Design better databases

dang · on Oct 3, 2019

The original URL was http://dbpatterns.com/, which no longer points to the right content, so we replaced it with the closest archive.org copy of the original.

mortehu · on April 7, 2016

Tip for SQL users:

If you give all your ID fields unique names, e.g. by calling your field "reservation_id" instead of "id", even in the reservation table, you can do stuff like:

SELECT * FROM reservation JOIN guest USING (reservation_id);

By doing "USING (reservation_id)" instead of "ON reservation.id = guest.reservation_id", the field will be automatically deduplicated, so you don't have to qualify it elsewhere in the query, and "SELECT *" will return only one copy.

Tenhundfeld · on April 7, 2016

I actually prefer just using id as the primary key, and I like the explicitness of seeing the table/alias before the column in complex queries. I don't care too much about typing it out; reservation.id isn't longer than reservation_id, and the savings of USING vs ON seem minimal. I also don't care about deduplicating that one field, as I'll likely need to consider other duplicate fields in the results, like created_at or description.

But I can see arguments in both directions. I think the big thing is consistency. If you consistently use <table name>_id as the PK and your tooling doesn't fight it, then cool.

I do feel like this benefit breaks down rather quickly though. For example, if you wanted to track who created a guest record, you probably wouldn't name the column user_id as it's ambiguous. So you might use creator_id to signify it points to the user who created the guest, not to the user invited to the reservation – thus negating your ability to join to a user table via USING.

That's not to say it's a bad tip, just because there are cases where it doesn't fit. Definitely worth knowing. Thanks for sharing.

paisawalla · on April 7, 2016

When you have to join on composite PKs, it's easier to appreciate the parent's advice.

erpellan · on April 7, 2016

Some things that I'd love to see as options in rdbms:

Enforced globally unique column names - natural joins on everything.

Throw an exception if a query requires a table scan (bonus points if it printed the statement that would create the appropriate index).

Translate all update and delete statements into inserts (enforced immutable data + versioning).

bmdavi3 · on April 8, 2016

For queries returning results from a majority of rows in the table, won't a sequential table scan be faster than an index scan?

jpitz · on April 8, 2016

For non-clustered indices, yes.

jeremysmyth · on April 11, 2016

and if the order is not important. Reordering large numbers of rows can be a very expensive operation.

moron4hire · on April 9, 2016

You can force the last one but just dropping the delete and update permissions from your role.

Tenhundfeld · on April 7, 2016

Yeah, I hear that. I also tend to always use surrogate keys. It can have performance implications, but that's typically not my biggest concern.

EDIT: I guess that doesn't really address your point of needing to join on a composite natural key though.

ahachete · on April 8, 2016

Do you know SSSKA? Society to Stop Surrogate Key Abuse: http://www.slideshare.net/PGExperts/keyvil-lightning-talk?qi... ;)

Tenhundfeld · on April 8, 2016

I didn't know about it, but I don't find those slides compelling, not at all. For example, their response to the "keys shouldn't change" is "what is CASCADE for then?". That seems like a naive view of data.

It ignores things like integrating with other systems. If I export my data to an OLAP database, I can't just cascade that key change through it without manual effort.

It also ignores managing historical data. Maybe I don't want that natural key update cascaded everywhere. In their example of hotel room number, if that public-facing number changed (which does happen), I wouldn't necessarily want that cascaded everywhere. For example, I might need past reservations/invoices from before the change to keep the old room number, to match all correspondence with the guest, etc. Okay, so maybe you don't automatically cascade everywhere, but then you need some way to link that old room number and the new room number in all of your reporting.

Bleh. That's a lot of headache that can be greatly minimized with a surrogate key, for very little drawback in my experience. Sure, I can imagine scenarios where the performance impact or additional storage could actually be a real negative, but for your average CRUD app, I think surrogate keys add way, way more value than cost.

ahachete · on April 8, 2016

Sure, there's CASCADE, but as you pointed out it has several drawbacks.

That's why ideally natural keys should be immutable. That has many advantages over surrogate keys.

There's a follow up post (posts) to this presentation: http://www.databasesoup.com/2015/03/primary-keyvil-reprised.... where many problems caused by the abuse of surrogate keys are illustrated.

But the main point is simple: surrogate keys (abuse) is an easy way to data corruption. Not in the sense of durability as in ACID, but in the sense of humans (manually or by bugs in the software) duplicating information that is not enforced to be unique thanks to using non-sense surrogate keys as the uniqueness criteria.

Tenhundfeld · on April 8, 2016

If a combination of data should always be unique in a table, you should put a unique constraint/index on those columns to ensure that data integrity. I think that's mostly unrelated to the surrogate key vs natural key debate.

Your point about immutable keys is important, I think. A "natural key" can mean different things, as I understand the term. You could call a person's social security number a natural key, which for the record is a bad idea as they can change . But you could also call the combination of user_id and reservation_id a composite natural key, which is immutable and not so bad in my book. I'd still use a surrogate key though.

It might seem like I care about this more than I actually do. I've just found over the years of writing systems that I fairly often regretted not having a surrogate key on a table, but I've never once regretted adding a surrogate key that ended up not being essential.

What I've seen happen is that the data model expands/changes and I need to reference a join table. I might just start out referencing it by the composite natural key. Then the model changes again, and now I need to add another piece of data that changes the natural key. For example, maybe the natural key was (user_id, reservation_id), but now it's (user_id, reservation_id, membership_type_id). Now I need to think about anywhere I referenced that join table by the natural key.

ahachete · on April 8, 2016

Unfortunately, UNIQUE is not enough to prevent the kind of errors I am talking about. They are only easily prevented if the PRIMARY KEY is a natural key.

I have seen many cases where concepts that are UNIQUE NOT NULL for a given business case are ignored as natural keys and surrogate keys are used instead. This results in "duplicated" records (from a business logic perspective) that often end up being manually cleaned, resulting in costs in the orders of 10s of thousands. Not funny.

The idea of a primary key is that you are guaranteed 0 or 1 results, but in the absence of UNIQUE and NOT NULL you may get more than one. That's a powerful abstraction. If your business logic determine that a given attribute or combination of attributes satisfies that condition, that deserves to be the row key.

If you rather use a surrogate key:

a) either you also have to enforce UNIQUE and specially NOT NULL on those attributes (apart from the PK) which is definitely not free lunch (extra indexes, more expensive joins, not benefiting from index-only scans)

b) or often someone forgets either the UNIQUE or NOT NULL on the attribute(s) that have to be constrained as such and data integrity is compromised.

Both cases are bad enough. What's the justification to systematically use surrogate keys then?

koide · on April 10, 2016

a) Premature optimization b) Implementation mistake

Using natural keys may offer better performance in certain cases, but does not protect you from b, as you might select a wrong key, or even forget about putting a PK on the table, for example.

Surrogates help you manage the ever changing and interconnected world: the immutable name of something that has been shared between systems is no longer immutable? This type of scenario is also very common and can cost thousands as well.

Which type of problem is more likely in your application? That's up to you to decide.

developer2 · on April 7, 2016

Not really a fan. When your table name is "reservation", you're prefixing "id" with the table name ("reservation_id"), but you don't do the same for the other columns. I've never liked the inconsistency of having "reservation_id", but other columns like "name" instead of "reservation_name". Especially on longer table names where you wind up with columns like "this_really_long_table_id".

All just to shorten join clauses with a natural join or USING? You already get shortened joins - less so, but still a gain - by using table aliases so as not to repeat the long table names in each query multiple times. IMO it also takes an extra moment to parse the ambiguous "id" columns to resolve them mentally to which id is being referred to.

Mahn · on April 7, 2016

Personally, for me it makes sense that the id field has a different distintive name for each table because it's a specific kind of id, i.e., customerID is not interchangeable with reservationID. For me, this is not so much about prepending the field with the table name, but about calling things by their name. "id" is just too vague, even if you use it in the context of a specific table.

_3u10 · on April 7, 2016

I believe this is what foreign keys are for, and also, they are usually both ints/uuids/etc so generally they are interchangeable.

I fail to see how

  customer.id = product.id

is any less obvious of a bug than

  customer.customerID = product.productID

jessaustin · on April 8, 2016

With aliases that first expression might be:

  c.id = p.id

That seems like an easy error to make...

liotier · on April 8, 2016

> With aliases that first expression might be: c.id = p.id

Which is why I don't use aliases, so that future me doesn't have to wonder what the hell he wrote a few months ago.

rlucas · on April 8, 2016

Once an application has been alive for more than a few months, I find it often is subject to self-joins and other query approaches which pretty much beg for aliases in the name of legibility.

dsp1234 · on April 8, 2016

subject to self-joins and other query approaches

Which means that the original argument about reserveration_id vs id doesn't matter since it's likely that both table and column aliases would be needed anyways.

_3u10 · on April 7, 2016

I once made that argument, we ended up with reservation_name, I regret I ever said anything about it.

Apparently, it made joins more clear, I seriously wonder how often people fucked that up for them to think it was a good idea to prefix every fucking column with its table name.

It's like when I see unit tests for setters and think, gee, setters seem pretty straightforward to me, how often to people fuck them up?

developer2 · on April 7, 2016

>> we ended up with reservation_name

That's absolutely terrifying. You've just provided my subconscious with new material with which to populate my nightmares.

How did the code using that database operate? Were you using "reservation_name", or did the application revert the naming scheme by mapping the column to "name"?

Either way, FML.

PeCaN · on April 7, 2016

FFS, it's just reservation.reservation_name, it's not like he's parsing HTML with regular expressions or something. Like... big deal.

kbenson · on April 7, 2016

I'm not sure if this is humor going over my head, or seriousness smacking me in the face.

davvolun · on April 11, 2016

http://stackoverflow.com/questions/1732348/regex-match-open-...

vmasto · on April 7, 2016

Could you elaborate on why this is so awful? (genuine question)

bottled_poe · on April 8, 2016

I can think of a few reasons:

- The field name implies that the primary key field is a string. This entails numerous issues (eg. How do you generate a new unique ID?) - It is not obvious that the field is the primary key. - A 'name' property on a reservation doesn't make sense. Is it the name of the person making the reservation? How can this be unique? etc.

developer2 · on April 8, 2016

My use of "name" was a poorly chosen example to tie to the concept of a reservation. I just chose it because "name" is an extremely common and unambiguous column in many tables.

_3u10 · on April 8, 2016

Too much typing / line noise.

kbenson · on April 7, 2016

Yes, you always have to be really careful when positing an absurd solution to make a point; there's always the risk you'll be taken seriously. :)

leepowers · on April 7, 2016

I prefer longer, more descriptive table names coupled with shorter columns names. Then use aliases when writing queries.

    SELECT * FROM long_table_name l LEFT JOIN another_table_here a ON l.id = a.rel_id

Also, most tables end up being used to populate objects. It's simpler to reference an object with `Reservation.id` than `Reservation.reservation_id`

Regardless of the design choices made it's more important to be consistent in naming conventions and to document why a particular design/schema was chosen.

developer2 · on April 7, 2016

I tabbed away from my previous comment for a minute, which made me forget to mention the ORM scheme in applications. You're exactly right.

Nobody ever uses the class "Reservation" with a property named "reservation_id" in their code. They will name their column "reservation_id" in the database, but then work to undo that at the application level by telling the ORM to map "reservation_id" to an object property named "id". It creates the situation where developers fight against the database schema in their code, and adds an unnecessary level of complexity when comparing code to database. "Why is it 'id' in code but 'reservation_id' in the database?".

A common counterargument from pure DBAs, who are not themselves writing code that integrates with the schemas they create, is that databases should be approached as a completely standalone component that should not be designed with applications' structures in mind. That applications are free to bastardize the database schema in any way they please, so it really doesn't matter anyway. I personally prefer to see it as a reality that applications will be the primary consumer of the database resource, and deserve to have their intended integrations analyzed. Once the database is finalized, it is the application developers who will be studying the schema on a weekly basis. Anything to ease that constant inspection process, the better.

Of course, this is all personal preference. I have learned over the years to finally stop arguing so much over these little details. In real life with coworkers, anyway. It's still fun to battle for your point of view with forum strangers! :D

whitenoice · on April 8, 2016

The only problem I have with this is that when you have a lot of joins and your query is long it gets confusing, as in what table was 'l', what table was 'a' and then you have to scroll up and look it up again, whereas in the parent comment its easy as it makes it explicitly clear and the table name does not matter when looking at joins.

daxelrod · on April 7, 2016

It's long bothered me that SQL doesn't have a way of saying "join these two tables based on the key relationships I've already defined in my schema". (Or maybe some variants do?)

The database already knows that guest.reservation_id is a foreign key referencing reservation.id. Why should you have to repeat yourself?

Someone · on April 8, 2016

I have longed for that, too, but I'm not sure I should want it.

Problem is that it becomes problematic as soon as you keep such queries around (in stored procedures, source code, or batch files), and then modify your schema by adding or removing a foreign key.

ComodoHacker · on April 8, 2016

What if you have more then one foreign key referencing the same table? Or worse, what if you had one and later add another? You would get undefined behavior.

Added:

Also SQL language itself has nothing to do with keys, relationships and other details of database design. It's different level of abstraction.

Various query builders/SQL generators do what you want.

greggyb · on April 7, 2016

And when there are multiple foreign keys between two tables?

kbsletten · on April 7, 2016

Genuinely curious, what's the use-case for this?

level3 · on April 8, 2016

Some examples:

In project management, tasks might have an assigner_id and assignee_id

In transaction management, transactions might have a sender_id and receiver_id

In sports, matches might have a team1_id and team2_id

jonahx · on April 8, 2016

> In transaction management, transactions might have a sender_id and receiver_id

in normal form, shouldn't you have "transactions", "transaction_senders", and "transaction_receivers" anyway? similarly for the other cases?

i know sometimes the simplicity of a single table is a valid choice, so your point still stands, but it does feel like the examples are cases of "well, if you're were doing the right way(TM), that wouldn't happen"

dsp1234 · on April 8, 2016

shouldn't you have "transactions", "transaction_senders", and "transaction_receivers"

were doing the right way(TM), that wouldn't happen

Having a separate transaction_senders and transaction_receivers wouldn't be the right way(TM) unless it's possible to have multiple senders/receivers. If a transaction can only have 1 sender/receiver then normalization is complete when the sender/receiver data is removed from 'transactions' table and replaced with the sender_id and receiver_id columns.

When introducing transaction_senders/transaction_receivers tables without further constraints, it would be immediately possible to have multiple transaction senders and multiple transaction receivers for a single transaction, which is likely an error.

level3 · on April 9, 2016

These examples depend on the use case. If you are guaranteed to only have one sender and one receiver, then having them in the same table is already normal form.

Perhaps (?) a less controversial example is a table of flights between airports. Should a single flight have departure_ids and destination_ids? (assuming we treat legs as separate flights)

Or what about package delivery? Do we need sender_ids and receiver_ids there?

cname · on April 7, 2016

created_by and updated_by both pointing at the user table?

slagfart · on April 7, 2016

Explicit foreign keys perform better in some joins, as the optimiser can assume that all possible values are in the source column of that foreign key.

vog · on April 8, 2016

> join these two tables based on the key relationships I've already defined in my schema

This is almost exactly what "NATURAL JOIN" does. It is like "JOIN USING" with a pre-filled column list.

However, this isn't decided based on the schema, but based on equal column names. So you need to apply the dicipline as noted by the parent comment.

See also the sibling discussion thread at: https://news.ycombinator.com/item?id=11450187

ComodoHacker · on April 8, 2016

> the field will be automatically deduplicated, so you don't have to qualify it elsewhere in the query, and "SELECT " will return only one copy.

You still have in outer joins. Also, SELECT * is a bad practice (with some exceptions) when used in code, suitable only for ad-hoc queries.

huuu · on April 8, 2016

Agreed. Or at least put a table name in front of the asterisk:

  SELECT
      table1.*,
      table2.field
  FROM table1
  JOIN table2
      ON table2.rel_id = table1.id

ComodoHacker · on April 8, 2016

table.* doesn't really do any better. Database evolves, table's structure changes and you get various bugs and unpredictable behavior. From elusive exceptions to memory hogs to network bottlenecks, depending on DBMS, data itself and data access framework you use.

huuu · on April 8, 2016

It does better when, for example, you have a field called 'id' in both tables.

  SELECT
      *
  FROM table1
  JOIN table2
      ON table2.rel_id = table1.id

will use one of the 'id' field from table1 or table2 (depending on the join type).

  SELECT
      table1.*
  FROM table1
  JOIN table2
      ON table2.rel_id = table1.id

will use the 'id' field of table1 and does not include data from table2.

But I agree with your other arguments.

petepete · on April 7, 2016

Or if you're feeling lucky:

SELECT * FROM reservation NATURAL INNER JOIN guest;

paisawalla · on April 7, 2016

Err, be careful with natural joins: suppose both tables have a column called "name" or "created_at" -- natural join will create join conditions from those.

marcosdumay · on April 7, 2016

I never understood why the thing named "natural join" don't use relationships to determine the joining columns. Does not look natural for me.

vog · on April 8, 2016

Indeed, this is also what always annoyed me.

I suspect that this has to do with the fact that NATURAL JOIN is intended to work for sub-SELECTs and VIEWs, too. On those, it is quite hard and error-prone to determine "foreign key" equivalents.

Also note that restricting NATURAL JOINs to tables wouldn't be a good solution here, either, because that would prevent you from ever JOINing more than 2 tables: When JOINing 3 tables, you effectively do have some kind of sub SELECT after having joined the first two tables, which is then supposed to be joined with the third one.

marcosdumay · on April 8, 2016

You get the same sort of problems jessedhillon was talking about.

That is, you lose absolutely nothing - just changes its semantics.

MichaelBurge · on April 8, 2016

That's why he said, "If you're feeling lucky".

sjwright · on April 8, 2016

Which in the world of databases is a synonym for "useless". Typical boutique SQL query strings can be anything from a few to a few dozen lines and the compiled form is typically cached by the SQL engine (if you're using parameterised variables, which you are, because not doing so is stupid and dangerous).

So why not be explicit?

sokoloff · on April 7, 2016

I don't like using select * as it will cause the shape of the result set to change upon schema modifications. We use a DB access layer that "knows" the column indices of result sets at compile time and our zero-downtime process relies on that not changing (for the time period where old code is running against new schema).

Imagine two two-column tables, A with id and name, and B with id and value. (Doesn't matter what the columns are; this is just for concreteness.)

select * from A JOIN B using (id) will give you a three column result set, id, name, value.

Now, suppose you want to do a schema modification to add "last _login" to A. Suddenly, your query is returning a 4 column result set, id, name, last_login, value and code using ordinal positions is now reading last_login into the "value" slot.

Maybe it's a corner-case for our use case and zero-downtime release processes, but it's something to be aware of anyway.

lccarrasco · on April 8, 2016

Most drivers or interfaces to the db let you use the column's name instead of the ordinal positions, precisely because that decouples the query's result from the things you care about, might be worth checking ^^

level3 · on April 8, 2016

I'm curious: what are the advantages of having your access layer rely on column indices rather than column names?

sokoloff · on April 8, 2016

At coding-time (pre-compilation), we create classes representing each stored procedure that we've marked for code generation. That gives parameter list (inc auto-completion in the IDE), column names, types/type safety on inputs and outputs, etc, from the parameter list and result set shape and it was [slightly] easier to write the generic code that binds columns in the result set by column index, since we "knew" the column indices wouldn't change in the result set.

(We had a prior system that relied on column order and so had already banned select * and only allowed additions to the result set on the right side before we even got to the point of doing codegen or zero-downtime releases. This was 2003 that we created the codegen wrapper and 2005 or so when we switched to zero downtime releases.)

I guess "because we're lazy" is the somewhat overly glib answer. ;-)

rusanu · on April 8, 2016

I don't like this because it leads to the misleading assumption that it doesn't matter which value you refer (left or right), since they are equal. But once you consider collations is easy to see the fallacy:

   left.id = 'Something' COLLATE case insensitive
   right.id = 'SOMETHING' COLLATE case insensitive

   SELECT id FROM left JOIN right USING (id);

Will the value be 'Something' or 'SOMETHING'? Better avoid the surprise and be specific...

NicoJuicy · on April 8, 2016

I like using id as key, if you want to "decouple" info, i just use reservation.id as reservation_id

This way, my db fetch is always on id and not a longer name ( reservation_id), which on the end of an webapp, makes me save some reading / writing :)

ciroduran · on April 7, 2016

And if you're using a schema you did not create, don't forget to use shorthands! SELECT * FROM reservation r JOIN guest g ON (r.id = g.reservation_id), queries can get messy if you don't use them! :)

swah · on April 7, 2016

Another option:

    SELECT * FROM reservation r JOIN guest USING (r.id);

piers · on April 8, 2016

I really hope you're using SELECT * as an example and that you don't really do that.

voltagex_ · on April 8, 2016

In which dialect?

ciroduran · on April 8, 2016

It's part of the SQL-92 standard http://flylib.com/books/en/4.343.1.38/1/

rlonstein · on April 7, 2016

But, but, but ActiveRecord doesn't do that by default so it can't be a good idea [sarcasm]

clay_to_n · on April 7, 2016

It's an old-looking website, but I've found this site has some really cool data models: http://www.databaseanswers.org/data_models/index.htm

As someone in the healthcare space, looking at some of these models gives me a better idea of how various aspects of the healthcare industry work, and the things they interact with. Ex) http://www.databaseanswers.org/data_models/patient_data_ware...

sakopov · on April 7, 2016

I was just about to post this. I've been referring to this site for years all the way since University days. Invaluable resource! Thanks for posting!

nradov · on April 7, 2016

That healthcare industry data model is very simplistic and would not be adequate for the majority of real world use cases involving patient visits. If you have to model something in the healthcare domain then don't reinvent the wheel. Start with the HL7 RIM, and then constrain it down to what you really need.

https://www.hl7.org/implement/standards/product_brief.cfm?pr...

clay_to_n · on April 8, 2016

Good point! I've yet to dig into HL7 myself, so thanks for the link.

lai · on April 7, 2016

I think that after you posted this, it caused a surge of traffic to that website, which caused the server to crash lol.

bochoh · on April 7, 2016

Thanks for this, so much information here.

elchief · on April 7, 2016

those models are actually pretty bad

the ones in these books are better:

https://dba.stackexchange.com/questions/12991/ready-to-use-d...

theli0nheart · on April 7, 2016

This strikes me as a very unhelpful comment. Why are they bad? Why are the ones in these books better?

jqm · on April 7, 2016

The site (databaseanswers) is neat and I've gotten some good ideas from it in the past, but most of the models (that I've looked at anyway) are pretty simple and you almost really wouldn't need to look at a diagram drawn by somebody else to intuitively put something like that together. It seems almost more for someone creating MS Access level applications. Granted there may be some more complex schemas that I didn't see.

There might be an argument about excessive normalization in some cases also. Take some of those layouts too far and try to extend them and you might wind up with tons of little tables. Normalization was I think more important back in those days (not that it's not still important... but some of the downsides of going overboard on normalization have become apparent I think.... at least to me).

clay_to_n · on April 7, 2016

It's really mixed. The models on the site I posted are from a variety of places, so some are higher quality than others.

klodolph · on April 7, 2016

I just clicked on the second "featured" pattern and found this hot mess:

http://dbpatterns.com/documents/5091f74289cbad03bc958bc0/

It has the "let's put a UUID on every row" disease common to designers who have never really learned anything other than object oriented design. Price is a string (I guess so you can put "market price" on the fish?), and there's a currency symbol on every "delivery". The whole thing just makes no sense. 18 stars. Unbelievable.

skrebbel · on April 7, 2016

I'm daft. What's wrong with using uuids as primary keys everywhere? I'd appreciate it if you could elaborate :)

mvitorino · on April 7, 2016

The problem is that most uuids are generated in a way that, when sorted alphabetically as strings, would have a random order.

Example: t0: 7458e3a9-716b-4352-b2e4-b5b67d0c089b t1: 4d8d753c-1777-439d-8725-b093b1bd8430

Using this as a PK in any relational database will mean the rows are stored in the clustered index order, which causes extreme fragmentation because the db engine constantly has to find "holes" in the data pages to insert more recent records instead of adding those at the bottom of the table as would happen with any continuously increasing key.

ww520 · on April 7, 2016

This is a non-issue. Most db engines just append new data rows at the end of the data table and assign an internal row ID for it. The keys (UUID in this case) are stored in the separate index pages using B+ tree, which searches random key (UUID) or sequentially incremented key equally well.

UUID key is a problem only if your main query is a range query on the PK of a clustered indexed table. If your main query is a range query, you should pick something that can be sorted in the range anyway.

mvitorino · on April 7, 2016

A lot of times when a UUID is used, it's not generated in the db, such as when you need to generate guaranteed unique IDs client side. Also 4 bytes vs 16 is not "equally well" in terms of performance. Note that the the diagram linked at the root of this thread is encoding uuids as strings which means 32-36 bytes instead of 16.

ddorian43 · on April 7, 2016

Actually only postgresql does it this way (by storing data in the heap and not in the primary index). Mysql(innodb),mssql,oracle uses the store the row in the primary-key.

Edit: I'm ~wrong, see below.

ww520 · on April 7, 2016

Only Innodb does it by default. The others use heap table by default.

MSSQL allows clustered indexed table as an option to order the physical storage of rows. Oracle has index-organized table as an option.

Edit: They don't use clustered indexed table by default because record insertion is very expensive since clustered index forces the table to store the records contiguously in the index's order. Also Innodb is not truely clustered indexed. It only stores records contiguously for one page at the B+ Tree leaf level. Records in different pages are scattered all over even if the index values are sequential.

Jweb_Guru · on April 7, 2016

It's pretty variable, but Postgres definitely is not the only one that does it that way (or even close).

mvitorino · on April 7, 2016

Not the case with MS SQL Server for example (and there are probably several others). Physical row ordering (at least up to 2008 edition as far as I can recall) coincides with clustered index.

wiredfool · on April 7, 2016

I don't believe that the Primary Key affects the physical ordering of records in PostgreSQL.

Primary Key is implemented as a unique index coupled with a not null constraint. It's essentially syntactic sugar.

mortehu · on April 7, 2016

To elaborate, even if you do call CLUSTER[1] to enforce a physical ordering, the primary key is not the default (there is no default).

1. http://www.postgresql.org/docs/9.5/static/sql-cluster.html

gnaritas · on April 7, 2016

That would be true except for that fact that database designers know this, and offer a way to generate sequential GUID's; so it's actually not a problem for that reason. The only real downside is performance, an int key performs better.

mvitorino · on April 7, 2016

A number of scenarios that require UUIDs as keys, involve generation client side where such guarantees may not be available. In my view, UUIDs should not be used unless there is a strong reason for it (such as making keys not guessable). Most people will implement them incorrectly/inefficiently.

jsmthrowaway · on April 7, 2016

> an int key performs better.

A UUID is a 128-bit integer. That people do not store, generate, or interact with them that way is the bug.

I'd kill for a 128-bit architecture so a UUID compare would be a single instruction, and it's too bad consensus is that we don't really need it.

mortehu · on April 7, 2016

> I'd kill for a 128-bit architecture so a UUID compare would be a single instruction

Well, there's PCMPESTRI on Intel.

http://www.felixcloutier.com/x86/PCMPESTRI.html

jsmthrowaway · on April 7, 2016

Yeah, I just said this in another comment, but SSE is my weak point; I stopped writing assembly and dealing with this stuff before SSE was even introduced.

That looks basically like what you need, though, yeah. Cool.

gnaritas · on April 7, 2016

An int is either 4 or 8 bytes, it's still going to perform better no matter how you treat the GUID.

edit: bytes

paisawalla · on April 7, 2016

> An int is either 4 or 8 bits

This perfectly summarizes the bitwise-innumeracy of the argument against UUIDs. Even without 128 bit word sizes, most UUIDs are going to be non-matches in their lower bits, assuming a random distribution. There's no computational efficiency gained in the vastly wide critical path here.

gnaritas · on April 7, 2016

> most UUIDs are going to be non-matches in their lower bits, assuming a random distribution

But you can't assume a random distribution because anyone using UUID's for database keys isn't using random UUID's, they're using sequential ones.

jsmthrowaway · on April 7, 2016

Which would still make the low-order quadword not match.

gnaritas · on April 7, 2016

It's a database though, you don't get to control how comparisons are done. Guid's are larger than ints, so using them as keys or joining on them is going to be slightly slower (not that it makes them bad, I'm not against GUID's). Just as using a bigint will be slower than an int.

jsmthrowaway · on April 7, 2016

If I were implementing the uuid type in Postgres and did anything other than `if(unlikely(LO(uuid1) == LO(uuid2)) && unlikely(HI(uuid1) == HI(uuid2)))` I'd consider myself doing the wrong thing, personally. Two 64-bit compares, two flag checks, and a branch, which is about as fast as it's going to get in the absence of a 128-bit architecture.

You could actually vectorize the compares in SSE, I believe (but I stopped writing assembly before SSE existed and don't know much about it at a low level). This is why it's so critical to treat a UUID as an integer instead of a string, and I see a lot of code that does the latter.

gnaritas · on April 8, 2016

While all of that is true, it's really not relevant to a programmer deciding on which to use as a key; he has no control over any of those things.

jandrese · on April 7, 2016

I hope you mean bytes. On a 128bit architecture that would be a single instruction compare, just like a 32 or 64 bit compare. Performance would be roughly the same (minus maybe a cache miss because you blew it out with your fat ints).

dhd415 · on April 7, 2016

Every sequential GUID generation feature provides guarantees only per session. In other words, when you bring your database down to upgrade or patch it and then bring it back up, your GUID sequence starts over again at some arbitrary value which is likely to overlap with your existing GUIDs.

gnaritas · on April 7, 2016

> your GUID sequence starts over again at some arbitrary value which is likely to overlap with your existing GUIDs.

I don't believe that's true. You're right that they start the session over, but it'll still be a fresh sequence that doesn't overlap anything generated in the past.

enobrev · on April 7, 2016

So you're saying it's specifically less efficient for a write-heavy tables, correct?

I generally prefer auto-incremented integers, but UUIDs are very useful for client-generated records (like for an app/website that's built for offline usage).

acscott · on April 7, 2016

It's also bigger than necessary when a sequence will do. Large PKs mean larger indexes. For very large databases or very high volume every little inefficiency adds up.

As I always say, if we don't split hairs now, we'll be splitting heads later.

Retra · on April 7, 2016

I don't understand this... How does the sort order in the index have anything to do with the arrangement of the table? If you build an index over the UUIDs, it's just going to refer to rows in the table by their internal offsets.

dspillett · on April 8, 2016

>How does the sort order in the index have anything to do with the arrangement of the table?

If you have a clustered index based upon that column (and in systems that support this it is common for the primary key to be the clustered index for the table) then the physical layout of the data will follow this column.

See http://use-the-index-luke.com/blog/2014-01/unreasonable-defa... as a good run down of the whys and wherefores of clustered indexes.

slagfart · on April 7, 2016

Having a sort order is immensely preferable when you're joining multiple sequential rows at once. The optimiser can create a pseudo-partition for your join, and constrain its reads to that. Especially on spinning rust (or I/O constrained systems), you're also reading multiple sequential pages in a stream off disk, which gives a big speed boost compared to random reads.

Without that ordering, the index is 'dumb', and has to find all your rows one by one in its structure. Admittedly it's not as much of an issue for SSDs, but it does still impact performance.

nickpeterson · on April 7, 2016

First there is the question of whether one should have surrogate keys at all, some oppose surrogate keys because: They make tables more difficult to reason about. They encourage structures that involve large amounts of joins.

Others embrace surrogate keys because: They make joins between tables easier (usually one field rather than a compound key where one might accidentally not include all the join predicates). They save space since an integer is almost always smaller than a compound key.

So if you are in the camp that favors surrogate keys, then consider these points:

Some quick downsides for UUIDs: If your table has a clustered index, UUIDs are generally not created in a regular order, and so you'll be constantly inserting into a part of your index rather than the end. Some databases allow sequential UUID generation to help mitigate this. UUIDs are usually 16 bytes, but an int is usually 1-8 bytes. Also keep in mind you pay this price multiple times since the primary key (generally) becomes the foreign key in other tables.

Some upsides for UUIDs: If you have a process that has no master, being able to tie disparate things together with a shared UUID is very useful sometimes.

Often developers use UUIDs/GUIDs because: Their app wants to perform a 'Create' operation and they want to generate the key in advance.

This can usually be mitigated by creating a routine that generates a range of reserved surrogate integer keys, and gives a starting value to the application layer. Then the application layer can use those values with knowledge they won't already exist. For instance, the app layer say, "I'm going to make 100 customers", and the response comes back "10232". Meaning that 10132-10232 is reserved for that process to use for customer primary keys. This allows larger (chunkier) requests which can be interrelated without constantly making requests to the Database layer. One could extent this to be a true cache for entity reserved numbers that gets dolled out as needed by the app layer.

ww520 · on April 7, 2016

UUID as PK is perfectly fine; however, as with any design decision it really depends on your needs and weighing alternatives.

Some popular ways to do PK: natural key, sequentially generated key, and UUID. Personally I would prefer natural key if I can find an immutable natural key for the table. However, natural keys are hard to find. The food example in this case doesn't have natural key. Also if the natural key requires a compound key, it's just not worth the pain.

Sequential auto generated key is good when you need to hand it off to users, like order id or ticket number. It's short and simple and it's auto generated. The downsize is when you migrate databases, you need to seed the new database carefully or it would start from beginning again. Also in record creation, you need to do extra read to get back the newly generated key.

UUID is mostly worry free. It can be generated anywhere and doesn't need to be in the database. For a setup with distributed databases, I would use UUID just to have global uniqueness. For an offline app, I would use UUID to create the data records locally and later sync them with the main database. UUID is good when it's used internally and not exposed to the users.

jon-wood · on April 7, 2016

Sequential IDs can also leak business information when exposed to customers such as how many orders your taking - an interested party can place an order at 10am and another at 11am. Once you've done that you can compare the IDs and know how many orders were taken that hour.

irl_zebra · on April 7, 2016

Why should UUIDs not be exposed to users? Because of the their unwieldy look, or because of security concerns?

I ask because I have an app where previously I was accessing a particular data point via pk, and the user saw the pk in the url bar. But it could expose user/site data as any user could access that pk, or just guess at the next sequentially generated pk. I switched to uuid and now it's a little ugly in the url bar, but no data is exposed.

ww520 · on April 7, 2016

I meant the key handed off to the user for human interaction. As a user I would prefer order number 7356 than 7458e3a9-716b-4352-b2e4-b5b67d0c089b.

If there's no human interaction with the key, UUID is perfectly fine.

mhd · on April 8, 2016

I've seen scenarios where UUIDs were used intentionally for this, to prevent users who get a full export of table data to reference the UUID. If it's just a sequence of small integers they might them to more permanent and intrinsically tied to the data, whereas UUIDs seem more like computing artifacts.

Yeah, I know, this could easily fixed by not friggin' showing surrogate keys to those users in the first place, but, well, data integration is ugly business.

irl_zebra · on April 7, 2016

Gotcha, just wanted to make sure I wasn't missing something. Makes total sense, thanks!

radagast · on April 8, 2016

> But it could expose user/site data as any user could access that pk, or just guess at the next sequentially generated pk.

Using a UUID instead of a sequentially rolling integer ID isn't solving your problem, you're just doing security through obscurity. The real solution is implementing real authentication & authorization - not making the key harder to guess.

naasking · on April 8, 2016

> Using a UUID instead of a sequentially rolling integer ID isn't solving your problem, you're just doing security through obscurity.

A common sentiment, but not true if your id is cryptographically random. It amounts to capability security which is the right approach to security if used comprehensively.

Scirra_Tom · on April 7, 2016

Came across an interesting case against using UUIDs the other day, in the UK it's a legal requirement that VAT invoices are sequential: http://www.hmrc.gov.uk/manuals/vatrecmanual/vatrec5020.htm

Using random UUIDs as an invoice number wouldn't cut it, you'd have to have a sequential ID in some form. You could of course have both but that's adding unnecessary complexity.

Another example, if you are designing a website and lets say give each blog post a UUID, a URL of:

website.com/blog/my-post-34

Looks better than

website.com/blog/my-post-d173affb-6bb8-4435-bef1-4e29409dff4c

Unique sequential Id's also give some clues about the relative age of rows (eg ID 100 is older than ID 50). This doesn't come up often but can sometimes be useful.

Regardless, I always find using a unique int ID for pretty much every record makes your job 10x easier and I fail to see why you would ever want to do anything different. If you want to 'hide' the ID in special cases (eg order numbers) better to generate a unique random int ID.

xrstf · on April 8, 2016

Sequential order IDs are also mandatory in Germany. At least I remember that we had to make sure in one project that the order IDs had no gaps and were sequential.

clavalle · on April 7, 2016

There is no way to enforce data integrity between related tables by virtue of the data model for one thing.

gcr · on April 7, 2016

I don't get this argument. Foreign key constraints are part of the schema.

What should the primary key be if not a UUID?

clavalle · on April 7, 2016

A natural key. It is not always possible, but when it is it should be used.

cortesoft · on April 7, 2016

an integer id?

billybugmaster · on April 7, 2016

isn't that what the fk are for? food_uuid for example. I concede I wouldn't call it uuid but I don't see a problem keeping the data straight.

jtmarmon · on April 7, 2016

?? foreign keys?

kawera · on April 7, 2016

Some good answers here: http://stackoverflow.com/questions/45399/advantages-and-disa...

Mister_Snuggles · on April 7, 2016

But where do you even put the price on a food item to start with?

The only price is on the 'delivery' table, which seems to imply that the price is only set when someone orders something. I guess this is a delivery-only place? But because the 'delivery' table can only have one food_uuid, you'd have to put in for a bunch of deliveries to order more than one thing.

Maybe the 'price' on the menu is in that 'Additional Info' table?

    SELECT A.NAME, A.DESCRIPTION, B.VALUE FROM FOOD A LEFT OUTER JOIN "Additional Info" B ON (A.UUID = B.UUID AND B.TYPE = 'Price');

There's so much wrong here...

Smudge · on April 7, 2016

I don't disagree, but there is a case to be made for using UUID over sequential integer ids, and that is for horizontal scaling. When inserting records with UUID primary keys, you don't have to have your db cluster synchronized around which sequential ID should be next. There are other ways around this problem, of course, but using UUIDs sidesteps the issue entirely.

mozumder · on April 7, 2016

Because a restaurant menu needs horizontal scaling...

Smudge · on April 7, 2016

Any data might need horizontal scaling regardless of the type of content. It all comes down to how much data and how often it is accessed. So, sure, if you're building the website for a single restaurant, don't use UUIDs, because no matter how popular, it will never get that much traffic.

nodefortytwo · on April 8, 2016

Yeah, i found that that too, spent 10 mins poking at it:

http://dbpatterns.com/documents/5707794a1514b4252236dcbe/

better, but not finished...

r0m4n0 · on April 7, 2016

And only one comment that says "thhx" haha

Like any open advice community like this, its important there are checks in place to not perpetuate bad ideas...

wahsd · on April 7, 2016

Did it cross your mind that there is a whole host of reasons why it may have 18 stars that have nothing to do with the quality of the pattern?

zenogais · on April 7, 2016

Kinda misleading title. Doesn't really have anything to do with database design. It has to do with sharing Entity-Relationship diagrams. Maybe "Design better relational models".

jordanlev · on April 7, 2016

If anyone is interested, there is a 2-volume set of books called "The Data Model Resource Book". They've been around for a while, so for more traditional businesses, but very thorough and broken out by industry:

http://www.wiley.com/WileyCDA/WileyTitle/productCd-047138023...

http://www.wiley.com/WileyCDA/WileyTitle/productCd-047135348...

rlonstein · on April 7, 2016

+1. Definitely worth reading, I have Vols 1 & 2 (Vol 3 came out later). I found it useful because the author explained the decisions about the models and traps to avoid.

rukuu001 · on April 8, 2016

> because the author explained the decisions

By far one of the most useful things a resource can do.

It doesn't just broaden your understanding of a domain, but it can teach you how to think about design problems.

koolba · on April 7, 2016

More of an ER modeler than a database designer. The interface is pretty slick/simple. Here's a sample I just cooked up: http://dbpatterns.com/documents/570699101514b428de8893a1

I can't imagine using it for anything real but it was fun to play with.

If the creator is reading this, here's a bug report: If the user hasn't hit "Save" when they export the schema, they get an empty text file and they'll think the app is broken. It should either auto-save or prompt the user "You have unsaved content, do you want to save before exporting?"

UPDATE: One more bug report, the export doesn't escape the columns names at all. It just naively adds double quotes around them. This breaks with double quotes in field names (yeah yeah which is stupid but still..):

    CREATE TABLE "foo" (
	"id" int,
	"baz" varchar(255),
	"bar" varchar(255),
	""test"" varchar(255)
    );

abraae · on April 7, 2016

The social aspects of this are awesome.

But the actual modelling functionality itself is really lacking. The lines joining the entities don't even indicate cardinality (does a ninja have many ninja_nelts? Or is it the other way around?). That's pretty much 101 for any visual database modelling.

Also the lines overlap, meaning you can't see whether ninja_weapon is related to ninja, or to ninja_belt.

I can answer these questions myself by perusing the foreign key columns - but that makes the lines redundant. Since several decades ago, Chen, Bachman, Grady Booch and a bunch of others have given ways to denote cardinality etc., and these are widely used. This tool should use one.

(apologies in advance if I have this wrong and the tool does support this but the diagram authors elected not to use them).

astazangasta · on April 7, 2016

I found the interface very off-putting. If this is a site about exploring data models, a "public feed" and "search" is not a great way to interact with those. In fact, a timeline chronology of when these things were made seems a great way to expose mostly irrelevant information to a large number of users.

How do I browse these models? Is there a taxonomy of organization? How many models are in the system? Are there models relevant to general domains of interest to me? None of this is made apparent.

zo1 · on April 7, 2016

I think they're aiming at the "jsfiddle" field. I.e. Sharing ERDs and such over SO and other social sites. As opposed to them trying to be some sort of comprehensive database of standardized DB patterns.

karmakaze · on April 7, 2016

Thanks for your sample link. It's way more interesting than the feed of model snippets in the original link.

collyw · on April 12, 2016

Isn't that how you design databases? Start with an ER model? Its easy to translate to database tables.

cmrdporcupine · on April 7, 2016

Y'know I really miss designing relational schemas. As much as I hate SQL the language I love the relational model that it is a (butchered and ugly and compromised) implementation of. But working where I do on the systems I do now this is something I never have to deal with anymore.

There's something very therapeutic about organizing data using a system of rules, and the relational data model is a powerful one.

dv35z · on April 7, 2016

Completely agree with you. I have always hoped that some of the GUI modelers would allow you to add notes/explanations on tables & relations, explaining the motivation behind design decisions. Example, "M2M relation here allows us to 'tag' objects with multiple attributes".

iamleppert · on April 8, 2016

Looking around I see lots of people just blindly using UUIDs for everything. At a prior job, I had a boss who forced me to use UUIDs for everything, before we had even collected a single row of data.

It hurts my pragmatism to solve a distributed systems problem before we even had a distributed system! Don't be a sheep and use UUIDs, rather than a simple integer primary key or composite key that is natural to the table. People say it's just an id, who cares? But I say people making these kinds of decisions are probably making other poor choices backed by something they read and don't understand.

Here's a great article on locality and the need for uniqueness and why UUIDs should only be introduced when needed and to solve very specific distributed systems problems. The reality is almost all applications will never need to have multiple concurrent writers.

https://eager.io/blog/how-long-does-an-id-need-to-be/?hn

vkjv · on April 8, 2016

I barely ever use uuids. As you say, it solves a fairly specific problem. Usually when you need to scale to multiple datacenters, it's easier and faster to add a datacenter column and starting using it with id as a composite key.

The only time I use UUIDs is when I need to pass that id out, but I want to hidr how many rows there are or the rate that they are being created. However, I usually use COMBs to prevent fragmentation.

gholevas · on April 7, 2016

I built an electron app that allows you to design, generate, and share Mongoose Schemas if anyone finds it useful: http://mongomulch.com

moron4hire · on April 8, 2016

These schemas seem like pretty simple, straightforward adaptations of specific problems. This doesn't actually look like any sort of actionable advice on how to design better schemas. Are we supposed to just osmose the knowledge somehow? I was expecting something more along the lines of recommendations like "foreign key relationships should always be indexed, nobody ever came up with a realistic example where they shouldn't! Why the hell isn't this the default behavior?"

igrekel · on April 8, 2016

Interesting idea but it the way it is currently done isn't that great.

Most of the patterns I've seen are obvious simple things but they are missing a lot of content like : - What was the design's intended properties? - Expected volumes, access patterns? - Good places for indices? Maybe the comments could be of some help but really haven't seen much. I've actually found more interesting content in the HN comments than on the site itself.

hghar · on April 7, 2016

I don't think this must be called "patterns" in computer science patterns are a model applied to give solutions to recurring problems. I was expecting something like Martin Fowler Patterns-Enterprise-Application-Architecture but for databases.

This should be called something like database designs.

kazinator · on April 7, 2016

So what you're saying is that "design patterns" should only be for code, not for data?

If you apply some recipe to structure code, that's a "design pattern"; if we apply some recipe for structuring data, it's a "data(base) design"?

But code is data. But sometimes it's not clear whether something is code or data, or whether it's closer to being code than to data.

Design patterns incorporate data. The Observer pattern requires a list of observers that are notified; that is a data structure which we can have in a database: an observer table joined to an event source by event source ID.

If I make such a database (say for a large number of users to set up persistent notifications about some interesting interesting things), am I doing "database design" or implementing the Observer pattern? :)

nxzero · on April 7, 2016

One of the keys to understanding SQL, and on that note, I would highly recommend SQLZoo's interactive tutorial that allows someone to query a (fake) database and progress using baby steps: http://sqlzoo.net

haddr · on April 7, 2016

Cool website, for off the shelf models. For anything more advanced you should be careful, as modeling real world scenarios is rarely so abstract and mamy times you need to make some scenario-specific tradeoffs during the modeling phase to fulfill your requirements.

barryosull · on April 7, 2016

Looks like they’re trying to create a repo of open source DB schemas for domains. That's putting the cart before the horse in my mind. People don’t start with DB schemas when building domains, they end up with DB schemas after modelling them.

Scirra_Tom · on April 7, 2016

Had a quick look through, don't understand what audience this is targeted towards.

tacone · on April 7, 2016

I like this site. As simple as it is, it has a lot of potential community-wise.

- it could allow sql exports for various platforms - it could feature an API so people can write their own framework drivers (for example creating migrations, importing existing schemas etc)

Lets hope that it gains users, so that the voting system (the star) can become more useful to filter out the garbage.

Edit: please remove the login wall to see the starred items, it raises the entry barrier quite a bit. Do you really want new visitors to see low rated schemas as the first thing?

adwf · on April 8, 2016

As a database guy this makes me feel good. From looking at a lot of these schemas, my job is not going to disappear anytime soon...

Most egregious example for me is probably the prevalence of a lot of "type" tables when a simple enum column would do. Or maybe the sheer number of UUIDs that are being thrown around.

I even saw a circular ID chain in one. Would be fun setting up foreign key constraints for that!

raziel2p · on April 7, 2016

Most of these seem extremely simple. I would like to see some examples where a good database schema isn't so obvious.

Mister_Snuggles · on April 7, 2016

When looking at an author, I can see all of their patterns. This one, in particular, has a pattern that was forked 13 times: http://dbpatterns.com/accounts/profile/thaichor/

How do you see the forks?

n13 · on April 8, 2016

Was wondering how difficult is to code a UI like this? i.e. SVG with draggable boxes with the connections/lines?

olalonde · on April 7, 2016

Wow, that's really cool! Would be nice to have a "migration" export format for Rails, Knex.js, etc.

jdc0589 · on April 7, 2016

I was bored, so here is a quickly thrown together generic "thing" DB model: http://dbpatterns.com/documents/5706a53c1514b428de88940c

burrox · on April 7, 2016

I like it, I think there's a real use case for it. I spent quite some time looking for examples on how to create an schema for an activity based workflow webapp.

intrasight · on April 8, 2016

An empty schema diagram is to a data modeler what a blank canvas is to an artist or a blank page to a writer - thrilling but somewhat intimidating.

stevesun21 · on April 7, 2016

I thought about the similar idea to create domain models to elaborate business rules with UML ORD rather than database ERD.

jmcgough · on April 8, 2016

The search function doesn't seem to be working - I can't get it to return results ever.

geniium · on April 8, 2016

Seems like a big playground...

vonklaus · on April 7, 2016

wow. doing my first real data model. tough to find resources like this, started literally resterday. can't wait to check this out

whatnotests · on April 7, 2016

Kinda cool but the UX really needs some re-thinking.

Simple things like dialogs getting stacked on top of each other, using `prompt("...")` sometimes and modals other times, foreign key relationship arrows not indicating cardinality, etc --

Looks good but please don't consider it "finished" yet.

gherkin0 · on April 7, 2016

> Public Newsfeed

> jnichols created new pattern

> Penis

> 24 seconds ago

Ok... maybe they need to start with some patterns about filtering spam and noise.