Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I worked at a startup (now fairly popular in the US) where we had tables for each thing (users, companies, etc) and a “relationship” table that described each relationship between things. There were no foreign keys, so making changes were pretty cheap. It was actually pretty ingenious (the two guys who came up with the schema went on to get paid to work on k8s).

It was super handy to simply query that table to debug things, since by merely looking for a user, you’d discover everything. If Mongo was more mature and scalable back then (2012ish), I wonder if we would have used it.



This is quite similar to the RDF triples model. Everything is a 'thing' including 'relationships' between things. So you just need a thing table and a relation table.

The issue with this is schema management and rules gets pushed to the application layer. You also need to deal with very massive tables (in terms of # of rows) for the relationships table which leads to potential performance issues.


let's call them "nodes" and "edges"


Right? What a groundbreaking idea! /s


other downsides I saw from an implementation of this in postgres:

1.you can't have strongly typed schemas specific to a 'thing' subtype; to store it all together, you end up sticking all the interesting bits in a JSON field

2. any queries of interest require a lot of self joins and performance craters

Please, please, never implement a "graph database" or "party model" or "entity schema" in a normal RDBMS, except perhaps at miniscule scale; the flexibility is absolutely not worth it.


That's why you have duplicated properties on things you went to query on.


That's why in RDF you should really be using a well-modelled ontology.

As for performance issues, using a dedicated Quad store helps there, rather than one built on top of an RDBMS (OpenLink Virtuoso excluded, because somehow they manage to make Quads on top of a distributed SQL engine blazing fast).

It's been about 6 years since I've been in that world, so things might have changed wrt performance.


Sounds like a 'star schema', just without foreign key constraints on the relationship ('fact') table.

I'm not really sure I understand why not having that constraint is helpful - what sort of 'changes' are significantly cheaper? Schema changes on the thing ('dimension') tables? Or do you mean cheaper in the 'reasoning about it' sense, you can delete a thing without having to worry about whether it should be nullified, deleted, or what in the relationship table? (And sure something starts to 404 or whatever, but in this application that's fine.)


> you can delete a thing without having to worry about whether it should be nullified, deleted, or what in the relationship table?

Pretty much this. You could just delete all the relationships, or just some of them. For example, a user could be related to many things (appointments, multiple vendors, customers, etc). Thus a customer can just delete the relationship between them and a user, but the user still keeps their other relationships with the customer (such as appointments, allowing them to see their appointment history, even though the customer no longer has access to the user's information).


wouldn't you be able to do the same even with FK in place? what FK would prevent you from doing would be deleting a user before all his relationships are removed. That should prevent you to ending up with inconsistent data inside the tables.


In my experience with regular Postgres databases shared across many teams, inconsistent/broken references is seen as a much less risky thing to do than, say, cascading a delete. Usually a broken reference is of no consequence. And you don’t use FKs, people tend to bake that into how they code things without much trouble.


I've oft wished for a system where FKs could be used to automatically mark rows as invalid instead of deleting them, and you could inspect the nature of the broken reference as a property of the row/object. Then broken references can be cleaned up when appropriate, possibly never depending on the application.

This gives the neat property of being able to do data entry in parallel, too - different people may insert records into two tables linked by a FK relationship, it's nice for this to happen in either order. It also allows the user to query for exactly what rows will be affected by a cascaded delete, or to perform the operation in steps.


Wouldn't you effectively get those features by just not using FKs? You'd get your broken-reference marker by `where foreign_id not in (select id from foreign_table)`. You could create that as a view or materialized view.


It would be nice to have the schema metadata, at the very least, to be able to analyze a given tables reverse dependencies without resorting to naming conventions.


> tables for each thing (users, companies, etc) and a “relationship” table that described each relationship between things

Sounds like objects and associations in the Tao model:

https://engineering.fb.com/2013/06/25/core-data/tao-the-powe...


So basically one table for each type of node, plus one table for all edges. Neat!

How big was the relationships table? How was performance?


It was pretty massive. I’m certain that it is probably sharded by now. Database performance wasn’t our bottleneck, we had np-hard problems all over the place which was our primary performance issues. For most customers, they never really had issues, but our larger customers def did. Those problems were amazingly fun to solve.


How do you shard a thing like this? (Apart from “with great difficulty.”) :)


Does this question imply sharding would be more difficult with this sort of table design, compared with other table designs?

Obviously sharding is difficult, but what would make it any more difficult with this table design?

Picking the sharding keys, avoiding hotspots, rebalancing, failover, joining disparate tables on different shards, etc... these are all complicated issues with far reaching implications, but I just don't understand how this table design would complicate sharding any more than another.

I'm not trying to sound condescending and suggest it's a dumb question... genuinely interested.


This would have happened after I left, so I'm just speculating based on experience.

1. You create a function in your application that maps a customer id to a database server. There is only one at first.

2. You setup replication to another server.

3. You update your application's mapping function to map about half of your ids to the new replica. Use something like md5(customer id) % 2. Using md5 or something else that has a good random distribution is important (sha-x DOES NOT have a good random distribution).

4. You application now will write to the replica, turn off replication, and delete the ids from each server that no longer map to that server.

Adding a new server follows approximately the same lines, except you need to rebalance your keys which can be tricky.


So I'm trying to learn about system design without having ever encountered large problems from any of my projects. What I'm reading keeps pushing for hash rings for these types of distribution problems.

Someone who does this stuff for real, do orgs actually just do a hash % serverCount? That's what a startup I worked at did 8 years ago. I thought nobody actually did this anymore though given the benefits of hash rings?


It depends on control and automation. If you can easily migrate things between servers, a hash ring makes more sense. If you are a scrappy startup who has too much traffic/data, you are doing everything manually, including rebalancing and babysitting servers.


Can you please clarify why md5 has better random distribution that sha?


It doesn't. I'm curious to hear how withinboredom arrived at that conclusion.


I arrived at this empirically doing some tests at work. In general, Sha and md5 BOTH present randomly distributed numbers. However, most IDs (and the assumption made here) are numeric. For the first billion or so numbers, sha doesn't create a random distribution, while md5 does.

Looking back, I don't remember my experimental setup to verify a random distribution and I recommend checking for your own setup. But that was the conclusion we arrived at.


Here's the fraction of 1s for each bit position output by MD5 and SHA-256 using the first million integers as inputs: https://pastebin.com/aVR3u8Vv


I quickly threw a few into excel with a pareto graph to show distribution. Granted, it is only the first 10,000 integers % 1000: https://imgur.com/a/HypGWP1

To my untrained eye, md5 and sha1 (and not shown, sha256) look remarkably similar to random. If you want truly even distributions, then going with a nieve "X % Y" approach looks better.


If you do % 1000 there'll be a bit of bias from the modulus, but not anything meaningful for the use case, and it affects MD5 and SHA both.

No argument that for simple incrementing integers a raw modulus is better, though. Even if hashing is needed, a cryptographic hash function probably isn't (depending on threat model, e.g. whether user inputs are involved).


MD5 and ShaX are available in almost every language. Things that are probably better (like Murmur3) might not be, or may even have different hashes for the same thing (C# vs JS with Murmur3, for example). I had to reimplement Murmur3 in C# to get the hashes to match from JS, it boiled down to the C# version using longs and eventually casting to an int while JS can’t or didn’t do that final cast.


I wasn't thinking of anything exotic, just the built-in functions used for hash tables and random number generators (for integers and other inputs that can be interpreted as seeds, a random number generator is also a hash function, and of course in your case the hash function was a random number generator all along anyway!).


soooo basically there is no bias


Looks like it. Though I think md5 is faster, which is maybe what I’m remembering, now that I think about it. This was years ago, funny how memory gets tainted.


I think that's probably true of lots of masters of crafts, sometimes you have a gut feeling and don't even remember the formative moments that created the conclusion haha


If you worry about random distribution of the bits wouldn't it be better to do (H(customer_id) % big_prime) % 2? so that the "% 2" does not just read the last bit


A finite field would ALMOST work here. However, you're making the assumption that the user ids using your application are random. If you have an A/B test running using that same mapping, you'll end up with a server hotter than the other if one arm of the experiment is widely successful.


In practice I would have added a salt, but I did not consider that each independent sharding needs to have its own salt.


Thanks for that answer, it seems quite a specialist job. So many moving parts. Why would you need random distribution and not just customers 1 to 100k for example?


> Why would you need random distribution and not just customers 1 to 100k for example?

To keep the databases from having hotspots. For example, if your average customer churns after 6 months, you'll have your "old customer" database mostly just idling, sucking up $$ that you can't downgrade to a smaller machine because there are still customers on it. Meanwhile, your "new customer" database is on fire, swamped with requests. By using a random distribution, you spread the load out.

If you get fancy with some tooling, you can migrate customers between databases, and have a pool of small databases for "dead customers" and a more powerful pool for "live customers" or even have geographical distribution and use the datacenter closest to them. But you really do need some fancy tooling to pull that off, but it is doable.


> Meanwhile, your "new customer" database is on fire, swamped with requests. By using a random distribution, you spread the load out.

Great answer.


This depends on access patterns to your data.


I was wondering the same thing. But at the same time, it seems like having it all centralized in one place would make it really easy to write tooling to aid in such an endeavor.

For example, suppose it was decided that sharding would be by account ID. It would probably be very achievable to have some monitoring for all newly created relationships to check that they have an account ID specified, and break that down by relationship type (and therefor attributable to a team). Dashboards and metrics would be trivial to create and track progress, which would be essential for any large scale engineering effort.


Yeah it does seem you need to think ahead of what the future needs would be.


One neat thing about Mongo is the ObjectID's:

We didn't know it at the time, but you can get the date of creation from them. Which means... If you sort by ObjectID, you're actually sorting by creation time. You can create a generic ObjectID using JavaScript and filter by time ranges. I still have a codepen proof of concept I did for doing ranges.

Anyway, the other thing is, and I'm not sure how many people use it this way: you can copy an ObjectID verbatim into other documents, like a foreign key, so you can reference between collections. If you do this, you'll save yourself a lot of headaches. Just because you can shove anything and everything into MongoDB documents, doesn't mean you absolutely should.


But why? Ok, no foreign key checking, so it’s fast, great. Why not include the foreign ID in the local table? What’s the purpose behind a global “relationships” table?


Guessing:

* Most relations are properly modeled as N:N anyways.

* Corollary: no "aargh, need to upgrade this relation to N:N" regrets.

* Corollary: no 1000 relationship tables to name & chase around the DB metadata.

* Low friction to add relations on-the-fly when developing code.

* Low friction to query all relations associated with an object.

* Low friction to update a relation, but keep the ex around.

Neat approach IMHO.


>Most relations are properly modeled as N:N anyways

I'm not sure that's true. Containment is a very common type of relationship.

>Corollary: no 1000 relationship tables to name & chase around the DB metadata.

That depends on whether relationships have associated attributes. If and when they do, you have to turn them into object types and create a table for them. And it's exactly the N:M relationships that tend to have associated information.

That's the conceptual problem with this approach in my view. Quite often, there isn't really a conceptual difference between relationship types and object/thing types. So you end up with a model that allows you to treat some relationships as data but not others.

>Neat approach IMHO.

It's a valid compromise I guess, depending heavily on the specifics of your schema. Ultimately, database optimisation is a very dirty business if you can't find a way to completely separate logical from physical aspects, which turns out to be a very hard problem to solve.


Great points! Indeed, hard to tell without knowing a bit more about their schema specifics.


* low friction till things go wrong and you're not even aware they went wrong because the relational mistake wasn't constrained

* aaargh we have trillion row tables that contain everything, how do we scale this (you don't)

* aaargh our tables indices don't fit into memory, what do we do (you cry)

* no at a glance metrics like table sizes to have an idea of which relations are growing

* oh no we have to have complex app code to re-implement for relationships everything a database gave for free

Reddit used a similar system in Posgresql and hit all of these issues before moving to Cassandra. Even then they still had to deal with constraints of the original model.

Edit: Didn't even read the article being about Reddit, it pretty much confirms what I said. It took them something like 2 years iirc to migrate to Cassandra and the site was wholely unuseable during that time. There is no doubt in my mind if another Reddit style site existed during that time it would have eaten their lunch same as Reddit ate Diggs lunch (for non technical reasons).

Furthermore, back then iirc they only really had something like 4 tables/relations: comment, post, user, vote. The 'faster development time' adding new relations was completely moot, they spent more time developing and dealing with a complex db query wrapper and performance issues (caching in memcache all over the place) than actual features.

The (app+query) code and performance for their nested comments was absolutely hilarious. Something a well designed table would have given then for free in one recursive CTE.

It wasn't until they moved to Cassandra that they were able to let the db do that job and work on adding the shit ton of features that exist today.


> low friction till things go wrong and you're not even aware they went wrong because the relational mistake wasn't constrained

These are people-problems, and usually a dev made this mistake exactly once after being hired. Writing robust code isn't hard if the culture supports it.

> we have trillion row tables that contain everything, how do we scale this (you don't)

You do, by sharding the table.

> no at a glance metrics like table sizes to have an idea of which relations are growing

select count(*) where relation = 'whatever'; is pretty simple.

> we have to have complex app code to re-implement for relationships everything a database gave for free

These queries are no different than standard many-to-many type queries. In all, the logic is no different than anything else. You also still have to do the same number of inserts that you have to do for many-to-many, but now the order isn't important. You can even do a quick consistency check before committing the transaction at the framework level.


> These are people-problems, and usually a dev made this mistake exactly once after being hired.

Classifying bugs as "people problems" strikes me as an incredibly hostile work environment.

What happens to the poor developer who makes a particular mistake twice? They get fired?


Have you ever accidentally deleted a production table before? If so, I doubt you’d do it again if you are able to learn from your mistakes. No hostilities required. Hire good people who are smart and capable of learning.


Deleting a production DB is not the only way to mess up your data.

Smart people design systems that alert them when they're about to commit an error because they have 500 other priorities on their mind.


Yeah, we had that.

Even where I work now (that has almost no table constraints at all because there is no way to support them at our scale), we have tooling that prevents you from making dumb mistakes and alerting + backups if you manage to do it anyway. Having processes in place to deal with these kinds of issues, and tooling is. That's what I meant by "people problem." No constraint will keep a determined person from deleting the wrong thing (in fact, it might be worse if it cascades).

We had someone accidentally fight the "query nanny" once, thinking they were deleting a table in their dev environment... nope, they deleted a production table with hundreds of millions of rows. It took hours to complete the deletion, while all devs worked to delete the feature so we could restore from backup and revert our hacked out code. They still work there.

My point is, these are all "people problems" and no technology will save you from them. They might mitigate the issue, or slow people down to get them to realize what they are about to do ... but that is all you can really hope for. A constraint isn't necessarily required to provide that mitigation, but it helps when you can use them.


Thanks for clarifying, I had misinterpreted your statement.

I can see how at huge scales, DB constraints can be impossible to implement. But they do come (almost) for free in terms of development effort, whereas what you describe probably took a long time to implement.


"made a big mistake twice" does not correlate with "bad person who is not smart or capable of learning"—hope you're not in management


I never said it did. Just that people tend to learn from their mistakes. Are there people who will make the same mistake different ways over and over again? Yep (I’m one of them). It’s not an issue, I learn and get better at spotting the mistake when I review other’s code and others point it out when reviewing my code. That’s why I said “exactly once” since once you know what it looks like, you can spot it in a review.


It wasn't until they moved to Cassandra that they were able to let the db do that job and work on adding the shit ton of features that exist today.

But cassandra is still a nosql and doesn't give the constraints of something like pg.


Cant help but read this and wonder if it was an strict architectural decision (I think this is what you imply?), or a "move fast and fix it later" side effect that hung around. To me it seems like your discarding a lot of safety but perhaps that's overstated in my head.


The relationships were enforced in-code, so the safety existed there instead of by the database. Even where I work now with sharding and such, we don’t have any foreign key constraints yet serve billions of requests. Code can provide just as much safety as the database, you just have to be smart enough to not try and override it.


> Code can provide just as much safety as the database

No, database constraints declaratively specify invariants that have to be valid for the entirety of your data, no matter when it was entered.

Checks and constraints in code only affect data that is currently being modified while that version of the code is live. Unless you run regular "data cleanup" routines on all your data, your code will not be able to fix existing data or even tell you that it needs to be fixed. If you deployed a bug to production that caused data corruption, you deploying a fix for it will not undo the effects of the bug on already created data.

Now, a new database constraint will also not fix previously invalid data, but it will alert you to the problem and allow you to fix the data, after which it will then be able to confirm to you that the data is in correct shape and will be continue to be so in the future, even if you accidentally deploy a bug wherein the code forgets to set a reference properly.

I'm fine with people claiming that, for their specific use case or situation, database constraints are not worth it (e.g. for scalability reasons), but it seems disingenuous to me to claim that code can provide "just as much safety".


> Code can provide just as much safety as the database, you just have to be smart enough to not try and override it.

This statement is absolutely false. Some application code that random developers write will never be a match for constraints in a mature, well tested database.


> Some application code that random developers write will never be a match for constraints in a mature, well tested database.

Well, lucky for you, most places don't hire random developers, only qualified ones. /s

Where I work has billions of database tables sharded all over the place. There simply isn't a way to define a constraint except via code. We seem to get along just fine with code reviews and documentation.


I’ve never seen a database with just one writer. I wouldn’t even bet on all the writers using the same language, much less reusing a single piece of code implementing constraints that never have changed and never will.


Can you tell me what the performance impact is with code logic and extra dB calls versus built-in constraint logic? Because I am fairly convinced that most modern DBs have almost zero impact from relationship constraints. Those are heavily optimised. I am curious to your study and want to know more.


The problem with sharding is that you can't join, as most databases don't support "distributed joins" so relationship constraints don't even make sense. Thus it has to be done in code and often requires multiple queries to various servers to get the full picture. Most of this can be optimized by denormalizing as much as possible for common queries and leaving normalized records for less common queries.

There's a performance impact, for sure, but there's really not much that can be done without switching to some database technology that does sharding for you and can handle distributed joins (rethinkdb comes to mind, but I'm not sure if that is maintained any more). But then you have to rewrite all your queries and logic to handle this new technology and that probably isn't worth it. Even if you were to use something like cockroachdb and postgres, cockroachdb doesn't support some features of postgres (like some triggers and other things), so you'll probably still have to rewrite some things.


> I am fairly convinced that most modern DBs have almost zero impact from relationship constraints

It cannot fundamentally be zero impact, the database needs to guarantee that the referred record actually exists at the time the transaction commits. Postgres takes a lock on all referenced foreign keys when inserting/updating rows, this is definitely not completely free.

Found this out when a plain insert statement caused a transaction deadlock under production load, fun times.


> There were no foreign keys

Then how did the relationship table refer to things in the thing tables?


I bet they meant no foreign key constraints.


OK, but that still doesn't make any sense because:

> we had tables for each thing

I take that to mean "a separate table for each class of thing." So the single "relationship" table had to somehow refer not just to rows in tables but to the tables themselves. AFAIK that is not possible to do directly in SQL. You would have to query the relationship table, extract a table name from the result, and then make a separate query to get the relevant row from that table. And you would have to make that separate query for every relationship. That would be horribly inefficient.


I don't see why you would have to include the table name. I presume the calling code knows what they are looking for.

For example, say we have a table of people and a table of books and a relationship table books-owned-by. If I want to see what's in your library I can select you in the relationship table and join with the books table to get a record set describing your library. Or is this not what the OP was describing?


The OP said, "we had tables for each thing (users, companies, etc) and a “relationship” table that described each relationship between things". I took that to mean that there was only one relationship table.


Oh I see. Hmm... I wonder what OP meant?


Yep, so a table for each thing with their own schemas. A user table, a company table, etc. Then a single relationship that defines all the relationships between all the objects in a giant many-to-many relationship.

So to query a full relationship, you'd do a join from A -> relationship -> B, probably an inner join. But most of the time, it was usually just a singular join (to get a list of things to show the user, for example) since you'd already be querying in the context of some part of the relationship.


Ah, that makes sense. So something like:

    select whatever from thingy, relation, widget
    where thingy.id = relation.id1
    and widget.id = relation.id2
    and relation.type1 = 'thingy'
    and relation.type2 = 'widget'


I'm a bit confused about how that relationship table would look. Is anybody kind enough to show a concrete example?


This type of setup is semi-common and I've seen it called "polymorphic relationships". The PHP framework Laravel has first-party support for them if you want a real example, although those are generally only polymorphic for one side of the relationship (i.e. building a CRM and having a single "comments" table with the ability to relate Customers, Opportunity, Leads, etc. to each comment without having a unique column for each type).

Generally, the relationship table will look like this:

  - relation1_id
  - relation1_type
  - relation2_id
  - relation2_type
Where *_id is the database ID and *_type is some kind of string constant representing another table. Generally you put an index on "relation#_id, relation#_type", but that's obviously use case dependent


Like other have said, you can store the table types with the ids.

But here's the neat part. That's often only for safety.

If I want to find all users and their orders, I can join the user table to the relationship table, then join that to the order table. Something like:

SELECT <COLUMNS GO HERE> FROM USER INNER JOIN RELATIONSHIP ON USER.Id IN (RELATIONSHIP.LeftId, RELATIONSHIP.RightId) INNER JOIN ORDER ON ORDER.Id IN (RELATIONSHIP.LeftId, RELATIONSHIP.RightId)

So you're getting the Users. The filtering the ones that have relationships, then filtering that to the ones that also have orders.

If you've ever managed an N:N relationship table, it's like that, but more generic.


IIRC, we had a nullable column for each ID type. So a column for user, company, appointment, etc. then we had an enumerated column for the relation (owner, patient, vendor, etc).


Sounds like the classic entity–relationship model

I am assisting a professor with his intro to databases class this semester and that was the first thing taught there

https://en.wikipedia.org/wiki/Entity%E2%80%93relationship_mo...


How far is that from rdf triplets and similar ideas ? Just curious


I had never heard of this before! But yeah, you could use the relationship table to describe it as a RDF triplet. For example, relationships could be OWNER, or PATIENT, or something and could be seen as a predicate. So USER OWNER of COMPANY, or USER PATIENT of COMPANY, or RECORD BELONGS to COMPANY, or COMPANY VENDOR of COMPANY.


I dont understand this. Wouldn't the relationships rows need to use foreign keys to know which things they apply to?


what's the advantage (besides debugging) to have all relationships in one table, rather than having a separate table for each relationship? seems like it would be strictly worse


Funny. I’ve done exactly this elsewhere.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: