Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Graph databases have been around nearly as long as databases generally, at least since the 1970s. The sole feature that makes a database a "graph database" is support for a minimal amount of recursion in queries, which was a feature before SQL even existed. There are good technical reasons graph databases have never been commercially successful.

Most databases support some form of recursion and have for decades. The reason they don't market themselves as a "graph database" is that the performance and scalability of graph data models in conventional database architectures is typically very poor, so representing your entire data model that way is not encouraged. The way indexing is implemented in many traditional database engine designs, e.g. B-trees, is pathological for some common graph-like traversals. Even databases that market themselves as "graph databases" have conventional internals and perform only marginally better for graph data models than databases that are positioned differently.

The idea of graph databases are great. Unfortunately, the existing implementations all suffer from very poor scalability due to fundamental limitations of the data structures and algorithms used internally. The core theoretical challenges are well-understood but few people are credibly working on addressing those.



Interesting. But are those old graph databases similar to modern ones like Neo4j? Neo4j has pretty amazing performance compared to doing the same thing in SQL. Although it's certainly possible that SQL database was poorly designed. Even so, a handful of developers new to graph DBs managed to easily beat its performance.

I've also heard that some modern graph databases are not true native graph databases below the surface, and therefore perform worse at graph-specific queries.

Edit: is it possible you're talking about Network Model DBs[0]? Wikipedia mentions them being around since the late 1960s, and that they could model graphs, but suggests it's more of a predecessor to graph DBs, which saw a lot of improvements until the arrival of modern commercial graph DBs in the 2000s[1].

Again, I'm not an expert on this at all, but it sounds like there have been significant improvements in graph databases since the 1970s. Much more so than in relational databases.

[0] https://en.wikipedia.org/wiki/Network_model

[1] https://en.wikipedia.org/wiki/Graph_database


I wouldn't say relational databases are poorly designed, any more than saying a hammer is poorly designed because it makes for a bad screwdriver. A hammer is still excellent at working with nails, it's hard to find a better tool to work with that. This is just about using the right tool for the job.

Back when data was simpler and not as big, relational databases were perfect, and there have been years of engineering and bug fixes that have gone into them. They are excellent at what they do, and they continue to improve.

But as technology has improved, as our disks and memory have gotten bigger, as the data we collect and want to query over has gotten bigger, and as our queries have gotten more complex, we've been running against the limitations of log(n) joins and relational database technology for some use cases. Now, not every problem is a nail. Some are screws. Some are more exotic.

That's been the reason for nosql databases in the first place, to try to address the shortcomings that arise as data gets bigger, more complex, and as queries and operations become more complex over large data.

log(n) joins are fine...until data explodes, and you're no longer doing just a handful of joins per query, but a very large number of them, maybe even unbounded, and maybe the rules for what data to traverse has soft or even no restrictions. When your data is graphy, when the questions you want to answer require traversals of this scale and nature, and when you want to make sure your traversal costs are proportional only to the graph you want to traverse (and not proportional to the total data in the database), then graph databases provide a very good tool for modeling and efficiently querying over that data.

Graph databases are relatively young, compared to relational databases. Yet their usage has been proven, especially as more graphy problems and data have grown more common.

Relational databases are still useful, and still improving, and graph databases will also continue to grow and improve side by side with them.

We even have a GQL initiative, on the language side, aimed at becoming an ISO standard that will hold an equivalent position as SQL, but for graph querying. That should speak to the value and importance of the paradigm.


The fundamental premise of the relational model is the physical/logical distinction. The relational model deliberately does not make any requirements or assumptions about how data is physically stored or structured.

The difference between relational and graph (and other NoSQL database systems) is not about particular sizes and shapes of data, it is about level of abstraction. For example assuming joins are "log(n)" makes certain assumptions about how relations and indexes are implemented which is only true for some naive implementation (like Access or MySQL).

Just as an example, materialized views is a physical-level optimization where an arbitrary complex query result is stored and kept updated, which means data can be retrieved as fast as physically possible. Of course this has a cost at insert-time, since materialized views also have to be updated - but this is a performance trade-off just like the structure of a graph database is a performance trade-off.

NoSQL databases has a tight coupling between the physical and logical structure, which makes them easier to optimize towards particular usage patterns but harder to adapt to changing requirements over time. The relational model was specifically designed for large databases used by multiple applications and changing over time.


Performance was not the selling point of relational databases in the first place. It is hard to beat the performance of a perfectly tuned hierarchical or graph database on the workloads they are designed for. At best a relational database can be equally fast. For this reason hierarchical/graph databases never went away and are still appropriate for specialized tasks.

The selling point of relational database is they can accommodate long-lived multi-purpose databases used by multiple different users and applications over time and where the data model would change and evolve over time as the world changes. The is achieved by "data independence" - the logical model is decoupled from physical storage structure and is not hard-coded to accommodate particular query patterns. So you can have ad-hoc queries - cross-cutting queries which you didn't know you would need when the schema was designed. The key insight is that relationships between entities are also just data.


You can build almost any kind of database on top of any decent database kernel -- they are similarly expressive. The "kind" of database is just a skin over the kernel. However, different kernel designs make different performance tradeoffs in several dimensions. A graph database is much more join-intensive than a relational database, for example, so it may make sense to reduce select performance to increase join performance.

Graph database technology was essentially stagnant for decades, and the design of Neo4j is from that era. The basic technology is the same across all graph databases, just incrementally modernized and micro-optimized, hence why they all have the same scalability and performance problems. The market gap is that most interesting graph data models have at least 10^12 edges and graph databases stop working well long before you get to that scale.

There was a major breakthrough on this topic around 2008 at a corporate research lab, a new way of representing graph data models that was highly scalable. By coincidence, I was working in graph database R&D at the same time elsewhere and had some exposure to the research, made several material improvements to the new approach, and even built working prototype database based on it that addressed a lot of classic scalability and performance issues. I stopped working in graph databases shortly thereafter but to this day have not seen a single implementation in the wild based on that research. It isn't the kind of thing you can build on top of an existing database kernel, the internals were necessarily pretty exotic, which is part of the problem.

tl;dr: Graph databases have been stuck in the same local architectural minima for half a century, receiving a fresh coat of paint every decade or so but not addressing any of the reasons almost no one can use them.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: