Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I can explain it: write throughput has become the bottleneck for an increasing number of applications and workloads. If you can't parse, process, index, and store real-time data at the rate it is created, or you can't load bulk data in a reasonable period of time, query performance doesn't matter. If you can only insert 100k records per second, how long will it take you to load a trillion records? A trillion row table fits on a single ordinary server these days and they aren't uncommon.

Note that even LSM-trees are only used for medium write intensity workloads. If you need maximum throughput for indexed access methods, newer database engines that use inductive indexing are on another level. I've seen a few implementations drive 10 million records per second through indexing and storage on an single EC2 instance (and with better query performance than LSM). These were not academic exercises either; they were invented to solve real-world problems LSM could not handle.

As for why anyone needs, say, 10 million inserts per second per server? There are quite a few modern workloads that generate many, many billions of new inserts per second if you run them at scale. Without the high throughput on a per server basis, managing the implied cluster size would become a serious challenge on its own. Server hardware can support very high ingest rates at very high storage densities if the software can drive it.



> newer database engines that use inductive indexing are on another level

Can you provide me some resources where I can read more about this? Google is not helpful


Interesting, what do you mean by "inductive indexing"? 10 million inserts per second sounds pretty exciting.


The 10 million/sec is largely a consequence of the insert path being highly pipelined all the way through storage without a lot of secondary structure. An EC2 instance like i3en clearly has the end-to-end hardware bandwidth to support that write rate. The usual thread-per-core, user-space I/O, etc idioms will get you there with a good design. Of course, you still need to insert an index in there that captures every searchable column you need without slowing things down too much, or query performance will be nonexistent.

Inserting an index into that pipeline means the data must be organized around a single primary index for every key column you want to query, no secondary indexes, and that you generally cannot use a classic multidimensional spatial index to capture multiple columns -- if column types have fundamentally different distributions of data and load, query selectivity degrades materially. A canonical example is trying to index time and space, and maybe an entity id, in the same index e.g. GPS probe data models, which can be both extremely large and extremely fast.

The general strategy is to build an indexing structure that adaptively compresses away all of the non-selective feature space of the dimensions regardless of the characteristic distribution of the data types. The idea is kind of old and follows from the theoretical equivalence between indexing and AI; whereas most succinct indexing structures effect compression via coding theory, this is compression via lossy pattern induction. Storage engines have to be specifically designed for this (ab)use case and you cannot do any kind of short-circuit queries with these indexes, you always have to look at the underlying data the index points to. Generalized induction is pathologically intractable, so these are efficient and practical approximations. The write (and query) throughput is extremely high because the index structure often fits entirely in CPU cache even for very large data models. AFAIK, these architectures have been used in production for around a decade, but current versions are qualitatively better than early versions.

That said, these indexes still suffer from the write sparsity problem that all cache-mediated index-organized storage has, it just pushes that performance problem out a couple orders of magnitude. Those limits are visible on ultra-dense storage available today. There is a radically different research database architecture that appears to qualitatively breach this limit but I don't expect that to show up in production software implementation for at least a few years. It isn't fully baked but people are working on it.

Virtually all of the research in this space happens completely outside of academia, so it rarely gets published, and research groups stopped filing patents over a decade ago because they were effectively unenforceable. It has turned into a boutique business.


How do I learn more about all of this? Where do I go to work?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: