A lot depends on the type of queries. You could have tables the size of the entire internet and every disk drive ever made, and they'd still be reasonably fast for queries that just look up a single value by an indexed key.
The trick is to have the right indexes (which includes the interior structures of the storage data structure) so that queries jump quickly to the relevant data and ignore the rest. Like opening a book at the right page because the page number is known. Sometimes a close guess is good enough.
In addition, small indexes and tree interior nodes should stay hot in RAM between queries.
When the indexes are too large to fit in RAM, those get queried from storage as well, and at a low level it's analogous to the system finding the right page in an "index book", using an "index index" to get that page number. As many levels deep as you need. The number of levels is generally small.
For example, the following is something I worked on recently. It's a custom database (written by me) not Postgres, so the performance is higher but the table scaling principles are similar. The thing has 200GB tables at the moment, and when it's warmed up, querying a single value takes just one 4k read from disk, a single large sector, because the tree index fits comfortably in RAM.
It runs at approximately 1.1 million random-access queries/second from a single SSD on my machine, which is just a $110/month x86 server. The CPU has to work quite hard to keep up because the data is compressed, albeit with special query-friendly compression.
If there was very little RAM so nothing could be kept in it, the speed would drop by a factor of about 5, to 0.2 million queries/second. That shows you don't need a lot of RAM, it just helps.
Keeping the RAM and increasing table size to roughly 10TB the speed would drop by half to 0.5 million queries/second. In principle, with the same storage algorithms a table size of roughly 1000TB (1PB) would drop it to 0.3 million queries/second, and roughly 50,000TB (50PB) would drop it to 0.2 million. (But of course those sizes won't fit on a single SSD. A real system of that size would have more parallel components, and could have higher query performance.) You can grow to very large tables without much slowdown.
The current application is Ethereum L1 state and state history, but it has useful properties for other applications. It's particularly good at being small and fast, and compressing time-varying blockchain-like or graph data.
As it's a prototype I'm not committing to final figures, but measurement, theory and prototype tests project the method to be significantly smaller and faster than other implementations, or at least competitive with the state of the art being researched by other groups.
> Why did you reinvent the wheel?
Different kind of wheel. No storage engine that I'm aware of has the desired combination of properties to get the size (small) and speed (IOPS, lower read & write amplification) in each of the types of operations required. Size and I/O are major bottlenecks for this type of application; in a way it's one of the worst cases for any kind of database or schema.
It's neither a B-tree nor an LSM-tree, (not a fractal tree either), because all of those are algorithmically poor for some of the operations required. I found another structure after being willing to "go there" relating the application to low-level storage behaviour, and reading older academic papers.
These data structures are not hard to understand or implement, once you get used to them. As I've been working on and off for many years on storage structures as a hobby (yeah, it's fun!), it's only natural to consider it an option when faced with an unusual performance challenge.
It also allowed me to leverage separate work I've done on raw Linux I/O performance (for filesystems, VMs etc), which is how random-access reads are able to reach millions/s on a single NVMe SSD.
> Is it as durable as Postgres?
Yes.
Modulo implementation bugs (because it won't have the scrutiny and many eyes/years of testing that Postgres does).
The trick is to have the right indexes (which includes the interior structures of the storage data structure) so that queries jump quickly to the relevant data and ignore the rest. Like opening a book at the right page because the page number is known. Sometimes a close guess is good enough.
In addition, small indexes and tree interior nodes should stay hot in RAM between queries.
When the indexes are too large to fit in RAM, those get queried from storage as well, and at a low level it's analogous to the system finding the right page in an "index book", using an "index index" to get that page number. As many levels deep as you need. The number of levels is generally small.
For example, the following is something I worked on recently. It's a custom database (written by me) not Postgres, so the performance is higher but the table scaling principles are similar. The thing has 200GB tables at the moment, and when it's warmed up, querying a single value takes just one 4k read from disk, a single large sector, because the tree index fits comfortably in RAM.
It runs at approximately 1.1 million random-access queries/second from a single SSD on my machine, which is just a $110/month x86 server. The CPU has to work quite hard to keep up because the data is compressed, albeit with special query-friendly compression.
If there was very little RAM so nothing could be kept in it, the speed would drop by a factor of about 5, to 0.2 million queries/second. That shows you don't need a lot of RAM, it just helps.
Keeping the RAM and increasing table size to roughly 10TB the speed would drop by half to 0.5 million queries/second. In principle, with the same storage algorithms a table size of roughly 1000TB (1PB) would drop it to 0.3 million queries/second, and roughly 50,000TB (50PB) would drop it to 0.2 million. (But of course those sizes won't fit on a single SSD. A real system of that size would have more parallel components, and could have higher query performance.) You can grow to very large tables without much slowdown.