Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Over the last decade, the distributed system nature of modern server hardware internals has become painfully evident in how software architectures scale on a single machine. The traditional approaches -- multithreading, locking, lock-free structures, etc -- are all forms of coordination and agreement in a distributed system, with the attendant scalability problems if not used very carefully.

At some point several years ago, a few people noticed that if you attack the problem of scalable distribution within a single server the same way you would in large distributed systems (e.g. shared nothing architectures) that you could realize huge performance increases on a single machine. The caveat is that the software architectures look unorthodox.

The general model looks like this:

- one process per core, each locked to a single core

- use locked local RAM only (effectively limiting NUMA)

- direct dedicated network queue (bypass kernel)

- direct storage I/O (bypass kernel)

If you do it right, you minimize the amount of silicon that is shared between processes which has surprisingly large performance benefits. Linux has facilities that make this relatively straightforward too.

As a consequence, adjacent cores on the same CPU have only marginally more interaction with each other than cores on different machines entirely. Treating a single server as a distributed cluster of 1-core machines, and writing the software in such a way that the operating system behavior reflects that model to the extent possible, is a great architecture for extreme performance but you rarely see it outside of closed source software.

As a corollary, garbage-collected languages do not work for this at all.



I think that's a generalization that simply shifts the burden elsewhere, and cannot be said to be "the right" architecture in general. There is a reason CPUs implement cache-coherence on top of their "innate" shared-nothing design, and the reason is abstraction. If you don't need certain abstractions, then a sharded approach is indeed optimal, but if you do, then you have to implement them at some level or another, and it's often better to rely on hardware messaging (cache-coherence) than software messaging.

So for some abstractions such as column and other analytics databases, a sharded architecture works very well, but if you need isolated online transactions, strong consistency etc., then sharding no longer cuts it. Instead, you should rely on algorithms that minimize coherence traffic while maintaining a consistent shared-memory abstraction. In fact, there are algorithms that provide the exact same linear-scalability as sharding with an added small latency constant (not a constant factor but a constant addition) while providing much richer abstractions.

Similarly, your statement about "garbage-collected languages" is misplaced. First, there's that abstraction issue again -- if you need a consistent shared memory, then a good GC can be extremely effective. Second, while it is true that GCs don't contribute much to a sharded infrastructure (and may harm its performance), GCs and "GCed languages" are two different things. For example, in Java you don't have to use the GC. In fact, it is common practice for high-performance, sharded-architecture Java code to use manually-managed memory.


You can still offer isolated transactions, tunable consistency, etc. within a shard though, which Cassandra does.

And yes, you can write high performance Java, but for whatever reasons the Cassandra codebase isn't an example of that. They just did a big storage engine rewrite and the result is slower.

https://issues.apache.org/jira/browse/CASSANDRA-7486


> You can still offer isolated transactions, tunable consistency, etc. within a shard though

And if that happens to be exactly all you need then that's great! :)

> And yes, you can write high performance Java, but for whatever reasons the Cassandra codebase isn't an example of that.

I don't know anything about the Cassandra codebase, but one thing I'm often asked is if you're not going to use the GC in Java, why write Java at all. The answer is that the very tight core code like a shard event loop turns out to be a rather small part of the total codebase. There's more code dedicated to support functions (such as monitoring and management) than the core, and relying on the GC for that makes writing all that code much more convenient and doesn't affect your performance.


Google built multi-row transactions on top of a partitioned row store though? I guess I'm not really sure which applications have a shared memory architecture like the one you're describing.

And to be fair, some people are pretty productive in modern C++. It's a shame the JNI isn't better so you can have the best of both worlds.

(for the record, Quasar is #1 on my list of libraries to try if I go back to Java)


Of course you can implement any shared/consistent memory transaction on top of shards -- after all, the CPU implements shared memory on top of message-passing in a shared-nothing architecture, too. It's just that then you end up implementing that abstraction yourself. If you need it, someone has to implement it, and if you need it at the machine level, it's better to rely on its hardware implementation than re-implement the same thing in software. Naive implementations end up creating much more contention (i.e. slow communications) than a sophisticated use of hardware concurrency (i.e. communication) instructions.

My point is that if you're providing a shared-memory abstraction to your user (like arbitrary transactions) -- even at a very high level -- then your architecture isn't "shared-nothing", period. Somewhere in your stack there's an implementation of a shared-memory abstraction. And if you decide to call anything that doesn't use shared memory at CPU/OS level "shared nothing", then that's an arbitrary and rather senseless distinction, because even at the CPU/OS level, shared memory is implemented on top of message-passing. So the cost of a shared abstraction is incurred when it's provided to the user, and is completely independent of how it's implemented. The only way to avoid it is to restrict the programming model and not provide the abstraction. If doing that is fine for the user -- great, but there's no way to have this abstraction without paying for it.

And JNI is better now[1] (I've used JNR to implement FUSE filesystems in pure Java). JNR will serve as the basis for "JNI 2.0" -- Project Panama[2]. And thanks!

[1]: https://github.com/jnr/jnr-ffi

[2]: http://openjdk.java.net/projects/panama/


Very interesting, thanks


I really appreciate your insight, but I didn't understand why you keep implying "open source" as some kind of lack-behind design?


In practice, open source databases use more traditional, simpler architectures for which there is a lot of literature. Ironically, you see a lot more creativity and experimentation in closed source database architectures, and this has accrued some substantial benefits to those implementations.

The architecture at the link looks unusual compared to open source databases but it is actually a common architecture pattern in closed source databases with significant benefits, particularly when it comes to performance. There is a lot of what I would call "guild knowledge" in advanced database engine design, much like with HPC, things the small number of experts all seem to know but no one ever writes down.

It is a path dependency problem. Most open source databases were someone's first serious attempt at designing a database, a project that turned into a product. This is an excellent way to learn but it would be unrealistic to expect a thoroughly expert design for a piece of software of such complexity on the first (or second, or third) go at it. The atypical quality of PostgreSQL is a testament to the importance of having an experienced designer involved in the architecture.


It's not exactly guild knowledge, there's just a log of legacy baggage with open source projects that were started by random people and became popular before much thought was given to the architecture. This model has been considered by Cassandra devs for at least the last two years, and there are open JIRA tickets associated with it, it just hasn't been considered a priority.


What aspect of the architecture seems unusual?


I agree. I suppose that the vast majority of closed source DBs do not follow this shared nothing architecture either.


How does one bypass the kernel for network and disk IO? I've never heard of doing this before (e.g. IO is always a system call)


Basically mapping the device registers into user space and doing exactly what the kernel would do, but without the syscall overhead.

Some researchers made a big splash at OSDI by doing this securely:

http://people.inf.ethz.ch/troscoe/pubs/peter-arrakis-osdi14....


Oracle has long argued the value of bypassing the file system and associated kernel drivers with its raw devices and ASM. It'd be interesting to see such a thing land in other platforms.


For network, the kernel is bypasses through dpdk. For disk, the syscalls are there. But they are always async IO with O_DIRECT. So the OS wont cache and buffer anything. So that is what you bypass


> As a corollary, garbage-collected languages do not work for this at all.

As a naive person about this...

Is possible to reap some of the benefits from this kind of architecture in a managed environment (like .NET)?

If a try to implement, for example a sqlite-like/kdb+, memory only database in .NET, how far is possible to go?

Or how avoid some common traps?


This approach seems to dovetail nicely with unikernel approaches. Have you experimented with combining the above constraints with a system like MirageOS?


Strongly agreed. It's great that these HPC techniques are finally starting to trickle down.


> one process per core, each locked to a single core; use locked local RAM only (effectively limiting NUMA); direct dedicated network queue (bypass kernel); direct storage I/O (bypass kernel)

I have no idea how to do any of these things. What are the system/api calls to lock a process to a kernel? How do you bypass kernel IO?


Take a look at sched_setaffinity and Intel DPDK.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: