Over the last decade, the distributed system nature of modern server hardware in...

pron · on Sept 23, 2015

I think that's a generalization that simply shifts the burden elsewhere, and cannot be said to be "the right" architecture in general. There is a reason CPUs implement cache-coherence on top of their "innate" shared-nothing design, and the reason is abstraction. If you don't need certain abstractions, then a sharded approach is indeed optimal, but if you do, then you have to implement them at some level or another, and it's often better to rely on hardware messaging (cache-coherence) than software messaging.

So for some abstractions such as column and other analytics databases, a sharded architecture works very well, but if you need isolated online transactions, strong consistency etc., then sharding no longer cuts it. Instead, you should rely on algorithms that minimize coherence traffic while maintaining a consistent shared-memory abstraction. In fact, there are algorithms that provide the exact same linear-scalability as sharding with an added small latency constant (not a constant factor but a constant addition) while providing much richer abstractions.

Similarly, your statement about "garbage-collected languages" is misplaced. First, there's that abstraction issue again -- if you need a consistent shared memory, then a good GC can be extremely effective. Second, while it is true that GCs don't contribute much to a sharded infrastructure (and may harm its performance), GCs and "GCed languages" are two different things. For example, in Java you don't have to use the GC. In fact, it is common practice for high-performance, sharded-architecture Java code to use manually-managed memory.

acconsta · on Sept 24, 2015

You can still offer isolated transactions, tunable consistency, etc. within a shard though, which Cassandra does.

And yes, you can write high performance Java, but for whatever reasons the Cassandra codebase isn't an example of that. They just did a big storage engine rewrite and the result is slower.

https://issues.apache.org/jira/browse/CASSANDRA-7486

pron · on Sept 24, 2015

> You can still offer isolated transactions, tunable consistency, etc. within a shard though

And if that happens to be exactly all you need then that's great! :)

> And yes, you can write high performance Java, but for whatever reasons the Cassandra codebase isn't an example of that.

I don't know anything about the Cassandra codebase, but one thing I'm often asked is if you're not going to use the GC in Java, why write Java at all. The answer is that the very tight core code like a shard event loop turns out to be a rather small part of the total codebase. There's more code dedicated to support functions (such as monitoring and management) than the core, and relying on the GC for that makes writing all that code much more convenient and doesn't affect your performance.

acconsta · on Sept 24, 2015

Google built multi-row transactions on top of a partitioned row store though? I guess I'm not really sure which applications have a shared memory architecture like the one you're describing.

And to be fair, some people are pretty productive in modern C++. It's a shame the JNI isn't better so you can have the best of both worlds.

(for the record, Quasar is #1 on my list of libraries to try if I go back to Java)

pron · on Sept 24, 2015

Of course you can implement any shared/consistent memory transaction on top of shards -- after all, the CPU implements shared memory on top of message-passing in a shared-nothing architecture, too. It's just that then you end up implementing that abstraction yourself. If you need it, someone has to implement it, and if you need it at the machine level, it's better to rely on its hardware implementation than re-implement the same thing in software. Naive implementations end up creating much more contention (i.e. slow communications) than a sophisticated use of hardware concurrency (i.e. communication) instructions.

My point is that if you're providing a shared-memory abstraction to your user (like arbitrary transactions) -- even at a very high level -- then your architecture isn't "shared-nothing", period. Somewhere in your stack there's an implementation of a shared-memory abstraction. And if you decide to call anything that doesn't use shared memory at CPU/OS level "shared nothing", then that's an arbitrary and rather senseless distinction, because even at the CPU/OS level, shared memory is implemented on top of message-passing. So the cost of a shared abstraction is incurred when it's provided to the user, and is completely independent of how it's implemented. The only way to avoid it is to restrict the programming model and not provide the abstraction. If doing that is fine for the user -- great, but there's no way to have this abstraction without paying for it.

And JNI is better now[1] (I've used JNR to implement FUSE filesystems in pure Java). JNR will serve as the basis for "JNI 2.0" -- Project Panama[2]. And thanks!

[1]: https://github.com/jnr/jnr-ffi

[2]: http://openjdk.java.net/projects/panama/

acconsta · on Sept 24, 2015

Very interesting, thanks

est · on Sept 23, 2015

I really appreciate your insight, but I didn't understand why you keep implying "open source" as some kind of lack-behind design?

jandrewrogers · on Sept 23, 2015

In practice, open source databases use more traditional, simpler architectures for which there is a lot of literature. Ironically, you see a lot more creativity and experimentation in closed source database architectures, and this has accrued some substantial benefits to those implementations.

The architecture at the link looks unusual compared to open source databases but it is actually a common architecture pattern in closed source databases with significant benefits, particularly when it comes to performance. There is a lot of what I would call "guild knowledge" in advanced database engine design, much like with HPC, things the small number of experts all seem to know but no one ever writes down.

It is a path dependency problem. Most open source databases were someone's first serious attempt at designing a database, a project that turned into a product. This is an excellent way to learn but it would be unrealistic to expect a thoroughly expert design for a piece of software of such complexity on the first (or second, or third) go at it. The atypical quality of PostgreSQL is a testament to the importance of having an experienced designer involved in the architecture.

_benedict · on Sept 24, 2015

It's not exactly guild knowledge, there's just a log of legacy baggage with open source projects that were started by random people and became popular before much thought was given to the architecture. This model has been considered by Cassandra devs for at least the last two years, and there are open JIRA tickets associated with it, it just hasn't been considered a priority.

cbsmith · on Sept 23, 2015

What aspect of the architecture seems unusual?

ogrisel · on Sept 23, 2015

I agree. I suppose that the vast majority of closed source DBs do not follow this shared nothing architecture either.

lastofus · on Sept 23, 2015

How does one bypass the kernel for network and disk IO? I've never heard of doing this before (e.g. IO is always a system call)

acconsta · on Sept 23, 2015

Basically mapping the device registers into user space and doing exactly what the kernel would do, but without the syscall overhead.

Some researchers made a big splash at OSDI by doing this securely:

http://people.inf.ethz.ch/troscoe/pubs/peter-arrakis-osdi14....

technion · on Sept 23, 2015

Oracle has long argued the value of bypassing the file system and associated kernel drivers with its raw devices and ASM. It'd be interesting to see such a thing land in other platforms.

glommer · on Sept 23, 2015

For network, the kernel is bypasses through dpdk. For disk, the syscalls are there. But they are always async IO with O_DIRECT. So the OS wont cache and buffer anything. So that is what you bypass

mamcx · on Sept 23, 2015

> As a corollary, garbage-collected languages do not work for this at all.

As a naive person about this...

Is possible to reap some of the benefits from this kind of architecture in a managed environment (like .NET)?

If a try to implement, for example a sqlite-like/kdb+, memory only database in .NET, how far is possible to go?

Or how avoid some common traps?

infinite8s · on Sept 29, 2015

This approach seems to dovetail nicely with unikernel approaches. Have you experimented with combining the above constraints with a system like MirageOS?

acconsta · on Sept 23, 2015

Strongly agreed. It's great that these HPC techniques are finally starting to trickle down.

a8da6b0c91d · on Sept 24, 2015

> one process per core, each locked to a single core; use locked local RAM only (effectively limiting NUMA); direct dedicated network queue (bypass kernel); direct storage I/O (bypass kernel)

I have no idea how to do any of these things. What are the system/api calls to lock a process to a kernel? How do you bypass kernel IO?

acconsta · on Sept 24, 2015

Take a look at sched_setaffinity and Intel DPDK.