Keeping CALM: When Distributed Consistency Is Easy

lukevp · on Aug 25, 2020

This is great stuff, will need time to read it and digest. I feel that CRDTs are something that should have been invented much earlier than 2011. Formalizing the conversation around what the base level assumptions are that are necessary to build these systems is really exciting. I have a conceptual understanding of the implications of distributing state and coordination of changes on that state, but it’s so much easier for us all to get things right when there’s best practices and understanding around these concepts. It’s kind of like how Raft made an easy to understand and include consensus library, or Yjs the same for CRDTs, or libsodium makes it easy(er) to do security correctly. It helps us develop the native units of computing for distributed systems, in the “geographically distinct edge computing sense” and not in the “a bunch of nodes with fast interconnects” sense, where offline is common and coordination has major performance implications to UIs.

SmooL · on Aug 25, 2020

I find the similarities & diffferences interesting between this and CRDTs.

This seems to be saying that any algorithm with a monotonic output with respect to input information "has a consistent, coordination-free distributed implementation".

As I understand it, for data to be monotonic requires that the data is partially orderable.

CRDTs require partial ordering, as well as a merge() function so as to create a lattice.

This seems then that CRDT's have stronger requirements - this seems to make sense, since CRDT's are about sharing data, whereas this CALM theory is only talking about making a local decision.

dwenzek · on Aug 26, 2020

> CRDT's are about sharing data, whereas this CALM theory is only talking about making a local decision.

Both CRDT and CALM are about sharing data to make a local decision which is globally consistent.

Both use an order relation over datasets to modelise the concept of "adding more info to some partial input or output".

I would say that the difference is on their focus. CALM determines the frontier where no coordination is required and provides general criteria (no need either to retract former output nor to hear everything there is to hear nor to know all the participants). CDRT provides a mean to meet this criteria. By taking the least upper bound of former partial results, CDRT ensures that the outcome is growing.

vivekseth · on Aug 26, 2020

Another related conversation when this was discussed on The Morning Paper: https://news.ycombinator.com/item?id=19316737

j-pb · on Aug 25, 2020

So basically every system wich allows you to 'delete' stuff, instead of 'forgetting' it has it (distributed state) wrong.

Good thing, that that's not every database ever cough

hinkley · on Aug 26, 2020

I think they're saying that deletion is a consequence of reachability.

A 'delete' is then the removal of all edges that lead to the forgotten item.

j-pb · on Aug 26, 2020

I think you're conflating the deadlock-detection/garbage collector examples with the general gist of the paper, or maybe I just misunderstood?

My point is that 'deletion' in every dbms - be it SQL or NoSQL - is equivalent to logical negation of a statement, which in turn is equivalent to the non monotonic query example that they use throughout the article `¬∃ ("not exists")`.

There is a major logical difference between an explicit delete (negation in a closed world interpretation) and a forget (temporary inconsistency in an open world interpretation), in the sense that the latter can be repaired by subsequent messaging.

This means that you can't have easy distributed consistency with our current databases and data models.

This is partially addressed by temporal databases like datomic and juxt, however they too have delete/retract as logical negation, which makes their monotonicity properties limited to the set of 'change operations' instead of the data itself, which in turn makes it hard to work with. If their monotonicity were on the data-model layer the entire database would essentially become a CRDT.

sriram_malhar · on Aug 26, 2020

Sure, negation and deletion are non-monotonic. And sure, they require coordination and waiting if you really want serializability. But the modern default is snapshot isolation, and experience has shown that there are many many applications that are quite willing to live with this weaker level of isolation (eventual consistency). You can get monotonic reads/writes, without having to wait for all of them to respond, which is the point of this paper. Multi-version databases are lattices.

j-pb · on Aug 26, 2020

Tbh I think you got it backwards. CALM is actually very much about eventual consistency. The gist is basically, that distributed systems are only eventually consistent iff they are logically monotonic.

Coordination as used in this article basically means "needs to be decided by an authority". But because of CALM we know that in order to elect an authority/leader, you need a consensus algorithm that actually has monotonicity at it's core, like raft.

There's a really great talk on this by Heidi Howard: https://www.youtube.com/watch?v=KTHOwgpMIiU

I actually don't think that database isolation levels are a good way to talk about this stuff, because they are practical guarantees, not theoretical ones. Snapshot isolation for example also requires a leader.

From a theoretical point of view there is no in between (at least none that I can see), either you are eventually consistent or you're not, either you are monotonic or you're not.

jupp0r · on Aug 26, 2020

> experience has shown that there are many many applications that are quite willing to live with this weaker level of isolation (eventual consistency)

My experience has also shown that the effort for getting this right is vastly underestimated, initially and that eventually most of these many many applications end up with a set of bugs triggered by edge cases nobody thought about.

hinkley · on Aug 26, 2020

thedailywtf used to be a fun place for seeing people answer the question, "what could go wrong?" but it got too dark or cringey for me (I can find dark without any help, thank you.)

There's a whole ecosystem of minor catastrophes that don't quite warrant a blog post. Do we need a name for the Chesterton house on the corner of Dunning Way and Kruger Lane?

At the same time, it's really hard to learn why that fence is there until you at least see someone else get injured by ignoring the signs, so in that respect TDWTF probably taught a lot of people about quite a few fences.

hinkley · on Aug 26, 2020

Typically we use “conflate” as a synonym for confuse, but it can also mean removing a distinction.

In which case yes, I am conflating the two. I’m saying the paper is making a distinction that doesn’t exist. Concurrent reachability is much the same problem everywhere, even between mark and sweep and reference counting. It’s all thread safe graph theory.

j-pb · on Aug 26, 2020

I think you're actually confusing the concepts.

They talks about a more general concept of consistency that is per se unrelated to graph algorithms.

The examples they have use to illustrate the idea are deadlock detection and garbage collection in a distributed setting, which are two very distinct problems, in that the former will eventually converge, while the other does not necessarily due to the 2 generals problem (you could try to move all the information to a single node, and you can do so in practice, but because of the 2 generals problem you can't once your connections have a chance of being lossy).