Ok so having worked on distributed consensus a bunch here are a couple thoughts in no particular order:
* In the real world, servers misbehave, like, a lot, so you have to deal with that fact. All assumptions about network robustness in particular will be proven wrong on a long enough timeline.
* Leader election in a world without a robust network is an exercise in managing acceptable failure tolerances. Every application has a notion of acceptable failure rates or acceptable downtime. Discovering that is a non-trivial effort.
Jeff Dean and others at Google famously came to the conclusion that it was ok for some parts of the Gmail service to be down for limited periods of time. Accepting that all self-healing/robust systems will eventually degrade and have to be restored is the first step in building something manageable. The AXD301 is the most robust system ever built by humans to my knowledge (I think it did 20 years of uptime in production). Most other systems will fail long before that. Managing systems as they fail is an art, particularly as all systems operate in a degraded state.
In short, in a lab environment networks function really well. In the real world, it's a jungle. Plan accordingly.
To add to that: "uptime" is a sort of misleading statistic. A single server can have an "uptime" of 20 years sitting in the closet of an office in a corporate park somewhere in suburban Maryland. To the point that nobody can find the server, but they can ssh to it, and yup... it's still up.
On the other hand, the more components (servers, disks, switches, etc) you have, the higher the probability of failure. The more traffic, more data, more changes more... anything, tends to lead to more failure. So paradoxically, small systems can have better "uptime" than large robust well-designed ones.
In my experience managing large distributed decentralized systems, the more buggy the hardware, the larger the cascades of software failures, the harder it was to fix. Industrial-grade hardware that nobody monkeyed with (or was replaced as soon as any failures were detected) led to the most stable systems. Reliable, dedicated private network paths led to more stability. At a certain point we didn't really need leader election, because they'd never need a new leader, because things hardly ever failed.
If you just take one device, it can either be completely functional and chug along for 20 years, or completely fail, which happens rarely but is painful. In either case, your software can ignore the failure state, there's nothing it can do at the case of a failure.
If you take 1000 such devices, then the vast majority will be fine, but a small number of them will fail. No matter how you fix them and replace them, you will keep seeing some degraded nodes, or temporarily disconnected nodes because the network has glitches, too. So your software has to always be ready to face a failure of a remote node, and always be ready to gracefully handle it. This completely changes the way you think.
Good point, Consensus at high data scales is not achievable. We are building a cloud-native log analytics solution at logiq.ai. Our initial design had RAFT at the core. Though it is relatively easy to implement when we started hitting high data rates (50+ Gb per hour), the RAFT APIs started to slow down the entire system. We removed it and made each pod stateless, which allowed us to scale exponentially. Now we can go from 1 Gb per hour to hundreds with a single `kubectl scale sts` command. Eventual consistency is the way forward for applications that generates high volumes of data.
Of course, some applications do require Consensus; for those, I would start with RAFT as a starting point.
I don't agree with this. Eventually consistent was majorly hyped a while ago, but these days I don't think that there's widespread consensus on this issue. The complexity of getting an eventually consistent system working correctly is often much higher than in a system with a leader.
Most large real-time systems (or near enough) must be eventually consistent due to speed of light delays between all nodes of the system.
Multiplayer video games have big problems with light speed. A user could be 100ms away from the server, and another 100ms in the other direction, so anything one user does, the other will see 200ms later. If the game wasn't eventually consistent, it would need to be a turn based game as 5 actions per second is too slow as most real-time games run 60-100 actions per second.
A bank account is also eventually consistent as transactions need several days to clear, causing the exact balance to be unknown. This is why a bank establishes an "available" balance as that is the most pessimistic estimation of an accounts balance.
Like with all things, designing a system as eventually consistent or as a leader-type really depends on the application, the team that is going to build it and the resources available to support it.
Since the article mentions Google as the outlier preferring Paxos, I may be able to shed some light from a few years ago.
The Paxos, paxosdb, and related libraries (despite the name, all are multi-paxos) are solid and integrated directly into a number of products (Borg, Chubby, CFS, Spanner, etc.). There are years of engineering effort and unit tests behind the core Paxos library and so it makes sense to keep using and improving it instead of going off to Raft. As far as I am aware the Google Paxos implementation predates Raft by quite a while.
I think in general if most other people use Raft it's better for the community to have single, stable, and well-tested shared implementations for much the same reason it's good for Google to stick with Paxos.
This makes sense to me. Very few of us have the resources to maintain, for example, the kind of globally synced (way beyond typical NTP) clock infrastructure that Google has (TrueTime[1]).
This is just best effort on google's end right? Don't think anything is documented/guaranteed such that you would be able to, for ex. rely on it like spanner's use of true time.
Sure, you'd have to invent the rest, but TrueTime isn't about having perfect clocks, it's about estimating the error of your peer's clocks. Having a reasonable platform clock is a good starting point, and addresses the problems discussed in the Jane Street article.
> The Paxos, paxosdb, and related libraries (despite the name, all are multi-paxos) are solid and integrated directly into a number of products (Borg, Chubby, CFS, Spanner, etc.).
Maybe things have changed, but I thought the bottom turtle for pretty much any infrastructure system at Google was Chubby. I didn't realize Borg now directly does Paxos.
What I meant was that I thought any system that required some form of distributed locking or consensus, did so by building on top of Chubby (which does Paxos), not by implementing Paxos directly.
I hope that the "Paxos vs Raft" debate can die down, now that engineers are learning TLA+ and distributed systems more thoroughly. These days we can design new protocols and prove their correctness, instead of always relying on academics. For example, at MongoDB we considered adopting the reconfiguration protocol from Raft, but instead we designed our own and checked it with TLA+. See "Design and Verification of a Logless Dynamic Reconfiguration Protocol in MongoDB Replication" for details: https://arxiv.org/pdf/2102.11960.pdf
In practice, for the systems where I built a replication system from the ground up, once you factor in all the performance, scale, storage layer and networking implications, this Paxos vs. Raft thing is largely a theoretical discussion.
Basic paxos, is well, too basic and people mostly run modifications of this to get higher throughput and better latencies. After those modifications, it does not look very different from Raft with modifications applied for storage integration and so on.
> Basic Paxos, is well, too basic and people mostly run modifications of this to get higher throughput and better latencies. After those modifications, it does not look very different from Raft.
Alan Vermeulen, one of the founding AWS engineers, calls inventing newer solutions to distributed consensus an exercise in re-discovering Paxos.
This exactly my take as well. Multi-Paxos and Raft seem very similar to me. Calling out what the exact differences and tradeoffs are would be good blog/research fodder.
I think the differences become more stark and more valuable/surprising the closer you get to understanding the protocols. There are some major availability and performance tradeoffs involved in the choice between Multi-Paxos and Raft, as you go from paper to production. This can be the difference between your cluster remaining available, and the loss of an entire cluster merely because of a latent sector error.
For example, UW-Madison's paper "Protocol-Aware Recovery for Consensus-Based Storage" [1] won best paper at Fast '18 and describes simple scenarios where an entire LogCabin, Raft, Kafka or Zookeeper cluster can become unavailable far too soon, or even suffer global cluster data loss.
Ok, maybe not for the win, but it's worth a look. I'm actually fairly certain one of the Paxos implementations I've worked with and used is really more of a VR bend to Paxos anyway.
I very recently learned about VR (VSR?) and am wondering if someone can provide information about the tradeoffs relative to Raft/Paxos (which are "basically the same").
I did a three minute "one-slide" lightning talk [1] on Viewstamped Replication at the PaPoC '21 workshop at EuroSys organized by Heidi Howard, and had a few seconds to spend on how Viewstamped Replication compares with Multi-Paxos (Paxos with leader election) and Raft (Paxos with leader election and some overly strong restrictions).
Raft is at the extreme end of the spectrum with lowest availability and lowest performance, compared to Viewstamped Replication and Multi-Paxos further along the spectrum.
The reason for this is that Raft suffers from the distributed equivalent of the classic RAID 5 problem, where if the leader fails, it requires the next leader to have a perfectly pristine log without even a single latent sector error, or else the whole Raft cluster can become unavailable.
Raft has no mechanism to repair the new leader's log and so cannot fully utilize redundancy to maximize cluster availability, whereas Viewstamped Replication Revisited describes how to leverage Merkle Trees for leader repair, and also features a recovery protocol in case a replica loses its whole state. The Raft paper and thesis have no storage fault model. At first glance, Raft appears easy to implement, yet Raft's correctness (let alone availability) also depends entirely on "perfect" storage across the whole cluster, which I find surprising given how even RAID systems have long moved on from RAID 5. Here again, Viewstamped Replication Revisited has no such dependence on durable storage for correctness and can be run entirely in-memory much more simply.
Raft also leans on randomized election timeouts to deal with its problem of split votes during leader election, which all (split votes and randomized timeouts) both add unnecessary latency to the leader election phase.
Viewstamped Replication has no problem of split votes to begin with, so it can react much more quickly to leader failure, since replicas in Viewstamped Replication are surprisingly "not selfish" and don't vote for themselves but rather work together to have an advantage in insight on who the next leader should be (simple modulo function). This additional deterministic input to the protocol I find is a massive algorithmic gain.
RAFT is popular, but as Heidi Howard said in another of her talks, it's worth surveying the literature before reaching for RAFT.
Multi-Paxos is identical to Viewstamped Replication which preceded it, except that it tolerates "gaps" in the log. In other words operations can be committed out of order. However, with our hash chaining optimization on Viewstamped Replication that we created for TigerBeetle [2], we're able to achieve the same performance benefit of having the followers ack ops instantly back to the leader (even if a follower doesn't yet have the complete log, i.e. we've moved log backfill repairs out of the critical path), but without all the added complexity that Multi-Paxos has of resolving "gaps" that are never filled.
The difference between Paxos and Multi-Paxos is something that people often get wrong. Paxos does not do leader election but requires two RTTs to commit an op, with the advantage that any replica can lead for the op. Multi-Paxos does do leader election for the leader to order the op, but with the advantage that it requires just one RTT to commit an op. The choice of Paxos ("active replication" = 2 RTT) or Multi-Paxos ("passive replication" = 1 RTT) depends on how close your users are to all your replicas, and how you layout your cluster in terms of geography and availability regions.
Raft was published in 2013 and is very similar to the Viewstamped Replication Revisited paper published by Barbara Liskov and James Cowling a year before in 2012 (in fact, I find the latter more understandable than the former). Both Raft and Viewstamped Replication are all in the Multi-Paxos family. Of course, Viewstamped Replication pioneered consensus a year before Paxos in 1988 in Brian M. Oki's doctoral dissertation, it's just that the terms "Paxos" or "Multi-Paxos" are how we classify these categories of algorithms today. Multi-Paxos is what most people need for state machine replication, and I find it remarkable that Viewstamped Replication had all of this, batteries included, from the beginning.
Ahaha cool thanks! The raid 5 analogy was what was missing for me from the tiger beetle talk! Perfect, since also I used to work for a storage company. I wonder if there's a materialized log state machine method waiting in the wings that uses some form of erasure recovery to rebuild broken logs.
Paxos is a family of algorithms which are aimed at distributed consistency / monotonic state guarantees. However, Paxos allows for leaders with out-of-order logs to be elected leaders (provided they then reorder their logs) whereas Raft requires a server’s log to be up-to-date before it can be elected leader. Moreover, Raft has a reputation for having better understandability than Paxos.
edit: It looks like the linked paper covers the main differences, albeit in a more detailed manner. Also, it sems as if the author rejects the idea that Raft is more understandable and makes a case why he thinks Paxos is more understandable.
Personally I find paxos more understandable. For example, KTH have a really nice incremental development of Multi-Paxos called Sequence Paxos: https://arxiv.org/abs/2008.13456
Problem: "Raft protocol described and analyzed in English has problems."
Solution: "Here is a modification to the protocol, described in English, that does not have such a problem."
Seems like there is a common failure mode of "describing distributed protocols in plain english and thinking this is a proof"?
The actual problem here is not "There was a problem in the Raft protocol, and I figured it out and provided a work around". The actual problem here is "Reasonably experienced software engineers reviewed the specification and didn't see any problems." This actual problem has not been addressed by the article.
> is that it's hard to tell if a particular C++ or Rust implementation conforms to the spec.
You can use either TLC's graph output or simulation mode to generate lots of TLA+ traces. Then you write an orchestrator that reads the trace as JSON and run the same actions in each step on the program, then confirm they have the same final states. You can go the other way, too, where you take program traces and map them onto the TLA+ state space.
It's a bit tricky to set up, but it's possible. I've had several clients who've built something similar.
https://www.microsoft.com/en-us/research/blog/p-programming-... makes the case better than I can on why it's important to have the specification and implementation come from the same source as opposed to maintaining two and then verify via traces that they're compatible.
But definitely worth doing (like Jepsen does) to find problems in the implementation.
This is, incidentally, one of the kinds of use cases that informed part of our product functionality.
1. A way to write down fairly complicated system/behavior-level specs for distributed systems.
2. A way to instrument implementations to get the necessary signal out of the system to verify it nominally.
3. A way to automatically explore adverse conditions to validate that the system keeps behaving as needed and specifically isolate contributing factors for when it doesn't.
The very first prototype was leveraged (in-part, as there was much more too the system under test) in a consensus mechanism that was central to the correct, stable, and most importantly... safe... operation of some critical software infrastructure. It was necessary to try to explore where the model assumptions broke down in practice in a real system.
> The issue is that rust is a very large language and it's hard to get it right.
In my experience, Rust type system (especially the Send and Sync traits, which prevent data races) makes it easier − not harder − to get big things work.
My intention is to reuse the rust type system where we can and yet keep the language small enough to be able to implement and verify non-trivial real world programs.
you can abstract IO and async'ness from the implementation, and model it as a pure function. Then build and inspect the state space. A detailed description was done here:
* In the real world, servers misbehave, like, a lot, so you have to deal with that fact. All assumptions about network robustness in particular will be proven wrong on a long enough timeline.
* Leader election in a world without a robust network is an exercise in managing acceptable failure tolerances. Every application has a notion of acceptable failure rates or acceptable downtime. Discovering that is a non-trivial effort.
Jeff Dean and others at Google famously came to the conclusion that it was ok for some parts of the Gmail service to be down for limited periods of time. Accepting that all self-healing/robust systems will eventually degrade and have to be restored is the first step in building something manageable. The AXD301 is the most robust system ever built by humans to my knowledge (I think it did 20 years of uptime in production). Most other systems will fail long before that. Managing systems as they fail is an art, particularly as all systems operate in a degraded state.
In short, in a lab environment networks function really well. In the real world, it's a jungle. Plan accordingly.