Instead of doing all this complicated thing, how about simply following a Raft-like consensus protocol with the minor modification that the leader won't include a write op its read processing until that write op has been applied to the log of all the replicas, not just the quorum. When the heartbeat response from replicas indicates to the leader that this write op has been applied everywhere, it can advance its internal marker to include this write op in the read operations.
This simple scheme allows all members including replicas to serve read-after-write consistency and penalizes the write op that happened. That write op wont be acknowledged to the caller until it has been applied everywhere.
There are no fault tolerance issues here btw. If any replica fails, as long as quorum was reached, the repair procedure will ensure that write will be eventually applied to all replicas. If the quorum itself could not been reached then the write is lost anyways and is no different than the typical case of reading just from the leader.
I don't think this scheme provides the "monotonic reads" property discussed in the blog post. Specifically, it would be possible for a reader to observe a new value from r2 (who received a timely heartbeat), then to later observe an older value from r3 (who received a delayed heartbeat). This would be a violation of linearizability, which mandates that operations appear to take place atomically, regardless of which replica is consulted behind the scenes. This is important because linearizability is compositional, so users of CockroachDB and internal systems within CockroachDB can both use global tables as a building block without needing to design around subtle race conditions.
However, for the sake of discussion, this is an interesting point on the design spectrum! A scheme that provides read-your-writes but not monotonic reads is essentially what you would get if you took global tables as described in this blog post, but then never had read-only transactions commit-wait. It's a trade-off we considered early in the design of this work and one that we may consider exposing in the future for select use cases. Here's the relevant section of the original design proposal, if you're interested: https://github.com/cockroachdb/cockroach/blob/master/docs/RF....
Because the definitions and leveling are arbitrary and cultural within a given firm or context. Notice how "plan" is both at the bottom level, and also the highest level by being all encompassing if you call it "long term".
That said, regardless of labels, this leveled model is reasonable and better than a model that isn't cascaded.
What does "stream.write_all(&number_as_bytes).unwrap();" do if the socket buffer is full? Does it block this virtual thread running this function? Or does the stream keep buffering? or is it sending the message to some other process which is accumulating those messages. What if I don't wait this thread to block and instead do something else?
I believe all of these are handled. I just cannot find sufficient documentation to understand the details of how this works.
Same as the synchronous version: It will block until more data can be written, and then go on and write as much as possible using another async .write() call. It's the same as:
let mut offset = 0;
while offset != number_as_bytes.length() {
let written = stream.write(&number_as_bytes[offset..]).await.unwrap();
offset += written;
}
The synchronous version would be the same without the .await, and offers stronger guarantees that either all bytes are written to the socket or the socket errored and is dead. The async version could be cancelled in the middle of the invocation after some segments have already been written.
It was disappointing to see no detail of actual design of notification system using DDB and lambda. I was hoping to see schema information architecture etc. instead it was high level stuff.
So as a staff eng you will have projects that you are leading but have cross team/org dependencies.
You have to get alignment with those teams and get projects on their roadmap/sprints so you can get unblocked.
But a lot of engineers think they can just ask cross-functional teams fit with but get surprised when things don’t happen. And that’s because other team is busy and not idling.
As a staff engineer you have to figure out how you can contribute to cross-team such that they agree to the problem, let you contribute to their code base, and help them unlock you.
Its a lot of championing and relationships building that is expected at staff level.
In practice, for the systems where I built a replication system from the ground up, once you factor in all the performance, scale, storage layer and networking implications, this Paxos vs. Raft thing is largely a theoretical discussion.
Basic paxos, is well, too basic and people mostly run modifications of this to get higher throughput and better latencies. After those modifications, it does not look very different from Raft with modifications applied for storage integration and so on.
> Basic Paxos, is well, too basic and people mostly run modifications of this to get higher throughput and better latencies. After those modifications, it does not look very different from Raft.
Alan Vermeulen, one of the founding AWS engineers, calls inventing newer solutions to distributed consensus an exercise in re-discovering Paxos.
This exactly my take as well. Multi-Paxos and Raft seem very similar to me. Calling out what the exact differences and tradeoffs are would be good blog/research fodder.
I think the differences become more stark and more valuable/surprising the closer you get to understanding the protocols. There are some major availability and performance tradeoffs involved in the choice between Multi-Paxos and Raft, as you go from paper to production. This can be the difference between your cluster remaining available, and the loss of an entire cluster merely because of a latent sector error.
For example, UW-Madison's paper "Protocol-Aware Recovery for Consensus-Based Storage" [1] won best paper at Fast '18 and describes simple scenarios where an entire LogCabin, Raft, Kafka or Zookeeper cluster can become unavailable far too soon, or even suffer global cluster data loss.