I find "distributed systems" to be a huge source of imposter syndrome. Despite having worked almost exclusively with distributed applications for several years now, it is difficult to consider myself experienced. When I'm asked if I've worked with distributed systems, I don't think they are asking me if I've managed a Hadoop cluster. They are interested in building new applications using some of the primitives discussed in this post. All of these links are great, but the fact is that building and operating tools like this is hard. In addition to consensus primitives, your system may need very precise error handling, structured logging, distributed tracing, resource monitoring, schema evolution, etc. In the end, I probably pause for a second too long when answering that question, but I don't think its because of a lack of experience, quite the opposite!
Do you know what the CAP theorem is, and can explain it to me like I'm 5? Can you tell me how a SQL DB fits into it, and where something like DynamoDB fits into it?
Congratulations, you are better than 95% of the people that I've interviewed out there saying they are experienced building distributed systems. Including system/solution architects.
Sure, but the OP was talking about their difficulty during interviews; I'm just saying, they're almost assuredly better than the majority of the rest of the hiring pool. The company has an opening they need filled; they can only fill it with people from the hiring pool, not those who aren't in the hiring pool.
I'm realizing I've only worked in distributed systems as well, but I'd never feel comfortable telling potential employers I'm an expert. Being an expert in distributed systems seems almost too broad. At a high level couldn't it be expertise at integration, accessible logging, and configuration?
Distributed systems research has been going on since the 70's and Unix Neckbeards have probably forgotten more about them than we have learned, so actually I think impostor syndrome is a bit warranted with them.
The actual hard stuff is not even these papers, it's the implementations that are way more complex than some algorithm or architectural pattern. Anyone who says "X is better than Y" is fooling themselves because it's only the implementation context that matters.
The only thing you can say for certain is that reducing the amount of components and complexity in the system often results in better outcomes.
> The only thing you can say for certain is that reducing the amount of components and complexity in the system often results in better outcomes.
No, there are a few other things that you can say for certain:
Watch out for positive-only feedback loops, you absolutely need negative feedback as well - or only. Eg. exponential back-off.
Sometimes, you just need a decentralized solution, rather than a distributed one, and you don't have to have the same answer at every scale (eg. distributed intra-datacenter, decentralized inter-datacenter, or vice-versa).
Loose coupling is your friend.
Sure, add an extra layer of indirection, but you probably need to pay more attention to cache invalidation than you think.
Throughput probably matters more than latency.
Reducing the size/number of writes will probably help more than trying to speed them up.
Multi-tenancy is a PITA for systems in general, and distributed ones are no exception (aside: there is probably a huge business for multi-tenancy-as-a-service, if anyone manages to solve it in a general-purpose way), but a series of per-customer single-tenant deployments may be worse, especially if they are all on different versions of the code. Here be dragons.
Don't overthink it. Start with a naive implementation and go from there (see loose coupling above).
Maybe there are some other things you can say for certain. But as to some of your points:
> Watch out for positive-only feedback loops, you absolutely need negative feedback as well - or only. Eg. exponential back-off.
Agreed it may need negative feedback, but I'm not sure about always.
If your service has a latency SLA, exponential back-off might kill your SLA (depending on wordage and where the back-off is). The fix is to soft reject requests (RST rather than dropping packets) when you can't meet the demand. This change may allow you to meet your SLA if it's written to prioritize low latency over service unavailability.
This is it's own negative feedback loop, but change from sending RSTs to silently dropping and you no longer have the feedback.
> Sometimes, you just need a decentralized solution, rather than a distributed one
Agreed
> Loose coupling is your friend.
Until it isn't? :)
> add an extra layer of indirection, but you probably need to pay more attention to cache invalidation
Fixes for additional layers tend to increase system complexity compared to fixes for fewer layers.
> Throughput probably matters more than latency.
Until it doesn't :)
> Reducing the size/number of writes will probably help more than trying to speed them up.
Depending on 20 different things... You really have to account for all the system's limits (and business use cases) and find the solution that matches the implementation needs.
> there is probably a huge business for multi-tenancy-as-a-service
Sure, it's called EKS :-) Just build more clusters... Don't worry, we'll bill you...
> Don't overthink it
Yes and no; Yes, in that there will always be unknowns. But no, in that often improvements in communication will provide better solutions without extra work. Think smarter, not harder!
Yes. The area is fairly broad. In my opinion - I think the question to ask for is: Do you have the distributed systems mindset? Not Are you an expert on Distributed Systems
It is definitely a broad term, and I think that disciplined implementation of the things that you mentioned is the real key. It just isn't as exciting to talk about.
I don't trust anyone who comes back with quick answers without any hesitation or pause. I remember a lecture from a while ago, I can't remember who was giving it but they were someone well known, where they asked people to implement bubble sort. The results that came back ranged from something like 20 lines to 2000 lines and everysingleonehadbugs. I don't know how anyone who works in this industry for any length of time doesn't follow up every answer with, "...but there's probably some problem that I'm not thinking of at the moment".
I wonder if this source of imposter syndrome applies to any field where the job roles and success criteria are not clearly defined and vary from company to company (ie data analyst, product owner, distributed systems developer).
Great list. Only thing I’d add for the other enterprise developers out there is to first default to not building a distributed system at all, but rather build a much smaller monolith. In 25 years I’ve worked for so many orgs that wanted to build The World’s Most Scalable System for what would maybe be a few hundred concurrent users. Not surprisingly, those projects tend to tank.
And with a single monolith you can comfortably handle several thousand concurrent users, not just several hundred.
From my humble experience I've found that it is also relatively easier to migrate an established monolith to a semi-distributed system, than it is to build and scale a distributed system from the ground up whilst at the same time trying to figure out all its kinks before it too becomes an established system.
The trick is to find "fault lines" throughout your monolith - places where it's easy to create separation of concerns and narrow interfaces. Maintain and document these like they are external APIs. Only pass simple data through it (serialized or anything easy to serialize).
When the time comes, cleave these chunks off and wrap them in RPC/REST/graphql.
It's good form even for monoliths since it makes for an easy interface to test.
+1 to you sir. I was just having a discussion this past weekend with some technical friends and noted that a well-architected monolith often lends itself to easy deconstruction such that you get the best of both worlds when they're necessary in your product lifecycle.
Ah, i kinda agree. My former employer developed and sold an editorial suite architected in about 1999-2000 around the SOA model (along with corba) and later evolved through the years.
So basically some monoliths playing together.
I've always been admiring that design and how much it could do and handle with so little resources.
Most times for a whole editorial newsroom a single 48c/128gb machine can be more than enough.
They were particularly smart on avoiding relational databases and sticking to an object database (Versant oodbms, another mostly unknown marvel of software engineering).
That software, even with its shortcomings, was really a great piece of software engineering.
People today don't realize how cheap you can get a really upscaled machine. In Azure, a 48 core, 192GB machine is 2000 USD a month. This is nothing compared to upscaling your app on 10 servers and 20 services, with a lot of overhead, maintenance and dev time.
If you can get away with a monolith, it's the way to go, as long as you leave room to quickly grow horizontally as well, if needed.
That is a great advice. However if scalability must be introduced later on, it can be really hard as there are many features that have been added to the monolith, and refactoring it to become scalable can be a huge task.
The conditions where the monolith must be converted to a scalable system should be defined as early as possible.
Designing something to be “scalable” before it’s actually established is a recipe for premature optimization at best and a poorly baked SOA with service boundaries that make change incredibly difficult at worst.
It’s important to keep mind that SOA is about scaling teams first, code second and not really about throughout per se. A share-nothing web tier plus a couple judiciously applied databases and background job queues can effectively scale a huge proportion of applications without the overhead of a full SOA.
But our industry has all kinds of perverse incentives to do resumé driven development. How does one get to put paxos on their resumé if they built the correct simple solution?
The problem with scaling is that visitor count often goes exponentially. So while in the beginning everything seems calm and quiet, suddenly you can see a massive spike in interest. And if you didn't take the right measures, you might lose all that interest.
That's what startups are seeking, but it's not that often at all. Also, exponential doesn't necessarily mean "happens too fast to react". If the number of your users doubles every year, you still have exponential growth, if you only have paid users you still have a very nice growth of business and you also have time to react.
It's indeed pretty rare to see services that crumble under the exponential increase in traffic. Especially if it's not a spike from publicity (I don't know how you would say "being slashdotted" these days). It happens, but it's rare. But because it's usually newsworthy or at least interesting it gets more attention and it will feel much more of a danger than it actually is. Both for software developers and entrepreneurs (or even for the general public) it feels lame. Ha! They should have expected this! But the truth is that most of the time you should not prepare for this, because it's pretty rare, it's not even necessary for success, not even for startup scale success at the beginning.
It's pretty easy to see actually: (almost) all of today's successful services provided by startups started as monoliths (or maybe more realistically some kind of SOA, because monolith vs. microservices is really a false dichotomy). It did work a decade ago with weaker hardware, why wouldn't it work now?
There's times when you might be able to expect a huge spike, but even then, if the product is early, so the overall design is still pretty fluid, I would try to keep flexible.
It's more important to be aware of overload, and provide feedback to users (or potential users) and some form of load shedding. And test that all. For a lot of applications, being able to quickly plop together a second monolith could work to address spikes. Or switching to a bigger machine: Epyc 2-socket systems get you up to 128 cores (256 SMT threads), and I think 8TB of ram. You can do an awful lot with that.
Yep, it is still a matter of abandoning the monolith and building a new scalable system in whatever fashion you choose at that time, micro services or whatever.
In today's cloud deployments where cost is not a mega concern (as compared to having physical servers in a DC), this approach is very much feasible.
You could run two systems in parallel (monolith v/s scalable) and ditch the monolith once you have the necessary cutover logistics in place.
Of course, the sooner you do this, the better and it is non trivial to figure out the 'when'
That's great advice, as long as different systems in this monolith are loosely coupled both in terms of code and data, and you keep in mind that you will need to move them to separate services later on.
Dr. Kleppman also provides notes for that specific lecture [0]; they contain all the slides with the corresponding detailed text, which is really awesome!
Good list, but is it still being actively updated? Not having Kleppmann’s seminal Designing Data-Intensive Applications (2017) on it would indicate no.
Alex Petrov’s Database Internals: A Deep Dive Into How Distributed Data Systems Work (2019) is another essential recent reference that should be here. Not as broad as Kleppmann but dives a lot deeper into certain topics.
HighScalability is also a great practical blog for this. I've learned a ton and implemented learnings based on some of their write-ups: http://highscalability.com/
I guess I'm pretty opinionated about this, but it was odd the author talked about the necessity of changing the way you think without also including anything about TLA+. IMO the "way you think" about distributed systems - if you want to be effective - will basically end up looking exactly like you think when writing a TLA+ spec, and learning TLA+ is a fast-track method of thinking like a distributed systems engineer. This is much, much more important in day-to-day work on distributed systems than knowing how a bunch of distributed systems algorithms work.
Is it though? The hard part about distributed systems is performance in our crappy real world environment with unreliable poorly performing and faulty public internet, unreliable hardware, OSes, etc. Which is directly at odds with needing TLA+, because if you do need it, it means the complexity of the algorithms is so great, that you won't be able to keep them in your head and understand every aspect of their performance to make something work well. It's similar how people think they can just peek a random consensus algorithm they've heard is correct and easy and make a decently working distributed storage, which is silly of course, it's only good for educational purposes.
EDIT: (Ah, I see you are a TLA+ promoter, that's why you made a comment like that)
I am not a "TLA+ promoter" but think it is a very valuable tool for anyone building distributed systems. The value of TLA+ is that it forces you to carefully consider your algorithm, which is certainly important if the algorithm is complex but equally important if the algorithm is simple. Most people will struggle to correctly specify even a simple algorithm in TLA+ because they will miss a lot of things they had assumed without ever thinking about.
Real world systems need to handle all the things you mention. TLA+ helps you consider all these issues with spelling them out individually. There is no point in building a complex system if you haven't taken the time to validate the correctness of the target system in the first place.
> Which is directly at odds with needing TLA+, because if you do need it, it means the complexity of the algorithms is so great, that you won't be able to keep them in your head and understand every aspect of their performance to make something work well.
I don't think this is correct.
What TLA+ allows us to do is be more creative in our design and choice of algorithms, while allowing the computer to help us reason about whether the choices we're making still result in a system that is correct. "Correct" in in this context means two things: "safe" as in it doesn't lose or corrupt data, and "live" as in it eventually makes progress without deadlock or other blockers. That doesn't capture "meets the SLA" or "fast enough for real use" or even "tolerates gray failures". All of those are critical properties indeed - but unless you have fundamental safety and liveness you're never going to get those properties anyway. You might think you have them, but then you'll have a bad time eventually.
So TLA+ (and similar tools) aren't a complete solution to the problem, but they are an exceptionally useful one. Fundamentally, they're useful because distributed and concurrent protocols, even very simple ones like 2PC, are wickedly difficult to reason about clearly. Computers can help us reason, and specification languages can help us communicate clearly about our reasoning.
That's the thing, performance should dictate the algorithms, not the other way around and TLA+ can't make this process any easier, only harder. I get it's not the case at AWS, where distributed services AWS thinks customers might want is what dictates the choices, but this is an exception, not the rule and unless someone wants to work there they have no reason to be doing it this way, especially not for educational purposes learning distributed systems.
> performance should dictate the algorithms, not the other way
You want correctness first, and performance second. But these two are very much intertwined. And knowing exactly where the boundary is will help you co-design them.
Some AWS engineers have said the following[1]:
"TLA+ [...] giving us enough understanding and confidence to make aggressive performance optimizations without sacrificing correctness."
I think the hard part about distributed systems is the combinatorial explosion of possible system states, which is also common to any concurrent program. Really distributed systems is just concurrency on hard mode, where failures are basically guaranteed instead of being very rare.
I wouldn't particularly say I'm a TLA+ promoter (it's a FOSS project), any more than anyone who has a great fascination with a language/framework/algorithm/viewpoint is a promoter. We're all promoters of the memes that live inside our heads!
> Which is directly at odds with needing TLA+, because if you do need it, it means the complexity of the algorithms is so great, that you won't be able to keep them in your head and understand every aspect of their performance to make something work well.
Having used TLA+ for years, I would say it's the exact opposite. All bugs happen due to someone believing something is simple enough to work out in their head while it actually isn't. So if your judgment about what you can keep in your head is good, you never have any bugs and you really don't need TLA+. But if you do happen to have bugs occasionally, then your belief about how much you can keep in your head is sometimes wrong. TLA+ is a very quick way to write down what's in your head so you can think about it more rigorously. Surely, if you can truly keep it in your head, it should be easy for you to write it down precisely. And just in case you're wrong, there are tools that can check if you're right, just to be extra sure. In short, TLA+ helps if you ever have bugs. It doesn't help if you never do.
I'm surprised there is no mention of the Bitcoin white paper. One of the most practical consensus protocols out there. Sure there is a lot of emphasis on the currency itself but I found the consensus protocol explanation so simple and easy to understand, as opposed to the Paxos algorithm.
I don't know if it belongs to the same category; but I find distributed learning very interesting and related. It has been a trending topic in biomedical field. In an area you can't gather enough data ethically, distributed systems are key.
How long would it take to realistically get through all these books? for the average developer... i'm thinking 3-4 years? and this includes taking weekends outside of work to painfully work through every chapter (+ break weekends since working+studying this in evening would probably lead to burnout in 1 year)
I would wager around a year at max. Most materials related to Distributed Systems are information heavy(requiring you to read and remember) as opposed to being math heavy(with the possible exception of graph theory, which usually none of the fundamental texts expect the readers to know). Even many popular papers like raft, paxos, chubby, BitTorrent, gfs etc do not require much mathematical knowledge beyond basic arithmetic. You can read them like a non-fiction and understand most of the things.
Not saying that Distributed Systems is easy, doing research in Distributed Systems is very challenging, but I don’t think reading and understanding these materials should be that difficult for an average software engineer.
Definitely, may be not directly, but it lies at the heart of software engineering(or more appropriately computer science). It helps with thinking differently, for example I don’t know if ndergrad courses teach distributed search algorithms using eigen vectors and values, but it helps to understand invariants and transformations better.
IMO a better approach would be to read Designing Data Intensive Applications (mentioned a million times on HN) which is more like a high-level map of the field. The references in DDIA are also a goldmine of information.
You don't have to read DDIA front to back. Just picking a topic (for instance "Distributed Transactions") is enough to get you started building an intuition about these issues.
Most of the links in comments are useful. But after a point it's hard to find small projects to implement to practice these. I had to do 2PC for a class.