This often devolves into extremely fragile systems instead. For instance, let's ...

twic · on April 10, 2020

Systems need to be robust against uncontrollable failures, like a cosmic ray destroying an image as it travels over the internet, because we can never prevent those.

But systems should quickly and reliably surface bugs, which are controllable failures.

A layer of suffering on top of that simple story is that it's not always clear what is and what is not a controllable failure. Is a logic error in a dependency of some infrastructure tooling somewhere in your stack controllable or not? Somebody somewhere could have avoided making that mistake, but it's not clear that you could.

An additional layer of suffering is that we have a habit of allowing this complexity to creep or flood into our work and telling ourselves that it's inevitable. The author writes:

> Once your system is spread across multiple nodes, we face the possibility of one node failing but not another, or the network itself dropping, reordering, and delaying messages between nodes. The vast majority of complexity in distributed systems arises from this simple possibility.

But somehow, the conclusion isn't "so we shouldn't spread the system across multiple nodes". Yo Martin, can we get the First Law of Distributed Object Design a bit louder for the people at the back?

https://www.drdobbs.com/errant-architectures/184414966

And let us never forget to ask ourselves this question:

https://www.whoownsmyavailability.com/

andai · on April 11, 2020

> systems should quickly and reliably surface bugs, which are controllable failures

I was thinking, if the error exists between keyboard and chair, I want the strictest failure mode to both catch it and force me to do things right the first time.

But once the thing is up and running, I want it to be as resilient as possible. Resource corrupted? Try again. Still can't load it? At this point, in "release mode" we want a graceful fallback -- also to prevent eventual bit rot. But during development it should be a red flag of the highest order.

numpad0 · on April 11, 2020

Are robustness and loose engineering the same/overlapping quality measurements?

If so makes sense to be not strict, if not it’s you(and us all) rolling up two different modes of failures into a single classification.

andai · on April 10, 2020

That's an interesting distinction. I think each resource should be self contained. Malformed HTML? HTML error. Malformed or missing image? Browser displays an image error.

The key here is that the web wasn't designed for engineers but for amateurs to slap something together sloppily in the first place.

As an aside it's curious how ridiculously forgiving HTML and JS are while CSS craps itself on a single missing semicolon. As though it were okay for the thing to be semantically and functionally malformed and malfunctioning... as long as it looks good!

nikofeyn · on April 11, 2020

> This often devolves into extremely fragile systems instead.

as if the systems we have today aren't fragile? instead, they're fragile but their fragility is hidden and obfuscated.

being robust and reliable is different than just letting systems do whatever they think is best.