> that people seem to be treating this as if it's some sort of profound insight ...

> that people seem to be treating this as if it's some sort of profound insight that you only get if you work at a very senior level in engineering for major US cloud providers, when it's in fact blindingly obvious!

I don't mean to imply it's a profound insight, and the discussions I had in AWS were never in those terms. It's just that when you're designing and building things that are going to operate at that scale, you have to very seriously consider the improbable.

What's more difficult is actually knowing what needs to be considered. e.g. prior to working at AWS, I don't think I'd have even considered "NIC corrupts packet, in such a way it gets to the OS mangled" as something that would be worth handling. Yet S3 and similar scale services see that and other improbable events so regularly that they actually have to consciously design for it, everywhere.

It's also one reason why larger services end up being incredibly conservative about the use of technology. You know what the failure modes are, however improbable, and can account for them. New technology tends to be kept on the fringes, and only adopted in more significant places once proven and improbable failures become understood.