Spot on, and thank you. My second team was ~40 (might have peaked at ~50) split ...

jjirsa · on Oct 14, 2022

I'll see if we can get permission to discuss this publicly.

sdevoid · on Oct 10, 2022

Exactly.

- There's nothing scarier than "the automation had no rate-limiting or health-checking". Of course, what do we mean by automation? At some point it becomes impractical to slow every change down to a crawl, so some judgement is required. But "slow enough to mount a human response to stop it" is the standard I've applied to critical systems.

- Thankfully I've avoided having to support "real" storage systems. The challenges of "simple" distributed databases storing is enough for me. :-)

On the "pets vs. cattle" metaphor, I think most people fail to grok the second component of that. I don't think there are many successful cattle ranchers that will just "take the cattle out of line and shoot it in the head." The point of the metaphor is: When you have thousands of head of cattle, you need to think about the problems of feeding and caring for cattle operationally, not as a one-off.

Despite what https://xkcd.com/1737 might make one believe, people don't just throw out servers when one part goes bad, or (intentionally) burn down datacenters. What the "hyperscalers" do is decouple the operational challenges of running machines from the operational challenges of running services (or at least try to). Of course this results in a TON of work on both ends.