I'm going to guess somewhere between 50 and 400 people. Smaller than 50: this is...

notacoward · on Oct 9, 2022

Spot on, and thank you. My second team was ~40 (might have peaked at ~50) split across four sub-teams, for software that ran at similar scale and was designed and developed to rely heavily on other in-house infra. Maybe half a dozen people on adjacent teams (including customers) who had more than trivial knowledge of our system. Some in our team were almost pure developers, some were almost pure operators, most were at various points in between.

I think the reason you and I (we know each other on Twitter BTW) are so at odds with some of the other commenters is that they haven't maxed out on automation yet and don't realize that's A Thing. Automation is absolutely fantastic and essential for running anything at this scale, but it's no panacea. While it usually helps you get more work done faster, sometimes it causes damage faster. Some of our most memorable incidents involved automation run amok, like suddenly taking down 1000 machines in a cluster for trivial reasons or even false alarms while we were already fighting potential-data-loss or load-storm problems. That, in turn, was largely the result of the teams responsible for that infra only thinking about ephemeral web workers and caches, hardly even trying to understand the concerns of permanent data storage. But I digress.

The point, still, is that when you've maxed out on what automation can do for you, your remaining workload tends to scale by cluster. And having thousands instead of dozens of clusters sounds like a nightmare. There are many ways to scale such systems. Increasing the size of individual clusters sure as hell ain't easy - I joined that team because it seemed like a hard and fun challenge - but ultimately it pays off by avoiding the operational challenge of Too Many Clusters.

jjirsa · on Oct 14, 2022

I'll see if we can get permission to discuss this publicly.

sdevoid · on Oct 10, 2022

Exactly.

- There's nothing scarier than "the automation had no rate-limiting or health-checking". Of course, what do we mean by automation? At some point it becomes impractical to slow every change down to a crawl, so some judgement is required. But "slow enough to mount a human response to stop it" is the standard I've applied to critical systems.

- Thankfully I've avoided having to support "real" storage systems. The challenges of "simple" distributed databases storing is enough for me. :-)

On the "pets vs. cattle" metaphor, I think most people fail to grok the second component of that. I don't think there are many successful cattle ranchers that will just "take the cattle out of line and shoot it in the head." The point of the metaphor is: When you have thousands of head of cattle, you need to think about the problems of feeding and caring for cattle operationally, not as a one-off.

Despite what https://xkcd.com/1737 might make one believe, people don't just throw out servers when one part goes bad, or (intentionally) burn down datacenters. What the "hyperscalers" do is decouple the operational challenges of running machines from the operational challenges of running services (or at least try to). Of course this results in a TON of work on both ends.

jjirsa · on Oct 14, 2022

Just wanted to say thanks for understanding how hard this is.

It's a fun sub-thread to read.

As I mentioned elsewhere, I'll see if I can get permission to talk publicly about the actual numbers.