Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm going to guess somewhere between 50 and 400 people. Smaller than 50: this is probably an org where "we maintain the source code and documentation. Running, debugging, and scaling your cluster is YOUR problem". Larger than 400: the org probably fulfills other functions besides "running thousands of production Cassandra clusters."

This is a lot, and is one of the big things that surprised me when I first joined a large organization. But here's an example breakdown in no particular order:

- 10 to 30 people for managing, contributing, and maintaining the open source aspect. Just a handful of engineers could contribute features and fixes to the project, but once the open source project and community gets larger it becomes a full time job. System component maintainers, foundation boards, committees, conferences, etc. add up.

- 10 to 70 people for "operations". As you mentioned, the load here tends to scale with the number of clusters (customers). At the large end this is several teams, with say a team dedicated to (a) fleet-management of the individual machines in a cluster, (b) cluster lifecycle management, and (c) macro level operations above the cluster level. Alerts can't all go to this team, so some of the work is writing alerts that go to customers in a self-serve model.

- 10 - 40 people for "scale projects." At this scale you have 1 - 10 customers that are on the verge of toppling the system over. They've grown and hit various system bottlenecks that need to be addressed. And you'd be lucky if they're all hitting the same bottleneck. With this many customers, it's likely that they've all adopted orthogonal anti-patterns that you need to find a fix for: too many rows, too many columns, bad query optimization, too many schema changes, too many cluster creates and destroys, etc. So you probably have multiple projects ongoing for these.

- 10 - 30 people for "testing infrastructure". Everyone writes unit tests, but once you get to integration and scale testing, you need a team that writes and maintains fixtures to spin up a small cluster for some tests and a large cluster for scale tests (which your "scale projects" teams need, btw). And your customers probably need ways of getting access to small test Cassandra clusters (or mocks of the same) for THEIR integration and scale tests, since Cassandra is just a small part of their system.

- 10 - 30 people for automating resource scaling and handling cost attribution. These may not be one function, but I'm lumping them together. "Operations" might handle some of the resource scale problems, but at some point it's probably worth a team to continually look for ways to manage the multi-dimensional scaling problem that large cluster software systems inevitably create. (Is it better to have few large nodes, or many small nodes?) You need some way of attributing cost back to customer organizations, otherwise you're paying $50M because one engineer on the weather team forgot to tear down a test cluster in one automated test 6 months ago and... You need to make sure that growth projections for customers are defined and tracked so you have enough machines on hand.

- I'll add that it'll be worth adding whole teams for some of the more complex internal bits of this system, even if the actual rate of change in that sub-system is not very high. At this scale organizations need to optimize for stability, not efficiency. You don't want to be in the situation where the only person who understands the FizzBuzz system leaves and now dozens of people/projects are blocked because nobody understands how to safely integrate changes into FizzBuzz.

- Things not covered: security, auditing, internal documentation, machine provisioning, datacenter operations, operating system maintenance, firmware maintenance, new hardware qualification, etc. Maybe there's an entire organization dedicated to each of these, in which case you get it for free. If not, some of your time needs to be spent on these. (Even "free" might have a cost as you need to integrate with those services and update when those services change.)




Spot on, and thank you. My second team was ~40 (might have peaked at ~50) split across four sub-teams, for software that ran at similar scale and was designed and developed to rely heavily on other in-house infra. Maybe half a dozen people on adjacent teams (including customers) who had more than trivial knowledge of our system. Some in our team were almost pure developers, some were almost pure operators, most were at various points in between.

I think the reason you and I (we know each other on Twitter BTW) are so at odds with some of the other commenters is that they haven't maxed out on automation yet and don't realize that's A Thing. Automation is absolutely fantastic and essential for running anything at this scale, but it's no panacea. While it usually helps you get more work done faster, sometimes it causes damage faster. Some of our most memorable incidents involved automation run amok, like suddenly taking down 1000 machines in a cluster for trivial reasons or even false alarms while we were already fighting potential-data-loss or load-storm problems. That, in turn, was largely the result of the teams responsible for that infra only thinking about ephemeral web workers and caches, hardly even trying to understand the concerns of permanent data storage. But I digress.

The point, still, is that when you've maxed out on what automation can do for you, your remaining workload tends to scale by cluster. And having thousands instead of dozens of clusters sounds like a nightmare. There are many ways to scale such systems. Increasing the size of individual clusters sure as hell ain't easy - I joined that team because it seemed like a hard and fun challenge - but ultimately it pays off by avoiding the operational challenge of Too Many Clusters.


I'll see if we can get permission to discuss this publicly.


Exactly.

- There's nothing scarier than "the automation had no rate-limiting or health-checking". Of course, what do we mean by automation? At some point it becomes impractical to slow every change down to a crawl, so some judgement is required. But "slow enough to mount a human response to stop it" is the standard I've applied to critical systems.

- Thankfully I've avoided having to support "real" storage systems. The challenges of "simple" distributed databases storing is enough for me. :-)

On the "pets vs. cattle" metaphor, I think most people fail to grok the second component of that. I don't think there are many successful cattle ranchers that will just "take the cattle out of line and shoot it in the head." The point of the metaphor is: When you have thousands of head of cattle, you need to think about the problems of feeding and caring for cattle operationally, not as a one-off.

Despite what https://xkcd.com/1737 might make one believe, people don't just throw out servers when one part goes bad, or (intentionally) burn down datacenters. What the "hyperscalers" do is decouple the operational challenges of running machines from the operational challenges of running services (or at least try to). Of course this results in a TON of work on both ends.


Just wanted to say thanks for understanding how hard this is.

It's a fun sub-thread to read.

As I mentioned elsewhere, I'll see if I can get permission to talk publicly about the actual numbers.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: