Hacker News new | past | comments | ask | show | jobs | submit login

> why they decided to make services depend on just one data center

In my experience, no engineers really decided to make services depend on just one data center. It happened because the dependency was overlooked. Or it happened because the dependency was thought to be a "soft dependency" with graceful degradation in case of unavailability but the graceful degradation path had a bug. Or it happened because the engineers thought it had a dependency on one of multiple data centers, but then the failover process had a bug.

Reminds me of that time when a single data center in Paris for GCP brought down the entire Google Cloud Console albeit briefly. Really the same thing.




> In my experience, no engineers really decided to make services depend on just one data center.

Partially true in this case; I can't speak to modern CF (or won't, moreso) but a large amount of internal services were built around SQL db's, and weren't built with any sense of eventual consistency. Usage of read replicas was basically unheard of. Knowing that, and that this was normal, it's a cultural issue rather than an "oops" issue.

Flipping the whole DC data sources is a sign of what I'm describing; FAANG would instead be running services in multiple DC's rather than relying on primary/secondary architecture.


Dunno about that, I've read similar internal postmortems at the FAANG I worked at.


Everywhere I've worked requires a DR drill per service, but I've never seen anything where the whole company shuts down a DC at once across all services.

But probably we should. It's an immensely larger coordination problem, but frankly, it's probably the more common failure mode.


The FAANG I worked at did this back in 2016-18, so that what happened to CloudFlare didn't happen to them.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: