Hacker News new | past | comments | ask | show | jobs | submit login

> While most of our critical control plane systems had been migrated to the high availability cluster, some services, especially for some newer products, had not yet been added to the high availability cluster.

> The other two data centers running in the area would take over responsibility for the high availability cluster and keep critical services online. Generally that worked as planned. Unfortunately, we discovered that a subset of services that were supposed to be on the high availability cluster had dependencies on services exclusively running in PDX-04.

> A handful of products did not properly get stood up on our disaster recovery sites. These tended to be newer products where we had not fully implemented and tested a disaster recovery procedure.

So the root cause for the outage was that they relied on a single data center. I find that pretty embarrassing for a company like Cloudflare, which powers such relevant parts of the internet.




> I find that pretty embarrassing for a company like Cloudflare, which powers such relevant parts of the internet.

Bah, who cares about such unimportant details, what's important is that ~dev velocity~ was reaaally high right until that moment!

> We were also far too lax about requiring new products and their associated databases to integrate with the high availability cluster. Cloudflare allows multiple teams to innovate quickly. As such, products often take different paths toward their initial alpha. While, over time, our practice is to migrate the backend for these services to our best practices, we did not formally require that before products were declared generally available (GA). That was a mistake as it meant that the redundancy protections we had in place worked inconsistently depending on the product.

Complete and utter management failure. And customers apparently are sold what Cloudflare internally considers to be alpha quality software?


> Complete and utter management failure. And customers apparently are sold what Cloudflare internally considers to be alpha quality software?

This has been my experience with AWS and GCP as well. Assume anything that's under 3 years old is not really GA quality no matter what they say publicly.


I've been involved with some new service launches at AWS, and it's a strict requirement that everything goes through some rigorous operational and security reviews that cover exactly these issues before the service can be launched as GA. Feature-wise people might consider them "alpha", but when it comes to the resilience and security of the launched features, they are held to much higher standards than what is being described in this post-mortem.


Your operational reviews must be lacking at AWS then (surprise surprise) then because there are so many instances where something will be released in alpha yet the documentation will still be outdated, stale and incorrect LOL.


I think you misunderstand what's being talked about in this thread. "Operations" in this context has nothing to do with external-facing documentation, and instead refers to the resilience of the service and ensuring it doesn't for example, stop working when a single data center experiences a power outage.


"It stopped working because you did XYZ which you shouldn't have done despite it not being documented as something you shouldn't do" isn't different to a customer than a data center going down. For example, I'm sure the EKS UI was really resilient which meant little when random nodes dropped from a cluster due to the utter crap code in the official CNI network driver. My point wasn't that every cloud provider released alpha level software by the same definition but that by a customer's definition they all released alpha level software and label it GA.


> This has been my experience with AWS and GCP as well. Assume anything that's under 3 years old is not really GA quality no matter what they say publicly.

GCP run multi-year betas of services and features, so I'm doubtful there were still things not ironed out for GA. Do you have some examples?


Having worked at companies with varying degrees of autonomy, in my experience a more flexible structure allows for building systems that are ultimately more resilient. Of course, there are ways to do it poorly, but that doesn’t mean it’s a “complete and utter management failure”.


> Complete and utter management failure

Too strong. A failure certainly, but painting this as the worst possible management failure is kind of silly.


To be honest if you take the circumstances and them spending half of their post-mortem blaming the vendor, it does look like a total shitshow.


I’m going to leave out some details but there was a period of time where you could bypass cloudflare’s IP whitelisting by using Apple’s iCloud relay service. This was fixed but to my knowledge never disclosed.


There was a time when they were dumping encryption keys into search engine caches for weeks, and had the audacity to claim here, the issue was "mostly" solved. Until they were called out on it by Google Project Zero team...

"Cloudflare Reverse Proxies Are Dumping Uninitialized Memory" - https://news.ycombinator.com/item?id=13718752


There still exist many bypasses that work in a lot of cases. There's even services for it now. Wouldn't be surprised if that or similar was a technique employed.


Saw.t


And the top comment on the other HN post called it: https://news.ycombinator.com/item?id=38113503


And that this was unironically written in the same post mortem: “We are good at distributed systems.”

There’s a lack of awareness there.


Well, they did distribute their systems. Some were in the running DC, some were not ;)


Their uptime was eventually consistent


haha. The control plane was eventually consistent after 3 days


They are good at systems that are distributed; they are very bad at ensuring systems they sell thier custoners are distributed.


They distributed the faults across all their customers....


Good != infallible


> While most of our critical control plane systems had been migrated to the high availability cluster, some services, especially for some newer products, had not yet been added to the high availability cluster.

It's amazing that they don't have standards that mandate all new systems to use HA from the beginning.


> I find that pretty embarrassing for a company like Cloudflare, which powers such relevant parts of the internet.

Absolute lack of faith in cloudflare rn.

This is amateur hour stuff.

It's especially egregious that these are new services that were rolled out without HA.


?

Tbh. As far as I can see, their data plane worked at the edge.

Cloudflare released a lot of new products and the ones that affected were: streams, new image upload and logpush.

Their control plane was bad though. But since most products worked, that's more redundancy than most products.

The proposed solution is simple:

- GA requires to be in the high availability cluster

- test entire DC outages


The combination of "newer products" and then having "our Stream service" as the only named service in the post-mortem is very odd, since Stream is hardly a "newer product". It was launched in 2017 and went GA in 2018[2]. If after 5 years it still didn't have a disaster recovery procedure I find it hard to believe they even considered it.

[1]: https://blog.cloudflare.com/introducing-cloudflare-stream/ [2]: https://www.cloudflare.com/press-releases/2018/cloudflare-st...


From what I was reading on the status page & customers here on HN, WARP + Zero Trust were also majorly affected, which would be quite impactful for a company using these products for their internal authentication.

It's not just streams, image upload & Logpush.


Those customers were impacted until the DC was back up ( 1-2 hours?) On the config plane.

The data plane ( which I mentioned) had no issues.

It's literally in the title what was affected: "Post Mortem on Cloudflare Control Plane and Analytics Outage"

Eg. The status page mentioned the healthchecks not working, while everything was fine with it. There were just no analytics at that time to confirm that.

Source: I watched it all happen in the cloudflare discord channel.

If you know anyone that is claiming to be affected on the data plane for the services you mentioned, that would be an interesting one.

Note: I remember emails were also more affected though.


> Those customers were impacted until the DC was back up ( 1-2 hours?) On the config plane.

Which was still like ~12+ hours, if we check the status page.

>Eg. The status page mentioned the healthchecks not working, while everything was fine with it. There were just no analytics at that time to confirm that.

What good is a status page that's lying to you? Especially since CF manually updates it, anyway?

>Source: I watched it all happen in the cloudflare discord channel.

Wow, as a business customer I definitely like watching some Discord channel for status updates.


?

This wasn't about status updates going to discord only.

There is literally a discussion section on the discord, named: #general-discussions

Not everything was clear in the discord too ( eg. The healthchecks were discussed there), that's not something you want to copy-paste in the status updates...

Priority for cloudflare seemed to get everything back up. And what they thought was down, was always mentioned in the status updates.


Oh, I just looked it up and I thought you mean that CF engineers were giving real time updates there. That's not the case.

However, I still fail to see your argument regarding Zero Trust and not being impacted. The status page literally mentioned that the service was recovered on Nov 3, so I don't understand what you mean by:

>The data plane ( which I mentioned) had no issues.

There's literally a section with "Data plane impact" on all over the status page, and ZT is definitely in the earlier ones. And this is given the fact that status updates on Nov 2 were very sparse until power was restored.


We don't use zero trust atm. So, I can't know for sure.

What I mentioned, was what I've seen passing by in the channel at the time.

I also saw no incoming help requests for zero trust tbh ( did some community help)


This was short downtime. But big companies must create own gateway but small just waiting and relying on CF


> Tbh. As far as I can see, their data plane worked at the edge.

Arguable, it's best to think of the edge as a buffering point in addition to processing. Aggregation has to happen somewhere, and that's where shit hit the fan.


? That would mean their data is at the core cluster. That's not true or I haven't seen any evidence to support that statement.

Cloudflare's data lives in the edge and is constantly moving.

The only thing not living in the edge ( as was noticed), is stream, logpush and new image resize requests ( existing ones worked fine) from the data plane


>That would mean their data is at the core cluster. That's not true or I haven't seen any evidence to support that statement.

You're being loose in your usage of 'data'. No one is talking about cached copies of an upstream, but you probably are.

Read the post mortem a bit more closely. They explicitly state that the control plane(s) source of truth lives in core, and that logs aggregate back to core for analytics and service ingestion. Think through the implications on that one.


That’s my interpretation as well. There is one central brain, and “the edge” is like the nervous system that collects signals, sends it to the brain, and is _eventually consistent_ with instructions/config generated by the brain.


> I am sorry and embarrassed for this incident and the pain that it caused our customers and our team.

So do they.


[flagged]


Sounds like chatgpt doesn't want your business and tuned thier cloudflare settings accordingly. Conveniently cloudflare is getting the blame, which is presumably part of what they're paying for.


>Sounds like chatgpt doesn't want your business and tuned thier cloudflare settings accordingly. Conveniently cloudflare is getting the blame, which is presumably part of what they're paying for.

The issue is fixed now. But as I mentioned CloudFlare still has a shit captcha, and the one for disabilities was broken as I mentioned.


Yep, it's easy to spot folks who have never configured Cloudflare's WAF when they suggest Cloudflare is blocking their browser of choice instead of the website itself.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: