> While most of our critical control plane systems had been migrated to the high availability cluster, some services, especially for some newer products, had not yet been added to the high availability cluster.
> The other two data centers running in the area would take over responsibility for the high availability cluster and keep critical services online. Generally that worked as planned. Unfortunately, we discovered that a subset of services that were supposed to be on the high availability cluster had dependencies on services exclusively running in PDX-04.
> A handful of products did not properly get stood up on our disaster recovery sites. These tended to be newer products where we had not fully implemented and tested a disaster recovery procedure.
So the root cause for the outage was that they relied on a single data center. I find that pretty embarrassing for a company like Cloudflare, which powers such relevant parts of the internet.
> I find that pretty embarrassing for a company like Cloudflare, which powers such relevant parts of the internet.
Bah, who cares about such unimportant details, what's important is that ~dev velocity~ was reaaally high right until that moment!
> We were also far too lax about requiring new products and their associated databases to integrate with the high availability cluster. Cloudflare allows multiple teams to innovate quickly. As such, products often take different paths toward their initial alpha. While, over time, our practice is to migrate the backend for these services to our best practices, we did not formally require that before products were declared generally available (GA). That was a mistake as it meant that the redundancy protections we had in place worked inconsistently depending on the product.
Complete and utter management failure. And customers apparently are sold what Cloudflare internally considers to be alpha quality software?
> Complete and utter management failure. And customers apparently are sold what Cloudflare internally considers to be alpha quality software?
This has been my experience with AWS and GCP as well. Assume anything that's under 3 years old is not really GA quality no matter what they say publicly.
I've been involved with some new service launches at AWS, and it's a strict requirement that everything goes through some rigorous operational and security reviews that cover exactly these issues before the service can be launched as GA. Feature-wise people might consider them "alpha", but when it comes to the resilience and security of the launched features, they are held to much higher standards than what is being described in this post-mortem.
Your operational reviews must be lacking at AWS then (surprise surprise) then because there are so many instances where something will be released in alpha yet the documentation will still be outdated, stale and incorrect LOL.
I think you misunderstand what's being talked about in this thread. "Operations" in this context has nothing to do with external-facing documentation, and instead refers to the resilience of the service and ensuring it doesn't for example, stop working when a single data center experiences a power outage.
"It stopped working because you did XYZ which you shouldn't have done despite it not being documented as something you shouldn't do" isn't different to a customer than a data center going down. For example, I'm sure the EKS UI was really resilient which meant little when random nodes dropped from a cluster due to the utter crap code in the official CNI network driver. My point wasn't that every cloud provider released alpha level software by the same definition but that by a customer's definition they all released alpha level software and label it GA.
> This has been my experience with AWS and GCP as well. Assume anything that's under 3 years old is not really GA quality no matter what they say publicly.
GCP run multi-year betas of services and features, so I'm doubtful there were still things not ironed out for GA. Do you have some examples?
Having worked at companies with varying degrees of autonomy, in my experience a more flexible structure allows for building systems that are ultimately more resilient. Of course, there are ways to do it poorly, but that doesn’t mean it’s a “complete and utter management failure”.
I’m going to leave out some details but there was a period of time where you could bypass cloudflare’s IP whitelisting by using Apple’s iCloud relay service. This was fixed but to my knowledge never disclosed.
There was a time when they were dumping encryption keys into search engine caches for weeks, and had the audacity to claim here, the issue was "mostly" solved. Until they were called out on it by Google Project Zero team...
There still exist many bypasses that work in a lot of cases. There's even services for it now. Wouldn't be surprised if that or similar was a technique employed.
> While most of our critical control plane systems had been migrated to the high availability cluster, some services, especially for some newer products, had not yet been added to the high availability cluster.
It's amazing that they don't have standards that mandate all new systems to use HA from the beginning.
The combination of "newer products" and then having "our Stream service" as the only named service in the post-mortem is very odd, since Stream is hardly a "newer product". It was launched in 2017 and went GA in 2018[2]. If after 5 years it still didn't have a disaster recovery procedure I find it hard to believe they even considered it.
From what I was reading on the status page & customers here on HN, WARP + Zero Trust were also majorly affected, which would be quite impactful for a company using these products for their internal authentication.
Those customers were impacted until the DC was back up ( 1-2 hours?) On the config plane.
The data plane ( which I mentioned) had no issues.
It's literally in the title what was affected: "Post Mortem on Cloudflare Control Plane and Analytics Outage"
Eg. The status page mentioned the healthchecks not working, while everything was fine with it. There were just no analytics at that time to confirm that.
Source: I watched it all happen in the cloudflare discord channel.
If you know anyone that is claiming to be affected on the data plane for the services you mentioned, that would be an interesting one.
Note: I remember emails were also more affected though.
> Those customers were impacted until the DC was back up ( 1-2 hours?) On the config plane.
Which was still like ~12+ hours, if we check the status page.
>Eg. The status page mentioned the healthchecks not working, while everything was fine with it. There were just no analytics at that time to confirm that.
What good is a status page that's lying to you? Especially since CF manually updates it, anyway?
>Source: I watched it all happen in the cloudflare discord channel.
Wow, as a business customer I definitely like watching some Discord channel for status updates.
This wasn't about status updates going to discord only.
There is literally a discussion section on the discord, named: #general-discussions
Not everything was clear in the discord too ( eg. The healthchecks were discussed there), that's not something you want to copy-paste in the status updates...
Priority for cloudflare seemed to get everything back up. And what they thought was down, was always mentioned in the status updates.
Oh, I just looked it up and I thought you mean that CF engineers were giving real time updates there. That's not the case.
However, I still fail to see your argument regarding Zero Trust and not being impacted. The status page literally mentioned that the service was recovered on Nov 3, so I don't understand what you mean by:
>The data plane ( which I mentioned) had no issues.
There's literally a section with "Data plane impact" on all over the status page, and ZT is definitely in the earlier ones. And this is given the fact that status updates on Nov 2 were very sparse until power was restored.
> Tbh. As far as I can see, their data plane worked at the edge.
Arguable, it's best to think of the edge as a buffering point in addition to processing. Aggregation has to happen somewhere, and that's where shit hit the fan.
? That would mean their data is at the core cluster. That's not true or I haven't seen any evidence to support that statement.
Cloudflare's data lives in the edge and is constantly moving.
The only thing not living in the edge ( as was noticed), is stream, logpush and new image resize requests ( existing ones worked fine) from the data plane
>That would mean their data is at the core cluster. That's not true or I haven't seen any evidence to support that statement.
You're being loose in your usage of 'data'. No one is talking about cached copies of an upstream, but you probably are.
Read the post mortem a bit more closely. They explicitly state that the control plane(s) source of truth lives in core, and that logs aggregate back to core for analytics and service ingestion. Think through the implications on that one.
That’s my interpretation as well. There is one central brain, and “the edge” is like the nervous system that collects signals, sends it to the brain, and is _eventually consistent_ with instructions/config generated by the brain.
Sounds like chatgpt doesn't want your business and tuned thier cloudflare settings accordingly. Conveniently cloudflare is getting the blame, which is presumably part of what they're paying for.
>Sounds like chatgpt doesn't want your business and tuned thier cloudflare settings accordingly. Conveniently cloudflare is getting the blame, which is presumably part of what they're paying for.
The issue is fixed now. But as I mentioned CloudFlare still has a shit captcha, and the one for disabilities was broken as I mentioned.
Yep, it's easy to spot folks who have never configured Cloudflare's WAF when they suggest Cloudflare is blocking their browser of choice instead of the website itself.
> The other two data centers running in the area would take over responsibility for the high availability cluster and keep critical services online. Generally that worked as planned. Unfortunately, we discovered that a subset of services that were supposed to be on the high availability cluster had dependencies on services exclusively running in PDX-04.
> A handful of products did not properly get stood up on our disaster recovery sites. These tended to be newer products where we had not fully implemented and tested a disaster recovery procedure.
So the root cause for the outage was that they relied on a single data center. I find that pretty embarrassing for a company like Cloudflare, which powers such relevant parts of the internet.