> I find that pretty embarrassing for a company like Cloudflare, which powers such relevant parts of the internet.
Bah, who cares about such unimportant details, what's important is that ~dev velocity~ was reaaally high right until that moment!
> We were also far too lax about requiring new products and their associated databases to integrate with the high availability cluster. Cloudflare allows multiple teams to innovate quickly. As such, products often take different paths toward their initial alpha. While, over time, our practice is to migrate the backend for these services to our best practices, we did not formally require that before products were declared generally available (GA). That was a mistake as it meant that the redundancy protections we had in place worked inconsistently depending on the product.
Complete and utter management failure. And customers apparently are sold what Cloudflare internally considers to be alpha quality software?
> Complete and utter management failure. And customers apparently are sold what Cloudflare internally considers to be alpha quality software?
This has been my experience with AWS and GCP as well. Assume anything that's under 3 years old is not really GA quality no matter what they say publicly.
I've been involved with some new service launches at AWS, and it's a strict requirement that everything goes through some rigorous operational and security reviews that cover exactly these issues before the service can be launched as GA. Feature-wise people might consider them "alpha", but when it comes to the resilience and security of the launched features, they are held to much higher standards than what is being described in this post-mortem.
Your operational reviews must be lacking at AWS then (surprise surprise) then because there are so many instances where something will be released in alpha yet the documentation will still be outdated, stale and incorrect LOL.
I think you misunderstand what's being talked about in this thread. "Operations" in this context has nothing to do with external-facing documentation, and instead refers to the resilience of the service and ensuring it doesn't for example, stop working when a single data center experiences a power outage.
"It stopped working because you did XYZ which you shouldn't have done despite it not being documented as something you shouldn't do" isn't different to a customer than a data center going down. For example, I'm sure the EKS UI was really resilient which meant little when random nodes dropped from a cluster due to the utter crap code in the official CNI network driver. My point wasn't that every cloud provider released alpha level software by the same definition but that by a customer's definition they all released alpha level software and label it GA.
> This has been my experience with AWS and GCP as well. Assume anything that's under 3 years old is not really GA quality no matter what they say publicly.
GCP run multi-year betas of services and features, so I'm doubtful there were still things not ironed out for GA. Do you have some examples?
Having worked at companies with varying degrees of autonomy, in my experience a more flexible structure allows for building systems that are ultimately more resilient. Of course, there are ways to do it poorly, but that doesn’t mean it’s a “complete and utter management failure”.
I’m going to leave out some details but there was a period of time where you could bypass cloudflare’s IP whitelisting by using Apple’s iCloud relay service. This was fixed but to my knowledge never disclosed.
There was a time when they were dumping encryption keys into search engine caches for weeks, and had the audacity to claim here, the issue was "mostly" solved. Until they were called out on it by Google Project Zero team...
There still exist many bypasses that work in a lot of cases. There's even services for it now. Wouldn't be surprised if that or similar was a technique employed.
Bah, who cares about such unimportant details, what's important is that ~dev velocity~ was reaaally high right until that moment!
> We were also far too lax about requiring new products and their associated databases to integrate with the high availability cluster. Cloudflare allows multiple teams to innovate quickly. As such, products often take different paths toward their initial alpha. While, over time, our practice is to migrate the backend for these services to our best practices, we did not formally require that before products were declared generally available (GA). That was a mistake as it meant that the redundancy protections we had in place worked inconsistently depending on the product.
Complete and utter management failure. And customers apparently are sold what Cloudflare internally considers to be alpha quality software?