Long story short they give no indication as to why their data center cooling systems were unable to handle voltage changes caused by the storm and their systems are not designed for speedy restoration into another region.
Data Center problems are Azure's issue, not the VSTS team, so your criticism should be focused there. Even though Microsoft runs all of these services that's out of scope for a VSTS blog.
They linked to the Azure service details in the Azure Status link if you wanted more information: https://azure.microsoft.com/en-us/status/history/. It sounds as though they're compiling a report on this problem which will be released in the coming weeks.
> Initially, the datacenter was able to maintain its operational temperatures through a load dependent thermal buffer that was designed within the cooling system. However, once this thermal buffer was depleted the datacenter temperature exceeded safe operational thresholds, and an automated shutdown of devices was initiated
Total failure of a data center's cooling apparatus seems to be a very rare occurrence to me, perhaps limited to simultaneous failure of utility and genset power (example: electrical switchgear and fuel pumps underwater due to flooding).
Anyone have any data around how frequently such a failure occurs?
I had the same thought. The only thing I could come up with is that it wasn't a failure of power supply, but that a surge took down enough cooling systems that they couldn't maintain temperature. A lot of DC's I've seen are N+1 with cooling (or even 2N), but they all run at the same time and are the same units. Or the control system went down, and they weren't able to get it back up and running, although I would think they would have redundancy in that case.
> The primary solution we are pursuing to improve handling datacenter failures is Availability Zones, and we are exploring the feasibility of asynchronous replication.
This I do not understand. I was also amazed when I saw that Azure AZs are not available on all regions. In AWS, the bare minimum is 2 AZs (except for one odd region). Same thing for Google Cloud.
From what I understand and I can't find the reference anywhere, each region has at least three availability zones. Some regions only have two user selectable AZ's.
For instance, S3 promises that it is replicated between 3 AZ's in a region. That guaranteed is available in regions that only have two publicly available AZ's.
At the end of the day, Azure and AWS are monocultures with considerable amount of centralization and interdependency within their services. Their scale undermines the original purpose behind the Internet.
It bothers me that increasing number of large companies dump their own data centers to jump into The Cloud. Thia means future outages (which will undoubtedly happen) will have wider and wider impact on end users.
For example, if your email is hosted on AWS and it goes down, you loose access to your email. No big deal. However, if your email, VOIP and IM/chat go down at the same time, you may loose all ability to communicate electronically. This can be a very big deal in certain situations.
The original purpose behind the Internet was to build a robust layer-3 network based on packet switching technology. The designers weren't focused on the application layer.
Separately, I think we have enough history of working with the cloud at this point to demonstrate that major providers' availability is on par with, or better than, the availability of the typical small entity. Sure, the impact is potentially wider spread (although this can be mitigated with a cellular architecture, which first-class providers do employ), but there's a perverse advantage that when outages occur, they tend to get fixed a lot faster because the complaint volume is much higher.
>when outages occur, they tend to get fixed a lot faster because the complaint volume is much higher.
On the other hand, they can be much harder to fix, because the sheer scale of failures and complexity of the infrastructure. There is a higher probability of complex systemic issues, as demonstrated by this very outage.
There are plenty of smaller providers that beat Azure VMs in uptime. Plus, smaller websites/services can employ much simpler failure mitigation strategies.
The "complex systemic issue" here is that Azure is only now rolling out availability zones, and the product in question hasn't yet been able to take advantage of them to mitigate a serious DC fault caused by an Act of God.
The necessity of low-latency-but-decoupled-physical-plant AZs is well known in the art by now, and these issues will no doubt be addressed as Azure matures. Remember, they're 5 years behind AWS.
> The "complex systemic issue" here is that Azure is only now rolling out availability zones,
Availability zones are a mitigation. The issues is the sequence of events and dependencies described in the postmortem. The description has six paragraphs.
I'm not precisely sure what you're referring to. Can you cite the precise problem discussed in the postmortem, and how, specifically, you think it could have been better designed?
And how could your perfect model, whatever that is, survive a similar catastrophic DC failure without availability zones?
My favorite is aws slinging “make sure you have copies in other regions”. Look. Every time an entire region has an issue on aws all the regions are fucked. APIs etc just drop or hang. If a region has dropped things are really bad. So keep paying to double up your data, but it doesn’t do any good. If Virginia East has some type of nuclear attack - I have worse problems than running my crappy CRUD app.
> Look. Every time an entire region has an issue on aws all the regions are fucked.
This is not true. I've watched some full region outages going on while our systems were chugging along just fine. I have also seen temporary network issues in one region(that caused our monitoring to go haywire) where others were completely unaffected.
What tends to happen is that AWS locates a bunch of services in us-east-1 (Route53 being one example). This may knock out the control plane for a bunch of services, which shouldn't be a thing.
One issue is the stability of The Cloud of your choice, but in reality, few apps are truly multi-cloud and ready to loose a region.
Really makes you think that with all these big players pushing chaos engineering and multi-cloud deployments, the cloud suppliers should really be more prepared.
How about at least two "chaos regions" for every supplier, available at half cost, but with random outages, to really test our applications, and the suppliers infrastructure!
> On other hand, they have more experience handling failure than 99% of companies so I think 99% companies are better off with Azure/AWS/AliBaba.
Nowhere near 99%.
Running a large service on top of Azure or AWS is not a walk in the park. You don't manage hardware and you deal less with provisioning, but you need a significant level of expertise in dealing with idiosyncrasies of the particular hosting you've chosen.
I've seen companies transition to AWS. Very often they spend months "figuring things out". And then it turns out that they've just traded low-level problems which everyone knows how to solve for high-level problems no one knows how to solve.
Also, just because AWS/Azure is running doesn't mean your service is up. You end up managing high-level company-specific infrastructure on top of low-level one-size-fits-all infrastructure. Both can fail.
A lot of companies jump onto AWS/Azure/other big clouds because of hype, bandwagon effect and the general feeling everything will become "world-class" by magic. In many cases there is no legitimate cost-benefits analysis that compares costs of AWS switch to costs of fixing whatever problems company's current infrastructure suffers from. Also, almost no one factors in the costs of vendor lock-in and "too big to care" effect.
There are definitely use cases where Big Cloud makes sense, but it's nowhere near 99%.
I hear your argument, but the notion is that if customers are expected to trust the Microsoft Cloud to run their operations, the company itself should also trust it. How would it look if Microsoft was slinging its cloud services but wasn't primarily using them?
Of course, Azure has such a massive footprint that the bigger issue is, why wasn't that used as an advantage when South Central US went into a shutdown mode? This does play against the notion that with so many regions this sort of event should have been preventable.
The next tech "black swan" event will be cloud. At some point in the future, we are going to have a major even that will take down so many businesses and cost billions of dollars.
Comments like this are counter productive for the discussion. Competition is always good and some of us like that AWS has competition in the form of Azure and GCP. AWS has had its own share of outages so its not perfect either.
I know you just got used to and are productive with the old UX and naming... but hey now we call it something else and have a new fancy UX featuring more whitespace that you get to learn. Fire in motion!
The point is that the headline should've been "Azure DevOps Outage…" but they're afraid that other outlets will take that over as "Azure Outage..." and they don't want a headline like that making the rounds. So they use the old name for bad news and the new name for good news.
Having worked in big Azure customer I wouldn't say resources equals stability. The reality is more resources + quality engineering + failure testing.
As this case tells, Azure, isn't spending a lot on failure testing. Also, having experience with being an big Azure customer, I can tell you that things look better than they are.
Small Azure customer (five figures a month), but can co-sign all of this.
Azure looks shiny from the outside but we've had way, way more problems, from uptime to bad APIs to awful language SDKs to bad user interfaces to licensing hell, than I've ever had on AWS or GCP. It's so bad that I am currently weighing whether or not to advocate for a migration off of it, at nontrivial expense, because I cannot pretend to provide reliable services for our customers.