Postmortem: Azure DevOps (VSTS) Outage of 4 Sep 2018

a2tech · on Sept 24, 2018

Long story short they give no indication as to why their data center cooling systems were unable to handle voltage changes caused by the storm and their systems are not designed for speedy restoration into another region.

gfo · on Sept 24, 2018

Data Center problems are Azure's issue, not the VSTS team, so your criticism should be focused there. Even though Microsoft runs all of these services that's out of scope for a VSTS blog.

They linked to the Azure service details in the Azure Status link if you wanted more information: https://azure.microsoft.com/en-us/status/history/. It sounds as though they're compiling a report on this problem which will be released in the coming weeks.

outworlder · on Sept 24, 2018

> Long story short they give no indication as to why their data center cooling systems were unable to handle voltage changes caused by the storm

This shouldn't matter, only to them and whoever operates the datacenter.

> and their systems are not designed for speedy restoration into another region.

This is what matters. You have a cloud service that is not really designed for the cloud.

chrisbolt · on Sept 24, 2018

Seems like there's more information in the preliminary RCA on https://azure.microsoft.com/en-us/status/history/

bpicolo · on Sept 24, 2018

> Initially, the datacenter was able to maintain its operational temperatures through a load dependent thermal buffer that was designed within the cooling system. However, once this thermal buffer was depleted the datacenter temperature exceeded safe operational thresholds, and an automated shutdown of devices was initiated

souterrain · on Sept 24, 2018

Total failure of a data center's cooling apparatus seems to be a very rare occurrence to me, perhaps limited to simultaneous failure of utility and genset power (example: electrical switchgear and fuel pumps underwater due to flooding).

Anyone have any data around how frequently such a failure occurs?

beh9540 · on Sept 24, 2018

I had the same thought. The only thing I could come up with is that it wasn't a failure of power supply, but that a surge took down enough cooling systems that they couldn't maintain temperature. A lot of DC's I've seen are N+1 with cooling (or even 2N), but they all run at the same time and are the same units. Or the control system went down, and they weren't able to get it back up and running, although I would think they would have redundancy in that case.

outworlder · on Sept 24, 2018

Ok, it's understandable, freak events happen.

> The primary solution we are pursuing to improve handling datacenter failures is Availability Zones, and we are exploring the feasibility of asynchronous replication.

This I do not understand. I was also amazed when I saw that Azure AZs are not available on all regions. In AWS, the bare minimum is 2 AZs (except for one odd region). Same thing for Google Cloud.

scarface74 · on Sept 24, 2018

From what I understand and I can't find the reference anywhere, each region has at least three availability zones. Some regions only have two user selectable AZ's.

For instance, S3 promises that it is replicated between 3 AZ's in a region. That guaranteed is available in regions that only have two publicly available AZ's.

romaniv · on Sept 24, 2018

At the end of the day, Azure and AWS are monocultures with considerable amount of centralization and interdependency within their services. Their scale undermines the original purpose behind the Internet.

It bothers me that increasing number of large companies dump their own data centers to jump into The Cloud. Thia means future outages (which will undoubtedly happen) will have wider and wider impact on end users.

For example, if your email is hosted on AWS and it goes down, you loose access to your email. No big deal. However, if your email, VOIP and IM/chat go down at the same time, you may loose all ability to communicate electronically. This can be a very big deal in certain situations.

otterley · on Sept 24, 2018

The original purpose behind the Internet was to build a robust layer-3 network based on packet switching technology. The designers weren't focused on the application layer.

Source: https://www.internetsociety.org/internet/history-internet/br...

Separately, I think we have enough history of working with the cloud at this point to demonstrate that major providers' availability is on par with, or better than, the availability of the typical small entity. Sure, the impact is potentially wider spread (although this can be mitigated with a cellular architecture, which first-class providers do employ), but there's a perverse advantage that when outages occur, they tend to get fixed a lot faster because the complaint volume is much higher.

romaniv · on Sept 24, 2018

>when outages occur, they tend to get fixed a lot faster because the complaint volume is much higher.

On the other hand, they can be much harder to fix, because the sheer scale of failures and complexity of the infrastructure. There is a higher probability of complex systemic issues, as demonstrated by this very outage.

There are plenty of smaller providers that beat Azure VMs in uptime. Plus, smaller websites/services can employ much simpler failure mitigation strategies.

otterley · on Sept 24, 2018

The "complex systemic issue" here is that Azure is only now rolling out availability zones, and the product in question hasn't yet been able to take advantage of them to mitigate a serious DC fault caused by an Act of God.

The necessity of low-latency-but-decoupled-physical-plant AZs is well known in the art by now, and these issues will no doubt be addressed as Azure matures. Remember, they're 5 years behind AWS.

romaniv · on Sept 24, 2018

> The "complex systemic issue" here is that Azure is only now rolling out availability zones,

Availability zones are a mitigation. The issues is the sequence of events and dependencies described in the postmortem. The description has six paragraphs.

otterley · on Sept 24, 2018

I'm not precisely sure what you're referring to. Can you cite the precise problem discussed in the postmortem, and how, specifically, you think it could have been better designed?

And how could your perfect model, whatever that is, survive a similar catastrophic DC failure without availability zones?

ransom1538 · on Sept 24, 2018

My favorite is aws slinging “make sure you have copies in other regions”. Look. Every time an entire region has an issue on aws all the regions are fucked. APIs etc just drop or hang. If a region has dropped things are really bad. So keep paying to double up your data, but it doesn’t do any good. If Virginia East has some type of nuclear attack - I have worse problems than running my crappy CRUD app.

outworlder · on Sept 24, 2018

> Look. Every time an entire region has an issue on aws all the regions are fucked.

This is not true. I've watched some full region outages going on while our systems were chugging along just fine. I have also seen temporary network issues in one region(that caused our monitoring to go haywire) where others were completely unaffected.

What tends to happen is that AWS locates a bunch of services in us-east-1 (Route53 being one example). This may knock out the control plane for a bunch of services, which shouldn't be a thing.

ransom1538 · on Sept 24, 2018

" I've watched some full region outages going on while our systems were chugging along just fine."

Yeah! It has a green dot!

TobbenTM · on Sept 24, 2018

One issue is the stability of The Cloud of your choice, but in reality, few apps are truly multi-cloud and ready to loose a region.

Really makes you think that with all these big players pushing chaos engineering and multi-cloud deployments, the cloud suppliers should really be more prepared.

How about at least two "chaos regions" for every supplier, available at half cost, but with random outages, to really test our applications, and the suppliers infrastructure!

tumetab1 · on Sept 24, 2018

On other hand, they have more experience handling failure than 99% of companies so I think 99% companies are better off with Azure/AWS/AliBaba.

Experience kind of matters in this case and allows smaller IT departments to outsource resilience to more skilled companies/workers.

romaniv · on Sept 24, 2018

> On other hand, they have more experience handling failure than 99% of companies so I think 99% companies are better off with Azure/AWS/AliBaba.

Nowhere near 99%.

Running a large service on top of Azure or AWS is not a walk in the park. You don't manage hardware and you deal less with provisioning, but you need a significant level of expertise in dealing with idiosyncrasies of the particular hosting you've chosen.

I've seen companies transition to AWS. Very often they spend months "figuring things out". And then it turns out that they've just traded low-level problems which everyone knows how to solve for high-level problems no one knows how to solve.

Also, just because AWS/Azure is running doesn't mean your service is up. You end up managing high-level company-specific infrastructure on top of low-level one-size-fits-all infrastructure. Both can fail.

A lot of companies jump onto AWS/Azure/other big clouds because of hype, bandwagon effect and the general feeling everything will become "world-class" by magic. In many cases there is no legitimate cost-benefits analysis that compares costs of AWS switch to costs of fixing whatever problems company's current infrastructure suffers from. Also, almost no one factors in the costs of vendor lock-in and "too big to care" effect.

There are definitely use cases where Big Cloud makes sense, but it's nowhere near 99%.

gfo · on Sept 24, 2018

I hear your argument, but the notion is that if customers are expected to trust the Microsoft Cloud to run their operations, the company itself should also trust it. How would it look if Microsoft was slinging its cloud services but wasn't primarily using them?

Of course, Azure has such a massive footprint that the bigger issue is, why wasn't that used as an advantage when South Central US went into a shutdown mode? This does play against the notion that with so many regions this sort of event should have been preventable.

segmondy · on Sept 24, 2018

The next tech "black swan" event will be cloud. At some point in the future, we are going to have a major even that will take down so many businesses and cost billions of dollars.

dsfyu404ed · on Sept 24, 2018

And anyone who's left standing will make billions of dollars.

swebs · on Sept 24, 2018

That's an extremely roundabout way of saying there was a lightning strike and they had inadequate surge protection.

em0ney · on Sept 24, 2018

Really not the worst post mortem I've ever seen

sungju1203 · on Sept 24, 2018

just use AWS. simple.

Bhilai · on Sept 24, 2018

Comments like this are counter productive for the discussion. Competition is always good and some of us like that AWS has competition in the form of Azure and GCP. AWS has had its own share of outages so its not perfect either.

intern4tional · on Sept 24, 2018

AWS has had these same issues when it has had region outages.

Simply google AWS outage and see the results. Thankfully an entire datacenter outage is rare and Azure is taking steps to mitigate that risk.

curiousDog · on Sept 24, 2018

What will you do the day AWS decides to jack up their prices 50% and there are no competitors?

bogdan · on Sept 24, 2018

> VSTS (now called Azure DevOps)

Not again.

oceanswave · on Sept 24, 2018

I know you just got used to and are productive with the old UX and naming... but hey now we call it something else and have a new fancy UX featuring more whitespace that you get to learn. Fire in motion!

herbderb · on Sept 24, 2018

What's the point of even mentioning that when they just use the old name for the entirety of the article anyway

skrebbel · on Sept 24, 2018

The point is that the headline should've been "Azure DevOps Outage…" but they're afraid that other outlets will take that over as "Azure Outage..." and they don't want a headline like that making the rounds. So they use the old name for bad news and the new name for good news.

TBH I'd do the same.

freeone3000 · on Sept 24, 2018

But is is an Azure outage. It took out a DC. VSTS is one of the services affected, but other services were also affected.

indemnity · on Sept 24, 2018

Is this what we have to look forward to as GitHub will be forced onto Azure?

WorkLifeBalance · on Sept 24, 2018

You say that as if github never has outages when the reality is, like any large service, it has frequent service outages.

Just check the github status pages, there are orange level issues almost weekly and you don't need to go back too far for full outages.

thejosh · on Sept 24, 2018

Because GitHub is so stable and never goes down?

manigandham · on Sept 24, 2018

No service is perfect and github has had plenty of outages. They will only become more reliable with the resources of MS/Azure at their disposal.

tumetab1 · on Sept 24, 2018

Having worked in big Azure customer I wouldn't say resources equals stability. The reality is more resources + quality engineering + failure testing.

As this case tells, Azure, isn't spending a lot on failure testing. Also, having experience with being an big Azure customer, I can tell you that things look better than they are.

eropple · on Sept 24, 2018

Small Azure customer (five figures a month), but can co-sign all of this.

Azure looks shiny from the outside but we've had way, way more problems, from uptime to bad APIs to awful language SDKs to bad user interfaces to licensing hell, than I've ever had on AWS or GCP. It's so bad that I am currently weighing whether or not to advocate for a migration off of it, at nontrivial expense, because I cannot pretend to provide reliable services for our customers.

rhencke · on Sept 24, 2018

Do you have a source for this?