Facebook blames a server configuration change for yesterday’s outage

fenwick67 · on March 14, 2019

To everyone jumping to conclusions, remember that the words "server" and "configuration" can mean a whole host of things. It doesn't necessarily mean they mistyped their nginx config.

stingraycharles · on March 14, 2019

Exactly. Doing an upgrade to an internal email service is a configuration change. Scaling down a cluster is a configuration change. Mitigating a DDoS attack by implementing a firewall rule is a configuration change.

“Configuration” in this context is the high level system configuration, and can mean pretty much anything that falls under that.

mattbeckman · on March 14, 2019

Largely impacted by this outage was how it affected those who use Facebook Login as a convenient OAuth option. Good thing for developers to remember if someone asks them to avoid a native login option.

aboutruby · on March 14, 2019

Not a big surprise as it's one of the harder things to test.

cannedslime · on March 14, 2019

Keep calm and blame dev ops!

mancerayder · on March 14, 2019

.. but, but, didn't they wring the DevOps folks through coding challenges, sorting algos and whiteboard coding before hiring them? I heard that's the number 1 way to ensure uptime at FAANG.

(Configuration changes, that's the source of my sarcasm)

lousken · on March 14, 2019

It took very long to fix so I think this was related to their databases, maybe some data corruption.

CodeSheikh · on March 14, 2019

Seems like a lot of people were forced to be productive yesterday (:

canada_dry · on March 14, 2019

> Seems like a lot of people were forced to be productive yesterday

Well... except for those companies dumb enough to farm out their intra-company communications to facebook.

A friend of mine's law firm had to resort to - OMG - the phone - yesterday.

thisacctforreal · on March 14, 2019

Wait, the law firm talks about their dealings over Facebook?

evv · on March 15, 2019

Facebook Workplace, presumably (the slack competitor)

indigodaddy · on March 14, 2019

Nope, Slack was working fine...

rachelbythebay · on March 15, 2019

Ah, the tao of reliability.

Async too.

chowes · on March 14, 2019

Must be related to them merging the chat backends for WhatsApp, FB, and Instagram

segmondy · on March 14, 2019

I suspect this too, fastly integrating the systems they can't be broken up.

ajsharp · on March 14, 2019

Also, 'many people had trouble accessing our apps and services' is some ninja-level gaslighting: https://twitter.com/facebook/status/1106229690069442560

traek · on March 15, 2019

In what way is that gaslighting?

ajsharp · on March 15, 2019

'many people': every product was down, completely, for everyone, afaik.

snazz · on March 15, 2019

No. The services were down intermittently, with only certain parts down completely (auth, etc).

COGlory · on March 15, 2019

People just like to use that word

halfnibble · on March 14, 2019

WAT. A server configuration change? What kind of server configuration can affect presumably thousands of machines replicated across the globe? I'm trying to understand this.

jhayward · on March 14, 2019

Most failures of this type end up being a cascading resource exhaustion problem propagated by an un- or mis-analyzed feedback or dependency path. It is frankly amazing it doesn't happen more often.

I'm excluding the other common type of long outage, the head-desking "failover didn't work, backups are horked, it'll take 10's of hours to restore/cold start" kind.

ceejayoz · on March 14, 2019

It's hardly unheard of.

https://en.wikipedia.org/wiki/Cascading_failure

An organization Facebook's size isn't gonna be applying configuration changes to one server at a time over SSH, either. A server configuration can easily affect thousands of machines across the globe if it's deployed to them all.

wyre · on March 14, 2019

Shy did I take so long for Facebook to release the cause of the outage? If they are applying configuration changes at a large level shouldnt it be fairly easy for them to figure out what was the cause?

ceejayoz · on March 14, 2019

That's silly. Error rates show as elevated on https://developers.facebook.com/status/dashboard/ until 11pm Pacific yesterday. The @facebook Twitter account sent out a statement basically within an hour of the start of the next business day.

viraptor · on March 15, 2019

Possibly because it doesn't matter to us really. The postmortem will be interesting to read if they publish it, but otherwise - it stopped working. Time to explain it to the peanut gallery is better spent dealing with the actual issue.

groestl · on March 14, 2019

It takes an admin to bring down a host, but it takes a configuration management system to bring down a site.

mikewhy · on March 14, 2019

"To err is human, but to really foul things up you need a computer."

sarcasmatwork · on March 14, 2019

Bad configuration in a tunnel, IP, BGP etc.

https://www.bleepingcomputer.com/news/technology/facebook-an...

gizmo385 · on March 14, 2019

Yes, but those are all reversible relatively quickly. They aren't something that I would think ought to take almost an entire day to resolve.

ceejayoz · on March 14, 2019

The side-effects of such a thing might not be as easily reversible.

I've had to sit around waiting a couple hours for a Percona database cluster to re-sync after a major networking whoops, and it only had a few hundred gigabytes of data.

ajsharp · on March 14, 2019

This is fucking bananas. For nearly a decade, Facebook has been at the forefront of innovating how code is deployed at global scale. They presumably have gradual rollouts, automated rollbacks, anomaly detection, not to mention (I assume) loads of organizational safeguards in place to ensure this sort of thing never happens.

Something else happened. This was not a configuration issue. Edit: If it was, I'd expect a post-mortem post-haste.

chronid · on March 14, 2019

Google also has all that and now and then their network explodes anyway when they do configuration changes. :)

Certain configurations at a big enough scale are dangerous, just because you could hit a terrible corner case when you rolled out the change on 50% capacity, and lose all of it so fast that your magic automatic rollback is pointless because your infrastructure is burning.

groestl · on March 14, 2019

Or the closed loop of an automatic rollback itself tips over the system and causes the outage.

johnvanommen · on March 14, 2019

> This is fucking bananas. For nearly a decade, Facebook has been at the forefront of innovating how code is deployed at global scale. They presumably have gradual rollouts, automated rollbacks, anomaly detection, not to mention (I assume) loads of organizational safeguards in place to ensure this sort of thing never happens.

> Something else happened. This was not a configuration issue. Edit: If it was, I'd expect a post-mortem post-haste.

I've worked on automation projects at a large scale, and Facebook uses an unusual and clever method to deploy their software: BitTorrent.

I can only speculate about why FB went down yesterday. But if you understand that it's being deployed via BT, you can see that there's the potential to have a lengthy rollback window.

IE, this isn't like uninstalling a single RPM; this could have impacted a significant fraction of their fleet of systems, across multiple datacenters, and if so, the amount of data they'd need to move to rollback could have been tremendous.

https://www.quora.com/Why-does-Facebook-use-BitTorrent-to-co...

ajsharp · on March 14, 2019

I totally agree with all this, and I'm completely open to a valid technical explanation here.

My initial comment is/was admittedly a bit reactive, and more so to the general tone of their explanation than the likelihood of a legitimate technical explanation. This wasn't one service -- every product was down for nearly 24 hours, and their explanation is basically, "uh, yea, it was a...um...configuration issue." The terseness of that explanation, in my opinion, is insulting to the millions of people and businesses that rely on facebook to get information and operate their businesses.

epriest · on March 14, 2019

I can't speak beyond 2011, but in the early part of the last decade you're giving Facebook far too much credit.

Facebook went down for most of a day in ~2009 because a new hire mistakenly removed the memcached server config in sitevars.

Facebook went down for several hours in ~2010 because someone configured a cyclic dependency in GateKeeper.

Circa 2011, Facebook's deployment process was very good, but also very very far from infallible.

lamontcg · on March 14, 2019

It would take a computer the size of facebook to perfectly plan how a change will actually affect facebook.

Nobody actually spends the money to do that. They all wing it at some level or another. They're just winging it at a scale vastly more massive than the hundred or thousand computers most people manage.

Source: I worked at Amazon back when managing 30,000 servers was a lot, and I can extrapolate.

pjc50 · on March 14, 2019

Come on, have people not read COMP.RISKS? Computing history is full of high-uptime systems wrecked by a single critically misplaced typo.

shereadsthenews · on March 14, 2019

How did you determine that Facebook leads this space? I recently read an article about how Facebook distributes RPMs internally and it struck me as the kind of thing an insane person might have invented fifteen years ago. I mean, NFS in front of glusterfs? Also, RPMs???? Talk about bananas.

gmmeyer · on March 14, 2019

What's wrong with RPMs?

lykr0n · on March 14, 2019

It has a bunch of well built and supported tooling, including dependency management, dependencies, and versioning /s

Nothing. I quite like using them to deploy applications. If you package them right and build your deployment system correctly, they're not the worst way to do things.

ajsharp · on March 14, 2019

I'm mostly guessing based on what I've read over the years. They've published a hefty corpus of work regarding their deployment infrastructure and greater code review/quality approach.

Here's a few examples: - https://code.fb.com/web/rapid-release-at-massive-scale/. - https://www.quora.com/How-does-Facebook-release-deploy-proce...

WillPostForFood · on March 14, 2019

Something else happened.

?

This was not a configuration issue.

??

I'm mostly guessing

!!!

ajsharp · on March 14, 2019

I mean, yea, I don't work there, so I'm intuiting based on:

my own professional experience

!

deductive reasoning

!!

general...intuition

!!!

packetslave · on March 15, 2019

“Intuiting” is a funny way to spell “just making shit up”

muxator · on March 14, 2019

Could you link the article?

shereadsthenews · on March 15, 2019

https://www.slideshare.net/mobile/PhilDibowitz/centos-at-fac...

muxator · on March 15, 2019

Thanks. What I read there seems sensibile to me.

JMTQp8lwXL · on March 14, 2019

If not a configuration issue, care to speculate what actually happened? There could've been malicious intent, whether on the part of an internal or external actor --and it really could be either-- given the amount of criticism Facebook has drawn in the past few years. But perhaps Facebook's PR statement would've addressed that, had it been the case.

ajsharp · on March 14, 2019

These are basically my thoughts on it: https://twitter.com/ajsharp/status/1106308735142526976.

Normally, this level of outage would come with a technical post-mortem. Instead, they issue a super vague statement.

amethyst · on March 14, 2019

We likely won't even have an internal post-mortem for at least a couple weeks. There's no way you can possibly expect a full breakdown of what went wrong at this scale less than 24 hours after it gets resolved.

ajsharp · on March 15, 2019

Google managed to put one together in a little over a day: https://status.cloud.google.com/incident/storage/19002.

ceejayoz · on March 15, 2019

Expectations are vastly different for a business-focused hosting platform with SLAs than a social network.