WAT. A server configuration change? What kind of server configuration can affect...

jhayward · on March 14, 2019

Most failures of this type end up being a cascading resource exhaustion problem propagated by an un- or mis-analyzed feedback or dependency path. It is frankly amazing it doesn't happen more often.

I'm excluding the other common type of long outage, the head-desking "failover didn't work, backups are horked, it'll take 10's of hours to restore/cold start" kind.

ceejayoz · on March 14, 2019

It's hardly unheard of.

https://en.wikipedia.org/wiki/Cascading_failure

An organization Facebook's size isn't gonna be applying configuration changes to one server at a time over SSH, either. A server configuration can easily affect thousands of machines across the globe if it's deployed to them all.

wyre · on March 14, 2019

Shy did I take so long for Facebook to release the cause of the outage? If they are applying configuration changes at a large level shouldnt it be fairly easy for them to figure out what was the cause?

ceejayoz · on March 14, 2019

That's silly. Error rates show as elevated on https://developers.facebook.com/status/dashboard/ until 11pm Pacific yesterday. The @facebook Twitter account sent out a statement basically within an hour of the start of the next business day.

viraptor · on March 15, 2019

Possibly because it doesn't matter to us really. The postmortem will be interesting to read if they publish it, but otherwise - it stopped working. Time to explain it to the peanut gallery is better spent dealing with the actual issue.

groestl · on March 14, 2019

It takes an admin to bring down a host, but it takes a configuration management system to bring down a site.

mikewhy · on March 14, 2019

"To err is human, but to really foul things up you need a computer."

sarcasmatwork · on March 14, 2019

Bad configuration in a tunnel, IP, BGP etc.

https://www.bleepingcomputer.com/news/technology/facebook-an...

gizmo385 · on March 14, 2019

Yes, but those are all reversible relatively quickly. They aren't something that I would think ought to take almost an entire day to resolve.

ceejayoz · on March 14, 2019

The side-effects of such a thing might not be as easily reversible.

I've had to sit around waiting a couple hours for a Percona database cluster to re-sync after a major networking whoops, and it only had a few hundred gigabytes of data.