WAT. A server configuration change? What kind of server configuration can affect presumably thousands of machines replicated across the globe? I'm trying to understand this.
Most failures of this type end up being a cascading resource exhaustion problem propagated by an un- or mis-analyzed feedback or dependency path. It is frankly amazing it doesn't happen more often.
I'm excluding the other common type of long outage, the head-desking "failover didn't work, backups are horked, it'll take 10's of hours to restore/cold start" kind.
An organization Facebook's size isn't gonna be applying configuration changes to one server at a time over SSH, either. A server configuration can easily affect thousands of machines across the globe if it's deployed to them all.
Shy did I take so long for Facebook to release the cause of the outage? If they are applying configuration changes at a large level shouldnt it be fairly easy for them to figure out what was the cause?
That's silly. Error rates show as elevated on https://developers.facebook.com/status/dashboard/ until 11pm Pacific yesterday. The @facebook Twitter account sent out a statement basically within an hour of the start of the next business day.
Possibly because it doesn't matter to us really. The postmortem will be interesting to read if they publish it, but otherwise - it stopped working. Time to explain it to the peanut gallery is better spent dealing with the actual issue.
The side-effects of such a thing might not be as easily reversible.
I've had to sit around waiting a couple hours for a Percona database cluster to re-sync after a major networking whoops, and it only had a few hundred gigabytes of data.