Hacker News new | past | comments | ask | show | jobs | submit login

WAT. A server configuration change? What kind of server configuration can affect presumably thousands of machines replicated across the globe? I'm trying to understand this.



Most failures of this type end up being a cascading resource exhaustion problem propagated by an un- or mis-analyzed feedback or dependency path. It is frankly amazing it doesn't happen more often.

I'm excluding the other common type of long outage, the head-desking "failover didn't work, backups are horked, it'll take 10's of hours to restore/cold start" kind.


It's hardly unheard of.

https://en.wikipedia.org/wiki/Cascading_failure

An organization Facebook's size isn't gonna be applying configuration changes to one server at a time over SSH, either. A server configuration can easily affect thousands of machines across the globe if it's deployed to them all.


Shy did I take so long for Facebook to release the cause of the outage? If they are applying configuration changes at a large level shouldnt it be fairly easy for them to figure out what was the cause?


That's silly. Error rates show as elevated on https://developers.facebook.com/status/dashboard/ until 11pm Pacific yesterday. The @facebook Twitter account sent out a statement basically within an hour of the start of the next business day.


Possibly because it doesn't matter to us really. The postmortem will be interesting to read if they publish it, but otherwise - it stopped working. Time to explain it to the peanut gallery is better spent dealing with the actual issue.


It takes an admin to bring down a host, but it takes a configuration management system to bring down a site.


"To err is human, but to really foul things up you need a computer."


Bad configuration in a tunnel, IP, BGP etc.

https://www.bleepingcomputer.com/news/technology/facebook-an...


Yes, but those are all reversible relatively quickly. They aren't something that I would think ought to take almost an entire day to resolve.


The side-effects of such a thing might not be as easily reversible.

I've had to sit around waiting a couple hours for a Percona database cluster to re-sync after a major networking whoops, and it only had a few hundred gigabytes of data.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: