This is fucking bananas. For nearly a decade, Facebook has been at the forefront... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

ajsharp on March 14, 2019 | parent | context | favorite | on: Facebook blames a server configuration change for ...

This is fucking bananas. For nearly a decade, Facebook has been at the forefront of innovating how code is deployed at global scale. They presumably have gradual rollouts, automated rollbacks, anomaly detection, not to mention (I assume) loads of organizational safeguards in place to ensure this sort of thing never happens.

Something else happened. This was not a configuration issue. Edit: If it was, I'd expect a post-mortem post-haste.

chronid on March 14, 2019 | [–]

Google also has all that and now and then their network explodes anyway when they do configuration changes. :)

Certain configurations at a big enough scale are dangerous, just because you could hit a terrible corner case when you rolled out the change on 50% capacity, and lose all of it so fast that your magic automatic rollback is pointless because your infrastructure is burning.

groestl on March 14, 2019 | | [–]

Or the closed loop of an automatic rollback itself tips over the system and causes the outage.

johnvanommen on March 14, 2019 | | [–]

> This is fucking bananas. For nearly a decade, Facebook has been at the forefront of innovating how code is deployed at global scale. They presumably have gradual rollouts, automated rollbacks, anomaly detection, not to mention (I assume) loads of organizational safeguards in place to ensure this sort of thing never happens.

> Something else happened. This was not a configuration issue. Edit: If it was, I'd expect a post-mortem post-haste.

I've worked on automation projects at a large scale, and Facebook uses an unusual and clever method to deploy their software: BitTorrent.

I can only speculate about why FB went down yesterday. But if you understand that it's being deployed via BT, you can see that there's the potential to have a lengthy rollback window.

IE, this isn't like uninstalling a single RPM; this could have impacted a significant fraction of their fleet of systems, across multiple datacenters, and if so, the amount of data they'd need to move to rollback could have been tremendous.

https://www.quora.com/Why-does-Facebook-use-BitTorrent-to-co...

ajsharp on March 14, 2019 | | [–]

I totally agree with all this, and I'm completely open to a valid technical explanation here.

My initial comment is/was admittedly a bit reactive, and more so to the general tone of their explanation than the likelihood of a legitimate technical explanation. This wasn't one service -- every product was down for nearly 24 hours, and their explanation is basically, "uh, yea, it was a...um...configuration issue." The terseness of that explanation, in my opinion, is insulting to the millions of people and businesses that rely on facebook to get information and operate their businesses.

epriest on March 14, 2019 | | [–]

I can't speak beyond 2011, but in the early part of the last decade you're giving Facebook far too much credit.

Facebook went down for most of a day in ~2009 because a new hire mistakenly removed the memcached server config in sitevars.

Facebook went down for several hours in ~2010 because someone configured a cyclic dependency in GateKeeper.

Circa 2011, Facebook's deployment process was very good, but also very very far from infallible.

lamontcg on March 14, 2019 | | [–]

It would take a computer the size of facebook to perfectly plan how a change will actually affect facebook.

Nobody actually spends the money to do that. They all wing it at some level or another. They're just winging it at a scale vastly more massive than the hundred or thousand computers most people manage.

Source: I worked at Amazon back when managing 30,000 servers was a lot, and I can extrapolate.

pjc50 on March 14, 2019 | | [–]

Come on, have people not read COMP.RISKS? Computing history is full of high-uptime systems wrecked by a single critically misplaced typo.

shereadsthenews on March 14, 2019 | | [–]

How did you determine that Facebook leads this space? I recently read an article about how Facebook distributes RPMs internally and it struck me as the kind of thing an insane person might have invented fifteen years ago. I mean, NFS in front of glusterfs? Also, RPMs???? Talk about bananas.

gmmeyer on March 14, 2019 | | [–]

What's wrong with RPMs?

lykr0n on March 14, 2019 | | | [–]

It has a bunch of well built and supported tooling, including dependency management, dependencies, and versioning /s

Nothing. I quite like using them to deploy applications. If you package them right and build your deployment system correctly, they're not the worst way to do things.

ajsharp on March 14, 2019 | | | [–]

I'm mostly guessing based on what I've read over the years. They've published a hefty corpus of work regarding their deployment infrastructure and greater code review/quality approach.

Here's a few examples: - https://code.fb.com/web/rapid-release-at-massive-scale/. - https://www.quora.com/How-does-Facebook-release-deploy-proce...

WillPostForFood on March 14, 2019 | | | [–]

Something else happened.

?

This was not a configuration issue.

??

I'm mostly guessing

!!!

ajsharp on March 14, 2019 | | | [–]

I mean, yea, I don't work there, so I'm intuiting based on:

my own professional experience

!

deductive reasoning

!!

general...intuition

!!!

packetslave on March 15, 2019 | | | [–]

“Intuiting” is a funny way to spell “just making shit up”

muxator on March 14, 2019 | | | [–]

Could you link the article?

shereadsthenews on March 15, 2019 | | | [–]

https://www.slideshare.net/mobile/PhilDibowitz/centos-at-fac...

muxator on March 15, 2019 | | | [–]

Thanks. What I read there seems sensibile to me.

JMTQp8lwXL on March 14, 2019 | [–]

If not a configuration issue, care to speculate what actually happened? There could've been malicious intent, whether on the part of an internal or external actor --and it really could be either-- given the amount of criticism Facebook has drawn in the past few years. But perhaps Facebook's PR statement would've addressed that, had it been the case.

ajsharp on March 14, 2019 | [–]

These are basically my thoughts on it: https://twitter.com/ajsharp/status/1106308735142526976.

Normally, this level of outage would come with a technical post-mortem. Instead, they issue a super vague statement.

amethyst on March 14, 2019 | | [–]

We likely won't even have an internal post-mortem for at least a couple weeks. There's no way you can possibly expect a full breakdown of what went wrong at this scale less than 24 hours after it gets resolved.

ajsharp on March 15, 2019 | | [–]

Google managed to put one together in a little over a day: https://status.cloud.google.com/incident/storage/19002.

ceejayoz on March 15, 2019 | | [–]

Expectations are vastly different for a business-focused hosting platform with SLAs than a social network.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact