Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Running BGP in large-scale data centers (fb.com)
193 points by radiator on Oct 5, 2021 | hide | past | favorite | 40 comments


I can guess why this was posted but it probably has nothing to do with today's outage. Routing inside a datacenter is different from routing to the Internet. This post is probably more relevant: https://engineering.fb.com/2017/08/21/networking-traffic/ste...


I don’t know, the paper[1] describes how they setup their systems so I think it is relevant to what happened

[1] https://research.fb.com/publications/running-bgp-in-data-cen...


BGP on the internet and inside your datacenter isn't any different though. It has different rules for iBGP and eBGP neighbors, but it's common to have multiple AS in your own datacenter, especially at Facebooks scale.


It seems like it is coupled in Facebook:)


To achieve the goals we’d set, we had to go beyond using BGP as a mere routing protocol.

Uh-oh.


We know failures happen in any large-scale system — hence, our routing design aims to minimize the impact of any potential failures.

Oops!


Petr wrote up this RFC. I believe he started it when he was at Microsoft, but he finished it at Facebook:

https://datatracker.ietf.org/doc/html/rfc7938


Awesome info, i will use this config for all servers asap


So, who pushed out the configuration changes and is this person still employed at FB?


“Recently, I was asked if I was going to fire an employee who made a mistake that cost the company $600,000. No, I replied, I just spent $600,000 training him. Why would I want somebody to hire his experience?”

— Thomas J. Watson, allegedly


And if it was in hands of a single person who made the technical mistake (and we're humans, we make mistakes), the real question is: how can a single person, mistakenly or deliberately, have the power to update BGP config to take the entire infrastructure offline? This is plain wrong, and it shouldn't be possible in the first place.


And what would you expect? A secret gathering of company elders to commit the change? I guess this change was approved by someone, but in the end someone has to push the keys to perform the operation.


> And what would you expect? A secret gathering of company elders to commit the change?

Push your changes to a simulated network to see what it does before pushing to real hardware:

> VIRL is a network virtualization simulator by Cisco. Similar to the likes of GNS3 and EVE-NG. VIRL will be used to build a test network (shown below), so that we can validate our changes prior to deploying them to production.

* https://www.packetcoders.io/netdevops-ci-cd-with-ansible-git...


At FB/AMZN scale its really tough to accurately emulate/virtualize the network, these are deep clos fabrics with LOTS OF STATE. At best you can get a sense of what protocols might/should do, but at the end of the day the state of the real environment isn't present.

So that micro-loop you created with your new policy that you observed in the lab turns in to a wave of micro-loops in real life because some random router was slow to update due to load.


You're not wrong, but it would be interesting to see some 'hyperscale' operator publish a paper on how many simulated routers you'd need to get a somewhat accurate test: 0.1, 1, 10% of a network?


Automated testing and validation? Something like what the linked article says they have in place?

"To minimize impact on production traffic while achieving high release velocity for the BGP agent, we built our own testing and incremental deployment framework, consisting of unit testing, emulation, and canary testing. We use a multi-phase deployment pipeline to push changes to agents."


Then again, training efficiency isn't the same for every employee. If this particular employee has been trained for this situation in the past and still required training, then this employee might need more training in future. Perhaps it is more cost effective to work with a different employee.


Do you want a workplace culture where people hide their mistakes? Because this is how you create such a culture.


It really depends. I've put someone through a disciplinary before for a really expensive failed change. Not because they made the mistake but because they failed to do any sanity checks before hand or post-imp checks - they just bashed it in and walked away.

I'd never go this route with a geniune mistake as they are life's best lessons. But there is a difference between an honest mistake and a mistake that could have been easily avoided had process been followed.


People who talk bgp should be above mundane & cruel human dynamics and culture because their function is too important. All bgp speaking people i know are and perhaps thats both why this discussion is held and why the internet is still up and running. "If you don't know what you are doing, don't do it"


But humans, even the smartest ones with great knowledge and experience in their field, make mistakes. Considering such a big outage last occured 13 years ago, I think it's perfectly normal for an experienced entity (whether a human or more likely a team) to make a big mistake in this time window.


Also, don't forget the reviewers who approved the change. And the designers of the whole system/process that enabled this thing to fall through this process.


I think it's the design. It should have been fault-tolerant to possible issues exactly like this, and should not allow the configuration to go live in the first place, in case the person or team pushing the configuration makes a mistake (or a sabotage attempt).


Of course! That person learnt a valuable lesson yesterday, so they cannot be fired.


Not per se. Depending on the type of error made, that person might have become a pariah no one wants to peer with.


*Pariah

Also if someone making a mistake means no one wants to work with them, that seems like a terrible team. Perhaps you should reassess how you deal with failures if that is your go to thought on this.


If it was something truly idiotic or malevolent, sure. Otherwise, that would be a really crap culture to work in.


Many people love to work at let's say, a nucleair powerplant or certain aspects of the military and experience similair pressure to not fail.One could say such work is not for everyone but to state it's "crap culture" is somewhat disrespectful towards the people who do it and the masses who suffer when things go wrong.


Thankfully, here in the real world, nuclear power plants and similar critical infrastructure in most countries operate with a safety culture, where people are encouraged to report their mistakes, learn from them, and help others avoid making the same mistakes, rather than a punitive culture, where people are incentivised to hide mistakes in order to save their job.


BGP is spoken by all cultures, including those who do not know what "safety culture" is, and thats a good thing because it allows for everyone to connect to the internet, bridging cultures. When the critical infrastructure in Texas went down, it had everything to do with corporate culture, trying to save jobs, and not with security culture. The engineers where not to blame, what we discus today is likely a similair case. At nuclear power plants, a single individual can't cause a meltdown due to security culture. In case of AS owners/engineers this is not always the case yet all of them know how not to fail with the exception of some who are ruled by another working culture than those who talk bgp.One that is not punitive but open albeit with social hygiene keeping the incompetent & corrupt out. The recent events are not a mistake to be forgiven but a failure to be studied while not taking risks by eliminating the human source.


i'd go for a beer or four with that dude


I remember a blog entry about a guy who took Amazon offline for some minutes and was fired, even while the change had beend aproved by several people.

I can't find it right now but that's the kind attitude you would expect from Amazon. I guess what FB is going to do.


Add (2021) to the title?


Not following. Is it 2022 already?


No. It's 2038


Isn’t that 1901 or 1902?


No, it's 1st Jan 1970


It’s a signed integer, so when it rolls, you roll ~68 years before 0.


(Before the fall)


> ERR_CONNECTION_CLOSED

Yeah. Not sure I want to read that "advice" today.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: