I can guess why this was posted but it probably has nothing to do with today's outage. Routing inside a datacenter is different from routing to the Internet. This post is probably more relevant: https://engineering.fb.com/2017/08/21/networking-traffic/ste...
BGP on the internet and inside your datacenter isn't any different though. It has different rules for iBGP and eBGP neighbors, but it's common to have multiple AS in your own datacenter, especially at Facebooks scale.
“Recently, I was asked if I was going to fire an employee who made a mistake that cost the company $600,000. No, I replied, I just spent $600,000 training him. Why would I want somebody to hire his experience?”
And if it was in hands of a single person who made the technical mistake (and we're humans, we make mistakes), the real question is: how can a single person, mistakenly or deliberately, have the power to update BGP config to take the entire infrastructure offline? This is plain wrong, and it shouldn't be possible in the first place.
And what would you expect? A secret gathering of company elders to commit the change? I guess this change was approved by someone, but in the end someone has to push the keys to perform the operation.
> And what would you expect? A secret gathering of company elders to commit the change?
Push your changes to a simulated network to see what it does before pushing to real hardware:
> VIRL is a network virtualization simulator by Cisco. Similar to the likes of GNS3 and EVE-NG. VIRL will be used to build a test network (shown below), so that we can validate our changes prior to deploying them to production.
At FB/AMZN scale its really tough to accurately emulate/virtualize the network, these are deep clos fabrics with LOTS OF STATE. At best you can get a sense of what protocols might/should do, but at the end of the day the state of the real environment isn't present.
So that micro-loop you created with your new policy that you observed in the lab turns in to a wave of micro-loops in real life because some random router was slow to update due to load.
You're not wrong, but it would be interesting to see some 'hyperscale' operator publish a paper on how many simulated routers you'd need to get a somewhat accurate test: 0.1, 1, 10% of a network?
Automated testing and validation? Something like what the linked article says they have in place?
"To minimize impact on production traffic while achieving high release velocity for the BGP agent, we built our own testing and incremental deployment framework, consisting of unit testing, emulation, and canary testing. We use a multi-phase deployment pipeline to push changes to agents."
Then again, training efficiency isn't the same for every employee. If this particular employee has been trained for this situation in the past and still required training, then this employee might need more training in future. Perhaps it is more cost effective to work with a different employee.
It really depends. I've put someone through a disciplinary before for a really expensive failed change. Not because they made the mistake but because they failed to do any sanity checks before hand or post-imp checks - they just bashed it in and walked away.
I'd never go this route with a geniune mistake as they are life's best lessons. But there is a difference between an honest mistake and a mistake that could have been easily avoided had process been followed.
People who talk bgp should be above mundane & cruel human dynamics and culture because their function is too important.
All bgp speaking people i know are and perhaps thats both why this discussion is held and why the internet is still up and running.
"If you don't know what you are doing, don't do it"
But humans, even the smartest ones with great knowledge and experience in their field, make mistakes. Considering such a big outage last occured 13 years ago, I think it's perfectly normal for an experienced entity (whether a human or more likely a team) to make a big mistake in this time window.
Also, don't forget the reviewers who approved the change. And the designers of the whole system/process that enabled this thing to fall through this process.
I think it's the design. It should have been fault-tolerant to possible issues exactly like this, and should not allow the configuration to go live in the first place, in case the person or team pushing the configuration makes a mistake (or a sabotage attempt).
Also if someone making a mistake means no one wants to work with them, that seems like a terrible team. Perhaps you should reassess how you deal with failures if that is your go to thought on this.
Many people love to work at let's say, a nucleair powerplant or certain aspects of the military and experience similair pressure to not fail.One could say such work is not for everyone but to state it's "crap culture" is somewhat disrespectful towards the people who do it and the masses who suffer when things go wrong.
Thankfully, here in the real world, nuclear power plants and similar critical infrastructure in most countries operate with a safety culture, where people are encouraged to report their mistakes, learn from them, and help others avoid making the same mistakes, rather than a punitive culture, where people are incentivised to hide mistakes in order to save their job.
BGP is spoken by all cultures, including those who do not know what "safety culture" is, and thats a good thing because it allows for everyone to connect to the internet, bridging cultures.
When the critical infrastructure in Texas went down, it had everything to do with corporate culture, trying to save jobs, and not with security culture. The engineers where not to blame, what we discus today is likely a similair case.
At nuclear power plants, a single individual can't cause a meltdown due to security culture.
In case of AS owners/engineers this is not always the case yet all of them know how not to fail with the exception of some who are ruled by another working culture than those who talk bgp.One that is not punitive but open albeit with social hygiene keeping the incompetent & corrupt out.
The recent events are not a mistake to be forgiven but a failure to be studied while not taking risks by eliminating the human source.