There are entire systems engineering courses focused on failure resulting from a series of small problems that eventually in the right succession result in catastrophic failure. And I think we can say this was a catastrophic failure.
Think about it, first you need a race condition, and that race condition has to result in the unexpected result. That right there, assuming this code has been tested and is frequently used, is probably a less than 10% chance (if it was frequently happening someone would have noticed.) Then you need an engineer to decide they need this particular crash dump. Then you need your credential scanning software (which again, presumably usually catches stuff) to not be able to detect this particular credential. Now you need an account compromised to get network access and that user has access to this crash dump and the hacker happens to get to it and grabs it.
But even then, you should be safe because the key is old and is only good to get into consumer email accounts...except you have a bug that accepts the old key AND a bug that didn't reject this signing key for a token accessing corporate email accounts.
This is a really good system engineering lesson. Try all you want eventually enough small things will add up to cause a catastrophic result. The lesson is, to the extent you can, engineer things so when they blow-up the blast radius is limited.
> that eventually in the right succession result in catastrophic failure.
With a caveat that when it comes to security the eventual succession doesn't come as a random process but will be actively targeted and exploited. The attackers are not random processes flipping coins, rather they can flip a coin that often lands on "heads", in their favor.
The post-mortem results are presented as if events happened as a random set of unfortunate circumstances: the attacker just happened to work for Microsoft, there just happened to be a race condition, and then a crash randomly happened, and then the attacker just happened to find the crash dump somewhere. We should consider even starting with the initial "race condition" bug, that it might have been inserted deliberately. The crash could have been triggered deliberately. An attacker may have been expecting the crash dump to appear in a particular place to grab it. The attacker may have had accomplices.
The other frightening possibility is that the attack surface targeted by persistent threat actors is so large that a breach becomes certain (the law of large numbers): when you have so many accounts owned that one of them will have the right access rights; when you have so many dumps one of them will have the key; etc ...
> The post-mortem results are presented as if events happened as a random set of unfortunate circumstances: the attacker just happened to work for Microsoft
Does it say that?
> the Storm-0558 actor was able to successfully compromise a Microsoft engineer’s corporate account
Race condition is the reason we all use to explain to management why we wrote a stupid bug. Everything is a race condition: "the masker is asynchronous so the writer starts writing dumps before the masker is setup" sounds like a completely moronic thing to do. Say there is a race condition, and people say "a less than 10% chance from happening", but what do we know, maybe it happens each big crash, and it just doesn't crash that often.
Why isn't it masking before writing to disk ? God only knows.
Crash handlers don't know what state the system will be in when they're called. Will we be completely out of memory, so even malloc calls have started failing and no library is safe to call? Are we out of disk space, so we maybe can't write our logs out anyway? Is storage impaired, so we can write but only incredibly slowly? Is there something like a garbage collector that's trying to use 100% of every CPU? Are we crashing because of a fault in our logging system, which we're about to log to, giving us a crash in a crash? Does the system have an alarm or automated restart that won't fire until we exit, which our crash handler delays?
It's pretty common to keep it simple in the crash handler.
Unknown unknown catastrophic failures like this one have always happened and will continue to happen, that's why we need resilience which, probably, means a less centralised worldview.
Which should probably mean that half (or more) of the Western business world relying on Outlook.com is a very wrong thing to have in place, but as the current money incentives are not focused on resilience nor on stuff like breaking super-centralized Outlook.com-like entities down means that I'm pretty sure we'll continue having events like this one happening well into the future.
Indeed. While reading that I thought to myself “gosh, that’s a lot of needles that got threaded right there”. It feels like the Voyager Grand Tour gravitationally-assisted trajectory… happening by mistake.
A lot of accident analysis reads like this (air accident reports especially tend to read like they've come from a writer who's just discovered foreshadowing). And often there's a few points where it could have been worse. There's a reason for the "Swiss cheese" model of safety. The main thing to remember is there's not just one needle: it's somewhere between a bundle of spaghetti and water being pushed up against the barriers, and that's before you assume malicious actors.
Yeah I get that, it’s not a single Voyager, it’s millions of them sent out radially in random directions and random speeds and one or two of them just happen to thread the needle and go on the Grand Tour. It’s just an impression. Plus as you say there’s the selective element of an intelligence deliberately selecting for an outcome at the end (which confusingly is also a beginning).
"reducing your blast radius" is never truly finished, so how do you know what is sufficient, or when the ROI on investing time/money is still positive?
Think about it, first you need a race condition, and that race condition has to result in the unexpected result. That right there, assuming this code has been tested and is frequently used, is probably a less than 10% chance (if it was frequently happening someone would have noticed.) Then you need an engineer to decide they need this particular crash dump. Then you need your credential scanning software (which again, presumably usually catches stuff) to not be able to detect this particular credential. Now you need an account compromised to get network access and that user has access to this crash dump and the hacker happens to get to it and grabs it.
But even then, you should be safe because the key is old and is only good to get into consumer email accounts...except you have a bug that accepts the old key AND a bug that didn't reject this signing key for a token accessing corporate email accounts.
This is a really good system engineering lesson. Try all you want eventually enough small things will add up to cause a catastrophic result. The lesson is, to the extent you can, engineer things so when they blow-up the blast radius is limited.