There are entire systems engineering courses focused on failure resulting from a...

rdtsc · on Sept 7, 2023

> that eventually in the right succession result in catastrophic failure.

With a caveat that when it comes to security the eventual succession doesn't come as a random process but will be actively targeted and exploited. The attackers are not random processes flipping coins, rather they can flip a coin that often lands on "heads", in their favor.

The post-mortem results are presented as if events happened as a random set of unfortunate circumstances: the attacker just happened to work for Microsoft, there just happened to be a race condition, and then a crash randomly happened, and then the attacker just happened to find the crash dump somewhere. We should consider even starting with the initial "race condition" bug, that it might have been inserted deliberately. The crash could have been triggered deliberately. An attacker may have been expecting the crash dump to appear in a particular place to grab it. The attacker may have had accomplices.

cathalc · on Sept 7, 2023

>The post-mortem results are presented as if events happened as a random set of unfortunate circumstances:

Public RCAs are nothing more than carefully curated PR stunts to calm customers. You can be sure the internal RCA is a lot more damning.

ddalex · on Sept 7, 2023

The other frightening possibility is that the attack surface targeted by persistent threat actors is so large that a breach becomes certain (the law of large numbers): when you have so many accounts owned that one of them will have the right access rights; when you have so many dumps one of them will have the key; etc ...

hulitu · on Sept 7, 2023

> the attack surface targeted by persistent threat actors is so large that a breach becomes certain

I thought a good security rule was to reduce the attack surface. But ok, we are talking about ... Microsoft. /s

MichaelZuo · on Sept 7, 2023

This is why for certain things the minimum requirement should not be 99% certain or 99.9% certain, but 99.9999% certain.

rawling · on Sept 7, 2023

> The post-mortem results are presented as if events happened as a random set of unfortunate circumstances: the attacker just happened to work for Microsoft

Does it say that?

> the Storm-0558 actor was able to successfully compromise a Microsoft engineer’s corporate account

rdtsc · on Sept 7, 2023

Ah you’re right thanks for the correction.

hulitu · on Sept 7, 2023

You seem to imply that Microsoft employs system engineering and test its products. The reality is far from this.

The Microsoft ecosystem looks like a Lego car built by the neighborhoud kids, everybody bringing something from home and smashing it together.

mynameisash · on Sept 7, 2023

On what basis do you make these claims?

xwolfi · on Sept 7, 2023

Race condition is the reason we all use to explain to management why we wrote a stupid bug. Everything is a race condition: "the masker is asynchronous so the writer starts writing dumps before the masker is setup" sounds like a completely moronic thing to do. Say there is a race condition, and people say "a less than 10% chance from happening", but what do we know, maybe it happens each big crash, and it just doesn't crash that often.

Why isn't it masking before writing to disk ? God only knows.

michaelt · on Sept 7, 2023

> Why isn't it masking before writing to disk ?

Crash handlers don't know what state the system will be in when they're called. Will we be completely out of memory, so even malloc calls have started failing and no library is safe to call? Are we out of disk space, so we maybe can't write our logs out anyway? Is storage impaired, so we can write but only incredibly slowly? Is there something like a garbage collector that's trying to use 100% of every CPU? Are we crashing because of a fault in our logging system, which we're about to log to, giving us a crash in a crash? Does the system have an alarm or automated restart that won't fire until we exit, which our crash handler delays?

It's pretty common to keep it simple in the crash handler.

paganel · on Sept 7, 2023

Unknown unknown catastrophic failures like this one have always happened and will continue to happen, that's why we need resilience which, probably, means a less centralised worldview.

Which should probably mean that half (or more) of the Western business world relying on Outlook.com is a very wrong thing to have in place, but as the current money incentives are not focused on resilience nor on stuff like breaking super-centralized Outlook.com-like entities down means that I'm pretty sure we'll continue having events like this one happening well into the future.

qubex · on Sept 7, 2023

Indeed. While reading that I thought to myself “gosh, that’s a lot of needles that got threaded right there”. It feels like the Voyager Grand Tour gravitationally-assisted trajectory… happening by mistake.

rcxdude · on Sept 8, 2023

A lot of accident analysis reads like this (air accident reports especially tend to read like they've come from a writer who's just discovered foreshadowing). And often there's a few points where it could have been worse. There's a reason for the "Swiss cheese" model of safety. The main thing to remember is there's not just one needle: it's somewhere between a bundle of spaghetti and water being pushed up against the barriers, and that's before you assume malicious actors.

qubex · on Sept 8, 2023

Yeah I get that, it’s not a single Voyager, it’s millions of them sent out radially in random directions and random speeds and one or two of them just happen to thread the needle and go on the Grand Tour. It’s just an impression. Plus as you say there’s the selective element of an intelligence deliberately selecting for an outcome at the end (which confusingly is also a beginning).

anagpal · on Sept 8, 2023

the challenge is knowing when you've done enough.

"reducing your blast radius" is never truly finished, so how do you know what is sufficient, or when the ROI on investing time/money is still positive?