Perhaps - but if I made a list of all of the things your company *should* be doi...

jjav · on July 19, 2024

> all of the things your company should be doing and didn't

Processes need to match the potential risk.

If your company is doing some inconsequential social app or whatever, then sure, go ahead and move fast and break things if that's how you roll.

If you are a company, let's call them Crowdstrike, that has access to push root privileged code to a significant percentage of all machine on the internet, the minimum quality bar is vastly higher.

For this type of code, I would expect a comprehensive test suite that covers everything and a fleet of QA machines representing every possible combination of supported hardware and software (yes, possibly thousands of machines). A build has to pass that and then get rolled into dogfooding usage internally for a while. And then very slowly gets pushed to customers, with monitoring that nothing seems to be regressing.

Anything short of that is highly irresponsible given the access and risk the Crowdstrike code represents.

Denvercoder9 · on July 19, 2024

> A build has to pass that and then get rolled into dogfooding usage internally for a while. And then very slowly gets pushed to customers, with monitoring that nothing seems to be regressing.

That doesn't work in the business they're in. They need to roll out definition updates quickly. Their clients won't be happy if they get compromised while CrowdStrike was still doing the dogfooding or phased rollout of the update that would've prevented it.

jjav · on July 20, 2024

> That doesn't work in the business they're in. They need to roll out definition updates quickly.

Well clearly we have incontrovertible evidence now (if it was needed) that YOLO-pushing insufficiently tested updates to everyone at once does not work either.

This is being called in many places (righfully) the largest IT outage in history. How many billions will be the cost? How many people died?

So yes, clearly not the correct way to operate.

nikau · on July 21, 2024

I mean this isn't some bizarre edge case bug, its a file full of nulls and a broken parser that blindly imported it.

Its negligence.

EvanAnderson · on July 19, 2024

A company deploying kernel-mode code that can render huge numbers of machines unusable should have done better. It's one of those "you had one job" kind of situations.

They would be a gigantic target for malware. Imagine pwning a CDN to pwn millions of client computers. The CDN being malicious would be a major threat.

soraminazuki · on July 19, 2024

Oh, they have one job for sure. Selling compliance. All else isn't their job, including actual security.

Antiviruses are security cosplay that works by using a combination of bug-riddled custom kernel drivers and unsandboxed C++ parsers running with the highest level of privileges to tamper with every bit of data it can get its hands on. They violate every security common sense. They also won't even hesitate to disable or delay rollouts of actual security mechanisms built into browsers and OSes if it gets in the way.

The software industry needs to call out this scam and put them out of business sooner than later. This has been the case for at least a decade or two and it's sad that nothing has changed.

https://ia801200.us.archive.org/1/items/SyScanArchiveInfocon... https://robert.ocallahan.org/2017/01/disable-your-antivirus-...

heraldgeezer · on July 19, 2024

Nope, I have seen software like Crowdstrike, S1, Huntress and Defender E5 stop active ransomware attacks.

jjav · on July 20, 2024

> Nope, I have seen software like Crowdstrike, S1, Huntress and Defender E5 stop active ransomware attacks.

Yes, occasionally they do. This is not an either-or situation.

While they do catch and stop attacks, it is also true that crowdstrike and its ilk are root-level backdoors into the system that bypass all protections and thus will cause problems sometimes.

soraminazuki · on July 19, 2024

That anecdote doesn't justify installing gaping security holes into the kernel with those tools. Actual security requires knowledge, good practice, and good engineering. Antiviruses can never be a substitute.

lytedev · on July 19, 2024

You seem security-wise, so surely you can understand that in some (many?) cases, antivirus is totally acceptable given the threat model. If you are wanting to keep the script kiddies from metasploiting your ordinary finance employees, it's certainly worth the tradeoff for some organizations, no? It's but one tool with its tradeoffs like any tool.

TeMPOraL · on July 20, 2024

That's like pointing at the occasional petty theft and mugging, and using it to justify establishing an extraordinary secret police to run the entire country. It's stupid, and if you do it anyway, it's obvious you had other reasons.

Antivirus software is almost universally malware. Enterprise endpoint "protection" software like CrowdStrike is worse, it's an aggressive malware and a backdoor controlled by a third party, whose main selling points are compliance and surveillance. Installing it is a lot like outsourcing your secret police to a consulting company. No surprise, everything looks pretty early on, but two weeks in, smart consultants rotate out to bring in new customers, and bad ones rotate in to run the show.

Yeah, that's definitely a good tradeoff against script kiddies metasploiting your ordinary finance employees. Wonder if it'll look as good when loss of life caused by CrowdStrike this weekend gets tallied up.

jmb99 · on July 19, 2024

How many attacks have they stopped that would have DoS’d a significant fraction of the world’s Windows machines roughly instantly?

The ends don’t justify the means.

cduzz · on July 19, 2024

Which is their "One Job" ?

Options include:

1. protected the systems always work even if things are messed up

2. protected systems are always protected even when things are messed up

The two failure modes are exclusive; ideally you let the end user decide what to do if the protection mechanism is itself unstable.

One could suggest "the system must always work" but that's ignoring that sometimes things don't go to plan.

None of the systems in boot loops were p0wned by known exploits while they were boot looping. As far as we know anyhow.

(edited to add the obvious default of "just make a working system" which is of course both a given and not going to happen)

jmb99 · on July 19, 2024

The failure mode here was a page fault due to an invalid definition file. That (likely) means the definition file was being used as-is without any validation, and pointers were being dereferenced based on that non-validated definition file. That means this software is likely vulnerable to some kind of kernel-level RCE through its definition files, and is (clearly) 100% vulnerable to DoS attacks through invalid definition files. Who knows how long this has been the case.

This isn’t a matter of “either your system is protected all the time, even if that means it’s down, or your system will remain up but might be unprotected.” It’s “your system is vulnerable to kernel-level exploits because of your AV software’s inability to validate definition files.”

The failure mode here should absolutely not be to soft-brick the machine. You can have either of your choices configurable by the sysadmin; definition file fails to validate? No problem, the endpoint has its network access blocked until the problem can be resolved. Or, it can revert to a known-good definition, if that’s within the organization’s risk tolerance.

But that would require competent engineering, which clearly was not going on here.

cduzz · on July 21, 2024

Well.... yeah, incuriously shoving unsigned data from elsewhere into ring 0 and executing it is malpractice.

EvanAnderson · on July 20, 2024

Their "one job" is to not make things worse than the default. DoS'ing the OS with an unhandled kernel mode exception would be not doing that job.

How about a different analogy: First do no harm.

bn-l · on July 19, 2024

I think in this case it’s reasonable for us to expect that they are doing what they should be doing.