Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What basis do you have for saying that? It is likely their DR was running on a mirror of their production systems, and was similarly impacted by the Crowdstrike outage. So they fell back to Windows Servers similarly stuck in a boot-loop.

Keep in mind there was no way to opt out or delay CS Channel updates.



If your DR system is susceptible to the same faults as your main system it’s not a DR system.

It would be like claiming raid1 is a backup.


Or it would be like claiming my backup isn’t a backup because both systems run openssh, so a remote code execution vuln there could take down both systems.

Any DR system will have to accept some risks, and those don’t necessarily invalidate it in general, just make it insufficient for some scenarios.

Conversely, if they ran the main system on windows with crowdstrike and the DR one on poorly configured linux with no security software, they probably would have needed more sysadmins, had more trouble maintaining software for both, and been vulnerable to risk from both linux and windows bugs, so I feel like they made the right tradeoff in general.

I’m sure you, who can deride this DR system, have devised your own system such that it is resilient to a meteor destroying the earth.


> I’m sure you, who can deride this DR system, have devised your own system such that it is resilient to a meteor destroying the earth.

That reminds me one of Corey Quinn's comfortable AWS truths.

https://x.com/QuinnyPig/status/1173371749808783360

> If your DR plan assumes us-east-1 dies unrecoverably, what you're really planning for is 100 square miles of Northern Virginia no longer existing. Good luck with that ad farm in a nuclear wasteland, buddy!


As HN itself discovered a couple of years ago when a set of same-manufacturer, same-batch disks within both RAID arrays and backup server failed within a few hours of one another:

<https://news.ycombinator.com/item?id=32048148>

<https://news.ycombinator.com/item?id=32031243>


One idea: build a DR system and turn it off. Ideally it would be cloneable, but even without that ability, one could test it every few months to make sure it boots adequately quickly and then turn it back off. The attack surface of a bunch of computers or instances that are powered down is pretty low.


Better yet, alternate between them every month or two.


> Keep in mind there was no way to opt out or delay CS Channel updates.

Do CS updates somehow work over airgaps? You know, the kind that production systems have to prevent any access to or from external networks? Well... some production systems anyway.


What's your point? An air gapped disaster recovery system would be useless. An airline operations application has to connect to a bunch of other external systems to be of any use.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: