but checklists work (you might have heard about surgeons leaving medical tools in patients, and checklists eliminating this problem, seemingly the dumbest simplest technology, yet it's very powerful compared to the default of nothing)
of course the quality of answers matter, but that's on the environment (auditors, regulators, industry best practices, trade groups, client expectations), see the absolute total shambles of the state of IT sec in South Korea, the 1000 year old COBOL systems in finance/insurance, the reservation system in air travel, the sorry state of TLS before LetsEncrypt, SMTP servers (how SPF, DMARC, DKIM and ARC and MTA-STS are all needed to signal that you really prefer TLS, really don't want to be impersonated, you really vouch for forwarded stuff, etc).
also there's the BeyondCorp step-up. nobody bothered to really segment the internal network. nowadays the default is zero-trust.
>but checklists work (you might have heard about surgeons leaving medical tools in patients, and checklists eliminating this problem, seemingly the dumbest simplest technology, yet it's very powerful compared to the default of nothing)
These glorified checklists also backfire all the time. With the surgeon it's indeed simple. With software what I see is that certification and process often lessens quality.
Why you ask? Well if you need certification, changes become expensive. Because there's a huge tail of documentation and processes to be done. This leads to the perverse incentive to change as little as possible, even though you know it's broken.
Boeing knew that they needed to retrain pilots if they changed the 737 MAX airplane too much. So they hacked the hardware, which lead to hacking the software, which lead to omission in the training. All because of the incentive not to change too much lest they need to retrain all pilots.
The problem of course is that the retraining is an either or. Either your plane is sufficiently 737 like or not. If not the costs are huge. Especially since the competition's airplane would not need retraining.
So the risks of these hacks to make it 737-like enough were weighed against huge costs. The costs were basically not being able to do business at all. Since these planes would be cost prohibitive for the cheap domestic airlines that want them.
If the costs were more linear I'm sure this wouldn't have happened. E.g. if you can retrain pilots on only the MCAS system and still not require full recertification.
I work for a company that ships safety-certified software. We, or often our customers, have discovered bugs in that software. We do not fix the bugs because one single small bugfix means re-certifying the entire software, a process that takes months of producing proof of matching the safety case plus many more months of updating and approving accompanying documentation and going through an audit. Everything has to be re-touched.
We just issue an updated defect list to go with the software and our customers needs to work around the bug. Known unfixed bugs are a fact of life in certified software. Update releases are not. Customers pay a premium for this because they, too, would have to go through the same pain and expense on their side.
This reminds me of how NASA would never have a Shuttle in flight during the transition from December 31 to January 1, because they were unsure as to whether the shuttle's computers could handle the rollover correctly [1]. Sure, they could have updated the software to make sure that the rollover was handled correctly, but that would have required them to recertify the entire OS running the Shuttle, and it was easier to just plan missions such that the Shuttle was never flying on New Year's Eve.
We have discovered a critical bug in QNX 6 kernel in a networking scenario. There is no workaround, since the bug was in the core of their message passing infrastructure - a non-blocking by design kernel call, SendPulse(), sometimes blocks.
It took me 9 months talking to them about this problem until I managed to reproduce it on just two nodes and half a page of code, and record kernel logs that clearly showed a race condition.
We have received a patched kernel in a few days, and it worked like that for a while. This fix was merged into the official release after almost two years.
After that - only Linux, where we can see and fix stuff. No proprietary code and bureaucracy, no "fast, robust and reliable" operating systems.
There is no safety-certified Linux. As far as I know there was no safety-certified QNX 6 either (QOS 1.0 was based on QNX 6.5 SP1, which is not the same as QNX 6 despite the numbers looking eerily similar).
With a safety-certified system, you do not receive a patch because it violates the safety certification. Of course, you can get a patch and use it but then you're responsible for safety-certifying the entire stack including the closed-source vendor code, and best of luck.
This is exactly my point. Also, even if there is a workaround, more often than not the complexity of the mountain of workarounds just creates the next set of certified bugs.
It's a valid point, but the solution is not obvious. It's a trade-off in a big design space. (Of course with software it seems "trivial" to make sure the certification can be done quickly and cheaply. Just automate it! Unfortunately we're not there yet. :/ )
The difference is that those bugs are certified in one and not the other. In many cases, bugs are known to the SW provider but not disclosed to customers unless they happen upon them. With certified SW, bugs are by default disclosed up front.
All true! The checklist is only as good as the system that produced it, and that was basically the largest sentence/paragraph in the comment.
Having a forward-looking industry working in symbiosis with its top-notch quality high-functioning regulatory environment is the ideal state. It's rare. (I would say it's apprecition is academic only today. State-capability (or "state capacity") is getting to be a buzzword now for pundits[0][1][2][3]. But it's not that surprising that these problems seem to be cropping up now, in the Internet era, and not during the Cold War.)
My theory is that having a basic regulatory environment allows for increasing industry-wide quality effectively and quickly. For example the CDC is bad at counting COVID cases[2], but catching blindness causing eye drops with "just" ~70 cases countrywide[4] is a good example of the basic safety net.
Similarly, the whole aviation industry's safety process and context was what allowed the MCAS fuckup to come to light fast.
"Fun fact" regarding MCAS. If I recall correctly Boeing argued that the MCAS malfunction was covered by the runaway stabilizer procedure (checklist!), the astronomically bad UX of MCAS itself is what made pilots confused. (Because it activated for 10 sec every minute or something WTF like that, so they had no idea they need to get the runaway stabilizer checklist.) And that's exactly what you're saying. It wasn't sufficiently 737-like.
The workaround is to add UX to these type similarity checks done by the FAA and other regulatory bodies. And, again, exactly as you have mentioned, the cost-benefit discontinuity led to this bad trade off. And while we can't magically smooth over all of them, but at least (and that's my argument) we have good frameworks to start looking at them, detect, recognize, analyze and workaround them. (And checklists are the level 1 of these tools.)
glorified checklists.
but checklists work (you might have heard about surgeons leaving medical tools in patients, and checklists eliminating this problem, seemingly the dumbest simplest technology, yet it's very powerful compared to the default of nothing)
of course the quality of answers matter, but that's on the environment (auditors, regulators, industry best practices, trade groups, client expectations), see the absolute total shambles of the state of IT sec in South Korea, the 1000 year old COBOL systems in finance/insurance, the reservation system in air travel, the sorry state of TLS before LetsEncrypt, SMTP servers (how SPF, DMARC, DKIM and ARC and MTA-STS are all needed to signal that you really prefer TLS, really don't want to be impersonated, you really vouch for forwarded stuff, etc).
also there's the BeyondCorp step-up. nobody bothered to really segment the internal network. nowadays the default is zero-trust.