There's a difference between *parsing* and *validating*. https://lexi-lambda.git...

touisteur · on Sept 12, 2023

From a safety-critical standpoint, I've always found this article interesting but strange. You want both, before taking into account any data from anything outside of the system. Do both. As soon as possible. Don't propagate data you haven't validated in any way your spec says so. If you have more stringent specs than any standard you're using, be explicit about it, reject the data with a clear failure report. Check for anything that could be corrupted, misformated, something that you're not expecting and could cause unexpected behaviour.

I feel the lack of investment in destroying the parsing- (and validation-) related classes of bugs is the worst oversight in the history of computing. We have the tools to build crash-proof parsers (spark, Frama-C, and custom model checked code generators such as recordflux) that - not being perfect in any way - if they had a tiny bit of the effort the security industry put in mending all the 'Postel's law' junk out there, we'd be working on other stuff.

I built, with an intern, an in-house bit-precise code generator for deserializers that can be proved absent of runtime errors, and am moving to semantics checks ('field X and field Y can only present together', or 'field Y must be greater or equal to the previous time field Y was present'). It's not that hard, compared to many other proof and safety/security endeavours.

cratermoon · on Sept 12, 2023

> It's not that hard, compared to many other proof and safety/security endeavours.

Yes, but the code has to understand and model the input into a program representation: the AST. That's the essence of the "parse, don't validate" paradigm. Instead of looking at each piece of a blob of data in isolation to determine if it's a valid value, turn the input into a type-rich representation in the problem domain.

In the case of the FPRSA-R system in question, it does none of that. It's simply a gateway to translate data in format A to data in format B, like an ETL system. It's not looking at the input as a flight plan with waypoints, segments and routes.

Why the programmers chose to do the equivalent of bluescreening on one failed input, I can't say. As others have pointed out, the situation it gave up on isn't so rare: 1 in 15 million will happen. Of course switching to an identical backup system is a bad choice, too. In safety-critical work, there needs to be a different backup, much like the Backup Flight System in the space shuttle or the Abort Guidance System on the Apollo Lunar Module: a completely different set of avionics, programmed independently.

touisteur · on Sept 12, 2023

One of the reasons developers 'let it crash' is because no one wants to pay for error recovery, and I mean the whole design (including system level), testing, and long-term maintenance of barely used code.

THAT SAID isolation of the decoding code and data structures, having a way back to either checkpoint/restore or wipe out bad state (or, proving the absence of side effects, as SPARK dataflow contracts allow, for example) is better design, I wish would be taught more often. I really dislike how often exception propagation is taught without showing the handling of side effects...