Checksumming at that level is a bit pointless, because then you can't repair the...

acdha · on June 27, 2016

> Instead of being able to recover from a mirror or parity data, all you get is "it's corrupt, oh well".

That's exactly what you want – a clean failure which prevents other software from silently propagating corrupt data. Even more importantly, with corruption it's usually much easier to recover from another copy if you notice the problem quickly.

Think about what happens with e.g. photos – you copy files from a phone, USB card, etc. and something gets corrupted on local storage. With integrity checks that means that the next time you load your photo app, go to upload photos somewhere, run a backup, etc. the filesystem you immediately get an unavoidable error telling you exactly what happens. If it's recent, you can go back to the source phone/camera and copy the file back.

With all current consumer filesystems, what you instead get are things like a file which looks okay at first – maybe because it's just displaying an embedded JPEG thumbnail, maybe because software like Photoshop is enormously fault-tolerant – but doesn't work when you use it in a different app or, because you didn't get an error, the underlying hardware file affecting your storage becomes worse over time until something fails so badly that it becomes obvious. By the time you notice the problem, the original file is long gone because you reused the device storage and you have to check every backup, online service, etc. to find which copies are corrupt.

ghshephard · on June 26, 2016

Checksumming at the file system level solves the problem of corruption that occurs off the media (on the bus). The media has checksumming on it that allows detection and recovery from errors that occur on the media itself.

The question I'm sure Apple engineers have were - How often do we see BitRot occurring off media, and is the media that we're deploying sufficiently resistant to bit-rot?

And, with APFS's flexible structure, this is a feature that can be added at a later time. Probably made sense to deliver in 2017 something that was rock solid that they could build on, than to either (A) push out the delivery date, or (B) not fully bake all features of the file system.

wumpus · on June 26, 2016

And here's a nice summary of the bitrot that you see in systems with disks. Just like network devices, anyone with enough gear sees these failures.

https://www.usenix.org/legacy/event/fast08/tech/full_papers/...

StillBored · on June 27, 2016

Others have pointed out the potential HW issues, but I implemented something similar in a product that stored data using its own disk format, and found that adding checksums to all data written to disk yielded a number of cases where what we though was HW failures were actually SW failures. AKA really, really, really obscure bugs that only happened under obscure conditions (think the equivalent of fsck bugs checking the filesystem after power loss for one example, the journal needed to be in exactly the right state to trigger the recovery bug).

espadrine · on June 26, 2016

I have no idea if Apple flash storage devices do that (or will), but Hamming (7,4) can correct single-bit errors. Most ECC devices are SECDED (single-error correction, double-error detection).

adf · on June 27, 2016

I'd rather be told that the data is corrupt, than get garbage.

cmurf · on June 27, 2016

Failing to replicate corrupt data is the point. By delivering the corrupt data to user space, it proliferates into backups. And since it's common practice, even if suboptimal, to conflate archive and backup and rotate those media, it's just a matter of time before all backups have a corrupt copy of the file.