I would recommend anybody interested in filesystems to watch Jeff Bonwick (ZFS inventor) explain the design of ZFS: https://www.youtube.com/watch?v=NRoUC9P1PmA. They have a few very nice war stories explaining why they found it useful to have the user data checksummed as well.
I really like the hack of a block pointers being a data structure and containing the birth time (tx id I guess) and how that avoids needing managing bitmaps.
The advantage of doing send/receive vs rsync was also a nice explanation.
It is not often that you see a technology and think "Oh, this is great stuff", this is like that with ZFS. Haven't played much with it. But get that idea just from learning about it so far.
Both Jeff and Bill are great at communicating and explaining the technology. Like how they tag team, with minor funny bits here and there.
Regarding the main issue here, checksums -- yeah I don't see how Apple engineers could have watched this and said "Meh, don't need data checksums". Maybe they do have a secret vault with magic new holographic storage, immune to cosmic rays and other vagaries of physics, who knows.
I might be wrong, but I read APFL's no-checksum decision as founded upon Apple's hardware-and-software strategy. Given their situation, they can decide that checksumming is a hardware problem. They can build their latest devices with ECC, and require customers to only use Apple-developed USB keys. (I would not be shocked to learn that, in late 2017, Apple starts selling USB-C storage keys preformatted as APFS.)
I think it is a decision based on Apples image. They can only lose when enabling data checksums.
The user would have to be informed, since Apples hardware has no storage redundancy. That will be perceived as an admission of failure on Apples part; your $2k device just murdered your data.
There is not really a UX flow apart from telling the user to recover the file from backup.
Technically it is the dominant strategy to have data checksums. - Even if you assume the Hardware is very awesomely perfect.
>There is not really a UX flow apart from telling the user to recover the file from backup.
I agree with all your points, but just want to offer a possible UX solution:
Since the vast majority of data on people's hard drives are video and images, where minor data corruption results in (in most cases) just visual artifacts, we could have a pop-up dialogue that says: "A higher quality version of this file is found on your backup. Do you want to restore it?" when the corrupted file is a video or image and there's a confirmed backup of it.
If there's no confirmed backup then just silently ignore the corruption since the user won't notice anyways.
> There is not really a UX flow apart from telling the user to recover the file from backup.
There absolutely is, since that could be performed automatically. Inform the user about the corruption, rename the corrupted file, then restore the most recent backup in its place.
That's assuming that there's backup and OS knows about it. Huge majority of people don't have any backups or their backups are not automatic, so OS can't restore the file.
To the extent that this is true in the modern era of cloud services, it's irresponsibly abetted by so many programs silently ignoring errors rather than reporting them.
In Apple's case, they have complete control over iCloud and could offer this easily for any file which is stored there. They could also add some sort of metadata API so services like CrashPlan, Backblaze, etc. could register the presence of other copies in a generic manner. Third party services could also integrate background scrubs into their existing application.
In each case, the first time that dialog appeared you'd likely have a customer for life from anyone who's gone through the hassle of losing a personal memory, important document, etc. or make a panicked search for other/older copies.
If there isn't a backup, it should at least notify the user of the error. But the OS would know about backups, since it has a built-in backup system.
Also, I wouldn't be surprised if Apple adds a cloud-based backup for macOS once APFS is the default filesystem, since change sets would be extremely efficient.
Checksumming at that level is a bit pointless, because then you can't repair the data. Instead of being able to recover from a mirror or parity data, all you get is "it's corrupt, oh well".
> Instead of being able to recover from a mirror or parity data, all you get is "it's corrupt, oh well".
That's exactly what you want – a clean failure which prevents other software from silently propagating corrupt data. Even more importantly, with corruption it's usually much easier to recover from another copy if you notice the problem quickly.
Think about what happens with e.g. photos – you copy files from a phone, USB card, etc. and something gets corrupted on local storage. With integrity checks that means that the next time you load your photo app, go to upload photos somewhere, run a backup, etc. the filesystem you immediately get an unavoidable error telling you exactly what happens. If it's recent, you can go back to the source phone/camera and copy the file back.
With all current consumer filesystems, what you instead get are things like a file which looks okay at first – maybe because it's just displaying an embedded JPEG thumbnail, maybe because software like Photoshop is enormously fault-tolerant – but doesn't work when you use it in a different app or, because you didn't get an error, the underlying hardware file affecting your storage becomes worse over time until something fails so badly that it becomes obvious. By the time you notice the problem, the original file is long gone because you reused the device storage and you have to check every backup, online service, etc. to find which copies are corrupt.
Checksumming at the file system level solves the problem of corruption that occurs off the media (on the bus). The media has checksumming on it that allows detection and recovery from errors that occur on the media itself.
The question I'm sure Apple engineers have were - How often do we see BitRot occurring off media, and is the media that we're deploying sufficiently resistant to bit-rot?
And, with APFS's flexible structure, this is a feature that can be added at a later time. Probably made sense to deliver in 2017 something that was rock solid that they could build on, than to either (A) push out the delivery date, or (B) not fully bake all features of the file system.
Others have pointed out the potential HW issues, but I implemented something similar in a product that stored data using its own disk format, and found that adding checksums to all data written to disk yielded a number of cases where what we though was HW failures were actually SW failures. AKA really, really, really obscure bugs that only happened under obscure conditions (think the equivalent of fsck bugs checking the filesystem after power loss for one example, the journal needed to be in exactly the right state to trigger the recovery bug).
I have no idea if Apple flash storage devices do that (or will), but Hamming (7,4) can correct single-bit errors. Most ECC devices are SECDED (single-error correction, double-error detection).
Failing to replicate corrupt data is the point. By delivering the corrupt data to user space, it proliferates into backups. And since it's common practice, even if suboptimal, to conflate archive and backup and rotate those media, it's just a matter of time before all backups have a corrupt copy of the file.
You may be interested in http://daringfireball.net/thetalkshow/2016/06/17/ep-158 where Craig Federighi discusses how Apple implemented wear-your-Apple-watch-to-unlock functionality for their Macs; as it turns out, they
make calculations based on how fast the Bluetooth responses are to judge the distance between the laptop and the watch — that way, a snoop can't unlock your laptop if you're merely in the room. It's nice to know that Federighi, at least, is involved in the nitty-gritty even though these aren't reflected in the keynotes.