It really bugs me (and has for a while) that there is still no mainstream linux ...

nayuki · on March 12, 2017

Btrfs has experienced some data loss bugs in recent memory. It looks like ZFS is the only remaining option.

https://www.phoronix.com/scan.php?page=news_item&px=Btrfs-Da...

https://www.spinics.net/lists/linux-btrfs/msg59190.html

https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg...

seiferteric · on March 12, 2017

Yes I have tried to use btrfs several times for work projects and personally because I was very excited about it but every time I have ran into severe bugs even though it was said to be "stable". I have given up for now, maybe I will check back in a couple more years.

Dylan16807 · on March 12, 2017

I really like btrfs for my backup machine, but it sometimes manages to hang when it's cleaning up deleted snapshots. This is a problem that's much worse when on a hard drive or fragmented file system, but I've gotten it to happen even on a recently-balanced drive with a small amount of snapshots.

I recently had a server be crippled by running snapper on default settings for a few months. And after a couple days of balancing (which desperately needs a throttle control) it wasn't much better, so I gave up on having it run btrfs.

I think the record I managed was slightly over two minutes of btrfs blocking all disk I/O. Something is deeply wrong with how it organizes transactions.

dom0 · on March 13, 2017

Funny, my btrfs story is somewhat similar. I don't hit severe bugs, though, just severe performance issues. E.g. btrfs on / always sooner or later means for me that the entire IO to the disk that's on will be _thrashed_ during system updates or package installs / uninstalls.

djsumdog · on March 12, 2017

The last time I was using btrfs was around 2014, and I was wondering why my hard drive was always showing 100% utilization even after I moved/deleted a ton-o-stuff.

Turns out at the time, re-balancing still had to be run manually. I'm not sure if that still holds true.

delroth · on March 13, 2017

It is still a problem, I just had one of my personal servers hit issues where I couldn't even manually re-balance because the metadata was full. Had to apply weird workarounds to be able to write to that filesystem again...

tombrossman · on March 12, 2017

Any idea if ZFS plays well with Ubuntu's full-disk encryption? I've used the FDE option at install for years and every time I upgrade (I wipe & reinstall every year or so) I try to understand how to first set up ZFS, then FDE, and then I realize it's far too complicated for me. Any good tutorials or setup guides that even a moron could understand?

I've got a pretty good setup now with a fairly complex fstab, multiple SSDs, backup drives, and everything fully encrypted and auto mounting at boot. I'd really love to move this to a file system more resistant to data corruption.

zxv · on March 12, 2017

Native ZFS encryption may become available soon. It is pending code review: https://github.com/zfsonlinux/zfs/pull/5769

There was a presentation of this by Tom Caputi at the most recent OpenZFS Developer Summit.

Slides: https://drive.google.com/file/d/0B5hUzsxe4cdmU3ZTRXNxa2JIaDQ... Video: https://youtu.be/frnLiXclAMo Conference: http://open-zfs.org/wiki/OpenZFS_Developer_Summit

tnorgaard · on March 12, 2017

We run with ZFS over LUKS encrypted volumes in production on AWS ephemeral disks and have done so in over two years on Ubuntu 14.04 and 16.04. The major issue for us has been getting the startup order right, as timing issues does occur once you have many instances. To solve this, we use upstart (14.04) and systemd (16.04) together with Puppet to control the ordering.

Performance wise it does fairly well, our benchmarks shows ~10-15% decrease on random 8kb IO (14.04).

We are definitely looking forward to ZFS native encryption!

newman314 · on March 13, 2017

What is the right order?

weitzj · on March 13, 2017

Since ZFS will run on blocklevel devices and you want to get the ZFS benefits of Snapshots/compression/(deduplication), in my opinion it makes sense to do the encryption at the blocklevel, i.e. LUKS has to provide decrypted block level devices before ZFS searches for its zpools. When ZFS native encryption is available on Linux this will be different, since you much finer control on what to encrypt and you can keep all ZFS features.

So:

First decrypt LUKS (we are doing this in GRUB) Then mount zpool(s)

drvdevd · on March 12, 2017

I use it all over with LUKS on Ubuntu. It works fine, but there's one little hitch:

when calling anything that ultimately calls grub-probe (e.g. apt-get upgrade), you have to symlink the decrypted device mapper volume up a layer into /dev because grub-probe can't seem to find the ZFS vdev(s) otherwise. ie: "ln -s /dev/mapper/encrypted-zfs-vdev /dev".

This is in fact the case on every Linux distribution I've run ZFS over dm-crypt on.

EDIT: IOW its a grub bug not an Ubuntu bug.

XorNot · on March 13, 2017

The other problem I've found is that grub's update-grub scripts do not handle mirrored ZFS volumes well at all - they wind up just spraying doubled up invalid commands everywhere even up to yakkety so far.

I've had it on my backlog to at some point go in and sort out my initramfs's insanity when it comes to handling crypt'd disks in general - it should be a lot less brittle then it is.

weitzj · on March 13, 2017

We are rolling our own update grub as a bash script, which does some sanity checks:

- is the crypto module still present in GRUB config?

- is grub running in text only mode? Otherwise we cannot see the LUKS prompt to decrypt the devices.

- does initramfs know about all mirror devices (instead of an early return after the first)

nisa · on March 12, 2017

You can always use zvols and just use ext4+luks on that. There is also work on ZFS native encryption that looks pretty promising, not sure if it's ready yet.

benley · on March 12, 2017

I haven't tried to do it with Ubuntu specifically, but I do know that ZFS-on-linux works fine atop LUKS full-disk encryption. My laptop is running NixOS with such a setup, and I'm pretty sure I followed the Ubuntu ZFS documentation while figuring out how to do it.

(summary: I don't know how to make it super easy, but what you want should totally work)

_vdrz · on March 12, 2017

bcachefs is working on it but needs support:

- https://bcache.evilpiepirate.org/Bcachefs/

- https://www.patreon.com/bcachefs/

(I'm not Kent but I love the work)

patrickg_zill · on March 12, 2017

Having used ZFS on Solaris/x86 for years and years, and now on Linux, I would say that your best best is to use ZFS.

While ZFS might not be supported oput of the box on "your-choice-distro" I would point out that Ansible or other automation tools should make it pretty easy to end up with a repeatable installed machine or VM image with ZFS.

cyphar · on March 13, 2017

> (...sure, you could call zfs or btrfs "mainstream", I suppose, but when I say "mainstream" I mean something along the lines of "officially supported by RedHat". zfs isn't, and RH considers btrfs to still be "experimental".)

SUSE (the other main enterprise Linux distribution vendor) has been supporting Btrfs in the enterprise (as the default filesystem) for several years.

[Disclosure: I work for SUSE.]

throwaway7767 · on March 15, 2017

No disrespect intended, but my experience is that SuSE has always played fast and loose with the filesystem defaults.

I recall they switched to reiserfs as default at one point. reiserfs was never a good choice for data consistency - the fact that storing a reiserfs image in a file on the host reiserfs filesystem and then doing an fsck on the host FS would corrupt the FS should be a clear signal that there are fundamental problems remaining to be solved.

That said, I'm playing with btrfs on some of my machines, and it seems quite nice. But no way would I risk using it on a production server at this time.

cyphar · on March 17, 2017

None taken. To be clear, we do provide enterprise support for ext4 and XFS as well (which a lot of people use for the reasons you mentioned). In my experience, btrfs still has some growing pains (especially when it comes to quotas, which will cause your machine to lag quite a bit when doing a balance) but is definitely serviceable as a daily driver (though for long-term storage I use XFS).

throwaway7767 · on March 17, 2017

Ah, that's good to hear. It's been a while since I've used SuSE.

Of course, someone has to go first and filesystems never truly get battle-hardened until distros start pushing them. I appreciate that SuSE does this from that perspective. It means when I switch over there will be less bugs. :)

I'm using btrfs as a daily driver on my workstations so I get some experience with the tooling, and also because features like consistent snapshots are really nice to have. Still haven't taken the plunge on the server side, I expect I'll give it a few years until it's considered "boring".

dom0 · on March 12, 2017

btrfs only uses CRC32c which is weakish. ZFS is great but not exactly portable. I started to use Borg now for archiving purposes as well, not just backup. For me (low access concurrency, i.e. single or at most "a few" users) that works very well. Portable + strong checksumming + strong crypto + mountable + reasonable speed (with prospect of more) is a good package. It doesn't solve error correction, though.

rleigh · on March 12, 2017

Not portable? In comparison to which filesytems? I can easily export a ZFS pool on Linux, physically transport the discs to a FreeBSD or IllumOS server and import the pool. Or to MacOS X, which is also supported (though I haven't tried it unlike the others where it worked perfectly).

That's already far ahead of ext, ufs, xfs, jfs, btrfs etc. The only ones offhand which are possibly more portable are fat, hfs, udf and ntfs, and you're not exactly going to want to use them for any serious purpose on a Unix system. ZFS is the most portable and featureful Unix filesystem around at present IMHO.

ak217 · on March 12, 2017

crc32c is not weakish, and was chosen for a reason: crc32c has widespread hardware acceleration support that remains faster than any hash, and crc32c can be computed in parallel (unlike a hash, it has no hidden state, so you can sum independently computed block checksums to get the overall blob checksum). Bitrot detection doesn't need a cryptographic hash. You may want a hash for other purposes (like if you somehow trust your metadata volume more than your data volume), but that's a separate and slower use case.

bascule · on March 13, 2017

Bitrot detection doesn't need a cryptographic hash.

Not only is a cryptographic hash unnecessary, under certain circumstances it will actually do a worse job.

Cryptographic hashes operate under a different set of constraints than error detecting code. With an error detecting code, it's desirable to guarantee a different checksum in the event of a bitflip.

With a cryptographic random oracle, this is not the case: we want all outcomes to have equal probability, even potentially producing the same digest in the event of a bitflip. As an example of a system which failed in this way: Enigma was specifically designed so the ciphertext of a given letter was always different from its plaintext. This left a statistical signature on the ciphertext which in turn was exploited by cryptanalysts to recover plaintexts. (Note: a comparison to block ciphers is apt as most hash functions are, at their core, similar to block ciphers)

Though the digests of cryptographic hash functions are so large it's statistically improbable for a single bitflip to result in the same digest as the original, it is not a guarantee the same way it is with the CRC family.

Cryptographic hash functions are not designed to be error detecting codes. They are designed to be random oracles. Outside a security context, using a CRC family function will not only be faster, but will actually provide guarantees cryptographic hash functions can't.

rwiggins · on March 13, 2017

In practice, the chance of a digest collision between two messages that differ in a single bit is exceedingly small for any secure cryptographic hash function. It's so small that it's practically not worth considering. Cryptographers are incredibly careful in building and ensuring proper diffusion in cryptographic hash functions.

mrb · on March 13, 2017

"Not only is a cryptographic hash unnecessary, under certain circumstances it will actually do a worse job."

A cryptographic hash is unnecessary, but as you point out in the next to last paragraph, it is statistically improbable that it will do a worse job. Because collisions are statistically improbable.

jepler · on March 13, 2017

Specifically, I wrote a program to search for single-bit-flip collisions in sha1 truncated to 16 bits. The program didn't need to search for long before finding two messages with the same 16-truncated sha1 with a single bit flip at bit 1 of byte 171 of a 256-byte message. 376 1 171 be44b935e7ecfc81d1fe2cddcd7c1d7e04338fd83fa994cd6a877732ca5d8db83346bd9ccbfc4c8770682bd307c782421a512a80a106be87825d5c13f3156e23ffaacdfc1651f88f775507d1175542def2ccf084271ebd4ead175c8a448be0d50b26f59d970301ebc5a7f672d3ea870d9a1e02f8f5fd01c38297b8aa264a3f07fec32f9a91aa359784d2d9ce0e4649465c705f50feed23dcbefc0a726cfadb5e47ee577ed45203f90d6e2e650d42ddb10cba49d06bd4cdad4e6eaf5cfcb062de2539fc847ce0c104f2e667369080eaaab5934ae5f7f1ba733c3d1bfbda87bfa72ef12475b9ff0edc4deb99e6a5cf387c7f6b9c71ea62b4db4bb67c92d36460dd be44b935e7ecfc81d1fe2cddcd7c1d7e04338fd83fa994cd6a877732ca5d8db83346bd9ccbfc4c8770682bd307c782421a512a80a106be87825d5c13f3156e23ffaacdfc1651f88f775507d1175542def2ccf084271ebd4ead175c8a448be0d50b26f59d970301ebc5a7f672d3ea870d9a1e02f8f5fd01c38297b8aa264a3f07fec32f9a91aa359784d2d9ce0e4649465c705f50feed23dcbefc0a726cfadb5e47ee577ed45203f90d6e2e640d42ddb10cba49d06bd4cdad4e6eaf5cfcb062de2539fc847ce0c104f2e667369080eaaab5934ae5f7f1ba733c3d1bfbda87bfa72ef12475b9ff0edc4deb99e6a5cf387c7f6b9c71ea62b4db4bb67c92d36460dd

https://gist.github.com/jepler/96d1e779dc95b8941b208887e10a8...

On the other hand, any CRC will detect all such errors; a well-chosen one such as CRC32C will detect all messages with up to 5 bits flipped at this message size.

This is quite appropriate for the error model in data transmission, of uncorrelated bit errors. https://users.ece.cmu.edu/~koopman/networks/dsn02/dsn02_koop... is a pretty good paper, though there are probably better ones for readers without an existing background in how CRC works.

mrb · on March 13, 2017

You are not testing a crypto hash. "Crypto hash" means it is cryptographically strong, not truncated to 16 bits. For example ZFS with checksum=sha256 will use the full 256-bit hash for detecting data corruption.

jepler · on March 14, 2017

Yup, you're right. if you use a full size cryptographic hash then the number of undetected errors can be treated as 0 regardless of hamming distance. On the other hand, it has 8x the storage overhead of a 32-bit CRC.

jepler · on March 13, 2017

Less so if you assume the cryptographic hash will be truncated to 32bits so that its size matches crc32c. Furthermore, some of those collisions will be on message pairs with small hamming distance, probably including messages with a single bit flipped which CRC will always detect.

dom0 · on March 12, 2017

CRC32c -> I saw this many times fail to detect corruption on message lengths anywhere between a couple kB and a few MB. btrfs blocks are 16 kB iirc, so in range. The longer hashes of ZFS, Borg and so on mean that if it's corrupted I _definitely_ know. Not so confident with CRC32 from experience.

ak217 · on March 13, 2017

I'm curious about the setting in which you saw these failures, could you elaborate?

Unlike a plain checksum, CRC-32C is hardened against bias, which means its distribution is not far from that of an ideal checksum. This means if your bitrot is random and you're using 16KB blocks, you will need to see on the order of ((2 * * 32) * 16KB)=64TB of corrupted data to get a random failure. Modern hard drives corrupt data at a rate of once every few terabytes. TCP without SSL (because of its highly imperfect checksum) corrupts data at a rate of once every few gigabytes. Assuming an extremely bad scenario of a corrupt packet once every 1 GB, in theory you'd need to read more than a zettabyte of data to get a random CRC-32C failure. I'm not doubting that real world performance could be much worse, but I'd like to understand how.

vardump · on March 13, 2017

> This means if your bitrot is random and you're using 16KB blocks, you will need to see on the order of ((2 * * 32) * 16KB)=64TB of corrupted data to get a random failure.

No, that's what you need to generate one guaranteed failure, when enumerating different random corruption possibilities. Simply because 32-bit number can at most represent 2^32 different states.

In practice, you'd have 50% probability to have a collision for every 32 TB... assuming perfect distribution.

By the way, 32 TB takes just 4-15 hours to read from a modern SSD. Terabyte is just not so much data nowadays.

leni536 · on March 13, 2017

Just a nit: you don't get a guaranteed failure at 64TB, you get a failure with approx 1-1/e ~= 63% probability. At 32TB you get a failure with approx 1-1/sqrt(e) ~= 39% probability.

I do agree that tens of TBs are not too much data, but mind that this probability means that you need to feed your checksum 64TB worth of 16KB blocks, every one of them being corrupt, to let at least one of them go trough unnoticed with 63% probability. So you don't only need to calculate with the throughput of your SSD, but the throughput multiplied with the corruption rate.

marcosdumay · on March 13, 2017

For disk storage CRC-32C is still non-broken. You can't say the same about on-board communication protocols or even some LANs.

When people started using CRC-32 it was because with the technology of the time it was virtually impossible to see collisions. Nowadays we are discussing if it's reasonable to expect a data volume that gives you 40% or 60% of collision chance.

CRC32 end is way overdue. We should standardize on a CRC64 algorithm soon, or will have our hands forced and probably stick with a bad choice.

speleo_engr · on March 13, 2017

Posts like this are what keep me coming back to HN.

cmurf · on March 13, 2017

Not exactly, blocks are 4KiB so the vast majority of the CRC32C's apply to 4KiB block size; the metadata uses 16KiB nodes and those have their own checksum also. From what I see in the on disk format docs, it's a 20 byte checksum.

johnramsden · on March 13, 2017

ZFS is probably one of your better options if you want portability. It works on most Unix like operating systems, and due to its zfs send/recv capability you can send your data all over the place - and can trust that it will actually end up the same on the other end. If that's not portability I don't know what is.

If what you're saying is it's not portable to Windows then sure, but compared to most filesystems it's extremely portable.

X86BSD · on March 12, 2017

ZFS not portable? Have you ever used it?

"ZFS export" on the host system, remove drives, insert drives in new server, "ZFS import" on the new server, It really is that simple.

dpedu · on March 12, 2017

I think he means portability of the software. What I want to 'ZFS import' on a Windows or macOS machine?

X86BSD · on March 12, 2017

Oh, well it's ported from Solaris to illumos and FreeBSD and to a lesser extent OS X and Linux. So I'm still confused about the portability claim.

tux1968 · on March 12, 2017

The license incompatibility has kept it out of the mainline kernel for Linux, so it's not really a viable option there in many situations. Linux is definitely lacking in this department.

rleigh · on March 12, 2017

That's a minor concern and unrelated to the portability claim. That comes down to choice, and is not a technical consideration. I'm using it with Ubuntu 16.04 LTS and 16.10 where it works out of the box. It's most certainly portable to and from Linux and other systems; I've done it personally, and it works a treat.

tux1968 · on March 12, 2017

Minor for some, major for others. And it is technical too, because there are maintenance ramifications for it not being in the mainline kernel, for example how quickly a security patch can be applied.

And for others a constriction on which distribution they can move to etc. It just reduces the number of situations where it can be used, even if you find yourself in one where it can.

justincormack · on March 12, 2017

ZFS on OSX has been revived I believe.

drvdevd · on March 12, 2017

I wonder if the recent Linux syscall emulation on Windows would somehow make it possible or easier to port ZFS on Linux to Windows.

I know you have the SPL anyway, so maybe with the addition of the Linux POSIX-ish layer in there this could be the case...

chungy · on March 12, 2017

Short answer: No it wouldn't make it easier to port.

Longer answer: The Linux subsystem in Windows 10 only deals with userspace. It doesn't support kernel modules nor changes anything about making Windows drivers. Porting ZFS to Windows is certainly possible, but it will take quite a lot of effort, and the Linux subsystem is irrelevant in that situation.

drvdevd · on March 12, 2017

Yeah, I figured as much, but was hoping there might be something about the Linux subsystem that would be helpful in porting drivers around, beyond userspace.

Obviously, I've yet to actually use it myself.

gjjrfcbugxbhf · on March 12, 2017

Would a FUSE implementation of ZFS not be possible? (Just wondering I've no idea what's possible here)

Dylan16807 · on March 12, 2017

A FUSE implementation of ZFS exists and works well, and adding FUSE support to the Windows 10 Linux subsystem appears to be reasonably high up on the priority list.

That doesn't get you access from Windows programs, but there are some other ways to do FUSE or FUSE-like things on Windows..

problems · on March 13, 2017

It may be fairly straightforward to port the ZFS FUSE to Windows if you use things like Dokan or WinFsp which have the FUSE interface supported fairly well - these would give full access via standard Windows tools.

rincebrain · on March 12, 2017

Last I knew, the zfs-fuse codebase hadn't been updated since before feature flags were added to any of the OpenZFS targets, so it's not a particularly well-supported solution...

e12e · on March 12, 2017

Fwiw I've successfully shared my luks-encrypted usb3 zfs-formatted disk to Windows pro on my Surface 4 via a hyper-v vm running Ubuntu and samba. It won't work for all external drives - you need to be able to set the drive as "offline" in device manager under Windows in order to pass it through to the hyper-v vm (and sadly this doesn't appear possible with the sdcard - I had hoped to install Ubuntu on the sdcard and have the option to boot from the sdcard and also boot into the same filsystem under hyper-v).

s-macke · on March 12, 2017

A real benefit would be file system support for Reed-Solomon error correction. Some archival tools support such a feature. You would spent 1%-10% of the disk space for error correction.

fnord123 · on March 12, 2017

Sector failures on a single drive are not independent random events. So doing this on a single drive is not perceived to be a good idea.

The only choices, imo, are multiple disks, or automatic backup to offsite (dropbox, box, one drive, etc).

fnord123 · on March 13, 2017

Sorry, a better answer is that it's already done in the hardware/firmware layer. e.g. from Wikipedia[1]:

""" Modern HDDs present a consistent interface to the rest of the computer, no matter what data encoding scheme is used internally. Typically a DSP in the electronics inside the HDD takes the raw analog voltages from the read head and uses PRML and Reed–Solomon error correction[144] to decode the sector boundaries and sector data, then sends that data out the standard interface. That DSP also watches the error rate detected by error detection and correction, and performs bad sector remapping, data collection for Self-Monitoring, Analysis, and Reporting Technology, and other internal tasks. """

With the answer I gave above, and the fact that it's already being done in hardware, I don't think adding another layer of EC will be fruitful.

[1] https://en.wikipedia.org/wiki/Hard_disk_drive#Access_and_int...

pmarreck · on March 13, 2017

If this is the case, then why does bit-rot still occur?

fnord123 · on March 13, 2017

It's not magic. If you have enough bit flips for the same data then it's not recoverable. Sometimes it could even flip and resemble correct data. It might not happen to you, but if 500 people reading this thread each have access to 50T of data, then sure bit rot can happen in some of that 25 petabytes.

pmarreck · on March 13, 2017

Do you think that there is a need for some kind of ECC at the OS level? There are surely some applications where even 1 bit flip in 25 petabytes is bad. Amazon's 2008 site outage comes to mind: http://status.aws.amazon.com/s3-20080720.html

fnord123 · on March 13, 2017

Yes, that's why you have erasure coding across drives either using RAID groups or using an object store. Just not at the single drive file system level where blocks going bad within a disk tend to be correlated (i.e. the drive is getting old and will die).

Also, EC at the block level would probably spread the blocks around the drive. This means any read would need to seek all over the dang place trying to reassemble files and that's a bad access pattern for spinning disks. Real bad. Like, the worst. It might even reduce the effective lifetime of the drive. So it would be not only correlated with device failure, it could precipitate it.

Maybe it would be ok on an SSD.

zzzcpan · on March 12, 2017

What would you do with a failed checksum on a filesystem level? These errors could be and likely to be transient too.

elfchief · on March 12, 2017

If nothing else, you can log the error. If you have RAID1 you can recopy the block from a good copy. It is, honestly, probably situation-specific, but step #1 is always going to be "identify that you have a problem"

zzzcpan · on March 12, 2017

That's the thing, identifying the problem on the filesystem level is useless if you can only correct it on another level. Unless your filesystem is distributed and self-healing, it's not a place for checksums, it must remain a thin predictable layer on top of a disk.

ProblemFactory · on March 12, 2017

There absolutely is a practical use for checksums and error detection, even if you cannot do error correction.

Error detection allows you to:

* Discover and replace bad hardware (like the author of the article),

* Avoid unknowingly using invalid data for calculations or program code,

* Stop using the system and corrupting more data over time,

* Restore files from backup while still in the retention period of your backup service.

I once had a computer with a bad disk that somehow corrupted 1 bit out of about 150MB written, and probably only in some regions. I only found out after the GZip checksum of a very large downloaded file mismatched, and it took a while to figure out the real cause. By that time I had been using it for months, so it's unclear to this day what other files might have been corrupted.

floatboth · on March 12, 2017

ZFS is self-healing. It can automatically restore from mirror/RAID drives. And if you only have one drive, you can set a parity option so that information will be written several times, like a mirror but on one disk. (Obviously doesn't protect against a whole drive failure :D)

g0xA52A2A · on March 12, 2017

> That's the thing, identifying the problem on the filesystem level is useless if you can only correct it on another level.

Nope I'd take a system that can tell me "Hey sorry I don't have the right data" over "Here's some data, looks good to me" any day of the week.

Also as ProblemFactory points out, ZFS will self-heal whenever possible.

zzzcpan · on March 13, 2017

"Nope I'd take a system that can tell me "Hey sorry I don't have the right data" over "Here's some data, looks good to me" any day of the week."

Nope, it doesn't help with anything. I think you are making false assumptions about the safety of your data. Because to keep consistency such system must automatically shutdown on data corruption and only wake up once the problem is fixed. To keep availability it must automatically fix the problem. There is no third option, because filesystems are not aware of how applications use them. Returning EIO would also introduce inconsistency.

But if you actually care about your data, you shouldn't trust a single box, ever, no matter how fancy the filesystem is. There are just so many things that could go wrong, it's hard to believe someone would even take that risk.

Just a few weeks ago I saw data loss on one of the object storage nodes, because of a typical sata cable issue. It caused multiple kernel panics, affecting data on multiple disks, before I was able to figure it out. Not a problem for a distributed storage, but you wouldn't survive that with a local storage. This is also one of the reasons I consider zfs an over engineered toy, that doesn't target real world problems.

heinrich5991 · on March 12, 2017

What is wrong with returning an IO error (e.g. EIO) if the data read from the disk is corrupt?

adrianN · on March 12, 2017

Use an error correcting code, fix errors you can fix, report unrecoverable errors to the user so that they can restore from backup.

Mister_Snuggles · on March 12, 2017

OpenSuSE (and SLES) has considered btrfs stable for a while.

morecoffee · on March 12, 2017

Making data on individual drives is probably not a long term goal for integrity. It is more likely that multi homing data in geographically disjoint locations with some sort of syncing is a better long term goal. Fixing silent data corruption on a single drive doesn't solve any of the much more likely disasters, like fire, flooding, weather, etc. Not even datacenters can withstand lightning.

elfchief · on March 12, 2017

The problem with multiple geographic locations, in this context, is that you'd have to read from all of them, and compare the results, to know that you have file corruption. Which is, needless to say, not something that it makes sense to do.

The purpose of data block checksumming isn't to make your data more resilient (at least directly), it's to make sure you know you have a problem. Once you know you have a problem, then you can go read from your alternate datacenter or whatever.

db48x · on March 12, 2017

Agreed, and that's how zfs does it behind the scenes. When it detects a read error, it uses the pool redundancy (whether mirrors or RAIDZn, whichever you're using) to transparently recover from the error. Even if you set up a pool with only one disk you can still set it to keep multiple copies of all the data.

A super-ZFS that automatically did that using remote mirrors would be interesting, but would also stretch the definition of "transparent" a bit.

TheSpiceIsLife · on March 12, 2017

Not even datacenters can withstand lightning.

Data centers do have lighting protection. Lighting protection is very low-tech.

https://en.m.wikipedia.org/wiki/Lightning_rod

stubish · on March 13, 2017

zfs is supported in Ubuntu by Canonical, both in the 'out of the box and recommended' sense and the pay-for-support sense via Ubuntu Advantage.

Unfortunately, you still can't get a root zfs partition without hassles.

X86BSD · on March 12, 2017

If you actually care about your data you simply will have to switch platforms. Illumos or FreeBSD. Linux has no answer to the horrors of garbage Filesystems and garbage lying hardware, firmware etc that has plagued UNIX admins for ever. Btrfs is apparently the best they can do and its glaringly deficient architecturally.

Filesystems are HARD. And as bcantrill notes, if you can't get simple things like epoll right your dead in the water on harder things like kqueue and zfs, Dtrace, jails, etc.

This is proving true looking at the fact Linux land can't get a filesystem right and they've been trying for decades.

Many will argue and try and justify that simple fact but as the cold harsh hammer of reality slams into their skull eventually those folks will be forced to face the reality.

Linux honestly just needs to adopt zfs. Period. But because of the Linux license that may be all but impossible now. Which leaves Linux in this untenable position of being stuck in a tar pit. Unable to adopt the clear choice to move forward and unable to have the technical ability to implement their own solution equal to zfs. So wither now Linux?

stock_toaster · on March 12, 2017

  > Linux honestly just needs to adopt zfs. Period.
  > But because of the Linux license that may be all but impossible now.

Hammer2 might be an option, once it gets finished.

tytso · on March 12, 2017

The question is whether or not "the market" cares enough about "its data" enough that they are willing to pay for what would cost. And data checksumming and RAID 1 (so you can recover from the data corruption) does cost something --- even if it is a 100% expansion in the bytes needed to store the data. And COW file systems do cost something in terms of HDD overhead. Maybe you have enough slack in terms of how you are using your HDD that you don't notice, but if you are using all of your HDD's I/O capacity with an update-in-place file system, when you switch to a COW file system, you will notice the performance hit.

If your only copy of your children's baby pictures are on a solo 4TB drive in your house --- then you may be more than willing to pay that cost. But what if your house burns down? It may be that the right answer is do data backups in the cloud, and there you will want erasure coding, and perhaps more than 100% overhead --- you might erasure coding scheme that has a 150% to 300% blowup in space so you have better data availability, not just data loss avoidance.

I do agree that file systems are hard, but at the same time you need to have a business case for doing that level of investment. This is true whether it is for a proprietary or open source code base. Many years ago, before ZFS was announced, I participated in a company wide study at my employer at the time about whether it was worth it to invest in file systems. Many distinguished engineers and fellows, as well as business people and product managers participated. The question was studied not just in terms of the technical questions, but also from the business perspective --- "was the ROI on this positive; would customers actually pay more such that the company would actually gain market share, or otherwise profit from making this investment". And the answer was "No". This made me sad, because it meant that my company wasn't going to invest in further file system technologies. But from a stock holder's perspective, the company's capital was better spent investing in middleware and database products, because there the ROI was actually positive.

From everything that I've read and from listening to Bryan's presentations, I understanding is that at Sun they did _not_ do a careful business investigation before deciding to invest in ZFS. And so, "Good for Open Solaris!" Maybe it kinda sucked if you were a Sun shareholder, but maybe Sun was going to go belly-up anyway, so might as well get some cool technology developed before the ship went under. :-)

As far as Linux is concerned, at some level, the amount of volunteer manpower and corporate investment in btrfs speaks to, I suspect, a similar business decision being made across the Linux ecosystem about whether or not the ROI of investing in btrfs makes sense. The companies that have invested in ext4 because in no-journal mode, have done so because if what you want is a back-end, local disk file system, on top of which you put a cluster file system like HDFS or GFS or Colossus, and where the data integrity is done end-to-end at the cluster file system level, and not at the local disk file system, you'll want the lowest overhead file system layer you can get. That doesn't mean that you don't care about data integrity; you do! But you've made an architectural decision about where to place that functionality, and a business decision about where to invest their proprietary and open source development work.

Each company and each user should make their own decisions. If you don't need the software compatibility and other advantages of Linux, and if data integrity is hugely important, and you don't want to go down the path of using some userspace solution like Camilstore, or some cloud service like AWS S3 or Google Compute Storage, then perhaps switching to FreeBSD is the right solution for you. Or you can choose to contribute to make btrfs more stable. And in some cases that may mean being willing to accept a certain risk for data loss at one level, because you handle your integrity requirements at a different level. (e.g., why worry about your source tree on your laptop when it's backed up regularly via "git push" to servers around the internet?)