He doesn't really mention how this relates to the current bcache code. That one is brilliant. I run a 4 drive raid with slow spinning 2TB drives and a 100G SSD drive as a writeback cache top. It detects strides so reading and writing large files stream directly to the raid, but little reads and writes got via the SSD. By filtering all the little traffic, the actual traffic to the drives is more
efficient and spends less time seeking.
In that setup bcache effectively creates a cached block device and I format that with ext4. It seems now it includes the filesystem.
Not sure I like that, but it seems to be the trend in filesystems since ZFS.
He mentions multiple times in the first few paragraphs that this is an evolution of the current bcache code:
"For those who haven't kept up with bcache, the bcache codebase has been evolving/metastasizing into a full blown, general purpose posix filesystem... I and the other people working on bcache realized that what we were working on was, almost by accident, a good chunk of the functionality of a full blown filesystem - and there was a really clean and elegant design to be had there if we took it and ran with it."
I've been considering using bcache on one of my systems, last time I checked it seemed a bit unready in terms of all the steps needed to set it up and keep it working. What has your experience been?
I've run bcache-backed databases in production for about a year or so - it's been absolutely rock solid, and I'm very happy with it.
We spent a lot of time prototyping basically every SSD caching tech that existed at the time and bcache was the clear winner (note that dm-cache, the tech underpinning lvm-cache was pretty immature at the time).
You do need a workload with a fairly reasonable cache hitrate to get the best out of it, but that's obviously true of all such technologies.
Yes, see the "What is bcache?" section here:
http://bcache.evilpiepirate.org/
He compares it to L2ARC... same but with writeback. (ie writes can be considered finished when they are recorded by the SSD. Faster but less safe unless your SSD is also a raid.)
The bcachefs part at the top of that page is the new stuff this announcement is describing.
Another file system pursuing the lofty goals of ZFS/btrfs on linux (the other being tux3). It is a catch 22 situation for these guys. If they do not release it early there will be no users to test and report bugs, if they do release it early it is half baked. I hope the guy has a large stash in his bank. File systems take notoriously long to stabilize.
The performance is so close on all filesystems right now on Linux, why not just contribute to the code for one of the existing solutions and make it more mature?
ps. Kent's benchmark numbers struck me for one particular aspect I haven't seen in other benchmarks - max latency - and EXT4 is looking darn good in that aspect - I still use EXT4 over XFS, I simply do not trust XFS enough yet
You should give it another look. My team's used it for a mission critical project for over a decade. It's older and more mature than ext4, IMO. For sustained write throughput, it does much better than ext4. That said, many folks have utilization which looks more like random I/O than sequential.
I'd also like to hear any rationale for why one would trust XFS less than EXT4. There might be arguments against using XFS -- but trust/maturity sounds like a very strange one to pick?
Yeah, still better than ext2. I think part of it is that many people expect the file to either not have grown, or have the data you wrote to it.
IIRC (and I'm a bit hazy) here were other issues too that have been since fixed. The long delay between write and commit meant that a lot of bugs that would otherwise have been vanishingly rare got exposed. Likely the ext systems have/had similar bugs that just have only happened a single digit number of times in the past 20 years.
ext3 with the default data=ordered was the filesystem that taught a generation of Linux developers that fsync(2) does nothing to improve data consistency and has crippling performance impact, so should never ever be called.
The inevitable results of applications developed for ext3 running on other file systems incurring data loss on system crashes then got blamed on those other filesystems (including ext4, ironically).
I don't think the performance for btrfs can be considered "close".
And the thing with filesystems is that the format can't really change once stable - and you get locked into an architecture that dictates an awful lot about what features you can provide. So if you want to add zfs-like features to a linux fs that doesn't eat your data/have abysmal performance (btrfs) then you're basically back at square 0 and you have to write a new one.
actually btrfs looks pretty good on the phoronix benchmarks (note that higher is better in all those tests - go look at the write tests)
on Kent's benchmarks, something is up with the results in the thousands of MB/s - high end flash is fast but not that fast unless the entire transaction is possibly taking place in RAM cache instead of on the drive - which would explain btrfs having numbers in the hundreds of MB/s instead
My experience with btrfs in the real world was that it was unreasonably slow. At least an order of magnitude slower last time I tested it (which admittedly was a year or 2 ago). I keep hoping they'll introduce some massive speedups but until then it's ext4 or xfs for me.
> why not just contribute to the code for one of the existing solutions and make it more mature?
Well, Linus took that gamble when he blessed btrfs as the future of Linux filesystems four or five years ago, and it hasn't really worked out as intended.
Last time I used XFS there wer some metadata operations that were just painfully slow, and at least one of them was used by svn update. To the point that an svn update of a large repository over the LAN on XFS was ~10x slower than reiserfs.
Now that is something I'd really like to see in Linux filesystems. AFAIK the GNU/Linux implementations of ZFS do not support ECC. Only the Oracle version.
There are a few different profiles offered in the Ceph Erasure Coded Pool support, afaik they are based on research papers. Assume due diligence done to ensure they can be used openly (ianal etc).
> Now that is something I'd really like to see in Linux filesystems. AFAIK the GNU/Linux implementations of ZFS do not support ECC. Only the Oracle version.
When people talk about "ECC" in relation to ZFS, they are generally referring to ECC memory, i.e. RAM that detects and corrects single-bit corruption. This does not need any support from software, it is a hardware thing. So if you have ECC memory, it will "work with ZFS" no matter your OS.
Erasure coding here refers to redundancy in the block storage, so you can lose a disk and still maintain access to the data. And I can confirm that the linux ZFS implementation, along with all others (as they're all based on the same codebase), support that.
AFAIK zfs doesn't use erasure coding. It has a checksum for each block. Together with redundant storage of blocks, this allows it to determine which block is corrupt when two copies of the same block have different contents.
Nope, all blocks, unless wikipedia is wrong or I am completely misunderstanding the following:
"When a block is accessed, regardless of whether it is data or meta-data, its checksum is calculated and compared with the stored checksum value of what it "should" be. If the checksums match, the data are passed up the programming stack to the process that asked for it; if the values do not match, then ZFS can heal the data if the storage pool provides data redundancy (such as with internal mirroring), assuming that the copy of data is undamaged and with matching checksums."
What ZFS implementations are you talking about? I'm only aware of zfsonlinux which AFAIK is CDDL (and out-of-tree -- but works fine) -- is there a port licensed under GPL?
This isn't a union filesystem; it's a primary filesystem that happens to use COW internally for better reliability. It's comparable to ZFS, btrfs, or tux3.
Looks very interesting. Would be cool if it was also compared against my favourite "dark horse" fs: nilfs2[1]. I always thought it'd make a great fs for flash storage -- but in the end I've generally ended up running ext4 over LUKS on my SSDs so far.
> It's taken a long time to get to this point - longer than I would have guessed if you'd asked me back when we first started talking about it - but I'm pretty damn proud of where it's at now.
This seems true for most of the non-trivial projects. :)
the guide for bcache mentioned raid/erasure coding isn't implemented yet - anyone know if that is up to date and/or when it could be expected to be integrated?
Unfortunately lkml.org is quite buggy. Try again in the future, or find a different mailing list archive, subject "[ANNOUNCE] bcachefs - a general purpose COW filesystem".
Depends on the use case, but ext4 is a pretty good default. I like using IBM's JFS on my laptops to prolong battery life since it minimizes CPU usage [0][1][2].
ext4 is really good at the single user (single process / single thread) use case.
Other file systems shine in different categories like number of features (btrfs) snapshots (btrfs) very large devices (xfs) really high concurrency (xfs).
ext4 has it's place - and deservedly so. But picking the right file system really depends on your requirements. For example if you're building a storage server then ZFS would be a better fit.
XFS is also worth taking notice of since the benchmarks I've read rate it for having faster read and write speeds than ext4. However I don't have extensive first hand experience running XFS (something which I'm currently addressing).
I've set up several hundred storage servers with XFS (up to 250 TB per FS) in the past 15 years. There were sour times, 10 years ago or more, but in recent times it's been rock solid and beat the crap out of other FS.
Notice that XFS always hated crappy hardware, because it pushes it to the limits.
I'm running XFS on a couple of my disks. It seems quick and reliable so far. My only complaint is that you can expand a volume (online, too), but you can't shrink it. Shrinking volumes comes in handy occasionally, and when you need it, you need it. But so far it hasn't bitten me yet.
I'm storing a lot of small files on an ext4 drive (on a 48 GB Linode). Right now it's 5M files, and they take up 40G or so. I formatted it to use at least twice as many inodes as the default (otherwise I would have run out), and then lowered the block size to 2K rather than 4K, which reduced a ton of wasted space because the files are so small.
That's not terabytes, but it's definitely a lot of small files for the size. I probably will end up using 50M or so files up from 5M (on a bigger drive).
So is ext4 less inefficient for this use case than other file systems? Why? I hear you can turn on options like dir_index but I haven't tried it yet.
Note that since linux 3.8 [1] ext4 can embed small files into inodes thus using less space and allow faster reads. Using that feature and adjusting inode size you can manipulate which file sizes are inlined.
OK interesting, so this capability is on by default then? I didn't change the default inode size, but it looks like I could SAVE space by increasing it (from 256 or something). Since I have 1.91M files less than 2048 bytes, I guess I could increase it enough to store most/many of those.
And I guess it halves the IOPS for serving those files, since with a cold cache you need to read block holding the inode and block holding the data.
In my experience ext4 becomes very unhappy/slow when you have many files/folders in the same folder (I noticed this with backuppc). dir_index may have helped me with this.
Do you have a reason for keeping them stored as lots of very small files in a filesystem?
If not, you might take a look at this page on the Sqlite home site: https://www.sqlite.org/intern-v-extern-blob.html and consider storing the small items within a sqlite database file instead.
Thanks for the pointer. I actually am using sqlite for the metadata, but not the data. I was thinking that sqlite wouldn't be happy with 40+ GB of files, though I haven't tested it. This page seems to say it's doable with some tuning [1]. (Although as mentioned I may go up to 500 GB or more later)
It's nice to see that intuitive performance characteristic actually tested -- small files perform better when stored internally, but large files perform better when stored on the file system.
I want to use rsync to sync files between machines, so that's a consideration. Otherwise I would have to roll my own.
With sqlite you'll also get duplication of the OS buffer cache in user space. Not sure how much of an issue that is in practice.
I like sqlite a lot, but it seems to have more nontrivial data structures than a file system, so it's harder to reason about performance and space overhead. It also has more concurrency issues than a file system.
One thing I like about the file system is that I was able to account for the waste within 1%. I wrote a simple script to compare the waste with 2K ext4 block size vs 4K block size, and it predicted 7 GB of saved space (out of 40 GB or so), and that turned out to be exactly what happened in practice. I don't think I'd be able to account for space overheads so exactly in sqlite (in theory or in practice).
The other dimension is IOPS as mentioned here, but I think that will be easier with just a file system too. There are more observability and debugging tools with file systems.
It's possible that because the inode issue, I could go for storing small files internally and big files externally though. The distribution of sizes is pretty much Zipfian. That isn't too hard to code, and the complexity may be justified.
Actually I just checked, and the bottom 1.91M files out of 5M are less than 1024 bytes, and sums to only 1 GB ! The other 3.1 M files sum to 34 G (largest file being 2GB or so).
The distribution was more skewed than I thought. So I think the hybrid solution with internal sqlite for small files may be a good idea. Thanks!
ext4 is usually fine, but it has really bad corner case behavior with respect to large directories (> 10k dentries). Iterating through them becomes increasingly slow as they grow larger -- worse than O(n), which is the expected case.
What's worse is that even if you then delete all the files in the directory, performance does not return to normal; iterating becomes just as slow at that point with 10 files. The only solution is to remove the directory and recreate it.
We've been running XFS on our 1000+ hosts for 4 years without issue. There are old stories of XFS issues, but it's been rock solid for quite some time, at least since RHEL 5.
ext4 is comparable to NTFS or HFS+. It mostly works, but other filesystems like btrfs or ZFS give you proper data integrity (checksumming) and other features.
BtrFS will stop being "unstable" when development is done, in 20 years (much like ext*).
I run BtrFS on my 2 laptops (one with SSD, one with HDD): the only issue I ran into is metadata space exhaustion: I have well over 100 snapshots and suddenly, the drive becomes unwritable. Once you know that can happen, you're more careful about backups and when to remove the old snapshots.
Also, Debian might not be the best place for bleeding-edge stuff like BtrFS: you should consider a distribution with a more recent version of kernel and utils (archlinux, gentoo...)
I might be the minority here, but I've been burned twice with btrfs (in Q3 and Q4 of 2014) and was only able to recover the data with great effort. Both times the errors only occurred after extended use and without a clear cause. My (unfortunately necessarily vague) bug report was met with scepticism whether my hardware was at fault (I ran memtest for 48h without any errors, and never had any problems with other filesystems), which was quite frustrating after I had lost a significant amount of time on their experimental filesystem and at least wanted to help others not have the same problems in the future.
I'll consider btrfs again when a large distribution has adopted it as default and it has worked for them for at least a year.
Yes, and I still wonder at the sanity of making it the default given the large number of known problems.
I've lost data, irrecoverably, with Btrfs on at least two occasions--the filesystem was completely toasted. More recently I've encountered the unbalancing issue--the filesystem would become read-only approximately every 36 hours with sustained intensive use. That's simply unacceptable--the default filesystem needs to be utterly reliable, and becoming unusable at unpredictable intervals is not suitable for production.
I've been running raid-1 btrfs on workstations and a server for a while, haven't seen any issues at all. I did have an issue on a btrfs laptop, it didn't cope with on/off all that well, but that was more than a year ago, maybe I will retest.
I'm a ZFS user but recently I took the plunge and tried btrfs out. I'm mostly impressed. This was after an encouter with ZFS's horrendous performance with dedup enabled. btrfs at least lets you manually dedup (which works fine for my purposes--I rely heavily on hashdeep in my workflows already and it's trival to locate duplicates in hashdeep inventories).
My sense is that it's the more exotic parts of btrfs (i.e. the native RAID but not-RAID-0/1 parts, esp the stuff intended to be in the RAID-z alternates) that have the most issues.
But ext4 doesn't natively support any of that either, so maybe any place you would consider using ext4, btrfs is potentially ok as an alternative. I mean if you approach it as ext4 with a few very useful missing features I think you hit the most useful and widely used corners. You can still use md etc. I'm not talking about datacenter scale things, I'm talking workstations and laptops.
I currently run a zfs-on-linux RAID-1 and have been testing btrfs for one of my backup copies. There are parts of btrfs things seem to work fine and that I really like a lot compared to the ZFS versions. I'm a bit paranoid about data loss and corruption (I do research with large sets of medical images, it's sort of like video except the data is typically very compressible, but few tools work with compressed images--filesystem-level compression is great for this stuff) so I keep hashdeep inventories of everything to document provenance. One thing about medical image processing workflows is you easily end up with multiple copies of images in different directories. I hoped to use ZFS deduplication for this but it is a nightmare. With btrfs you at least have the option of safely deduping files manually using the (cp --reflink) so you can dedup periodically (or even based on knowledge of how the data is layed out in the filesystem) or add it into workflow scripts which works well.
Unfortunately, the one thing that I haven't figured out is how to make a bit-by-bit clone of an existing btrfs filesystem. You can dd, but that leads to issues because device UUIDs are imbedded into the metadata. Working on btrfs, I've managed to consolidate 5.1TB of data that includes very compressible source images, duplacates and a lot of text files into 1.2TB, but I can't figure out how to correctly duplicate the 1.2TB version of the data without it transiently exploding back to 5.1TB in the process.
I do like the way ZFS approches the concept and organization of "datasets" better than btrfs's approach though. btrfs's approach seems more adhoc and less opinionated. I think if you lack discipline and experience with large datasets, btrfs can enable you to do unwise things that will seriously bite you in the butt down the road, whereas ZFS enforces some discipline. I think it's probably because ZFS was designed by people that have seen hell.
No, I have not actually tested it yet, but according to my research the current state of btrfs send/receive is this:
1. btrfs send streams the files (i.e. it decompresseses files that are compressed on disk)
2. The send stream (optionally) is compressed
3. btrfs receive (optionally) recompresses data into the destination
If you have a large dataset and take the time to use one of the slower compression algorithms/settings this means that you have to wait for that compression to happen all over again at the destination. (you could have different compression settings on the two btrfs filesystems, btrfs send/receive is one of the ways for migrating data between these settings)
This guy should rush to write documentation first...He will probably "build to budget" exhausting money and his time in coding.
If he does so and finishes his project with a burnout there is a huge interrogation on how production people will be able to fix without a good knowledge of the design during the recovery time (burnin =~ burnout time).
Read the source luke is nice. But 10 pages in natural languages speaks louder than 10K lines of codes even with comments.
I dislike these kind of coders that goes for the beef and despise the grunt works like not for them.
Documentation should come first. This code will have to be maintained since he -like a good drug dealer- is trying to hook people to use his code.