Hacker News new | past | comments | ask | show | jobs | submit login
Bcachefs – A general purpose COW filesystem (lkml.org)
385 points by koverstreet on Aug 21, 2015 | hide | past | favorite | 101 comments



He doesn't really mention how this relates to the current bcache code. That one is brilliant. I run a 4 drive raid with slow spinning 2TB drives and a 100G SSD drive as a writeback cache top. It detects strides so reading and writing large files stream directly to the raid, but little reads and writes got via the SSD. By filtering all the little traffic, the actual traffic to the drives is more efficient and spends less time seeking. In that setup bcache effectively creates a cached block device and I format that with ext4. It seems now it includes the filesystem. Not sure I like that, but it seems to be the trend in filesystems since ZFS.


He mentions multiple times in the first few paragraphs that this is an evolution of the current bcache code:

"For those who haven't kept up with bcache, the bcache codebase has been evolving/metastasizing into a full blown, general purpose posix filesystem... I and the other people working on bcache realized that what we were working on was, almost by accident, a good chunk of the functionality of a full blown filesystem - and there was a really clean and elegant design to be had there if we took it and ran with it."


I've been considering using bcache on one of my systems, last time I checked it seemed a bit unready in terms of all the steps needed to set it up and keep it working. What has your experience been?


I've run bcache-backed databases in production for about a year or so - it's been absolutely rock solid, and I'm very happy with it.

We spent a lot of time prototyping basically every SSD caching tech that existed at the time and bcache was the clear winner (note that dm-cache, the tech underpinning lvm-cache was pretty immature at the time).

You do need a workload with a fairly reasonable cache hitrate to get the best out of it, but that's obviously true of all such technologies.


I'm afraid I still see little bugs appearing on the mailing list and on my test systems.

For bugs look at: https://github.com/koverstreet/linux-bcache/issues http://dir.gmane.org/gmane.linux.kernel.bcache.devel



I have lost data, twice in the last six months.


Is your SSD similar to L2ARC in ZFS?


Yes, see the "What is bcache?" section here: http://bcache.evilpiepirate.org/ He compares it to L2ARC... same but with writeback. (ie writes can be considered finished when they are recorded by the SSD. Faster but less safe unless your SSD is also a raid.)

The bcachefs part at the top of that page is the new stuff this announcement is describing.


Ok, so it is a combination of L2ARC and ZIL (zfs intent log), then?

L2ARC improves read operations, ZIL improves writes.


It's neither. It's just a writeback cache, that gets bypassed for sequential operations.


Another file system pursuing the lofty goals of ZFS/btrfs on linux (the other being tux3). It is a catch 22 situation for these guys. If they do not release it early there will be no users to test and report bugs, if they do release it early it is half baked. I hope the guy has a large stash in his bank. File systems take notoriously long to stabilize.


I think it's great to see new FSs targeted to SSDs. There's an alpha version to try out: http://bcache.evilpiepirate.org/


The performance is so close on all filesystems right now on Linux, why not just contribute to the code for one of the existing solutions and make it more mature?

http://www.phoronix.com/scan.php?page=article&item=linux_rai...

F2FS is looking really promising for SSD

ps. Kent's benchmark numbers struck me for one particular aspect I haven't seen in other benchmarks - max latency - and EXT4 is looking darn good in that aspect - I still use EXT4 over XFS, I simply do not trust XFS enough yet


> I simply do not trust XFS enough yet

You should give it another look. My team's used it for a mission critical project for over a decade. It's older and more mature than ext4, IMO. For sustained write throughput, it does much better than ext4. That said, many folks have utilization which looks more like random I/O than sequential.


I'd also like to hear any rationale for why one would trust XFS less than EXT4. There might be arguments against using XFS -- but trust/maturity sounds like a very strange one to pick?


I don't know how it is now, but XFS was at one point, almost by design, setup to lose your data on power-failure.


Refering to this?: http://xfs.org/index.php/XFS_FAQ#Q:_Why_do_I_see_binary_NULL...

Are you sure it was worse than ext2/3?

[ed: Looks like it might be better/as good as ext2, but somewhat worse than ext3/4 -- apparently xfs jourunals only metadata: http://superuser.com/questions/84257/xfs-and-loss-of-data-wh...

Still doesn't seem like it destroy's data on powerloss -- just the usual - if data isn't written to disk, then it's not written to disk.]


Yeah, still better than ext2. I think part of it is that many people expect the file to either not have grown, or have the data you wrote to it.

IIRC (and I'm a bit hazy) here were other issues too that have been since fixed. The long delay between write and commit meant that a lot of bugs that would otherwise have been vanishingly rare got exposed. Likely the ext systems have/had similar bugs that just have only happened a single digit number of times in the past 20 years.


ext3 with the default data=ordered was the filesystem that taught a generation of Linux developers that fsync(2) does nothing to improve data consistency and has crippling performance impact, so should never ever be called.

The inevitable results of applications developed for ext3 running on other file systems incurring data loss on system crashes then got blamed on those other filesystems (including ext4, ironically).


You could say the same about ext4. It was fixed.


I don't think the performance for btrfs can be considered "close".

And the thing with filesystems is that the format can't really change once stable - and you get locked into an architecture that dictates an awful lot about what features you can provide. So if you want to add zfs-like features to a linux fs that doesn't eat your data/have abysmal performance (btrfs) then you're basically back at square 0 and you have to write a new one.


From the article:

"[..] the main goal of bcachefs to match ext4 and xfs on performance and reliability, but with the features of btrfs/zfs."


And btrfs is really frightening there.


actually btrfs looks pretty good on the phoronix benchmarks (note that higher is better in all those tests - go look at the write tests)

on Kent's benchmarks, something is up with the results in the thousands of MB/s - high end flash is fast but not that fast unless the entire transaction is possibly taking place in RAM cache instead of on the drive - which would explain btrfs having numbers in the hundreds of MB/s instead


Even mid range PCIe flash devices are fast enough to make the Btrfs numbers seem unreasonably slow.


My experience with btrfs in the real world was that it was unreasonably slow. At least an order of magnitude slower last time I tested it (which admittedly was a year or 2 ago). I keep hoping they'll introduce some massive speedups but until then it's ext4 or xfs for me.


High end flash is that fast. Btrfs is simply not a good filesystem (yet?).


> why not just contribute to the code for one of the existing solutions and make it more mature?

Well, Linus took that gamble when he blessed btrfs as the future of Linux filesystems four or five years ago, and it hasn't really worked out as intended.


Is there a link for his blessing of btrfs and an explanation?


Last time I used XFS there wer some metadata operations that were just painfully slow, and at least one of them was used by svn update. To the point that an svn update of a large repository over the LAN on XFS was ~10x slower than reiserfs.


XFS has recently-ish (depending on your definition of recently) had a significant workover in this area.

Before this, EXT4 was the king of metadata-intensive workloads. Nowadays XFS wins.

I'm guessing you used the older code (rhel/centos pre-6.3 perhaps?) and ran into this issue.


ext4 didn't exist when I last used xfs, so maybe I'll try it again.


> PLANNED FEATURES: erasure coding

Now that is something I'd really like to see in Linux filesystems. AFAIK the GNU/Linux implementations of ZFS do not support ECC. Only the Oracle version.


The problem with the erasure coding implementations: patents. An example:

https://www.techdirt.com/articles/20141115/07113529155/paten...


There are a few different profiles offered in the Ceph Erasure Coded Pool support, afaik they are based on research papers. Assume due diligence done to ensure they can be used openly (ianal etc).

http://ceph.com/docs/master/rados/operations/erasure-code-pr...

edit: Indeed, checked jerasure site and that's gone.


> Now that is something I'd really like to see in Linux filesystems. AFAIK the GNU/Linux implementations of ZFS do not support ECC. Only the Oracle version.

When people talk about "ECC" in relation to ZFS, they are generally referring to ECC memory, i.e. RAM that detects and corrects single-bit corruption. This does not need any support from software, it is a hardware thing. So if you have ECC memory, it will "work with ZFS" no matter your OS.

Erasure coding here refers to redundancy in the block storage, so you can lose a disk and still maintain access to the data. And I can confirm that the linux ZFS implementation, along with all others (as they're all based on the same codebase), support that.


Unless he is talking about scrubbing and self-healing? In which case, the same thing still applies -- all releases do this as far as I know.

See: https://pthree.org/2012/12/11/zfs-administration-part-vi-scr...


Well, scrubbing and self-healing are really just the user-facing operations enabled by the erasure coding, so it's really the same thing.


AFAIK zfs doesn't use erasure coding. It has a checksum for each block. Together with redundant storage of blocks, this allows it to determine which block is corrupt when two copies of the same block have different contents.


Redundant storage of the meta data - not the actual blocks. But at least you know they're corrupted if the become corrupted ;)


Nope, all blocks, unless wikipedia is wrong or I am completely misunderstanding the following:

"When a block is accessed, regardless of whether it is data or meta-data, its checksum is calculated and compared with the stored checksum value of what it "should" be. If the checksums match, the data are passed up the programming stack to the process that asked for it; if the values do not match, then ZFS can heal the data if the storage pool provides data redundancy (such as with internal mirroring), assuming that the copy of data is undamaged and with matching checksums."

https://en.wikipedia.org/wiki/ZFS


"ZFS can heal the data IF the storage pool provides data redundancy"

You can configure ZFS either with or without redundancy.


Yes, which is why I said "Together with redundant storage of blocks" in my original comment.


I'd love to see it to, but I've yet to see an implementation of EC that wasn't far, far too slow to be used for any kind of performant workload.


There are RozoFS it's a french company that has made a high performance erasure coding fs.

https://en.wikipedia.org/wiki/RozoFS


What ZFS implementations are you talking about? I'm only aware of zfsonlinux which AFAIK is CDDL (and out-of-tree -- but works fine) -- is there a port licensed under GPL?


And just today I was wondering what happened with this btrfs patch set which introduces some of the benefits of bcache to btrfs: http://marc.info/?l=linux-btrfs&m=129622115023547


Sounds like a really cool alternative to aufs, would love to see a Docker storage driver added to support this, once stable. https://docs.docker.com/reference/commandline/daemon/#daemon...

Any chance this might get added to the mainline kernel? aufs was never able to get merged in. OverlayFS was merged in, but IMHO isn't as good as aufs.


This isn't a union filesystem; it's a primary filesystem that happens to use COW internally for better reliability. It's comparable to ZFS, btrfs, or tux3.


Looks very interesting. Would be cool if it was also compared against my favourite "dark horse" fs: nilfs2[1]. I always thought it'd make a great fs for flash storage -- but in the end I've generally ended up running ext4 over LUKS on my SSDs so far.

[1] http://nilfs.sourceforge.net/en/


Does it outperform F2FS on flash storage, too?


Interested in that too, I was following F2FS progress a bit as well.


> It's taken a long time to get to this point - longer than I would have guessed if you'd asked me back when we first started talking about it - but I'm pretty damn proud of where it's at now.

This seems true for most of the non-trivial projects. :)


the guide for bcache mentioned raid/erasure coding isn't implemented yet - anyone know if that is up to date and/or when it could be expected to be integrated?


> anyone know if that is up to date and/or when it could be expected to be integrated?

If btrfs is used as an approximation, approximately three weeks after the heat death of the universe.


Anybody else just see this? http://cl.ly/ccI0


Unfortunately lkml.org is quite buggy. Try again in the future, or find a different mailing list archive, subject "[ANNOUNCE] bcachefs - a general purpose COW filesystem".


ext4 seems always to be the best.. no?


Depends on the use case, but ext4 is a pretty good default. I like using IBM's JFS on my laptops to prolong battery life since it minimizes CPU usage [0][1][2].

[0] http://fsbench.netnation.com/

[1] https://www.debian-administration.org/article/388/Filesystem...

[2] http://linuxgazette.net/122/TWDT.html#piszcz


those benchmarks are 10+ years old - are you sure they are still valid?


ext4 is really good at the single user (single process / single thread) use case.

Other file systems shine in different categories like number of features (btrfs) snapshots (btrfs) very large devices (xfs) really high concurrency (xfs).


ext4 has it's place - and deservedly so. But picking the right file system really depends on your requirements. For example if you're building a storage server then ZFS would be a better fit.

XFS is also worth taking notice of since the benchmarks I've read rate it for having faster read and write speeds than ext4. However I don't have extensive first hand experience running XFS (something which I'm currently addressing).


I've set up several hundred storage servers with XFS (up to 250 TB per FS) in the past 15 years. There were sour times, 10 years ago or more, but in recent times it's been rock solid and beat the crap out of other FS. Notice that XFS always hated crappy hardware, because it pushes it to the limits.


Unsurprising given it was built by the same people who built these:

https://www.sgi.co.jp/features/2001/dec/fleet_numerical/imag...

And more recently this:

https://www.sgi.com/pdfs/4555.pdf

Because pushing the envelope of HPC awesomeness was just another day in the office for folks at SGI. :)


XFS is hands down the best at sustained write speeds. Each time I've stacked it up against the competition it wins.


I'm running XFS on a couple of my disks. It seems quick and reliable so far. My only complaint is that you can expand a volume (online, too), but you can't shrink it. Shrinking volumes comes in handy occasionally, and when you need it, you need it. But so far it hasn't bitten me yet.


until you get to many small files (multiple terabytes) or very large volumes.

then it's metadata overheads consume more I/O than the data itself.


Can you elaborate on this or provide a pointer?

I'm storing a lot of small files on an ext4 drive (on a 48 GB Linode). Right now it's 5M files, and they take up 40G or so. I formatted it to use at least twice as many inodes as the default (otherwise I would have run out), and then lowered the block size to 2K rather than 4K, which reduced a ton of wasted space because the files are so small.

That's not terabytes, but it's definitely a lot of small files for the size. I probably will end up using 50M or so files up from 5M (on a bigger drive).

So is ext4 less inefficient for this use case than other file systems? Why? I hear you can turn on options like dir_index but I haven't tried it yet.


Note that since linux 3.8 [1] ext4 can embed small files into inodes thus using less space and allow faster reads. Using that feature and adjusting inode size you can manipulate which file sizes are inlined.

[1] http://kernelnewbies.org/Linux_3.8#head-372b38979138cf2006bd...


OK interesting, so this capability is on by default then? I didn't change the default inode size, but it looks like I could SAVE space by increasing it (from 256 or something). Since I have 1.91M files less than 2048 bytes, I guess I could increase it enough to store most/many of those.

And I guess it halves the IOPS for serving those files, since with a cold cache you need to read block holding the inode and block holding the data.


Happy to. :)

Talk by dchinner @ Redhat https://www.youtube.com/watch?v=FegjLbCnoBw


> Can you elaborate on this or provide a pointer?

In my experience ext4 becomes very unhappy/slow when you have many files/folders in the same folder (I noticed this with backuppc). dir_index may have helped me with this.


Do you have a reason for keeping them stored as lots of very small files in a filesystem?

If not, you might take a look at this page on the Sqlite home site: https://www.sqlite.org/intern-v-extern-blob.html and consider storing the small items within a sqlite database file instead.


Thanks for the pointer. I actually am using sqlite for the metadata, but not the data. I was thinking that sqlite wouldn't be happy with 40+ GB of files, though I haven't tested it. This page seems to say it's doable with some tuning [1]. (Although as mentioned I may go up to 500 GB or more later)

It's nice to see that intuitive performance characteristic actually tested -- small files perform better when stored internally, but large files perform better when stored on the file system.

I want to use rsync to sync files between machines, so that's a consideration. Otherwise I would have to roll my own.

With sqlite you'll also get duplication of the OS buffer cache in user space. Not sure how much of an issue that is in practice.

I like sqlite a lot, but it seems to have more nontrivial data structures than a file system, so it's harder to reason about performance and space overhead. It also has more concurrency issues than a file system.

One thing I like about the file system is that I was able to account for the waste within 1%. I wrote a simple script to compare the waste with 2K ext4 block size vs 4K block size, and it predicted 7 GB of saved space (out of 40 GB or so), and that turned out to be exactly what happened in practice. I don't think I'd be able to account for space overheads so exactly in sqlite (in theory or in practice).

The other dimension is IOPS as mentioned here, but I think that will be easier with just a file system too. There are more observability and debugging tools with file systems.

It's possible that because the inode issue, I could go for storing small files internally and big files externally though. The distribution of sizes is pretty much Zipfian. That isn't too hard to code, and the complexity may be justified.

[1] http://stackoverflow.com/questions/784173/what-are-the-perfo...


Actually I just checked, and the bottom 1.91M files out of 5M are less than 1024 bytes, and sums to only 1 GB ! The other 3.1 M files sum to 34 G (largest file being 2GB or so).

The distribution was more skewed than I thought. So I think the hybrid solution with internal sqlite for small files may be a good idea. Thanks!


ext4 is usually fine, but it has really bad corner case behavior with respect to large directories (> 10k dentries). Iterating through them becomes increasingly slow as they grow larger -- worse than O(n), which is the expected case.

What's worse is that even if you then delete all the files in the directory, performance does not return to normal; iterating becomes just as slow at that point with 10 files. The only solution is to remove the directory and recreate it.

We've been running XFS on our 1000+ hosts for 4 years without issue. There are old stories of XFS issues, but it's been rock solid for quite some time, at least since RHEL 5.


ext4 is comparable to NTFS or HFS+. It mostly works, but other filesystems like btrfs or ZFS give you proper data integrity (checksumming) and other features.


Isn't btrfs still plagued with stability issues/bugs though? (and zfs isn't readily available in default kernels?)

Reading planet.debian.org I seem to get the impression that btrfs it's not 100% there yet, unless you feel a little adventurous.


BtrFS will stop being "unstable" when development is done, in 20 years (much like ext*).

I run BtrFS on my 2 laptops (one with SSD, one with HDD): the only issue I ran into is metadata space exhaustion: I have well over 100 snapshots and suddenly, the drive becomes unwritable. Once you know that can happen, you're more careful about backups and when to remove the old snapshots.

Also, Debian might not be the best place for bleeding-edge stuff like BtrFS: you should consider a distribution with a more recent version of kernel and utils (archlinux, gentoo...)


For a btrfs NAS use case, I suggest Rockstor. It ships with bleeding edge kernels from elrepo and is pretty much exclusively btrfs.


Debian Unstable is usually fairly up to date wrt bleeding edge though :)


ah, didn't think of Debian unstable, my bad.


I might be the minority here, but I've been burned twice with btrfs (in Q3 and Q4 of 2014) and was only able to recover the data with great effort. Both times the errors only occurred after extended use and without a clear cause. My (unfortunately necessarily vague) bug report was met with scepticism whether my hardware was at fault (I ran memtest for 48h without any errors, and never had any problems with other filesystems), which was quite frustrating after I had lost a significant amount of time on their experimental filesystem and at least wanted to help others not have the same problems in the future.

I'll consider btrfs again when a large distribution has adopted it as default and it has worked for them for at least a year.


For what it's worth, SLES 12 was released October 2014 and has btfs by default:

https://www.suse.com/releasenotes/x86_64/SUSE-SLES/12/#intro...

openSUSE also has it default in 13.2 (November 2014):

https://en.opensuse.org/Portal:13.2


Yes, and I still wonder at the sanity of making it the default given the large number of known problems.

I've lost data, irrecoverably, with Btrfs on at least two occasions--the filesystem was completely toasted. More recently I've encountered the unbalancing issue--the filesystem would become read-only approximately every 36 hours with sustained intensive use. That's simply unacceptable--the default filesystem needs to be utterly reliable, and becoming unusable at unpredictable intervals is not suitable for production.


> Reading planet.debian.org I seem to get the impression that btrfs it's not 100% there yet, unless you feel a little adventurous.

Netgear evidently disagrees and has built consumer-grade NASes based on btrfs.

http://www.zdnet.com/article/netgear-revamps-readynas-storag...

I'm also using btrfs for all my home-devices and I've yet to suffer any issues.


I've been running raid-1 btrfs on workstations and a server for a while, haven't seen any issues at all. I did have an issue on a btrfs laptop, it didn't cope with on/off all that well, but that was more than a year ago, maybe I will retest.


I'm a ZFS user but recently I took the plunge and tried btrfs out. I'm mostly impressed. This was after an encouter with ZFS's horrendous performance with dedup enabled. btrfs at least lets you manually dedup (which works fine for my purposes--I rely heavily on hashdeep in my workflows already and it's trival to locate duplicates in hashdeep inventories).

My sense is that it's the more exotic parts of btrfs (i.e. the native RAID but not-RAID-0/1 parts, esp the stuff intended to be in the RAID-z alternates) that have the most issues.

But ext4 doesn't natively support any of that either, so maybe any place you would consider using ext4, btrfs is potentially ok as an alternative. I mean if you approach it as ext4 with a few very useful missing features I think you hit the most useful and widely used corners. You can still use md etc. I'm not talking about datacenter scale things, I'm talking workstations and laptops.

I currently run a zfs-on-linux RAID-1 and have been testing btrfs for one of my backup copies. There are parts of btrfs things seem to work fine and that I really like a lot compared to the ZFS versions. I'm a bit paranoid about data loss and corruption (I do research with large sets of medical images, it's sort of like video except the data is typically very compressible, but few tools work with compressed images--filesystem-level compression is great for this stuff) so I keep hashdeep inventories of everything to document provenance. One thing about medical image processing workflows is you easily end up with multiple copies of images in different directories. I hoped to use ZFS deduplication for this but it is a nightmare. With btrfs you at least have the option of safely deduping files manually using the (cp --reflink) so you can dedup periodically (or even based on knowledge of how the data is layed out in the filesystem) or add it into workflow scripts which works well.

Unfortunately, the one thing that I haven't figured out is how to make a bit-by-bit clone of an existing btrfs filesystem. You can dd, but that leads to issues because device UUIDs are imbedded into the metadata. Working on btrfs, I've managed to consolidate 5.1TB of data that includes very compressible source images, duplacates and a lot of text files into 1.2TB, but I can't figure out how to correctly duplicate the 1.2TB version of the data without it transiently exploding back to 5.1TB in the process.

I do like the way ZFS approches the concept and organization of "datasets" better than btrfs's approach though. btrfs's approach seems more adhoc and less opinionated. I think if you lack discipline and experience with large datasets, btrfs can enable you to do unwise things that will seriously bite you in the butt down the road, whereas ZFS enforces some discipline. I think it's probably because ZFS was designed by people that have seen hell.


> Unfortunately, the one thing that I haven't figured out is how to make a bit-by-bit clone of an existing btrfs filesystem.

Have you tried btrfs send/receive?


No, I have not actually tested it yet, but according to my research the current state of btrfs send/receive is this:

1. btrfs send streams the files (i.e. it decompresseses files that are compressed on disk)

2. The send stream (optionally) is compressed

3. btrfs receive (optionally) recompresses data into the destination

If you have a large dataset and take the time to use one of the slower compression algorithms/settings this means that you have to wait for that compression to happen all over again at the destination. (you could have different compression settings on the two btrfs filesystems, btrfs send/receive is one of the ways for migrating data between these settings)


EXT4 is arguably the most mature but even recently they found a long term bug in EXT4 on RAID that no-one noticed for years

https://www.phoronix.com/scan.php?page=news_item&px=Linux-4-...


Actually it looks like they found a bug in md (RAID) that was causing fs corruption. To my reading it looks like ext4 was just incidental.


Are acl and xattrs active by default?


Yes


Nice. I will be keeping a close eye on this fs for future use.


Is the support of O_DIRECT available or planned ?


Yep - it's been done for awhile, works fine.


Ken imo you should rename bcachefs to something else maybe bcfs?


This guy should rush to write documentation first...He will probably "build to budget" exhausting money and his time in coding.

If he does so and finishes his project with a burnout there is a huge interrogation on how production people will be able to fix without a good knowledge of the design during the recovery time (burnin =~ burnout time).

Read the source luke is nice. But 10 pages in natural languages speaks louder than 10K lines of codes even with comments.

I dislike these kind of coders that goes for the beef and despise the grunt works like not for them.

Documentation should come first. This code will have to be maintained since he -like a good drug dealer- is trying to hook people to use his code.





Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: