I'm storing a lot of small files on an ext4 drive (on a 48 GB Linode). Right now it's 5M files, and they take up 40G or so. I formatted it to use at least twice as many inodes as the default (otherwise I would have run out), and then lowered the block size to 2K rather than 4K, which reduced a ton of wasted space because the files are so small.
That's not terabytes, but it's definitely a lot of small files for the size. I probably will end up using 50M or so files up from 5M (on a bigger drive).
So is ext4 less inefficient for this use case than other file systems? Why? I hear you can turn on options like dir_index but I haven't tried it yet.
Note that since linux 3.8 [1] ext4 can embed small files into inodes thus using less space and allow faster reads. Using that feature and adjusting inode size you can manipulate which file sizes are inlined.
OK interesting, so this capability is on by default then? I didn't change the default inode size, but it looks like I could SAVE space by increasing it (from 256 or something). Since I have 1.91M files less than 2048 bytes, I guess I could increase it enough to store most/many of those.
And I guess it halves the IOPS for serving those files, since with a cold cache you need to read block holding the inode and block holding the data.
In my experience ext4 becomes very unhappy/slow when you have many files/folders in the same folder (I noticed this with backuppc). dir_index may have helped me with this.
Do you have a reason for keeping them stored as lots of very small files in a filesystem?
If not, you might take a look at this page on the Sqlite home site: https://www.sqlite.org/intern-v-extern-blob.html and consider storing the small items within a sqlite database file instead.
Thanks for the pointer. I actually am using sqlite for the metadata, but not the data. I was thinking that sqlite wouldn't be happy with 40+ GB of files, though I haven't tested it. This page seems to say it's doable with some tuning [1]. (Although as mentioned I may go up to 500 GB or more later)
It's nice to see that intuitive performance characteristic actually tested -- small files perform better when stored internally, but large files perform better when stored on the file system.
I want to use rsync to sync files between machines, so that's a consideration. Otherwise I would have to roll my own.
With sqlite you'll also get duplication of the OS buffer cache in user space. Not sure how much of an issue that is in practice.
I like sqlite a lot, but it seems to have more nontrivial data structures than a file system, so it's harder to reason about performance and space overhead. It also has more concurrency issues than a file system.
One thing I like about the file system is that I was able to account for the waste within 1%. I wrote a simple script to compare the waste with 2K ext4 block size vs 4K block size, and it predicted 7 GB of saved space (out of 40 GB or so), and that turned out to be exactly what happened in practice. I don't think I'd be able to account for space overheads so exactly in sqlite (in theory or in practice).
The other dimension is IOPS as mentioned here, but I think that will be easier with just a file system too. There are more observability and debugging tools with file systems.
It's possible that because the inode issue, I could go for storing small files internally and big files externally though. The distribution of sizes is pretty much Zipfian. That isn't too hard to code, and the complexity may be justified.
Actually I just checked, and the bottom 1.91M files out of 5M are less than 1024 bytes, and sums to only 1 GB ! The other 3.1 M files sum to 34 G (largest file being 2GB or so).
The distribution was more skewed than I thought. So I think the hybrid solution with internal sqlite for small files may be a good idea. Thanks!
then it's metadata overheads consume more I/O than the data itself.