> Anything where you want an image to be guaranteed consistent, even if not comp...

rosser · on March 11, 2014

You can always use other threads, and on most systems, you can do lots of useful I/O with reasonable performance while an fsync() is going on.

No, you can't. At least not without knowing what you're doing and careful planning.

My day job is PostgreSQL DBA, and I've been doing that for most of a decade now. As the kids on the Reddits would say, "I've seen some shit." I have some rather large servers, with some rather powerful IO subsystems — my production environment has SAS SLC SSDs under hardware RAID with a ginormous cache. I still see the behavior described in TFA far more often than I'd like. Linux really is pretty dumb here.

For example, because of this fsync() issue, and the fact that fsync() calls flush all outstanding writes for the entire filesystem the file(s) being fsync()'ed reside upon, I've set up my servers such that my $PGDATA/pg_xlog directory is a symlink from the volume mounted at (well, above) $PGDATA to a separate, much smaller filesystem. (That is: transaction logs, which must be fsync()'ed often to guarantee consistency, and enable crash recovery, reside on a smaller, dedicated filesystem, separate from the rest of my database's disk footprint.)

If I didn't do that, at every checkpoint, my performance would measurably fall. I learned this lesson the hard way, at an old job, where my postgres clusters lived on a SAN — it wasn't just my db instances that were being adversely affected by this IO storm. It was everything else that lived on the filer, too.

That's how bad it can be.

tytso · on March 11, 2014

It's not true that fsync() calls flush all outstanding writes for the entire file system; that was true for ext3 in data=ordered mode, but it's definitely not true for ext4 or xfs. If you use fdatasync(), and there were no write commands that issued against the file descriptor that required metadata updates (i.e., you didn't do any block allocations, etc), then both ext4 and xfs won't need to trigger a journal commit, so the only thing that has to get sent to disk is all of the dirty metadata blocks, followed by a SYNC CACHE command which forces the disk drive to guarantee that all writes sent to the disk will survive a power cut.

If you use fsync() and/or you have allocated blocks or otherwise performanced a write which required updating file system metadata, and thus will require a journal commit, then you will need to force out all pending metadata updates to the journal as part of the file system commit, but that's still not the same as "flush all outstanding writes for the entire file system".

bodyfour · on March 11, 2014

> or you have allocated blocks or otherwise performanced a write which required updating file system metadata

What if you're appending to a file, and want to checkpoint every so often. I guess you can be clever with fallocate(FALLOC_FL_KEEP_SIZE) to avoid the block allocation, but won't st_size still need to be updated?

I also assume that st_mtime doesn't count towards dirtying the metadata.

dap · on March 11, 2014

You're absolutely right. The situation on Linux sounds bad, and it sounds like work's being done to improve that. Some other systems handle this case fine, though, and the problem isn't with the fsync() interface.

Aside: I worked on one of those filers, and I've seen some shit too. :) Firmware, especially disk firmware, was the worst.

deathanatos · on March 11, 2014

> So a little bit of data loss is okay, but a lot isn't?

I don't think that's the issue being put forth here, all the way to the most-ancestor comment. It's not between a bit of data loss and a lot of data loss: if the power fails, I fully expect that some data that was in buffer might not have made it to the disk. But that's okay: I left my file in a state that's still valid (using a journal or something), and I can recover from it.

But if write B happens before write A, my file is corrupt, and it's game over. I don't need to override the kernel's I/O scheduler to say "write this write right now!", I just need to ensure that A goes before B. That's it; it can happen whenever, but it must happen in that order.

Currently, the only solution seems to be fsync and forcing everything to disk right now, and screwing over any buffering what-so-ever¹. I just want to specify to the kernel, "here's my data, here's how it must be written, but you can buffer and write it when convenient, as you have a better picture of the whole system I/O."

¹It's funny that the article mentions Postgres/MySQL, as I've heard of fsync woes through browsers using SQLite.

bodyfour · on March 11, 2014

> So a little bit of data loss is okay, but a lot isn't?

No data loss is OK. However, delaying acknowledgement to the network client requesting the db update often is OK. Especially if your protocol allows many simultaneous requests to be in flight with asynchronous completion notification. (Similar to how TCQ on a hard drive works)

Right now it's very hard to do this type of workload efficiently. That's a pity because with write-notifications you could do it with zero extra I/O cost.

maxlybbert · on March 11, 2014

I've never implemented database transactions, so this might be a little hand wavy.

A database transaction promises that all changes happen or no changes happen, i.e., that you don't get some of the changes. The easiest way to implement that is to write what you're going to do to a log, and then once you know that the log is on disk, you make the change in the actual data file. Two phase commit.

But what if I could loosen that some? It isn't that important when the log file is actually written; only that it will hit disk before the changes to the data file. If I write to the log, tell the operating system "after you get around to putting that on disk, make these changes to this other file," I can promise that the transaction will be atomic. I can't tell you if the transaction will actually be committed, but I know there won't be a partial commit. If the system crashes before the log gets updated, the transaction doesn't commit. But I don't have to wait around to find out when the file gets written out. I just need to know that if anything gets written to the data file, then all my writes to the log file are safely on disk.