> What bodyfour is proposing is not just getting a write-notification -- it's in...

SamReidHughes2 · on March 11, 2014

> I don't know what the intended use case actually is,

Generally speaking, if you've got a database system of some sort and want to write data to a file.

> It's still not enough: with write ordering and notification but no instruction to actually write the data soon, the kernel can buffer it indefinitely.

That's not really the problem -- being able to say "do this write operation after this other write operation" would let you pump modifications into some file at a faster rate than if you had to wait for every fsync. Suppose you have a modification that needs to be done. Right now, you might say, write a batch of blocks, wait for them to complete, and then write another block elsewhere (a new "superblock" or whatever the terminology you prefer is). Well, you'd rather send all the blocks simultaneously and say "the superblock write should happen _after_ these other blocks'". (Another option is to checksum the new blocks referred to by the superblock, but that requires pulling them up to the CPU and checksumming them.) (And there are other options that are more complicated with other trade-offs -- it would be nice if you could just send multiple blocks to write, with a partial ordering specified.)

So, even if you had no fsync at all, you'd be able to pump modifications into a database file faster than before. Without some kind of fsync, you couldn't confirm they'd ever been written. With a fine-grained fsync or "flush and notify on a per block basis" call, you can confirm that a certain subset of changes have been written. Generally speaking it's nice to be able to send in a bunch of changes without flushing because when you have multiple noncontiguous block writes to choose from that you'd like to perform simultaneously, they can get thrown on disk with better throughput.

> If you really want that, could you mmap(2) the file and use msync(2)?

It's better to send writes using O_DIRECT. Basically because mmap is bad for various reasons. There's some decent discussion on this here, especially in the comments: http://useless-factor.blogspot.com/2011/05/why-not-mmap.html

dap · on March 11, 2014

>> I don't know what the intended use case actually is, >Generally speaking, if you've got a database system of some sort and want to write data to a file.

But database systems have been around for years without such an interface, and can't they basically saturate a storage subsystem?

SamReidHughes2 · on March 11, 2014

You can always saturate a storage subsystem -- add more clients (assuming you don't saturate the CPU, the CPU's memory bandwidth, or the network interface -- any of which can happen if you put a high-end storage device on otherwise typical hardware). But what you get is higher than the minimal possible latency.

For example, suppose you send a bunch of write operations to the disk and then send an fsync. Well, if those write operations happen one after the other (figuratively) (because there's a bunch of them), their actual completion time would on average be half that of the actual waiting time all of them must suffer through.

Now suppose you've got the ability to do fine-grained fsyncs on particular write operations, efficiently. It would still be useful and result in improved latency if the disk or OS knew that getting block A on disk didn't matter to the process until block B was also on disk, and took advantage of that fact. And it would be extra-useful if the disk or OS knew that block B had to be written after block A, because then you would be able to save a round-trip or save on CPU bandwidth necessary for marking or checksumming blocks to a sufficient degree that you can determine upon startup whether they were completely and correctly written.

SamReidHughes2 · on March 11, 2014

> , an SSD is necessary in order to guarantee something is on stable storage with latency better than spindles can provide

Interestingly enough, if we ignore current hard drive firmware, I don't agree with this. In the context of sending arbitrary random sequences of writes to blocks, sure. But in the context of having a database or filesystem that wants low latency writes? My guess is that you could accomplish this if you track the location of the drive head and spindle. The last time I tried anything like this, though (talking to /dev/sdb, a new 7200 RPM WD Black laptop drive, from userland) I could only get about 1.5 ms, per block write + fsync (iirc -- the numbers 1.3 ms and 2.0 ms ring a bell too). I didn't try writing near the middle of the disk, though, so it could have been drifting the drive head off the track each time for some reason. There's just no hope in general, given current rotationals, when a rotational drive takes ~250 microsecs just to read a 4KB buffer from memory. They just don't care.

If you actually did take advantage of physical information to hold down write latency, garbage collection and keeping startup times low would be a pain (but hey SSDs have gc worries too), there'd definitely be throughput and capacity trade-offs.