> What bodyfour is proposing is not just getting a write-notification -- it's in tandem with being able to specify write ordering. If you need to be able to specify write ordering, you also want to be able to have a write notification.
It's still not enough: with write ordering and notification but no instruction to actually write the data soon, the kernel can buffer it indefinitely.
If you want the data on disk, use fsync(). If you don't care, don't. If the problem is that you can't afford the latency imposed by the multiple fsync() calls required to ensure correct data ordering for your application, fine. But that's not the problem the OP talks about. That was about fsync() hammering the I/O subsystem. You can solve that problem by fixing the fsync() implementation.
> fsync does not work because it returns when everything (in the same thread? or whatever) has been written to disk, and doesn't let you wait for a particular block to have been written.
If you really want that, could you mmap(2) the file and use msync(2)?
For this as well as the other cases described (e.g., wanting write ordering), I don't know what the intended use case actually is, but is it possible that there's another way to organize the data that's still correct and performs well without changing the POSIX interface? It seems likely, given the number of different programs out there that manage to get by with it, and there's a rather significant cost to adding a new interface.
(One option is to write everything you need into a temporary file, fsync() it, then synchronously rename it to "commit" it. That still requires two fsync's, but never more than that. You can generalize this for multiple files using a temporary directory.)
FWIW, I typically work on illumos systems. On fsync(), ZFS records only an intent log record. That alone helps, since it's not stopping the world to write out everything that's been buffered. For particularly latency-sensitive applications, we use a separate intent log device on an SSD. (Regardless of write ordering and filesystem optimizations, an SSD is necessary in order to guarantee something is on stable storage with latency better than spindles can provide.) This configuration works very well.
> I don't know what the intended use case actually is,
Generally speaking, if you've got a database system of some sort and want to write data to a file.
> It's still not enough: with write ordering and notification but no instruction to actually write the data soon, the kernel can buffer it indefinitely.
That's not really the problem -- being able to say "do this write operation after this other write operation" would let you pump modifications into some file at a faster rate than if you had to wait for every fsync. Suppose you have a modification that needs to be done. Right now, you might say, write a batch of blocks, wait for them to complete, and then write another block elsewhere (a new "superblock" or whatever the terminology you prefer is). Well, you'd rather send all the blocks simultaneously and say "the superblock write should happen _after_ these other blocks'". (Another option is to checksum the new blocks referred to by the superblock, but that requires pulling them up to the CPU and checksumming them.) (And there are other options that are more complicated with other trade-offs -- it would be nice if you could just send multiple blocks to write, with a partial ordering specified.)
So, even if you had no fsync at all, you'd be able to pump modifications into a database file faster than before. Without some kind of fsync, you couldn't confirm they'd ever been written. With a fine-grained fsync or "flush and notify on a per block basis" call, you can confirm that a certain subset of changes have been written. Generally speaking it's nice to be able to send in a bunch of changes without flushing because when you have multiple noncontiguous block writes to choose from that you'd like to perform simultaneously, they can get thrown on disk with better throughput.
> If you really want that, could you mmap(2) the file and use msync(2)?
>> I don't know what the intended use case actually is,
>Generally speaking, if you've got a database system of some sort and want to write data to a file.
But database systems have been around for years without such an interface, and can't they basically saturate a storage subsystem?
You can always saturate a storage subsystem -- add more clients (assuming you don't saturate the CPU, the CPU's memory bandwidth, or the network interface -- any of which can happen if you put a high-end storage device on otherwise typical hardware). But what you get is higher than the minimal possible latency.
For example, suppose you send a bunch of write operations to the disk and then send an fsync. Well, if those write operations happen one after the other (figuratively) (because there's a bunch of them), their actual completion time would on average be half that of the actual waiting time all of them must suffer through.
Now suppose you've got the ability to do fine-grained fsyncs on particular write operations, efficiently. It would still be useful and result in improved latency if the disk or OS knew that getting block A on disk didn't matter to the process until block B was also on disk, and took advantage of that fact. And it would be extra-useful if the disk or OS knew that block B had to be written after block A, because then you would be able to save a round-trip or save on CPU bandwidth necessary for marking or checksumming blocks to a sufficient degree that you can determine upon startup whether they were completely and correctly written.
> , an SSD is necessary in order to guarantee something is on stable storage with latency better than spindles can provide
Interestingly enough, if we ignore current hard drive firmware, I don't agree with this. In the context of sending arbitrary random sequences of writes to blocks, sure. But in the context of having a database or filesystem that wants low latency writes? My guess is that you could accomplish this if you track the location of the drive head and spindle. The last time I tried anything like this, though (talking to /dev/sdb, a new 7200 RPM WD Black laptop drive, from userland) I could only get about 1.5 ms, per block write + fsync (iirc -- the numbers 1.3 ms and 2.0 ms ring a bell too). I didn't try writing near the middle of the disk, though, so it could have been drifting the drive head off the track each time for some reason. There's just no hope in general, given current rotationals, when a rotational drive takes ~250 microsecs just to read a 4KB buffer from memory. They just don't care.
If you actually did take advantage of physical information to hold down write latency, garbage collection and keeping startup times low would be a pain (but hey SSDs have gc worries too), there'd definitely be throughput and capacity trade-offs.
It's still not enough: with write ordering and notification but no instruction to actually write the data soon, the kernel can buffer it indefinitely.
If you want the data on disk, use fsync(). If you don't care, don't. If the problem is that you can't afford the latency imposed by the multiple fsync() calls required to ensure correct data ordering for your application, fine. But that's not the problem the OP talks about. That was about fsync() hammering the I/O subsystem. You can solve that problem by fixing the fsync() implementation.
> fsync does not work because it returns when everything (in the same thread? or whatever) has been written to disk, and doesn't let you wait for a particular block to have been written.
If you really want that, could you mmap(2) the file and use msync(2)?
For this as well as the other cases described (e.g., wanting write ordering), I don't know what the intended use case actually is, but is it possible that there's another way to organize the data that's still correct and performs well without changing the POSIX interface? It seems likely, given the number of different programs out there that manage to get by with it, and there's a rather significant cost to adding a new interface.
(One option is to write everything you need into a temporary file, fsync() it, then synchronously rename it to "commit" it. That still requires two fsync's, but never more than that. You can generalize this for multiple files using a temporary directory.)
FWIW, I typically work on illumos systems. On fsync(), ZFS records only an intent log record. That alone helps, since it's not stopping the world to write out everything that's been buffered. For particularly latency-sensitive applications, we use a separate intent log device on an SSD. (Regardless of write ordering and filesystem optimizations, an SSD is necessary in order to guarantee something is on stable storage with latency better than spindles can provide.) This configuration works very well.