> From a safety perspective, O_DIRECT is now table stakes Except for the awkward...

jorangreef · on Jan 24, 2021

That's a common misunderstanding of the purpose of O_DIRECT.

For a write barrier, you would still use fsync() or O_DSYNC, along with O_DIRECT.

The man page for open(2) is clear on this: https://man7.org/linux/man-pages/man2/open.2.html

jlokier · on Jan 25, 2021

I guess I know, as this is what I found when I googled just now :-) https://linux-scsi.vger.kernel.narkive.com/yNnBRBPn/o-direct...

I was trying to address this aspect of the parent comment:

> O_DIRECT is the only safe way to recover from the journal after fsync failure (when the page cache can no longer be trusted by the database to be coherent with the disk)

O_DIRECT is not a safe way to recover from the journal if you have decided you cannot trust fsync to do its job, because you need fsync to make O_DIRECT write-cache durable.

(By the way, O_SYNC/O_DSYNC are equivalent to calling fsync/fdatasync after each write, therefore subject to some of the same issues.)

But even in normal situations with fsync working fine, it is not clear if you can rely on fsync to do a drive write-cache flush when there isn't any metadata or page cache data for the file because you've only been using O_DIRECT.

Neither open(2) or fsync(2) man pages address this durability issue. You can't use O_DSYNC or O_SYNC for good throughout with O_DIRECT because your database does not want the overhead of a write-cache flush on every write. You only want it for barriers. And you can't rely on fdatasync because there's no data to flush in the page cache, no block I/O to do, so fdatasync could meet expectations by doing nothing.

My solution in the past has been to toggle the LSB in st_mtime before async just to force a drive write-cache flush when I'm not sure that anything else will force one. It's not pretty.

jorangreef · on Jan 26, 2021

> O_DIRECT is not a safe way to recover from the journal if you have decided you cannot trust fsync to do its job, because you need fsync to make O_DIRECT write-cache durable.

I was specifically referring not to an fsync in your sense (where the disk or fs does not respect fsync at all so that fsync is a no-op, or where the fs has a bug with O_DIRECT not flushing if it sees nothing dirty in the page cache - by the way I think this is no longer an issue, otherwise it's a kernel bug you can report)

...but to handling an fsync error in the context of the paper from WISC that I linked to in that parent comment, where the kernel's page cache has gone out of sync with the disk after an fsync EIO error ("Fsyncgate"):

"when the page cache can no longer be trusted by the database to be coherent with the disk: https://www.usenix.org/system/files/atc20-rebello.pdf"

The details are all in the paper. Sure, some disks may not respect fsync, but O_DIRECT is still the only way to safely read and recover from the journal when the kernel's page cache is out of sync with the disk (again, details in the paper). It's another fantastic paper out of WISC.