I guess I'm not gorking how the behavior of write() blocking when fsync is being...

antirez · on May 3, 2010

Hello. Write in non real time Linux is not able to guarantee that it will return in a given amount of time, still all the system calls have more or less predictable timing behavior when the disk and the CPU are not busy. What I mean is that if you remove the fsync() call from the other thread, what you get is a constant stream of "13 microseconds" delays.

So when fsync() is not into the mix the kernel will do the right thing, will use buffers and will make the write calls very cheap. This is important for many applications. But when there are more strict durability requirements this is no longer true and care must be used.

Non blocking I/O (aio_*) is an interesting alternative in some application, but in the case of Redis it is important to return "OK" only after we got an acknowledge from write(2). Doing this suspending the client and resuming it when the write was performed will turn Redis from a 140k operations/second database into a 10k operations second database, so this is not going to be the solution.

Real world software is written not reading manual pages, but checking how the underlying OS actually works IMHO. For instance Redis persistence uses fork() copy-on-write semantic of modern operation system. Also the fact that write(2) can block per semantic, does not mean you'll be happy to know your kernel is blocking a process for seconds many times as a result of a write(2) operation.

anaisbetts · on May 3, 2010

Kernel developer here - honestly, I'm not surprised that fsync() stops the world, and furthermore I suspect that even if you got past the vfs layer, you would see different effects on different filesystems (i.e. you still couldn't bet that fsync() would act like you want it to). The semantics of fsync mean "Please guarantee that everything is written to disk, flush all caches now".

The kernel doesn't keep a 2nd queue for post-fsync writes that it will then swap into the "real" one - think about what happens if someone else calls fsync(); does it spin up a 3rd queue for that one? Does the fsync block? I think it would quickly descend into Crazyville.

antirez · on May 3, 2010

Hello xpaulbettsx, thanks for your comment!

Yes I guess the implementation may get more complex, not sure about the actual implementation it's just a linked list of operations to flush like it happens looking at a few source code fragments, then it's just possible to put a "sentinel" in the list that will block the first fsync() when the first sentinel is found and so forth.

I mean, I'm all against the complexity myself in the code I write, so I can't really question this behavior and the "fsync every second" policy is not a huge use case indeed, but still it's important to now that. Googling a bit around there are tons of people that appear to be pretty confident that moving fsync() into another thread is the way to go, while instead stuff are working in a different way.

houseabsolute · on May 3, 2010

A pair of queues should be enough. The first queue is the one being fsynced, the second queue is open for writes. The first fsync call starts the first queue flushing. The second fsync call marks the second queue as read to flush once the first one is done. The third has no additional effect. Writes can still proceed against the second until it actually starts flushing. At which point the queues swap and you're back where you started.

doty · on May 3, 2010

Isn't this just good old group commit? I'm surprised the kernel wouldn't do this optimization for you.

pquerna · on May 3, 2010

The interesting thing about your timing is that I would expect on ext3 you would sill similar slowdowns, although perhaps they are harder to see, due to how the most common configuration of using sync=ordered behaves, which should cause effectively any fsync to flush everything to disk, and even then, the kernel will trigger one ext3 commit interval every ~5 seconds by default.

I didn't see it mentioned in the blog post, but what filesystem were you running this test on?

I don't know the details about Redis' threading and event model, but please feel free to share them;

Most event-y systems I've worked on end up falling back to a smallish-number of threads to provide the IO concurrency, because you just don't have the fine grained control of the kernel to provide a perfect level of service with a tiny-number of threads.

This is the model the Event MPM in httpd uses, its the model that Traffic Server uses, its the model that Lighthttpd tried to not use at first, but they added threads in later because of blocking stat calls, etc.

Kernel's like the liberty that POSIX standards give them, things will block, and even if you want to do 140,000 operations a second, if you want any durability at all, there will have to be a tradeoff, the most likely ones are things like multiple commit logs, and more threading.

antirez · on May 3, 2010

Sorry for not mentioning this, but I'm using ext4.

About using more threads, it seems very unlikely to me that tis is going to help in a persistence model where an Append Only File is used in order to persist. All the clients will eventually write to the same file object and will get automatically serialized.

Btw as a proof of how the same filesystem will behave in a very different way check the O_SYNC test above: same ext4 filesystem, but an order of magnitude faster doing things in a way instead of in another way.

Unfortunately POSIX or not the real world kernel implementations are full of small implementations details that a programmer doing system programming is required to know to write fast code :(

gaius · on May 3, 2010

If you don't want to use aio_, which is likely to be.. dangerous ground considering how few people use them*

Oracle uses AIO (you can strace it and see for yourself) and if it's safe for them, it's probably safe for you.