One interesting thing is that io_uring can operate in different modes. One of th...

magicalhippo · on May 23, 2023

> One interesting thing is that io_uring can operate in different modes. One of the modes enables kernel-side polling [...]

On a related note, I recently saw this presentation[1] where they show some benchmarks of the various modes.

One gotcha of sorts, though obvious when you think about it, is that the kernel-side polling mode requires a free CPU core for the polling thread. Meaning you'll get very poor performance if you're not leaving enough CPU for the kernel to do its polling.

https://www.youtube.com/watch?v=5jKKVdJJqKY

eru · on May 23, 2023

> Our general take was also that it has a lot of potential, but is relatively low level that most mainstream programmers aren't going to pay attention to it. Hence, it'll be a while before it permeates through various ecosystems.

Isn't that what libraries are for?

blibble · on May 22, 2023

> Hence, it'll be a while before it permeates through various ecosystems.

this may take a while as it's a completely different IO model

it took us 30 odd years to get from select/epoll to async/coroutines being popular

mrcode007 · on May 22, 2023

Windows has been using this mechanism for well over a decade now. It’s called IOCP (IO completion ports)

hawk_ · on May 22, 2023

IOCP is syscall free only on one side i.e. completion. io_uring can be syscall free on submission side as well.

riceart · on May 22, 2023

> well over a decade now.

3 decades.

blibble · on May 22, 2023

and Linux has had POSIX async IO since 2003

(which no-one uses either because the API doesn't compose well onto existing application structures)

mananaysiempre · on May 22, 2023

If you had mentioned Solaris, I’d have agreed.

But POSIX async I/O (the aio_* functions) in Linux is basically worthless performance-wise AFAIU, because Glibc implements it in userspace by spawning threads to do standard sync I/O. Now Linux also has non-POSIX async I/O (the io_* functions), but it’s very situational because it works only if you bypass the cache (O_DIRECT) and can still randomly block on metadata operations (so can Win32, to be fair). There’s select/poll/epoll with O_NONBLOCK of course, which is what people normally use, but those do not really work with files on disk (neither do their WinSock equivalents). Hell, signal-driven IO (O_ASYNC) exists, I’ve used it to make a single-threaded emulator (CPU-bound unlike a network server) interact with the terminal. But asynchronous I/O of normal, cached files is only possible on Linux through the use of io_uring, as far as I’ve been able to figure out.

That said, I’ve read people here saying[1] that overlapped I/O on Windows also works by scheduling operations on a thread pool, even referencing KB articles[2]. This does not mesh with everything I’ve read about I/O in the NT kernel, which is supposed to be natively async to the point where the I/O request datastructure (the IRP) has what’s essentially an emulated call stack inside of it, in order to allow the I/O subsystem to juggle continuations. What am I missing? Does the Win32 subsystem need to dumb things down that much even inside its own implementation?

(Windows 8 also introduced a ringbuffer-based, no-syscalls thing called Registered I/O that looks very much like io_uring.)

[1] https://news.ycombinator.com/item?id=11867351

[2] https://support.microsoft.com/kb/156932

cyberax · on May 22, 2023

> That said, I’ve read people here saying[1] that overlapped I/O on Windows also works by scheduling operations on a thread pool

The _kernel_ thread pool. Eventually, most work has to be done in an actual thread, after all.

> [2] https://support.microsoft.com/kb/156932

It's a bit misleading. What they mean is that some operations can act as barriers for further operations. E.g. async calls to ReadFile won't run until the call to WriteFile finishes (if it's writing past the end of the file).

GoblinSlayer · on May 23, 2023

Why you think you can't use epoll with disk files on linux?

mananaysiempre · on May 23, 2023

Per open(2) [1], you can’t really ask the kernel to not block on regular files:

> O_NONBLOCK [...] has no effect for regular files and will (briefly) block when device activity is required, regardless of whether O_NONBLOCK is set. []O_NONBLOCK semantics might eventually be implemented[.]

I’m actually not sure if the reported readiness for them is of any use, but the documentation for select(2) [2] doesn’t give me a lot of hope:

> A file descriptor is ready for writing if a write operation will not block. However, even if a file descriptor indicates as writable, a large write may still block.

This for data operations; if you want open() itself to avoid spelunking through NFS or spinning up optical drives or whatnot, before io_uring you simply had no way to tell that to the kernel—you call open*() or perhaps creat(), which must give you a fd, thus must block until they can do so.

(As far as I’ve seen, tutorial documentation usually rounds this down to “you can’t do nonblocking I/O on disk files”.)

[1] https://man7.org/linux/man-pages/man2/open.2.html

[2] https://man7.org/linux/man-pages/man2/select.2.html

GoblinSlayer · on May 23, 2023

You use io_uring to spin optical drives?

loeg · on May 22, 2023

Yeah, but everything on Windows uses IOCP.

fanf2 · on May 22, 2023

I am looking forward to io_uring support in libuv

CoolCold · on May 23, 2023

isn't it a thing already https://www.phoronix.com/news/libuv-io-uring ?

ori_b · on May 23, 2023

As long as you are willing to accept the latency and/or busy looping that the kernel needs in order to poll.

throwaway894345 · on May 22, 2023

Does this imply that the caller has to poll the result buffer to know when the kernel has processed the initial data?

sophacles · on May 22, 2023

Sort of. In the strictest sense, yes - the cq is a ring buffer (implemented with fancy atomic stuff), so you have to check if there is a completion on the queue before you read the entry. However, this doesn't need a syscall to do polling, if more completions come in while you're processing, they will be available to you.

There's also a syscall (io_uring_enter) that will do a context switch and wake you up when completions are available (it's a complicated syscall, that has a lot of knobs and switches and levers - just be ready for a LOT of information if you go read the man page).

BeeOnRope · on May 22, 2023

Can you use the kernel polling mode w/o running as root?