Zero-copy network transmission with io_uring (2021)

scrubs · on Jan 31, 2022

My side project has been to rewrite https://github.com/erpc-io/eRPC which does RPCs over UDP with some congestion control supposedly quite fast. Paper: https://www.usenix.org/system/files/nsdi19-kalia.pdf. I never got the code to work in AWS; I believe the author focused on Mellanox NICS but that's not really commodity H/W, which is where my interests lay.

So I dug into it .. and well I'll have my own library soon. I should be able to send UDP w/o congestion control sometime this week.

eRPC uses DPDK (100% user space NIC TX/RX control) plus the author's own other ideas to get performance. Since I'm getting into DPDK way, way late in the game, I hope and believe DPDK is and will be the better way to go than to turn to kernel fix-ups.

Getting the kernel out of the way with pinned threads seems -- or more exotic with special H/W if preferred --- is cleaner if one can develop from scratch. DPDK is all day long zero copy, and always has been.

This library will be a part of something bigger. For many frameworks, a key architecture point is:

I got a RPC/packet/message. Ok, now what?

* process it in-place e.g. on the thread that was doing RX?

* if I don't delegate am I making back-pressure?

* delegate it to another core? How to efficiently do that? Hint: https://blog.acolyer.org/2017/12/04/ffwd-delegation-is-much-...

* if I delegate ... to whom? Maybe I'm partitioning?

* If I delegate how do I get a response back?

In DPDK I believe these are easier to decide and well-manage in code.

anujkaliaitd · on Jan 31, 2022

(Author of eRPC.) I'm glad you've found the project useful, and my apologies for the code and build quality, and NIC compatibility issues that you faced. I'm very interested in learning from your experience (perhaps even as a GitHub issue on our repo), especially w.r.t. how the code organization should be improved, although I understand the list might be too long :).

I know of some issues myself that I plan to fix, e.g., using macros to decide which Transport class to use -- this should just use virtual methods; messy inter-mixing of congestion control and packet I/O; some "performance optimizations" like preallocated buffers that are just not worth it.

scrubs · on Jan 31, 2022

I'll be in touch you can be sure. Code is in GIT but I will not make it public until I have UDP working --- hopefully +/- 3 days.

I do and will continue mention your paper (which I also found on HN) and your repo because it has a lot of really good ideas ... ideas which I'm forced to engage to mastery.

I haven't yet had time to focus on how you deal with packet loss via and with congestion management. Anything you can do --- I'll have to reread your paper (well written BTW) to pull my own weight here --- is a real benefit to the community.

Matthias247 · on Feb 1, 2022

If you want to get packets fast into userspace you can now apparently also AF_XDP on Linux which is also effectively a kernel bypass. The driver will put packets into a ringbuffer before they hit the remaining kernel, and userspace can fetch them effectively there. I haven’t benchmarked for the same application, but it promises to give similar levels of performance.

I think the other questions are more ones of library design and less about the packet API one uses. On top of all kinds of APIs one could build singlethreaded state machines which deque packets and process them. Or other architectures if required.

Agingcoder · on Feb 2, 2022

I think the paper describing xdp mentioned a 30% performance drop compared to a full kernel bypass - it's quite good.

goombacloud · on Jan 31, 2022

Depending on the requirements of the RPC you could implement the logic in eBPF code over the XDP framework which can directly execute the code inside the NIC driver and modify the packet buffer to be a response packet that gets send out immediately without ever going through the rest of the kernel nor userspace.

scrubs · on Jan 31, 2022

That option was not on my radar.

>execute the code inside the NIC driver

this is ... a NIC+FPGA which can execute code?

spijdar · on Jan 31, 2022

No, it's offloading the code onto the processor on the NIC (which usually have several general purpose CPU cores). According to Wikipedia [0] it's not widely supported in hardware, but a few companies like Intel are "working on it". I guess the upshot is it gracefully "degrades" to running on the host CPU if on an unsupported NIC.

[0] https://en.wikipedia.org/wiki/Express_Data_Path

goombacloud · on Jan 31, 2022

Offloading to the NIC is of course the fastest way but even doing it in the NIC driver (check if the NIC driver supports it or you have to use the slower generic fallback) you already have advantages similar to doing it with DPDK.

m12k · on Jan 31, 2022

It seems like a lot of performance optimizations that used to require custom kernels can now be implemented using io_uring instead. Does anyone know if any such performance improvements are making their way into places where they can be felt by those of us working higher up the tech stack? (E.g. if nginx, node or similar got faster networking or disk access)

dan-robertson · on Jan 31, 2022

I think the performance optimisations people use these days are not kernel patches so much as:

- userspace networking which avoids copying in the kernel and expensive context switches (both advantages of io_uring) and some costs of sockets apis[1]

- specialised NICs that can e.g. do TLS in hardware

- specialised NICs with on-board FPGAs to which some custom processing may be offloaded

- specialised NICs with first class virtualisation support to avoid hypervisor overhead (mostly just a cloud thing)

One thing about high performance networking is that there are really two different desires: maximising throughput and minimising latency. Different choices will be made based on what kind of performance is most desired.

[1] if you imagine a typical high performance, pre-io_uring, portable web server, you have some socket which you listen on, many threads which call accept on it (with an option so that only one would be woken up at a time with new connections) and a load of fds for open connections with the server calling epoll to wait for inputs from them. And when a packet comes in and makes it through the kernel tcp stack, the kernel has to assign the packet to an appropriate fd and figure out if any thread needs to wake up. But instead with specialised networking configurations, you can just receive the packets in the order they arrive and send them off to their handlers yourself without the overhead of explaining how to do that to the kernel.

kzrdude · on Jan 31, 2022

TLS in hardware must have suffered a lot from successive TLS version changes? I.e that hardware quickly becomes obsolete.

10000truths · on Jan 31, 2022

Not really. The only thing that needs to be offloaded is the record encapsulation (the format of which has remained backwards compatible) and the encryption/decryption (which is limited to AES-GCM for all intents and purposes). The handshake and the control messages can still be handled in software, since those aren't in the critical path.

dijit · on Jan 31, 2022

Yes, this is true, which is why some high traffic (non-tech company) sites, like the frontends for banking, are slower to move than others.

Sometimes the entire frontend of a bank (Say: HSBC, which is a global bank) can fit into 3 racks of servers.

Not saying it's the right thing, just confirming your question.

tomnipotent · on Jan 31, 2022

> are not kernel patches so much as

Intel even offers DDIO (Direct Data I/O), which allows an Intel NIC and CPU to share data without going through main memory.

drewg123 · on Jan 31, 2022

DDIO is a bit of a pet peeve of mine.

DDIO applies to ANY I/O device, which makes it a horrible idea on servers with even a moderate amount of I/O. With DDIO, EVERY I/O device's DMA writes (eg, transfer from device to host) allocates space in cache. This means it works great on toy benchmarks with a single thread polling for I/O from a NIC. But scale it up to a 32c/64t server that is reading a few gigabytes/sec from disk and doing few tens of gigabits of network I/O, and you thrash your caches.

Its precursor, DCA, was limited to Intel (and a few other vendors) NICs, and was far better. With DCA, the NIC decides what, if any, DMA write should be pushed to cache and tags that DMA write with a PCIe TLP steering tag. The best choices are typically RX descriptors, and protocol headers. Other data is not tagged, and hence does not pollute the cache.

jeffbee · on Jan 31, 2022

It sounds like what you want is RDT+CAT, so you can allocate your LLC as you like. And that means you have to pick and choose your platform since Intel keeps waffling over whether CAT should exist on every SKU or not.

drewg123 · on Jan 31, 2022

Well, yes and no. I don't see how CAT can differentiate between a packet header and packet data, but being able to control the number of ways dedicated to DDIO would be nice.

In terms of features like CAT being available on different SKUs.. Intel needs to remember that they're not just competing with themselves these days. Their competitors generally don't go fusing off features on low-end SKUs to drive sales of higher margin SKUs. Intel needs to differentiate chips solely on on cores count, clock speed, cache size, etc.

jeffbee · on Jan 31, 2022

CAT, at least, has trickled down to desktop-class SKUs. The real problem is you can't use it in EC2.

tbr1 · on Jan 31, 2022

Small correction: DDIO is not limited to Intel NICs, it's a mostly-transparent-to-the-hardware mechanism by which DMAs are snooped and some fraction of the NIC-local (in the case of multi socket systems) L3 cache is filled with incoming data.

jeffbee · on Jan 31, 2022

DDIO works for any DMA initiator, not just Intel NICs.

It's too bad that DDIO has not trickled down the product line. It doesn't exist on desktop-class parts, and many people can't use it even on top-of-the-line Xeon-SP because cloud operators disable it and RDT/CAT, which is helpful when using direct cache access.

xxs · on Jan 31, 2022

Why would you need "many threads which call accept on it" and then "epoll". With epoll you dont need more threads than the available CPU cores.

dan-robertson · on Jan 31, 2022

Yeah I guess you would epoll to receive connections. I wasn’t very clear

heftig · on Jan 31, 2022

Specialized NICs - is this what e.g. Cavium LiquidIO "server adapters" are? I came across those looking through the firmware shipped with Linux.

mping · on Jan 31, 2022

Yes, I think io_uring is slowly making its way onto Java ecosystem. Example: https://github.com/netty/netty-incubator-transport-io_uring

I guess it will go into the JVM too.

gigel82 · on Jan 31, 2022

That sounds interesting for routing applications.

I was just looking into building a custom router using an older PC and a dual-port 2.5GbE NIC. OpnSense sounds like a solid choice, but I'd prefer a Linux solution because the CPU is pretty beefy and wouldn't mind running a few Dockers on the side (yes, I know multi-purposing a router is a bad idea but meh).

wmf · on Jan 31, 2022

The kernel already does zero-copy routing; io_uring is for UDP and TCP.

kzemek · on Jan 31, 2022

For a given value of "zero"; the packets are still copied from NIC buffers and wrapped in sk_buff.

wmf · on Jan 31, 2022

Maybe headers are parsed but my understanding is that the whole packet is not copied.

gigel82 · on Jan 31, 2022

It's a NAT router so needs at minimum to masquerade (change the src/dest IP in the packet headers to match the internal / public). I'd also be running fail2ban or crowdsec for the bare intrusion detection.

jeffbee · on Jan 31, 2022

A "beefy CPU" will sit twiddling its thumbs forwarding traffic between 2.5g interfaces. Any CPU on the market can do that in between naps.

MayeulC · on Jan 31, 2022

It depends if GP wants to do deep packet inspection or other processing on the data. Otherwise, yes, you are right.

ahepp · on Jan 31, 2022

Have you considered vyos as a Linux alternative to *sense?

Additionally, you could set the box up as a hypervisor, pass the NIC through to the router VM, and have the other VMs/containers use a bridge or some more advanced virtual networking.

That obviously has potential downsides, but I had a bit of fun doing it before downsizing my server

pimeys · on Jan 31, 2022

Even with opnsense, put a packet filter and wireguard to the box, and I guarantee you'll use your CPU and memory with a 2.5GbE NIC...

gfd · on Jan 31, 2022

I wanted to see some benchmark numbers but the first hit on google returns a unresolved 150+ message github issue about how io_ring might not be better than epoll: https://github.com/axboe/liburing/issues/189

I didn't read the entire thread, but are there still implementation kinks that are being worked out or is the api ready to use now?

idealmedtech · on Jan 31, 2022

Note: since Alex is a fairly gender neutral name I'm going to use they/them pronouns.

I went through and read it, and the submitter is incredibly confrontational and not at all open to feedback on the correctness of their benchmark. Also, when presented with contradictory evidence (their own benchmark where the results show io_uring is faster than epoll on other machines), they essentially dismiss it and says the other users ran their benchmark wrong, or that they can't reproduce their results on their machine, or that the other users used Boost and therefore are invalid. So not an entirely reliable criticism coming from them, in my opinion.

halpert · on Jan 31, 2022

That’s not my take at all. Many people respond with hand waving arguments: maybe epoll is faster because the benchmark is wrong, or only one file descriptor is used (which wasn’t correct but said anyway), etc. Other people also respond saying they’ve been unable to match epoll performance when trying to add io_uring support to their libraries.

Matthias247 · on Feb 1, 2022

I would recommend to do your own benchmarking for whatever application you want to use it in. This might require building a prototype.

My understanding at the moment is that there is not one io-uring but every kernel version has a different implementation, and it changed quite a bit over the years. In the earlier version it just delegated all IO to an in kernel threadpool. Especially for network IO that’s not ideal, since the alternatives (epoll) didn’t require a threadpool neither in kernel nor userspace. And this probably showed up in benchmarks. The implementation changed, but I can’t really tell how now everything works for disk and network IO with out reviewing source code again.

A comprehensive changelog which explains implementation changes and bugs in different versions would be super helpful

digikata · on Jan 31, 2022

I think io_uring more or less works as advertised now for the key thing it set out to do, but it removes one bottleneck in a larger system of network processing, so you can switch to io_uring, yet potentially without any other changes, a real world application can hit the next bottleneck due some adjacent architectural decision (or performance regressions as implementations are changed to accommodate an expanding envelope of uses).

jeffbee · on Jan 31, 2022

There are a lot of rough edges in io_uring, foremost among them is it has been written, in classic Linux style, without regard to the security implications, so despite its youth it already collected four local privilege escalation CVEs. I recommend approaching it skeptically, write your own test and make sure it's actually faster in your use case.

MichaelMoser123 · on Jan 31, 2022

> there needs to be a mechanism by which the kernel can tell applications that a given buffer can be reused. MSG_ZEROCOPY handles this by returning notifications via the error queue associated with the socket — a bit awkward, but it works.

Let's say they have a circular buffer for the TX queue. Why can't they have an atomic variable for the read end of the circular buffer; one could map this variable into shared memory, the kernel could then move this counter, once the nic has completed to send a frame with the data from the read end of that circular buffer. User space could still query this counter to check for 'buffer full' conditions.

dilyevsky · on Jan 31, 2022

So the idea is for it to be used like sendfile?

foota · on Jan 31, 2022

Sounds like it, yes, but it is infinitely more flexible and iirc the general architecture also allows chaining, e.g., you can in a sense bind a parameter of a system call to the result of another to-be-made system call. An example might be taking the length from one system call to another or a file descriptor number.

hamburglar · on Jan 31, 2022

Or stuff that requires low latency on a weak platform. When I was at Microsoft we did this with Media Center Extender running on linux on ARM where we needed to stream 1080p MPEG and didn’t have the memory bandwidth for two copies. We added send/recv calls that allowed the memory that the NIC was reading into to be the same memory the mpeg decoder was reading from. It allowed us to get better latency by not having to buffer as much video to keep the decoder fed. Neat that there is a general mechanism for this now.

cmrdporcupine · on Jan 31, 2022

I was involved with trying to do a similar kind of thing with a video streaming box I was tasked on working on for Google Fiber some years ago. Never shipped, but we had to repurpose an old MIPS-based box that was being used as a network box to try to spool up an inbound MPEG-TS video stream and then re-serve it as MPEG DASH segments to the local network. I was trying to do it with zero copies and minimal transitions between user and kernel space, so was a lot of mmap'ing and the like. On weak hardware these kinds of optimizations showed a lot of promise.

Unfortunately the whole project was killed and all Fiber projects pulled from our office before we could finish.

nopurpose · on Jan 31, 2022

looks like many syscalls will eventually be replaced by io_uring acting like a job queue. Not just network, but any interaction with the kernel.

a-dub · on Jan 31, 2022

will this compete with the performance of kernel bypass networking? (and potentially obsolete it someday?)

majke · on Jan 31, 2022

There are multiple problems "kernel bypasses" solves.

Generally speaking:

(A) prevent kernel from touching data and only operate on metadata (to reduce cache misses/memory bandwidth)

(B) in routing situations prevent application from copying data

(C) busy-polling - save cpu on doing everything in application as opposed to jumping to/from kernel context, for better latency

(D) allow manual buffer management

For example, splice or tcp_mmap() can be used to "receive" data from networking stack into application managed place. splice and MSG_ZEROCOPY can be used to "transmit" data into networking stack. The problem with MSG_ZEROCOPY is that, while buffers belong to application, the API doesn't tell you when it's done sending and it's possible to repurpose the buffers.

This io_uring article seem to indicate this fault has been fixed in io_uring zerocopy API.

These API's are pretty much (A), prevent kernel from touching data, and only look at metadata.

Due to API limitations it's hard or even impossible to do (B) in application. But with better API's we'll get there.

To get (C), you need raw hardware access basically.

But the holly grail is (D): the buffer-reuse problem. Ideally you'd want to submit buffer to networking card (or network stack), get data from NIC directly into there, then get a handle to this data in application, perform routing decision, submit that buffer for transmission in some other NIC, and eventually get a completion notification. At which point the buffer could be submitted for RX again. The userspace zerocopy API's are far away from this ideal scenario.

So no. Zerocopy kernel API's don't immediately solve all problems typical kernel bypass solution aims for. Having staid that, I don't think many users have the problems true kernel bypass solves.

drewg123 · on Jan 31, 2022

For example, splice or tcp_mmap() can be used to "receive" data from networking stack into application managed place. splice and MSG_ZEROCOPY can be used to "transmit" data into networking stack. The problem with MSG_ZEROCOPY is that, while buffers belong to application, the API doesn't tell you when it's done sending and it's possible to repurpose the buffers.

This io_uring article seem to indicate this fault has been fixed in io_uring zerocopy API.

This is the hard part of a zero-copy API, and why its hard to make a zero-copy API that's a drop-in replacement for sockets and can magically enable naive applications to have zero-copy transmits. I attempted to do zero-copy sockets in FreeBSD ~20+ years ago (https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.34...). What's not mentioned in the paper is that the change was an overall net-negative, because applications would frequently trigger a COW fault when attempting to write to a buffer owned by the kernel.

a-dub · on Jan 31, 2022

it almost seems like zero copy design is a larger design principle that has to permeate how all user and kernel code is structured in order to really see the benefits.

i've been thinking about these sorts of things (kernel/user data barriers and high performance ipc) a lot lately as i've been wondering what a high performance, high security microkernel might look like.

ekkeke · on Jan 31, 2022

Depends, kernel bypass isn't just about avoiding the kernel, a lot of the cards used are specifically desinged for this and come with much lower latency (as low as 1us to write a packet).

Eliminating a kernel copy of the buffer being made will improve the performance significantly, but probably won't get you the same performance as a specialised card which would be what you'd usually go for if it's that important.