My side project has been to rewrite https://github.com/erpc-io/eRPC which does RPCs over UDP with some congestion control supposedly quite fast. Paper: https://www.usenix.org/system/files/nsdi19-kalia.pdf. I never got the code to work in AWS; I believe the author focused on Mellanox NICS but that's not really commodity H/W, which is where my interests lay.
So I dug into it .. and well I'll have my own library soon. I should be able to send UDP w/o congestion control sometime this week.
eRPC uses DPDK (100% user space NIC TX/RX control) plus the author's own other ideas to get performance. Since I'm getting into DPDK way, way late in the game, I hope and believe DPDK is and will be the better way to go than to turn to kernel fix-ups.
Getting the kernel out of the way with pinned threads seems -- or more exotic with special H/W if preferred --- is cleaner if one can develop from scratch. DPDK is all day long zero copy, and always has been.
This library will be a part of something bigger. For many frameworks, a key architecture point is:
I got a RPC/packet/message. Ok, now what?
* process it in-place e.g. on the thread that was doing RX?
(Author of eRPC.) I'm glad you've found the project useful, and my apologies for the code and build quality, and NIC compatibility issues that you faced. I'm very interested in learning from your experience (perhaps even as a GitHub issue on our repo), especially w.r.t. how the code organization should be improved, although I understand the list might be too long :).
I know of some issues myself that I plan to fix, e.g., using macros to decide which Transport class to use -- this should just use virtual methods; messy inter-mixing of congestion control and packet I/O; some "performance optimizations" like preallocated buffers that are just not worth it.
I'll be in touch you can be sure. Code is in GIT but I will not make it public until I have UDP working --- hopefully +/- 3 days.
I do and will continue mention your paper (which I also found on HN) and your repo because it has a lot of really good ideas ... ideas which I'm forced to engage to mastery.
I haven't yet had time to focus on how you deal with packet loss via and with congestion management. Anything you can do --- I'll have to reread your paper (well written BTW) to pull my own weight here --- is a real benefit to the community.
If you want to get packets fast into userspace you can now apparently also AF_XDP on Linux which is also effectively a kernel bypass. The driver will put packets into a ringbuffer before they hit the remaining kernel, and userspace can fetch them effectively there. I haven’t benchmarked for the same application, but it promises to give similar levels of performance.
I think the other questions are more ones of library design and less about the packet API one uses. On top of all kinds of APIs one could build singlethreaded state machines which deque packets and process them. Or other architectures if required.
Depending on the requirements of the RPC you could implement the logic in eBPF code over the XDP framework which can directly execute the code inside the NIC driver and modify the packet buffer to be a response packet that gets send out immediately without ever going through the rest of the kernel nor userspace.
No, it's offloading the code onto the processor on the NIC (which usually have several general purpose CPU cores). According to Wikipedia [0] it's not widely supported in hardware, but a few companies like Intel are "working on it". I guess the upshot is it gracefully "degrades" to running on the host CPU if on an unsupported NIC.
Offloading to the NIC is of course the fastest way but even doing it in the NIC driver (check if the NIC driver supports it or you have to use the slower generic fallback) you already have advantages similar to doing it with DPDK.
It seems like a lot of performance optimizations that used to require custom kernels can now be implemented using io_uring instead. Does anyone know if any such performance improvements are making their way into places where they can be felt by those of us working higher up the tech stack? (E.g. if nginx, node or similar got faster networking or disk access)
I think the performance optimisations people use these days are not kernel patches so much as:
- userspace networking which avoids copying in the kernel and expensive context switches (both advantages of io_uring) and some costs of sockets apis[1]
- specialised NICs that can e.g. do TLS in hardware
- specialised NICs with on-board FPGAs to which some custom processing may be offloaded
- specialised NICs with first class virtualisation support to avoid hypervisor overhead (mostly just a cloud thing)
One thing about high performance networking is that there are really two different desires: maximising throughput and minimising latency. Different choices will be made based on what kind of performance is most desired.
[1] if you imagine a typical high performance, pre-io_uring, portable web server, you have some socket which you listen on, many threads which call accept on it (with an option so that only one would be woken up at a time with new connections) and a load of fds for open connections with the server calling epoll to wait for inputs from them. And when a packet comes in and makes it through the kernel tcp stack, the kernel has to assign the packet to an appropriate fd and figure out if any thread needs to wake up. But instead with specialised networking configurations, you can just receive the packets in the order they arrive and send them off to their handlers yourself without the overhead of explaining how to do that to the kernel.
Not really. The only thing that needs to be offloaded is the record encapsulation (the format of which has remained backwards compatible) and the encryption/decryption (which is limited to AES-GCM for all intents and purposes). The handshake and the control messages can still be handled in software, since those aren't in the critical path.
DDIO applies to ANY I/O device, which makes it a horrible idea on servers with even a moderate amount of I/O. With DDIO, EVERY I/O device's DMA writes (eg, transfer from device to host) allocates space in cache. This means it works great on toy benchmarks with a single thread polling for I/O from a NIC. But scale it up to a 32c/64t server that is reading a few gigabytes/sec from disk and doing few tens of gigabits of network I/O, and you thrash your caches.
Its precursor, DCA, was limited to Intel (and a few other vendors) NICs, and was far better. With DCA, the NIC decides what, if any, DMA write should be pushed to cache and tags that DMA write with a PCIe TLP steering tag. The best choices are typically RX descriptors, and protocol headers. Other data is not tagged, and hence does not pollute the cache.
It sounds like what you want is RDT+CAT, so you can allocate your LLC as you like. And that means you have to pick and choose your platform since Intel keeps waffling over whether CAT should exist on every SKU or not.
Well, yes and no. I don't see how CAT can differentiate between a packet header and packet data, but being able to control the number of ways dedicated to DDIO would be nice.
In terms of features like CAT being available on different SKUs.. Intel needs to remember that they're not just competing with themselves these days. Their competitors generally don't go fusing off features on low-end SKUs to drive sales of higher margin SKUs. Intel needs to differentiate chips solely on on cores count, clock speed, cache size, etc.
Small correction: DDIO is not limited to Intel NICs, it's a mostly-transparent-to-the-hardware mechanism by which DMAs are snooped and some fraction of the NIC-local (in the case of multi socket systems) L3 cache is filled with incoming data.
DDIO works for any DMA initiator, not just Intel NICs.
It's too bad that DDIO has not trickled down the product line. It doesn't exist on desktop-class parts, and many people can't use it even on top-of-the-line Xeon-SP because cloud operators disable it and RDT/CAT, which is helpful when using direct cache access.
I was just looking into building a custom router using an older PC and a dual-port 2.5GbE NIC. OpnSense sounds like a solid choice, but I'd prefer a Linux solution because the CPU is pretty beefy and wouldn't mind running a few Dockers on the side (yes, I know multi-purposing a router is a bad idea but meh).
It's a NAT router so needs at minimum to masquerade (change the src/dest IP in the packet headers to match the internal / public). I'd also be running fail2ban or crowdsec for the bare intrusion detection.
Have you considered vyos as a Linux alternative to *sense?
Additionally, you could set the box up as a hypervisor, pass the NIC through to the router VM, and have the other VMs/containers use a bridge or some more advanced virtual networking.
That obviously has potential downsides, but I had a bit of fun doing it before downsizing my server
I wanted to see some benchmark numbers but the first hit on google returns a unresolved 150+ message github issue about how io_ring might not be better than epoll: https://github.com/axboe/liburing/issues/189
I didn't read the entire thread, but are there still implementation kinks that are being worked out or is the api ready to use now?
Note: since Alex is a fairly gender neutral name I'm going to use they/them pronouns.
I went through and read it, and the submitter is incredibly confrontational and not at all open to feedback on the correctness of their benchmark. Also, when presented with contradictory evidence (their own benchmark where the results show io_uring is faster than epoll on other machines), they essentially dismiss it and says the other users ran their benchmark wrong, or that they can't reproduce their results on their machine, or that the other users used Boost and therefore are invalid. So not an entirely reliable criticism coming from them, in my opinion.
That’s not my take at all. Many people respond with hand waving arguments: maybe epoll is faster because the benchmark is wrong, or only one file descriptor is used (which wasn’t correct but said anyway), etc. Other people also respond saying they’ve been unable to match epoll performance when trying to add io_uring support to their libraries.
I would recommend to do your own benchmarking for whatever application you want to use it in. This might require building a prototype.
My understanding at the moment is that there is not one io-uring but every kernel version has a different implementation, and it changed quite a bit over the years. In the earlier version it just delegated all IO to an in kernel threadpool. Especially for network IO that’s not ideal, since the alternatives (epoll) didn’t require a threadpool neither in kernel nor userspace. And this probably showed up in benchmarks. The implementation changed, but I can’t really tell how now everything works for disk and network IO with out reviewing source code again.
A comprehensive changelog which explains implementation changes and bugs in different versions would be super helpful
I think io_uring more or less works as advertised now for the key thing it set out to do, but it removes one bottleneck in a larger system of network processing, so you can switch to io_uring, yet potentially without any other changes, a real world application can hit the next bottleneck due some adjacent architectural decision (or performance regressions as implementations are changed to accommodate an expanding envelope of uses).
There are a lot of rough edges in io_uring, foremost among them is it has been written, in classic Linux style, without regard to the security implications, so despite its youth it already collected four local privilege escalation CVEs. I recommend approaching it skeptically, write your own test and make sure it's actually faster in your use case.
> there needs to be a mechanism by which the kernel can tell applications that a given buffer can be reused. MSG_ZEROCOPY handles this by returning notifications via the error queue associated with the socket — a bit awkward, but it works.
Let's say they have a circular buffer for the TX queue.
Why can't they have an atomic variable for the read end of the circular buffer; one could map this variable into shared memory, the kernel could then move this counter, once the nic has completed to send a frame with the data from the read end of that circular buffer. User space could still query this counter to check for 'buffer full' conditions.
Sounds like it, yes, but it is infinitely more flexible and iirc the general architecture also allows chaining, e.g., you can in a sense bind a parameter of a system call to the result of another to-be-made system call. An example might be taking the length from one system call to another or a file descriptor number.
Or stuff that requires low latency on a weak platform. When I was at Microsoft we did this with Media Center Extender running on linux on ARM where we needed to stream 1080p MPEG and didn’t have the memory bandwidth for two copies. We added send/recv calls that allowed the memory that the NIC was reading into to be the same memory the mpeg decoder was reading from. It allowed us to get better latency by not having to buffer as much video to keep the decoder fed. Neat that there is a general mechanism for this now.
I was involved with trying to do a similar kind of thing with a video streaming box I was tasked on working on for Google Fiber some years ago. Never shipped, but we had to repurpose an old MIPS-based box that was being used as a network box to try to spool up an inbound MPEG-TS video stream and then re-serve it as MPEG DASH segments to the local network. I was trying to do it with zero copies and minimal transitions between user and kernel space, so was a lot of mmap'ing and the like. On weak hardware these kinds of optimizations showed a lot of promise.
Unfortunately the whole project was killed and all Fiber projects pulled from our office before we could finish.
There are multiple problems "kernel bypasses" solves.
Generally speaking:
(A) prevent kernel from touching data and only operate on metadata (to reduce cache misses/memory bandwidth)
(B) in routing situations prevent application from copying data
(C) busy-polling - save cpu on doing everything in application as opposed to jumping to/from kernel context, for better latency
(D) allow manual buffer management
For example, splice or tcp_mmap() can be used to "receive" data from networking stack into application managed place. splice and MSG_ZEROCOPY can be used to "transmit" data into networking stack. The problem with MSG_ZEROCOPY is that, while buffers belong to application, the API doesn't tell you when it's done sending and it's possible to repurpose the buffers.
This io_uring article seem to indicate this fault has been fixed in io_uring zerocopy API.
These API's are pretty much (A), prevent kernel from touching data, and only look at metadata.
Due to API limitations it's hard or even impossible to do (B) in application. But with better API's we'll get there.
To get (C), you need raw hardware access basically.
But the holly grail is (D): the buffer-reuse problem. Ideally you'd want to submit buffer to networking card (or network stack), get data from NIC directly into there, then get a handle to this data in application, perform routing decision, submit that buffer for transmission in some other NIC, and eventually get a completion notification. At which point the buffer could be submitted for RX again. The userspace zerocopy API's are far away from this ideal scenario.
So no. Zerocopy kernel API's don't immediately solve all problems typical kernel bypass solution aims for. Having staid that, I don't think many users have the problems true kernel bypass solves.
For example, splice or tcp_mmap() can be used to "receive" data from networking stack into application managed place. splice and MSG_ZEROCOPY can be used to "transmit" data into networking stack. The problem with MSG_ZEROCOPY is that, while buffers belong to application, the API doesn't tell you when it's done sending and it's possible to repurpose the buffers.
This io_uring article seem to indicate this fault has been fixed in io_uring zerocopy API.
This is the hard part of a zero-copy API, and why its hard to make a zero-copy API that's a drop-in replacement for sockets and can magically enable naive applications to have zero-copy transmits. I attempted to do zero-copy sockets in FreeBSD ~20+ years ago (https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.34...). What's not mentioned in the paper is that the change was an overall net-negative, because applications would frequently trigger a COW fault when attempting to write to a buffer owned by the kernel.
it almost seems like zero copy design is a larger design principle that has to permeate how all user and kernel code is structured in order to really see the benefits.
i've been thinking about these sorts of things (kernel/user data barriers and high performance ipc) a lot lately as i've been wondering what a high performance, high security microkernel might look like.
Depends, kernel bypass isn't just about avoiding the kernel, a lot of the cards used are specifically desinged for this and come with much lower latency (as low as 1us to write a packet).
Eliminating a kernel copy of the buffer being made will improve the performance significantly, but probably won't get you the same performance as a specialised card which would be what you'd usually go for if it's that important.
So I dug into it .. and well I'll have my own library soon. I should be able to send UDP w/o congestion control sometime this week.
eRPC uses DPDK (100% user space NIC TX/RX control) plus the author's own other ideas to get performance. Since I'm getting into DPDK way, way late in the game, I hope and believe DPDK is and will be the better way to go than to turn to kernel fix-ups.
Getting the kernel out of the way with pinned threads seems -- or more exotic with special H/W if preferred --- is cleaner if one can develop from scratch. DPDK is all day long zero copy, and always has been.
This library will be a part of something bigger. For many frameworks, a key architecture point is:
I got a RPC/packet/message. Ok, now what?
* process it in-place e.g. on the thread that was doing RX?
* if I don't delegate am I making back-pressure?
* delegate it to another core? How to efficiently do that? Hint: https://blog.acolyer.org/2017/12/04/ffwd-delegation-is-much-...
* if I delegate ... to whom? Maybe I'm partitioning?
* If I delegate how do I get a response back?
In DPDK I believe these are easier to decide and well-manage in code.