It seems like a lot of performance optimizations that used to require custom kernels can now be implemented using io_uring instead. Does anyone know if any such performance improvements are making their way into places where they can be felt by those of us working higher up the tech stack? (E.g. if nginx, node or similar got faster networking or disk access)
I think the performance optimisations people use these days are not kernel patches so much as:
- userspace networking which avoids copying in the kernel and expensive context switches (both advantages of io_uring) and some costs of sockets apis[1]
- specialised NICs that can e.g. do TLS in hardware
- specialised NICs with on-board FPGAs to which some custom processing may be offloaded
- specialised NICs with first class virtualisation support to avoid hypervisor overhead (mostly just a cloud thing)
One thing about high performance networking is that there are really two different desires: maximising throughput and minimising latency. Different choices will be made based on what kind of performance is most desired.
[1] if you imagine a typical high performance, pre-io_uring, portable web server, you have some socket which you listen on, many threads which call accept on it (with an option so that only one would be woken up at a time with new connections) and a load of fds for open connections with the server calling epoll to wait for inputs from them. And when a packet comes in and makes it through the kernel tcp stack, the kernel has to assign the packet to an appropriate fd and figure out if any thread needs to wake up. But instead with specialised networking configurations, you can just receive the packets in the order they arrive and send them off to their handlers yourself without the overhead of explaining how to do that to the kernel.
Not really. The only thing that needs to be offloaded is the record encapsulation (the format of which has remained backwards compatible) and the encryption/decryption (which is limited to AES-GCM for all intents and purposes). The handshake and the control messages can still be handled in software, since those aren't in the critical path.
DDIO applies to ANY I/O device, which makes it a horrible idea on servers with even a moderate amount of I/O. With DDIO, EVERY I/O device's DMA writes (eg, transfer from device to host) allocates space in cache. This means it works great on toy benchmarks with a single thread polling for I/O from a NIC. But scale it up to a 32c/64t server that is reading a few gigabytes/sec from disk and doing few tens of gigabits of network I/O, and you thrash your caches.
Its precursor, DCA, was limited to Intel (and a few other vendors) NICs, and was far better. With DCA, the NIC decides what, if any, DMA write should be pushed to cache and tags that DMA write with a PCIe TLP steering tag. The best choices are typically RX descriptors, and protocol headers. Other data is not tagged, and hence does not pollute the cache.
It sounds like what you want is RDT+CAT, so you can allocate your LLC as you like. And that means you have to pick and choose your platform since Intel keeps waffling over whether CAT should exist on every SKU or not.
Well, yes and no. I don't see how CAT can differentiate between a packet header and packet data, but being able to control the number of ways dedicated to DDIO would be nice.
In terms of features like CAT being available on different SKUs.. Intel needs to remember that they're not just competing with themselves these days. Their competitors generally don't go fusing off features on low-end SKUs to drive sales of higher margin SKUs. Intel needs to differentiate chips solely on on cores count, clock speed, cache size, etc.
Small correction: DDIO is not limited to Intel NICs, it's a mostly-transparent-to-the-hardware mechanism by which DMAs are snooped and some fraction of the NIC-local (in the case of multi socket systems) L3 cache is filled with incoming data.
DDIO works for any DMA initiator, not just Intel NICs.
It's too bad that DDIO has not trickled down the product line. It doesn't exist on desktop-class parts, and many people can't use it even on top-of-the-line Xeon-SP because cloud operators disable it and RDT/CAT, which is helpful when using direct cache access.