It seems like a lot of performance optimizations that used to require custom ker...

dan-robertson · on Jan 31, 2022

I think the performance optimisations people use these days are not kernel patches so much as:

- userspace networking which avoids copying in the kernel and expensive context switches (both advantages of io_uring) and some costs of sockets apis[1]

- specialised NICs that can e.g. do TLS in hardware

- specialised NICs with on-board FPGAs to which some custom processing may be offloaded

- specialised NICs with first class virtualisation support to avoid hypervisor overhead (mostly just a cloud thing)

One thing about high performance networking is that there are really two different desires: maximising throughput and minimising latency. Different choices will be made based on what kind of performance is most desired.

[1] if you imagine a typical high performance, pre-io_uring, portable web server, you have some socket which you listen on, many threads which call accept on it (with an option so that only one would be woken up at a time with new connections) and a load of fds for open connections with the server calling epoll to wait for inputs from them. And when a packet comes in and makes it through the kernel tcp stack, the kernel has to assign the packet to an appropriate fd and figure out if any thread needs to wake up. But instead with specialised networking configurations, you can just receive the packets in the order they arrive and send them off to their handlers yourself without the overhead of explaining how to do that to the kernel.

kzrdude · on Jan 31, 2022

TLS in hardware must have suffered a lot from successive TLS version changes? I.e that hardware quickly becomes obsolete.

10000truths · on Jan 31, 2022

Not really. The only thing that needs to be offloaded is the record encapsulation (the format of which has remained backwards compatible) and the encryption/decryption (which is limited to AES-GCM for all intents and purposes). The handshake and the control messages can still be handled in software, since those aren't in the critical path.

dijit · on Jan 31, 2022

Yes, this is true, which is why some high traffic (non-tech company) sites, like the frontends for banking, are slower to move than others.

Sometimes the entire frontend of a bank (Say: HSBC, which is a global bank) can fit into 3 racks of servers.

Not saying it's the right thing, just confirming your question.

tomnipotent · on Jan 31, 2022

> are not kernel patches so much as

Intel even offers DDIO (Direct Data I/O), which allows an Intel NIC and CPU to share data without going through main memory.

drewg123 · on Jan 31, 2022

DDIO is a bit of a pet peeve of mine.

DDIO applies to ANY I/O device, which makes it a horrible idea on servers with even a moderate amount of I/O. With DDIO, EVERY I/O device's DMA writes (eg, transfer from device to host) allocates space in cache. This means it works great on toy benchmarks with a single thread polling for I/O from a NIC. But scale it up to a 32c/64t server that is reading a few gigabytes/sec from disk and doing few tens of gigabits of network I/O, and you thrash your caches.

Its precursor, DCA, was limited to Intel (and a few other vendors) NICs, and was far better. With DCA, the NIC decides what, if any, DMA write should be pushed to cache and tags that DMA write with a PCIe TLP steering tag. The best choices are typically RX descriptors, and protocol headers. Other data is not tagged, and hence does not pollute the cache.

jeffbee · on Jan 31, 2022

It sounds like what you want is RDT+CAT, so you can allocate your LLC as you like. And that means you have to pick and choose your platform since Intel keeps waffling over whether CAT should exist on every SKU or not.

drewg123 · on Jan 31, 2022

Well, yes and no. I don't see how CAT can differentiate between a packet header and packet data, but being able to control the number of ways dedicated to DDIO would be nice.

In terms of features like CAT being available on different SKUs.. Intel needs to remember that they're not just competing with themselves these days. Their competitors generally don't go fusing off features on low-end SKUs to drive sales of higher margin SKUs. Intel needs to differentiate chips solely on on cores count, clock speed, cache size, etc.

jeffbee · on Jan 31, 2022

CAT, at least, has trickled down to desktop-class SKUs. The real problem is you can't use it in EC2.

tbr1 · on Jan 31, 2022

Small correction: DDIO is not limited to Intel NICs, it's a mostly-transparent-to-the-hardware mechanism by which DMAs are snooped and some fraction of the NIC-local (in the case of multi socket systems) L3 cache is filled with incoming data.

jeffbee · on Jan 31, 2022

DDIO works for any DMA initiator, not just Intel NICs.

It's too bad that DDIO has not trickled down the product line. It doesn't exist on desktop-class parts, and many people can't use it even on top-of-the-line Xeon-SP because cloud operators disable it and RDT/CAT, which is helpful when using direct cache access.

xxs · on Jan 31, 2022

Why would you need "many threads which call accept on it" and then "epoll". With epoll you dont need more threads than the available CPU cores.

dan-robertson · on Jan 31, 2022

Yeah I guess you would epoll to receive connections. I wasn’t very clear

heftig · on Jan 31, 2022

Specialized NICs - is this what e.g. Cavium LiquidIO "server adapters" are? I came across those looking through the firmware shipped with Linux.

mping · on Jan 31, 2022

Yes, I think io_uring is slowly making its way onto Java ecosystem. Example: https://github.com/netty/netty-incubator-transport-io_uring

I guess it will go into the JVM too.