This is more than merely UDP with encryption. Rather, this seems to be a reliabl...

X-Istence · on Feb 21, 2013

What exactly about the BSD sockets API makes it a source of TCP's many woes? What woes are you talking about?

Also, why do we want to re-implement TCP in user space, when it still needs to traverse the kernel as a UDP packet, this doesn't have any of the benefits that netmap or PF_RING for example bring to the table where the software is doing the full stack and thus there is less latency involved.

QUIC as it currently stands will be sitting on top of UDP, so all packets traverse the kernel, get dumped to user space, user space parses it, and sends packets back, as UDP. All that extra copying, when with TCP in the kernel the kernel would be responsible for reassembling packets, would be responsible for ACK'ing packets, and stuff like that.

Also, mmap'd files as buffer for file transfers ... you are going to have to explain that one. There are already various zero copy mechanisms for sending files, such as sendfile().

As for new congestion control algorithms, great, sounds fantastic, but how much more of a nightmare is it going to be to get the parties to agree on a congestion control mechanism? We already have that problem now ...

Sorry, but I see a whole lot of complaining without any real data or information to back it up. [[citation needed]].

advm · on Feb 22, 2013

Maybe TCP's issues aren't apparent when you're using it to download page assets from AWS over your home Internet connection, but they become apparent when you're doing large file transfers between systems whose bandwidth-delay products (BDPs) greatly exceed the upper limit of the TCP buffers on the end systems.

This may not be an issue for users of consumer grade Internet service, but it is an issue to organizations who have private, dedicated, high-bandwidth links and need to move a lot of data over large distances (equating to high latency) very quickly and often; CDNs, data centers, research institutions, or, I dunno, maybe someone like Google.

The BDP and the TCP send buffer size impose an upper limit on the window size for the connection. Ideally, in a file transfer scenario, the BDP and the socket's send buffer size should be equal. If your send buffer size is lower than the BDP, you cannot ever transfer at a greater throughput than buffer_size / link_latency, and thus you cannot ever attain maximum bandwidth. I can explain in more detail why that's true if you want, but otherwise here's this: http://www.psc.edu/index.php/networking/641-tcp-tune

Unfortunately for end systems with a high BDP between them, most of the time the maximum send buffer size for a socket is capped by the system to something much lower than the BDP. This is a result of the socket implementation of these systems, not an inherent limitation of TCP.

An accepted user-level solution to this issue is to use multiple sockets in parallel, but that has its own issues, such as breaking fairness and not working well with the stream model. I can explain this more if you want, too, just let me know.

As for zero-copy with sendfile, well, even when you do zero-copy, the above issues still apply, because the socket buffer is still used. Admittedly my research into zero-copy is very cursory, but from what I understand, even when you use sendfile, it still copies from files into the TCP send buffer, so zero-copy is actually less zero than it seems. It just doesn't require the whole user space read-buffer-write-buffer loop, which does yield a noticable performance increase, but that doesn't mean the buffer magically goes away and the BDP issue is solved.

What I was suggesting, by using an mmap'd file as the send buffer, would allow the TCP send buffer issue to be circumvented completely. In a user space protocol specialized for file transfers (not saying QUIC is, but I'm not talking about QUIC), you wouldn't need to use a send buffer, and the window size would become limited only by the BDP; the file contents are always available without needing to buffer them like the socket API does, and the network will never have to be idle just because the send buffer is capped at 10 MB or something.

You mentioned QUIC needing to do the whole userland dance to let the application parse packets and all, and that is a valid point, but, in the case of reading, it's entirely possible to only read in the header data you need from the datagram and let the kernel take care of as much of the actual copying as it can for the rest of the packet. In the case of writing, you can "plug" the socket, write headers in user space, then zero-copy the file segment for added performance.

Certainly there is a performance hit, since it's still going into user space, but there wouldn't be if this was all part of the kernel. Like I was saying, I'm not so much excited over the fact that this is user space or anything, but more so that it has the potential to exert pressure on the Linux community and such to make finally make some broad and well-needed changes to the network stack. Even if QUIC isn't being designed to address the issues I'm mentioning, maybe a user space-configurable socket library with the backing of Google will make experimentation with internals normally obscured by the socket API more accessible to people, and something good will come out of that eventually.

Anyway, all of these issues stem from the fact that BSD sockets are meant to be a general purpose communications protocol. However, some applications (such as file transfers) don't need all of the amenities general purpose sockets offer, such as separate kernel-allocated buffers so applications don't need to keep buffers after calling write(), etc.

There are other problems with TCP, such as slow start being, well, slow to converge on high-BDP networks, bad performance in the face of random packet loss (e.g., TCP over Wi-Fi), congestion control algorithms being too conservative (IMO, not everyone needs to agree on the same congestion control protocol for it to work well, it just needs to converge to network conditions faster, better differentiate types of loss, and yield to fairness more quickly), TCP features such as selective ACKs not being widely used, default TCP socket settings sucking and requiring a lot of tuning to get right, crap with NAT that can't be circumvented at the user level (UDP-based stream protocols can do rendezvous connections to get around NAT), and more. People write whole papers on all these things, though, and I don't want to make this even more of a wall of stupid text than it already is.

Okay, you can call me out for not having citations or backing up my claims again now. Problem is most of the public research exists as a shitty academic papers you wouldn't probably bother reading anyway, and most of the people actually studying this stuff in-depth and coming up with solutions are private researchers and engineers working for companies like Google.

brugidou · on Feb 23, 2013

how would you differentiate types of loss? The main issue with TCP over large BDP at the moment is not the limited buffer size IMHO. You can increase is dramatically on modern Linux. You can maybe attempt to work around congestion mechanisms and slow start by using very aggressive algorithms. But I wonder how researchers work with packet loss?

anonymfus · on Feb 21, 2013

>If Google does it right, maybe it'll force OS developers to wake up and rethink the increasingly clunky and antiquated BSD sockets API that is the source of many of TCP's woes and finally modernize the network stack for the fiber era.

WinRT don't ever has BSD sockets.