I generally say yes. The fixed timer is for the delayed ACK. That was a terrible idea. Both Linux and Windows now have a way to turn delayed ACKs off, but they're still on by default.
TCP_QUICKACK, which turns off delayed ACKs, is in Linux, but the manual page is very confused about what it actually does. Apparently it turns itself off after a while. I wish someone would get that right. I'd disable delayed ACKs by default. It's hard to think of a case today where they're a significant win. As I've written in the past, delayed ACKs were a hack to make remote Telnet character echo work better.
A key point is asymmetry. If you're the one who's doing lots of little writes, you can either turn set TCP_NODELAY at your end, or turn off delayed ACKs at the other end. If you can. Things doing lots of little writes but not filling up the pipe, typically game clients, can't change the settings at the other end. So it became a standard practice to do what you could do at your end.
Some linux settings apply to every TCP connection which can be a very surprising result and hard to debug. I can see someone wanting to time out a feature, so it doesn't upset the rest of the system forever.
I liked the article, but it's not entirely clear to me what the cause of the problem was. Linux delays sending out data only if the data size is less than the packet size AND the previous packet in not yet ACKed (Nagle's algorithm). My guess was that this app is doing a write, write, read and hitting the delayed ACK problem.
The problem with delayed ACKs is that requires controlling the server. If you control the client, you can’t remotely turn off delayed ACKs, so instead you have to disable Nagle’s algorithm.
Not topic related, but John’s transfer from networking to Autodesk got me interested. I don’t immediately see the good connection between the two, when they where a startup.
Would love to see a blog post from John Nagel on this, now that we know the outcome of Autodesk.
So the kernel is using a static decision that's really bad sometimes? Would it be too expensive to treat this like a branch predictor and keep some state to have the kernel enable/disable the delayed ACK dynamically depending on how it has won/lost the bet recently?
I think that what John Nagle's answers was suggesting:
> A delayed ACK is a bet. The TCP implementation is betting that data will be sent shortly and will make it unnecessary to send a lone ACK. Every time a delayed ACK is actually sent, that bet was lost. The TCP spec allows an implementation to lose that bet every time without turning off delayed ACKs. Properly, delayed ACKs should only turn on when a few unnecessary ACKs that could have been piggybacked have been sent in a row, and any time a delayed ACK is actually sent, delayed ACKs should be turned off again. There should have been a counter for this.
Yes, that's the suggested fix. My question is if treating this like a CPU branch predictor and keeping some state that needs to be updated is too expensive or not. The packet pipeline is very performance sensitive.
It is amusing that you doubt that "John Nagle" on Stack Overflow is M. Nagle, but don't express any doubt that "Animats" here on this WWW site, and indeed this very discussion, is M. Nagle. Surely the reverse stance is the more logical, if one has no idea what "Animats" is.
It is odd that some people give more credence to a pseudonym, or in this case a company name, than to using one's own name.
Among people who would pretend to be a given well known person, surely more of them know that person's real name than something that would make a plausible pseudonym. Or in any case, more of them would choose it.
And if someone thinks a pseudonym is a particular famous person, they must have a reason, which is unlikely to be weaker than just assuming a normal name is accurate.
I'm not up to doing a Bayesian analysis right now, but I feel like one could show it makes more sense to doubt an unverified name.
Ohhhhh so true. I sadly have no such story to tell regarding performance optimization, but figuring out the intricacies of any complex system (for me at least) inevitably leads to you knowing arcane stuff that might come in handy some time. But on the other hand, it also - in my humble experience - leads to you knowing a lot of arcane stuff that might have an impact on a problem, but is absolutely not related in the specific case one is dealing with.
Knowing when to discard arcane knowledge and when to jump onto that train of thought imho is crucial.
But on the other hand debugging arcane stuff in complex systems is just so much fun. One learns so much.
People should know about that more so that they can learn lessons. Any "little tweak" to an otherwise simple and elegant spec increases complexity which generations will have to deal with. This complexity often times compounds exponentially. Just see the interaction between Nagle and delayed acks. Each on their own they sound like a cool idea, but the compounding complexity is what kills understanding.
Unfortunately, new generations don't learn the lesson. Modern web dev for example has so many layers of complexity and bloat, and they all interact and you need to know all layers intimitly for any real understanding as the complexity explodes. It does not have to be that way, if only every layer has a clean, small abstraction. Then you don't need to know the details, only the small spec. But that does not work if everybody just adds a hack here and there which breaks some separation but oh well.
There's nothing that makes me sadder in the world. How a lack of understanding can poison what could otherwise be in our reach, and moves it again further away from our understanding. It's almost as if we love "simple" because it's the perfect platform to start complicating. I mean, it's everywhere, not just IT systems. And part of it is unavoidable, but... when you start looking at it, it becomes impossible to unsee. So many absurd roadblocks, so much wasted potential.
But $LARGE_CLIENT told our sales guy that if we were to just implement $ARBITRARY_CHECKBOX_FEATURE in our advanced configuration they'd sign a 7 figure deal!
Uh, yeah, if you have millions of dollars lined up and you turn it down to maintain some sense of technical purity, I hope your revenue is in the billions.
But that's not how this works. Customers order a feature. They don't order how messed up your implementation of that feature is.
In other words, the agreement with the customer is the spec. But this discussion is about the implementation. Conflating these two is one of the major problems in our industry.
> John Heidemann. Performance Interactions Between P-HTTP and TCP Implementations. ACM Computer Communication Review. 27, 2 (Apr. 1997), 65–73.
> This document describes several performance problems resulting from interactions between implementations of persistent-HTTP (P-HTTP) and TCP. Two of these problems tie P-HTTP performance to TCP delayed-acknowledgments, thus adding up to 200ms to each P-HTTP transaction. A third results in multiple slow-starts per TCP connection. Unresolved, these problems result in P-HTTP transactions which are 14 times slower than standard HTTP and 20 times slower than potential P-HTTP over a 10 Mb/s Ethernet. We describe each problem and potential solutions. After implementing our solutions to two of the problems, we observe that P-HTTP performs better than HTTP on a local Ethernet. Although we observed these problems in specific implementations of HTTP and TCP (Apache-1.1b4 and SunOS 4.1.3, respectively), we believe that these problems occur more widely.
Solutions for efficient batching of HTTP headers + data without delays involve TCP_NODELAY, and MSG_MORE / SPLICE_F_MORE / TCP_CORK / TCP_NOPUSH. Possibly TCP_QUICKACK may come in handy. Same for any protocol really, but HTTP is the one where there tends to be a separate sendmsg() and sendfile() on Linux.
This is exactly why the Socket API in WinRT has Nagle off by default. The old way of dealing with sockets was to treat them like buffered files, or to drive them from a keyboard (so that Nagle is useful). But newer socket programs seem to just make a full chunk of information, and send it at once. Those newer programs either turn off Nagle, or would be improved if they did.
So we bit the bullet, and decided to make Nagle off by default.
That's easy to say but the bad interaction happens when the algorithms operate at opposite ends of the connection, and you often don't control both ends.
The bottom line is, you need to understand the semantics of your application protocol to best know how to apply them.
I once had to debug the scaling performance of a MPI-based simulation algorithm on cheap linux machines with TCP. I finally collected a TCP trace and showed it to the local expert who said: "hmm, 250ms delay right there.. that's the TCP retransmit timer... you're flooding the ethernet switch with too many packets and the switch is dropping them. Enable <such and such a feature>."
Since then I've always kept various constants in human RAM because it helps root cause.
a good one (although it's really a functional constant that has to be determined per-system) is the amount of time it takes to look up data in local RAM vs RAM that is attached to the other CPU socket (IE, over the system bus); it's 50% longer.
I typically remember all the TCP timing constants, the seek time of a hard drive, everything in "Programmer's rule of thumb" (how long it takes for data to travel by light from CA to NL, etc)
I have also ran into this, but for me it was a periodic latency spike with steady but periodic messages. That latency spike went away when the messages were sent as-fast-as-possible.
Similar to Nagle, there are reasons to combine packets on a session. Network equipment that fools with every packet can get backed up if the traffic packet count exceeds a limit. By Nagling (or doing something similar in your transmit code) you can increase your message rate through such bottlenecks.
Used to have a server cluster that used some 'hologram' style router on the receiving end, to spread load. It had a hard limit on # packets per second it could handle. I changed our app to combine sends (2ms timer, not 40ms!) and halved our total traffic packet count. Put off the day they had to buy more server-side hardware to handle the load.
Btw if the clients are on wifi networks, then there's no point in aggregating sends past a pretty small size (512 bytes?) because wifi fragments (used to fragment?) packets to that smaller size over the air, and never reassembles them, leaving that to the target server.
> Stuff like this just proves that a large part of this job is just remembering a bunch of weird data points and knowing when to match this story to that problem.
I've hit Nagle far in the past, and reading the title I thought 'well that can't be about Nagle because that was a 200ms delay'
Looks like someone tuned it down to 40ms but didn't dare removing it. It would be interesting to know how they came to that choice
Why not just use tcpdump or wireshark when troubleshooting network latencies? Usually only takes a minute or two to pinpoint the issue. Then you would need to spend time understanding why the pinpointed behavior is what it is and sometimes it is in the application, sometimes not.. I've solved so many issues over the years with tcpdump that it has become one of the most valuable tools I know.
If you're debugging weird latency problems, I'd recommend to start from relative timestamps. Then filter to just one or two TCP streams that exhibit the problem and go from there.
Inspect packet contents. The packet dissectors are doing a lot of heavy lifting, so look at the deconstructed data. I realise this may not be a welcome suggestion, but a sharp pencil, an A3 (or larger) scratch pad and a good ruler go a long way.
I've dug to the bottom of quite a few network and traffic problems in my life by drawing the observed traffic patterns into sequence diagrams. Once you have the diagram visualised, it's easier to spot the places where something funky either happens or looks to be missing.
No kidding. I imagined Google or DDG would turn "multiple messages in a single TCP package" into something useful, since it's basically the description of the algorithm. But no luck. The best I got was somebody with an IO buffering problem in Stack Overflow that commented that he turned the algorithm off.
Well, as soon as you'd look for and find any one of your messages in wireshark to use as an example, then you'd notice that the packet has not only that message but others as well.
I remember first learning of Nagle's algorithm back in the early WoW days in my endless quest to get lower latency for PvP on my neighbor's cracked WEP. I don't really know if it matters much in 2020, but I still habitually run the *.reg file to disable it on every new windows install.
It explains how to avoid the 40ms delay and still batch data where possible for maximum efficiency. The key part is that you can toggle the TCP options during the lifetime of the connection to force flushes.
By using sendmsg() with MSG_MORE instead of write(), you can avoid the setsockopt() with TCP_CORK to cork before the write, and the later setsockopt() with TCP_NODELAY to push despite the cork.
You can't give MSG_MORE to sendfile(), though with HTTP you don't need to. But if you need an equivalent of MSG_MORE with sendfile() you can in theory use SPLICE_F_MORE with splice() instead.
Ah, I see, you were referring to saving `setsockopt()`, not data-carrying syscalls.
Yes that makes sense. But I guess that in most cases where you have control over the `sendmsg()`'s calls flags, you'd also have control over its buffer, so you may be able to build the buffer in userspace in many situations, thus even saving multiple data-carrying syscalls.
The `setsockopt()` approach has the benefit that it works even when you have no control over the sending syscalls, e.g. when some library does it for you that you cannot modify or configure.
MSG_MORE comes in useful for these examples, where you can't use writev() alone, but do control the sending syscalls:
- HTTP (unencrypted) serving static files or cache files, to combine sendmsg() for the headers followed by sendfile() for the body. You can't batch using writev() in that case, if you want the benefit of sendfile().
- Transmitting a stream of data that is being forwarded or generated. For example a HTTPS reverse proxy which forwards incoming unencrypted data and formats it into TLS progressively. It can't buffer the whole response as that would add too much delay, so it can send using sendmsg() with MSG_MORE until it reaches the end of the forwarded data.
To be fair, this can be fixed with well designed libraries that don't rely on TCP doing job for them merging buffers and preventing small writes.
The issue is vast majority of libraries treat the problem as if it did not exist and prefer to not get their hands dirty and just conveniently write a stream of data to the socket leaving to the user to correctly configure options on the socket.
But yes, in general, performance is at least in significant part about remembering a huge amount of trivia.
Well, yes. The point of TCP is that it's an opaque reliable linear-stream abstraction. If you're not treating it like an opaque reliable linear-stream abstraction, you shouldn't be using TCP. If you want to manage your own datagrams, use a datagram transport. (Not necessarily UDP. I'd suggest SCTP, personally. Or maybe QUIC.)
That's naive at best. You say, if I wanted to write good HTTP client I should decide not to use TCP because TCP is meant to relieve the user from managing datagrams?
The reality is, when you design a client library you work with what you have and in case when you are writing HTTP client what you have is TCP on one side and a user application on the other side that expects maximum performance and minimum latency possible (just look at the benchmarks on the net that compare tens of frameworks in competition for who can produces largest TPS of hello world with lowest latency).
HTTP is specifically the case where Nagle will probably hit most often.
If you treat TCP just like file stream (it's a stream after all) and you're implementing, say, a webserver, the most straightforward way to implement a response is to:
- figure out what was requested
- build headers
- send headers
- send the file body
especially if you intend to send the file body using sendfile. but this pattern is broken by Nagle - because there's no return traffic between the two sends in HTTP and the headers often won't fill out a full TCP packet, you'll trigger the wait.
In your example sending the file body will fill the remainder of the packet containing the headers and trigger a packet send. The client waiting for your response isn't going to lose out. In fact, they'll benefit from having file body data in that first packet.
Your example is when you want Nagles algorithm to apply. This exact use case is mentioned on tcp(7) under the TCP_CORK option, which is like a more explicit, application controlled, version of Nagle.
Nagles algorithm sucks for HTTP clients who happen to send big HTTP requests. If the HTTP headers being sent are >1400 bytes or so, then the tail end of the request will wait 1 RTT for the ACK of the first part. This will be particularly bad if the HTTP server has delayed-ACKs enabled (TCP_QUICKACK is not on). The server TCP stack has no idea that the HTTP server application hasn't yet got a full request and can't yet respond, so will delay the ACK of the first 1400 request bytes in the hope that it will.
The delayed ACK delay is probably small compared to the RTT, so Nagle still dominates
True, but also kind of irrelevant, in that HTTP1.x is a protocol designed under the constraint that there are going to be an unbounded number of application-level proxies/gateways between the client and the server, where some of them might be under the client’s control (e.g. Tor), some under the server’s (e.g. Varnish), and some neither (e.g. a corporate or ISP WAF.)
And because of that, modern web servers don’t really do NODELAY; rather the opposite — they fully buffer HTTP requests and responses into what are essentially single giant datagrams.
Put your Express app behind Nginx, and Nginx will buffer the entire request to feed it to your app as one TCP jumbo-frame (if possible); and then buffer your response to feed it back down the line to the client as one TCP jumbo-frame, or a burst of such packets. This happens not just with HEAD responses, but with full-bodied GETs/POSTs/PUTs, too.
The only time modern web servers attempt to break your buffer up and schedule sends at different times, is when you tell them that you’re doing a `Transfer-Encoding: Chunked` response; or when your upstream is HTTP/2 and sends down an HTTP/2 stream non-terminal body-frame; or when your upstream is speaking websockets or WebRTC. Otherwise you get this “hyper-Nagling.”
(Why do they do it? Because these servers want to achieve sub-worker-thread concurrency, and system calls like write(2) are blocking, getting in the way of that worker doing anything else. So web-server worker threads want to do as few system calls as possible, which means batching up your writes so they can do one context-switch for one big write(2) or writev(2). For these servers, sendfile(2) can actually be an anti-pattern, compared to mmap(2)ing the relevant static files so that they can just be passed as a pointer to writev(2) along with the headers buffer. That’s one blocking context-switch rather than two!)
The solution I have implemented in my case is to move the problem to the application by designing the writes a little bit better. Instead of writing into dumb stream the application writes buffers of data or stream of data but each time indicating if there is more data that will immediately follow.
In any case, TCP_NODELAY is enabled and all writes are immediately flushed unless the application clearly indicates it has a queue of data that will immediately follow.
If your library is already merging buffers and preventing small writes then you would still want to set TCP_NODELAY, to eliminate delays due to send/send/recv patterns where merged buffers are less than the MSS... because you know for sure that you're already doing all you can, and Nagle's algorithm can't help further except introduce delay.
Something is not right with this. The blog post has a link to https://bugs.ruby-lang.org/issues/8681, which made the change to disable Nagle's algorithm since ruby 2.1.0 released in 2013.
World of Warcraft had Nagle's algorithm enabled for YEARS. That's one reason that VPN services were so popular and could cut 50-100ms off your ping time, especially if you were playing from Oceania.
This isn't explicitly related, but interesting, so I offer it up here.
When I read 40ms, it triggered a memory from tracking down a different 40ms latency bug a few years ago. I work on the Netflix app for set top boxes, and a particular pay TV company had a box based on AOSP L. Testing discovered that after enough times switching the app between foreground and background, playback would start to stutter. The vendor doing the integration blamed Netflix - they showed that in the stutter case, the Netflix app was not feeding video data quickly enough for playback. They stopped their analysis at this point, since as far as they were concerned, they had found the issue and we had to fix the Netflix app.
I doubted the app was the issue, as it ran on millions of other devices without showing this behavior. I instrumented the code and measured 40ms of extra delay from the thread scheduler. The 40ms was there, and was outside of our app's context. Literally, I measured it between the return of the thread handler and the next time the handler was called. So I responded, to paraphrase, its not us, its you. Your Android scheduler is broken.
But the onus was on me to prove it by finding the bug. I read the Android code, and learned Android threads are a userspace construct - the Android scheduler uses epoll() as a timer and calls your thread handler based on priority level. I thought, epoll() performance isn't guaranteed, maybe something obscure changed, and this change is adding an additional 40ms in this particular case. So I dove into the kernel, thinking the issue must be somewhere inside epoll().
Lucky for me, another engineer, working for a different vendor on the project, found the smoking gun in this patch in Android M (the next version). It was right there, an extra 40ms explicitly (and mistakenly) added when a thread is created while the app is in the background.
Fix janky navbar ripples -- incorrect timerslack values
If a thread is created while the parent thread is "Background",
then the default timerslack value gets set to the current
timerslack value of the parent (40ms). The default value is
used when transitioning to "Foreground" -- so the effect is that
the timerslack value becomes 40ms regardless of foreground/background.
This does occur intermittently for systemui when creating its
render thread (pretty often on hammerhead and has been seen on
shamu). If this occurs, then some systemui animations like navbar
ripples can wait for up to 40ms to draw a frame when they intended
to wait 3ms -- jank.
This fix is to explicitly set the foreground timerslack to 50us.
A consequence of setting timerslack behind the process' back is
that any custom values for timerslack get lost whenever the thread
has transition between fg/bg.
--- a/libcutils/sched_policy.c
+++ b/libcutils/sched_policy.c
@@ -50,6 +50,7 @@
// timer slack value in nS enforced when the thread moves to background
#define TIMER_SLACK_BG 40000000
+#define TIMER_SLACK_FG 50000
static pthread_once_t the_once = PTHREAD_ONCE_INIT;
@@ -356,7 +357,8 @@
¶m);
}
- prctl(PR_SET_TIMERSLACK_PID, policy == SP_BACKGROUND ? TIMER_SLACK_BG : 0, tid);
+ prctl(PR_SET_TIMERSLACK_PID,
+ policy == SP_BACKGROUND ? TIMER_SLACK_BG : TIMER_SLACK_FG, tid);
return 0;
> But the onus was on me to prove it by finding the bug.
As a network engineer who has had to prove numerous times that "it's not the firewall" or "it's not the network", you have my deepest sympathies.
--
Story time:
My most memorable instance of this was when I worked at a .edu and was going through yet another iteration of the back-and-forth, "it's not on our side, must be on your side", "no, it's not us, it's your firewall", finger-pointing blame game.
This particular issue had been dragging on for a while and, eventually, I got a packet capture from their side (packet captures are, to me, the ultimate source of truth, "the last word"). Yet, even after showing (and explaining) it to them, they continued to insist that their firewall was absolutely, positively, 100% configured properly and, thus, it had to be my firewall which was not properly configured.
Finally, I called in a favor from a (Ph.D.) faculty member who "owed me one". After explaining what was going on and, more importantly, showing him the timestamps, he agreed to help me out. He wrote up a wonderful e-mail (which I still have a copy of, somewhere) stating how it just was not possible, literally, that mine was the "offending firewall". He explained how he was able to conclude this with absolute certainty, thanks to the laws of physics and politely suggested that -- unless they had evidence that Checkpoint was able to somehow violate those laws of physics (in which case he would be extremely interested in their proof) -- that perhaps they might wanna take another look at their firewall?
I waited anxiously, curious what the response (and latest excuse) would be. A few days passed before, finally, an e-mail appeared in my inbox. I quickly opened it and read the one-line reply:
If TCP had a header flag to indicate that the next segment was <MSS in size, then the receiver could be a lot smarter about whether it delayed the ACK.
I was dealing with "Nagle's delay" only yesterday, adding "setsockopt(TCP_NODELAY)" for the alpha version of a new high-performance payments database called TigerBeetle: https://github.com/coilhq/tiger-beetle
Comments like this are just spam and don’t contribute anything to the discussion. If you have some interesting insights, share them, but don’t shift the discussion to your own completely unrelated project.
I appreciate your positive intention of watching out for the quality of the threads. But please don't do it this way. Instead, please follow the site guidelines and be kind.
HN threads are conversations, and it's natural in a conversation for someone to connect a topic to something they were working on recently. That's not shifting discussion to something unrelated. I can see how one could interpret it just as an excuse for spamming one's project, but such an interpretation is just a guess, and if you apply the site guideline "Assume good faith", you'd guess the opposite. It's clear from https://news.ycombinator.com/item?id=24801075 that the good-faith guess is the true one.
I was challenged with similar problems when doing performance testing trying to determine our new wireless router limits only to find out they are far too low.
> you can’t fix TCP problems without understanding TCP
True, but many problems are not TCP problems and it is not always easy to determine where does your delay is coming from
> Stuff like this just proves that a large part of this job is just remembering a bunch of weird data points and knowing when to match this story to that problem.
Sounds a like interns doing rounds with the chief resident.
> Stuff like this just proves that a large part of this job is just remembering a bunch of weird data points and knowing when to match this story to that problem.
No, the TCP_NODELAY setsockopt is not exposed by the browser.
If you only control the client then you could make your http/fetch requests as big as possible, i.e. do the Nagle algorithm yourself but at a higher level to avoid the delays, but I don't think there's not much else you can do.
Then again, if you control the server you could just run HTTP/3 to be on QUIC and then you're not on TCP any more, you would also save alot of connection handshake latency and benefit from advances in bufferbloat-sensitive congestion control algorithms much quicker.
To be fair, I would be more concerned about bufferbloat delays on small shared networks than delays due to Nagle's algorithm. Bufferbloat delays can be up to tens of seconds, and you can trigger them disturbingly easily, just have someone else in the office upload a large attachment to an email in Gmail's web client to saturate the router's send buffer, while you watch your ping times.