40 Milliseconds of latency that just would not go away

deafcalculus · on Oct 16, 2020

Was delayed ACKs the problem? Disabling delayed ACKs seems like a better bet than using TCP_NODELAY which turns off Nagle's algorithm.

Animats · on Oct 16, 2020

I generally say yes. The fixed timer is for the delayed ACK. That was a terrible idea. Both Linux and Windows now have a way to turn delayed ACKs off, but they're still on by default.

TCP_QUICKACK, which turns off delayed ACKs, is in Linux, but the manual page is very confused about what it actually does. Apparently it turns itself off after a while. I wish someone would get that right. I'd disable delayed ACKs by default. It's hard to think of a case today where they're a significant win. As I've written in the past, delayed ACKs were a hack to make remote Telnet character echo work better.

A key point is asymmetry. If you're the one who's doing lots of little writes, you can either turn set TCP_NODELAY at your end, or turn off delayed ACKs at the other end. If you can. Things doing lots of little writes but not filling up the pipe, typically game clients, can't change the settings at the other end. So it became a standard practice to do what you could do at your end.

deadmutex · on Oct 16, 2020

You should sign your post with your name (in this topic), in case a lot of people are not familiar with who you are :).

tpxl · on Oct 16, 2020

For others, their name is in their bio and is John Nagle.

0xffff2 · on Oct 16, 2020

Disagree. It's a lot more fun when everyone has to figure it out for themselves. :)

JoeAltmaier · on Oct 16, 2020

Some linux settings apply to every TCP connection which can be a very surprising result and hard to debug. I can see someone wanting to time out a feature, so it doesn't upset the rest of the system forever.

twsted · on Oct 16, 2020

On macOS add the line net.inet.tcp.delayed_ack=0 to the file /etc/sysctl.conf (which doesn't initially exist).

ganzuul · on Oct 16, 2020

The article is refreshingly short and to the point. Unusually, it does not aim the waste the maximum amount of your time so why don't you read it.

deafcalculus · on Oct 16, 2020

I liked the article, but it's not entirely clear to me what the cause of the problem was. Linux delays sending out data only if the data size is less than the packet size AND the previous packet in not yet ACKed (Nagle's algorithm). My guess was that this app is doing a write, write, read and hitting the delayed ACK problem.

lilyball · on Oct 16, 2020

The problem with delayed ACKs is that requires controlling the server. If you control the client, you can’t remotely turn off delayed ACKs, so instead you have to disable Nagle’s algorithm.

punnerud · on Oct 16, 2020

From John Nagle:

"(..) Unfortunately, delayed ACKs went in after I got out of networking in 1986, and this was never fixed. Now it's too late."

https://stackoverflow.com/a/16663206/2326672

oefrha · on Oct 16, 2020

See also this thread straight from the horse’s mouth, with some additional technical considerations: https://news.ycombinator.com/item?id=9048947

mercer · on Oct 16, 2020

Ha, I've been here for years and never knew that Animats was John Nagle. cool!

ignoramous · on Oct 16, 2020

See also another amusing exchange: https://news.ycombinator.com/item?id=21088736

punnerud · on Oct 16, 2020

Not topic related, but John’s transfer from networking to Autodesk got me interested. I don’t immediately see the good connection between the two, when they where a startup.

Would love to see a blog post from John Nagel on this, now that we know the outcome of Autodesk.

(I see he is active on HN almost every day)

JdeBP · on Oct 16, 2020

The Stack Overflow answer was "straight from the horse's mouth" too, doubters here at https://news.ycombinator.com/item?id=24799256 notwithstanding. (-:

pedrocr · on Oct 16, 2020

So the kernel is using a static decision that's really bad sometimes? Would it be too expensive to treat this like a branch predictor and keep some state to have the kernel enable/disable the delayed ACK dynamically depending on how it has won/lost the bet recently?

eru · on Oct 16, 2020

I think that what John Nagle's answers was suggesting:

> A delayed ACK is a bet. The TCP implementation is betting that data will be sent shortly and will make it unnecessary to send a lone ACK. Every time a delayed ACK is actually sent, that bet was lost. The TCP spec allows an implementation to lose that bet every time without turning off delayed ACKs. Properly, delayed ACKs should only turn on when a few unnecessary ACKs that could have been piggybacked have been sent in a row, and any time a delayed ACK is actually sent, delayed ACKs should be turned off again. There should have been a counter for this.

pedrocr · on Oct 16, 2020

Yes, that's the suggested fix. My question is if treating this like a CPU branch predictor and keeping some state that needs to be updated is too expensive or not. The packet pipeline is very performance sensitive.

Flowdalic · on Oct 16, 2020

I wouldn't be sure that this is authentic, i.e. actually from John Nagle.

JdeBP · on Oct 16, 2020

It is amusing that you doubt that "John Nagle" on Stack Overflow is M. Nagle, but don't express any doubt that "Animats" here on this WWW site, and indeed this very discussion, is M. Nagle. Surely the reverse stance is the more logical, if one has no idea what "Animats" is.

It is odd that some people give more credence to a pseudonym, or in this case a company name, than to using one's own name.

perl4ever · on Oct 16, 2020

Among people who would pretend to be a given well known person, surely more of them know that person's real name than something that would make a plausible pseudonym. Or in any case, more of them would choose it.

And if someone thinks a pseudonym is a particular famous person, they must have a reason, which is unlikely to be weaker than just assuming a normal name is accurate.

I'm not up to doing a Bayesian analysis right now, but I feel like one could show it makes more sense to doubt an unverified name.

sdoering · on Oct 16, 2020

> Down that path also lies madness.

Ohhhhh so true. I sadly have no such story to tell regarding performance optimization, but figuring out the intricacies of any complex system (for me at least) inevitably leads to you knowing arcane stuff that might come in handy some time. But on the other hand, it also - in my humble experience - leads to you knowing a lot of arcane stuff that might have an impact on a problem, but is absolutely not related in the specific case one is dealing with.

Knowing when to discard arcane knowledge and when to jump onto that train of thought imho is crucial.

But on the other hand debugging arcane stuff in complex systems is just so much fun. One learns so much.

ganafagol · on Oct 16, 2020

People should know about that more so that they can learn lessons. Any "little tweak" to an otherwise simple and elegant spec increases complexity which generations will have to deal with. This complexity often times compounds exponentially. Just see the interaction between Nagle and delayed acks. Each on their own they sound like a cool idea, but the compounding complexity is what kills understanding.

Unfortunately, new generations don't learn the lesson. Modern web dev for example has so many layers of complexity and bloat, and they all interact and you need to know all layers intimitly for any real understanding as the complexity explodes. It does not have to be that way, if only every layer has a clean, small abstraction. Then you don't need to know the details, only the small spec. But that does not work if everybody just adds a hack here and there which breaks some separation but oh well.

And that's why we can't have nice things.

slx26 · on Oct 16, 2020

There's nothing that makes me sadder in the world. How a lack of understanding can poison what could otherwise be in our reach, and moves it again further away from our understanding. It's almost as if we love "simple" because it's the perfect platform to start complicating. I mean, it's everywhere, not just IT systems. And part of it is unavoidable, but... when you start looking at it, it becomes impossible to unsee. So many absurd roadblocks, so much wasted potential.

Terr_ · on Oct 16, 2020

On the other hand, there may be comfort in the fatalism that it's just a feature of the universe.

Cell biology, for example, seems littered with stuff that just happens to work 51% of the time due to what appear to be crazy sideways interactions.

Sodman · on Oct 16, 2020

But $LARGE_CLIENT told our sales guy that if we were to just implement $ARBITRARY_CHECKBOX_FEATURE in our advanced configuration they'd sign a 7 figure deal!

staticassertion · on Oct 16, 2020

Uh, yeah, if you have millions of dollars lined up and you turn it down to maintain some sense of technical purity, I hope your revenue is in the billions.

ganafagol · on Oct 17, 2020

But that's not how this works. Customers order a feature. They don't order how messed up your implementation of that feature is.

In other words, the agreement with the customer is the spec. But this discussion is about the implementation. Conflating these two is one of the major problems in our industry.

jlokier · on Oct 16, 2020

This is one ancient problem. I remember dealing with it in 2003.

Writeup from 1997 here (P-HTTP basically means HTTP version 1.1):

https://www.isi.edu/~johnh/PAPERS/Heidemann97a.html

> John Heidemann. Performance Interactions Between P-HTTP and TCP Implementations. ACM Computer Communication Review. 27, 2 (Apr. 1997), 65–73.

> This document describes several performance problems resulting from interactions between implementations of persistent-HTTP (P-HTTP) and TCP. Two of these problems tie P-HTTP performance to TCP delayed-acknowledgments, thus adding up to 200ms to each P-HTTP transaction. A third results in multiple slow-starts per TCP connection. Unresolved, these problems result in P-HTTP transactions which are 14 times slower than standard HTTP and 20 times slower than potential P-HTTP over a 10 Mb/s Ethernet. We describe each problem and potential solutions. After implementing our solutions to two of the problems, we observe that P-HTTP performs better than HTTP on a local Ethernet. Although we observed these problems in specific implementations of HTTP and TCP (Apache-1.1b4 and SunOS 4.1.3, respectively), we believe that these problems occur more widely.

Solutions for efficient batching of HTTP headers + data without delays involve TCP_NODELAY, and MSG_MORE / SPLICE_F_MORE / TCP_CORK / TCP_NOPUSH. Possibly TCP_QUICKACK may come in handy. Same for any protocol really, but HTTP is the one where there tends to be a separate sendmsg() and sendfile() on Linux.

rsclient · on Oct 16, 2020

This is exactly why the Socket API in WinRT has Nagle off by default. The old way of dealing with sockets was to treat them like buffered files, or to drive them from a keyboard (so that Nagle is useful). But newer socket programs seem to just make a full chunk of information, and send it at once. Those newer programs either turn off Nagle, or would be improved if they did.

So we bit the bullet, and decided to make Nagle off by default.

ohazi · on Oct 16, 2020

You don't need to do it this way, though... the general rule is that you shouldn't enable both Nagle's Algorithm and TCP delayed ACK at the same time.

nly · on Oct 16, 2020

That's easy to say but the bad interaction happens when the algorithms operate at opposite ends of the connection, and you often don't control both ends.

The bottom line is, you need to understand the semantics of your application protocol to best know how to apply them.

dekhn · on Oct 16, 2020

I once had to debug the scaling performance of a MPI-based simulation algorithm on cheap linux machines with TCP. I finally collected a TCP trace and showed it to the local expert who said: "hmm, 250ms delay right there.. that's the TCP retransmit timer... you're flooding the ethernet switch with too many packets and the switch is dropping them. Enable <such and such a feature>."

Since then I've always kept various constants in human RAM because it helps root cause.

whoisburbansky · on Oct 16, 2020

What are some examples of other such constants you have hanging around?

dekhn · on Oct 16, 2020

a good one (although it's really a functional constant that has to be determined per-system) is the amount of time it takes to look up data in local RAM vs RAM that is attached to the other CPU socket (IE, over the system bus); it's 50% longer.

I typically remember all the TCP timing constants, the seek time of a hard drive, everything in "Programmer's rule of thumb" (how long it takes for data to travel by light from CA to NL, etc)

scott_s · on Oct 16, 2020

John Nagle is a commenter here on HN, and has commented on this very thing: https://news.ycombinator.com/item?id=10608356

I have also ran into this, but for me it was a periodic latency spike with steady but periodic messages. That latency spike went away when the messages were sent as-fast-as-possible.

Kiro · on Oct 16, 2020

Animats is John Nagle?! I've seen good comments from him to the level where I think "oh, a comment from Animats" but never realized.

JoeAltmaier · on Oct 16, 2020

Similar to Nagle, there are reasons to combine packets on a session. Network equipment that fools with every packet can get backed up if the traffic packet count exceeds a limit. By Nagling (or doing something similar in your transmit code) you can increase your message rate through such bottlenecks.

Used to have a server cluster that used some 'hologram' style router on the receiving end, to spread load. It had a hard limit on # packets per second it could handle. I changed our app to combine sends (2ms timer, not 40ms!) and halved our total traffic packet count. Put off the day they had to buy more server-side hardware to handle the load.

Btw if the clients are on wifi networks, then there's no point in aggregating sends past a pretty small size (512 bytes?) because wifi fragments (used to fragment?) packets to that smaller size over the air, and never reassembles them, leaving that to the target server.

oasisbob · on Oct 16, 2020

> wifi fragments (used to fragment?) packets

802.11n and 802.11ac both don't do this. In contrast, they have several layers of packet aggregation (AMPDU/AMSDU). [1]

Learned about this while troubleshooting latency problems on a noisy 10km point-to-point link.

[1] https://arxiv.org/pdf/1704.07015.pdf

unilynx · on Oct 16, 2020

> Stuff like this just proves that a large part of this job is just remembering a bunch of weird data points and knowing when to match this story to that problem.

I've hit Nagle far in the past, and reading the title I thought 'well that can't be about Nagle because that was a 200ms delay'

Looks like someone tuned it down to 40ms but didn't dare removing it. It would be interesting to know how they came to that choice

JoeAltmaier · on Oct 16, 2020

I thought the delay was set in the 'delayed ack' setting?

euph0ria · on Oct 16, 2020

Why not just use tcpdump or wireshark when troubleshooting network latencies? Usually only takes a minute or two to pinpoint the issue. Then you would need to spend time understanding why the pinpointed behavior is what it is and sometimes it is in the application, sometimes not.. I've solved so many issues over the years with tcpdump that it has become one of the most valuable tools I know.

LeonM · on Oct 16, 2020

Out of curiosity, how would you debug this issue with Wireshark?

Just look for multiple messages in a single TCP packet? Or is there a better way?

bostik · on Oct 16, 2020

If you're debugging weird latency problems, I'd recommend to start from relative timestamps. Then filter to just one or two TCP streams that exhibit the problem and go from there.

Inspect packet contents. The packet dissectors are doing a lot of heavy lifting, so look at the deconstructed data. I realise this may not be a welcome suggestion, but a sharp pencil, an A3 (or larger) scratch pad and a good ruler go a long way.

I've dug to the bottom of quite a few network and traffic problems in my life by drawing the observed traffic patterns into sequence diagrams. Once you have the diagram visualised, it's easier to spot the places where something funky either happens or looks to be missing.

pwinnski · on Oct 16, 2020

Right? If seeing multiple messages in a single TCP packet immediately clues you in to the issue, the truth is that you already knew about the issue.

marcosdumay · on Oct 16, 2020

No kidding. I imagined Google or DDG would turn "multiple messages in a single TCP package" into something useful, since it's basically the description of the algorithm. But no luck. The best I got was somebody with an IO buffering problem in Stack Overflow that commented that he turned the algorithm off.

PeterisP · on Oct 16, 2020

Well, as soon as you'd look for and find any one of your messages in wireshark to use as an example, then you'd notice that the packet has not only that message but others as well.

sa46 · on Oct 16, 2020

Can anyone recommend a good intro to debugging network problems assuming competent knowledge of most Linux sysadmin tasks?

geraldcombs · on Oct 16, 2020

We're about to wrap up a conference dedicated to that very subject: https://sharkfestvirtual.wireshark.org/. Session videos should be online soon.

_2xvp · on Oct 16, 2020

As soon as I saw the title I remembered this Julia Evans post about the same issue: https://jvns.ca/blog/2015/11/21/why-you-should-understand-a-...

errantspark · on Oct 16, 2020

I remember first learning of Nagle's algorithm back in the early WoW days in my endless quest to get lower latency for PvP on my neighbor's cracked WEP. I don't really know if it matters much in 2020, but I still habitually run the *.reg file to disable it on every new windows install.

blibble · on Oct 16, 2020

Blizzard finally figured out how to call setsockopt with TCP_NODELAY

nh2 · on Oct 16, 2020

For those wondering "so, so how do I do it right?":

I was in that situation 4 years ago and did a short write up on it:

https://gist.github.com/nh2/9def4bcf32891a336485

It explains how to avoid the 40ms delay and still batch data where possible for maximum efficiency. The key part is that you can toggle the TCP options during the lifetime of the connection to force flushes.

Review appreciated.

jlokier · on Oct 16, 2020

Seems like generally a good approach.

However you also have MSG_MORE instead of TCP_CORK to save a few system calls.

nh2 · on Oct 16, 2020

How does MSG_MORE save syscalls, given that it's a flag you pass to syscalls? Do you mean "save packets"?

jlokier · on Oct 16, 2020

System calls not packets.

By using sendmsg() with MSG_MORE instead of write(), you can avoid the setsockopt() with TCP_CORK to cork before the write, and the later setsockopt() with TCP_NODELAY to push despite the cork.

You can't give MSG_MORE to sendfile(), though with HTTP you don't need to. But if you need an equivalent of MSG_MORE with sendfile() you can in theory use SPLICE_F_MORE with splice() instead.

nh2 · on Oct 16, 2020

Ah, I see, you were referring to saving `setsockopt()`, not data-carrying syscalls.

Yes that makes sense. But I guess that in most cases where you have control over the `sendmsg()`'s calls flags, you'd also have control over its buffer, so you may be able to build the buffer in userspace in many situations, thus even saving multiple data-carrying syscalls.

The `setsockopt()` approach has the benefit that it works even when you have no control over the sending syscalls, e.g. when some library does it for you that you cannot modify or configure.

jlokier · on Oct 16, 2020

MSG_MORE comes in useful for these examples, where you can't use writev() alone, but do control the sending syscalls:

- HTTP (unencrypted) serving static files or cache files, to combine sendmsg() for the headers followed by sendfile() for the body. You can't batch using writev() in that case, if you want the benefit of sendfile().

- Transmitting a stream of data that is being forwarded or generated. For example a HTTPS reverse proxy which forwards incoming unencrypted data and formats it into TLS progressively. It can't buffer the whole response as that would add too much delay, so it can send using sendmsg() with MSG_MORE until it reaches the end of the forwarded data.

nh2 · on Oct 18, 2020

That makes sense, thanks!

lmilcin · on Oct 16, 2020

To be fair, this can be fixed with well designed libraries that don't rely on TCP doing job for them merging buffers and preventing small writes.

The issue is vast majority of libraries treat the problem as if it did not exist and prefer to not get their hands dirty and just conveniently write a stream of data to the socket leaving to the user to correctly configure options on the socket.

But yes, in general, performance is at least in significant part about remembering a huge amount of trivia.

derefr · on Oct 16, 2020

Well, yes. The point of TCP is that it's an opaque reliable linear-stream abstraction. If you're not treating it like an opaque reliable linear-stream abstraction, you shouldn't be using TCP. If you want to manage your own datagrams, use a datagram transport. (Not necessarily UDP. I'd suggest SCTP, personally. Or maybe QUIC.)

eru · on Oct 16, 2020

Your suggestion makes sense when you control both end-points.

If you only control one end-point, you might be stuck with TCP. See also the write-ups around BBR, like https://blog.apnic.net/2017/05/09/bbr-new-kid-tcp-block/

For BBR, it's about server-side changes to TCP with unmodified clients.

lmilcin · on Oct 16, 2020

That's naive at best. You say, if I wanted to write good HTTP client I should decide not to use TCP because TCP is meant to relieve the user from managing datagrams?

The reality is, when you design a client library you work with what you have and in case when you are writing HTTP client what you have is TCP on one side and a user application on the other side that expects maximum performance and minimum latency possible (just look at the benchmarks on the net that compare tens of frameworks in competition for who can produces largest TPS of hello world with lowest latency).

unilynx · on Oct 16, 2020

HTTP is specifically the case where Nagle will probably hit most often.

If you treat TCP just like file stream (it's a stream after all) and you're implementing, say, a webserver, the most straightforward way to implement a response is to:

  - figure out what was requested
  - build headers
  - send headers
  - send the file body

especially if you intend to send the file body using sendfile. but this pattern is broken by Nagle - because there's no return traffic between the two sends in HTTP and the headers often won't fill out a full TCP packet, you'll trigger the wait.

nly · on Oct 16, 2020

In your example sending the file body will fill the remainder of the packet containing the headers and trigger a packet send. The client waiting for your response isn't going to lose out. In fact, they'll benefit from having file body data in that first packet.

Your example is when you want Nagles algorithm to apply. This exact use case is mentioned on tcp(7) under the TCP_CORK option, which is like a more explicit, application controlled, version of Nagle.

Nagles algorithm sucks for HTTP clients who happen to send big HTTP requests. If the HTTP headers being sent are >1400 bytes or so, then the tail end of the request will wait 1 RTT for the ACK of the first part. This will be particularly bad if the HTTP server has delayed-ACKs enabled (TCP_QUICKACK is not on). The server TCP stack has no idea that the HTTP server application hasn't yet got a full request and can't yet respond, so will delay the ACK of the first 1400 request bytes in the hope that it will.

The delayed ACK delay is probably small compared to the RTT, so Nagle still dominates

derefr · on Oct 16, 2020

True, but also kind of irrelevant, in that HTTP1.x is a protocol designed under the constraint that there are going to be an unbounded number of application-level proxies/gateways between the client and the server, where some of them might be under the client’s control (e.g. Tor), some under the server’s (e.g. Varnish), and some neither (e.g. a corporate or ISP WAF.)

And because of that, modern web servers don’t really do NODELAY; rather the opposite — they fully buffer HTTP requests and responses into what are essentially single giant datagrams.

Put your Express app behind Nginx, and Nginx will buffer the entire request to feed it to your app as one TCP jumbo-frame (if possible); and then buffer your response to feed it back down the line to the client as one TCP jumbo-frame, or a burst of such packets. This happens not just with HEAD responses, but with full-bodied GETs/POSTs/PUTs, too.

The only time modern web servers attempt to break your buffer up and schedule sends at different times, is when you tell them that you’re doing a `Transfer-Encoding: Chunked` response; or when your upstream is HTTP/2 and sends down an HTTP/2 stream non-terminal body-frame; or when your upstream is speaking websockets or WebRTC. Otherwise you get this “hyper-Nagling.”

(Why do they do it? Because these servers want to achieve sub-worker-thread concurrency, and system calls like write(2) are blocking, getting in the way of that worker doing anything else. So web-server worker threads want to do as few system calls as possible, which means batching up your writes so they can do one context-switch for one big write(2) or writev(2). For these servers, sendfile(2) can actually be an anti-pattern, compared to mmap(2)ing the relevant static files so that they can just be passed as a pointer to writev(2) along with the headers buffer. That’s one blocking context-switch rather than two!)

lmilcin · on Oct 16, 2020

The solution I have implemented in my case is to move the problem to the application by designing the writes a little bit better. Instead of writing into dumb stream the application writes buffers of data or stream of data but each time indicating if there is more data that will immediately follow.

In any case, TCP_NODELAY is enabled and all writes are immediately flushed unless the application clearly indicates it has a queue of data that will immediately follow.

_vvhw · on Oct 16, 2020

If your library is already merging buffers and preventing small writes then you would still want to set TCP_NODELAY, to eliminate delays due to send/send/recv patterns where merged buffers are less than the MSS... because you know for sure that you're already doing all you can, and Nagle's algorithm can't help further except introduce delay.

lmilcin · on Oct 16, 2020

That's exactly what I meant. Did I say otherwise?

The general approach is to disable the algorithm (ie. enable TCP_NODELAY) and then implement whatever optimizations you can on higher level.

lawrjone · on Oct 16, 2020

It’s one of those trapdoors that people continually fall down: https://gocardless.com/blog/in-search-of-performance-how-we-...

throwdbaaway · on Oct 16, 2020

Something is not right with this. The blog post has a link to https://bugs.ruby-lang.org/issues/8681, which made the change to disable Nagle's algorithm since ruby 2.1.0 released in 2013.

taneq · on Oct 16, 2020

World of Warcraft had Nagle's algorithm enabled for YEARS. That's one reason that VPN services were so popular and could cut 50-100ms off your ping time, especially if you were playing from Oceania.

jdblair · on Oct 16, 2020

This isn't explicitly related, but interesting, so I offer it up here.

When I read 40ms, it triggered a memory from tracking down a different 40ms latency bug a few years ago. I work on the Netflix app for set top boxes, and a particular pay TV company had a box based on AOSP L. Testing discovered that after enough times switching the app between foreground and background, playback would start to stutter. The vendor doing the integration blamed Netflix - they showed that in the stutter case, the Netflix app was not feeding video data quickly enough for playback. They stopped their analysis at this point, since as far as they were concerned, they had found the issue and we had to fix the Netflix app.

I doubted the app was the issue, as it ran on millions of other devices without showing this behavior. I instrumented the code and measured 40ms of extra delay from the thread scheduler. The 40ms was there, and was outside of our app's context. Literally, I measured it between the return of the thread handler and the next time the handler was called. So I responded, to paraphrase, its not us, its you. Your Android scheduler is broken.

But the onus was on me to prove it by finding the bug. I read the Android code, and learned Android threads are a userspace construct - the Android scheduler uses epoll() as a timer and calls your thread handler based on priority level. I thought, epoll() performance isn't guaranteed, maybe something obscure changed, and this change is adding an additional 40ms in this particular case. So I dove into the kernel, thinking the issue must be somewhere inside epoll().

Lucky for me, another engineer, working for a different vendor on the project, found the smoking gun in this patch in Android M (the next version). It was right there, an extra 40ms explicitly (and mistakenly) added when a thread is created while the app is in the background.

https://android.googlesource.com/platform/system/core/+/4cdc...

  Fix janky navbar ripples -- incorrect timerslack values
  
  If a thread is created while the parent thread is "Background",
  then the default timerslack value gets set to the current
  timerslack value of the parent (40ms). The default value is
  used when transitioning to "Foreground" -- so the effect is that
  the timerslack value becomes 40ms regardless of foreground/background.
  
  This does occur intermittently for systemui when creating its
  render thread (pretty often on hammerhead and has been seen on
  shamu). If this occurs, then some systemui animations like navbar
  ripples can wait for up to 40ms to draw a frame when they intended
  to wait 3ms -- jank.
  
  This fix is to explicitly set the foreground timerslack to 50us.
  
  A consequence of setting timerslack behind the process' back is
  that any custom values for timerslack get lost whenever the thread
  has transition between fg/bg.
  

  --- a/libcutils/sched_policy.c
  +++ b/libcutils/sched_policy.c
  @@ -50,6 +50,7 @@
   
   // timer slack value in nS enforced when the thread moves to background
   #define TIMER_SLACK_BG 40000000
  +#define TIMER_SLACK_FG 50000
   
   static pthread_once_t the_once = PTHREAD_ONCE_INIT;
   
  @@ -356,7 +357,8 @@
                              &param);
       }
   
  -    prctl(PR_SET_TIMERSLACK_PID, policy == SP_BACKGROUND ? TIMER_SLACK_BG : 0, tid);
  +    prctl(PR_SET_TIMERSLACK_PID,
  +          policy == SP_BACKGROUND ? TIMER_SLACK_BG : TIMER_SLACK_FG, tid);
   
       return 0;

jlgaddis · on Oct 16, 2020

> But the onus was on me to prove it by finding the bug.

As a network engineer who has had to prove numerous times that "it's not the firewall" or "it's not the network", you have my deepest sympathies.

--

Story time:

My most memorable instance of this was when I worked at a .edu and was going through yet another iteration of the back-and-forth, "it's not on our side, must be on your side", "no, it's not us, it's your firewall", finger-pointing blame game.

This particular issue had been dragging on for a while and, eventually, I got a packet capture from their side (packet captures are, to me, the ultimate source of truth, "the last word"). Yet, even after showing (and explaining) it to them, they continued to insist that their firewall was absolutely, positively, 100% configured properly and, thus, it had to be my firewall which was not properly configured.

Finally, I called in a favor from a (Ph.D.) faculty member who "owed me one". After explaining what was going on and, more importantly, showing him the timestamps, he agreed to help me out. He wrote up a wonderful e-mail (which I still have a copy of, somewhere) stating how it just was not possible, literally, that mine was the "offending firewall". He explained how he was able to conclude this with absolute certainty, thanks to the laws of physics and politely suggested that -- unless they had evidence that Checkpoint was able to somehow violate those laws of physics (in which case he would be extremely interested in their proof) -- that perhaps they might wanna take another look at their firewall?

I waited anxiously, curious what the response (and latest excuse) would be. A few days passed before, finally, an e-mail appeared in my inbox. I quickly opened it and read the one-line reply:

  We found the problem, it's working now.

nialv7 · on Oct 16, 2020

My first reaction to the 40ms number is "TCP_NODELAY?".

That number is probably craved into my brain now.

jeffbee · on Oct 16, 2020

I have the same traumatic association with 5 second delays. 5 second tail latency? Look for SYN retransmits.

Ono-Sendai · on Oct 16, 2020

Related: an old blog post of mine: http://forwardscattering.org/post/3 'Sockets should have a flushHint() API call.'

fanf2 · on Oct 16, 2020

Linux has a MSG_MORE flag for sendmsg() which sort of does the opposite. https://linux.die.net/man/2/sendmsg

pronoiac · on Oct 16, 2020

I wish the fixes from the first fork were offered back to the main project.

nly · on Oct 16, 2020

If TCP had a header flag to indicate that the next segment was <MSS in size, then the receiver could be a lot smarter about whether it delayed the ACK.

_vvhw · on Oct 16, 2020

I was dealing with "Nagle's delay" only yesterday, adding "setsockopt(TCP_NODELAY)" for the alpha version of a new high-performance payments database called TigerBeetle: https://github.com/coilhq/tiger-beetle

coddle-hark · on Oct 16, 2020

Hey, cool, I was actually just using TCP myself working on a new high-performance beetle database called TigerPayments: https://github.com/coddlehark/tigerpayments

Comments like this are just spam and don’t contribute anything to the discussion. If you have some interesting insights, share them, but don’t shift the discussion to your own completely unrelated project.

dang · on Oct 16, 2020

I appreciate your positive intention of watching out for the quality of the threads. But please don't do it this way. Instead, please follow the site guidelines and be kind.

HN threads are conversations, and it's natural in a conversation for someone to connect a topic to something they were working on recently. That's not shifting discussion to something unrelated. I can see how one could interpret it just as an excuse for spamming one's project, but such an interpretation is just a guess, and if you apply the site guideline "Assume good faith", you'd guess the opposite. It's clear from https://news.ycombinator.com/item?id=24801075 that the good-faith guess is the true one.

If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html, we'd be grateful.

coddle-hark · on Oct 17, 2020

You’re right, I’m sorry.

dang · on Oct 17, 2020

Appreciated!

2rsf · on Oct 16, 2020

I was challenged with similar problems when doing performance testing trying to determine our new wireless router limits only to find out they are far too low.

> you can’t fix TCP problems without understanding TCP

True, but many problems are not TCP problems and it is not always easy to determine where does your delay is coming from

jgalt212 · on Oct 16, 2020

> Stuff like this just proves that a large part of this job is just remembering a bunch of weird data points and knowing when to match this story to that problem.

Sounds a like interns doing rounds with the chief resident.

kevsim · on Oct 16, 2020

Second Nagle-related post I've seen this week! https://www.kdab.com/there-and-back-again/

brundolf · on Oct 16, 2020

> Stuff like this just proves that a large part of this job is just remembering a bunch of weird data points and knowing when to match this story to that problem.

nickdothutton · on Oct 16, 2020

Everyone should spend a day trying to optimise iSCSI traffic.

acvny · on Oct 16, 2020

Not clear what is this about. What delay?

doubletgl · on Oct 16, 2020

The network kind.

nt2h9uh238h · on Oct 16, 2020

I have the same issue with my in-memory nodeJS app on Amazon AWS. Nice story, but how to FIX it?

bborud · on Oct 16, 2020

W Richard Stevens wrote some good books on TCP/IP. May I suggest reading them?

bullen · on Oct 16, 2020

Can you disable Neagles (enable nodelay) on .js XMLHttpRequest or the newer fetch stuff? Maybe chrome/firefox disables it by default?

Edit: This community is stupid/toxic.

_vvhw · on Oct 16, 2020

No, the TCP_NODELAY setsockopt is not exposed by the browser.

If you only control the client then you could make your http/fetch requests as big as possible, i.e. do the Nagle algorithm yourself but at a higher level to avoid the delays, but I don't think there's not much else you can do.

Then again, if you control the server you could just run HTTP/3 to be on QUIC and then you're not on TCP any more, you would also save alot of connection handshake latency and benefit from advances in bufferbloat-sensitive congestion control algorithms much quicker.

To be fair, I would be more concerned about bufferbloat delays on small shared networks than delays due to Nagle's algorithm. Bufferbloat delays can be up to tens of seconds, and you can trigger them disturbingly easily, just have someone else in the office upload a large attachment to an email in Gmail's web client to saturate the router's send buffer, while you watch your ping times.