It seems to be a complaint against doing process-per-CPU.
Let's say your server has 4 CPUs. The conservative option is to limit yourself to 4 requests at a time. But for most web applications, requests use tiny bursts of CPU in between longer spans of I/O, so your CPUs will be mostly idle.
Let's say we want to make better use of our CPUs and accept 40 requests at a time. Some environments (Java, Go, etc) allow any of the 40 requests to run on any of the CPUs. A request will have to wait only if 4+ of the 40 requests currently need to do CPU work.
Some environments (Node, Python, Ruby) allow a process to only use a single CPU at a time (roughly). You could run 40 processes, but that uses a lot of memory. The standard alternative is to do process-per-CPU; for this example we might run 4 processes and give each process 10 concurrent requests.
But now requests will have to wait if more than 1 of the 10 requests in its process needs to do CPU work. This has a higher probability of happening than "4+ out of 40". That's why this setup will result in higher latency.
And there's a bunch more to it. For example, it's slightly more expensive (for cache/NUMA reasons) for a request to switch from one CPU to another, so some high-performance frameworks intentionally pin requests to CPUs, e.g. Nginx, Seastar. A "work-stealing" scheduler tries to strike a balance: requests are pinned to CPUs, but if a CPU is idle it can "steal" a request from another CPU.
The starvation/timeout problem described in the post is strictly more likely to happen in process-per-CPU, sure. But for a ton of web app workloads, the odds of it happening are low, and there are things you can do to improve the situation.
Let's say your server has 4 CPUs. The conservative option is to limit yourself to 4 requests at a time. But for most web applications, requests use tiny bursts of CPU in between longer spans of I/O, so your CPUs will be mostly idle.
Let's say we want to make better use of our CPUs and accept 40 requests at a time. Some environments (Java, Go, etc) allow any of the 40 requests to run on any of the CPUs. A request will have to wait only if 4+ of the 40 requests currently need to do CPU work.
Some environments (Node, Python, Ruby) allow a process to only use a single CPU at a time (roughly). You could run 40 processes, but that uses a lot of memory. The standard alternative is to do process-per-CPU; for this example we might run 4 processes and give each process 10 concurrent requests.
But now requests will have to wait if more than 1 of the 10 requests in its process needs to do CPU work. This has a higher probability of happening than "4+ out of 40". That's why this setup will result in higher latency.
And there's a bunch more to it. For example, it's slightly more expensive (for cache/NUMA reasons) for a request to switch from one CPU to another, so some high-performance frameworks intentionally pin requests to CPUs, e.g. Nginx, Seastar. A "work-stealing" scheduler tries to strike a balance: requests are pinned to CPUs, but if a CPU is idle it can "steal" a request from another CPU.
The starvation/timeout problem described in the post is strictly more likely to happen in process-per-CPU, sure. But for a ton of web app workloads, the odds of it happening are low, and there are things you can do to improve the situation.
The post also talks about Gunicorn accepting connections inefficiently and that should probably be fixed, but that space has very similar tradeoffs <https://blog.cloudflare.com/the-sad-state-of-linux-socket-ba....