I think this conflates a poor implementation of a webserver with python/gunicorn...

I think this conflates a poor implementation of a webserver with python/gunicorn/gevent being bad. There are a few (easy) things to do to avoid some of the pitfalls she encountered:

> A connection arrives on the socket. Linux runs a pass down the list of listeners doing the epoll thing -- all of them! -- and tells every single one of them that something's waiting out there. They each wake up, one after another, a few nanoseconds apart.

Linux is known to have poor fairness with multiple processes listening to the same socket. For most setups that require forking a process, you run a local loadbalancer on box, whether it's haproxy or something else, and have each process listen on its own port. This not only allows you to ensure fairness by whatever load balance policy you want, but also lets you have healthchecks, queueing, etc.

>Meanwhile, that original request is getting old. The request it made has since received a response, but since there's not been an opportunity to flip back to it, the new request is still cooking. Eventually, that new request's computations are done, and it sends back a reply: 200 HTTP/1.1 OK, blah blah blah.

This can happen whether it's an os threaded design or a userspace green-thread runtime. If a process is overloaded, clients can and will timeout on the request. The main difference is in a green-thread runtime it's about overloading the process vs. utilizing all threads. Can make this better by using a local load balancer on box and spreading load evenly. It's also best practice to minimize "blocking" in the application that causes these pauses to happen.

>That's why they fork-then-load. That's why it takes up so much memory, and that's why you can't just have a bunch of these stupid things hanging around, each handling one request at a time and not pulling a "SHINYTHING!" and ignoring one just because another came in. There's just not enough RAM on the machine to let you do this. So, num_cpus + 1 it is.

Delayed imports (because of cyclical dependencies) is bad practice. That being said, forking N processes is standard for languages/runtimes that can only utilize a single core (python, ruby, javascript, etc.).

This is not to say that this solution is ideal -- just that with a small bit of work you can improve the scalability/reliability/behavior under load of these systems by quite a bit.