This is a good survey of options with pros and cons. While one is at it please s...

pjc50 · on May 14, 2021

The "thundering herd" problem implies a misconfiguration - surely you can crank down the accept limit on the server? This is a problem that Apache spotted in the 90s and provided tuning parameters to mitigate.

mumblemumble · on May 14, 2021

As with so many things gRPC, they've had an open issue about it for years, but it doesn't seem to be going anywhere: https://github.com/grpc/grpc-java/issues/1886

I like gRPC overall, but it's definitely got some rough spots like this. I'm fortunate enough to be using it at a small enough scale that they don't really hinder me.

My sense, based on the multitude of GitHub issues like this, is that the project itself is thrashing. They've got a relatively small core team that seems to be micro-managing all the official implementations, largely out of a desire to maintain cross-language compatibility. But the cross-language compatibility is actually rather poorer than you'd expect, because they're trying to maintain it by directly micro-managing the implementations, which turns every feature design project into a complicated n-body problem.

It would be nice if they could just publish a comprehensive formal spec, and let the language implementations follow it. Then each (official) language would have only one external thing to track, instead of ten.

vishnugupta · on May 14, 2021

+1. To add, the lack of reverse proxy support back then made it worse. So things like rate limiting, max-conn, time out etc., that we take for granted in Nginx or HAProxy had to be either hand-coded or done in some roundabout manner.

This is what I meant by "operational maturity". A big factor for a new tech to gain adoption is the ease with which it fits in the existing ecosystem. For gRPC it's about load-balancer, reverse proxies, deployments, and so fourth. I'm sure gRPC will get there, depending on how fast it's adopted in big tech companies but I suspect it's not there yet.

aleks_me2 · on May 15, 2021

HAProxy can be used to load balance gRPC since 2.0. Here the current 2.4 version link to the documentation.

http://cbonte.github.io/haproxy-dconv/2.4/configuration.html...

ksmith14 · on May 14, 2021

It's good you learned that lesson about having someone else announce the server, but isn't always that simple. With things like Apache Aurora it would often announce jobs as soon as the process started. With something like Finagle, it tends to do it as soon as the component is initialized, instead of waiting until the server is fully initialized and ready to handle requests.

Something that I implemented at my last job and others rediscovered as part of my current job is to implement an "administratively up/down" API as part of the control plane and only have the server announced if it was "up." Decoupling the announcement from process start/initialization complete allowed us to roll out new versions of software in a disabled fashion and then "flip the switch" (red/black deployments). It also enabled us to take individual instances out of service without killing them, enabling developers to debug issues/anomalies more easily.

Load shedding/backpressure/rate limiting at various layers is also extremely helpful, whether at the load balancer/API gateway or at individual servers. That has saved our bacon numerous times.

im3w1l · on May 14, 2021

Maybe you could bring up a bunch of fake-servers, that "implement" every rpc call with a long sleep or maybe by responding with internal error. Then take them down gradually as real ones go up.

jrockway · on May 14, 2021

I think what you really want to do is not give every client a full view of all the backends. xDS lets you write a service discovery server that meets this condition (it knows the full state, potentially with health information about upstreams, and it knows which client is connecting, so you can adjust this as you see fit). I've also seen people do AZ or regional aggregation, i.e. given some consumer in AZ A, the consumer gets a list of endpoints like, 0.upstream.a, 1.upstream.a, ..., regional-aggregator.B, regional-aggregator.C, etc. It sees all the endpoints in the same AZ, but goes through a proxy to get to other regions/zones. Under non-panic circumstances, you'd want all requests to be served from the same node, then the same zone, and only go to other regions in degraded cases.

I don't know what the state of the art around tooling to manage this is. For some reason, I suspect that the service meshes punt on this, because N * M isn't a problem in the demo environments where these systems spend most of their time. Meanwhile, the big companies that hit the scalability limit of N * M connections across upstream/downstream pairs wrote their own service discovery stuff decades ago.

daniellarusso · on May 14, 2021

That is pretty clever.