> Making the server to forcefully shut down the connections after sometime, and when they reconnect, it automatically causes the new connections to go to the healthier instances.
This is good practice anyway. Having long-lived servers causes all kinds of issues. In the old days, if you had a server that was up for years, you had no idea whether you could reboot the server safely or not. Additionally, from a security perspective, the longer-lived a server is, the more likely that an attacker installed some manner of backdoor on the server that's just waiting for the attacker's opportune moment to come back and exploit it.
Much better to make chaos an expected part of your operations - make sure your servers are cattle (not pets), give them maximum lifetimes, and terminate them on a regular basis. The more frequent the termination, the better prepared you are for single-node overload, and the less likely that it happens in your use case in the first place.
Do you have any research source for that, please? AFAIK death is more than anything connected to offspring survival. That is not considered to be 'social' by evolutionary biologists.
I’m confused what you want research about. That death and new generations enable our much faster social evolution and inherently improves the bus factor of all our systems? Do you need research to see this?
I would like to see research that connects age of death or death itself with social factors. AFAIK it's connected to offspring survival, which is not a social factor.
I'd just invoke Plank's principle and thousands of other examples right before your eyes. You shouldn't need research when you're swimming in the conclusion.
Sorry but this is going directly against evolutionary biology as it is commonly understood by evolutionary biologists and very successfully used to explain evolution. They don't care about invocations of Planck's principle, that's philosophical pondering, not useful genomics.
Cultural evolution is part of evolution as much as genetic evolution is. In fact cultural evolution in practice supersedes genetic evolution in many places because it adapts and propagates orders of magnitudes faster than genes.
If you want to ignore most of the picture and focus on the part you like, you're like a horse with eye covers. Only what you look at exists, and where you look determines where you want to go. It's not pretty.
I'm open to the broader picture and certainly the human culture has enormous impact on human bodies, the age of death, etc. But the comment I originally reacted to has suggested that death is entirely socially driven, which IMHO needs enormous proof and to me that simply doesn't make sense, as even entirely non-cultural beings die.
> Either of these approaches defeats a fundamental gRPC benefit: reusable connections.
I feel like that's not quite true. The server could notice it's over-worked and start telling clients to back off or otherwise to try another system. At this point the clients can make the call to either back off or terminate the connection. The server can increase its "back off for N ms" until it finally has to
terminate connections.
It requires a bit of extra coordination, but not much.
Also, even without such a system, forcing a reconnect when a new instance starts up/ periodically still reduces connection load a ton. If you force reconnects every minute, and you have 10/rps, that's 1 reconnect per 600 calls.
I guess in theory a load balancer is the most information to make these decisions, so really you'd want it to be aware of 'back off' requests from the server. One way of doing this might be something with health checks? Dunno.
This is a good survey of options with pros and cons. While one is at it please spend some time understanding failure modes and more importantly recovery from failures.
Let me describe a scenario my team ran into while using option #3 i.e., Look-Aside. We used etcd as service discovery store. Servers as part of their startup would register with etcd. Clients would query etcd and randomly pick a server to talk to. Now, for some reason the entire service fleet got isolated. A couple of hours later the network partition got back just fine but all the servers had shut down.
The real pain started when we began bringing up servers. As soon as a server was brought up it would register with etcd, immediately get bombarded by hundreds of clients and then would promptly die. No matter what permutation we tried servers would just refuse to come up because of thundering herd. The only option was to shut down the client app, which was another fiasco because we hadn't built a clean way to do it, and bring up all the servers and then gingerly bring up clients hoping they won't kill the servers.
The downtime lasted almost for a day. And this is a largish app based ride hailing business I'm talking about so you could imagine the shitstorm it created among customers and investors alike.
A key learning for us was to isolate service startup from load-balancer registration. A service should not be responsible for registering itself with a load balancer that should be for someone else to do.
While gRPC does have lots of positives it still has some way to go before it reaches the operational maturity. People will discover these deficiencies in a painful manner.
The "thundering herd" problem implies a misconfiguration - surely you can crank down the accept limit on the server? This is a problem that Apache spotted in the 90s and provided tuning parameters to mitigate.
I like gRPC overall, but it's definitely got some rough spots like this. I'm fortunate enough to be using it at a small enough scale that they don't really hinder me.
My sense, based on the multitude of GitHub issues like this, is that the project itself is thrashing. They've got a relatively small core team that seems to be micro-managing all the official implementations, largely out of a desire to maintain cross-language compatibility. But the cross-language compatibility is actually rather poorer than you'd expect, because they're trying to maintain it by directly micro-managing the implementations, which turns every feature design project into a complicated n-body problem.
It would be nice if they could just publish a comprehensive formal spec, and let the language implementations follow it. Then each (official) language would have only one external thing to track, instead of ten.
+1. To add, the lack of reverse proxy support back then made it worse. So things like rate limiting, max-conn, time out etc., that we take for granted in Nginx or HAProxy had to be either hand-coded or done in some roundabout manner.
This is what I meant by "operational maturity". A big factor for a new tech to gain adoption is the ease with which it fits in the existing ecosystem. For gRPC it's about load-balancer, reverse proxies, deployments, and so fourth. I'm sure gRPC will get there, depending on how fast it's adopted in big tech companies but I suspect it's not there yet.
It's good you learned that lesson about having someone else announce the server, but isn't always that simple. With things like Apache Aurora it would often announce jobs as soon as the process started. With something like Finagle, it tends to do it as soon as the component is initialized, instead of waiting until the server is fully initialized and ready to handle requests.
Something that I implemented at my last job and others rediscovered as part of my current job is to implement an "administratively up/down" API as part of the control plane and only have the server announced if it was "up." Decoupling the announcement from process start/initialization complete allowed us to roll out new versions of software in a disabled fashion and then "flip the switch" (red/black deployments). It also enabled us to take individual instances out of service without killing them, enabling developers to debug issues/anomalies more easily.
Load shedding/backpressure/rate limiting at various layers is also extremely helpful, whether at the load balancer/API gateway or at individual servers. That has saved our bacon numerous times.
Maybe you could bring up a bunch of fake-servers, that "implement" every rpc call with a long sleep or maybe by responding with internal error. Then take them down gradually as real ones go up.
I think what you really want to do is not give every client a full view of all the backends. xDS lets you write a service discovery server that meets this condition (it knows the full state, potentially with health information about upstreams, and it knows which client is connecting, so you can adjust this as you see fit). I've also seen people do AZ or regional aggregation, i.e. given some consumer in AZ A, the consumer gets a list of endpoints like, 0.upstream.a, 1.upstream.a, ..., regional-aggregator.B, regional-aggregator.C, etc. It sees all the endpoints in the same AZ, but goes through a proxy to get to other regions/zones. Under non-panic circumstances, you'd want all requests to be served from the same node, then the same zone, and only go to other regions in degraded cases.
I don't know what the state of the art around tooling to manage this is. For some reason, I suspect that the service meshes punt on this, because N * M isn't a problem in the demo environments where these systems spend most of their time. Meanwhile, the big companies that hit the scalability limit of N * M connections across upstream/downstream pairs wrote their own service discovery stuff decades ago.
Even better - they do not move for me at all, but they put ridiculous amount of load on the system - this was the heaviest tab on my system, and its contenders were big fscking SPAs
I'm not sure if I'm missing something or if the post only wants to discuss load-balancers available in AWS infra, but is there any reason we can't have an active application level load balancer for gRPC? As in, one that connects to all backends ahead of time and... balances the load on a per-request rather than per-connection basis?
There are quite a few proxies that can do this; Envoy's probably the most popular.
You can also load-balance on the client side. Clients connect to a management server, and the server tells them what percentage of connections to send to which upstream. I don't think anyone has written any publicly available software to manage this, but it can use the same underlying protocol as Envoy's configuration servers (which are also rare in public; I wrote one, but it sends clients Endpoints and Clusters and the gRPC implementation happens to want Listeners). The xDS implementation is a newer version of grpclb, which maybe had more support but I've never heard of anyone using it. (Google internally uses the client side load balancing. Worked great when I used it, so was always surprised it wasn't more popular in the real world.)
The article is really negative about client-side load balancing, but you can't complain about the latency properties -- no proxy between you and the other end. I always wished the Internet standards did a better job here; you can serve multiple A records for a DNS lookup, and browsers kind of sort of try to load balance (by selecting one of the endpoints randomly), but it doesn't work too well if one of the backends is broken. If you could just supply a pre-defined health-check function and maybe select a simple balancing algorithm, HTTP would be just a little bit smoother for everyone. (If you put topology information in the DNS lookup, you could even balance geographically without having to run special DNS servers that serve different replies to different users. A lot of upstream DNS caching hacks could then go away. It would be great! People are onto the value of this; Envoy Mobile lets you serve this information to your mobile app, so it can easily survive regional failures in your server-side infrastructure. Probably not a lot of adoption, but something that browsers should start doing.)
>You can also load-balance on the client side. Clients connect to a management server, and the server tells them what percentage of connections to send to which upstream. I don't think anyone has written any publicly available software to manage this, but it can use the same underlying protocol as Envoy's configuration servers
I believe this is exactly the way that Hashicorp's Consul Connect service mesh is implemented for L7 routing.
This is fine if you're comfortable adding a few hundred usec average / 1ms p99 latency to every request. A lot of gRPC use cases don't want to make this tradeoff.
> Although this feature from ALBs is relatively new but the arguments for sticky gRPC connections and NLB are also applicable with ALBs. If yu end up with a single sticky and persistent connection from a client that starts sending a large volume of requests, you are overloading a single server instance that holds that connection and the requests are not being load balanced as they should.
That seems strange to me. Without end-to-end http/2, ALBs would load balance individual requests, not the entire TCP session, even with keepalives. And it sounds like individual messages on the same connection can be routed to different target groups, which would make it even more stange not to balance individual requests.
Yeah, I think the post author may be confused. I've migrated several high traffic gRPC services (basically TensorFlow Serving) from AWS NLBs to ALBs for the improved load balancing. It worked pretty much as desired (although a little expensive).
It very much depends on your clients and the nature of the traffic they send. For a high-volume client, assuming a default of 100 concurrent streams per connection max, this actually works out to <connection establishment>, <50 parallel requests>, <connection close>. Have to ~quadruple your connection count vs what you could have done on one unfettered connection, and every connection will spend 50% of it's time being established.
Some hybrid of count and time might do, e.g. `50 requests AND 1+ minute since establishment`. But it's nuanced - surprisingly hard to find logic that works well in all cases.
As a concrete case of what EricBurnett says, we have a service that has a "long" warmup (p99 30ms - you can think of it like loading a symbol table) but then very fast requests once that data is loaded (p99 about 400usec, fast specifically because it can exchange indexes into the shared symbol table and not full data).
I was hoping to see something about xDS. I've seen a lot of work being done on adding support for it in gRPC on git but (as usual with gRPC) the docs and blog posts seem to be rare and terse, so I haven't had a chance to learn it yet.
Also, does anybody know how gRPC is used at Google? Why do they support its development (other than GCP clients)? I'm under the impression Stubby is a completely different codebase and is far superior and fully integrated in their stack.
IIRC, right before I left Google at the end of 2019 you could be added to an allowlist to use gRPC internally. I believe they want to eventually replace Stubby with gRPC. Someone who still works at Google would have to comment on the current state of the world or if my memory is failing.
As for Stubby it is a completely different codebase and is tightly integrated with a lot of internal tooling and features.
One other thing I miss from Stubby C++ compared to gRPC is that async stubby is super opinionated about it's threading model and came with batteries included, where I find the completion queue approach very hard to use.
Lastly, as far as load balancing goes. I think some load balancers with native gRPC support allows you to let different servers get different individual requests instead of requiring all requests go to the same server. Of course this only really works for unary RPC calls. But load balancing HTTP/2 isn't unique to gRPC nor keep alive http/1.1 connections. (Outside of the streaming RPCs of course, but there is an equivalent of websockets)
Thank you! Do you know why they were looking to replace Stubby with gRPC? It doesn't seem to me that gRPC is quite customizable enough currently to hook in all the monitoring and tooling that I imagine Google has.
And yes, the completion queue approach is pretty cumbersome, would be very interesting to see how stubby went about it :P
> Do you know why they were looking to replace Stubby?
Likely to maintain less code long term. It's nice to have a primative like this open source so you can use it everywhere (even on mobile). I'm not sure about the claim that gRPC isn't customizable enough, I think there are a lot of hooks in place. For example Open Census does a lot of the instrumentation that you get for free at Google with Stubby, but for gRPC: https://opencensus.io/zpages/
Is there a name for the pattern where your client retrieves a subset of the available instances via service discovery, establishes connections to every instance in the subset, and spreads requests between them?
E.g. you have 200 clients connected to 100 servers, your service discovery sends each client a list of 10 servers, and 2000 connections are established instead of 20000.
"Partial mesh" is a good description, but it's mostly used to describe routing, not load balancing.
Well, it does exactly what op wrote: "your client retrieves a subset of the available instances via service discovery, establishes connections to every instance in the subset, and spreads requests between them"
The article you linked describes retrieving all instances and establishing connections to a subset of them, not retrieving a subset of them. To achieve deterministic apertures it further needs both the complete serverset and the complete peerset.
Not sure why you want to do this. In normal RPC over messaging you would use the competing consumers pattern which also solves the connection issues. But I don't think it exists in gRPC.
What would be the advantage of doing this at the service discovery level (filtering what the client sees) as opposed to the client level (filtering what the client uses)?
I haven't been in the app development space in ages, are people really exposing their service meshes to public clients? I have trouble imagining a case where external clients could cause sufficient load for this to matter, but also where relying on the client to do it right wouldn't be a DoS risk.
Nobody should be afraid of 20000 connections or even 20 billion connections. The cost of having a lot of TCP connections hanging around are wildly over-estimated. You should not even think about subsetting unless/until you can measure the cost in terms of CPU spent on keepalive messages or memory spent on connection state, or you're getting more than millions of file descriptors on a single machine. Even if you are subsetting, very small subsets (like 10) are definitely harmful, and hiding the full set of candidate backends is also harmful. If you are subsetting, allow your clients to observe the global state and make their own decisions. What's it going to cost you? A few thousand IP addresses and port numbers in memory? That's basically nothing.
I also struggle to see the case the parent is trying to solve, but I would be concerned purely on a memory basis if my service opens even millions of unnecessary connections, let alone 20 billion.
Across what size of fleet? It's meaningless to say you'd be concerned about millions of connected sockets, except to the extent that it was a significant fraction of total system cost.
You've got at least the size of the read and write for every open connection to contend with. Let's be real conservative and clamp that down to 4KiB each. So that's 8K per connection.
Meaning that a rough lower bound on the memory usage under this scenario would be something like ~8GiB per million open connections, and ~8TiB per billion open connections.
You're right that it comes down to the extent to which that's a significant fraction of total system cost. I think I can say with reasonable confidence, though, that this would be a significant fraction of total system cost for the vast majority of systems.
Really? I'm not willing to let you wave your arms about that last conclusion. Suppose you had 100 frontends and 10000 backends in a fully-connected manner with 1 million sockets. You spent 8KiB on this state (I think that's a bit low, but I'm using your figure). That would only be a significant fraction of (more than 1% of) 100 * 1E6 * 8KiB == 800GiB, spread across your 10100 containers, meaning they each have only 80MiB of memory? This seems very unlikely.
A much more likely scenario is each of your 10100 things has at least 2-4GB of memory per CPU core, meaning at least 20-40TiB of memory in your system, meaning that the million connections cost you 0.02% of your system memory, or essentially nothing.
> A much more likely scenario is each of your 10100 things has at least 2-4GB of memory per CPU core
This is nearly two orders of magnitude off (250-500MB using 4-16 CPU cores) for some of our most connection-intense services. And as you said 8KiB is extremely low, in practice more like 2-10x that once all buffering is taken into account. >5% of my memory spent on connection handling that (by the original assumption) I don't need, yes, that worries me.
That just doesn't make any sense. You can't even build a computer with 16 cores and 500MB of memory, so why would you have that ratio in production? This whole thread has been completely surreal.
> You can't even build a computer with 16 cores and 500MB of memory, so why would you have that ratio in production?
Because the service has no substantial local state so doesn't need any more than that? Should we invent unnecessary local state just so we can use more memory? That sounds considerably more surreal to me.
Or maybe instead we should use all the memory we saved to keep our 200-300GB working set for a different service in RAM colocated, which is what we do...
I think the point of contention here is probably what a typical system looks like. I haven't personally had the pleasure of working on a system that's anywhere close to having 100 frontends and 10,000 backends, and I don't have many friends who have, either, so I assume that's an atypical use case. Perhaps I'm wrong there.
Then why would you have a million established connections? That's my whole point here: don't worry about the connection count until you have a concrete need to worry about it.
I think that's actually my whole point here. If I'm running a system on anywhere near the scale I'm used to operating, you damn well better believe that I'm going to worry about "20 billion connections." The concrete reason, which I can give a priori without even waiting for it to happen first, is that the best case scenario for what's going on is probably that the servers are being hugged to death.
That and we've violated the laws of physics. Because even by the numbers that you agreed are grossly optimistic (which was deliberate), we're looking at that consuming an order of magnitude or two more RAM than we have across all our servers. So that would admittedly be cool. But still.
Memory when using a workload scheduler can for sure be as low as 80MB. I believe it was set at 100MB for all containers handled by Nomad in a solution I used to work with. It would be insanely expensive to build solutions with your specs to just solve a problem that should be solved by redesigning for fewer connections.
I'm not an expert on this, but I think what's costly is storing and communicating information about the load of each server a client is connected to, not the overhead of TCP.
> The official implementation uses a Per Call Load Balancing and not a per connection one. So each call is being load balanced separately. This is the ideal and the desired case and it will avoid having heavy sticky connections.
I wonder how much of this is just leaking from how stubby & GSLB is done internally at Google. I'll admit, I don't know much about gRPC.
Separate point: if you control both the clients and the servers completely, you can have servers respond with load shedding signals if they start to get overloaded, and use that as a hint for the client to go look elsewhere. This doesn't solve all problems, but it does make some cases easier to handle. Also it is obviously not a viable solution if you don't fully control all clients - some of them will just not respect your load shedding and continue to beat an individual server into the ground.
I am not even sure if the problem described in the article is specific to gRPC. I think it's a HTTP/2 problem.
Any sane HTTP/2 client making N queries to the same host/IP ideally should use 1 connection since h2 supports multiplexing (which is bad for load balancing if you have an L4 LB). So even if you don't use gRPC, consider putting a per-call (L7) LB in front of your app if you use HTTP/2 protocol between your apps.
Nice read! I can add that from my experience using DNS for load balancing isn't the best idea because of the complexities with the DNS caching. The DNS caching happens on DNS resolvers, OS level but also some clients do it too which might be tricky to watch for. Basically you might end up with an IP address of a failed server and no way of retrying until TTL of the DNS record expires.
Article doesn't mention the one big con of look-aside load balancing. In a partition scenario it can be that the load director can contact backends that the client can't, so it gets bad advice from the load director. So you need to think about fallbacks and how clients can operate autonomously in this scenario.
We have lots of client instances communicating to other server instances with ZMQ PubSub via a network load balancer. We find that periodically recreating the client side connection works fine for us.
HTTP2 is not a suitable protocol for implementing persistent, stateful communication channels. HTTP2 is bidirectional but it's stateless. gRPC is bidirectional but intended to be stateful. WebSockets (bidirectional and stateful) would have been a better match and not required sticky sessions. gRPC essentially takes us back to the dark ages of long polling.
I was making this argument for years. WebSockets was designed for bidirectional data exchange between client and server. On the other hand, HTTP2 was designed for pushing static resources to the client to speed up loading of front ends (it was meant to make bundling front-end scripts redundant). HTTP2 was not designed for bidirectional data exchange over a stateful channel.
Can you be more specific? What technical drawbacks are relevant? It's certainly not literally polling any more than a websocket so what do you mean? Long running websockets are also hard to balance for the exact same reasons. Much harder in fact because you can't use a per request LB.
It's not sticky sessions in the traditional sense. It's a single connection. Http/1.1 requests are allowed to be stateful within a single connection as well. The difference is there is built support for multiplexed requests within that single connection.
Yeah, Google's L7 load balancers have been supporting gRPC since forever, but I'm also surprised to hear AWS and Cloudflare only recently added gRPC support. Maybe the author uses only AWS so that's why only AWS is relevant to them. OSS proxies and gateways like envoy had it from the get go.
I really wonder how one get the powers to down-vote on HN. What is the down-vote for? Is it a specific gRPC problem? For I for sure work with LBs that are set to timeout at 4min. And most of the backend clients are set to drop sockets after 120s. This causes challenges for all traffic.
This is good practice anyway. Having long-lived servers causes all kinds of issues. In the old days, if you had a server that was up for years, you had no idea whether you could reboot the server safely or not. Additionally, from a security perspective, the longer-lived a server is, the more likely that an attacker installed some manner of backdoor on the server that's just waiting for the attacker's opportune moment to come back and exploit it.
Much better to make chaos an expected part of your operations - make sure your servers are cattle (not pets), give them maximum lifetimes, and terminate them on a regular basis. The more frequent the termination, the better prepared you are for single-node overload, and the less likely that it happens in your use case in the first place.