Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is a good survey of options with pros and cons. While one is at it please spend some time understanding failure modes and more importantly recovery from failures.

Let me describe a scenario my team ran into while using option #3 i.e., Look-Aside. We used etcd as service discovery store. Servers as part of their startup would register with etcd. Clients would query etcd and randomly pick a server to talk to. Now, for some reason the entire service fleet got isolated. A couple of hours later the network partition got back just fine but all the servers had shut down.

The real pain started when we began bringing up servers. As soon as a server was brought up it would register with etcd, immediately get bombarded by hundreds of clients and then would promptly die. No matter what permutation we tried servers would just refuse to come up because of thundering herd. The only option was to shut down the client app, which was another fiasco because we hadn't built a clean way to do it, and bring up all the servers and then gingerly bring up clients hoping they won't kill the servers.

The downtime lasted almost for a day. And this is a largish app based ride hailing business I'm talking about so you could imagine the shitstorm it created among customers and investors alike.

A key learning for us was to isolate service startup from load-balancer registration. A service should not be responsible for registering itself with a load balancer that should be for someone else to do.

While gRPC does have lots of positives it still has some way to go before it reaches the operational maturity. People will discover these deficiencies in a painful manner.



The "thundering herd" problem implies a misconfiguration - surely you can crank down the accept limit on the server? This is a problem that Apache spotted in the 90s and provided tuning parameters to mitigate.


As with so many things gRPC, they've had an open issue about it for years, but it doesn't seem to be going anywhere: https://github.com/grpc/grpc-java/issues/1886

I like gRPC overall, but it's definitely got some rough spots like this. I'm fortunate enough to be using it at a small enough scale that they don't really hinder me.

My sense, based on the multitude of GitHub issues like this, is that the project itself is thrashing. They've got a relatively small core team that seems to be micro-managing all the official implementations, largely out of a desire to maintain cross-language compatibility. But the cross-language compatibility is actually rather poorer than you'd expect, because they're trying to maintain it by directly micro-managing the implementations, which turns every feature design project into a complicated n-body problem.

It would be nice if they could just publish a comprehensive formal spec, and let the language implementations follow it. Then each (official) language would have only one external thing to track, instead of ten.


+1. To add, the lack of reverse proxy support back then made it worse. So things like rate limiting, max-conn, time out etc., that we take for granted in Nginx or HAProxy had to be either hand-coded or done in some roundabout manner.

This is what I meant by "operational maturity". A big factor for a new tech to gain adoption is the ease with which it fits in the existing ecosystem. For gRPC it's about load-balancer, reverse proxies, deployments, and so fourth. I'm sure gRPC will get there, depending on how fast it's adopted in big tech companies but I suspect it's not there yet.


HAProxy can be used to load balance gRPC since 2.0. Here the current 2.4 version link to the documentation.

http://cbonte.github.io/haproxy-dconv/2.4/configuration.html...


It's good you learned that lesson about having someone else announce the server, but isn't always that simple. With things like Apache Aurora it would often announce jobs as soon as the process started. With something like Finagle, it tends to do it as soon as the component is initialized, instead of waiting until the server is fully initialized and ready to handle requests.

Something that I implemented at my last job and others rediscovered as part of my current job is to implement an "administratively up/down" API as part of the control plane and only have the server announced if it was "up." Decoupling the announcement from process start/initialization complete allowed us to roll out new versions of software in a disabled fashion and then "flip the switch" (red/black deployments). It also enabled us to take individual instances out of service without killing them, enabling developers to debug issues/anomalies more easily.

Load shedding/backpressure/rate limiting at various layers is also extremely helpful, whether at the load balancer/API gateway or at individual servers. That has saved our bacon numerous times.


Maybe you could bring up a bunch of fake-servers, that "implement" every rpc call with a long sleep or maybe by responding with internal error. Then take them down gradually as real ones go up.


I think what you really want to do is not give every client a full view of all the backends. xDS lets you write a service discovery server that meets this condition (it knows the full state, potentially with health information about upstreams, and it knows which client is connecting, so you can adjust this as you see fit). I've also seen people do AZ or regional aggregation, i.e. given some consumer in AZ A, the consumer gets a list of endpoints like, 0.upstream.a, 1.upstream.a, ..., regional-aggregator.B, regional-aggregator.C, etc. It sees all the endpoints in the same AZ, but goes through a proxy to get to other regions/zones. Under non-panic circumstances, you'd want all requests to be served from the same node, then the same zone, and only go to other regions in degraded cases.

I don't know what the state of the art around tooling to manage this is. For some reason, I suspect that the service meshes punt on this, because N * M isn't a problem in the demo environments where these systems spend most of their time. Meanwhile, the big companies that hit the scalability limit of N * M connections across upstream/downstream pairs wrote their own service discovery stuff decades ago.


That is pretty clever.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: