It seems that they recreated a Service Mesh. Running Istio or Consul Connect takes care or the vast majority of issues listed in this post: Encryption, Identity, access control. And even trasparently for the developpers (no modification of the code...)
Isn't Istio implemented mostly as a sidecar container(Envoy Proxy) though? The article mentions they are running containers via their Tupperware orchestrator. If they are largely running containerized where is the scaling issue with adding sidecars to implement the service mesh? I don't have any experience with Istio but I'm genuinely curious along which axis it(or Connect, Linkerd etc) doesn't scale.
Envoy is so slow that deployment at this scale would be too costly, or if you could afford it would immediately present itself as a huge opportunity for cost reduction. People who are measuring their tail latency in microseconds aren't going to tolerate Envoy's marginal latency, which will be milliseconds even at the median.
Indeed, Istio is very CPU heavy and probably not great for cloud use on a truly large scale just due to cost concerns. Shopify found it uses 50% more CPU then an alternative like Linkerd [1]. Istio also incurs much larger latency and lower throughput then Linkerd [2]. (To be clear, I don't have a dog in this Istio vs Linkerd fight, those are just some recent benchmarks I'm aware of)
Interesting, I wasn't aware that Istio has such performance issues. Isn't Google using this though as well or at least an internal version of it? Surely they are on the same scale as FB.
I'm curious at to what the cause of the latency is. TLS handshakes?