Building Facebook's Service Encryption Infastructure

Shish2k · on May 29, 2019

> After several days, we finally narrowed down the issue to a bad Advanced Vector Extensions (AVX) instruction on a single CPU in our fleet

This isn't even the first time I've heard of an issue at FB being caused by a single bad CPU instruction. Working at a scale where "Problem X is a one-in-a-million edge case" and "Problem X happens several times per day" are synonymous is weird...

Diederich · on May 29, 2019

At a previous organization I worked at that operated at this kind of scale, I came up with a couple of maxims:

1. If you're only 99% automated, you're dead.

2. Anything you can easily imagine going wrong is probably going wrong right now.

3. Everything else that can possibly go wrong will go wrong at some point in the not distant future.

Some of the problems we ran into were pretty fun and challenging. The longest substantial one I'm aware of took a couple of years to figure out.

neb_b · on May 29, 2019

Can you share the problem that took a couple of years to figure out?

Diederich · on May 29, 2019

Sure. I didn't figure this out, but was relatively close to the investigation. My team provided a lot of data that ended up being used to come to root cause.

My team and extended teams managed almost 200,000 network devices, spread all over the world, most of which were Cisco, and most of which were installed in stores. And most of the switch ports were connected to customer facing Point Of Sale devices. Among these are employee facing registers and customer facing card scanners. That is, the devices you interact with whenever scan your card to pay for something.

With that many devices in that many locations in a largely unmanaged environment (the switches would be installed all over the store, often in the ceiling, and many of them experienced extreme temperatures), there were a constant stream of failures. The process to manage these failures was optimized, streamlined and largely automated.

However, it was discovered that switches were failing far more frequently in the northern Midwestern US than elsewhere, and then only in the winter.

So this wasn't a really big operational issue, but it had a substantial cost impact, and the rate was high enough that a lot of the affected stores did notice and were complaining.

Right. Very strange, very mysterious.

So, briefly, the root cause:

Apparently, people in the upper Midwest wear wool to stay warm far more frequently than other cold places, specifically the US northeast. And much of the time, the humidity is quite low. So, you have a lot of people wearing a lot of wool in low humidity air. These people generated a lot of static, which they would all too often discharge while interacting with the customer facing point of sale device. And, all too frequently, that pulse of static would end up flowing all the way back to the switch, often killing it.

I didn't follow the subsequent remediation efforts, so I don't know what if anything was done about that.

ianlevesque · on May 29, 2019

Amusing that you were causing them physical pain when parting with their money at the store, like negative conditioning.

Diederich · on May 29, 2019

Yes, that particular joke did surface after we discovered the problem. (:

laingc · on May 29, 2019

That’s a fantastic anecdote. I would have loved to have been a fly on the wall when the results were reported to management.

lioeters · on May 29, 2019

Indeed, great story. I'd love to have seen the face of the person who finally figured it out.

Diederich · on May 29, 2019

Thanks.

I suspect it would not have been received as that big a deal to management or to anyone else. In those halcyon days, we were running into and usually solving all kinds of such edgy, extreme scale problems. It was a lot of work, but a hell of a lot of fun too.

maxpanas · on May 29, 2019

One of the great stories right here. Thank you for sharing.

NetOpWibby · on May 29, 2019

Wow, that’s one helluva edge case I’d never think of!

wnevets · on May 29, 2019

but at least you get to blame the users!

parthdesai · on May 29, 2019

If you are interested in something like that, i would recommend you to listen to this podcast [0]. One of the stories from what i remember is that someone had run a `delete table` command but it turned out that processor had a bit off which mapped `create table` command to `delete table`. It's really really interesting.

[0] https://softwareengineeringdaily.com/2017/06/16/google-early...

herpderperator · on May 29, 2019

What does "a single bad CPU instruction" mean here? That the CPU was faulty and reliably miscalculated AVX instructions? I don't understand.

bdd · on May 29, 2019

Yes. For that particular CPU, in one core (and its sibling thread (as in hyperthread)), AVX instruction is broken. It doesn't do what it is supposed to do.

Recently, again at Facebook, we found a few machines with CPUs where ADCX instruction is broken on at least one core. This is especially fatal because OpenSSL and Fizz (our TLS 1.3 impl) use these instructions for AMD64 architecture in RSA implementation.

sroussey · on May 30, 2019

What kind of burn-in process is used to tease these out before deploying? Curious

bdd · on May 30, 2019

Barring the specifics, we deploy new machines in batches of tens of thousands at a time and they practically don't receive any production traffic for weeks while repeatedly running burn-in tests followed by baseline daemons. If they survive, they get provisioned for production.

Infant mortality rate of CPUs--especially single socket designs--is very low compared to DIMMs and SSDs. They tend to develop these issues later in their economic lives. They are also comparatively rare to other module failures.

lclarkmichalek · on May 29, 2019

It's always fun to have the space in your schedule to drill down to that level of issue. TLS is pretty useful in that way, as you can catch CPU issues as transport layer errors instead of way down in your application with strange decoding issues.

ihm · on May 29, 2019

Does anyone know if they use encryption and access control to granularly regulate access to data? For example, the part of the system that feeds data to advertisements shouldn’t have access to my private messages (in my view that would be a huge breach of trust with users.)

sudoyear123 · on May 29, 2019

There are several access control mechanisms. One such ACL as mentioned in the post is identity certificates which are used to perform access control. Other mechanisms for identity are CATs which have been talked about in the past https://rwc.iacr.org/2018/Slides/Lewi.pdf and https://www.youtube.com/watch?v=kY-Bkv3qxMc

Boulth · on May 29, 2019

CATs at first sight look like Macaroons or JWTs. Thanks for the links!

sudoyear123 · on May 29, 2019

Ya they're similar in that they are all signed blobs of data, but different in the sense that they are specifically designed to send authentication information via several layers of proxies

Boulth · on May 29, 2019

I'm actually interested in this subject so I'll check out your links when I'll be able to. At first sight this sounds like wrapping tokens or third party caveats in Macaroons.

bdd · on May 29, 2019

We presented about CATs again in Def Con 26. It's a 21 minute talk but if you're interested in how CAT differ from Macaroons, you can skip to 16:15 mark where Yueting explains https://cryptovillage.org/cats-a-tale-of-scalable-authentica...

Boulth · on May 30, 2019

I've seen both videos, nice explanation.

If you don't mind I wouldn't necessarily agree with the comment about JWT by Yueting. JWT is just a format, querying backend to get a new token is not necessary (this is only how people often use them). I actually built a small PoC that mints new JWTs on client side (in the browser) signing them with a non-exportable key (through Webcrypto).

As for Macaroons I believe they could also be adjusted to resemble CATs as I understood them (with layers for different services). I do have other issues with Macaroons though (https://news.ycombinator.com/item?id=17878845)...

dirkg · on May 29, 2019

I wonder what the performance differences are with their approach vs using K8s + service mesh of your choice which includes all this.

I imagine Google runs a bigger scale operation on top of Borg and their internal service mesh.

Diederich · on May 29, 2019

Borg != Kubernetes k8s is based on Borg, but they are quite different.

Companies at this scale have integrations between all different levels and layers of the 'stack' that make the use of off the shelf software difficult or impossible.

dirkg · on May 29, 2019

yes I know they are different, but the lessons learnt/dev in both products probably end up influencing each other.

My point was encryption of services is built into K8s/service mesh and wondering how it fares compared to FB's approach.

sudoyear123 · on May 29, 2019

This would make a great comparison. I'm not certain whether or not K8's mutual auth supports session ticket resumptions and distribution of short lived ticket keys. The ticket rotation design would probably make a great addition to K8. There are a lot of intricate details in design which can make a major difference in not only performance but also whether or not the system wakes you up at night.

brown9-2 · on May 29, 2019

Kubernetes does not have a built-in mutual auth solution

ropman76 · on May 30, 2019

"After several years of trying to manage these issues with Kerberos, we decided to redesign the system from the ground up.." I am curious, is there anyone who went the opposite direction direction and implemented a new Kerberos library or setup?

cryptonector · on May 30, 2019

So, several places, such as Stanford or Morgan Stanley, have sophisticated Kerberos setups using something like Russ Allbery's Wallet[0] (Stanford) or Roland Dowdeswell's OSKT[1] (Morgan Stanley, Two Sigma) stack.

For example, OSKT is a self-service toolkit that lets users build up access controls for clusters, "role accounts" (user accounts for running application automation), and what not. Users use "krb5_prestash" to indicate what hosts should have what role accounts' credentials (the user must own the hosts and role accounts) and krb5_keytab to get keys for services on hosts they are allowed to run. A nifty trick is to have wildcard DNS A RRs for hosts so that one can have HTTP/${USER}.$(uname -n) principals (and keys for them) on any host the user can login to.

All of this is high-performance and self-service. Users don't need to file JIRA tickets or whatever to get their keys for their services.

Self-service credential provisioning is absolutely essential to successful deployment at scale of any authentication system one uses, whether that be Kerberos or PKIX or DANE or anything else one might find or invent.

  [0] https://www.eyrie.org/~eagle/software/wallet/readme.html
  [1] https://oskt.secure-endpoints.com/
      https://github.com/elric1/

plainOldText · on May 29, 2019

> We run one of the largest microservices deployments in the world, with thousands of services that perform billions of requests per second.

Thousands? Does one truly need that many microservices, even if you are the size of Facebook? That is a lot.

throwawaymath · on May 29, 2019

That actually sounds about right from my perspective (and experience). Out of curiosity, why do you think that is a lot?

I would go a bit further and say most companies should only proceed with a microservice architecture if they have sufficient scale and automation such that decomposing their architecture will result in at least a high double digit number of discrete services.

plainOldText · on May 29, 2019

I think I was reading microservices but I was imagining services and that's why it seemed like a lot. I guess it depends on the granularity of the decomposition.

I agree, scale and automation are two important factors when decomposing architectures. I think it would be really valuable if systems could decompose themselves to some degree, based on scale and other factors, without much of an operator's intervention.

Marako · on May 29, 2019

It seems that they recreated a Service Mesh. Running Istio or Consul Connect takes care or the vast majority of issues listed in this post: Encryption, Identity, access control. And even trasparently for the developpers (no modification of the code...)

StreamBright · on May 29, 2019

Only the scale that these solutions can support is different.

bogomipz · on May 29, 2019

Isn't Istio implemented mostly as a sidecar container(Envoy Proxy) though? The article mentions they are running containers via their Tupperware orchestrator. If they are largely running containerized where is the scaling issue with adding sidecars to implement the service mesh? I don't have any experience with Istio but I'm genuinely curious along which axis it(or Connect, Linkerd etc) doesn't scale.

shereadsthenews · on May 29, 2019

Envoy is so slow that deployment at this scale would be too costly, or if you could afford it would immediately present itself as a huge opportunity for cost reduction. People who are measuring their tail latency in microseconds aren't going to tolerate Envoy's marginal latency, which will be milliseconds even at the median.

programd · on May 29, 2019

Indeed, Istio is very CPU heavy and probably not great for cloud use on a truly large scale just due to cost concerns. Shopify found it uses 50% more CPU then an alternative like Linkerd [1]. Istio also incurs much larger latency and lower throughput then Linkerd [2]. (To be clear, I don't have a dog in this Istio vs Linkerd fight, those are just some recent benchmarks I'm aware of)

[1] https://medium.com/@michael_87395/benchmarking-istio-linkerd...

[2] https://medium.com/@ihcsim/linkerd-2-0-and-istio-performance...

bogomipz · on May 29, 2019

Interesting, I wasn't aware that Istio has such performance issues. Isn't Google using this though as well or at least an internal version of it? Surely they are on the same scale as FB.

I'm curious at to what the cause of the latency is. TLS handshakes?

Terretta · on May 29, 2019

That report resulted in an update to Istio docs: “Warning not to use demo profile for performance evaluation”

https://github.com/istio/istio.io/pull/4220

More here, which basically suggests, don’t stop Istio from scaling out before 500 rps, it doesn’t like that at all:

https://kinvolk.io/blog/2019/05/performance-benchmark-analys...

StreamBright · on May 29, 2019

To be fair, most software projects have performance issues at FB's scale.

dpflan · on May 29, 2019

Is this an engineering article to help prime the pump for discussing FB's approach to taking privacy more seriously (e.g. Zuckerberg's the "future is private")? The article does not explicitly state any connection to such larger FB company and product developments, but it made me think it's connected in some way.

jedberg · on May 29, 2019

Looks like a pretty typical "we solve hard problems and you should come join us" recruiting blog post.