> After several days, we finally narrowed down the issue to a bad Advanced Vector Extensions (AVX) instruction on a single CPU in our fleet
This isn't even the first time I've heard of an issue at FB being caused by a single bad CPU instruction. Working at a scale where "Problem X is a one-in-a-million edge case" and "Problem X happens several times per day" are synonymous is weird...
Sure. I didn't figure this out, but was relatively close to the investigation. My team provided a lot of data that ended up being used to come to root cause.
My team and extended teams managed almost 200,000 network devices, spread all over the world, most of which were Cisco, and most of which were installed in stores. And most of the switch ports were connected to customer facing Point Of Sale devices. Among these are employee facing registers and customer facing card scanners. That is, the devices you interact with whenever scan your card to pay for something.
With that many devices in that many locations in a largely unmanaged environment (the switches would be installed all over the store, often in the ceiling, and many of them experienced extreme temperatures), there were a constant stream of failures. The process to manage these failures was optimized, streamlined and largely automated.
However, it was discovered that switches were failing far more frequently in the northern Midwestern US than elsewhere, and then only in the winter.
So this wasn't a really big operational issue, but it had a substantial cost impact, and the rate was high enough that a lot of the affected stores did notice and were complaining.
Right. Very strange, very mysterious.
So, briefly, the root cause:
Apparently, people in the upper Midwest wear wool to stay warm far more frequently than other cold places, specifically the US northeast. And much of the time, the humidity is quite low. So, you have a lot of people wearing a lot of wool in low humidity air. These people generated a lot of static, which they would all too often discharge while interacting with the customer facing point of sale device. And, all too frequently, that pulse of static would end up flowing all the way back to the switch, often killing it.
I didn't follow the subsequent remediation efforts, so I don't know what if anything was done about that.
I suspect it would not have been received as that big a deal to management or to anyone else. In those halcyon days, we were running into and usually solving all kinds of such edgy, extreme scale problems. It was a lot of work, but a hell of a lot of fun too.
If you are interested in something like that, i would recommend you to listen to this podcast [0]. One of the stories from what i remember is that someone had run a `delete table` command but it turned out that processor had a bit off which mapped `create table` command to `delete table`. It's really really interesting.
Yes. For that particular CPU, in one core (and its sibling thread (as in hyperthread)), AVX instruction is broken. It doesn't do what it is supposed to do.
Recently, again at Facebook, we found a few machines with CPUs where ADCX instruction is broken on at least one core. This is especially fatal because OpenSSL and Fizz (our TLS 1.3 impl) use these instructions for AMD64 architecture in RSA implementation.
Barring the specifics, we deploy new machines in batches of tens of thousands at a time and they practically don't receive any production traffic for weeks while repeatedly running burn-in tests followed by baseline daemons. If they survive, they get provisioned for production.
Infant mortality rate of CPUs--especially single socket designs--is very low compared to DIMMs and SSDs. They tend to develop these issues later in their economic lives. They are also comparatively rare to other module failures.
It's always fun to have the space in your schedule to drill down to that level of issue. TLS is pretty useful in that way, as you can catch CPU issues as transport layer errors instead of way down in your application with strange decoding issues.
Does anyone know if they use encryption and access control to granularly regulate access to data? For example, the part of the system that feeds data to advertisements shouldn’t have access to my private messages (in my view that would be a huge breach of trust with users.)
Ya they're similar in that they are all signed blobs of data, but different in the sense that they are specifically designed to send authentication information via several layers of proxies
I'm actually interested in this subject so I'll check out your links when I'll be able to. At first sight this sounds like wrapping tokens or third party caveats in Macaroons.
If you don't mind I wouldn't necessarily agree with the comment about JWT by Yueting. JWT is just a format, querying backend to get a new token is not necessary (this is only how people often use them). I actually built a small PoC that mints new JWTs on client side (in the browser) signing them with a non-exportable key (through Webcrypto).
As for Macaroons I believe they could also be adjusted to resemble CATs as I understood them (with layers for different services). I do have other issues with Macaroons though (https://news.ycombinator.com/item?id=17878845)...
Borg != Kubernetes k8s is based on Borg, but they are quite different.
Companies at this scale have integrations between all different levels and layers of the 'stack' that make the use of off the shelf software difficult or impossible.
This would make a great comparison. I'm not certain whether or not K8's mutual auth supports session ticket resumptions and distribution of short lived ticket keys. The ticket rotation design would probably make a great addition to K8. There are a lot of intricate details in design which can make a major difference in not only performance but also whether or not the system wakes you up at night.
"After several years of trying to manage these issues with Kerberos, we decided to redesign the system from the ground up.."
I am curious, is there anyone who went the opposite direction direction and implemented a new Kerberos library or setup?
So, several places, such as Stanford or Morgan Stanley, have sophisticated Kerberos setups using something like Russ Allbery's Wallet[0] (Stanford) or Roland Dowdeswell's OSKT[1] (Morgan Stanley, Two Sigma) stack.
For example, OSKT is a self-service toolkit that lets users build up access controls for clusters, "role accounts" (user accounts for running application automation), and what not. Users use "krb5_prestash" to indicate what hosts should have what role accounts' credentials (the user must own the hosts and role accounts) and krb5_keytab to get keys for services on hosts they are allowed to run. A nifty trick is to have wildcard DNS A RRs for hosts so that one can have HTTP/${USER}.$(uname -n) principals (and keys for them) on any host the user can login to.
All of this is high-performance and self-service. Users don't need to file JIRA tickets or whatever to get their keys for their services.
Self-service credential provisioning is absolutely essential to successful deployment at scale of any authentication system one uses, whether that be Kerberos or PKIX or DANE or anything else one might find or invent.
That actually sounds about right from my perspective (and experience). Out of curiosity, why do you think that is a lot?
I would go a bit further and say most companies should only proceed with a microservice architecture if they have sufficient scale and automation such that decomposing their architecture will result in at least a high double digit number of discrete services.
I think I was reading microservices but I was imagining services and that's why it seemed like a lot. I guess it depends on the granularity of the decomposition.
I agree, scale and automation are two important factors when decomposing architectures. I think it would be really valuable if systems could decompose themselves to some degree, based on scale and other factors, without much of an operator's intervention.
It seems that they recreated a Service Mesh. Running Istio or Consul Connect takes care or the vast majority of issues listed in this post: Encryption, Identity, access control. And even trasparently for the developpers (no modification of the code...)
Isn't Istio implemented mostly as a sidecar container(Envoy Proxy) though? The article mentions they are running containers via their Tupperware orchestrator. If they are largely running containerized where is the scaling issue with adding sidecars to implement the service mesh? I don't have any experience with Istio but I'm genuinely curious along which axis it(or Connect, Linkerd etc) doesn't scale.
Envoy is so slow that deployment at this scale would be too costly, or if you could afford it would immediately present itself as a huge opportunity for cost reduction. People who are measuring their tail latency in microseconds aren't going to tolerate Envoy's marginal latency, which will be milliseconds even at the median.
Indeed, Istio is very CPU heavy and probably not great for cloud use on a truly large scale just due to cost concerns. Shopify found it uses 50% more CPU then an alternative like Linkerd [1]. Istio also incurs much larger latency and lower throughput then Linkerd [2]. (To be clear, I don't have a dog in this Istio vs Linkerd fight, those are just some recent benchmarks I'm aware of)
Interesting, I wasn't aware that Istio has such performance issues. Isn't Google using this though as well or at least an internal version of it? Surely they are on the same scale as FB.
I'm curious at to what the cause of the latency is. TLS handshakes?
Is this an engineering article to help prime the pump for discussing FB's approach to taking privacy more seriously (e.g. Zuckerberg's the "future is private")? The article does not explicitly state any connection to such larger FB company and product developments, but it made me think it's connected in some way.
This isn't even the first time I've heard of an issue at FB being caused by a single bad CPU instruction. Working at a scale where "Problem X is a one-in-a-million edge case" and "Problem X happens several times per day" are synonymous is weird...