Hacker News new | past | comments | ask | show | jobs | submit login
Twitter has an internal root CA problem (izzodlaw.com)
117 points by loriverkutya on March 13, 2023 | hide | past | favorite | 76 comments



I think people who work in reliability see this type of thing as the real existential threat to twitter. It's unrealistic that a large infrastructure would fall over overnight, but what is very realistic is small problems being neglected until they become big problems, or multiple problems happening at the same time.

This alone is probably manageable, it might even be simple but painful to handle for 2-15 of twitters employees (pre-firing) with specialized knowledge. If 3 people knew the disaster recovery plan and they all got fired because they were so busy maintaining things and fighting fires that they failed to get good reviews by building things, well I wouldn't be surprised. Likewise the employees trusted with extreme disaster recovery mechanisms are not the poor souls on H1Bs who don't have the option of leaving easily, so the people trusted with access might have already jumped ship since they aren't being coerced into staying on board with a mad man.

The real existential threat is another problem compounding on top of this or a disastrous recovery effort. Auto-remediation systems could do something awful. A master database could fall over and a replica be promoted, but if that happens twice, 4 times? Without puppet to configure replacement machines appropriately, there could be a very real problem very quickly. Similarly, extremely powerful tools, like a root ssh key, might be taken out, but those keys do not have seat-belts and one command typed wrong could be catastrophic. Sometimes bigger disasters are made trying to fix smaller ones.

Puppet can be in the critical path of both recovery (via config change) and capacity.


That's okay, Musk tweeted that Twitter needs a complete, green-field rewrite 5 days ago, I'm sure that will solve the problem.


How many engineers would that take Michael? 4, hardcore, over the weekend?


Whenever I hear specifics about likely ways things could fail, I always see a plan. "Hey this all makes sense, lets focus on having these areas covered before they come to pass."

Same goes when someone lists all the reasons why a proposal isn't viable. "Great, so we'll address those and be golden then?" Often they list them as fact without considering (or the ability to imagine) that they could be made viable with additional effort.


Oh, I know that problem, we did change Puppet root CA due to mishap of one of the admins during updating to sha256 certs. But IIRC (it was long time ago) Puppet CA cert by default are issued for like 10-20 years, would be a bit weird if true. Also, old versions didn't had trust chain "just" root CA so puppet master would have to have key for that on disk anyway, proper "root CA + leaf CA for puppetmasters" have been a thing for just few years in Puppet.

It would only be really problematic if they also lost SSH access to those machines using Puppet. If you have root access the fix is not exactly hard.

But then they fired people that did had access so that might also be a problem

We made sure all of our machines can be accesses both by Puppet and by SSH kinda for that reason; we had both accidents of someone fucking up Puppet, and someone fucking up SSH config rendering machines un-loggable (the lessons were learned and etched in stone).

So really, depending on who has access to what, it can be anything from "just pipe list of hosts to few ssh commands fixing it" to "get access manually to the server and change stuff, or redeploy machine from scratch". Again, assuming muski boy didn't fire wrong people


> It would only be really problematic if they also lost SSH access to those machines using Puppet. If you have root access the fix is not exactly hard.

> But then they fired people that did had access so that might also be a problem

Oh my, wouldn't that be delicious...

Gotta wonder how you'd go about fixing that, though. Assuming that those people's access was also tied to their employment and irrevocably voided when they were fired: I guess it would depend on how well those machines are secured against attackers with access to the hardware.


> Musk fired everyone with access to the private key to their internal root CA,

The way forward is to generate a new CA root certificate.

> and they can no longer run puppet because the puppet master's CA cert expired

They can reconfigure internal tools to use the new CA root certificate, or rather one of the signed intermediate certificates.

> and they can't get a new one because no one has access.

They can simply generate new CA root certificates, and sign or create new intermediate certificates.

> They no longer can mint certs.

Yes, they, can...

> My limited understanding in this area is that this is...very bad

No, it, is, not...

There are two immediate issues that come to mind.

* Twitter was so awful before, that it relied on people to safeguard the keys to the kingdom. This is very bad practice, and one of the many things Musk will no doubt be fixing. For any mission critical assets, and especially certificates, but also passwords... current modern day corporate practice is to have a secure ledger of these that can be accessed by the board of directors, the executive managers, and designated maintainers. At no point ever should the password be entrusted to anybody, but rather a "role" that functions as the one who has access. Say for example, the CIO/CTO and their subordinates.

* The Second issue is the one everyone is fixating upon, and that's firing important people who put the company at risk. This is a big issue, and certainly Musk could have done a better job of scoping out who represents a single-point-of- failure at twitter, eliminate that risk, and then proceed with the culling. In a modern enterprise no single person should be capable of putting the entire operation at risk. It's just that simple. So in a way, Musk accelerated what was probably inevitable at Twitter already. They were probably precariously close to destruction already, and now they can learn the hard way of not repeating these mistakes.


>Twitter was so awful before, that it relied on people to safeguard the keys to the kingdom. This is very bad practice...that can be accessed by the board of directors, the executive managers, and designated maintainers.

LOL, you realize all the PEOPLE you list as the PEOPLE who should be able to manage the keys to the kingdom are PEOPLE? Board of directors - fired on day one of Musk takeover, executive managers - many fired one day one by Musk as well, designated maintainers - for all we know they could have been fired in the purge or quit when Musk offered the 3 month severance.

All system require people to run.


According to the latest publicly available sources there are still many hundreds of folks on the active payroll at Twitter, do you know of any evidence to the contrary?


Significant edits for clarity.

Serious question... How do I build a system that grants access to a company role not a person? In other words, the CIO is fired, how does this system ensure that the new CIO can access it, and the old one no longer can?

If we tie it to the HR system, whoever admins that effectively has the keys to the kingdom. Same for Active Directory or any other technical solution.


Something like the nuclear football is probably the only answer. Something very obvious that is transferred with the role


You're probably right, though honestly I'm not sure that helps here either. If I'm the CIO and Musk walks in and tells me to get out, I'm not going to go to any pains to make sure he knows about the football. Sure I'll leave it there in my desk, where if someone knows of it's existence they can find it, but it probably just ends up going in the dumpster or with the desk when he sells it.


> For any mission critical assets, and especially certificates, but also passwords... current modern day corporate practice is to have a secure ledger of these that can be accessed by the board of directors, the executive managers, and designated maintainers. At no point ever should the password be entrusted to anybody, but rather a "role" that functions as the one who has access. Say for example, the CIO/CTO and their subordinates.

Maybe in hacker movies. In real life, you try your best to avoid anyone having access to keys or passwords, and rely on HSMs, cloud KMS, secret services, etc. Access to those things is controlled by your security team, with multi-factor authentication, often stored in safes, with alerts being fired when they are used (because they should never be used). The audit logs that trigger these alerts should be written in WORM storage, so you can track access back down to individuals, and so that you know when you need to rotate secrets accessed by humans. Ideally your CA infrastructure automatically rotates and distributes.

There's absolutely no way in hell you should allow your board to have access to these things.

Most companies slowly work their way towards full automation, and until that happens, your security team usually owns manual rotations of critical systems like this. Only a fucking moron would fire all of these people.


when building secure systems one of the key principles is assume someone will leak the private key. this is how we get to hsm

maybe another one is assume you will lose access to the hsm. sure spinning up a new trust chain is annoying but it wouldn’t take that long to do. totally agree this post is overblown


spinning up a new trust chain is not so hard, but deploying that trust chain to thousands of servers around the world when your automation tool isn't available to do it with is really, really hard.


This is why I've been very skeptical of the kids these days kicking literally everyone off of the production servers.

Having a few greybeards with the keys to the kingdom and the wisdom not to use it to screw around in prod, outside of existential emergencies, can be quite useful.

Also should have console access.

One time a bad config push took out a couple hundred webservers with effectively a single iptables default deny rule and we had to get a dozen people to fix them in chunks by logging in manually over remote terminal (probably could have expect-scripted that up, but it was quicker to just get it done).


I'll take the rumor with a grain of salt, but can anyone unpack what the recovery plan would be for something like this? It would obviously be a big problem, but where would you even start?


Assuming they’ve still got access to the servers themselves via SSH, you’d start by issuing a new root CA cert for the Puppetmaster and putting that in place, then you’ve got to issue a new cert for every client and distributing those. It’s not impossible, but it’s also going to be a pain in the backside to do.


If you read through the guide [1] it requires you to have sudo access to bounce the puppet process on the client nodes.

This is because the whole idea is that you have inaccessible, locked down Production servers that only Puppet (which is driven from a central, governed configuration management source) has authority to configure i.e. no SSH and no root access.

Thus leaving the only option being to physically visit each server at the datacenter and issue the commands.

[1] https://www.puppet.com/docs/puppet/5.5/ssl_regenerate_certif...


Been there before, we did exactly this; except over OOB+reboot-into-single-user (because SELinux). Took us a few days (~5k servers) but managed to get out of it with no public-facing downtime. The other way would have just been to rekick the world one box at a time. A number of integration tests were added after that disaster :)


According to this [1] Twitter has 500,000+ servers spread across DCs, GCP and AWS.

Which if we assume only a team of your size remains then it would take 300+ days.

That would mean no OS patches etc which would put them firmly in the crosshairs of the FTC.

[1] https://twitter.com/d_feldman/status/1562265193249390593


Interesting. Considering the number of MAU to be around 350 million, that's a bit fewer than 1000 persons per server. Of course it's not that simple, because not all servers are the same and more importantly not all users are the same, but it sounds like a bit on the low end.

Anecdotal point of data: infosec.exchange hosts 30k users on 7 servers (https://infosec.exchange/@jerry/109374478717918484). That's a 1:5 ratio. Again, not the same usage and performance requirements, but I find it interesting.


If you are split-cloud under a homogenous puppet master without homogeneous break-glass SSH access (which would be crazy) then probably your best bet is to just re-kick the world. But the scaling factor for this sort of thing is most certainly not team size; it's "how many X servers can be down at the same time", which will increase with your number of servers. In any case I think the FTC is the least of twitter's concerns right now.


Not sure if it's still the case but last time I had co-located servers you could access the systems via OOB without needing to reboot them in single user mode.

If it's not the case then Twitter is definitely in far more trouble because according to past engineers at least a few of their services needed manual intervention on a full scale reboot. And losing quorum in a distributed system is never pretty.


It's nearly impossible to predict recovery without understanding the system. You would probably need to know how ssh is configured, how secrets are managed, and how files are distributed, both before and after puppet.

Circular dependencies can absolutely wreck you. For example, puppet could configure sudoers, and without puppet config being applied people who would normally expect access might not have it. So now you have to find a privileged ssh key for un-configured machines.

I would be surprised if twitter did not have a physical vault with a USB drive with a root SSH key on it. With that you can do just about everything.

I would be most terrified of machine churn. Auto-remediation systems or elastic capacity systems can result in lost capacity that can't come back until the configuration problem is resolved.


Create new root CA, ssh to machine, remove old certs, re-add machine to Puppet, sign the new CSR on Puppet master, then it will download new root.

Very simple operation... if you have working SSH access with root. If they don't, well...


If you don't have ssh access with root, hopefully you have access to something like the underlying hypervisor, to do the equivalent of "sudo xl console vmname" on a xen dom0 to get what is logically the same as a physical serial tty (or local vga+keyboard) console on the domU machine.

Or the VMware esxi emulated graphical console, etc.

Or if it's a bunch of bare metal machines, hopefully someone old-school in the organization thought to deploy 48/96-port rs232 console serial concentrators and wire them up to the db9 serial port on each physical server. And you didn't disable all local serial tty in your operating config.


To my knowledge all modern DCs have out-of-band networks for this sort of thing that provide serial access to the BMC chip, nothing old school about that. Old school is having to submit a ticket to Jerry in the DC to walk the crash cart down to box 55AE, hook up a serial console, run diagnostics, and attach the output back to the ticket. You only have to deal with Jerry occasionally now, usually when the BMC or power rails fail.


There's more than a few people who've decided the security risk of full console capable bmc is not acceptable - and if other fail over systems are engineered appropriately, not necessary at all. BMC/IPMI intentionally disabled/not connected to any network.

Anecdotally I have seen a number of low cost x86-64 pseudo blade setups similar to open compute platform design stuff which have no oob. If a unit fails it's pulled entirely and put in a work queue for someone to repair.


In both cases it's disruptive event as you have to reboot the machine to get into rescue mode (as you don't need the password)


> Or if it's a bunch of bare metal machines, hopefully someone old-school in the organization thought to deploy 48/96-port rs232 console serial concentrators and wire them up to the db9 serial port on each physical server. And you didn't disable all local serial tty in your operating config.

In a hacker folklore story this would 100% be the solution. And for some reason they'd have to use an original VT100 that some greybeard had lovingly restored at home.


If they're in the cloud, it's pretty straightforward to re-mount the drive somewhere else and replace the SSH keys.


And if they have same template everywhere probably not even too hard to script.


Depends how disposable the individual servers are. I don't know specifics of the Twitter infra, but I would probably just issue a new cert and begin shooting and replacing the old servers. Hopefully the services are abstracted from the puppet cert and things like Redis and whatnot will safely reprovision and find their quorums.


The certificate for Twitter's hidden service expired a full week ago and they still haven't exchanged it.


Taking it with a pinch of salt, but this stuff does happen.

I've received calls from past employers, usually when they migrate a site I worked on to a new CMS or platform. There is some critical service (AWS, CDN credentials, domain related) etc. that no one knows who has access... Happily those appear to get resolved... but this... yikes (if true)


In a possibly more pedestrian example, my organization needed a re-mailer service set up and found out that the IT worker previously tasked with administration for that service had the MFA set up for his personal phone. I think they eventually got a hold of him to coordinate transfer of credentials, but knowing him, there was a 50% chance he could have left the company on bad terms and would have made things quite a bit more difficult.


I had something similar happen when I left a company, only I'm fairly consistent on deleting credentials to systems I'm not supposed to have access to. Fortunately it was for an internal service and nothing customer facing, so they were able to wipe and redeploy.


One of the first things I do when leaving a company is remove all credentials from my password manager. Sure they should disable my accounts, but on the off chance they don't I still want it clear I don't have access.

It doesn't have to be a departure on bad terms, if they needed my TOTP codes I can't help them. That secret is already gone.


Funnily enough putting it in configuration management (like Puppet) can make it nice and automatic.

But, well, if you fuck up your CM...


The really interesting part of this is what else is tied to that CA. If it’s just Puppet, it’s bad enough; internal PKIs have a habit of metastasizing into lots of other places, though, precisely because everything internal trusts them. Worst-case here is that some piece of the internals of the Twitter app relies on things from that CA—-for instance, it relies on packages to do app config changes or updates and the packages have to be signed from that chain or served from something with a cert from it. In that case they’d be hosed: you’d have to replace every copy of the Twitter app. Fairly unlikely, but wouldn’t be the first time I’ve seen it happen.

Beyond that, though: Internal build systems? Data encryption? User client auth to critical services? Internal app mTLS for data exchanges? The list of possibilities goes on and on…



It's pretty amazing that a formerly public company like Twitter had such shitty documentation/processes/infrastructure.

I thought SOX mandated this sort of internal controls - after all, Twitter basically seems to be full of infrastructure risks that would (and have) negatively impacted them financially in a material way.

No key access? Why didn't they print it out and stick it in a safe deposit box, which is what a couple of startups I've been with have done...along with a couple of other key pieces of paper. Physical backup.


very possibly bullshit but huge if true: https://twitter.com/davidgerard/status/1634633886712954881


wouldn't put it past him since he only wanted "builders" and I bet he doesn't consider platform ops "builders" (even though they build tons of stuff; twitter's platform was basically a product in and of itself)


Wonder if their servers all share a common NTP server/pool (that they control).


Since about 3-4 days I had issues with using tweetdeck, which doesn't load in Firefox with a key pinning failure. I'm not sure it's related, but it seems like to large of a coincidence not to.


That Mastodon server has a load time problem.. took a solid 30 seconds for me to load


[flagged]


The good part of your insightful analysis is you can keep rolling the dates so you never have to revise it.


So far, there seem to be surprisingly few issues. Some glitches here and there, but overall stability looks still quite good. I would've expected major issues much sooner, especially as they did push out new features in the meantime.


Do you use it actively? There have been minor problems on a near daily basis and moderate ones a handful of times. It was only ever average stable before but it was at least always that, it's far less consistent recently.


I use it actively and feel like recommendations have improved, at least for me. Ads are a bit more annoying but I've faced zero problems with stability. But I'm only reading, not actively tweeting.


Do you have data on this? Its not like other apps don't have issues. Sometimes I open netflix and it takes 30 seconds to show my profiles; that doesn't mean the app is garbage.


Do I have data on a website I use sometimes? No.


If this is true - who knows - then it reflects rather badly on the people who were fired - as they didn't implement safeguard for a 'run over by a bus' scenario when they were in charge.


It's normal to plan for scenarios where you abruptly lose some people. It's... less normal to plan for scenarios where you abruptly lose basically everybody; in most cases where that happens the company is basically dead anyway, so they're arguably not worth planning for.

Say you're planning, well, _anything_, and someone says "but in five years, a weird billionaire might buy the company and mismanage it to such an extent that your contingency plans don't work". There's a good argument that the proper response is that (a) that is largely the weird billionaire's problem and (b) that it is impossible to defend against an arbitrarily incompetent speculative future weird billionaire.

If someone takes a hammer to an electricity distribution board and electrocutes themselves, the normal response is not "well, that's the electrician's fault; they should have thought of that".

If true, this would "reflect rather badly" on exactly one person. But, y'know, it'll need to join a rather long queue of poorly reflecting things.


There's "run over by a bus" and "90% of the company got ran over by a bus" scenarios. The second one is rarely worth implementing.


Not to forget the "90% get fired my an egotistical maniac who expressed his distain by quite publicly calling them lazy useless pieces of shit" scenario. That scenario is also seldom considered.


There's also possibility of: "If this person hadn't been fired, they could use some other form of credentials within twitter's internal systems plus a passphrase they have memorized to login to the private-key-repository system where the credentials for the root CA are stored and retrieve them. But as they were fired abruptly they are not inclined to help Musk. And nobody has asked them".


Arent abrupt firings the norm in the USA?

My company had layoffs last year and the US people were gone the same day.


Sure, but you do abrupt firings of whole teams only if you don't need what these teams do.

It is quite reasonable if a company's response plan for scenario "what if we intentionally shoot ourselves in the head" is "don't do that, why would we do that?".


> Arent abrupt firings the norm in the USA?

Abrupt firings of everyone with critical access, primaries and backups, is not, because its suicide. (Also why critical access roles are vetted carefully, because you want to make sure there is a lower-than-normal chance you will need to fire any of them, since that’s how you minimize the chance of a situation where you’d want to fire enough of them to cause a critical situation.)

If you do need decide there’s a problem that requires you to fire those people, you find every way possible to delay firing some of them while you expand the set of people with that access (which may be only momentary, by compelling them to hand over credentials as part of the exit process, if you have confidence that you can do that successfully.)


It reflects rather badly on you that you're talking mad shit without knowing their circumstances. What's the bus factor on your systems? Can they handle literally every person being fired overnight?


Or they had a "run over by a bus" scenario that assumed that the entire team wasn't going to be run over by a bus all at the same time?


I've worked at companies where we've had policies about how much of the company can travel together (ie, how many people can be on the same plane). If your entire team is fired then the company is the problem, not the team.


It's management job to plan for run over by a bus scenario, not the people whose job is to actually implement things.


There’s “someone got run over by a bus outside of our control” and “The people in charge direct a bus to run over everyone covering a key function”. You don’t really plan for the latter scenario when you are in charge, instead, you just don’t direct a bus to do that. If your successor decides to do that, that’s…on them.


Maybe building it right cost 5x, and you have a budget for 1x. Sometime money is not unlimited even at FAANG


To add on, people forget that Twitter was never really FAANG. It not only wasn't profitable but had no monetization plan for years. I'm sure it paid off for all the investors who got Elon's money but even as a Facebook competitor they don't have Facebook money.


TWAANG wouldn't sound bad tho


It was profitable in 2019 and 2020, and could have been in 2022 (the first semester was huge, probably because of all the stuff happening all over the world).


We have 7 racks and 3 people working in ops and built Puppet setup "right". It's not hard. And their setup was probably right too.

Just that nobody plans for "bus hit our entire ops team"


More commonly you don't plan for it, you make sure the entire ops team are never on the same bus, same plane, preferably not even in the same city.


All fine and good. Until the new owner just fires everyone over night anyway. At which point, I guess, it is not the previous ops team's problem anymore.


Unfortunately even those precautions wouldn't save you from the 'holy shit basicially the entire ops team got fired out of nowhere' scenario..


_Nothing_ can save a company from sufficiently incompetent future management, ultimately.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: