The last post-mortem they wrote is very interesting and full of details. Basically back in 2016 the heart or keystone component of fly.io production infrastructure was called consul, which is a highly secure TLS server that tracks shared state and it requires that both the server certificate and the client certificate be authenticated. Since it was centralized, it had scaling issues, so fly.io wrote a replacement for it in 2020 called corrosion, and quickly forgot about consul, but didn't have the heart to kill it. Then in October 2024 consul's root key signing key expires, which brought down all connectivity, and since it uses bidirectional authentication, they couldn't bring it back online until they deployed new SSL certificates to every machine in their fleet. Somehow they did this in half an hour, but the chain of dominoes had already been set in motion to reveal other weaknesses in their infrastructure that they could eliminate. There was this other internal service whose own independent set of TLS keys had also expired long ago, but they didn't notice until they tried rebooting it as part of the consul rekey, since doing so severed the TCP connections it had established way back when its certificate was valid. Plus the whole time this is happening, their logging tools are DDOSing their network provider. It took some real heroes to save the company and all their customers too when that many things explode at once.
Any principle in itself isn't without critique, agree, but it's still the choice being made to pick this specific principle that tells the whole story. There are so many principles to pick from and the tech dept pick follows up with a "We have a 3-month “no refactoring” rule for new hires. This isn’t everyone’s preferred work style! We try to be up front about stuff.", which sounds a bit like an additional perform or else... principle that just delays ownership of the stuff you're supposed to work with. In the best case that sounds like naiive optimism and in the worst case that's gross negligence... neither one speaks "engineering" to me.
It is absolutely not a "perform or else" rule. Why are you reading so far into this? We really do have a rule about tech-debt changes, and it's a useful insight into why you might or might not want to work here, which is why we bring it up, despite the possibility it might alienate people; we'd like to be as honest as we can be. Worrying about people reading hustle-culture bullshit into stuff like this is a reason not to be transparent, which sucks.
All the other comments aside: these aren't even contradictory statements. We really do have no-tech-debt rules, and they generally have not been responsible for our outages. Consul wasn't tech debt; it was a carefully made decision (that I happen to disagree with and enjoy thinking about Michael Ehrmantrout shooting in the face).
People hosting their business with a cloud hosting provider doesn't care about your technical debt, we care about our businesses not going down for several hours and then being gaslighted that its normal and told to expect more in the future by the founder.
If you'd be happier without the companies involved in stories commenting here, then by all means get more people to write comments like this and see if you can chase them away. I think you won't have so much luck with me, but it might work with other companies. Nobody is gaslighting you.
For brevity I chose to put up only the conclusion from a postmortem (of which I've read plenty by now) and another point from their otherwise comparatively shorter careers page, which imo capture the inherent tension between building out fast & building out right. This is not something I've started complaining about today or yesterday. I've used Fly in prod for 4 years and spilled much ink on this topic on their forums already. Even if I critique, I remain optimistic about Fly despite the seemingly endless list of failure modes building such complex systems entail: https://community.fly.io/t/fly-down/10224/15
(personally speaking, I'm humble enough because I can hardly build a toy side-project right!)
The last post-mortem they wrote is very interesting and full of details. Basically back in 2016 the heart or keystone component of fly.io production infrastructure was called consul, which is a highly secure TLS server that tracks shared state and it requires that both the server certificate and the client certificate be authenticated. Since it was centralized, it had scaling issues, so fly.io wrote a replacement for it in 2020 called corrosion, and quickly forgot about consul, but didn't have the heart to kill it. Then in October 2024 consul's root key signing key expires, which brought down all connectivity, and since it uses bidirectional authentication, they couldn't bring it back online until they deployed new SSL certificates to every machine in their fleet. Somehow they did this in half an hour, but the chain of dominoes had already been set in motion to reveal other weaknesses in their infrastructure that they could eliminate. There was this other internal service whose own independent set of TLS keys had also expired long ago, but they didn't notice until they tried rebooting it as part of the consul rekey, since doing so severed the TCP connections it had established way back when its certificate was valid. Plus the whole time this is happening, their logging tools are DDOSing their network provider. It took some real heroes to save the company and all their customers too when that many things explode at once.