This is the multi-million dollar .unwrap() story. In a critical path of infrastr...

abalone · 2025-11-19T05:04:38 1763528678

I’ve led multiple incident responses at a FAANG, here’s my take. The fundamental problem here is not Rust or the coding error. The problem is:

1. Their bot management system is designed to push a configuration out to their entire network rapidly. This is necessary so they can rapidly respond to attacks, but it creates risk as compared to systems that roll out changes gradually.

2. Despite the elevated risk of system wide rapid config propagation, it took them 2 hours to identify the config as the proximate cause, and another hour to roll it back.

SOP for stuff breaking is you roll back to a known good state. If you roll out gradually and your canaries break, you have a clear signal to roll back. Here was a special case where they needed their system to rapidly propagate changes everywhere, which is a huge risk, but didn’t quite have the visibility and rapid rollback capability in place to match that risk.

While it’s certainly useful to examine the root cause in the code, you’re never going to have defect free code. Reliability isn’t just about avoiding bugs. It’s about understanding how to give yourself clear visibility into the relationship between changes and behavior and the rollback capability to quickly revert to a known good state.

Cloudflare has done an amazing job with availability for many years and their Rust code now powers 20% of internet traffic. Truly a great team.

polack · 2025-11-19T09:47:57 1763545677

They failed on so many levels here.

How can you write the proxy without handling the config containing more than the maximum features limit you set yourself?

How can the database export query not have a limit set if there is a hard limit on number of features?

Why do they do non-critical changes in production before testing in a stage environment?

Why did they think this was a cyberattack and only after two hours realize it was the config file?

Why are they that afraid of a botnet? Does not leave me confident that they will handle the next Aisuru attack.

I'm migrating my customers off Cloudflare. I don't think they can swallow the next botnet attacks and everyone on Cloudflare go down with the ship, so it will be safer to not be behind Cloudflare when it hits.

huijzer · 2025-11-19T11:03:39 1763550219

> They failed on so many levels here.

That's often the case with human error as especially aviation safety experts know: https://en.wikipedia.org/wiki/Swiss_cheese_model

itzjacki · 2025-11-19T13:24:14 1763558654

Exactly. The only way this could happen in the first place was _because_ they failed at so many levels. And as a result, more layers of Swiss cheese will be added, and holes in existing ones will be patched. This process is the reason flying is so safe, and the reason why Cloudflare will be a little bit more resilient tomorrow than it was yesterday.

miki123211 · 2025-11-19T14:00:24 1763560824

In organizations with this level of care, if you fail at fewer levels, customers just never notice the error.

Any big and noticeable incident is one of the "we failed on so many levels here" kind, by definition.

michaelt · 2025-11-19T11:20:51 1763551251

> Why did they think this was a cyberattack

Isn’t getting cyberattacked their core business?

Yokohiii · 2025-11-20T08:36:52 1763627812

If so, why is their discovery of, non ambiguous?

cowsandmilk · 2025-11-19T09:59:35 1763546375

> Why do they do non-critical changes in production before testing in a stage environment?

I guess the noncritical change here was the change to the database? My experience has been a lot of teams do a poor job having a faithful replica of databases in stage environments to expose this type of issue.

brookst · 2025-11-19T14:37:13 1763563033

In part because it is somewhere between really hard and impossible. Is your staging DB going to be as big? Seeing the same RPS as prod? Seeing the same scenarios?

Permissions stuff might be caught without a completely faithful replica, but there are always going to be attributes of the system that only exist in prod.

nip · 2025-11-19T13:50:41 1763560241

It’s easy to pick on logic that failed and for which you have a very detailed and great post mortem write-up.

Yet you omit to acknowledge that the remaining 99.99999% logic written that powers Cloudflare works flawlessly.

Also, hindsight is 20/20

Yokohiii · 2025-11-20T09:08:59 1763629739

You are less critical with CF then they are with themselves.

A system that is 99.99999% flawless, can still be unusable.

optimism bias: 100/100

aforwardslash · 2025-11-19T20:58:50 1763585930

I know its easy to criticize what happened after the fact and having a clear(er) picture of all the moving parts and the timeline of events, but I think that while most of the people in the thread are pointing out either Rust-related or lack of configuration validation, what really grinds my gears is something that - in my opinion - is bad engineering.

Having an unprivileged application querying system.columns to infer the table layout is just bad; Not having a proper, well-defined table structure indicates sloppiness in the overall schema design, specially if it changes quickly. Considering specifically clickhouse, and even if this approach would be a good idea, the unprivileged way of doing it would be "DESCRIBE TABLE <name>", NOT iterating system.columns. The gist of it - sloppy design not even well implemented.

Having a critical application issuing ad-hoc commands to system.* tablespace instead of using a well-tested library is just amateurism, and again - bad engineering; IMO it is good practice to consider all system.* privileged applications and ensure their querying is completely separate from your application logic; Sometimes some system tables change, and fields are added and/or removed - not planning for this will basically make future compatibility a nightmare.

Not only the problematic query itself, but the whole context of this screams "lack of proper application design" and devs not knowing how to use the product and/or read the documentation. Granted, this is a bit "close to home" for me, because I use ClickHouse extensively (at a scale - I'm assuming - several orders of magnitude smaller than CloudFlare) and I have spent a lot of time designing specifically to avoid at least some of these kind of mistakes. But, if I can do it at my scale, why aren't they doing it?

Yokohiii · 2025-11-20T09:02:36 1763629356

On all the other issues, I thought they wanted to do the right thing at heart, but missed to make it fail safe. I can pass it as a problem of a journey to maturity or simply the fact that you can't get everything perfect. Maybe even a bit of sloppiness here and there.

The database issue screamed at me: lack of expertise. I don't use CH, but seeing someone to mess with a production system and they being surprised "Oh, it does that?", is really bad. And this is obviously not knowledge that is hard to achieve, buried deep in a manual or an edge case only discoverable by source code, it's bread and butter knowledge you should know.

What is confusing, that they didn't add this to their follow-up steps. With some benefit of doubt I'd assume they didn't want to put something very basic as a reason out there, just to protect the people behind it from widespread blame. But if that's not the case, then it's a general problem. Sadly it's not uncommon that components like databases are dealt with, on an low effort basis. Just a thing we plug in and works. But it's obviously not.

raxxorraxor · 2025-11-19T09:56:43 1763546203

I don't think these are realistic requirements for any engineered system to be honest. Realistic is to have contingencies for such cases, which are simply errors.

But the case for Cloudflare here is complicated. Every engineer is very free to make a better system though.

polack · 2025-11-19T11:59:36 1763553576

What is not realistic? To do simple input validation on data that has the potential to break 20% of the internet? To not have a system in place to rollback to the latest known state when things crash?

Cloudflare builds a global scale system, not an iphone app. Please act like it.

raxxorraxor · 2025-11-19T13:06:16 1763557576

Cloudflares success was simplicity to build a distributed system in different data centers around the world to be implemented by third party IT workers while Cloudflare were a few people. There are probably a lot of shitty iPhone apps that do less important work and are vastly more complex than the former Cloudflare server node configuration.

Every system has a non-reducible risk and no data rollback is trivial, especially for a CDN.

aquariusDue · 2025-11-19T12:44:37 1763556277

Yeah, I don't quite understand the people cutting Cloudflare massive slack. It's not about nailing blame on a single person or a team, it's about keeping a company that is THE closest thing to a public utility for the web accountable. They more or less did a Press Release with a call to action to buy or use their services at the end and everybody is going "Yep, that's totally fine. Who hasn't sent a bug to prod, amirite?".

It goes over my head why Cloudflare is HN's darling while others like Google, Microsoft and AWS don't usually enjoy the same treatment.

miyuru · 2025-11-19T13:40:39 1763559639

>It goes over my head why Cloudflare is HN's darling while others like Google, Microsoft and AWS don't usually enjoy the same treatment.

Do the others you mentioned provide such detailed outage reports, within 24 hours of an incident? I’ve never seen others share the actual code that related to the incident.

Or the CEO or CTO replying to comments here?

>Press Release

This is not press release, they always did these outage posts from the start of the company.

https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

aquariusDue · 2025-11-19T16:43:23 1763570603

> Do the others you mentioned provide such detailed outage reports, within 24 hours of an incident? I’ve never seen others share the actual code that related to the incident.

Azure (albeit pretty old): https://devblogs.microsoft.com/devopsservice/?p=17665

AWS: https://aws.amazon.com/message/101925/

GCP: https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1S...

The code sample might as well be COBOL for people not familiar with Rust and its error handling semantics.

> Or the CEO or CTO replying to comments here?

I've looked around the thread and I haven't seen the CTO here nor the CEO, probably I'm not familiar with their usernames and that's on me.

> This is not press release, they always did these outage posts from the start of the company.

My mistake calling them press releases. Newspapers and online publications also skim this outage report to inform their news stories.

I wasn't clear enough on my previous comment. I'd like all major players in the internet and web infrastructure to be held to higher standards. As it stands when it comes to them or the tech department of a retail store the retail store must answer to more laws when surface area of combined activities is took into account.

Yes, Cloudflare excels where others don't or barely bother and I too enjoyed the pretty graphs, diagrams and I've learned some nifty Rust tricks.

EDIT: I've removed some unwarranted snark from my comment which I apologize for.

dspillett · 2025-11-19T12:24:33 1763555073

> To do simple input validation on data that has the potential to break 20% of the internet?

There will always be bugs in code, even simple code, and sometimes those things don't get caught before they cause significant trouble.

The failing here was not having a quick rollback option, or having it and not hitting the button soon enough (even if they thought the problem was probably something else, I think my paranoia about my own code quality is such that I would have been rolling back much sooner just in case I was wrong about the “something else”).

vsl · 2025-11-19T10:25:03 1763547903

> Why did they think this was a cyberattack and only after two hours realize it was the config file?

They explain that at some length in TFA.

jve · 2025-11-19T10:12:07 1763547127

> I'm migrating my customers off Cloudflare.

Is that an overreaction?

Name me global, redundant systems that have not (yet) failed.

And if you used cloudflare to protect against botnet and now go off cloudflare... you are vulnerable and may experience more downtime if you cannot swallow the traffic.

I mean no service have 100% uptime - just that some have more nines than others.

Carriethebest · 2025-11-19T10:20:00 1763547600

There are many self-hosted alternatives to protect against botnet. We don't have to use cloudflare. Everthing is under their control!

sofixa · 2025-11-19T10:25:45 1763547945

> There are many self-hosted alternatives to protect against botnet

Whatever you do, unless you have their bandwidth capacity, at some point those "self-hosted" will get flooded with traffic.

benjiro · 2025-11-19T12:59:26 1763557166

As yourself more the question, is your service that important to need 99.999% uptime? Because i get the impression that people are so fixated on this uptime concept, that the idea of being down for a few hours is the most horrible issue in the world. To the point that they rather hand over control of their own system to a 3th party, then accept a downtime.

The fact that cloudflare can literally ready every bit of communication (as it sits between the client and your server) is already plenty bad. And yet, we accept this more easily, then a bit of downtime. We shall not ask about the prices for that service ;)

To me its nothing more then the whole "everybody on the cloud" issue, when most do not need the resource that cloud companies like AWS provide (and the bill), and yet, get totally tied down to this one service.

I am getting old lol ...

throw0101c · 2025-11-19T13:37:45 1763559465

> As yourself more the question, is your service that important to need 99.999% uptime?

What is the cost of many-9s uptime from Cloudflare? For DDoS protection it is $0/month on their free tier:

* https://www.cloudflare.com/en-ca/plans/

benjiro · 2025-11-20T11:21:41 1763637701

Not when you start pushing into the TB's range of monthly data... When you get that dreaded phone call from a CF rep, because the bill that is coming is no joke.

Its free as long as you really are small, not worth milking. The moment you can afford to run your own mini dc at your office, you start to enter the "well, hello there" for CF.

throw0101c · 2025-11-20T14:01:35 1763647295

> The moment you can afford to run your own mini dc at your office, you start to enter the "well, hello there" for CF.

As someone who has (and is) runs (running) a DC with all the electrical/UPS, cooling, piping, HVAC+D stuff to deal with: it can be a lot of just time/overhead.

Especially if you don't have a number of folks in-house to deal with all that 'non-IT' equipment (I'm a bit strange in that I have an interest in both IT and HVAC-y stuff).

literalAardvark · 2025-11-19T19:57:35 1763582255

There are many systems that benefit from ddos protection without actually needing the high uptime.

The bandwidth costs of a ddos alone would close down a small shop.

Cloudflare provide an incredibly good service with a great track record, and sometimes shit happens.

KronisLV · 2025-11-19T10:39:34 1763548774

> There are many self-hosted alternatives to protect against botnet.

What would some good examples of those be? I think something like Anubis is mostly against bot scraping, not sure how you'd mitigate a DDoS attack well with self-hosted infra if you don't have a lot of resources?

On that note, what would be a good self-hosted WAF? I recall using mod_security with Apache and the OWASP ruleset, apparently the Nginx version worked a bit slower (e.g. https://www.litespeedtech.com/benchmarks/modsecurity-apache-... ), there was also the Coraza project but I haven't heard much about it https://coraza.io/ or maybe the people who say that running a WAF isn't strictly necessary also have a point (depending on the particular attack surface).

Genuine questions.

weberer · 2025-11-19T15:33:54 1763566434

>What would some good examples of those be?

There is haproxy-protection, which I believe is the basis of Kiwiflare. Clients making new connections have to solve a proof-of-work challenge that take about 3 seconds of compute time.

Enterprise: https://www.haproxy.com/solutions/ddos-protection-and-rate-l...

FOSS: https://gitgud.io/fatchan/haproxy-protection

jve · 2025-11-19T10:59:08 1763549948

Well if you self host DDoS protection service, that would be VERY expensive. You would need rent rack space along with a very fast internet connection at multiple data centers to host this service.

purple_turtle · 2025-11-19T10:45:28 1763549128

Can you name three of this many alternatives?

How they magically manage DDOS larger than their bandwidth?

If the plan is to have larger bandwidth than any DDOS it is going to be expensive, quickly.

monerozcash · 2025-11-19T12:03:48 1763553828

You could probably get a very fat pipe with usage based billing, you'd only go bankrupt when you get hit by a big DDoS and not before.

lgeek · 2025-11-19T14:06:05 1763561165

If you're buying transit, you'll have a hard time getting away with less than 10% commit, i.e. you'll have to pay for 10 Gbps of transit to have a 100 Gbps port, which will typically run into 4 digits USD / month. You'll need a few hundred Gbps of network and scrubbing capacity to handle common DDoS attacks using amplification from script kids with a 10 Gbps uplink server that allow spoofing, and probably on the order of 50+ Tbps to handle Aisuru.

If you're just renting servers instead, you have a few options that are effectively closer to a 1% commit, but better have a plan B for when your upstreams drop you if the incoming attack traffic starts disrupting other customers - see Neoprotect having to shut down their service last month.

nijave · 2025-11-19T16:59:19 1763571559

We had better uptime with AWS WAF in us-east-1 than we've had in the last 1.5 years of Cloudflare.

I do like the flat cost of Cloudflare and feature set better but they have quite a few outages compared to other large vendors--especially with Access (their zero trust product)

I'd lump them into GitHub levels of reliability

We had a comparable but slightly higher quote from an Akamai VAR.

polack · 2025-11-19T11:43:53 1763552633

Yes, it's probably an overreaction.

But at the same time, what value do they add if they:

* Took down the the customers sites due to their bug.

* Never protected against an attack that our infra could not have handled by itself.

* Don't think that they will be able to handle the "next big ddos" attack.

It's just an extra layer of complexity for us. I'm sure there are attacks that could help our customers with, that's why we're using them in the first place. But until the customers are hit with multiple ddos attacks that we can not handle ourself then it's just not worth it.

dspillett · 2025-11-19T12:58:05 1763557085

> • Took down the the customers sites due to their bug.

That is always a risk with using a 3rd party service, or even adding extra locally managed moving parts. We use them in DayJob, and despite this huge issue and the number of much smaller ones we've experienced over the last few years their reliability has been pretty darn good (at least as good as the Azure infrastructure we have their services sat in front of).

> • Never protected against an attack that our infra could not have handled by itself.

But what about the next one… Obviously this is a question sensitive to many factors in our risk profiles and attitudes to that risk, there is no one right answer to the “but is it worth it?” question here.

On a slightly facetious point: if something malicious does happen to your infrastructure, that it does not cope well with, you won't have the “everyone else is down too” shield :) [only slightly facetious because while some of our clients are asking for a full report including justification for continued use of CF and any other 3rd parties, which is their right both morally and as written in our contracts, most, especially those who had locally managed services affected, have taken the “yeah, half our other stuff was affected to, what can you do?” viewpoint].

> • Don't think that they will be able to handle the "next big ddos" attack.

It is a war of attrition. At some point a new technique, or just a new botnet significantly larger than those seen before, will come along that they might not be able to deflect quickly. I'd be concerned if they were conceited enough not to be concerned about that possibility. Any new player is likely to practise on smaller targets first before directly attacking CF (in fact I assume that it is rather rare that CF is attacked directly) or a large enough segment of their clients to cause them specific issues. Could your infrastructure do any better if you happen to be chosen as one of those earlier targets?

Again, I don't know your risk profile so can say which is the right answer, if there even is an easy one other than “not thinking about it at all” being a truly wrong answer. Also DDoS protection is not the only service many use CF for, so those need to be considered too if you aren't using them for that one thing.

tete · 2025-11-20T11:24:30 1763637870

I agree. I think the comments about how "it is fine, because so many things had to fail" do not apply in this case.

It's not that many things had to fail, it's that many things that are obvious haven't been done. It would be a valid excuse if many "exotic" scenarios would have to align, not when it's obvious error cases that weren't handled and changes have not been tested.

While having wrong first assumptions is just how things work when you try to analyze the issue[1], not testing changes before production is just stupidity and nothing else.

The story would be different if eg. multiple unlikely, hard to track things happened at once without anyone making a clearly linkable event, something that would also happen in staging. Most of the things mentioned could essentially statically checked. This is the prime example of what you want as any tech person, because it's not hard to prevent compared to a lot of scenarios where you deal with balancing likelihoods of scenarios, timings, etc.

You don't think someone is a great plumber, because they forgot their tools and missed that big hole in the pipe and also rang at the wrong door, because all these things failed. You think someone is a good plumber if they said they would have to go back to fetch a bulky specialized tool, because this is the rare case in which they need it, but they could also do this other thing in this spcific case. They are great plumbers if they tell you how this happened in first place and how to fix it. They are great plumbers if they manage to fix something outside of their usual scope.

Here pretty much all of the things that you pay them for failed. At a large scale.

I am sure this has there are reasons which we don't now about, and I hope that CloudFlare can fix them. Be it management focusing on the wrong things, be it developers not being in the wrong position or annoyed enough to care or something else entirely. However, not doing these things is (likely) a sign that currently they are not in the state of creating reliable systems - at least none reliable enough for what they are doing. It would be perfectly fine if they ran a web shop or something, but if as experienced many other companies rely on you being up or their stuff fails, then maybe you should not run a company with products like "Always Online".

[1] And should make you adapt the process of analyzing issues. Eg. making sure config changes are "very loud" in monitoring. It's one of the most easily tracked thing that can go wrong, and can relatively easily be mapped to a point in time compared to many other things.

kosolam · 2025-11-19T12:09:32 1763554172

So where are you migrating to?

JB_Dev · 2025-11-19T06:28:37 1763533717

Does their ring based rollout really truly have to be 0->100% in a few seconds?

I don’t really buy this requirement. At least make it configurable with a more reasonable default for “routine” changes. E.g. ramping to 100% over 1 hour.

As long as that ramp rate is configurable, you can retain the ability to respond fast to attacks by setting the ramp time to a few seconds if you truly think it’s needed in that moment.

cowsandmilk · 2025-11-19T10:02:16 1763546536

The configuration file is updated every five minutes, so clearly they have some past experience where they’ve decided an hour is too long. That said, even a roll out over five minutes can be helpful.

NicoJuicy · 2025-11-19T08:38:13 1763541493

I think defence against a DDOS against your network is the best reason for a quick rollout

matteocontrini · 2025-11-19T10:20:06 1763547606

This was not about DDoS defense but the Bot Management feature, which is a paid Enterprise-only feature not enabled by default to block automated requests regardless of whether an attack is going on.

https://developers.cloudflare.com/bots/get-started/bot-manag...

nijave · 2025-11-19T17:28:38 1763573318

Bots can also cause a DoS/DDoS. We use the feature to restrict certain AI scraper tools by user agent that adversly impact performance (they have a tendency to hammer "export all the data" endpoints much more than regular users do)

inemesitaffia · 2025-11-19T10:32:26 1763548346

So if you didn't enable it your stuff would work?

matteocontrini · 2025-11-19T10:39:18 1763548758

It would still fail if you were unluckily on the new proxy (it's not very clear why if the feature was not enabled, indeed):

> Unrelated to this incident, we were and are currently migrating our customer traffic to a new version of our proxy service, internally known as FL2. Both versions were affected by the issue, although the impact observed was different.

> Customers deployed on the new FL2 proxy engine, observed HTTP 5xx errors. Customers on our old proxy engine, known as FL, did not see errors, but bot scores were not generated correctly, resulting in all traffic receiving a bot score of zero. Customers that had rules deployed to block bots would have seen large numbers of false positives. Customers who were not using our bot score in their rules did not see any impact.

jabl · 2025-11-19T08:49:29 1763542169

Maybe, but in that case maybe have some special casing logic to detect that yes indeed we're under a massive DDOS at this very moment, do a rapid rollout of this thing that will mitigate said DDOS. Otherwise use the default slower one?

Of course, this is all so easy to say after the fact..

xp84 · 2025-11-19T09:14:40 1763543680

Isn’t CF under a ‘massive DDOS’ 24/7 pretty much by definition? When does malicious traffic rest, and how many targets of same aren’t using CF?

NicoJuicy · 2025-11-19T09:54:49 1763546089

It's literally in the blog post as well

> In the internal incident chat room, we were concerned that this might be the continuation of the recent spate of high volume Aisuru DDoS attacks:

rkachowski · 2025-11-19T10:07:33 1763546853

Classic @devops_borat

"To make error is human. To propagate error to all server in automatic way is #devops"

throw0101c · 2025-11-19T13:42:41 1763559761

> "To make error is human. To propagate error to all server in automatic way is #devops"

This saying dates back to 1969: To err is human but to really foul things up requires a computer.

* https://quoteinvestigator.com/2010/12/07/foul-computer/

Also: I know there’s a proverb which says ‘To err is human,’ but a human error is nothing to what a computer can do if it tries.

* https://quoteinvestigator.com/2017/05/26/computer-error/

mongol · 2025-11-19T12:17:33 1763554653

I miss him. It must be more than 10 years now

ignoramous · 2025-11-19T06:11:20 1763532680

> Their bot management system is designed to push a configuration out to their entire network rapidly.

Once every 5m is not "rapidly". It isn't uncommon for configuration systems to do it every few seconds [0].

> While it’s certainly useful to examine the root cause in the code.

Believe the issue is as much an output from a periodic run (clickhouse query) caused by (on the surface, an unrelated change) causing this failure. That is, the system that validated the configuration (FL2) was different to the one that generated it (ML Bot Management DB).

Ideally, it is the system that vends a complex configuration that also vends & tests the library to consume it, or the system that consumes it, does so as if it was "tasting" the configuration first before devouring it unconditionally [1].

Of course, as with all distributed system failures, this is all easier said and done in hindsight.

[0] Avoiding overload in distributed systems by putting the smaller service in control (pg 4), https://d1.awsstatic.com/builderslibrary/pdfs/Avoiding%20ove...

[1] Lessons from CloudFront (2016), https://youtube.com/watch?v=n8qQGLJeUYA&t=1050

Hamuko · 2025-11-19T06:20:08 1763533208

>Once every 5m is not "rapidly".

Isn't rapidly more of how long it takes to get from A to Z rather than how often it is performed? You can push out a configuration update every fortnight but if it goes through all of your global servers in three seconds, I'd call it quite rapid.

abalone · 2025-11-19T06:25:31 1763533531

By rapid I mean a rapid rollout of changes to 100% of the fleet, not how often changes are made.

evntdrvn · 2025-11-19T13:41:06 1763559666

Thanks for sharing that AWS doc

matternot · 2025-11-19T09:57:31 1763546251

I don't understand why they didn't validate and sanitize the new config file revision. If bad(whatever that reason is) throw an error and revert back to previous version. You don't need to take down the whole internet for that.

WJW · 2025-11-19T10:22:36 1763547756

Same as for almost every bug I think: the dev in question hadn't considered that the input could be bad in the way that it turned out to be. Maybe they were new, or maybe they hadn't slept much because of a newborn baby, or maybe they thought it was a reasonable assumption that there would never be more than 200 ML features in the array in question. I don't think this developer will ever make the same mistake again at least.

Let those who have never written a bug before cast the first stone.

adriand · 2025-11-19T11:24:58 1763551498

> Maybe they were new, or maybe they hadn't slept much because of a newborn baby

Reminds me of House of Dynamite, the movie about nuclear apocalypse that really revolves around these very human factors. This outage is a perfect example of why relying on anything humans have built is risky, which includes the entire nuclear apparatus. “I don’t understand why X wasn’t built in such a way that wouldn’t mean we live in an underground bunker now” is the sentence that comes to mind.

Yokohiii · 2025-11-20T09:44:05 1763631845

I don't think this is an error originating from a single human. At CF scale I'd expect that multiple humans saw that code and gave it a pass. Rust or not, but an experienced dev could have seen this can lead to issues. Panicking without restoring a healthy state is just not an option in this case. They *know* that.

I guess you are right, likely a social issue, but certainly not a single exhausted parent.

throw0101c · 2025-11-19T13:44:27 1763559867

> I don't understand why they didn't validate and sanitize the new config file revision.

The new config file was not (AIUI) invalid (syntax-wise) but rather too big:

> […] That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network.

> The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.

matternot · 2025-11-19T18:42:55 1763577775

if the config is too big, then its an invalid config

dev_l1x_be · 2025-11-19T07:42:40 1763538160

Exactly the right take. Even when you want to have rapid changes on your infra, do it at least by region. You can start with the region where the least amount of users are impacted and if everything is fine, there is no elevated number of crashes for example, you can move forward. It was a standard practice at $RANDOM_FAANG when we had such deployments.

abalone · 2025-11-19T09:46:38 1763545598

Thank you. I am sympathetic to CF’s need to deploy these configs globally fast and don’t think slowing down their DDoS mitigation is necessarily a good trade off. What I am saying is this presents a bigger reliability risk and needs correspondingly fine crafted observability around such config changes and a rollback runbook. Greater risk -> greater attention.

twoodfin · 2025-11-19T12:28:44 1763555324

But the rapid deployment mechanism for bot features wasn’t where the bug was introduced.

In fact, the root bug (faulty assumption?) was in one or more SQL catalog queries that were presumably written some time ago.

(Interestingly the analysis doesn’t go into how these erroneous queries made it into production OR whether the assumption was “to spec” and it’s the security principal change work that was faulty. Seems more likely to be the former.)

abalone · 2025-11-19T18:42:30 1763577750

It was a change to the database that is used to generate a bot management config file. That file was the proximate cause for the panics. The kind of observability that would have helped here is “panics are elevated and here are the binary and config changes that preceded it,” along with a rollback runbook for it all.

Generally I would say we as an industry are more nonchalant about config changes vs binary changes. Where an org might have great processes and systems in place for binary rollouts, the whole fleet could be reading config from a database in a much more lax fashion. Those systems are quite risky actually.

twoodfin · 2025-11-19T19:05:03 1763579103

I am genuinely curious (albeit skeptical!) how anyone like Cloudflare could make that kind of feedback loop work at scale.

Even only in CF’s “critical path” there must be dozens of interconnected services and systems. How do you close the loop between an observed panic at the edge and a database configuration change N systems upstream?

mlrtime · 2025-11-19T11:46:07 1763552767

I've also led a team of Incident Commanders at a FAANG.

If this was a routine config change, I could see how it could take 2 hours to start the mediation plan. However they should have dashboards that correlate config setting changes with 500 errors (or equivalent). It gets difficult when you have many of of these going out at the same time and they are slowly rolled out.

The root cause document is mostly for high level and the public. The details on this specific outage will be in a internal document with many action items, some of them maybe quarter long projects including fixing this specific bug and maybe some linter/monitor to prevent it from happening again.

jbs789 · 2025-11-19T06:15:56 1763532956

Thanks for this assessment.

In a productive way, this view also shifts the focus to improving the system (visibility etc), empowering the team, rather than focusing on the code which broke (probably strikes fear in the individuals, to do anything!)

HelloNurse · 2025-11-19T09:13:59 1763543639

The "coding error" is a somewhat deliberate choice to fail eagerly that is usually safe but doesn't align with the need to do something (propagation of the configuration file) without failing.

I'm sure that there are misapplied guidelines to do that instead of being nice to incoming bot management configuration files, and someone might have been scolded (or worse) for proposing or attempting to handle them more safely.

pas · 2025-11-19T07:31:59 1763537519

Rolling out new code should be done differently than rolling out new data to fight bots.

If every time there's a new bot someone needs to write code that can blow up their whole service, maybe they need to iterate a bit on this design?

watchful_moose · 2025-11-19T09:35:44 1763544944

This isn't what they do, though. This is a data/config push - original article says _a “feature file” used by our Bot Management system_

antihero · 2025-11-19T12:23:19 1763554999

I would say that whilst this is a good top down view, that `.unwrap()` should have been caught at code-review and not allowed. Clippy rule could have saved a lot of money.

That and why the hell wasn't their alerting showing up colossal amount of panics in their bot manager thing?

xmcqdpt2 · 2025-11-19T12:42:15 1763556135

Yes the lack of observability is really the disturbing bit here. You have panics in a bunch of your core infrastructure, you would expect there to be a big red banner on the dashboard that people look at when they first start troubleshooting an incident.

This is also a pretty good example why having stack traces by default is great. That error could have been immediately understood just from a stack trace and a basic exception message.

BrtByte · 2025-11-19T11:42:10 1763552530

You can write the safest code in the world, but if you're shipping config changes globally every few minutes without a robust rollback plan or telemetry that pinpoints when things go sideways, you're flying blind

tormeh · 2025-11-19T08:51:00 1763542260

Partial disagree. There should be lints against 'unwrap's. An 'expect' at least forces you to write down why you are so certain it can't fail. An unwrap is not just hubris, it's also laziness, and has no place in sensitive code.

And yes, there is a lint you can use against slicing ('indexing_slicing') and it's absolutely wild that it's not on by default in clippy.

milliams · 2025-11-19T10:33:04 1763548384

  [lints.clippy]
  dbg_macro = "deny"
  unwrap_used = "deny"
  expect_used = "deny"

tormeh · 2025-11-19T11:33:21 1763552001

Exactly. This should be the default for production code at companies like Cloudflare.

echelon · 2025-11-19T12:03:05 1763553785

https://github.com/search?q=unwrap%28%29+language%3ARust&typ...

This is sobering.

My new fear is some dependency unwrap()ing or expect()ing something where they didn't prove the correctness.

Unwrap() and expect() are an anti-pattern and have no place in idiomatic Rust code. The language should move to deprecate them.

an_ko · 2025-11-19T12:40:01 1763556001

I use unwrap a lot, and my most frequent target is unwrapping the result of Mutex::lock. Most applications have no reasonable way to recover from lock poisoning, so if I were forced to write a match for each such use site to handle the error case, the handler would have no choice but to just call panic anyway. Which is equivalent to unwrap, but much more verbose.

Perhaps it needs a scarier name, like "assume_ok".

echelon · 2025-11-19T12:53:52 1763556832

I use locks a lot too, and I always return a Result from lock access. Sometimes an anyhow::Result, but still something to pass up to the caller.

This lets me do logging at minimum. Sometimes I can gracefully degrade. I try to be elegant in failure as possible, but not to the point where I wouldn't be able to detect errors or would enter a bad state.

That said, I am totally fine with your use case in your application. You're probably making sane choices for your problem. It should be on each organization to decide what the appropriate level of granularity is for each solution.

My worry is that this runtime panic behavior has unwittingly seeped into library code that is beyond our ability and scope to observe. Or that an organization sets a policy, but that the tools don't allow for rigid enforcement.

kevin_thibedeau · 2025-11-19T17:27:21 1763573241

> the handler would have no choice but to just call panic anyway

The handler could log the error and then panic. Much better than chasing bad hunches about a DDoS.

ok123456 · 2025-11-19T21:57:26 1763589446

Unwrap is in a lot of example code.

If you're using Result<T,E>, there's no automatic language feature for statically typing a nested E that mirrors how it was called.

So out of brevity, they unwrap.

Expect to see this sort of error crop up a lot as people use LLMs to vibe with the borrow checker.

ViewTrick1002 · 2025-11-19T12:19:21 1763554761

What would the difference be if they had enforced no unwraps/expects/slicing and instead logged the error and returned a 500?

As the user, I can't tell the difference, but it might have sped up their recovery a bit.

tomtomtom777 · 2025-11-19T12:26:15 1763555175

You could argue that explicitly writing down the assumption would make it clearer to yourself and your reviewer that it might be overly optimistic.

ljm · 2025-11-19T13:16:06 1763558166

Pretty much - the time spent ruling out the hypothesis that it was a cyberattack would have been time spent investigating the uptick in deliberately written error logs, since you would expect alerts to be triggered if those exceed a threshold.

I imagine it would also require less time debugging a panic. That kind of breadcrumb trail in your logs is a gift to the future engineer and also customers who see a shorter period of downtime.

j-krieger · 2025-11-19T09:21:19 1763544079

I would love to go further and explicitely forbid unwrap and similar calls using a `no_panic` attribute.

stevefan1999 · 2025-11-20T05:05:04 1763615104

I actually have to do this for programs that runs in bare metal. You can't afford to have nondeterministic panic like this. If things really gone wrong you'd have a watchdog and health checker to verify the state of program.

j-krieger · 2025-11-20T12:19:26 1763641166

How do you manage to do this?

stevefan1999 · 2025-11-20T14:06:46 1763647606

There's a crate that prevents linking panic symbol in the final stage of the executable generation, forcing it to be undefined symbol, so while it is hard to find out where the panic is, it effectively requires me to inspect throughout the code to find out. Sometimes I have to disassemble the object file to see this

j-krieger · 2025-11-21T10:55:17 1763722517

it's not the `no_panic` crate by david tolnay, is it?

zelphirkalt · 2025-11-19T09:29:05 1763544545

It is just 2 different layers. Of course the code is also a problem, if it is in fact as the GP describes it. You are taking the higher level view, which is the second layer of dealing with not only this specific mistake, but also other mistakes, that can be related to arbitrary code paths.

Both are important, and I am pretty sure, that someone is gonna fix that line of code pretty soon.

seethishat · 2025-11-19T12:32:09 1763555529

The bot is efficient. This is by design. It will push out mistakes just as efficiently as it pushes out good changes. Good or bad... the plane of control is unchanged.

This is the danger of automated control systems. If they get hacked or somehow push out bad things (CloudStrike), they will have complete control and be very efficient.

nrhrjrjrjtntbt · 2025-11-19T07:56:09 1763538969

This guy SREs

ithkuil · 2025-11-19T09:35:14 1763544914

Back when it meant Site Reliability Engineer and not Sysadmin Really Expensive

wrs · 2025-11-19T00:32:07 1763512327

It seems people have a blind spot for unwrap, perhaps because it's so often used in example code. In production code an unwrap or expect should be reviewed exactly like a panic.

It's not necessarily invalid to use unwrap in production code if you would just call panic anyway. But just like every unsafe block needs a SAFETY comment, every unwrap in production code needs an INFALLIBILITY comment. clippy::unwrap_used can enforce this.

dist1ll · 2025-11-19T01:11:33 1763514693

> every unwrap in production code needs an INFALLIBILITY comment. clippy::unwrap_used can enforce this.

How about indexing into a slice/map/vec? Should every `foo[i]` have an infallibility comment? Because they're essentially `get(i).unwrap()`.

10000truths · 2025-11-19T01:36:59 1763516219

Yes? Funnily enough, I don't often use indexed access in Rust. Either I'm looping over elements of a data structure (in which case I use iterators), or I'm using an untrusted index value (in which case I explicitly handle the error case). In the rare case where I'm using an index value that I can guarantee is never invalid (e.g. graph traversal where the indices are never exposed outside the scope of the traversal), then I create a safe wrapper around the unsafe access and document the invariant.

dist1ll · 2025-11-19T02:04:22 1763517862

If that's the case then hats off. What you're describing is definitely not what I've seen in practice. In fact, I don't think I've ever seen a crate or production codebase that documents infallibility of every single slice access. Even security-critical cryptography crates that passed audits don't do that. Personally, I found it quite hard to avoid indexing for graph-heavy code, so I'm always on the lookout for interesting ways to enforce access safety. If you have some code to share that would be very interesting.

10000truths · 2025-11-19T04:10:00 1763525400

My rule of thumb is that unchecked access is okay in scenarios where both the array/map and the indices/keys are private implementation details of a function or struct, since an invariant is easy to manually verify when it is tightly scoped as such. I've seen it used it in:

* Graph/tree traversal functions that take a visitor function as a parameter

* Binary search on sorted arrays

* Binary heap operations

* Probing buckets in open-addressed hash tables

koito17 · 2025-11-19T07:40:33 1763538033

> I don't think I've ever seen a crate or production codebase that documents infallibility of every single slice access.

The smoltcp crate typically uses runtime checks to ensure slice accesses made by the library do not cause a panic. It's not exactly equivalent to GP's assertion, since it doesn't cover "every single slice access", but it at least covers slice accesses triggered by the library's public API. (i.e. none of the public API functions should cause a panic, assuming that the runtime validation after the most recent mutation succeeds).

Example: https://docs.rs/smoltcp/latest/src/smoltcp/wire/ipv4.rs.html...

zelphirkalt · 2025-11-19T09:35:30 1763544930

I think this goes against the Rust goals in terms of performance. Good for safe code, of course, but usually Rust users like to have compile time safety to making runtime safety checks unnecessary.

hansvm · 2025-11-19T02:47:12 1763520432

> graph-heavy code

Could you share some more details, maybe one fully concrete scenario? There are lots of techniques, but there's no one-size-fits-all solution.

dist1ll · 2025-11-19T03:06:26 1763521586

Sure, these days I'm mostly working on a few compilers. Let's say I want to make a fixed-size SSA IR. Each instruction has an opcode and two operands (which are essentially pointers to other instructions). The IR is populated in one phase, and then lowered in the next. During lowering I run a few peephole and code motion optimizations on the IR, and then do regalloc + asm codegen. During that pass the IR is mutated and indices are invalidated/updated. The important thing is that this phase is extremely performance-critical.

wrs · 2025-11-19T04:06:21 1763525181

And it's fine for a compiler to panic when it violates an assumption. Not so with the Cloudflare code under discussion.

echelon · 2025-11-19T12:05:44 1763553944

Idiomatic Rust would have been to return a Result<> to the caller, not to surprise them with a panic.

The developer was lazy.

A lot of Rust developers are: https://github.com/search?q=unwrap%28%29+language%3ARust&typ...

hansvm · 2025-11-19T15:22:07 1763565727

One normal "trick" is phantom typing. You create a type representing indices and have a small, well-audited portion of unsafe code handling creation/unpacking, where the rest of the code is completely safe.

The details depend a lot on what you're doing and how you're doing it. Does the graph grow? Shrink? Do you have more than one? Do you care about programmer error types other than panic/UB?

Suppose, e.g., that your graph doesn't change sizes, you only have one, and you only care about panics/UB. Then you can get away with:

1. A dedicated index type, unique to that graph (shadow / strong-typedef / wrap / whatever), corresponding to whichever index type you're natively using to index nodes.

2. Some mechanism for generating such indices. E.g., during graph population phase you have a method which returns the next custom index or None if none exist. You generated the IR with those custom indexes, so you know (assuming that one critical function is correct) that they're able to appropriately index anywhere in your graph.

3. You have some unsafe code somewhere which blindly trusts those indices when you start actually indexing into your array(s) of node information. However, since the very existence of such an index is proof that you're allowed to access the data, that access is safe.

Techniques vary from language to language and depending on your exact goals. GhostCell [0] in Rust is one way of relegating literally all of the unsafe code to a well-vetted library, and it uses tagged types (via lifetimes), so you can also do away with the "only one graph" limitation. It's been awhile since I've looked at it, but resizes might also be safe pretty trivially (or might not be).

The general principle though is to structure your problem in such a way that a very small amount of code (so that you can more easily prove it correct) can provide promises that are enforceable purely via the type system (so that if the critical code is correct then so is everything else).

That's trivial by itself (e.g., just rely on option-returning .get operators), so the rest of the trick is to find a cheap place in your code which can provide stronger guarantees. For many problems, initialization is the perfect place (e.g., you can bounds-check on init and then not worry about it again) (e.g., if even bounds-checking on initialization is too slow then you can still use the opportunity at initialization to write out a proof of why some invariant holds and then blindly/unsafely assert it to be true, but you then immediately pack that hard-won information into a dedicated type so that the only place you ever have to think about it is on initialization).

[0] https://plv.mpi-sws.org/rustbelt/ghostcell/

dist1ll · 2025-11-19T16:09:27 1763568567

I do use a combination of newtyped indices + singleton arenas for data structures that only grow (like the AST). But for the IR, being able to remove nodes from the graph is very important. So phantom typing wouldn't work in that case.

tux3 · 2025-11-19T01:21:02 1763515262

Usually you'd want to write almost all your slice or other container iterations with iterators, in a functional style.

For the 5% of cases that are too complex for standard iterators? I never bother justifying why my indexes are correct, but I don't see why not.

You very rarely need SAFETY comments in Rust because almost all the code you write is safe in the first place. The language also gives you the tool to avoid manual iteration (not just for safety, but because it lets the compiler eliminate bounds checks), so it would actually be quite viable to write these comments, since you only need them when you're doing something unusual.

wrs · 2025-11-19T02:13:16 1763518396

I didn't restate the context from the code we're discussing: it must not panic. If you don't care if the code panics, then go ahead and unwrap/expect/index, because that conforms to your chosen error handling scheme. This is fine for lots of things like CLI tools or isolated subprocesses, and makes review a lot easier.

So: first, identify code that cannot be allowed to panic. Within that code, yes, in the rare case that you use [i], you need to at least try to justify why you think it'll be in bounds. But it would be better not to.

There are a couple of attempts at getting the compiler to prove that code can't panic (e.g., the no-panic crate).

lenkite · 2025-11-19T05:31:43 1763530303

What about memory allocation - how will you stop that from panicking ? `Vec::resize` will always panic in Rust. And this is just one example out of thousands in the Rust stdlib.

Unless the language addresses no-panic in its governing design or allows try-catch, not sure how you go about this.

wrs · 2025-11-19T05:48:23 1763531303

That is slowly being addressed, but meanwhile it’s likely you have a reliable upper bound on how much heap your service needs, so it’s a much smaller worry. There are also techniques like up-front or static allocation if you want to make more certain.

NewJazz · 2025-11-19T05:58:39 1763531919

Yep and this postmortem details how their proxy modules use static allocation.

echelon · 2025-11-19T12:38:02 1763555882

I'm far more worried about some dependency calling unwrap() or expect() now.

https://github.com/search?q=unwrap%28%29+language%3ARust&typ...

This is ridiculous. We're probably going to start seeing more of these. This was just the first, big highly visible instance.

We should have a name for this similar to "my code just NPE'd". I suggest "unwrapped", as in, "My Rust app just unwrapped a present."

I think we should start advocating for the deprecation and eventual removal of the unwrap/expect family of methods. There's no reason engineers shouldn't be handling Options and Results gracefully, either passing the state to the caller or turning to a success or fail path. Not doing this is just laziness.

assbuttbuttass · 2025-11-19T05:55:29 1763531729

In TFA they mentioned they preallocate all the memory up front

kibwen · 2025-11-19T02:34:33 1763519673

Indexing is comparatively rare given the existence of iterators, IMO. If your goal is to avoid any potential for panicking, I think you'd have a harder time with arithmetic overflow.

echelon · 2025-11-19T03:59:18 1763524758

Cargo needs to grow a label for crates that provably do not panic. (Neverminding allocations and things outside our control flow.)

I want to ban crates that panic from my dependency chain.

The language could really use an extra set of static guarantees around this. I would opt in.

lenkite · 2025-11-19T05:33:25 1763530405

> I want to ban crates that panic from my dependency chain.

Which means banning anything that allocates memory and thousands of stdlib functions/methods.

echelon · 2025-11-19T05:59:22 1763531962

See the immediately preceding sentence.

I'm fine with allocation failures. I don't want stupid unwrap()s, improper slice access, or other stupid and totally preventable behavior.

There are things inside the engineer's control. I want that to not panic.

throwaway2037 · 2025-11-19T07:05:57 1763535957

Your pair of posts is very interesting to me. Can you share with me: What is your programming environment such that you are "fine with allocation failures"? I'm not doubting you, but for me, if I am doing systems programming with C or C++, my program is doomed if a malloc fails! When I saw your post, I immediately thought: Am I doing it wrong? If I get a NULL back from malloc(), I just terminate with an error message.

deredede · 2025-11-19T07:55:38 1763538938

Not GP but I read "I'm fine with allocation failures" as "I'm OK with my program terminating if it can't allocate (but not for other errors)".

johnisgood · 2025-11-19T09:14:27 1763543667

I mean, yeah, if I am using a library, as an user of this library, I would like to be able to handle the error myself. Having the library decide to panic, for example, is the opposite of it.

echelon · 2025-11-19T11:41:46 1763552506

If I can't allocate memory, I'm typically okay with the program terminating.

I don't want dependencies deciding to unwrap() or expect() some bullshit and that causing my entire program to crash because I didn't anticipate or handle the panic.

Code should be written, to the largest extent possible, to mitigate errors using Result<>. This is just laziness.

I want checks in the language to safeguard against lazy Rust developers. I don't want their code in my dependency tree, and I want static guarantees against this.

edit: I just searched unwrap() usage on Github, and I'm now kind of worried/angry:

https://github.com/search?q=unwrap%28%29+language%3ARust&typ...

A lot of this is just pure laziness.

PhilipRoman · 2025-11-19T11:15:15 1763550915

Not to mention overcommit has become standard behavior on many systems, so you wouldn't even get a NULL unless you really tried.

phire · 2025-11-19T04:52:03 1763527923

I think I'd prefer a compile-time guarantee.

Something that allows me to tag annotate a function (or my whole crate) as "no panic", and get a compile error if the function or anything it calls has a reachable panic.

This will allow it to work with many unmodified crates, as long as constant propagation can prove that any panics are unreachable. This approach will also allow crates to provide panicking and non panicking versions of their API (which many already do).

echelon · 2025-11-19T05:16:34 1763529394

Yes, I want that. I also want to be able to (1) statically apply a badge on every crate that makes and meets these guarantees (including transitively with that crate's own dependencies) so I can search crates.io for stronger guarantees and (2) annotate my Cargo.toml to not import crates that violate this, so time isn't wasted compiling - we know it'll fail in advance.

On the subject of this, I want more ability to filter out crates in our Cargo.toml. Such as a max dependency depth. Or a frozen set of dependencies that is guaranteed not to change so audits are easier. (Obviously we could vendor the code in and be in charge of our own destiny, but this feels like something we can let crate authors police.)

aw1621107 · 2025-11-19T05:15:21 1763529321

I think the most common solution at the moment is dtolnay's no_panic [0]. That has a bunch of caveats, though, and the ergonomics leave something to be desired, so a first-party solution would probably be preferable.

[0]: https://github.com/dtolnay/no-panic

yakshaving_jgt · 2025-11-19T09:46:08 1763545568

This sounds a little bit like Safe Haskell, which never really took off.

echelon · 2025-11-19T12:08:54 1763554134

I would be fine just getting rid of unwrap(), expect(), etc. That's still a net win.

Look at how many lazy cases of this there are in Rust code [1].

Some of these are no doubt tested (albeit impossible to statically guarantee), but a lot of it looks like sloppiness or not leaning on the language's strong error handling features.

It's disappointing to see. We've had so much of this creep into the language that eventually it caused a major stop-the-world outage. This is unlikely to be the last time we see it.

[1] https://github.com/search?q=unwrap%28%29+language%3ARust&typ...

yakshaving_jgt · 2025-11-19T12:17:41 1763554661

I don't write Rust so I don't really know, but from someone else's description here it sounds similar to `fromJust` in Haskell which is a common newbie footgun. I think you're right that this is a case of not using the language properly, though I know I was seduced into the idea that Haskell is safe by default when I was first learning, which isn't quite true — the safety features are opt-in.

A language DX feature I quite like is when dangerous things are labelled as such. IIRC, some examples of this are `accursedUnutterablePerformIO` in Haskell, and `DO_NOT_USE_OR_YOU_WILL_BE_FIRED_EXPERIMENTAL_CREATE_ROOT_CONTAINERS` in React.js.

echelon · 2025-11-19T12:43:24 1763556204

I would be in favor of renaming unwrap() and its family to `unwrap_do_not_use_or_you_will_break_the_internet()`

I still think we should remove them outright or make production code fail to compile without a flag allowing them. And we also need tools to start cleaning up our dependency tree of this mess.

dist1ll · 2025-11-19T01:41:57 1763516517

For iteration, yes. But there's other cases, like any time you have to deal with lots of linked data structures. If you need high performance, chances are that you'll have to use an index+arena strategy. They're also common in mathematical codebases.

danielheath · 2025-11-19T01:17:09 1763515029

I mean... yeah, in general. That's what iterators are for.

brabel · 2025-11-19T07:56:02 1763538962

Yes, I always thought it was wrong to use unwrap in examples. I know, people want to keep examples simple, but it trains developers to use unwrap() as they see that everywhere. Yes, there are places where it's ok as that blog post explains so well: https://burntsushi.net/unwrap/ But most devs IMHO don't have the time to make the call correctly most of the time... so it's just better to do something better, like handle the error and try to recover, or if impossible, at least do `expect("damn it, how did this happen")`.

gwd · 2025-11-19T09:35:44 1763544944

> Yes, I always thought it was wrong to use unwrap in examples.

And because it gets picked up by LLMs. It would be interesting to know if this particular .unwrap() was written by a human.

mattacular · 2025-11-19T12:15:59 1763554559

There is a prevailing mentality that LLMs make it easy to become productive in new languages, if you are already proficient in one. That's perhaps true until you suddenly bump up against the need to go beyond your superficial understanding of the new language and its idiosyncrasies. These little collisions with reality occur until one of them sparks an issue of this magnitude.

In theory, experienced human code reviewers can course correct newer LLM-guided devs work before it blows up. In practice, reviewers are already stretched thin and submitters absolute to now rapidly generate more and more code to review makes that exhaustion effect way worse. It becomes less likely they spot something small but obvious amongst the haystack of LLM generated code bailing there way.

gwd · 2025-11-20T09:48:35 1763632115

> There is a prevailing mentality that LLMs make it easy to become productive in new languages, if you are already proficient in one.

Yes, and: I've found this to be mostly true, if you make sure you take the time to deeply understand what the code is doing. When I asked an LLM to do something for me in Javascript, then I said, "What if X happens, wouldn't that cause Y? Would it be better to restructure it like so and so to make it more robust?" The LLM immediately improves it.

Any experienced programmer who was taking the time to review this code, on learning that unwrap() has a "panic" inside, would certainly change it. But as you say, reviewers are already stretched thin.

empath75 · 2025-11-19T16:41:33 1763570493

I have removed many 'unwraps' from copilot and cursor generated code, but cursor is a lot better at it than copilot was.

inferiorhuman · 2025-11-19T08:46:04 1763541964

Dunno, I think the alternatives have their own pretty significant downsides. All would require front loading more in-depth understanding of error handling and some would just be quite a bit more verbose.

IMO making unwrap a clippy lint (or perhaps a warning) would be a decent start. Or maybe renaming unwrap.

jandrewrogers · 2025-11-19T13:29:56 1763558996

This strikes me as a culture issue more than one of language.

A tenet of systems code is that every possible error must be handled explicitly and exhaustively close to the point of occurrence. It doesn’t matter if it is Rust, C, etc. Knowing how to write systems code is unrelated to knowing a systems language. Rust is a systems language but most people coming into Rust have no systems code experience and are “holding it wrong”. It has been a recurring theme I’ve seen with Rust development in a systems context.

C is pretty broken as a language but one of the things going for it is that it has a strong systems code culture surrounding it that remembers e.g. why we do all of this extra error handling work. Rust really needs systems code practice to be more strongly visible in the culture around the language.

empath75 · 2025-11-19T16:51:23 1763571083

Unwrap _is_ explicitly handling an error at the point of occurrence. You have explicitly decided to panic, which is sometimes a valid choice. I use it (on startup only) when server configs are missing or invalid or in CLI tools when the options aren't valid. Crashing a pod on startup before it goes Ready is a valid pattern in k8s and generally won't cause an outage because the previous pod will continue working.

inferiorhuman · 2025-11-19T23:43:36 1763595816

  A tenet of systems code is that every possible error must be handled
  explicitly and exhaustively close to the point of occurrence.

All the more reason it doesn't really belong in examples for third party libraries.

wongarsu · 2025-11-19T09:25:11 1763544311

> at least do `expect("damn it, how did this happen")`

That gives you the same behavior as unwrap with a less useful error message though. In theory you can write useful messages, but in practice (and your example) expect is rarely better than unwrap in modern rust

echelon · 2025-11-19T11:39:07 1763552347

We shouldn't be using unwrap() or expect() at all.

This is Rust's Null Pointer Exception.

unwrap(), expect(), bad math, etc. - this is all caused by lazy Rust developers or Rust developers not utilizing the language's design features.

The language should grow the ability to mark this code as dangerous, and we should have static tools to exclude this code from our dependency tree.

I don't want some library I use to `unwrap()` and cause my application to crash because I didn't anticipate their stupid panic.

Rust developers have clearly leaned on this crutch far too often:

https://github.com/search?q=unwrap%28%29+language%3ARust&typ...

The Rust team needs to plug this leak.

burntsushi · 2025-11-19T13:06:15 1763557575

I'm on the Rust libs-api team and you're mistaken. I use `unwrap()` all the time.

My blog on this topic was linked above, you should read it: https://burntsushi.net/unwrap/

underdeserver · 2025-11-19T13:59:51 1763560791

Eh, "mistaken" might be a bit harsh. He's stating an opinion, which you and I disagree with.

> The language should grow the ability to mark this code as dangerous, and we should have static tools to exclude this code from our dependency tree.

Might be useful to point out that this static tool exists (clippy::unwrap_used).

burntsushi · 2025-11-19T15:05:45 1763564745

They said:

> unwrap(), expect(), bad math, etc. - this is all caused by lazy Rust developers or Rust developers not utilizing the language's design features.

That's factually incorrect. (And insulting.)

underdeserver · 2025-11-19T15:37:43 1763566663

I disagree with that characterization. Using unwrap() like you suggest in your blog post is an intentional, well-thought-out choice. Using unwrap() the way Cloudflare did it is, with hindsight, a bad choice, that doesn't utilize the language's design features.

Note that they're not criticizing the language. I read "Rust developers" in this context as developers using Rust, not those who develop the language and ecosystem. (In particular they were not criticizing you.)

I think it's reasonable to question the use of unwrap() in this context. Taking a cue from your blog post^ under runtime invariant violations, I don't think this use matches any of your cases. They assumed the size of a config file is small, it wasn't, so the internet crashed.

^ https://burntsushi.net/unwrap/#what-is-my-position

wongarsu · 2025-11-19T15:55:42 1763567742

Echelon's comment was "We shouldn't be using unwrap() or expect() at all. [...] unwrap(), expect(), bad math, etc. - this is all caused by lazy Rust developers". Even in my most generous interpretation I can't see how that is anything except a rejection of all unwraps (and equivalent constructs like expect()).

I fully agree with burntsushi that echelon is taking an extreme and arguably wrong stance. His sentiment becomes more and more correct as Rust continues to evolve ways to avoid unwrap as an ergonomic shortcut, but I don't think we are quite there yet for general use. There absolutely is code that should never panic, but that involves tradeoffs and design choices that aren't true for every project (or even the majority of them)

burntsushi · 2025-11-19T15:44:51 1763567091

The commenter also said:

> We shouldn't be using unwrap() or expect() at all.

So the context of their comment is not some specific nuanced example. They made a blanket statement.

> Note that they're not criticizing the language. I read "Rust developers" in this context as developers using Rust, not those who develop the language and ecosystem.

I have the same interpretation.

> I think it's reasonable to question the use of unwrap() in this context. Taking a cue from your blog post^ under runtime invariant violations, I don't think this use matches any of your cases. They assumed the size of a config file is small, it wasn't, so the internet crashed.

Yes? I didn't say it wasn't reasonable to question the use of unwrap() here. I don't think we really have enough information to know whether it was inappropriate or not.

unwrap() is all about nuance. I hope my blog post conveyed that. Because unwrap() is a manifestation of an assertion on a runtime invariant. A runtime invariant can be arbitrarily complicated. So saying things like, "we shouldn't be using unwrap() or expect() at all" is an extreme position to carve out that is also way too generalized.

I stand by what I said. They are factually mistaken in their characterization of the use of unwrap()/expect() in general.

underdeserver · 2025-11-19T15:49:36 1763567376

> So the context of their comment is not some specific nuanced example. They made a blanket statement.

That is their opinion, I disagree with it, but I don't think it's an insulting or invalid opinion to have. There are codebases that ban nulls in other languages too.

> They are factually mistaken in their characterization of the use of unwrap()/expect() in general.

It's an opinion about a stylistic choice. I don't see what fact there is here that could be mistaken.

burntsushi · 2025-11-19T15:57:52 1763567872

I'm finding this exchange frustrating, and now we're going in circles. I'll say this one last time in as clear language as I can. They said this:

> unwrap(), expect(), bad math, etc. - this is all caused by lazy Rust developers or Rust developers not utilizing the language's design features.

The factually incorrect part of this is the statement that use of `unwrap()`, `expect()` and so on is caused by X or Y, where X is "lazy Rust developers" and Y is "Rust developers not utilizing the language's design features." But there are, factually, other causes than X or Y for use of `unwrap()`, `expect()` and so on. So stating that it is all caused by X or Y is factually incorrect. Moreover, X is 100% insulting when applied to any one specific individual. Y can be insulting when applied to any one specific individual.

Now this:

> We shouldn't be using unwrap() or expect() at all.

That's an opinion. It isn't factually incorrect. And it isn't insulting.

underdeserver · 2025-11-19T16:15:54 1763568954

I'm sorry I'm frustrating you. It was not my intention. For what it's worth, I use ripgrep every day, and it's made my life appreciably better. (Same goes for Astral products.) Thank you for that, and I wish your day improves.

> unwrap(), expect(), bad math, etc. - this is all caused by lazy Rust developers or Rust developers not utilizing the language's design features

I just read that line as shorthand for large outages caused by misuse of unwrap(), expect(), bad math etc. - all caused by...

That's also an opinion, by my reading.

I assumed we were talking specifically about misuses, not all uses of unwrap(), or all bad bugs. Anyway, I think we're ultimately saying the same thing. It's ironic in its own way.

petcat · 2025-11-19T12:48:01 1763556481

How would they plug it? Just deprecate .unwrap and .expect and then remove them from the API completely?

frumplestlatz · 2025-11-19T09:22:52 1763544172

I have to disagree that unwrap is ever OK. If you have to use unwrap, your types do not match your problem. Fix them. You have encoded invariants in your types that do not match reality.

Change your API boundary, surface the discrepancy between your requirements and the potential failing case at the edges where it can be handled.

If you need the value, you need to handle the case that it’s not available explicitly. You need to define your error path(s)

Anything else leads to, well, this.

echelon · 2025-11-19T11:50:56 1763553056

This.

This is a failure caused by lazy Rust programming and not relying on the language's design features.

It's a shame this code can even be written. It is surprising and escapes the expected safety of the language.

I'm terrified of some dependency using unwrap() or expect() and crashing for something entirely outside of my control.

We should have an opt-in strict Cargo.toml declaration that forbids compilation of any crate that uses entirely preventable panics. The only panics I'll accept are those relating to memory allocation.

This is one of the sharpest edges in the language, and it needs to be smoothed away.

burntsushi · 2025-11-19T13:07:23 1763557643

The blog linked in the GP anticipates this rebuttal and already addresses it.

Your argument also implies that things like `slice[i]` are never okay.

frumplestlatz · 2025-11-19T14:19:57 1763561997

`slice[i]` is also a hole in the type system, but at least it’s generally relying on a local invariant, immediate to the surrounding context, that does not require lying about invariants across your API surface.

The blog post doesn’t address the issue, it simply pretends it’s not a real problem.

Also from the post: “If we were to steelman advocates in favor of this style of coding, then I think the argument is probably best limited to certain high reliability domains. I personally don’t have a ton of experience in said domains …”

Enough said.

burntsushi · 2025-11-19T15:10:38 1763565038

`slice[i]` is just sugar for `slice.get(i).unwrap()`. And whether it's a "local" invariant or not is orthogonal. And `unwrap()` does not "require lying about invariants across your API surface."

> The blog post doesn’t address the issue, it simply pretends it’s not a real problem.

It very explicitly addresses it! It even gives real examples.

> Also from the post: “If we were to steelman advocates in favor of this style of coding, then I think the argument is probably best limited to certain high reliability domains. I personally don’t have a ton of experience in said domains …” > > Enough said.

Ad hominem... I don't have experience working on, e.g., medical devices upon which someone's life depends. So the point of that sentence is to say, "yes, I acknowledge this advice may not apply there." You also cherry picked that quote and left off the context, which is relevant here.

And note that you said:

> I have to disagree that unwrap is ever OK.

That's an extreme position. It isn't caveated to only apply to certain contexts.

frumplestlatz · 2025-11-19T20:34:47 1763584487

> `slice[i]` is just sugar for `slice.get(i).unwrap()`. And whether it's a "local" invariant or not is orthogonal. And `unwrap()` does not "require lying about invariants across your API surface."

It's not orthogonal. `Result` isn't a local invariant, and yes, `.unwrap()` does require lying. If your code depends on an API that can fail, and you cannot handle that failure locally (`.unwrap()` is not handling it), then your type signature needs to express that you can fail -- and you need to raise an error on that failure.

> That's an extreme position. It isn't caveated to only apply to certain contexts.

No, it's a principled position. Correct code doesn't `.unwrap()`, but code that hides failure cases -- or foists invariant enforcement onto programmers remembering not to screw up -- does.

I've built and worked on ridiculously complex code bases without a single instance of `.unwrap()` or the local language equivalent; it's just not necessary. This is just liked the unchecked exception debate in Java -- complex explanations for a very simple goal of avoiding the thought, time, and effort to accurately model a system's invariants.

burntsushi · 2025-11-20T16:19:49 1763655589

> No, it's a principled position. Correct code doesn't `.unwrap()`, but code that hides failure cases -- or foists invariant enforcement onto programmers remembering not to screw up -- does.

I don't think you understand what an internal runtime invariant is. Either way, I don't know of any widespread libraries (in any language) that follow this "principled" position. That makes it de facto extreme.

> I've built and worked on ridiculously complex code bases without a single instance of `.unwrap()` or the local language equivalent; it's just not necessary.

Show me. If you're using `slice[i]`, then you're using `unwrap()`. It introduces a panicking branch.

> If your code depends on an API that can fail, and you cannot handle that failure locally (`.unwrap()` is not handling it), then your type signature needs to express that you can fail -- and you need to raise an error on that failure.

You use `unwrap()` when you know the failure cannot happen.

I note you haven't engaged with any of the examples I provided in the blog.

frumplestlatz · 2025-11-20T18:58:14 1763665094

> You use `unwrap()` when you know the failure cannot happen.

That’s an invariant meant to be expressed by your type system — and it is.

You’ve failed to model your invariants in your API — and thus the type system — if you ever reach a point where an engineer has to manually assess and assert whether “cannot” applies.

You will get it wrong. That is bad code.

burntsushi · 2025-11-26T19:11:39 1764184299

You can't model all invariants in the type system. My blog even shows examples of this too.

quotemstr · 2025-11-19T12:47:54 1763556474

> If you have to use unwrap, your types do not match your problem

The problem starts with Rust stdlib. It panics on allocation failure. You expect Rust programmers to look at stdlib and not imitate it?

Sure, you can try to taboo unwrap(), but 1) it won't work, and 2) it'll contort program design in places where failure really is a logic bug, not a runtime failure, and for which unwrap() is actually appropriate.

The real solution is to go back in time, bonk the Rust designers over the head with a cluebat, and have them ship a language that makes error propagation the default and syntactically marks infallible cleanup paths --- like C++ with noexcept.

frumplestlatz · 2025-11-19T17:59:44 1763575184

> 1) it won't work

Of course it will. I've built enormous systems, including an entire compiler, without once relying on the local language equivalent of `.unwrap()`.

> 2) it'll contort program design in places where failure really is a logic bug, not a runtime failure, and for which unwrap() is actually appropriate.

That's a failure to model invariants in your API correctly.

> ... have them ship a language that makes error propagation the default and syntactically marks infallible cleanup paths --- like C++ with noexcept.

Unchecked exceptions aren't a solution. They're a way to avoid taking the thought, time, and effort to model failure paths, and instead leave that inherent unaddressed complexity until a runtime failure surprises users. Like just happened to Cloudflare.

empath75 · 2025-11-19T16:49:54 1763570994

We flag unwrap/expect usage in lints and have limited it to just server startup where we want the server to crash if a file is missing...

dehrmann · 2025-11-19T02:09:01 1763518141

It's the same blind spot people have to Java's checked exceptions. People commonly resort to Pokemon exception handling and either blindly ignoring or rethrowing as a runtime exception. When Rust got popular, I was a bit confused by people talking about how great Result it's essentially a checked exception without a stack trace.

Terr_ · 2025-11-19T02:46:31 1763520391

"Checked Exceptions Are Actually Good" gang, rise up! :p

I think adoption would have played out very different if there had only been some more syntactic-sugar. For example, an easy syntax for saying: "In this method, any (checked) DeepException e that bubbles up should immediately be replaced by a new (checked) MylayerException(e) that contains the original one as a cause.

We might still get lazy programmers making systems where every damn thing goes into a generic MylayerException, but that mess would still be way easier to fix later than a hundred scattered RuntimeExceptions.

twhitmore · 2025-11-19T10:35:22 1763548522

Exception handling would be better than what we're seeing here.

The problem is that any non-trivial software is composition, and encapsulation means most errors aren't recoverable.

We just need easy ways to propagate exceptions out to the appropriate reliability boundary, ie. the transaction/ request/ config loading, and fail it sensibly, with an easily diagnosable message and without crashing the whole process.

C# or unchecked Java exceptions are actually fairly close to ideal for this.

The correct paradigm is "prefer throw to catch" -- requiring devs to check every ret-val just created thousands of opportunities for mistakes to be made.

By contrast, a reliable C# or Java version might have just 3 catch clauses and handle errors arising below sensibly without any developer effort.

https://literatejava.com/exceptions/ten-practices-for-perfec...

bigstrat2003 · 2025-11-19T03:27:32 1763522852

I'm with you! Checked exceptions are actually good and the hate for them is super short sighted. The exact same criticisms levied at checked exceptions apply to static typing in general, but people acknowledge the great value static types have for preventing errors at compile time. Checked exceptions have that same value, but are dunked on for some reason.

never_inline · 2025-11-19T04:57:31 1763528251

The dislike is probably because of 2 reasons.

1. in most cases they don't want to handle `InterruptedException` or `IOException` and yet need to bubble them up. In that case the code is very verbose.

2. it makes lambdas and functions incompatible. So eg: if you're passing a function to forEach, you're forced to wrap it in runtime exception.

3. Due to (1) and (2), most people become lazy and do `throws Exception` which negates most advantages of having exceptions in the first place.

In line-of-business apps (where Java is used the most), an uncaught exception is not a big deal. It will bubble up and gets handled somewhere far up the stack (eg: the server logger) without disrupting other parts of the application. This reduces the utility of having every function throw InterruptedException / IOException when those hardly ever happen.

loglog · 2025-11-19T12:06:55 1763554015

Java checked exceptions suffer from a lack of generic exception types ("throws T", where T can be e.g. "Exception", "Exception1|Exception2", or "never") This would also require union types and a bottom type. Without generics, higher order functions are very hard to use.

dehrmann · 2025-11-19T06:55:04 1763535304

> 2. it makes lambdas and functions incompatible.

This is true, but the hate predated lambdas in Java.

ErikCorry · 2025-11-19T09:34:58 1763544898

You could always manually build the same thing as lambda with a class and you had the same problem.

frumplestlatz · 2025-11-19T09:30:51 1763544651

> an uncaught exception is not a big deal

In my experience, it actually is a big deal, leaving a wake of indeterminant state behind after stack unrolling. The app then fails with heisenbugs later, raising more exceptions that get ignored, compounding the problem.

People just shrug off that unreliability as an unavoidable cost of doing business.

Terr_ · 2025-11-19T04:05:42 1763525142

Yeah, in both cases it's a layering situation, where it's the duty of your code to decide what layers of abstraction need to be be bridged, and to execute on that decision. Translating/wrapping exception-types from deeper functions is the same as translating/wrapping return-types the same places.

I think it comes down to a psychological or use-case issue: People hate thinking about errors and handling them, because it's that hard stuff that always consumes more time than we'd like to think. Not just digitally, but in physical machines too. It's also easier to put off "for later."

lenkite · 2025-11-19T05:35:25 1763530525

Checked exceptions in theory were good, but Java simply did not add facilities to handle or support them well in many APIs. Even the new API's in Java - Streams, etc do not support checked exceptions.

Skeime · 2025-11-19T09:10:12 1763543412

There is also the problem that they decided to make all references nullable, so `NullPointerException`s could appear everywhere. This "forced" them to introduce the escape hatch of `RuntimeException`, which of course was way overused immediately, normalizing it.

gwbas1c · 2025-11-19T02:59:06 1763521146

It's a lot lighter: a stack trace takes a lot of overhead to generate; a result has no overhead for a failure. The overhead (panic) only comes once the failure can't be handled. (Most books on Java/C# don't explain that throwing exceptions has high performance overhead.)

Exceptions force a panic on all errors, which is why they're supposed to be used in "exceptional" situations. To avoid exceptions when an error is expected, (eof, broken socket, file not found,) you either have to use an unnatural return type or accept the performance penalty of the panic that happens when you "throw."

In Rust, the stack trace happens at panic (unwrap), which is when the error isn't handled. IE, it's not when the file isn't found, it's when the error isn't handled.

branko_d · 2025-11-19T07:26:57 1763537217

> Exceptions force a panic on all errors

What do you mean?

Exceptions do not force panic at all. In most practical situations, an exception unhandled close to where it was thrown will eventually get logged. It's kind of a "local" panic, if you will, that will terminate the specific function, but the rest of the program will remain unaffected. For example, a web server might throw an exception while processing a specific HTTP request, but other HTTP requests are unaffected.

Throwing an exception does not necessarily mean that your program is suddenly in an unsupported state, and therefore does not require terminating the entire program.

gwbas1c · 2025-11-19T16:58:12 1763571492

> Throwing an exception does not necessarily mean that your program is suddenly in an unsupported state, and therefore does not require terminating the entire program.

That's not what a panic means. Take a read through Go's panic / resume mechanism; it's similar to exceptions, but the semantics (with multiple return values) make it clear that panic is for exceptional situations. (IE, panic isn't for "file not found," but instead it's for when code isn't written to handle "file not found.")

Even Rust has mechanisms to panic without aborting the process, although I will readily admit that I haven't used them and don't understand them: https://doc.rust-lang.org/std/panic/fn.resume_unwind.html

frumplestlatz · 2025-11-19T09:32:53 1763544773

> Throwing an exception does not necessarily mean that your program is suddenly in an unsupported state

When everyone uses runtime exceptions and doesn’t count for exception handling in every possible code path, that’s exactly what it means.

branko_d · 2025-11-19T14:56:27 1763564187

Sure, but the same is true of any error handling strategy.

When you work with exceptions, the key is to assume that every line can throw unless proven otherwise, which in practice means almost all lines of code can throw. Once you adopt that mental model, things get easier.

frumplestlatz · 2025-11-19T15:10:27 1763565027

Explicit error handling strategies allow you to not worry about all the code paths that explicitly cannot throw -- which is a lot of them. It makes life a lot easier in the non-throwing case, and doesn't complicate life any more in the throwing case as compared to exception-based error handling.

It also makes errors part of the API contract, which is where they belong, because they are.

branko_d · 2025-11-20T06:55:38 1763621738

I would respectfully disagree with most of what you said. I guess individual perspective depends a lot on the kinds of code you work on.

The point about being explicitly part of the API stands, though.

dehrmann · 2025-11-19T06:57:57 1763535477

> a stack trace takes a lot of overhead to generate

Can't Hotspot not generate the stack trace when it knows the exception will be caught and the stack trace ignored?

mike_hearn · 2025-11-19T14:01:05 1763560865

It can and that optimization has existed for a while.

Actually it can also just turn off the collection of stack traces entirely for throw sites that are being hit all the time. But most Java code doesn't need this because code only throws exceptions for exceptional situations.

BlackFly · 2025-11-19T08:02:19 1763539339

> it's essentially a checked exception without a stack trace

In theory, theory and practice are the same. In practice...

You can't throw a checked exception in a stream, this fact actually underlines the key difference between an exception and a Result: Result is in return position and exceptions are a sort of side effect that has its own control flow. Because of that, once your method throws an Exception or you are writing code in a try block that catches an exception, you become blind to further exceptions of that type, even if you might be able to or required to fix those errors. Results are required to be handled individually and you get syntactic sugar to easily back propagate.

It is trivial to include a stack trace, but stack traces are really only useful for identifying where something occurred, and generally what is superior is attaching context as you back propagate which trivially occurs with judicious use of custom error types with From impls. Doing this means that the error message uniquely defines the origin and paths it passed through without intermediate unimportant stack noise. With exceptions you would always need to catch each exception and rethrow a new exception containing the old to add contextual information, then to avoid catching to much you need variables that will be initialized inside the try block defined outside of the try block. So stack traces are basically only useful when you are doing Pokemon exception handling.

Ygg2 · 2025-11-19T09:00:07 1763542807

> When Rust got popular, I was a bit confused by people talking about how great Result it's essentially a checked exception without a stack trace.

It's not a checked exception without a stack trace.

Rust doesn't have Java's checked or unchecked exception semantics at the moment. Panics are more like Java's Errors (e.g. OOM error). Results are just error codes on steroids.

baq · 2025-11-19T08:11:04 1763539864

checked exceptions failed because when used properly they fossilize method signatures. they're fine if your code will never be changed and they're fine when you control 100% of users of the throwing code. if you're distributing a library... no bueno.

frumplestlatz · 2025-11-19T09:34:20 1763544860

That’s just not true. They required that you use hierarchical exception types and define your own library exception type that you declare at the boundary.

The same is required for any principled error handling.

speed_spread · 2025-11-19T02:36:35 1763519795

Pet peeve: unwrap() should be deprecated and renamed or_panic(). More consistent with the rest of stdlib methods and appropriately scarier.

wrs · 2025-11-19T04:13:39 1763525619

That's kind of what I'm saying with the blind spot comment. The words "unwrap" and "expect" should be just as much a scary red flag as the word "panic", but for some reason it seems a lot of people don't see them that way.

shadowmatter · 2025-11-19T05:28:53 1763530133

Even in lowly Java, they later added to Optional the orElseThrow() method since the name of the get() method did not connote the impact of unwrapping an empty Optional.

vbezhenar · 2025-11-19T10:19:03 1763547543

I've found both methods very useful. I'm using `get()` when I've checked that the value is present and I don't expect any exceptions. I'm using `orElseThrow()` when I actually expect that value can be absent and throwing is fine. Something like

    if (userOpt.isPresent()) {
      var user = userOpt.get();
      var accountOpt = accountRepository.selectAccountOpt(user.getId());
      var account = accountOpt.orElseThrow();
    }

Idea checks it by default and highlights if I've used `get()` without previous check. It's not forced at compiler level, but it's good enough for me.

loglog · 2025-11-19T20:08:34 1763582914

While the `Optional` API is generally pretty inconvenient (compared e.g. to Kotlin), it does offer the more precise `ifPresent`.

vbezhenar · 2025-11-20T00:27:55 1763598475

Java lambdas are terrible so I usually avoid them. There's no reason to invent new methods, when language already has corresponding statements.

echelon · 2025-11-19T03:25:47 1763522747

A lot of stuff should be done about the awful unwrap family of methods.

A few ideas:

- It should not compile in production Rust code

- It should only be usable within unsafe blocks

- It should require explicit "safe" annotation from the engineer. Though this is subject to drift and become erroneous.

- It should be possible to ban the use of unsafe in dependencies and transitive dependencies within Cargo.

kibwen · 2025-11-19T03:33:58 1763523238

The `unsafe` keyword means something specific in Rust, and panicking isn't unsafe by Rust's definition. Sometimes avoiding partial functions just isn't feasible, and an unwrap (or whatever you want to call the method) is a way of providing a (runtime-checked) proof to the compiler that the function is actually total.