More

palcu · 2025-12-14T23:22:34 1765754554

Thank you! Opening an incident as soon as user impact begins is one of those instincts you develop after handling major incidents for years as an SRE at Google, and now at Anthropic.

I was also fortunate to be using Claude at that exact moment (for personal reasons), which meant I could immediately see the severity of the outage.

koakuma-chan · 2025-12-15T00:08:30 1765757310

It's important for companies to use their own products.

awesome_dude · 2025-12-15T01:05:16 1765760716

Unless using your own dogfood prevents you from fixing it if it breaks

https://www.theguardian.com/technology/2021/oct/05/facebook-...

I have a memory that Slack fell into this trap too (I could be wrong)

hiddencost · 2025-12-15T03:11:40 1765768300

Facebook notoriously had to cut open the doors to one of their data centers.

Google SRE still keeps IRC available in case of an emergency.

antonvs · 2025-12-15T02:31:49 1765765909

Now I’m imagining the folks at Slack gritting their teeth and using MS Teams

aduwah · 2025-12-14T23:39:48 1765755588

Take my condolences, Sunday outages are rough

nrhrjrjrjtntbt · 2025-12-15T01:28:13 1765762093

Sweet. Hopefully it is more than instinct but a codified at Anthropic. I.e. a graduate engineer with little experience can assess and raise incident if needed.

palcu · 2025-12-14T22:56:34 1765752994

Hello, I'm one of the engineers who worked on the incident. We have mitigated the incident as of 14:43 PT / 22:43 UTC. Sorry for the trouble.

l1n · 2025-12-14T23:45:15 1765755915

Also an engineer on this incident. This was a network routing misconfiguration - an overlapping route advertisement caused traffic to some of our inference backends to be blackholed. Detection took longer than we’d like (about 75 minutes from impact to identification), and some of our normal mitigation paths didn’t work as expected during the incident.

The bad route has been removed and service is restored. We’re doing a full review internally with a focus on synthetic monitoring and better visibility into high-impact infrastructure changes to catch these faster in the future.

ammut · 2025-12-15T01:46:38 1765763198

If you have a good network CI/CD pipeline and can trace the time of deployment to when the errors began, it should be easy to reduce your total TTD/TTR. Even when I was parsing logs years ago and matching them up against AAA authorization commands issued, it was always a question of "when did this start happening?" and then "who made a change around that time period?"

giancarlostoro · 2025-12-15T00:59:40 1765760380

I don't know if you guys do write ups, but cloudflare's write ups on outages is in my eyes the gold standard the entire industry should follow.

Arcuru · 2025-12-15T01:56:43 1765763803

When I was at Big Corp, I loved reading the internal postmortems. They were usually very interesting and I learned a lot. It's one of the things I miss about leaving.

A tech company that publishes the postmortems when possible always get a +1 in my eyes, I think it's a sign of good company culture. Cloudflare's are great and I would love to see more from others in the industry.

boondongle · 2025-12-15T02:05:38 1765764338

A big reason for that is it comes from the CEO. Other providers have a team and then at least 2 to 3 layers of management above them and a dotted line legal counsel. So the goal posts randomly shift from "more information" to "no information" over time based on the relationships of that entire chain, the customer heat of the moment, and personality.

Underneath a public statement they all have extremely detailed post-mortems. But how much goes public is 100% random from the customer's perspective. There's no Monday Morning QB'ing the CEO, but there absolutely is "Day-Shift SRE Leader Phil"

bflesch · 2025-12-15T02:14:36 1765764876

Cloudflare deploys stuff on Fridays, and it directly affected shopify, one of their major ecommerce customers. Until they fix their internal processes all writeups should be seen as purely marketing material.

giancarlostoro · 2025-12-15T13:45:23 1765806323

I absolutely see it as marketing, and it is effective because I still appreciate the write ups. Arguably any publicly traded company should be letting their investors know more details about outages.

999900000999 · 2025-12-15T00:02:32 1765756952

Was this a typo situation or a bad process thing ?

Back when I did website QA Automation I'd manually check the website at the end of my day. Nothing extensive, just looking at the homepage for piece of mind.

Once a senior engineer decided to bypass all of our QA, deploy and took down prod. Fun times.

spike021 · 2025-12-15T01:30:02 1765762202

Depending on how long someone's been in the industry it's more a question of if, not when, an outage will occur due to someone deciding to push code haphazardly.

At my first job one of my more senior team members would throw caution to the wind and deploy at 3pm or later on Fridays because he believed in shipping ASAP.

There were a couple times that those changes caused weekend incidents.

MobiusHorizons · 2025-12-15T03:49:08 1765770548

I think you meant to write “when, not if” instead of “if, not when”

spike021 · 2025-12-15T06:07:26 1765778846

heh, probably. that's what I get for writing a comment while walking my dog.

userbinator · 2025-12-15T02:49:30 1765766970

In these times, it could be "the AI did it".

tayo42 · 2025-12-15T00:54:42 1765760082

I was kind surprised to see details like that in a comment, but clicked on your personal website and see your a Co-founder, so I guess no one is going to repremand you lol

wouldbecouldbe · 2025-12-15T00:16:22 1765757782

Trying to understand what this means.

Did the bad route cause an overload? Was there a code error on that route that wasn’t spotted? Was it a code issue or an instance that broke?

bc569a80a344f9c · 2025-12-15T00:23:34 1765758214

It says network routing issue.

Network routes consist of a network (a range of IPs) and a next hop to send traffic for that range to.

These can overlap. Sometimes that’s desirable, sometimes it is not. When routers have two routes that are exactly the same they often load balance (in some fairly dumb, stateless fashion) between possible next hops, when one of the routes is more specific, it wins.

Routes get injected by routers saying “I am responsible for this range” and setting themselves as the next hop, others routers that connect to them receive this advertisement and propagate it to their own router peers further downstream.

An example would be advertising 192.168.0.0/23, which is the range of 192.168.0.0-192.168.1.255.

Let’s say that’s your inference backend in some rows in a data center.

Then, through some misconfiguration, some other router starts announcing 192.168.1.0/24 (192.168.1.0-192.168.1.255). This is more specific, that traffic gets sent there, and half of the original inference pod is now unreachable.

disqard · 2025-12-15T00:36:02 1765758962

Thank you for that explanation!

mattdeboard · 2025-12-15T00:23:28 1765758208

it means their servers were unreachable due to network misconfig.

colechristensen · 2025-12-15T01:29:17 1765762157

The details and promptness of reporting are much appreciated and build trust, so thanks!

giancarlostoro · 2025-12-15T00:59:08 1765760348

Any chance you guys could do write ups on these incidents similar to how CloudFlare does? For all the heat some people give them, I trust CloudFlare more with my websites than a lot of other companies because of their dedication to transparency.

l1n · 2025-12-15T01:00:20 1765760420

We're considering this!

giancarlostoro · 2025-12-15T01:59:35 1765763975

I already love the product, and I think it would be great to see. Even if its not as "quickly" as CloudFlares (they post ASAP its insane) I would still be happy to see postmortem threads. We all learn industry wide from them.

nickpeterson · 2025-12-14T23:07:14 1765753634

The one time you desperately need to ask Claude and it isn’t working…

dan_wood · 2025-12-14T22:58:48 1765753128

Can you divulge more on the issue?

Only curious as a developer and dev op. It's all quite interesting where and how things go wrong especially with large deployments like Anthropic.

binsquare · 2025-12-14T23:02:59 1765753379

I yearn for the nitty gritty details too

mulhoon · 2025-12-14T23:11:16 1765753876

They turned it off and on again.

cheschire · 2025-12-14T23:35:15 1765755315

Three times.

https://youtu.be/uRGljemfwUE?si=Sq0t-2ipXr_gDqao&t=69

binsquare · 2025-12-15T00:26:36 1765758396

dgellow · 2025-12-14T23:38:08 1765755488

Hope you have a good rest of your weekend

Chance-Device · 2025-12-14T23:05:02 1765753502

Thank you for your service.

g-mork · 2025-12-14T23:38:51 1765755531

it's still down get back to work

palcu · 2025-07-18T16:05:58 1752854758

There is an external incident now.

https://status.cloud.google.com/incidents/8cY8jdUpEGGbsSMSQk...

palcu · 2025-06-14T18:42:39 1749926559

One of the problems has been that most users have requested that quotas get updated as fast as possible and that they should be consistent across regions, even for global quotas. As such people have been prioritising user experience rather than availability.

I hope the pendulum swings the other way around now in the discussion.

[disclaimer that I worked as a GCP SRE for a long time, but not left recently]

stogot · 2025-06-14T22:47:26 1749941246

Why did you leave? If you don’t mind asking

PNewling · 2025-06-15T02:13:35 1749953615

Not sure if you'll get an answer (I'd be interesting in a response as well), but from the blog in their profile it looks like they moved to be a 'member of technical staff working in the AI Reliability Engineering (AIRE) team at Anthropic'. So it might have just been an upward move to something different/more-exciting.

palcu · 2025-06-15T12:14:31 1749989671

I confirm. I’m still doing reliability, but in a fun and exciting way for Claude and Anthropic.

palcu · 2025-03-04T11:25:57 1741087557

seems to be OpenAI

https://news.ycombinator.com/item?id=36155028

TheAlchemist · 2025-03-04T22:01:31 1741125691

I would assume that's the Microsoft part actually.

palcu · on Dec 25, 2023

And the queries flow through.

palcu · on Aug 8, 2023

My adjacent teams in London who work in SRE on Google Cloud (GCE) got some well deserved doughnuts today for rolling out the patches on time.

palcu · on July 3, 2023

Yeah yeah, they've broken the old TweetDeck. You need to wait for the pop-up to ask you to transition to the new TweetDeck. Or, search on the internet for the Javascript variable you have to change in your console.

The more important question is that they've removed the Activity feed, where you could see likes from other people. Which was like a realtime feed to what your friends were doing on the website. The website is way more boring now.

palcu · on April 27, 2023

[disclaimer: SRE @ Google, I was involved with the incident, obvious conflicts of interest]

Hey Dang, thanks for cleaning up the thread. One thing to note is that the title is not correct. The entire region is not currently down, as the regional impact was mitigated as of 06:39 PDT, per the support dashboard (though I think it was earlier). The impact is currently zonal (europe-west9-a), so having zone in the title as opposed to region would reflect reality closer.

Finally, there's lots of good feedback on this thread and on the previous one (https://news.ycombinator.com/item?id=35711349), so we obviously have a lot of lessons to learn.

Waterluvian · on April 27, 2023

Would you be able to comment a bit on the emotional (perhaps there’s a better word) aspect of the response?

Was there a lot of anxiety? Panic? Or was it just a “woof that sucks. Time to follow a checklist and then do a bunch of paper work” ?

What I’m curious about is what it feels like on a team at a company like Google when there is a major system failure.

palcu · on April 27, 2023

There's not much emotion as the core team working on the huge outages is more like an "SRE for SRE". They are all people who've been with the company for a long time and they've been in the secondary seat for at least one previous big rodeo. Not to mention that we're all running a checklist that has been exercised multiple times and there's always somebody on the call who could help if a step fails.

Personally, I wasn't part this time for the actual mitigation of the overall Paris DC recovery, as I was busy with an unfortunate[0] side effect of the outage. These generate more anxiety, as being woken up at 6am and being told that nobody understands exactly why the system is acting this way is not great. But then again, we're trained for this situation and there are always at least several ways of fixing the issue.

Finally, it's worth repeating that incident management is just a part of the SRE job and after several years I've understood that it is not the most important one. The best SREs I know are not great when it comes to a huge incident. But, they're work has avoided the other 99 outages that could have appeared on the front page of Hacker News.

[0]: https://news.ycombinator.com/item?id=35734224

Waterluvian · on April 28, 2023

I appreciate your insight into this. Thanks!

throwbigdata · on April 28, 2023

Who trains the trainers?

palcu · on April 28, 2023

Life and experience, if you're looking for a short answer. For example, last year we had an outage in London[0] and the folks who worked on it learnt a lot. Now, they applied the learnings in this incident.

[0]: https://news.ycombinator.com/item?id=32161755

palcu · on Jan 6, 2023

I absolutely love the answer to any of the other obvious questions that one would have about the membership.

https://www.bitsaboutmoney.com/memberships