From my experience working on SaaS, and improving ops at large organizations, I'...

DanHulton · 2025-03-18T05:03:14 1742274194

This assumes that the engineers in question get to choose how to allot their time, and are _allowed_ to spend time to add graceful failure modes. I cannot tell you how many stories I have heard of, and companies I have directly worked at, where this power is not granted to engineers, and they are instead directed to "stop working on technical debt, we'll make time to come back to that later". Of course, time is never found later, and the 3am pages continue because the people who DO choose how time is allocated are not the ones waking up at 3am to fix problems.

nijave · 2025-03-18T06:06:51 1742278011

Definitely an issue but I think there's a little room for push back. Work done outside normal working hours is automatically the highest priority, by definition. It's helpful to remind people of that.

If it's important enough to deserve a page, it's top priority work. The reverse is also true (if a page isn't top priority, disable the paging alert and stick it on a dashboard or periodic checklist)

whstl · 2025-03-18T09:14:55 1742289295

You're right, but it's still outrageous that engineers need to burn political capital in order to have proper sleep and avoid burnout.

nijave · 2025-03-18T20:46:18 1742330778

Agreed.

tacticus · 2025-03-18T03:54:13 1742270053

IMO it's when the incident response and readiness practice imposes a direct backpressure on feature delivery that you get the issues actually fixed and a resilient system.

if it's just the engineer while product and management see no real cost then people burn out and leave.

> The most successful teams I've seen treat on-call like a leading indicator - every incident represents unpriced technical debt that should be systematically eliminated. Each alert becomes an investment opportunity rather than a burden to be rotated.

100%

hinkley · 2025-03-18T16:29:03 1742315343

The people who do the real work don’t get raises and promotions because the annual review system punishes them for doing the right thing.

rednafi · 2025-03-18T04:12:52 1742271172

> When engineers bear the full financial consequences of 3AM pages, they're more likely to make systems more resilient by adding graceful failure modes.

Making engineers handle 3 AM issues caused by their code is one thing, but making them bear the financial consequences is another. That’s how you create a blame-game culture where everyone is afraid to deploy at the end of the day or touch anything they don’t fully understand.

cyberax · 2025-03-18T06:42:43 1742280163

"Financial consequences" probably mean "the success of the startup, so your options won't be worth less than the toilet paper", rather than "you'll pay for the downtime out of your salary".

mrkeen · 2025-03-18T07:39:55 1742283595

Engineers don't pick their work, management does.

A manager no longer needs to choose between system reliability and churning out new features with on-call:

The manager can get all the credit for pushing out new features during the day, and sleep well at night knowing that the engineers aren't.

procaryote · 2025-03-18T11:04:52 1742295892

At a lot of companies engineers are involved in picking the work. It's silly to hire competent problem solvers and treat them as unskilled workers needing micro-management.

Besides, if you set the on-call system up so people get free time the following day to compensate for waking up at night, the manager can't pretend there's no cost.

Bad management will fail on both of these of course, but there's no saving that beyond finding a better company.

whstl · 2025-03-18T12:51:32 1742302292

It is silly indeed but unfortunately this is what happens in companies that don’t have a good engineering culture.

northern-lights · 2025-03-18T07:03:47 1742281427

This assumes that the engineers who wrote the code that caused the 3 AM pages will still be around to suffer the consequences of the 3 AM pages. This is a lot of times, not true, especially in an environment which fostered moving around internally every now and then. Happens in at least one of the FAANGs.

meeshmuesh · 2025-03-18T13:28:28 1742304508

Sounds like Amazon

adrianN · 2025-03-18T05:55:13 1742277313

Minimizing 3am pages is good for engineers but it is not necessarily the best investment for the company. Beyond a certain scale it is probably not a good investment to try to get rid of all pages.

mook · 2025-03-18T07:15:12 1742282112

By that point wouldn't it start to make sense to have people across time zones so that it will be working hours somewhere?