Another status page that sucks. Slack goes down, people start texting me about i...

sjsdaiuasgdia · on May 17, 2023

Progression of status pages, from experience at a large cloud provider...

Stage 1: Status is manually set. There may be various metrics around what requires an update, and there may be one or more layers of approval needed.

Problems: Delayed or missed updates. Customers complain that you're not being honest about outages.

Stage 2: Status is automatically set based on the outcome of some monitoring check or functional test.

Problems: Any issue with the system that performs the "up or not?" source of truth test can result in a status change regardless of whether an actual problem exists. "Override automatic status updates" becomes one of the first steps performed during incident response, turning this into "status is manually set, but with extra steps". Customers complain that you're not being honest about outages and latency still sucks.

Stage 3: Status is automatically set based on a consensus of results from tests run from multiple points scattered across the public internet.

Problems: You now have a network of remote nodes to maintain yourself or pay someone else to maintain. The more reliable you want this monitoring to be, the more you need to spend. The cost justification discussions in an enterprise get harder as that cost rises. Meanwhile, many customers continue to say you're not being honest because they can't tell the difference between a local issue and an actual outage. Some customers might notice better alignment between the status page and their experience, but they're content, so they have little motivation to reach out and thank you for the honesty.

Eventually, the monitoring service gets axed because we can just manually update the status page after all.

Stage 4: Status is manually set. There may be various metrics around what requires an update, and there may be one or more layers of approval needed.

Not saying this is a great outcome, but it is an outcome that is understandable given the parameters of the situation.

croes · on May 17, 2023

More like this

https://www.hasthelargehadroncolliderdestroyedtheworldyet.co...

viraptor · on May 17, 2023

There are two kinds of status pages. Those that show you red when the service is up (automated) and those that show you green when the service is down (manual and automated) - or some mix of those two. You don't want made up information too, so for the manual one you'll get some delay for the analysis of the situation.

Your actions are extremely unlikely to change if the downtime is for less than half an hour. So what exactly do you expect to happen here?

(Yes, I got annoyed at a thousandth comment that essentially says the status update is not instantaneous and perfectly reflecting the situation)

ilyt · on May 17, 2023

I expect:

* not lying

* not having to dig out our logs and proofs for SLA reasons on obvious fuckup.

> (Yes, I got annoyed at a thousandth comment that essentially says the status update is not instantaneous and perfectly reflecting the situation)

It's not about it being minute behind reality, it's about it lying in entirety. Why you have so much problems with understanding that ?

viraptor · on May 17, 2023

There's a bit of time between a delayed status report and lying. If you referred to them historically not reporting anything at all then that's fair, but it wasn't clear from the message.

And it's a common comment I see 10min after some service having issues: "the status page is still green" - yes, people are probably still logging in to things and figuring out if the issue actually is internal.

ilyt · on May 18, 2023

But the expectation is that site reports live state.

If it is up to person to change it then it as status page is useless, might as well look at reddit whether people are complaining...

efficax · on May 17, 2023

status pages are rarely automated and it looks like slack was down at like 1am pacific. somebody got woken up by a page and groggily escalated and they sat there fighting the outage for 30 minutes before someone said “what about the status page”. or at least that’s how it worked at my last company

stevewodil · on May 17, 2023

This is very accurate. Or the customer communication lead was shadowing and this was their first incident.

It’s all the same

switch007 · on May 17, 2023

If you accept that status pages are (partially) under the control of PR teams, it makes more sense that they’re useless and a lie.

nurettin · on May 17, 2023

Status page pull request is pending merge, because seniors are either fired or or overworked.