Actionable: I should only be alerted if there is a definitive action to take. Clearing up disk space, investigating suspicious activity, replacing a hard drive or BBU, fixing connectivity issues. If an alert clears on it's own, that probably indicates a problem with monitoring frequency or thresholds. If an outage occurs without alerting, ditto. If an alert gets ignored, it probably indicates that there are too many non-actionable alerts.
Minimal: Reduce alerting noise. Actionable takes care of a lot of that, but parenting relationships can reduce a dozen or a hundred alerts to one. "SQL database is down" rather than 100 "web frontend timeout", for example.
Nagging: An alert should go to one person, in a way that they are sure to see it. Bypass DND settings, text, notification, call. That person should either acknowledge and work it, or the next person in the on-call rotation should get naggged. Until someone owns it. I used to have an Android app that I wrote that would ring my phone on loud every 15 seconds if I had a missed call or text, for example. These days, I'm more relying on pings every 15 minutes if the alert is not acked or resolved.
Thank you! Two earlier points make sense. The third, put it differently -- not nagging the whole on-call group at once is interesting. Is this more of a wide spread practice?
Actionable: I should only be alerted if there is a definitive action to take. Clearing up disk space, investigating suspicious activity, replacing a hard drive or BBU, fixing connectivity issues. If an alert clears on it's own, that probably indicates a problem with monitoring frequency or thresholds. If an outage occurs without alerting, ditto. If an alert gets ignored, it probably indicates that there are too many non-actionable alerts.
Minimal: Reduce alerting noise. Actionable takes care of a lot of that, but parenting relationships can reduce a dozen or a hundred alerts to one. "SQL database is down" rather than 100 "web frontend timeout", for example.
Nagging: An alert should go to one person, in a way that they are sure to see it. Bypass DND settings, text, notification, call. That person should either acknowledge and work it, or the next person in the on-call rotation should get naggged. Until someone owns it. I used to have an Android app that I wrote that would ring my phone on loud every 15 seconds if I had a missed call or text, for example. These days, I'm more relying on pings every 15 minutes if the alert is not acked or resolved.