Ask HN: What Makes an Infra Monitoring and Alerting System Effective?

linsomniac · on March 5, 2021

For alerts: Actionable, minimal, and nagging.

Actionable: I should only be alerted if there is a definitive action to take. Clearing up disk space, investigating suspicious activity, replacing a hard drive or BBU, fixing connectivity issues. If an alert clears on it's own, that probably indicates a problem with monitoring frequency or thresholds. If an outage occurs without alerting, ditto. If an alert gets ignored, it probably indicates that there are too many non-actionable alerts.

Minimal: Reduce alerting noise. Actionable takes care of a lot of that, but parenting relationships can reduce a dozen or a hundred alerts to one. "SQL database is down" rather than 100 "web frontend timeout", for example.

Nagging: An alert should go to one person, in a way that they are sure to see it. Bypass DND settings, text, notification, call. That person should either acknowledge and work it, or the next person in the on-call rotation should get naggged. Until someone owns it. I used to have an Android app that I wrote that would ring my phone on loud every 15 seconds if I had a missed call or text, for example. These days, I'm more relying on pings every 15 minutes if the alert is not acked or resolved.

03z · on March 5, 2021

Thank you! Two earlier points make sense. The third, put it differently -- not nagging the whole on-call group at once is interesting. Is this more of a wide spread practice?

linsomniac · on March 8, 2021

On call groups are very wide-spread. PagerDuty and Icinga/Nagios implement them for sure.