Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: What Makes an Infra Monitoring and Alerting System Effective?
2 points by 03z on March 5, 2021 | hide | past | favorite | 3 comments
Hello HN,

I'm writing up some of my thoughts on what makes a good infrastructure monitoring and alert system. Do you have any thoughts based on your experience?



For alerts: Actionable, minimal, and nagging.

Actionable: I should only be alerted if there is a definitive action to take. Clearing up disk space, investigating suspicious activity, replacing a hard drive or BBU, fixing connectivity issues. If an alert clears on it's own, that probably indicates a problem with monitoring frequency or thresholds. If an outage occurs without alerting, ditto. If an alert gets ignored, it probably indicates that there are too many non-actionable alerts.

Minimal: Reduce alerting noise. Actionable takes care of a lot of that, but parenting relationships can reduce a dozen or a hundred alerts to one. "SQL database is down" rather than 100 "web frontend timeout", for example.

Nagging: An alert should go to one person, in a way that they are sure to see it. Bypass DND settings, text, notification, call. That person should either acknowledge and work it, or the next person in the on-call rotation should get naggged. Until someone owns it. I used to have an Android app that I wrote that would ring my phone on loud every 15 seconds if I had a missed call or text, for example. These days, I'm more relying on pings every 15 minutes if the alert is not acked or resolved.


Thank you! Two earlier points make sense. The third, put it differently -- not nagging the whole on-call group at once is interesting. Is this more of a wide spread practice?


On call groups are very wide-spread. PagerDuty and Icinga/Nagios implement them for sure.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: