Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If companies took the concept of an 'error budget' (every SRE principle seems to be important to companies except this one) seriously and saw it as a signal instead of an annoyance there would be some ebb and flow in this realm instead of a continued compounding increase where stability sits on the backs of a few people with the tribal knowledge. Just my 2 cents.


Say more, please.

How does an error budget account for a lack of experts in the stability area? I can't seem to be able to make that connection, and I'd love to learn.


I'm guessing the suggestion is: once the error budget is spent, no new feature work is accepted until causes of the errors are fixed


Sure, that's solid policy, but I'm not sure it really addresses the "tribal knowledge" issue per se.

I mean, I'd like to think that fixing reliability issues increases knowledge, but decades on multithreaded codebases littered with "sleep(0); // just in case" point in a different direction.

All I've found to work so far is a deliberate attempt to shift to more active knowledge sharing. I was hoping to learn a few new tricks from OP, but "reliability freezes" are not it, based on my experience.


"Fixing the error" is at times subjective. A less nuanced approach is to freeze non-reliability-improving changes (i.e. merges) until the production meets SLO again. That is the canonical example policy given in Google's SRE Book.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: