If companies took the concept of an 'error budget' (every SRE principle seems to...

groby_b · on Oct 26, 2022

Say more, please.

How does an error budget account for a lack of experts in the stability area? I can't seem to be able to make that connection, and I'd love to learn.

bigbluedots · on Oct 26, 2022

I'm guessing the suggestion is: once the error budget is spent, no new feature work is accepted until causes of the errors are fixed

groby_b · on Nov 4, 2022

Sure, that's solid policy, but I'm not sure it really addresses the "tribal knowledge" issue per se.

I mean, I'd like to think that fixing reliability issues increases knowledge, but decades on multithreaded codebases littered with "sleep(0); // just in case" point in a different direction.

All I've found to work so far is a deliberate attempt to shift to more active knowledge sharing. I was hoping to learn a few new tricks from OP, but "reliability freezes" are not it, based on my experience.

kubanczyk · on Oct 26, 2022

"Fixing the error" is at times subjective. A less nuanced approach is to freeze non-reliability-improving changes (i.e. merges) until the production meets SLO again. That is the canonical example policy given in Google's SRE Book.