If companies took the concept of an 'error budget' (every SRE principle seems to be important to companies except this one) seriously and saw it as a signal instead of an annoyance there would be some ebb and flow in this realm instead of a continued compounding increase where stability sits on the backs of a few people with the tribal knowledge. Just my 2 cents.
Sure, that's solid policy, but I'm not sure it really addresses the "tribal knowledge" issue per se.
I mean, I'd like to think that fixing reliability issues increases knowledge, but decades on multithreaded codebases littered with "sleep(0); // just in case" point in a different direction.
All I've found to work so far is a deliberate attempt to shift to more active knowledge sharing. I was hoping to learn a few new tricks from OP, but "reliability freezes" are not it, based on my experience.
"Fixing the error" is at times subjective. A less nuanced approach is to freeze non-reliability-improving changes (i.e. merges) until the production meets SLO again. That is the canonical example policy given in Google's SRE Book.