A string value in a json config needed to be updated.
On one prod instance, typo while updating the config by hand. Config validation of the software caught it, software stopped with the appropriate error message, a few minutes later we were up and running again.
We introduced work reviews on prod instances (similar to code reviews) after that.
Later, he then wrote a patch script to avoid making that mistake again.
In the json schema definition used in the script, the name of the property had a typo (how it came to be... no clue, copy paste should have taken care of that).
The script was part of a MR, the reviewer missed the typo. We noticed it in staging.
We introduced tests for config editing scripts after that.
And so it went on and on... The problem is not that it happens and we then refine our processes. It is the frequency.
What I’m seeing here is that you don’t have mature mechanisms to assure the reliability of your services yet. The second paragraph suggests that a misconfiguration was able to make it into production that arguably should have been caught at an earlier stage of the deployment pipeline. Anyone can make these sorts of mistakes; the fact that a particular colleague is more prone to them really doesn’t matter all that much.
Fortify your delivery pipeline and the problem should resolve itself.
They are not, but think of it like learning to play a guitar: at first, the strings cut into your fingers, but then you build up enough calluses and playing it stops hurting. Or, consider a building code: every rule was written in blood, and new buildings get safer over time.
A string value in a json config needed to be updated.
On one prod instance, typo while updating the config by hand. Config validation of the software caught it, software stopped with the appropriate error message, a few minutes later we were up and running again.
We introduced work reviews on prod instances (similar to code reviews) after that.
Later, he then wrote a patch script to avoid making that mistake again.
In the json schema definition used in the script, the name of the property had a typo (how it came to be... no clue, copy paste should have taken care of that).
The script was part of a MR, the reviewer missed the typo. We noticed it in staging.
We introduced tests for config editing scripts after that.
And so it went on and on... The problem is not that it happens and we then refine our processes. It is the frequency.