> It must have something to do with the number of mistakes, otherwise it's all a waste of time!
Not really. Imagine two systems with the same amount of mistakes. (Here the mistakes can be either bugs, or operator mistakes.)
One is designed such that every mistake brings the whole system down for a day with millions of dollars of lost revenue each time.
The other is designed such that when a mistake happens it is caught early, and when it is not caught it only impacts some limited parts of the system and recovering from the mistake is fast and reliable.
They both have the same amount of mistakes, yet one of these two systems is wastly more reliable.
> if it's not reducing the number of mistakes, what's it all for
Not really. Imagine two systems with the same amount of mistakes. (Here the mistakes can be either bugs, or operator mistakes.)
One is designed such that every mistake brings the whole system down for a day with millions of dollars of lost revenue each time.
The other is designed such that when a mistake happens it is caught early, and when it is not caught it only impacts some limited parts of the system and recovering from the mistake is fast and reliable.
They both have the same amount of mistakes, yet one of these two systems is wastly more reliable.
> if it's not reducing the number of mistakes, what's it all for
For reducing their impact.