We've started to note these wont-fixes down as risks and started talking about probability and impact of these. That has resulted in good and realistic discussions with people from other departments or higher up.
Like, sure, people with access to the servers can run <ansible 'all' -m cmd -a 'shutdown now' -b> and worse. And we've had people nuke productive servers, so there is some impact involved in our work style -- though redundancy and gradually ramping up people from non-critial systems to more critical systems mitigates this a lot.
But some people got a bit concerned about the potential impact.
However if you realistically look at the amount of changes people push into the infrastructure on a daily basis, the chance of this occurring seems to very low - and errors mostly happen due to pressure and stress. And our team is already over capacity, so adding more controls on this will slow all of our internal customers down a lot too.
So now it is just a documented and accepted risk that we're able to burn production to the ground in one or two shell commands.
The amount of deliberate damage anyone on my team can do is pretty much catastrophic. But we accept this as risk. It is appropriate for the environment. If we were running a bank, it would be inappropriate, but we're not running a bank.
I pushed back on risk management one time when The New Guy rebuilt our CI system. It was great, all bells and whistles and tests, except now deploying a change took 5 minutes. Same for rolling back a change. I said "Dude, this used to take 20 seconds. If I made a mistake I would know, and fix it in 20 seconds. Now we have all these tests which still allow me to cause total outage, but now it takes 10 minutes to fix it." He did make it faster in the end :)
Like, sure, people with access to the servers can run <ansible 'all' -m cmd -a 'shutdown now' -b> and worse. And we've had people nuke productive servers, so there is some impact involved in our work style -- though redundancy and gradually ramping up people from non-critial systems to more critical systems mitigates this a lot.
But some people got a bit concerned about the potential impact.
However if you realistically look at the amount of changes people push into the infrastructure on a daily basis, the chance of this occurring seems to very low - and errors mostly happen due to pressure and stress. And our team is already over capacity, so adding more controls on this will slow all of our internal customers down a lot too.
So now it is just a documented and accepted risk that we're able to burn production to the ground in one or two shell commands.