The problem is that it sounds good to talk about "allocating resources" and "com...

tithe · on Aug 6, 2024

> In the end the only way to do this is to allow the system to fail.

This may be the efficient way for systems under test, but for a live, production system there must be higher bar of performance than "let it fail". I agree with several of the points Malmberg makes (which my original sarcastic comment probably doesn't suggest), but his final conclusion of "let the system break" is alarming and dangerous.

> If you find yourself in a position of being a hero, you have to notice it and do something about it.

If I found myself being the hero, I would absolutely push this forward and do something about it. It's also a tragedy that this may actually result in the opposite outcome that you want (like being fired for "not being a team player"). At the end of the day, it's still human beings in charge of these sytems, which means handling our communications with grace and tact.

andrewla · on Aug 6, 2024

Just from experience, the system failures tend not to be of this mode. If heroism is routinely deployed, then yes, failures can be huge and even more heroism then needs to be applied.

But normally, what failures look like is a degredation in responsiveness or a failure to scale up quickly enough for surging demand or faster turnaround on canary failures or caches that need to be purged after batch jobs, etc.

Much more "degredation below SLA" rather than "every windows machine in the world blue screens". Heroism for disasters like that, sure, but that's going to be a post-mortem and a big deal. Most of the time failures are small, and letting them fail means that generally there is more awareness of the problem -- clearing caches or restarting the instances because they get slow conceals the problem and become part of the background routine.

The note on heroism is not a note for managers -- it's a note for SREs to actively notice when they are engaging in heroism and to stop doing that. Letting the cache get overloaded so that an automated system can do the purges because the development team is now aware of the issue is far preferable. And sometimes these routine acts of heroism become routine process/superstition to fix problems that no longer exist, or that are minor and not worth the time spent.