Being the hero once is fine -- but if the only reason we don't have outages on every update is because of an SRE who takes it upon themselves to babysit the update, then the system is super broken and there will come a time when that babysitting doesn't work and the lack of awareness of the problem will cause the cascade to be that much worse.
I do agree. If persistent heroic acts are a requirement to keep any system running, then let's allocate resources to determine why that's the case and how the system can be changed into something more resilient.
I also concur. A firefighter putting out a dangerous kitchen fire is heroic, but putting out the same fire several times without finding out the cause is negligent.
The problem is that it sounds good to talk about "allocating resources" and "communication" and making the system more resiliant. But even identifying places where heroism is being applied is extremely difficult except for the hero. Most importantly, the people who need to know that herois is keeping things alive will often be completely unaware that there is even a problem.
In the end the only way to do this is to allow the system to fail. If you find yourself in a position of being a hero, you have to notice it and do something about it. You could do a big writeup of how to fix the system to remove the requirement for heroism, etc., but as an SRE you don't always have insight into how important the particular issue is, so you could be wasting a bunch of time on something completely unnecessary (or that is not worth your time or the time to fix it).
> In the end the only way to do this is to allow the system to fail.
This may be the efficient way for systems under test, but for a live, production system there must be higher bar of performance than "let it fail". I agree with several of the points Malmberg makes (which my original sarcastic comment probably doesn't suggest), but his final conclusion of "let the system break" is alarming and dangerous.
> If you find yourself in a position of being a hero, you have to notice it and do something about it.
If I found myself being the hero, I would absolutely push this forward and do something about it. It's also a tragedy that this may actually result in the opposite outcome that you want (like being fired for "not being a team player"). At the end of the day, it's still human beings in charge of these sytems, which means handling our communications with grace and tact.
Just from experience, the system failures tend not to be of this mode. If heroism is routinely deployed, then yes, failures can be huge and even more heroism then needs to be applied.
But normally, what failures look like is a degredation in responsiveness or a failure to scale up quickly enough for surging demand or faster turnaround on canary failures or caches that need to be purged after batch jobs, etc.
Much more "degredation below SLA" rather than "every windows machine in the world blue screens". Heroism for disasters like that, sure, but that's going to be a post-mortem and a big deal. Most of the time failures are small, and letting them fail means that generally there is more awareness of the problem -- clearing caches or restarting the instances because they get slow conceals the problem and become part of the background routine.
The note on heroism is not a note for managers -- it's a note for SREs to actively notice when they are engaging in heroism and to stop doing that. Letting the cache get overloaded so that an automated system can do the purges because the development team is now aware of the issue is far preferable. And sometimes these routine acts of heroism become routine process/superstition to fix problems that no longer exist, or that are minor and not worth the time spent.
There are people so addicted to this that they will literally create problems out of nowhere so they can pull some heroics and save the day, always with high visibility from management. Seen a person advance pretty far in their career this way. Until management stops incentivizing this behavior, it won't stop. This is a management issue - which seems weird because this writing seems targeted towards IC's.
one instance I can remember immediately - lead SRE turns off an alert for some DLQ that has been finicky and experiencing periodic issues, often requiring intervention from the on-call team. he doesn’t tell the on call team he turned this off, suspecting that after a day or so something downstream will blow up. Then it does, he appears out of nowhere to save the day with the precise solution and looks like a genius for it.
At my place we would wonder why the alert was turned off which would most likely have been audit logged in some way. Perhaps they only play chaos monkey in systems where you can change things anonymously.
Have you never heard of a volunteer firefighter starting fires? It doesn't happen often, but it does happen. (And maybe with professional firefighters, I don't know.)
You don't want heroes at large companies with top down product management. You need heroes at small innovative startups. This write up is more of a documentation on the stagnant culture inside google
In my experience being the hero is fantastic … until you want to go on vacation, have a sick day, change teams, or get promoted. Hard to promote someone irreplaceable.
Always code (or mentor) yourself out of the job and let others play with your legos. Even if they do it wrong.
I strongly suggest that after that slide, there needs to be a whole series of slides about how to make it so that it's ok to let the system break. If you haven't already done the hard work to make your stuff resilient, "let the system break" is a recipe for blowing up customers, damaging reputations, and hurting people.
I really dislike the way this slide deck is written. It's rewriting a failure of management (bad project planning, too few people for the workload) and presenting it a failure by all the team members.
"The Hero decides that, despite this, ..."
"No matter what they're told about not doing this."
"The team doesn't realize..."
"Heroism is low risk, and easy to do."
"Help the Hero figure out what they should do instead."
"But the Hero won't let it go."
I suspect the likely scenario that prompted this document to be written was something like a manager facing low morale from his team, and has just been asked to explain why there was a catastrophic failure that he hadn't communicated upwards. Likely, he hadn't been doing his job properly, had no idea how much work his team was actually doing, the team was massively overloaded and worried about the job culls in other departments, worried because their boss kept saying things like "this was due yesterday", and so had been doing everything possible to stop the proverbial hitting the fan... and one day it reached bursting point, and they simply couldn't cope with all the work, despite already being forced to do overtime. Maybe some of them had even quit as a result, and complained to HR about the work-life balance in the team.
But the team leader can't possibly be at fault. This is the management spin on it: it's all the team member's fault, and the poor manager had no idea what was going on, not because he was a terrible manager, but because the team had been deliberately hiding all the work they were doing from him, they didn't want to go home to their wives and kids, but were choosing to spend their evenings working on secret projects to stoke their own egos or deal with their own insecurities, and concealing all the extra work from their managers.
This article highlights many pitfalls but fails to explain "how to practice heroism effectively".
For instance, a team member might notice a recurring pattern and repeatedly save the SLA by addressing it immediately. While this quick fix is heroic, it should also be escalated for a long-term solution. This way, the hero tackles the immediate issue, and the team ensures that such heroism isn't needed in the future, and so on.
Heroism is a good thing I believe, as long as it is not applied systematically.
Example 1: A client has a deadline and a malfunction or unpredictable limitation of our product is in their critical path. A few people put in collaborate effort, meaning working extra hours a few days, to help them out. Later the customer is happy and the boss throws a celebration drink.
Example 2 : an ICT member got a message that could indicate a security breach over the weekend. He logs in and sees more suspicious activity. He takes first actions (disable all logins/access of certain criteria) and calls head of ICT.
Did any of the commenters read the slideshow? Heroism is bad when it covers systemic problems.
Heroes are great -- SREs who rise to the occasion to prevent horrors are appropriately rewarded and congratulated for their work.
But when a product relies upon heroes to continue operating, you are in a dangerous situation. That's how major outages occur; the hero goes on vacation or decides to let it break this time and the cascade of failures causes huge amounts of damage, where letting the system break much earlier would have made it clear to the development team that there is a major gap in the intrinsic reliability of the system.
could Google stop the "heroism syndrome" and give us the source-code for their deactivated services? even if they aren't parsed to their heroic servers and it's about being self-host-able by non-heroes
[1] All teams should have a Jordan, a kobe, a shaquille or a combi. One needs A players and supporting cast. It is not the culture or the org who decides upon the evolution of the heroism. It is the hero who builds a team around him/her.
[2] the scrum or agile saga that promotes that all team members should be able to do what all team members do is just excel-minded-nonesense. Cant win championships with only goalkeepers, or only midfielders. Cant prep one to be good in both either during a lifetime.
Probably google wants weat crops that always look alike and are predictable?
Agreed. The other plausible, realistic and decent option is that the author of the reaction to my comment does not understand the article at all and does not understand much about who has influence in a company and who’s incentivized to hide, change or create problems. Good luck!
Total junk. Don't blame the "hero" for their behavior, blame the management for not thinking ahead and making sure the problems didn't fester, blister, boil over to the point in which babysitting the systems over the weekend became necessary.
No "hero" ever does this work without trying to plan for it ahead of time. "Heroics" are necessary when the system let them down and stop letting long term thinking and planning account for problems.
If you're in a profit center, you might get rewarded for your risk.