Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Why Heroism Is Bad and What We Can Do to Stop It (sre.google)
67 points by RyeCombinator on Aug 6, 2024 | hide | past | favorite | 45 comments


Whenever I see these google SRE articles, I kinda reduce the message to "don't be a hero in a cost center in an org with nearly unlimited resources."

If you're in a profit center, you might get rewarded for your risk.


Yeah ha, this was literally my thought.


I imagine a conversation with this individual as team lead would go something like this:

"So, you worked overtime to save systems across the planet from crashing due to a botched update?"

"Yes, sir. We're 'Site Reliability Engineering', after all."

"And people in airports don't have to sleep on the floors because airlines can actually schedule flights?"

"Yes, sir. Site Reliability Engineering, at its finest, sir!"

"No, you played the hero. That's bad for the team and normally for you, really. You should have let it break."

"But our team...is 'Site Reliability Engineering'?"

"You should have let it break."

"But, Site...Reliability?"

"You're fired."


Being the hero once is fine -- but if the only reason we don't have outages on every update is because of an SRE who takes it upon themselves to babysit the update, then the system is super broken and there will come a time when that babysitting doesn't work and the lack of awareness of the problem will cause the cascade to be that much worse.


I do agree. If persistent heroic acts are a requirement to keep any system running, then let's allocate resources to determine why that's the case and how the system can be changed into something more resilient.


I also concur. A firefighter putting out a dangerous kitchen fire is heroic, but putting out the same fire several times without finding out the cause is negligent.


The problem is that it sounds good to talk about "allocating resources" and "communication" and making the system more resiliant. But even identifying places where heroism is being applied is extremely difficult except for the hero. Most importantly, the people who need to know that herois is keeping things alive will often be completely unaware that there is even a problem.

In the end the only way to do this is to allow the system to fail. If you find yourself in a position of being a hero, you have to notice it and do something about it. You could do a big writeup of how to fix the system to remove the requirement for heroism, etc., but as an SRE you don't always have insight into how important the particular issue is, so you could be wasting a bunch of time on something completely unnecessary (or that is not worth your time or the time to fix it).


> In the end the only way to do this is to allow the system to fail.

This may be the efficient way for systems under test, but for a live, production system there must be higher bar of performance than "let it fail". I agree with several of the points Malmberg makes (which my original sarcastic comment probably doesn't suggest), but his final conclusion of "let the system break" is alarming and dangerous.

> If you find yourself in a position of being a hero, you have to notice it and do something about it.

If I found myself being the hero, I would absolutely push this forward and do something about it. It's also a tragedy that this may actually result in the opposite outcome that you want (like being fired for "not being a team player"). At the end of the day, it's still human beings in charge of these sytems, which means handling our communications with grace and tact.


Just from experience, the system failures tend not to be of this mode. If heroism is routinely deployed, then yes, failures can be huge and even more heroism then needs to be applied.

But normally, what failures look like is a degredation in responsiveness or a failure to scale up quickly enough for surging demand or faster turnaround on canary failures or caches that need to be purged after batch jobs, etc.

Much more "degredation below SLA" rather than "every windows machine in the world blue screens". Heroism for disasters like that, sure, but that's going to be a post-mortem and a big deal. Most of the time failures are small, and letting them fail means that generally there is more awareness of the problem -- clearing caches or restarting the instances because they get slow conceals the problem and become part of the background routine.

The note on heroism is not a note for managers -- it's a note for SREs to actively notice when they are engaging in heroism and to stop doing that. Letting the cache get overloaded so that an automated system can do the purges because the development team is now aware of the issue is far preferable. And sometimes these routine acts of heroism become routine process/superstition to fix problems that no longer exist, or that are minor and not worth the time spent.


More likely: "You worked overtime to save systems across the planet from crashing due to a potentially botched update?"

"Yes, sir. We're 'Site Reliability Engineering', after all."

"And nothing has changed. Updates are a smooth as usual."

"Yes, sir. Site Reliability Engineering, at its finest, sir!"

"You're fired."


There are people so addicted to this that they will literally create problems out of nowhere so they can pull some heroics and save the day, always with high visibility from management. Seen a person advance pretty far in their career this way. Until management stops incentivizing this behavior, it won't stop. This is a management issue - which seems weird because this writing seems targeted towards IC's.


I haven't run into this person but I'd love to hear about problems you've seen people create - I can't even think of a problem I could create and fix.


one instance I can remember immediately - lead SRE turns off an alert for some DLQ that has been finicky and experiencing periodic issues, often requiring intervention from the on-call team. he doesn’t tell the on call team he turned this off, suspecting that after a day or so something downstream will blow up. Then it does, he appears out of nowhere to save the day with the precise solution and looks like a genius for it.


At my place we would wonder why the alert was turned off which would most likely have been audit logged in some way. Perhaps they only play chaos monkey in systems where you can change things anonymously.


Have you never heard of a volunteer firefighter starting fires? It doesn't happen often, but it does happen. (And maybe with professional firefighters, I don't know.)


You don't want heroes at large companies with top down product management. You need heroes at small innovative startups. This write up is more of a documentation on the stagnant culture inside google


You DO need heroes at large companies with top down product managers.

But even small companies cannot afford to have heroism be the only reason that their systems work.


Yeah, the whole thing reads kind of like a research paper "Amazing Procedure Dramatically Improves Group Cohesion - - in mice."

Analogous to "Employees Having Outstanding Problem-Solving Initiative; Shown Detrimental to Company Objectives - - at Google."


"No matter how many hours they need to work."

"No matter that they need to work evenings and weekends."

I don't call these people heroes, I call them idiots.

Also, not being able to copy/paste text from text slides is a pretty terrible design choice, but we shouldn't be surprised knowing what the source is.


"I'm going to be a Happy Idiot

and struggle for the Legal Tender

where the Ads Take Aim

and Lay Their Claim

to the Heart and the Soul of the Spender

And believe in Whatever May lie

in those things that Money Can Buy

though True Love could have been a Contender

Are you there?

Say a Prayer

for The Pretender

who started out so Young and Strong

only to Surrender"

- - J. Browne, 1976


They're not idiots if they're paid by the hour.


They (SREs) are not, generally, paid by the hour.


In my experience being the hero is fantastic … until you want to go on vacation, have a sick day, change teams, or get promoted. Hard to promote someone irreplaceable.

Always code (or mentor) yourself out of the job and let others play with your legos. Even if they do it wrong.


That's what I precisely learned from things such as Bank Python:

1. Create legos

2. Jump to another similar place to create the same legos


Watch the world burn with vodka in hand. It's much better than doing nothing while you can.


The last slide says "let the system break".

I strongly suggest that after that slide, there needs to be a whole series of slides about how to make it so that it's ok to let the system break. If you haven't already done the hard work to make your stuff resilient, "let the system break" is a recipe for blowing up customers, damaging reputations, and hurting people.


sometimes that's the only way to get the message across to leaders


It's wild to me that ICANN allowed .google and other brands to be TLDs.


Why not though?


Heroism is what you do until you manage to secure the headcount and hire the team that lets you run things smoothly.

In the real world, getting approval for headcount can take 6 months, hiring 3, training another 3.

So you need to sustain heroism for a year without burning out.


Heroism as described in this slideshow prevents anyone from realizing/accepting that they need to hire the appropriate headcount.


While not doing heroism prevents your product/service from being successful or reach the market in a timely fashion.


Not really. At a mature org, if anyone is putting in over 40 hours a week, management should he investigating. It's a warning flag either way.


Management frequently does not care so long as nobody is complaining hard enough.


I really dislike the way this slide deck is written. It's rewriting a failure of management (bad project planning, too few people for the workload) and presenting it a failure by all the team members.

"The Hero decides that, despite this, ..."

"No matter what they're told about not doing this."

"The team doesn't realize..."

"Heroism is low risk, and easy to do."

"Help the Hero figure out what they should do instead."

"But the Hero won't let it go."

I suspect the likely scenario that prompted this document to be written was something like a manager facing low morale from his team, and has just been asked to explain why there was a catastrophic failure that he hadn't communicated upwards. Likely, he hadn't been doing his job properly, had no idea how much work his team was actually doing, the team was massively overloaded and worried about the job culls in other departments, worried because their boss kept saying things like "this was due yesterday", and so had been doing everything possible to stop the proverbial hitting the fan... and one day it reached bursting point, and they simply couldn't cope with all the work, despite already being forced to do overtime. Maybe some of them had even quit as a result, and complained to HR about the work-life balance in the team.

But the team leader can't possibly be at fault. This is the management spin on it: it's all the team member's fault, and the poor manager had no idea what was going on, not because he was a terrible manager, but because the team had been deliberately hiding all the work they were doing from him, they didn't want to go home to their wives and kids, but were choosing to spend their evenings working on secret projects to stoke their own egos or deal with their own insecurities, and concealing all the extra work from their managers.


This article highlights many pitfalls but fails to explain "how to practice heroism effectively".

For instance, a team member might notice a recurring pattern and repeatedly save the SLA by addressing it immediately. While this quick fix is heroic, it should also be escalated for a long-term solution. This way, the hero tackles the immediate issue, and the team ensures that such heroism isn't needed in the future, and so on.


What a bombastic title—heroism is bad.

I’m thinking this whole piece is slanted to correct some other toxic or difficult to manage culture issue.

Getting to examples quickly saves the piece. Sounds like there are some gung-ho youths happy to be working at Google and they need some mentoring.


Heroism is a good thing I believe, as long as it is not applied systematically.

Example 1: A client has a deadline and a malfunction or unpredictable limitation of our product is in their critical path. A few people put in collaborate effort, meaning working extra hours a few days, to help them out. Later the customer is happy and the boss throws a celebration drink.

Example 2 : an ICT member got a message that could indicate a security breach over the weekend. He logs in and sees more suspicious activity. He takes first actions (disable all logins/access of certain criteria) and calls head of ICT.


Did any of the commenters read the slideshow? Heroism is bad when it covers systemic problems.

Heroes are great -- SREs who rise to the occasion to prevent horrors are appropriately rewarded and congratulated for their work.

But when a product relies upon heroes to continue operating, you are in a dangerous situation. That's how major outages occur; the hero goes on vacation or decides to let it break this time and the cascade of failures causes huge amounts of damage, where letting the system break much earlier would have made it clear to the development team that there is a major gap in the intrinsic reliability of the system.


> Why Heroism Is Bad and What We Can Do to Stop It

Talk to Hollywood ? /s


could Google stop the "heroism syndrome" and give us the source-code for their deactivated services? even if they aren't parsed to their heroic servers and it's about being self-host-able by non-heroes


Not sure if heroism is bad.

[1] All teams should have a Jordan, a kobe, a shaquille or a combi. One needs A players and supporting cast. It is not the culture or the org who decides upon the evolution of the heroism. It is the hero who builds a team around him/her. [2] the scrum or agile saga that promotes that all team members should be able to do what all team members do is just excel-minded-nonesense. Cant win championships with only goalkeepers, or only midfielders. Cant prep one to be good in both either during a lifetime.

Probably google wants weat crops that always look alike and are predictable?


What you're describing isn't at all what the linked page is discussing.


Agreed. The other plausible, realistic and decent option is that the author of the reaction to my comment does not understand the article at all and does not understand much about who has influence in a company and who’s incentivized to hide, change or create problems. Good luck!


Total junk. Don't blame the "hero" for their behavior, blame the management for not thinking ahead and making sure the problems didn't fester, blister, boil over to the point in which babysitting the systems over the weekend became necessary.

No "hero" ever does this work without trying to plan for it ahead of time. "Heroics" are necessary when the system let them down and stop letting long term thinking and planning account for problems.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: