I haven't asked AWS employees specifically about blameless postmortems, but seve...

azinman2 · on Dec 7, 2021

When I was at Google I didn't have a lot of exposure to the public infra side. However I do remember back in 2008 when a colleague was working on routing side of YouTube, he made a change that cost millions of dollars in mere hours before noticing and reverting it. He mentioned this to the larger team which gave applause during a tech talk. I cannot possibly generalize the culture differences between Amazon and Google, but at least in that one moment, the Google culture seemed to support that errors happen, they get noticed, and fixed without harming the perceived performance of those responsible.

wolverine876 · on Dec 7, 2021

While I support that, how are the people involved evaluated?

azinman2 · on Dec 7, 2021

I was not informed of his performance reviews. However, given the reception, his work in general, and the attitudes of the team, I cannot imagine this even came up. More likely the ability to improve routing to actually make YouTube cheaper in the end was I'm sure the ultimate positive result.

This was also towards the end of the golden age of Google, when the percentage of top talent was a lot higher.

wolverine876 · on Dec 7, 2021

So on what basis is someone's performance reviewed, if such performance is omitted?

marcan_42 · on Dec 8, 2021

The entire point of blameless postmortems is acknowledging that the mere existence of an outage does not inherently reflect on the performance of the people involved. This allows you to instead focus on building resilient systems that avoid the possibility of accidental outages in the first place.

wolverine876 · on Dec 8, 2021

I know. That's not what I'm asking about, if you might read my question.

xmprt · on Dec 8, 2021

I'll play devil's advocate here and say that sometimes these incidents deserve praise because they uncovered an issue that was otherwise unknown previously. Also if the incident had a large negative impact then it shows to leadership how critical normal operation of that service is. Even if you were the cause of the issue, the fact that you fixed it and kept the critical service operating the rest of the time, is worth something good.

wolverine876 · on Dec 8, 2021

I know; that's not what I'm asking about. I'm talking about a different issue.

yebyen · on Dec 8, 2021

Mistakes happen, and a culture that insists too hard that "mistakes shouldn't happen, and so we can't be seen making mistakes" is harmful toward engineering.

How should their performance be evaluated, if not by the rote number of mistakes that can be pinned onto the person, and their combined impact? (Was that the question?)

abdabab · on Dec 7, 2021

If an engineer causes an outage by mistake and then ensures that would never happen again, he has made a positive impact.

wolverine876 · on Dec 7, 2021

I understand that, but eventually they need to evaluate performance, for promotions, demotions, raises, cuts, hiring, firing, etc. How is that done?

abdabab · on Dec 8, 2021

It’s standard. Career ladder [1] sets expectation for each level. Performance is measured against those expectations. Outages don’t negatively impact a single engineer.

The key difference is the perspective. If reliability is bad that’s an organizational problem and blaming or punishing one engineer won’t fix that.

[1] An example ladder from Patreon: https://levels.patreon.com/

wolverine876 · on Dec 8, 2021

> The key difference

The key difference between what and what?

abdabab · on Dec 8, 2021

Your approach and their approach. It sounded like you have a different perspective about who is responsible for an outage.