Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That sounds like the exact opposite of human-factors engineering. No one likes taking blame. But when things go sideways, people are extra spicy and defensive, which makes them clam up and often withhold useful information, which can extend the outage.

No-blame analysis is a much better pattern. Everyone wins. It's about building the system that builds the system. Stuff broke; fix the stuff that broke, then fix the things that let stuff break.



I worked at Walmart Technology. I bravely wrote post mortem documents owning the fault of my team (100+ people), owning both technically and also culturally as their leader. I put together a plan to fix it and executed it. Thought that was the right thing to do. This happend two times in my 10 year career there.

Both times I was called out as a failure in my performance eval. Second time, I resigned and told them to find a better leader.

Happy now I am out of such shitty place.


That's shockingly stupid. I also worked for a major Walmart IT services vendor in another life, and we always had to be careful about how we handled them, because they didn't always show a lot of respect for vendors.

On another note, thanks for building some awesome stuff -- walmart.com is awesome. I have both Prime and whatever-they're-currently-calling Walmart's version and I love that Walmart doesn't appear to mix SKU's together in the same bin which seems to cause counterfeiting fraud at Amazon.


walmart.com user design sucks. My particular grudge right now is - I'm shopping to go pickup some stuff (and indicate "in store pickup) and each time I search for the next item, it resets that filter making me click on that filter for each item on my list.


Almost every physical-store-chain company's website makes it way too hard to do the thing I nearly always want out of their interface, which is to search the inventory of the X nearest locations. They all want to push online orders or 3rd-party-seller crap, it seems.


Yes, I assume they intentionally make it difficult to push third party sellers’ where they get to earn bigger profit margins and/or hide their low inventory.

Although, Amazon is the worst, then Walmart (still much better than Amazon since you can at least filter). The others are not bad in my experience.


Walmart.com, Am I the only one in the world who can't view their site on my phone? I tried it on a couple devices and couldn't get it to work. Scaling is fubar. I assumed this would be costing them millions/billions since it's impossible to buy something from my phone right now. S21+ in portrait on multiple browsers.


What's a "bin" in this context?


I believe he means a literal bin. E.g. Amazon takes products from all their sellers and chucks them in the same physical space, so they have no idea who actually sold the product when it's picked. So you could have gotten something from a dodgy 3rd party seller that repackages broken returns, etc, and Amazon doesn't maintain oversight of this.


Literally just a bin in a fulfillment warehouse.

An amazon listing doesn't guarantee a particular SKU.


Ah, whew. That's what I thought. Thanks! I asked because we make warehouse and retail management systems and every vendor or customer seems to give every word their own meanings (e.g., we use "bin" in our discounts engine to be a collection of products eligible for discounts, and "barcode" has at least three meanings depending on to whom you're speaking).


Is WalMart.com awesome?


Props to you and Walmart will never realize their loss. Unfortunately. But one day there will be headline (or even a couple of them) and you will know that if you had been there it might not have happened and that in the end it is Walmarts' customers that will pay the price for that, not their shareholders.


Stories like this are why I'm really glad I stopped talking to that Walmart Technology recruiter a few years ago. I love working for places where senior leadership constantly repeat war stories about "that time I broke the flagship product" to reinforce the importance of blameless postmortems. You can't fix the process if the people who report to you feel the need to lie about why things go wrong.


But hope you found a better place?


that's awful. You should have been promoted for that.


is it just 'ceremony' to be called out on those things? (even if it is actually a positive sum total)


> Happy now I am out of such shitty place.

Doesn't sound like it.


I firmly believe in the dictum "if you ship it you own it". That means you own all outages. It's not just an operator flubbing a command, or a bit of code that passed review when it shouldn't. It's all your dependencies that make your service work. You own ALL of them.

People spend all this time threat modelling their stuff against malefactors, and yet so often people don't spend any time thinking about the threat model of decay. They don't do it adding new dependencies (build- or runtime), and therefore are unprepared to handle an outage.

There's a good reason for this, of course: modern software "best practices" encourage moving fast and breaking things, which includes "add this dependency we know nothing about, and which gives an unknown entity the power to poison our code or take down our service, arbitrarily, at runtime, but hey its a cool thing with lots of github stars and it's only one 'npm install' away".

Just want to end with this PSA: Dependencies bad.


Should I be penalized if an upstream dependency, owned by another team, fails? Did I lack due diligence in choosing to accept the risk that the other team couldn't deliver? These are real problems in the micro-services world, especially since I own UI and there are dozens of teams pumping out services, and I'm at the mercy of all of them. The best I can do is gracefully fail when services don't function in a healthy state.


You and many others here may be conflating two concepts which are actually quite separate.

Taking blame is a purely punitive action and solves nothing. Taking responsibility means it's your job to correct the problem.

I find that the more "political" the culture in the organization is, the more likely everyone is to search for a scapegoat to protect their own image when a mistake happens. The higher you go up in the management chain, the more important vanity becomes, and the more you see it happening.

I have made plenty of technical decisions that turned out to be the wrong call in retrospect. I took _responsibility_ for those by learning from the mistake and reversing or fixing whatever was implemented. However, I never willfully took _blame_ for those mistakes because I believed I was doing the best job I could at the time.

Likewise, the systems I manage sometimes fail because something that another team manages failed. Sometimes it's something dumb and could have easily been prevented. In these cases, it's easy point blame and say, "Not our fault! That team or that person is being a fuckup and causing our stuff to break!" It's harder but much more useful to reach out and say, "hey, I see x system isn't doing what we expect, can we work together to fix it?"


Every argument I have on the internet is between prescriptive and descriptive language.

People tend to believe that if you can describe a problem that means you can prescribe a solution. Often times, the only way to survive is to make it clear that the first thing you are doing is describing the problem.

After you do that, and it's clear that's all you are doing, then you follow up with a prescriptive description where you place clearly what could be done to manage a future scenario.

If you don't create this bright line, you create a confused interpretation.


My comment was made from the relatively simpler entrepreneurial perspective, not the corporate one. Corp ownership rests with people in the C-suite who are social/political lawyer types, not technical people. They delegate responsibility but not authority, because they can hire people, even smart people, to work under those conditions. This is an error mode where "blame" flows from those who control the money to those who control the technology. Luckily, not all money is stupid so some corps (and some parts of corps) manage to function even in the presence of risk and innovation failures. I mean the whole industry is effectively a distributed R&D budget that may or may not yield fruit. I suppose this is the market figuring out whether iterated R&D makes sense or not. (Based on history, I'd say it makes a lot of sense.)


I wish you wouldn't talk about "penalization" as if it was something that comes from a source of authority. Your customers are depending on you, and you've let them down, and the reason that's bad has nothing to do with what your boss will do to you in a review.

The injustice that can and does happen is that you're explicitly given a narrow responsibility during development, and then a much broader responsibility during operation. This is patently unfair, and very common. For something like a failed uService you want to blame "the architect" that didn't anticipate these system level failures. What is the solution? Have plan b (and plan c) ready to go. If these services don't exist, then you must build them. It also implies a level of indirection that most systems aren't comfortable with, because we want to consume services directly (and for good reason) but reliability requires that you never, ever consume a service directly, but instead from an in-process location that is failure aware.

This is why reliable software is hard, and engineers are expensive.

Oh, and it's also why you generally do NOT want to defer the last build step to runtime in the browser. If you start combining services on both the client and server, you're in for a world of hurt.


Not penalised no, but questioned as to how well your graceful failure worked in the end.

Remember: it may not be your fault, but it still is your problem.


A analogy for illustrating this is:

You get hit by a car and injured. The accident is the other driver's fault, but getting to the ER is your problem. The other driver may help and call an ambulance, but they might not even be able to help you if they also got hurt in the car crash.


> Should I be penalized if an upstream dependency, owned by another team, fails?

Yes

> Did I lack due diligence in choosing to accept the risk that the other team couldn't deliver?

Yes


Say during due diligence two options are uncovered: use an upstream dependency owned by another team, or use that plus a 3P vendor for redundancy. Implementing parallel systems costs 10x more than the former and takes 5x longer. You estimate a 0.01% chance of serious failure for the former, and 0.001% for the latter.

Now say you're a medium sized hyper-growth company in a competitive space. Does spending 10 times more and waiting 5 times longer for redundancy make business sense? You could argue that it'd be irresponsible to over-engineer the system in this case, since you delay getting your product out and potentially lose $ and ground to competitors.

I don't think a black and white "yes, you should be punished" view is productive here.


Where does this mindset end? Do I lack due diligence by choosing to accept that the cpu microcode on the system I’m deploying to works correctly?


If it's brand new RiscV CPU that was just relesed 5 min ago, and nobody really tested then yes.

If its standard CPU that everybody else uses, and its not known to be bad then no.

Same for software. Is it ok to have dependency on AWS services ? Their history shows yes. Dependency on brand new SaaS product ? Nothing mission critical.

Or npm/crates/pip packages. Packages that have been around and seedily maintained for few years, have active users, are worth checking out. Some random project from single developer ? Consider vendoring (and owning if necessary ) it.


Why? Intel has Spectre/Meltdown which erased like half of everyone's capacity overnight.


You choose the CPU and you choose what happens in a failure scenario. Part of engineering is making choices that meet the availability requirements of your service. And part of that is handling failures from dependencies.

That doesn't extend to ridiculous lengths but as a rule you should engineer around any single point of failure.


I think this is why we pay for support, with the expectation that if their product inadvertently causes losses for you they will work fast to fix it or cover the losses.


Yes? If you are worried about CPU microcode failing, then you do a NASA and have multiple CPU arch's doing calculations in a voting block. These are not unsolved problems.


JPL goes further and buys multiple copies of all hardware and software media used for ground systems, and keeps them in storage "just in case". It's a relatively cheap insurance policy against the decay of progress.


That's a great philosophy.

Ok, let's take an organization, let's call them, say Ammizzun. Totally not Amazon. Let's say you have a very aggressive hire/fire policy which worked really well in rapid scaling and growth of your company. Now you have a million odd customers highly dependent on systems that were built by people that are now one? two? three? four? hire/fire generations up-or-out or cashed-out cycles ago.

So.... who owns it if the people that wrote it are lllloooooonnnnggg gone? Like, not just long gone one or two cycles ago so some institutional memory exists. I mean, GONE.


A lot can go wrong as an organization grows, including loss of knowledge. At amazon "Ownership" officially rests with the non-technical money that owns voting shares. They control the board who controls the CEO. "Ownership" can be perverted to mean that you, a wage slave, are responsible for the mess that previous ICs left behind. The obvious thing to do in such a circumstance is quit (or don't apply). It is unfair and unpleasant to be treated in a way that gives you responsibility but no authority, and to participant in maintaining (and extending) that moral hazard, and as long as there are better companies you're better off working for them.


I worked on a project like this in government for my first job. I was the third butt in that seat in a year. Everyone associated with project that I knew there was gone by one year from my own departure date.

They are now on the 6th butt in that seat in 4 years. That poor fellow is entirely blameless for the mess that accumulated over time.


Having individuals own systems seems like a terrible practice. You're essentially creating a single point of failure if only one person understands how the system works.


if I were a black hat I would absolutely love GitHub and all the various language-specific package systems out there. giving me sooooo many ways to sneak arbitrary tailored malicious code into millions of installs around the world 24x7. sure, some of my attempts might get caught, or not but not lead to a valuable outcome for me. but that percentage that does? can make it worth it. its about scale and a massive parallelization of infiltration attempts. logic similar to the folks blasting out phishing emails or scam calls.

I love the ubiquity of thirdparty software from strangers, and the lack of bureaucratic gatekeepers. but I also hate it in ways. and not enough people know about the dangers of this second thing.


Any yet oddly enough the Earth continues to spin and the internet continues to work. I think the system we have now is necessarily the system that must exist ( in this particular case, not in all cases ). Something more centralized is destined to fail. And, while the open source nature of software introduces vulnerabilities it also fixes them.


> And, while the open source nature of software introduces vulnerabilities it also fixes them.

dat gap tho... which was my point. smart black hats will be exploiting this gap, at scale. and the strategy will work because the majority of folks seem to be either lazy, ignorant or simply hurried for time.

and btw your 1st sentence was rude. constructive feedback for the future


For my vote, I don't think it was rude, I think it was making a point.


when working on CloudFiles, we often had monitoring for our limited dependencies that were better than their monitoring. Don't just know what your stuff is doing, but what your whole dependency ecosystem is doing and know when it all goes south. also helps to learn where and how you can mitigate some of those dependencies.


This. We found very big, serious issues with our anti-DDOS provider because their monitoring sucked compared to ours. It was a sobering reality check when we realized that.


It's also a nightmare for software preservation. There's going to be a lot from this era that won't be usable 80 years from now because everything is so interdependent and impossible to archive. It's going to be as messy and irretrievable as the Web pre Internet Archive + Wayback are.


I don't think engineers can believe in no-blame analysis if they know it'll harm career growth. I can't unilaterally promote John Doe, I have to convince other leaders that John would do well the next level up. And in those discussions, they could bring up "but John has caused 3 incidents this year", and honestly, maybe they'd be right.


Would they? Having 3 outages in a year sounds like an organization problem. Not enough safeguards to prevent very routine human errors. But instead of worrying about that we just assign a guy to take the fall


If you work in a technical role and you _don't_ have the ability to break something, you're unlikely to be contributing in a significant way. Likely that would make you a junior developer whose every line of code is heavily scrutinized.

Engineers should be experts and you should be able to trust them to make reasonable choices about the management of their projects.

That doesn't mean there can't be some checks in place, and it doesn't mean that all engineers should be perfect.

But you also have to acknowledge that adding all of those safeties has a cost. You can be a competent person who requires fewer safeties or less competent with more safeties.

Which one provides more value to an organization?


The tactical point is to remove sharp edges, eg there's a tool that optionally take a region argument.

    network_cli remove_routes [--region us-east-1]
Blaming the operator that they should have known that running

    network_cli remove_routes
will take down all regions because the region wasn't specified is the kind of thing as to what's being called out here.

All of the tools need to not default to breaking the world. That is the first and foremost thing being pushed. If an engineer is remotely afraid to come forwards (beyond self-shame/judgement) after an incident, and say "hey, I accidentally did this thing", then the situation will never get any better.

That doesn't mean that engineers don't have the ability to break things, but it means it's harder (and very intentionally so) for a stressed out human operator to do the wrong thing by accident. Accidents happen. Do you just plan on never getting into a car accident, or do you wear a seat belt?


> Which one provides more value to an organization?

Neither, they both provide the same value in the long term.

Senior engineers cannot execute on everything they commit to without having a team of engineers they work with. If nobody trains junior engineers, the discipline would go extinct.

Senior engineers provide value by building guardrails to enable junior engineers to provide value by delivering with more confidence.


Well if John caused 3 outages and and his peers Sally and Mike each caused 0, it's worth taking a deeper look. There's a real possibility he's getting screwed by a messed up org, also he could be doing slapdash work or he seriously might not undertsand the seriousness of an outage.


John’s team might also be taking more calculated risks and running circles around Sally and Mike’s teams with respect to innovation and execution. If your organization categorically punishes failures/outages, you end up with timid managers that are only playing defense, probably the opposite of what the leadership team wants.


Worth a look, certainly. Also very possible that this John is upfront about honest postmortems and like a good leader takes the blame, whereas Sally and Mike are out all day playing politics looking for how to shift blame so nothing has their name attached. Most larger companies that's how it goes.


Or John's work is in frontline production use and Sally's and Mike's is not, so there's different exposure.


You're not wrong, but it's possible that the organization is small enough that it's just not feasible to have enough safeguards that would prevent the outages John caused. And in that case, it's probably best that John not be promoted if he can't avoid those errors.


Current co is small. We are putting in the safeguards from Day 1. Well, okay technically like day 120, the first few months were a mad dash to MVP. But now that we have some breathing room, yeah, we put a lot of emphasis on preventing outages, detecting and diagnosing outages promptly, documenting them, doing the whole 5-why's thing, and preventing them in the future. We didn't have to, we could have kept mad dashing and growth hacking. But very fortunately, we have a great culture here (founders have lots of hindsight from past startups).

It's like a seed for crystal growth. Small company is exactly the best time to implement these things, because other employees will try to match the cultural norms and habits.


Well, I started at the small company I'm currently at around day 7300, where "source control" consisted of asking the one person who was in charge of all source code for a copy of the files you needed to work on, and then giving the updated files back. He'd write down the "checked out" files on a whiteboard to ensure that two people couldn't work on the same file at the same time.

The fact that I've gotten it to the point of using git with automated build and deployment is a small miracle in itself. Not everybody gets to start from a clean slate.


> I have to convince other leaders that John would do well the next level up.

"Yes, John has made mistakes and he's always copped to them immediately and worked to prevent them from happening again in the future. You know who doesn't make mistakes? People who don't do anything."


You know why SO-teams, firefighters and military pilots are so successful?

-You don't hide anything

-Errors will be made

-After training/mission everyone talks about the errors (or potential ones) and how to prevent them

-You don't make the same error twice

Being afraid to make errors and learn from them creates a culture of hiding, a culture of denial and especially being afraid to take responsibility.


You can even make the same error twice but you better have much better explanation the second time around than you had the first time around because you already knew that what you did was risky and or failure prone.

But usually it isn't the same person making the same mistake, usually it is someone else making the same mistake and nobody thought of updating processes/documentation to the point that the error would have been caught in time. Maybe they'll fix that after the second time ;)


Yes. AAR process in the army was good at this up to the field grade level, but got hairy on G/J level staffs. I preferred being S-6 to G-6 for that reason.


There is no such thing as "no-blame" analysis. Even in the best organizations with the best effort to avoid it, there is always a subconscious "this person did it". It doesn't help that these incidents serve as convenient places for others to leverage to climb their own career ladder at your expense.


Or just take responsibility. People will respect you for doing that and you will demonstrate leadership.


Cynical/realist take: Take responsibility and then hope your bosses already love you, you can immediately both come with a way to prevent it from happening again, and convince them to give you the resources to implement it. Otherwise your responsibility is, unfortunately, just blood in the water for someone else to do all of that to protect the company against you and springboard their reputation on the descent of yours. There were already senior people scheming to take over your department from your bosses, now they have an excuse.


This seems like an absolutely horrid way of working or doing 'office politics'.


Yes, and I personally have worked in environments that do just that. They said they didn't, but with management "personalities" plus stack ranking, you know damn well that they did.


And the guy who doesn't take responsibility gets promoted. Employees are not responsible for failures of management to set a good culture.


The Gervais/Peter Principle is alive and well in many orgs. That doesn't mean that when you have the prerogative to change the culture, you just give up.

I realize that isn't an easy thing to do. Often the best bet is to just jump around till you find a company that isn't a cultural superfund site.


Not in healthy organizations, they don't.


You can work an entire career and maybe enjoy life in one healthy organization in that entire time even if you work in a variety of companies. It just isn't that common, though of course voicing the _ideals_ is very, very common.


Once you reach a certain size there are surprisingly few healthy organization, most of them turn into externalization engines with 4 beats per year.


I love it when I share a mental model with someone in the wild.


Way more fun argument: Outages just, uh… uh… find a way.


> No-blame analysis is a much better pattern. Everyone wins. It's about building the system that builds the system. Stuff broke; fix the stuff that broke, then fix the things that let stuff break.

Yea, except it doesn't work in practice. I work with a lot of people who come from places with "blameless" post-mortem 'culture' and they've evangelized such a thing extensively.

You know what all those people have proven themselves to really excel at? Blaming people.


Ok, and? I don't doubt it fails in places. That doesn't mean that it doesn't work in practice. Our company does it just fine. We have a high trust, high transparency system and it's wonderful.

It's like saying unit tests don't work in practice because bugs got through.


Have you ever considered that the “no-blame” postmortems you are giving credit for everything are just a side effect of living in a high trust, high transparency system?

In other words, “no-blame” should be an emergent property of a culture of trust. It’s not something you can prescribe.


Yes, exactly. Culture of trust is the root. Many beneficial patterns emerge when you can have that: more critical PRs, blameless post-mortems, etc.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: