Should I be penalized if an upstream dependency, owned by another team, fails? Did I lack due diligence in choosing to accept the risk that the other team couldn't deliver? These are real problems in the micro-services world, especially since I own UI and there are dozens of teams pumping out services, and I'm at the mercy of all of them. The best I can do is gracefully fail when services don't function in a healthy state.
You and many others here may be conflating two concepts which are actually quite separate.
Taking blame is a purely punitive action and solves nothing. Taking responsibility means it's your job to correct the problem.
I find that the more "political" the culture in the organization is, the more likely everyone is to search for a scapegoat to protect their own image when a mistake happens. The higher you go up in the management chain, the more important vanity becomes, and the more you see it happening.
I have made plenty of technical decisions that turned out to be the wrong call in retrospect. I took _responsibility_ for those by learning from the mistake and reversing or fixing whatever was implemented. However, I never willfully took _blame_ for those mistakes because I believed I was doing the best job I could at the time.
Likewise, the systems I manage sometimes fail because something that another team manages failed. Sometimes it's something dumb and could have easily been prevented. In these cases, it's easy point blame and say, "Not our fault! That team or that person is being a fuckup and causing our stuff to break!" It's harder but much more useful to reach out and say, "hey, I see x system isn't doing what we expect, can we work together to fix it?"
Every argument I have on the internet is between prescriptive and descriptive language.
People tend to believe that if you can describe a problem that means you can prescribe a solution. Often times, the only way to survive is to make it clear that the first thing you are doing is describing the problem.
After you do that, and it's clear that's all you are doing, then you follow up with a prescriptive description where you place clearly what could be done to manage a future scenario.
If you don't create this bright line, you create a confused interpretation.
My comment was made from the relatively simpler entrepreneurial perspective, not the corporate one. Corp ownership rests with people in the C-suite who are social/political lawyer types, not technical people. They delegate responsibility but not authority, because they can hire people, even smart people, to work under those conditions. This is an error mode where "blame" flows from those who control the money to those who control the technology. Luckily, not all money is stupid so some corps (and some parts of corps) manage to function even in the presence of risk and innovation failures. I mean the whole industry is effectively a distributed R&D budget that may or may not yield fruit. I suppose this is the market figuring out whether iterated R&D makes sense or not. (Based on history, I'd say it makes a lot of sense.)
I wish you wouldn't talk about "penalization" as if it was something that comes from a source of authority. Your customers are depending on you, and you've let them down, and the reason that's bad has nothing to do with what your boss will do to you in a review.
The injustice that can and does happen is that you're explicitly given a narrow responsibility during development, and then a much broader responsibility during operation. This is patently unfair, and very common. For something like a failed uService you want to blame "the architect" that didn't anticipate these system level failures. What is the solution? Have plan b (and plan c) ready to go. If these services don't exist, then you must build them. It also implies a level of indirection that most systems aren't comfortable with, because we want to consume services directly (and for good reason) but reliability requires that you never, ever consume a service directly, but instead from an in-process location that is failure aware.
This is why reliable software is hard, and engineers are expensive.
Oh, and it's also why you generally do NOT want to defer the last build step to runtime in the browser. If you start combining services on both the client and server, you're in for a world of hurt.
You get hit by a car and injured. The accident is the other driver's fault, but getting to the ER is your problem. The other driver may help and call an ambulance, but they might not even be able to help you if they also got hurt in the car crash.
Say during due diligence two options are uncovered: use an upstream dependency owned by another team, or use that plus a 3P vendor for redundancy. Implementing parallel systems costs 10x more than the former and takes 5x longer. You estimate a 0.01% chance of serious failure for the former, and 0.001% for the latter.
Now say you're a medium sized hyper-growth company in a competitive space. Does spending 10 times more and waiting 5 times longer for redundancy make business sense? You could argue that it'd be irresponsible to over-engineer the system in this case, since you delay getting your product out and potentially lose $ and ground to competitors.
I don't think a black and white "yes, you should be punished" view is productive here.
If it's brand new RiscV CPU that was just relesed 5 min ago, and nobody really tested then yes.
If its standard CPU that everybody else uses, and its not known to be bad then no.
Same for software. Is it ok to have dependency on AWS services ? Their history shows yes. Dependency on brand new SaaS product ? Nothing mission critical.
Or npm/crates/pip packages. Packages that have been around and seedily maintained for few years, have active users, are worth checking out. Some random project from single developer ? Consider vendoring (and owning if necessary ) it.
You choose the CPU and you choose what happens in a failure scenario. Part of engineering is making choices that meet the availability requirements of your service. And part of that is handling failures from dependencies.
That doesn't extend to ridiculous lengths but as a rule you should engineer around any single point of failure.
I think this is why we pay for support, with the expectation that if their product inadvertently causes losses for you they will work fast to fix it or cover the losses.
Yes? If you are worried about CPU microcode failing, then you do a NASA and have multiple CPU arch's doing calculations in a voting block. These are not unsolved problems.
JPL goes further and buys multiple copies of all hardware and software media used for ground systems, and keeps them in storage "just in case". It's a relatively cheap insurance policy against the decay of progress.