So the proper way to think about this is in terms of where you put abstractions. Much like you'd write a function or library, you can abstract physical machines by idealizing the component in a system.
I don't think it's really a separate keyword to search for. This is all probabilities.
The math isn't complex, the hard part is writing down a complete graph of all the connections between different components, environments, and failure scenarios. If your valve is made of five parts, and one of them has a 10% chance of failing per year, then your valve has a 10% chance of failing per year. If it has two parts that have a 10% chance of failing in a year, then assuming independent failures the total probability of failure of that component is 19% in a year.
These numbers are rarely known with so much precision during initial design. Consider it akin to estimating the probability of certain kinds of predictable bugs in a library you're using. How much do you trust that github repository vs intel? The most robust thing to do is typically to design around your best guesses but then do validation testing to refine your guesses.
So if I think a critical valve or seal has a high probability of failure but have low confidence in what the probability is, I'll take that valve or seal and literally set up a test case to make sure it performs as expected. Then I can collect real statistics and go from there. Data >> Guesses, but the systems are so complex that guesses are where you have to start.
Then you'd basically put the system together, one part at a time, and validate with each added part that the entire system still behaves as expected. And you throw in some edge cases to ensure that controls are working properly, like perhaps in the aluminum heater case you'd simply break a thermocouple yourself to ensure that the safety system works. But you'd do that in testing, not in production.
It's really very analogous to unit tests, unsurprisingly because the need is similar. I've had vendors ship me special custom thermocouples that they claimed would run for 10 years at 600C. We threw them in an oven as a trivial validation test. They caught fire. We didn't use that vendor again. But by analogy that's how the firmware blob you get from a vendor is too. They sure claim it does something, but until you've done real testing with it who knows?
As you pin down the true probabilities of different failures, you just propagate them through your graph of possible failures to estimate the probability of different scenarios and focus on the high risk and high likelihood events. Sometimes the risk is as simple as "the system will be down for an hour while we replace a failed component". No biggie, maintenance is an expected cost. Sometimes the risk is a nuclear plant meltdown.
EDIT: The goal of the above is to identify which causes result in critical failures with high likelihood. Once you've identified them, then you focus down on addressing the root cause. It's more about identifying where problems would start if there were a bad scenario, so you know where to spend more attention in quality control.
If you identify that the debouncing is a cause where if it doesn't work your machine doesn't work as needed, the actual solutions could be software or hardware. What's the relative probability that each solution will work? How costly is a failure? How much does it cost to implement? At that point you're talking cost models with reliability requirements as an input.
These comments are incredibly helpful, thank you for taking your time to write them.
Please correct me if I'm wrong, but I think is named Reliabilty Engineering / Safety Engineering? Those might be some good things to search for people interested.
I don't think it's really a separate keyword to search for. This is all probabilities.
The math isn't complex, the hard part is writing down a complete graph of all the connections between different components, environments, and failure scenarios. If your valve is made of five parts, and one of them has a 10% chance of failing per year, then your valve has a 10% chance of failing per year. If it has two parts that have a 10% chance of failing in a year, then assuming independent failures the total probability of failure of that component is 19% in a year.
These numbers are rarely known with so much precision during initial design. Consider it akin to estimating the probability of certain kinds of predictable bugs in a library you're using. How much do you trust that github repository vs intel? The most robust thing to do is typically to design around your best guesses but then do validation testing to refine your guesses.
So if I think a critical valve or seal has a high probability of failure but have low confidence in what the probability is, I'll take that valve or seal and literally set up a test case to make sure it performs as expected. Then I can collect real statistics and go from there. Data >> Guesses, but the systems are so complex that guesses are where you have to start.
Then you'd basically put the system together, one part at a time, and validate with each added part that the entire system still behaves as expected. And you throw in some edge cases to ensure that controls are working properly, like perhaps in the aluminum heater case you'd simply break a thermocouple yourself to ensure that the safety system works. But you'd do that in testing, not in production.
It's really very analogous to unit tests, unsurprisingly because the need is similar. I've had vendors ship me special custom thermocouples that they claimed would run for 10 years at 600C. We threw them in an oven as a trivial validation test. They caught fire. We didn't use that vendor again. But by analogy that's how the firmware blob you get from a vendor is too. They sure claim it does something, but until you've done real testing with it who knows?
As you pin down the true probabilities of different failures, you just propagate them through your graph of possible failures to estimate the probability of different scenarios and focus on the high risk and high likelihood events. Sometimes the risk is as simple as "the system will be down for an hour while we replace a failed component". No biggie, maintenance is an expected cost. Sometimes the risk is a nuclear plant meltdown.
EDIT: The goal of the above is to identify which causes result in critical failures with high likelihood. Once you've identified them, then you focus down on addressing the root cause. It's more about identifying where problems would start if there were a bad scenario, so you know where to spend more attention in quality control.
If you identify that the debouncing is a cause where if it doesn't work your machine doesn't work as needed, the actual solutions could be software or hardware. What's the relative probability that each solution will work? How costly is a failure? How much does it cost to implement? At that point you're talking cost models with reliability requirements as an input.