I am 100% certain I live more in that world than you; You can check my resume if you want to get into a dick waving contest.
What I'm saying is that the two probabilities are independent, possibly correlated, but not dependent. You need some number of nines in your control plane for scaling operations. You need some number of nines in your control plane for updates. These are very few, and they don't overly affect the serving plane, so long as the serving plane is itself resilient to the errors that happen even when a control plane is running, like sudden node failure.
Proper modeling of these failure conditions is not as simple as multiplying probabilities. The chance of failures in your serving path goes up as the time between control plane readiness goes up. You calculate (Really, only ever guesstimate, but you can get some good information for those guesses) the probability of a failure in the serving plane (incl. increases in traffic to the point of overload) before the control plane has had a chance to take actions again, and you worry about MTTF and MTBR of the control plane more than the "Reliability" - You can have a control plane with 75% or less "uptime" by action failure rate but that still takes actions on a regular cadence and never notice.
You can build reliable infrastructure out of unreliable components. The control plane itself is an unreliable component, and you can serve traffic at massive scale with control planes faulty or down completely - Without affecting serving traffic. You don't need more nines in your control plane than your serving cluster - That is the only point I am addressing/contesting. You can have many, many less and still be doing right fine.
What I'm saying is that the two probabilities are independent, possibly correlated, but not dependent. You need some number of nines in your control plane for scaling operations. You need some number of nines in your control plane for updates. These are very few, and they don't overly affect the serving plane, so long as the serving plane is itself resilient to the errors that happen even when a control plane is running, like sudden node failure.
Proper modeling of these failure conditions is not as simple as multiplying probabilities. The chance of failures in your serving path goes up as the time between control plane readiness goes up. You calculate (Really, only ever guesstimate, but you can get some good information for those guesses) the probability of a failure in the serving plane (incl. increases in traffic to the point of overload) before the control plane has had a chance to take actions again, and you worry about MTTF and MTBR of the control plane more than the "Reliability" - You can have a control plane with 75% or less "uptime" by action failure rate but that still takes actions on a regular cadence and never notice.
You can build reliable infrastructure out of unreliable components. The control plane itself is an unreliable component, and you can serve traffic at massive scale with control planes faulty or down completely - Without affecting serving traffic. You don't need more nines in your control plane than your serving cluster - That is the only point I am addressing/contesting. You can have many, many less and still be doing right fine.