This is a persistent myth that is just flat out wrong. Your k8s cluster orchestr...

tekno45 · 2025-08-18T05:28:25 1755494905

Your CONTROL PLANE doesn't immediately cause outages if it goes down.

But if your workloads stop and can't be started on the same node you've got a degradation if not an outage.

lukaslalinsky · 2025-08-19T13:16:05 1755609365

What alternatives do you have? No matter which system you are using, database failovers will require external coordination. We are talking about PostgreSQL, so that normally means something like Patroni with an external service (unless you mean something manual). I find it easier to manage just one such service, Kubernetes, and using it for both running the database process as well as coordinating failovers via Patroni.

GauntletWizard · 2025-08-18T16:20:31 1755534031

Yes, but that's workloads || operator, not workloads && operator - you don't need four nines for your control plane just to keep your workloads alive. Your control plane can be significantly less reliable than your workloads, and the workloads will keep serving fine.

In real practice, it's so cheap to keep your operator running redundantly, that it's probably going to have more nines than your workloads, but it doesn't need to be

tekno45 · 2025-08-18T16:48:17 1755535697

You're assuming a static cluster.

In my world scaling is required. Meaning new nodes and new pods. Meaning you need a control plane.

Even in development, no control plane means no updates.

In production, no scaling means im going to have a user facing issue at the next traffic spike

GauntletWizard · 2025-08-18T17:04:05 1755536645

I am 100% certain I live more in that world than you; You can check my resume if you want to get into a dick waving contest.

What I'm saying is that the two probabilities are independent, possibly correlated, but not dependent. You need some number of nines in your control plane for scaling operations. You need some number of nines in your control plane for updates. These are very few, and they don't overly affect the serving plane, so long as the serving plane is itself resilient to the errors that happen even when a control plane is running, like sudden node failure.

Proper modeling of these failure conditions is not as simple as multiplying probabilities. The chance of failures in your serving path goes up as the time between control plane readiness goes up. You calculate (Really, only ever guesstimate, but you can get some good information for those guesses) the probability of a failure in the serving plane (incl. increases in traffic to the point of overload) before the control plane has had a chance to take actions again, and you worry about MTTF and MTBR of the control plane more than the "Reliability" - You can have a control plane with 75% or less "uptime" by action failure rate but that still takes actions on a regular cadence and never notice.

You can build reliable infrastructure out of unreliable components. The control plane itself is an unreliable component, and you can serve traffic at massive scale with control planes faulty or down completely - Without affecting serving traffic. You don't need more nines in your control plane than your serving cluster - That is the only point I am addressing/contesting. You can have many, many less and still be doing right fine.