Always said by people who haven't spent much time in the cloud. Because single h...

BossingAround · on Aug 9, 2024

I love k8s, but bringing back up a single app that crashed is a very different problem from "our k8s is down" - because if you think your k8s won't go down, you're in for a surprise.

You can view a single k8s also as a single host, which will go down at some point (e.g. a botched upgrade, cloud network partition, or something similar). While much less frequent, also much more difficult to get out of.

Of course, if you have a multi-cloud setup with automatic (and periodically tested!) app migration across clouds, well then... Perhaps that's the answer nowadays.. :)

solatic · on Aug 9, 2024

> if you think your k8s won't go down, you're in for a surprise

Kubernetes is a remarkably reliable piece of software. I've administered (large X) number of clusters that often had several years of cluster lifetime, each, everything being upgraded through the relatively frequent Kubernetes release lifecycle. We definitely needed some maintenance windows sometimes, but well, no, Kubernetes didn't unexpectedly crash on us. Maybe I just got lucky, who knows. The closest we ever got was the underlying etcd cluster having heartbeat timeouts due to insufficient hardware, and etcd healed itself when the nodes were reprovisioned.

There's definitely a whole lotta stuff in the Kubernetes ecosystem that isn't nearly as reliable, but that has to be differentiated from Kubernetes itself (and the internal etcd dependency).

> You can view a single k8s also as a single host, which will go down at some point (e.g. a botched upgrade, cloud network partition, or something similar)

The managed Kubernetes services solve the whole "botched upgrade" concern. etcd is designed to tolerate cloud network partitions and recover.

Comparing this to sudden hardware loss on a single-VM app is, quite frankly, insane.

__turbobrew__ · on Aug 9, 2024

If you start using more esoteric features the reliability of k8s goes down. Guess what happens when you enable the in place vertical pod scaling feature gate?

It restarts every single container in the cluster at the same time: https://github.com/kubernetes/kubernetes/issues/122028

We have also found data races in the statefulset controller which only occurs when you have thousands of statefulsets.

Overall, if you stay on the beaten path k8s reliability is good.

cyberpunk · on Aug 9, 2024

Even if your entire control plane disappears your nodes will keep running and likely for enough time to build an entirely new cluster to flip over to.

I don’t get it either. It’s not hard at all.

BossingAround · on Aug 9, 2024

Your nodes & containers keep running, but is your networking up when your control plane is down?