For fun, I run a bare metal k8s cluster with my blog and other projects running on it. My last three nights have been fighting bugs. Bugs with volumes not attaching, nginx magically configuring itself incorrectly, and a whole bunch of other crap. This just magically started happening, but crap like this seems to happen at least once a month. It’s to the point where I spend at least one night a week babysitting the cluster.
I don’t have to pay someone else to handle this, but if did, I would get rid of k8s in a heartbeat. I’ve seen a devops team of only a few people manage tens of thousands of traditional servers, but I doubt such a small team could handle a k8s cluster of the same size.
I’m considering moving back to traditional architecture for my blog and other projects. K8s has been fun, but there’s too much magic everywhere.
No one has ever explained the point of it to me either.
I’ve heard it’s supposed to solve the problem of programs running differently on different machines. That’s a problem I’ve never encountered in my 12 years of experience.
But the types of issues you describe are very real and very time consuming.
> No one has ever explained the point of it to me either.
It makes sense if you use docker. Docker containers need somewhere to live. If you want two copies of your service alive at all times, K8s is the thing which will listen for crashes, and restart them, etc.
The ecosystem isn't really anything more than the sum of its features.
I already mentioned K8s as an automatic container runner/restarter. But if you run two copies of a service, you need a load balancer to route traffic to them. You can program your own (more work), or download & run someone else's (less work). Or you can see what K8s provides [0] and do even less work than that.
If your services talk to one another, they could talk by hard-coded IP (maintenance nightmare), or by hostname. If they talk by hostname, then they need DNS to resolve those host names. Again, you can roll DNS yourself, or you can see what K8s gives you [1].
And on and on. Firewalls, https, permissions, password/secrets management.
There's one more thing to say about K8s which is that it has become a bit of a defacto standard. So you don't need to relearn a completely new way of doing this stuff if you decide to switch jobs / cloud providers.
K8s gives you a lot for free, until it doesn’t. I’m not saying the old way is better, but if it is better — it’s easier to fix when shit hits the fan. A bad day on k8s will take you completely offline, while a bad day on a single server may or may not take you completely offline (depending on your backup situation and how good your devops is).
You’re not making an apples to apples comparison. You can run k8 on a single server or run 1,000 bare metal servers. Number of servers and how you deploy to them are tangential things and not mutually exclusive.
You seem to also be implying that by running a single bare metal server you have eliminated any chance of downtime which isn’t true
For example if your process crashes on bare metal, you go down, unless you have some kind of supervisor that watched and restarts the process, if youre not using kubernetes as a supervisor then you need to set one up using some other tool. At the end of the day you can’t eliminate all tooling/downtime.
I was just saying that no matter what, all your eggs are in one basket. K8s is program that can fail like any other program. If it does fail (like etcd getting corrupt or even the process itself crashing for some reason) you can end up with a collection of servers that can’t do anything (I’m actually in this position right now). It’s exceedingly rare that this can happen, but it’s also exceedingly rare with regular servers. The difference is cost, right?
If a single server fails, you may be offline but there are well-tread paths to come back online. Your material cost is the cost of that single server. If k8s goes down, oh boy. Not only is it very complex, requiring knowledge of how it works to diagnose and recover from, but there can be zero documentation on how to recover. You are now also paying for a cloud of bricks.
A random example from $dayjob: vendors like ESRI ship products that are actually a dozen spread across five sets of servers with certificates and load balancers everywhere. My customer has 7 sets of them due to acquisitions, each with dev, test, and prod instances. That’s 21 sets of a dozen servers or so. Just keeping up with OS updates and app patching is nearly a full time job!
Or just apply their official helm chart… and you’re pretty much done. You’ll also get better efficiency because the various environments will all get bin-packed together.
Is it perfect? No, but it’s better than doing to yourself!
Consider the alternative in conditions where you need various forms of scalability in a cloud agnostic way. Especially when you have a complicated system of many services.
I think some uses cases might be - running/testing software on a variety of hardware configurations, and sharing a limited pool of machines among people/projects.
K8s was never meant to be used for running a blog :) It was built to support Google-scale deployment, with probably dozens of engineers just supporting the live clusters as they stumble into various bizzare states.
> I don’t have to pay someone else to handle this, but if did, I would get rid of k8s in a heartbeat. I’ve seen a devops team of only a few people manage tens of thousands of traditional servers, but I doubt such a small team could handle a k8s cluster of the same size.
This has been my experience with a lot of the "we need to be cloud native! containers!" mantra in the enterprise. Some exec gets it in their head it's a good idea (and probably gets non-trivial "referral agent fees") this is a must do, and all of the young, hip developer types are happy to cheerlead it.
Two years later OpEx is exploding, most of the processes haven't yet been converted to be in the cloud, and the environment isn't noticeably better or different. It sucks, just sucks in a new and more expensive way that gives you less control of your data.
Seen this at 3 x F500 orgs and with multiple cloud providers, including the big 3 + one of the well known second tiers.
How do you solve logging and storage? Those were two issues that caused me to leave it behind. With k8s, there is longhorn for storage so I can move databases around and have volumes replicated to deal with disk failures. Is there anything like that for Nomad?
Nowadays when i try to scaffold quick ideas, I just start a cloudflare worker. You get a url, cron, key-value store and express-like JS server going with a click of a button. I don't even have to npm -i.
I can definitely see the appeal of tinkering with 'advance tech' for personal hobby tho. Because now I am pretty sure you know more about K8 than me :)
I don’t have to pay someone else to handle this, but if did, I would get rid of k8s in a heartbeat. I’ve seen a devops team of only a few people manage tens of thousands of traditional servers, but I doubt such a small team could handle a k8s cluster of the same size.
I’m considering moving back to traditional architecture for my blog and other projects. K8s has been fun, but there’s too much magic everywhere.