I run multi-site Ceph+Nomad clusters with NixOS on Hetzner for our startup and maintaining those takes less than 5% of my time.
By using great tools and understanding them well you can do it with little manpower. I learned all those tools in around 3 months total -- so around as much as getting a basic understanding of AWS IAM ;-)
The only thing you don't get with that from your list is auto-scaling. But the with Hetzner the price difference vs AWS is 10x for storage, 20x for compute, and 10000x for traffic, so we just over-provision a little. And my 5% time /includes/ manual upscaling.
Yes, I am oncall 24/7 to manage that infra, but I'd be as well when using hosted cloud services. Yes, fixing a Ceph issue, or handling Hashicorp Consul not handling an out-of-disk situation correctly is more complicated than waiting for S3 go come back from its outage, but the savings are massive. Testing whether your backup restore works is something you need to do equally with hosted services.
So it is definitely possible to self-manage everything, for 5% of one engineer.
> By using great tools and understanding them well you can do it with little manpower.
“and understanding them well” is doing a lot of legwork there. From a standing start how does a startup that has the skills & experience to make the product but not necessarily manage the infrastructure get to the point of understanding the tools well, or even knowing which tools are best to learn to the point of understanding well?
> So it is definitely possible to self-manage everything, for 5% of one engineer.
I can accept that as true, if you have the right person/people, and they are willing (particularly the on-call part).
I'm in a similar situation; what resources did you find helpful for learning NixOS? Tho I could skip that for now and stick containerized, in which case I just need Nomad..but I'm not certain on picking it over K8s in any case. Just knowing I'm gonna have to deal with this soon and you seem to have it figured out enough!
I found NixOps when searching for an alternative to Ansible that is actually declarative and not just a "bash in yaml" runner. Our Ansible deployments took > 10 minutes and were not "congruent" (well explained in [1]): Removing the Ansible line that installed nginx did not uninstall nginx, so the state on all servers diverged over time and we had no clue what was runing where. Docker was also very slow because changing something early in a Dockerfile leads to lots of re-building, because again it's just bash scripts with snapshotting.
I thought "surely somebody must have invented a better system for this" and NixOps was exactly that. Deploying config changes always took a few seconds with that, instead of 10 minutes.
> what resources did you find helpful for learning NixOS?
This was already in 2017 so documentation was worse than it is today.
On a flight I read the Nix, NixOS, Nixpkgs manuals top to bottom. I also read some of the nix-pills, but didn't like that they went so deep into the weeds of packaging when my primary interest at the time was OS configuration management. In retrospect, I should have read those also front to end to save some time later when packaging our own software and some specific dependencies became more important for us. I also read various blog posts, examples, and asked some questions in the IRC channel (now Matrix), where there were some people that simply knew every detail and were willing to spend hours sharing their knowledge (thanks cleverca22!).
I also read key NixOS logic source code, such as the `switch-to-configuration` script that switches between 2 declarative configs (like many, I do not like that this is written in Perl, and I'm sure it will eventually be switched).
A thing I did wrong was to learn too late how to write my own NixOS modules; I wrote our own systems as "plain nix functions" but they would have been better as NixOS modules, because those allow overriding parts of the config from outside, and make code more composable (see also https://news.ycombinator.com/item?id=41355203).
I spent 2 months prototyping all our infra in NixOps and learned by doing.
I also learned specifically where the gaps are: NixOS generally handles what's running on a single machine (with systemd units), and with e.g. NixOps you can access the global config of other machines (to render e.g. a Wireguard config file where you need to put in all machines to connect to, so {all machines IPs} \ {own IP}). It does not handle active cross-machine coordination, e.g. if some GlusterFS or Ceph tutorial says "first run this command on this machine, then afterwards that command on that other machine", or "run this command on any machine, but only run it once". So I learned Consul as a distributed lock service to coordinate (mutex) commands across machines. Luckily, the amount of software that needs "installation by human operator running commands" is continuously going down, declarative config becomes more of a norm.
With NixOS, a good thing is that while it is reasonably complex, it is simple enough that you can understand it fully, that is, for any given behaviour you _know_ where in the nixpkgs code it is. I recommend to use that approach (spend a few months to understand it fully), because it makes you massively more productive.
I also believe that this is a big benefit of NixOS vs e.g. containers on Kubernetes: Kubernetes is big and complicated, with likely more lines of code than anybody could read, and the mechanisms are more involved (for example, you need to know a lot of iptables to know how a request is routed eventually to your application code). NixOS is simpler (packaging software and rendering systemd units); it uses a more radically different fundament but in turn advanced features on top of it are straightforward (multiple versions of libraries on the same machine, knowing for every binary exactly which source code built it, running _only_ what's declared, automatic transparent build caching, spawning VMs that mimic your physical servers). NixOS provides less than cluster orchestrators like Nomad and Kubernetes (e.g. no multi-machine rolling deploys with automatic rollbacks), but one person can keep it all in their head, and it is very good at building things that run in cluster orchestrators. (Disclosure: I know much more about NixOS than Kubernetes; maybe Kubernetes disagree with me and think that a single person can understand Kubernetes source entirely to get the fast directed debugging I claim is possible with NixOS.)
Often, you also don't need a cluster orchestrator. Our Ceph runs straight on NixOS on Hetzner dedicated machines, it does not run in our Nomad. We use Nomad to schedule our application-specific jobs onto our machines -- that is, we use the cluster orchestrator for their original design goal (ball-packing CPU + memory jobs across machines), and do not use the cluster orchestrator as a "code packaging and deployment tool", which is what much of current Docker+Kubernetes is used for. We find that Nix is simpler and better for the latter.
Starting from NixOps, we Nixified all our our tooling (e.g. build our Haskell / C++ / Python / TypeScript with Nix), fixed things in nixpkgs in our submodule and made lots of upstream PRs for it (I'm currently at ~300 nixpkgs commits). NixOS works extra well if you upstream stuff your company needs, because it will reduce your maintenance burden and make other industrial users' life easier too. Especially recommended is to upstream NixOS VM tests for services you rely on; for example, I contributed the Consul multi-machine VM test [2], which automatically runs for any version upgrade to Consul in nixpkgs so nobody will break our infra that way.
No.
I run multi-site Ceph+Nomad clusters with NixOS on Hetzner for our startup and maintaining those takes less than 5% of my time.
By using great tools and understanding them well you can do it with little manpower. I learned all those tools in around 3 months total -- so around as much as getting a basic understanding of AWS IAM ;-)
The only thing you don't get with that from your list is auto-scaling. But the with Hetzner the price difference vs AWS is 10x for storage, 20x for compute, and 10000x for traffic, so we just over-provision a little. And my 5% time /includes/ manual upscaling.
Yes, I am oncall 24/7 to manage that infra, but I'd be as well when using hosted cloud services. Yes, fixing a Ceph issue, or handling Hashicorp Consul not handling an out-of-disk situation correctly is more complicated than waiting for S3 go come back from its outage, but the savings are massive. Testing whether your backup restore works is something you need to do equally with hosted services.
So it is definitely possible to self-manage everything, for 5% of one engineer.