There are a lot of reasons we arrived here over the decades of struggling to kee...

There are a lot of reasons we arrived here over the decades of struggling to keep servers in good working order in a sea of change. One is that backup and restore is inherently fragile, and we have many instances where restorability degrades for many reasons over a long life. Backup restore verification is not a regular part of hygiene because it’s intrusive, tedious, and slow. If ever done it’s usually done once. Reproducible builds allows for automated verification and testing offline.

Changes done are only captured at snapshot intervals and are no coherent and atomic, so you can easily miss changes that are crucial but capture destructive changes in between deltas. Worse are flaws that are introduced but not observed for a long time and are now hopelessly intermixed with other changes. Reproducible build systems allow you to use a revision control system to manage change and cherry pick changesets to resolve intermixed flaws, and even if they’re deeply intermixed you can resolve in an offline server until it’s healthy to rebuild your online server.

The issue with reproducible build systems isn’t they aren’t superior to backup and restore in every way. It’s the interfaces we provide today are overly complex compared to the simple interface of “backup and restore,” which despite its promised interface always works in the backup part but often fails in the restore. These ideas of hermetic server builds are relatively new and the tooling hasn’t matured.

I would say actually click ops is an ideal way to solve that issue. Click ops that serializes resiliently to a metadata store that drives the build and is revision controlled solves that usability issue. If the metadata store is text configs and can be modified directly without breaking the user interfaces would be necessary to deal with the tedium for complex changes in a UI, while providing a nice rendering of state for simple exploratory changes. Backup and restore would be only necessary for stateful changes, but since the stateful changes aren’t at the OS layer, you won’t end up with a bricked server.