Oh wow! Unexpected and cool to see this post on Hacker News! Since then we have evolved our VM infra a bit, and I've written two more posts about this.
First, we started cloning VMs using userfaultfd, which allows us to bypass the disk and let children read memory directly from parent VMs [1].
And we also moved to saving memory snapshots compressed. To keep VM boots fast, we need to decompress on the fly as VMs read from the snapshot, so we chunk up snapshots in 4kb-8kb pieces that are zstd compressed [2].
The VM vs container debate is fascinating. They are separate yet slowly merging concepts that are becoming more blurred as technology becomes cheaper and faster. If the real bottleneck to scale is adaptable code, then it is foolish to dismiss the VM as outdated tech when it can be completely rehomed in 2 seconds. That megabyte of python code managing your containers would still be busy checking it's dependencies in that same timeframe.
In a way it's nothing new. MOSIX/openMosix (https://en.wikipedia.org/wiki/MOSIX, https://en.wikipedia.org/wiki/OpenMosix) did similar stuff with individual processes. It would probably be even faster as then you would have to only move process memory and its state rather than the whole VM memory (and it state).
I guess it would/could be nice to have something that moves kubernetes pods around rather killing them and starting new ones.
Unmentioned: there are serious security issues with memory cloning code not designed for it.
For example, an SSL library might have pre-calculated the random nonce for the next incoming SSL connection.
If you clone the VM containing a process using that library, now both child VM's will use the same nonce. Some crypto is 100% broken open if a nonce is reused.
It's important to refresh entropy immediately after clone. Still, there can be code that didn't assume it could be cloned (even though there's always been `fork`, of course). Because of this, we don't live clone across workspaces for unlisted/private sandboxes and limit the use case to dev envs where no secrets are stored.
VMware even has a vSphere Fault Tolerance product that creates a "live shadow instance" of a VM that mirrors the primary virtual machine (with up to 4 vCPUs). So you can do a quick failover in case of an "immediate planned" failover case, but apparently even when the primary DB goes down. I guess this might work when some external system (like a storage array) goes down in the primary, you can just switch to the other VM (with latest memory/CPU state) and replay that I/O there and keep going... But if there's a hard crash of the primary, if it actually does work, then they must be doing lots of reasoning about internal state change ordering & external device side-effect (somewhat like Antithesis, but for a different purpose). Back in the day, they supported only uniprocessor VMs (with something called vLockstep) and later up to 4 vCPUs with something called Fast Checkpointing.
I've always wanted to test this out for fun, by now 15 years have gone by and I've never got to it...
Live migration had some very cool demos. They would have an intensive workload such as a game playing and cause a crash and the VM would resume with 0 buffering.
Isn't this still a concern even if you're not pre-calculating way ahead of time? If you generate it when needed, it could still catch you at the wrong time (e.g. right before encryption, but right after nonce generation)
Unless your encryption and transport protocols are 100% stateless only 1 connection will actually be able to form, even if you duplicate the machine during connection creation.
The problem with pre-computing a bunch and keeping them in memory is brand new connections made post cloning would use the same list of nonces.
Sounds like it would simply be inappropriate to clone & use a VM that's assuming it's data is unique. This would also be true of other conditions, e.g. if you needed to spoof a MAC or IPv6 address & picked one randomly.
The problem is modern software is so fiendishly complicated there almost certainly is stuff like that in the code. The question is where, and does it matter?
I don't really follow, what's the issue with that? The two nodes will encrypt using the same key, so they can snoop at each other's traffic that they send out? Doesn't sound that big of a deal per se.
A nonce is not a key, it's a piece of random that is meant to be used at most once.
If an attacker sees valid nonces on a VM, and knows of another VM sharing the same nonces, then your crypto on both* VMs becomes vulnerable to replay attacks.
I'm starting to see a pattern here. This describes a technology that rapidly deploys "VM" instances in the cloud which support things like Lambda and single-process containers. At what point do we scale this all back to a more rudimentary OS that provides security and process management across multiple physical machines? Or is there already a Linux distro that does this?
I ask because watching cloud providers like AWS slowly reinvent mainframes just seems like the painful way around.
We were working on this at CoreOS before Kubernetes came about – called fleet https://github.com/coreos/fleet. Distributed systemd units run across a cluster, typically running containers or golang binaries with a super minimal OS underneath. I always thought it was cool but it definitely had its challenges and Kubernetes is better in most ways, IMO.
> I ask because watching cloud providers like AWS slowly reinvent mainframes just seems like the painful way around.
When AWS was the hot new thing in town a server was coming in at 12/24 threads.
A modern AMD machine tops out at 700+ threads and 400gb QSFP interconnects. GO back to 2000 and the Dotcom boom and thats a whole mid sized company, in a 2u rack.
Finding single applications that can leverage all that horsepower is going to be a challenge... and thats before you layer in lift for redundancy.
Strip away all the bloat, all the fine examples of Conways law that organizations drag around (or inherit from other orgs) and compute is at a place where it's effectively free... With the real limits/costs being power and data (and these are driven by density).
There was a multi-machine single-Linux-kernel-instance distro many years ago called Kerrighed. The company behind it died unfortunately so it hasn't kept up with Linux kernel patch rebasing. It offered a "view of a unique SMP machine on top of a cluster of standard PCs".
They’re missing multi-machine orchestration: Run thousands of jails on these dozen machines. Don’t bother me with the details at runtime.
They are also missing an ergonomic tool like dockerfiles. The following file, plus a cli tool for “run N copies on my M machines” should be enough to run bsd in prod, and it is not:
If there's any difference now versus the past, it is that I think right now pretty much every point on the wheel is available quite readily now. If you want a more "rudimentary OS" you don't need to wait for the next turning of the wheel, it's here now. Need full VMs? Still a practical technology. Containers enough? Actively in development and use. Mix & match? Around any sensible combination you can do it now. And so on.
We are shown a person who quit the server and then the server stops and restarts (that 2 second clone of vm)
but what if I have a service like lets say normal minecraft servers like hypixel or others, they can't hope for a 2 second delay. Maybe we would have to use proxies in that case.
I am genuinely interested by this tech.
Currently, I am much in favour of tinykvm and its snapshotting because its even lighter than firecracker(I think). I really like the dev behind tinykvm as well.
> How to handle network and IP duplicates on cloned VMs
That is indeed what I would love to read the most! Because no matter what you do, it gets complex - if you tear down the network stack of the "old" VM, applications (like Minecraft) might be heading down into unstable territory when the listener socket disappears and the "new" VM has to go through the entire DHCP flow that may easily take a second or more, and if you just do the equivalent of S3 sleep (suspend to RAM), the first "new" VM will have everything working as expected but any further VM being spawned from the template will run into duplicate IP/MAC address usage.
Interesting read—thanks! One question: in the CoW example, if VM A modifies the data post-fork, what does VM B see when it later copies that data? Does it get the original data from the time of the fork, or VM A’s modified version?
They tried running minecraft, but I wonder if a similar (or better) cloning is possible for a mission critical workload - like a database consuming a huge amount of memory. Neon uses QEMU to achieve this for example: https://neon.tech/docs/reference/glossary#live-migration but is that the only way?
A search for 'linux process live migration' picks up at least one repo that claims to have done it, and a bunch of other interesrting things.
For a very simple program, with limited I/O, it's not too hard; especially if you don't mind a significant pause to move. Difficulty comes when you have FDs to migrate and if you need to reduce the pausing. If you need to keep FDs to the filesystem or the program will load/store to the filesystem periodically, you'd need to do a filesystem migration too... If you need to keep FDs for network sockets, you've got to transfer those somehow.
If it's just stdin/out/err, you could probably do the migration in userspace with some difficulty if you need to keep pid constant (but maybe you don't need that either).
Minimal pausing involves letting the program run on the initial machine while you copy memory, setting pages to read-only so you can catch writes, and only pausing the program once the copy is substantially finished. Then you pause execution on the initial machine. If there's a significant amount of modified pages to copy over when you pause, you can still start execution on the new machine, as long as the modified pages are marked unavailable, if you background copy them before they're used great... if not, you have to block until the modified data comes through.
Probably you do this on two nearby machines with fast networking, and the program doesn't have a lot of writes all over memory, so the pause should be short.
If you're talking about Criu then it's not just a claim it actually does work well in production. I know Google was using it in prod on their internal systems and probably many others. It even can migrate TCP connections for you via socket repair api in Linux
Interesting thought, but highly dependant on the actual program. Let's assume it doesn't touch any files on disk (no opening sockets either). You would need to at least
1. Halt the process (SIGSTOP comes to mind)
2. Create a copy of the running program and /proc/$pid - which will also include memory and mmap details
3. Transfer everything to the other machine
4. Load memory, somehow spawn a spawn a new process with the info from /proc/$pid we saved, mmap the loaded memory into it
5. Continue the process on the new machine (SIGCONT)
Let me admit that I do not have the slightest clue how to achieve step 4. I wonder if a systemd namespace could make things easier.
if you put you program in a qemu/kvm VM then it just works
I was completely blown away when I first experienced it. My code running in a VM never even noticed any downtime. All the network connections are preserved and so on.
Cool article! The stack (and results) are impressive, but I also appreciate the article in itself, starting from basics and getting to the point in a clear and slowly expanding way. Easy to follow and appreciate.
On a bit of a tangent rant, this kind of writing is slowly going away, taken over by LLM slop (and I'm a huge fan of LLMs, just not the people who write those kinds of articles). I was recently looking for real world benchmarks for vllm/sglang deployments of DeepSeek3 on a 8x 96GB pod, to see if the model fits into the amount of RAM, with kv cache and context length, what numbers to people get, etc.
Of the ~20 articles that google surfaced on various attempts of keywords, none were what I was looking for. The excerpts seemed promising, some even offered tables & stuff related to ds3 and RAM usage, but all were LLM crap. All were written in that simple style - intro - bla bla - conclusion, some even had RAM requirements that made no sense (running a model trained in FP8 in 16bit, something noone would do, etc.)
While I fully agree with you on the absence of good benchmarks and the growing LLM slop ...
"running a model trained in FP8 in 16bit, something noone would do, etc"
I did that because on the RTX 3090 - which can be a good bang per buck for inference - the FP8 support is nerfed at the driver level. So a kernel that upscales FP8 to FP16 inside SRAM, then does the matmul, then downscales to FP8 again can bring massive performance benefits on those consumer cards.
BTW, you can run a good DeepSeek3 quant on a single H200.
Thanks! I was looking at blackwell 6000PROs, 8x 96GB for running full fp8 (as it's supported and presumably fast).
I know AWQ should run, and be pretty snappy & efficient w/ the new MLA added, but wanted to check if fp8 fits as well, because from a simple napkin math it seems pretty tight (might only work for bs1, ctx_len <8k which would probably not be suited for coding tasks).
Has anybody tried running ollama and Open WebUI in firecracker instead of full VMs? I assume this should work, but not sure about GPU (single and multi) passthrough.
VMs have a full OS that needs to be maintained (patched, upgraded when EOL, etc.).
Hypervisors traditionally cost a metric crapton of money per core. Yes, Proxmox is pretty good, but it's the exception, not the norm. They're also relatively slow in spinning up new VMs (kind of by definition, it takes a lot of time to emulate a full blown replica of hardware vs just starting a process in a cgroup/jail).
And most of all, VMs are just solving the wrong problem. You don't care about emulating hardware, you care about running some workload. Maybe it needs specific hardware or a virtual version of it, but more likely than not, it's a regular batch processor or API that can happily run in a container with almost none of the overhead of a full VM.
while you are correct in calling the VM startup slow compared to the container startup, reading "emulating hardware" burns my eyes
modern VMs don't emulate hardware. When a VM has a hard drive or a network device there's no sophisticated code to trick VM into believing that this is real hardware. Virtio drivers are about VM writing data in a memory area and assuming it's written to the disk / sent to the network (because in the background hypervisor reads the same memory area and does the job)
>Virtual machines are often seen as slow, expensive, bloated and outdated.
by whom?
I tend to loathe firecracker posts because theyre all just thinly veiled ads for Amazon services.
Firecracker is not included in the standard linux KVM/QEMU duo and has sparse documentation. you cannot deploy a firecracker image like a traditional VM. in fact there are no tools to assist in creating a firecracker VM, and the filesystem for the VM must be EXT4.
TL;DR: this is all fun stuff if youre 200% cloud, but most companies run a ton of on-prem vms as well.
First, we started cloning VMs using userfaultfd, which allows us to bypass the disk and let children read memory directly from parent VMs [1].
And we also moved to saving memory snapshots compressed. To keep VM boots fast, we need to decompress on the fly as VMs read from the snapshot, so we chunk up snapshots in 4kb-8kb pieces that are zstd compressed [2].
Happy to answer any questions here!
[1]: https://codesandbox.io/blog/cloning-microvms-using-userfault...
[2]: https://codesandbox.io/blog/how-we-scale-our-microvm-infrast...