Hacker News new | past | comments | ask | show | jobs | submit login

So if I'm understanding correctly, all the classic instances were migrated to more modern types with no intervention from the account holder?

Did they suffer a reboot during that migration, or was it done via some live-migration process (it's hard to live-migrate off a virtualization platform that was never designed with that in mind!).

What about the original network setup? Is that still emulated, or might some customer applications have broken?




No, migrating did involve intervention from the account holder. More information here: https://aws.amazon.com/blogs/aws/ec2-classic-is-retiring-her...

It seems like AWS spent time, people and money to migrate customers off EC2 classic. They made a fairly good effort to automate the process and make it less painful for customers. For example: https://repost.aws/knowledge-center/ssm-migrate-ec2classic-v...

The original network was from an everyone-on-the-same-subnet model to a you get your own subnet, so yes, customer applications could break in the process. People do all sorts of non-smart things for good reasons, like hardcoding an ip address in /etc/hosts when a nameserver is down. And then they forget to change it back. To do these sorts of migrations well requires a sort of stick and carrot approach. The stick, which is we want to shut down this service and will eventually refuse you service, and the carrot, which includes automation, reminders that people need maintenance windows for their applications, clear directions, and above all, willingness to deal with people and actually talk to them.


Looking at that blog post, I think AWS could have done the migration for most users with no involvement of the user themselves.

In the ideal world, they would have written software to live-migrate VM's to the new platform and emulate the old networking.

Emulating old stuff should be pretty easy, because hardware moves on, and an instance back in 2006 probably had far lower performance expectations - and therefore even a fairly poor performance emulation will be sufficient to meet user needs.


"emulate the old networking" is something that can't be done per customer, and the new platform makes networking per customer.

Let's say I have my aws account "account1", and my friend has their account "account2", both running classic. We could have both talked to each other's instances by their _private IPs_ even though they're in different accounts. AWS has no way of knowing those two instances are related, other than that they're both in classic.

Sure, AWS could make a global cross-account emulated flat network, but at that point, it's probably cheaper to just use the real thing, which was already built and functions... and at that point, you're not migrating them to "the new platform", but rather to "ec2 classic 2"


If there is a small number of classic users, a single special case in the code to have all classic users connected to a single network of an admin account seems very doable...

I wonder if perhaps part of the reason for not doing this was they were worried about malware spreading across that shared internal network from one VM without security patches to the next VM without security patches.

Even if that were the case, they could monitor all VM's on the classic network, and any VM which doesn't contact another users VM's for ~1 month would have the ability to do so be blocked.


I wonder why they didn't do that in the 14 years since VPCs were introduced?


The had to be restarted, but not only that, has to have their networks reconfigured.

But they gave people YEARS to do that, and tracked down every user to help them if necessary.


> has to have their networks reconfigured.

I don't see why every user couldn't be auto-created a virtual network with the same 10.x.x.x IP addresses as their original machine had - and therefore there is no need to do any reconfiguration on the users side.


Because there's more than just the local IP address to worry about.

Remember, all of EC2 Classic was in a single /8 of private IPs. You could communicate with EC2 instances in another account via their private IP address.

If you have two instances in different accounts that need to communicate, upgrading from EC2 Classic to VPC couldn't be done automatically.


Because people could have that same network on-promises on the other side of the VPN (I have).


But that isn't a new problem if the same user already uses that address - you're just leaving them with the same issue they already had.


It's not clear, but my interpretation is that they contacted every account holder, somehow convinced them to migrate (perhaps with discounts and/or threats of termination) and then shut down once everyone migrated.

Would be very interesting to learn how that was possible, it seems surprising to me that there wasn't even one instance that the owner forgot about or just was unwilling to do any work on.

It's possible that credit card expiration was the key, as that may have automatically disabled almost all forgotten accounts.


They don't need to threaten. Their SLAs don't offer to run VMs indefinitely. AWS will send you an email about shutting down your VM if, eg. they need to rotate the disk used for storing VM image etc. It's somewhere there in the contract, and it's a usual process for someone who keeps long-running VMs in EC2.


Do they really not do live migration or at least auto-restart (if configured) in those cases?


They have live migration now (for many years) but they didn’t back in the early era. I’m not sure they set it up for the classic environment but I think they must have - in the early 2010s you could get notices that your VM’s host had failed or was about to fail and you needed to launch a new one but I think that stopped by around 2015 as I had servers running for a deprecated project which were finally shutdown this decade and it seems like rather good luck not to have any failures in 7+ years on old hardware.


I received last such email about a year ago for a VM in either free-tier or like the cheapest one available. This probably has to do with VM flavor you choose as well as with the time you created it.


Definitely possible - I only had about ten old servers but that’s a nice stretch without a hardware failure.


Its your responsibility to make your systems resistant to failure


I don't know the details of this particular migration, but I used to have a VM in some low-price tier that was running for a long time (few years), and, eventually AWS sent me an email telling they are going to shut it down for maintenance reasons.

Guess this was something similar. VMs, if not specifically configured to be able to move cannot really be moved automatically. Think about eg. randomness of ordering in PCIe bus (i.e. after moving the devices may not come up in the same order as before moving), various machine ids, like MAC address -- if you don't make sure VM isn't affected by these changes, it's likely that it will be, if moved.


> Think about eg. randomness of ordering in PCIe bus (i.e. after moving the devices may not come up in the same order as before moving), various machine ids, like MAC address -- if you don't make sure VM isn't affected by these changes, it's likely that it will be, if moved.

QEMU/KVM/libvirt/... are idempotent when it comes to hardware the VM sees - the exception is the CPU model, that one can't be changed on the fly without at least rebooting the VM in question, and hardware in passthrough mode like GPUs.

All the VM sees from a live migration is a few seconds of "lost" time, as if someone had stopped the CPU clock.


The clouds have figured this out: https://dl.acm.org/doi/pdf/10.1145/3186411.3186415


That's a very bold claim :)

If you prepare for migration, then it will work. If you don't -- it might or might not work, and it depends on way too many things to be confident that it will.

For example, in our VM deployment process we heavily rely on PXE boot and our code that runs during initramfs and also after pivot. So, even if whatever you have in the hypervisor and the virtualized OS has somehow managed to solve the moving problem, our own (screwy) code might not be so smart.

In general, the more of the underlying infrastructure you touch, the harder it will be to move you, unless you specifically prepare to move. Eg. what if you set up BMC for various components on your system? What if your code relies on knowing particular details of hardware to which you were given pass-through access from your VM, like, eg. you record somewhere the serial number of your disk in order to be able to identify it on next boot, but suddenly that disk is gone?

Even simpler: MIG flag on NVidia's GPUs is stored in the persistent writable memory of the GPU itself. Suppose your VM has moved to a different location and is connected to a different (but compatible) GPU -- you need to run the code that sets the flag again and then reboot the system in order to start working with the GPU, but the host may not be even aware of the fact that you need that particular setting.

The guest side of things needs to be prepared to move, to mitigate these problems.


I'm saying that at least GCP has supported VM migration for a while and it's generally not something people are currently worried too much about these days given they have attempted to mitigate issues like you've pointed out.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: