So I've created ~300k ec2 instances with SadServers and my experience was that starting an ec2 VM from stopped took ~30 seconds and creating one from AMI took ~50 seconds.
Recently I decided to actually look at boot times since I store in the db when the servers are requested and when they become ready and it turns out for me it's really bi-modal; some take about 15-20s and many take about 80s, see graph https://x.com/sadservers_com/status/1782081065672118367
Pretty baffled by this (same region, same pretty much everything), any idea why?. Definitively going to try this trick in the article.
My guess is probably related to AWS Spot capacity.
The second and third spikes at 80 and 140 seconds lines up nicely with this kind of behavior.
The second spike would be optimised workloads that can respond to spot interruption in under 60 seconds.
The third spike would be Spot workloads that are being force-terminated.
The reason it's falling on those bounds is because of whatever is trying to schedule your workload only re-checks for free capacity once a minute.
I used to be able to spin up spot instances and basically never get interruptions. They'd stay on for weeks/months.
In my experience, it used to be fairly safe to have Spot instances for most workloads. You'd almost never get Spot interruptions. Now, some regions and instance types are difficult to run Spot instances at all.
Thanks, pot capacity being scheduled differently would explain the behavior.
Almost all my ec2 instances are spot, and actually I can compare the distribution with the on-demand ones.
My spot instances are very short lived (15-30 mins max) and AFAIK I've never seen a spot instance force-terminated (this would be hard to find I think).
Perhaps in one case you are getting a slice of a machine that is already running, versus AWS powering up a machine that was offline and getting a slice of that one?
I'm happy that a lot of people use and find SadServers beneficial while not costing me a lot of money. Still a lot of features to implement on the website and a shrinking backlog of scenario ideas to materialize (if somebody has ideas for Linux/Docker/Kubernetes etc troubleshooting scenarios, please let me know).
I've just tried several of your challenges and they're all painfully accurate for real-world scenarios. I will definitely point people at these next time I'll get asked how I learned to fix <random Linux configuration problem>!
As for suggestions, here are some random things I needed to do recently:
- resize the boot partition of an OS (don't know how doable this is with your vserver setup, maybe use one of those WASM Linux emulators?)
- set up a systemd service/timer/socket that starts at the right time and responds correctly to reloads/restarts
- set up IPv6 correctly
- troubleshoot why a device wasn't connecting to the WiFi (DHCP service problem!)
- set up a VPN (wireguard/openvpn/etc). Expert mode: make the remote endpoint have an A/AAAA record that the server isn't listening on
- troubleshoot why some of my devices couldn't ssh into a server despite the pubkeys being in the authorized_keys folder (old sshd version didn't understand the most recent key algorithm!). Bonus problem: ~/.ssh had the wrong permissions so the authorized keys weren't loading.
- renew an ACME/letsencrypt certificate in nginx in proxy mode (location / was proxied but location /.well-known/... shouldn't have been!)
- check your preferred smtp daemon to see if it's set as an open relay
- upgrade postgres from an old version to a new version without data loss (hard mode: the partition postgres uses by default doesn't have the free space to make a copy and migrate the data)
- figure out why the firewall isn't blocking port 1234 despite UFW being enabled and a block-all rule being present (it was because of Docker iptables rules overriding UFW rules)
- update a package that has some kind of dependency issue (i.e. an external repository that is no longer needed)
- make Ubuntu shut up about Ubuntu Pro and stop it from fetching ads on ssh login
- alter a systemd service file so that it no longer runs as root (hard mode: set up dynamic users and other hardening features)
yes good catch (I should forbid internet access to this end point), poor queue is waiting on VM up but there's no quota left until other VMs are garbaged-collected.
Didn't know about this one. There's quite a few labs/sandbox SaaS but what I've seen so far is that they are more for training with a "follow the recipe" model (do this do that to configure something, rather than "this (real) server is broken, fix it (with possibly different solutions)" which imho is more real-life and useful.
I believe the company was founded by some coworkers of mine way back when at Rackspace who often interviewed Linux admins with a lab VM and I assume they just automated the setup and spun it off as their own business. At least that's what happened as far as I can tell; I didn't know the parties involved.