How often do you boot up instances? Do you measure detailed metrics for the time from the RunInstances call to the earliest possible timestamp you can easily get from the user code, to quantify the amount of time spent in AWS before any instance code gets control?
If so, I'd love to see your measured distribution of boot times. Because I've observed results similar to your observations on EBS, with some long-tail outliers.
Instances are constantly booting up because most instances live <30d. Boot time in terms of how soon a node is fully booted and joined to the EKS apiserver and ready for workloads is approx 2.5-3min. There are lot of parts involved in getting to this point though, some of which would not matter if you're not using EKS. Also this is not something we measure super closely as from a user perspective it is generally imperceptible.
A possibly better metric for your particular case (assuming you're interested in fastest bootup possibly achievable) is from our self-managed github-actions runners. Those boot times are in the 40-50s range. This is consistent with what others see, as far as I know. A good blog on this topic - including how they got boot-to-ready times down to 5s - that you might be interested in from the depot.dev folks: https://depot.dev/blog/github-actions-breaking-five-second-b...
I'm already at the ~5s mark, booting a brand new instance, almost all of which is AWS time before my instance gets control; once the kernel takes over the remaining boot time is milliseconds. (I plan to go the "pool of instances" route in the future to eliminate the AWS time I have no control over.)
But ever so often, I observe instances taking several more seconds of that uncontrollable AWS time, and I wondered what statistics you might have on that.
Possibly relatedly, do you ever observe EBS being degraded at initial boot?
Great deep dive, I've been actively curious about some of the results you found that present themselves similarly in infra setups I run or have run previously.
This kind of miffs also:
> AWS doesn’t describe how failure is distributed for gp3 volumes
I wonder why? Because it affects their number of 9s? Rep?
Our experience has been that they do fire, but not reliably enough or soon enough to be worth anything other than validating the problem after the fact.