we have a lot more content like this on the way. if anyone has feedback or quest...

swyx · 2025-03-18T15:04:45 1742310285

LOVE this stuff sam, its highly educational but also establishes a ton of trust in PS. please keep it up!

JoshTriplett · 2025-03-18T15:19:09 1742311149

How often do you boot up instances? Do you measure detailed metrics for the time from the RunInstances call to the earliest possible timestamp you can easily get from the user code, to quantify the amount of time spent in AWS before any instance code gets control?

If so, I'd love to see your measured distribution of boot times. Because I've observed results similar to your observations on EBS, with some long-tail outliers.

Thanks for the analysis and article!

miller_joe · 2025-03-18T15:33:15 1742311995

Instances are constantly booting up because most instances live <30d. Boot time in terms of how soon a node is fully booted and joined to the EKS apiserver and ready for workloads is approx 2.5-3min. There are lot of parts involved in getting to this point though, some of which would not matter if you're not using EKS. Also this is not something we measure super closely as from a user perspective it is generally imperceptible.

A possibly better metric for your particular case (assuming you're interested in fastest bootup possibly achievable) is from our self-managed github-actions runners. Those boot times are in the 40-50s range. This is consistent with what others see, as far as I know. A good blog on this topic - including how they got boot-to-ready times down to 5s - that you might be interested in from the depot.dev folks: https://depot.dev/blog/github-actions-breaking-five-second-b...

JoshTriplett · 2025-03-18T15:40:23 1742312423

I'm already at the ~5s mark, booting a brand new instance, almost all of which is AWS time before my instance gets control; once the kernel takes over the remaining boot time is milliseconds. (I plan to go the "pool of instances" route in the future to eliminate the AWS time I have no control over.)

But ever so often, I observe instances taking several more seconds of that uncontrollable AWS time, and I wondered what statistics you might have on that.

Possibly relatedly, do you ever observe EBS being degraded at initial boot?

bigfatfrock · 2025-03-18T15:12:10 1742310730

Great deep dive, I've been actively curious about some of the results you found that present themselves similarly in infra setups I run or have run previously.

This kind of miffs also:

> AWS doesn’t describe how failure is distributed for gp3 volumes

I wonder why? Because it affects their number of 9s? Rep?

samlambert · 2025-03-18T15:25:03 1742311503

it's hard to know for sure. it might be that or it might just present a number that is confusing to most.

ta988 · 2025-03-18T15:11:19 1742310679

Thanks! This is extremely useful and I'll be waiting for the next ones.

flaminHotSpeedo · 2025-03-18T15:07:06 1742310426

Do you listen for volume degradation EventBridge notifications? I'm curious if or how often AWS flags these failed volumes for you

nickvanw · 2025-03-18T18:01:13 1742320873

Our experience has been that they do fire, but not reliably enough or soon enough to be worth anything other than validating the problem after the fact.

kingnaldo · 2025-03-19T12:15:29 1742386529

Love how educational it is. I'd love even more if formulas were included for the statistics calculations.