Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

we have a lot more content like this on the way. if anyone has feedback or questions let us know.



LOVE this stuff sam, its highly educational but also establishes a ton of trust in PS. please keep it up!


How often do you boot up instances? Do you measure detailed metrics for the time from the RunInstances call to the earliest possible timestamp you can easily get from the user code, to quantify the amount of time spent in AWS before any instance code gets control?

If so, I'd love to see your measured distribution of boot times. Because I've observed results similar to your observations on EBS, with some long-tail outliers.

Thanks for the analysis and article!


Instances are constantly booting up because most instances live <30d. Boot time in terms of how soon a node is fully booted and joined to the EKS apiserver and ready for workloads is approx 2.5-3min. There are lot of parts involved in getting to this point though, some of which would not matter if you're not using EKS. Also this is not something we measure super closely as from a user perspective it is generally imperceptible.

A possibly better metric for your particular case (assuming you're interested in fastest bootup possibly achievable) is from our self-managed github-actions runners. Those boot times are in the 40-50s range. This is consistent with what others see, as far as I know. A good blog on this topic - including how they got boot-to-ready times down to 5s - that you might be interested in from the depot.dev folks: https://depot.dev/blog/github-actions-breaking-five-second-b...


I'm already at the ~5s mark, booting a brand new instance, almost all of which is AWS time before my instance gets control; once the kernel takes over the remaining boot time is milliseconds. (I plan to go the "pool of instances" route in the future to eliminate the AWS time I have no control over.)

But ever so often, I observe instances taking several more seconds of that uncontrollable AWS time, and I wondered what statistics you might have on that.

Possibly relatedly, do you ever observe EBS being degraded at initial boot?


Great deep dive, I've been actively curious about some of the results you found that present themselves similarly in infra setups I run or have run previously.

This kind of miffs also:

> AWS doesn’t describe how failure is distributed for gp3 volumes

I wonder why? Because it affects their number of 9s? Rep?


it's hard to know for sure. it might be that or it might just present a number that is confusing to most.


Thanks! This is extremely useful and I'll be waiting for the next ones.


Do you listen for volume degradation EventBridge notifications? I'm curious if or how often AWS flags these failed volumes for you


Our experience has been that they do fire, but not reliably enough or soon enough to be worth anything other than validating the problem after the fact.


Love how educational it is. I'd love even more if formulas were included for the statistics calculations.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: