it's not a contradiction but there is nuance. local disks mean we can do a significant amount of the operations involved in a write locally without every block going over the network. It's true that a replica has to acknowledge it received the write but that's a single operation vs hundreds over a network.
How often do you boot up instances? Do you measure detailed metrics for the time from the RunInstances call to the earliest possible timestamp you can easily get from the user code, to quantify the amount of time spent in AWS before any instance code gets control?
If so, I'd love to see your measured distribution of boot times. Because I've observed results similar to your observations on EBS, with some long-tail outliers.
Instances are constantly booting up because most instances live <30d. Boot time in terms of how soon a node is fully booted and joined to the EKS apiserver and ready for workloads is approx 2.5-3min. There are lot of parts involved in getting to this point though, some of which would not matter if you're not using EKS. Also this is not something we measure super closely as from a user perspective it is generally imperceptible.
A possibly better metric for your particular case (assuming you're interested in fastest bootup possibly achievable) is from our self-managed github-actions runners. Those boot times are in the 40-50s range. This is consistent with what others see, as far as I know. A good blog on this topic - including how they got boot-to-ready times down to 5s - that you might be interested in from the depot.dev folks: https://depot.dev/blog/github-actions-breaking-five-second-b...
I'm already at the ~5s mark, booting a brand new instance, almost all of which is AWS time before my instance gets control; once the kernel takes over the remaining boot time is milliseconds. (I plan to go the "pool of instances" route in the future to eliminate the AWS time I have no control over.)
But ever so often, I observe instances taking several more seconds of that uncontrollable AWS time, and I wondered what statistics you might have on that.
Possibly relatedly, do you ever observe EBS being degraded at initial boot?
Great deep dive, I've been actively curious about some of the results you found that present themselves similarly in infra setups I run or have run previously.
This kind of miffs also:
> AWS doesn’t describe how failure is distributed for gp3 volumes
I wonder why? Because it affects their number of 9s? Rep?
Our experience has been that they do fire, but not reliably enough or soon enough to be worth anything other than validating the problem after the fact.
Answering on behalf of PlanetScale (I'm the CTO!). I don't remember exactly what was different from upstream, but it wasn't a whole lot of changes.
Fortunately, PlanetScale runs a well-maintained fork ourselves, so we're very used to taking custom changes and getting them deployed. In this case, we asked the Block team for all of their changes and went through them one by one to see what was needed to pull into our fork.
By the time we did the migration, we made sure that any behavior wouldn't be different where it mattered.
Mostly the diffs were related to running against the on-prem MySQL instances smoothly: stuff like changes to split tooling or how you boot up the pieces. We have had unique vindexes or query planning changes in the past but we either deprecated or upstreamed them prior to migration.
Just to address the core of your comment, 20 magnetic disks would combine for about ~2,000 IOPS of capacity, provide for no redundancy, and allow only one machine to process the entirety of the queries coming in to power the application.
Even a full 60 disk server filled with magnetic disks would provide less I/O capacity for running a relational database than a single EBS volume.
It's might not look like a lot of data if you're talking about storing media files, but it's quite a bit of relational data to be queried in single-digit milliseconds at-scale.
I assumed people did not need to be explicitly reminded that you had to provision additional capacity for redundancy, and that you can use different layers of caching (ssd caches, ram caches etc).
And by the way, it was just posted today that you can get 60TB pci gen5 SSDs from Micron: https://news.ycombinator.com/item?id=42122434 : you can still fit all that dataset in a single machine and provide all the iops you need. You'd need just 7 of those.
45drives is... bad. I cannot understand why people would use them- particularly for homelab. Three SC846 24x3.5" 4U can be had for no more than $400/ea and, theres 36bay versions.
> Three SC846 24x3.5" 4U can be had for no more than $400/ea and, theres 36bay versions.
Because then you need three times the rack space. What you don't spend one-time for the hardware you'll spend extra every month for rack space, connectivity, cooling, power etc.
You might use that budget for redundancy, for example.
> After the commit, the Connection object associated with that transaction is closed, causing its underlying DBAPI connection to be released back to the connection pool associated with the Engine to which the Session is bound.
I expect other reasonably sane libs for working with transactional databases do the same.
So, if you are doing pooling correctly, you can only run out of available connections if you want to have a lot of long running transactions.
So, why would you want every of your 50k frontends keep an open transaction simultaneously?
Because there's an overhead to make a connection, authenticate, set the default parameters on the connection, etc. I've never seen a framework that closed db connections between requests.
Of course, the better design is to write a nonblocking worker that can run async requests on a single connection, and not need a giant pool of blocking workers, but that is a major architecture plan that can't be added late in a project that started as blocking worker pools. MySQL has always fit well with those large blocking worker pools. Postgres less so.
As I said, you can return the connection to the connection pool.
From the perspective of keeping the number of open connections low it doesn't really matter if you close it or return to the pool, because in either case the connection becomes available to other clients.
I might not be understanding what you're pointing out here. It sounds to me like sqlalchemy is talking about a pool of connections within one process, in which case releasing back to that pool does not close the connection by that process to the database. Parent comment is talking about one connection per process with 50k processes. My comment was that you don't need that many processes if each process can handle hundreds of web requests asynchronously.
If you are saying that a connection pool can be shared between processes without pgbouncer, that is news to me.
The most common design for a Web app on Linux in the last 20 years is to have a pool of worker processes, each single-threaded and ready to serve one request. The processes might be apache ready to invoke PHP, or mod-perl, or a pool of ruby-on-rails or perl or python processes receiving the requests directly. Java tends to be threads instead of processes. I've personally never needed to go past about 100 workers, but I've talked to people who scale up to thousands, and they happen to be using MySQL. I've never used pgbouncer, but understand that's the tool to reach for rather than configuring Pg to allow thousands of connections.