Hacker News new | past | comments | ask | show | jobs | submit | samlambert's comments login

no. fundamentally different architectures.


so proud of you Mike. we still regularly talk about how we miss you. i am so glad you are doing well.


it's a lot of complexity and cost for a service that is already replicating 3 ways. 6x replication for a single node's disks seems excessive.


it's not a contradiction but there is nuance. local disks mean we can do a significant amount of the operations involved in a write locally without every block going over the network. It's true that a replica has to acknowledge it received the write but that's a single operation vs hundreds over a network.


we have a lot more content like this on the way. if anyone has feedback or questions let us know.


LOVE this stuff sam, its highly educational but also establishes a ton of trust in PS. please keep it up!


How often do you boot up instances? Do you measure detailed metrics for the time from the RunInstances call to the earliest possible timestamp you can easily get from the user code, to quantify the amount of time spent in AWS before any instance code gets control?

If so, I'd love to see your measured distribution of boot times. Because I've observed results similar to your observations on EBS, with some long-tail outliers.

Thanks for the analysis and article!


Instances are constantly booting up because most instances live <30d. Boot time in terms of how soon a node is fully booted and joined to the EKS apiserver and ready for workloads is approx 2.5-3min. There are lot of parts involved in getting to this point though, some of which would not matter if you're not using EKS. Also this is not something we measure super closely as from a user perspective it is generally imperceptible.

A possibly better metric for your particular case (assuming you're interested in fastest bootup possibly achievable) is from our self-managed github-actions runners. Those boot times are in the 40-50s range. This is consistent with what others see, as far as I know. A good blog on this topic - including how they got boot-to-ready times down to 5s - that you might be interested in from the depot.dev folks: https://depot.dev/blog/github-actions-breaking-five-second-b...


I'm already at the ~5s mark, booting a brand new instance, almost all of which is AWS time before my instance gets control; once the kernel takes over the remaining boot time is milliseconds. (I plan to go the "pool of instances" route in the future to eliminate the AWS time I have no control over.)

But ever so often, I observe instances taking several more seconds of that uncontrollable AWS time, and I wondered what statistics you might have on that.

Possibly relatedly, do you ever observe EBS being degraded at initial boot?


Great deep dive, I've been actively curious about some of the results you found that present themselves similarly in infra setups I run or have run previously.

This kind of miffs also:

> AWS doesn’t describe how failure is distributed for gp3 volumes

I wonder why? Because it affects their number of 9s? Rep?


it's hard to know for sure. it might be that or it might just present a number that is confusing to most.


Thanks! This is extremely useful and I'll be waiting for the next ones.


Do you listen for volume degradation EventBridge notifications? I'm curious if or how often AWS flags these failed volumes for you


Our experience has been that they do fire, but not reliably enough or soon enough to be worth anything other than validating the problem after the fact.


Love how educational it is. I'd love even more if formulas were included for the statistics calculations.


Some of the largest relational databases run on PlanetScale/Vitess. It's beyond battle hardened.


Its great. We use it.

I'm not sure I would call it even close to battle hardened.

They are still many lurking footguns and bugs.

Try running it with > 250k tables. Falls down hard.

Error logic around etcd/topo server is very shaky, edge cases can wedge cells/clusters into broken state.


We've loved working with the incredible team at Cash App. If anyone has questions myself or someone on the PlanetScale team will answer.


The article says they used a forked version of Vitess with customizations. What were they and how did you address that when migrating?


Answering on behalf of PlanetScale (I'm the CTO!). I don't remember exactly what was different from upstream, but it wasn't a whole lot of changes.

Fortunately, PlanetScale runs a well-maintained fork ourselves, so we're very used to taking custom changes and getting them deployed. In this case, we asked the Block team for all of their changes and went through them one by one to see what was needed to pull into our fork.

By the time we did the migration, we made sure that any behavior wouldn't be different where it mattered.


Mostly the diffs were related to running against the on-prem MySQL instances smoothly: stuff like changes to split tooling or how you boot up the pieces. We have had unique vindexes or query planning changes in the past but we either deprecated or upstreamed them prior to migration.


It would be interesting to know why a Bitcoin application requires 400 TB of disk space.


The article says,

> At peak times, Cash App's database handles approximately 3–4 million queries per second (QPS) across 400 shards, totaling around 400TiB of data.

400TiB represents not a lot of this data. If each query stored only 1 byte, this would only be 4 years worth of this data.

If duplicated, or processed and the results stored, that would add up, too.


Why would a query store data? Are they logging individual queries?


It's peer-to-peer payments and banking, which have been around for much longer than the stocks/bitcoin aspect of the app.


It's not much data.

With current 22TB magnetic disks it's less than 20 disks, they would fit into a single machine (4u, likely).

The Storinator XL60 from 45drives (https://www.45drives.com/products/storinator-xl60-configurat...) can hold 60 disks, for (advertised) ~1.4PB of data.

(btw i've learned about 45drives through Linus Tech Tips channel so I think it's obligatory to say LTTSTORE.COM - for the meme)


Just to address the core of your comment, 20 magnetic disks would combine for about ~2,000 IOPS of capacity, provide for no redundancy, and allow only one machine to process the entirety of the queries coming in to power the application.

Even a full 60 disk server filled with magnetic disks would provide less I/O capacity for running a relational database than a single EBS volume.

It's might not look like a lot of data if you're talking about storing media files, but it's quite a bit of relational data to be queried in single-digit milliseconds at-scale.


I assumed people did not need to be explicitly reminded that you had to provision additional capacity for redundancy, and that you can use different layers of caching (ssd caches, ram caches etc).

And by the way, it was just posted today that you can get 60TB pci gen5 SSDs from Micron: https://news.ycombinator.com/item?id=42122434 : you can still fit all that dataset in a single machine and provide all the iops you need. You'd need just 7 of those.

So yeah, 400TB of data is not much data nowadays.


45drives is... bad. I cannot understand why people would use them- particularly for homelab. Three SC846 24x3.5" 4U can be had for no more than $400/ea and, theres 36bay versions.

60 drives in 6U is a crap ton of weight.


> Three SC846 24x3.5" 4U can be had for no more than $400/ea and, theres 36bay versions.

Because then you need three times the rack space. What you don't spend one-time for the hardware you'll spend extra every month for rack space, connectivity, cooling, power etc.

You might use that budget for redundancy, for example.


because people run large amounts of front ends and workers that create a significant amount of connections. it doesn't matter if they are all active.


Why would you want every "frontend" keep an open connection all the time?

> it doesn't matter if they are all active

It does, if the connection is inactive (doesn't hold an open transaction) you should close it or return it to the pool.


so you are suggesting you close a connection between queries?


Between queries in the same transaction? No

Between transactions? Yes, absolutely

In fact, many libraries do it automatically.

For example, SQLAlchemy doc explicitly says [0]:

> After the commit, the Connection object associated with that transaction is closed, causing its underlying DBAPI connection to be released back to the connection pool associated with the Engine to which the Session is bound.

I expect other reasonably sane libs for working with transactional databases do the same.

So, if you are doing pooling correctly, you can only run out of available connections if you want to have a lot of long running transactions.

So, why would you want every of your 50k frontends keep an open transaction simultaneously?

[0] https://docs.sqlalchemy.org/en/20/orm/session_basics.html#co...


Because there's an overhead to make a connection, authenticate, set the default parameters on the connection, etc. I've never seen a framework that closed db connections between requests.

Of course, the better design is to write a nonblocking worker that can run async requests on a single connection, and not need a giant pool of blocking workers, but that is a major architecture plan that can't be added late in a project that started as blocking worker pools. MySQL has always fit well with those large blocking worker pools. Postgres less so.


As I said, you can return the connection to the connection pool.

From the perspective of keeping the number of open connections low it doesn't really matter if you close it or return to the pool, because in either case the connection becomes available to other clients.


I might not be understanding what you're pointing out here. It sounds to me like sqlalchemy is talking about a pool of connections within one process, in which case releasing back to that pool does not close the connection by that process to the database. Parent comment is talking about one connection per process with 50k processes. My comment was that you don't need that many processes if each process can handle hundreds of web requests asynchronously.

If you are saying that a connection pool can be shared between processes without pgbouncer, that is news to me.


Of course, you're right, it is not possible to to share a connection pool between processes without pgbouncer.

> Parent comment is talking about one connection per process with 50k processes.

It is actually not clear what parent comment was talking about. I don't know what exactly did they mean by "front ends".


The most common design for a Web app on Linux in the last 20 years is to have a pool of worker processes, each single-threaded and ready to serve one request. The processes might be apache ready to invoke PHP, or mod-perl, or a pool of ruby-on-rails or perl or python processes receiving the requests directly. Java tends to be threads instead of processes. I've personally never needed to go past about 100 workers, but I've talked to people who scale up to thousands, and they happen to be using MySQL. I've never used pgbouncer, but understand that's the tool to reach for rather than configuring Pg to allow thousands of connections.


no fights at all. we all wanted to do it.


this is correct


Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: