Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The visuals are awesome; the bouncing-box is probably the best illustration of relative latency I've seen.

Your "1 in a million" comment on durability is certainly too pessimistic once you consider the briefness of the downtime before a new server comes in and re-replicates everything, right? I would think if your recovery is 10 minutes for example, even if each of three servers is guaranteed to fail once in the month, I think it's already like 1 in two million? and if it's a 1% chance of failure in the month failure of all three overlapping becomes extremely unlikely.

Thought I would note this because one-in-a-million is not great if you have a million customers ;)



> Your "1 in a million" comment on durability is certainly too pessimistic once you consider the briefness of the downtime before a new server comes in and re-replicates everything, right?

Absolutely. Our actual durability is far, far, far higher than this. We believe that nobody should ever worry about losing their data, and thats the peace of mind we provide.


> Instead of relying on a single server to store all data, we can replicate it onto several computers. One common way of doing this is to have one server act as the primary, which will receive all write requests. Then 2 or more additional servers get all the data replicated to them. With the data in three places, the likelihood of losing data becomes very small.

Is my understanding correct, that this means you propagate writes asynchronously from the primary to the secondary servers (without waiting for an "ACK" from them for writes)?


For PlanetScale Metal, we use semi-sync replication. The primary need to get an ack from at least one replica before committing.


Soo... We have a network hop after all?


For writes, yes. But what if your workload is 90% reads?


It makes a lot of sense for read-heavy workloads, for sure!

I was just trying to get a better understanding of what is happening under the hood :)


Is that ack sent once the request is received or once it is stored on the remote disk?


Kudos to whoever patiently & passionately built these. On an off topic - This is a great perspective for building realistic course work for middle & high school students. I'm sure they learn faster & better with visuals like these.


It would be incredibly cool if this were used in high school curricula.


1 in a million is the probability that all three servers die in one months, without swapping out the broken ones. So at some point in the month all the data is gone.

If you replace the failed(or failing) node right away, the failure percentage goes down greatly. You would likely need the probability of a node going done in 30 minutes time space. Assuming the migration can be done in 30 min.

(i hope this calculation is correct)

If 1% probability per month then 1%/(43800/30) = (1/1460)% probability per 30 min.

For three instances: (1/1460)% * (1/1460)% * (1/1460)% = (1/3112136000)% probability per 30 min that all go down.

Calculated for one month (1/3112136000)% * (43800/30) = (1/2131600)%

So one in 213 160 000 that all three servers go down in a 30 minute time span somewhere in one month. After the 30 minutes another replica will already be available, making the data safe.

I'm happy to be corrected. The probability course was some years back :)


One thing I will suggest: you’re assuming failures are non-correlated and have an equally weighted chance per in it of time.

Neither is a good assumption from my experience. Failures being correlated to any degree greatly increases the chances of what the aviation world refers to as “the holes in the Swiss cheese lining up”.


You are 100% correct. Heavily depends on where the servers reside. Just a rough estimate for the case that the failures are non related.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: