For (1), I definitely want my production HA databases to fsync every write. Of c...

_bohm · 2025-06-20T03:03:30 1750388610

Most database systems are designed to amortize fsyncs when they have high write throughput. You want every write to be fsync'd, but you don't want to actually call fsync for each individual write operation.

haiku2077 · 2025-06-19T15:51:44 1750348304

> I definitely want my production HA databases to fsync every write.

I didn't! Our business DR plan only called for us to restore to an older version with short downtime, so fsync on every write on every node was a reduction in performance for no actual business purpose or benefit. IIRC we modified our database to run off ramdisk and snapshot every few minutes which ran way better and had no impact on our production recovery strategy.

> if somebody sets up a Kubernetes cluster, they can and should afford enterprise SSDs where fsync of small data is fast and reliable

At the time one of the problems I ran into was that public cloud regions in southeast asia had significantly worse SSDs that couldn't keep up. This was on one of the big three cloud providers.

1000 fsyncs/second is a tiny fraction of the real world production load we required. An API that only accepts 1000 writes a second is very slow!

Also, plenty of people run k8s clusters on commodity hardware. I ran one on an old gaming PC with a budget SSD for a while in my basement. Great use case for k3s.

nh2 · 2025-06-21T05:07:49 1750482469

> IIRC we modified our database to run off ramdisk

If a ramdisk is sufficient for your use case, why would you use a Raft-based distributed networked consensus database like etcd in the first place?

Its whole point is to protect you not only from power failure (like fsync does) but even from entire machine failure.

And the network is usually higher latency than a local SSD anyway.

> 1000 fsyncs/second is a tiny fraction of the real world production load we required

Kubernetes uses etcd for storing the cluster state. Do you update the cluster state more than 1000 times per second? Curious what operation needs that.

mdaniel · 2025-06-21T17:33:11 1750527191

> Do you update the cluster state more than 1000 times per second? Curious what operation needs that.

Ooo, ooo, I know this one! It's for clusters with more than approximately 300 Nodes, as their status polling will actually knock over the primary etcd cluster. That's why kube-apiserver introduced this fun parameter: https://kubernetes.io/docs/reference/command-line-tools-refe... and the "but why" of https://kubernetes.io/docs/setup/best-practices/cluster-larg...

I've only personally gotten clusters up to about 550 so I'm sure the next scaling wall is hiding between that and the 5000 limit they advertise

nh2 · 2025-06-23T15:29:28 1750692568

> their status polling will actually knock over the primary etcd cluster

Polling the status sounds like a read-only operation, why would it trigger an fsync?

mdaniel · 2025-06-23T18:35:39 1750703739

kubelet constantly checks-in to report "I am alive, and here is the state of affairs" so that kube-apiserver can make informed decisions about whether a Pod needs attention to align with its desired state

So, I don't this second know what the formal reconciliation loop is called for that Node->apiserver handshake but it is a polling operation in that the connection isn't left open all the time, and it is a status-reporting operation. So that's how I ended up calling it "status polling." It is a write operation because whichever etcd is assigned to track the current state of the Events needs to be mutated to record the current state of the Events

It actually could be that even a smaller sized cluster could get into this same situation if there were a bazillion tiny Pods (e.g. not directly correlated with the physical size of the cluster) but since the limits say one cannot have more than 110 Pods per Node, I'm guessing the Event magnification is easier to see with physically wider clusters

nh2 · 2025-06-24T16:12:06 1750781526

I see, that makes sense. Thanks for the details!

haiku2077 · 2025-06-22T14:26:05 1750602365

> And the network is usually higher latency than a local SSD anyway.

Until your write outrun the disk performance/page cache and your disk I/O performance spikes. Linux used to be really bad at this when memory cgroups were involved until a cpuple of years ago.

> If a ramdisk is sufficient for your use case, why would you use a Raft-based distributed networked consensus database like etcd in the first place?

Because at the time Kubernetes required it. If the adapters to other databases existed at the time I would have tested them out.

> Kubernetes uses etcd for storing the cluster state. Do you update the cluster state more than 1000 times per second? Curious what operation needs that.

Steady state in a medium to large cluster exceeds that. At the time I was looking at these etcd issues I was running fleets of 200+ node clusters and hitting a scaling wall around 200-300. These days I use a major Kubernetes service that does not use etcd behind the scenes and my fleets can scale up to 15000 nodes at the extreme end.