If you can detect EBS failure better than Amazon - I'd be selling this to them t...

tpetry · 2025-03-18T15:10:45 1742310645

They probably detect this. Thats why the problem is solved after one to ten minutes according to the article. There's probably nothing they can do which wouldn't stress the disks more.

diggan · 2025-03-18T15:32:27 1742311947

Probably sometimes, at least if we trust the article:

> In our experience, the documentation is accurate: sometimes volumes pass in and out of their provisioned performance in small time windows:

What AWS consider "small degradation" is sometimes "100% down" for their users though, look at any previous "AWS is down/having problems" HN comment threads and you'll see there tends to be a huge mismatch between what AWS considers "not working" and what users of AWS considers "not working".

Doesn't surprise me people want better tooling than what AWS themselves offer.

nickvanw · 2025-03-18T15:40:37 1742312437

Author here - it's not that we're detecting failure better than they are (though certainly, we might be able to do it as fast as anyone else) - it's what you do afterwards that matters.

Being able to fail over to another database instance backed by a different volume in a different zone allows for a minimization of impact. This is well inline with AWS best practices, it's just arduous to do quickly and at-scale.

sougou · 2025-03-18T17:35:22 1742319322

It's not just failure detection. A write to EBS is at least two additonal network hops. The first one is to get to the machine for the initial write, and the second is for that write to be propagated to another machine for durability. Multiply this by the number of IOPS required to complete a database transaction.

dijit · 2025-03-18T15:36:34 1742312194

Why? They wouldn't buy it.

No offence to anyone who has drank the kool-aid with AWS, but honestly they're making a product *not* foundational infrastructure.

This might feel like a jarring point.

When you think of foundational infrastructure in the real world you think bridges and plumbing and the costs of building such things; which is stupidly high.

Yet when those things get grossly privatised they end up like Lagos, Nigeria[0].

Because there is a difference between delivering something that works most of the time, and something that works all of the time -- Major point being: one of them is obscenely profitable, and the other one might not even break even, which is why governments usually take on the cost of foundational infrastructure: They never expect to even break-even.

[0]: https://ourworld.unu.edu/en/water-privatisation-a-worldwide-...

flaminHotSpeedo · 2025-03-18T15:24:40 1742311480

I think the more interesting part here (besides the fact that AWS SLA's sneakily screw you over and make it hard to guarantee static stability) is the remediation aspect.

This is a consistent letdown across most AWS products; they build the undifferentiated 90% of a thing, but some PM refuses to admit their product isn't complete, so instead of having optional features flags or cdk samples or something to help with that last 10%, they bury it deep in the docs and try not to draw attention to it. Then when you open a support case they tell you to pound sand, or maybe suggest rearchitecting to avoid their foot-gun they didn't tell you about.

bddicken · 2025-03-18T15:53:15 1742313195

Or in this case, to spend far more $$ on io2.