Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I wonder if you could work around this problem by having two EBS volumes on each host, and write to them both. You'd have the OS report the write was successful as soon as either drive reported success. With reads you could alternate between drives for double the read performance during happy times, but quickly detect when one drive is slow and reroute those reads to the other drive.

We could call this RAID -1.

You'd need some accounting to ensure that the drives are eventually consistent, but based on the graphs of the issue it seems like you could keep the queue of pending writes in RAM for the duration of the slowdown.

Of course, it's quite likely that there will be correlated failures, as the two EBS volumes might end up on the same SAN and set of physical drives. Also it doesn't seem worth paying double for this.



The blog post mentioned correlated failures in an availability zone. You likely could reduce this a bit, but still run into this frequently enough


it's a lot of complexity and cost for a service that is already replicating 3 ways. 6x replication for a single node's disks seems excessive.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: