I wonder if you could work around this problem by having two EBS volumes on each host, and write to them both. You'd have the OS report the write was successful as soon as either drive reported success. With reads you could alternate between drives for double the read performance during happy times, but quickly detect when one drive is slow and reroute those reads to the other drive.
We could call this RAID -1.
You'd need some accounting to ensure that the drives are eventually consistent, but based on the graphs of the issue it seems like you could keep the queue of pending writes in RAM for the duration of the slowdown.
Of course, it's quite likely that there will be correlated failures, as the two EBS volumes might end up on the same SAN and set of physical drives. Also it doesn't seem worth paying double for this.
We could call this RAID -1.
You'd need some accounting to ensure that the drives are eventually consistent, but based on the graphs of the issue it seems like you could keep the queue of pending writes in RAM for the duration of the slowdown.
Of course, it's quite likely that there will be correlated failures, as the two EBS volumes might end up on the same SAN and set of physical drives. Also it doesn't seem worth paying double for this.