A “former S3 engineer” commented on HN during the Glacier launch. Nothing verifi...

badman_ting · on April 25, 2014

"he suggests that S3 has an erasure encoding strategy"

Apologies for the diversion - what does this mean? Does it mean that when an item is erased from S3, S3 "encodes" the data so that the next person who gets the same physical disk space can't read what was there before?

oakwhiz · on April 25, 2014

No, "erasure encoding" refers to a specific type of forward error correction.

https://en.wikipedia.org/wiki/Erasure_code

https://en.wikipedia.org/wiki/Forward_error_correction

darkmighty · on April 25, 2014

If you push hard enough erasure coding, you can use extremely unreliable components throughout the infrastructure while maintaining reliability. That introduces high latency depending on the hardware specifics, but that's what Glacier is about.

Imo, they found a sweet spot of $/GB on a much higher latency/lower reliability region (this is analogous to increasing overall capacity in a communication channel by instead of using a few highly reliable high powered symbols use optimally many unreliable low power ones with error correction) -- disk manufacturers already use this aggressively for soft failures within the disk, but are obviously restricted on more systematic failures (i.e. if the whole drive fails there's nothing they can do).

If a single drive has a P_failure, with many drives they can achieve close to 1-P_failure reliable storage capacity [1]. So all they have to do is seek the optimum

$/GB_opt = min over C,D [ C/(D*(1-P_failure(C))) ], where C is the cost per drive and D is the drive capacity

[1] http://en.wikipedia.org/wiki/Binary_erasure_channel

amaks · on April 25, 2014

Yes, but this doesn't explain 4-5 hours data access latency.

chockablock · on April 25, 2014

Latency could be artificial, in order to get:

-differential pricing

-ability to switch to transparently switch to slower technologies in future

darkmighty · on April 26, 2014

If they're aggressive enough, I'd say there could justifiably be a quite large latency. Say for example their block size is 55 disks with 10% redundancy. Then the seek time is the maximum seek time of the lowest 50 seek times within the set of disks which haven't failed. Now there's a queue to read each drive, and it may very well frequently be a huge maximum queue pretty much every time. Even if the queue is typically short, they would still have to show a typical high value, which I'd say is essentially read time of an entire disk.

Now factor that the disks are both large and crappy. A 120 Mb/s read speed and 1Tb disk size would imply 8000 secs ~ 2 hours. Factor possibility of differential pricing as you mentioned (even longer queues), and you may get an upper bound of 3 or 4 hours.

I'm just speculating though.

acdha · on April 26, 2014

S3 has to handle random access patterns, which are the hardest to optimize.

I wouldn't be surprised if Glacier's latency wasn't purely artificial so much as it was a deliberate design decision so the architecture can be very different from S3: pure streaming I/O, huge block sizes, concurrent access is nowhere near the same, etc. That much time allows really aggressive disk scheduling and it'd make it much easier to do things like spread data across a large number of devices with wide geographic separation.

hemancuso · on April 25, 2014

The current strategy for highly durable object stores is [largely] to have 3+ replicas of the data. If you utilized erasure coding [Cleversafe does this, others are working on it], you can achieve very high durability by spreading data over multiple disks or even datacenters such that if a few of them fail, you still can recover all of the information. But, like with RAID, there are many tradeoffs in how you configure such a system. Erasure coding makes sense for rarely-accessed data since IO is considerably more expensive in a networked setting.

zwily · on April 25, 2014

Mozy has been doing this for nearly a decade.

bmm6o · on April 25, 2014

I read it as saying that the data has redundancy added to it so that some amount of unreadable ("erased") bits can be interpolated. CD's do this with Reed-Solomon encoding, and there's a decoder in the hardware.