A “former S3 engineer” commented on HN during the Glacier launch. Nothing verifiable, but it suggests some contrast with the idea that Glacier is optical backed [also interesting: he suggests that S3 has an erasure encoding strategy.]
“They’ve optimized for low-power, low-speed, which will lead to increased cost savings due to both energy savings and increased drive life. I’m not sure how much detail I can go into, but I will say that they’ve contracted a major hardware manufacturer to create custom low-RPM (and therefore low-power) hard drives that can programmatically be spun down. These custom HDs are put in custom racks with custom logic boards all designed to be very low-power. The upper limit of how much I/O they can perform is surprisingly low – only so many drives can be spun up to full speed on a given rack. I’m not sure how they stripe their data, so the perceived throughput may be higher based on parallel retrievals across racks, but if they’re using the same erasure coding strategy that S3 uses, and writing those fragments sequentially, it doesn’t matter – you’ll still have to wait for the last usable fragment to be read.”
"he suggests that S3 has an erasure encoding strategy"
Apologies for the diversion - what does this mean? Does it mean that when an item is erased from S3, S3 "encodes" the data so that the next person who gets the same physical disk space can't read what was there before?
If you push hard enough erasure coding, you can use extremely unreliable components throughout the infrastructure while maintaining reliability. That introduces high latency depending on the hardware specifics, but that's what Glacier is about.
Imo, they found a sweet spot of $/GB on a much higher latency/lower reliability region (this is analogous to increasing overall capacity in a communication channel by instead of using a few highly reliable high powered symbols use optimally many unreliable low power ones with error correction) -- disk manufacturers already use this aggressively for soft failures within the disk, but are obviously restricted on more systematic failures (i.e. if the whole drive fails there's nothing they can do).
If a single drive has a P_failure, with many drives they can achieve close to 1-P_failure reliable storage capacity [1]. So all they have to do is seek the optimum
$/GB_opt = min over C,D [ C/(D*(1-P_failure(C))) ],
where C is the cost per drive and D is the drive capacity
If they're aggressive enough, I'd say there could justifiably be a quite large latency. Say for example their block size is 55 disks with 10% redundancy. Then the seek time is the maximum seek time of the lowest 50 seek times within the set of disks which haven't failed. Now there's a queue to read each drive, and it may very well frequently be a huge maximum queue pretty much every time. Even if the queue is typically short, they would still have to show a typical high value, which I'd say is essentially read time of an entire disk.
Now factor that the disks are both large and crappy. A 120 Mb/s read speed and 1Tb disk size would imply 8000 secs ~ 2 hours. Factor possibility of differential pricing as you mentioned (even longer queues), and you may get an upper bound of 3 or 4 hours.
S3 has to handle random access patterns, which are the hardest to optimize.
I wouldn't be surprised if Glacier's latency wasn't purely artificial so much as it was a deliberate design decision so the architecture can be very different from S3: pure streaming I/O, huge block sizes, concurrent access is nowhere near the same, etc. That much time allows really aggressive disk scheduling and it'd make it much easier to do things like spread data across a large number of devices with wide geographic separation.
The current strategy for highly durable object stores is [largely] to have 3+ replicas of the data. If you utilized erasure coding [Cleversafe does this, others are working on it], you can achieve very high durability by spreading data over multiple disks or even datacenters such that if a few of them fail, you still can recover all of the information. But, like with RAID, there are many tradeoffs in how you configure such a system. Erasure coding makes sense for rarely-accessed data since IO is considerably more expensive in a networked setting.
I read it as saying that the data has redundancy added to it so that some amount of unreadable ("erased") bits can be interpolated. CD's do this with Reed-Solomon encoding, and there's a decoder in the hardware.
https://news.ycombinator.com/item?id=4416065
“They’ve optimized for low-power, low-speed, which will lead to increased cost savings due to both energy savings and increased drive life. I’m not sure how much detail I can go into, but I will say that they’ve contracted a major hardware manufacturer to create custom low-RPM (and therefore low-power) hard drives that can programmatically be spun down. These custom HDs are put in custom racks with custom logic boards all designed to be very low-power. The upper limit of how much I/O they can perform is surprisingly low – only so many drives can be spun up to full speed on a given rack. I’m not sure how they stripe their data, so the perceived throughput may be higher based on parallel retrievals across racks, but if they’re using the same erasure coding strategy that S3 uses, and writing those fragments sequentially, it doesn’t matter – you’ll still have to wait for the last usable fragment to be read.”