Just out of curiosity, has anyone done any research into determining whether a drive which fails after X days/years has some properties in the first Y days that could be a signal for future failure?
I'm certain there are more recent statistics but Google's "Failure Trends in a Large Disk Drive Population"[1] (2007) is a good start:
"In addition to presenting failure statistics,
we analyze the correlation between failures and several
parameters generally believed to impact longevity."
There is also a more recent open source dataset from Backblaze[2] that includes:
"Every day, the software that runs the Backblaze data center takes a snapshot of the state of every drive in the data center, including the drive’s serial number, model number, and all of its SMART data"
which forms the basis of an article correlating SMART data with drive failures at Backblaze[3].
The TL;DR answer is yes, there are some hard drive SMART values that can indicate failure is likely, but they vary by model and don't necessarily show before failure.
yeah, I was wondering if there were measurables that could be correlated with failure before SMART kicked in, even if they were something like date of year, or location of manufacture, or shipping route they took. :P
Generally the most common measurable is sector reallocation errors. This comes from various random things going wrong at the wrong time and the disk re-allocates a new sector from the spare pool to deal with one that has gone bad. In operation, our disks at Blekko pick up sector reallocation errors at a low statistical rate that picks up prior to total failure. Since our infrastructure is triply redundant (three disks hold a copy of every piece of data) we can simply reformat drives which develop sector errors. If you plot the time between sector errors developing over the life of the drive, it gets shorter rapidly as the drive as nearing complete failure. Sometimes however there is no warning, the drive simply fails. As with my previous experience at Google and NetApp before that, there is a small rise in early failure (infant mortality) then a long tail toward a steep failure rate after about 10 years.