At August’s 2016 Flash Media Summit (in Santa Clara, CA) SoftRAID’s VP of Engineering, Tim Standing, talked about the challenges around SSD failure prediction.
September 9 2016
Tim started off talking about SoftRAID’s efforts to make storage more reliable: “In 2010, we added a feature for predicting disk failure, which used the results from a Google study on 100,000 rotating media disk drives. This feature can warn users weeks or months before a disk fails. The feature predicts about 75% of disk drive failures, the other 25% of the failures happen without any warning.”
SoftRAID’s success in predicting disk failure in rotating media spurred Tim and his team to develop a similar system for SSDs: “After we saw the power of failure prediction, we wanted to develop the same feature for SSDs.”
For those of us who don’t know why SSDs can’t use the same process as rotating media for failure prediction, Tim explains: “When disks with rotating media are about to fail, they start reallocating sectors. We can use the reallocated sector count as an indicator for impending disk failure; the more sectors reallocated, the nearer the disk is to failure. Unfortunately, this technique doesn’t work with SSDs because SSDs reallocate sectors during everyday use—every time a flash memory block stops working, the controller reallocates another block of flash memory to replace it. It’s not unusual for a healthy SSD to have thousands of reallocated sectors.”
So another technique needed to be used for failure prediction in SSDs, and Tim thought his team had found it: “We were excited to discover that SSDs contain a Media Wearout Indicator as one of their SMART parameters.”
Tim then described how the Media Wearout Indicator works: “Remember that SSDs have 10 – 20% extra flash memory (a 100 GB SSD actually contains 110–120 GB of flash memory). This extra flash memory is used to replace flash memory blocks that wear out as the SSD is used. The Media Wearout Indicator displays the amount of extra flash memory still available in an SSD. It goes from 100% when the SSD is new down to 0% when all this extra flash memory has been used up.
However, as Tim went on to explain, the Media Wearout Indicator didn’t turn out to be quite as useful as expected: “We had high hopes that this indicator would provide us with a predictive indicator for impending failure. Two years ago, we incorporated a mechanism for monitoring it into SoftRAID. Since then, we have seen no SSDs which have failed because all their extra flash memory has been consumed. All the SSDs we have seen fail have failed with the Media Wearout Indicator well above 80%. We are still trying to develop a reliable mechanism for predicting when SSDs will fail.”
After his talk, Tim spoke to Chris Bross of DriveSavers Data Recovery, Inc., who said that their experience was exactly the same. SSDs fail catastrophically and without warning, and the Media Wearout Indicator is not useful in predicting when they will fail.