I think this must have been nature’s way of preparing me for the grim reality of hard drive failure. At about the same time, my work as an IT journalist was starting to fill the house with computers. I had three or four constantly running machines by the time Windows appeared as a serious commercial effort in 1987, and that was enough to ensure that my hard drives would give up the ghost from time to time. I learnt an important lesson: hard drives are guaranteed to fail. It says so on the casing – that’s what the Mean Time To Failure (MTTF) figure means.
In those days, when a drive went down you really knew about it. The head tearing into the platter filled the room with a banshee wail that scared the life out of you. Today they just click, or not even that. Last week’s drive failed with hardly a sigh. But fail it did, sure as death and taxes.
The crash wasn’t completely unexpected. As I found with the light bulbs, as the number of computers you own increases, so does the number of hard drives you wave goodbye to. Last time I counted, I was spinning something like a couple of dozen drives on a daily basis. Regular readers will know that my key data is out there in the cloud being professionally cared for by Google, so my documents were safe – but a boot drive, with all those carefully tuned applications, takes time to restore. And of course it had to be a boot drive that went down. And in the new Cosmos machine that I told you about in Issue 282.
“It is estimated that over 90 per cent of all new information produced in the world is being stored on magnetic media, most of it on hard disk drives. Despite their importance, there is relatively little published work on the failure patterns of disk drives, and the key factors that affect their lifetime.” That’s the opening of a white paper called Failure Trends in a Large Disk Drive Population that Google published a couple of years ago. Its conclusions rattled industry confidence in figures like MTTF, and contradicted conventional wisdom on, for example, the relationship between drive failure and ambient temperature. It also threw doubt on the supposed advantage of paying over the odds for ‘industrial-strength’ fibre channel and SCSI drives.
Amid all this uncertainty, one statistic does stand out from the Google study: although the average failure rate (AFR) appears to be unaffected across a wide range of temperatures for the first two years of the drive’s life, in year three the AFR shoots up to 10 per cent for drives above 35°C and to a very nasty 15 per cent for drives hotter than 40°C. Assuming that the average life of a drive is five years, you might conclude that drives mostly need to be kept cool. According to the study, however, temperatures below 30°C are very bad news for new drives.
But statistics can be misleading. I’m wondering if those three-year-old drives, evidently of an older generation than the one- and two-year-old drives that apparently shrugged off high temperatures, were failing at 35°C and more simply because they were earlier technology. This question doesn’t appear to have occurred to the authors of the paper, and it would take a follow-up to clear up the point. Are newer drives really more temperature-resilient? If so, we can generalise Google’s conclusion, which seems to be geared towards enterprises that routinely pension off their drives within a couple of years, that “at moderate temperature ranges it’s likely that there are other effects which affect failure rates much more strongly than temperatures do”. This optimistic scenario says that with products produced after 2005, this temperature tolerance may hold true for the whole life of the drive.
SMART, the technology in all modern drives that keeps you informed of a drive’s general health and running temperature, is supposed to be able to warn you when a drive is about to fail. Alas, the Google study debunks that as another myth. When drives fail, they’re likely simply to fail. But SMART does at least let you monitor the temperature. My replacement boot drive is running at 47°C, and it’s getting on for a couple of years old. Hmm.
The optimistic scenario says don’t worry – but I’m going to install a fan on top of my drive stack anyway. And I’ll be keeping a close eye on the light bulbs, too.
-PCPLUS
Your Comments
0 Responses to "Bidmead: Hard Disk Failure Analysis"