Minimizing Hard Disk Drive Failure and Data Loss/Detecting an Impending Drive Failure

Diagnosis and repair

edit

Operating system tools such as chkdsk on Windows; and fsck, smartctl, and badblocks on Linux can be used routinely, perhaps once every three months, to check the integrity of the file system used on the drive and repair errors as possible. Third party tools for scanning are available as well. In addition to routine scans, a scan must also be run immediately if problems are experienced working with files on the drive. Typical examples of such problems are a hang or a CRC error when moving files.

A diagnostic check can also include a bad sector scan. While running a bad sector scan for a large drive can take several hours to a few days depending upon the drive and utility used, it is recommended. The presence of several or an increased number of bad sectors on a drive can be indicative of poor drive health. Such a drive can be replaced to avoid risking further loss of data.

S.M.A.R.T.

edit

S.M.A.R.T. reliability data can be queried from drives using various S.M.A.R.T. tools. This data can be used as an estimate of drive health. Based on the data, if the software reports the drive health as being unacceptably low, the drive can be preemptively replaced.

Software applications exist to automatically monitor S.M.A.R.T. data based on a schedule. The application can then alert the user if a minimum reliability threshold is crossed. Such an application may be preferred over one that only manually queries the S.M.A.R.T. data. Free software applications for Windows with this functionality include PassMark DiskCheckup and Acronis Drive Monitor.

Software also exists to interpret S.M.A.R.T. data and assign a numerical percentage value to a drive's health. Free software applications for Windows with this functionality include SpeedFan (when used in conjunction with its online analysis feature) and Acronis Drive Monitor.

As with temperature data, it is possible that the S.M.A.R.T. data provided by a drive is not readable for various reasons. In particular, S.M.A.R.T. data is not readable from the majority of drives connected externally via USB and Firewire. This is because the protocol bridge between the USB and ATA protocols does not seem to support S.M.A.R.T. data.

Relevant parameters

edit

While S.M.A.R.T. has several parameters, a subset of these parameters has a large impact on failure probability. These parameters are scan errors, reallocation counts, offline reallocation counts, and probational counts. The critical threshold for each of these four parameters is one.


Parameter Number of times the drive is more likely to fail within 60 days after reaching the parameter's critical threshold of one.
Scan errors 39*
Reallocation counts 14
Offline reallocation counts 21
Probational counts 16

*A scan error in a young drive increase its probability of failure more dramatically than it does for an older drive. While drives with just one scan error are more likely to fail than those with none, drives with multiple scan errors fail even more quickly.

Unfortunately, it is unlikely that S.M.A.R.T. data by itself can be used to develop an effective predictive model of individual drive failures. This is because a significant percentage of drives that fail have no S.M.A.R.T. errors whatsoever.

System event logs

edit
 
A sample disk warning event as shown by Event Viewer in Windows XP

The operating system logs system events. Of particular interest are system events triggered by a disk or a disk controller. Only events logged as errors or warnings are of concern, and not those that are logged solely for informational purposes. Under Windows, events can be viewed using the built-in Event Viewer application. Under other operating systems, other applications may be available for viewing event logs.

The system event log can be monitored for the presence of disk related errors and warnings. If any such events are logged, they can be checked to see which drive or device they pertain to. If similar events are suddenly logged by multiple drives in a short span of time, it is more likely that the problem is with a common controller card or motherboard component than with the individual drives.

Depending upon the event and its frequency, if the problem is with a drive, a diagnostic software can be run. If the event continues, it can serve as a sign of an impending drive failure. The relevant device can be replaced if the error persists.