Operating System Design/File Systems/Faults

In engineering in general, fault-tolerance refers to the ability of something to continue to function (though perhaps at reduced levels) after something as gone wrong. More specifically, in filesystem design, it refers to the ability of a filesystem to store data reliably, even in the face of hardware mistakes.

Many things can go wrong in a storage system, especially one with moving parts like hard disk drives. Bad sectors only prevent the use of a few sectors, while a head crash can permanently ruin an entire disk. Other kinds of crash (software bugs, unexpected loss of electric power, etc.) generally cause no physical damage to the disk. Such crashes often garble the sector(s) that were in the middle of a write at the time of the crash. Because many computer systems buffer writes in RAM and re-order the sequence of writes before the data is stored on non-volatile media, such crashes often lead to inconsistent data.

There are several ways of improving fault tolerance in filesystems:

  • RAID, which duplicates data.
  • Journaling, which helps avoid problems in case of a crash.
  • Dealing with bad blocks, which prevents use of corrupted disk sectors.