Embedded Systems/Watchdog Timer

Watchdog Timer

In an embedded environment, far away from the lab, and far away from the programmers, engineers, and technicians, all sorts of things can go wrong, and the embedded system needs to be able to fix itself. Remember, once you close the box, and shrink-wrap your product, it's hard to get back in there and fix your mistakes.

In a typical computer systems, cosmic rays flip a bit of RAM about once a month^{[citation needed]}. If that happens to the wrong bit, the program can "hang", stuck in a short infinite loop.

Turning the power off then on again gets it unstuck. But how do you jiggle the power switch when you are in Paris, and your embedded system is in Antarctica? Or you are on Earth and your embedded system is near Neptune?

One of the most important tools of an embedded systems engineer is the Watch-Dog Timer (WDT). A WDT is a timer with a very long fuse (several seconds, usually).

The WDT counts down toward zero(*), like the big red numbers counting down on the bombs in the movies. Left to itself, eventually the counter will reach zero. When the counter reaches zero, the WDT resets the microcontroller (as if the power were turned off, then turned back on).

When the system is running normally, you don't want it to randomly reset itself, so you need to make sure that your program always "feeds the watch-dog" long before time runs out. Good practice is to reset the WDT less than halfway-through its countdown. For instance, if the WDT has a timer of 20 seconds, then you will want to feed the WDT at least once every 10 seconds.

Unlike when our hero deals with bombs in the movies, feeding the watch-dog doesn't stop the countdown. When the code uses a "reset" or "clear" command to feed the watchdog, it merely sets the WDT back to some large number -- and then the watchdog timer immediately starts counting down from there.

If the programmer fails to feed the watchdog in time -- or if the program hangs for any reason -- then sooner or later WDT will time out, and the program will reset, hopefully getting your system unstuck.

Some "multi-stage" watchdogs have a first stage that interrupts the CPU. The corresponding interrupt handler logs debug information about the current state information (crash dump) and attempts a "soft" recovery. If some fault causes the soft recovery to timeout, the final stage of the watchdog will reset the entire system. ^[1]

(*) Some watchdogs count up. With this kind of watchdog, "feeding the watchdog" resets it to zero. If it ever reaches some high limit, it resets the system.

Embedded Systems/Watchdog Timer

Watchdog Timer

Further reading