In embedded electronic systems, a simple device which is supposed to ensure fail-safe behaviour. Unfortunately, it is all too often abused.

The basic idea is this: there is a timer. One (or several) components in the system can reset the timer. If the timer ever hits zero, (i.e. there is too long of a gap between two successive reset signals), the entire circuit is reset.

The idea is that during normal operation of the device, one central subsystem will regularly reset the watchdog timer (also known as petting the watchdog, or kicking the dog. Only a catastrophic error will cause the subsystem to fail to reset the timer (and every catastrophic error that is otherwise untrapped will cause the subsystem to fail to reset the timer). If a timeout ever occurs, then the simplest way to resolve the serious error is to reset the entire system.

There are several problems with the implementations of watchdog timers. Some implementations do not even have a well-defined period for the timer. This makes it impossible to use it effectively. Other implementations do not provide for an appropriate timer period - some systems may be better off with a fast timer, others may require a slow one. If the programmer is lazy or the implementation of the watchdog is inappropriate, there may be watchdog resets in several subsystems, or in a peripheral subsystem (whose only purpose, sometimes, is to prevent the watchdog timer from resetting). If there are resets in several subsystems, then one subsystem could fail, stop transmitting it's own reset signal, and the watchdog would not catch the failure because it's still getting petted by the others. If the resets are put on a peripheral system, then there could be a major failure in the core and the watchdog would not help at all, whereas a failure in that particular (minor) subsystem could incapacitate the entire system.

Even when the watchdog timer is used correctly, it's a bit of a cop out. There are many types of failure that can occur that will have no effect on the watchdog or the subsystem in charge of the watchdog. And with enough design work, one can predict the sets of circumstances that would cause a watchdog timer timeout, then add to the design to handle these particular circumstances in a more graceful way than crapping out the entire thing. That way, one could prove at design time that there will be little if any benefit to including a watchdog timer, and a very real risk with potentially serious consequences to having one in.

The above sounds harsh - and it's meant to. However, I do not discount the value of watchdog timers altogether. There are times when the extra design work I talk of is simply not worth it. In this wonderful capitalistic society, failure is always an option (provided the cost of failure is less than the cost of avoiding failure).

Pyrogenic's implementation of a watchdog timer is somewhat different than the type I've been ranting about, it's much more limited in scope. When calling a routine that may hang, it's only common sense in good programming practice to add a timer if you don't want your own system to hang. This timer won't overkill the error recovery process by starting the whole thing from scratch again. This kind of watchdog is reasonable, and is a good idea.