Hangcheck

From Segfault
Jump to navigation Jump to search

The hangcheck timer has been developed by Oracle and included into the Linux kernel:

 The hangcheck-timer driver is a method for catching various system hangs and pauses.
 It catches hangs and pauses where the system resumes after some time, but the clock has
 not noticed the system's lost time.

There is no userspace daemon like the watchdog deamon, we have to configure the kernel module though:

 hangcheck_tick        - period of time between system checks, default is 60 sec
 hangcheck_margin      - maximum hang delay before resetting the the node, default is 180 sec
 hangcheck_dump_tasks  - If nonzero, the machine will dump the system task state when the timer margin is exceeded
 hangcheck_reboot      - If nonzero, the machine will reboot when the timer margin is exceeded

When the sum of both timers is exceeded, the box will be reset (here: 60+180=240 sec) We should adjust modprobe.conf, here is how we configure the module on the command line:

 $ sudo modprobe hangcheck-timer hangcheck_tick=30 hangcheck_margin=180 hangcheck_dump_tasks=1 hangcheck_reboot=1

See also

Links