Watchdog on Linux/Ubuntu

Watchdog timers are commonly found in embedded systems and other computer-controlled equipment where humans cannot easily access the equipment or would be unable to react to faults in a timely manner. In such systems, the computer cannot depend on a human to reboot it if it hangs; it must be self-reliant.

Odroid N2 support watchdog driver meson_wdt to control the PMU.

  • Available with Linux odroid 4.9.177-28 (May 16, 2019) or higher version

Watchdog driver meson_wdt is configurable for Odroid N2.

You should be able to see /dev/watchdog and /dev/watchdog0 device files being created.

odroid@odroid:~$ ls -la /dev/watchdog*
crw------- 1 root root  10, 130 Jan 28  2018 /dev/watchdog
crw------- 1 root root 243,   0 Jan 28  2018 /dev/watchdog0
odroid@odroid:~$

To install watchdog daemon

sudo apt-get install watchdog

Create dir for watchdog logs files

sudo mkdir -p /var/log/watchdog

Append the default watchdog configuration. /etc/default/watchdog

# Start watchdog at boot time? 0 or 1
run_watchdog=1
# Start wd_keepalive after stopping watchdog? 0 or 1
run_wd_keepalive=1
# Load module before starting watchdog
watchdog_module="none"
# Specify additional watchdog options here (see manpage).
watchdog_options="-s -v -c /etc/watchdog.conf"

You need to edit the /etc/watchdog.conf file to un-comment and so actually use the /dev/watchdog device access to the module. Otherwise the watchdog will not use the hardware and rely only on its internal code to soft-reboot a broken machine.
This configuration example sets the WDT timeout at 15 seconds. If you need a faster reboot, reduce the value of “watchdog-timeout”.

$ cat /etc/watchdog.conf
#ping			= 172.31.14.1
#ping			= 172.26.1.255
#interface		= eth0
#file			= /var/log/messages
#change			= 1407
 
# Uncomment to enable test. Setting one of these values to '0' disables it.
# These values will hopefully never reboot your machine during normal use
# (if your machine is really hung, the loadavg will go much higher than 25)
#max-load-1		= 24
#max-load-5		= 18
#max-load-15		= 12
 
# Note that this is the number of pages!
# To get the real size, check how large the pagesize is on your machine.
#min-memory		= 1
#allocatable-memory	= 1
 
#repair-binary		= /usr/sbin/repair
#repair-timeout		= 60
#test-binary		=
#test-timeout		= 60
 
# The retry-timeout and repair limit are used to handle errors in a more robust
# manner. Errors must persist for longer than retry-timeout to action a repair
# or reboot, and if repair-maximum attempts are made without the test passing a
# reboot is initiated anyway.
#retry-timeout		= 60
#repair-maximum		= 1
 
watchdog-device	= /dev/watchdog
 
# Defaults compiled into the binary
#temperature-sensor	=
#max-temperature	= 90
 
# Defaults compiled into the binary
#admin			= root
#interval		= 1
#logtick                = 1
#log-dir		= /var/log/watchdog
 
# This greatly decreases the chance that watchdog won't be scheduled before
# your machine is really loaded
realtime		= yes
priority		= 1
 
# Check if rsyslogd is still running by enabling the following line
#pidfile		= /var/run/rsyslogd.pid
 
watchdog-timeout        = 15

Note : watchdog-timeout will generally determine after which watchdog failed to keep-alive, then it will trigger reboot.

For more configuration please follow link below. http://www.sat.dundee.ac.uk/psc/watchdog/watchdog-configure.html

on Ubuntu 18.04.x enable watchdog service status

In order to start watchdog service we need to create soft links of service as below.

sudo ln -s  /lib/systemd/system/watchdog.service /etc/systemd/system/multi-user.target.wants/watchdog.service
sudo systemctl enable watchdog.service
sudo systemctl start watchdog.service

Check for watchdog service is running successfully.

odroid@odroid:~$ service watchdog status
● watchdog.service - watchdog daemon
   Loaded: loaded (/lib/systemd/system/watchdog.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2019-05-28 08:12:51 UTC; 2min 32s ago
  Process: 2718 ExecStart=/bin/sh -c [ $run_watchdog != 1 ] || exec /usr/sbin/watchdog $watchdog_options (code=exited, status=0/SUCCESS)
  Process: 2715 ExecStartPre=/bin/sh -c [ -z "${watchdog_module}" ] || [ "${watchdog_module}" = "none" ] || /sbin/modprobe $watchdog_module (code=exit
 Main PID: 2720 (watchdog)
   CGroup: /system.slice/watchdog.service
           └─2720 /usr/sbin/watchdog -s -v -c /etc/watchdog.conf
 
May 28 08:15:13 odroid watchdog[2720]: still alive after 121 interval(s)
May 28 08:15:14 odroid watchdog[2720]: still alive after 122 interval(s)
May 28 08:15:15 odroid watchdog[2720]: still alive after 123 interval(s)
May 28 08:15:16 odroid watchdog[2720]: still alive after 124 interval(s)
May 28 08:15:17 odroid watchdog[2720]: still alive after 125 interval(s)
May 28 08:15:19 odroid watchdog[2720]: still alive after 126 interval(s)
May 28 08:15:20 odroid watchdog[2720]: still alive after 127 interval(s)
May 28 08:15:21 odroid watchdog[2720]: still alive after 128 interval(s)
May 28 08:15:22 odroid watchdog[2720]: still alive after 129 interval(s)
May 28 08:15:23 odroid watchdog[2720]: still alive after 130 interval(s)
lines 1-19/19 (END)

Once the watchdog demon is configured it tries to continuously reset the watchdog timer. When/if it fails to do it (because of unresponsive system), the timer will expire and the board will reboot.

Another way to test watchdog device is killing the watchdog demon after it has started.

root@odroid64:~#
root@odroid64:~# pkill -9 watchdog

To test watchdog daemon.

Be careful when using these commands.

The commands below will cause the kernel to crash.

Use caution when following these steps, and by no means use them on a production machine.

echo c > /proc/sysrq-trigger

This will force the Linux kernel to crash. If the watchdog works properly, it will reboot the system.

root@odroid:~# echo c > /proc/sysrq-trigger
[   46.497202@3] sysrq: SysRq : Trigger a crash
[   46.497523@0] Unable to handle kernel NULL pointer dereference at virtual address 00000000
[   46.510196@0] pgd = ffffffc0c6d62000
[   46.510356@0] [0000000000000000] *pgd=0000000000000000, *pud=0000000000000000
[   46.517274@0] Internal error: Oops: 96000045 [#1] PREEMPT SMP
[   46.521024@0] Modules linked in: fuse squashfs cpufreq_ondemand cpufreq_powersave cpufreq_userspace cpufreq_conservative rtc_pcf8563 i2c_meson_master sch_6
[   46.556125@0] CPU: 0 PID: 2906 Comm: bash Not tainted 4.9.177-28 #1
[   46.562356@0] Hardware name: Hardkernel ODROID-N2 (DT)
[   46.567472@0] task: ffffffc0c8de8000 task.stack: ffffffc0c7b68000
[   46.573563@0] PC is at sysrq_handle_crash+0x28/0x38
[   46.578393@0] LR is at sysrq_handle_crash+0x14/0x38
[   46.583244@0] pc : [<ffffff80094e3698>] lr : [<ffffff80094e3684>] pstate: 60000145
[   46.590782@0] sp : ffffffc0c7b6bcd0
[   46.594248@0] x29: ffffffc0c7b6bcd0 x28: ffffffc0c8de8000 
[   46.599709@0] x27: ffffff8009c12000 x26: 0000000000000040 
[   46.605168@0] x25: 0000000000000123 x24: 0000000000000000 
[   46.610628@0] x23: 0000000000000004 x22: ffffff800a656000 
[   46.616088@0] x21: ffffff800a656488 x20: 0000000000000063 
[   46.621549@0] x19: ffffff800a5f9000 x18: ffffffffffffffff 
[   46.627008@0] x17: 0000007f93178028 x16: ffffff800923a770 
[   46.632468@0] x15: ffffff800a5d7e90 x14: ffffff808a77b11f 
[   46.637928@0] x13: 0000000000000000 x12: 0000000000000007 
[   46.643388@0] x11: 0000000000000006 x10: 0000000000000358 
[   46.648848@0] x9 : 0000000000000001 x8 : 0000000000000000 
[   46.654308@0] x7 : ffffff800a640130 x6 : 0000000000000000 
[   46.659768@0] x5 : 0000000000000000 x4 : 0000000000000000 
[   46.665228@0] x3 : 0000000000000000 x2 : 00000000000409b1 
[   46.670688@0] x1 : 0000000000000000 x0 : 0000000000000001 
[   46.676150@0] 
[   46.676150@0] SP: 0xffffffc0c7b6bc50:
[   46.681435@0] bc50  0a656000 ffffff80 00000004 00000000 00000000 00000000 00000123 00000000
[   46.689755@0] bc70  00000040 00000000 09c12000 ffffff80 c8de8000 ffffffc0 c7b6bcd0 ffffffc0
[   46.698074@0] bc90  094e3684 ffffff80 c7b6bcd0 ffffffc0 094e3698 ffffff80 60000145 00000000
[   46.706394@0] bcb0  c7b6bcd0 ffffffc0 094e3684 ffffff80 ffffffff 0000007f 00000000 00000000
[   46.714714@0] bcd0  c7b6bce0 ffffffc0 094e3d58 ffffff80 c7b6bd20 ffffffc0 094e4328 ffffff80
[   46.723034@0] bcf0  00000002 00000000 8da7b0d0 00000055 8da7b0d0 00000055 00000002 00000000
[   46.731354@0] bd10  c7b6beb0 ffffffc0 00000015 00000000 c7b6bd40 ffffffc0 092b9070 ffffff80
[   46.739674@0] bd30  3d0b0b40 ffffffc0 3cd44700 ffffffc0 c7b6bd80 ffffffc0 09238058 ffffff80
[   46.748007@0] 
[   46.748007@0] X28: 0xffffffc0c8de7f80:
[   46.753368@0] 7f80  00000000 00000000 00000000 00000000 c8de7fc0 ffffffc0 00000000 00000000
[   46.761688@0] 7fa0  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[   46.770008@0] 7fc0  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[   46.778328@0] 7fe0  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[   46.786648@0] 8000  00000008 00000000 ffffffff ffffffff 00000001 00000000 00000000 00000000
[   46.794967@0] 8020  c7b68000 ffffffc0 00000002 00400100 00000000 00000000 00000000 00000000
[   46.803288@0] 8040  00000001 00000000 00000005 00000000 ffff0866 00000000 3d284600 ffffffc0
[   46.811608@0] 8060  00000000 00000001 00000078 00000078 00000078 00000000 09c19458 ffffff80
[   46.819929@0] 
[   46.819929@0] X29: 0xffffffc0c7b6bc50:
[   46.825301@0] bc50  0a656000 ffffff80 00000004 00000000 00000000 00000000 00000123 00000000
[   46.833621@0] bc70  00000040 00000000 09c12000 ffffff80 c8de8000 ffffffc0 c7b6bcd0 ffffffc0
[   46.841941@0] bc90  094e3684 ffffff80 c7b6bcd0 ffffffc0 094e3698 ffffff80 60000145 00000000
[   46.850261@0] bcb0  c7b6bcd0 ffffffc0 094e3684 ffffff80 ffffffff 0000007f 00000000 00000000
[   46.858581@0] bcd0  c7b6bce0 ffffffc0 094e3d58 ffffff80 c7b6bd20 ffffffc0 094e4328 ffffff80
[   46.866901@0] bcf0  00000002 00000000 8da7b0d0 00000055 8da7b0d0 00000055 00000002 00000000
[   46.875221@0] bd10  c7b6beb0 ffffffc0 00000015 00000000 c7b6bd40 ffffffc0 092b9070 ffffff80
[   46.883541@0] bd30  3d0b0b40 ffffffc0 3cd44700 ffffffc0 c7b6bd80 ffffffc0 09238058 ffffff80
[   46.891861@0] 
[   46.893510@0] Process bash (pid: 2906, stack limit = 0xffffffc0c7b68000)
[   46.900185@0] Stack: (0xffffffc0c7b6bcd0 to 0xffffffc0c7b6c000)
[   46.906078@0] bcc0:                                   ffffffc0c7b6bce0 ffffff80094e3d58
[   46.914052@0] bce0: ffffffc0c7b6bd20 ffffff80094e4328 0000000000000002 000000558da7b0d0
[   46.922025@0] bd00: 000000558da7b0d0 0000000000000002 ffffffc0c7b6beb0 0000000000000015
[   46.929999@0] bd20: ffffffc0c7b6bd40 ffffff80092b9070 ffffffc03d0b0b40 ffffffc03cd44700
[   46.937972@0] bd40: ffffffc0c7b6bd80 ffffff8009238058 ffffff800a5d7000 ffffffc03cd44700
[   46.945946@0] bd60: 0000000000000002 ffffffc0c7b6beb0 000000558da7b0d0 ffffffc0c9bb7a80
[   46.953919@0] bd80: ffffffc0c7b6be30 ffffff8009239084 0000000000000002 ffffffc03cd44700
[   46.961892@0] bda0: 0000000000000000 000000558da7b0d0 ffffffc0c7b6beb0 0000000000000002
[   46.969866@0] bdc0: ffffffc0c7b6bdf0 ffffff800923c88c 0000000000000000 ffffffc0ca86ca80
[   46.977839@0] bde0: ffffffc0ca86ca80 0000000000000002 ffffffc0c7b6be30 ffffff8009239174
[   46.985812@0] be00: 0000000000000002 ffffffc03cd44700 0000000000000000 000000558da7b0d0
[   46.993786@0] be20: ffffffc03cd44700 00000000000409b1 ffffffc0c7b6be70 ffffff800923a7dc
[   47.001759@0] be40: ffffff800a5d7000 ffffffc03cd44700 ffffffc03cd44700 000000558da7b0d0
[   47.009732@0] be60: 0000000000000002 0000000000000000 0000000000000000 ffffff80090839c0
[   47.017705@0] be80: ffffffffffffff1d 00000040c5119000 ffffffffffffffff 0000007f931cfbac
[   47.025679@0] bea0: 0000000020000000 0000000000000400 0000000000000000 00000000000409b1
[   47.033652@0] bec0: 0000000000000001 000000558da7b0d0 0000000000000002 0000007f932651a8
[   47.041626@0] bee0: 0000000000000000 0000000155510004 0000000000000000 0000000000000001
[   47.049599@0] bf00: 0000000000000040 0000007f932f3700 0000000000000010 0000000000000000
[   47.057572@0] bf20: 0000000000000001 000000000000270f 0000000000000002 0000000000000000
[   47.065545@0] bf40: 000000556d115bf0 0000007f93178028 0000007f93260a70 0000000000000001
[   47.073526@0] bf60: 000000558da7b0d0 0000007f93261560 0000000000000002 000000558da7b0d0
[   47.081494@0] bf80: 0000000000000002 0000007f93261648 000000556d0fe000 000000556d0eb000
[   47.089466@0] bfa0: 000000558da7ae60 0000007ff070e2d0 0000007f9317b398 0000007ff070e2d0
[   47.097438@0] bfc0: 0000007f931cfbac 0000000020000000 0000000000000001 0000000000000040
[   47.105412@0] bfe0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[   47.113386@0] Call trace:
[   47.115988@0] Exception stack(0xffffffc0c7b6bae0 to 0xffffffc0c7b6bc10)
[   47.122572@0] bae0: ffffff800a5f9000 0000007fffffffff ffffffc0c7b6bcd0 ffffff80094e3698
[   47.130545@0] bb00: 0000000060000145 ffffff800a77a000 ffffffc0c7b6bb30 ffffff800911278c
[   47.138519@0] bb20: ffffff8009f1c628 0000000100000000 ffffffc0c7b6bbd0 ffffff8009112938
[   47.146492@0] bb40: ffffffc0c7b6bc30 ffffff8009f53a98 ffffff800a656488 ffffff800a656000
[   47.154464@0] bb60: 0000000000000004 0000000000000000 0000000000000123 0000000000000040
[   47.162439@0] bb80: ffffff8009c12000 ffffffc0c8de8000 ffffffc0ca408240 00000000000409b1
[   47.170412@0] bba0: 0000000000000001 0000000000000000 00000000000409b1 0000000000000000
[   47.178385@0] bbc0: 0000000000000000 0000000000000000 0000000000000000 ffffff800a640130
[   47.186358@0] bbe0: 0000000000000000 0000000000000001 0000000000000358 0000000000000006
[   47.194330@0] bc00: 0000000000000007 0000000000000000
[   47.199371@0] [<ffffff80094e3698>] sysrq_handle_crash+0x28/0x38
[   47.205255@0] [<ffffff80094e3d58>] __handle_sysrq+0xb0/0x1a8
[   47.210887@0] [<ffffff80094e4328>] write_sysrq_trigger+0x90/0xa0
[   47.216871@0] [<ffffff80092b9070>] proc_reg_write+0x90/0xd0
[   47.222416@0] [<ffffff8009238058>] __vfs_write+0x60/0x150
[   47.227785@0] [<ffffff8009239084>] vfs_write+0xac/0x1b0
[   47.232985@0] [<ffffff800923a7dc>] SyS_write+0x6c/0xd8
[   47.238104@0] [<ffffff80090839c0>] el0_svc_naked+0x34/0x38
[   47.243562@0] Code: 52800020 b90a9820 d5033e9f d2800001 (39000020) 
[   47.249813@0] ---[ end trace a309fd0bed7660d7 ]---
[   47.266264@0] Kernel panic - not syncing: Fatal exception
[   47.266368@0] SMP: stopping secondary CPUs
[   47.270152@0] Kernel Offset: disabled
[   47.273744@0] Memory Limit: none
[   47.288644@0] Rebooting in 5 seconds..
[   52.288923@0] reboot reason 12
bl31 reboot reason: 0xd
bl31 reboot reason: 0xc
system cmd  1.
G12B:BL:6e7c85:7898ac;FEAT:E0F83180:2000;POC:F;RCY:0;EMMC:0;READ:0;0.4
                                                                      bl2_stage_init 0x01
bl2_stage_init 0x81
hw id: 0x0000 - pwm id 0x01
bl2_stage_init 0xc1
bl2_stage_init 0x02
 
L0:00000000
L1:00000703
L2:00008067
L3:04000000
B2:00002000
B1:e0f83180
 
TE: 303140
 
BL2 Built : 10:47:19, Jan 14 2019. g12b g152d217 - guotai.shen@droid11-sz
 
Board ID = 4
Set A53 clk to 24M
Set A73 clk to 24M
Set clk81 to 24M
A53 clk: 1200 MHz
A73 clk: 1200 MHz
CLK81: 166.6M
smccc: 0004e8b4
eMMC boot @ 0

Watchdog on Android

To test watchdog daemon.

Be careful when using these commands.

The commands below will cause the kernel to crash.

Use caution when following these steps, and by no means use them on a production machine.

echo c > /proc/sysrq-trigger