[PVE-User] watchdog timeout hardcoded to 10 sec

Thomas Lamprecht t.lamprecht at proxmox.com
Fri Dec 10 16:34:32 CET 2021


On 10.12.21 15:22, Stefan Radman wrote:
> What is the reason for hardcoding the watchdog timeout into pve-ha-manager/watchdog-mux.c?

Note that this is the multiplexer, the actual timeout for its clients is 60s.

The MUX opens the actual watchdog, it's a really small C program with a very small
footprint and static resource usage, so it won't ever fail to update the watchdog
in any situation where the system isn't total lost.

The MUX then checks the actual clients, if those did not ping in the last 60s the
MUX will stop updating the actual watchdog, causing a reset around 0s to 10s later.

So the in-practice timeout for the watchdog services the MUX provides is 60 to 70
seconds, not ten.

> https://git.proxmox.com/?p=pve-ha-manager.git;a=blob;f=src/watchdog-mux.c#l33 <https://git.proxmox.com/?p=pve-ha-manager.git;a=blob;f=src/watchdog-mux.c#l33>
>   33 <https://git.proxmox.com/?p=pve-ha-manager.git;a=blob;f=src/watchdog-mux.c#l33> int watchdog_timeout = 10;
> https://git.proxmox.com/?p=pve-ha-manager.git;a=blob;f=src/watchdog-mux.c#l157 <https://git.proxmox.com/?p=pve-ha-manager.git;a=blob;f=src/watchdog-mux.c#l157>
>  157 <https://git.proxmox.com/?p=pve-ha-manager.git;a=blob;f=src/watchdog-mux.c#l157>     if (ioctl(watchdog_fd, WDIOC_SETTIMEOUT, &watchdog_timeout) == -1) {
> I am trying to use a more conservative 5 minute timeout for the IPMI watchdog but it gets changed to 10 seconds when the watchdog-mux.service starts.

That's not a reasonable timeout for Proxmox VE's HA self fencing as pmxcfs locks have
a timeout of 2 minutes, if you go above that all consistency guarantees from the self
fencing are void and a HA Service can be recovered while the original one still access
some of its resources, iow. there be dragons.

ps. Personally I'd only rely on a HW watchdog if I'm really sure it runs stable, most
of the time their firmware is just a mess and they have so many bugs that the softdog
of the kernel, which itself is a quite small and simple kernel module, works more
stable. YMMV, but I never saw a situation where the softdog didn't do its job but we
got some report of failing HW watchdogs - not /that/ many, but most users go for the
default setup so this may be biased.

hope that helps,

More information about the pve-user mailing list