watchdog timeout hardcoded to 10 sec

Stefan Radman stefan.radman at me.com
Fri Dec 10 17:06:04 CET 2021


Hi Thomas

Thank you for the thorough explanation. That makes sense and we’ll test and reconsider.

Regarding the hardware watchdog we are using recent Dell and Supermicro hardware with up-to-date firmware and we are pretty sure the watchdog runs stable.
From past experience there are hardware failures a power cycle will “cure” (at least temporarily, until the hardware is replaced).
The softdog probably won’t work in this case.

> got some report of failing HW watchdogs

I’d be interested to hear more about the circumstances (make, model, settings) from the community.
We are usually more interested in reliability (24/7/365)  than performance.

> hope that helps,

It does, indeed :)

Thanks & cheers

Stefan


> On Dec 10, 2021, at 18:34, Thomas Lamprecht <t.lamprecht at proxmox.com> wrote:
> 
> Hi,
> 
> On 10.12.21 15:22, Stefan Radman wrote:
>> What is the reason for hardcoding the watchdog timeout into pve-ha-manager/watchdog-mux.c?
> 
> Note that this is the multiplexer, the actual timeout for its clients is 60s.
> 
> The MUX opens the actual watchdog, it's a really small C program with a very small
> footprint and static resource usage, so it won't ever fail to update the watchdog
> in any situation where the system isn't total lost.
> 
> The MUX then checks the actual clients, if those did not ping in the last 60s the
> MUX will stop updating the actual watchdog, causing a reset around 0s to 10s later.
> 
> So the in-practice timeout for the watchdog services the MUX provides is 60 to 70
> seconds, not ten.
> 
>> 
>> https://git.proxmox.com/?p=pve-ha-manager.git;a=blob;f=src/watchdog-mux.c#l33 <https://git.proxmox.com/?p=pve-ha-manager.git;a=blob;f=src/watchdog-mux.c#l33>
>>  33 <https://git.proxmox.com/?p=pve-ha-manager.git;a=blob;f=src/watchdog-mux.c#l33> int watchdog_timeout = 10;
>> https://git.proxmox.com/?p=pve-ha-manager.git;a=blob;f=src/watchdog-mux.c#l157 <https://git.proxmox.com/?p=pve-ha-manager.git;a=blob;f=src/watchdog-mux.c#l157>
>> 157 <https://git.proxmox.com/?p=pve-ha-manager.git;a=blob;f=src/watchdog-mux.c#l157>     if (ioctl(watchdog_fd, WDIOC_SETTIMEOUT, &watchdog_timeout) == -1) {
>> 
>> I am trying to use a more conservative 5 minute timeout for the IPMI watchdog but it gets changed to 10 seconds when the watchdog-mux.service starts.
> 
> That's not a reasonable timeout for Proxmox VE's HA self fencing as pmxcfs locks have
> a timeout of 2 minutes, if you go above that all consistency guarantees from the self
> fencing are void and a HA Service can be recovered while the original one still access
> some of its resources, iow. there be dragons.
> 
> ps. Personally I'd only rely on a HW watchdog if I'm really sure it runs stable, most
> of the time their firmware is just a mess and they have so many bugs that the softdog
> of the kernel, which itself is a quite small and simple kernel module, works more
> stable. YMMV, but I never saw a situation where the softdog didn't do its job but we
> got some report of failing HW watchdogs - not /that/ many, but most users go for the
> default setup so this may be biased.
> 
> hope that helps,
> Thomas
> 





More information about the pve-user mailing list