[PVE-User] Reboot on psu failure in redundant setup

Fri Nov 8 16:45:13 CET 2019

Hi,

On 11/8/19 4:35 PM, Daniel Berteaud wrote:
> ----- Le 8 Nov 19, à 16:22, Mark Adams mark at openvs.co.uk a écrit :
>> Hi All,
>>
>> This cluster is on 5.4-11.
>>
>> This is most probably a hardware issue either with ups or server psus, but
>> wanted to check if there is any default watchdog or auto reboot in a
>> proxmox HA cluster.
>>
>> Explanation of what happened:
>>
>> All servers have redundant psu, being fed from separate ups in
>> separate racks on separate feeds. One of the UPS went out, and when it did
>> all nodes rebooted. They were functioning normally after the reboot, but I
>> wasn't expecting the reboot to occur.
>>
>> When the UPS went down, it also took down all of the core network because
>> the power was not connected up in a redundant fashion. Ceph and "LAN"
>> traffic was blocked because of this. Did a watchdog reboot each node
>> because it lost contact with its cluster peers? I didn't configure it to do
>> this myself, so is this an automatic feature? Everything I have read says
>> it should be configured manually.
>>
>> Thanks in advance.
> 
> Yes, that's expected. If all nodes are isolated from each other, they will be self-fenced (using a software watchdog) to prevent any corruption and allow services to be recovered on the quorate part of the cluster. In your case, there was no quorate part, as there was no network at all.

Small addition, it can also be a HW Watchdog if configured[0].

And yes, as soon as you enable a HA service that node and the current
HA manager node will enable and pull-up a watchdog. And, if the node hangs
or there's a quorum loss for more than 60s, the watchdog updates will stop
and the node will get self-fenced soon afterwards (not more than a few
seconds).

[0]: https://pve.proxmox.com/pve-docs/chapter-ha-manager.html#_configure_hardware_watchdog

cheers,
Thomas