[PVE-User] Corosync and Cluster reboot

Wed Jan 8 13:02:14 CET 2025

Hi Iztok,

January 8, 2025 at 11:12 AM, "Iztok Gregori" <iztok.gregori at elettra.eu mailto:iztok.gregori at elettra.eu?to=%22Iztok%20Gregori%22%20%3Ciztok.gregori%40elettra.eu%3E > wrote:

> 
> Hi!
> 
> On 07/01/25 15:15, DERUMIER, Alexandre wrote:
> 
> > 
> > Personnaly, I'll recommand to disable HA temporary during the network change (mv /etc/pve/ha/resources.cfg to a tmp directory, stop all pve-ha-lrm , tehn stop all pve-ha-crm to stop the watchdog)
> >  
> >  Then, after the migration, check the corosync logs during 1 or 2 days , and after that , if no retransmit occur, reenable HA.
> > 
> Good advice. But with the pve-ha-* services down the "HA-VMs" cannot 
> migrate from a node to the other, because the migration is handled by 
> the HA (or at least that is how I remember to happen some time ago). So 
> I've (temporary) removed all the resources (VMs) from HA, which has the 
> effect to tell "pve-ha-lrm" to disable the watchdog( "watchdog closed 
> (disabled)" ) and no reboot should occur.
Yes, after a minute or two when no resource is under HA the watchdog is closed (lrm becomes idle).
I second Alexandre's recommendation when working on the corosync network/config.

> 
> > 
> > It's really possible that it's a corosync bug (I remember to have had this kind of error with pve 7.X)
> > 
> I'm leaning to a similar conclusion, but I'm still lacking in 
> understanding of how corosync/watchdog is handled in Proxmox.
> 
> For example I still don't know who is updating the watchdog-mux service? 
> Is corosync (but no "watchdog_device" is set in corosync.conf and by 
> manual "if unset, empty or "off", no watchdog is used.") or is pve-ha-lrm?
The watchdog-mux service is handled by the LRM service.
The LRM is holding a lock in /etc/pve when it becomes active. This allow the node to fence itself, since the watchdog isn't updated anymore when the node drops out of quorum. By default the softdog is used, but it can be changed to a hardware watchdog in /etc/default/pve-ha-manger.

> 
> I think that, after the migration, my best shot is to upgrade the 
> cluster, but I have to understand if newer libcephfs client libraries 
> support old Ceph clusters.
Ceph usually guarantees compatibility between two-ish major versions (eg. Quincy -> Squid, Pacific -> Reef; unless stated otherwise).
Any bigger version difference usually works as well, but it is strongly recommended to upgrade ceph as there have been numerous bugs fixed the past years.

Cheers,
Alwin
--
croit GmbH,
Consulting / Training / 24x7 Support
https://www.croit.io/services/proxmox