[pve-devel] corosync bug: cluster break after 1 node clean shutdown

Thomas Lamprecht t.lamprecht at proxmox.com
Thu Sep 10 20:21:14 CEST 2020


On 10.09.20 13:34, Alexandre DERUMIER wrote:
>>> as said, if the other nodes where not using HA, the watchdog-mux had no
>>> client which could expire.
> 
> sorry, maybe I have wrong explained it,
> but all my nodes had HA enabled.
> 
> I have double check lrm_status json files from my morning backup 2h before the problem,
> they were all in "active" state. ("state":"active","mode":"active" )
> 

OK, so all had a connection to the watchdog-mux open. This shifts the suspicion
again over to pmxcfs and/or corosync.

> I don't why node7 don't have rebooted, the only difference is that is was the crm master.
> (I think crm also reset the watchdog counter ? maybe behaviour is different than lrm ?)

The watchdog-mux stops updating the real watchdog as soon any client disconnects or times
out. It does not know which client (daemon) that was.

>>> above lines also indicate very high load. 
>>> Do you have some monitoring which shows the CPU/IO load before/during this event? 
> 
> load (1,5,15 ) was: 6  (for 48cores), cpu usage: 23%
> no iowait on disk (vms are on a remote ceph, only proxmox services are running on local ssd disk)
> 
> so nothing strange here :/

Hmm, the long loop times could then be the effect of a pmxcfs read or write
operation being (temporarily) stuck.






More information about the pve-devel mailing list