[pve-devel] corosync bug: cluster break after 1 node clean shutdown

Alexandre DERUMIER aderumier at odiso.com
Sat Sep 5 15:32:41 CEST 2020


> Something like an extra heartbeat between nodes daemons, and check if we also have quorum with theses heartbeats ? 

>
>>Was this even related to corosync? What exactly caused the reboot?

Hi Dietmar,

what I'm 100% sure, it that the watchdog have reboot all the servers. (I have watchdog trace in ipmi)

That's happen, just after shutdown the server. 
What is strange is that corosync logs on all servers show that they correctly the node down, and see other nodes.

So, I really don't known.
Maybe corosync what hanging ?


I don't have any other logs from crm/lrm/pmxcs...

I'm really blind. :/




----- Mail original -----
De: "dietmar" <dietmar at proxmox.com>
À: "Proxmox VE development discussion" <pve-devel at lists.proxmox.com>, "aderumier" <aderumier at odiso.com>
Cc: "pve-devel" <pve-devel at pve.proxmox.com>
Envoyé: Vendredi 4 Septembre 2020 17:42:45
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

> do you think it could be possible to add an extra optionnal layer of security check, not related to corosync ? 

I would try to find the bug instead. 

> I'm still afraid of this corosync bug since years, and still don't use HA. (or I have tried to enable it 2months ago,and this give me a disaster yesterday..) 
> 
> Something like an extra heartbeat between nodes daemons, and check if we also have quorum with theses heartbeats ? 

Was this even related to corosync? What exactly caused the reboot? 




More information about the pve-devel mailing list