[pve-devel] corosync bug: cluster break after 1 node clean shutdown

Alexandre DERUMIER aderumier at odiso.com
Tue Sep 15 11:35:57 CEST 2020


Hi,

I have finally reproduce it !

But this is with a corosync restart in cron each 1 minute, on node1

Then: lrm was stuck for too long for around 60s and softdog have been triggered on multiple other nodes.


here the logs with full corosync debug at the time of last corosync restart. 

node1 (where corosync is restarted each minute)
https://gist.github.com/aderumier/c4f192fbce8e96759f91a61906db514e

node2
https://gist.github.com/aderumier/2d35ea05c1fbff163652e564fc430e67

node5

https://gist.github.com/aderumier/df1d91cddbb6e15bb0d0193ed8df9273




I'll prepare logs from the previous corosync restart, as the lrm seem to be already stuck before.


----- Mail original -----
De: "aderumier" <aderumier at odiso.com>
À: "dietmar" <dietmar at proxmox.com>
Cc: "Proxmox VE development discussion" <pve-devel at lists.proxmox.com>, "Thomas Lamprecht" <t.lamprecht at proxmox.com>
Envoyé: Mardi 15 Septembre 2020 10:42:15
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

>>pmxcfs cannot send anything in that case, so it is impossible that this has effects on other nodes. 
yes, I understand that, but I was thinking of the case if corosync is in stopping phase (not totally stopped). 

Something racy (I really don't known). 

I just send 2 patch to start pve-cluster && corosync after syslog, like this we'll have shutdown logs too. 


(I'm currently try to reproduce the problem, with reboot loops, but I still can't reproduce it :/ ) 





----- Mail original ----- 
De: "dietmar" <dietmar at proxmox.com> 
À: "aderumier" <aderumier at odiso.com> 
Cc: "Thomas Lamprecht" <t.lamprecht at proxmox.com>, "Proxmox VE development discussion" <pve-devel at lists.proxmox.com> 
Envoyé: Mardi 15 Septembre 2020 09:13:53 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

> I ask the question, because the 2 times I have problem, it was when shutting down a server. 
> So maybe some strange behaviour occur with both corosync && pmxcfs are stopped at same time ? 

pmxcfs cannot send anything in that case, so it is impossible that this has effects on other nodes. 


_______________________________________________ 
pve-devel mailing list 
pve-devel at lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 




More information about the pve-devel mailing list