[pve-devel] corosync bug: cluster break after 1 node clean shutdown

Alexandre DERUMIER aderumier at odiso.com
Wed Sep 30 08:09:15 CEST 2020


Hi,

some news, my last test is running for 14h now, and I don't have had any problem :)

So, it seem that is indeed fixed ! Congratulations !



I wonder if it could be related to this forum user
https://forum.proxmox.com/threads/proxmox-6-2-corosync-3-rare-and-spontaneous-disruptive-udp-5405-storm-flood.75871/

His problem is that after corosync lag (he's have 1 cluster stretch on 2DC with 10km distance, so I think sometimes he's having some small lag,
1 node is flooding other nodes with a lot of udp packets. (and making things worst, as corosync cpu is going to 100% / overloaded, and then can't see other onodes

I had this problem 6month ago after shutting down a node, that's why I'm thinking it could "maybe" related.

So, I wonder if it could be same pmxcfs bug, when something looping or send again again packets.

The forum user seem to have the problem multiple times in some week, so maybe he'll be able to test the new fixed pmxcs, and tell us if it's fixing this bug too.



----- Mail original -----
De: "aderumier" <aderumier at odiso.com>
À: "Proxmox VE development discussion" <pve-devel at lists.proxmox.com>
Envoyé: Mardi 29 Septembre 2020 15:52:18
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

>>huge thanks for all the work on this btw! 

huge thanks to you ! ;) 


>>I think I've found a likely culprit (a missing lock around a 
>>non-thread-safe corosync library call) based on the last logs (which 
>>were now finally complete!). 

YES :) 


>>if feedback from your end is positive, I'll whip up a proper patch 
>>tomorrow or on Thursday. 

I'm going to launch a new test right now ! 


----- Mail original ----- 
De: "Fabian Grünbichler" <f.gruenbichler at proxmox.com> 
À: "Proxmox VE development discussion" <pve-devel at lists.proxmox.com> 
Envoyé: Mardi 29 Septembre 2020 15:28:19 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

huge thanks for all the work on this btw! 

I think I've found a likely culprit (a missing lock around a 
non-thread-safe corosync library call) based on the last logs (which 
were now finally complete!). 

rebuilt packages with a proof-of-concept-fix: 

23b03a48d3aa9c14e86fe8cf9bbb7b00bd8fe9483084b9e0fd75fd67f29f10bec00e317e2a66758713050f36c165d72f107ee3449f9efeb842d3a57c25f8bca7 pve-cluster_6.1-8_amd64.deb 
9e1addd676513b176f5afb67cc6d85630e7da9bbbf63562421b4fd2a3916b3b2af922df555059b99f8b0b9e64171101a1c9973846e25f9144ded9d487450baef pve-cluster-dbgsym_6.1-8_amd64.deb 

I removed some logging statements which are no longer needed, so output 
is a bit less verbose again. if you are not able to trigger the issue 
with this package, feel free to remove the -debug and let it run for a 
little longer without the massive logs. 

if feedback from your end is positive, I'll whip up a proper patch 
tomorrow or on Thursday. 


_______________________________________________ 
pve-devel mailing list 
pve-devel at lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


_______________________________________________ 
pve-devel mailing list 
pve-devel at lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 





More information about the pve-devel mailing list