[pve-devel] corosync bug: cluster break after 1 node clean shutdown

Wed Sep 30 08:26:25 CEST 2020

Hi,

On 30.09.20 08:09, Alexandre DERUMIER wrote:
> some news, my last test is running for 14h now, and I don't have had any problem :)
> 

great! Thanks for all your testing time, this would have been much harder,
if even possible at all, without you probiving so much testing effort on a
production(!) cluster - appreciated!

Naturally many thanks to Fabian too, for reading so many logs without going
insane :-)

> So, it seem that is indeed fixed ! Congratulations !
> 

honza comfirmed Fabians suspicion about lacking guarantees of thread safety
for cpg_mcast_joined, which was sadly not documented, so this is surely
a bug, let's hope the last of such hard to reproduce ones.

> 
> 
> I wonder if it could be related to this forum user
> https://forum.proxmox.com/threads/proxmox-6-2-corosync-3-rare-and-spontaneous-disruptive-udp-5405-storm-flood.75871/
> 
> His problem is that after corosync lag (he's have 1 cluster stretch on 2DC with 10km distance, so I think sometimes he's having some small lag,
> 1 node is flooding other nodes with a lot of udp packets. (and making things worst, as corosync cpu is going to 100% / overloaded, and then can't see other onodes

I can imagine this problem showing up as a a side effect of a flood where partition
changes happen. Not so sure that this can be the cause of that directly.

> 
> I had this problem 6month ago after shutting down a node, that's why I'm thinking it could "maybe" related.
> 
> So, I wonder if it could be same pmxcfs bug, when something looping or send again again packets.
> 
> The forum user seem to have the problem multiple times in some week, so maybe he'll be able to test the new fixed pmxcs, and tell us if it's fixing this bug too.

Testing once available would be sure a good idea for them.