[pve-devel] corosync bug: cluster break after 1 node clean shutdown

Wed Sep 16 17:17:02 CEST 2020

I have produce it again, with the coredump this time

restart corosync : 17:05:27

http://odisoweb1.odiso.net/pmxcfs-corosync2.log

bt full

https://gist.github.com/aderumier/466dcc4aedb795aaf0f308de0d1c652b

coredump

http://odisoweb1.odiso.net/core.7761.gz

----- Mail original -----
De: "Thomas Lamprecht" <t.lamprecht at proxmox.com>
À: "aderumier" <aderumier at odiso.com>, "Proxmox VE development discussion" <pve-devel at lists.proxmox.com>
Envoyé: Mercredi 16 Septembre 2020 16:45:12
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

On 9/16/20 3:15 PM, Alexandre DERUMIER wrote: 
> I have reproduce it again, with pmxcfs in debug mode 
> 
> corosync restart at 15:02:10, and it was already block on other nodes at 15:02:12 
> 
> The pmxcfs was still logging after the lock. 
> 
> 
> here the log on node1 where corosync has been restarted 
> 
> http://odisoweb1.odiso.net/pmxcfs-corosync.log 
> 

thanks for those, I need a bit to sift through them. Seem like either dfsm gets 
out of sync or we do not get a ACK reply from cpg_send. 

A full core dump would be still nice, in gdb: 
generate-core-file 

PS: instead of manually switching to threads you can do: 
thread apply all bt full 

to get a backtrace for all threads in one command