[pve-devel] corosync bug: cluster break after 1 node clean shutdown

Fri Sep 25 11:19:04 CEST 2020

On September 25, 2020 9:15 am, Alexandre DERUMIER wrote:
> 
> Another hang, this time on corosync stop, coredump available
> 
> http://odisoweb1.odiso.net/test3/ 
> 
>                                                                                                              
> node1
> ----
> stop corosync : 09:03:10
> 
> node2: /etc/pve locked
> ------
> Current time : 09:03:10

thanks, these all indicate the same symptoms:

1. cluster config changes (corosync goes down/comes back up in this case)
2. pmxcfs starts sync process
3. all (online) nodes receive sync request for dcdb and status
4. all nodes send state for dcdb and status via CPG
5. all nodes receive state for dcdb and status from all nodes except one (13 in test 2, 10 in test 3)

in step 5, there is no trace of the message on the receiving side, even 
though the sending node does not log an error. as before, the hang is 
just a side-effect of the state machine ending up in a state that should 
be short-lived (syncing, waiting for state from all nodes) with no 
progress. the code and theory say that this should not happen, as either 
sending the state fails triggering the node to leave the CPG (restarting 
the sync), or a node drops out of quorum (triggering a config change, 
which triggers restarting the sync), or we get all states from all nodes 
and the sync proceeds. this looks to me like a fundamental 
assumption/guarantee does not hold..

I will rebuild once more modifying the send code a bit to log a lot more 
details when sending state messages, it would be great if you could 
repeat with that as we are still unable to reproduce the issue. 
hopefully those logs will then indicate whether this is a corosync/knet 
bug, or if the issue is in our state machine code somewhere. so far it 
looks more like the former..