[pve-devel] corosync bug: cluster break after 1 node clean shutdown
Alexandre DERUMIER
aderumier at odiso.com
Fri Sep 25 11:46:46 CEST 2020
>>I will rebuild once more modifying the send code a bit to log a lot more
>>details when sending state messages, it would be great if you could
>>repeat with that as we are still unable to reproduce the issue.
ok, no problem, I'm able to easily reproduce it, I'll do new test when you'll
send the new version.
(and thanks again to debugging this, because It's really beyond my competence)
----- Mail original -----
De: "Fabian Grünbichler" <f.gruenbichler at proxmox.com>
À: "Proxmox VE development discussion" <pve-devel at lists.proxmox.com>
Envoyé: Vendredi 25 Septembre 2020 11:19:04
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
On September 25, 2020 9:15 am, Alexandre DERUMIER wrote:
>
> Another hang, this time on corosync stop, coredump available
>
> http://odisoweb1.odiso.net/test3/
>
>
> node1
> ----
> stop corosync : 09:03:10
>
> node2: /etc/pve locked
> ------
> Current time : 09:03:10
thanks, these all indicate the same symptoms:
1. cluster config changes (corosync goes down/comes back up in this case)
2. pmxcfs starts sync process
3. all (online) nodes receive sync request for dcdb and status
4. all nodes send state for dcdb and status via CPG
5. all nodes receive state for dcdb and status from all nodes except one (13 in test 2, 10 in test 3)
in step 5, there is no trace of the message on the receiving side, even
though the sending node does not log an error. as before, the hang is
just a side-effect of the state machine ending up in a state that should
be short-lived (syncing, waiting for state from all nodes) with no
progress. the code and theory say that this should not happen, as either
sending the state fails triggering the node to leave the CPG (restarting
the sync), or a node drops out of quorum (triggering a config change,
which triggers restarting the sync), or we get all states from all nodes
and the sync proceeds. this looks to me like a fundamental
assumption/guarantee does not hold..
I will rebuild once more modifying the send code a bit to log a lot more
details when sending state messages, it would be great if you could
repeat with that as we are still unable to reproduce the issue.
hopefully those logs will then indicate whether this is a corosync/knet
bug, or if the issue is in our state machine code somewhere. so far it
looks more like the former..
_______________________________________________
pve-devel mailing list
pve-devel at lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
More information about the pve-devel
mailing list