[pve-devel] corosync bug: cluster break after 1 node clean shutdown

Tue Sep 15 11:46:51 CEST 2020

On 9/15/20 11:35 AM, Alexandre DERUMIER wrote:
> Hi,
> 
> I have finally reproduce it !
> 
> But this is with a corosync restart in cron each 1 minute, on node1
>
> Then: lrm was stuck for too long for around 60s and softdog have been triggered on multiple other nodes.
> 
> here the logs with full corosync debug at the time of last corosync restart. 
> 
> node1 (where corosync is restarted each minute)
> https://gist.github.com/aderumier/c4f192fbce8e96759f91a61906db514e
> 
> node2
> https://gist.github.com/aderumier/2d35ea05c1fbff163652e564fc430e67
> 
> node5
> https://gist.github.com/aderumier/df1d91cddbb6e15bb0d0193ed8df9273
> 
> I'll prepare logs from the previous corosync restart, as the lrm seem to be already stuck before.

Yeah that would be good, as yes the lrm seems to get stuck at around 10:46:21

> Sep 15 10:47:26 m6kvm2 pve-ha-lrm[3736]: loop take too long (65 seconds)