[pve-devel] corosync bug: cluster break after 1 node clean shutdown

Tue Sep 29 10:51:32 CEST 2020

On September 28, 2020 5:59 pm, Alexandre DERUMIER wrote:
> Here a new test http://odisoweb1.odiso.net/test5
> 
> This has occured at corosync start
> 
> 
> node1:
> -----
> start corosync : 17:30:19
> 
> 
> node2: /etc/pve locked
> --------------
> Current time : 17:30:24
> 
> 
> I have done backtrace of all nodes at same time with parallel ssh at 17:35:22 
> 
> and a coredump of all nodes at same time with parallel ssh at 17:42:26
> 
> 
> (Note that this time, /etc/pve was still locked after backtrace/coredump)

okay, so this time two more log lines got printed on the (again) problem 
causing node #13, but it still stops logging at a point where this makes 
no sense.

I rebuilt the packages:

f318f12e5983cb09d186c2ee37743203f599d103b6abb2d00c78d312b4f12df942d8ed1ff5de6e6c194785d0a81eb881e80f7bbfd4865ca1a5a509acd40f64aa  pve-cluster_6.1-8_amd64.deb
b220ee95303e22704793412e83ac5191ba0e53c2f41d85358a247c248d2a6856e5b791b1d12c36007a297056388224acf4e5a1250ef1dd019aee97e8ac4bcac7  pve-cluster-dbgsym_6.1-8_amd64.deb

with a change of how the logging is set up (I now suspect that some 
messages might get dropped if the logging throughput is high enough), 
let's hope this gets us the information we need. please repeat the test5 
again with these packages.

is there anything special about node 13? network topology, slower 
hardware, ... ?