[pve-devel] corosync bug: cluster break after 1 node clean shutdown

Alexandre DERUMIER aderumier at odiso.com
Tue Sep 29 12:52:44 CEST 2020


here a new test:

http://odisoweb1.odiso.net/test6/

node1
-----
start corosync : 12:08:33


node2 (/etc/pve lock)
-----
Current time : 12:08:39


node1 (stop corosync : unlock /etc/pve)
-----
12:28:11 : systemctl stop corosync


backtraces: 12:26:30


coredump : 12:27:21


----- Mail original -----
De: "aderumier" <aderumier at odiso.com>
À: "Proxmox VE development discussion" <pve-devel at lists.proxmox.com>
Envoyé: Mardi 29 Septembre 2020 11:37:41
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

>>with a change of how the logging is set up (I now suspect that some 
>>messages might get dropped if the logging throughput is high enough), 
>>let's hope this gets us the information we need. please repeat the test5 
>>again with these packages. 

I'll test this afternoon 

>>is there anything special about node 13? network topology, slower 
>>hardware, ... ? 

no nothing special, all nodes have exactly same hardware/cpu (24cores/48threads 3ghz)/memory/disk. 

this node is around 10% cpu usage, load is around 5. 

----- Mail original ----- 
De: "Fabian Grünbichler" <f.gruenbichler at proxmox.com> 
À: "Proxmox VE development discussion" <pve-devel at lists.proxmox.com> 
Envoyé: Mardi 29 Septembre 2020 10:51:32 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

On September 28, 2020 5:59 pm, Alexandre DERUMIER wrote: 
> Here a new test http://odisoweb1.odiso.net/test5 
> 
> This has occured at corosync start 
> 
> 
> node1: 
> ----- 
> start corosync : 17:30:19 
> 
> 
> node2: /etc/pve locked 
> -------------- 
> Current time : 17:30:24 
> 
> 
> I have done backtrace of all nodes at same time with parallel ssh at 17:35:22 
> 
> and a coredump of all nodes at same time with parallel ssh at 17:42:26 
> 
> 
> (Note that this time, /etc/pve was still locked after backtrace/coredump) 

okay, so this time two more log lines got printed on the (again) problem 
causing node #13, but it still stops logging at a point where this makes 
no sense. 

I rebuilt the packages: 

f318f12e5983cb09d186c2ee37743203f599d103b6abb2d00c78d312b4f12df942d8ed1ff5de6e6c194785d0a81eb881e80f7bbfd4865ca1a5a509acd40f64aa pve-cluster_6.1-8_amd64.deb 
b220ee95303e22704793412e83ac5191ba0e53c2f41d85358a247c248d2a6856e5b791b1d12c36007a297056388224acf4e5a1250ef1dd019aee97e8ac4bcac7 pve-cluster-dbgsym_6.1-8_amd64.deb 

with a change of how the logging is set up (I now suspect that some 
messages might get dropped if the logging throughput is high enough), 
let's hope this gets us the information we need. please repeat the test5 
again with these packages. 

is there anything special about node 13? network topology, slower 
hardware, ... ? 


_______________________________________________ 
pve-devel mailing list 
pve-devel at lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


_______________________________________________ 
pve-devel mailing list 
pve-devel at lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 





More information about the pve-devel mailing list