[pve-devel] corosync bug: cluster break after 1 node clean shutdown

Alexandre DERUMIER aderumier at odiso.com
Tue Sep 22 07:43:53 CEST 2020


I have done test with "kill -9 <pidofcorosync",  and I have around 20s hang on other nodes,
but after that it's become available again.


So, it's really something when corosync is in shutdown phase, and pmxcfs is running.

So, for now, as workaround, I have changed

/lib/systemd/system/pve-cluster.service

#Wants=corosync.service
#Before=corosync.service
Requires=corosync.service
After=corosync.service


Like this, at shutdown, pve-cluster is stopped before corosync, and if I restart corosync, pve-cluster is stopped first.




----- Mail original -----
De: "aderumier" <aderumier at odiso.com>
À: "Thomas Lamprecht" <t.lamprecht at proxmox.com>
Cc: "Proxmox VE development discussion" <pve-devel at lists.proxmox.com>
Envoyé: Lundi 21 Septembre 2020 01:54:59
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

Hi, 

I have done a new test, this time with "systemctl stop corosync", wait 15s, "systemctl start corosync", wait 15s. 

I was able to reproduce it at corosync stop on node1, 1second later /etc/pve was locked on all other nodes. 


I have started corosync 10min later on node1, and /etc/pve has become writeable again on all nodes 



node1: corosync stop: 01:26:50 
node2 : /etc/pve locked : 01:26:51 

http://odisoweb1.odiso.net/corosync-stop.log 


pmxcfs : bt full all threads: 

https://gist.github.com/aderumier/c45af4ee73b80330367e416af858bc65 

pmxcfs: coredump :http://odisoweb1.odiso.net/core.17995.gz 


node1:corosync start: 01:35:36 
http://odisoweb1.odiso.net/corosync-start.log 





BTW, I have been contacted in pm on the forum by a user following this mailing thread, 
and he had exactly the same problem with a 7 nodes cluster recently. 
(shutting down 1 node, /etc/pve was locked until the node was restarted) 



----- Mail original ----- 
De: "Thomas Lamprecht" <t.lamprecht at proxmox.com> 
À: "Proxmox VE development discussion" <pve-devel at lists.proxmox.com>, "aderumier" <aderumier at odiso.com> 
Envoyé: Jeudi 17 Septembre 2020 13:35:55 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

On 9/17/20 12:02 PM, Alexandre DERUMIER wrote: 
> if needed, here my test script to reproduce it 

thanks, I'm now using this specific one, had a similar (but all nodes writes) 
running here since ~ two hours without luck yet, lets see how this behaves. 

> 
> node1 (restart corosync until node2 don't send the timestamp anymore) 
> ----- 
> 
> #!/bin/bash 
> 
> for i in `seq 10000`; do 
> now=$(date +"%T") 
> echo "restart corosync : $now" 
> systemctl restart corosync 
> for j in {1..59}; do 
> last=$(cat /tmp/timestamp) 
> curr=`date '+%s'` 
> diff=$(($curr - $last)) 
> if [ $diff -gt 20 ]; then 
> echo "too old" 
> exit 0 
> fi 
> sleep 1 
> done 
> done 
> 
> 
> 
> node2 (write to /etc/pve/test each second, then send the last timestamp to node1) 
> ----- 
> #!/bin/bash 
> for i in {1..10000}; 
> do 
> now=$(date +"%T") 
> echo "Current time : $now" 
> curr=`date '+%s'` 
> ssh root at node1 "echo $curr > /tmp/timestamp" 
> echo "test" > /etc/pve/test 
> sleep 1 
> done 
> 


_______________________________________________ 
pve-devel mailing list 
pve-devel at lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 





More information about the pve-devel mailing list