[pve-devel] corosync bug: cluster break after 1 node clean shutdown

Alexandre DERUMIER aderumier at odiso.com
Mon Sep 21 01:54:59 CEST 2020


Hi,

I have done a new test, this time with "systemctl stop corosync", wait 15s, "systemctl start corosync", wait 15s.

I was able to reproduce it at corosync stop on node1, 1second later /etc/pve was locked on all other nodes.


I have started corosync 10min later on node1, and /etc/pve has become writeable again on all nodes



node1: corosync stop: 01:26:50
node2 : /etc/pve locked : 01:26:51

http://odisoweb1.odiso.net/corosync-stop.log


pmxcfs : bt full all threads:

https://gist.github.com/aderumier/c45af4ee73b80330367e416af858bc65

pmxcfs: coredump :http://odisoweb1.odiso.net/core.17995.gz


node1:corosync start: 01:35:36
http://odisoweb1.odiso.net/corosync-start.log





BTW, I have been contacted in pm on the forum by a user following this mailing thread,
and he had exactly the same problem with a 7 nodes cluster recently.
(shutting down 1 node, /etc/pve was locked until the node was restarted)



----- Mail original -----
De: "Thomas Lamprecht" <t.lamprecht at proxmox.com>
À: "Proxmox VE development discussion" <pve-devel at lists.proxmox.com>, "aderumier" <aderumier at odiso.com>
Envoyé: Jeudi 17 Septembre 2020 13:35:55
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

On 9/17/20 12:02 PM, Alexandre DERUMIER wrote: 
> if needed, here my test script to reproduce it 

thanks, I'm now using this specific one, had a similar (but all nodes writes) 
running here since ~ two hours without luck yet, lets see how this behaves. 

> 
> node1 (restart corosync until node2 don't send the timestamp anymore) 
> ----- 
> 
> #!/bin/bash 
> 
> for i in `seq 10000`; do 
> now=$(date +"%T") 
> echo "restart corosync : $now" 
> systemctl restart corosync 
> for j in {1..59}; do 
> last=$(cat /tmp/timestamp) 
> curr=`date '+%s'` 
> diff=$(($curr - $last)) 
> if [ $diff -gt 20 ]; then 
> echo "too old" 
> exit 0 
> fi 
> sleep 1 
> done 
> done 
> 
> 
> 
> node2 (write to /etc/pve/test each second, then send the last timestamp to node1) 
> ----- 
> #!/bin/bash 
> for i in {1..10000}; 
> do 
> now=$(date +"%T") 
> echo "Current time : $now" 
> curr=`date '+%s'` 
> ssh root at node1 "echo $curr > /tmp/timestamp" 
> echo "test" > /etc/pve/test 
> sleep 1 
> done 
> 





More information about the pve-devel mailing list