[pve-devel] corosync bug: cluster break after 1 node clean shutdown
Alexandre DERUMIER
aderumier at odiso.com
Mon Sep 21 01:54:59 CEST 2020
Hi,
I have done a new test, this time with "systemctl stop corosync", wait 15s, "systemctl start corosync", wait 15s.
I was able to reproduce it at corosync stop on node1, 1second later /etc/pve was locked on all other nodes.
I have started corosync 10min later on node1, and /etc/pve has become writeable again on all nodes
node1: corosync stop: 01:26:50
node2 : /etc/pve locked : 01:26:51
http://odisoweb1.odiso.net/corosync-stop.log
pmxcfs : bt full all threads:
https://gist.github.com/aderumier/c45af4ee73b80330367e416af858bc65
pmxcfs: coredump :http://odisoweb1.odiso.net/core.17995.gz
node1:corosync start: 01:35:36
http://odisoweb1.odiso.net/corosync-start.log
BTW, I have been contacted in pm on the forum by a user following this mailing thread,
and he had exactly the same problem with a 7 nodes cluster recently.
(shutting down 1 node, /etc/pve was locked until the node was restarted)
----- Mail original -----
De: "Thomas Lamprecht" <t.lamprecht at proxmox.com>
À: "Proxmox VE development discussion" <pve-devel at lists.proxmox.com>, "aderumier" <aderumier at odiso.com>
Envoyé: Jeudi 17 Septembre 2020 13:35:55
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
On 9/17/20 12:02 PM, Alexandre DERUMIER wrote:
> if needed, here my test script to reproduce it
thanks, I'm now using this specific one, had a similar (but all nodes writes)
running here since ~ two hours without luck yet, lets see how this behaves.
>
> node1 (restart corosync until node2 don't send the timestamp anymore)
> -----
>
> #!/bin/bash
>
> for i in `seq 10000`; do
> now=$(date +"%T")
> echo "restart corosync : $now"
> systemctl restart corosync
> for j in {1..59}; do
> last=$(cat /tmp/timestamp)
> curr=`date '+%s'`
> diff=$(($curr - $last))
> if [ $diff -gt 20 ]; then
> echo "too old"
> exit 0
> fi
> sleep 1
> done
> done
>
>
>
> node2 (write to /etc/pve/test each second, then send the last timestamp to node1)
> -----
> #!/bin/bash
> for i in {1..10000};
> do
> now=$(date +"%T")
> echo "Current time : $now"
> curr=`date '+%s'`
> ssh root at node1 "echo $curr > /tmp/timestamp"
> echo "test" > /etc/pve/test
> sleep 1
> done
>
More information about the pve-devel
mailing list