[pve-devel] corosync bug: cluster break after 1 node clean shutdown
Alexandre DERUMIER
aderumier at odiso.com
Tue Sep 22 07:43:53 CEST 2020
I have done test with "kill -9 <pidofcorosync", and I have around 20s hang on other nodes,
but after that it's become available again.
So, it's really something when corosync is in shutdown phase, and pmxcfs is running.
So, for now, as workaround, I have changed
/lib/systemd/system/pve-cluster.service
#Wants=corosync.service
#Before=corosync.service
Requires=corosync.service
After=corosync.service
Like this, at shutdown, pve-cluster is stopped before corosync, and if I restart corosync, pve-cluster is stopped first.
----- Mail original -----
De: "aderumier" <aderumier at odiso.com>
À: "Thomas Lamprecht" <t.lamprecht at proxmox.com>
Cc: "Proxmox VE development discussion" <pve-devel at lists.proxmox.com>
Envoyé: Lundi 21 Septembre 2020 01:54:59
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
Hi,
I have done a new test, this time with "systemctl stop corosync", wait 15s, "systemctl start corosync", wait 15s.
I was able to reproduce it at corosync stop on node1, 1second later /etc/pve was locked on all other nodes.
I have started corosync 10min later on node1, and /etc/pve has become writeable again on all nodes
node1: corosync stop: 01:26:50
node2 : /etc/pve locked : 01:26:51
http://odisoweb1.odiso.net/corosync-stop.log
pmxcfs : bt full all threads:
https://gist.github.com/aderumier/c45af4ee73b80330367e416af858bc65
pmxcfs: coredump :http://odisoweb1.odiso.net/core.17995.gz
node1:corosync start: 01:35:36
http://odisoweb1.odiso.net/corosync-start.log
BTW, I have been contacted in pm on the forum by a user following this mailing thread,
and he had exactly the same problem with a 7 nodes cluster recently.
(shutting down 1 node, /etc/pve was locked until the node was restarted)
----- Mail original -----
De: "Thomas Lamprecht" <t.lamprecht at proxmox.com>
À: "Proxmox VE development discussion" <pve-devel at lists.proxmox.com>, "aderumier" <aderumier at odiso.com>
Envoyé: Jeudi 17 Septembre 2020 13:35:55
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
On 9/17/20 12:02 PM, Alexandre DERUMIER wrote:
> if needed, here my test script to reproduce it
thanks, I'm now using this specific one, had a similar (but all nodes writes)
running here since ~ two hours without luck yet, lets see how this behaves.
>
> node1 (restart corosync until node2 don't send the timestamp anymore)
> -----
>
> #!/bin/bash
>
> for i in `seq 10000`; do
> now=$(date +"%T")
> echo "restart corosync : $now"
> systemctl restart corosync
> for j in {1..59}; do
> last=$(cat /tmp/timestamp)
> curr=`date '+%s'`
> diff=$(($curr - $last))
> if [ $diff -gt 20 ]; then
> echo "too old"
> exit 0
> fi
> sleep 1
> done
> done
>
>
>
> node2 (write to /etc/pve/test each second, then send the last timestamp to node1)
> -----
> #!/bin/bash
> for i in {1..10000};
> do
> now=$(date +"%T")
> echo "Current time : $now"
> curr=`date '+%s'`
> ssh root at node1 "echo $curr > /tmp/timestamp"
> echo "test" > /etc/pve/test
> sleep 1
> done
>
_______________________________________________
pve-devel mailing list
pve-devel at lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
More information about the pve-devel
mailing list