[pve-devel] corosync bug: cluster break after 1 node clean shutdown

Thu Sep 17 13:35:55 CEST 2020

On 9/17/20 12:02 PM, Alexandre DERUMIER wrote:
> if needed, here my test script to reproduce it

thanks, I'm now using this specific one, had a similar (but all nodes writes)
running here since ~ two hours without luck yet, lets see how this behaves.

> 
> node1 (restart corosync until node2 don't send the timestamp anymore)
> -----
> 
> #!/bin/bash
> 
> for i in `seq 10000`; do 
>    now=$(date +"%T")
>    echo "restart corosync : $now"
>     systemctl restart corosync
>     for j in {1..59}; do
>         last=$(cat /tmp/timestamp)
>         curr=`date '+%s'`
>         diff=$(($curr - $last))
>         if [ $diff -gt 20 ]; then
>            echo "too old"
>            exit 0
>         fi
>         sleep 1
>      done
> done 
> 
> 
> 
> node2 (write to /etc/pve/test each second, then send the last timestamp to node1)
> -----
> #!/bin/bash
> for i in {1..10000};
> do
>    now=$(date +"%T")
>    echo "Current time : $now"
>    curr=`date '+%s'`
>    ssh root at node1 "echo $curr > /tmp/timestamp"
>    echo "test" > /etc/pve/test
>    sleep 1
> done
>