[pve-devel] ha and zfs replication bug

Wed Nov 22 17:01:20 CET 2017

Hi,

This is our training week with customers, 

and we are testing HA + new storage replication with zfs.

We have some bugs:

setup is:

node1: kvm1
node2: kvm2

both with local-zfs

We are testing with CT , id 101.

CT is running on kvm1, with HA enable

group: mongroupeha
        nodes kvm2:1,kvm1:2
        nofailback 0
        restricted 0

ct: 101
        group mongroupeha
        state started

kvm1 is the prefered node.

replication job:

local: 101-0
        target kvm2
        schedule */1

Now, when kvm1 is crashing, CT is started on kvm2.
that's ok.

When kvm1 is coming back online, CT can be fallback anymore.

1st bug:

in gui, when CT is ont kvm2, we can't see the replication job on the GUI, because it's still

local: 101-0
        target kvmformation2
        schedule */1

if we edit the config with
local: 101-0
        target kvmformation1
        schedule */1

we can see it again.

(So, I think we could delete or change the config, to not have a phantom job in the GUI)

then after that, the replication job is failing with

"
Task viewer: CT 101 - Migrate
Output
Status
Stop
task started by HA resource agent
2017-11-22 16:50:33 use dedicated network address for sending migration traffic (10.59.100.221)
2017-11-22 16:50:33 starting migration of CT 101 to node 'kvmformation1' (10.59.100.221)
2017-11-22 16:50:33 found local volume 'local-zfs:subvol-101-disk-1' (in current VM config)
2017-11-22 16:50:33 start replication job
2017-11-22 16:50:33 guest => CT 101, running => 0
2017-11-22 16:50:33 volumes => local-zfs:subvol-101-disk-1
2017-11-22 16:50:35 create snapshot '__replicate_101-0_1511365833__' on local-zfs:subvol-101-disk-1
2017-11-22 16:50:35 full sync 'local-zfs:subvol-101-disk-1' (__replicate_101-0_1511365833__)
volume 'rpool/data/subvol-101-disk-1' already exists
exit code 255
full send of rpool/data/subvol-101-disk-1 at __replicate_101-1_1511364690__ estimated size is 1.23G
send from @__replicate_101-1_1511364690__ to rpool/data/subvol-101-disk-1 at __replicate_101-0_1511365833__ estimated size is 10.8M
total estimated size is 1.24G
TIME        SENT   SNAPSHOT
command 'zfs send -Rpv -- rpool/data/subvol-101-disk-1 at __replicate_101-0_1511365833__' failed: got signal 13
send/receive failed, cleaning up snapshot(s)..
2017-11-22 16:50:36 delete previous replication snapshot '__replicate_101-0_1511365833__' on local-zfs:subvol-101-disk-1
2017-11-22 16:50:36 end replication job with error: command 'set -o pipefail && pvesm export local-zfs:subvol-101-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_101-0_1511365833__' failed: exit code 255
2017-11-22 16:50:36 ERROR: command 'set -o pipefail && pvesm export local-zfs:subvol-101-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_101-0_1511365833__' failed: exit code 255
2017-11-22 16:50:36 aborting phase 1 - cleanup resources
2017-11-22 16:50:36 start final cleanup
2017-11-22 16:50:36 ERROR: migration aborted (duration 00:00:03): command 'set -o pipefail && pvesm export local-zfs:subvol-101-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_101-0_1511365833__' failed: exit code 255
TASK ERROR: migration aborted"

Because of old zfs volume on kvm1.
if we delete the volume manually on kvm1 with
 pvesm free subvol-101-disk-1 --storage local-zfs

Then, the replication job is running again, and the CT is finally migrated back again on kvm1.

Is it a bug or an expected behaviour ?