[pve-devel] ha and zfs replication bug
Alexandre DERUMIER
aderumier at odiso.com
Wed Nov 22 17:01:20 CET 2017
Hi,
This is our training week with customers,
and we are testing HA + new storage replication with zfs.
We have some bugs:
setup is:
node1: kvm1
node2: kvm2
both with local-zfs
We are testing with CT , id 101.
CT is running on kvm1, with HA enable
group: mongroupeha
nodes kvm2:1,kvm1:2
nofailback 0
restricted 0
ct: 101
group mongroupeha
state started
kvm1 is the prefered node.
replication job:
local: 101-0
target kvm2
schedule */1
Now, when kvm1 is crashing, CT is started on kvm2.
that's ok.
When kvm1 is coming back online, CT can be fallback anymore.
1st bug:
in gui, when CT is ont kvm2, we can't see the replication job on the GUI, because it's still
local: 101-0
target kvmformation2
schedule */1
if we edit the config with
local: 101-0
target kvmformation1
schedule */1
we can see it again.
(So, I think we could delete or change the config, to not have a phantom job in the GUI)
then after that, the replication job is failing with
"
Task viewer: CT 101 - Migrate
Output
Status
Stop
task started by HA resource agent
2017-11-22 16:50:33 use dedicated network address for sending migration traffic (10.59.100.221)
2017-11-22 16:50:33 starting migration of CT 101 to node 'kvmformation1' (10.59.100.221)
2017-11-22 16:50:33 found local volume 'local-zfs:subvol-101-disk-1' (in current VM config)
2017-11-22 16:50:33 start replication job
2017-11-22 16:50:33 guest => CT 101, running => 0
2017-11-22 16:50:33 volumes => local-zfs:subvol-101-disk-1
2017-11-22 16:50:35 create snapshot '__replicate_101-0_1511365833__' on local-zfs:subvol-101-disk-1
2017-11-22 16:50:35 full sync 'local-zfs:subvol-101-disk-1' (__replicate_101-0_1511365833__)
volume 'rpool/data/subvol-101-disk-1' already exists
exit code 255
full send of rpool/data/subvol-101-disk-1 at __replicate_101-1_1511364690__ estimated size is 1.23G
send from @__replicate_101-1_1511364690__ to rpool/data/subvol-101-disk-1 at __replicate_101-0_1511365833__ estimated size is 10.8M
total estimated size is 1.24G
TIME SENT SNAPSHOT
command 'zfs send -Rpv -- rpool/data/subvol-101-disk-1 at __replicate_101-0_1511365833__' failed: got signal 13
send/receive failed, cleaning up snapshot(s)..
2017-11-22 16:50:36 delete previous replication snapshot '__replicate_101-0_1511365833__' on local-zfs:subvol-101-disk-1
2017-11-22 16:50:36 end replication job with error: command 'set -o pipefail && pvesm export local-zfs:subvol-101-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_101-0_1511365833__' failed: exit code 255
2017-11-22 16:50:36 ERROR: command 'set -o pipefail && pvesm export local-zfs:subvol-101-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_101-0_1511365833__' failed: exit code 255
2017-11-22 16:50:36 aborting phase 1 - cleanup resources
2017-11-22 16:50:36 start final cleanup
2017-11-22 16:50:36 ERROR: migration aborted (duration 00:00:03): command 'set -o pipefail && pvesm export local-zfs:subvol-101-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_101-0_1511365833__' failed: exit code 255
TASK ERROR: migration aborted"
Because of old zfs volume on kvm1.
if we delete the volume manually on kvm1 with
pvesm free subvol-101-disk-1 --storage local-zfs
Then, the replication job is running again, and the CT is finally migrated back again on kvm1.
Is it a bug or an expected behaviour ?
More information about the pve-devel
mailing list