[pve-devel] ha and zfs replication bug

Wed Nov 22 17:40:50 CET 2017

Yes.... I have face this bug here too...
What I do as a workaround is create a script, that, via ssh session,
destroy the old zfs volume:

ssh root at nodea zfs destroy vol

Then migrate

ha-migrate CT

This is a shame! rs... Can you realize that with a small vol that's ok, but
with a large vol, this can be unacceptable...

---
Gilberto Ferreira

Consultor TI Linux | IaaS Proxmox, CloudStack, KVM | Zentyal Server |
Zimbra Mail Server

(47) 3025-5907
(47) 99676-7530

Skype: gilberto.nunes36

konnectati.com.br <http://www.konnectati.com.br/>

https://www.youtube.com/watch?v=2rkgOxuyuu8
<https://l.facebook.com/l.php?u=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3D2rkgOxuyuu8&h=ATP9BJPCTVrxQ0HlXLOw0kw0XCzPJ4U4ykUa7tFWFuBvgTD0wmC801aq_95_stzEFRQVw7Kr0HZutpDHKLccDjjj0tTUxU4JsKeQ3WPxS07szD2CKzq-xOsR7zZYZwv7JP_D9tuk1dBIa2yZWUYlXdd83_PYwmBhJ00TxUioCq5_XnrZnanIvpOhKXGwiULeTXsdPqjKmB8Wh_6fLjaFYpAmrSW3eghz4vtEgJcoCPm82hLIgJuS4tLlFXX8y2RJFNiZU8Ke2M_tbEOUgk95FP2jqhPlvpkYHLnmr0Q>

2017-11-22 14:01 GMT-02:00 Alexandre DERUMIER <aderumier at odiso.com>:

> Hi,
>
> This is our training week with customers,
>
> and we are testing HA + new storage replication with zfs.
>
>
>
> We have some bugs:
>
> setup is:
>
> node1: kvm1
> node2: kvm2
>
> both with local-zfs
>
>
> We are testing with CT , id 101.
>
> CT is running on kvm1, with HA enable
>
> group: mongroupeha
>         nodes kvm2:1,kvm1:2
>         nofailback 0
>         restricted 0
>
> ct: 101
>         group mongroupeha
>         state started
>
>
> kvm1 is the prefered node.
>
>
> replication job:
>
> local: 101-0
>         target kvm2
>         schedule */1
>
>
>
> Now, when kvm1 is crashing, CT is started on kvm2.
> that's ok.
>
>
> When kvm1 is coming back online, CT can be fallback anymore.
>
>
> 1st bug:
>
> in gui, when CT is ont kvm2, we can't see the replication job on the GUI,
> because it's still
>
> local: 101-0
>         target kvmformation2
>         schedule */1
>
>
> if we edit the config with
> local: 101-0
>         target kvmformation1
>         schedule */1
>
> we can see it again.
>
> (So, I think we could delete or change the config, to not have a phantom
> job in the GUI)
>
>
> then after that, the replication job is failing with
>
> "
> Task viewer: CT 101 - Migrate
> Output
> Status
> Stop
> task started by HA resource agent
> 2017-11-22 16:50:33 use dedicated network address for sending migration
> traffic (10.59.100.221)
> 2017-11-22 16:50:33 starting migration of CT 101 to node 'kvmformation1'
> (10.59.100.221)
> 2017-11-22 16:50:33 found local volume 'local-zfs:subvol-101-disk-1' (in
> current VM config)
> 2017-11-22 16:50:33 start replication job
> 2017-11-22 16:50:33 guest => CT 101, running => 0
> 2017-11-22 16:50:33 volumes => local-zfs:subvol-101-disk-1
> 2017-11-22 16:50:35 create snapshot '__replicate_101-0_1511365833__' on
> local-zfs:subvol-101-disk-1
> 2017-11-22 16:50:35 full sync 'local-zfs:subvol-101-disk-1'
> (__replicate_101-0_1511365833__)
> volume 'rpool/data/subvol-101-disk-1' already exists
> exit code 255
> full send of rpool/data/subvol-101-disk-1 at __replicate_101-1_1511364690__
> estimated size is 1.23G
> send from @__replicate_101-1_1511364690__ to rpool/data/subvol-101-disk-1@
> __replicate_101-0_1511365833__ estimated size is 10.8M
> total estimated size is 1.24G
> TIME        SENT   SNAPSHOT
> command 'zfs send -Rpv -- rpool/data/subvol-101-disk-1 at __replicate_101-0_1511365833__'
> failed: got signal 13
> send/receive failed, cleaning up snapshot(s)..
> 2017-11-22 16:50:36 delete previous replication snapshot
> '__replicate_101-0_1511365833__' on local-zfs:subvol-101-disk-1
> 2017-11-22 16:50:36 end replication job with error: command 'set -o
> pipefail && pvesm export local-zfs:subvol-101-disk-1 zfs - -with-snapshots
> 1 -snapshot __replicate_101-0_1511365833__' failed: exit code 255
> 2017-11-22 16:50:36 ERROR: command 'set -o pipefail && pvesm export
> local-zfs:subvol-101-disk-1 zfs - -with-snapshots 1 -snapshot
> __replicate_101-0_1511365833__' failed: exit code 255
> 2017-11-22 16:50:36 aborting phase 1 - cleanup resources
> 2017-11-22 16:50:36 start final cleanup
> 2017-11-22 16:50:36 ERROR: migration aborted (duration 00:00:03): command
> 'set -o pipefail && pvesm export local-zfs:subvol-101-disk-1 zfs -
> -with-snapshots 1 -snapshot __replicate_101-0_1511365833__' failed: exit
> code 255
> TASK ERROR: migration aborted"
>
>
> Because of old zfs volume on kvm1.
> if we delete the volume manually on kvm1 with
>  pvesm free subvol-101-disk-1 --storage local-zfs
>
> Then, the replication job is running again, and the CT is finally migrated
> back again on kvm1.
>
>
>
> Is it a bug or an expected behaviour ?
>
>
>
>
> _______________________________________________
> pve-devel mailing list
> pve-devel at pve.proxmox.com
> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
>