[PVE-User] proxmox 5 - replication fails

dORSY dorsyka at yahoo.com
Sun Jul 9 12:04:44 CEST 2017


Hi,

I upgraded my 2-node setup to proxmox 5 and stretch.
I am using ZFS local storages and unicast corosync (OVH).

Did set up storage migration from both sides to the other for some VMs and a container. It makes transfering VMs really fast, however VMs need to be shut down as online migration doesn't work anymore with replication. I can live with that for the time being, but would be a great improvement.

Problem is,  some zfs migration attempts failed. And stays failed.

2017-07-09 11:03:00 103-0: start replication job
2017-07-09 11:03:00 103-0: guest => VM 103, running => 2979
2017-07-09 11:03:00 103-0: volumes => local-zfs:vm-103-disk-1
2017-07-09 11:03:00 103-0: delete stale replication snapshot '__replicate_103-0_1499590800__' on local-zfs:vm-103-disk-1
2017-07-09 11:03:05 103-0: (remote_prepare_local_job) delete stale replication snapshot '__replicate_103-0_1499590800__' on local-zfs:vm-103-disk-1
2017-07-09 11:03:09 103-0: create snapshot '__replicate_103-0_1499590980__' on local-zfs:vm-103-disk-1
2017-07-09 11:03:09 103-0: full sync 'local-zfs:vm-103-disk-1' (__replicate_103-0_1499590980__)
2017-07-09 11:03:10 103-0: delete previous replication snapshot '__replicate_103-0_1499590980__' on local-zfs:vm-103-disk-1
2017-07-09 11:03:10 103-0: end replication job with error: command 'set -o pipefail && pvesm export local-zfs:vm-103-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_103-0_1499590980__ | /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=xxhostnamexx' root at ip.of.other.node -- pvesm import local-zfs:vm-103-disk-1 zfs - -with-snapshots 1' failed: exit code 255

Could not run the command by hand as it deletes those snapshots right away.

Digging into pvesr logs in journal, I found that migration works as expected for a time:

Jul 09 10:15:02 NameOfNode1 pvesr[642]: send from @__replicate_103-0_1499587200__ to rpool/data/vm-103-disk-1 at __replicate_103-0_1499588100__ estimated size is 60.5M
Jul 09 10:15:02 NameOfNode1 pvesr[642]: total estimated size is 60.5M
Jul 09 10:15:02 NameOfNode1 pvesr[642]: TIME        SENT   SNAPSHOT
Jul 09 10:15:02 NameOfNode1 pvesr[642]: rpool/data/vm-103-disk-1 at __replicate_103-0_1499587200__        name        rpool/data/vm-103-disk-1 at __replicate_103-0_1499587200__        -
Jul 09 10:15:03 NameOfNode1 pvesr[642]: 10:15:03   5.47M   rpool/data/vm-103-disk-1 at __replicate_103-0_1499588100__
Jul 09 10:15:04 NameOfNode1 pvesr[642]: 10:15:04   10.4M   rpool/data/vm-103-disk-1 at __replicate_103-0_1499588100__
Jul 09 10:15:05 NameOfNode1 pvesr[642]: 10:15:05   40.4M   rpool/data/vm-103-disk-1 at __replicate_103-0_1499588100__
Jul 09 10:15:12 NameOfNode1 pvesr[642]: send from @__replicate_109-0_1499587210__ to rpool/data/vm-109-disk-1 at __replicate_109-0_1499588111__ estimated size is 172M
Jul 09 10:15:12 NameOfNode1 pvesr[642]: total estimated size is 172M
Jul 09 10:15:12 NameOfNode1 pvesr[642]: TIME        SENT   SNAPSHOT
Jul 09 10:15:12 NameOfNode1 pvesr[642]: rpool/data/vm-109-disk-1 at __replicate_109-0_1499587210__        name        rpool/data/vm-109-disk-1 at __replicate_109-0_1499587210__        -
Jul 09 10:15:13 NameOfNode1 pvesr[642]: 10:15:13   15.6M   rpool/data/vm-109-disk-1 at __replicate_109-0_1499588111__
Jul 09 10:15:14 NameOfNode1 pvesr[642]: 10:15:14   41.0M   rpool/data/vm-109-disk-1 at __replicate_109-0_1499588111__
Jul 09 10:15:15 NameOfNode1 pvesr[642]: 10:15:15   71.2M   rpool/data/vm-109-disk-1 at __replicate_109-0_1499588111__
Jul 09 10:15:16 NameOfNode1 pvesr[642]: 10:15:16   88.8M   rpool/data/vm-109-disk-1 at __replicate_109-0_1499588111__
Jul 09 10:15:17 NameOfNode1 pvesr[642]: 10:15:17    114M   rpool/data/vm-109-disk-1 at __replicate_109-0_1499588111__
Jul 09 10:15:18 NameOfNode1 pvesr[642]: 10:15:18    121M   rpool/data/vm-109-disk-1 at __replicate_109-0_1499588111__
Jul 09 10:15:19 NameOfNode1 pvesr[642]: 10:15:19    149M   rpool/data/vm-109-disk-1 at __replicate_109-0_1499588111__
Jul 09 10:15:20 NameOfNode1 pvesr[642]: 10:15:20    160M   rpool/data/vm-109-disk-1 at __replicate_109-0_1499588111__
Jul 09 10:15:21 NameOfNode1 pvesr[642]: 10:15:21    172M   rpool/data/vm-109-disk-1 at __replicate_109-0_1499588111__
Jul 09 10:30:05 NameOfNode1 pvesr[3893]: send from @__replicate_103-0_1499588100__ to rpool/data/vm-103-disk-1 at __replicate_103-0_1499589000__ estimated size is 8.30M
Jul 09 10:30:05 NameOfNode1 pvesr[3893]: total estimated size is 8.30M
Jul 09 10:30:05 NameOfNode1 pvesr[3893]: TIME        SENT   SNAPSHOT
Jul 09 10:30:05 NameOfNode1 pvesr[3893]: rpool/data/vm-103-disk-1 at __replicate_103-0_1499588100__        name        rpool/data/vm-103-disk-1 at __replicate_103-0_1499588100__        -
Jul 09 10:30:06 NameOfNode1 pvesr[3893]: 10:30:06   2.11M   rpool/data/vm-103-disk-1 at __replicate_103-0_1499589000__
Jul 09 10:30:07 NameOfNode1 pvesr[3893]: 10:30:07   2.11M   rpool/data/vm-103-disk-1 at __replicate_103-0_1499589000__
Jul 09 10:30:08 NameOfNode1 pvesr[3893]: 10:30:08   2.11M   rpool/data/vm-103-disk-1 at __replicate_103-0_1499589000__
Jul 09 10:30:09 NameOfNode1 pvesr[3893]: 10:30:09   2.11M   rpool/data/vm-103-disk-1 at __replicate_103-0_1499589000__
Jul 09 10:30:10 NameOfNode1 pvesr[3893]: 10:30:10   3.10M   rpool/data/vm-103-disk-1 at __replicate_103-0_1499589000__
Jul 09 10:30:23 NameOfNode1 pvesr[3893]: send from @__replicate_109-0_1499588111__ to rpool/data/vm-109-disk-1 at __replicate_109-0_1499589022__ estimated size is 22.5M
Jul 09 10:30:23 NameOfNode1 pvesr[3893]: total estimated size is 22.5M
Jul 09 10:30:23 NameOfNode1 pvesr[3893]: TIME        SENT   SNAPSHOT
Jul 09 10:30:24 NameOfNode1 pvesr[3893]: rpool/data/vm-109-disk-1 at __replicate_109-0_1499588111__        name        rpool/data/vm-109-disk-1 at __replicate_109-0_1499588111__        -
Jul 09 10:30:24 NameOfNode1 pvesr[3893]: 10:30:24   2.10M   rpool/data/vm-109-disk-1 at __replicate_109-0_1499589022__
Jul 09 10:30:25 NameOfNode1 pvesr[3893]: 10:30:25   2.10M   rpool/data/vm-109-disk-1 at __replicate_109-0_1499589022__
Jul 09 10:30:26 NameOfNode1 pvesr[3893]: 10:30:26   4.55M   rpool/data/vm-109-disk-1 at __replicate_109-0_1499589022__

But then it somehow wants to send "@" and note that the estimated size even exceeds the size of data on the VM's disk, close to the volume sizes.
This is happening both for VM 103 and 109 (sometimes it is just one node)

ZFS list of those volumes:
rpool/data/vm-103-disk-1      14.5G  2.29T  14.5G  -
rpool/data/vm-109-disk-1      21.4G  2.29T  21.4G  -
rpool/data/vm-103-disk-1      volsize      20G      local
rpool/data/vm-109-disk-1      volsize      30G      local


Jul 09 10:36:11 NameOfNode1 pvesr[5123]: send from @ to rpool/data/vm-109-disk-1 at __replicate_109-0_1499589360__ estimated size is 28.4G
Jul 09 10:36:11 NameOfNode1 pvesr[5123]: total estimated size is 28.4G
Jul 09 10:36:11 NameOfNode1 pvesr[5123]: TIME        SENT   SNAPSHOT
Jul 09 10:36:12 NameOfNode1 pvesr[5123]: rpool/data/vm-109-disk-1        name        rpool/data/vm-109-disk-1        -
Jul 09 10:36:12 NameOfNode1 pvesr[5123]: volume 'rpool/data/vm-109-disk-1' already exists
Jul 09 10:36:12 NameOfNode1 pvesr[5123]: warning: cannot send 'rpool/data/vm-109-disk-1 at __replicate_109-0_1499589360__': Broken pipe
Jul 09 10:36:12 NameOfNode1 pvesr[5123]: cannot send 'rpool/data/vm-109-disk-1': I/O error
Jul 09 10:36:12 NameOfNode1 pvesr[5123]: command 'zfs send -Rpv -- rpool/data/vm-109-disk-1 at __replicate_109-0_1499589360__' failed: exit code 1
Jul 09 10:36:12 NameOfNode1 pvesr[5123]: send/receive failed, cleaning up snapshot(s)..
Jul 09 10:45:17 NameOfNode1 pvesr[6669]: send from @__replicate_103-0_1499589000__ to rpool/data/vm-103-disk-1 at __replicate_103-0_1499589900__ estimated size is 9.12M
Jul 09 10:45:17 NameOfNode1 pvesr[6669]: total estimated size is 9.12M
Jul 09 10:45:17 NameOfNode1 pvesr[6669]: TIME        SENT   SNAPSHOT
Jul 09 10:45:17 NameOfNode1 pvesr[6669]: rpool/data/vm-103-disk-1 at __replicate_103-0_1499589000__        name        rpool/data/vm-103-disk-1 at __replicate_103-0_1499589000__        -
Jul 09 10:45:18 NameOfNode1 pvesr[6669]: 10:45:18   7.24M   rpool/data/vm-103-disk-1 at __replicate_103-0_1499589900__
Jul 09 10:46:02 NameOfNode1 pvesr[7055]: send from @ to rpool/data/vm-109-disk-1 at __replicate_109-0_1499589960__ estimated size is 28.4G
Jul 09 10:46:02 NameOfNode1 pvesr[7055]: total estimated size is 28.4G
Jul 09 10:46:02 NameOfNode1 pvesr[7055]: TIME        SENT   SNAPSHOT
Jul 09 10:46:02 NameOfNode1 pvesr[7055]: rpool/data/vm-109-disk-1        name        rpool/data/vm-109-disk-1        -
Jul 09 10:46:02 NameOfNode1 pvesr[7055]: volume 'rpool/data/vm-109-disk-1' already exists
Jul 09 10:46:02 NameOfNode1 pvesr[7055]: warning: cannot send 'rpool/data/vm-109-disk-1 at __replicate_109-0_1499589960__': Broken pipe
Jul 09 10:46:02 NameOfNode1 pvesr[7055]: cannot send 'rpool/data/vm-109-disk-1': I/O error
Jul 09 10:46:02 NameOfNode1 pvesr[7055]: command 'zfs send -Rpv -- rpool/data/vm-109-disk-1 at __replicate_109-0_1499589960__' failed: exit code 1
Jul 09 10:46:02 NameOfNode1 pvesr[7055]: send/receive failed, cleaning up snapshot(s)..
Jul 09 11:00:08 NameOfNode1 pvesr[9425]: send from @__replicate_103-0_1499589900__ to rpool/data/vm-103-disk-1 at __replicate_103-0_1499590800__ estimated size is 7.24M
Jul 09 11:00:08 NameOfNode1 pvesr[9425]: total estimated size is 7.24M
Jul 09 11:00:08 NameOfNode1 pvesr[9425]: TIME        SENT   SNAPSHOT
Jul 09 11:00:08 NameOfNode1 pvesr[9425]: rpool/data/vm-103-disk-1 at __replicate_103-0_1499589900__        name        rpool/data/vm-103-disk-1 at __replicate_103-0_1499589900__        -
Jul 09 11:00:09 NameOfNode1 pvesr[9425]: 11:00:09   2.11M   rpool/data/vm-103-disk-1 at __replicate_103-0_1499590800__
Jul 09 11:03:09 NameOfNode1 pvesr[10151]: send from @ to rpool/data/vm-103-disk-1 at __replicate_103-0_1499590980__ estimated size is 19.5G
Jul 09 11:03:09 NameOfNode1 pvesr[10151]: total estimated size is 19.5G
Jul 09 11:03:09 NameOfNode1 pvesr[10151]: TIME        SENT   SNAPSHOT
Jul 09 11:03:10 NameOfNode1 pvesr[10151]: rpool/data/vm-103-disk-1        name        rpool/data/vm-103-disk-1        -
Jul 09 11:03:10 NameOfNode1 pvesr[10151]: volume 'rpool/data/vm-103-disk-1' already exists
Jul 09 11:03:10 NameOfNode1 pvesr[10151]: warning: cannot send 'rpool/data/vm-103-disk-1 at __replicate_103-0_1499590980__': Broken pipe
Jul 09 11:03:10 NameOfNode1 pvesr[10151]: cannot send 'rpool/data/vm-103-disk-1': I/O error
Jul 09 11:03:10 NameOfNode1 pvesr[10151]: command 'zfs send -Rpv -- rpool/data/vm-103-disk-1 at __replicate_103-0_1499590980__' failed: exit code 1
Jul 09 11:03:10 NameOfNode1 pvesr[10151]: send/receive failed, cleaning up snapshot(s)..
Jul 09 11:04:02 NameOfNode1 pvesr[10347]: send from @ to rpool/data/vm-109-disk-1 at __replicate_109-0_1499591040__ estimated size is 28.4G
Jul 09 11:04:02 NameOfNode1 pvesr[10347]: total estimated size is 28.4G
Jul 09 11:04:02 NameOfNode1 pvesr[10347]: TIME        SENT   SNAPSHOT
Jul 09 11:04:02 NameOfNode1 pvesr[10347]: rpool/data/vm-109-disk-1        name        rpool/data/vm-109-disk-1        -
Jul 09 11:04:02 NameOfNode1 pvesr[10347]: volume 'rpool/data/vm-109-disk-1' already exists
Jul 09 11:04:02 NameOfNode1 pvesr[10347]: warning: cannot send 'rpool/data/vm-109-disk-1 at __replicate_109-0_1499591040__': Broken pipe
Jul 09 11:04:02 NameOfNode1 pvesr[10347]: cannot send 'rpool/data/vm-109-disk-1': I/O error
Jul 09 11:04:02 NameOfNode1 pvesr[10347]: command 'zfs send -Rpv -- rpool/data/vm-109-disk-1 at __replicate_109-0_1499591040__' failed: exit code 1
Jul 09 11:04:02 NameOfNode1 pvesr[10347]: send/receive failed, cleaning up snapshot(s)..
Jul 09 11:13:03 NameOfNode1 pvesr[11914]: send from @ to rpool/data/vm-103-disk-1 at __replicate_103-0_1499591580__ estimated size is 19.5G
Jul 09 11:13:03 NameOfNode1 pvesr[11914]: total estimated size is 19.5G
Jul 09 11:13:03 NameOfNode1 pvesr[11914]: TIME        SENT   SNAPSHOT
Jul 09 11:13:03 NameOfNode1 pvesr[11914]: rpool/data/vm-103-disk-1        name        rpool/data/vm-103-disk-1        -
Jul 09 11:13:03 NameOfNode1 pvesr[11914]: volume 'rpool/data/vm-103-disk-1' already exists
Jul 09 11:13:03 NameOfNode1 pvesr[11914]: warning: cannot send 'rpool/data/vm-103-disk-1 at __replicate_103-0_1499591580__': Broken pipe
Jul 09 11:13:03 NameOfNode1 pvesr[11914]: cannot send 'rpool/data/vm-103-disk-1': I/O error
Jul 09 11:13:03 NameOfNode1 pvesr[11914]: command 'zfs send -Rpv -- rpool/data/vm-103-disk-1 at __replicate_103-0_1499591580__' failed: exit code 1
Jul 09 11:13:03 NameOfNode1 pvesr[11914]: send/receive failed, cleaning up snapshot(s)..

After that it really wants to transfer whole disks again and again but it fails to do that.There is no way to recover from this, just by destroying the whole ZFS volume on the target side and restransfer it.
Please help me get around this problem or to find out what goes wrong there.
regards,dorsy



More information about the pve-user mailing list