[PVE-User] proxmox 5 - replication fails

Wed Jul 12 13:50:44 CEST 2017

# cat /etc/pve/replication.cfg
local: 105-0
     target ns302695
     rate 10
     schedule */2:00

local: 103-0
     target ns3511723
     rate 11
     schedule */20

local: 109-0
     target ns3511723
     rate 10

local: 102-0
     target ns302695
     rate 10
     schedule 22:30

local: 107-0
     target ns302695
     rate 10

local: 100-0
     target ns302695
     rate 10
     schedule */2:00

cat /var/lib/pve-manager/pve-replication-state.json
{"103":{"local/ns3511723":{"storeid_list":["local-zfs"],"fail_count":0,"last_try":1499859600,"last_sync":1499859600,"last_iteration":1499859600,"last_node":"ns302695","duration":4.482678}},"109":{"local/ns3511723":{"fail_count":0,"storeid_list":["local-zfs"],"last_sync":1499859000,"last_try":1499859000,"last_iteration":1499859000,"last_node":"ns302695","duration":7.828846}}}

On the failed node (at the moment, I had failures from both sides):
# cat /var/lib/pve-manager/pve-replication-state.json
{"105":{"local/ns302695":{"last_iteration":1499853601,"fail_count":0,"duration":32.107092,"last_node":"ns3511723","storeid_list":["local-zfs"],"last_try":1499853633,"last_sync":1499853633}},"102":{"local/ns302695":{"last_try":1499805001,"last_sync":1499805001,"last_node":"ns3511723","duration":126.81862,"storeid_list":["local-zfs"],"last_iteration":1499805001,"fail_count":0}},"107":{"local/ns302695":{"fail_count":0,"last_iteration":1499859000,"duration":3.511844,"last_node":"ns3511723","storeid_list":["local-zfs"],"last_try":1499859000,"last_sync":1499859000}},"100":{"local/ns302695":{"error":"command 
'set -o pipefail && pvesm export local-zfs:vm-100-disk-1 zfs - 
-with-snapshots 1 -snapshot __replicate_100-0_1499858220__ | 
/usr/bin/cstream -t 10000000 | /usr/bin/ssh -o 'BatchMode=yes' -o 
'HostKeyAlias=ns302695' root at IP.OF.TAR.GET -- pvesm import 
local-zfs:vm-100-disk-1 zfs - -with-snapshots 1' failed: exit code 
255","fail_count":5,"last_iteration":1499858220,"duration":2.493542,"last_node":"ns3511723","storeid_list":["local-zfs"],"last_try":1499858220,"last_sync":1499846406}}}

But I knew all these from the API :)

pve:/> get nodes/ns3511723/replication/100-0/log
200 OK
[
    {
       "n" : 1,
       "t" : "2017-07-12 13:17:00 100-0: start replication job"
    },
    {
       "n" : 2,
       "t" : "2017-07-12 13:17:00 100-0: guest => VM 100, running => 12279"
    },
    {
       "n" : 3,
       "t" : "2017-07-12 13:17:00 100-0: volumes => local-zfs:vm-100-disk-1"
    },
    {
       "n" : 4,
       "t" : "2017-07-12 13:17:01 100-0: create snapshot 
'__replicate_100-0_1499858220__' on local-zfs:vm-100-disk-1"
    },
    {
       "n" : 5,
       "t" : "2017-07-12 13:17:01 100-0: full sync 
'local-zfs:vm-100-disk-1' (__replicate_100-0_1499858220__)"
    },
    {
       "n" : 6,
       "t" : "2017-07-12 13:17:03 100-0: delete previous replication 
snapshot '__replicate_100-0_1499858220__' on local-zfs:vm-100-disk-1"
    },
    {
       "n" : 7,
       "t" : "2017-07-12 13:17:03 100-0: end replication job with error: 
command 'set -o pipefail && pvesm export local-zfs:vm-100-disk-1 zfs - 
-with-snapshots 1 -snapshot __replicate_100-0_1499858220__ | 
/usr/bin/cstream -t 10000000 | /usr/bin/ssh -o 'BatchMode=yes' -o 
'HostKeyAlias=ns302695' root at IP.OF.TAR.GET -- pvesm import 
local-zfs:vm-100-disk-1 zfs - -with-snapshots 1' failed: exit code 255"
    }
]

pve:/> get nodes/ns3511723/replication/100-0/status
200 OK
{
    "duration" : 2.493542,
    "error" : "command 'set -o pipefail && pvesm export 
local-zfs:vm-100-disk-1 zfs - -with-snapshots 1 -snapshot 
__replicate_100-0_1499858220__ | /usr/bin/cstream -t 10000000 | 
/usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=ns302695' 
root at IP.OF.TAR.GET -- pvesm import local-zfs:vm-100-disk-1 zfs - 
-with-snapshots 1' failed: exit code 255",
    "fail_count" : 5,
    "guest" : "100",
    "id" : "100-0",
    "jobnum" : "0",
    "last_sync" : 1499846406,
    "last_try" : 1499858220,
    "next_sync" : 1499860020,
    "rate" : 10,
    "schedule" : "*/2:00",
    "target" : "ns302695",
    "type" : "local",
    "vmtype" : "qemu"
}

Also, I have set a throttle of 10MB/s for the replication jobs, which is 
just a portion of the available bandwidth between the nodes, it should 
not be an issue.

On 2017-07-12 13:39, Dominik Csapak wrote:
> hi,
>
> i reply here, to avoid confusion in the other thread
>
> can you post the content of the two files:
>
> /etc/pve/replication.cfg
> /var/lib/pve-manager/pve-replication-state.json (of the source node)
>
> ?
>
>
> _______________________________________________
> pve-user mailing list
> pve-user at pve.proxmox.com
> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user