[PVE-User] Replication blocked issue

Wed Apr 28 17:34:14 CEST 2021

Dear PVE users,

I've a 3-nodes clusters, with ZFS storage.
Every node use it's own storage and the VMs/LXCs are replicated across 
other nodes every 10 minutes.

Some times happens that a replica job is running without an end.

For example at the moment I have a replication started yesterday:

2021-04-27 07:20:01 101-1: start replication job
2021-04-27 07:20:01 101-1: guest => CT 101, running => 1
2021-04-27 07:20:01 101-1: volumes => DS1:subvol-101-disk-1
2021-04-27 07:20:02 101-1: freeze guest filesystem
2021-04-27 07:20:05 101-1: create snapshot 
'__replicate_101-1_1619500801__' on DS1:subvol-101-disk-1
2021-04-27 07:20:06 101-1: thaw guest filesystem
2021-04-27 07:20:06 101-1: using secure transmission, rate limit: none
2021-04-27 07:20:06 101-1: incremental sync 'DS1:subvol-101-disk-1' 
(__replicate_101-1_1619500201__ => __replicate_101-1_1619500801__)
2021-04-27 07:20:08 101-1: send from @__replicate_101-1_1619500201__ to 
zp1/subvol-101-disk-1 at __replicate_101-0_1619500211__ estimated size is 213K
2021-04-27 07:20:08 101-1: send from @__replicate_101-0_1619500211__ to 
zp1/subvol-101-disk-1 at __replicate_101-1_1619500801__ estimated size is 26.1M
2021-04-27 07:20:08 101-1: total estimated size is 26.4M
2021-04-27 07:20:09 101-1: TIME        SENT   SNAPSHOT 
zp1/subvol-101-disk-1 at __replicate_101-1_1619500801__
2021-04-27 07:20:09 101-1: 07:20:09   3.18M 
zp1/subvol-101-disk-1 at __replicate_101-1_1619500801__
[...]
2021-04-28 17:27:25 101-1: 17:27:25   3.18M 
zp1/subvol-101-disk-1 at __replicate_101-1_1619500801__
2021-04-28 17:27:26 101-1: 17:27:26   3.18M 
zp1/subvol-101-disk-1 at __replicate_101-1_1619500801__
2021-04-28 17:27:27 101-1: 17:27:27   3.18M 
zp1/subvol-101-disk-1 at __replicate_101-1_1619500801__

as you can see, no progress in this time slot, still 3.18M transferred.

There are 2 big problems with this:

1) the blocked replica prevents the other replication scheduled on the 
source node to run until this replication ends or fail

2) I've no other solution but reboot the destination node to exit this 
situation.

I tried to kill the process on the destination node, but the process is 
in D state and cannot be killed.
There is a way to get out this scenario without reboot nodes?

Thanks a lot and best regards,

-- 
Marco Bertorello
https://www.marcobertorello.it