[PVE-User] Replication blocked issue
Bertorello, Marco
me at marcobertorello.it
Wed Apr 28 17:34:14 CEST 2021
Dear PVE users,
I've a 3-nodes clusters, with ZFS storage.
Every node use it's own storage and the VMs/LXCs are replicated across
other nodes every 10 minutes.
Some times happens that a replica job is running without an end.
For example at the moment I have a replication started yesterday:
2021-04-27 07:20:01 101-1: start replication job
2021-04-27 07:20:01 101-1: guest => CT 101, running => 1
2021-04-27 07:20:01 101-1: volumes => DS1:subvol-101-disk-1
2021-04-27 07:20:02 101-1: freeze guest filesystem
2021-04-27 07:20:05 101-1: create snapshot
'__replicate_101-1_1619500801__' on DS1:subvol-101-disk-1
2021-04-27 07:20:06 101-1: thaw guest filesystem
2021-04-27 07:20:06 101-1: using secure transmission, rate limit: none
2021-04-27 07:20:06 101-1: incremental sync 'DS1:subvol-101-disk-1'
(__replicate_101-1_1619500201__ => __replicate_101-1_1619500801__)
2021-04-27 07:20:08 101-1: send from @__replicate_101-1_1619500201__ to
zp1/subvol-101-disk-1 at __replicate_101-0_1619500211__ estimated size is 213K
2021-04-27 07:20:08 101-1: send from @__replicate_101-0_1619500211__ to
zp1/subvol-101-disk-1 at __replicate_101-1_1619500801__ estimated size is 26.1M
2021-04-27 07:20:08 101-1: total estimated size is 26.4M
2021-04-27 07:20:09 101-1: TIME SENT SNAPSHOT
zp1/subvol-101-disk-1 at __replicate_101-1_1619500801__
2021-04-27 07:20:09 101-1: 07:20:09 3.18M
zp1/subvol-101-disk-1 at __replicate_101-1_1619500801__
[...]
2021-04-28 17:27:25 101-1: 17:27:25 3.18M
zp1/subvol-101-disk-1 at __replicate_101-1_1619500801__
2021-04-28 17:27:26 101-1: 17:27:26 3.18M
zp1/subvol-101-disk-1 at __replicate_101-1_1619500801__
2021-04-28 17:27:27 101-1: 17:27:27 3.18M
zp1/subvol-101-disk-1 at __replicate_101-1_1619500801__
as you can see, no progress in this time slot, still 3.18M transferred.
There are 2 big problems with this:
1) the blocked replica prevents the other replication scheduled on the
source node to run until this replication ends or fail
2) I've no other solution but reboot the destination node to exit this
situation.
I tried to kill the process on the destination node, but the process is
in D state and cannot be killed.
There is a way to get out this scenario without reboot nodes?
Thanks a lot and best regards,
--
Marco Bertorello
https://www.marcobertorello.it
More information about the pve-user
mailing list