[PVE-User] Back on replica, 'got unexpected replication job error ... timeout'
Marco Gaiarin
gaio at lilliput.linux.it
Fri Oct 21 12:16:14 CEST 2022
I continue to get spurious errors like:
Subject: Replication Job: 121-0 failed
command 'zfs snapshot rpool-data/vm-121-disk-0 at __replicate_121-0_1666288805__' failed: got timeout
I'm convinced that :
1) are io-bound, not network-bound; if i limit the bandwith of the replica
to some indecent value (eg, 5 Mbit/s) they still happen.
2) they are totally self-healing and benign
Practically if the IO is under stress (for example: for a running backup)
the perl PVE code timeout waiting a reply for an operation that indeed
succeed, only not on the specified time.
Loking at log i've also found:
Oct 21 02:30:25 pppve2 pvesr[19291]: command 'zfs destroy rpool-data/vm-128-disk-1 at __replicate_128-0_1666297807__' failed: got timeout
so destroy operation still tiemout, but PVE does not send email complaining
about them. And snapshot get correctly deleted, indeed:
root at pppve2:~# zfs list -t snapshot | grep _128
rpool-data/vm-128-disk-1 at __replicate_128-0_1666312205__ 378K - 2.02G -
rpool/data/vm-128-disk-0 at __replicate_128-0_1666312205__ 50.2M - 19.7G -
I am right?! I can fire up a bug for that?
Thanks.
--
...il ponte di Messina unirà «non due coste ma due cosche».
(Niki Vendola)
More information about the pve-user
mailing list