[PVE-User] Back on replica, 'got unexpected replication job error ... timeout'

Fri Oct 21 12:16:14 CEST 2022

I continue to get spurious errors like:

	Subject: Replication Job: 121-0 failed

  command 'zfs snapshot rpool-data/vm-121-disk-0 at __replicate_121-0_1666288805__' failed: got timeout

I'm convinced that :

1) are io-bound, not network-bound; if i limit the bandwith of the replica
  to some indecent value (eg, 5 Mbit/s) they still happen.

2) they are totally self-healing and benign

Practically if the IO is under stress (for example: for a running backup)
the perl PVE code timeout waiting a reply for an operation that indeed
succeed, only not on the specified time.

Loking at log i've also found:

	Oct 21 02:30:25 pppve2 pvesr[19291]: command 'zfs destroy rpool-data/vm-128-disk-1 at __replicate_128-0_1666297807__' failed: got timeout

so destroy operation still tiemout, but PVE does not send email complaining
about them. And snapshot get correctly deleted, indeed:

 root at pppve2:~# zfs list -t snapshot | grep _128
 rpool-data/vm-128-disk-1 at __replicate_128-0_1666312205__       378K      -     2.02G  -
 rpool/data/vm-128-disk-0 at __replicate_128-0_1666312205__      50.2M      -     19.7G  -

I am right?! I can fire up a bug for that?

Thanks.

-- 
  ...il ponte di Messina unirà «non due coste ma due cosche».
							(Niki Vendola)