[pve-devel] [guest-common] fix #1694: Replication risks permanently losing sync in high loads due to timeout bug
Thomas Lamprecht
t.lamprecht at proxmox.com
Thu Mar 29 06:22:25 CEST 2018
On 3/23/18 12:15 PM, Wolfgang Link wrote:
> If the pool is under heavy load ZFS will low prioritized deletion jobs.
> This ends in a timeout and the program logic will delete the current sync snapshot.
> On the next run the former sync snapshots will also removed because they are not in the state file.
> In this state it is no more possible to sync and a full sync has to be performed.
>
> We do not delete the former snapshot on the end of the replication run,
> because prepare_local_job will delete it anyway and
> when a timeout happens in this state we can ignore it and start the replication.
>
So why was it ever done this way? We always get the remote side snapshots
and cleanup the stale ones there, so what purpose had the
remote_finalize_local_job method? Cleaning up in some edge case the
remote_prepare_local_job method didn't reached?
Just asking so that I can better understand any possible side effects of
your proposed changes...
> The snapshot deletion error will be logged in the replication log.
> ---
> PVE/Replication.pm | 28 +++++++++++++++-------------
> 1 file changed, 15 insertions(+), 13 deletions(-)
>
> diff --git a/PVE/Replication.pm b/PVE/Replication.pm
> index 9bc4e61..1eb853d 100644
> --- a/PVE/Replication.pm
> +++ b/PVE/Replication.pm
> @@ -136,8 +136,18 @@ sub prepare {
> $last_snapshots->{$volid}->{$snap} = 1;
> } elsif ($snap =~ m/^\Q$prefix\E/) {
> $logfunc->("delete stale replication snapshot '$snap' on $volid");
> - PVE::Storage::volume_snapshot_delete($storecfg, $volid, $snap);
> - $cleaned_replicated_volumes->{$volid} = 1;
> +
> + eval {
> + PVE::Storage::volume_snapshot_delete($storecfg, $volid, $snap);
> + $cleaned_replicated_volumes->{$volid} = 1;
> + };
> +
> + # If deleting the snapshot fails, we can not be sure if it was due to an error or a timeout.
> + # The likelihood that the delete has worked out is high at a timeout.
> + # If it really fails, it will try to remove on the next run.
> + warn $@ if $@;
> +
> + $logfunc->("delete stale replication snapshot error: $@") if $@;
> }
> }
> }
> @@ -282,24 +292,16 @@ sub replicate {
> }
>
> replicate_volume($ssh_info, $storecfg, $volid, $base_snapname, $sync_snapname, $rate, $insecure, $logfunc);
> + # old snapshots will removed by next run from prepare_local_job.
> }
> };
> - $err = $@;
> -
> - if ($err) {
> + if ($@) {
> $cleanup_local_snapshots->($replicate_snapshots, $sync_snapname); # try to cleanup
> # we do not cleanup the remote side here - this is done in
> # next run of prepare_local_job
> - die $err;
> + die $@;
> }
The above change is unnecessary, I could fix-it up and amend your patch.
If there's a v2 of this please address that.
>
> - # remove old snapshots because they are no longer needed
> - $cleanup_local_snapshots->($last_snapshots, $last_sync_snapname);
> -
> - remote_finalize_local_job($ssh_info, $jobid, $vmid, $sorted_volids, $start_time, $logfunc);
> -
This makes remote_finalize_local_job an orphan, it has no callers anymore...
also the finalize-local-job CLI command has no caller.
So if this approach is deemed good those would need to be cleaned up, or
deprecated?
> - die $err if $err;
> -
> return $volumes;
> }
>
>
More information about the pve-devel
mailing list