[pve-devel] [Patch V2 guest-common] fix #1694: Replication risks permanently losing sync in high loads due to timeout bug

Thu Apr 12 11:38:10 CEST 2018

minor nit: please use a short subject that indicates what the commit
does, not what the bug was that it fixes.

e.g., something like

fix #1694: make failure of snapshot removal non-fatal

On Thu, Apr 12, 2018 at 09:46:03AM +0200, Wolfgang Link wrote:
> If the pool is under heavy load ZFS will low prioritized deletion jobs.

I don't think this is actually the reason - destroy is already async in
ZFS. I think the reason is that in certain high-load scenarios ANY ZFS
operation can block, including registering an (async) destroy. Since ZFS
operations are implemented via ioctl's, killing the user space process
does not affect the waiting kernel thread processing the ioctl.

Whether freeing the data has a lower priority than writing stuff is not
related IMHO.

> This ends in a timeout and the program logic will delete the current sync snapshot.
> On the next run the former sync snapshots will also removed because they are not in the state file.
> In this state it is no more possible to sync and a full sync has to be performed.

I understand this (because we discussed/triaged the bug together ;)),
but I don't think anybody can understand that just by looking at the
commit.

the actual problem is:
once "zfs destroy" has been called, killing it does not say anything
about whether the destroy operation will be aborted or not. since
running into a timeout effectively means killing it, we don't know
whether the snapshot exists afterwards or not. we also don't know how
long it takes for ZFS to catch up on pending ioctls.

> We ignore if the deletion of the former snapshot failed on the end of the replication run,
> because prepare_local_job will delete it anyway and
> when a timeout happens in this state we can ignore it and start the replication.

given the above problem, we must to not die on errors when deleting a no
longer needed snapshot fails (under a timeout) after an otherwise
successful replication. since we retry on the next run anyway, this is
not problematic.

> 
> The snapshot deletion error will be logged in the replication log.
> ---
>  PVE/Replication.pm | 26 ++++++++++++++++++++------
>  1 file changed, 20 insertions(+), 6 deletions(-)
> 
> Patch V2:
> As discuss with Dietmar we will keep remove the former snapshot at the end of the run,
> but ignore if it fails. This is to prevent we have 2 replication snapshots at time.
> 
> diff --git a/PVE/Replication.pm b/PVE/Replication.pm
> index 9bc4e61..d8ccfaf 100644
> --- a/PVE/Replication.pm
> +++ b/PVE/Replication.pm
> @@ -136,8 +136,18 @@ sub prepare {
>  		$last_snapshots->{$volid}->{$snap} = 1;
>  	    } elsif ($snap =~ m/^\Q$prefix\E/) {
>  		$logfunc->("delete stale replication snapshot '$snap' on $volid");
> -		PVE::Storage::volume_snapshot_delete($storecfg, $volid, $snap);
> -		$cleaned_replicated_volumes->{$volid} = 1;
> +
> +		eval {
> +		    PVE::Storage::volume_snapshot_delete($storecfg, $volid, $snap);
> +		    $cleaned_replicated_volumes->{$volid} = 1;
> +		};

either this eval or the one in the last hunk should be enough, no need
to have two evals? this eval also changes the semantics for each cleanup
BEFORE the replication.

> +
> +		# If deleting the snapshot fails, we can not be sure if it was due to an error or a timeout.
> +		# The likelihood that the delete has worked out is high at a timeout.
> +		# If it really fails, it will try to remove on the next run.
> +		warn $@ if $@;
> +
> +		$logfunc->("delete stale replication snapshot error: $@") if $@;

do we need the double warn + log here? if so, maybe a comment would be
nice to indicate why..

>  	    }
>  	}
>      }
> @@ -293,12 +303,16 @@ sub replicate {
>  	die $err;
>      }
>  
> -    # remove old snapshots because they are no longer needed
> -    $cleanup_local_snapshots->($last_snapshots, $last_sync_snapname);
> +    eval {
> +	# remove old snapshots because they are no longer needed
> +	$cleanup_local_snapshots->($last_snapshots, $last_sync_snapname);

like Dietmar said, $cleanup_local_snapshots already does an eval, so
this should stay as it was before

>  
> -    remote_finalize_local_job($ssh_info, $jobid, $vmid, $sorted_volids, $start_time, $logfunc);
> +	remote_finalize_local_job($ssh_info, $jobid, $vmid, $sorted_volids, $start_time, $logfunc);

this needs to be in an eval only if prepare() does not have one, unless
I misunderstand the code?

this is the part that actually triggers the bug, since if we die here,
we think the replication has failed, clean up the new replication
snapshot locally, and don't update the replication state. the remote
side does have the new snapshot that the local side is now lacking, but
it also (potentially) no longer has the old snapshot that the local side
has (the last one according to our state file). maybe something like
that should go into the commit message as well?

since there is only this one caller AFAICT, the eval could also be moved
into remote_finalize_local_job (with the same question attached though
;))

> +    };
>  
> -    die $err if $err;
> +    # old snapshots will removed by next run from prepare_local_job.
> +    warn $@ if $@;
> +    $logfunc->("delete stale replication snapshot error: $@") if $@;

again, is this double warn + log needed? (I ask because 2 out of 3 'warn'
statements in Replication.pm are introduced by this patch..)

(cleanup_local_snapshots already warns, which maybe should be converted
to $logfunc as well?)

>  
>      return $volumes;
>  }
> -- 
> 2.11.0
> 
> 
> _______________________________________________
> pve-devel mailing list
> pve-devel at pve.proxmox.com
> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel