[pve-devel] applied: [Patch V3 guest-common] fix #1694: make failure of snapshot removal non-fatal

Mon Apr 16 10:54:31 CEST 2018

applied with fixes - see comment inline:

> On April 13, 2018 at 12:24 PM Wolfgang Link <w.link at proxmox.com> wrote:
> 
> 
> In certain high-load scenarios ANY ZFS operation can block,
> including registering an (async) destroy.
> Since ZFS operations are implemented via ioctl's,
> killing the user space process
> does not affect the waiting kernel thread processing the ioctl.
> 
> Once "zfs destroy" has been called, killing it does not say anything
> about whether the destroy operation will be aborted or not.
> Since running into a timeout effectively means killing it,
> we don't know whether the snapshot exists afterwards or not.
> We also don't know how long it takes for ZFS to catch up on pending ioctls.
> 
> Given the above problem, we must to not die on errors when deleting a no
> longer needed snapshot fails (under a timeout) after an otherwise
> successful replication. Since we retry on the next run anyway, this is
> not problematic.
> 
> The snapshot deletion error will be logged in the replication log
> and the syslog/journal.
> ---
>  PVE/Replication.pm | 30 ++++++++++++++++++++++++++----
>  1 file changed, 26 insertions(+), 4 deletions(-)
> 
> Patch V2:
> As discuss with Dietmar we will keep remove the former snapshot at the end of
> the run,
> but ignore if it fails. This is to prevent we have 2 replication snapshots at
> time.
> 
> Patch V3:
> Remove cleanup_local_snapshots from eval as Diemar suggest.
> Use Fabian commit message.
> 
> diff --git a/PVE/Replication.pm b/PVE/Replication.pm
> index 9bc4e61..98ba1b6 100644
> --- a/PVE/Replication.pm
> +++ b/PVE/Replication.pm
> @@ -136,8 +136,21 @@ sub prepare {
>  		$last_snapshots->{$volid}->{$snap} = 1;
>  	    } elsif ($snap =~ m/^\Q$prefix\E/) {
>  		$logfunc->("delete stale replication snapshot '$snap' on $volid");
> -		PVE::Storage::volume_snapshot_delete($storecfg, $volid, $snap);
> -		$cleaned_replicated_volumes->{$volid} = 1;
> +
> +		eval {
> +		    PVE::Storage::volume_snapshot_delete($storecfg, $volid, $snap);
> +		    $cleaned_replicated_volumes->{$volid} = 1;
> +		};
> +
> +		# If deleting the snapshot fails, we can not be sure if it was due to an
> error or a timeout.
> +		# The likelihood that the delete has worked out is high at a timeout.
> +		# If it really fails, it will try to remove on the next run.
> +
> +		# warn is for syslog/journal.
> +		warn $@ if $@;
> +
> +		# logfunc will written in replication log.
> +		$logfunc->("delete stale replication snapshot error: $@") if $@;
>  	    }
>  	}
>      }
> @@ -296,9 +309,18 @@ sub replicate {
>      # remove old snapshots because they are no longer needed
>      $cleanup_local_snapshots->($last_snapshots, $last_sync_snapname);
>  
> -    remote_finalize_local_job($ssh_info, $jobid, $vmid, $sorted_volids,
> $start_time, $logfunc);
> +    eval {
> +	remote_finalize_local_job($ssh_info, $jobid, $vmid, $sorted_volids,
> $start_time, $logfunc);
> +    };
>  
> -    die $err if $err;
> +    # old snapshots will removed by next run from prepare_local_job.
> +    if ($err = $@) {
> +	# warn is for syslog/journal.
> +	warn $err;
> +
> +	# logfunc will written in replication log.
> +	$logfunc->("delete stale replication snapshot error: err");

should be:

	$logfunc->("delete stale replication snapshot error: $err");

> +    }
>  
>      return $volumes;
>  }
> --