[pve-devel] [Patch V3 guest-common] fix #1694: make failure of snapshot removal non-fatal

Fri Apr 13 12:24:39 CEST 2018

In certain high-load scenarios ANY ZFS operation can block,
including registering an (async) destroy.
Since ZFS operations are implemented via ioctl's,
killing the user space process
does not affect the waiting kernel thread processing the ioctl.

Once "zfs destroy" has been called, killing it does not say anything
about whether the destroy operation will be aborted or not.
Since running into a timeout effectively means killing it,
we don't know whether the snapshot exists afterwards or not.
We also don't know how long it takes for ZFS to catch up on pending ioctls.

Given the above problem, we must to not die on errors when deleting a no
longer needed snapshot fails (under a timeout) after an otherwise
successful replication. Since we retry on the next run anyway, this is
not problematic.

The snapshot deletion error will be logged in the replication log
and the syslog/journal.
---
 PVE/Replication.pm | 30 ++++++++++++++++++++++++++----
 1 file changed, 26 insertions(+), 4 deletions(-)

Patch V2:
As discuss with Dietmar we will keep remove the former snapshot at the end of the run,
but ignore if it fails. This is to prevent we have 2 replication snapshots at time.

Patch V3:
Remove cleanup_local_snapshots from eval as Diemar suggest.
Use Fabian commit message.

diff --git a/PVE/Replication.pm b/PVE/Replication.pm
index 9bc4e61..98ba1b6 100644
--- a/PVE/Replication.pm
+++ b/PVE/Replication.pm
@@ -136,8 +136,21 @@ sub prepare {
 		$last_snapshots->{$volid}->{$snap} = 1;
 	    } elsif ($snap =~ m/^\Q$prefix\E/) {
 		$logfunc->("delete stale replication snapshot '$snap' on $volid");
-		PVE::Storage::volume_snapshot_delete($storecfg, $volid, $snap);
-		$cleaned_replicated_volumes->{$volid} = 1;
+
+		eval {
+		    PVE::Storage::volume_snapshot_delete($storecfg, $volid, $snap);
+		    $cleaned_replicated_volumes->{$volid} = 1;
+		};
+
+		# If deleting the snapshot fails, we can not be sure if it was due to an error or a timeout.
+		# The likelihood that the delete has worked out is high at a timeout.
+		# If it really fails, it will try to remove on the next run.
+
+		# warn is for syslog/journal.
+		warn $@ if $@;
+
+		# logfunc will written in replication log.
+		$logfunc->("delete stale replication snapshot error: $@") if $@;
 	    }
 	}
     }
@@ -296,9 +309,18 @@ sub replicate {
     # remove old snapshots because they are no longer needed
     $cleanup_local_snapshots->($last_snapshots, $last_sync_snapname);
 
-    remote_finalize_local_job($ssh_info, $jobid, $vmid, $sorted_volids, $start_time, $logfunc);
+    eval {
+	remote_finalize_local_job($ssh_info, $jobid, $vmid, $sorted_volids, $start_time, $logfunc);
+    };
 
-    die $err if $err;
+    # old snapshots will removed by next run from prepare_local_job.
+    if ($err = $@) {
+	# warn is for syslog/journal.
+	warn $err;
+
+	# logfunc will written in replication log.
+	$logfunc->("delete stale replication snapshot error: err");
+    }
 
     return $volumes;
 }
-- 
2.11.0