[pve-devel] [PATCH zsync] fix #2821: only abort if there really is a waiting/syncing job instance already

Fabian Ebner f.ebner at proxmox.com
Mon Dec 14 14:00:39 CET 2020


By remembering the instance via PID and start time and checking for that
information in later instances. If the old instance can't be found, the new one
will continue and register itself in the state.

After updating, if there is a waiting instance running the old version, one more
might be created, because there is no instance_id yet. But the new instance will
set the instance_id, which any later instance will see.

More importantly, if the state is wrongly 'waiting' or 'syncing', e.g.
because an instance was terminated before finishing, we don't abort anymore and
recover from the wrong state, thus fixing the bug.

Signed-off-by: Fabian Ebner <f.ebner at proxmox.com>
---

I couldn't find a better unique identifier that can be easily verfied from
within another instance, but PID and start time should be good enough for the
intended purpose.

Another alternative would be to introduce job-specific locking around the whole
sync() block, but then we would have some three lock-level deep code...

@Thomas: I felt like this was more complete than the "clear state after boot"-
solution, because it also works when the processes are killed for different
reasons than during shutdown.

 pve-zsync | 40 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/pve-zsync b/pve-zsync
index f3b98c4..506bfab 100755
--- a/pve-zsync
+++ b/pve-zsync
@@ -266,6 +266,7 @@ sub add_state_to_job {
     $job->{state} = $state->{state};
     $job->{lsync} = $state->{lsync};
     $job->{vm_type} = $state->{vm_type};
+    $job->{instance_id} = $state->{instance_id};
 
     for (my $i = 0; $state->{"snap$i"}; $i++) {
 	$job->{"snap$i"} = $state->{"snap$i"};
@@ -365,6 +366,7 @@ sub update_state {
     if ($job->{state} ne "del") {
 	$state->{state} = $job->{state};
 	$state->{lsync} = $job->{lsync};
+	$state->{instance_id} = $job->{instance_id};
 	$state->{vm_type} = $job->{vm_type};
 
 	for (my $i = 0; $job->{"snap$i"} ; $i++) {
@@ -584,6 +586,33 @@ sub destroy_job {
     });
 }
 
+sub get_process_start_time {
+    my ($pid) = @_;
+
+    return eval { run_cmd(['ps', '-o', 'lstart=', '-p', "$pid"]); };
+}
+
+sub get_instance_id {
+    my ($pid) = @_;
+
+    my $starttime = get_process_start_time($pid)
+	or die "could not determine start time for process '$pid'\n";
+
+    return "${pid}:${starttime}";
+}
+
+sub instance_exists {
+    my ($instance_id) = @_;
+
+    if (defined($instance_id) && $instance_id =~ m/^([1-9][0-9]*):(.*)$/) {
+	my ($pid, $starttime) = ($1, $2);
+	my $actual_starttime = get_process_start_time($pid);
+	return defined($actual_starttime) && $starttime eq $actual_starttime;
+    }
+
+    return 0;
+}
+
 sub sync {
     my ($param) = @_;
 
@@ -593,11 +622,18 @@ sub sync {
 	eval { $job = get_job($param) };
 
 	if ($job) {
-	    if (defined($job->{state}) && ($job->{state} eq "syncing" || $job->{state} eq "waiting")) {
+	    my $state = $job->{state} // 'ok';
+	    $state = 'ok' if !instance_exists($job->{instance_id});
+
+	    if ($state eq "syncing" || $state eq "waiting") {
 		die "Job --source $param->{source} --name $param->{name} is already scheduled to sync\n";
 	    }
 
 	    $job->{state} = "waiting";
+
+	    eval { $job->{instance_id} = get_instance_id($$); };
+	    warn "Could not set instance ID - $@" if $@;
+
 	    update_state($job);
 	}
     });
@@ -671,6 +707,7 @@ sub sync {
 		eval { $job = get_job($param); };
 		if ($job) {
 		    $job->{state} = "error";
+		    delete $job->{instance_id};
 		    update_state($job);
 		}
 	    });
@@ -687,6 +724,7 @@ sub sync {
 		    $job->{state} = "ok";
 		}
 		$job->{lsync} = $date;
+		delete $job->{instance_id};
 		update_state($job);
 	    }
 	});
-- 
2.20.1






More information about the pve-devel mailing list