[pve-devel] [PATCH pve-guest-common 1/1] partially fix #2530: snapshots: add pre/post/failed-snapshot hooks

Wed Dec 21 13:41:53 CET 2022

On December 21, 2022 12:26 pm, Stefan Hanreich wrote:
> 
> 
> On 12/21/22 11:44, Fabian Grünbichler wrote:
>> this is v2, right? ;)
> 
> Oh no - for some reason it's only in the cover letter..
> 
>> 
>> On December 12, 2022 2:43 pm, Stefan Hanreich wrote:
>>> This commit adds hooks to the snapshotting process, which can be used
>>> to run additional setup scripts to prepare the VM for snapshotting.
>>>
>>> Examples for use cases include:
>>> * forcing processes to flush their writes
>>> * blocking processes from writing
>>> * altering the configuration of the VM to make snapshotting possible
>>>
>>> The prepare step has been split into two parts, so the configuration
>>> can be locked a bit earlier during the snapshotting process. Doing it
>>> this way ensures that the configuration is already locked during the
>>> pre-snapshot hook. Because of this split, the VM config gets written
>>> in two stages now, rather than one.
>>>
>>> In case of failure during the preparation step - after the lock is
>>> written - error handling has been added so the lock gets released
>>> properly. The failed-snapshot hook runs when the snapshot fails, if
>>> the pre-snapshot hook ran already. This enables users to revert any
>>> changes done during the pre-snapshot hookscript.
>> 
>> see below
>>   
>>> The preparation step assumes that the hook does not convert the
>>> current VM into a template, which is why the basic checks are not
>>> re-run after the pre-snapshot hook. The storage check runs after the
>>> pre-snapshot hook, because the hook might get used to setup the
>>> storage for snapshotting. If the hook would run after the storage
>>> checks, this becomes impossible.
>>>
>>> cfs_update() gets called after every invocation of a hookscript, since
>>> it is impossible to know which changes get made by the hookscript.
>>> Doing this ensures that we see the updated state of the CFS after the
>>> hookscript got invoked.
>>>
>>> Signed-off-by: Stefan Hanreich <s.hanreich at proxmox.com>
>>> ---
>>>   src/PVE/AbstractConfig.pm | 49 ++++++++++++++++++++++++++++++++++-----
>>>   1 file changed, 43 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/src/PVE/AbstractConfig.pm b/src/PVE/AbstractConfig.pm
>>> index a0c0bc6..3bff600 100644
>>> --- a/src/PVE/AbstractConfig.pm
>>> +++ b/src/PVE/AbstractConfig.pm
>>> @@ -710,8 +710,7 @@ sub __snapshot_prepare {
>>>   
>>>       my $snap;
>>>   
>>> -    my $updatefn =  sub {
>>> -
>>> +    my $run_checks = sub {
>>>   	my $conf = $class->load_config($vmid);
>>>   
>>>   	die "you can't take a snapshot if it's a template\n"
>>> @@ -721,15 +720,21 @@ sub __snapshot_prepare {
>>>   
>>>   	$conf->{lock} = 'snapshot';
>>>   
>>> -	my $snapshots = $conf->{snapshots};
>>> -
>>>   	die "snapshot name '$snapname' already used\n"
>>> -	    if defined($snapshots->{$snapname});
>>> +	    if defined($conf->{snapshots}->{$snapname});
>>> +
>>> +	$class->write_config($vmid, $conf);
>>> +    };
>>>   
>>> +    my $updatefn = sub {
>>> +	my $conf = $class->load_config($vmid);
>>>   	my $storecfg = PVE::Storage::config();
>>> +
>>>   	die "snapshot feature is not available\n"
>>>   	    if !$class->has_feature('snapshot', $conf, $storecfg, undef, undef, $snapname eq 'vzdump');
>>>   
>>> +	my $snapshots = $conf->{snapshots};
>>> +
>>>   	for my $snap (sort keys %$snapshots) {
>>>   	    my $parent_name = $snapshots->{$snap}->{parent} // '';
>>>   	    if ($snapname eq $parent_name) {
>>> @@ -753,7 +758,32 @@ sub __snapshot_prepare {
>>>   	$class->write_config($vmid, $conf);
>>>       };
>>>   
>>> -    $class->lock_config($vmid, $updatefn);
>>> +    $class->lock_config($vmid, $run_checks);
>>> +
>>> +    eval {
>>> +	my $conf = $class->load_config($vmid);
>>> +	PVE::GuestHelpers::exec_hookscript($conf, $vmid, "pre-snapshot", 1);
>>> +    };
>>> +    my $err = $@;
>>> +
>>> +    PVE::Cluster::cfs_update();
>>> +
>>> +    if ($err) {
>>> +	$class->remove_lock($vmid, 'snapshot');
>>> +	die $err;
>>> +    }
>>> +
>> 
>> I wonder if we don't also want to call the 'failed-snapshot' phase when just the
>> pre-snapshot invocation failed? might be possible to combine the error handling
>> then, although I am not sure it makes it more readable if combined..
>> 
> 
> I thought about it, but I thought that if the user die's in his perl 
> script he should be able to run any cleanup code before that. This 
> doesn't consider any problems in the hookscript unforeseen by the user 
> though, so I think your approach is better, since it is easier to use. 
> This places less burden on the author of the hookscript. Might make the 
> code a bit more convoluted though (depending on how we want to handle 
> errors in failed-snapshot), but the upsides are way better imo.
> 
> One thing that would be easier with making the user do his cleanup in 
> pre-snapshot would be that the pre-snapshot hook knows exactly what 
> failed in pre-snapshot, so cleanup-code could use that information to 
> skip certain steps. But again, it assumes that pre-snapshot will 
> properly handle any possible error, which might be a bit much to assume.

yes, there is always the question of whether the hookscript does (proper) error
handling.. but if it does, an additional call to failed-snapshot shouldn't hurt
since that should be covered as well (in this case, if we pass the information
that prepare failed, it could be a no-op for example).

>>> +
>>> +    if (my $err = $@) {
>>> +	my $conf = $class->load_config($vmid);
>>> +	PVE::GuestHelpers::exec_hookscript($conf, $vmid, "failed-snapshot");
>> 
>> this exec_hookscript needs to be inside an eval {}, with warn in case it fails..
> 
> Isn't this already handled by the exec_hookscript function, since I am 
> not passing $stop_on_error ? It should exit with warn instead of die 
> then. Maybe I am misunderstanding something.
> 
> See:
> https://git.proxmox.com/?p=pve-guest-common.git;a=blob;f=src/PVE/GuestHelpers.pm;h=b4ccbaa73a3fd08ba5d34350ebd57ee31355035b;hb=HEAD#l125

ah yeah, missed that - thanks for pointing it out :)

>> 
>> also, this call here happens when preparing for making the snapshot, after
>> possibly saving the VM state, but before taking the volume snapshots..
>> 
> 
> This should be alleviated by the envvars you proposed below, because 
> then we could pass that information to the hookscript and the user 
> decides what to do with this information, right?

yes exactly, this is part of the issue that could be solved by passing more
information to the hook script.

>>> +	PVE::Cluster::cfs_update();
>>> +
>>> +	$class->remove_lock($vmid, 'snapshot');
>>> +	die $err;
>>> +    }
>>>   
>>>       return $snap;
>>>   }
>>> @@ -837,11 +867,18 @@ sub snapshot_create {
>>>   
>>>       if ($err) {
>>>   	warn "snapshot create failed: starting cleanup\n";
>>> +
>>> +	PVE::GuestHelpers::exec_hookscript($conf, $vmid, "failed-snapshot");
>> 
>> eval + warn as well
> 
> see above

indeed

>> 
>> this call here happens when the volume snapshots might or might not have been
>> created already (depending on what exactly the error cause is).
>>
> 
> same here - should be alleviated by adding envvars, right?
> 
>>> +	PVE::Cluster::cfs_update();
>>> +
>>>   	eval { $class->snapshot_delete($vmid, $snapname, 1, $drivehash); };
>>>   	warn "$@" if $@;
>>>   	die "$err\n";
>>>       }
>>>   
>>> +    PVE::GuestHelpers::exec_hookscript($conf, $vmid, "post-snapshot");
>> 
>> and here we have a similar issue (no eval), what should happen if post-snapshot
>> fails?
>> 
>> A die immediately (very likely wrong, current)
>> B eval + warn but proceed with commit (possibly leaving leftover hook changes around)
>> C eval + warn, call failed-snapshot but proceed with commit (gives the
>>    hookscript a chance to cleanup, but how does it differentiate between the
>>    different failed-snapshot call sites?)
>> D eval + delete snapshot (seems suboptimal)
>> E eval + call failed-snapshot + delete snapshot (same, and also the issue of the
>>    hookscript being able to know what's going on again)
>> 
>> B and C seem most sensible to me, but C adds to the issue of "missing
>> failed-snapshot context", depending on what the hookscript is doing..
> 
> again, see above - I think it currently actually behaves like B because 
> of how exec_hookscript works if I understand correctly.

yes. but the question is whether that is good ;) I guess switching to C would
require passing stop_on_error and wrapping in eval..

> Similar idea to not running the failed-snapshot hook if pre-snapshot 
> fails. I thought that the user should be aware that his hookscript 
> failed at some point and run possible cleanup code before returning. As 
> I said above that's probably a worse idea than just running 
> failed-snapshot. It also enables the user to just have all the cleanup 
> handled by failed-snapshot instead of having to add it to pre/post/failed.

like mentioned above, doing

if !pre {
    failed
}

if !snapshot {
    failed
}

if !post {
    failed
}

doesn't stop the user from handling all errors in the phase itself, and then
doing `exit 1` to fail the hookscript, with the failed phase only handling
actual "snapshotting didn't work" errors - as long as we pass along *why* we are
calling the failed phase.

>> one way to pass information is via the environment, we do that for the migration
>> case already (setting PVE_MIGRATED_FROM, so that the pre-start/post-start
>> hookscript can know the start happens in a migration context, and where to
>> (possibly) find the guest config..
>> 
>> for example, we could set PVE_SNAPSHOT_PHASE here, and have prepare/commit/post
>> as sub-phases, or even pass a list of volumes already snapshotted (or created,
>> in case of vmstate), or 
> 
> That's a good idea, I'll look into sensible values for 
> PVE_SNAPSHOT_PHASE as well as look into how we could pass the 
> information about volumes to the hookscript best.
> 
>> obviously setting the environment is only allowed in a forked worker context,
>> else it would affect the next API endpoint handled by the pveproxy/pvedaemon/..
>> process, so it might be worth double-checking and cleaning up to avoid
>> side-effects with replication/migration/.. if we go down that route..
>> 
> 
> very good remark - thanks. I would not have thought of it even though it 
> is kinda obvious now you pointed it out.

well, I just realized that with replication/migration we are not doing a full
guest snapshot anyway, so that should be irrelevant ;) it just leaves container
backups in snapshot mode where we need to be careful to clean the env so that
the next container's hookscript execution in a single vzdump job doesn't see
wrong information.