[pve-devel] [PATCH v3 qemu-server 11/11] qcow2: add external snapshot support

Mon Jan 13 11:08:27 CET 2025

> Alexandre Derumier via pve-devel <pve-devel at lists.proxmox.com> hat am
> 16.12.2024 10:12 CET geschrieben:

>>it would be great if there'd be a summary of the design choices and a
>>high level summary of what happens to the files and block-node-graph
>>here. it's a bit hard to judge from the code below whether it would
>>be possible to eliminate the dynamically named block nodes, for
>>example ;)

yes, sorry, I'll add more infos with qemu limitations and why I'm doing
it like this.

>>a few more comments documenting the behaviour and ideally also some
>>tests (mocking the QMP interactions?) would be nice
yes, I'll add tests, need to find a good way to mock it.

> +
> +    #preallocate add a new current file
> +    my $new_current_fmt_nodename = get_blockdev_nextid("fmt-
> $deviceid", $nodes);
> +    my $new_current_file_nodename = get_blockdev_nextid("file-
> $deviceid", $nodes);

>>okay, so here we have a dynamic node name because the desired target
>>name is still occupied. could we rename the old block node first?

we can't rename a node, that's the problem.

> +    PVE::Storage::volume_snapshot($storecfg, $volid, $snap);

>>(continued from above) and this invocation here are the same??
The invocation is the same, but they it's not doing the same if it's an
external snasphot.

>> wouldn't this already create the snapshot on the storage layer? 
yes, it's create the (lvm volume) +  qcow2 file with preallocation

>>and didn't we just hardlink + reopen + unlink to transform the
>>previous current volume into the snap volume?
yes.
here, we are creating the new current volume,  adding it to the graph
with blockdev-add, then finally switch to it with blockdev-snapshot

The ugly trick in pve-storage is in plugin.pm
#rename current volume to snap volume
rename($path, $snappath) if -e $path && !-e $snappath;
or in lvm plugin.
eval { lvrename($vg, $volname, $snap_volname) } ;

(and you have already made comment about them ;)

because I'm re-using volume_snapshot (I didn't have to add a new method
in pve-storage) to allocate the snasphot file, but indeed, we have
already to the rename online.

>>should this maybe have been vdisk_alloc and it just works by
accident?
It's not works fine with vdisk_alloc, because the volume need to be
created without the size specified but with backing file param instead.
(if I remember, qemu-img is looking at the backing file size+metadas
and set the correct total size for the new volume)

Maybe a better way could be to reuse vdisk_alloc, and add backing file
as param ?

> +    my $new_file_blockdev = generate_file_blockdev($storecfg,
> $drive, $new_current_file_nodename);
> +    my $new_fmt_blockdev = generate_format_blockdev($storecfg,
> $drive, $new_current_fmt_nodename, $new_file_blockdev);
> +
> +    $new_fmt_blockdev->{backing} = undef;

>>generate_format_blockdev doesn't set backing? 
yes, it's adding backing

>>maybe this should be >>converted into an assertion?

but they are a limitation of the qmp blockdev-ad ++blockdev-snapshot
where the backing attribute need undef in the blockdev-add or the
blockdev-snapshot will fail because it's trying itself to set the
backing file when doing the switch.

From my test, it was related to this
https://lists.gnu.org/archive/html/qemu-block/2019-10/msg01404.html

> +    PVE::QemuServer::Monitor::mon_cmd($vmid, 'blockdev-add',
> %$new_fmt_blockdev);
> +    mon_cmd($vmid, 'blockdev-snapshot', node => $format_nodename,
> overlay => $new_current_fmt_nodename);
> +}
> +
> +sub blockdev_snap_rename {
> +    my ($storecfg, $vmid, $deviceid, $drive, $src_path,
> $target_path) = @_;

>>I think this whole thing needs more error handling and thought about
>>how to recover from various points failing.. 
yes, that's the problem with renaming, it's not atomic.

Also, if we need to recover (rollback), how to manage multiple disk ?

>>there's also quite some overlap with blockdev_current_rename, I
>>wonder whether it would be possible to simplify the code further by
>merging the two? but see below, I think we can even get away with
>>dropping this altogether if we switch from block-commit to block-
>>stream..
Yes, I have seperated them because I was not sure of the different
workflow, and It was more simplier to fix one method without breaking
the other.

I'll look to implement block-stream.  (and keep commit to initial image
for the last snapshot delete)

> +    #untaint
> +    if ($src_path =~ m/^(\S+)$/) {
> + $src_path = $1;
> +    }
> +    if ($target_path =~ m/^(\S+)$/) {
> + $target_path = $1;
> +    }

>>shouldn't that have happened in the storage plugin?
>>
> +
> +    #create a hardlink
> +    link($src_path, $target_path);

>>should this maybe be done by the storage plugin?

This was to avoid to introduce a sub method, but yes, it could be
better indeed.

PVE::Storage::link  ?

> 
> +    #delete old $path link
> +    unlink($src_path);

and this

PVE::Storage::unlink  ?
(can't use free_image here, because we really want to remove the link
and not the volume )

> +
> +    #rename underlay
> +    my $storage_name = PVE::Storage::parse_volume_id($volid);
> +    my $scfg = $storecfg->{ids}->{$storage_name};
> +    if ($scfg->{type} eq 'lvm') {
> + print"lvrename $src_path to $target_path\n";
> + run_command(
> +     ['/sbin/lvrename', $src_path, $target_path],
> +     errmsg => "lvrename $src_path to $target_path error",
> + );
> +    }

>>and this as well?
I didn't reuse lvrename in lvmplugin, because it's using vgname/lvname
and not the path, but I can look to extend it)

> +}
> +
> +sub blockdev_current_rename {
> +    my ($storecfg, $vmid, $deviceid, $drive, $path, $target_path,
> $skip_underlay) = @_;
> +    ## rename current running image
> +
> +    my $nodes = get_blockdev_nodes($vmid);
> +    my $volid = $drive->{file};
> +    my $target_file_nodename = get_blockdev_nextid("file-$deviceid",
> $nodes);

>>here we could already incorporate the snapshot name, since we know
it?

31char limits.

> +
> +    my $file_blockdev = generate_file_blockdev($storecfg, $drive,
> $target_file_nodename);
> +    $file_blockdev->{filename} = $target_path;
> +
> +    my $format_node = find_blockdev_node($nodes, $path, 'fmt');

>>then we'd know this is always the "current" node, however we
>>deterministically name it?

until you are doing a block-mirror, the current fmt node will be
replaced with another current2 fmt node.

>>and this should be done by the storage layer I think? how does this
>>interact with LVM?
from my test, an hardlink is working

>> would we maybe want to mknod instead of hardlinking the
device node? 

because /dev/<vgname>/<lv>  is not a device node, it's already a link
to the device node

for example:
lrwxrwxrwx  1 root root    7 Dec 10 00:11 vm-10001-disk-0 -> ../dm-9

#ln vm-10001-disk-0 testrename

lrwxrwxrwx  1 root root    7 Dec 10 00:11 vm-10001-disk-0 -> ../dm-9
lrwxrwxrwx  1 root root    7 Dec 10 00:11 testrename -> ../dm-9

>>did you try whether a plain rename would also work (not sure - qemu
>>already has an open FD to the file/blockdev, but I am not sure how
>>LVM handles this ;))?

from my test, the lvrename, it simply the rename the lvm volume
internaly, then rename link.. and as we have already create the link,
it's simply rename it without problem.

#lvrename vm-10001-disk-0  vm-10001-disk-snap1

lrwxrwxrwx  1 root root    7 Dec 10 00:11 vm-10001-disk-snap1 -> ../dm-
9
lrwxrwxrwx  1 root root    7 Dec 10 00:11 testrename -> ../dm-9

#lvrename vm-10001-disk-snap1 testrename

lrwxrwxrwx  1 root root    7 Dec 10 00:11 testrename -> ../dm-9

> 
> +
> +sub blockdev_commit {

>>see comments below for qemu_volume_snapshot_delete, I think this..

>>and this can be replaced altogether with blockdev_stream..

>>wouldn't it make more sense to use block-stream to merge the contents
>>of the to-be-deleted snapshot into the current overlay? that way we
>>wouldn't need to rename anything, AFAICT..

>>same here, instead of commiting from the child into the to-be-deleted
>>snapshot, and then renaming, why not just block-stream from the to-
>>be-deleted snapshot into the child, and then discard the snapshot
>>that is no longer needed?

>>commit is the wrong direction though?
>>
>>if we have A -> B -> C, and B is deleted, the delta previously
co>>ntained in B should be merged into C, not into A?
>>
>>so IMHO a simple block-stream + removal of the to-be-deleted snapshot
>>should be the right choice here as well?
>>
>>that would effectively make all the paths identical AFAICT (stream
>>from to-be-deleted snapshot to child, followed by deletion of the no
>>longer used volume corresponding to the deleted/streamed snapshot)
>>and no longer require any renaming..

Yes, got it now. I'll implement block-stream.
But keep commit for last snapshot delete.