[pve-devel] [PATCH v3 qemu-server 11/11] qcow2: add external snapshot support
DERUMIER, Alexandre
alexandre.derumier at groupe-cyllene.com
Mon Jan 13 11:08:27 CET 2025
> Alexandre Derumier via pve-devel <pve-devel at lists.proxmox.com> hat am
> 16.12.2024 10:12 CET geschrieben:
>>it would be great if there'd be a summary of the design choices and a
>>high level summary of what happens to the files and block-node-graph
>>here. it's a bit hard to judge from the code below whether it would
>>be possible to eliminate the dynamically named block nodes, for
>>example ;)
yes, sorry, I'll add more infos with qemu limitations and why I'm doing
it like this.
>>a few more comments documenting the behaviour and ideally also some
>>tests (mocking the QMP interactions?) would be nice
yes, I'll add tests, need to find a good way to mock it.
> +
> + #preallocate add a new current file
> + my $new_current_fmt_nodename = get_blockdev_nextid("fmt-
> $deviceid", $nodes);
> + my $new_current_file_nodename = get_blockdev_nextid("file-
> $deviceid", $nodes);
>>okay, so here we have a dynamic node name because the desired target
>>name is still occupied. could we rename the old block node first?
we can't rename a node, that's the problem.
> + PVE::Storage::volume_snapshot($storecfg, $volid, $snap);
>>(continued from above) and this invocation here are the same??
The invocation is the same, but they it's not doing the same if it's an
external snasphot.
>> wouldn't this already create the snapshot on the storage layer?
yes, it's create the (lvm volume) + qcow2 file with preallocation
>>and didn't we just hardlink + reopen + unlink to transform the
>>previous current volume into the snap volume?
yes.
here, we are creating the new current volume, adding it to the graph
with blockdev-add, then finally switch to it with blockdev-snapshot
The ugly trick in pve-storage is in plugin.pm
#rename current volume to snap volume
rename($path, $snappath) if -e $path && !-e $snappath;
or in lvm plugin.
eval { lvrename($vg, $volname, $snap_volname) } ;
(and you have already made comment about them ;)
because I'm re-using volume_snapshot (I didn't have to add a new method
in pve-storage) to allocate the snasphot file, but indeed, we have
already to the rename online.
>>should this maybe have been vdisk_alloc and it just works by
accident?
It's not works fine with vdisk_alloc, because the volume need to be
created without the size specified but with backing file param instead.
(if I remember, qemu-img is looking at the backing file size+metadas
and set the correct total size for the new volume)
Maybe a better way could be to reuse vdisk_alloc, and add backing file
as param ?
> + my $new_file_blockdev = generate_file_blockdev($storecfg,
> $drive, $new_current_file_nodename);
> + my $new_fmt_blockdev = generate_format_blockdev($storecfg,
> $drive, $new_current_fmt_nodename, $new_file_blockdev);
> +
> + $new_fmt_blockdev->{backing} = undef;
>>generate_format_blockdev doesn't set backing?
yes, it's adding backing
>>maybe this should be >>converted into an assertion?
but they are a limitation of the qmp blockdev-ad ++blockdev-snapshot
where the backing attribute need undef in the blockdev-add or the
blockdev-snapshot will fail because it's trying itself to set the
backing file when doing the switch.
From my test, it was related to this
https://lists.gnu.org/archive/html/qemu-block/2019-10/msg01404.html
> + PVE::QemuServer::Monitor::mon_cmd($vmid, 'blockdev-add',
> %$new_fmt_blockdev);
> + mon_cmd($vmid, 'blockdev-snapshot', node => $format_nodename,
> overlay => $new_current_fmt_nodename);
> +}
> +
> +sub blockdev_snap_rename {
> + my ($storecfg, $vmid, $deviceid, $drive, $src_path,
> $target_path) = @_;
>>I think this whole thing needs more error handling and thought about
>>how to recover from various points failing..
yes, that's the problem with renaming, it's not atomic.
Also, if we need to recover (rollback), how to manage multiple disk ?
>>there's also quite some overlap with blockdev_current_rename, I
>>wonder whether it would be possible to simplify the code further by
>merging the two? but see below, I think we can even get away with
>>dropping this altogether if we switch from block-commit to block-
>>stream..
Yes, I have seperated them because I was not sure of the different
workflow, and It was more simplier to fix one method without breaking
the other.
I'll look to implement block-stream. (and keep commit to initial image
for the last snapshot delete)
> + #untaint
> + if ($src_path =~ m/^(\S+)$/) {
> + $src_path = $1;
> + }
> + if ($target_path =~ m/^(\S+)$/) {
> + $target_path = $1;
> + }
>>shouldn't that have happened in the storage plugin?
>>
> +
> + #create a hardlink
> + link($src_path, $target_path);
>>should this maybe be done by the storage plugin?
This was to avoid to introduce a sub method, but yes, it could be
better indeed.
PVE::Storage::link ?
>
> + #delete old $path link
> + unlink($src_path);
and this
PVE::Storage::unlink ?
(can't use free_image here, because we really want to remove the link
and not the volume )
> +
> + #rename underlay
> + my $storage_name = PVE::Storage::parse_volume_id($volid);
> + my $scfg = $storecfg->{ids}->{$storage_name};
> + if ($scfg->{type} eq 'lvm') {
> + print"lvrename $src_path to $target_path\n";
> + run_command(
> + ['/sbin/lvrename', $src_path, $target_path],
> + errmsg => "lvrename $src_path to $target_path error",
> + );
> + }
>>and this as well?
I didn't reuse lvrename in lvmplugin, because it's using vgname/lvname
and not the path, but I can look to extend it)
> +}
> +
> +sub blockdev_current_rename {
> + my ($storecfg, $vmid, $deviceid, $drive, $path, $target_path,
> $skip_underlay) = @_;
> + ## rename current running image
> +
> + my $nodes = get_blockdev_nodes($vmid);
> + my $volid = $drive->{file};
> + my $target_file_nodename = get_blockdev_nextid("file-$deviceid",
> $nodes);
>>here we could already incorporate the snapshot name, since we know
it?
31char limits.
> +
> + my $file_blockdev = generate_file_blockdev($storecfg, $drive,
> $target_file_nodename);
> + $file_blockdev->{filename} = $target_path;
> +
> + my $format_node = find_blockdev_node($nodes, $path, 'fmt');
>>then we'd know this is always the "current" node, however we
>>deterministically name it?
until you are doing a block-mirror, the current fmt node will be
replaced with another current2 fmt node.
>>and this should be done by the storage layer I think? how does this
>>interact with LVM?
from my test, an hardlink is working
>> would we maybe want to mknod instead of hardlinking the
device node?
because /dev/<vgname>/<lv> is not a device node, it's already a link
to the device node
for example:
lrwxrwxrwx 1 root root 7 Dec 10 00:11 vm-10001-disk-0 -> ../dm-9
#ln vm-10001-disk-0 testrename
lrwxrwxrwx 1 root root 7 Dec 10 00:11 vm-10001-disk-0 -> ../dm-9
lrwxrwxrwx 1 root root 7 Dec 10 00:11 testrename -> ../dm-9
>>did you try whether a plain rename would also work (not sure - qemu
>>already has an open FD to the file/blockdev, but I am not sure how
>>LVM handles this ;))?
from my test, the lvrename, it simply the rename the lvm volume
internaly, then rename link.. and as we have already create the link,
it's simply rename it without problem.
#lvrename vm-10001-disk-0 vm-10001-disk-snap1
lrwxrwxrwx 1 root root 7 Dec 10 00:11 vm-10001-disk-snap1 -> ../dm-
9
lrwxrwxrwx 1 root root 7 Dec 10 00:11 testrename -> ../dm-9
#lvrename vm-10001-disk-snap1 testrename
lrwxrwxrwx 1 root root 7 Dec 10 00:11 testrename -> ../dm-9
>
> +
> +sub blockdev_commit {
>>see comments below for qemu_volume_snapshot_delete, I think this..
>>and this can be replaced altogether with blockdev_stream..
>>wouldn't it make more sense to use block-stream to merge the contents
>>of the to-be-deleted snapshot into the current overlay? that way we
>>wouldn't need to rename anything, AFAICT..
>>same here, instead of commiting from the child into the to-be-deleted
>>snapshot, and then renaming, why not just block-stream from the to-
>>be-deleted snapshot into the child, and then discard the snapshot
>>that is no longer needed?
>>commit is the wrong direction though?
>>
>>if we have A -> B -> C, and B is deleted, the delta previously
co>>ntained in B should be merged into C, not into A?
>>
>>so IMHO a simple block-stream + removal of the to-be-deleted snapshot
>>should be the right choice here as well?
>>
>>that would effectively make all the paths identical AFAICT (stream
>>from to-be-deleted snapshot to child, followed by deletion of the no
>>longer used volume corresponding to the deleted/streamed snapshot)
>>and no longer require any renaming..
Yes, got it now. I'll implement block-stream.
But keep commit for last snapshot delete.
More information about the pve-devel
mailing list