[pve-devel] [PATCH v3 qemu-server 11/11] qcow2: add external snapshot support
Fabian Grünbichler
f.gruenbichler at proxmox.com
Mon Jan 13 14:27:03 CET 2025
> DERUMIER, Alexandre <alexandre.derumier at groupe-cyllene.com> hat am 13.01.2025 11:08 CET geschrieben:
>
>
> > Alexandre Derumier via pve-devel <pve-devel at lists.proxmox.com> hat am
> > 16.12.2024 10:12 CET geschrieben:
>
> >>it would be great if there'd be a summary of the design choices and a
> >>high level summary of what happens to the files and block-node-graph
> >>here. it's a bit hard to judge from the code below whether it would
> >>be possible to eliminate the dynamically named block nodes, for
> >>example ;)
>
> yes, sorry, I'll add more infos with qemu limitations and why I'm doing
> it like this.
>
> >>a few more comments documenting the behaviour and ideally also some
> >>tests (mocking the QMP interactions?) would be nice
> yes, I'll add tests, need to find a good way to mock it.
>
>
> > +
> > + #preallocate add a new current file
> > + my $new_current_fmt_nodename = get_blockdev_nextid("fmt-
> > $deviceid", $nodes);
> > + my $new_current_file_nodename = get_blockdev_nextid("file-
> > $deviceid", $nodes);
>
> >>okay, so here we have a dynamic node name because the desired target
> >>name is still occupied. could we rename the old block node first?
>
> we can't rename a node, that's the problem.
>
>
> > + PVE::Storage::volume_snapshot($storecfg, $volid, $snap);
>
> >>(continued from above) and this invocation here are the same??
> The invocation is the same, but they it's not doing the same if it's an
> external snasphot.
>
> >> wouldn't this already create the snapshot on the storage layer?
> yes, it's create the (lvm volume) + qcow2 file with preallocation
>
> >>and didn't we just hardlink + reopen + unlink to transform the
> >>previous current volume into the snap volume?
> yes.
> here, we are creating the new current volume, adding it to the graph
> with blockdev-add, then finally switch to it with blockdev-snapshot
>
> The ugly trick in pve-storage is in plugin.pm
> #rename current volume to snap volume
> rename($path, $snappath) if -e $path && !-e $snappath;
> or in lvm plugin.
> eval { lvrename($vg, $volname, $snap_volname) } ;
>
>
> (and you have already made comment about them ;)
>
>
> because I'm re-using volume_snapshot (I didn't have to add a new method
> in pve-storage) to allocate the snasphot file, but indeed, we have
> already to the rename online.
>
>
> >>should this maybe have been vdisk_alloc and it just works by
> accident?
> It's not works fine with vdisk_alloc, because the volume need to be
> created without the size specified but with backing file param instead.
> (if I remember, qemu-img is looking at the backing file size+metadas
> and set the correct total size for the new volume)
I am not sure I follow.. we create a snapshot, but then we pretend it isn't a file with backing file when passing it to qemu? this seems wrong.. IMHO we should just allocate (+format) here, and then let qemu do the backing link up instead of this confusing (and error-prone, as it masks problems that should be a hard error!) call..
> Maybe a better way could be to reuse vdisk_alloc, and add backing file
> as param ?
that seems.. wrong as well? the file can never be bigger just because it has a backing file right? why can't we just allocate and format the regular volume?
> > + my $new_file_blockdev = generate_file_blockdev($storecfg,
> > $drive, $new_current_file_nodename);
> > + my $new_fmt_blockdev = generate_format_blockdev($storecfg,
> > $drive, $new_current_fmt_nodename, $new_file_blockdev);
> > +
> > + $new_fmt_blockdev->{backing} = undef;
>
> >>generate_format_blockdev doesn't set backing?
> yes, it's adding backing
>
> >>maybe this should be >>converted into an assertion?
>
> but they are a limitation of the qmp blockdev-ad ++blockdev-snapshot
> where the backing attribute need undef in the blockdev-add or the
> blockdev-snapshot will fail because it's trying itself to set the
> backing file when doing the switch.
>
> From my test, it was related to this
> https://lists.gnu.org/archive/html/qemu-block/2019-10/msg01404.html
yeah, what I mean is:
generate_format_blockdev doesn't (and should never) set backing. so setting it to undef has no effect. but we might want to assert that it *is* undef, so that if we ever mistakenly change generate_format_blockdev to set backing, it will be caught instead of silently being papered over..
> > + PVE::QemuServer::Monitor::mon_cmd($vmid, 'blockdev-add',
> > %$new_fmt_blockdev);
> > + mon_cmd($vmid, 'blockdev-snapshot', node => $format_nodename,
> > overlay => $new_current_fmt_nodename);
> > +}
> > +
> > +sub blockdev_snap_rename {
> > + my ($storecfg, $vmid, $deviceid, $drive, $src_path,
> > $target_path) = @_;
>
> >>I think this whole thing needs more error handling and thought about
> >>how to recover from various points failing..
> yes, that's the problem with renaming, it's not atomic.
>
> Also, if we need to recover (rollback), how to manage multiple disk ?
in general, a rollback with multiple disks that fails half-way through can only be recovered using another rollback, and that only works if the half-rollback is idempotent? another reason to avoid the need for renames wherever possible ;)
> >>there's also quite some overlap with blockdev_current_rename, I
> >>wonder whether it would be possible to simplify the code further by
> >merging the two? but see below, I think we can even get away with
> >>dropping this altogether if we switch from block-commit to block-
> >>stream..
> Yes, I have seperated them because I was not sure of the different
> workflow, and It was more simplier to fix one method without breaking
> the other.
>
> I'll look to implement block-stream. (and keep commit to initial image
> for the last snapshot delete)
>
>
> > + #untaint
> > + if ($src_path =~ m/^(\S+)$/) {
> > + $src_path = $1;
> > + }
> > + if ($target_path =~ m/^(\S+)$/) {
> > + $target_path = $1;
> > + }
>
> >>shouldn't that have happened in the storage plugin?
> >>
> > +
> > + #create a hardlink
> > + link($src_path, $target_path);
>
> >>should this maybe be done by the storage plugin?
>
> This was to avoid to introduce a sub method, but yes, it could be
> better indeed.
>
> PVE::Storage::link ?
the issue is that these are all storage-specific things done in qemu-server, and they should be done by the storage plugin, else a third-party plugin can never support external storages.. so we'd need to find the right level of abstraction to tell the storage plugin what to do when, and then the storage plugin can decide how it exposes both snapshot names with the same underlying snapshot (for a while)..
More information about the pve-devel
mailing list