[pve-devel] [PATCH v3 pve-storage 2/3] lvmplugin: add qcow2 snapshot

Fri Jan 10 11:16:44 CET 2025

>>one downside with this part in particular - we have to always
>>allocate full-size LVs (+qcow2 overhead), even if most of them will
>>end up storing just a single snapshot delta which might be a tiny
>>part of that full-size.. hopefully if discard is working across the
>>whole stack this doesn't actually explode space usage on the storage
>>side, but it makes everything a bit hard to track.. OTOH, while we
>>could in theory extend/reduce the LVs and qcow2 images on them when
>>modifying the backing chain, the additional complexity is probably
>>not worth it at the moment..

see this RFC with dynamic extend. (not shrink/not discard)
https://lore.proxmox.com/pve-devel/mailman.475.1725007456.302.pve-devel@lists.proxmox.com/t/
(I think that the tricky part (as Dominic have fast review), is to
handle resize cluster lock correctly, and handle timeout/retry with a
queue through a specific daemon)

But technically, this is how ovirt is Managed it  (and it's works in
production, I have customers using it since multiple years)

> 
>  sub plugindata {
>      return {
>   content => [ {images => 1, rootdir => 1}, { images => 1 }],
> + format => [ { raw => 1, qcow2 => 1 } , 'raw' ],

>>I wonder if we want to guard the snapshotting-related parts below
>>with an additional "snapext" option here as well? 

I really don't known, it's not possible to do snapshots with .raw
anyway. 
on the gui side, it could allow to enable/display the format field for
example if snapext is defined in the storage.

>>or even the usage >>of qcow2 altogether?

I think we should keep to possiblity to choose .raw vs .qcow2 on same
storage, because
maybe a user really need max performance for a specific vm without the
need of snapshot.

> 
> +
> +    #add extra space for qcow2 metadatas
> +    #without sub-allocated clusters : For 1TB storage : l2_size =
> disk_size × 8 / cluster_size
> +    #with sub-allocated clusters : For 1TB storage : l2_size =
> disk_size × 8 / cluster_size / 16
> +                                   #4MB overhead for 1TB with
> extented l2 clustersize=128k
> +
> +    my $qcow2_overhead = ceil($size/1024/1024/1024) * 4096;

>>there's "qemu-img measure", which seems like it would do exactly what
>>we want ;)

"Calculate the file size required for a new image. This information can
be used to size logical volumes or SAN LUNs appropriately for the image
that will be placed in them."

indeed, lol.  I knowned the command, but I thinked it was to measure
the content of an existing file. I'll do tests to see if I got same
results (and if sub-allocated clusters is correctly handled)

> +
> +    my $lvmsize = $size;
> +    $lvmsize += $qcow2_overhead if $fmt eq 'qcow2';
> +
>      die "not enough free space ($free < $size)\n" if $free < $size;
>  
> -    $name = $class->find_free_diskname($storeid, $scfg, $vmid)
> +    $name = $class->find_free_diskname($storeid, $scfg, $vmid, $fmt)
>   if !$name;
>  
> -    lvcreate($vg, $name, $size, ["pve-vm-$vmid"]);
> -
> +    my $tags = ["pve-vm-$vmid"];
> +    push @$tags, "\@pve-$name" if $fmt eq 'qcow2';

>>that's a creative way to avoid the need to discover and activate
>>snapshots one by one below, but it might warrant a comment ;)

ah sorry (but yes,this was the idea to active/desactivate the whole
chain in 1call)

> >>
> +
> +    #rename current lvm volume to snap volume
> +    my $vg² = $scfg->{vgname};
> +    print"rename $volname to $snap_volname\n";
> +    eval { lvrename($vg, $volname, $snap_volname) } ;
>> missing error handling..

> +
> +
> +    #allocate a new lvm volume
> +    $class->alloc_new_image($storeid, $scfg, $vmid, 'qcow2',
> $volname, $size/1024);

>>missing error handling

ah ,sorry, it should include in the following eval

> +    eval {
> +        $class->format_qcow2($storeid, $scfg, $volname, undef,
> $snap_path);
> +    };
> +
> +    if ($@) {
> +        eval { $class->free_image($storeid, $scfg, $volname, 0) };
> +        warn $@ if $@;
> +    }
> +}
> +
> +sub volume_rollback_is_possible {
> +    my ($class, $scfg, $storeid, $volname, $snap, $blockers) = @_;
> +
> +    my $snap_path = $class->path($scfg, $volname, $storeid, $snap);
> +
> +    my $snapshots = $class->volume_snapshot_info($scfg, $storeid,
> $volname);
> +    my $parent_snap = $snapshots->{current}->{parent};
> +
> +    return 1 if !-e $snap_path || $snapshots->{$parent_snap}->{file}
> eq $snap_path;

>>the first condition here seems wrong, see storage patch #1

yes

> +    die "can't rollback, '$snap' is not most recent snapshot on
> '$volname'\n";
> +
> +    return 1;
>  }
>  
> +
>  sub volume_snapshot_rollback {
>      my ($class, $scfg, $storeid, $volname, $snap) = @_;
>  
> -    die "lvm snapshot rollback is not implemented";
> +    die "can't rollback snapshot this image format\n" if $volname !~
> m/\.(qcow2|qed)$/;

>>above we only have qcow2, which IMHO makes more sense..

We could remove the .qed everywhere, IT's deprecated since 2017 and we
never have exposed it in the gui.

> +
> +   my $snapshots = $class->volume_snapshot_info($scfg, $storeid,
> $volname);
> +   my $snap_path = $snapshots->{$snap}->{file};
> +   my $snap_volname = $snapshots->{$snap}->{volname};
> +   return if !-e $snap_path;  #already deleted ?

>>should maybe be a die?

same than patch #1 comment. this was for snapdel retry with multiple
disks.

> +    } else {
> +        print"commit $snap_path\n";
> +        $cmd = ['/usr/bin/qemu-img', 'commit', $snap_path];

>> leftover?

still no ;) see my patch#1 reply
> +        #if we delete an intermediate snapshot, we need to link
> upper snapshot to base snapshot
> +        die "missing parentsnap snapshot to rebase child
> $child_path\n" if !$parent_path;
> +        print "link $child_snap to $parent_snap\n";
> +        $cmd = ['/usr/bin/qemu-img', 'rebase', '-u', '-b',
> $parent_path, '-F', 'qcow2', '-f', 'qcow2', $child_path];
> +        run_command($cmd);

>>same as for patch #1, I am not sure the -u here is correct..

This is correct, see my patch#1 reply

> 
> +# don't allow to clone as we can't activate the base on multiple
> host at the same time
> +#       clone => {
> +#           base => { qcow2 => 1, raw => 1},
> +#       },

>>I think activating the base would actually be okay, we just must
>>never write to it? ;)

Ah, this is a good remark. I thinked we couldn't activate an LV on
multiple node at the same time. I'll look at this, this add possibility
of linked clone. (I need to check the external snapshot code with
backing chains first)