[pve-devel] [PATCH v2 pve-storage 1/2] add external snasphot support

DERUMIER, Alexandre alexandre.derumier at groupe-cyllene.com
Wed Oct 23 14:59:33 CEST 2024


Hi Fabian,

thanks for the review !

>>-------- Message initial --------
>>De: Fabian Grünbichler <f.gruenbichler at proxmox.com>
>>À: Proxmox VE development discussion <pve-devel at lists.proxmox.com>
>>Cc: Alexandre Derumier <alexandre.derumier at groupe-cyllene.com>
>>Objet: Re: [pve-devel] [PATCH v2 pve-storage 1/2] add external
>>snasphot support
>>Date: 23/10/2024 12:12:46
>>
>>some high level comments:
>>
>>I am not sure how much we gain here with the raw support?

They are really qcow2 overhead, mostly with big disk.
as for good performance, the qcow2 l2-cache-size need to be keeped in
memory (and it's 1MB by disk)
https://events.static.linuxfound.org/sites/events/files/slides/kvm-forum-2017-slides.pdf

Hopefully, they are improvments with the "new" sub-cluster feature
https://people.igalia.com/berto/files/kvm-forum-2020-slides.pdf
I'm already using it at snapshot create, but I think we should also use
it for main qcow2 volume.


But even with that, you can still have performance impact.
So yes, I think they are really usecase for workload when you only need
snapshot time to time (before an upgrade for example), but max
performance with no snaphot exist.



>> it's a bit confusing to have a volid ending with raw, with the
>>current volume and all but the first snapshot actually being stored
>>in qcow2 files, with the raw file being the "oldest" snapshot in the
>>chain..
if it's too confusing, we could use for example an .snap extension.
(as we known that it's qcow2 behind)


if possible, I'd be much happier with the snapshot name in the snapshot
file being a 1:1 match, see comments inline

>>- makes it a lot easier to understand (admin wants to manually remove
>>snapshot "foo", if "foo" was the last snapshot then right now the
>>volume called "foo" is actually the current contents!)

This part is really difficult, because you can't known in advance the
name of the snapshot you'll take in the future. The only way could be
to create a "current" volume ,  rename it when you took another
snasphot (I'm not sure it's possible to do it live,
and this could break link chain too)

Also, I don't known how to manage the main volume, when you take the
first snapshot ? we should rename it too.

so "vm-disk-100-disk-0.raw|qcow2"  , become "vm-disk-100-disk-0-
snap1.(raw|qcow2)" + new "vm-disk-100-disk-0-current.qcow2" ?


I'll try to do test again to see what is possible.




>>- means we don't have to do lookups via the full snapshot list all
>>the time (e.g., if I want to do a full clone from a snapshot "foo", I
>>can just pass the snap-foo volume to qemu-img)

ok got it



>>the naming scheme for snapshots needs to be adapted to not clash with
>>regular volumes:

>>$ pvesm alloc extsnap 131314 vm-131314-disk-foobar.qcow2 2G
>>Formatting '/mnt/pve/ext4/extsnap/images/131314/vm-131314-disk-
>>foobar.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off
>>preallocation=off compression_type=zlib size=2147483648
>>lazy_refcounts=off refcount_bits=16
>>successfully created 'extsnap:131314/vm-131314-disk-foobar.qcow2'
>>$ qm rescan --vmid 131314
>>rescan volumes...
>>can't parse snapname from path at
>>/usr/share/perl5/PVE/Storage/Plugin.pm line 1934.

any preference for naming scheme ? for lvm external snap, I have used
"vm-131314-disk-0-snap-<foobar>";

>>storage_migrate needs to handle external snapshots, or at least error
>>out. 
it should already work. (I have tested move_disk, and live migration +
storage migration). qemu_img_convert offline and qemu block job for
live.


>>I haven't tested that part or linked clones or a lot of other
>>advanced related actions at all ;)

For linked clone, we can't have a base image with snapshots (other than
_base_). so It'll be safe.



> Alexandre Derumier via pve-devel <pve-devel at lists.proxmox.com> hat am
> 30.09.2024 13:31 CEST geschrieben:
> Signed-off-by: Alexandre Derumier <alexandre.derumier at groupe-
> cyllene.com>
> ---
>  src/PVE/Storage/DirPlugin.pm |   1 +
>  src/PVE/Storage/Plugin.pm    | 225 +++++++++++++++++++++++++++++++--
> --
>  2 files changed, 201 insertions(+), 25 deletions(-)
> 
> diff --git a/src/PVE/Storage/DirPlugin.pm
> b/src/PVE/Storage/DirPlugin.pm
> index 2efa8d5..2bef673 100644
> --- a/src/PVE/Storage/DirPlugin.pm
> +++ b/src/PVE/Storage/DirPlugin.pm
> @@ -80,6 +80,7 @@ sub options {
>   is_mountpoint => { optional => 1 },
>   bwlimit => { optional => 1 },
>   preallocation => { optional => 1 },
> + snapext => { optional => 1 },
>     };
>  }
>  
> diff --git a/src/PVE/Storage/Plugin.pm b/src/PVE/Storage/Plugin.pm
> index 6444390..5e5197a 100644
> --- a/src/PVE/Storage/Plugin.pm
> +++ b/src/PVE/Storage/Plugin.pm
> @@ -214,6 +214,11 @@ my $defaultData = {
>       maximum => 65535,
>       optional => 1,
>   },
> +        'snapext' => {
> +     type => 'boolean',
> +     description => 'enable external snapshot.',
> +     optional => 1,
> +        },
>      },
>  };
>  
> @@ -695,7 +700,7 @@ sub get_subdir {
>  }
>  
>  sub filesystem_path {
> -    my ($class, $scfg, $volname, $snapname) = @_;
> +    my ($class, $scfg, $volname, $snapname, $current_snap) = @_;

see comment below

>  
>      my ($vtype, $name, $vmid, undef, undef, $isBase, $format) =
>   $class->parse_volname($volname);
> @@ -703,7 +708,7 @@ sub filesystem_path {
>      # Note: qcow2/qed has internal snapshot, so path is always
>      # the same (with or without snapshot => same file).
>      die "can't snapshot this image format\n"
> - if defined($snapname) && $format !~ m/^(qcow2|qed)$/;
> + if defined($snapname) && !$scfg->{snapext} && $format !~
> m/^(qcow2|qed)$/;
>  
>      my $dir = $class->get_subdir($scfg, $vtype);
>  
> @@ -711,13 +716,22 @@ sub filesystem_path {
>  
>      my $path = "$dir/$name";
>  
> +    if($scfg->{snapext}) {
> + my $snappath = get_snap_path($path, $snapname);
> + if($snapname) {
> +     $path = $snappath;
> + } elsif ($current_snap) {
> +     $path = $current_snap->{file};
> + }
> +    }

see commente below

>      return wantarray ? ($path, $vmid, $vtype) : $path;
>  }
>  
>  sub path {
>      my ($class, $scfg, $volname, $storeid, $snapname) = @_;
>  
> -    return $class->filesystem_path($scfg, $volname, $snapname);
> +    my $current_snapshot = $class->get_current_snapshot($scfg,
> $storeid, $volname);

>>this is pretty expensive, and would only be needed if $snapname is
>>not set..

The main problem is when you start a vm on a specific snasphot,
we don't send the $snapname param.

One way could be that qemu-server check the current snapshot from
config when doing specific action like start.





> +    return $class->filesystem_path($scfg, $volname, $snapname,
> $current_snapshot);

>>couldn't we avoid extending the signature of filesystem_path and just
pass the name of the current snapshot as $snapname?

I need to redo test, I don't remember why I have splitted them, but you
are right, it should be cleaner.

>  }
>  
>  sub create_base {
> @@ -1074,13 +1088,31 @@ sub volume_resize {
>  sub volume_snapshot {
>      my ($class, $scfg, $storeid, $volname, $snap) = @_;
>  
> -    die "can't snapshot this image format\n" if $volname !~
> m/\.(qcow2|qed)$/;
> +    die "can't snapshot this image format\n" if $volname !~
> m/\.(raw|qcow2|qed)$/;
>  
> -    my $path = $class->filesystem_path($scfg, $volname);
> +    die "external snapshot need to be enabled to snapshot .raw
> volumes\n" if !$scfg->{snapext};

>>this condition is definitely wrong - it means no more snapshotting
>>unless external snapshot support is enabled..

oops, sorry.

>  
> -    my $cmd = ['/usr/bin/qemu-img', 'snapshot','-c', $snap, $path];
> +    if($scfg->{snapext}) {
>  
> -    run_command($cmd);
> + my $path = $class->path($scfg, $volname, $storeid);
> +
> + my $snappath = get_snap_path($path, $snap);
> + my $format = ($class->parse_volname($volname))[6];
> +
> + my $cmd = ['/usr/bin/qemu-img', 'create', '-b', $path,
> +    '-F', $format, '-f', 'qcow2', $snappath];

>>see comments on qemu-server, but.. wouldn't it be better if the file
>>with $snap in its name would be the one storing that snapshot's data?
>>i.e., rename the "current" volume to be called ...-$snap... , and
>>then create a new "current" file without a suffix with the renamed
>>volume as backing file?

I'll try it !

> +
> + my $options = "extended_l2=on,";
> + $options .= preallocation_cmd_option($scfg, 'qcow2');
> + push @$cmd, '-o', $options;
> + run_command($cmd);
> +
> +    } else {
> +
> + my $path = $class->filesystem_path($scfg, $volname);
> + my $cmd = ['/usr/bin/qemu-img', 'snapshot','-c', $snap, $path];
> + run_command($cmd);
> +    }
>  
>      return undef;
>  }
> @@ -1091,19 +1123,39 @@ sub volume_snapshot {
>  sub volume_rollback_is_possible {
>      my ($class, $scfg, $storeid, $volname, $snap, $blockers) = @_;
>  
> +    if ($scfg->{snapext}) {
> + #technically, we could manage multibranch, we it need lot more work
> for snapshot delete

>>would multibranch be easier if there is a simple 1:1 correspondence
>>between snapshots and their filenames?
>>
>>switching to a different part of the "hierarchy" is then just
>>- delete current volume
>>- create new current volume using rollback target as backing file
the rollback/branch switch is not too difficult, maybe 1:1 naming could
help.

>>I guess deletion does become harder then, since it potentially
>>requires multiple rebases..

yes, The biggest difficulty is snapshot delete, as you need to create a
block-stream  job, mergin/writing to each branch child, and you need to
do it atomically with a transaction with multiple jobs. 
So yes, it's possible, but I wanted to keep it easy for now.

> + my $path = $class->filesystem_path($scfg, $volname);
> + my $snappath = get_snap_path($path, $snap);
> +
> + my $snapshots = $class->volume_snapshot_info($scfg, $storeid,
> $volname);
> + my $currentpath = $snapshots->{current}->{file};
> + return 1 if !-e $snappath || $currentpath eq $snappath;
> +
> + die "can't rollback, '$snap' is not most recent snapshot on
> '$volname'\n";
> +    }
> +
>      return 1;
>  }
>  
>  sub volume_snapshot_rollback {
>      my ($class, $scfg, $storeid, $volname, $snap) = @_;
>  
> -    die "can't rollback snapshot this image format\n" if $volname !~
> m/\.(qcow2|qed)$/;
> +    die "can't rollback snapshot this image format\n" if $volname !~
> m/\.(raw|qcow2|qed)$/;
>  
> -    my $path = $class->filesystem_path($scfg, $volname);
> +    die "external snapshot need to be enabled to rollback snapshot
> .raw volumes\n" if $volname =~ m/\.(raw)$/ && !$scfg->{snapext};
>  
> -    my $cmd = ['/usr/bin/qemu-img', 'snapshot','-a', $snap, $path];
> +    my $path = $class->filesystem_path($scfg, $volname);
>  
> -    run_command($cmd);
> +    if ($scfg->{snapext}) {
> + #simply delete the current snapshot and recreate it
> + my $snappath = get_snap_path($path, $snap);
> + unlink($snappath);
> + $class->volume_snapshot($scfg, $storeid, $volname, $snap);

this *reads* so weird ;) it is right given the current semantics
(current snapshot == live image, snapshot data actually stored in
parent snapshot)

> +    } else {
> + my $cmd = ['/usr/bin/qemu-img', 'snapshot','-a', $snap, $path];
> + run_command($cmd);
> +    }
>  
>      return undef;
>  }
> @@ -1111,17 +1163,50 @@ sub volume_snapshot_rollback {
>  sub volume_snapshot_delete {
>      my ($class, $scfg, $storeid, $volname, $snap, $running) = @_;
>  
> -    die "can't delete snapshot for this image format\n" if $volname
> !~ m/\.(qcow2|qed)$/;
> +    die "can't delete snapshot for this image format\n" if $volname
> !~ m/\.(raw|qcow2|qed)$/;
> +
> +    die "external snapshot need to be enabled to delete snapshot of
> .raw volumes\n" if !$scfg->{snapext};
>  
>      return 1 if $running;
>  
> -    my $path = $class->filesystem_path($scfg, $volname);
> +    my $cmd = "";
> +    if ($scfg->{snapext}) {
> +
> + my $snapshots = $class->volume_snapshot_info($scfg, $storeid,
> $volname);
> + my $snappath = $snapshots->{$snap}->{file};
> + return if !-e $snappath;  #already deleted ?
> +
> + my $parentsnap = $snapshots->{$snap}->{parent};
> + my $childsnap = $snapshots->{$snap}->{child};
> +        die "error: can't find a parent for this snapshot" if
> !$parentsnap;
>  
> -    $class->deactivate_volume($storeid, $scfg, $volname, $snap, {});
> + my $parentpath = $snapshots->{$parentsnap}->{file};
> + my $parentformat = $snapshots->{$parentsnap}->{'format'} if
> $parentsnap;
> + my $childpath = $snapshots->{$childsnap}->{file} if $childsnap;
> + my $childformat = $snapshots->{$childsnap}->{'format'} if
> $childsnap;
>  
> -    my $cmd = ['/usr/bin/qemu-img', 'snapshot','-d', $snap, $path];
> + print "merge snapshot $snap to $parentsnap\n";
> + $cmd = ['/usr/bin/qemu-img', 'commit', $snappath];
> + run_command($cmd);
> +
> + #if we delete an intermediate snapshot, we need to link upper
> snapshot to base snapshot
> + if($childpath && -e $childpath) {
> +     die "missing parentsnap snapshot to rebase child $childpath\n"
> if !$parentpath;
> +     print "link $childsnap to $parentsnap\n";
> +     $cmd = ['/usr/bin/qemu-img', 'rebase', '-u', '-b', $parentpath,
> '-F', $parentformat, '-f', $childformat, $childpath];
> +     run_command($cmd);
> + }

>>wouldn't a regular safe rebase work just as well, instead of commit +
>>unsafe rebase? if there is no parent, passing in "" as "new" backing
>>file should work..

I'll test it, but I'm pretty sure this is the correct way.

> +
> + #delete the snapshot
> + unlink($snappath);
> +    } else {
> + my $path = $class->filesystem_path($scfg, $volname);
>  
> -    run_command($cmd);
> + $class->deactivate_volume($storeid, $scfg, $volname, $snap, {});
> +
> + $cmd = ['/usr/bin/qemu-img', 'snapshot','-d', $snap, $path];
> + run_command($cmd);
> +    }
>  
>      return undef;
>  }
> @@ -1140,10 +1225,6 @@ sub volume_has_feature {
>      my ($class, $scfg, $feature, $storeid, $volname, $snapname,
> $running, $opts) = @_;
>  
>      my $features = {
> - snapshot => {
> -     current => { qcow2 => 1 },
> -     snap => { qcow2 => 1 },
> - },
>   clone => {
>       base => { qcow2 => 1, raw => 1, vmdk => 1 },
>   },
> @@ -1159,11 +1240,23 @@ sub volume_has_feature {
>       base => { qcow2 => 1, raw => 1, vmdk => 1 },
>       current => { qcow2 => 1, raw => 1, vmdk => 1 },
>   },
> - rename => {
> -     current => {qcow2 => 1, raw => 1, vmdk => 1},
> - },
> + 'rename' => {
> +     current => { qcow2 => 1, raw => 1, vmdk => 1},
> + }
>      };
>  
> +    if ($scfg->{snapext}) {
> + $features->{snapshot} = {
> + current => { raw => 1, qcow2 => 1 },
> + snap => { raw => 1, qcow2 => 1 },
> + }
> +    } else {
> + $features->{snapshot} = {
> + current => { qcow2 => 1 },
> + snap => { qcow2 => 1 },
> + };
> +    }
> +

>>this could just leave $features as it is, and add the "raw" bits:
>>
>>if ($scfg->{snapext}) {
>>    $features->{snapshot}->{current}->{raw} = 1;
>>    $features->{snapshot}->{snap}->{raw} = 1;
>>}

ok !
>      if ($feature eq 'clone') {
>   if (
>       defined($opts->{valid_target_formats})
> @@ -1222,7 +1315,9 @@ sub list_images {
>   }
>  
>   if ($vollist) {
> -     my $found = grep { $_ eq $volid } @$vollist;
> +     my $search_volid = $volid;
> +     $search_volid =~ s/-snap-.*\./\./;
> +     my $found = grep { $_ eq $search_volid } @$vollist;
>       next if !$found;
>   }
>  
> @@ -1380,7 +1475,53 @@ sub status {
>  sub volume_snapshot_info {
>      my ($class, $scfg, $storeid, $volname) = @_;
>  
> -    die "volume_snapshot_info is not implemented for $class";
> +    die "volume_snapshot_info is not implemented for $class" if
> !$scfg->{snapext};
> +
> +    my $path = $class->filesystem_path($scfg, $volname);
> +
> +    my ($vtype, $name, $vmid, $basename, $basevmid, $isBase,
> $format) = $class->parse_volname($volname);
> +
> +    my $basevolname = $volname;
> +    $basevolname =~ s/\.(raw|qcow2)$//;
> +
> +    my $snapshots = $class->list_images($storeid, $scfg, $vmid);
> +    my $info = {};
> +    for my $snap (@$snapshots) {
> +
> + my $volid = $snap->{volid};
> + next if ($volid !~ m/$basevolname/);

>>this regex is broken w.r.t. partial matching!
>>
>>e.g., if a VM has both a disk -1.qcow2 and -11.qcow2 and I attempt to
>>snapshot it using external snapshots:
ok !


snapshotting 'drive-scsi0' (extsnap:131314/vm-131314-disk-0.raw)
Formatting '/mnt/pve/ext4/extsnap/images/131314/vm-131314-disk-0-snap-
test2.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=on
preallocation=off compression_type=zlib size=200704
backing_file=/mnt/pve/ext4/extsnap/images/131314/vm-131314-disk-0-snap-
test.qcow2 backing_fmt=raw lazy_refcounts=off refcount_bits=16
snapshotting 'drive-scsi1' (extsnap:131314/vm-131314-disk-1.qcow2)
Formatting '/mnt/pve/ext4/extsnap/images/131314/vm-131314-disk-11-snap-
test2.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=on
preallocation=off compression_type=zlib size=2147483648
backing_file=/mnt/pve/ext4/extsnap/images/131314/vm-131314-disk-
11.qcow2 backing_fmt=qcow2 lazy_refcounts=off refcount_bits=16
snapshotting 'drive-scsi2' (extsnap:131314/vm-131314-disk-11.qcow2)
qemu-img: /mnt/pve/ext4/extsnap/images/131314/vm-131314-disk-11-snap-
test2.qcow2: Error: Trying to create an image with the same filename as
the backing file
snapshot create failed: starting cleanup
merge snapshot test2 to test
Image committed.
merge snapshot test2 to base
Image committed.
TASK ERROR: command '/usr/bin/qemu-img create -b
/mnt/pve/ext4/extsnap/images/131314/vm-131314-disk-11-snap-test2.qcow2
-F qcow2 -f qcow2 /mnt/pve/ext4/extsnap/images/131314/vm-131314-disk-
11-snap-test2.qcow2 -o 'extended_l2=on,preallocation=off'' failed: exit
code 1 

> +
> + my (undef, $snapvolname) = parse_volume_id($volid);
> + my $snapname = get_snapname_from_path($volid);
> + my $snapfile = $class->filesystem_path($scfg, $snapvolname,
> $snapname);
> + $snapname = 'base' if !$snapname;
> +
> + my $format = $snap->{'format'};
> + my $parentfile = $snap->{parent};
> + my $parentname = get_snapname_from_path($parentfile) if
> $parentfile;
> + $parentname = 'base' if !$parentname && $parentfile;
> +
> + $info->{$snapname}->{file} = $snapfile;
> + $info->{$snapname}->{volid} = $volid;
> + $info->{$snapname}->{'format'} = $format;
> + $info->{$snapname}->{parent} = $parentname if $parentname;
> + $info->{$parentname}->{child} = $snapname if $parentname;
> +    }
> +
> +    my $current = undef;
> +    for my $id (keys %$info) {
> + my $snap = $info->{$id};
> + die "error: snap $id: you can't have multiple current snapshot: 
> current:$current\n" if !$snap->{child} && $current;
> + $current = $id if !$snap->{child};
> +    }
> +
> +    if ($current) {
> + $info->{current}->{file} = $info->{$current}->{file};
> + $info->{current}->{'format'} = $info->{$current}->{'format'};
> + $info->{current}->{parent} = $info->{$current}->{parent};
> +    }
> +
> +    return $info;
>  }
>  
>  sub activate_storage {
> @@ -1764,4 +1905,38 @@ sub config_aware_base_mkdir {
>      }
>  }
>  
> +sub get_snap_path {
> +    my ($path, $snap) = @_;
> +
> +    my $basepath = "";
> +    my $baseformat = "";
> +    if ($path =~ m/^((.*)(vm-(\d+)-disk-(\d+)))(-snap-
> (.*))?\.(raw|qcow2)/) {

>>this regex is wrong - volumes can have arbitrary names after the -
>>disk- part..

ah sorry. do you have some example where it's used ? (maybe for efi or
other specific disk ?)


> + $basepath = $1;
> + $baseformat = $8;
> +    }
> +    my $format = $snap ? 'qcow2' : $baseformat;
> +    my $snappath = $snap ? $basepath."-snap-$snap.$format" : undef;
> +
> +    return $snappath;
> +}
> +
> +sub get_snapname_from_path {
> +    my ($path) = @_;
> +
> +    if ($path =~ m/^((.*)(vm-(\d+)-disk-(\d+)))(-snap-
> (.*))?\.(raw|qcow2)/) {

>>here as well.. and this whole helper is just used twice in
>>volume_snapshot_info, maybe it could be inlined or made private 
ok !


> + my $snapname = $7;
> + return $snapname;
> +    }
> +    die "can't parse snapname from path";
> +}
> +
> +sub get_current_snapshot {
> +    my ($class, $scfg, $storeid, $volname) = @_;
> +    #IMPROVE ME: faster way to find current snapshot? (search the
> most recent created snapshot file ? need to works with lvm volume
> too)
> +
> +    return if !$scfg->{snapext};
> +    my $snapshots = $class->volume_snapshot_info($scfg, $storeid,
> $volname);
> +    return $snapshots->{current};
> +}
> +
>  1;
> -- 
> 2.39.2




More information about the pve-devel mailing list