[pve-devel] [PATCH-SERIES v3 pve-storage/qemu-server/pve-qemu] add external qcow2 snapshot support

Fri Jan 10 13:30:22 CET 2025

-------- Message initial --------
De: Fiona Ebner <f.ebner at proxmox.com>
À: Proxmox VE development discussion <pve-devel at lists.proxmox.com>,
f.gruenbichler at proxmox.com <f.gruenbichler at proxmox.com>
Cc: "DERUMIER, Alexandre" <alexandre.derumier at groupe-cyllene.com>
Objet: Re: [pve-devel] [PATCH-SERIES v3 pve-storage/qemu-server/pve-
qemu] add external qcow2 snapshot support
Date: 10/01/2025 10:55:14

Am 10.01.25 um 08:44 schrieb DERUMIER, Alexandre via pve-devel:
> -------- Message initial --------
> De: Fabian Grünbichler <f.gruenbichler at proxmox.com>
> À: Proxmox VE development discussion <pve-devel at lists.proxmox.com>
> Cc: Alexandre Derumier <alexandre.derumier at groupe-cyllene.com>
> Objet: Re: [pve-devel] [PATCH-SERIES v3 pve-storage/qemu-server/pve-
> qemu] add external qcow2 snapshot support
> Date: 09/01/2025 15:13:14
> 
> > Alexandre Derumier via pve-devel <pve-devel at lists.proxmox.com> hat
> > am
> > 16.12.2024 10:12 CET geschrieben:
> 
> > This patch series implement qcow2 external snapshot support for
> > files
> > && lvm volumes
> > 
> > The current internal qcow2 snapshots have bad write performance
> > because no metadatas can be preallocated.
> > 
> > This is particulary visible on a shared filesystem like ocfs2 or
> > gfs2.
> > 
> > Also other bugs are freeze/lock reported by users since years on
> > snapshots delete on nfs
> > (The disk access seem to be frozen during all the delete duration)
> > 
> > This also open doors for remote snapshot export-import for storage
> > replication.
> > > 
> > > a few high level remarks:
> > > - I am not sure whether we want to/can switch over to blockdev on
> > > the
> > > fly (i.e., without some sort of opt-in phase to iron out kinks).
> > > what
> > > about upgrades with running VMs? I guess some sort of flag and
> > > per-VM
> > > switching would be a better plan..
> 
> I have tested live migration, and it's works for the small tests I
> have
> done. (I was surprised myself). I'll try to do more longer test to be
> 100% sure that they are not corruption of datas.
> 
>  on the guest side, it's transparent. on qemu side, the devices and
> pci
> plumbing is still the same, and qemu already use blockdev behind.
> 
> If needed, we could make a switch based on qemu version, or or manual
> option.

>>Yes, we need to be very careful that all the defaults/behavior would
be
>>the same to not break live-migration. A known difference is format
>>autodetection, which happens with "-drive file=" but not with
>>"-blockdev", but not relevant as we explicitly set the format.
>>Dumping
>>the QObject JSON configs of the drives might be a good heuristic to
>>check that the properties really are the same before and after the
>>switch.
I had looked manually at qdev info, and dumped the blockdevs generated
by -drive command to compare, I didn't see difference (only the node
names and the additional throttle group node)

>>Switching based on QEMU version would need to be the creation QEMU
>>from
>>the meta config property. Using machine or running binary version
>>would
>>mean we would automatically switch for non-Windows OSes which are not
>>version pinned by default, so that doesn't help if there would be
>>breakage.

That's why I was thinking to implement this for pve9. (based on qemu
version)

>>I'd really hope it is compatible, because for a per-VM flag,
>>for backwards-compat reasons (e.g. rolling back to a snapshot with
>>VMstate) it would need to start out as being off by default.

I think that a vmstate is not a problem, because this is only the guest
memory right ? and devices are not changing.

>>We wouldn't even need to switch to using '-blockdev' right now (still
>>good thing to do long-term wise, but if it is opt-in, we can't rely
>>on
>>all VMs having it, which is bad), 
>>you could also set the node-name for the '-drive'. 

Are you sure about this ? I don't have seen any documentation about
adding the node-name to drive.   (and we need it for hotplug hmp
drive_add too :/ )

not even sure this could define specific name to the file nodename,
which is needed for the snapshot renaming with blockdev-reopen.

>>I.e. switching to '-blockdev' can be done separately to
>>switching to 'blockdev-*' QMP operations.

I really don't known if you can implement qmp blockdev-*, with --drive
syntax (where it could be possible to define nodename).

I known that qmp blockdev-* command accept "device" (for --drive) or
"node-name" (for --blockdev)

(BTW,switching to -blockdev is already breaking qmp proxmox backup ^_^
possibly because of the throttle-group top node, I don't remember
exactly).

I'll take time to retest  live migration with differents os, restore 
snapshot with state, and see if I have crash or silent data
corruptions.

> > > - if you see a way to name the block graph nodes in a
> > > deterministic
> > > fashion (i.e., have a 1:1 mapping between snapshot and block
> > > graph
> > > node name) that would be wonderful, else we'd have to find
> > > another
> > > way to improve the lookup there..
> 
> 1:1 mapping with snapshot is not possible (I have tried it a lot),
> because:
>   - snapshot name can be too long (blockdev name is 31 characters
> max,
> hash based on filename is difficult)
>   - with external snapshot file renaming, this don't work  (snap-->
> current). We can't rename a blockdev, so the mapping will drift.
> 
>   So, I don't think that it's possible to avoid lookup. (I really
> don't
> have idea how to manage it).  
> I'm not sure it's really a problem ?  it's just an extra qmp call,
> but
> it's super fast.

>>Are we sure the node-name for the drive is always stable? I.e. is the
>>block node that the guest sees inserted in the drive, always the one
>>named by the 'node-name' that was initially set when attaching the
>>drive
>>via '-blockdev' or QMP 'blockdev-add'? After all kinds of block
>>operations? Even if there are partially completed/failed block
>>operations? After live migration from a not-yet-updated node?
>>Otherwise,

No, the only stable nodename for me (in my implementation) is the top
throttle-group node. as it never change during mirroring, snaphot
rename,...

The drive node (format-node or file-node), can change (2 file-node for
live file renaming with blockdev-reopen for example , 2 format-node
switching after a mirror, ...)

>>I'd prefer always querying the node-name before doing a QMP
>>'blockdev-*'
>>command to make sure it's actually the node that the guest sees as
>>well,
>>like we currently do for 'block-export-add'. 
That's the way I have done it

>>And we wouldn't even >>need
>>to set the node-names ourselves at all if always querying first.
>>Seems a
>>bit more future-proof as well.

blockdev-reopen don't work with autogenerated nodenames (block#<id>)
(not sure if it's a bug or not).
That's why I'm currently naming all of them (including backing chain
snapshots too)