[pve-devel] [PATCH-SERIES v3 pve-storage/qemu-server/pve-qemu] add external qcow2 snapshot support

Fri Jan 10 10:55:14 CET 2025

Am 10.01.25 um 08:44 schrieb DERUMIER, Alexandre via pve-devel:
> -------- Message initial --------
> De: Fabian Grünbichler <f.gruenbichler at proxmox.com>
> À: Proxmox VE development discussion <pve-devel at lists.proxmox.com>
> Cc: Alexandre Derumier <alexandre.derumier at groupe-cyllene.com>
> Objet: Re: [pve-devel] [PATCH-SERIES v3 pve-storage/qemu-server/pve-
> qemu] add external qcow2 snapshot support
> Date: 09/01/2025 15:13:14
> 
>> Alexandre Derumier via pve-devel <pve-devel at lists.proxmox.com> hat am
>> 16.12.2024 10:12 CET geschrieben:
> 
>> This patch series implement qcow2 external snapshot support for files
>> && lvm volumes
>>
>> The current internal qcow2 snapshots have bad write performance
>> because no metadatas can be preallocated.
>>
>> This is particulary visible on a shared filesystem like ocfs2 or
>> gfs2.
>>
>> Also other bugs are freeze/lock reported by users since years on
>> snapshots delete on nfs
>> (The disk access seem to be frozen during all the delete duration)
>>
>> This also open doors for remote snapshot export-import for storage
>> replication.
>>>
>>> a few high level remarks:
>>> - I am not sure whether we want to/can switch over to blockdev on the
>>> fly (i.e., without some sort of opt-in phase to iron out kinks). what
>>> about upgrades with running VMs? I guess some sort of flag and per-VM
>>> switching would be a better plan..
> 
> I have tested live migration, and it's works for the small tests I have
> done. (I was surprised myself). I'll try to do more longer test to be
> 100% sure that they are not corruption of datas.
> 
>  on the guest side, it's transparent. on qemu side, the devices and pci
> plumbing is still the same, and qemu already use blockdev behind.
> 
> If needed, we could make a switch based on qemu version, or or manual
> option.

Yes, we need to be very careful that all the defaults/behavior would be
the same to not break live-migration. A known difference is format
autodetection, which happens with "-drive file=" but not with
"-blockdev", but not relevant as we explicitly set the format. Dumping
the QObject JSON configs of the drives might be a good heuristic to
check that the properties really are the same before and after the switch.

Switching based on QEMU version would need to be the creation QEMU from
the meta config property. Using machine or running binary version would
mean we would automatically switch for non-Windows OSes which are not
version pinned by default, so that doesn't help if there would be
breakage. I'd really hope it is compatible, because for a per-VM flag,
for backwards-compat reasons (e.g. rolling back to a snapshot with
VMstate) it would need to start out as being off by default.

We wouldn't even need to switch to using '-blockdev' right now (still
good thing to do long-term wise, but if it is opt-in, we can't rely on
all VMs having it, which is bad), you could also set the node-name for
the '-drive'. I.e. switching to '-blockdev' can be done separately to
switching to 'blockdev-*' QMP operations.

>>> - if you see a way to name the block graph nodes in a deterministic
>>> fashion (i.e., have a 1:1 mapping between snapshot and block graph
>>> node name) that would be wonderful, else we'd have to find another
>>> way to improve the lookup there..
> 
> 1:1 mapping with snapshot is not possible (I have tried it a lot),
> because:
>   - snapshot name can be too long (blockdev name is 31 characters max,
> hash based on filename is difficult)
>   - with external snapshot file renaming, this don't work  (snap-->
> current). We can't rename a blockdev, so the mapping will drift.
> 
>   So, I don't think that it's possible to avoid lookup. (I really don't
> have idea how to manage it).  
> I'm not sure it's really a problem ?  it's just an extra qmp call, but
> it's super fast.

Are we sure the node-name for the drive is always stable? I.e. is the
block node that the guest sees inserted in the drive, always the one
named by the 'node-name' that was initially set when attaching the drive
via '-blockdev' or QMP 'blockdev-add'? After all kinds of block
operations? Even if there are partially completed/failed block
operations? After live migration from a not-yet-updated node? Otherwise,
I'd prefer always querying the node-name before doing a QMP 'blockdev-*'
command to make sure it's actually the node that the guest sees as well,
like we currently do for 'block-export-add'. And we wouldn't even need
to set the node-names ourselves at all if always querying first. Seems a
bit more future-proof as well.