[pve-devel] [PATCH SERIES storage/qemu-server/-manager] RFC : add lvmqcow2 storage support

Thu Sep 5 09:51:52 CEST 2024

> Alexandre Derumier via pve-devel <pve-devel at lists.proxmox.com> hat am 26.08.2024 13:00 CEST geschrieben:
> This patch series add support for a new lvmqcow2 storage format.
> 
> Currently, we can't do snasphot && thin provisionning on shared block devices because
> lvm thin can't share his metavolume. I have a lot of onprem vmware customers
> where it's really blocking the proxmox migration. (and they are looking for ovirt/oracle
> virtualisation where it's working fine).
> 
> It's possible to format a block device without filesystem with qcow2 format directly.
> This is used by redhat rhev/ovirt since almost 10year in their vsdm daemon.
> 
> For thin provisiniong or to handle extra size of snapshot, we need to be able to resize
> the lvm volume dynamically.
> The volume is increased by chunk of 1GB by default (can be changed).
> Qemu implement events to sent an alert when the write usage is reaching a threshold.
> (Threshold is 50% of last chunk, so when vm have 500MB free)
> 
> The resize is async (around 2s), so user need to choose a correct chunk size && threshold,
> if the storage is really fast (nvme for example, where you can write more than 500MB in 2ss)
> 
> If the resize is not enough fast, the vm will pause in io-error.
> pvestatd is looking for this error, and try to extend again if needed and resume the vm

I agree with Dominik about the downsides of this approach.

We had a brief chat this morning and came up with a possible alternative that would still allow snapshots (even if thin-provisioning would be out of scope):

- allocate the volume with the full size and put a fully pre-allocated qcow2 file on it
- no need to monitor regular guest I/O, it's guaranteed that the qcow2 file can be fully written
- when creating a snapshot
-- check the actual usage of the qcow2 file
-- extend the underlying volume so that the total size is current usage + size exposed to the guest
-- create the actual (qcwo2-internal) snapshot
- still no need to monitor guest I/O, the underlying volume should be big enough to overwrite all data

this would give us effectively the same semantics as thick-provisioned zvols, which also always reserve enough space at snapshot creation time to allow a full overwrite of the whole zvol. if the underlying volume cannot be extended by the required space, snapshot creation would fail.

some open questions:
- do we actually get enough information about space usage out of the qcow2 file (I think so, but haven't checked in detail)
- is there a way to compact/shrink either when removing snapshots, or as (potentially expensive) standalone action (worst case, compact by copying the whole disk?)

another, less involved approach would be to over-allocate the volume to provide a fixed, limited amount of slack for snapshots (e.g., "allocate 50% extra space for snapshots" when creating a guest volume) - but that has all the usual downsides of thin-provisioning (the guest is lied to about the disk size, and can run into weird error states when space runs out) and is less flexible.

what do you think about the above approaches?