[pve-devel] [PATCH SERIES storage/qemu-server/-manager] RFC : add lvmqcow2 storage support

Thu Aug 29 10:27:07 CEST 2024

>>just my personal opinion, maybe you also want to wait for more
>>feedback from somebody else...
>>(also i just glanced over the patches, so correct me if I'm wrong)

Hi Dominik !

i see some problems with this approach (some are maybe fixable, some
probably not?)

>>* as you mentioned, if the storage is fast enough you have a runaway
>>VM
>>   this is IMHO not acceptable, as that leads to VMs that are
>>completely blocked and
>>   can't do anything. I fear this will generate many support calls
>>why their guests
>>   are stopped/hanging...

If the chunksize is correctly configured, it shouldn't happen.
(for example, if the storage is able to write 500MB/S,   use a
chunksize of 5~10GB  , this give your 10~20s window )

>>* the code says containers are supported (rootdir => 1) but i don't
>>see how?
>>   there is AFAICS no code to handle them in any way...
>>   (maybe just falsely copied?)

oh indeed, I don't have checked about CT yet. (it could be implemented
with a storage usage check each x second, but I'm not sure it's scale
fine with a lof of CT volumes)

>>* you lock the local blockextend call, but give it a timeout of 60
>>seconds.
>>   what if that timeout expires? the vm again gets completely blocked
>>until it's
>>   resized by pvestatd

I'm locking it to avoid multiple extend, I have set arbitrary 60s, but
it could be a lot lower. (lvm extend don't use more than 1s for me)

>>* IMHO pvestatd is the wrong place to make such a call. It's already
>>doing much
>>   stuff in a way where a single storage operation blocks many other
>>things
>>   (metrics, storage/vm status, ballooning, etc..)
>>
>>   cramming another thing in there seems wrong and will only lead to
>>even more people
>>   complaining about the pvestatd not working, only in this case the
>>vms
>>   will be in an io-error state indefinitely then.
>>
>>   I'd rather make a separate daemon/program, or somehow integrate it
>>into
>>   qmeventd (but then it would have to become multi
>>threaded/processes/etc.
>>   to not block it's other purposes)

Yes, I agree with this.  (BTW, if one day, we could have threading,
queues  or seperate daemon for each storage monitor, it could help a
lot with hangy storage)

>>* there is no cluster locking?
>>   you only mention
>>
>>   ---8<---
>>   #don't·use·global·cluster·lock·here,·use·on·native·local·lvm·lock
>>   --->8---
>>
>>   but don't configure any lock? (AFAIR lvm cluster locking needs
>>additional
>>   configuration/daemons?)
>>
>>   this *will* lead to errors if multiple VMs on different hosts try
>>   to resize at the same time.
>>
>>   even with cluster locking, this will very soon lead to contention,
>>since
>>   storage operations are inherently expensive, e.g. if i have
>>   10-100 VMs wanting to resize at the same time, some of them will
>>run
>>   into a timeout or at least into the blocking state.
>>
>>   That does not even need much IO, just bad luck when multiple VMs
>>go
>>   over the threshold within a short time.

mmm,ok, This one could be a problem indeed.  
I need to look at ovirt code. (because they are really using it in
production since 10year), to see how they handle locks.

>>All in all, I'm not really sure if the gain (snapshots on shared LVM)
>>is worth
>>the potential cost in maintenance, support and customer
>>dissatisfaction with
>>stalled/blocked VMs.

>>Generally a better approach could be for your customers to use some
>>kind of shared filesystem (GFS2/OCFS/?). I know those are not really
>>tested or supported by us, but i would hope that they scale and
>>behave
>>better than qcow2-on-lvm-with-dynamic-resize.

Yes, if we can get it working fine, it could be *a lot* better.  I'm
still afraid about kernel bug/regression. (at least in ocfs2, 10 year
ago, it was a knightmare. I have used it in prod for 1~2 year).

For gfs2, they are a user on proxmox forum, using it in production
since 2019 without any problem.
https://forum.proxmox.com/threads/pve-7-x-cluster-setup-of-shared-lvm-lv-with-msa2040-sas-partial-howto.57536/

I need to test if we can have storage timeout if one node goes down.
(for ocfs2, it was the case, the forum user tell me that it was ok with
gfs2)

I'll do test on my side.

I really need this feature for a lot of onprem customers, migrating
from vmware.  They are mostly small clusters. (2~3 nodes with direct
attach san).  

So even if gfs2 don't scale too much with many nodes, personnaly, it
should be enough for me if we limits the number of supported nodes.

>>best regards
>>Dominik

Thanks again for the review ! 

(BTW, I have some small fixes to do on this patch series on pvestatd
code)