[pve-devel] [PATCH SERIES storage/qemu-server/-manager] RFC : add lvmqcow2 storage support

Fri Aug 30 10:44:03 CEST 2024

> 
Hi,
I have done more tests

> > * there is no cluster locking?
> >    you only mention
> > 
> >    ---8<---
> >   
> > #don't·use·global·cluster·lock·here,·use·on·native·local·lvm·lock
> >    --->8---
> > 
> >    but don't configure any lock? (AFAIR lvm cluster locking needs
> > additional
> >    configuration/daemons?)
> > 
> >    this *will* lead to errors if multiple VMs on different hosts
> > try
> >    to resize at the same time.
> > 
> >    even with cluster locking, this will very soon lead to
> > contention,
> > since
> >    storage operations are inherently expensive, e.g. if i have
> >    10-100 VMs wanting to resize at the same time, some of them will
> > run
> >    into a timeout or at least into the blocking state.
> > 
> >    That does not even need much IO, just bad luck when multiple VMs
> > go
> >    over the threshold within a short time.

>>>mmm,ok, This one could be a problem indeed.  
>>>I need to look at ovirt code. (because they are really using it in
>>>production since 10year), to see how they handle locks.

ok, you are right, we need a cluster lock here. Redhat is using sanlock
daemon or dlm using corosync for coordination. 

* IMHO pvestatd is the wrong place to make such a call. It's
already
doing much
   stuff in a way where a single storage operation blocks many
other
things
   (metrics, storage/vm status, ballooning, etc..)

   cramming another thing in there seems wrong and will only lead
to
even more people
   complaining about the pvestatd not working, only in this case
the
vms
   will be in an io-error state indefinitely then.

   I'd rather make a separate daemon/program, or somehow integrate
it
into
   qmeventd (but then it would have to become multi
threaded/processes/etc.
   to not block it's other purposes)

>>Yes, I agree with this.  (BTW, if one day, we could have threading,
>>queues  or seperate daemon for each storage monitor, it could help a
>>lot with hangy storage)

Ok, I think we could manage a queue of disk to resize somewhere.
pvestatd could fill the queue on io-error, and it could be processed
by qemu-eventd. (or maybe another daemon)
It could be done sequentially, as we need a cluster lock anyway

> > All in all, I'm not really sure if the gain (snapshots on shared
> > LVM)
> > is worth
> > the potential cost in maintenance, support and customer
> > dissatisfaction with
> > stalled/blocked VMs.

> > Generally a better approach could be for your customers to use some
> > kind of shared filesystem (GFS2/OCFS/?). I know those are not
> > really
> > tested or supported by us, but i would hope that they scale and
> > behave
> > better than qcow2-on-lvm-with-dynamic-resize.

>>>Yes, if we can get it working fine, it could be *a lot* better.  I'm
>>>still afraid about kernel bug/regression. (at least in ocfs2, 10
>>>year
>>>ago, it was a knightmare. I have used it in prod for 1~2 year).

>>>For gfs2, they are a user on proxmox forum, using it in production
>>>since 2019 without any problem.

>>>I need to test if we can have storage timeout if one node goes down.
>>>(for ocfs2, it was the case, the forum user tell me that it was ok
>>>with
>>gfs2)

>>>I'll do test on my side.

ok, I have done tests with gfs2. Installation is easy, and it's well
integrated with corosync. (using dlm daemon to manage locks which use
corosync). (Note: It need fencing if corosync is dead, it's currently
not able to resume the lock)

It's working fine with preallocated qcow2. I have almost same
performance than raw device, around 20k iops 4k && 3GB/s on my test
storage.

But when the file is not preallocated. (or when you take a snapshot on
a preallocated drive, so new write are not preallocated anymore),
the performance is abymissal.  (60 iops 4k, 40MB/S).
Seem to be a well known problem with gfs2, with cluster lock on block
allocation.

I'll do tests with ocfs2 to compare