[RFC PATCH v2 proxmox-backup-qemu] restore: make chunk loading more parallel

Tue Jul 8 17:08:11 CEST 2025

On Tue, 2025-07-08 at 12:58 +0200, Dominik Csapak wrote:
> On 7/8/25 12:04, Adam Kalisz wrote:
> > Hi Dominik,
> > 
> 
> Hi,
> 
> > this is a big improvement, I have done some performance
> > measurements
> > again:
> > 
> > Ryzen:
> > 4 worker threads:
> > restore image complete (bytes=53687091200, duration=52.06s,
> > speed=983.47MB/s)
> > 8 worker threads:
> > restore image complete (bytes=53687091200, duration=50.12s,
> > speed=1021.56MB/s)
> > 
> > 4 worker threads, 4 max-blocking:
> > restore image complete (bytes=53687091200, duration=54.00s,
> > speed=948.22MB/s)
> > 8 worker threads, 4 max-blocking:
> > restore image complete (bytes=53687091200, duration=50.43s,
> > speed=1015.25MB/s)
> > 8 worker threads, 4 max-blocking, 32 buffered futures:
> > restore image complete (bytes=53687091200, duration=52.11s,
> > speed=982.53MB/s)
> > 
> > Xeon:
> > 4 worker threads:
> > restore image complete (bytes=10737418240, duration=3.06s,
> > speed=3345.97MB/s)
> > restore image complete (bytes=107374182400, duration=139.80s,
> > speed=732.47MB/s)
> > restore image complete (bytes=107374182400, duration=136.67s,
> > speed=749.23MB/s)
> > 8 worker threads:
> > restore image complete (bytes=10737418240, duration=2.50s,
> > speed=4095.30MB/s)
> > restore image complete (bytes=107374182400, duration=127.14s,
> > speed=805.42MB/s)
> > restore image complete (bytes=107374182400, duration=121.39s,
> > speed=843.59MB/s)
> 
> just for my understanding: you left the parallel futures at 16 and
> changed the threads in the tokio runtime?

Yes, that's correct.

> The biggest issue here is that we probably don't want to increase
> that number by default by much, since on e.g. a running system this
> will have an impact on other running VMs. Adjusting such a number
> (especially in a way where it's now actually used in contrast to
> before) will come as a surprise for many.
> 
> That's IMHO the biggest challenge here, that's why I did not touch
> the tokio runtime thread settings, to not increase the load too much.
> 
> Did you by any chance observe the CPU usage during your tests?
> As I wrote in my commit message, the cpu usage quadrupled
> (proportional to the more chunks we could put through) when using 16
> fetching tasks.

Yes, please see the mpstat attachments. The one with yesterday's date
is the first patch, the two from today are you today's patch without
changes and the second is with 8 worker threads. All of them use 16
buffered futures.

> Just an additional note: With my solution,the blocking threads should
> not have any impact at all, since the fetching should be purely
> async (so no blocking code anywhere) and the writing is done in the
> main thread/task in sequence so no blocking threads will be used
> except one.

I actually see a slight negative impact but that's based on a very few
runs.

> > On the Ryzen system I was hitting:
> > With 8-way concurrency, 16 max-blocking threads:
> > restore image complete (bytes=53687091200, avg fetch
> > time=24.7342ms,
> > avg time per nonzero write=1.6474ms, storage nonzero total write
> > time=19.996s, duration=45.83s, speed=1117.15MB/s)
> > -> this seemed to be the best setting for this system
> > 
> > It seems the counting of zeroes works in some kind of steps (seen
> > on the Xeon system with mostly incompressible data):
> > 
> 
> yes, only whole zero chunks will be counted.
> 
> [snip]
> > 
> > Especially during a restore the speed is quite important if you
> > need to hit Restore Time Objectives under SLAs. That's why we were
> > targeting 1 GBps for incompressible data.
> 
> I get that, but this will always be a tradeoff between CPU load and
> throughput and we have to find a good middle ground here.

Sure, I am not disputing that for the default configuration.

> IMO with my current patch, we have a very good improvement already,
> without increasing the (theoretical) load to the system.
> 
> It could be OK from my POV to make the number of threads of the
> runtime configurable e.g. via vzdump.conf. (That's a thing that's
> easily explainable in the docs for admins)

That would be great, because some users have other priorities depending
on the operational situation. The nice thing about the ENV override in
my submission is that if somebody runs PBS_RESTORE_CONCURRENCY=8
qmrestore ... they can change the priority of the restore ad-hoc e.g.
if they really need to restore as quickly as possibly they can throw
more threads at the problem within some reasonable bounds. In some
cases they do a restore of a VM that isn't running and the resources it
would be using on the system are available for use.

Again, I agree the defaults should be conservative.

> If someone else (@Fabian?) wants to chime in to this discussion,
> I'd be glad.
> 
> Also feedback on my code in general would be nice ;)
> (There are probably better ways to make this concurrent in an
> async context, e.g. maybe using 'async-channel' + fixed number
> of tasks ?)

Having more timing information about how long the fetches and writes of
nonzero chunks take would be great for doing informed estimates about
the effect of performance settings should there be any. It would also
help with the benchmarking right now to see where we are saturated.

Do I understand correctly that we are blocking a worker when writing a
chunk to storage with the write_data_callback or write_zero_callback
from fetching chunks?

>From what I gathered, we have to have a single thread that writes the
VM image because otherwise we would have problems with concurrent
access.
We need to feed this thread as well as possible with a mix of zero
chunks and non-zero chunks. The zero chunks are cheap because we
generate them based on information we already have in memory. The non-
zero chunks we have to fetch over the network (or from cache, which is
still more expensive than zero chunks). If I understand correctly, if
we fill the read queue with non-zero chunks and wait for them to become
available we will not be writing any possible zero chunks that come
after it to storage and our bottleneck storage writer thread will be
idle and hungry for more chunks.

My original solution basically wrote all the zero chunks first and then
started working through the non-zero chunks. This split seems to mostly
avoid the cost difference between zero and non-zero chunks keeping
futures slots occupied. However I have not considered the memory
consumption of the chunk_futures vector which might grow very big for
multi TB VM backups. However the idea of knowing about the cheap filler
zero chunks we might always write if we don't have any non-zero chunks
available is perhaps not so bad, especially for NVMe systems. For
harddrives I imagine the linear strategy might be faster because it
avoids should in my imagination avoid some expensive seeks.
Would it make sense to have a reasonable buffer of zero chunks ready
for writing while we fetch non-zero chunks over the network?

Is this thought process correct?

Btw. I am not on the PBS list, so to avoid getting stuck in a queue
there I am posting only to PVE devel.

Adam