[pve-devel] [RFC PATCH v2 proxmox-backup-qemu] restore: make chunk loading more parallel

Thu Jul 10 14:48:23 CEST 2025

On 7/8/25 17:08, Adam Kalisz wrote:
> On Tue, 2025-07-08 at 12:58 +0200, Dominik Csapak wrote:
>> On 7/8/25 12:04, Adam Kalisz wrote:
[snip]
>>>
>>> Especially during a restore the speed is quite important if you
>>> need to hit Restore Time Objectives under SLAs. That's why we were
>>> targeting 1 GBps for incompressible data.
>>
>> I get that, but this will always be a tradeoff between CPU load and
>> throughput and we have to find a good middle ground here.
> 
> Sure, I am not disputing that for the default configuration.
> 
>> IMO with my current patch, we have a very good improvement already,
>> without increasing the (theoretical) load to the system.
>>
>> It could be OK from my POV to make the number of threads of the
>> runtime configurable e.g. via vzdump.conf. (That's a thing that's
>> easily explainable in the docs for admins)
> 
> That would be great, because some users have other priorities depending
> on the operational situation. The nice thing about the ENV override in
> my submission is that if somebody runs PBS_RESTORE_CONCURRENCY=8
> qmrestore ... they can change the priority of the restore ad-hoc e.g.
> if they really need to restore as quickly as possibly they can throw
> more threads at the problem within some reasonable bounds. In some
> cases they do a restore of a VM that isn't running and the resources it
> would be using on the system are available for use.
> 
> Again, I agree the defaults should be conservative.

Ok, I'll probably send a new version soon that will make the threads
and futures configurable via env variables and leave the default to
16 futures with 4 threads (which seems to be a good start).

> 
>> If someone else (@Fabian?) wants to chime in to this discussion,
>> I'd be glad.
>>
>> Also feedback on my code in general would be nice ;)
>> (There are probably better ways to make this concurrent in an
>> async context, e.g. maybe using 'async-channel' + fixed number
>> of tasks ?)
> 
> Having more timing information about how long the fetches and writes of
> nonzero chunks take would be great for doing informed estimates about
> the effect of performance settings should there be any. It would also
> help with the benchmarking right now to see where we are saturated.

yes, some tracing information would be really nice. I'm currently a bit
pressed for time, but will look into that.

> 
> Do I understand correctly that we are blocking a worker when writing a
> chunk to storage with the write_data_callback or write_zero_callback
> from fetching chunks?

yes. Technically we should call something like 'spawn_blocking' or
'block_in_place' so the executor won't get stuck, but in most cases it
will work out just fine.

> 
>  From what I gathered, we have to have a single thread that writes the
> VM image because otherwise we would have problems with concurrent
> access.

Currently it's that way because the default behavior for rust is
to prevent void pointer from being used across threads (for good reason!)

Though i think in this case, the callback is static, so we should be able
to call it from multiple threads (assuming the qemu layer can handle that)

For that I have to rewrite some parts of the qemu-side integration too,
so that it does not modify the callback data in the c callback itself
(which is not a problem)

Then I can test if writing in multiple threads makes a difference.

> We need to feed this thread as well as possible with a mix of zero
> chunks and non-zero chunks. The zero chunks are cheap because we
> generate them based on information we already have in memory. The non-
> zero chunks we have to fetch over the network (or from cache, which is
> still more expensive than zero chunks). If I understand correctly, if
> we fill the read queue with non-zero chunks and wait for them to become
> available we will not be writing any possible zero chunks that come
> after it to storage and our bottleneck storage writer thread will be
> idle and hungry for more chunks.
> 
> My original solution basically wrote all the zero chunks first and then
> started working through the non-zero chunks. This split seems to mostly
> avoid the cost difference between zero and non-zero chunks keeping
> futures slots occupied. However I have not considered the memory
> consumption of the chunk_futures vector which might grow very big for
> multi TB VM backups. However the idea of knowing about the cheap filler
> zero chunks we might always write if we don't have any non-zero chunks
> available is perhaps not so bad, especially for NVMe systems. For
> harddrives I imagine the linear strategy might be faster because it
> avoids should in my imagination avoid some expensive seeks.
> Would it make sense to have a reasonable buffer of zero chunks ready
> for writing while we fetch non-zero chunks over the network?
> 
> Is this thought process correct?

I think so. The issue with such optimizations is IMO that no matter
for which case we'll optimize, some other cases will suffer for it.
Since the systems out there are very diverse, I'd like to not optimize
that at all, and simply write them in order we get them from the index.

There are also other settings that play a role here, e.g. the target storage.
If that is configured to be able to skip zeros, the zero writeback call
will actually do nothing at all (e.g. when the target is a file
storage with qcow2 files)

Just for the record i also benchmarked a slower system here:
6x16 TiB spinners in raid-10 with nvme special devices
over a 2.5 g link:

current approach is ~61 MiB/s restore speed
with my patch it's ~160MiB/s restore speed with not much increase
in cpu time (both were under 30% of a single core)

Also did perf stat for those to compare how much overhead the additional futures/async/await
brings:

first restore:

         62,871.24 msec task-clock                       #    0.115 CPUs utilized 

           878,151      context-switches                 #   13.967 K/sec 

            28,205      cpu-migrations                   #  448.615 /sec 

           519,396      page-faults                      #    8.261 K/sec 

   277,239,999,474      cpu_core/cycles/                 #    4.410 G/sec 
(89.20%)
   190,782,860,504      cpu_atom/cycles/                 #    3.035 G/sec 
(10.80%)
   482,534,267,606      cpu_core/instructions/           #    7.675 G/sec 
(89.20%)
   188,659,352,613      cpu_atom/instructions/           #    3.001 G/sec 
(10.80%)
    46,913,925,346      cpu_core/branches/               #  746.191 M/sec 
(89.20%)
    19,251,496,445      cpu_atom/branches/               #  306.205 M/sec 
(10.80%)
       904,032,529      cpu_core/branch-misses/          #   14.379 M/sec 
(89.20%)
       621,228,739      cpu_atom/branch-misses/          #    9.881 M/sec 
(10.80%)
1,633,142,624,469      cpu_core/slots/                  #   25.976 G/sec                    (89.20%) 

   489,311,603,992      cpu_core/topdown-retiring/       #     29.7% Retiring 
(89.20%)
    97,617,585,755      cpu_core/topdown-bad-spec/       #      5.9% Bad Speculation 
(89.20%)
   317,074,236,582      cpu_core/topdown-fe-bound/       #     19.2% Frontend Bound 
(89.20%)
   745,485,954,022      cpu_core/topdown-be-bound/       #     45.2% Backend Bound 
(89.20%)
    57,463,995,650      cpu_core/topdown-heavy-ops/      #      3.5% Heavy Operations       # 
26.2% Light Operations        (89.20%)
    88,333,173,745      cpu_core/topdown-br-mispredict/  #      5.4% Branch Mispredict      # 
0.6% Machine Clears          (89.20%)
   217,424,427,912      cpu_core/topdown-fetch-lat/      #     13.2% Fetch Latency          # 
6.0% Fetch Bandwidth         (89.20%)
   354,250,103,398      cpu_core/topdown-mem-bound/      #     21.5% Memory Bound           # 
23.7% Core Bound              (89.20%)

     548.195368256 seconds time elapsed 

      44.493218000 seconds user 

      21.315124000 seconds sys 

second restore:

         67,908.11 msec task-clock                       #    0.297 CPUs utilized 

           856,402      context-switches                 #   12.611 K/sec 

            46,539      cpu-migrations                   #  685.323 /sec 

           942,002      page-faults                      #   13.872 K/sec 

   300,757,558,837      cpu_core/cycles/                 #    4.429 G/sec 
(75.93%)
   234,595,451,063      cpu_atom/cycles/                 #    3.455 G/sec 
(24.07%)
   511,747,593,432      cpu_core/instructions/           #    7.536 G/sec 
(75.93%)
   289,348,171,298      cpu_atom/instructions/           #    4.261 G/sec 
(24.07%)
    49,993,266,992      cpu_core/branches/               #  736.190 M/sec 
(75.93%)
    29,624,743,216      cpu_atom/branches/               #  436.248 M/sec 
(24.07%)
       911,770,988      cpu_core/branch-misses/          #   13.427 M/sec 
(75.93%)
       811,321,806      cpu_atom/branch-misses/          #   11.947 M/sec 
(24.07%)
1,788,660,631,633      cpu_core/slots/                  #   26.339 G/sec                    (75.93%) 

   569,029,214,725      cpu_core/topdown-retiring/       #     31.4% Retiring 
(75.93%)
   125,815,987,213      cpu_core/topdown-bad-spec/       #      6.9% Bad Speculation 
(75.93%)
   234,249,755,030      cpu_core/topdown-fe-bound/       #     12.9% Frontend Bound 
(75.93%)
   885,539,445,254      cpu_core/topdown-be-bound/       #     48.8% Backend Bound 
(75.93%)
    86,825,030,719      cpu_core/topdown-heavy-ops/      #      4.8% Heavy Operations       # 
26.6% Light Operations        (75.93%)
   116,566,866,551      cpu_core/topdown-br-mispredict/  #      6.4% Branch Mispredict      # 
0.5% Machine Clears          (75.93%)
   135,276,276,904      cpu_core/topdown-fetch-lat/      #      7.5% Fetch Latency          # 
5.5% Fetch Bandwidth         (75.93%)
   409,898,741,185      cpu_core/topdown-mem-bound/      #     22.6% Memory Bound           # 
26.2% Core Bound              (75.93%)

     228.528573197 seconds time elapsed 

      48.379229000 seconds user 

      21.779166000 seconds sys 

so the overhead for the additional futures was ~8%  in cycles, ~6% in instructions
which does not seem too bad

> 
> 
> Btw. I am not on the PBS list, so to avoid getting stuck in a queue
> there I am posting only to PVE devel.
> 
> Adam