[pve-devel] [RFC PATCH v2 proxmox-backup-qemu] restore: make chunk loading more parallel
Dominik Csapak
d.csapak at proxmox.com
Thu Jul 10 14:48:23 CEST 2025
On 7/8/25 17:08, Adam Kalisz wrote:
> On Tue, 2025-07-08 at 12:58 +0200, Dominik Csapak wrote:
>> On 7/8/25 12:04, Adam Kalisz wrote:
[snip]
>>>
>>> Especially during a restore the speed is quite important if you
>>> need to hit Restore Time Objectives under SLAs. That's why we were
>>> targeting 1 GBps for incompressible data.
>>
>> I get that, but this will always be a tradeoff between CPU load and
>> throughput and we have to find a good middle ground here.
>
> Sure, I am not disputing that for the default configuration.
>
>> IMO with my current patch, we have a very good improvement already,
>> without increasing the (theoretical) load to the system.
>>
>> It could be OK from my POV to make the number of threads of the
>> runtime configurable e.g. via vzdump.conf. (That's a thing that's
>> easily explainable in the docs for admins)
>
> That would be great, because some users have other priorities depending
> on the operational situation. The nice thing about the ENV override in
> my submission is that if somebody runs PBS_RESTORE_CONCURRENCY=8
> qmrestore ... they can change the priority of the restore ad-hoc e.g.
> if they really need to restore as quickly as possibly they can throw
> more threads at the problem within some reasonable bounds. In some
> cases they do a restore of a VM that isn't running and the resources it
> would be using on the system are available for use.
>
> Again, I agree the defaults should be conservative.
Ok, I'll probably send a new version soon that will make the threads
and futures configurable via env variables and leave the default to
16 futures with 4 threads (which seems to be a good start).
>
>> If someone else (@Fabian?) wants to chime in to this discussion,
>> I'd be glad.
>>
>> Also feedback on my code in general would be nice ;)
>> (There are probably better ways to make this concurrent in an
>> async context, e.g. maybe using 'async-channel' + fixed number
>> of tasks ?)
>
> Having more timing information about how long the fetches and writes of
> nonzero chunks take would be great for doing informed estimates about
> the effect of performance settings should there be any. It would also
> help with the benchmarking right now to see where we are saturated.
yes, some tracing information would be really nice. I'm currently a bit
pressed for time, but will look into that.
>
> Do I understand correctly that we are blocking a worker when writing a
> chunk to storage with the write_data_callback or write_zero_callback
> from fetching chunks?
yes. Technically we should call something like 'spawn_blocking' or
'block_in_place' so the executor won't get stuck, but in most cases it
will work out just fine.
>
> From what I gathered, we have to have a single thread that writes the
> VM image because otherwise we would have problems with concurrent
> access.
Currently it's that way because the default behavior for rust is
to prevent void pointer from being used across threads (for good reason!)
Though i think in this case, the callback is static, so we should be able
to call it from multiple threads (assuming the qemu layer can handle that)
For that I have to rewrite some parts of the qemu-side integration too,
so that it does not modify the callback data in the c callback itself
(which is not a problem)
Then I can test if writing in multiple threads makes a difference.
> We need to feed this thread as well as possible with a mix of zero
> chunks and non-zero chunks. The zero chunks are cheap because we
> generate them based on information we already have in memory. The non-
> zero chunks we have to fetch over the network (or from cache, which is
> still more expensive than zero chunks). If I understand correctly, if
> we fill the read queue with non-zero chunks and wait for them to become
> available we will not be writing any possible zero chunks that come
> after it to storage and our bottleneck storage writer thread will be
> idle and hungry for more chunks.
>
> My original solution basically wrote all the zero chunks first and then
> started working through the non-zero chunks. This split seems to mostly
> avoid the cost difference between zero and non-zero chunks keeping
> futures slots occupied. However I have not considered the memory
> consumption of the chunk_futures vector which might grow very big for
> multi TB VM backups. However the idea of knowing about the cheap filler
> zero chunks we might always write if we don't have any non-zero chunks
> available is perhaps not so bad, especially for NVMe systems. For
> harddrives I imagine the linear strategy might be faster because it
> avoids should in my imagination avoid some expensive seeks.
> Would it make sense to have a reasonable buffer of zero chunks ready
> for writing while we fetch non-zero chunks over the network?
>
> Is this thought process correct?
I think so. The issue with such optimizations is IMO that no matter
for which case we'll optimize, some other cases will suffer for it.
Since the systems out there are very diverse, I'd like to not optimize
that at all, and simply write them in order we get them from the index.
There are also other settings that play a role here, e.g. the target storage.
If that is configured to be able to skip zeros, the zero writeback call
will actually do nothing at all (e.g. when the target is a file
storage with qcow2 files)
Just for the record i also benchmarked a slower system here:
6x16 TiB spinners in raid-10 with nvme special devices
over a 2.5 g link:
current approach is ~61 MiB/s restore speed
with my patch it's ~160MiB/s restore speed with not much increase
in cpu time (both were under 30% of a single core)
Also did perf stat for those to compare how much overhead the additional futures/async/await
brings:
first restore:
62,871.24 msec task-clock # 0.115 CPUs utilized
878,151 context-switches # 13.967 K/sec
28,205 cpu-migrations # 448.615 /sec
519,396 page-faults # 8.261 K/sec
277,239,999,474 cpu_core/cycles/ # 4.410 G/sec
(89.20%)
190,782,860,504 cpu_atom/cycles/ # 3.035 G/sec
(10.80%)
482,534,267,606 cpu_core/instructions/ # 7.675 G/sec
(89.20%)
188,659,352,613 cpu_atom/instructions/ # 3.001 G/sec
(10.80%)
46,913,925,346 cpu_core/branches/ # 746.191 M/sec
(89.20%)
19,251,496,445 cpu_atom/branches/ # 306.205 M/sec
(10.80%)
904,032,529 cpu_core/branch-misses/ # 14.379 M/sec
(89.20%)
621,228,739 cpu_atom/branch-misses/ # 9.881 M/sec
(10.80%)
1,633,142,624,469 cpu_core/slots/ # 25.976 G/sec (89.20%)
489,311,603,992 cpu_core/topdown-retiring/ # 29.7% Retiring
(89.20%)
97,617,585,755 cpu_core/topdown-bad-spec/ # 5.9% Bad Speculation
(89.20%)
317,074,236,582 cpu_core/topdown-fe-bound/ # 19.2% Frontend Bound
(89.20%)
745,485,954,022 cpu_core/topdown-be-bound/ # 45.2% Backend Bound
(89.20%)
57,463,995,650 cpu_core/topdown-heavy-ops/ # 3.5% Heavy Operations #
26.2% Light Operations (89.20%)
88,333,173,745 cpu_core/topdown-br-mispredict/ # 5.4% Branch Mispredict #
0.6% Machine Clears (89.20%)
217,424,427,912 cpu_core/topdown-fetch-lat/ # 13.2% Fetch Latency #
6.0% Fetch Bandwidth (89.20%)
354,250,103,398 cpu_core/topdown-mem-bound/ # 21.5% Memory Bound #
23.7% Core Bound (89.20%)
548.195368256 seconds time elapsed
44.493218000 seconds user
21.315124000 seconds sys
second restore:
67,908.11 msec task-clock # 0.297 CPUs utilized
856,402 context-switches # 12.611 K/sec
46,539 cpu-migrations # 685.323 /sec
942,002 page-faults # 13.872 K/sec
300,757,558,837 cpu_core/cycles/ # 4.429 G/sec
(75.93%)
234,595,451,063 cpu_atom/cycles/ # 3.455 G/sec
(24.07%)
511,747,593,432 cpu_core/instructions/ # 7.536 G/sec
(75.93%)
289,348,171,298 cpu_atom/instructions/ # 4.261 G/sec
(24.07%)
49,993,266,992 cpu_core/branches/ # 736.190 M/sec
(75.93%)
29,624,743,216 cpu_atom/branches/ # 436.248 M/sec
(24.07%)
911,770,988 cpu_core/branch-misses/ # 13.427 M/sec
(75.93%)
811,321,806 cpu_atom/branch-misses/ # 11.947 M/sec
(24.07%)
1,788,660,631,633 cpu_core/slots/ # 26.339 G/sec (75.93%)
569,029,214,725 cpu_core/topdown-retiring/ # 31.4% Retiring
(75.93%)
125,815,987,213 cpu_core/topdown-bad-spec/ # 6.9% Bad Speculation
(75.93%)
234,249,755,030 cpu_core/topdown-fe-bound/ # 12.9% Frontend Bound
(75.93%)
885,539,445,254 cpu_core/topdown-be-bound/ # 48.8% Backend Bound
(75.93%)
86,825,030,719 cpu_core/topdown-heavy-ops/ # 4.8% Heavy Operations #
26.6% Light Operations (75.93%)
116,566,866,551 cpu_core/topdown-br-mispredict/ # 6.4% Branch Mispredict #
0.5% Machine Clears (75.93%)
135,276,276,904 cpu_core/topdown-fetch-lat/ # 7.5% Fetch Latency #
5.5% Fetch Bandwidth (75.93%)
409,898,741,185 cpu_core/topdown-mem-bound/ # 22.6% Memory Bound #
26.2% Core Bound (75.93%)
228.528573197 seconds time elapsed
48.379229000 seconds user
21.779166000 seconds sys
so the overhead for the additional futures was ~8% in cycles, ~6% in instructions
which does not seem too bad
>
>
> Btw. I am not on the PBS list, so to avoid getting stuck in a queue
> there I am posting only to PVE devel.
>
> Adam
More information about the pve-devel
mailing list