[pve-devel] [RFC PATCH v2 proxmox-backup-qemu] restore: make chunk loading more parallel
Dominik Csapak
d.csapak at proxmox.com
Fri Jul 11 10:21:45 CEST 2025
On 7/10/25 14:48, Dominik Csapak wrote:
[snip]
>
> Just for the record i also benchmarked a slower system here:
> 6x16 TiB spinners in raid-10 with nvme special devices
> over a 2.5 g link:
>
> current approach is ~61 MiB/s restore speed
> with my patch it's ~160MiB/s restore speed with not much increase
> in cpu time (both were under 30% of a single core)
>
> Also did perf stat for those to compare how much overhead the additional futures/async/await
> brings:
>
>
> first restore:
>
> 62,871.24 msec task-clock # 0.115 CPUs utilized
> 878,151 context-switches # 13.967 K/sec
> 28,205 cpu-migrations # 448.615 /sec
> 519,396 page-faults # 8.261 K/sec
> 277,239,999,474 cpu_core/cycles/ # 4.410 G/sec (89.20%)
> 190,782,860,504 cpu_atom/cycles/ # 3.035 G/sec (10.80%)
> 482,534,267,606 cpu_core/instructions/ # 7.675 G/sec (89.20%)
> 188,659,352,613 cpu_atom/instructions/ # 3.001 G/sec (10.80%)
> 46,913,925,346 cpu_core/branches/ # 746.191 M/sec (89.20%)
> 19,251,496,445 cpu_atom/branches/ # 306.205 M/sec (10.80%)
> 904,032,529 cpu_core/branch-misses/ # 14.379 M/sec (89.20%)
> 621,228,739 cpu_atom/branch-misses/ # 9.881 M/sec (10.80%)
> 1,633,142,624,469 cpu_core/slots/ # 25.976 G/sec (89.20%)
> 489,311,603,992 cpu_core/topdown-retiring/ # 29.7% Retiring (89.20%)
> 97,617,585,755 cpu_core/topdown-bad-spec/ # 5.9% Bad Speculation (89.20%)
> 317,074,236,582 cpu_core/topdown-fe-bound/ # 19.2% Frontend Bound (89.20%)
> 745,485,954,022 cpu_core/topdown-be-bound/ # 45.2% Backend Bound (89.20%)
> 57,463,995,650 cpu_core/topdown-heavy-ops/ # 3.5% Heavy Operations # 26.2%
> Light Operations (89.20%)
> 88,333,173,745 cpu_core/topdown-br-mispredict/ # 5.4% Branch Mispredict # 0.6%
> Machine Clears (89.20%)
> 217,424,427,912 cpu_core/topdown-fetch-lat/ # 13.2% Fetch Latency # 6.0%
> Fetch Bandwidth (89.20%)
> 354,250,103,398 cpu_core/topdown-mem-bound/ # 21.5% Memory Bound # 23.7%
> Core Bound (89.20%)
>
>
> 548.195368256 seconds time elapsed
>
>
> 44.493218000 seconds user
> 21.315124000 seconds sys
>
> second restore:
>
> 67,908.11 msec task-clock # 0.297 CPUs utilized
> 856,402 context-switches # 12.611 K/sec
> 46,539 cpu-migrations # 685.323 /sec
> 942,002 page-faults # 13.872 K/sec
> 300,757,558,837 cpu_core/cycles/ # 4.429 G/sec (75.93%)
> 234,595,451,063 cpu_atom/cycles/ # 3.455 G/sec (24.07%)
> 511,747,593,432 cpu_core/instructions/ # 7.536 G/sec (75.93%)
> 289,348,171,298 cpu_atom/instructions/ # 4.261 G/sec (24.07%)
> 49,993,266,992 cpu_core/branches/ # 736.190 M/sec (75.93%)
> 29,624,743,216 cpu_atom/branches/ # 436.248 M/sec (24.07%)
> 911,770,988 cpu_core/branch-misses/ # 13.427 M/sec (75.93%)
> 811,321,806 cpu_atom/branch-misses/ # 11.947 M/sec (24.07%)
> 1,788,660,631,633 cpu_core/slots/ # 26.339 G/sec (75.93%)
> 569,029,214,725 cpu_core/topdown-retiring/ # 31.4% Retiring (75.93%)
> 125,815,987,213 cpu_core/topdown-bad-spec/ # 6.9% Bad Speculation (75.93%)
> 234,249,755,030 cpu_core/topdown-fe-bound/ # 12.9% Frontend Bound (75.93%)
> 885,539,445,254 cpu_core/topdown-be-bound/ # 48.8% Backend Bound (75.93%)
> 86,825,030,719 cpu_core/topdown-heavy-ops/ # 4.8% Heavy Operations # 26.6%
> Light Operations (75.93%)
> 116,566,866,551 cpu_core/topdown-br-mispredict/ # 6.4% Branch Mispredict # 0.5%
> Machine Clears (75.93%)
> 135,276,276,904 cpu_core/topdown-fetch-lat/ # 7.5% Fetch Latency # 5.5%
> Fetch Bandwidth (75.93%)
> 409,898,741,185 cpu_core/topdown-mem-bound/ # 22.6% Memory Bound # 26.2%
> Core Bound (75.93%)
>
>
> 228.528573197 seconds time elapsed
>
>
> 48.379229000 seconds user
> 21.779166000 seconds sys
>
>
> so the overhead for the additional futures was ~8% in cycles, ~6% in instructions
> which does not seem too bad
>
addendum:
the tests above did sadly run into a network limit of ~600MBit/s (still
trying to figure out where the bottleneck in the network is...)
tested again from a different machine that has a 10G link to the pbs mentioned above.
This time i restored to the 'null-co' driver from qemu since the target storage was too slow....
anyways, the results are:
current code: restore ~75MiB/s
16 way parallel: ~528MiB/s (7x !)
cpu usage went up from <50% of one core to ~350% (like in my initial tests with a different setup)
perf stat output below:
current:
183,534.85 msec task-clock # 0.409 CPUs utilized
117,267 context-switches # 638.936 /sec
700 cpu-migrations # 3.814 /sec
462,432 page-faults # 2.520 K/sec
468,609,612,840 cycles # 2.553 GHz
1,286,188,699,253 instructions # 2.74 insn per cycle
41,342,312,275 branches # 225.256 M/sec
846,432,249 branch-misses # 2.05% of all branches
448.965517535 seconds time elapsed
152.007611000 seconds user
32.189942000 seconds sys
16 way parallel:
228,583.26 msec task-clock # 3.545 CPUs utilized
114,575 context-switches # 501.240 /sec
6,028 cpu-migrations # 26.371 /sec
1,561,179 page-faults # 6.830 K/sec
510,861,534,387 cycles # 2.235 GHz
1,296,819,542,686 instructions # 2.54 insn per cycle
43,202,234,699 branches # 189.000 M/sec
828,196,795 branch-misses # 1.92% of all branches
64.482868654 seconds time elapsed
184.172759000 seconds user
44.560342000 seconds sys
so still about ~8% more cycles, about the same amount of instructions but in much less time
More information about the pve-devel
mailing list