[RFC PATCH v2 proxmox-backup-qemu] restore: make chunk loading more parallel

Tue Jul 8 12:04:56 CEST 2025

Hi Dominik,

this is a big improvement, I have done some performance measurements
again:

Ryzen:
4 worker threads:
restore image complete (bytes=53687091200, duration=52.06s,
speed=983.47MB/s)
8 worker threads:
restore image complete (bytes=53687091200, duration=50.12s,
speed=1021.56MB/s)

4 worker threads, 4 max-blocking:
restore image complete (bytes=53687091200, duration=54.00s,
speed=948.22MB/s)
8 worker threads, 4 max-blocking:
restore image complete (bytes=53687091200, duration=50.43s,
speed=1015.25MB/s)
8 worker threads, 4 max-blocking, 32 buffered futures:
restore image complete (bytes=53687091200, duration=52.11s,
speed=982.53MB/s)

Xeon:
4 worker threads:
restore image complete (bytes=10737418240, duration=3.06s,
speed=3345.97MB/s)
restore image complete (bytes=107374182400, duration=139.80s,
speed=732.47MB/s)
restore image complete (bytes=107374182400, duration=136.67s,
speed=749.23MB/s)
8 worker threads:
restore image complete (bytes=10737418240, duration=2.50s,
speed=4095.30MB/s)
restore image complete (bytes=107374182400, duration=127.14s,
speed=805.42MB/s)
restore image complete (bytes=107374182400, duration=121.39s,
speed=843.59MB/s)

Much better but it would need to be 25% faster on this older system to
hit the numbers I have already seen with my solution:

For comparison, with my solution on the same Xeon system I was hitting:
With 8-way concurrency, 16 max-blocking threads:
restore image complete (bytes=10737418240, avg fetch time=16.7572ms,
avg time per nonzero write=1.9310ms, storage nonzero total write
time=1.580s, duration=2.25s, speed=4551.25MB/s)
restore image complete (bytes=107374182400, avg fetch time=29.1714ms,
avg time per nonzero write=2.2216ms, storage nonzero total write
time=55.739s, duration=106.17s, speed=964.52MB/s)
restore image complete (bytes=107374182400, avg fetch time=28.2543ms,
avg time per nonzero write=2.1473ms, storage nonzero total write
time=54.139s, duration=103.52s, speed=989.18MB/s)

With 16-way concurrency, 32 max-blocking threads:
restore image complete (bytes=10737418240, avg fetch time=25.3444ms,
avg time per nonzero write=2.0709ms, storage nonzero total write
time=1.694s, duration=2.02s, speed=5074.13MB/s)
restore image complete (bytes=107374182400, avg fetch time=53.3046ms,
avg time per nonzero write=2.6692ms, storage nonzero total write
time=66.969s, duration=106.65s, speed=960.13MB/s)
restore image complete (bytes=107374182400, avg fetch time=47.3909ms,
avg time per nonzero write=2.6352ms, storage nonzero total write
time=66.440s, duration=98.09s, speed=1043.95MB/s)
-> this seemed to be the best setting for this system

On the Ryzen system I was hitting:
With 8-way concurrency, 16 max-blocking threads:
restore image complete (bytes=53687091200, avg fetch time=24.7342ms,
avg time per nonzero write=1.6474ms, storage nonzero total write
time=19.996s, duration=45.83s, speed=1117.15MB/s)
-> this seemed to be the best setting for this system

It seems the counting of zeroes works in some kind of steps (seen on
the Xeon system with mostly incompressible data):

download and verify backup index
progress 1% (read 1073741824 bytes, zeroes = 0% (0 bytes), duration 1
sec)
progress 2% (read 2147483648 bytes, zeroes = 0% (0 bytes), duration 2
sec)
progress 3% (read 3221225472 bytes, zeroes = 0% (0 bytes), duration 3
sec)
progress 4% (read 4294967296 bytes, zeroes = 0% (0 bytes), duration 5
sec)
progress 5% (read 5368709120 bytes, zeroes = 0% (0 bytes), duration 6
sec)
progress 6% (read 6442450944 bytes, zeroes = 0% (0 bytes), duration 7
sec)
progress 7% (read 7516192768 bytes, zeroes = 0% (0 bytes), duration 8
sec)
progress 8% (read 8589934592 bytes, zeroes = 0% (0 bytes), duration 10
sec)
progress 9% (read 9663676416 bytes, zeroes = 0% (0 bytes), duration 11
sec)
progress 10% (read 10737418240 bytes, zeroes = 0% (0 bytes), duration
12 sec)
progress 11% (read 11811160064 bytes, zeroes = 0% (0 bytes), duration
14 sec)
progress 12% (read 12884901888 bytes, zeroes = 0% (0 bytes), duration
15 sec)
progress 13% (read 13958643712 bytes, zeroes = 0% (0 bytes), duration
16 sec)
progress 14% (read 15032385536 bytes, zeroes = 0% (0 bytes), duration
18 sec)
progress 15% (read 16106127360 bytes, zeroes = 0% (0 bytes), duration
19 sec)
progress 16% (read 17179869184 bytes, zeroes = 0% (0 bytes), duration
20 sec)
progress 17% (read 18253611008 bytes, zeroes = 0% (0 bytes), duration
21 sec)
progress 18% (read 19327352832 bytes, zeroes = 0% (0 bytes), duration
23 sec)
progress 19% (read 20401094656 bytes, zeroes = 0% (0 bytes), duration
24 sec)
progress 20% (read 21474836480 bytes, zeroes = 0% (0 bytes), duration
25 sec)
progress 21% (read 22548578304 bytes, zeroes = 0% (0 bytes), duration
27 sec)
progress 22% (read 23622320128 bytes, zeroes = 0% (0 bytes), duration
28 sec)
progress 23% (read 24696061952 bytes, zeroes = 0% (0 bytes), duration
29 sec)
progress 24% (read 25769803776 bytes, zeroes = 0% (0 bytes), duration
31 sec)
progress 25% (read 26843545600 bytes, zeroes = 1% (515899392 bytes),
duration 31 sec)
progress 26% (read 27917287424 bytes, zeroes = 1% (515899392 bytes),
duration 33 sec)

Especially during a restore the speed is quite important if you need to
hit Restore Time Objectives under SLAs. That's why we were targeting 1
GBps for incompressible data.

Thank you
Adam

On Tue, 2025-07-08 at 10:49 +0200, Dominik Csapak wrote:
> by using async futures to load chunks and stream::buffer_unordered to
> buffer up to 16 of them, depending on write/load speed, use tokio's
> task
> spawn to make sure the continue to run in the background, since
> buffer_unordered starts them, but does not poll them to completion
> unless we're awaiting.
> 
> With this, we don't need to increase the number of threads in the
> runtime to trigger parallel reads and network traffic to us. This way
> it's only limited by CPU if decoding and/or decrypting is the bottle
> neck.
> 
> I measured restoring a VM backup with a 60GiB disk (filled with
> ~42GiB
> data) and fast storage over a local network link (from PBS VM to the
> host). Let it 3  runs, but the variance was not that big, so here's
> some
> representative log output with various MAX_BUFFERED_FUTURES values.
> 
> benchmark   duration        speed   cpu percentage
> current      107.18s   573.25MB/s           < 100%
> 4:            44.74s  1373.34MB/s           ~ 180%
> 8:            32.30s  1902.42MB/s           ~ 290%
> 16:           25.75s  2386.44MB/s           ~ 360%
> 
> I saw an increase in CPU usage proportional to the speed increase, so
> while in the current version it uses less than a single core total,
> using 16 parallel futures resulted in 3-4 available threads of the
> tokio runtime to be utilized.
> 
> In general I'd like to limit the buffering somehow, but I don't think
> there is a good automatic metric we can use, and giving the admin a
> knob
> that is hard to explain what the actual ramifications about it are is
> also not good, so I settled for a value that showed improvement but
> does
> not seem too high.
> 
> In any case, if the target and/or source storage is too slow, there
> will
> be back/forward pressure, and this change should only matter for
> storage
> systems where IO depth plays a role and that are fast enough.
> 
> The way we count the finished chunks also changes a bit, since they
> can come unordered, so we can't rely on the index position to
> calculate
> the percentage.
> 
> This patch is loosely based on the patch from Adam Kalisz[0], but
> removes
> the need to increase the blocking threads and uses the (actually
> always
> used) underlying async implementation for reading remote chunks.
> 
> 0:
> https://lore.proxmox.com/pve-devel/mailman.719.1751052794.395.pve-devel@lists.proxmox.com/
> 
> Signed-off-by: Dominik Csapak <d.csapak at proxmox.com>
> Based-on-patch-by: Adam Kalisz <adam.kalisz at notnullmakers.com>
> ---
> changes from RFC v1:
> * uses tokio task spawn to actually run the fetching in the
> background
> * redo the counting for the task output (pos was unordered so we got
>   weird ordering sometimes)
> 
> When actually running the fetching in the background the speed
> increase
> is much higher than just using buffer_unordered for the fetching
> futures, which is nice (altough the cpu usage is much higher now).
> 
> Since the benchmark was much faster with higher values, I used a
> different bigger VM this time around so the timings are more
> consistent
> and it makes sure the disk does not fit in the PBS's memory.
> 
> The question what count we should use remains though...
> 
>  src/restore.rs | 63 +++++++++++++++++++++++++++++++++++++-----------
> --
>  1 file changed, 47 insertions(+), 16 deletions(-)
> 
> diff --git a/src/restore.rs b/src/restore.rs
> index 5a5a398..4e6c538 100644
> --- a/src/restore.rs
> +++ b/src/restore.rs
> @@ -2,6 +2,7 @@ use std::convert::TryInto;
>  use std::sync::{Arc, Mutex};
>  
>  use anyhow::{bail, format_err, Error};
> +use futures::StreamExt;
>  use once_cell::sync::OnceCell;
>  use tokio::runtime::Runtime;
>  
> @@ -13,7 +14,7 @@ use
> pbs_datastore::cached_chunk_reader::CachedChunkReader;
>  use pbs_datastore::data_blob::DataChunkBuilder;
>  use pbs_datastore::fixed_index::FixedIndexReader;
>  use pbs_datastore::index::IndexFile;
> -use pbs_datastore::read_chunk::ReadChunk;
> +use pbs_datastore::read_chunk::AsyncReadChunk;
>  use pbs_datastore::BackupManifest;
>  use pbs_key_config::load_and_decrypt_key;
>  use pbs_tools::crypt_config::CryptConfig;
> @@ -29,6 +30,9 @@ struct ImageAccessInfo {
>      archive_size: u64,
>  }
>  
> +// use this many buffered futures to make loading of chunks more
> concurrent
> +const MAX_BUFFERED_FUTURES: usize = 16;
> +
>  pub(crate) struct RestoreTask {
>      setup: BackupSetup,
>      runtime: Arc<Runtime>,
> @@ -165,26 +169,53 @@ impl RestoreTask {
>  
>          let start_time = std::time::Instant::now();
>  
> -        for pos in 0..index.index_count() {
> -            let digest = index.index_digest(pos).unwrap();
> +        let read_queue = (0..index.index_count()).map(|pos| {
> +            let digest = *index.index_digest(pos).unwrap();
>              let offset = (pos * index.chunk_size) as u64;
> -            if digest == &zero_chunk_digest {
> -                let res = write_zero_callback(offset,
> index.chunk_size as u64);
> -                if res < 0 {
> -                    bail!("write_zero_callback failed ({})", res);
> +            let chunk_reader = chunk_reader.clone();
> +            async move {
> +                let chunk = if digest == zero_chunk_digest {
> +                    None
> +                } else {
> +                    let raw_data = tokio::task::spawn(async move {
> +                        AsyncReadChunk::read_chunk(&chunk_reader,
> &digest).await
> +                    })
> +                    .await??;
> +                    Some(raw_data)
> +                };
> +
> +                Ok::<_, Error>((chunk, offset))
> +            }
> +        });
> +
> +        // this buffers futures and pre-fetches some chunks for us
> +        let mut stream =
> futures::stream::iter(read_queue).buffer_unordered(MAX_BUFFERED_FUTUR
> ES);
> +
> +        let mut count = 0;
> +        while let Some(res) = stream.next().await {
> +            let res = res?;
> +            match res {
> +                (None, offset) => {
> +                    let res = write_zero_callback(offset,
> index.chunk_size as u64);
> +                    if res < 0 {
> +                        bail!("write_zero_callback failed ({})",
> res);
> +                    }
> +                    bytes += index.chunk_size;
> +                    zeroes += index.chunk_size;
>                  }
> -                bytes += index.chunk_size;
> -                zeroes += index.chunk_size;
> -            } else {
> -                let raw_data = ReadChunk::read_chunk(&chunk_reader,
> digest)?;
> -                let res = write_data_callback(offset, &raw_data);
> -                if res < 0 {
> -                    bail!("write_data_callback failed ({})", res);
> +                (Some(raw_data), offset) => {
> +                    let res = write_data_callback(offset,
> &raw_data);
> +                    if res < 0 {
> +                        bail!("write_data_callback failed ({})",
> res);
> +                    }
> +                    bytes += raw_data.len();
>                  }
> -                bytes += raw_data.len();
>              }
> +
> +            count += 1;
> +
>              if verbose {
> -                let next_per = ((pos + 1) * 100) /
> index.index_count();
> +                let next_per = (count * 100) / index.index_count();
>                  if per != next_per {
>                      eprintln!(
>                          "progress {}% (read {} bytes, zeroes = {}%
> ({} bytes), duration {} sec)",