[RFC PATCH proxmox-backup-qemu] restore: make chunk loading more parallel

Mon Jul 7 21:49:09 CEST 2025

Hi Dominik,

seems like a more elegant solution. In my testing on the Ryzen system:

progress 100% (read 53624176640 bytes, zeroes = 5% (2776629248 bytes), duration 52 sec)
progress 96% (read 53628370944 bytes, zeroes = 5% (2776629248 bytes), duration 52 sec)
progress 97% (read 53645148160 bytes, zeroes = 5% (2776629248 bytes), duration 52 sec)
progress 96% (read 53657731072 bytes, zeroes = 5% (2776629248 bytes), duration 52 sec)
progress 97% (read 53666119680 bytes, zeroes = 5% (2776629248 bytes), duration 52 sec)
progress 96% (read 53678702592 bytes, zeroes = 5% (2776629248 bytes), duration 52 sec)
progress 97% (read 53682896896 bytes, zeroes = 5% (2776629248 bytes), duration 52 sec)
restore image complete (bytes=53687091200, duration=52.35s, speed=977.94MB/s)

As you can see the speed is much better than the original still but
lower than my solution and also the progress reporting is showing the
lack of proper ordering.

This is with 8 futures:

progress 97% (read 52076478464 bytes, zeroes = 2% (1199570944 bytes), duration 57 sec)
progress 96% (read 52080672768 bytes, zeroes = 2% (1199570944 bytes), duration 57 sec)
progress 97% (read 52084867072 bytes, zeroes = 2% (1203765248 bytes), duration 57 sec)
progress 98% (read 52583989248 bytes, zeroes = 3% (1702887424 bytes), duration 57 sec)
progress 99% (read 53120860160 bytes, zeroes = 4% (2239758336 bytes), duration 57 sec)
progress 100% (read 53657731072 bytes, zeroes = 5% (2776629248 bytes), duration 57 sec)
progress 97% (read 53661925376 bytes, zeroes = 5% (2776629248 bytes), duration 57 sec)
restore image complete (bytes=53687091200, duration=57.29s, speed=893.66MB/s)

With 32 futures I have got 984 MB/s so that's the limit I guess.

The do have checksum acceleration using SHA NI.

What I have noticed that in my solution the zero chunks seem to get
written very early while the restore process is fetching other chunks,
which speeds up images with mostly zeroes greatly.

This shows on the Xeon system where I restore 3 drives:

progress 80% (read 10716446720 bytes, zeroes = 69% (7415529472 bytes), duration 10 sec)
progress 79% (read 10720641024 bytes, zeroes = 69% (7415529472 bytes), duration 10 sec)
progress 80% (read 10737418240 bytes, zeroes = 69% (7415529472 bytes), duration 10 sec)
restore image complete (bytes=10737418240, duration=10.18s, speed=1006.31MB/s)
--> this can be 5x faster with my solution, as it contains largely zero blocks

progress 99% (read 106292051968 bytes, zeroes = 1% (1606418432 bytes), duration 314 sec)
progress 98% (read 106296246272 bytes, zeroes = 1% (1606418432 bytes), duration 314 sec)
progress 99% (read 106304634880 bytes, zeroes = 1% (1606418432 bytes), duration 314 sec)
progress 100% (read 107374182400 bytes, zeroes = 1% (2139095040 bytes), duration 316 sec)
restore image complete (bytes=107374182400, duration=316.35s, speed=323.70MB/s)
--> here I would do ~1 GBps with my solution

The other image is similarly slow.

This system does not have checksum acceleration using SHA NI, maybe
that could explain the large difference in speed between the solutions?
Is the sha256 calculation a bottleneck when fetching chunks?

With the futures solution, mpstat shows only 1-2 cores are busy on
these 2 sockets systems. The speed is very similar to the single
threaded solution that was there before but I confirmed multiple times
that I am using the new version (and the output of progress shows that
as well, which is why I included it).

Best regards
Adam

On Mon, 2025-07-07 at 10:14 +0200, Dominik Csapak wrote:
> by using async futures to load chunks and stream::buffer_unordered to
> buffer up to 16 of them, depending on write/load speed.
> 
> With this, we don't need to increase the number of threads in the
> runtime to trigger parallel reads and network traffic to us. This way
> it's only limited by CPU if decoding and/or decrypting is the bottle
> neck.
> 
> I measured restoring a VM backup with about 30GiB data and fast
> storage
> over a local network link (from PBS VM to the host).
> Let it do multiple runs, but the variance was not that big, so here's
> some representative log output with various MAX_BUFFERED_FUTURES
> values.
> 
> no MAX_BUFFERED_FUTURES: duration=43.18s, speed=758.82MB/s
> 4:                       duration=38.61s, speed=848.77MB/s
> 8:                       duration=33.79s, speed=969.85MB/s
> 16:                      duration=31.45s, speed=1042.06MB/s
> 
> note that increasing the number has a diminishing returns after 10-12
> on
> my system, but I guess it depends on the exact configuration. (For
> more
> than 16 I did not see any improvement, but this is probably just my
> setup).
> 
> I saw an increase in CPU usage (from ~75% to ~100% of one core),
> which
> are very likely the additional chunks to be decoded.
> 
> In general I'd like to limit the buffering somehow, but I don't think
> there is a good automatic metric we can use, and giving the admin a
> knob
> that is hard to explain what the actual ramifications about it are is
> also not good, so I settled for a value that showed improvement but
> does
> not seem too high.
> 
> In any case, if the target and/or source storage is too slow, there
> will
> be back/forward pressure, and this change should only matter for
> storage
> systems where IO depth plays a role.
> 
> This patch is loosely based on the patch from Adam Kalisz[0], but
> removes
> the need to increase the blocking threads and uses the (actually
> always
> used) underlying async implementation for reading remote chunks.
> 
> 0:
> https://lore.proxmox.com/pve-devel/mailman.719.1751052794.395.pve-devel@lists.proxmox.com/
> 
> Signed-off-by: Dominik Csapak <d.csapak at proxmox.com>
> Based-on-patch-by: Adam Kalisz <adam.kalisz at notnullmakers.com>
> ---
> @Adam could you please test this patch too to see if you still see
> the
> improvements you saw in your version?
> 
> Also I sent it as RFC to discuss how we decide how many chunks we
> want
> to buffer/threads we want to allocate. This is a non-trivial topic,
> and
> as i wrote we don't have a real metric to decide upfront, but giving
> the
> admin knobs that are complicated is also not the best solution.
> 
> My instinct would be to simply increase to 16 (as I have done here)
> and
> maybe expose this number in /etc/vzdump.conf or
> /etc/pve/datacenter.cfg
> 
> Also I tried to make the writes multi-threaded too, but my
> QEMU-knowledge is not very deep for this kind of thing, and I wanted
> to
> get this version out there soon. (Increasing the write threads can
> still
> be done afterwards if this change is enough for now)
> 
> I developed this patch on top of the 'stable-bookworm' branch, but it
> should apply cleanly on master as well.
> 
>  src/restore.rs | 57 +++++++++++++++++++++++++++++++++++++-----------
> --
>  1 file changed, 42 insertions(+), 15 deletions(-)
> 
> diff --git a/src/restore.rs b/src/restore.rs
> index 5a5a398..741b3e1 100644
> --- a/src/restore.rs
> +++ b/src/restore.rs
> @@ -2,6 +2,7 @@ use std::convert::TryInto;
>  use std::sync::{Arc, Mutex};
>  
>  use anyhow::{bail, format_err, Error};
> +use futures::StreamExt;
>  use once_cell::sync::OnceCell;
>  use tokio::runtime::Runtime;
>  
> @@ -13,7 +14,7 @@ use
> pbs_datastore::cached_chunk_reader::CachedChunkReader;
>  use pbs_datastore::data_blob::DataChunkBuilder;
>  use pbs_datastore::fixed_index::FixedIndexReader;
>  use pbs_datastore::index::IndexFile;
> -use pbs_datastore::read_chunk::ReadChunk;
> +use pbs_datastore::read_chunk::AsyncReadChunk;
>  use pbs_datastore::BackupManifest;
>  use pbs_key_config::load_and_decrypt_key;
>  use pbs_tools::crypt_config::CryptConfig;
> @@ -29,6 +30,9 @@ struct ImageAccessInfo {
>      archive_size: u64,
>  }
>  
> +// use at max 16 buffered futures to make loading of chunks more
> concurrent
> +const MAX_BUFFERED_FUTURES: usize = 16;
> +
>  pub(crate) struct RestoreTask {
>      setup: BackupSetup,
>      runtime: Arc<Runtime>,
> @@ -165,24 +169,47 @@ impl RestoreTask {
>  
>          let start_time = std::time::Instant::now();
>  
> -        for pos in 0..index.index_count() {
> +        let read_queue = (0..index.index_count()).map(|pos| {
>              let digest = index.index_digest(pos).unwrap();
>              let offset = (pos * index.chunk_size) as u64;
> -            if digest == &zero_chunk_digest {
> -                let res = write_zero_callback(offset,
> index.chunk_size as u64);
> -                if res < 0 {
> -                    bail!("write_zero_callback failed ({})", res);
> +            let chunk_reader = chunk_reader.clone();
> +            async move {
> +                let chunk = if digest == &zero_chunk_digest {
> +                    None
> +                } else {
> +                    let raw_data =
> AsyncReadChunk::read_chunk(&chunk_reader, digest).await?;
> +                    Some(raw_data)
> +                };
> +
> +                Ok::<_, Error>((chunk, pos, offset))
> +            }
> +        });
> +
> +        // this buffers futures and pre-fetches some chunks for us
> +        let mut stream =
> futures::stream::iter(read_queue).buffer_unordered(MAX_BUFFERED_FUTUR
> ES);
> +
> +        while let Some(res) = stream.next().await {
> +            let res = res?;
> +            let pos = match res {
> +                (None, pos, offset) => {
> +                    let res = write_zero_callback(offset,
> index.chunk_size as u64);
> +                    if res < 0 {
> +                        bail!("write_zero_callback failed ({})",
> res);
> +                    }
> +                    bytes += index.chunk_size;
> +                    zeroes += index.chunk_size;
> +                    pos
>                  }
> -                bytes += index.chunk_size;
> -                zeroes += index.chunk_size;
> -            } else {
> -                let raw_data = ReadChunk::read_chunk(&chunk_reader,
> digest)?;
> -                let res = write_data_callback(offset, &raw_data);
> -                if res < 0 {
> -                    bail!("write_data_callback failed ({})", res);
> +                (Some(raw_data), pos, offset) => {
> +                    let res = write_data_callback(offset,
> &raw_data);
> +                    if res < 0 {
> +                        bail!("write_data_callback failed ({})",
> res);
> +                    }
> +                    bytes += raw_data.len();
> +                    pos
>                  }
> -                bytes += raw_data.len();
> -            }
> +            };
> +
>              if verbose {
>                  let next_per = ((pos + 1) * 100) /
> index.index_count();
>                  if per != next_per {