[pve-devel] [RFC PATCH v2 proxmox-backup-qemu] restore: make chunk loading more parallel

Tue Jul 8 12:58:01 CEST 2025

On 7/8/25 12:04, Adam Kalisz wrote:
> Hi Dominik,
> 

Hi,

> this is a big improvement, I have done some performance measurements
> again:
> 
> Ryzen:
> 4 worker threads:
> restore image complete (bytes=53687091200, duration=52.06s,
> speed=983.47MB/s)
> 8 worker threads:
> restore image complete (bytes=53687091200, duration=50.12s,
> speed=1021.56MB/s)
> 
> 4 worker threads, 4 max-blocking:
> restore image complete (bytes=53687091200, duration=54.00s,
> speed=948.22MB/s)
> 8 worker threads, 4 max-blocking:
> restore image complete (bytes=53687091200, duration=50.43s,
> speed=1015.25MB/s)
> 8 worker threads, 4 max-blocking, 32 buffered futures:
> restore image complete (bytes=53687091200, duration=52.11s,
> speed=982.53MB/s)
> 
> Xeon:
> 4 worker threads:
> restore image complete (bytes=10737418240, duration=3.06s,
> speed=3345.97MB/s)
> restore image complete (bytes=107374182400, duration=139.80s,
> speed=732.47MB/s)
> restore image complete (bytes=107374182400, duration=136.67s,
> speed=749.23MB/s)
> 8 worker threads:
> restore image complete (bytes=10737418240, duration=2.50s,
> speed=4095.30MB/s)
> restore image complete (bytes=107374182400, duration=127.14s,
> speed=805.42MB/s)
> restore image complete (bytes=107374182400, duration=121.39s,
> speed=843.59MB/s)

just for my understanding: you left the parallel futures at 16 and changed
the threads in the tokio runtime?

The biggest issue here is that we probably don't want to increase that
number by default by much, since on e.g. a running system this will
have an impact on other running VMs. Adjusting such a number
(especially in a way where it's now actually used in contrast to
before) will come as a surprise for many.

That's IMHO the biggest challenge here, that's why I did not touch
the tokio runtime thread settings, to not increase the load too much.

Did you by any chance observe the CPU usage during your tests?
As I wrote in my commit message, the cpu usage quadrupled (proportional
to the more chunks we could put through) when using 16 fetching
tasks.

Just an additional note: With my solution,the blocking threads should
not have any impact at all, since the fetching should be purely
async (so no blocking code anywhere) and the writing is done in the
main thread/task in sequence so no blocking threads will be used
except one.

> 
> Much better but it would need to be 25% faster on this older system to
> hit the numbers I have already seen with my solution:
> 
> For comparison, with my solution on the same Xeon system I was hitting:
> With 8-way concurrency, 16 max-blocking threads:
> restore image complete (bytes=10737418240, avg fetch time=16.7572ms,
> avg time per nonzero write=1.9310ms, storage nonzero total write
> time=1.580s, duration=2.25s, speed=4551.25MB/s)
> restore image complete (bytes=107374182400, avg fetch time=29.1714ms,
> avg time per nonzero write=2.2216ms, storage nonzero total write
> time=55.739s, duration=106.17s, speed=964.52MB/s)
> restore image complete (bytes=107374182400, avg fetch time=28.2543ms,
> avg time per nonzero write=2.1473ms, storage nonzero total write
> time=54.139s, duration=103.52s, speed=989.18MB/s)
> 
> With 16-way concurrency, 32 max-blocking threads:
> restore image complete (bytes=10737418240, avg fetch time=25.3444ms,
> avg time per nonzero write=2.0709ms, storage nonzero total write
> time=1.694s, duration=2.02s, speed=5074.13MB/s)
> restore image complete (bytes=107374182400, avg fetch time=53.3046ms,
> avg time per nonzero write=2.6692ms, storage nonzero total write
> time=66.969s, duration=106.65s, speed=960.13MB/s)
> restore image complete (bytes=107374182400, avg fetch time=47.3909ms,
> avg time per nonzero write=2.6352ms, storage nonzero total write
> time=66.440s, duration=98.09s, speed=1043.95MB/s)
> -> this seemed to be the best setting for this system
> 
> On the Ryzen system I was hitting:
> With 8-way concurrency, 16 max-blocking threads:
> restore image complete (bytes=53687091200, avg fetch time=24.7342ms,
> avg time per nonzero write=1.6474ms, storage nonzero total write
> time=19.996s, duration=45.83s, speed=1117.15MB/s)
> -> this seemed to be the best setting for this system
> 
> It seems the counting of zeroes works in some kind of steps (seen on
> the Xeon system with mostly incompressible data):
> 

yes, only whole zero chunks will be counted.

[snip]
> 
> Especially during a restore the speed is quite important if you need to
> hit Restore Time Objectives under SLAs. That's why we were targeting 1
> GBps for incompressible data.

I get that, but this will always be a tradeoff between CPU load and
throughput and we have to find a good middle ground here.

IMO with my current patch, we have a very good improvement already,
without increasing the (theoretical) load to the system.

It could be OK from my POV to make the number of threads of the runtime
configurable e.g. via vzdump.conf. (That's a thing that's easily
explainable in the docs for admins)

If someone else (@Fabian?) wants to chime in to this discussion,
I'd be glad.

Also feedback on my code in general would be nice ;)
(There are probably better ways to make this concurrent in an
async context, e.g. maybe using 'async-channel' + fixed number
of tasks ?)

Thanks
Dominik

> 
> Thank you
> Adam
> 
> On Tue, 2025-07-08 at 10:49 +0200, Dominik Csapak wrote:
>> by using async futures to load chunks and stream::buffer_unordered to
>> buffer up to 16 of them, depending on write/load speed, use tokio's
>> task
>> spawn to make sure the continue to run in the background, since
>> buffer_unordered starts them, but does not poll them to completion
>> unless we're awaiting.
>>
>> With this, we don't need to increase the number of threads in the
>> runtime to trigger parallel reads and network traffic to us. This way
>> it's only limited by CPU if decoding and/or decrypting is the bottle
>> neck.
>>
>> I measured restoring a VM backup with a 60GiB disk (filled with
>> ~42GiB
>> data) and fast storage over a local network link (from PBS VM to the
>> host). Let it 3  runs, but the variance was not that big, so here's
>> some
>> representative log output with various MAX_BUFFERED_FUTURES values.
>>
>> benchmark   duration        speed   cpu percentage
>> current      107.18s   573.25MB/s           < 100%
>> 4:            44.74s  1373.34MB/s           ~ 180%
>> 8:            32.30s  1902.42MB/s           ~ 290%
>> 16:           25.75s  2386.44MB/s           ~ 360%
>>
>> I saw an increase in CPU usage proportional to the speed increase, so
>> while in the current version it uses less than a single core total,
>> using 16 parallel futures resulted in 3-4 available threads of the
>> tokio runtime to be utilized.
>>
>> In general I'd like to limit the buffering somehow, but I don't think
>> there is a good automatic metric we can use, and giving the admin a
>> knob
>> that is hard to explain what the actual ramifications about it are is
>> also not good, so I settled for a value that showed improvement but
>> does
>> not seem too high.
>>
>> In any case, if the target and/or source storage is too slow, there
>> will
>> be back/forward pressure, and this change should only matter for
>> storage
>> systems where IO depth plays a role and that are fast enough.
>>
>> The way we count the finished chunks also changes a bit, since they
>> can come unordered, so we can't rely on the index position to
>> calculate
>> the percentage.
>>
>> This patch is loosely based on the patch from Adam Kalisz[0], but
>> removes
>> the need to increase the blocking threads and uses the (actually
>> always
>> used) underlying async implementation for reading remote chunks.
>>
>> 0:
>> https://lore.proxmox.com/pve-devel/mailman.719.1751052794.395.pve-devel@lists.proxmox.com/
>>
>> Signed-off-by: Dominik Csapak <d.csapak at proxmox.com>
>> Based-on-patch-by: Adam Kalisz <adam.kalisz at notnullmakers.com>
>> ---
>> changes from RFC v1:
>> * uses tokio task spawn to actually run the fetching in the
>> background
>> * redo the counting for the task output (pos was unordered so we got
>>    weird ordering sometimes)
>>
>> When actually running the fetching in the background the speed
>> increase
>> is much higher than just using buffer_unordered for the fetching
>> futures, which is nice (altough the cpu usage is much higher now).
>>
>> Since the benchmark was much faster with higher values, I used a
>> different bigger VM this time around so the timings are more
>> consistent
>> and it makes sure the disk does not fit in the PBS's memory.
>>
>> The question what count we should use remains though...
>>
>>   src/restore.rs | 63 +++++++++++++++++++++++++++++++++++++-----------
>> --
>>   1 file changed, 47 insertions(+), 16 deletions(-)
>>
>> diff --git a/src/restore.rs b/src/restore.rs
>> index 5a5a398..4e6c538 100644
>> --- a/src/restore.rs
>> +++ b/src/restore.rs
>> @@ -2,6 +2,7 @@ use std::convert::TryInto;
>>   use std::sync::{Arc, Mutex};
>>   
>>   use anyhow::{bail, format_err, Error};
>> +use futures::StreamExt;
>>   use once_cell::sync::OnceCell;
>>   use tokio::runtime::Runtime;
>>   
>> @@ -13,7 +14,7 @@ use
>> pbs_datastore::cached_chunk_reader::CachedChunkReader;
>>   use pbs_datastore::data_blob::DataChunkBuilder;
>>   use pbs_datastore::fixed_index::FixedIndexReader;
>>   use pbs_datastore::index::IndexFile;
>> -use pbs_datastore::read_chunk::ReadChunk;
>> +use pbs_datastore::read_chunk::AsyncReadChunk;
>>   use pbs_datastore::BackupManifest;
>>   use pbs_key_config::load_and_decrypt_key;
>>   use pbs_tools::crypt_config::CryptConfig;
>> @@ -29,6 +30,9 @@ struct ImageAccessInfo {
>>       archive_size: u64,
>>   }
>>   
>> +// use this many buffered futures to make loading of chunks more
>> concurrent
>> +const MAX_BUFFERED_FUTURES: usize = 16;
>> +
>>   pub(crate) struct RestoreTask {
>>       setup: BackupSetup,
>>       runtime: Arc<Runtime>,
>> @@ -165,26 +169,53 @@ impl RestoreTask {
>>   
>>           let start_time = std::time::Instant::now();
>>   
>> -        for pos in 0..index.index_count() {
>> -            let digest = index.index_digest(pos).unwrap();
>> +        let read_queue = (0..index.index_count()).map(|pos| {
>> +            let digest = *index.index_digest(pos).unwrap();
>>               let offset = (pos * index.chunk_size) as u64;
>> -            if digest == &zero_chunk_digest {
>> -                let res = write_zero_callback(offset,
>> index.chunk_size as u64);
>> -                if res < 0 {
>> -                    bail!("write_zero_callback failed ({})", res);
>> +            let chunk_reader = chunk_reader.clone();
>> +            async move {
>> +                let chunk = if digest == zero_chunk_digest {
>> +                    None
>> +                } else {
>> +                    let raw_data = tokio::task::spawn(async move {
>> +                        AsyncReadChunk::read_chunk(&chunk_reader,
>> &digest).await
>> +                    })
>> +                    .await??;
>> +                    Some(raw_data)
>> +                };
>> +
>> +                Ok::<_, Error>((chunk, offset))
>> +            }
>> +        });
>> +
>> +        // this buffers futures and pre-fetches some chunks for us
>> +        let mut stream =
>> futures::stream::iter(read_queue).buffer_unordered(MAX_BUFFERED_FUTUR
>> ES);
>> +
>> +        let mut count = 0;
>> +        while let Some(res) = stream.next().await {
>> +            let res = res?;
>> +            match res {
>> +                (None, offset) => {
>> +                    let res = write_zero_callback(offset,
>> index.chunk_size as u64);
>> +                    if res < 0 {
>> +                        bail!("write_zero_callback failed ({})",
>> res);
>> +                    }
>> +                    bytes += index.chunk_size;
>> +                    zeroes += index.chunk_size;
>>                   }
>> -                bytes += index.chunk_size;
>> -                zeroes += index.chunk_size;
>> -            } else {
>> -                let raw_data = ReadChunk::read_chunk(&chunk_reader,
>> digest)?;
>> -                let res = write_data_callback(offset, &raw_data);
>> -                if res < 0 {
>> -                    bail!("write_data_callback failed ({})", res);
>> +                (Some(raw_data), offset) => {
>> +                    let res = write_data_callback(offset,
>> &raw_data);
>> +                    if res < 0 {
>> +                        bail!("write_data_callback failed ({})",
>> res);
>> +                    }
>> +                    bytes += raw_data.len();
>>                   }
>> -                bytes += raw_data.len();
>>               }
>> +
>> +            count += 1;
>> +
>>               if verbose {
>> -                let next_per = ((pos + 1) * 100) /
>> index.index_count();
>> +                let next_per = (count * 100) / index.index_count();
>>                   if per != next_per {
>>                       eprintln!(
>>                           "progress {}% (read {} bytes, zeroes = {}%
>> ({} bytes), duration {} sec)",
>