[pbs-devel] [PATCH proxmox-backup v2 1/3] tape: introduce a tape backup read thread tuning opti
Max Carrara
m.carrara at proxmox.com
Wed May 8 15:37:16 CEST 2024
On Wed May 8, 2024 at 8:56 AM CEST, Dominik Csapak wrote:
> On 5/7/24 17:10, Max Carrara wrote:
> > On Tue May 7, 2024 at 9:29 AM CEST, Dominik Csapak wrote:
> >> using a single thread for reading is not optimal in some cases, e.g.
> >> when the underlying storage can handle more reads in parallel than with
> >> a single thread.
> >>
> >
> > Got only a few tiny notes that are otherwise unremarkable:
> >
> > Would prefer to reword this as:
> >
> > Using a single thread for reading is not optimal in some cases, e.g.
> > when the underlying storage can handle reads from multiple threads in
> > parallel.
> >
> >> We use the ParallelHandler to handle the actual reads. Make the
> >> sync_channel buffer size depending on the number of threads so we have
> >> space for two chunks per thread.
> >
> > Would s/depending/depend/ for the above paragraph.
> >
>
> ok to both comments, will update in a v3 (if necessary)
> or if preferred the one committing could fixup the commit message when applying
>
> >>
> >> How this impacts the backup speed largely depends on the underlying
> >> storage and how the backup is laid out on it.
> >>
> >> I benchmarked the following setups:
> >>
> >> * Setup A: relatively spread out backup on a virtualized pbs on single HDDs
> >> * Setup B: mostly sequential chunks on a virtualized pbs on single HDDs
> >> * Setup C: backup on virtualized pbs on a fast NVME
> >> * Setup D: backup on bare metal pbs with ZFS in a RAID10 with 6 HDDs
> >> and 2 fast special devices in a mirror
> >>
> >> (values are reported in MB/s as seen in the task log, caches were
> >> cleared between runs, backups were bigger than the memory available)
> >>
> >> setup 1 thread 2 threads 4 threads 8 threads
> >> A 55 70 80 95
> >> B 110 89 100 108
> >> C 294 294 294 294
> >> D 118 180 300 300
> >>
> >> So there are cases where multiple read threads speed up the tape backup
> >> (dramatically). On the other hand there are situations where reading
> >> from a single thread is actually faster, probably because we can read
> >> from the HDD sequentially.
> >>
> >> I used a new default value of '4' here since that gave good performance
> >> on all setups (not necessarily the best) and we also have a default
> >> value of '4' for verification threads.
> >>
> >> We use a struct here for the datastore since we want to introduce other
> >> thread tuning options too.
> >>
> >> Signed-off-by: Dominik Csapak <d.csapak at proxmox.com>
> >
> > The patch is pretty straightforward; there's one question inline, but
> > otherwise this looks fine to me. Solid work!
> >
> > Wasn't able to test this yet as setting up a VTL on Bookworm failed for
> > me, unfortunately. Will try to test this tomorrow if possible.
>
> it's probably obvious, but just to be sure:
>
> make sure the target storage for the vtl is faster than your
> source storage, otherwise you'll always be limited by that...
Thanks for the tip! Will consider it. Am in the process of spinning up
another VM.
>
> >
> > For now:
> >
> > Reviewed-by: Max Carrara <m.carrara at proxmox.com>
> >
> [snip]
> > diff --git a/src/tape/pool_writer/new_chunks_iterator.rs b/src/tape/pool_writer/new_chunks_iterator.rs
> >> index 1454b33d2..a2a8091f6 100644
> >> --- a/src/tape/pool_writer/new_chunks_iterator.rs
> >> +++ b/src/tape/pool_writer/new_chunks_iterator.rs
> >> @@ -6,8 +6,9 @@ use anyhow::{format_err, Error};
> >> use pbs_datastore::{DataBlob, DataStore, SnapshotReader};
> >>
> >> use crate::tape::CatalogSet;
> >> +use crate::tools::parallel_handler::ParallelHandler;
> >>
> >> -/// Chunk iterator which use a separate thread to read chunks
> >> +/// Chunk iterator which uses separate threads to read chunks
> >> ///
> >> /// The iterator skips duplicate chunks and chunks already in the
> >> /// catalog.
> >> @@ -25,7 +26,11 @@ impl NewChunksIterator {
> >> snapshot_reader: Arc<Mutex<SnapshotReader>>,
> >> catalog_set: Arc<Mutex<CatalogSet>>,
> >> ) -> Result<(std::thread::JoinHandle<()>, Self), Error> {
> >> - let (tx, rx) = std::sync::mpsc::sync_channel(3);
> >> + let read_threads = datastore
> >> + .get_thread_configuration()
> >> + .tape_backup_read_threads;
> >> +
> >> + let (tx, rx) = std::sync::mpsc::sync_channel(read_threads * 2);
> >
> > Is there any reason you're using `* 2` here? For example, is the
> > throughput unaffected if you use a larger value, like `* 8`?
> >
> > If the constant has an effect like that it should IMO be documented, but
> > if not, then it can just stay like it is.
>
> i did not benchmark that much here (would have increased my benchmark time since
> that's another dimension in the testing with many possible values)
>
> but my though process was this:
>
> if the source storage is much slower than what we can write to tape, that
> buffer will be empty most of the time
>
> if the source storage is much faster than what we can write to tape, it
> will be full most of the time
>
> in both cases the buffer size won't matter much (as long as there is a bit of
> buffer)
>
> the sweet spot is when the storage is about as fast as the tape writes
> and in that case you want to be able to have just a bit more buffer
> than threads reading from the storage so there is always a read
> going on (so the buffer will never empty, since that slows
> down the tape write)
>
> so 1*number of threads seemed to low, but i think anything bigger
> than 2* is just a waste of memory
>
> does that make sense?
>
> if we really want, i can make additional benchmarks but my guess
> is that it won't make much difference in practice
Okay that makes a lot of sense, that has cleared everything up for me.
Thanks!
I don't think this needs to be changed at all, then. LGTM
More information about the pbs-devel
mailing list