[pbs-devel] [RFC pxar proxmox-backup 00/36] fix #3174: improve file-level backup

Fabian Grünbichler f.gruenbichler at proxmox.com
Wed Mar 13 12:44:03 CET 2024


On March 5, 2024 10:26 am, Christian Ebner wrote:
> Disclaimer: This patches are work in progress and not intended for
> production use just yet. The purpose is for initial testing and review.
> 
> This series of patches implements an metadata based file change
> detection mechanism for improved pxar file level backup creation speed
> for unchanged files.
> 
> The chosen approach is to split pxar archives on creation via the
> proxmox-backup-client into two separate archives and upload streams,
> one exclusive for regular file payloads, the other one for the rest
> of the pxar archive, which is mostly metadata.
> 
> On consecutive runs, the metadata archive of the previous backup run,
> which is limited in size and therefore rapidly accessed is used to
> lookup and compare the metadata for entries to encode.
> This assumes that the connection speed to the Proxmox Backup Server is
> sufficiently fast, allowing the download and chaching of the chunks for
> that index.
> 
> Changes to regular files are detected by comparing all of the files
> metadata object, including mtime, acls, ecc. If no changes are detected,
> the previous payload index is used to lookup chunks to possibly re-use
> in the payload stream of the new archive.
> In order to reduce possible chunk fragmentation, the decision whether to
> re-use or re-encode a file payload is deferred until enough information
> is gathered by adding entries to a look-ahead cache. If enough payload
> is referenced, the chunks are re-used and injected into the pxar payload
> upload stream, otherwise they are discated and the files encoded
> regularly.

I like how this is shaping up!

some high-level feedback in addition to things noted at individual
patches:

I think the two archive types should also get a proper header that has
fields like an archive version and possible other metadata. while this
means losing concat support, this is not something we use or need
anyway. it would make the next bump a lot less painful, since the old
client can print meaningful error messages like "encountered pxar
archive v3, unsupported, please upgrade" instead of opaque "invalid
entry type <magic blob>, abort" (which cannot be differentiated from a
corrupt archive!).

I think the pxar/create.rs code can be simplified/refactor to make it
easier to understand, although it's probably not the easiest task.

Some (at least debug) collection of the "wasted space" in the form of
padding (i.e., all the bytes of re-used chunks that are not referenced
by this snapshot) would be nice to have. Or at least an upper bound of
that (calculating an accurate amount might be expensive for
intra-archive dedup, and also, in real-world, the actual waste depends
on other snapshots anyway..). maybe we can also re-visit some sort of
heuristic for this, so that at least the final chunk of a file is not
re-used unless it or the next re-used file(s) make up > $threshold of
the chunk.

the benchmark tool is not that meaningful without some way of testing
*changing* input data in a systematic fashion ;)

I'll give this a more in-depth spin and see what else I notice/find!

> The following lists the most notable changes included in this series since
> the version 1:
> - also cache pxar exclude pattern passed via cli instead of encoding
>   them directly. This lead to an inconsistent archive while caching.
> - Fix the flushing of entries and chunks to inject before finishing the
>   archiver. Previously these last entries have been re-encoded, now they
>   are re-used.
> - add a dedicated method and type in the decoder for decoding payload
>   references.
> 
> An invocation of a backup run with this patches now is:
> ```bash
> proxmox-backup-client backup <label>.pxar:<source-path> --change-detection-mode=metadata
> ```
> During the first run, no reference index is available, the pxar archive
> will however be split into the two parts.
> Following backups will however utilize the pxar archive accessor and
> index files of the previous run to perform file change detection.
> 
> As benchmarks, the linux source code as well as the coco dataset for
> computer vision and pattern recognition can be used.
> The benchmarks can be performed by running:
> ```bash
> proxmox-backup-test-suite detection-mode-bench prepare --target /<path-to-bench-source-target>
> proxmox-backup-test-suite detection-mode-bench run linux.pxar:/<path-to-bench-source-target>/linux
> proxmox-backup-test-suite detection-mode-bench run coco.pxar:/<path-to-bench-source-target>/coco
> ```
> 
> Above command invocations assume the default repository and credentials
> to be set as environment variables, they might however be passed as
> additional optional parameters instead.
> 
> Benchmark runs using these test data show a significant improvement in
> the time needed for the backups. Note that all of these results were to a local
> PBS instance within a VM, minimizing therefore possible influences by the network.
> 
> For the linux source code backup:
>     Completed benchmark with 5 runs for each tested mode.
> 
>     Completed regular backup with:
>     Total runtime: 51.31 s
>     Average: 10.26 ± 0.12 s
>     Min: 10.16 s
>     Max: 10.46 s
> 
>     Completed metadata detection mode backup with:
>     Total runtime: 4.89 s
>     Average: 0.98 ± 0.02 s
>     Min: 0.95 s
>     Max: 1.00 s
> 
>     Differences (metadata based - regular):
>     Delta total runtime: -46.42 s (-90.47 %)
>     Delta average: -9.28 ± 0.12 s (-90.47 %)
>     Delta min: -9.21 s (-90.64 %)
>     Delta max: -9.46 s (-90.44 %)
> 
> For the coco dataset backup:
>     Completed benchmark with 5 runs for each tested mode.
> 
>     Completed regular backup with:
>     Total runtime: 520.72 s
>     Average: 104.14 ± 0.79 s
>     Min: 103.44 s
>     Max: 105.49 s
> 
>     Completed metadata detection mode backup with:
>     Total runtime: 6.95 s
>     Average: 1.39 ± 0.23 s
>     Min: 1.26 s
>     Max: 1.79 s
> 
>     Differences (metadata based - regular):
>     Delta total runtime: -513.76 s (-98.66 %)
>     Delta average: -102.75 ± 0.83 s (-98.66 %)
>     Delta min: -102.18 s (-98.78 %)
>     Delta max: -103.69 s (-98.30 %)
> 
> This series of patches implements an alternative, but more promising
> approach to the series presented previously [0], with the intention to
> solve the same issue with less changes required to the pxar format and to
> be more efficient.
> 
> [0] https://lists.proxmox.com/pipermail/pbs-devel/2024-January/007693.html
> 
> pxar:
> 
> Christian Ebner (10):
>   format/examples: add PXAR_PAYLOAD_REF entry header
>   encoder: add optional output writer for file payloads
>   format/decoder: add method to read payload references
>   decoder: add optional payload input stream
>   accessor: add optional payload input stream
>   encoder: move to stack based state tracking
>   encoder: add payload reference capability
>   encoder: add payload position capability
>   encoder: add payload advance capability
>   encoder/format: finish payload stream with marker
> 
>  examples/mk-format-hashes.rs |  10 +
>  examples/pxarcmd.rs          |   6 +-
>  src/accessor/aio.rs          |   7 +
>  src/accessor/mod.rs          |  85 ++++++++-
>  src/decoder/mod.rs           |  92 ++++++++-
>  src/decoder/sync.rs          |   7 +
>  src/encoder/aio.rs           |  52 +++--
>  src/encoder/mod.rs           | 357 +++++++++++++++++++++++++----------
>  src/encoder/sync.rs          |  45 ++++-
>  src/format/mod.rs            |  10 +
>  src/lib.rs                   |   3 +
>  11 files changed, 534 insertions(+), 140 deletions(-)
> 
> proxmox-backup:
> 
> Christian Ebner (26):
>   client: pxar: switch to stack based encoder state
>   client: backup: factor out extension from backup target
>   client: backup: early check for fixed index type
>   client: backup: split payload to dedicated stream
>   client: restore: read payload from dedicated index
>   tools: cover meta extension for pxar archives
>   restore: cover meta extension for pxar archives
>   client: mount: make split pxar archives mountable
>   api: datastore: refactor getting local chunk reader
>   api: datastore: attach optional payload chunk reader
>   catalog: shell: factor out pxar fuse reader instantiation
>   catalog: shell: redirect payload reader for split streams
>   www: cover meta extension for pxar archives
>   index: fetch chunk form index by start/end-offset
>   upload stream: impl reused chunk injector
>   client: chunk stream: add chunk injection queues
>   client: implement prepare reference method
>   client: pxar: implement store to insert chunks on caching
>   client: pxar: add previous reference to archiver
>   client: pxar: add method for metadata comparison
>   specs: add backup detection mode specification
>   pxar: caching: add look-ahead cache types
>   client: pxar: add look-ahead caching
>   fix #3174: client: pxar: enable caching and meta comparison
>   test-suite: add detection mode change benchmark
>   test-suite: Add bin to deb, add shell completions
> 
>  Cargo.toml                                    |   1 +
>  Makefile                                      |  13 +-
>  debian/proxmox-backup-client.bash-completion  |   1 +
>  debian/proxmox-backup-client.install          |   2 +
>  debian/proxmox-backup-test-suite.bc           |   8 +
>  examples/test_chunk_speed2.rs                 |  10 +-
>  pbs-client/src/backup_specification.rs        |  53 ++
>  pbs-client/src/backup_writer.rs               |  89 ++-
>  pbs-client/src/chunk_stream.rs                |  42 +-
>  pbs-client/src/inject_reused_chunks.rs        | 152 +++++
>  pbs-client/src/lib.rs                         |   1 +
>  pbs-client/src/pxar/create.rs                 | 620 +++++++++++++++++-
>  pbs-client/src/pxar/look_ahead_cache.rs       |  40 ++
>  pbs-client/src/pxar/mod.rs                    |   3 +-
>  pbs-client/src/pxar_backup_stream.rs          |  54 +-
>  pbs-client/src/tools/mod.rs                   |   2 +-
>  pbs-datastore/src/dynamic_index.rs            |  55 ++
>  proxmox-backup-client/src/catalog.rs          |  73 ++-
>  proxmox-backup-client/src/main.rs             | 280 +++++++-
>  proxmox-backup-client/src/mount.rs            |  56 +-
>  proxmox-backup-test-suite/Cargo.toml          |  18 +
>  .../src/detection_mode_bench.rs               | 294 +++++++++
>  proxmox-backup-test-suite/src/main.rs         |  17 +
>  proxmox-file-restore/src/main.rs              |  11 +-
>  .../src/proxmox_restore_daemon/api.rs         |  16 +-
>  pxar-bin/src/main.rs                          |   7 +-
>  src/api2/admin/datastore.rs                   |  45 +-
>  tests/catar.rs                                |   4 +
>  www/datastore/Content.js                      |   6 +-
>  zsh-completions/_proxmox-backup-test-suite    |  13 +
>  30 files changed, 1827 insertions(+), 159 deletions(-)
>  create mode 100644 debian/proxmox-backup-test-suite.bc
>  create mode 100644 pbs-client/src/inject_reused_chunks.rs
>  create mode 100644 pbs-client/src/pxar/look_ahead_cache.rs
>  create mode 100644 proxmox-backup-test-suite/Cargo.toml
>  create mode 100644 proxmox-backup-test-suite/src/detection_mode_bench.rs
>  create mode 100644 proxmox-backup-test-suite/src/main.rs
>  create mode 100644 zsh-completions/_proxmox-backup-test-suite
> 
> -- 
> 2.39.2
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel at lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 




More information about the pbs-devel mailing list