[pbs-devel] partially-applied: [PATCH v8 pxar proxmox-backup 00/69] fix #3174: improve file-level backup

Fabian Grünbichler f.gruenbichler at proxmox.com
Wed Jun 5 10:51:41 CEST 2024


applied the pxar patches + follow-ups, and patches 16, 17 and 39 for
PBS. while most of the rest of the patches LGTM as well, there are too
many inter-dependencies to just pick a few, and quite a lot would be
required to make pbs buildable again with a bumped pxar, so I left the
rest even if most of them will likely be unchanged for v9, and skipped
the bump of pxar as well for now.

On May 28, 2024 11:41 am, Christian Ebner wrote:
> This series of patches implements an metadata based file change
> detection mechanism for improved pxar file level backup creation speed
> for unchanged files.
> 
> The chosen approach is to split pxar archives on creation via the
> proxmox-backup-client into two separate data and upload streams,
> one exclusive for regular file payloads, the other one for the rest
> of the pxar archive, which is mostly metadata.
> 
> On consecutive runs, the metadata archive of the previous backup run,
> which is limited in size and therefore rapidly accessed is used to
> lookup and compare the metadata for entries to encode.
> This assumes that the connection speed to the Proxmox Backup Server is
> sufficiently fast, allowing the download and chaching of the chunks for
> that index.
> 
> Changes to regular files are detected by comparing all of the files
> metadata object, including mtime, acls, ecc. If no changes are detected,
> the previous payload index is used to lookup chunks to possibly re-use
> in the payload stream of the new archive.
> In order to reduce possible chunk fragmentation, the decision whether to
> reuse or reencode a file payload is deferred until enough information
> is gathered by adding entries to a look-ahead cache. If the padding
> introduced by reusing chunks falls below a threshold, the entries are
> referenced, the chunks are reused and injected into the pxar payload
> upload stream, otherwise they are discated and the files encoded
> regularly.
> 
> Patches 16 and 17 are to be applied before the patches to the pxar
> repository, while patches 14 and 15 are to be applied to the pxar repository
> only after patch 52 in the series, for the patches to compile in a sequential
> chain.
> 
> The following lists the most notable changes included in this series since
> the version 7:
> - Fixed incorrectly squashed patches during rebase
> 
> The following lists the most notable changes included in this series since
> the version 6:
> - Allow to use `.pxar` extension in cli commands for convenience
> - Refactor the input/output interface for the pxar encoder, decoder and
>   accessor to use a `PxarVariant` enum, in order to guarantee the
>   payload relate input/output is always attached for split archives.
> - Refactor the lookahead caching logic in the pxars `Archiver` to
>   improve overall code readability.
> - Add helper method for file name matching and use it where possible,
>   for it to be handled in a single place.
> - Extend documentation to include additional information about which
>   metadata is compared to the previous snapshot
> - Fix an issue with the `pxar list` which failed in case of metadata
>   only pxar archives.
> - Fix an issue in the payload chunker test where the context was not
>   updated accordingly.
> - Various clippy fixes, smaller refactoring and reordering of patches
> 
> The following lists the most notable changes included in this series since
> the version 5:
> - Fix an issue where the payload chunker was not correctly reset after
>   suggested or forced boundaries.
> - Added regression tests for payload chunker and chunk stream.
> 
> The following lists the most notable changes included in this series since
> the version 4:
> - Increase open file handle limit to hard limit and adapt lookahead
>   cache size dynamically (thanks a lot to Thomas for pointing this out
>   and providing the necessary background information). This helps with
>   the reuse of multiple entries being contained within the same chunk,
>   otherwise exceeding padding threshold and being therefore reencoded
>   instead.
> - Fix payload chunker scan to only scan up until chunk pos in case a
>   suggested boundary is chosen.
> - Fix issue with decoder state being not set to correct `InDirectory`
>   after reading prelude and getting root directory entry.
> - Fix issue with kept back chunk injection when the chunk follows a
>   range discontinuity.
> - Add regression test for pxar create with metadata archive and payload
>   index reference.
> 
> The following lists the most notable changes included in this series since
> the version 3:
> - Rework the whole reused chunk injection and accounting logic and use
>   lockless async `mpsc::channel`s instead of `Arc<Mutex<VecDeque<..>>>`.
> - Reworked lookahead caching logic to use payload ranges and check for
>   possible range continuation instead of looking up the reusable dynamic
>   entries immediately in case of a reusable entry chain. This also
>   avoids edge cases not covered in the previous version of the patch series.
>   This current version therefore tends to reencode small files more
>   aggressively, since they might introduce additional unwanted paddings.
> - Correctly cover also hardlinks for the reuse logic, avoiding to
>   reencode these entries.
> - Add additional dedicatet chunker implementation for payload data
>   stream, allowing the archiver to suggest boundaries to the chunker to
>   reduce padding for reused chunks.
> - Add additional `change-detection-mode=data`, in order to allow
>   creating split archives with fully reencoded payload data.
> - Add additional payload input readers for pxar accessor type
>   implementations where needed.
> - Add additional consistency check in pxar encoder when dropping state
>   or encoder instance.
> - CliParams was renamed to the more opaque Prelude, since the pxar
>   archive does not care about its contents and this might be extended to
>   store other information about the archive as well.
> - Add missing proxmox-file-restore for split archives and fix restore of
>   tar/zip archives via WebUI. This is handled by the same decoder logic,
>   and needed an updated payload input content range to read the data
>   from the correct location in the payload data archive.
> - Additional refactoring to use the pxar reader helpers where possible.
> 
> The following lists the most notable changes included in this series since
> the version 2:
> - many bugfixes regarding incorrect archive encoding by wrong offset
>   generation, adding additional sanity checks and rather fail on
>   encoding than produce an incorrectly encoded archive
> - different approach for deciding whether to reuse or reencode the
>   entries. Previously, the entries have been encoded when a cached
>   payload size threshold was reached. Now, the padding introduced by
>   reusable chunks is tracked, and only if the padding does not exceed
>   the set threshold, the entries are reused. This reduces the possible
>   padding, at the cost of reencoding more entries. Also avoids to
>   re-use chunks which have now large padding holes because of
>   moved/removed files contained within.
> - added headers for metadata archive and payload file
> - added documentation
> 
> An invocation of a backup run with this patches now is:
> ```bash
> proxmox-backup-client backup <label>.pxar:<source-path> --change-detection-mode=metadata
> ```
> During the first run, no reference index is available, the pxar archive
> will however be split into the two parts.
> Following backups will however utilize the pxar archive accessor and
> index files of the previous run to perform file change detection.
> 
> As benchmarks, the linux source code as well as the coco dataset for
> computer vision and pattern recognition can be used.
> The benchmarks can be performed by running:
> ```bash
> proxmox-backup-test-suite detection-mode-bench prepare --target /<path-to-bench-source-target>
> proxmox-backup-test-suite detection-mode-bench run linux.pxar:/<path-to-bench-source-target>/linux
> proxmox-backup-test-suite detection-mode-bench run coco.pxar:/<path-to-bench-source-target>/coco
> ```
> 
> Above command invocations assume the default repository and credentials
> to be set as environment variables, they might however be passed as
> additional optional parameters instead.
> 
> pxar:
> 
> Christian Ebner (15):
>   decoder: factor out skip part from skip_entry
>   lib: add type for input/output variant differentiation
>   encoder: move to stack based state tracking
>   format/examples: add header type `PXAR_PAYLOAD_REF`
>   decoder: add method to read payload references
>   encoder: allow split output writer for archive creation
>   decoder/accessor: allow for split input stream variant
>   decoder: set payload input range when decoding via accessor
>   encoder: add payload reference capability
>   encoder: add payload position capability
>   encoder: add payload advance capability
>   encoder/format: finish payload stream with marker
>   format: add payload stream start marker
>   format/encoder/decoder: new pxar entry type `Version`
>   format/encoder/decoder: new pxar entry type `Prelude`
> 
>  examples/apxar.rs            |   2 +-
>  examples/mk-format-hashes.rs |  21 ++
>  examples/pxarcmd.rs          |   7 +-
>  src/accessor/aio.rs          |  10 +-
>  src/accessor/mod.rs          | 120 +++++++--
>  src/accessor/sync.rs         |   8 +-
>  src/decoder/aio.rs           |  13 +-
>  src/decoder/mod.rs           | 249 ++++++++++++++---
>  src/decoder/sync.rs          |  21 +-
>  src/encoder/aio.rs           |  90 +++++--
>  src/encoder/mod.rs           | 508 ++++++++++++++++++++++++++---------
>  src/encoder/sync.rs          |  75 +++++-
>  src/format/mod.rs            |  63 +++++
>  src/lib.rs                   |  71 +++++
>  tests/compat.rs              |   3 +-
>  tests/simple/fs.rs           |   8 +-
>  tests/simple/main.rs         |  11 +-
>  17 files changed, 1027 insertions(+), 253 deletions(-)
> 
> proxmox-backup:
> 
> Christian Ebner (54):
>   client: backup: factor out extension from backup target
>   api: datastore: refactor getting local chunk reader
>   client: pxar: switch to stack based encoder state
>   client: pxar: combine writers into struct
>   client: pxar: optionally split metadata and payload streams
>   client: helper: add helpers for creating reader instances
>   client: helper: add method for split archive name mapping
>   client: tools: helper to check pxar filename extensions
>   client: restore: read payload from dedicated index
>   tools: cover extension for split pxar archives
>   restore: cover extension for split pxar archives
>   client: mount: make split pxar archives mountable
>   api: datastore: attach split archive payload chunk reader
>   catalog: shell: make split pxar archives accessible
>   www: cover metadata extension for pxar archives
>   file restore: factor out getting pxar reader
>   file restore: cover split metadata and payload archives
>   file restore: show more error context when extraction fails
>   pxar: add optional payload input for archive restore
>   pxar: cover listing for split archives
>   pxar: add more context to extraction error
>   client: pxar: include payload offset in entry listing
>   pxar: show padding in debug output on archive list
>   datastore: dynamic index: add method to get digest
>   client: pxar: helper for lookup of reusable dynamic entries
>   upload stream: implement reused chunk injector
>   client: chunk stream: add struct to hold injection state
>   chunker: add method to reset chunker state
>   client: streams: add channels for dynamic entry injection
>   specs: add backup detection mode specification
>   client: implement prepare reference method
>   client: pxar: add method for metadata comparison
>   pxar: caching: add look-ahead cache
>   client: pxar: refactor catalog encoding for directories
>   fix #3174: client: pxar: enable caching and meta comparison
>   client: backup writer: add injected chunk count to stats
>   pxar: create: keep track of reused chunks and files
>   pxar: create: show chunk injection stats debug output
>   client: pxar: add helper to handle optional preludes
>   client: pxar: opt encode cli exclude patterns as Prelude
>   pxar: ignore version and prelude entries in listing
>   docs: file formats: describe split pxar archive file layout
>   docs: add section describing change detection mode
>   test-suite: add detection mode change benchmark
>   test-suite: Makefile: add debian package and related files
>   datastore: chunker: add Chunker trait
>   datastore: chunker: implement chunker for payload stream
>   client: chunk stream: switch payload stream chunker
>   client: pxar: allow to restore prelude to optional path
>   client: pxar: add archive creation with reference test
>   client: tools: add helper to raise nofile rlimit
>   client: pxar: set cache limit based on nofile rlimit
>   chunker: tests: add regression tests for payload chunker
>   chunk stream: tests: add regression tests for payload chunker
> 
>  Cargo.toml                                    |   1 +
>  Makefile                                      |  18 +-
>  debian/control                                |   7 +
>  debian/proxmox-backup-client.bash-completion  |   1 +
>  debian/proxmox-backup-test-suite.bc           |   8 +
>  debian/proxmox-backup-test-suite.install      |   3 +
>  docs/Makefile                                 |   2 +
>  docs/backup-client.rst                        |  45 +
>  docs/command-line-tools.rst                   |   5 +
>  docs/command-syntax.rst                       |   4 +
>  docs/conf.py                                  |   1 +
>  docs/file-formats.rst                         |  46 +
>  docs/meta-format-overview.dot                 |  50 +
>  .../proxmox-backup-test-suite/description.rst |   2 +
>  docs/proxmox-backup-test-suite/man1.rst       |  17 +
>  docs/technical-overview.rst                   |   3 +
>  examples/test_chunk_size.rs                   |   9 +-
>  examples/test_chunk_speed.rs                  |   7 +-
>  examples/test_chunk_speed2.rs                 |   2 +-
>  pbs-client/src/backup_specification.rs        |  26 +
>  pbs-client/src/backup_writer.rs               | 118 ++-
>  pbs-client/src/chunk_stream.rs                | 238 ++++-
>  pbs-client/src/inject_reused_chunks.rs        | 129 +++
>  pbs-client/src/lib.rs                         |   3 +-
>  pbs-client/src/pxar/create.rs                 | 911 +++++++++++++++++-
>  pbs-client/src/pxar/extract.rs                |  28 +-
>  pbs-client/src/pxar/look_ahead_cache.rs       | 165 ++++
>  pbs-client/src/pxar/mod.rs                    |   5 +-
>  pbs-client/src/pxar/tools.rs                  | 123 ++-
>  pbs-client/src/pxar_backup_stream.rs          |  71 +-
>  pbs-client/src/tools/mod.rs                   |  69 +-
>  pbs-datastore/src/chunker.rs                  | 267 ++++-
>  pbs-datastore/src/dynamic_index.rs            |  14 +-
>  pbs-datastore/src/lib.rs                      |   2 +-
>  pbs-pxar-fuse/src/lib.rs                      |   2 +-
>  proxmox-backup-client/src/catalog.rs          |  29 +-
>  proxmox-backup-client/src/helper.rs           | 114 +++
>  proxmox-backup-client/src/main.rs             | 291 +++++-
>  proxmox-backup-client/src/mount.rs            |  33 +-
>  proxmox-backup-test-suite/Cargo.toml          |  18 +
>  .../src/detection_mode_bench.rs               | 294 ++++++
>  proxmox-backup-test-suite/src/main.rs         |  17 +
>  proxmox-file-restore/src/main.rs              |  73 +-
>  .../src/proxmox_restore_daemon/api.rs         |  20 +-
>  pxar-bin/src/main.rs                          |  85 +-
>  src/api2/admin/datastore.rs                   |  48 +-
>  src/api2/tape/restore.rs                      |  22 +-
>  src/bin/proxmox_backup_debug/diff.rs          |   2 +-
>  src/tape/file_formats/snapshot_archive.rs     |   8 +-
>  tests/catar.rs                                |   7 +-
>  tests/pxar/backup-client-pxar-data.mpxar      | Bin 0 -> 15070 bytes
>  tests/pxar/backup-client-pxar-data.ppxar.didx | Bin 0 -> 8096 bytes
>  tests/pxar/backup-client-pxar-expected.mpxar  | Bin 0 -> 15086 bytes
>  www/datastore/Content.js                      |   6 +-
>  zsh-completions/_proxmox-backup-test-suite    |  13 +
>  55 files changed, 3145 insertions(+), 337 deletions(-)
>  create mode 100644 debian/proxmox-backup-test-suite.bc
>  create mode 100644 debian/proxmox-backup-test-suite.install
>  create mode 100644 docs/meta-format-overview.dot
>  create mode 100644 docs/proxmox-backup-test-suite/description.rst
>  create mode 100644 docs/proxmox-backup-test-suite/man1.rst
>  create mode 100644 pbs-client/src/inject_reused_chunks.rs
>  create mode 100644 pbs-client/src/pxar/look_ahead_cache.rs
>  create mode 100644 proxmox-backup-client/src/helper.rs
>  create mode 100644 proxmox-backup-test-suite/Cargo.toml
>  create mode 100644 proxmox-backup-test-suite/src/detection_mode_bench.rs
>  create mode 100644 proxmox-backup-test-suite/src/main.rs
>  create mode 100644 tests/pxar/backup-client-pxar-data.mpxar
>  create mode 100644 tests/pxar/backup-client-pxar-data.ppxar.didx
>  create mode 100644 tests/pxar/backup-client-pxar-expected.mpxar
>  create mode 100644 zsh-completions/_proxmox-backup-test-suite
> 
> -- 
> 2.39.2
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel at lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
> 
> 




More information about the pbs-devel mailing list