[pbs-devel] partially-applied: [PATCH v8 pxar proxmox-backup 00/69] fix #3174: improve file-level backup
Fabian Grünbichler
f.gruenbichler at proxmox.com
Wed Jun 5 10:51:41 CEST 2024
applied the pxar patches + follow-ups, and patches 16, 17 and 39 for
PBS. while most of the rest of the patches LGTM as well, there are too
many inter-dependencies to just pick a few, and quite a lot would be
required to make pbs buildable again with a bumped pxar, so I left the
rest even if most of them will likely be unchanged for v9, and skipped
the bump of pxar as well for now.
On May 28, 2024 11:41 am, Christian Ebner wrote:
> This series of patches implements an metadata based file change
> detection mechanism for improved pxar file level backup creation speed
> for unchanged files.
>
> The chosen approach is to split pxar archives on creation via the
> proxmox-backup-client into two separate data and upload streams,
> one exclusive for regular file payloads, the other one for the rest
> of the pxar archive, which is mostly metadata.
>
> On consecutive runs, the metadata archive of the previous backup run,
> which is limited in size and therefore rapidly accessed is used to
> lookup and compare the metadata for entries to encode.
> This assumes that the connection speed to the Proxmox Backup Server is
> sufficiently fast, allowing the download and chaching of the chunks for
> that index.
>
> Changes to regular files are detected by comparing all of the files
> metadata object, including mtime, acls, ecc. If no changes are detected,
> the previous payload index is used to lookup chunks to possibly re-use
> in the payload stream of the new archive.
> In order to reduce possible chunk fragmentation, the decision whether to
> reuse or reencode a file payload is deferred until enough information
> is gathered by adding entries to a look-ahead cache. If the padding
> introduced by reusing chunks falls below a threshold, the entries are
> referenced, the chunks are reused and injected into the pxar payload
> upload stream, otherwise they are discated and the files encoded
> regularly.
>
> Patches 16 and 17 are to be applied before the patches to the pxar
> repository, while patches 14 and 15 are to be applied to the pxar repository
> only after patch 52 in the series, for the patches to compile in a sequential
> chain.
>
> The following lists the most notable changes included in this series since
> the version 7:
> - Fixed incorrectly squashed patches during rebase
>
> The following lists the most notable changes included in this series since
> the version 6:
> - Allow to use `.pxar` extension in cli commands for convenience
> - Refactor the input/output interface for the pxar encoder, decoder and
> accessor to use a `PxarVariant` enum, in order to guarantee the
> payload relate input/output is always attached for split archives.
> - Refactor the lookahead caching logic in the pxars `Archiver` to
> improve overall code readability.
> - Add helper method for file name matching and use it where possible,
> for it to be handled in a single place.
> - Extend documentation to include additional information about which
> metadata is compared to the previous snapshot
> - Fix an issue with the `pxar list` which failed in case of metadata
> only pxar archives.
> - Fix an issue in the payload chunker test where the context was not
> updated accordingly.
> - Various clippy fixes, smaller refactoring and reordering of patches
>
> The following lists the most notable changes included in this series since
> the version 5:
> - Fix an issue where the payload chunker was not correctly reset after
> suggested or forced boundaries.
> - Added regression tests for payload chunker and chunk stream.
>
> The following lists the most notable changes included in this series since
> the version 4:
> - Increase open file handle limit to hard limit and adapt lookahead
> cache size dynamically (thanks a lot to Thomas for pointing this out
> and providing the necessary background information). This helps with
> the reuse of multiple entries being contained within the same chunk,
> otherwise exceeding padding threshold and being therefore reencoded
> instead.
> - Fix payload chunker scan to only scan up until chunk pos in case a
> suggested boundary is chosen.
> - Fix issue with decoder state being not set to correct `InDirectory`
> after reading prelude and getting root directory entry.
> - Fix issue with kept back chunk injection when the chunk follows a
> range discontinuity.
> - Add regression test for pxar create with metadata archive and payload
> index reference.
>
> The following lists the most notable changes included in this series since
> the version 3:
> - Rework the whole reused chunk injection and accounting logic and use
> lockless async `mpsc::channel`s instead of `Arc<Mutex<VecDeque<..>>>`.
> - Reworked lookahead caching logic to use payload ranges and check for
> possible range continuation instead of looking up the reusable dynamic
> entries immediately in case of a reusable entry chain. This also
> avoids edge cases not covered in the previous version of the patch series.
> This current version therefore tends to reencode small files more
> aggressively, since they might introduce additional unwanted paddings.
> - Correctly cover also hardlinks for the reuse logic, avoiding to
> reencode these entries.
> - Add additional dedicatet chunker implementation for payload data
> stream, allowing the archiver to suggest boundaries to the chunker to
> reduce padding for reused chunks.
> - Add additional `change-detection-mode=data`, in order to allow
> creating split archives with fully reencoded payload data.
> - Add additional payload input readers for pxar accessor type
> implementations where needed.
> - Add additional consistency check in pxar encoder when dropping state
> or encoder instance.
> - CliParams was renamed to the more opaque Prelude, since the pxar
> archive does not care about its contents and this might be extended to
> store other information about the archive as well.
> - Add missing proxmox-file-restore for split archives and fix restore of
> tar/zip archives via WebUI. This is handled by the same decoder logic,
> and needed an updated payload input content range to read the data
> from the correct location in the payload data archive.
> - Additional refactoring to use the pxar reader helpers where possible.
>
> The following lists the most notable changes included in this series since
> the version 2:
> - many bugfixes regarding incorrect archive encoding by wrong offset
> generation, adding additional sanity checks and rather fail on
> encoding than produce an incorrectly encoded archive
> - different approach for deciding whether to reuse or reencode the
> entries. Previously, the entries have been encoded when a cached
> payload size threshold was reached. Now, the padding introduced by
> reusable chunks is tracked, and only if the padding does not exceed
> the set threshold, the entries are reused. This reduces the possible
> padding, at the cost of reencoding more entries. Also avoids to
> re-use chunks which have now large padding holes because of
> moved/removed files contained within.
> - added headers for metadata archive and payload file
> - added documentation
>
> An invocation of a backup run with this patches now is:
> ```bash
> proxmox-backup-client backup <label>.pxar:<source-path> --change-detection-mode=metadata
> ```
> During the first run, no reference index is available, the pxar archive
> will however be split into the two parts.
> Following backups will however utilize the pxar archive accessor and
> index files of the previous run to perform file change detection.
>
> As benchmarks, the linux source code as well as the coco dataset for
> computer vision and pattern recognition can be used.
> The benchmarks can be performed by running:
> ```bash
> proxmox-backup-test-suite detection-mode-bench prepare --target /<path-to-bench-source-target>
> proxmox-backup-test-suite detection-mode-bench run linux.pxar:/<path-to-bench-source-target>/linux
> proxmox-backup-test-suite detection-mode-bench run coco.pxar:/<path-to-bench-source-target>/coco
> ```
>
> Above command invocations assume the default repository and credentials
> to be set as environment variables, they might however be passed as
> additional optional parameters instead.
>
> pxar:
>
> Christian Ebner (15):
> decoder: factor out skip part from skip_entry
> lib: add type for input/output variant differentiation
> encoder: move to stack based state tracking
> format/examples: add header type `PXAR_PAYLOAD_REF`
> decoder: add method to read payload references
> encoder: allow split output writer for archive creation
> decoder/accessor: allow for split input stream variant
> decoder: set payload input range when decoding via accessor
> encoder: add payload reference capability
> encoder: add payload position capability
> encoder: add payload advance capability
> encoder/format: finish payload stream with marker
> format: add payload stream start marker
> format/encoder/decoder: new pxar entry type `Version`
> format/encoder/decoder: new pxar entry type `Prelude`
>
> examples/apxar.rs | 2 +-
> examples/mk-format-hashes.rs | 21 ++
> examples/pxarcmd.rs | 7 +-
> src/accessor/aio.rs | 10 +-
> src/accessor/mod.rs | 120 +++++++--
> src/accessor/sync.rs | 8 +-
> src/decoder/aio.rs | 13 +-
> src/decoder/mod.rs | 249 ++++++++++++++---
> src/decoder/sync.rs | 21 +-
> src/encoder/aio.rs | 90 +++++--
> src/encoder/mod.rs | 508 ++++++++++++++++++++++++++---------
> src/encoder/sync.rs | 75 +++++-
> src/format/mod.rs | 63 +++++
> src/lib.rs | 71 +++++
> tests/compat.rs | 3 +-
> tests/simple/fs.rs | 8 +-
> tests/simple/main.rs | 11 +-
> 17 files changed, 1027 insertions(+), 253 deletions(-)
>
> proxmox-backup:
>
> Christian Ebner (54):
> client: backup: factor out extension from backup target
> api: datastore: refactor getting local chunk reader
> client: pxar: switch to stack based encoder state
> client: pxar: combine writers into struct
> client: pxar: optionally split metadata and payload streams
> client: helper: add helpers for creating reader instances
> client: helper: add method for split archive name mapping
> client: tools: helper to check pxar filename extensions
> client: restore: read payload from dedicated index
> tools: cover extension for split pxar archives
> restore: cover extension for split pxar archives
> client: mount: make split pxar archives mountable
> api: datastore: attach split archive payload chunk reader
> catalog: shell: make split pxar archives accessible
> www: cover metadata extension for pxar archives
> file restore: factor out getting pxar reader
> file restore: cover split metadata and payload archives
> file restore: show more error context when extraction fails
> pxar: add optional payload input for archive restore
> pxar: cover listing for split archives
> pxar: add more context to extraction error
> client: pxar: include payload offset in entry listing
> pxar: show padding in debug output on archive list
> datastore: dynamic index: add method to get digest
> client: pxar: helper for lookup of reusable dynamic entries
> upload stream: implement reused chunk injector
> client: chunk stream: add struct to hold injection state
> chunker: add method to reset chunker state
> client: streams: add channels for dynamic entry injection
> specs: add backup detection mode specification
> client: implement prepare reference method
> client: pxar: add method for metadata comparison
> pxar: caching: add look-ahead cache
> client: pxar: refactor catalog encoding for directories
> fix #3174: client: pxar: enable caching and meta comparison
> client: backup writer: add injected chunk count to stats
> pxar: create: keep track of reused chunks and files
> pxar: create: show chunk injection stats debug output
> client: pxar: add helper to handle optional preludes
> client: pxar: opt encode cli exclude patterns as Prelude
> pxar: ignore version and prelude entries in listing
> docs: file formats: describe split pxar archive file layout
> docs: add section describing change detection mode
> test-suite: add detection mode change benchmark
> test-suite: Makefile: add debian package and related files
> datastore: chunker: add Chunker trait
> datastore: chunker: implement chunker for payload stream
> client: chunk stream: switch payload stream chunker
> client: pxar: allow to restore prelude to optional path
> client: pxar: add archive creation with reference test
> client: tools: add helper to raise nofile rlimit
> client: pxar: set cache limit based on nofile rlimit
> chunker: tests: add regression tests for payload chunker
> chunk stream: tests: add regression tests for payload chunker
>
> Cargo.toml | 1 +
> Makefile | 18 +-
> debian/control | 7 +
> debian/proxmox-backup-client.bash-completion | 1 +
> debian/proxmox-backup-test-suite.bc | 8 +
> debian/proxmox-backup-test-suite.install | 3 +
> docs/Makefile | 2 +
> docs/backup-client.rst | 45 +
> docs/command-line-tools.rst | 5 +
> docs/command-syntax.rst | 4 +
> docs/conf.py | 1 +
> docs/file-formats.rst | 46 +
> docs/meta-format-overview.dot | 50 +
> .../proxmox-backup-test-suite/description.rst | 2 +
> docs/proxmox-backup-test-suite/man1.rst | 17 +
> docs/technical-overview.rst | 3 +
> examples/test_chunk_size.rs | 9 +-
> examples/test_chunk_speed.rs | 7 +-
> examples/test_chunk_speed2.rs | 2 +-
> pbs-client/src/backup_specification.rs | 26 +
> pbs-client/src/backup_writer.rs | 118 ++-
> pbs-client/src/chunk_stream.rs | 238 ++++-
> pbs-client/src/inject_reused_chunks.rs | 129 +++
> pbs-client/src/lib.rs | 3 +-
> pbs-client/src/pxar/create.rs | 911 +++++++++++++++++-
> pbs-client/src/pxar/extract.rs | 28 +-
> pbs-client/src/pxar/look_ahead_cache.rs | 165 ++++
> pbs-client/src/pxar/mod.rs | 5 +-
> pbs-client/src/pxar/tools.rs | 123 ++-
> pbs-client/src/pxar_backup_stream.rs | 71 +-
> pbs-client/src/tools/mod.rs | 69 +-
> pbs-datastore/src/chunker.rs | 267 ++++-
> pbs-datastore/src/dynamic_index.rs | 14 +-
> pbs-datastore/src/lib.rs | 2 +-
> pbs-pxar-fuse/src/lib.rs | 2 +-
> proxmox-backup-client/src/catalog.rs | 29 +-
> proxmox-backup-client/src/helper.rs | 114 +++
> proxmox-backup-client/src/main.rs | 291 +++++-
> proxmox-backup-client/src/mount.rs | 33 +-
> proxmox-backup-test-suite/Cargo.toml | 18 +
> .../src/detection_mode_bench.rs | 294 ++++++
> proxmox-backup-test-suite/src/main.rs | 17 +
> proxmox-file-restore/src/main.rs | 73 +-
> .../src/proxmox_restore_daemon/api.rs | 20 +-
> pxar-bin/src/main.rs | 85 +-
> src/api2/admin/datastore.rs | 48 +-
> src/api2/tape/restore.rs | 22 +-
> src/bin/proxmox_backup_debug/diff.rs | 2 +-
> src/tape/file_formats/snapshot_archive.rs | 8 +-
> tests/catar.rs | 7 +-
> tests/pxar/backup-client-pxar-data.mpxar | Bin 0 -> 15070 bytes
> tests/pxar/backup-client-pxar-data.ppxar.didx | Bin 0 -> 8096 bytes
> tests/pxar/backup-client-pxar-expected.mpxar | Bin 0 -> 15086 bytes
> www/datastore/Content.js | 6 +-
> zsh-completions/_proxmox-backup-test-suite | 13 +
> 55 files changed, 3145 insertions(+), 337 deletions(-)
> create mode 100644 debian/proxmox-backup-test-suite.bc
> create mode 100644 debian/proxmox-backup-test-suite.install
> create mode 100644 docs/meta-format-overview.dot
> create mode 100644 docs/proxmox-backup-test-suite/description.rst
> create mode 100644 docs/proxmox-backup-test-suite/man1.rst
> create mode 100644 pbs-client/src/inject_reused_chunks.rs
> create mode 100644 pbs-client/src/pxar/look_ahead_cache.rs
> create mode 100644 proxmox-backup-client/src/helper.rs
> create mode 100644 proxmox-backup-test-suite/Cargo.toml
> create mode 100644 proxmox-backup-test-suite/src/detection_mode_bench.rs
> create mode 100644 proxmox-backup-test-suite/src/main.rs
> create mode 100644 tests/pxar/backup-client-pxar-data.mpxar
> create mode 100644 tests/pxar/backup-client-pxar-data.ppxar.didx
> create mode 100644 tests/pxar/backup-client-pxar-expected.mpxar
> create mode 100644 zsh-completions/_proxmox-backup-test-suite
>
> --
> 2.39.2
>
>
>
> _______________________________________________
> pbs-devel mailing list
> pbs-devel at lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
>
>
>
More information about the pbs-devel
mailing list