[pbs-devel] [PATCH v8 pxar proxmox-backup 00/69] fix #3174: improve file-level backup
Christian Ebner
c.ebner at proxmox.com
Tue May 28 11:41:54 CEST 2024
This series of patches implements an metadata based file change
detection mechanism for improved pxar file level backup creation speed
for unchanged files.
The chosen approach is to split pxar archives on creation via the
proxmox-backup-client into two separate data and upload streams,
one exclusive for regular file payloads, the other one for the rest
of the pxar archive, which is mostly metadata.
On consecutive runs, the metadata archive of the previous backup run,
which is limited in size and therefore rapidly accessed is used to
lookup and compare the metadata for entries to encode.
This assumes that the connection speed to the Proxmox Backup Server is
sufficiently fast, allowing the download and chaching of the chunks for
that index.
Changes to regular files are detected by comparing all of the files
metadata object, including mtime, acls, ecc. If no changes are detected,
the previous payload index is used to lookup chunks to possibly re-use
in the payload stream of the new archive.
In order to reduce possible chunk fragmentation, the decision whether to
reuse or reencode a file payload is deferred until enough information
is gathered by adding entries to a look-ahead cache. If the padding
introduced by reusing chunks falls below a threshold, the entries are
referenced, the chunks are reused and injected into the pxar payload
upload stream, otherwise they are discated and the files encoded
regularly.
Patches 16 and 17 are to be applied before the patches to the pxar
repository, while patches 14 and 15 are to be applied to the pxar repository
only after patch 52 in the series, for the patches to compile in a sequential
chain.
The following lists the most notable changes included in this series since
the version 7:
- Fixed incorrectly squashed patches during rebase
The following lists the most notable changes included in this series since
the version 6:
- Allow to use `.pxar` extension in cli commands for convenience
- Refactor the input/output interface for the pxar encoder, decoder and
accessor to use a `PxarVariant` enum, in order to guarantee the
payload relate input/output is always attached for split archives.
- Refactor the lookahead caching logic in the pxars `Archiver` to
improve overall code readability.
- Add helper method for file name matching and use it where possible,
for it to be handled in a single place.
- Extend documentation to include additional information about which
metadata is compared to the previous snapshot
- Fix an issue with the `pxar list` which failed in case of metadata
only pxar archives.
- Fix an issue in the payload chunker test where the context was not
updated accordingly.
- Various clippy fixes, smaller refactoring and reordering of patches
The following lists the most notable changes included in this series since
the version 5:
- Fix an issue where the payload chunker was not correctly reset after
suggested or forced boundaries.
- Added regression tests for payload chunker and chunk stream.
The following lists the most notable changes included in this series since
the version 4:
- Increase open file handle limit to hard limit and adapt lookahead
cache size dynamically (thanks a lot to Thomas for pointing this out
and providing the necessary background information). This helps with
the reuse of multiple entries being contained within the same chunk,
otherwise exceeding padding threshold and being therefore reencoded
instead.
- Fix payload chunker scan to only scan up until chunk pos in case a
suggested boundary is chosen.
- Fix issue with decoder state being not set to correct `InDirectory`
after reading prelude and getting root directory entry.
- Fix issue with kept back chunk injection when the chunk follows a
range discontinuity.
- Add regression test for pxar create with metadata archive and payload
index reference.
The following lists the most notable changes included in this series since
the version 3:
- Rework the whole reused chunk injection and accounting logic and use
lockless async `mpsc::channel`s instead of `Arc<Mutex<VecDeque<..>>>`.
- Reworked lookahead caching logic to use payload ranges and check for
possible range continuation instead of looking up the reusable dynamic
entries immediately in case of a reusable entry chain. This also
avoids edge cases not covered in the previous version of the patch series.
This current version therefore tends to reencode small files more
aggressively, since they might introduce additional unwanted paddings.
- Correctly cover also hardlinks for the reuse logic, avoiding to
reencode these entries.
- Add additional dedicatet chunker implementation for payload data
stream, allowing the archiver to suggest boundaries to the chunker to
reduce padding for reused chunks.
- Add additional `change-detection-mode=data`, in order to allow
creating split archives with fully reencoded payload data.
- Add additional payload input readers for pxar accessor type
implementations where needed.
- Add additional consistency check in pxar encoder when dropping state
or encoder instance.
- CliParams was renamed to the more opaque Prelude, since the pxar
archive does not care about its contents and this might be extended to
store other information about the archive as well.
- Add missing proxmox-file-restore for split archives and fix restore of
tar/zip archives via WebUI. This is handled by the same decoder logic,
and needed an updated payload input content range to read the data
from the correct location in the payload data archive.
- Additional refactoring to use the pxar reader helpers where possible.
The following lists the most notable changes included in this series since
the version 2:
- many bugfixes regarding incorrect archive encoding by wrong offset
generation, adding additional sanity checks and rather fail on
encoding than produce an incorrectly encoded archive
- different approach for deciding whether to reuse or reencode the
entries. Previously, the entries have been encoded when a cached
payload size threshold was reached. Now, the padding introduced by
reusable chunks is tracked, and only if the padding does not exceed
the set threshold, the entries are reused. This reduces the possible
padding, at the cost of reencoding more entries. Also avoids to
re-use chunks which have now large padding holes because of
moved/removed files contained within.
- added headers for metadata archive and payload file
- added documentation
An invocation of a backup run with this patches now is:
```bash
proxmox-backup-client backup <label>.pxar:<source-path> --change-detection-mode=metadata
```
During the first run, no reference index is available, the pxar archive
will however be split into the two parts.
Following backups will however utilize the pxar archive accessor and
index files of the previous run to perform file change detection.
As benchmarks, the linux source code as well as the coco dataset for
computer vision and pattern recognition can be used.
The benchmarks can be performed by running:
```bash
proxmox-backup-test-suite detection-mode-bench prepare --target /<path-to-bench-source-target>
proxmox-backup-test-suite detection-mode-bench run linux.pxar:/<path-to-bench-source-target>/linux
proxmox-backup-test-suite detection-mode-bench run coco.pxar:/<path-to-bench-source-target>/coco
```
Above command invocations assume the default repository and credentials
to be set as environment variables, they might however be passed as
additional optional parameters instead.
pxar:
Christian Ebner (15):
decoder: factor out skip part from skip_entry
lib: add type for input/output variant differentiation
encoder: move to stack based state tracking
format/examples: add header type `PXAR_PAYLOAD_REF`
decoder: add method to read payload references
encoder: allow split output writer for archive creation
decoder/accessor: allow for split input stream variant
decoder: set payload input range when decoding via accessor
encoder: add payload reference capability
encoder: add payload position capability
encoder: add payload advance capability
encoder/format: finish payload stream with marker
format: add payload stream start marker
format/encoder/decoder: new pxar entry type `Version`
format/encoder/decoder: new pxar entry type `Prelude`
examples/apxar.rs | 2 +-
examples/mk-format-hashes.rs | 21 ++
examples/pxarcmd.rs | 7 +-
src/accessor/aio.rs | 10 +-
src/accessor/mod.rs | 120 +++++++--
src/accessor/sync.rs | 8 +-
src/decoder/aio.rs | 13 +-
src/decoder/mod.rs | 249 ++++++++++++++---
src/decoder/sync.rs | 21 +-
src/encoder/aio.rs | 90 +++++--
src/encoder/mod.rs | 508 ++++++++++++++++++++++++++---------
src/encoder/sync.rs | 75 +++++-
src/format/mod.rs | 63 +++++
src/lib.rs | 71 +++++
tests/compat.rs | 3 +-
tests/simple/fs.rs | 8 +-
tests/simple/main.rs | 11 +-
17 files changed, 1027 insertions(+), 253 deletions(-)
proxmox-backup:
Christian Ebner (54):
client: backup: factor out extension from backup target
api: datastore: refactor getting local chunk reader
client: pxar: switch to stack based encoder state
client: pxar: combine writers into struct
client: pxar: optionally split metadata and payload streams
client: helper: add helpers for creating reader instances
client: helper: add method for split archive name mapping
client: tools: helper to check pxar filename extensions
client: restore: read payload from dedicated index
tools: cover extension for split pxar archives
restore: cover extension for split pxar archives
client: mount: make split pxar archives mountable
api: datastore: attach split archive payload chunk reader
catalog: shell: make split pxar archives accessible
www: cover metadata extension for pxar archives
file restore: factor out getting pxar reader
file restore: cover split metadata and payload archives
file restore: show more error context when extraction fails
pxar: add optional payload input for archive restore
pxar: cover listing for split archives
pxar: add more context to extraction error
client: pxar: include payload offset in entry listing
pxar: show padding in debug output on archive list
datastore: dynamic index: add method to get digest
client: pxar: helper for lookup of reusable dynamic entries
upload stream: implement reused chunk injector
client: chunk stream: add struct to hold injection state
chunker: add method to reset chunker state
client: streams: add channels for dynamic entry injection
specs: add backup detection mode specification
client: implement prepare reference method
client: pxar: add method for metadata comparison
pxar: caching: add look-ahead cache
client: pxar: refactor catalog encoding for directories
fix #3174: client: pxar: enable caching and meta comparison
client: backup writer: add injected chunk count to stats
pxar: create: keep track of reused chunks and files
pxar: create: show chunk injection stats debug output
client: pxar: add helper to handle optional preludes
client: pxar: opt encode cli exclude patterns as Prelude
pxar: ignore version and prelude entries in listing
docs: file formats: describe split pxar archive file layout
docs: add section describing change detection mode
test-suite: add detection mode change benchmark
test-suite: Makefile: add debian package and related files
datastore: chunker: add Chunker trait
datastore: chunker: implement chunker for payload stream
client: chunk stream: switch payload stream chunker
client: pxar: allow to restore prelude to optional path
client: pxar: add archive creation with reference test
client: tools: add helper to raise nofile rlimit
client: pxar: set cache limit based on nofile rlimit
chunker: tests: add regression tests for payload chunker
chunk stream: tests: add regression tests for payload chunker
Cargo.toml | 1 +
Makefile | 18 +-
debian/control | 7 +
debian/proxmox-backup-client.bash-completion | 1 +
debian/proxmox-backup-test-suite.bc | 8 +
debian/proxmox-backup-test-suite.install | 3 +
docs/Makefile | 2 +
docs/backup-client.rst | 45 +
docs/command-line-tools.rst | 5 +
docs/command-syntax.rst | 4 +
docs/conf.py | 1 +
docs/file-formats.rst | 46 +
docs/meta-format-overview.dot | 50 +
.../proxmox-backup-test-suite/description.rst | 2 +
docs/proxmox-backup-test-suite/man1.rst | 17 +
docs/technical-overview.rst | 3 +
examples/test_chunk_size.rs | 9 +-
examples/test_chunk_speed.rs | 7 +-
examples/test_chunk_speed2.rs | 2 +-
pbs-client/src/backup_specification.rs | 26 +
pbs-client/src/backup_writer.rs | 118 ++-
pbs-client/src/chunk_stream.rs | 238 ++++-
pbs-client/src/inject_reused_chunks.rs | 129 +++
pbs-client/src/lib.rs | 3 +-
pbs-client/src/pxar/create.rs | 911 +++++++++++++++++-
pbs-client/src/pxar/extract.rs | 28 +-
pbs-client/src/pxar/look_ahead_cache.rs | 165 ++++
pbs-client/src/pxar/mod.rs | 5 +-
pbs-client/src/pxar/tools.rs | 123 ++-
pbs-client/src/pxar_backup_stream.rs | 71 +-
pbs-client/src/tools/mod.rs | 69 +-
pbs-datastore/src/chunker.rs | 267 ++++-
pbs-datastore/src/dynamic_index.rs | 14 +-
pbs-datastore/src/lib.rs | 2 +-
pbs-pxar-fuse/src/lib.rs | 2 +-
proxmox-backup-client/src/catalog.rs | 29 +-
proxmox-backup-client/src/helper.rs | 114 +++
proxmox-backup-client/src/main.rs | 291 +++++-
proxmox-backup-client/src/mount.rs | 33 +-
proxmox-backup-test-suite/Cargo.toml | 18 +
.../src/detection_mode_bench.rs | 294 ++++++
proxmox-backup-test-suite/src/main.rs | 17 +
proxmox-file-restore/src/main.rs | 73 +-
.../src/proxmox_restore_daemon/api.rs | 20 +-
pxar-bin/src/main.rs | 85 +-
src/api2/admin/datastore.rs | 48 +-
src/api2/tape/restore.rs | 22 +-
src/bin/proxmox_backup_debug/diff.rs | 2 +-
src/tape/file_formats/snapshot_archive.rs | 8 +-
tests/catar.rs | 7 +-
tests/pxar/backup-client-pxar-data.mpxar | Bin 0 -> 15070 bytes
tests/pxar/backup-client-pxar-data.ppxar.didx | Bin 0 -> 8096 bytes
tests/pxar/backup-client-pxar-expected.mpxar | Bin 0 -> 15086 bytes
www/datastore/Content.js | 6 +-
zsh-completions/_proxmox-backup-test-suite | 13 +
55 files changed, 3145 insertions(+), 337 deletions(-)
create mode 100644 debian/proxmox-backup-test-suite.bc
create mode 100644 debian/proxmox-backup-test-suite.install
create mode 100644 docs/meta-format-overview.dot
create mode 100644 docs/proxmox-backup-test-suite/description.rst
create mode 100644 docs/proxmox-backup-test-suite/man1.rst
create mode 100644 pbs-client/src/inject_reused_chunks.rs
create mode 100644 pbs-client/src/pxar/look_ahead_cache.rs
create mode 100644 proxmox-backup-client/src/helper.rs
create mode 100644 proxmox-backup-test-suite/Cargo.toml
create mode 100644 proxmox-backup-test-suite/src/detection_mode_bench.rs
create mode 100644 proxmox-backup-test-suite/src/main.rs
create mode 100644 tests/pxar/backup-client-pxar-data.mpxar
create mode 100644 tests/pxar/backup-client-pxar-data.ppxar.didx
create mode 100644 tests/pxar/backup-client-pxar-expected.mpxar
create mode 100644 zsh-completions/_proxmox-backup-test-suite
--
2.39.2
More information about the pbs-devel
mailing list