[pbs-devel] [PATCH-SERIES v4 pxar proxmox-backup proxmox-widget-toolkit 00/26] fix #3174: improve file-level backup

Fabian Grünbichler f.gruenbichler at proxmox.com
Mon Nov 13 15:23:21 CET 2023


On November 9, 2023 7:45 pm, Christian Ebner wrote:
> Changes to the patch series since version 3 are based on the feedback
> obtained via internal communication channels. Many thanks to Thomas,
> Fabian, Wolfgang and Dominik for their continuous feedback up until
> now.
> 
> This series of patches implements an metadata based file change
> detection mechanism for improved pxar file level backup creation speed
> for unchanged files.
> 
> The chosen approach is to skip encoding of regular file payloads,
> for which metadata (currently ctime and size) did not change as
> compared to a previous backup run. Instead of re-encoding the files, a
> reference to a newly introduced appendix section of the pxar archive
> will be written. The appendix section will be created as concatenation
> of indexed chunks from the previous backup run, thereby containing the
> sequential file payload at a calculated offset with respect to the
> starting point of the appendix section.
> 
> Metadata comparison and calculation of the chunks to be indexed for the
> appendix section is performed using the catalog of a previous backup as
> reference. In order to be able to calculate the offsets, an updated
> catalog file format version 2 is introduced which extends the previous
> version by including the file offset with respect to the pxar archive
> byte stream, as well as the files ctime. This allows to find the required
> chunks indexes and the start padding within the concatenated chunks.
> The catalog reader remains backwards compatible to the catalog file
> format version 1.
> 
> During encoding, the chunks needed for the appendix section are injected
> in the backup upload stream after forcing a chunk boundary when regular
> pxar encoding is finished. Finally, the pxar archive containing an
> appendix section are marked as such by appending a final pxar goodbye
> lookup table only containing the offset to the appendix section start and
> total size of that section, needed for random access as e.g. to mount
> the archive via the fuse filesystem implementation.

some (high-level) comments focused on compatibility:

the catalog v2 format is used unconditionally at the moment. IMHO it
should be guarded/opt-in via --change-detection-method, since old
clients cannot parse it.

else, the following would happen if a client system upgrades:

- pre-upgrade backup (readable by all clients)
- upgrade
- post-upgrade backup *with --c-d-m data* (readable by all clients, but
  everything catalog related only works with new clients)
- post-upgrade backup *with --c-d-m metadata* (readable by new clients
  only)

since the pxar format itself also changes (new entry types), it should
also be bumped (see below). if the new formats are then only used with
the new metadata mode, both new formats are effectively opt-in (until we
make that the default mode). having the incompatibility between old and
new clients encoded right in the magic value in the header also means we
don't spend time downloading indices and chunks only to notice at some
random point within the restore that we actually don't know how to parse
this particular pxar archive.

an additional bonus point - tools like pxar and proxmox-backup-debug
could also list the raw+parsed magic value, and in general, error
messages like:

 Error: got unexpected magic number for catalog

are a lot easier to grasp than (pxar extract)

 Error: encountered unexpected error during extraction

or (proxmox-backup-client restore)

 Error: error extracting archive - encountered unexpected error during extraction

the magic values could also be backported to the oldstable client
version, to make the error messages even better ("known unsupported" vs
"unexpected").

in general, UX wise it might be nice to mark backups using the new mode,
although I am not sure how specifically (some variants - just the
version/mode, archives, archives+snapshots, ..?).

one more peculiarity I noted while testing - doing three backups in a
row without changing the input tree at all:

- old client
- new client, mode data
- new client, mode metadata

the last snapshot has a bigger "logical" size, e.g., when doing this for
my kernel clone (6.8G), the first two have a logical size of 7.736 GiB,
while the last one is 8.064Gib. for smaller input dirs, the effect is
even more pronounced, a 56M dir with 10 dirs with one file each is
listed as 55M for the first wo, and 97.989MiB for the last one (almost
double the size!). the resulting pxar archives are actually this size,
I guess there is some optimization potential still left for this
particular case. the actual (deduplicated) difference is just two (small
test case) / eight (linux) very small chunks, so this issue is mostly
cosmetic I hope unless one really goes down the "download pxar file,
extract manually" route.

I hope to do some more in-depth testing and code review over the course
of the week!





More information about the pbs-devel mailing list