[pbs-devel] [RFC PATCH proxmox-backup] pbs-tools: zip: add EFS flag to zip files

Sat Sep 11 17:08:47 CEST 2021

On 10.09.21 11:09, Dominik Csapak wrote:
> this flag marks the file names as 'UTF-8' encoded.
> 
> By default, encoding of file names in zips are defined as code page 437,
> but we save the filenames as bytes (like in linux fs).
> 
> For linux systems this neither would be a problem since most tools
> simply use the filenames as bytes, but for the zip utility under
> windows it's important since NTFS uses UTF-16 for file names.
> 
> Since we generate zips only on pxars (file based backup on linux) or
> via file-restore-daemons (linux; ntfs mounted as UTF-8), it's a fair
> assumption that we can mark most filenames as UTF-8.
> 
> For zips generated from linux backups to be extracted on windows it is
> impossible to do the correct thing anyway, since windows can not have
> arbitrary bytes in file names, and for each encoding chosen, there is
> some file that cannot be shown correctly.
> so either all filenames are decoded as CP437 ('ö' -> '├╢')
> or non UTF-8 encoded file-names have garbage characters in them (�)
> 
> Signed-off-by: Dominik Csapak <d.csapak at proxmox.com>
> ---
> sending as RFC since there is no way to have it correct in all cases,
> and we have to decide if we want CP437 or UTF-8 by default
> 

Yeah, it's not only that we may not be incorrect, the closest definition of a ZIP spec
says "not set == should be cp437 but meh" and "set == MUST be valid UTF-8" about this
bit:

> D.2 If general purpose bit 11 is unset, the file name and comment SHOULD conform 
> to the original ZIP character encoding.  If general purpose bit 11 is set, the 
> filename and comment MUST support The Unicode Standard, Version 4.1.0 or 
> greater using the character encoding form defined by the UTF-8 storage 
> specification.  The Unicode Standard is published by the The Unicode
> Consortium (www.unicode.org).  UTF-8 encoded data stored within ZIP files 
> is expected to not include a byte order mark (BOM). 
> 
- Appendix D, https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT

Also interesting, just below above quote:

> D.3 Applications MAY choose to supplement this file name storage through the use 
> of the 0x0008 Extra Field.  Storage for this optional field is currently 
> undefined, however it will be used to allow storing extended information 
> on source or target encoding that MAY further assist applications with file 
> name, or file content encoding tasks.  Please contact PKWARE with any
> requirements on how this field SHOULD be used.

So I'd like to know what standard tools like info-zip (i.e., Debian's "zip" package) or
other cross-platform tools like 7zip do.

It seems that at least Debian's version of info zip had some thoughts about this and can
(or always does, did not checked that closely) safe utf8 filenames in an extra field, one
that some other tools maybe check for?

https://sources.debian.org/src/zip/3.0-12/zip.c/#L967

I say Debian's version, as upstream still talks about Unicode support on their home page,
which itself may be just outdated too, but it could also be that Debian patched that in.

Any how, it seems to me that there'd be some more compatible options that do not plainly
state that they're 100% utf-8 while actually not being so sure of that, so I'd explore that
angle quite some more; data restoration is probably the most important aspect of a backup
system - so every way we expose doing so should work as as good as possible - even if going
outside our Linux bubble.

>  pbs-tools/src/zip.rs | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/pbs-tools/src/zip.rs b/pbs-tools/src/zip.rs
> index 605480a8..88eea07b 100644
> --- a/pbs-tools/src/zip.rs
> +++ b/pbs-tools/src/zip.rs
> @@ -34,6 +34,8 @@ const VERSION_MADE_BY: u16 = 0x032d;
>  const ZIP64_EOCD_RECORD: u32 = 0x06064B50;
>  const ZIP64_EOCD_LOCATOR: u32 = 0x07064B50;
>  
> +const GENERAL_PURUPOSE_FLAGS: u16 = (1 << 3) | (1 << 11); // EFS + Data Descriptor
> +

- typo in constant name: purupose vs. purpose
- comment order do not match the bits used, bit 11 is EFS and bit 3 is telling
  the parser that the crc32 is not in the header but in the data descriptor after
  the compressed data; your bitwise-OR+comment order suggests different.
- isn't this related to BZ entry #3618, but that is neither mentioned here nor in the
  bug report...

_If_ we'd go down this way then the following const name and formatting would make this
easier to read IMO:

const LFH_GENERAL_PURPOSE_FLAGS: u16 = (1 << 3) // we place crc32 in data descriptor
    | (1 << 11); // EFS, mark filenames & comments as UTF-8 (not guaranteed but more often OK than CP437)