[pbs-devel] [RFC PATCH proxmox-backup] pbs-tools: zip: add EFS flag to zip files

Mon Sep 13 09:25:52 CEST 2021

On 13.09.21 09:14, Dominik Csapak wrote:
> On 9/11/21 17:08, Thomas Lamprecht wrote:
>> On 10.09.21 11:09, Dominik Csapak wrote:
>> Also interesting, just below above quote:
>>
>>> D.3 Applications MAY choose to supplement this file name storage through the use
>>> of the 0x0008 Extra Field.  Storage for this optional field is currently
>>> undefined, however it will be used to allow storing extended information
>>> on source or target encoding that MAY further assist applications with file
>>> name, or file content encoding tasks.  Please contact PKWARE with any
>>> requirements on how this field SHOULD be used.
> 
> AFAIU, that part of the 'spec' is not open and would require a license of PKWARE

yeah that's what I figured, but observe and replicate could in theory still be OK I guess?

> 
>>
>>
>> So I'd like to know what standard tools like info-zip (i.e., Debian's "zip" package) or
>> other cross-platform tools like 7zip do.
>>
>> It seems that at least Debian's version of info zip had some thoughts about this and can
>> (or always does, did not checked that closely) safe utf8 filenames in an extra field, one
>> that some other tools maybe check for?
>>
>> https://sources.debian.org/src/zip/3.0-12/zip.c/#L967
>>
>> I say Debian's version, as upstream still talks about Unicode support on their home page,
>> which itself may be just outdated too, but it could also be that Debian patched that in.
>>
>> Any how, it seems to me that there'd be some more compatible options that do not plainly
>> state that they're 100% utf-8 while actually not being so sure of that, so I'd explore that
>> angle quite some more; data restoration is probably the most important aspect of a backup
>> system - so every way we expose doing so should work as as good as possible - even if going
>> outside our Linux bubble.
> 
> I tested it with debians zip (Info-Zip), and despite was the spec says,
> i could not get it to write an 'extra-field' with the filenames.
> 

hmm, meh...

> Instead, when it finds a filename that has a filename with any byte that
> has the high-bit set (>127) it sets the EFS bit in the filename iff
> the filename is valid utf-8, and does nothing if not.
> 
> IMHO, that sounds like a reasonable thing to do, so i'd suggest
> that we test each filename for valid UTF-8, and set the bit
> if that's true. This marks ASCII only filenames also as UTF-8,
> but thats technically true and simplifies the rust code a bit
> (and should be faster for UTF-8 Filenames,
> since we do not have to check for a high bit first and then try to
> convert...)
> 
> Does that sound ok to you?

Yes, sounds reasonable to me for nwo. It'd be an improvement and we basically only use
this for on-the-fly generation, so if we find a possible "better" (hard to measure that
with such a mess of format) approach in the future it would help all users immediately
too then.

> 
> For non-valid UTF-8 filenames that have a high bit i'd produce
> CP437 filenames on windows, and on linux it'd just be the
> byte value.

just to be sure: what does "i'd produce" mean here? As we certainly do not differ ZIP
generation semantics by user OS (thank goodness), this means that a unzip by most
reasonable tools would produce the outcome you described, or? 

> 
>>
>>>   pbs-tools/src/zip.rs | 6 ++++--
>>>   1 file changed, 4 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/pbs-tools/src/zip.rs b/pbs-tools/src/zip.rs
>>> index 605480a8..88eea07b 100644
>>> --- a/pbs-tools/src/zip.rs
>>> +++ b/pbs-tools/src/zip.rs
>>> @@ -34,6 +34,8 @@ const VERSION_MADE_BY: u16 = 0x032d;
>>>   const ZIP64_EOCD_RECORD: u32 = 0x06064B50;
>>>   const ZIP64_EOCD_LOCATOR: u32 = 0x07064B50;
>>>   +const GENERAL_PURUPOSE_FLAGS: u16 = (1 << 3) | (1 << 11); // EFS + Data Descriptor
>>> +
>>
>> - typo in constant name: purupose vs. purpose
> 
> yeah thanks...
> 
>> - comment order do not match the bits used, bit 11 is EFS and bit 3 is telling
>>    the parser that the crc32 is not in the header but in the data descriptor after
>>    the compressed data; your bitwise-OR+comment order suggests different.
> 
> sorry
> 
>> - isn't this related to BZ entry #3618, but that is neither mentioned here nor in the
>>    bug report...
> 
> that bug-report wasn't there when i wrote the patch.

I naturally only compared mail vs BZ time and missed the date difference of one day,
argh, sorry.

> 
>>
>> _If_ we'd go down this way then the following const name and formatting would make this
>> easier to read IMO:
>>
>> const LFH_GENERAL_PURPOSE_FLAGS: u16 = (1 << 3) // we place crc32 in data descriptor
>>      | (1 << 11); // EFS, mark filenames & comments as UTF-8 (not guaranteed but more often OK than CP437)
>>
> 
> makes, sense, though if we only set it conditionally i'd split the
> EFS_MARK into its own constant.
> 

sounds good to me.