[pve-devel] [RFC PATCH pve-storage/common] fix #3256: allow special characters in storage-related config files

Mon Feb 17 11:15:29 CET 2025

Am 14.02.25 um 16:40 schrieb Laurențiu Leahu-Vlăducu:
> 
> This patch series fixes bug #3256:
> 
> 1. It ensures that general config files (e.g. storage.cfg) are decoded
>    from UTF-8 when deserialized. Previously, no decoding happened,
>    meaning that Perl interpreted the string as single bytes instead of
>    Unicode code points. Note: while I would have preferred to decode
>    the text right after reading from the file, there are some Perl
>    functions like Digest::SHA::sha1_hex that expect bytes
>    instead of UTF-8.

What about pre-existing configs that are not UTF-8? Not breaking those
is very important here.

> 
> 2. It ensures that general config files are explicitly encoded
>    as UTF-8 before serialization to prevent similar issues the other
>    way around.
> 
> 3. It adds a unit test to prevent similar issues from happening in
>    the future.
> 
> 4. It fixes the PBS storage plugin for serializing/deserializing the
>    password, similar to points 1 and 2, but for the case where the
>    password itself contains Unicode characters.
> 
> For more information on this topic, please read:
> https://perldoc.perl.org/perlunifaq#When-should-I-decode-or-encode?
> 
> I'm sending this patch series to begin a discussion on how to handle
> encodings in our config files, and eventually also other relevant
> files. In my opinion, we should handle them consistently as UTF-8,
> also over both Perl and Rust code.

Yes, that is the long-term plan AFAIK, but right now existing config
files might be encoded differently.

> 
> Due to the fact that Linux uses UTF-8 encoding by default since
> a long time, as well as browsers* and other software, I doubt that
> we have to worry too much about other encodings
> like Latin-1 (ISO-8859-1). However, according to the
> Perl documentation, Perl could have deserialized such a string
> in the past (since it's the default in Perl when not decoding
> explicitly), and it is no longer able to after the fixes included
> in this patch series.

Unfortunately, we do. E.g.

> [I] root at pve8a1 ~# pct set 112 --mp1 /root/ö,mp=/o
> [I] root at pve8a1 ~# file /etc/pve/lxc/112.conf
> /etc/pve/lxc/112.conf: ISO-8859 text

> 
> We have to ask ourselves:
> 
> a. Do we want to define, in general, that configuration files should
>    always be serialized and deserialized as UTF-8? If yes, should we
>    consider this a breaking change?

Yes, see above.

> 
> b. Do we want to introduce any backward-compatibility for existing
>    config files? In other words, assume that older files might have
>    used other encodings in the past. To be honest, I didn't test
>    Latin-1 encoded files yet, so I'm not sure how (or if) our
>    current code would handle it.

Yes, we certainly need to.

> 
> There are further parsers and plugins that I still need to modify,
> but I first wanted to get your feedback on this subject.
> 
> 
> * With browsers I mean the encoding in HTML and not the JavaScript
> internals with its UTF-16 encoding.
> 
>