[pve-devel] [PATCH v3 many 00/66] fix #4156: introduce new notification system

Wed Jul 19 11:54:25 CEST 2023

On Wed, Jul 19, 2023 at 10:40:09AM +0200, Lukas Wagner wrote:
> Hi again,
> 
> On 7/18/23 14:34, Dominik Csapak wrote:
> > * i found one bug, but not quite sure yet where it comes from exactly,
> >    putting in emojis into a field (e.g. a comment or author) it's accepted,
> >    but editing a different entry fails with:
> > 
> > --->8---
> > could not serialize configuration: writing 'notifications.cfg' failed: detected unexpected control character in section 'testgroup' key 'comment' (500)
> > ---8<---
> > 
> > not sure where the utf-8 info gets lost. (or we could limit all fields to ascii?)
> > such a notification target still works AFAICT (but if set as e.g. the author it's
> > probably the wrong value)
> > 
> > (i used 😀 as a test)
> 
> So I investigated a bit and found a minimal reproducer. Turns out it's an encoding issue
> in the FFI interface (perl->rust).
> 
> Let's assume that we have the following exported function in the pve-rs bindings:
> 
>   #[export]
>   fn test_emoji(name: &str) {
>       dbg!(&name);
>   }
> 
> 
> 
>   use PVE::RS::Notify;
>   my $str = "😊";

Without `use utf8;`, this produces a "byte string":

    $ perl -MDevel::Peek -e 'my $str = "😊"; Dump($str);'
    SV = PV(0x5576f4e0cea0) at 0x5576f4e39370
      REFCNT = 1
      FLAGS = (POK,IsCOW,pPOK)
      PV = 0x5576f4e424d0 "\xF0\x9F\x98\x8A"\0
      CUR = 4
      LEN = 10
      COW_REFCNT = 1

Note that \xF0\x9F\x98\x8A.

>   PVE::RS::Notify::test_emoji($str);
> 
> 
>   root at pve:~# perl test.pl
>   [src/notify.rs:562] &name = "ð\u{9f}\u{98}\u{8a}"

Note the `\u` portions here. This string contains
the *UTF-8* characters 0xF0, 0x9F, 0x98, 0x8A.

And how is it supposed to know any better.

> 
> To me it looks a bit like a UTF-16/UTF-8 mixup:
> 
> ð = 0x00F0 in UTF16
> 😊 = 0xF0 0x9F 0x98 0x8A in UTF-8
> 
> The issue can be fixed by doing a `$str = encode('utf-8', $str);` before calling
> `test_emoji`.

Perl and most of our perl code never cared (hence we already ran into a
bunch of utf-8 issues and for a long time did the whole "transport
encoding vs actual encoding" in HTTP vs JS vs json vs perl strings
completely *wrong* (and probably still do)), and a lot of *files* aren't
even *defined* to have a specific encoding (eg. interpreting bytes >0x80
from `/etc/network/interfaces` as utf-8 may simply be the *wrong* thing
to do).

Sure, the perlmod layer could be an issue. But I wouldn't jump to
conclusions there.

Also, note what *actually* happens if you `encode('utf-8', $str)`:

    $ perl -MEncode -MDevel::Peek -e 'my $a = encode("utf-8", "👍"); Dump($a);'
    SV = PV(0x55f62dfe6170) at 0x55f62e012430
      REFCNT = 1
      FLAGS = (POK,IsCOW,pPOK)
      PV = 0x55f62e1cbe60 "\xC3\xB0\xC2\x9F\xC2\x91\xC2\x8D"\0
      CUR = 8
      LEN = 10
      COW_REFCNT = 0

Now you have the UTF-8 encoding of each character in there explicitly.

What you really want would be for perl to acknowledge that you already
have utf-8:

    $ perl -MDevel::Peek -e 'use utf8; my $a = "👍"; Dump($a);'
    SV = PV(0x56265764cea0) at 0x5626576793e8
      REFCNT = 1
      FLAGS = (POK,IsCOW,pPOK,UTF8)
      PV = 0x5626576a2b60 "\xF0\x9F\x91\x8D"\0 [UTF8 "\x{1f44d}"]
      CUR = 4
      LEN = 10
      COW_REFCNT = 1

But we don't use `use utf8;` in our code base because it has too many
side effects.

To mark an utf-8 encoded not-as-utf-8-marked string as utf-8 in perl,
you can *decode* it:

    $ perl -MDevel::Peek -e 'use utf8; no utf8; my $a = "👍"; utf8::decode($a); Dump($a);'
    SV = PV(0x55a5c9c45ea0) at 0x55a5c9c723a8
      REFCNT = 1
      FLAGS = (POK,pPOK,UTF8)
      PV = 0x55a5c9c64280 "\xF0\x9F\x91\x8D"\0 [UTF8 "\x{1f44d}"]
      CUR = 4
      LEN = 10

All that said, I have not yet looked at the perl side (or perlmod side)
and cannot say what's going on.

But If you hand utf-8 *bytes* which aren't marked as utf-8 to perlmod,
it'll do what perl does and just encode each byte as utf-8.

*Guessing* that it's utf-8 would surely work - *this time* - but might
simply be *wrong* other times.