[pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse

Mon Sep 23 16:08:56 CEST 2024

Changing the way we write files, can eliminate the exponential growth of
write amplification with files of up to 128KiB in size.

Right now we are using `PVE::Tools::file_set_contents`, which itself
uses `print`. And looking at the debug output we can see that this ends
up writing the file contents in 8k blocks.

```
$ echo 1 > /etc/pve/.debug && perl -e 'use PVE::Tools; my $data = 
"a"x(128*1024); PVE::Tools::file_set_contents("/etc/pve/testfile", 
$data);' && echo 0 > /etc/pve/.debug
$ journalctl -n250 -u pve-cluster | grep cfs_fuse_write
Sep 23 15:23:04 pmxcfsbench pmxcfs[16835]: [main] debug: enter 
cfs_fuse_write /testfile.tmp.22487 8192 0 (pmxcfs.c:355:cfs_fuse_write)
Sep 23 15:23:04 pmxcfsbench pmxcfs[16835]: [main] debug: leave 
cfs_fuse_write /testfile.tmp.22487 (8192) (pmxcfs.c:368:cfs_fuse_write)
...
```

So lets change the benchmark script to write all the contents in a
single block.

```diff
@@ -21,10 +21,9 @@
  sub test_write {
          my ($k) = @_;
          system("rm /etc/pve/testfile");
-        my $data = "a"x($k*1024);
          system("sync; echo -n 3 > /proc/sys/vm/drop_caches");
          my $bytes_before = get_bytes_written();
-        PVE::Tools::file_set_contents("/etc/pve/testfile", $data);
+        system("dd if=/dev/urandom of=/etc/pve/testfile bs=${k}k 
count=1 2> /dev/null");
          system("sync; echo -n 3 > /proc/sys/vm/drop_caches");
          my $bytes_after = get_bytes_written();
          return $bytes_after - $bytes_before;
```

Along with the `-obig_writes` patch applied, this gives the following
results:

data size  written  amplification
1          54       54.0
2          33       16.5
4          49       12.3
8          53       6.6
16         62       3.9
32         77       2.4
64         114      1.8
128        178      1.4
256        580      2.3
512        2157     4.2
1024       9844     9.6

With this we write in 128k blocks instead of 8k blocks, eliminating the
rapid write amplification growth up until 128k data size.

It seems that `PVE::Tools::file_set_contents` needs to be optimized to
not write the contents in 8k blocks. Instead of `print` we might want to
use `syswrite`.


On 23/09/2024 13:48, Filip Schauer wrote:
> I also ran some benchmarks with the same script.
>
> I created a VM with two virtual disks, (both on an LVM Thin storage)
> installed PVE on one disk and set up an ext4 partition on the other.
>
> I stopped pvestatd and pve-cluster,
>
> ```
> systemctl stop pvestatd
> systemctl stop pve-cluster
> ```
>
> moved the pmxcfs database file to its own disk
>
> ```
> mv /var/lib/pve-cluster/config.db /tmp/
> mount /dev/sdb1 /var/lib/pve-cluster
> mv /tmp/config.db /var/lib/pve-cluster/
> ```
>
> and started pve-cluster again.
>
> ```
> systemctl start pve-cluster
> ```
>
> These are my results before and after applying the patch:
>
> data size  written (old)  amplification (old)  written (new) 
> amplification (new)
> 1          48             48                   45             45
> 2          48             24                   45             23
> 4          82             21                   80             20
> 8          121            15                   90             11
> 16         217            14                   146            9
> 32         506            16                   314            10
> 64         1472           23                   826            13
> 128        5585           44                   3765           29
> 256        20424          80                   10743          42
> 512        86715          169                  43650          85
> 1024       369568         361                  187496         183
>
> As can be seen, my results are similar with amplification really picking
> up at 128k. The patch seems to cut write amplification in half with big
> files, while making virtually no difference with files up to 4k.
>
> On 23/09/2024 11:17, Dominik Csapak wrote:
>> On 9/19/24 16:57, Thomas Lamprecht wrote:
>>> Am 19/09/2024 um 14:45 schrieb Dominik Csapak:
>>>> On 9/19/24 14:01, Thomas Lamprecht wrote:
>>>>> Am 19/09/2024 um 11:52 schrieb Dominik Csapak:
>>>>>> by default libfuse2 limits writes to 4k size, which means that on 
>>>>>> writes
>>>>>> bigger than that, we do a whole write cycle for each 4k block 
>>>>>> that comes
>>>>>> in. To avoid that, add the option 'big_writes' to allow writes 
>>>>>> bigger
>>>>>> than 4k at once.
>>>>>>
>>>>>> This should improve pmxcfs performance for situations where we often
>>>>>> write large files (e.g. big ha status) and maybe reduce writes to 
>>>>>> disk.
>>>>>
>>>>> Should? Something like before/after for benchmark numbers, 
>>>>> flamegraphs
>>>>> would be really good to have, without those it's rather hard to 
>>>>> discuss
>>>>> this, and I'd like to avoid having to do those, or check the inner 
>>>>> workings
>>>>> of the affected fuse userspace/kernel code paths here myself.
>>>>
>>>> well I mean the code change is relatively small and the result is 
>>>> rather clear:
>>>
>>> Well sure the code change is just setting an option... But the 
>>> actual change is
>>> abstracted away and would benefit from actually looking into..
>>>
>>>> in the current case we have the following calls from pmxcfs 
>>>> (shortened for e-mail)
>>>> when writing a single 128k block:
>>>> (dd if=... of=/etc/pve/test bs=128k count=1)
>>>
>>> Better than nothing but still no actual numbers (reduced time, 
>>> reduced write amp
>>> in combination with sqlite, ...), some basic analysis over 
>>> file/write size distribution
>>> on a single node and (e.g. three node) cluster, ...
>>> If that's all obvious for you then great, but as already mentioned 
>>> in the past, I
>>> want actual data in commit messages for such stuff, and I cannot 
>>> really see a downside
>>> of having such numbers.
>>>
>>> Again, as is I'm not really seeing what's to discuss, you send it as 
>>> RFC after
>>> all.
>>>
>>>> [...]
>>>> so a factor of 32 less calls to cfs_fuse_write (including 
>>>> memdb_pwrite)
>>>
>>> That can be huge or not so big at all, i.e. as mentioned above, it 
>>> would we good to
>>> measure the impact through some other metrics.
>>>
>>> And FWIW, I used bpftrace to count [0] with an unpatched pmxcfs, 
>>> there I get
>>> the 32 calls to cfs_fuse_write only for a new file, overwriting the 
>>> existing
>>> file again with the same amount of data (128k) just causes a single 
>>> call.
>>> I tried using more data (e.g. from 128k initially to 256k or 512k) 
>>> and it's
>>> always the data divided by 128k (even if the first file has a 
>>> different size)
>>>
>>> We do not override existing files often, but rather write to a new 
>>> file and
>>> then rename, but still quite interesting and IMO really showing that 
>>> just
>>> because this is 1 +-1 line change it doesn't necessarily have to be 
>>> trivial
>>> and obvious in its effects.
>>>
>>> [0]: bpftrace -e 'u:cfs_fuse_write /str(args->path) == "/test"/ {@ = 
>>> count();} END { print(@) }' -p "$(pidof pmxcfs)"
>>>
>>>
>>>>>> If we'd change to libfuse3, this would be a non-issue, since that 
>>>>>> option
>>>>>> got removed and is the default there.
>>>>>
>>>>> I'd prefer that. At least if done with the future PVE 9.0, as I do 
>>>>> not think
>>>>> it's a good idea in the middle of a stable release cycle.
>>>>
>>>> why not this change now, and the rewrite to libfuse3 later? that 
>>>> way we can
>>>> have some improvements now too...
>>>
>>> Because I want some actual data and reasoning first, even if it's 
>>> quite likely
>>> that this improves things Somehow™, I'd like to actually know in 
>>> what metrics
>>> and by how much (even if just an upper bound due to the benchmark or 
>>> some
>>> measurement being rather artificial).
>>>
>>> I mean you name the big HA status, why not measure that for real? 
>>> like, probably
>>> among other things, in terms of bytes hitting the block layer, i.e. 
>>> the actual
>>> backing disk from those requests as then we'd know for real if this 
>>> can reduce
>>> the write load there, not just that it maybe should.
>>
>>
>> hi,
>>
>> first i just wanted to say I'm sorry for my snarky comment about not 
>> needing to test
>> performance for such code. You're right, any insight we can gain there
>> is good and we (I!) should take the time to do that, even if the
>> change looks "obvious" like it does here
>>
>> so i did some benchmarks (mostly disk writes) and wrote the short 
>> script below
>> (maybe we can reuse that?)
>>
>> ----8<----
>> use strict;
>> use warnings;
>>
>> use PVE::Tools;
>>
>> my $size = shift;
>>
>> sub get_bytes_written {
>>         my $fh = IO::File->new("/proc/diskstats", "r");
>>         die if !$fh;
>>         my $bytes = undef;
>>         while (defined(my $line = <$fh>)) {
>>                 if ($line =~ m/sdb/) {
>>                         my @fields = split(/\s+/, $line);
>>                         $bytes = $fields[10] * 512;
>>                 }
>>         }
>>         return $bytes;
>> }
>>
>> sub test_write {
>>         my ($k) = @_;
>>         system("rm /etc/pve/testfile");
>>         my $data = "a"x($k*1024);
>>         system("sync; echo -n 3> /proc/sys/vm/drop_caches");
>>         my $bytes_before = get_bytes_written();
>>         PVE::Tools::file_set_contents("/etc/pve/testfile", $data);
>>         system("sync; echo -n 3> /proc/sys/vm/drop_caches");
>>         my $bytes_after = get_bytes_written();
>>         return $bytes_after - $bytes_before;
>> }
>>
>> $size //= 128;
>>
>> my $written = test_write($size) / 1024;
>> print("$written\n");
>> ---->8----
>>
>> to simulate our real write patterns with varying file sizes
>>
>> i installed a fresh pve, turned off pvestatd and put 
>> /var/lib/pve-cluster on it's own disk
>> (/dev/sdb), so diskstats only contains writes from the pmxcfs
>>
>> the results are below (all sizes are kbytes, ran them multiple times, 
>> but they
>> seem to be consistent)
>>
>> data size  written (old)  amplification (old)  written (new) 
>> amplification (new)
>> 1          56             56                   56             56
>> 2          72             36                   76             38
>> 4          84             21                   88             22
>> 8          144            18                   104            13
>> 16         236            14                   160            10
>> 32         532            16                   324            10
>> 64         1496           23                   836            13
>> 128        6616           51                   3848           30
>> 256        20600          80                   10568          41
>> 512        87296          170                  43416          84
>> 1024       388460         379                  197032 192
>>
>> for smaller writes there seems to be a minimum overhead of ~50-100 
>> kbytes
>> for big files with have a massive amplification of > 100x
>>
>> my patch does seem to make a difference for files >4k
>>
>> but the biggest surprise here is that the write amplification
>> is not linear, but increases with an increase in bytes we want to write
>> so e.g. going from 128k -> 256k file write we don't just write double 
>> the amount,
>> but 3x to 4x as much.
>>
>> i also produced a flamegraph according to 
>> https://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html, but 
>> that showed virtually no change between versions without
>> and with my patch (if one has a good svg hoster, i can post them ofc)
>>
>> so, tl;dr
>>
>> for small writes does not make that much of a difference, but
>> we can save ~half the writes for large files in the pmxcfs
>> (which e.g. a ha-state file could do)
>>
>>
>>
>> _______________________________________________
>> pve-devel mailing list
>> pve-devel at lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
>
>
> _______________________________________________
> pve-devel mailing list
> pve-devel at lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel