[pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse

Fri Sep 20 08:16:32 CEST 2024

On 9/19/24 16:57, Thomas Lamprecht wrote:
> Am 19/09/2024 um 14:45 schrieb Dominik Csapak:
>> On 9/19/24 14:01, Thomas Lamprecht wrote:
>>> Am 19/09/2024 um 11:52 schrieb Dominik Csapak:
>>>> by default libfuse2 limits writes to 4k size, which means that on writes
>>>> bigger than that, we do a whole write cycle for each 4k block that comes
>>>> in. To avoid that, add the option 'big_writes' to allow writes bigger
>>>> than 4k at once.
>>>>
>>>> This should improve pmxcfs performance for situations where we often
>>>> write large files (e.g. big ha status) and maybe reduce writes to disk.
>>>
>>> Should? Something like before/after for benchmark numbers, flamegraphs
>>> would be really good to have, without those it's rather hard to discuss
>>> this, and I'd like to avoid having to do those, or check the inner workings
>>> of the affected fuse userspace/kernel code paths here myself.
>>
>> well I mean the code change is relatively small and the result is rather clear:
> 
> Well sure the code change is just setting an option... But the actual change is
> abstracted away and would benefit from actually looking into..
> 
>> in the current case we have the following calls from pmxcfs (shortened for e-mail)
>> when writing a single 128k block:
>> (dd if=... of=/etc/pve/test bs=128k count=1)
> 
> Better than nothing but still no actual numbers (reduced time, reduced write amp
> in combination with sqlite, ...), some basic analysis over file/write size distribution
> on a single node and (e.g. three node) cluster, ...
> If that's all obvious for you then great, but as already mentioned in the past, I
> want actual data in commit messages for such stuff, and I cannot really see a downside
> of having such numbers.
> 
> Again, as is I'm not really seeing what's to discuss, you send it as RFC after
> all.
> 
>> [...]
>> so a factor of 32 less calls to cfs_fuse_write (including memdb_pwrite)
> 
> That can be huge or not so big at all, i.e. as mentioned above, it would we good to
> measure the impact through some other metrics.
> 
> And FWIW, I used bpftrace to count [0] with an unpatched pmxcfs, there I get
> the 32 calls to cfs_fuse_write only for a new file, overwriting the existing
> file again with the same amount of data (128k) just causes a single call.
> I tried using more data (e.g. from 128k initially to 256k or 512k) and it's
> always the data divided by 128k (even if the first file has a different size)
> 
> We do not override existing files often, but rather write to a new file and
> then rename, but still quite interesting and IMO really showing that just
> because this is 1 +-1 line change it doesn't necessarily have to be trivial
> and obvious in its effects.
> 
> [0]: bpftrace -e 'u:cfs_fuse_write /str(args->path) == "/test"/ {@ = count();} END { print(@) }' -p "$(pidof pmxcfs)"
> 
> 
>>>> If we'd change to libfuse3, this would be a non-issue, since that option
>>>> got removed and is the default there.
>>>
>>> I'd prefer that. At least if done with the future PVE 9.0, as I do not think
>>> it's a good idea in the middle of a stable release cycle.
>>
>> why not this change now, and the rewrite to libfuse3 later? that way we can
>> have some improvements now too...
> 
> Because I want some actual data and reasoning first, even if it's quite likely
> that this improves things Somehow™, I'd like to actually know in what metrics
> and by how much (even if just an upper bound due to the benchmark or some
> measurement being rather artificial).
> 
> I mean you name the big HA status, why not measure that for real? like, probably
> among other things, in terms of bytes hitting the block layer, i.e. the actual
> backing disk from those requests as then we'd know for real if this can reduce
> the write load there, not just that it maybe should.

sure, since you insist, I'll do some benchmarks today (when i have the time).

I'm just not sure if doing that is actually worth it. Would it make any
difference for the patch if the improvements were 1%, 10% or 1000% ?

The resulting change in behavior seems rather obvious to me:

less fuse -> db round trips for big writes, which can really only make the
whole thing more efficient, so IMHO it's very clear to me that this is an
improvement.

I sent this as an RFC because, as i wrote in the beginning, I'd like
for someone with more C/pmxcfs knowledge/experience to check or warn
for edge cases where we might depend on lower blocksizes (or similar things).
I very much doubt that any performance benchmarks would have revealed such things
that i wouldn't already have noticed by testing my local install.

if we do benchmarks on all changes that might impact performance (which
are most changes really) we'd probably would do little else...