[pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse

Dominik Csapak d.csapak at proxmox.com
Fri Sep 20 08:16:32 CEST 2024


On 9/19/24 16:57, Thomas Lamprecht wrote:
> Am 19/09/2024 um 14:45 schrieb Dominik Csapak:
>> On 9/19/24 14:01, Thomas Lamprecht wrote:
>>> Am 19/09/2024 um 11:52 schrieb Dominik Csapak:
>>>> by default libfuse2 limits writes to 4k size, which means that on writes
>>>> bigger than that, we do a whole write cycle for each 4k block that comes
>>>> in. To avoid that, add the option 'big_writes' to allow writes bigger
>>>> than 4k at once.
>>>>
>>>> This should improve pmxcfs performance for situations where we often
>>>> write large files (e.g. big ha status) and maybe reduce writes to disk.
>>>
>>> Should? Something like before/after for benchmark numbers, flamegraphs
>>> would be really good to have, without those it's rather hard to discuss
>>> this, and I'd like to avoid having to do those, or check the inner workings
>>> of the affected fuse userspace/kernel code paths here myself.
>>
>> well I mean the code change is relatively small and the result is rather clear:
> 
> Well sure the code change is just setting an option... But the actual change is
> abstracted away and would benefit from actually looking into..
> 
>> in the current case we have the following calls from pmxcfs (shortened for e-mail)
>> when writing a single 128k block:
>> (dd if=... of=/etc/pve/test bs=128k count=1)
> 
> Better than nothing but still no actual numbers (reduced time, reduced write amp
> in combination with sqlite, ...), some basic analysis over file/write size distribution
> on a single node and (e.g. three node) cluster, ...
> If that's all obvious for you then great, but as already mentioned in the past, I
> want actual data in commit messages for such stuff, and I cannot really see a downside
> of having such numbers.
> 
> Again, as is I'm not really seeing what's to discuss, you send it as RFC after
> all.
> 
>> [...]
>> so a factor of 32 less calls to cfs_fuse_write (including memdb_pwrite)
> 
> That can be huge or not so big at all, i.e. as mentioned above, it would we good to
> measure the impact through some other metrics.
> 
> And FWIW, I used bpftrace to count [0] with an unpatched pmxcfs, there I get
> the 32 calls to cfs_fuse_write only for a new file, overwriting the existing
> file again with the same amount of data (128k) just causes a single call.
> I tried using more data (e.g. from 128k initially to 256k or 512k) and it's
> always the data divided by 128k (even if the first file has a different size)
> 
> We do not override existing files often, but rather write to a new file and
> then rename, but still quite interesting and IMO really showing that just
> because this is 1 +-1 line change it doesn't necessarily have to be trivial
> and obvious in its effects.
> 
> [0]: bpftrace -e 'u:cfs_fuse_write /str(args->path) == "/test"/ {@ = count();} END { print(@) }' -p "$(pidof pmxcfs)"
> 
> 
>>>> If we'd change to libfuse3, this would be a non-issue, since that option
>>>> got removed and is the default there.
>>>
>>> I'd prefer that. At least if done with the future PVE 9.0, as I do not think
>>> it's a good idea in the middle of a stable release cycle.
>>
>> why not this change now, and the rewrite to libfuse3 later? that way we can
>> have some improvements now too...
> 
> Because I want some actual data and reasoning first, even if it's quite likely
> that this improves things Somehow™, I'd like to actually know in what metrics
> and by how much (even if just an upper bound due to the benchmark or some
> measurement being rather artificial).
> 
> I mean you name the big HA status, why not measure that for real? like, probably
> among other things, in terms of bytes hitting the block layer, i.e. the actual
> backing disk from those requests as then we'd know for real if this can reduce
> the write load there, not just that it maybe should.

sure, since you insist, I'll do some benchmarks today (when i have the time).

I'm just not sure if doing that is actually worth it. Would it make any
difference for the patch if the improvements were 1%, 10% or 1000% ?

The resulting change in behavior seems rather obvious to me:

less fuse -> db round trips for big writes, which can really only make the
whole thing more efficient, so IMHO it's very clear to me that this is an
improvement.

I sent this as an RFC because, as i wrote in the beginning, I'd like
for someone with more C/pmxcfs knowledge/experience to check or warn
for edge cases where we might depend on lower blocksizes (or similar things).
I very much doubt that any performance benchmarks would have revealed such things
that i wouldn't already have noticed by testing my local install.

if we do benchmarks on all changes that might impact performance (which
are most changes really) we'd probably would do little else...




More information about the pve-devel mailing list