[pve-devel] [PATCH v2 qemu-server 8/9] memory: add virtio-mem support

Wed Jan 25 10:54:10 CET 2023

Am 25.01.23 um 10:00 schrieb DERUMIER, Alexandre:
> Le mardi 24 janvier 2023 à 14:06 +0100, Fiona Ebner a écrit :
>> Am 04.01.23 um 07:43 schrieb Alexandre Derumier:
>>> +sub get_virtiomem_block_size {
>>> +    my ($conf) = @_;
>>> +
>>> +    my $MAX_MEM = get_max_mem($conf);
>>> +    my $static_memory = get_static_mem($conf);
>>> +    my $memory = get_current_memory($conf->{memory});
>>> +
>>> +    #virtiomem can map 32000 block size.
>>> +    #try to use lowest blocksize, lower = more chance to unplug
>>> memory.
>>> +    my $blocksize = ($MAX_MEM - $static_memory) / 32000;
>>> +    #2MB is the minimum to be aligned with THP
>>> +    $blocksize = 2**(ceil(log($blocksize)/log(2)));
>>> +    $blocksize = 4 if $blocksize < 4;
>>
>> Why suddenly 4?
> 
> I have added a note in the commit :
> 
>> Note: Currently, linux only support 4MB virtio blocksize, 2MB support
>> is currently is progress.
>>
> 
> So 2MB is valid from qemu side, but linux guest kernel don't support it
> actually. At least , you need to use multiple of 4MB. you can
> remove/add 2 blocks of 2MB at the same time, but it don't seem to be
> atomic, so I think it's better to use the minimum currently supported
> bloc.

Sorry, I missed that.

> Maybe later, we could extend the virtio=X option, to tell the virtio
> supported version.  (virtio=1.1  , virtio=1.2), and enable supported
> features ?

If we really need to and can't detect it otherwise, sure.

>>> +my sub balance_virtiomem {
>>
>> This function is rather difficult to read. The "record errors and
>> filter" logic really should be its own patch after the initial
>> support.
>> FWIW, I tried my best and it does seems fine :)
>>
>> But it's not clear to me that we even want that logic? Is it really
>> that
>> common for qom-set to take so long to be worth introducing all this
>> additional handling/complexity? Or should it just be a hard error if
>> qom-set still didn't have an effect on a device after 5 seconds.
>>
> from my test,It can take 2-3second on unplug on bigger setup. I'm doing
> it in // to be faster, to avoid to wait nbdimm * 2-3seconds.

Sure, doing it in parallel is perfectly fine. I'm just thinking that
switching gears (too early) and redirecting the request might not be
ideal. You also issue an additional qom-set to go back to
$virtiomem->{current} * 1024 *1024 if a request didn't make progress in
time. But to be sure that request worked, we'd also need to monitor it
;) I think issuing that request is fine as-is, but if there is a
"hanging" device, we really can't do much. And I do think the user
should be informed via an appropriate error if there is a problematic
device.

Maybe we can use 10 seconds instead of 5 (2-3 seconds already sounds too
close to 5 IMHO), so that we have a good margin, and just die instead of
trying to redirect the request to another device. After issuing the
reset request and writing our best guess of what the memory is to the
config of course.

If it really is an issue in practice that certain devices often take too
long for whatever reason, we can still add the redirect logic. But
starting out, I feel like it's not worth the additional complexity.

>> Would it actually be better to just fill up the first, then the
>> second
>> etc. as needed, rather than balancing? My gut feeling is that having
>> fewer "active" devices is better. But this would have to be tested
>> with
>> some benchmarks of course.
> 
> Well, from numa perspective, you really want to balance as much as
> possible. (That's why, with classic hotplug, we add/remove dimm on each
> socket alternatively).
> 
> That's the whole point of numa, read the nearest memory attached to the
> processor where the process are running.
> 
> That's a main advantage of virtio-mem  vs balloning (which doesn't
> handle numa, and remove pages randomly on any socket)

Makes sense. Thanks for the explanation!