[pve-devel] [PATCH v2 qemu-server 8/9] memory: add virtio-mem support

Fiona Ebner f.ebner at proxmox.com
Tue Jan 24 14:06:41 CET 2023


Am 04.01.23 um 07:43 schrieb Alexandre Derumier:
> a 4GB static memory is needed for DMA+boot memory, as this memory
> is almost always un-unpluggeable.
> 
> 1 virtio-mem pci device is setup for each numa node on pci.4 bridge
> 
> virtio-mem use a fixed blocksize with 32000 blocks
> Blocksize is computed from the maxmemory-4096/32000 with a minimum of
> 2MB to map THP.
> (lower blocksize = more chance to unplug memory).
> 
> Note: Currently, linux only support 4MB virtio blocksize, 2MB support
> is currently is progress.
> 

For the above paragraphs:
s/GB/GiB/
s/MB/MiB/
?

> For hotplug/unplug, we are try to allocate/unallocate same amount
> of memory aligned to the blocksize on each numa node if possible.
> If a node a not able to reach the target memory (could be an unmovable
> page on unplug for example), we try again to redispatch memory the
> remaining memory on others nodes.
> 
> About hugepages:
> 
> For ordinary memory devices, such as DIMMs, we preallocate memory via the
> memory backend for such use cases; however, with virtio-mem we're dealing
> with sparse memory backends; preallocating the whole memory backend
> destroys the whole purpose of virtio-mem.
> 
> Instead, we want to preallocate memory when actually exposing memory to the
> VM dynamically, and fail plugging memory gracefully + warn the user in case
> preallocation fails.
> 
> fixes:
> https://bugzilla.proxmox.com/show_bug.cgi?id=931

As said last time, one user in the report has Windows, which doesn't
support virtio-mem, so I wouldn't consider the issue fixed.

> https://bugzilla.proxmox.com/show_bug.cgi?id=2949
> ---
> 
> Signed-off-by: Alexandre Derumier <aderumier at odiso.com>
> ---
>  PVE/API2/Qemu.pm         |  10 +-
>  PVE/QemuServer.pm        |   7 +-
>  PVE/QemuServer/Memory.pm | 233 ++++++++++++++++++++++++++++++++++++---
>  PVE/QemuServer/PCI.pm    |   8 ++
>  4 files changed, 242 insertions(+), 16 deletions(-)
> 
> diff --git a/PVE/API2/Qemu.pm b/PVE/API2/Qemu.pm
> index cab1e84..42941ac 100644
> --- a/PVE/API2/Qemu.pm
> +++ b/PVE/API2/Qemu.pm
> @@ -32,7 +32,7 @@ use PVE::QemuServer::Drive;
>  use PVE::QemuServer::ImportDisk;
>  use PVE::QemuServer::Monitor qw(mon_cmd);
>  use PVE::QemuServer::Machine;
> -use PVE::QemuServer::Memory qw(get_current_memory parse_memory get_host_max_mem);
> +use PVE::QemuServer::Memory qw(get_current_memory parse_memory get_host_max_mem get_virtiomem_block_size);
>  use PVE::QemuMigrate;
>  use PVE::RPCEnvironment;
>  use PVE::AccessControl;
> @@ -487,6 +487,14 @@ my $check_memory_param = sub {
>  	    if $mem->{max} > $host_max_mem;
>      }
>  
> +    #unplug works better with 128MB by dimm to match the linux blocksize btyes.

s/MB/MiB/
s/btyes/bytes/

> +    if ($mem->{virtio}) {
> +	my $blocksize = get_virtiomem_block_size($conf);
> +
> +	die "memory need to be a multiple of $blocksize MB when virtiomem is enabled\n"

s/MB/MiB/
s/need/needs/

> +	    if $mem->{current} % $blocksize != 0;
> +    }
> +
>      if ($param->{memory} || defined($param->{balloon})) {
>  
>  	my $maxmem = undef;
> diff --git a/PVE/QemuServer.pm b/PVE/QemuServer.pm
> index 5847a78..51b29fc 100644
> --- a/PVE/QemuServer.pm
> +++ b/PVE/QemuServer.pm
> @@ -3857,7 +3857,12 @@ sub config_to_command {
>  	push @$cmd, get_cpu_options($conf, $arch, $kvm, $kvm_off, $machine_version, $winversion, $gpu_passthrough);
>      }
>  
> -    PVE::QemuServer::Memory::config($conf, $vmid, $sockets, $cores, $defaults, $hotplug_features, $cmd);
> +    my $mem_devices = {};
> +    PVE::QemuServer::Memory::config($conf, $vmid, $sockets, $cores, $defaults, $hotplug_features, $cmd, $mem_devices);
> +    foreach my $id (sort keys %$mem_devices) {
> +	my $pciaddr = print_pci_addr($id, $bridges, $arch, $machine_type);
> +	push @$devices, "-device", "$mem_devices->{$id}$pciaddr";
> +    }
>  
>      push @$cmd, '-S' if $conf->{freeze};
>  
> diff --git a/PVE/QemuServer/Memory.pm b/PVE/QemuServer/Memory.pm
> index b9136d2..6827004 100644
> --- a/PVE/QemuServer/Memory.pm
> +++ b/PVE/QemuServer/Memory.pm
> @@ -3,6 +3,8 @@ package PVE::QemuServer::Memory;
>  use strict;
>  use warnings;
>  
> +use POSIX qw/ceil/;

Style nit: haven't seen this variant yet, usually qw() is used in our
codebase

> +
>  use PVE::Tools qw(run_command lock_file lock_file_full file_read_firstline dir_glob_foreach);
>  use PVE::Exception qw(raise raise_param_exc);
>  use PVE::GuestHelpers qw(safe_string_ne safe_num_ne safe_boolean_ne);
> @@ -15,6 +17,7 @@ our @EXPORT_OK = qw(
>  get_current_memory
>  parse_memory
>  get_host_max_mem
> +get_virtiomem_block_size
>  );
>  
>  my $MAX_NUMA = 8;
> @@ -37,6 +40,12 @@ my $memory_fmt = {
>  	minimum => 65536,
>  	maximum => 4194304
>      },
> +    virtio => {
> +	description => "enable virtio-mem memory",

Again, we really should note that the feature is a technology preview
and only supported on recent enough Linux kernels.

> +	type => 'boolean',
> +	optional => 1,
> +	default => 0,
> +    }
>  };
>  
>  sub print_memory {
> @@ -69,7 +78,9 @@ my sub get_static_mem {
>      my $static_memory = 0;
>      my $memory = parse_memory($conf->{memory});
>  
> -    if ($memory->{max}) {
> +    if ($memory->{virtio}) {
> +	$static_memory = 4096;
> +    } elsif ($memory->{max}) {
>  	my $dimm_size = $memory->{max} / $MAX_SLOTS;
>  	#static mem can't be lower than 4G and lower than 1 dimmsize by socket
>  	$static_memory = $dimm_size * $sockets;
> @@ -160,6 +171,134 @@ sub get_current_memory {
>      return $memory->{current};
>  }
>  
> +sub get_virtiomem_block_size {
> +    my ($conf) = @_;
> +
> +    my $MAX_MEM = get_max_mem($conf);
> +    my $static_memory = get_static_mem($conf);
> +    my $memory = get_current_memory($conf->{memory});
> +
> +    #virtiomem can map 32000 block size.
> +    #try to use lowest blocksize, lower = more chance to unplug memory.
> +    my $blocksize = ($MAX_MEM - $static_memory) / 32000;
> +    #2MB is the minimum to be aligned with THP
> +    $blocksize = 2**(ceil(log($blocksize)/log(2)));
> +    $blocksize = 4 if $blocksize < 4;

Why suddenly 4?

As discussed last time, the future-proof way to calculate this (if we
want minimum 2) without risking any negative result for log($blocksize), is
my $blocksize = ($MAX_MEM - $static_memory) / 32000;
$blocksize = 2 if $blocksize < 2;
$blocksize = 2**(ceil(log($blocksize)/log(2)));

> +
> +    return $blocksize;
> +}
> +
> +my sub get_virtiomem_total_current {
> +    my ($mems) = @_;
> +    my $total = 0;
> +    foreach my $id (keys %$mems) {

Style nit: use for instead of foreach (same for all others below)
Nit: can iterate over values instead of keys

Same for the two functions below.

> +	my $mem = $mems->{$id};
> +	$total += $mem->{current};
> +    }
> +    return $total;

Below, you use total for a count, not a size. This is confusing naming.
Here, size should be used.

> +}
> +
> +my sub get_virtiomem_total_noerror {
> +    my ($mems) = @_;
> +
> +    my $total = 0;
> +    foreach my $id (keys %$mems) {
> +	my $mem = $mems->{$id};
> +	next if $mem->{error};
> +	$total++;
> +    }
> +    return $total;

I think the whole function could be
scalar(grep { !$mem->{error} } values $mems->%*)
and since there's only one caller, you can inline it there.

> +}
> +
> +my sub get_virtiomem_total_errors_size {
> +    my ($mems) = @_;
> +
> +    my $size = 0;
> +    foreach my $id (keys %$mems) {
> +	my $mem = $mems->{$id};
> +	next if !$mem->{error};
> +	$size += $mem->{current};
> +    }
> +    return $size;
> +}
> +
> +my sub balance_virtiomem {

This function is rather difficult to read. The "record errors and
filter" logic really should be its own patch after the initial support.
FWIW, I tried my best and it does seems fine :)

But it's not clear to me that we even want that logic? Is it really that
common for qom-set to take so long to be worth introducing all this
additional handling/complexity? Or should it just be a hard error if
qom-set still didn't have an effect on a device after 5 seconds.

Would it actually be better to just fill up the first, then the second
etc. as needed, rather than balancing? My gut feeling is that having
fewer "active" devices is better. But this would have to be tested with
some benchmarks of course.

> +    my ($vmid, $virtiomems, $blocksize, $target_virtiomem_total) = @_;
> +
> +    my $virtiomem_total_noerror = get_virtiomem_total_noerror($virtiomems);
> +
> +    print"try to balance memory on $virtiomem_total_noerror remaining virtiomems\n";

Style nit: missing space after print (same for some others below, also
missing space after if a few times)

> +
> +    die "error. no more available blocks in virtiomem to balance the remaining memory" if $target_virtiomem_total < 0;
> +    die "error. No more available virtiomem to balance the remaining memory\n" if $virtiomem_total_noerror == 0;

Style nit: lines > 100 characters (same for some others below).





More information about the pve-devel mailing list