[pve-devel] 3 numa topology issues
Alexandre DERUMIER
aderumier at odiso.com
Thu Jul 28 08:44:47 CEST 2016
I'm looking at openstack implementation
https://specs.openstack.org/openstack/nova-specs/specs/juno/implemented/virt-driver-numa-placement.html
and it seem that they check if host numa nodes exist too
"hw:numa_nodes=NN - numa of NUMA nodes to expose to the guest.
The most common case will be that the admin only sets ‘hw:numa_nodes’ and then the flavor vCPUs and RAM will be divided equally across the NUMA nodes.
"
This is what we are doing with numa:1. (we use sockets to known how many numa nodes we need)
" So, given an example config:
vcpus=8
mem=4
hw:numa_nodes=2 - numa of NUMA nodes to expose to the guest.
hw:numa_cpus.0=0,1,2,3,4,5
hw:numa_cpus.1=6,7
hw:numa_mem.0=3072
hw:numa_mem.1=1024
The scheduler will look for a host with 2 NUMA nodes with the ability to run 6 CPUs + 3 GB of RAM on one node, and 2 CPUS + 1 GB of RAM on another node. If a host has a single NUMA node with capability to run 8 CPUs and 4 GB of RAM it will not be considered a valid match.
"
So, if host don't have enough numa nodes, it's invalid
----- Mail original -----
De: "aderumier" <aderumier at odiso.com>
À: "Wolfgang Bumiller" <w.bumiller at proxmox.com>
Cc: "pve-devel" <pve-devel at pve.proxmox.com>
Envoyé: Mercredi 27 Juillet 2016 11:38:04
Objet: Re: [pve-devel] 3 numa topology issues
>>I believe we can simply remove this line since qemu allows it and just
>>applies its default policy. Alternatively we can keep a counter and
>>apply host-nodes manually, starting over at 0 when we run out of nodes,
>>but that's no better than letting qemu do this.
Well, I don't known how auto numa_balancing is working on host, when for example,
a guest define 2 numa nodes and host have only 1 numa node.
I'll have more time next week to do a lot of tests
----- Mail original -----
De: "Wolfgang Bumiller" <w.bumiller at proxmox.com>
À: "aderumier" <aderumier at odiso.com>
Cc: "pve-devel" <pve-devel at pve.proxmox.com>
Envoyé: Mercredi 27 Juillet 2016 09:16:07
Objet: Re: 3 numa topology issues
> On July 26, 2016 at 2:18 PM Alexandre DERUMIER <aderumier at odiso.com> wrote:
>
>
> > >>Issue #1: The above code currently does not honor our 'hostnodes' option
> > >>and breaks when trying to use them together.
>
> Also I need to check how to allocated hugepage, when hostnodes is defined with range like : "hostnodes:0-1".
>
>
>
>
>
> >>Useless, yes, which is why I'm wondering whether this should be
> >>supported/warned about/error...
>
> I think we could force to define "hostnodes".
> I don't known if a lot of people already use numaX option, but as we never exposed it in GUI, i don't think it could break setup of too many people.
>
>
>
>
>
> ----- Mail original -----
> De: "Wolfgang Bumiller" <w.bumiller at proxmox.com>
> À: "aderumier" <aderumier at odiso.com>
> Cc: "pve-devel" <pve-devel at pve.proxmox.com>
> Envoyé: Mardi 26 Juillet 2016 13:59:42
> Objet: Re: 3 numa topology issues
>
> On Tue, Jul 26, 2016 at 01:35:50PM +0200, Alexandre DERUMIER wrote:
> > Hi Wolfgang,
> >
> > I just come back from holiday.
>
> Hope you had a good time :-)
>
> >
> >
> >
> > >>Issue #1: The above code currently does not honor our 'hostnodes' option
> > >>and breaks when trying to use them together.
> >
> > mmm indeed. I think this can be improved. I'll try to check that next week.
> >
> >
> >
> > >>Issue #2: We create one node per *virtual* socket, which means enabling
> > >>hugepages with more virtual sockets than physical numa nodes will die
> > >>with the error that the numa node doesn't exist. This should be fixable
> > >>as far as I can tell, as nothing really prevents us from putting them on
> > >>the same node? At least this used to work and I've already asked this
> > >>question at some point. You said the host kernel will try to map them,
> > >>yet it worked without issues before, so I'm still not sure about this.
> > >>Here's the conversation snippet:
> >
> > you can create more virtual numa node than physical, only if you don't define "hostnodes" option.
> >
> > (from my point of vue, it's totally useless, as the whole point of numa option is to map virtual node to physical node, to avoid memory access bottleneck)
>
> Useless, yes, which is why I'm wondering whether this should be
> supported/warned about/error...
>
> >
> > if hostnodes is defined, you need to have physical numa node available (vm with 2 numa node need host with 2 numa node)
> >
> > With hugepage enabled, I have added a restriction to have hostnode defined, because you want to be sure that memory is on same node.
> >
> >
> > # hostnodes
> > my $hostnodelists = $numa->{hostnodes};
> > if (defined($hostnodelists)) {
> > my $hostnodes;
> > foreach my $hostnoderange (@$hostnodelists) {
> > my ($start, $end) = @$hostnoderange;
> > $hostnodes .= ',' if $hostnodes;
> > $hostnodes .= $start;
> > $hostnodes .= "-$end" if defined($end);
> > $end //= $start;
> > for (my $i = $start; $i <= $end; ++$i ) {
> > die "host NUMA node$i doesn't exist\n" if ! -d "/sys/devices/system/node/node$i/";
> > }
> > }
> >
> > # policy
> > my $policy = $numa->{policy};
> > die "you need to define a policy for hostnode $hostnodes\n" if !$policy;
> > $mem_object .= ",host-nodes=$hostnodes,policy=$policy";
> > } else {
> > die "numa hostnodes need to be defined to use hugepages" if $conf->{hugepages};
> > }
> >
> >
> > >>Issue #3: Actually just an extension to #2: we currently cannot enable
> > >>NUMA at all (even without hugepages) when there are more virtual sockets
> > >>than physical numa nodes, and this used to work. The big question is
> > >>now: does this even make sense? Or should we tell users not to do this?
> >
> > That's strange, it should work if you don't defined hugepages and hostnodes option(in numaX)
>
> Actually this one was my own faulty configuration, sorry.
Gotta take that back, here's the problem:
sockets: 2
numa: 1
(no numaX defined)
will go through Memory.pm's sub config:
| if ($conf->{numa}) {
|
| my $numa_totalmemory = undef;
| for (my $i = 0; $i < $MAX_NUMA; $i++) {
| next if !$conf->{"numa$i"};
(...)
| }
|
| #if no custom tology, we split memory and cores across numa nodes
| if(!$numa_totalmemory) {
|
| my $numa_memory = ($static_memory / $sockets);
|
| for (my $i = 0; $i < $sockets; $i++) {
| die "host NUMA node$i doesn't exist\n" if ! -d "/sys/devices/system/node/node$i/";
and dies there if no numa node exists.
I believe we can simply remove this line since qemu allows it and just
applies its default policy. Alternatively we can keep a counter and
apply host-nodes manually, starting over at 0 when we run out of nodes,
but that's no better than letting qemu do this.
_______________________________________________
pve-devel mailing list
pve-devel at pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
More information about the pve-devel
mailing list