[pve-devel] 3 numa topology issues

Tue Jul 26 13:35:50 CEST 2016

Hi Wolfgang,

I just come back from holiday.

>>Issue #1: The above code currently does not honor our 'hostnodes' option 
>>and breaks when trying to use them together. 

mmm indeed. I think this can be improved. I'll try to check that next week.

>>Issue #2: We create one node per *virtual* socket, which means enabling 
>>hugepages with more virtual sockets than physical numa nodes will die 
>>with the error that the numa node doesn't exist. This should be fixable 
>>as far as I can tell, as nothing really prevents us from putting them on 
>>the same node? At least this used to work and I've already asked this 
>>question at some point. You said the host kernel will try to map them, 
>>yet it worked without issues before, so I'm still not sure about this. 
>>Here's the conversation snippet: 

you can create more virtual numa node than physical, only if you don't define "hostnodes" option.

(from my point of vue, it's totally useless, as the whole point of numa option is to map virtual node to physical node, to avoid memory access bottleneck)

if hostnodes is defined, you need to have physical numa node available (vm with 2 numa node need host with 2 numa node)

With hugepage enabled, I have added a restriction to have hostnode defined, because you want to be sure that memory is on same node.

            # hostnodes
            my $hostnodelists = $numa->{hostnodes};
            if (defined($hostnodelists)) {
                my $hostnodes;
                foreach my $hostnoderange (@$hostnodelists) {
                    my ($start, $end) = @$hostnoderange;
                    $hostnodes .= ',' if $hostnodes;
                    $hostnodes .= $start;
                    $hostnodes .= "-$end" if defined($end);
                    $end //= $start;
                    for (my $i = $start; $i <= $end; ++$i ) {
                        die "host NUMA node$i doesn't exist\n" if ! -d "/sys/devices/system/node/node$i/";
                    }
                }

                # policy
                my $policy = $numa->{policy};
                die "you need to define a policy for hostnode $hostnodes\n" if !$policy;
                $mem_object .= ",host-nodes=$hostnodes,policy=$policy";
            } else {
                die "numa hostnodes need to be defined to use hugepages" if $conf->{hugepages};
            }

>>Issue #3: Actually just an extension to #2: we currently cannot enable 
>>NUMA at all (even without hugepages) when there are more virtual sockets 
>>than physical numa nodes, and this used to work. The big question is 
>>now: does this even make sense? Or should we tell users not to do this? 

That's strange, it should work if you don't defined hugepages and hostnodes option(in numaX)

(but without hostnodes, for me, i don't see any reason to do this)

----- Mail original -----
De: "Wolfgang Bumiller" <w.bumiller at proxmox.com>
À: "aderumier" <aderumier at odiso.com>
Cc: "pve-devel" <pve-devel at pve.proxmox.com>
Envoyé: Mardi 26 Juillet 2016 12:36:44
Objet: 3 numa topology issues

Currently we have the following code in hugepages_topology(): 

| for (my $i = 0; $i < $MAX_NUMA; $i++) { 
| next if !$conf->{"numa$i"}; 
| my $numa = PVE::QemuServer::parse_numa($conf->{"numa$i"}); 
(...) 
| $hugepages_topology->{$hugepages_size}->{$i} += hugepages_nr($numa_memory, $hugepages_size); 
| } 

The way $hugepages_topology is used this means that numa node 0 will 
always allocate from the host's numa node 0, 1 from 1 and so on: 

>From hugepages_allocate(): 

| my $nodes = $hugepages_topology->{$size}; 
| 
| foreach my $numanode (keys %$nodes) { 
(...) 
| my $path = "/sys/devices/system/node/node${numanode}/hugepages/hugepages-${hugepages_size}kB/"; 
(...) 
| } 

Issue #1: The above code currently does not honor our 'hostnodes' option 
and breaks when trying to use them together. 

Issue #2: We create one node per *virtual* socket, which means enabling 
hugepages with more virtual sockets than physical numa nodes will die 
with the error that the numa node doesn't exist. This should be fixable 
as far as I can tell, as nothing really prevents us from putting them on 
the same node? At least this used to work and I've already asked this 
question at some point. You said the host kernel will try to map them, 
yet it worked without issues before, so I'm still not sure about this. 
Here's the conversation snippet: 

| >>When adding more numaX entries to the VM's config than the host has this 
| >>now produces an 'Use of uninitialized value' error. 
| >>Better check for whether /sys/devices/system/node/node$numanode exists 
| >>and throw a useful error. 
| >>But should this even be fixed to host nodes? Without hugepages I was 
| >>able to provide more smaller numa nodes to the guest (iow. split one big 
| >>host numa node into multiple smaller virtual ones), should this not work 
| >>with hugepages, too? 
| 
| I need to check that. But you shouldn't be able to create more numa nodes number in guest than host nodes 
| +number. 
| (Because linux host kernel will try to map guest numa node to host numa node) 

In the worst case we could create one big node for all cpus? 

Issue #3: Actually just an extension to #2: we currently cannot enable 
NUMA at all (even without hugepages) when there are more virtual sockets 
than physical numa nodes, and this used to work. The big question is 
now: does this even make sense? Or should we tell users not to do this?