[pve-devel] 3 numa topology issues

Tue Jul 26 12:36:44 CEST 2016

Currently we have the following code in hugepages_topology():

|    for (my $i = 0; $i < $MAX_NUMA; $i++) {
|        next if !$conf->{"numa$i"};
|        my $numa = PVE::QemuServer::parse_numa($conf->{"numa$i"});
(...)
|        $hugepages_topology->{$hugepages_size}->{$i} += hugepages_nr($numa_memory, $hugepages_size);
|    }

The way $hugepages_topology is used this means that numa node 0 will
always allocate from the host's numa node 0, 1 from 1 and so on:

>From hugepages_allocate():

|       my $nodes = $hugepages_topology->{$size};
|
|       foreach my $numanode (keys %$nodes) {
(...)
|           my $path = "/sys/devices/system/node/node${numanode}/hugepages/hugepages-${hugepages_size}kB/";
(...)
|       }

Issue #1: The above code currently does not honor our 'hostnodes' option
and breaks when trying to use them together. 

Issue #2: We create one node per *virtual* socket, which means enabling
hugepages with more virtual sockets than physical numa nodes will die
with the error that the numa node doesn't exist. This should be fixable
as far as I can tell, as nothing really prevents us from putting them on
the same node? At least this used to work and I've already asked this
question at some point. You said the host kernel will try to map them,
yet it worked without issues before, so I'm still not sure about this.
Here's the conversation snippet:

| >>When adding more numaX entries to the VM's config than the host has this
| >>now produces an 'Use of uninitialized value' error.
| >>Better check for whether /sys/devices/system/node/node$numanode exists
| >>and throw a useful error.
| >>But should this even be fixed to host nodes? Without hugepages I was
| >>able to provide more smaller numa nodes to the guest (iow. split one big
| >>host numa node into multiple smaller virtual ones), should this not work
| >>with hugepages, too?
| 
| I need to check that. But you shouldn't be able to create more numa nodes number in guest than host nodes
| +number.
| (Because linux host kernel will try to map guest numa node to host numa node)

In the worst case we could create one big node for all cpus?

Issue #3: Actually just an extension to #2: we currently cannot enable
NUMA at all (even without hugepages) when there are more virtual sockets
than physical numa nodes, and this used to work. The big question is
now: does this even make sense? Or should we tell users not to do this?