[pve-devel] [PATCH ha-manager 09/15] manager: apply colocation rules when selecting service nodes

Daniel Kral d.kral at proxmox.com
Fri Apr 11 17:56:34 CEST 2025


Thanks for the taking the time here too!

I'm unsure if the documentation wasn't clear enough or I'm just blinded 
here in some details how the division between strict/non-strict should 
be, but I hope I could clarify some points about my understanding here. 
Please correct me here in any case there are scenarios where the current 
implementation will break user expectations, that's definitely not 
something that I want ;).

I'll definitely take some time to improve the control flow and names of 
variables/subroutines here to make it easier to understand and add 
examples how the content of $together and $separate look like at 
different stages.

The algorithm is online and is quite dependent on many other things like 
that $allowed_nodes has already nodes removed that were already tried 
and failed on, etc., so it's pretty dynamic here.

On 4/3/25 14:17, Fabian Grünbichler wrote:
> On March 25, 2025 4:12 pm, Daniel Kral wrote:
>> Add a mechanism to the node selection subroutine, which enforces the
>> colocation rules defined in the rules config.
>>
>> The algorithm manipulates the set of nodes directly, which the service
>> is allowed to run on, depending on the type and strictness of the
>> colocation rules, if there are any.
> 
> shouldn't this first attempt to satisfy all rules, and if that fails,
> retry with just the strict ones, or something similar? see comments
> below (maybe I am missing/misunderstanding something)

Hm, I'm not sure if I can follow what you mean here.

I tried to come up with some scenarios, where there could be conflicts 
because of "loose" colocation rules being overshadowed by strict 
colocation rules, but I'm currently not seeing that. But I've also been 
mostly concerned with smaller clusters (3 to 5 nodes) for now, so I'll 
take a closer look for larger applications/environments.

In general, when applying colocation rules, the logic is less concerned 
about which rules specifically get applied, but to make sure that none 
is violated in general.

This is also why a single service colocation rules turns out to be a 
noop, since it will never depend on the location of another service (the 
rule will never add something to $together/$separate, since there's only 
an entry there if other services already have a node pinned to them).

I hope the comments below clarify this a little bit or make it clearer 
where I'm missing something, so that the code/behavior/documentation can 
be improved ;).

> 
>>
>> This makes it depend on the prior removal of any nodes, which are
>> unavailable (i.e. offline, unreachable, or weren't able to start the
>> service in previous tries) or are not allowed to be run on otherwise
>> (i.e. HA group node restrictions) to function correctly.
>>
>> Signed-off-by: Daniel Kral <d.kral at proxmox.com>
>> ---
>>   src/PVE/HA/Manager.pm      | 203 ++++++++++++++++++++++++++++++++++++-
>>   src/test/test_failover1.pl |   4 +-
>>   2 files changed, 205 insertions(+), 2 deletions(-)
>>
>> diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
>> index 8f2ab3d..79b6555 100644
>> --- a/src/PVE/HA/Manager.pm
>> +++ b/src/PVE/HA/Manager.pm
>> @@ -157,8 +157,201 @@ sub get_node_priority_groups {
>>       return ($pri_groups, $group_members);
>>   }
>>   
>> +=head3 get_colocated_services($rules, $sid, $online_node_usage)
>> +
>> +Returns a hash map of all services, which are specified as being in a positive
>> +or negative colocation in C<$rules> with the given service with id C<$sid>.
>> +
>> +Each service entry consists of the type of colocation, strictness of colocation
>> +and the node the service is currently assigned to, if any, according to
>> +C<$online_node_usage>.
>> +
>> +For example, a service C<'vm:101'> being strictly colocated together (positive)
>> +with two other services C<'vm:102'> and C<'vm:103'> and loosely colocated
>> +separate with another service C<'vm:104'> results in the hash map:
>> +
>> +    {
>> +	'vm:102' => {
>> +	    affinity => 'together',
>> +	    strict => 1,
>> +	    node => 'node2'
>> +	},
>> +	'vm:103' => {
>> +	    affinity => 'together',
>> +	    strict => 1,
>> +	    node => 'node2'
>> +	},
>> +	'vm:104' => {
>> +	    affinity => 'separate',
>> +	    strict => 0,
>> +	    node => undef
>> +	}
>> +    }
>> +
>> +=cut
>> +
>> +sub get_colocated_services {
>> +    my ($rules, $sid, $online_node_usage) = @_;
>> +
>> +    my $services = {};
>> +
>> +    PVE::HA::Rules::Colocation::foreach_colocation_rule($rules, sub {
>> +	my ($rule) = @_;
>> +
>> +	for my $csid (sort keys %{$rule->{services}}) {
>> +	    next if $csid eq $sid;
>> +
>> +	    $services->{$csid} = {
>> +		node => $online_node_usage->get_service_node($csid),
>> +		affinity => $rule->{affinity},
>> +		strict => $rule->{strict},
>> +	    };
>> +        }
>> +    }, {
>> +	sid => $sid,
>> +    });
>> +
>> +    return $services;
>> +}
>> +
>> +=head3 get_colocation_preference($rules, $sid, $online_node_usage)
>> +
>> +Returns a list of two hashes, where each is a hash map of the colocation
>> +preference of C<$sid>, according to the colocation rules in C<$rules> and the
>> +service locations in C<$online_node_usage>.
>> +
>> +The first hash is the positive colocation preference, where each element
>> +represents properties for how much C<$sid> prefers to be on the node.
>> +Currently, this is a binary C<$strict> field, which means either it should be
>> +there (C<0>) or must be there (C<1>).
>> +
>> +The second hash is the negative colocation preference, where each element
>> +represents properties for how much C<$sid> prefers not to be on the node.
>> +Currently, this is a binary C<$strict> field, which means either it should not
>> +be there (C<0>) or must not be there (C<1>).
>> +
>> +=cut
>> +
>> +sub get_colocation_preference {
>> +    my ($rules, $sid, $online_node_usage) = @_;
>> +
>> +    my $services = get_colocated_services($rules, $sid, $online_node_usage);
>> +
>> +    my $together = {};
>> +    my $separate = {};
>> +
>> +    for my $service (values %$services) {
>> +	my $node = $service->{node};
>> +
>> +	next if !$node;
>> +
>> +	my $node_set = $service->{affinity} eq 'together' ? $together : $separate;
>> +	$node_set->{$node}->{strict} = $node_set->{$node}->{strict} || $service->{strict};
>> +    }
>> +
>> +    return ($together, $separate);
>> +}
>> +
>> +=head3 apply_positive_colocation_rules($together, $allowed_nodes)
>> +
>> +Applies the positive colocation preference C<$together> on the allowed node
>> +hash set C<$allowed_nodes> directly.
>> +
>> +Positive colocation means keeping services together on a single node, and
>> +therefore minimizing the separation of services.
>> +
>> +The allowed node hash set C<$allowed_nodes> is expected to contain any node,
>> +which is available to the service, i.e. each node is currently online, is
>> +available according to other location constraints, and the service has not
>> +failed running there yet.
>> +
>> +=cut
>> +
>> +sub apply_positive_colocation_rules {
>> +    my ($together, $allowed_nodes) = @_;
>> +
>> +    return if scalar(keys %$together) < 1;
>> +
>> +    my $mandatory_nodes = {};
>> +    my $possible_nodes = PVE::HA::Tools::intersect($allowed_nodes, $together);
>> +
>> +    for my $node (sort keys %$together) {
>> +	$mandatory_nodes->{$node} = 1 if $together->{$node}->{strict};
>> +    }
>> +
>> +    if (scalar keys %$mandatory_nodes) {
>> +	# limit to only the nodes the service must be on.
>> +	for my $node (keys %$allowed_nodes) {
>> +	    next if exists($mandatory_nodes->{$node});
>> +
>> +	    delete $allowed_nodes->{$node};
>> +	}
>> +    } elsif (scalar keys %$possible_nodes) {
> 
> I am not sure I follow this logic here.. if there are any strict
> requirements, we only honor those.. if there are no strict requirements,
> we only honor the non-strict ones?

Please correct me if I'm wrong, but at least for my understanding this 
seems right, because the nodes in $together are the nodes, which other 
co-located services are already running on.

If there is a co-located service already running somewhere and the 
services MUST be kept together, then there will be an entry like 'node3' 
=> { strict => 1 } in $together. AFAICS we can then ignore any 
non-strict nodes here, because we already know where the service MUST run.

If there is a co-located service already running somewhere and the 
services SHOULD be kept together, then there will be one or more 
entries, e.g. $together = { 'node1' => { strict => 0 }, 'node2' => { 
strict => 0 } };

If there is no co-located service already running somewhere, then 
$together = {}; and this subroutine won't do anything to $allowed_nodes.

In theory, we could assume that %$mandatory_nodes has always only one 
node, because it is mandatory. But currently, we do not hinder users 
manually migrating against colocation rules (maybe we should?) or what 
if rules suddenly change from non-strict to strict. We do not 
auto-migrate if rules change (maybe we should?).

-----

On another note, intersect() here is used with $together (and 
set_difference() with $separate below), which goes against what I said 
in patch #5 to only use hash sets, but as it tries to get only the truth 
value anyway, it was fine here. I'll make that more robust in a v1.

> 
>> +	# limit to the possible nodes the service should be on, if there are any.
>> +	for my $node (keys %$allowed_nodes) {
>> +	    next if exists($possible_nodes->{$node});
>> +
>> +	    delete $allowed_nodes->{$node};
>> +	}
> 
> this is the same code twice, just operating on different hash
> references, so could probably be a lot shorter. the next and delete
> could also be combined (`delete .. if !...`).

Yes, I wanted to break it down more and will improve it, thanks for the 
suggestion with the delete post-if!

I guess we can also move the definition + assignment of $possible_nodes 
down here too, as it won't be needed for the $mandatory_nodes case, 
depending if the general behavior won't be changed.

> 
>> +    }
>> +}
>> +
>> +=head3 apply_negative_colocation_rules($separate, $allowed_nodes)
>> +
>> +Applies the negative colocation preference C<$separate> on the allowed node
>> +hash set C<$allowed_nodes> directly.
>> +
>> +Negative colocation means keeping services separate on multiple nodes, and
>> +therefore maximizing the separation of services.
>> +
>> +The allowed node hash set C<$allowed_nodes> is expected to contain any node,
>> +which is available to the service, i.e. each node is currently online, is
>> +available according to other location constraints, and the service has not
>> +failed running there yet.
>> +
>> +=cut
>> +
>> +sub apply_negative_colocation_rules {
>> +    my ($separate, $allowed_nodes) = @_;
>> +
>> +    return if scalar(keys %$separate) < 1;
>> +
>> +    my $mandatory_nodes = {};
>> +    my $possible_nodes = PVE::HA::Tools::set_difference($allowed_nodes, $separate);
> 
> this is confusing or I misunderstand something here, see below..
> 
>> +
>> +    for my $node (sort keys %$separate) {
>> +	$mandatory_nodes->{$node} = 1 if $separate->{$node}->{strict};
>> +    }
>> +
>> +    if (scalar keys %$mandatory_nodes) {
>> +	# limit to the nodes the service must not be on.
> 
> this is missing a not?
> we are limiting to the nodes the service must not not be on :-P
> 
> should we rename mandatory_nodes to forbidden_nodes?

Good idea, yes this would be a much better fitting name. When I wrote 
$mandatory_nodes as above, I was always thinking 'mandatory to not be 
there'...

> 
>> +	for my $node (keys %$allowed_nodes) {
> 
> this could just loop over the forbidden nodes and delete them from
> allowed nodes?

Yes, this should also be possible. I think I had a counter example in an 
earlier version, where this didn't work, but now it should make sense.

> 
>> +	    next if !exists($mandatory_nodes->{$node});
>> +
>> +	    delete $allowed_nodes->{$node};
>> +	}
>> +    } elsif (scalar keys %$possible_nodes) {
> 
> similar to above - if we have strict exclusions, we honor them, but we
> ignore the non-strict exclusions unless there are no strict ones?

Same principle above, but now $separate holds all nodes where the 
anti-colocated services are already running on, so we're trying to not 
select a node from there.

> 
>> +	# limit to the nodes the service should not be on, if any.
>> +	for my $node (keys %$allowed_nodes) {
>> +	    next if exists($possible_nodes->{$node});
>> +
>> +	    delete $allowed_nodes->{$node};
>> +	}
>> +    }
>> +}
>> +
>> +sub apply_colocation_rules {
>> +    my ($rules, $sid, $allowed_nodes, $online_node_usage) = @_;
>> +
>> +    my ($together, $separate) = get_colocation_preference($rules, $sid, $online_node_usage);
>> +
>> +    apply_positive_colocation_rules($together, $allowed_nodes);
>> +    apply_negative_colocation_rules($separate, $allowed_nodes);
>> +}
>> +
>>   sub select_service_node {
>> -    my ($groups, $online_node_usage, $sid, $service_conf, $current_node, $try_next, $tried_nodes, $maintenance_fallback, $best_scored) = @_;
>> +    # TODO Cleanup this signature post-RFC
>> +    my ($rules, $groups, $online_node_usage, $sid, $service_conf, $current_node, $try_next, $tried_nodes, $maintenance_fallback, $best_scored) = @_;
>>   
>>       my $group = get_service_group($groups, $online_node_usage, $service_conf);
>>   
>> @@ -189,6 +382,8 @@ sub select_service_node {
>>   
>>       return $current_node if (!$try_next && !$best_scored) && $pri_nodes->{$current_node};
>>   
>> +    apply_colocation_rules($rules, $sid, $pri_nodes, $online_node_usage);
>> +
>>       my $scores = $online_node_usage->score_nodes_to_start_service($sid, $current_node);
>>       my @nodes = sort {
>>   	$scores->{$a} <=> $scores->{$b} || $a cmp $b
>> @@ -758,6 +953,7 @@ sub next_state_request_start {
>>   
>>       if ($self->{crs}->{rebalance_on_request_start}) {
>>   	my $selected_node = select_service_node(
>> +	    $self->{rules},
>>   	    $self->{groups},
>>   	    $self->{online_node_usage},
>>   	    $sid,
>> @@ -771,6 +967,9 @@ sub next_state_request_start {
>>   	my $select_text = $selected_node ne $current_node ? 'new' : 'current';
>>   	$haenv->log('info', "service $sid: re-balance selected $select_text node $selected_node for startup");
>>   
>> +	# TODO It would be better if this information would be retrieved from $ss/$sd post-RFC
>> +	$self->{online_node_usage}->pin_service_node($sid, $selected_node);
>> +
>>   	if ($selected_node ne $current_node) {
>>   	    $change_service_state->($self, $sid, 'request_start_balance', node => $current_node, target => $selected_node);
>>   	    return;
>> @@ -898,6 +1097,7 @@ sub next_state_started {
>>   	    }
>>   
>>   	    my $node = select_service_node(
>> +		$self->{rules},
>>   	        $self->{groups},
>>   		$self->{online_node_usage},
>>   		$sid,
>> @@ -1004,6 +1204,7 @@ sub next_state_recovery {
>>       $self->recompute_online_node_usage(); # we want the most current node state
>>   
>>       my $recovery_node = select_service_node(
>> +	$self->{rules},
>>   	$self->{groups},
>>   	$self->{online_node_usage},
>>   	$sid,
>> diff --git a/src/test/test_failover1.pl b/src/test/test_failover1.pl
>> index 308eab3..4c84fbd 100755
>> --- a/src/test/test_failover1.pl
>> +++ b/src/test/test_failover1.pl
>> @@ -8,6 +8,8 @@ use PVE::HA::Groups;
>>   use PVE::HA::Manager;
>>   use PVE::HA::Usage::Basic;
>>   
>> +my $rules = {};
>> +
>>   my $groups = PVE::HA::Groups->parse_config("groups.tmp", <<EOD);
>>   group: prefer_node1
>>   	nodes node1
>> @@ -31,7 +33,7 @@ sub test {
>>       my ($expected_node, $try_next) = @_;
>>       
>>       my $node = PVE::HA::Manager::select_service_node
>> -	($groups, $online_node_usage, "vm:111", $service_conf, $current_node, $try_next);
>> +	($rules, $groups, $online_node_usage, "vm:111", $service_conf, $current_node, $try_next);
>>   
>>       my (undef, undef, $line) = caller();
>>       die "unexpected result: $node != ${expected_node} at line $line\n"
>> -- 
>> 2.39.5
>>
>>
>>
>> _______________________________________________
>> pve-devel mailing list
>> pve-devel at lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
>>
>>
>>
> 
> 
> _______________________________________________
> pve-devel mailing list
> pve-devel at lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
> 
> 





More information about the pve-devel mailing list