[PVE-User] Automatic migration before reboot / shutdown? Migration to host in same group?

Thu Jul 6 15:14:37 CEST 2017

Hi Thomas,

thank you for your insight.

>> 1) I was wondering how a PVE (4.4) cluster will behave when one of the nodes is restarted / shutdown either via WebGUI or via
>> commandline. Will hosted, HA-managed VMs be migrated to other hosts before shutting down or will they be stopped (and restared on
>> another host once HA recognizes them as gone)?
> 
> First: on any graceful shutdown, which triggers stopping the pve-ha-lrm service,
> all HA managed services will be queued to stop (graceful shutdown with timeout).
> This is done to ensure consistency.
> 
> If a HA service gets then recovered to another node, or "waits" until the current
> node comes up again depends if you triggered a shutdown or a reboot.
> On a shutdown the service will be recovered after the node is seen as "dead" (~2 minutes)
> but on a reboot we mark the service as freezed, so the ha stack does not touches it.
> The idea here is that if a user reboots the node without migrating away a service he expects
> that the node comes up again fast and starts the service on its own again.
> Now, we know that this may not always be ideal, especially on really big machines
> with hundreds of gigabyte of RAM and a slow as hell firmware, where a boot may need > 10 minutes.

Understood. This is also kind of what I expected.

What is still unclear to me is what you consider a "graceful" shutdown? Every action that stops pve-ha-lrm?

> An idea is to allow the configuration of the behavior and add two additional behaviors,
> i.e. migrate away and relocate away.

What's the difference between migration and relocation? Temporary vs. permanent?

>> 2) Currently I run a cluster of four nodes that share the same 2U chassis:
>>
>> +-----+-----+
>> |  A  |  B  |
>> +-----+-----+
>> |  C  |  D  |
>> +-----+-----+
>>
>> (Please don't comment on whether this setup is ideal – I'm aware of the risks a single chassis brings…)
> As long as nodes share continents your never save anyway :-P

True, but impossible to implement for approx. 99.999999% of all PVE users. And latencies will be a nightmare then, esp. with Ceph :D

>> I created several HA groups:
>>
>> - left  contains A & C
>> - right contains B & D
>> - upper contains A & B
>> - lower contains C & D
>> - all   contains all nodes
>>
>> and configured VMs to run inside one of the groups.
>>
>> For updates I usually follow the following steps:
>> - migrate VMs from node via "bulk migrate" feature, selecting one of the other nodes
>> - when no more VMs run, do a "apt-get dist-upgrade" and reboot
>> - repeat till all nodes are up-to-date
>>
>> One issue I ran into with this procedure is that sometimes while a VM is still migrated to another host, already migrated VMs are
>> migrated back onto the current node because the target that was selected for "bulk migrate" was not inside the same group as the
>> current host.
> This is expected, you told the ha-manager that a service should or can not run there,
> thus it tried to bring it in an "OK" state again.

Yes, I was aware of the reasons why the VM was moved back, though it would make more sense to move it to another node in the same
(allowed) group for the maintenance case I'm describing here.

>> Practical example:
>> - VM 101 is configured to run on the left side of the cluster
>> - VM 102 is configured to run on the lower level of the cluster
>> - node C shall be updated
>> - I select "bulk migrate" to node D
>> - VM 101 is migrated to D
>> - VM 102 is migrated to D, but takes some time (a lot of RAM)
>> - HA recognizes that VM 101 is not running in the correct group and schedules a migration back to node C
>> - migration of VM 102 finishes and migration of VM 101 back to node C immediatelly starts
>> - once migration of VM 101 has finished I manually need to initate another migration (and after that need to be faster then HA to
>> do a reboot)
>>
>>
>> Would it be possible to implement another "bulk action" that will evacuate a host in a way that for every VM, the appropriate
>> target node is selected, depending on HA group configuration? This might also temporarily disable that node in HA management for
>> e.g. 10min or until next reboot so that maintenance work can be done…
>> What do you think of that idea?
>>
> 
> Quasi, a maintenance mode? I'm not opposed to it, but if such a thing would be done
> it would be only a light wrapper around already existing functionality.

Absolutely. Just another action that would evacuate the current host as optimal as possible. All VMs that are constrained to a
specific node group should be migrated within that group, all other VMs should be migrated to any node available (possible doing
some load balancing inside the cluster).

> Can I ask if whats the reason for your group setup?
> I assume that all VMs may run on all nodes, but you want to "pin" some VMs to specific nodes for load reasons?

We started to build a cluster out of just one chassis with four nodes. In the next few weeks I will add additional nodes that
possibly be located in another building. Those nodes will be grouped similarily and there will be additional groups that include
subsets of nodes from each building.

The reason behind my group setup is that I have two projects which have several services that are running on two VMs each (for
redundency and load balancing, e.g. LDAP). A configuration where  one LDAP is running "left" and the other is running "right"
eliminates the risk that both VMs run on the same node (and have a disruption of service if that particular node fails).

So for the first project I distribute all important VMs between "left" and right" and the other project's important VMs are
distrbuted between "upper" and "lower". This ensures that for both projects, important services are not interrupted if *one* node
fails.

All less-important VMs are allowed to run on all nodes.

If there are valid concerns against this reasoning, I'm open to suggestions for improvement.

> If this is the case I'd suggest changing the group configuration.
> I.e. each node gets a group, A, B, C and D. Each group has the respective node with priority 2 and all others with priority 1.
> When doing an system upgrade on node A you would edit group A and set node A's priority to 0,
> now all should migrate away from this node, trying to balance the service count over all nodes.
> You do not need to trigger a bulk action, at least for the HA managed VMs.
> 
> After all migrated execute the upgrade and reboot.
> Then reconfigure the Group A that node A has again the highest priority,
> i.e. 2, and the respective services migrate back to it again.
> 
> This should be quite fast to do after the initial setup, you just need to open the group configuration
> dialog and lower/higher the priority of one node.
> 
> You could also use a simmilar procedure on your current group configuration.
> The main thing what changes is that you need to edit two groups to make a node free.
> The advantage of mine method would be that the services get distributed on all other nodes not just moved to a single one.

Interesting idea. Didn't have a look at priorities yet.

Request for improvement: In "datacenter -> HA -> groups" show the configured priority, e.g. in a format
"nodename(priority)[,nodename(priority)]"

Regards,

    Uwe

> If anything is unclear or cannot apply to your situation, feel free to ask.
> 
> cheers,
> Thomas
> 
> PS: if not already read, please see also:
> <https://pve.proxmox.com/pve-docs/chapter-ha-manager.html#ha_manager_groups>
> 
>