[PVE-User] Automatic migration before reboot / shutdown? Migration to host in same group?

Thu Jul 6 16:03:03 CEST 2017

Hi,

On 07/06/2017 03:14 PM, Uwe Sauter wrote:
> Hi Thomas,
>
> thank you for your insight.
>
>
>>> 1) I was wondering how a PVE (4.4) cluster will behave when one of the nodes is restarted / shutdown either via WebGUI or via
>>> commandline. Will hosted, HA-managed VMs be migrated to other hosts before shutting down or will they be stopped (and restared on
>>> another host once HA recognizes them as gone)?
>> First: on any graceful shutdown, which triggers stopping the pve-ha-lrm service,
>> all HA managed services will be queued to stop (graceful shutdown with timeout).
>> This is done to ensure consistency.
>>
>> If a HA service gets then recovered to another node, or "waits" until the current
>> node comes up again depends if you triggered a shutdown or a reboot.
>> On a shutdown the service will be recovered after the node is seen as "dead" (~2 minutes)
>> but on a reboot we mark the service as freezed, so the ha stack does not touches it.
>> The idea here is that if a user reboots the node without migrating away a service he expects
>> that the node comes up again fast and starts the service on its own again.
>> Now, we know that this may not always be ideal, especially on really big machines
>> with hundreds of gigabyte of RAM and a slow as hell firmware, where a boot may need > 10 minutes.
> Understood. This is also kind of what I expected.
>
> What is still unclear to me is what you consider a "graceful" shutdown? Every action that stops pve-ha-lrm?

No, not every action which stops the pve-ha-lrm.
If it gets a stop request by anyone we check if a shutdown or reboot is 
in progress, if so we know that we have to stop/shutdown the services.
If no shutdown or reboot is in progress we just freeze the services and 
to not touch them, this is done as the only case where this happens is 
the one where an user manually triggers an stop via:
# systemctl stop pve-ha-lrm
or
# systemctl restart pve-ha-lrm
in both cases stopping running services is probably unwanted, we expect 
that the user knows why he does this.
One reason could be to shutdown the LRM watchdog connection as quorum 
loss is expected in the next minutes.

>> An idea is to allow the configuration of the behavior and add two additional behaviors,
>> i.e. migrate away and relocate away.
> What's the difference between migration and relocation? Temporary vs. permanent?

Migration does an online migration if possible (=on VMs) and the 
services is already running.
Relocation *always* stops the service if it runs  and only then migrates it.
If it then gets started on the other side again depends on the request 
state.

The latter one may be useful on really big VMs where short down time can 
be accepted
and online migration would need far to long or cause congestion on the 
network.

>>> 2) Currently I run a cluster of four nodes that share the same 2U chassis:
>>>
>>> +-----+-----+
>>> |  A  |  B  |
>>> +-----+-----+
>>> |  C  |  D  |
>>> +-----+-----+
>>>
>>> (Please don't comment on whether this setup is ideal – I'm aware of the risks a single chassis brings…)
>> As long as nodes share continents your never save anyway :-P
> True, but impossible to implement for approx. 99.999999% of all PVE users. And latencies will be a nightmare then, esp. with Ceph :D

Haha, yeah, would be quite a nightmare, if you haven't your own sea 
cable connection :D

>>> I created several HA groups:
>>>
>>> - left  contains A & C
>>> - right contains B & D
>>> - upper contains A & B
>>> - lower contains C & D
>>> - all   contains all nodes
>>>
>>> and configured VMs to run inside one of the groups.
>>>
>>> For updates I usually follow the following steps:
>>> - migrate VMs from node via "bulk migrate" feature, selecting one of the other nodes
>>> - when no more VMs run, do a "apt-get dist-upgrade" and reboot
>>> - repeat till all nodes are up-to-date
>>>
>>> One issue I ran into with this procedure is that sometimes while a VM is still migrated to another host, already migrated VMs are
>>> migrated back onto the current node because the target that was selected for "bulk migrate" was not inside the same group as the
>>> current host.
>> This is expected, you told the ha-manager that a service should or can not run there,
>> thus it tried to bring it in an "OK" state again.
> Yes, I was aware of the reasons why the VM was moved back, though it would make more sense to move it to another node in the same
> (allowed) group for the maintenance case I'm describing here.
>
>>> Practical example:
>>> - VM 101 is configured to run on the left side of the cluster
>>> - VM 102 is configured to run on the lower level of the cluster
>>> - node C shall be updated
>>> - I select "bulk migrate" to node D
>>> - VM 101 is migrated to D
>>> - VM 102 is migrated to D, but takes some time (a lot of RAM)
>>> - HA recognizes that VM 101 is not running in the correct group and schedules a migration back to node C
>>> - migration of VM 102 finishes and migration of VM 101 back to node C immediatelly starts
>>> - once migration of VM 101 has finished I manually need to initate another migration (and after that need to be faster then HA to
>>> do a reboot)
>>>
>>>
>>> Would it be possible to implement another "bulk action" that will evacuate a host in a way that for every VM, the appropriate
>>> target node is selected, depending on HA group configuration? This might also temporarily disable that node in HA management for
>>> e.g. 10min or until next reboot so that maintenance work can be done…
>>> What do you think of that idea?
>>>
>> Quasi, a maintenance mode? I'm not opposed to it, but if such a thing would be done
>> it would be only a light wrapper around already existing functionality.
> Absolutely. Just another action that would evacuate the current host as optimal as possible. All VMs that are constrained to a
> specific node group should be migrated within that group, all other VMs should be migrated to any node available (possible doing
> some load balancing inside the cluster).

I'll look again in this, if I get an idea how to incorporate this 
without breaking edge cases I can give it a shot,
no promise yet, though, sorry :)

>> Can I ask if whats the reason for your group setup?
>> I assume that all VMs may run on all nodes, but you want to "pin" some VMs to specific nodes for load reasons?
> We started to build a cluster out of just one chassis with four nodes. In the next few weeks I will add additional nodes that
> possibly be located in another building. Those nodes will be grouped similarily and there will be additional groups that include
> subsets of nodes from each building.
>
> The reason behind my group setup is that I have two projects which have several services that are running on two VMs each (for
> redundency and load balancing, e.g. LDAP). A configuration where  one LDAP is running "left" and the other is running "right"
> eliminates the risk that both VMs run on the same node (and have a disruption of service if that particular node fails).
> So for the first project I distribute all important VMs between "left" and right" and the other project's important VMs are
> distrbuted between "upper" and "lower". This ensures that for both projects, important services are not interrupted if *one* node
> fails.
> All less-important VMs are allowed to run on all nodes.
>
> If there are valid concerns against this reasoning, I'm open to suggestions for improvement.

Sounds OK, I have to think about it if I can propose a better fitting 
solution regarding our HA stack.
An idea was to add simple dependencies, i.e. this group/service should
not run on the same node as the other group/services. Not sure if this 
is quite specialism or more people would profit from it...

>> If this is the case I'd suggest changing the group configuration.
>> I.e. each node gets a group, A, B, C and D. Each group has the respective node with priority 2 and all others with priority 1.
>> When doing an system upgrade on node A you would edit group A and set node A's priority to 0,
>> now all should migrate away from this node, trying to balance the service count over all nodes.
>> You do not need to trigger a bulk action, at least for the HA managed VMs.
>>
>> After all migrated execute the upgrade and reboot.
>> Then reconfigure the Group A that node A has again the highest priority,
>> i.e. 2, and the respective services migrate back to it again.
>>
>> This should be quite fast to do after the initial setup, you just need to open the group configuration
>> dialog and lower/higher the priority of one node.
>>
>> You could also use a simmilar procedure on your current group configuration.
>> The main thing what changes is that you need to edit two groups to make a node free.
>> The advantage of mine method would be that the services get distributed on all other nodes not just moved to a single one.
> Interesting idea. Didn't have a look at priorities yet.
>
> Request for improvement: In "datacenter -> HA -> groups" show the configured priority, e.g. in a format
> "nodename(priority)[,nodename(priority)]"

Hmm, this should already be the case, except if the default priority is set.
I added this when I reworked the HA group editor sometimes in 4.3.

cheers,
Thomas