[pve-devel] [RFC ha-manager v2 5/7] allow LRM lock stealing for fenced nodes

Mon Mar 14 16:11:01 CET 2016

On 03/12/2016 01:39 PM, Dietmar Maurer wrote:
>> We are only allowed to recover (=steal) a service when we have its
>> LRMs lock, as this guarantees us that even if said LRM comes up
>> again during the steal operation the LRM cannot start the services
>> when the service config still belongs to it for a short time.
>>
>> This is important, else we have a possible race for the resource
>> which can result in a service started on the old (restarted) node
>> and the node where the service was recovered too, which is really
>> bad!
> I don't really understand that. Wouldn't it be safer to simply wait
> for the LRM lock after fencing?

I thought about the stealing and its safeness and came to the conclusion
that it can be made safe (and it is safe in this patch).

If we use the following process when HW fencing:

mark node to fence
        |
        v
start fence agent
        |
        v
agent returns success
        |
        v
remove lock from fenced node and acquire it
        |
        v
mark node to unknown
        |
        v
-> only here can the fenced node try to acquire its lock again, as can
be seen from the code snippet from the LRM:

 >    [..]
 >    if ($state eq 'wait_for_agent_lock') {
 >
 >    my $service_count = $self->active_service_count();
 >
 >    if (!$fence_request && $service_count && $haenv->quorate()) {
 >        if ($self->get_protected_ha_agent_lock()) {
 >        $self->set_local_status({ state => 'active' });
 >        }
 >    }
 >    [..]

Previously it was "dead", or did know that fence against it happens,
thus starts no services.

We only must be sure that we take it's lock before we change its state
it from 'fence' to 'unknown'.

The most important part is that we acquire it, stealing it is not
necessary, but safe.

To summarize the possible states:
* the node is fenced and stays until someone comes and checks it
(through switch, power, ... - fencing) - here we can do everything with
the lock we want
* the node comes back immediately (reset) because someone thought this
was a good way to setup the fence agents (it really isn't) and is
through a wonder fully functional. Here it sees oh I'm in the fence
state, thus it doesn't even tries to get the lock and start anything
(thus lock stealing and time out is fine here). Or the lock
stealing/timeout (both "same" effect in this context) already happened
and the service are recovered.

any thoughts? Do I have a really faulty reasoning somewhere?

I do not want to push the lock stealing by any means, but the more I
think the more it seems to be okay, and if it is (as it seems) then I'd
do it.