[pve-devel] [RFC ha-manager 3/3] always fence nodes on dead LRM

Fri Apr 29 11:04:59 CEST 2016

On 04/29/2016 08:30 AM, Dietmar Maurer wrote:
>
>> On April 26, 2016 at 10:55 AM Thomas Lamprecht <t.lamprecht at proxmox.com>
>> wrote:
>>
>>
>> fixes a recovery failure if a node starts up with a dead/broken LRM
>> but working corosync.
>>
>> So while its quorate it doesn't do anything but the CRM won't fence
>> it as our "last_online" timestamp only checks if quorate, not if the
>> HA manager is actually working.
>>
>> Can be reproduced with having a active node with services, simply
>> disable the lrm:
>> $ systemctl disable pve-ha-lrm
>> and then reboot.
>> (this would simulate a broken update/reboot)
>> So the node gets up again and gains quorum but the LRM does not
>> start and thus no service gets started/migrated/... fencing is
>> appropriate for such a situation.
> Why? And how does it solve the problem? Seems to end in an endless
> reboot cycle?

No, there won't be any endless reboot cycles, the CRM does not controls
reboots for any LRM.
Also if the LRM comes online again he will marked as online as soon as
he updates his timestamp again, thus almos instant.

This is a check for the case a Nodes corosync is active and working but
its LRM not,
I fix it by controlling not online the quorate state of the node but
also when its LRM last updated its timestamp.

Else you get a healthy looking cluster and no fence action when the node
is online but its LRM not (as in broken, dead, ...) which dismisses the
sense of HA, as such cases cand be detected and should be recovered, imo.