[pve-devel] [RFC ha-manager 3/3] always fence nodes on dead LRM
t.lamprecht at proxmox.com
Fri Apr 29 11:40:57 CEST 2016
On 04/29/2016 11:30 AM, Dietmar Maurer wrote:
>> On April 29, 2016 at 11:04 AM Thomas Lamprecht <t.lamprecht at proxmox.com>
>> On 04/29/2016 08:30 AM, Dietmar Maurer wrote:
>>>> On April 26, 2016 at 10:55 AM Thomas Lamprecht <t.lamprecht at proxmox.com>
>>>> fixes a recovery failure if a node starts up with a dead/broken LRM
>>>> but working corosync.
>>>> So while its quorate it doesn't do anything but the CRM won't fence
>>>> it as our "last_online" timestamp only checks if quorate, not if the
>>>> HA manager is actually working.
>>>> Can be reproduced with having a active node with services, simply
>>>> disable the lrm:
>>>> $ systemctl disable pve-ha-lrm
>>>> and then reboot.
>>>> (this would simulate a broken update/reboot)
>>>> So the node gets up again and gains quorum but the LRM does not
>>>> start and thus no service gets started/migrated/... fencing is
>>>> appropriate for such a situation.
>>> Why? And how does it solve the problem? Seems to end in an endless
>>> reboot cycle?
>> No, there won't be any endless reboot cycles, the CRM does not controls
>> reboots for any LRM.
>> Also if the LRM comes online again he will marked as online as soon as
>> he updates his timestamp again, thus almos instant.
> But the service will not come online, because you have disabled it. A reboot
> does not help here!
But it allows trigger fencing as the node went from online to unknown,
previously the node would be marked as online forever and nothing
More information about the pve-devel