[pve-devel] [RFC ha-manager 3/3] always fence nodes on dead LRM

Thomas Lamprecht t.lamprecht at proxmox.com
Fri Apr 29 11:43:03 CEST 2016

On 04/29/2016 11:30 AM, Dietmar Maurer wrote:
>> On April 29, 2016 at 11:04 AM Thomas Lamprecht <t.lamprecht at proxmox.com>
>> wrote:
>> On 04/29/2016 08:30 AM, Dietmar Maurer wrote:
>>>> On April 26, 2016 at 10:55 AM Thomas Lamprecht <t.lamprecht at proxmox.com>
>>>> wrote:
>>>> fixes a recovery failure if a node starts up with a dead/broken LRM
>>>> but working corosync.
>>>> So while its quorate it doesn't do anything but the CRM won't fence
>>>> it as our "last_online" timestamp only checks if quorate, not if the
>>>> HA manager is actually working.
>>>> Can be reproduced with having a active node with services, simply
>>>> disable the lrm:
>>>> $ systemctl disable pve-ha-lrm
>>>> and then reboot.
>>>> (this would simulate a broken update/reboot)
>>>> So the node gets up again and gains quorum but the LRM does not
>>>> start and thus no service gets started/migrated/... fencing is
>>>> appropriate for such a situation.
>>> Why? And how does it solve the problem? Seems to end in an endless
>>> reboot cycle?
>> No, there won't be any endless reboot cycles, the CRM does not controls
>> reboots for any LRM.
>> Also if the LRM comes online again he will marked as online as soon as
>> he updates his timestamp again, thus almos instant.
> But the service will not come online, because you have disabled it. A reboot
> does not help here!

Also a reboot does not happen here, at least no more with this change
than without it! No watchdog is active, nothing triggers a reboot of the
node as its LRM is dead!

