[pve-devel] [RFC ha-manager 3/3] always fence nodes on dead LRM

Dietmar Maurer dietmar at proxmox.com
Fri Apr 29 11:30:39 CEST 2016



> On April 29, 2016 at 11:04 AM Thomas Lamprecht <t.lamprecht at proxmox.com>
> wrote:
> 
> 
> 
> 
> On 04/29/2016 08:30 AM, Dietmar Maurer wrote:
> >
> >> On April 26, 2016 at 10:55 AM Thomas Lamprecht <t.lamprecht at proxmox.com>
> >> wrote:
> >>
> >>
> >> fixes a recovery failure if a node starts up with a dead/broken LRM
> >> but working corosync.
> >>
> >> So while its quorate it doesn't do anything but the CRM won't fence
> >> it as our "last_online" timestamp only checks if quorate, not if the
> >> HA manager is actually working.
> >>
> >> Can be reproduced with having a active node with services, simply
> >> disable the lrm:
> >> $ systemctl disable pve-ha-lrm
> >> and then reboot.
> >> (this would simulate a broken update/reboot)
> >> So the node gets up again and gains quorum but the LRM does not
> >> start and thus no service gets started/migrated/... fencing is
> >> appropriate for such a situation.
> > Why? And how does it solve the problem? Seems to end in an endless
> > reboot cycle?
> 
> No, there won't be any endless reboot cycles, the CRM does not controls
> reboots for any LRM.
> Also if the LRM comes online again he will marked as online as soon as
> he updates his timestamp again, thus almos instant.

But the service will not come online, because you have disabled it. A reboot
does not help here!



More information about the pve-devel mailing list