[PVE-User] pve-firewall, clustering and HA gone bad

Tue Jun 25 09:44:17 CEST 2019

On 6/25/19 9:10 AM, Mark Schouten wrote:
> On Thu, Jun 13, 2019 at 12:34:28PM +0200, Thomas Lamprecht wrote:
>>> 2: ha-manager should not be able to start the VM's when they are running
>>> elsewhere
>>
>> This can only happen if fencing fails, and that fencing works is always
>> a base assumption we must take (as else no HA is possible at all).
>> So it would be interesting why fencing did not worked here (see below
>> for the reason I could not determine that yet as I did not have your logs
>> at hand)
> 
> Reading the emails from that specific night, I saw this message:
> 
>  The node 'proxmox01' failed and needs manual intervention.
> 
>  The PVE HA manager tries to fence it and recover the
>  configured HA resources to a healthy node if possible.
> 
>  Current fence status: SUCCEED
>  fencing: acknowledged - got agent lock for node 'proxmox01'
> 
> This seems to suggest that the cluster is confident that the fencing
> succeeded. How does it determine that?
> 

It got the other's node LRM agent lock through pmxcfs.

Normal LRM cycle is

0. startup
1. (re-)acquire agent lock, if OK go to 2, else to 4
2. do work (start, stop, migrate resources)
3. got to 1
4. no lock: if we had the lock once we stop watchdog updates, stop doing
   anything, wait for either quorum again (<60s) or the watchdog to trigger
   (>=60)
   if we never had the lock just poll for it continuously

Locks can be held only by one node. If the CRM sees a node offline for >120
seconds (IIRC) it tries to acquire the lock from that node, once it has it
it can know that the HA stack on the other side cannot start any actions
anymore - and if your "unfreeze before watchdog enable" did not happened
it would got fenced by the watchdog.

The lock and recovery action itself was not the direct root cause, as said,
the most I could take out from the logs you sent was:
> ...
> So, the "unfreeze before the respective LRM got active+online with watchdog"
> seems the cause of the real wrong behavior here in your log, it allows the
> recovery to happen, as else frozen services wouldn't not have been recovered
> (that mechanism exactly exists to avoid such issues during a upgrade, where
> one does not want to stop or migrate all HA VM/CTs)

And as also said (see quote below), for more specific hinters I need the raw
logs, unmerged and as untouched as possible.

On 6/13/19 6:29 PM, Thomas Lamprecht wrote:
> While you interpolated the different logs into a single time-line it does not
> seem to match everywhere, for my better understanding could you please send me:
> 
> * corosync.conf
> * the journal or syslog of proxmox01 and proxmox03 around "Jun 12 01:38:16"
>   plus/minus ~ 5 minutes, please in separated files, no interpolation and as
>   unredacted as possible
> * information if you have a HW watchdog or use the Linux soft-dog
> 
> that would be appreciated.