[PVE-User] pve-firewall, clustering and HA gone bad

Thu Jun 13 18:29:41 CEST 2019

On 6/13/19 1:30 PM, Mark Schouten wrote:
> On Thu, Jun 13, 2019 at 12:34:28PM +0200, Thomas Lamprecht wrote:
>> Hi,
>> Do your ringX_addr in corosync.conf use the hostnames or the resolved
>> addresses? As with nodes added on newer PVE (at least 5.1, IIRC) we try
>> to resolve the nodename and use the resolved address to exactly avoid
>> such issues. If it don't uses that I recommend changing that instead
>> of the all nodes in al /etc/hosts approach.
> 
> It has the hostnames. It's a cluster upgraded from 4.2 up to current.

OK, I suggest that you change that to the resolved IPs and add a "name"
property, if not already there (at the moment not to sure when I added
the "name" per-default to the config, it was sometime in a 4.x release)
IOW, the config's "nodelist" section should look something like:

...
nodelist {
  node {
    name: prod1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.30.75
  }
  node {
    name: prod2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.30.76
  }
  ...
}

As said in the previous reply, that should avoid most issues of this kind,
and avoid the need for the /etc/host stuff on all hosts.

> 
>>> It seems that pve-firewall tries to detect localnet, but failed to do so
>>> correct. localnet should be 192.168.1.0/24, but instead it detected the
>>> IPv6 addresses. Which isn't entirely incorrect, but IPv6 is not used for
>>> clustering, so I should open IPv4 in the firewall not IPv6. So it seems
>>> like nameresolving is used to define localnat, and not what corosync is
>>> actually using.
>>
>> From a quick look at the code: That seems true and is definitively the
>> wrong behavior :/
> 
> Ok, I'll file a bug for that.

Thanks!

>>> 2: ha-manager should not be able to start the VM's when they are running
>>> elsewhere
>>
>> This can only happen if fencing fails, and that fencing works is always
>> a base assumption we must take (as else no HA is possible at all).
>> So it would be interesting why fencing did not worked here (see below
>> for the reason I could not determine that yet as I did not have your logs
>> at hand)
> 
> We must indeed make assumptions. Are there ways we can assume better? :)

Hmm, hard, as fencing must work. And it does normally if
* the fence device works (in this case the watchdog)
* no manual tinkering on HA was involved (no finger pointing, really, but
  while we try to fend off some manual changes, one can get the pve-ha-*
  in certain states where the VM is running but watchdogs got closed
* a bug (naturally), but with the simulation & regression tests this should
  be covered against in principle.

it will work, but closer analysis of your incident will hopefully show
what the case was, and if there could be enhancements against that.

>> The list trims attachments, could you please send them directly to my
>> address? I'd really like to see those.
> 
> Attached again, so you should receive it now.
> 

OK got the attachment now, thanks! 

I think I got the relevant part below:

> Jun 12 01:37:56 proxmox01 pve-ha-lrm[3729778]: status change wait_for_agent_lock => active
> Jun 12 01:37:56 proxmox01 pve-ha-lrm[3729778]: successfully acquired lock 'ha_agent_proxmox01_lock'
> Jun 12 01:37:56 proxmox01 pve-ha-lrm[3729778]: watchdog active

-> upgrade stuff on proxmox01, thus restart of pve-ha-lrm

> Jun 12 01:38:05 proxmox01 pve-ha-lrm[3729778]: received signal TERM
> Jun 12 01:38:05 proxmox01 pve-ha-lrm[3729778]: restart LRM, freeze all services
> Jun 12 01:38:14 proxmox03 pve-ha-crm[3084869]: service 'vm:100': state changed from 'started' to 'freeze'
> ...
> Jun 12 01:38:14 proxmox03 pve-ha-crm[3084869]: service 'vm:800': state changed from 'started' to 'freeze'

-> ... all got frozen (which is OK)

> Jun 12 01:38:16 proxmox01 pve-ha-lrm[3729778]: watchdog closed (disabled)
> Jun 12 01:38:18 proxmox01 pve-ha-lrm[3731520]: status change startup => wait_for_agent_lock

-> here, proxmox01 has not yet the LRM lock and is not yet active (!),
   but current master (proxmox03) already unfreezes proxmox01's services:

> Jun 12 01:38:24 proxmox03 pve-ha-crm[3084869]: service 'vm:100': state changed from 'freeze' to 'started'
> ...
> Jun 12 01:38:24 proxmox03 pve-ha-crm[3084869]: service 'vm:800': state changed from 'freeze' to 'started'

(remember that for below, the fact that those services got unfreezed before (!)
the watchdog was active again reads _very_ worrisome to me. They really shouldn't,
as freeze is exactly for avoiding issues during upgrade/restart of HA without stopping
all services)

-> now the quorum breaks as of the firewall allowing the IPv6 not IPv4 net, a bit later the HA cluster masters logs it

> Jun 12 01:38:36 proxmox03 pve-ha-crm[3084869]: node 'proxmox01': state changed from 'online' => 'unknown'

-> proxmox01's LRM only gets to log only one time anymore:
> Jun 12 01:38:37 proxmox01 pve-ha-lrm[3731520]: unable to write lrm status file - closing file '/etc/pve/nodes/proxmox01/lrm_status.tmp.3731520' failed - Operation not permitted 

Now from your interpolated logs I'm missing a initial
... node 'proxmox01': state changed from 'unknown' => 'fence'

I only found the a bit strange sequence below:
> Jun 12 01:39:34 proxmox03 pve-ha-crm[3084869]: fencing: acknowledged - got agent lock for node 'proxmox01
> Jun 12 01:39:34 proxmox03 pve-ha-crm[3084869]: node 'proxmox01': state changed from 'fence' => 'unknown'
> Jun 12 01:39:34 proxmox03 pve-ha-crm[3084869]: node 'proxmox01': state changed from 'unknown' => 'fence'

So, the "unfreeze before the respective LRM got active+online with watchdog"
seems the cause of the real wrong behavior here in your log, it allows the
recovery to happen, as else frozen services wouldn't not have been recovered
(that mechanism exactly exists to avoid such issues during a upgrade, where
one does not want to stop or migrate all HA VM/CTs)

While you interpolated the different logs into a single time-line it does not
seem to match everywhere, for my better understanding could you please send me:

* corosync.conf
* the journal or syslog of proxmox01 and proxmox03 around "Jun 12 01:38:16"
  plus/minus ~ 5 minutes, please in separated files, no interpolation and as
  unredacted as possible
* information if you have a HW watchdog or use the Linux soft-dog

that would be appreciated. I'll try to give the code which is in charge of
that a good look tomorrow.

thanks,
Thomas