[PVE-User] pve-firewall, clustering and HA gone bad

Thu Jun 13 15:29:15 CEST 2019

Should this stuff be in 'help' documentation ?

On 6/13/19 12:29 PM, Thomas Lamprecht wrote:
> On 6/13/19 1:30 PM, Mark Schouten wrote:
>> On Thu, Jun 13, 2019 at 12:34:28PM +0200, Thomas Lamprecht wrote:
>>> Hi,
>>> Do your ringX_addr in corosync.conf use the hostnames or the resolved
>>> addresses? As with nodes added on newer PVE (at least 5.1, IIRC) we try
>>> to resolve the nodename and use the resolved address to exactly avoid
>>> such issues. If it don't uses that I recommend changing that instead
>>> of the all nodes in al /etc/hosts approach.
>> It has the hostnames. It's a cluster upgraded from 4.2 up to current.
> OK, I suggest that you change that to the resolved IPs and add a "name"
> property, if not already there (at the moment not to sure when I added
> the "name" per-default to the config, it was sometime in a 4.x release)
> IOW, the config's "nodelist" section should look something like:
>
> ...
> nodelist {
>    node {
>      name: prod1
>      nodeid: 1
>      quorum_votes: 1
>      ring0_addr: 192.168.30.75
>    }
>    node {
>      name: prod2
>      nodeid: 2
>      quorum_votes: 1
>      ring0_addr: 192.168.30.76
>    }
>    ...
> }
>
> As said in the previous reply, that should avoid most issues of this kind,
> and avoid the need for the /etc/host stuff on all hosts.
>
>>>> It seems that pve-firewall tries to detect localnet, but failed to do so
>>>> correct. localnet should be 192.168.1.0/24, but instead it detected the
>>>> IPv6 addresses. Which isn't entirely incorrect, but IPv6 is not used for
>>>> clustering, so I should open IPv4 in the firewall not IPv6. So it seems
>>>> like nameresolving is used to define localnat, and not what corosync is
>>>> actually using.
>>>  From a quick look at the code: That seems true and is definitively the
>>> wrong behavior :/
>> Ok, I'll file a bug for that.
> Thanks!
>
>>>> 2: ha-manager should not be able to start the VM's when they are running
>>>> elsewhere
>>> This can only happen if fencing fails, and that fencing works is always
>>> a base assumption we must take (as else no HA is possible at all).
>>> So it would be interesting why fencing did not worked here (see below
>>> for the reason I could not determine that yet as I did not have your logs
>>> at hand)
>> We must indeed make assumptions. Are there ways we can assume better? :)
> Hmm, hard, as fencing must work. And it does normally if
> * the fence device works (in this case the watchdog)
> * no manual tinkering on HA was involved (no finger pointing, really, but
>    while we try to fend off some manual changes, one can get the pve-ha-*
>    in certain states where the VM is running but watchdogs got closed
> * a bug (naturally), but with the simulation & regression tests this should
>    be covered against in principle.
>
> it will work, but closer analysis of your incident will hopefully show
> what the case was, and if there could be enhancements against that.
>
>>> The list trims attachments, could you please send them directly to my
>>> address? I'd really like to see those.
>> Attached again, so you should receive it now.
>>
> OK got the attachment now, thanks!
>
> I think I got the relevant part below:
>
>> Jun 12 01:37:56 proxmox01 pve-ha-lrm[3729778]: status change wait_for_agent_lock => active
>> Jun 12 01:37:56 proxmox01 pve-ha-lrm[3729778]: successfully acquired lock 'ha_agent_proxmox01_lock'
>> Jun 12 01:37:56 proxmox01 pve-ha-lrm[3729778]: watchdog active
> -> upgrade stuff on proxmox01, thus restart of pve-ha-lrm
>
>> Jun 12 01:38:05 proxmox01 pve-ha-lrm[3729778]: received signal TERM
>> Jun 12 01:38:05 proxmox01 pve-ha-lrm[3729778]: restart LRM, freeze all services
>> Jun 12 01:38:14 proxmox03 pve-ha-crm[3084869]: service 'vm:100': state changed from 'started' to 'freeze'
>> ...
>> Jun 12 01:38:14 proxmox03 pve-ha-crm[3084869]: service 'vm:800': state changed from 'started' to 'freeze'
> -> ... all got frozen (which is OK)
>
>> Jun 12 01:38:16 proxmox01 pve-ha-lrm[3729778]: watchdog closed (disabled)
>> Jun 12 01:38:18 proxmox01 pve-ha-lrm[3731520]: status change startup => wait_for_agent_lock
> -> here, proxmox01 has not yet the LRM lock and is not yet active (!),
>     but current master (proxmox03) already unfreezes proxmox01's services:
>
>> Jun 12 01:38:24 proxmox03 pve-ha-crm[3084869]: service 'vm:100': state changed from 'freeze' to 'started'
>> ...
>> Jun 12 01:38:24 proxmox03 pve-ha-crm[3084869]: service 'vm:800': state changed from 'freeze' to 'started'
> (remember that for below, the fact that those services got unfreezed before (!)
> the watchdog was active again reads _very_ worrisome to me. They really shouldn't,
> as freeze is exactly for avoiding issues during upgrade/restart of HA without stopping
> all services)
>
> -> now the quorum breaks as of the firewall allowing the IPv6 not IPv4 net, a bit later the HA cluster masters logs it
>
>> Jun 12 01:38:36 proxmox03 pve-ha-crm[3084869]: node 'proxmox01': state changed from 'online' => 'unknown'
> -> proxmox01's LRM only gets to log only one time anymore:
>> Jun 12 01:38:37 proxmox01 pve-ha-lrm[3731520]: unable to write lrm status file - closing file '/etc/pve/nodes/proxmox01/lrm_status.tmp.3731520' failed - Operation not permitted
> Now from your interpolated logs I'm missing a initial
> ... node 'proxmox01': state changed from 'unknown' => 'fence'
>
> I only found the a bit strange sequence below:
>> Jun 12 01:39:34 proxmox03 pve-ha-crm[3084869]: fencing: acknowledged - got agent lock for node 'proxmox01
>> Jun 12 01:39:34 proxmox03 pve-ha-crm[3084869]: node 'proxmox01': state changed from 'fence' => 'unknown'
>> Jun 12 01:39:34 proxmox03 pve-ha-crm[3084869]: node 'proxmox01': state changed from 'unknown' => 'fence'
>
>
> So, the "unfreeze before the respective LRM got active+online with watchdog"
> seems the cause of the real wrong behavior here in your log, it allows the
> recovery to happen, as else frozen services wouldn't not have been recovered
> (that mechanism exactly exists to avoid such issues during a upgrade, where
> one does not want to stop or migrate all HA VM/CTs)
>
> While you interpolated the different logs into a single time-line it does not
> seem to match everywhere, for my better understanding could you please send me:
>
> * corosync.conf
> * the journal or syslog of proxmox01 and proxmox03 around "Jun 12 01:38:16"
>    plus/minus ~ 5 minutes, please in separated files, no interpolation and as
>    unredacted as possible
> * information if you have a HW watchdog or use the Linux soft-dog
>
> that would be appreciated. I'll try to give the code which is in charge of
> that a good look tomorrow.
>
> thanks,
> Thomas
>
> _______________________________________________
> pve-user mailing list
> pve-user at pve.proxmox.com
> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

-- 
LHProjects Network -- http://www.lhprojects.net