[PVE-User] pve-firewall, clustering and HA gone bad

Fri Jun 14 08:25:07 CEST 2019

Hi,

On 6/13/19 10:08 PM, JR Richardson wrote:
>> On 6/13/19 3:29 PM, Horace wrote:
>>> Should this stuff be in 'help' documentation ?
>>
>> The thing with the resolved ringX_ addresses?
>>
>> Hmm, it would not hurt if something regarding this is written there.
>> But it isn't as black and white, and often depends a lot on the
>> preferences of the admin(s) and their setup/environment.
>>
>> Some hints could probably given, especially for a IPv6 addition/switch,
>> as the getaddrinfo preference of IPv6 over IPv4 if both are configured
>> has often bitten people (see /etc/gai.conf , man gai.conf), not only with
>> clustering or PVE.
>>
>> A few other hints could probably thrown into that too..
>> Stefan (CCd), would you be willing to take a look at this and expand the
>> "Cluster Network" section from the pvecm chapter in pve-docs a bit
>> regarding this? That'd be great.
>>
> 
> Hi All,
> 
> Sorry to hijack thread, but I was about to perform a 10 node cluster
> upgrade and after reading above, I have some reservations.
> 
> I did a mix of versions 4.x and 5.x nodes over the last couple of
> years and my corosync.conf file has a mix of 'ring0_addr' entries as
> DNS name and IP Address. All node hosts files are up to date with all
> nodes in the cluster. I'm running PVE 5.2-5 across all nodes, seems to
> be working fine, no issues.
> 
> Should I update corosync.conf 'ring0_addr:' entries to all IP
> Addresses before attempting the upgrade?

If you did no host network change(s) you really should be fine.

The issue of Mark Schouten was mainly due to a few things coming
together, if he had the ring0_addr's resovled the FW had still blocked
in this case, as the local_net calculations still picked up the new
IPv6 net as primary first, AFAICT.

> 
> If so, I assume I have to stop the pmxcfs and or corosync, update the
> file on any node, then restart cluster service on the that node to
> push update to all nodes?

Would work, but it more intrusive than it needs to be. What I would do is:

1. Do an omping check[0] with all the new addresses you plan to replace the
   hostnames from ring0_addr *first*, as this shows if the cluster can talk
   with each other through those addresses at all. You can also get the
   currently used IPs by using the following command (maybe grep for 'ip')
   # corosync-cmapctl runtime.member

2. # cp /etc/pve/corosync.conf /etc/pve/corosync.conf.new

3. # editor /etc/pve/corosync.conf.new

4. change all ring0_addr to their respective IP counter part, as we use the
   _exact_ same addresses, just written out, as before you can do this all
  at once. If you /change/ the addresses to other ones it should not be
  done this way, at least not if you aren't really comfortable with corosync
  and played around (in testing systems) a lot with such stuff.

5. ensure you increased the config_version by one

6. safe and diff to ensure the changes you're about the enact are OK:
   # diff -u /etc/pve/corosync.conf /etc/pve/corosync.conf.new

7. now lets enforce the changes cluster wide, 
   # mv /etc/pve/corosync.conf.new /etc/pve/corosync.conf

9. pmxcfs sees that the corosync conf changed and the config version is
   newer, each node thus copies this over from /etc/pve/corosync.conf to
   /etc/corosync/corosync.conf

9. The journalctl/syslog should show some message about corosync reloading
   the config, possible telling you that it cannot enact ring0_addr change of
   itself during runtime, which is ok _here_ as we did not change the address
   at all, just switched to another representation of it.

As said that's for a address change which does not really changes the address ;)
Else, I probably would

1. stop pve-ha-lrm everywhere, then pve-ha-crm (order is important)

2. do the edits as above, pmxcfs and corosync must still run, triple check
   the changes, ensure that the new network is reachable from all nodes
   (omping can help)

3. enforce config by moving the .new over the real one.

4. # systemctl restar corosync pve-cluster # everywhere

5. start ha services again.

[0]: https://pve.proxmox.com/pve-docs/chapter-pvecm.html#cluster-network-requirements