[PVE-User] pve-firewall, clustering and HA gone bad

Fri Jun 14 17:41:36 CEST 2019

On Fri, Jun 14, 2019 at 1:25 AM Thomas Lamprecht
<t.lamprecht at proxmox.com> wrote:
>
> Hi,
>
> On 6/13/19 10:08 PM, JR Richardson wrote:
> >> On 6/13/19 3:29 PM, Horace wrote:
> >>> Should this stuff be in 'help' documentation ?
> >>
> >> The thing with the resolved ringX_ addresses?
> >>
> >> Hmm, it would not hurt if something regarding this is written there.
> >> But it isn't as black and white, and often depends a lot on the
> >> preferences of the admin(s) and their setup/environment.
> >>
> >> Some hints could probably given, especially for a IPv6 addition/switch,
> >> as the getaddrinfo preference of IPv6 over IPv4 if both are configured
> >> has often bitten people (see /etc/gai.conf , man gai.conf), not only with
> >> clustering or PVE.
> >>
> >> A few other hints could probably thrown into that too..
> >> Stefan (CCd), would you be willing to take a look at this and expand the
> >> "Cluster Network" section from the pvecm chapter in pve-docs a bit
> >> regarding this? That'd be great.
> >>
> >
> > Hi All,
> >
> > Sorry to hijack thread, but I was about to perform a 10 node cluster
> > upgrade and after reading above, I have some reservations.
> >
> > I did a mix of versions 4.x and 5.x nodes over the last couple of
> > years and my corosync.conf file has a mix of 'ring0_addr' entries as
> > DNS name and IP Address. All node hosts files are up to date with all
> > nodes in the cluster. I'm running PVE 5.2-5 across all nodes, seems to
> > be working fine, no issues.
> >
> > Should I update corosync.conf 'ring0_addr:' entries to all IP
> > Addresses before attempting the upgrade?
>
> If you did no host network change(s) you really should be fine.
>
> The issue of Mark Schouten was mainly due to a few things coming
> together, if he had the ring0_addr's resovled the FW had still blocked
> in this case, as the local_net calculations still picked up the new
> IPv6 net as primary first, AFAICT.
>
> >
> > If so, I assume I have to stop the pmxcfs and or corosync, update the
> > file on any node, then restart cluster service on the that node to
> > push update to all nodes?
>
> Would work, but it more intrusive than it needs to be. What I would do is:
>
> 1. Do an omping check[0] with all the new addresses you plan to replace the
>    hostnames from ring0_addr *first*, as this shows if the cluster can talk
>    with each other through those addresses at all. You can also get the
>    currently used IPs by using the following command (maybe grep for 'ip')
>    # corosync-cmapctl runtime.member
>
> 2. # cp /etc/pve/corosync.conf /etc/pve/corosync.conf.new
>
> 3. # editor /etc/pve/corosync.conf.new
>
> 4. change all ring0_addr to their respective IP counter part, as we use the
>    _exact_ same addresses, just written out, as before you can do this all
>   at once. If you /change/ the addresses to other ones it should not be
>   done this way, at least not if you aren't really comfortable with corosync
>   and played around (in testing systems) a lot with such stuff.
>
> 5. ensure you increased the config_version by one
>
> 6. safe and diff to ensure the changes you're about the enact are OK:
>    # diff -u /etc/pve/corosync.conf /etc/pve/corosync.conf.new
>
> 7. now lets enforce the changes cluster wide,
>    # mv /etc/pve/corosync.conf.new /etc/pve/corosync.conf
>
> 9. pmxcfs sees that the corosync conf changed and the config version is
>    newer, each node thus copies this over from /etc/pve/corosync.conf to
>    /etc/corosync/corosync.conf
>
> 9. The journalctl/syslog should show some message about corosync reloading
>    the config, possible telling you that it cannot enact ring0_addr change of
>    itself during runtime, which is ok _here_ as we did not change the address
>    at all, just switched to another representation of it.
>
> As said that's for a address change which does not really changes the address ;)
> Else, I probably would
>
> 1. stop pve-ha-lrm everywhere, then pve-ha-crm (order is important)
>
> 2. do the edits as above, pmxcfs and corosync must still run, triple check
>    the changes, ensure that the new network is reachable from all nodes
>    (omping can help)
>
> 3. enforce config by moving the .new over the real one.
>
> 4. # systemctl restar corosync pve-cluster # everywhere
>
> 5. start ha services again.
>
> [0]: https://pve.proxmox.com/pve-docs/chapter-pvecm.html#cluster-network-requirements
>

Thanks Thomas,

I really appreciate the clarification and directions. I do have a 4
node LAB cluster I will run through testing with, if all goes well,
will move on to production.

Thanks.

JR
-- 
JR Richardson
Engineering for the Masses
Chasing the Azeotrope