[PVE-User] Cluster fiasco and recovery
nada at verdnatura.es
nada at verdnatura.es
Thu Feb 14 10:29:04 CET 2019
MANY thanks Ian !!!
I am planning to add another node next month
also using LAG at bond0 and DNS failover lives at CTs
so adding IP to /etc/hosts @all nodes ;-)
MANY thanks to Dietmar and Martin who runs this pve-user list !!!
El 2019-02-14 08:15, Ian Coetzee escribió:
> Hi All,
>
> I just wanted to share something that happened with me yesterday. I am
> hoping that this will save someone else 4 hours in the future.
>
> We have a smallish cluster of 5 nodes (3 compute, 2 storage). The one
> compute node also doubles as a storage node.
>
> Yesterday, I was adding another compute node into the mix when suddenly
> the
> other compute nodes just randomly rebooted with nothing to show in the
> logs. As a result, the cluster lost quorum, as would be expected. What
> wasn't expected was that the cluster could not re-establish quorum. It
> looked to be a network related issue and I remembered that back in the
> days
> when I started the cluster, I had a lot of issues between the linux
> bonded
> interface and the switch, where I had to *if{down,up} bond0; *after a
> startup. It seems the kernel brings the bond up, and then tries to set
> the
> bond mode to lacp.... But I digress.
>
> So the first this I did was to troubleshoot this part by changing the
> bond
> types on the switches. When that didn't work I tore down the bonds and
> went
> back to the basics. Still the cluster can't establish quorum. After
> pulling
> out most of my hair in frustration, I went to ask google. Came across a
> post dated a while ago where the OP had a member offline while joining
> a
> new member. This is what pointed me to actually look at my
> corosync.conf
> file.
>
> Turned out, when I joined the first three members, it was from the cli
> (before the option was available in the gui) using the dns names of the
> other cluster member. This the important part, as you can probably
> guess,
> my dns servers was also running in the cluster, as such, they went
> offline
> when the compute nodes rebooted. Thus the cluster members had no idea
> what
> the ip addresses are of the first 3 nodes...
>
> I replaced the dns hostnames in the corosync.conf with the actual ip
> addresses and the cluster established quorum.
>
> So my advice to you guys out there, make sure that:
>
> - Your corosync.conf uses ip addresses
> or
> - If you want to use hostnames, put those hostnames in your
> /etc/hosts
> file
>
> I also think corosync should have logged errors that it is unable to
> resolve the hostnames to ip.
>
> Just a little piece of advice from me to you.
>
> Proxmox rocks though!
> _______________________________________________
> pve-user mailing list
> pve-user at pve.proxmox.com
> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
More information about the pve-user
mailing list