[PVE-User] pve-firewall, clustering and HA gone bad

Thu Jun 13 11:47:18 CEST 2019

Hi,

Let me start off with saying that I am not fingerpointing at anyone,
merely looking for how to prevent sh*t from happening again!

Last month I emailed about issues with pve-firewall. I was told that
there were fixes in the newest packages, so this maintenance I started
with upgrading pve-firewall before anything else. Which went well for
about all the clusters I upgraded.

Then I ended up at the last (biggest, 9 nodes) cluster, and stuff got
pretty ugly. Here's what happened:

1: I enabled IPv6 on the cluster interfaces in the last month. I've done
this before on other clusters, nothing special there. So I added the
IPv6 addresses on the interfaces and added all nodes in all the
/etc/hosts files. I've had issues with not being able to start clusters
because hostnames could not resolve, so all my nodes in all my clusters
have all the hostnames and addresses of their respective peers in
/etc/hosts.
2: I upgraded pve-firewall on all the nodes, no issues there
3: I started dist-upgrading on proxmox01 and proxmox02, and restarting
pve-firewall with `pve-firewall restart` because of [1] and noticed that
pvecm status did not list any of the other nodes in list of peers. So we
had:
  proxmox01: proxmox01
  proxmox02: proxmox02
  proxmox03-proxmox09: proxmox03-proxmox09

Obviously, /etc/pve was readonly on proxmox01 and proxmox02, since they
had no quorum.
4: HA is heavily used on this cluster. Just about all VM's have it
enabled. So since 'I changed nothing', I restarted pve-cluster a few
times on the broken nodes. Nothing helped.
4: I then restarted pve-cluster on proxmox03, and all of the sudden,
proxmox01 looked happy again.
5: In the meantime, ha-manager had kicked in and started VM's on other
nodes, but did not actually let proxmox01 fence itself, but I did not
notice this.
6: I tried restarting pve-cluster on yet another node, and then all
nodes except proxmox01 and proxmox02 fenced themselves, rebooting
alltogether.

After rebooting, the cluster was not completely happy, because the
firewall was still confused. So why was this firewall confused? Nothing
changed, remember? Well, nothing except bullet 1.

It seems that pve-firewall tries to detect localnet, but failed to do so
correct. localnet should be 192.168.1.0/24, but instead it detected the
IPv6 addresses. Which isn't entirely incorrect, but IPv6 is not used for
clustering, so I should open IPv4 in the firewall not IPv6. So it seems
like nameresolving is used to define localnat, and not what corosync is
actually using.

I fixed the current situation by adding the correct [ALIASES] in
cluster.fw, and now all is well (except for the broken VM's that were
running on two nodes and have broken images).

So I think there are two issues here:
1: pve-firewall should better detect the IP's used for essential
services
2: ha-manager should not be able to start the VM's when they are running
elsewhere

Obviously, this is a faulty situation which causes unexpected results.
Again, I'm not pointing fingers, I would like to discuss how we can
improve these kind of faulty situations.

In the attachment, you can find a log with dpkg, pmxcfs, pve-ha-(lc)rm
from all nodes. So maybe someone can better asses what went wrong.

[1]: https://bugzilla.proxmox.com/show_bug.cgi?id=1823

-- 
Mark Schouten     | Tuxis B.V.
KvK: 74698818     | http://www.tuxis.nl/
T: +31 318 200208 | info at tuxis.nl