[pve-devel] rgmanager + firewall = bad news

Tue Jul 14 23:01:27 CEST 2015

I think finally tracked down some bad behavior that's been around for a
while; e.g. this thread [1] (and I have pages of purple google results).

Problem: After a node(pve01) reboot, rgmanager doesn't appear to be
running correctly. In the PVE interface it doesn't report that node
running rgmanager, but on the box itself, `service rgmanager status`
lists several pids that are indeed rgmanager processes. `clustat` on
pve01 says rgmanager isn't running anywhere. `clustat` on other nodes
says rgmanager is running everywhere but pve01. This echoes the PVE
interface. I also tend to see the kernel reporting hung task here:

kernel: [  241.175951] INFO: task rgmanager:4321 blocked for more than
120 seconds.

Trying to restart rgmanager on pve01 never completes and normally then
requires fencing the node.

I think this is all caused by cluster firewall. Turning off the firewall
made the problem go away. After further experimentation and googling I
got to RHEL6 documentation  [2] that says dlm needs tcp dport 21064
open. Once I added that as an allowed dport to the security group I have
for the hypervisors and rebooted the node again everything seems happy
again.   I suggest adding this to the firewall default accept list
similar to ports 8006, 5404, 5405 et al.

[1]
http://forum.proxmox.com/threads/9962-rgmanager-running-per-cli-but-not-pve
[2]
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Cluster_Administration/s2-iptables_firewall-CA.html

Regards,
Nathan