[PVE-User] Whole cluster brokes

Wed Mar 8 10:53:52 CET 2017

On 03/08/2017 10:40 AM, Daniel wrote:
> Hi there,
>
> one College remove one server from the datacenter and after that the whole cluster is broken:

Did this server act as a multicast querier? Could explain the behavior.

Check if your switch has setup IGMP snooping, if yes you could disable 
it temporarily to see if that fixes the problem (may have a performance 
impact on the whole network as multicast messages get delivered to all 
network members).

You may also try to enable a querier on one node:

# echo 1 > /sys/devices/virtual/net/vmbr0/bridge/multicast_querier

> Mar  8 10:35:00 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused
> Mar  8 10:35:00 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused
> Mar  8 10:35:00 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused
> Mar  8 10:35:00 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused
> Mar  8 10:35:01 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 230
> Mar  8 10:35:01 host01 snmpd[1441]: Connection from UDP: [10.0.2.50]:40800->[10.0.2.110]:161
> Mar  8 10:35:01 host01 snmpd[1441]: Connection from UDP: [10.0.2.50]:55768->[10.0.2.110]:161
> Mar  8 10:35:02 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 240
> Mar  8 10:35:03 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 250
> Mar  8 10:35:04 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 260
> Mar  8 10:35:05 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 270
> Mar  8 10:35:06 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 280
> Mar  8 10:35:07 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 290
> Mar  8 10:35:08 host01 /usr/share/filebeat/bin/filebeat[20736]: logp.go:230: Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=6 libbeat.logstash.publish.write_bytes=4907 libbeat.publisher.published_events=76 libbeat.logstash.published_and_acked_events=76 publish.events=76 libbeat.logstash.publish.read_bytes=222 registrar.states.update=76 registrar.writes=6
> Mar  8 10:35:08 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 300
> Mar  8 10:35:09 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 310
> Mar  8 10:35:10 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 320
> Mar  8 10:35:10 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused
> Mar  8 10:35:10 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused
> Mar  8 10:35:10 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused
> Mar  8 10:35:10 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused
> Mar  8 10:35:10 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused
> Mar  8 10:35:10 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused
>
> So /etc/pve/ is not mounted anymore and I cant restart anythink.
> Anyone have an idee what can happen?

Whats your corosync and pve-cluster status?
systemctl status corosync pve-cluster

Looks like corosync is dead/broken and does not let our cluster 
filesystem join.

cheers and good luck,
Thomas