[PVE-User] Whole cluster brokes
Daniel
daniel at linux-nerd.de
Wed Mar 8 11:02:11 CET 2017
HI,
the Cluster was working all the time pretty cool.
So actually I found out that PVE File-System is not mounted. And here you also can see some logs you ask for ;)
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled)
Active: active (running) since Fri 2017-02-17 15:59:11 CET; 2 weeks 4 days ago
Main PID: 2083 (corosync)
CGroup: /system.slice/corosync.service
└─2083 corosync
Mar 08 09:41:28 host01 corosync[2083]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 08 09:41:32 host01 corosync[2083]: [TOTEM ] A new membership (10.0.2.110:112748) was formed. Members
Mar 08 09:41:32 host01 corosync[2083]: [QUORUM] Members[12]: 1 2 3 4 5 6 7 8 9 10 12 13
Mar 08 09:41:32 host01 corosync[2083]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 08 09:41:39 host01 corosync[2083]: [TOTEM ] A new membership (10.0.2.110:112756) was formed. Members joined: 13 left: 13
Mar 08 09:41:39 host01 corosync[2083]: [TOTEM ] Failed to receive the leave message. failed: 13
Mar 08 09:41:39 host01 corosync[2083]: [QUORUM] Members[12]: 1 2 3 4 5 6 7 8 9 10 12 13
Mar 08 09:41:39 host01 corosync[2083]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 08 09:41:58 host01 corosync[2083]: [TOTEM ] A new membership (10.0.2.110:112760) was formed. Members left: 13
Mar 08 09:41:58 host01 corosync[2083]: [TOTEM ] Failed to receive the leave message. failed: 13
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
Active: failed (Result: signal) since Wed 2017-03-08 10:54:06 CET; 6min ago
Process: 22861 ExecStart=/usr/bin/pmxcfs $DAEMON_OPTS (code=exited, status=0/SUCCESS)
Main PID: 22868 (code=killed, signal=KILL)
Mar 08 10:54:01 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 950
Mar 08 10:54:02 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 960
Mar 08 10:54:03 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 970
Mar 08 10:54:04 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 980
Mar 08 10:54:05 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 990
Mar 08 10:54:06 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 1000
Mar 08 10:54:06 host01 systemd[1]: pve-cluster.service stop-sigterm timed out. Killing.
Mar 08 10:54:06 host01 systemd[1]: pve-cluster.service: main process exited, code=killed, status=9/KILL
Mar 08 10:54:06 host01 systemd[1]: Failed to start The Proxmox VE cluster filesystem.
Mar 08 10:54:06 host01 systemd[1]: Unit pve-cluster.service entered failed state.
It seems that TOTEM ] Failed to receive the leave message. failed: 13 was the problem.
--
Grüsse
Daniel
Am 08.03.17, 10:53 schrieb "pve-user im Auftrag von Thomas Lamprecht" <pve-user-bounces at pve.proxmox.com im Auftrag von t.lamprecht at proxmox.com>:
On 03/08/2017 10:40 AM, Daniel wrote:
> Hi there,
>
> one College remove one server from the datacenter and after that the whole cluster is broken:
Did this server act as a multicast querier? Could explain the behavior.
Check if your switch has setup IGMP snooping, if yes you could disable
it temporarily to see if that fixes the problem (may have a performance
impact on the whole network as multicast messages get delivered to all
network members).
You may also try to enable a querier on one node:
# echo 1 > /sys/devices/virtual/net/vmbr0/bridge/multicast_querier
> Mar 8 10:35:00 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused
> Mar 8 10:35:00 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused
> Mar 8 10:35:00 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused
> Mar 8 10:35:00 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused
> Mar 8 10:35:01 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 230
> Mar 8 10:35:01 host01 snmpd[1441]: Connection from UDP: [10.0.2.50]:40800->[10.0.2.110]:161
> Mar 8 10:35:01 host01 snmpd[1441]: Connection from UDP: [10.0.2.50]:55768->[10.0.2.110]:161
> Mar 8 10:35:02 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 240
> Mar 8 10:35:03 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 250
> Mar 8 10:35:04 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 260
> Mar 8 10:35:05 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 270
> Mar 8 10:35:06 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 280
> Mar 8 10:35:07 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 290
> Mar 8 10:35:08 host01 /usr/share/filebeat/bin/filebeat[20736]: logp.go:230: Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=6 libbeat.logstash.publish.write_bytes=4907 libbeat.publisher.published_events=76 libbeat.logstash.published_and_acked_events=76 publish.events=76 libbeat.logstash.publish.read_bytes=222 registrar.states.update=76 registrar.writes=6
> Mar 8 10:35:08 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 300
> Mar 8 10:35:09 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 310
> Mar 8 10:35:10 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 320
> Mar 8 10:35:10 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused
> Mar 8 10:35:10 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused
> Mar 8 10:35:10 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused
> Mar 8 10:35:10 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused
> Mar 8 10:35:10 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused
> Mar 8 10:35:10 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused
>
> So /etc/pve/ is not mounted anymore and I cant restart anythink.
> Anyone have an idee what can happen?
Whats your corosync and pve-cluster status?
systemctl status corosync pve-cluster
Looks like corosync is dead/broken and does not let our cluster
filesystem join.
cheers and good luck,
Thomas
_______________________________________________
pve-user mailing list
pve-user at pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
More information about the pve-user
mailing list