[PVE-User] Whole cluster brokes

Wed Mar 8 11:02:11 CET 2017

HI,

the Cluster was working all the time pretty cool.
So actually I found out that PVE File-System is not mounted. And here you also can see some logs you ask for ;)

● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled)
   Active: active (running) since Fri 2017-02-17 15:59:11 CET; 2 weeks 4 days ago
 Main PID: 2083 (corosync)
   CGroup: /system.slice/corosync.service
           └─2083 corosync

Mar 08 09:41:28 host01 corosync[2083]: [MAIN  ] Completed service synchronization, ready to provide service.
Mar 08 09:41:32 host01 corosync[2083]: [TOTEM ] A new membership (10.0.2.110:112748) was formed. Members
Mar 08 09:41:32 host01 corosync[2083]: [QUORUM] Members[12]: 1 2 3 4 5 6 7 8 9 10 12 13
Mar 08 09:41:32 host01 corosync[2083]: [MAIN  ] Completed service synchronization, ready to provide service.
Mar 08 09:41:39 host01 corosync[2083]: [TOTEM ] A new membership (10.0.2.110:112756) was formed. Members joined: 13 left: 13
Mar 08 09:41:39 host01 corosync[2083]: [TOTEM ] Failed to receive the leave message. failed: 13
Mar 08 09:41:39 host01 corosync[2083]: [QUORUM] Members[12]: 1 2 3 4 5 6 7 8 9 10 12 13
Mar 08 09:41:39 host01 corosync[2083]: [MAIN  ] Completed service synchronization, ready to provide service.
Mar 08 09:41:58 host01 corosync[2083]: [TOTEM ] A new membership (10.0.2.110:112760) was formed. Members left: 13
Mar 08 09:41:58 host01 corosync[2083]: [TOTEM ] Failed to receive the leave message. failed: 13

● pve-cluster.service - The Proxmox VE cluster filesystem
   Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
   Active: failed (Result: signal) since Wed 2017-03-08 10:54:06 CET; 6min ago
  Process: 22861 ExecStart=/usr/bin/pmxcfs $DAEMON_OPTS (code=exited, status=0/SUCCESS)
 Main PID: 22868 (code=killed, signal=KILL)

Mar 08 10:54:01 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 950
Mar 08 10:54:02 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 960
Mar 08 10:54:03 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 970
Mar 08 10:54:04 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 980
Mar 08 10:54:05 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 990
Mar 08 10:54:06 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 1000
Mar 08 10:54:06 host01 systemd[1]: pve-cluster.service stop-sigterm timed out. Killing.
Mar 08 10:54:06 host01 systemd[1]: pve-cluster.service: main process exited, code=killed, status=9/KILL
Mar 08 10:54:06 host01 systemd[1]: Failed to start The Proxmox VE cluster filesystem.
Mar 08 10:54:06 host01 systemd[1]: Unit pve-cluster.service entered failed state.

It seems that TOTEM ] Failed to receive the leave message. failed: 13 was the problem.

-- 
Grüsse

Daniel

Am 08.03.17, 10:53 schrieb "pve-user im Auftrag von Thomas Lamprecht" <pve-user-bounces at pve.proxmox.com im Auftrag von t.lamprecht at proxmox.com>:

    On 03/08/2017 10:40 AM, Daniel wrote:
    > Hi there,
    >
    > one College remove one server from the datacenter and after that the whole cluster is broken:

    Did this server act as a multicast querier? Could explain the behavior.

    Check if your switch has setup IGMP snooping, if yes you could disable 
    it temporarily to see if that fixes the problem (may have a performance 
    impact on the whole network as multicast messages get delivered to all 
    network members).

    You may also try to enable a querier on one node:

    # echo 1 > /sys/devices/virtual/net/vmbr0/bridge/multicast_querier

    > Mar  8 10:35:00 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused
    > Mar  8 10:35:00 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused
    > Mar  8 10:35:00 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused
    > Mar  8 10:35:00 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused
    > Mar  8 10:35:01 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 230
    > Mar  8 10:35:01 host01 snmpd[1441]: Connection from UDP: [10.0.2.50]:40800->[10.0.2.110]:161
    > Mar  8 10:35:01 host01 snmpd[1441]: Connection from UDP: [10.0.2.50]:55768->[10.0.2.110]:161
    > Mar  8 10:35:02 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 240
    > Mar  8 10:35:03 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 250
    > Mar  8 10:35:04 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 260
    > Mar  8 10:35:05 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 270
    > Mar  8 10:35:06 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 280
    > Mar  8 10:35:07 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 290
    > Mar  8 10:35:08 host01 /usr/share/filebeat/bin/filebeat[20736]: logp.go:230: Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=6 libbeat.logstash.publish.write_bytes=4907 libbeat.publisher.published_events=76 libbeat.logstash.published_and_acked_events=76 publish.events=76 libbeat.logstash.publish.read_bytes=222 registrar.states.update=76 registrar.writes=6
    > Mar  8 10:35:08 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 300
    > Mar  8 10:35:09 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 310
    > Mar  8 10:35:10 host01 pmxcfs[7399]: [dcdb] notice: cpg_join retry 320
    > Mar  8 10:35:10 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused
    > Mar  8 10:35:10 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused
    > Mar  8 10:35:10 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused
    > Mar  8 10:35:10 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused
    > Mar  8 10:35:10 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused
    > Mar  8 10:35:10 host01 pvestatd[2090]: ipcc_send_rec failed: Connection refused
    >
    > So /etc/pve/ is not mounted anymore and I cant restart anythink.
    > Anyone have an idee what can happen?

    Whats your corosync and pve-cluster status?
    systemctl status corosync pve-cluster

    Looks like corosync is dead/broken and does not let our cluster 
    filesystem join.

    cheers and good luck,
    Thomas

    _______________________________________________
    pve-user mailing list
    pve-user at pve.proxmox.com
    http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user