[PVE-User] Whole cluster brokes

Wed Mar 8 11:15:25 CET 2017

Hi,

On 03/08/2017 11:02 AM, Daniel wrote:
> HI,
>
> the Cluster was working all the time pretty cool.

Yes, but if this particular node acted as a querier the cluster would 
have worked great
but removing it results in no more querier and so problems.
It's at least worth a try to look this up, simple test would be:

Executing on at least two nodes:

 > omping -c 600 -i 1 -q NODE1-IP NODE2-IP ...

This runs for ~10Minutes and schould have ideally 0% loss, at least <1%.

See:
http://pve.proxmox.com/pve-docs/chapter-pvecm.html#cluster-network-requirements

> So actually I found out that PVE File-System is not mounted. And here you also can see some logs you ask for ;)
Thanks! Do you have tried restarting corosync and then pve-cluster?
This is not entirely safe with active HA, but I guess you do not have HA
configured or else the watchdog would have already triggered.

>
> ● corosync.service - Corosync Cluster Engine
>     Loaded: loaded (/lib/systemd/system/corosync.service; enabled)
>     Active: active (running) since Fri 2017-02-17 15:59:11 CET; 2 weeks 4 days ago
>   Main PID: 2083 (corosync)
>     CGroup: /system.slice/corosync.service
>             └─2083 corosync
>
> Mar 08 09:41:28 host01 corosync[2083]: [MAIN  ] Completed service synchronization, ready to provide service.
> Mar 08 09:41:32 host01 corosync[2083]: [TOTEM ] A new membership (10.0.2.110:112748) was formed. Members
> Mar 08 09:41:32 host01 corosync[2083]: [QUORUM] Members[12]: 1 2 3 4 5 6 7 8 9 10 12 13
> Mar 08 09:41:32 host01 corosync[2083]: [MAIN  ] Completed service synchronization, ready to provide service.
> Mar 08 09:41:39 host01 corosync[2083]: [TOTEM ] A new membership (10.0.2.110:112756) was formed. Members joined: 13 left: 13
> Mar 08 09:41:39 host01 corosync[2083]: [TOTEM ] Failed to receive the leave message. failed: 13
> Mar 08 09:41:39 host01 corosync[2083]: [QUORUM] Members[12]: 1 2 3 4 5 6 7 8 9 10 12 13
> Mar 08 09:41:39 host01 corosync[2083]: [MAIN  ] Completed service synchronization, ready to provide service.
> Mar 08 09:41:58 host01 corosync[2083]: [TOTEM ] A new membership (10.0.2.110:112760) was formed. Members left: 13
> Mar 08 09:41:58 host01 corosync[2083]: [TOTEM ] Failed to receive the leave message. failed: 13
>
> ● pve-cluster.service - The Proxmox VE cluster filesystem
>     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
>     Active: failed (Result: signal) since Wed 2017-03-08 10:54:06 CET; 6min ago
>    Process: 22861 ExecStart=/usr/bin/pmxcfs $DAEMON_OPTS (code=exited, status=0/SUCCESS)
>   Main PID: 22868 (code=killed, signal=KILL)
>
> Mar 08 10:54:01 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 950
> Mar 08 10:54:02 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 960
> Mar 08 10:54:03 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 970
> Mar 08 10:54:04 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 980
> Mar 08 10:54:05 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 990
> Mar 08 10:54:06 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 1000
> Mar 08 10:54:06 host01 systemd[1]: pve-cluster.service stop-sigterm timed out. Killing.
> Mar 08 10:54:06 host01 systemd[1]: pve-cluster.service: main process exited, code=killed, status=9/KILL
> Mar 08 10:54:06 host01 systemd[1]: Failed to start The Proxmox VE cluster filesystem.
> Mar 08 10:54:06 host01 systemd[1]: Unit pve-cluster.service entered failed state.
>
> It seems that TOTEM ] Failed to receive the leave message. failed: 13 was the problem.
>

This could really indicate multicast problems (see above).
Did the problems happened instantly after the removal of the node? With 
some minutes delay?
And how did you remove the one node?

Just trying to understand your situation here :)

cheers,
Thomas