[PVE-User] Whole cluster brokes

Wed Mar 8 11:38:58 CET 2017

Hi,

when i try the command with 2 NODES i got the follwing Error.
So it seems realy to be a multicast problem.

root at host01:~# omping -c 10 -i 1 -q 10.0.2.110 10.0.2.111
10.0.2.111 : waiting for response msg
10.0.2.111 : waiting for response msg

I cant restart pve-cluster – I get errors. Corosync was not restarted yet – And yes – actually I don’t have HA configured yet.
Is there any special command to restart Corosync?

Should this help when I try to do on one node?

echo 1 > /sys/devices/virtual/net/vmbr0/bridge/multicast_querier

I am not sure what how long the cluster was working after 13 was shutdown.

-- 
Grüsse

Daniel

Am 08.03.17, 11:15 schrieb "pve-user im Auftrag von Thomas Lamprecht" <pve-user-bounces at pve.proxmox.com im Auftrag von t.lamprecht at proxmox.com>:

    Hi,

    On 03/08/2017 11:02 AM, Daniel wrote:
    > HI,
    >
    > the Cluster was working all the time pretty cool.

    Yes, but if this particular node acted as a querier the cluster would 
    have worked great
    but removing it results in no more querier and so problems.
    It's at least worth a try to look this up, simple test would be:

    Executing on at least two nodes:

     > omping -c 600 -i 1 -q NODE1-IP NODE2-IP ...

    This runs for ~10Minutes and schould have ideally 0% loss, at least <1%.

    See:
    http://pve.proxmox.com/pve-docs/chapter-pvecm.html#cluster-network-requirements

    > So actually I found out that PVE File-System is not mounted. And here you also can see some logs you ask for ;)
    Thanks! Do you have tried restarting corosync and then pve-cluster?
    This is not entirely safe with active HA, but I guess you do not have HA
    configured or else the watchdog would have already triggered.

    >
    > ● corosync.service - Corosync Cluster Engine
    >     Loaded: loaded (/lib/systemd/system/corosync.service; enabled)
    >     Active: active (running) since Fri 2017-02-17 15:59:11 CET; 2 weeks 4 days ago
    >   Main PID: 2083 (corosync)
    >     CGroup: /system.slice/corosync.service
    >             └─2083 corosync
    >
    > Mar 08 09:41:28 host01 corosync[2083]: [MAIN  ] Completed service synchronization, ready to provide service.
    > Mar 08 09:41:32 host01 corosync[2083]: [TOTEM ] A new membership (10.0.2.110:112748) was formed. Members
    > Mar 08 09:41:32 host01 corosync[2083]: [QUORUM] Members[12]: 1 2 3 4 5 6 7 8 9 10 12 13
    > Mar 08 09:41:32 host01 corosync[2083]: [MAIN  ] Completed service synchronization, ready to provide service.
    > Mar 08 09:41:39 host01 corosync[2083]: [TOTEM ] A new membership (10.0.2.110:112756) was formed. Members joined: 13 left: 13
    > Mar 08 09:41:39 host01 corosync[2083]: [TOTEM ] Failed to receive the leave message. failed: 13
    > Mar 08 09:41:39 host01 corosync[2083]: [QUORUM] Members[12]: 1 2 3 4 5 6 7 8 9 10 12 13
    > Mar 08 09:41:39 host01 corosync[2083]: [MAIN  ] Completed service synchronization, ready to provide service.
    > Mar 08 09:41:58 host01 corosync[2083]: [TOTEM ] A new membership (10.0.2.110:112760) was formed. Members left: 13
    > Mar 08 09:41:58 host01 corosync[2083]: [TOTEM ] Failed to receive the leave message. failed: 13
    >
    > ● pve-cluster.service - The Proxmox VE cluster filesystem
    >     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
    >     Active: failed (Result: signal) since Wed 2017-03-08 10:54:06 CET; 6min ago
    >    Process: 22861 ExecStart=/usr/bin/pmxcfs $DAEMON_OPTS (code=exited, status=0/SUCCESS)
    >   Main PID: 22868 (code=killed, signal=KILL)
    >
    > Mar 08 10:54:01 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 950
    > Mar 08 10:54:02 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 960
    > Mar 08 10:54:03 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 970
    > Mar 08 10:54:04 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 980
    > Mar 08 10:54:05 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 990
    > Mar 08 10:54:06 host01 pmxcfs[22868]: [dcdb] notice: cpg_join retry 1000
    > Mar 08 10:54:06 host01 systemd[1]: pve-cluster.service stop-sigterm timed out. Killing.
    > Mar 08 10:54:06 host01 systemd[1]: pve-cluster.service: main process exited, code=killed, status=9/KILL
    > Mar 08 10:54:06 host01 systemd[1]: Failed to start The Proxmox VE cluster filesystem.
    > Mar 08 10:54:06 host01 systemd[1]: Unit pve-cluster.service entered failed state.
    >
    > It seems that TOTEM ] Failed to receive the leave message. failed: 13 was the problem.
    >

    This could really indicate multicast problems (see above).
    Did the problems happened instantly after the removal of the node? With 
    some minutes delay?
    And how did you remove the one node?

    Just trying to understand your situation here :)

    cheers,
    Thomas

    _______________________________________________
    pve-user mailing list
    pve-user at pve.proxmox.com
    http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user