[pve-devel] corosync bug: cluster break after 1 node clean shutdown

Mon Sep 7 10:18:42 CEST 2020

There is a similar report in the forum:

https://forum.proxmox.com/threads/cluster-die-after-adding-the-39th-node-proxmox-is-not-stable.75506/#post-336111

No HA involved...

> On 09/07/2020 9:19 AM Alexandre DERUMIER <aderumier at odiso.com> wrote:
> 
>  
> >>Indeed, this should not happen. Do you use a spearate network for corosync? 
> 
> No, I use 2x40GB lacp link. 
> 
> >>was there high traffic on the network? 
> 
> but I'm far from saturated them. (in pps or througput),  (I'm around 3-4gbps)
> 
> 
> The cluster is 14 nodes, with around 1000vms (with ha enabled on all vms)
> 
> 
> From my understanding, watchdog-mux was still runing as the watchdog have reset only after 1min and not 10s,
>  so it's like the lrm was blocked and not sending watchdog timer reset to watchdog-mux.
> 
> 
> I'll do tests with softdog + soft_noboot=1, so if that happen again,I'll able to debug.
> 
> 
> 
> >>What kind of maintenance was the reason for the shutdown?
> 
> ram upgrade. (the server was running ok before shutdown, no hardware problem)  
> (I just shutdown the server, and don't have started it yet when problem occur)
> 
> 
> 
> >>Do you use the default corosync timeout values, or do you have a special setup?
> 
> 
> no special tuning, default values. (I don't have any retransmit since months in the logs)
> 
> >>Can you please post the full corosync config?
> 
> (I have verified, the running version was corosync was 3.0.3 with libknet 1.15)
> 
> 
> here the config:
> 
> "
> logging {
>   debug: off
>   to_syslog: yes
> }
> 
> nodelist {
>   node {
>     name: m6kvm1
>     nodeid: 1
>     quorum_votes: 1
>     ring0_addr: m6kvm1
>   }
>   node {
>     name: m6kvm10
>     nodeid: 10
>     quorum_votes: 1
>     ring0_addr: m6kvm10
>   }
>   node {
>     name: m6kvm11
>     nodeid: 11
>     quorum_votes: 1
>     ring0_addr: m6kvm11
>   }
>   node {
>     name: m6kvm12
>     nodeid: 12
>     quorum_votes: 1
>     ring0_addr: m6kvm12
>   }
>   node {
>     name: m6kvm13
>     nodeid: 13
>     quorum_votes: 1
>     ring0_addr: m6kvm13
>   }
>   node {
>     name: m6kvm14
>     nodeid: 14
>     quorum_votes: 1
>     ring0_addr: m6kvm14
>   }
>   node {
>     name: m6kvm2
>     nodeid: 2
>     quorum_votes: 1
>     ring0_addr: m6kvm2
>   }
>   node {
>     name: m6kvm3
>     nodeid: 3
>     quorum_votes: 1
>     ring0_addr: m6kvm3
>   }
>   node {
>     name: m6kvm4
>     nodeid: 4
>     quorum_votes: 1
>     ring0_addr: m6kvm4
>   }
>   node {
>     name: m6kvm5
>     nodeid: 5
>     quorum_votes: 1
>     ring0_addr: m6kvm5
>   }
>   node {
>     name: m6kvm6
>     nodeid: 6
>     quorum_votes: 1
>     ring0_addr: m6kvm6
>   }
>   node {
>     name: m6kvm7
>     nodeid: 7
>     quorum_votes: 1
>     ring0_addr: m6kvm7
>   }
> 
>   node {
>     name: m6kvm8
>     nodeid: 8
>     quorum_votes: 1
>     ring0_addr: m6kvm8
>   }
>   node {
>     name: m6kvm9
>     nodeid: 9
>     quorum_votes: 1
>     ring0_addr: m6kvm9
>   }
> }
> 
> quorum {
>   provider: corosync_votequorum
> }
> 
> totem {
>   cluster_name: m6kvm
>   config_version: 19
>   interface {
>     bindnetaddr: 10.3.94.89
>     ringnumber: 0
>   }
>   ip_version: ipv4
>   secauth: on
>   transport: knet
>   version: 2
> }
> 
> 
> 
> ----- Mail original -----
> De: "dietmar" <dietmar at proxmox.com>
> À: "aderumier" <aderumier at odiso.com>, "Proxmox VE development discussion" <pve-devel at lists.proxmox.com>
> Cc: "pve-devel" <pve-devel at pve.proxmox.com>
> Envoyé: Dimanche 6 Septembre 2020 14:14:06
> Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
> 
> > Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds) 
> > Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds) 
> 
> Indeed, this should not happen. Do you use a spearate network for corosync? Or 
> was there high traffic on the network? What kind of maintenance was the reason 
> for the shutdown?