[pve-devel] corosync bug: cluster break after 1 node clean shutdown
Alexandre DERUMIER
aderumier at odiso.com
Mon Sep 7 09:19:40 CEST 2020
>>Indeed, this should not happen. Do you use a spearate network for corosync?
No, I use 2x40GB lacp link.
>>was there high traffic on the network?
but I'm far from saturated them. (in pps or througput), (I'm around 3-4gbps)
The cluster is 14 nodes, with around 1000vms (with ha enabled on all vms)
>From my understanding, watchdog-mux was still runing as the watchdog have reset only after 1min and not 10s,
so it's like the lrm was blocked and not sending watchdog timer reset to watchdog-mux.
I'll do tests with softdog + soft_noboot=1, so if that happen again,I'll able to debug.
>>What kind of maintenance was the reason for the shutdown?
ram upgrade. (the server was running ok before shutdown, no hardware problem)
(I just shutdown the server, and don't have started it yet when problem occur)
>>Do you use the default corosync timeout values, or do you have a special setup?
no special tuning, default values. (I don't have any retransmit since months in the logs)
>>Can you please post the full corosync config?
(I have verified, the running version was corosync was 3.0.3 with libknet 1.15)
here the config:
"
logging {
debug: off
to_syslog: yes
}
nodelist {
node {
name: m6kvm1
nodeid: 1
quorum_votes: 1
ring0_addr: m6kvm1
}
node {
name: m6kvm10
nodeid: 10
quorum_votes: 1
ring0_addr: m6kvm10
}
node {
name: m6kvm11
nodeid: 11
quorum_votes: 1
ring0_addr: m6kvm11
}
node {
name: m6kvm12
nodeid: 12
quorum_votes: 1
ring0_addr: m6kvm12
}
node {
name: m6kvm13
nodeid: 13
quorum_votes: 1
ring0_addr: m6kvm13
}
node {
name: m6kvm14
nodeid: 14
quorum_votes: 1
ring0_addr: m6kvm14
}
node {
name: m6kvm2
nodeid: 2
quorum_votes: 1
ring0_addr: m6kvm2
}
node {
name: m6kvm3
nodeid: 3
quorum_votes: 1
ring0_addr: m6kvm3
}
node {
name: m6kvm4
nodeid: 4
quorum_votes: 1
ring0_addr: m6kvm4
}
node {
name: m6kvm5
nodeid: 5
quorum_votes: 1
ring0_addr: m6kvm5
}
node {
name: m6kvm6
nodeid: 6
quorum_votes: 1
ring0_addr: m6kvm6
}
node {
name: m6kvm7
nodeid: 7
quorum_votes: 1
ring0_addr: m6kvm7
}
node {
name: m6kvm8
nodeid: 8
quorum_votes: 1
ring0_addr: m6kvm8
}
node {
name: m6kvm9
nodeid: 9
quorum_votes: 1
ring0_addr: m6kvm9
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: m6kvm
config_version: 19
interface {
bindnetaddr: 10.3.94.89
ringnumber: 0
}
ip_version: ipv4
secauth: on
transport: knet
version: 2
}
----- Mail original -----
De: "dietmar" <dietmar at proxmox.com>
À: "aderumier" <aderumier at odiso.com>, "Proxmox VE development discussion" <pve-devel at lists.proxmox.com>
Cc: "pve-devel" <pve-devel at pve.proxmox.com>
Envoyé: Dimanche 6 Septembre 2020 14:14:06
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
> Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds)
> Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds)
Indeed, this should not happen. Do you use a spearate network for corosync? Or
was there high traffic on the network? What kind of maintenance was the reason
for the shutdown?
More information about the pve-devel
mailing list