[pve-devel] corosync bug: cluster break after 1 node clean shutdown
Alexandre DERUMIER
aderumier at odiso.com
Mon Sep 7 15:23:26 CEST 2020
Looking at theses logs:
Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: lost lock 'ha_manager_lock - cfs lock update failed - Permission denied
Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: lost lock 'ha_agent_m6kvm7_lock - cfs lock update failed - Permission denied
in PVE/HA/Env/PVE2.pm
"
my $ctime = time();
my $last_lock_time = $last->{lock_time} // 0;
my $last_got_lock = $last->{got_lock};
my $retry_timeout = 120; # hardcoded lock lifetime limit from pmxcfs
eval {
mkdir $lockdir;
# pve cluster filesystem not online
die "can't create '$lockdir' (pmxcfs not mounted?)\n" if ! -d $lockdir;
if (($ctime - $last_lock_time) < $retry_timeout) {
# try cfs lock update request (utime)
if (utime(0, $ctime, $filename)) {
$got_lock = 1;
return;
}
die "cfs lock update failed - $!\n";
}
"
If the retry_timeout is = 120, could it explain why I don't have log on others node, if the watchdog trigger after 60s ?
I don't known too much how locks are working in pmxcfs, but when a corosync member leave or join, and a new cluster memership is formed,
could we have some lock lost or hang ?
----- Mail original -----
De: "aderumier" <aderumier at odiso.com>
À: "dietmar" <dietmar at proxmox.com>
Cc: "Proxmox VE development discussion" <pve-devel at lists.proxmox.com>
Envoyé: Lundi 7 Septembre 2020 11:32:13
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
>>https://forum.proxmox.com/threads/cluster-die-after-adding-the-39th-node-proxmox-is-not-stable.75506/#post-336111
>>
>>No HA involved...
I had already help this user some week ago
https://forum.proxmox.com/threads/proxmox-6-2-4-cluster-die-node-auto-reboot-need-help.74643/#post-333093
HA was actived at this time. (Maybe the watchdog was still running, I'm not sure if you disable HA from all vms if LRM disable the watchdog ?)
----- Mail original -----
De: "dietmar" <dietmar at proxmox.com>
À: "aderumier" <aderumier at odiso.com>
Cc: "Proxmox VE development discussion" <pve-devel at lists.proxmox.com>
Envoyé: Lundi 7 Septembre 2020 10:18:42
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
There is a similar report in the forum:
https://forum.proxmox.com/threads/cluster-die-after-adding-the-39th-node-proxmox-is-not-stable.75506/#post-336111
No HA involved...
> On 09/07/2020 9:19 AM Alexandre DERUMIER <aderumier at odiso.com> wrote:
>
>
> >>Indeed, this should not happen. Do you use a spearate network for corosync?
>
> No, I use 2x40GB lacp link.
>
> >>was there high traffic on the network?
>
> but I'm far from saturated them. (in pps or througput), (I'm around 3-4gbps)
>
>
> The cluster is 14 nodes, with around 1000vms (with ha enabled on all vms)
>
>
> From my understanding, watchdog-mux was still runing as the watchdog have reset only after 1min and not 10s,
> so it's like the lrm was blocked and not sending watchdog timer reset to watchdog-mux.
>
>
> I'll do tests with softdog + soft_noboot=1, so if that happen again,I'll able to debug.
>
>
>
> >>What kind of maintenance was the reason for the shutdown?
>
> ram upgrade. (the server was running ok before shutdown, no hardware problem)
> (I just shutdown the server, and don't have started it yet when problem occur)
>
>
>
> >>Do you use the default corosync timeout values, or do you have a special setup?
>
>
> no special tuning, default values. (I don't have any retransmit since months in the logs)
>
> >>Can you please post the full corosync config?
>
> (I have verified, the running version was corosync was 3.0.3 with libknet 1.15)
>
>
> here the config:
>
> "
> logging {
> debug: off
> to_syslog: yes
> }
>
> nodelist {
> node {
> name: m6kvm1
> nodeid: 1
> quorum_votes: 1
> ring0_addr: m6kvm1
> }
> node {
> name: m6kvm10
> nodeid: 10
> quorum_votes: 1
> ring0_addr: m6kvm10
> }
> node {
> name: m6kvm11
> nodeid: 11
> quorum_votes: 1
> ring0_addr: m6kvm11
> }
> node {
> name: m6kvm12
> nodeid: 12
> quorum_votes: 1
> ring0_addr: m6kvm12
> }
> node {
> name: m6kvm13
> nodeid: 13
> quorum_votes: 1
> ring0_addr: m6kvm13
> }
> node {
> name: m6kvm14
> nodeid: 14
> quorum_votes: 1
> ring0_addr: m6kvm14
> }
> node {
> name: m6kvm2
> nodeid: 2
> quorum_votes: 1
> ring0_addr: m6kvm2
> }
> node {
> name: m6kvm3
> nodeid: 3
> quorum_votes: 1
> ring0_addr: m6kvm3
> }
> node {
> name: m6kvm4
> nodeid: 4
> quorum_votes: 1
> ring0_addr: m6kvm4
> }
> node {
> name: m6kvm5
> nodeid: 5
> quorum_votes: 1
> ring0_addr: m6kvm5
> }
> node {
> name: m6kvm6
> nodeid: 6
> quorum_votes: 1
> ring0_addr: m6kvm6
> }
> node {
> name: m6kvm7
> nodeid: 7
> quorum_votes: 1
> ring0_addr: m6kvm7
> }
>
> node {
> name: m6kvm8
> nodeid: 8
> quorum_votes: 1
> ring0_addr: m6kvm8
> }
> node {
> name: m6kvm9
> nodeid: 9
> quorum_votes: 1
> ring0_addr: m6kvm9
> }
> }
>
> quorum {
> provider: corosync_votequorum
> }
>
> totem {
> cluster_name: m6kvm
> config_version: 19
> interface {
> bindnetaddr: 10.3.94.89
> ringnumber: 0
> }
> ip_version: ipv4
> secauth: on
> transport: knet
> version: 2
> }
>
>
>
> ----- Mail original -----
> De: "dietmar" <dietmar at proxmox.com>
> À: "aderumier" <aderumier at odiso.com>, "Proxmox VE development discussion" <pve-devel at lists.proxmox.com>
> Cc: "pve-devel" <pve-devel at pve.proxmox.com>
> Envoyé: Dimanche 6 Septembre 2020 14:14:06
> Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
>
> > Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds)
> > Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds)
>
> Indeed, this should not happen. Do you use a spearate network for corosync? Or
> was there high traffic on the network? What kind of maintenance was the reason
> for the shutdown?
_______________________________________________
pve-devel mailing list
pve-devel at lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
More information about the pve-devel
mailing list