[pve-devel] corosync bug: cluster break after 1 node clean shutdown

Alexandre DERUMIER aderumier at odiso.com
Tue Sep 15 14:49:54 CEST 2020


Hi,

I have produce it again, 

now I can't write to /etc/pve/ from any node


I have also added some debug logs to pve-ha-lrm, and it was stuck in:
(but if /etc/pve is locked, this is normal)

        if ($fence_request) {
            $haenv->log('err', "node need to be fenced - releasing agent_lock\n");
            $self->set_local_status({ state => 'lost_agent_lock'});
        } elsif (!$self->get_protected_ha_agent_lock()) {
            $self->set_local_status({ state => 'lost_agent_lock'});
        } elsif ($self->{mode} eq 'maintenance') {
            $self->set_local_status({ state => 'maintenance'});
        }


corosync quorum is currently ok

I'm currently digging the logs

----- Mail original -----
De: "aderumier" <aderumier at odiso.com>
À: "Proxmox VE development discussion" <pve-devel at lists.proxmox.com>
Cc: "Thomas Lamprecht" <t.lamprecht at proxmox.com>
Envoyé: Mardi 15 Septembre 2020 13:04:31
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

also logs of node14, where the lrm was not too long 

https://gist.github.com/aderumier/a2e2d6afc7e04646c923ae6f37cb6c2d 


----- Mail original ----- 
De: "aderumier" <aderumier at odiso.com> 
À: "Thomas Lamprecht" <t.lamprecht at proxmox.com> 
Cc: "Proxmox VE development discussion" <pve-devel at lists.proxmox.com> 
Envoyé: Mardi 15 Septembre 2020 12:15:47 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

here the previous restart log 

node1 -> corosync restart at 10:46:15 
----- 
https://gist.github.com/aderumier/0992051d20f51270ceceb5b3431d18d7 


node2 
----- 
https://gist.github.com/aderumier/eea0c50fefc1d8561868576f417191ba 



node5 
------ 
https://gist.github.com/aderumier/f2ce1bc5a93827045a5691583bbc7a37 

----- Mail original ----- 
De: "Thomas Lamprecht" <t.lamprecht at proxmox.com> 
À: "aderumier" <aderumier at odiso.com>, "Proxmox VE development discussion" <pve-devel at lists.proxmox.com> 
Cc: "dietmar" <dietmar at proxmox.com> 
Envoyé: Mardi 15 Septembre 2020 11:46:51 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

On 9/15/20 11:35 AM, Alexandre DERUMIER wrote: 
> Hi, 
> 
> I have finally reproduce it ! 
> 
> But this is with a corosync restart in cron each 1 minute, on node1 
> 
> Then: lrm was stuck for too long for around 60s and softdog have been triggered on multiple other nodes. 
> 
> here the logs with full corosync debug at the time of last corosync restart. 
> 
> node1 (where corosync is restarted each minute) 
> https://gist.github.com/aderumier/c4f192fbce8e96759f91a61906db514e 
> 
> node2 
> https://gist.github.com/aderumier/2d35ea05c1fbff163652e564fc430e67 
> 
> node5 
> https://gist.github.com/aderumier/df1d91cddbb6e15bb0d0193ed8df9273 
> 
> I'll prepare logs from the previous corosync restart, as the lrm seem to be already stuck before. 

Yeah that would be good, as yes the lrm seems to get stuck at around 10:46:21 

> Sep 15 10:47:26 m6kvm2 pve-ha-lrm[3736]: loop take too long (65 seconds) 


_______________________________________________ 
pve-devel mailing list 
pve-devel at lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


_______________________________________________ 
pve-devel mailing list 
pve-devel at lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 





More information about the pve-devel mailing list