[pve-devel] corosync bug: cluster break after 1 node clean shutdown

Mon Sep 14 17:45:05 CEST 2020

>>Did you get in contact with knet/corosync devs about this? 
>>Because, it may well be something their stack is better at handling it, maybe 
>>there's also really still a bug, or bad behaviour on some edge cases... 

not yet, I would like to have more infos to submit, because I'm blind.
I have enabled debug logs on all my cluster if that happen again.

BTW,
I have noticed something, 

corosync is stopped after syslog stop, so at shutdown we never have corosync log

I have edit corosync.service

- After=network-online.target
+ After=network-online.target syslog.target

and now It's logging correctly.

Now, that logging work, I'm also seeeing pmxcfs errors when corosync is stopping.
(But no pmxcfs shutdown log)

Do you think it's possible to have a clean shutdown of pmxcfs first, before stopping corosync ?

"
Sep 14 17:23:49 pve corosync[1346]:   [MAIN  ] Node was shut down by a signal
Sep 14 17:23:49 pve systemd[1]: Stopping Corosync Cluster Engine...
Sep 14 17:23:49 pve corosync[1346]:   [SERV  ] Unloading all Corosync service engines.
Sep 14 17:23:49 pve corosync[1346]:   [QB    ] withdrawing server sockets
Sep 14 17:23:49 pve corosync[1346]:   [SERV  ] Service engine unloaded: corosync vote quorum service v1.0
Sep 14 17:23:49 pve pmxcfs[1132]: [confdb] crit: cmap_dispatch failed: 2
Sep 14 17:23:49 pve corosync[1346]:   [QB    ] withdrawing server sockets
Sep 14 17:23:49 pve corosync[1346]:   [SERV  ] Service engine unloaded: corosync configuration map access
Sep 14 17:23:49 pve corosync[1346]:   [QB    ] withdrawing server sockets
Sep 14 17:23:49 pve corosync[1346]:   [SERV  ] Service engine unloaded: corosync configuration service
Sep 14 17:23:49 pve pmxcfs[1132]: [status] crit: cpg_dispatch failed: 2
Sep 14 17:23:49 pve pmxcfs[1132]: [status] crit: cpg_leave failed: 2
Sep 14 17:23:49 pve pmxcfs[1132]: [dcdb] crit: cpg_dispatch failed: 2
Sep 14 17:23:49 pve pmxcfs[1132]: [dcdb] crit: cpg_leave failed: 2
Sep 14 17:23:49 pve corosync[1346]:   [QB    ] withdrawing server sockets
Sep 14 17:23:49 pve corosync[1346]:   [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Sep 14 17:23:49 pve pmxcfs[1132]: [quorum] crit: quorum_dispatch failed: 2
Sep 14 17:23:49 pve pmxcfs[1132]: [status] notice: node lost quorum
Sep 14 17:23:49 pve corosync[1346]:   [SERV  ] Service engine unloaded: corosync profile loading service
Sep 14 17:23:49 pve corosync[1346]:   [SERV  ] Service engine unloaded: corosync resource monitoring service
Sep 14 17:23:49 pve corosync[1346]:   [SERV  ] Service engine unloaded: corosync watchdog service
Sep 14 17:23:49 pve pmxcfs[1132]: [quorum] crit: quorum_initialize failed: 2
Sep 14 17:23:49 pve pmxcfs[1132]: [quorum] crit: can't initialize service
Sep 14 17:23:49 pve pmxcfs[1132]: [confdb] crit: cmap_initialize failed: 2
Sep 14 17:23:49 pve pmxcfs[1132]: [confdb] crit: can't initialize service
Sep 14 17:23:49 pve pmxcfs[1132]: [dcdb] notice: start cluster connection
Sep 14 17:23:49 pve pmxcfs[1132]: [dcdb] crit: cpg_initialize failed: 2
Sep 14 17:23:49 pve pmxcfs[1132]: [dcdb] crit: can't initialize service
Sep 14 17:23:49 pve pmxcfs[1132]: [status] notice: start cluster connection
Sep 14 17:23:49 pve pmxcfs[1132]: [status] crit: cpg_initialize failed: 2
Sep 14 17:23:49 pve pmxcfs[1132]: [status] crit: can't initialize service
Sep 14 17:23:50 pve corosync[1346]:   [MAIN  ] Corosync Cluster Engine exiting normally
"

----- Mail original -----
De: "Thomas Lamprecht" <t.lamprecht at proxmox.com>
À: "Proxmox VE development discussion" <pve-devel at lists.proxmox.com>, "aderumier" <aderumier at odiso.com>, "dietmar" <dietmar at proxmox.com>
Envoyé: Lundi 14 Septembre 2020 10:51:03
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

On 9/14/20 10:27 AM, Alexandre DERUMIER wrote: 
>> I wonder if something like pacemaker sbd could be implemented in proxmox as extra layer of protection ? 
> 
>>> AFAIK Thomas already has patches to implement active fencing. 
> 
>>> But IMHO this will not solve the corosync problems.. 
> 
> Yes, sure. I'm really to have to 2 differents sources of verification, with different path/software, to avoid this kind of bug. 
> (shit happens, murphy law ;) 

would then need at least three, and if one has a bug flooding the network in 
a lot of setups (not having beefy switches like you ;) the other two will be 
taken down also, either as memory or the system stack gets overloaded. 

> 
> as we say in French "ceinture & bretelles" -> "belt and braces" 
> 
> 
> BTW, 
> a user have reported new corosync problem here: 
> https://forum.proxmox.com/threads/proxmox-6-2-corosync-3-rare-and-spontaneous-disruptive-udp-5405-storm-flood.75871 
> (Sound like the bug that I have 6month ago, with corosync bug flooding a lof of udp packets, but not the same bug I have here) 

Did you get in contact with knet/corosync devs about this? 

Because, it may well be something their stack is better at handling it, maybe 
there's also really still a bug, or bad behaviour on some edge cases...