[pve-devel] corosync problems - need help

Sun Sep 14 16:11:58 CEST 2014

Note that the corosync layer seem to be fine

when cman start on the faulty node,

I see in corosync.log of other nodes, the member join

then start the

Sep 14 15:49:47 kvm1 pmxcfs[31484]: [status] notice: cpg_send_message retry 10
Sep 14 15:49:48 kvm1 pmxcfs[31484]: [status] notice: cpg_send_message retry 20
Sep 14 15:49:49 kvm1 pmxcfs[31484]: [status] notice: cpg_send_message retry 30
Sep 14 15:49:50 kvm1 pmxcfs[31484]: [status] notice: cpg_send_message retry 40
Sep 14 15:49:51 kvm1 pmxcfs[31484]: [status] notice: cpg_send_message retry 50
Sep 14 15:49:52 kvm1 pmxcfs[31484]: [status] notice: cpg_send_message retry 60
Sep 14 15:49:53 kvm1 pmxcfs[31484]: [status] notice: cpg_send_message retry 70
Sep 14 15:49:54 kvm1 pmxcfs[31484]: [status] notice: cpg_send_message retry 80
Sep 14 15:49:55 kvm1 pmxcfs[31484]: [status] notice: cpg_send_message retry 90
Sep 14 15:49:56 kvm1 pmxcfs[31484]: [status] notice: cpg_send_message retry 100

- then killing corosync on the faulty node

and It's work again:

Sep 14 15:49:56 kvm1 pmxcfs[31484]: [status] notice: cpg_send_message retried 100 times

seem to be in:
data/src/dfsm.c 

        result = cpg_mcast_joined(dfsm->cpg_handle, CPG_TYPE_AGREED, iov, len);
        if (retry && result == CPG_ERR_TRY_AGAIN) {
                nanosleep(&tvreq, NULL);
                ++retries;
                if ((retries % 10) == 0)
                        cfs_dom_message(dfsm->log_domain, "cpg_send_message retry %d", retries);
                if (retries < 100)
                        goto loop;
        }

        if (retries)
                cfs_dom_message(dfsm->log_domain, "cpg_send_message retried %d times", retries);

----- Mail original ----- 

De: "Alexandre DERUMIER" <aderumier at odiso.com> 
À: "Dietmar Maurer" <dietmar at proxmox.com> 
Cc: pve-devel at pve.proxmox.com 
Envoyé: Dimanche 14 Septembre 2014 15:41:26 
Objet: Re: [pve-devel] corosync problems - need help 

>>I am curios - you have done that on all nodes, or only on the failing 2 nodes? 

Yes, I need to do it on all nodes. 

I have done more invesgations, and now I can reproduce the problem 100% 

The problem seem to come from a specific node: kvm11 

When I start cman on this node, 

I have : 
pmxcfs[31484]: [status] notice: cpg_send_message retry XX 

on all other nodes 

Same hardware than other nodes, I need to check the network layer. 

On the faulty node, I see also some pmxcfs segfaults in dmesg 

[976776.602200] pmxcfs[3130]: segfault at 7ff1dcadef08 ip 00007ff1dcadef08 sp 00007fffd89cfe68 error 15 
[977517.260211] pmxcfs[4947]: segfault at 1956b00 ip 0000000001956b00 sp 00007ffff3b109e8 error 15 
[980494.722550] pmxcfs[15205]: segfault at 7f712457ef08 ip 00007f712457ef08 sp 00007fff4a916668 error 15 

----- Mail original ----- 

De: "Dietmar Maurer" <dietmar at proxmox.com> 
À: "Alexandre DERUMIER" <aderumier at odiso.com> 
Cc: pve-devel at pve.proxmox.com 
Envoyé: Dimanche 14 Septembre 2014 12:53:45 
Objet: RE: [pve-devel] corosync problems - need help 

> Ok,I finally solved, 
> 
> kill -9 dlm_controld 
> kill -9 corosync -f 
> 
> and service cman start 
> 
> 
> Now all is working fine again. 

I am curios - you have done that on all nodes, or only on the failing 2 nodes? 
_______________________________________________ 
pve-devel mailing list 
pve-devel at pve.proxmox.com 
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel