[pve-devel] corosync problems - need help
Alexandre DERUMIER
aderumier at odiso.com
Sun Sep 14 16:11:58 CEST 2014
Note that the corosync layer seem to be fine
when cman start on the faulty node,
I see in corosync.log of other nodes, the member join
then start the
Sep 14 15:49:47 kvm1 pmxcfs[31484]: [status] notice: cpg_send_message retry 10
Sep 14 15:49:48 kvm1 pmxcfs[31484]: [status] notice: cpg_send_message retry 20
Sep 14 15:49:49 kvm1 pmxcfs[31484]: [status] notice: cpg_send_message retry 30
Sep 14 15:49:50 kvm1 pmxcfs[31484]: [status] notice: cpg_send_message retry 40
Sep 14 15:49:51 kvm1 pmxcfs[31484]: [status] notice: cpg_send_message retry 50
Sep 14 15:49:52 kvm1 pmxcfs[31484]: [status] notice: cpg_send_message retry 60
Sep 14 15:49:53 kvm1 pmxcfs[31484]: [status] notice: cpg_send_message retry 70
Sep 14 15:49:54 kvm1 pmxcfs[31484]: [status] notice: cpg_send_message retry 80
Sep 14 15:49:55 kvm1 pmxcfs[31484]: [status] notice: cpg_send_message retry 90
Sep 14 15:49:56 kvm1 pmxcfs[31484]: [status] notice: cpg_send_message retry 100
- then killing corosync on the faulty node
and It's work again:
Sep 14 15:49:56 kvm1 pmxcfs[31484]: [status] notice: cpg_send_message retried 100 times
seem to be in:
data/src/dfsm.c
result = cpg_mcast_joined(dfsm->cpg_handle, CPG_TYPE_AGREED, iov, len);
if (retry && result == CPG_ERR_TRY_AGAIN) {
nanosleep(&tvreq, NULL);
++retries;
if ((retries % 10) == 0)
cfs_dom_message(dfsm->log_domain, "cpg_send_message retry %d", retries);
if (retries < 100)
goto loop;
}
if (retries)
cfs_dom_message(dfsm->log_domain, "cpg_send_message retried %d times", retries);
----- Mail original -----
De: "Alexandre DERUMIER" <aderumier at odiso.com>
À: "Dietmar Maurer" <dietmar at proxmox.com>
Cc: pve-devel at pve.proxmox.com
Envoyé: Dimanche 14 Septembre 2014 15:41:26
Objet: Re: [pve-devel] corosync problems - need help
>>I am curios - you have done that on all nodes, or only on the failing 2 nodes?
Yes, I need to do it on all nodes.
I have done more invesgations, and now I can reproduce the problem 100%
The problem seem to come from a specific node: kvm11
When I start cman on this node,
I have :
pmxcfs[31484]: [status] notice: cpg_send_message retry XX
on all other nodes
Same hardware than other nodes, I need to check the network layer.
On the faulty node, I see also some pmxcfs segfaults in dmesg
[976776.602200] pmxcfs[3130]: segfault at 7ff1dcadef08 ip 00007ff1dcadef08 sp 00007fffd89cfe68 error 15
[977517.260211] pmxcfs[4947]: segfault at 1956b00 ip 0000000001956b00 sp 00007ffff3b109e8 error 15
[980494.722550] pmxcfs[15205]: segfault at 7f712457ef08 ip 00007f712457ef08 sp 00007fff4a916668 error 15
----- Mail original -----
De: "Dietmar Maurer" <dietmar at proxmox.com>
À: "Alexandre DERUMIER" <aderumier at odiso.com>
Cc: pve-devel at pve.proxmox.com
Envoyé: Dimanche 14 Septembre 2014 12:53:45
Objet: RE: [pve-devel] corosync problems - need help
> Ok,I finally solved,
>
> kill -9 dlm_controld
> kill -9 corosync -f
>
> and service cman start
>
>
> Now all is working fine again.
I am curios - you have done that on all nodes, or only on the failing 2 nodes?
_______________________________________________
pve-devel mailing list
pve-devel at pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
More information about the pve-devel
mailing list