[PVE-User] Unreliable

Tue Mar 12 19:40:00 CET 2013

Hi,

> can you post your /etc/network/interfaces ?

root at kh-proxmox1:~# cat /etc/network/interfaces
# network interface settings
auto lo
iface lo inet loopback

iface eth0 inet manual

iface eth1 inet manual

auto vmbr0
iface vmbr0 inet static
         address  172.16.70.214
         netmask  255.255.255.0
         gateway  172.16.70.1
         bridge_ports eth0
         bridge_stp off
         bridge_fd 0

auto vmbr1
iface vmbr1 inet static
         address  172.16.60.214
         netmask  255.255.255.0
         bridge_ports eth1
         bridge_stp off
         bridge_fd 0

Where vmbr0 is the rest of the network (and host traffic) and vmbr1 only 
connected with nodes + storage (storages connected with this nic).

> What you can try:

1)- update last pve-kernel from pvetest repository

2)-or  try to not put your proxmox host ip on vmbr0 but directly on ethX

3)-or on your cisco switch, disable igmp snooping.
   (#conf t
    #no ip igmp snooping
   )

Did anything from above help you? How did you solve the problem? I'll 
give it a try anyway, when it's late at night :-)

Thanks,
Steffen

Am 12.03.2013 19:31, schrieb Alexandre DERUMIER:
>>> yes there were problems with corosync and cman . I can remeber that....
>>> something like Member left membership.. blah blah.
>>>
>>> The used Cisco Switch is:
>>>
>>> Catalyst 2960-S Serie
> good luck for you, I use cisco 2960g and I have notice theses problem too.
>
> What you can try:
>
> 1)- update last pve-kernel from pvetest repository
>
> 2)-or  try to not put your proxmox host ip on vmbr0 but directly on ethX
>
> 3)-or on your cisco switch, disable igmp snooping.
>    (#conf t
>     #no ip igmp snooping
>    )
>
>
> for 1,2 , verify that your cisco switch have "ip igmp snooping querier"
>
> The problem is that current redhat kernel send igmp queries from linux bridge to network, and conflict with cisco switchs.
> This behaviour has been change recently in 3.5 kernel but not in redhat kernel. So we have patched it.
>
> I hope it'll resolve yours problems :)
>
>
>
>
>
>>> http://forum.proxmox.com/threads/10755-Constantly-Losing-Quorum
>>>
>>> Could the problem also be the same problem the which is desribed in the
>>> last post? Nodes are connected with the iscsi storage (qnap nas) through
>>> the cisco switch on VLANX. The other nic is used to bridge the VMs and
>>> connect them and the hosts to the rest of the network (so also "host"
>>> traffic goes through this VLAN)...
> can you post your /etc/network/interfaces ?
>
> ----- Mail original -----
>
> De: mail at steffenwagner.com
> À: "Alexandre DERUMIER" <aderumier at odiso.com>
> Cc: pve-user at pve.proxmox.com
> Envoyé: Mardi 12 Mars 2013 18:58:31
> Objet: Re: [PVE-User] Unreliable
>
> Hi,
>
> yes there were problems with corosync and cman . I can remeber that....
> something like Member left membership.. blah blah.
>
> The used Cisco Switch is:
>
> Catalyst 2960-S Serie
> Produkt ID: WS-C2960S-24TS-S
> Version ID:V02
> Software: 12.2(55)SE3
>
> Here some of the Logs i found (daemon.log):
>
> Feb 12 14:49:57 kh-proxmox2 pmxcfs[1529]: [quorum] crit: quorum_dispatch
> failed: 2
> Feb 12 14:49:57 kh-proxmox2 pmxcfs[1529]: [libqb] warning:
> epoll_ctl(del): Bad file descriptor (9)
> Feb 12 14:49:57 kh-proxmox2 pmxcfs[1529]: [confdb] crit: confdb_dispatch
> failed: 2
> Feb 12 14:49:59 kh-proxmox2 pmxcfs[1529]: [libqb] warning:
> epoll_ctl(del): Bad file descriptor (9)
> Feb 12 14:49:59 kh-proxmox2 pmxcfs[1529]: [dcdb] crit: cpg_dispatch
> failed: 2
> Feb 12 14:50:01 kh-proxmox2 pmxcfs[1529]: [dcdb] crit: cpg_leave failed: 2
> Feb 12 14:50:03 kh-proxmox2 pmxcfs[1529]: [libqb] warning:
> epoll_ctl(del): Bad file descriptor (9)
> Feb 12 14:50:03 kh-proxmox2 pmxcfs[1529]: [dcdb] crit: cpg_dispatch
> failed: 2
> Feb 12 14:50:04 kh-proxmox2 pmxcfs[1529]: [status] crit:
> cpg_send_message failed: 2
> Feb 12 14:50:04 kh-proxmox2 pmxcfs[1529]: [status] crit:
> cpg_send_message failed: 2
> Feb 12 14:50:06 kh-proxmox2 pmxcfs[1529]: [dcdb] crit: cpg_leave failed: 2
> Feb 12 14:50:08 kh-proxmox2 pmxcfs[1529]: [status] crit:
> cpg_send_message failed: 2
> Feb 12 14:50:08 kh-proxmox2 pmxcfs[1529]: [status] crit:
> cpg_send_message failed: 2
> Feb 12 14:50:10 kh-proxmox2 pmxcfs[1529]: [libqb] warning:
> epoll_ctl(del): Bad file descriptor (9)
> Feb 12 14:50:10 kh-proxmox2 pmxcfs[1529]: [quorum] crit:
> quorum_initialize failed: 6
> Feb 12 14:50:10 kh-proxmox2 pmxcfs[1529]: [quorum] crit: can't
> initialize service
> Feb 12 14:50:10 kh-proxmox2 pmxcfs[1529]: [confdb] crit:
> confdb_initialize failed: 6
> Feb 12 14:50:10 kh-proxmox2 pmxcfs[1529]: [quorum] crit: can't
> initialize service
> Feb 12 14:50:10 kh-proxmox2 pmxcfs[1529]: [dcdb] notice: start cluster
> connection
> Feb 12 14:50:10 kh-proxmox2 pmxcfs[1529]: [dcdb] crit: cpg_initialize
> failed: 6
> Feb 12 14:50:10 kh-proxmox2 pmxcfs[1529]: [quorum] crit: can't
> initialize service
> Feb 12 14:50:12 kh-proxmox2 pmxcfs[1529]: [status] crit:
> cpg_send_message failed: 2
> Feb 12 14:50:12 kh-proxmox2 pmxcfs[1529]: [status] crit:
> cpg_send_message failed: 2
> Feb 12 14:50:12 kh-proxmox2 pmxcfs[1529]: [dcdb] notice: start cluster
> connection
> Feb 12 14:50:12 kh-proxmox2 pmxcfs[1529]: [dcdb] crit: cpg_initialize
> failed: 6
> Feb 12 14:50:12 kh-proxmox2 pmxcfs[1529]: [quorum] crit: can't
> initialize service
> Feb 12 14:50:12 kh-proxmox2 pmxcfs[1529]: [status] crit:
> cpg_send_message failed: 9
> Feb 12 14:50:12 kh-proxmox2 pmxcfs[1529]: [status] crit:
> cpg_send_message failed: 9
>
>
> And then it continues with the last line for thousands of lines... (so
> it means that node lost the quorum in the cluster.)
>
> Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: members:
> 1/1579, 2/1535
> Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: starting data
> syncronisation
> Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: members:
> 1/1579, 2/1535
> Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: starting data
> syncronisation
> Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: members:
> 1/1579, 2/1535, 3/398566
> Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: members:
> 1/1579, 2/1535, 3/398566
> Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: received sync
> request (epoch 1/1579/0000000A)
> Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: received sync
> request (epoch 1/1579/0000000A)
> Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: received all
> states
> Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: leader is 1/1579
> Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: synced members:
> 1/1579, 2/1535, 3/398566
> Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: all data is up
> to date
> Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: received all
> states
> Feb 12 16:06:43 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: all data is up
> to date
> Feb 12 16:06:46 kh-proxmox2 pmxcfs[1535]: [quorum] crit: quorum_dispatch
> failed: 2
> Feb 12 16:06:46 kh-proxmox2 pmxcfs[1535]: [libqb] warning:
> epoll_ctl(del): Bad file descriptor (9)
> Feb 12 16:06:46 kh-proxmox2 pmxcfs[1535]: [confdb] crit: confdb_dispatch
> failed: 2
> Feb 12 16:06:48 kh-proxmox2 pmxcfs[1535]: [libqb] warning:
> epoll_ctl(del): Bad file descriptor (9)
> Feb 12 16:06:48 kh-proxmox2 pmxcfs[1535]: [dcdb] crit: cpg_dispatch
> failed: 2
> Feb 12 16:06:50 kh-proxmox2 pmxcfs[1535]: [dcdb] crit: cpg_leave failed: 2
> Feb 12 16:06:52 kh-proxmox2 pmxcfs[1535]: [status] crit:
> cpg_send_message failed: 2
> Feb 12 16:06:52 kh-proxmox2 pmxcfs[1535]: [status] crit:
> cpg_send_message failed: 2
> Feb 12 16:06:52 kh-proxmox2 pmxcfs[1535]: [libqb] warning:
> epoll_ctl(del): Bad file descriptor (9)
> Feb 12 16:06:52 kh-proxmox2 pmxcfs[1535]: [dcdb] crit: cpg_dispatch
> failed: 2
> Feb 12 16:06:54 kh-proxmox2 pmxcfs[1535]: [dcdb] crit: cpg_leave failed: 2
> Feb 12 16:06:56 kh-proxmox2 pmxcfs[1535]: [libqb] warning:
> epoll_ctl(del): Bad file descriptor (9)
> Feb 12 16:06:56 kh-proxmox2 pmxcfs[1535]: [quorum] crit:
> quorum_initialize failed: 6
> Feb 12 16:06:56 kh-proxmox2 pmxcfs[1535]: [quorum] crit: can't
> initialize service
> Feb 12 16:06:56 kh-proxmox2 pmxcfs[1535]: [confdb] crit:
> confdb_initialize failed: 6
> Feb 12 16:06:56 kh-proxmox2 pmxcfs[1535]: [quorum] crit: can't
> initialize service
> Feb 12 16:06:56 kh-proxmox2 pmxcfs[1535]: [dcdb] notice: start cluster
> connection
> Feb 12 16:06:56 kh-proxmox2 pmxcfs[1535]: [dcdb] crit: cpg_initialize
> failed: 6
> Feb 12 16:06:56 kh-proxmox2 pmxcfs[1535]: [quorum] crit: can't
> initialize service
> Feb 12 16:06:58 kh-proxmox2 pmxcfs[1535]: [status] crit:
> cpg_send_message failed: 2
> Feb 12 16:06:58 kh-proxmox2 pmxcfs[1535]: [status] crit:
> cpg_send_message failed: 2
> Feb 12 16:06:58 kh-proxmox2 pmxcfs[1535]: [dcdb] crit: cpg_initialize
> failed: 6
> Feb 12 16:06:58 kh-proxmox2 pmxcfs[1535]: [quorum] crit: can't
> initialize service
> Feb 12 16:06:58 kh-proxmox2 pmxcfs[1535]: [status] crit:
> cpg_send_message failed: 9
> Feb 12 16:06:58 kh-proxmox2 pmxcfs[1535]: [status] crit:
> cpg_send_message failed: 9
>
> Again you see first all nodes are synced and then it looses again quorum.
>
> It's a like described in this post:
>
> http://forum.proxmox.com/threads/10755-Constantly-Losing-Quorum
>
> Could the problem also be the same problem the which is desribed in the
> last post? Nodes are connected with the iscsi storage (qnap nas) through
> the cisco switch on VLANX. The other nic is used to bridge the VMs and
> connect them and the hosts to the rest of the network (so also "host"
> traffic goes through this VLAN)...
>
>> pveperf
> CPU BOGOMIPS: 55876.08
> REGEX/SECOND: 1476041
> HD SIZE: 94.49 GB (/dev/mapper/pve-root)
> BUFFERED READS: 144.93 MB/sec
> AVERAGE SEEK TIME: 8.15 ms
> FSYNCS/SECOND: 30.69
> DNS EXT: 58.96 ms
>
> Thanks,
> Steffen Wagner
>
> P.S. Sorry Alexandre, i pressed the wrong button :-)
>
> Am 12.03.2013 17:49, schrieb Alexandre DERUMIER:
>> Hi Steffen,
>>
>> Seem that you have multicast errors/hang which cause corosync error.
>> What physicals switchs do you use ? (I ask this because we have found a multicast bug with a feature of current kernel and cisco swithcs)
>>
>>
>>
>>
>> 2013/3/12 Steffen Wagner < mail at steffenwagner.com >
>>
>>
>> Hi,
>>
>> I had a similiar problem with 2.2
>> I had rgmanager for HA features running on high end hardware (Dell, QNAP and Cisco). After about three days one of the nodes (it wasnt always the same!) left quorum (log said something like 'node 2 left, x nodes remaining in cluster, fencing node 2.'. After then always the node was successfully fenced... so i disabled fencing and changed it to 'hand'. Then the node didnt shut down anymore. It remained online with all vms, but the cluster said the node was offline (at reboot the node stuck at pve rgmanager service, only hardreset was possible).
>>
>> In the end i disabled HA and ran the nodes now only in cluster mode without fencing... working until now (3 months) without any problems... a pity, because i want to use HA features, but dont know whats wrong.
>>
>> My network setup is similiar as Fabio's. I'm using VLANs one for the storage interface and one for the other.....
>>
>> Until now i think i stay at 2.2 and do not upgrade to 2.3 until everyone in the maillist is happy :-)
>>
>>
>> Mit freundlichen Grüßen,
>> Steffen Wagner

-- 
Steffen Wagner
Im Obersteig 31
D-76879 Hochstadt / Pfalz

M +49 (0) 1523 3544688
F +49 (0) 6347 918475
E mail at steffenwagner.com