[PVE-User] Unreliable

Alexandre DERUMIER aderumier at odiso.com
Tue Mar 12 17:49:35 CET 2013


Hi Steffen,

Seem that you have multicast errors/hang which cause corosync error.
What physicals switchs do you use ? (I ask this because we have found a multicast bug with a feature of current kernel and cisco swithcs)




2013/3/12 Steffen Wagner < mail at steffenwagner.com > 


Hi, 

I had a similiar problem with 2.2 
I had rgmanager for HA features running on high end hardware (Dell, QNAP and Cisco). After about three days one of the nodes (it wasnt always the same!) left quorum (log said something like 'node 2 left, x nodes remaining in cluster, fencing node 2.'. After then always the node was successfully fenced... so i disabled fencing and changed it to 'hand'. Then the node didnt shut down anymore. It remained online with all vms, but the cluster said the node was offline (at reboot the node stuck at pve rgmanager service, only hardreset was possible). 

In the end i disabled HA and ran the nodes now only in cluster mode without fencing... working until now (3 months) without any problems... a pity, because i want to use HA features, but dont know whats wrong. 

My network setup is similiar as Fabio's. I'm using VLANs one for the storage interface and one for the other..... 

Until now i think i stay at 2.2 and do not upgrade to 2.3 until everyone in the maillist is happy :-) 


Mit freundlichen Grüßen, 
Steffen Wagner 
-- 

Im Obersteig 31 
76879 Hochstadt/Pfalz 

E mail at steffenwagner.com 
M 01523/3544688 
F 06347/918474 

Fábio Rabelo < fabio at fabiorabelo.wiki.br > schrieb: 

>2013/3/12 Andreu Sànchez i Costa < andreu.sanchez at iws.es > 
> 
>> Hello Fábio, 
>> 
>> Al 12/03/13 01:00, En/na Fábio Rabelo ha escrit: 
>> 
>> 
>> 2.3 do not have the reliability 1.9 has !!!! 
>> 
>> I am struggling with it for 3 months, my deadline are gone, and I cannot 
>> make it work for more than 3 days without an issue ... 
>> 
>> 
>> I cannot give my opinion about 2.3 but with 2.2.x it works perfectly, I 
>> only had to change elevator to deadline cause CFQ had performance problems 
>> with our P2000 iSCSI array disk. 
>> 
>> As other list members asked, what are your main problems? 
>> 
>> 
>I already described the problems several times here . 
> 
>This is a five node cluster, motherboards dual opteron from Supermicro . 
> 
>Storage uses the same motherboard as the five nodes, but with a 16 3,5 HD 
>slots, with 12 occupied by WD enterprise disks . 
> 
>Storage runs Nas4Free . ( already try Freenas, same result ) 
> 
>Like I said, when I installed PVE 1.9 everything works fine for, now 9 
>days, and counting . 
> 
>In the five nodes, are embedded 2 network ports, connected to Linksys 
>switcher, I am using it to serve the VMs . 
> 
>In one PCIe Slot there are an Intel 10 GB card, to talk with a Supermicro 
>10 GB switcher, exclusive to communication between the five nodes and the 
>Storage . 
> 
>This switcher have no link with anything else . 
> 
>In the Storage, I use one of the embedded ports to manage, and all images 
>are served through 10 GB card . 
> 
>After sometime, between 1 and 3 days the system is working, the nodes stops 
>to talk with the storage . 
> 
>When it happens, the log shows lots of msg like this : 
> 
>Mar 6 17:15:29 nodo-01 pvestatd[2804]: WARNING: storage 'iudice01' is 
>not online 
>Mar 6 17:15:39 nodo-01 pvestatd[2804]: WARNING: storage 'iudice01' is 
>not online 
>Mar 6 17:15:49 nodo-01 pvestatd[2804]: WARNING: storage 'iudice01' is 
>not online 
>Mar 6 17:15:59 nodo-01 pvestatd[2804]: WARNING: storage 'iudice01' is 
>not online 
>Mar 6 17:16:09 nodo-01 pvestatd[2804]: WARNING: storage 'iudice01' is 
>not online 
>Mar 6 17:16:19 nodo-01 pvestatd[2804]: WARNING: storage 'iudice01' is 
>not online 
>Mar 6 17:16:29 nodo-01 pvestatd[2804]: WARNING: storage 'iudice01' is 
>not online 
>Mar 6 17:16:39 nodo-01 pvestatd[2804]: WARNING: storage 'iudice01' is 
>not online 
>Mar 6 17:16:49 nodo-01 pvestatd[2804]: WARNING: storage 'iudice01' is 
>not online 
>Mar 6 17:16:59 nodo-01 pvestatd[2804]: WARNING: storage 'iudice01' is 
>not online 
> 
> 
> 
>After that, if I try to restart the pve daemon, it refuses to . 
> 
>If I try to reboot the server, it stops when the PVE daemon should stops, 
>and stays there forever . 
> 
>The only way to reboot any of the nodes is a hard reset ! 
> 
>At first, I my suspects goes to Storage, changed from Freenas to Nas4Free, 
>sane thing, desperation ! 
> 
>Then, for tests, I installed PVE 1.9 In all five nodes ( I have 2 systems 
>running it for 3 years, so issue, this new system are to replace both ) 
> 
>Like I said, 9 days and counting !!! 
> 
>So, there is no problem in the hardware, and there is no problem with 
>Nas4Free ! 
> 
>What left ?!? 
> 
> 
>Fábio Rabelo 
> 
>_______________________________________________ 
>pve-user mailing list 
> pve-user at pve.proxmox.com 
> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user 




_______________________________________________ 
pve-user mailing list 
pve-user at pve.proxmox.com 
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user 



More information about the pve-user mailing list