[PVE-User] Unreliable

Alexandre DERUMIER aderumier at odiso.com
Tue Mar 12 14:18:46 CET 2013



>> In one PCIe Slot there are an Intel 10 GB card, to talk with a Supermicro 10 GB switcher, exclusive to communication between the five nodes and the Storage .

What is the intel model card ?  do you use mtu 9000 ?

>>pvestatd[2804]: WARNING: storage 'iudice01' is not online 

What storage protocol do you use ? nfs/iscsi/lvm ?  
if nfs, what is your mounts options ?


>>After that, if I try to restart the pve daemon, it refuses to . 
>>If I try to reboot the server, it stops when the PVE daemon should stops, and stays there forever . 
>>
>>The only way to reboot any of the nodes is a hard reset ! 

It's possible that a access to the storage is hanging (stats, vm volume info,...).
Normally a check is done to avoid that. (this is the "not online" message you see).

The check are :

for nfs::
/usr/bin/rpcinfo -p nfsipserver  with a timeout of 2sec

for iscsi:

ping iscsiserverip tcp port 3260 with a timeout of 2sec.


So maybe the timeout is too low in proxmox code, when your san is under load.



Also, do you have vms hang ? or is it only pvedaemon/manager ?




----- Mail original ----- 

De: "Fábio Rabelo" <fabio at fabiorabelo.wiki.br> 
À: "Andreu Sànchez i Costa" <andreu.sanchez at iws.es> 
Cc: pve-user at pve.proxmox.com 
Envoyé: Mardi 12 Mars 2013 12:32:21 
Objet: Re: [PVE-User] Unreliable 

2013/3/12 Andreu Sànchez i Costa < andreu.sanchez at iws.es > 





Hello Fábio, 

Al 12/03/13 01:00, En/na Fábio Rabelo ha escrit: 


<blockquote>

2.3 do not have the reliability 1.9 has !!!! 

I am struggling with it for 3 months, my deadline are gone, and I cannot make it work for more than 3 days without an issue ... 



I cannot give my opinion about 2.3 but with 2.2.x it works perfectly, I only had to change elevator to deadline cause CFQ had performance problems with our P2000 iSCSI array disk. 

As other list members asked, what are your main problems? 


</blockquote>


I already described the problems several times here . 

This is a five node cluster, motherboards dual opteron from Supermicro . 

Storage uses the same motherboard as the five nodes, but with a 16 3,5 HD slots, with 12 occupied by WD enterprise disks . 

Storage runs Nas4Free . ( already try Freenas, same result ) 

Like I said, when I installed PVE 1.9 everything works fine for, now 9 days, and counting . 

In the five nodes, are embedded 2 network ports, connected to Linksys switcher, I am using it to serve the VMs . 

In one PCIe Slot there are an Intel 10 GB card, to talk with a Supermicro 10 GB switcher, exclusive to communication between the five nodes and the Storage . 

This switcher have no link with anything else . 

In the Storage, I use one of the embedded ports to manage, and all images are served through 10 GB card . 

After sometime, between 1 and 3 days the system is working, the nodes stops to talk with the storage . 

When it happens, the log shows lots of msg like this : 

Mar  6 17:15:29 nodo-01 pvestatd[2804]: WARNING: storage 'iudice01' is not online
Mar  6 17:15:39 nodo-01 pvestatd[2804]: WARNING: storage 'iudice01' is not online
Mar  6 17:15:49 nodo-01 pvestatd[2804]: WARNING: storage 'iudice01' is not online
Mar  6 17:15:59 nodo-01 pvestatd[2804]: WARNING: storage 'iudice01' is not online
Mar  6 17:16:09 nodo-01 pvestatd[2804]: WARNING: storage 'iudice01' is not online
Mar  6 17:16:19 nodo-01 pvestatd[2804]: WARNING: storage 'iudice01' is not online
Mar  6 17:16:29 nodo-01 pvestatd[2804]: WARNING: storage 'iudice01' is not online
Mar  6 17:16:39 nodo-01 pvestatd[2804]: WARNING: storage 'iudice01' is not online
Mar  6 17:16:49 nodo-01 pvestatd[2804]: WARNING: storage 'iudice01' is not online
Mar  6 17:16:59 nodo-01 pvestatd[2804]: WARNING: storage 'iudice01' is not online 

After that, if I try to restart the pve daemon, it refuses to . 

If I try to reboot the server, it stops when the PVE daemon should stops, and stays there forever . 

The only way to reboot any of the nodes is a hard reset ! 

At first, I my suspects goes to Storage, changed from Freenas to Nas4Free, sane thing, desperation ! 

Then, for tests, I installed PVE 1.9 In all five nodes ( I have 2 systems running it for 3 years, so issue, this new system are to replace both ) 

Like I said, 9 days and counting !!! 

So, there is no problem in the hardware, and there is no problem with Nas4Free ! 

What left ?!? 


Fábio Rabelo 



_______________________________________________ 
pve-user mailing list 
pve-user at pve.proxmox.com 
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user 



More information about the pve-user mailing list