[PVE-User] Abnormal Load average on nodes

Fri Sep 28 20:32:22 CEST 2012

Hello.

Well, this looks like some low level glitch. I had 3 nodes in the same 
state, to test what's going on. I don't know it it worth a bug report, 
too unclear. All the nodes are identical Gigabyte Motherboard with 
Phenom II x4 960T CPU and DDR3 RAM. I'll mention them by their 
hostnames, to avoid confusion.

proxmox43:
Just rebooted. Seems to have fully recovered. Have been working (better 
say, idle, no VMs or CTs) fine since yesterday.

proxmox44:
service pvestatd stop
service pvedaemon stop
service cman stop
service pve-cluster stop

Load average goes lower from 5.00, but stops at 1.00.
If I try to check anything related to proxmox44 on the Web GUI I'm 
immediately logged out.

I keep on killing things.
"/etc/init.d/apache2 stop" cries about it's config test:
Stopping web server: apache2
The apache2 configtest failed, so we are trying to kill it manually. 
This is almost certainly suboptimal, so please make sure your system is 
working as you'd expect now! ... (warning).
  ... waiting .

"/etc/init.d/vz stop" normal.
"/etc/init.d/qemu-server stop" normal.
"/etc/init.d/rgmanager stop" normal.
"/etc/init.d/rsync stop" normal.
"/etc/init.d/ntp stop" normal.
"/etc/init.d/vzeventd stop" normal.
Load Average still on the same level, steady 1.00

"/etc/init.d/rrdcached stop" normal.
Finally, Load average goes down, up to expected 0.00.

I start stuff back again:
"/etc/init.d/rrdcached start", normal, Load Average stays around 0.04.
"/etc/init.d/vzeventd start", normal.
"/etc/init.d/ntp start" normal.
"/etc/init.d/rsync start" normal.
"/etc/init.d/rgmanager start" normal.
"/etc/init.d/qemu-server start" normal.
Load Average at 0.00.

"/etc/init.d/vz start" cries about "Unable to open /etc/pve/openvz/0.conf"

"/etc/init.d/vz stop" normal.
"service pve-cluster start" normal.
"/etc/init.d/vz start" normal.

"service cman start" normal.
"service pvedaemon start" normal.
"service pvestatd start" normal.

"/etc/init.d/apache2 start" normal (didn't it failed it config test on 
shutdown?)

Load Average at 0.00.
The light on proxmox44's icon in my Web GUI turns green.
Everything looks normal on proxmox44.

proxmox45:
As far it seems the problem has to do with rrdcached, I just restarted 
it: "/etc/init.d/rrdcached restart" performs cleanly, but no effect, 
Load Average is still slightly above 5.

"/etc/init.d/rrdcached stop" normal, but Load Average stays at 5.00.
"service pvestatd stop" normal, Load Average goes lower but stops at 4.00.
"service pvedaemon stop" normal, drops Load Average to 1.00
"service cman stop" normal, Load Average still at 1.00
"service pve-cluster stop" normal, Load Average still at 1.00

Now it's clear it's not rrdcached's fault, it's just the most effective 
trigger.

This is a typical capture from proxmox44, when the problem showed up the 
first time (I stripped some spaces so it fit within 80 wide):

top - 10:59:16 up 13 days, 16:41, 1 user, load average: 5.08, 5.05, 5.01
Tasks: 163 total,   1 running, 162 sleeping,   0 stopped,   0 zombie
Cpu(s): 0.2%us, 0.0%sy, 0.0%ni, 99.7%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
Mem:  15912372k total, 599364k used, 15313008k free,  146660k buffers
Swap: 15728632k total,      0k used, 15728632k free,  136704k cached

    PID USER     PR NI  VIRT  RES  SHR S %CPU %MEM  TIME+  COMMAND
259116 www-data 20  0  272m  37m 4028 S    1  0.2  0:00.55 apache2
      1 root     20  0  8360  788  656 S    0  0.0  0:03.50 init
      2 root     20  0     0    0    0 S    0  0.0  0:00.00 kthreadd
      3 root     RT  0     0    0    0 S    0  0.0  0:00.22 migration/0
      4 root     20  0     0    0    0 S    0  0.0  0:00.00 ksoftirqd/0
      5 root     RT  0     0    0    0 S    0  0.0  0:00.00 migration/0
      6 root     RT  0     0    0    0 S    0  0.0  0:00.00 watchdog/0

While the condition is present, the "Summary" tab of Web GUI shows the 
images of the graphs, but with no data.

While playing with proxmox45, I noticed that the light on the node44 
went red again. It's Load Average was around 0.80 when i noticed it and 
grown slowly. Judging by the graph, the recover happened around 12:26 
(CPU usage % graph started there), and the node has fallen again at 
12:35 (the graph cuts there). In nearly one hour climbed to 1.00 and 
stayed there. Stopping rrdcached makes no difference.
Stopping pvestatd, pvedaemon, cman, pve-cluster,
and starting back pve-cluster, cman, pvestatd, pvedaemon seems to be a 
temporal solution. Onde Load Average reaches 1.05 the light in Web GUI 
turns red and the node begins to behave funny. All this with rrdcached 
stopped.

I have no idea.

El 27/09/12 16:19, Alexandre Kouznetsov escribió:
> Hello.
>
> I have a 4 nodes in a Proxmox 2.1 cluster.
> After a network configuration change on the node I'm using as web panel
> (hostname proxmox42) and rebooted (as Web GUI requested) I see the rest
> of the nodes offline (hostnames proxmox43-proxmox45). Well, they have a
> little red dot instead of a green one, in their icon in the in the web
> interface.
>
> The fallen nodes responds via Web and SSH, with some errors on the Web
> GUI. The network configuration change I have done was to add a bridge on
> a previously unused NIC.
>
> What can I do (places to look, tests to run) to see what is going on? My
> cluster has to go to production next week, I'm almost glad this happen
> now and not then.
>
>
> Random details, don't know what may be relevant:
>
> The "Datacenter" (root of the GUI hierarchy) section of the Web GUI
> shows this status:
> "Search" tab lists all the resources but shows the details only for tab
> status for proxmox42's resources.
> "Summary" tab shows all the nodes as "online".
> I have reloaded the page, logged out and logged in (using root PAM
> account), same status.
>
> Curiously, the "Summary" tabs of the fallen nodes are showing a valid
> status. I can see the CPU details, uptime, etc. The only thing out of
> order is the Load Average. They are doing or running nothing, but have
> Load Average above 1.
> Some parts of the GUI does not shows details and displays a floating
> message "communication failure".
>
> I can SSH to all the nodes and see that "pvecm status" and "pvecm nodes"
> shows all 4 nodes online and running.
> SSH to each node works, "top" confirms a high Load Average but shows
> less than 1% CPU usage.
> Apache access log shows successful connections to the API from proxmox42
> to the fallen nodes.
>
> I have rebooted one of the nodes and it appear to online now, seems
> normal (Load Average, response to GUI). I have not rebooted any other
> node yet. I'm more interested to find out what's the condition and make
> sure i eliminate the cause, then getting my nodes back online ASAP.
>
> Thank you.
>

-- 
Alexandre Kouznetsov