[pve-devel] need help to debug random host freeze on multiple hosts

Alexandre DERUMIER aderumier at odiso.com
Tue Dec 30 10:03:22 CET 2014


Hi, Thanks for the reply

>>But 90 % long-time CPU - usage is never good and why you have a 90 % CPU-usage, 
>>when you have not a high load?

When I said not a high load, I mean I'm around 50 load average for a 64 cores (4x16cores) server.
So It's ok, but near the limit before to be overloaded.

We have add a real business activity currently a higher than usual, and we have ordered some new server,
but until we have them, I have all the servers of the cluster between 70-90% cpu usage.

(It's high, but I don't think it should crash the host).



About security, we have firewall + anti ddos + ips hardware appliances in front of all vms.
So a lot of bad traffic is already filtered.

(I forget to said that we manage ourself all the vms for ours customers (around 1000vms),
so we have a good knownledge of the workload of each vms)



Thanks again for your reply

Alexandre

----- Mail original -----
De: "Detlef Bracker" <bracker at 1awww.com>
À: "aderumier" <aderumier at odiso.com>
Envoyé: Lundi 29 Décembre 2014 22:40:54
Objet: Re: [pve-devel] need help to debug random host freeze on multiple hosts

But 90 % long-time CPU - usage is never good and why you have a 90 % CPU-usage, 
when you have not a high load? I have see in days with much hacking, that the load goes 
up and cpu usage too and memory consumption down and disc IOs works as I see in 
atop more or less for hosts-jobs! 

As I have stopped the attacks with firewall and something on, the sitation was very 
clear! Still hackings with many differented attacks, I found via netstat 

netstat -t -u -p 

oder continues: 

netstat -t -u -p -c > logfile.txt 

The problem was, that the hackings comes expl. to use an open proxy what with 
old versions of joomla was delivered (plugin_goolgemap2_proxy.php) or by wordpress 
to xmlrpc.php. When hackers use webspaces with this two not updated scripts, the 
host not respond good. Loadavg normal on our hosts < 1 and with this hackings 
up to 60 or higher! 

About this is absolute important, that the PVE-Firewall works correct in the hosts, 
why their is very easy to block all traffic from hackers and our scripts to create the 
2nd blacklist work absolute fine! Only we wait to get the update for proxmox, that 
IPv6 is implemented correct, that we can activate the firewall for containers too, 
they have mixed IPv4 and IPv6 - in moment this containers, we cant handle via the 
pve-firewall! 


Am 29.12.2014 um 20:05 schrieb Alexandre DERUMIER: 



BQ_BEGIN

BQ_BEGIN

I don't have info about microcode update, only a note from dell support which said that it's correcting 
instability on vmware. (So I don't known for kvm) 



BQ_END

Here the detail of microcode patch

815 Processor May Read Partially Updated Branch Status
Register
Description
Under a highly specific and detailed set of internal timing conditions, the processor may read an internal branch
status register (BSR) while the register is being updated resulting in an incorrect rIP.
Potential Effect on System
The incorrect rIP causes unpredictable program or system behavior, usually observed as a page fault.
Suggested Workaround
Contact your AMD representative for information on a BIOS update.
Fix Planned
No fix planned



I have another crash this afternoon, and this host was around 90% cpu usage since 12h. (But loadaverage was ok).
So maybe more cpu give more chance to reach the case.

I have patched this bios, I'll wait to see if it's improve or not.



----- Mail original -----
De: "aderumier" <aderumier at odiso.com> À: "datanom.net" <mir at datanom.net> Cc: "pve-devel" <pve-devel at pve.proxmox.com> Envoyé: Lundi 29 Décembre 2014 16:56:32
Objet: Re: [pve-devel] need help to debug random host freeze on multiple hosts 

BQ_BEGIN

BQ_BEGIN

Could this, given the high load, be caused by a race condition which is 
solved in the new microcode? 

BQ_END

BQ_END

I don't have info about microcode update, only a note from dell support which said that it's correcting 
instability on vmware. (So I don't known for kvm) 

BQ_BEGIN

BQ_BEGIN

Have you tried connecting a serial console to one of the nodes? 

If you have IPMI on the nodes you should also be able to monitor 
further than on the default console. 

BQ_END

BQ_END

I'm going to implement serial output over the dell idrac. 


----- Mail original ----- 
De: "datanom.net" <mir at datanom.net> À: "pve-devel" <pve-devel at pve.proxmox.com> Cc: "aderumier" <aderumier at odiso.com> Envoyé: Lundi 29 Décembre 2014 13:27:08 
Objet: Re: [pve-devel] need help to debug random host freeze on multiple hosts 

On Mon, 29 Dec 2014 07:31:32 +0100 (CET) 
Alexandre DERUMIER <aderumier at odiso.com> wrote: 

BQ_BEGIN

Yes sure , I have nothing in logs. 
(That's why I thinked of kdump to try to have more info). 

I'll really don't known if it's a software real kernel panic, or a hardware bug. 

I just see on vmware forum some amd microcode bug, and see that dell provide a new bios update this month. 
I'll try to update to see if it's help. 

BQ_END

Could this, given the high load, be caused by a race condition which is 
solved in the new microcode? 

Have you tried connecting a serial console to one of the nodes? 

If you have IPMI on the nodes you should also be able to monitor 
further than on the default console. 

BQ_END






More information about the pve-devel mailing list