[PVE-User] Constant crashes with high IO load from FreeBSD guests

Fri Jan 7 19:12:37 CET 2011

On 2010-12-28 13:24, Myke G wrote:
> I have 3 SunFire X2200s in my cluster, with 8-16GB RAM, and 4 or 8 CPU 
> cores. I just upgraded one of them to v1.7, the rest are still running 
> v1.5. It doesn't seem to matter - the problem remains:
>
> If the guest fires up a tar of / or a particularly large ./configure 
> of something, eventually, within seconds to minutes, the Proxmox node 
> hangs. Usually there's no console messages, and I have to use the IPMI 
> interface to reset the machine. (I'm 500KM from the datacenter)
> We've tried FreeBSD 7 and 8, with SCSI and IDE, Realtek and Intel 
> NICs... no difference. We've only tried with i386 builds (stock from 
> ISO, and also updated to -STABLE). I've moved the guest around to 
> different nodes in the cluster, and the VM-IO-induced crashing is 
> universal. AFAIK the hardware is fine, the crashes follow this user's 
> workpatterns - which aren't exceptional IMO. We've recreated this 
> instance 4 times now I think.
> Sometimes, the guest machine just ends up being "stopped" in the 
> Proxmox management, but this is only about 1/20 occurrences. Even this 
> is undesirable, but nowhere near as bad as wiping out the hardware - 
> which is the typical mode of failure. I should add, sometimes the node 
> self-reboots...

Finally caught a crash, (via IPMI serial console) though it's mixed in 
with wall and vmstat 1, see below.

This is from a kernel that bcrl (of Kvack/kernel.org) built up, but the 
crashes are all the same as when I use the proxmox kernels.

His analysis is that the KVM developers aren't working on my "older" 
Opterons anymore and that I need to upgrade, or someone needs to put 
support for the older CPUs back in. (And he's not volunteering, he's 
busy with other stuff)

So now I'm looking at buying newer hardware. (Any suggestions? The Sun 
Fire X2200 replacements are 4X the price. Thanks Larry.)

FWIW:
processor    : 3
vendor_id    : AuthenticAMD
cpu family    : 15
model        : 65
model name    : Dual-Core AMD Opteron(tm) Processor 2218
stepping    : 2
cpu MHz        : 2613.235
cache size    : 1024 KB
physical id    : 1
siblings    : 2
core id        : 1
cpu cores    : 2
apicid        : 3
initial apicid    : 3
fpu        : yes
fpu_exception    : yes
cpuid level    : 1
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
rdtscp lm 3dnowext 3dnow rep_good extd_apicid pni cx16 lahf_lm 
cmp_legacy svm extapic cr8_legacy
bogomips    : 5227.07
TLB size    : 1024 4K pages
clflush size    : 64
cache_alignment    : 64
address sizes    : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc

And here's the oops:

BUG: unable to handle kernel paging request at ffff8801fc57d424
IP: [<ffffffffa0362f43>] paging32_cmpxchg_gpte+0x62/0x7d [kvm]
PGD 162b063 PUD f067 PMD fff500000000ff1a
Oops: 0002 [#1] SMP
last sysfs file: /sys/kernel/uevent_seqnum
CPU 3
Modules linked in: nfs lockd fscache nfs_acl auth_rpcgss sunrpc tun 
kvm_amd kvm iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi bridge 
8021q garp stp snd_pcm snd_timer snd soundcore snd_page_alloc 
amd64_edac_mod tpm_tis psmouse edac_core shpchp edac_mce_amd k8temp tpm 
tpm_bios pci_hotplug button i2c_nforce2 serio_raw processor evdev pcspkr 
i2c_core ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot 
usbhid ata_generic hid sata_nv pata_amd tg3 libphy ehci_hcd ohci_hcd 
usbcore forcedeth libata nls_base thermal fan thermal_sys [last 
unloaded: scsi_wait_scan]

Pid: 2470, comm: kvm Not tainted 2.6.36.2 #1 S39              /Sun Fire 
X2200 M2
RIP: 0010:[<ffffffffa0362f43>]  [<ffffffffa0362f43>] 
paging32_cmpxchg_gpte+0x62/0x7d [kvm]
RSP: 0018:ffff88011d7d9b98  EFLAGS: 00010286
RAX: 0000000005222405 RBX: 0000000000000109 RCX: ffff88011d7d8000
RDX: ffff8801fc57d424 RSI: 00007f1adcdfa000 RDI: ffffea0006f33358
RBP: ffff88011d7d9c88 R08: 0000000000000207 R09: ffff88011d7d9b1c
R10: 0000000000000007 R11: ffff88021e023fc8 R12: 0000000005222405
R13: 0000000005222425 R14: 0000000000000109 R15: 0000000000000000
FS:  0000000040832950(0063) GS:ffff880123b00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff8801fc57d424 CR3: 000000011ee66000 CR4: 00000000000006e0
DR0: 00000000000000a0 DR1: 0000000000000000 DR2: 0000000000000003
DR3: 00000000000000b0 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kvm (pid: 2470, threadinfo ffff88011d7d8000, task ffff88011d63c240)
Stack:
  0000000000000007 00000000000138c3 ffff88021ed6c000 ffffffffa0363118
<0> ffff88021ed6c000 000000000917a000 0000000000000004 0000000028109b7b
<0> 0000000700000007 ffffffffa035ea09 05222405c0b6ea8f 0000000000000000
Call Trace:
  [<ffffffffa0363118>] ? paging32_walk_addr+0x1ba/0x3da [kvm]
  [<ffffffffa035ea09>] ? kvm_set_cr3+0xf9/0x10b [kvm]
  [<ffffffffa03656c1>] ? paging32_page_fault+0x61/0x548 [kvm]
  [<ffffffffa0366eea>] ? do_insn_fetch+0x8e/0xd5 [kvm]
  [<ffffffff8100a66e>] ? call_function_single_interrupt+0xe/0x20
  [<ffffffffa0365bc1>] ? kvm_mmu_page_fault+0x19/0x6a [kvm]
  [<ffffffffa035e624>] ? kvm_arch_vcpu_ioctl_run+0x812/0xa6b [kvm]
  [<ffffffff8106cbbd>] ? do_futex+0xc7/0x989
  [<ffffffffa039f2b8>] ? svm_vcpu_load+0x83/0xa6 [kvm_amd]
  [<ffffffffa0350bca>] ? kvm_vcpu_ioctl+0xfe/0x528 [kvm]
  [<ffffffff810e7146>] ? do_readv_writev+0x102/0x117
  [<ffffffff8103f5b9>] ? finish_task_switch+0x34/0xb2
  [<ffffffff810f32e4>] ? do_vfs_ioctl+0x4a4/0x4eb
  [<ffffffff8106d597>] ? sys_futex+0x118/0x136
  [<ffffffff810f3368>] ? sys_ioctl+0x3d/0x5c
  [<ffffffff81009b02>] ? system_call_fastpath+0x16/0x1b
Code: b6 6d db b6 6d 48 8d 04 07 48 c1 f8 03 48 0f af c2 48 ba 00 00 00 
00 00 88 ff ff 48 c1 e0 0c 48 01 d0 89 da 48 8d 14 90 44 89 e0 <f0> 44 
0f b1 2a 89 c3 ff 49 1c e8 09 c6 fe ff 44 39 e3 5b 41 5c
RIP  [<ffffffffa0362f43>] paging32_cmpxchg_gpte+0x62/0x7d [kvm]
  RSP <ffff88011d7d9b98>
CR2: ffff8801fc57d424
---[ end trace a5afc54964b65567 ]---
note: kvm[2470] exited with preempt_count 1
  9  0   1068 418BUG: scheduling while atomic: kvm/2470/0x10000001
388 140112 31302Modules linked in:04    0    0     nfs 0     0 2864 
5667  0 37 63  0
  lockd
Message from fscache syslogd at Crowbar nfs_acl at Jan  7 12:20 
auth_rpcgss:30 ...
  kerne sunrpcl:Stack:

M tunessage from sysl kvm_amdogd at Crowbar at J kvman  7 12:20:30 . 
iscsi_tcp..
  kernel:Cod libiscsi_tcpe: b6 6d db b6 6 libiscsid 48 8d 04 07 48 
scsi_transport_iscsi c1 f8 03 48 0f  bridgeaf c2 48 ba 00 0 8021q0 00 00 
00 88 ff garp ff 48 c1 e0 0c  stp48 01 d0 89 da 4 snd_pcm8 8d 14 90 44 
89 snd_timer e0 <f0> 44 0f b snd1 2a 89 c3 ff 49 soundcore 1c e8 09 c6 
fe  snd_page_allocff 44 39 e3 5b 4 amd64_edac_mod1 5c

  shpchpm_tisage from syslogd psmouse at Crowbar at Jan  edac_core 7 
12:20:30 ...
  k8temp:CR2: f edac_mce_amdfff8801fc57d424

Message fr tpmom syslogd at Crowb tpm_biosar at Jan  7 12: pci_hotplug20:30 ...
  i2c_nforce2el:Call Trace:

Message fr serio_rawom syslogd at Crowb processorar at Jan  7 12: 
evdev20:30 ...
  ker pcspkrnel:last sysfs f i2c_coreile: /sys/kernel ext3/uevent_seqnum
  jbd
Message fro mbcachem syslogd at Crowba dm_mirrorr at Jan  7 12:2 
dm_region_hash0:30 ...
  kern dm_logel:Oops: 0002 [# dm_snapshot1] SMP
10  0  usbhid  1060 417920 14 ata_generic0112 3130148   6 hid4    0    
88     sata_nv 0  367  562 42  pata_amd24 23 11
  tg3 libphy ehci_hcd ohci_hcd usbcore forcedeth libata nls_base thermal 
fan thermal_sys [last unloaded: scsi_wait_scan]
Pid: 2470, comm: kvm Tainted: G      D     2.6.36.2 #1
Call Trace:
  [<ffffffff8130e300>] ? schedule+0xe4/0x5e2
  [<ffffffff81042810>] ? __cond_resched+0x1d/0x26
  [<ffffffff8130e926>] ? _cond_resched+0x26/0x31
  [<ffffffff8106b3e7>] ? exit_robust_list+0x2a/0x131
  [<ffffffff81043378>] ? mm_release+0x20/0xe2
  [<ffffffff810470e6>] ? exit_mm+0x1c/0x118
  [<ffffffff81048ab1>] ? do_exit+0x22e/0x72d
  [<ffffffff81061cda>] ? up+0xe/0x37
  [<ffffffff810462f6>] ? kmsg_dump+0xa9/0x141
  [<ffffffff81310d56>] ? oops_end+0xaf/0xb4
  4  0   1060 417 [<ffffffff8102e547>] ? no_context+0x1f2/0x201
920 140112 31301 [<ffffffff8102e708>] ? __bad_area_nosemaphore+0x1b2/0x1d6
48    0    0     [<ffffffff8106e6f4>] ? generic_exec_single+0x64/0x80
  0     0  299  5 [<ffffffffa0363e41>] ? set_spte+0x355/0x366 [kvm]
08 37 31 31  0
  [<ffffffff81312d6e>] ? do_page_fault+0x69/0x2da
  [<ffffffff813101d5>] ? page_fault+0x25/0x30
  [<ffffffffa0362f43>] ? paging32_cmpxchg_gpte+0x62/0x7d [kvm]
  [<ffffffffa0362ef3>] ? paging32_cmpxchg_gpte+0x12/0x7d [kvm]
  [<ffffffffa0363118>] ? paging32_walk_addr+0x1ba/0x3da [kvm]
  [<ffffffffa035ea09>] ? kvm_set_cr3+0xf9/0x10b [kvm]
  [<ffffffffa03656c1>] ? paging32_page_fault+0x61/0x548 [kvm]
  [<ffffffffa0366eea>] ? do_insn_fetch+0x8e/0xd5 [kvm]
  [<ffffffff8100a66e>] ? call_function_single_interrupt+0xe/0x20
  [<ffffffffa0365bc1>] ? kvm_mmu_page_fault+0x19/0x6a [kvm]
  [<ffffffffa035e624>] ? kvm_arch_vcpu_ioctl_run+0x812/0xa6b [kvm]
  [<ffffffff8106cbbd>] ? do_futex+0xc7/0x989
  [<ffffffffa039f2b8>] ? svm_vcpu_load+0x83/0xa6 [kvm_amd]
  8  0   1060 417 [<ffffffffa0350bca>] ? kvm_vcpu_ioctl+0xfe/0x528 [kvm]
920 140112 31301 [<ffffffff810e7146>] ? do_readv_writev+0x102/0x117
48    0    0     [<ffffffff8103f5b9>] ? finish_task_switch+0x34/0xb2
  0     0  207  3 [<ffffffff810f32e4>] ? do_vfs_ioctl+0x4a4/0x4eb
85 41 20 39  0
  [<ffffffff8106d597>] ? sys_futex+0x118/0x136
  [<ffffffff810f3368>] ? sys_ioctl+0x3d/0x5c
  [<ffffffff81009b02>] ? system_call_fastpath+0x16/0x1b

And the fun bits from IRC:

[12:16.58]  * Myke engages in some high-risk behaviour
[12:17.07] <Myke> but I suspect it won't happen again until backup season
[12:17.11] <Myke> (1AM - 5AM)
[12:20.35] <bcrl> there she blows
[12:20.43] <Myke> whoa
[12:20.49] <Myke> but userland survives?
[12:20.53] <Myke> WTF?
[12:21.04] <Myke> or was that a KVM dying without wiping out the kernel?
[12:21.26] <bcrl> well, if an oops happens in process context, the 
kernel kills the process
[12:21.35] <Myke> ah
[12:21.45] <bcrl> the shutdown is unclean, though, and memory isn't 
properly freed
[12:22.00] <Myke> but vmstat is still pumping out lines here...
[12:22.53] <Myke> wow, I can still SSH in
[12:23.03] <Myke> the VMs are dead AFAICS, but ... weird.
[12:23.09] <Myke> this is a new mode of failure FYI.
[12:23.41] <Myke> BUG: unable to handle kernel paging request at 
ffff8801fc57d424
[12:24.26] <bcrl> doh, i have to turn on a few more debug options
[12:24.59] <Myke> how long will that take? (I just want to get the VMs 
back up quick, but I assume a recompile will take < 5min)
[12:28.00] <bcrl> the oopsen are all in the page table handling code
[12:28.12] <bcrl> so this is related to the fact you're using 
old-generation opterons
[12:28.20] <Myke> wow
[12:28.23] <Myke> that's... brutal.
[12:28.37] <Myke> are you ready for a reboot?
[12:28.43] <bcrl> the shadow page table stuff sucks
[12:28.43] <bcrl> no
[12:28.54] <Myke> 'k.
[12:32.31] <bcrl> system needs to be rebooted, kernel compile hung -- 
one of the cpus is stuck
[12:32.43] <Myke> just noticed myself.
[12:32.44] <Myke> will do.
[12:32.46] <bcrl> it won't shut down cleanly most likely
[12:32.56] <bcrl> recommend reboot -f
[12:33.09] <Myke> uh, I think it's dead
[12:33.13] <Myke> my SSH is hung
[12:33.17] <bcrl> yeah
[12:33.20] <Myke> I'll just do a chassis reset
[12:33.32] <bcrl> came back
[12:33.36] <Myke> wow
[12:33.43] <Myke> uh, I was JUST about to hit enter
[12:33.44] <bcrl> 2 cpus hung now, i think
[12:33.47] <Myke> pfft.
[12:33.56] <Myke> should I wait here? :)
[12:34.04] <bcrl> just reboot -f
[12:34.06] <TSIGabe> wtf is going on with your server
[12:34.07] <Myke> or let it keep playing >>>>GAMES WITH MY MIND<<<<
[12:34.15] <Myke> bcrl: if you can do that, do that
[12:34.22] <Myke> oh
[12:34.25] <Myke> I just did that
[12:34.28] <Myke> my SSH also unhung
[12:34.33]  * Myke fl3x0r
[12:34.37] <Myke> TSIGabe: Linux sucks ;)
[12:34.41]  * Myke inflames bcrl
[12:34.54] <bcrl> do a hard reset
[12:35.37] <Myke> [root at Chainsaw ~]# ipmitool -U root -H cb-ipmi power reset
[12:35.37] <Myke> Password:
[12:35.37] <Myke> Chassis Power Control: Reset
[12:35.37] <Myke> [root at Chainsaw ~]#
[12:36.45] <Myke> bootang.
[12:37.01]  * Myke will not engage in high-risk behaviour this time
[12:37.08] <Myke> tho the VMs will be fscking on the way up
[12:37.10] <TSIGabe> let me bet, it runs some kind of modified kernel 
that Ben patched?
[12:37.19] <bcrl> not yet =)
[12:37.26] <Myke> soon
[12:37.39] <Myke> bcrl: d'you think this is something fixable, or should 
I just start getting hardware quotes now?
[12:38.04] <bcrl> it's fixable, but like i said, you're running hardware 
that developers no longer develop on
[12:38.40] <Myke> okay
[12:38.45] <Myke> only 2 VMs running on there now.
[12:38.57] <Myke> so nobody's going to fix it, unless you feel bored.
[12:39.40] <bcrl> at least you have oopses, so kvm developers can fix that
[12:40.27] <Myke> would you do the honours of submitting that? cuz I 
really don't have a clue where to start... or sound intelligent while 
discussing it ;)
[12:40.36] <Myke> I *will* follow-up to the proxmox mailing list
[12:41.50] <Myke> but what you're also telling me, is that my hardware 
IS NOT BAD, so it's likely safe for other roles
[12:41.57] <Myke> (ie: routers, FreeBSD jail servers, etc...)
[12:44.39] <bcrl> correctamundo
[12:45.10] <bcrl> virtualization is only just beginning to mature
[12:45.35] <Myke> sigh
[12:46.23] <Myke> The WinXP VM was doing > 100Mbit over NFS when the 
crash happened
[12:46.36]  * Myke is assembling a post to Proxmox ML
[13:02.26] <bcrl> i bet it's mmio related