[PVE-User] Random kernel panics of my KVM VMs

Tue Aug 15 07:02:12 CEST 2017

Hello everyone.

I am not sure this is the right place to ask, but I am also not sure where to
start, so this list seemed like a good place. I am happy for any direction as
to the best place to turn to for a solution. :)

For quite some time now I have been having random kernel panics on random VMs.

I have a two-node cluster, currently running a pretty current PVE version:

PVE Manager Version pve-manager/5.0-23/af4267bf

Now, these kernel panics have continued through several VM kernel upgrades,
and even continue after the 4.x to 5.x Proxmox upgrade several weeks ago. In
addition, I have moved VMs from one Proxmox node to the other to no avail,
ruling out hardware on one node or the other.

Also, it does not matter if the VMs have their (QCOW2) disks on the Proxmox
node's local hardware RAID storage or the Synology NFS-connected storage

I am trying to verify this by moving a few VMs that seem to panic more often
than others back to some local hardware RAID storage on one node as I write
this email...

Typically the kernel panics occur during the nightly backups of the VMs, but I
cannot say that this is always when they occur. I _can_ say that the kernel
panic always reports the sym53c8xx_2 module as the culprit though...

I have set up remote kernel logging on one VM and here is the kernel panic
reported:

----8<----
[138539.201838] Kernel panic - not syncing: assertion "i &&
sym_get_cam_status(cp->cmd) == DID_SOFT_ERROR" failed: file
"drivers/scsi/sym53c8xx_2/sym_hipd.c", line 3399
[138539.201838]
[138539.201838] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.9.34-gentoo #5
[138539.201838] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
[138539.201838]  ffff88023fd03d90 ffffffff813a2408 ffff8800bb842700
ffffffff81c51450
[138539.201838]  ffff88023fd03e10 ffffffff8111ff3f ffff880200000020
ffff88023fd03e20
[138539.201838]  ffff88023fd03db8 ffffffff813c70f3 ffffffff81c517b0
ffffffff81c51400
[138539.201838] Call Trace:
[138539.201838]  <IRQ> [138539.201838]  [<ffffffff813a2408>] dump_stack+0x4d/0x65
[138539.201838]  [<ffffffff8111ff3f>] panic+0xca/0x203
[138539.201838]  [<ffffffff813c70f3>] ? swiotlb_unmap_sg_attrs+0x43/0x60
[138539.201838]  [<ffffffff815ff3af>] sym_interrupt+0x1bff/0x1dd0
[138539.201838]  [<ffffffff8163e888>] ? e1000_clean+0x358/0x880
[138539.201838]  [<ffffffff815f8fc7>] sym53c8xx_intr+0x37/0x80
[138539.201838]  [<ffffffff8109fa78>] __handle_irq_event_percpu+0x38/0x1a0
[138539.201838]  [<ffffffff8109fbfe>] handle_irq_event_percpu+0x1e/0x50
[138539.201838]  [<ffffffff8109fc57>] handle_irq_event+0x27/0x50
[138539.201838]  [<ffffffff810a2b39>] handle_fasteoi_irq+0x89/0x160
[138539.201838]  [<ffffffff8101ea5e>] handle_irq+0x6e/0x120
[138539.201838]  [<ffffffff81079315>] ? atomic_notifier_call_chain+0x15/0x20
[138539.201838]  [<ffffffff8101e346>] do_IRQ+0x46/0xd0
[138539.201838]  [<ffffffff818dafff>] common_interrupt+0x7f/0x7f
[138539.201838]  <EOI> [138539.201838]  [<ffffffff818d9e5b>] ?
default_idle+0x1b/0xd0
[138539.201838]  [<ffffffff81025eea>] arch_cpu_idle+0xa/0x10
[138539.201838]  [<ffffffff818da22e>] default_idle_call+0x1e/0x30
[138539.201838]  [<ffffffff81097105>] cpu_startup_entry+0xd5/0x1c0
[138539.201838]  [<ffffffff8103cd98>] start_secondary+0xe8/0xf0
[138539.201838] Shutting down cpus with NMI
[138539.201838] Kernel Offset: disabled
[138539.201838] ---[ end Kernel panic - not syncing: assertion "i &&
sym_get_cam_status(cp->cmd) == DID_SOFT_ERROR" failed: file
"drivers/scsi/sym53c8xx_2/sym_hipd.c", line 3399
----8<----

The dmesg output on the Proxmox nodes' does not show any issues during the
times of these VM kernel panics.

I appreciate any comments, questions, or some direction on this.

Thank you,

Bill

-- 
Bill Arlofski
Reverse Polarity, LLC
http://www.revpol.com/blogs/waa
-------------------------------
He picks up scraps of information
He's adept at adaptation

--[ Not responsible for anything below this line ]--