[PVE-User] HPSA kernel bug

Wed Feb 29 18:45:48 CET 2012

I experience the following on an HP system with a Smart Array P410 and 
proxmox 1.9

After some time (seems to be when  it becomes a little more busy with 
the disks), this message appears in "dmesg", and the system won't access 
the disks anymore (longer log later in the mail).

"hpsa 0000:04:00.0: resetting device 6:0:0:1"

Raid has 4x600GB SAS disks with RAID1+0 configured.

I'm not sure if it's a driver bug or a defective hardware. The system is 
new and newly installed with 1 windows 2008 srv and 1 openvz container.

Any pointers would be appreciated.

Thanks,
Alessandro

-----------------------------------------------------------
proxmox:~# pveversion  --verbose
pve-manager: 1.9-26 (pve-manager/1.9/6567)
running kernel: 2.6.32-7-pve
proxmox-ve-2.6.32: 1.9-55+ovzfix-2
pve-kernel-2.6.32-6-pve: 2.6.32-55+ovzfix-1
pve-kernel-2.6.32-7-pve: 2.6.32-55+ovzfix-2
qemu-server: 1.1-32
pve-firmware: 1.0-15
libpve-storage-perl: 1.0-19
vncterm: 0.9-2
vzctl: 3.0.29-3pve1
vzdump: 1.2-16
vzprocps: 2.0.11-2
vzquota: 3.0.11-1
pve-qemu-kvm: 0.15.0-2
ksm-control-daemon: 1.0-6
proxmox:~# uname -a
Linux proxmox 2.6.32-7-pve #1 SMP Mon Feb 13 07:33:21 CET 2012 x86_64 
GNU/Linux
-----------------------------------------------------------

hpsa 0000:04:00.0: resetting device 6:0:0:1
INFO: task scsi_eh_6:937 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
scsi_eh_6     D ffff88043ada6d90     0   937      2    0 0x00000000
  ffff88043ada9b90 0000000000000046 0000000000000000 0000000081326931
  0000000000000041 000000000000f6c8 ffff88043ada9fd8 ffff88043ada9fd8
  ffff88043ada6d90 ffff88043dc56e10 ffff88043ada7348 000000010116be33
Call Trace:
  [<ffffffff81136f16>] ? __alloc_pages_nodemask+0x1a6/0xb80
  [<ffffffff81511935>] schedule_timeout+0x235/0x2d0
  [<ffffffff8106e586>] ? vprintk+0x36/0x50
  [<ffffffff815115c0>] wait_for_common+0x150/0x180
  [<ffffffff8105ffa0>] ? default_wake_function+0x0/0x20
  [<ffffffffa00420a5>] ? enqueue_cmd_and_start_io+0x165/0x180 [hpsa]
  [<ffffffff815116ad>] wait_for_completion+0x1d/0x20
  [<ffffffffa0047482>] hpsa_eh_device_reset_handler+0x172/0x41c [hpsa]
  [<ffffffff81365e34>] scsi_eh_ready_devs+0x224/0x870
  [<ffffffff81065fe7>] ? enqueue_task_fair+0x67/0x100
  [<ffffffff81366b07>] scsi_error_handler+0x497/0x630
  [<ffffffff8105ffb2>] ? default_wake_function+0x12/0x20
  [<ffffffff81366670>] ? scsi_error_handler+0x0/0x630
  [<ffffffff81096ca6>] kthread+0x96/0xb0
  [<ffffffff8100c34a>] child_rip+0xa/0x20
  [<ffffffff81096c10>] ? kthread+0x0/0xb0
  [<ffffffff8100c340>] ? child_rip+0x0/0x20
INFO: task kjournald:1423 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kjournald     D ffff88043bb22b10     0  1423      2    0 0x00000000
  ffff88043aa51d40 0000000000000046 0000000000000000 0000000000000001
  ffff88083e7d0400 000000000000f6c8 ffff88043aa51fd8 ffff88043aa51fd8
  ffff88043bb22b10 ffff88043e6baf50 ffff88043bb230c8 000000010116aaa0
Call Trace:
  [<ffffffff8100986d>] ? __switch_to+0xcd/0x320
  [<ffffffffa00ac7ec>] journal_commit_transaction+0x19c/0x1410 [jbd]
  [<ffffffff810972d0>] ? autoremove_wake_function+0x0/0x40
  [<ffffffff81080a4c>] ? try_to_del_timer_sync+0xac/0xe0
  [<ffffffffa00b2a3d>] kjournald+0xed/0x240 [jbd]
  [<ffffffff810972d0>] ? autoremove_wake_function+0x0/0x40
  [<ffffffffa00b2950>] ? kjournald+0x0/0x240 [jbd]
  [<ffffffff81096ca6>] kthread+0x96/0xb0
  [<ffffffff8100c34a>] child_rip+0xa/0x20
  [<ffffffff81096c10>] ? kthread+0x0/0xb0
  [<ffffffff8100c340>] ? child_rip+0x0/0x20
....
etc.