[PVE-User] smartd - Bad IEC (SMART) mode page

Murphy Lawson murphy.lawson at outlook.com
Fri Jul 7 17:46:48 CEST 2017


Hi Everyone,

A colleague has recently deployed 3 servers all running PVE (Virtual Environment 4.4-1) and it has been reported that they are all reporting SMART errors but only for one drive in an LSI/Avago RAID array.

When running a single scan with 'smartd -q onecheck' the following error is returned:

Device: /dev/bus/0 [megaraid_disk_05], [SEAGATE  ST2000NM0045     N002], lu id: 0x4f6a1e4f8c6b, S/N: AB123412341234, 2.00 TB
Device: /dev/bus/0 [megaraid_disk_05], Bad IEC (SMART) mode page, err=-5, skip device
Unable to register SCSI device /dev/bus/0 [megaraid_disk_05] at line 21 of file /etc/smartd.conf

In this instance the server had been up for just under an hour:
15:07:00 up 54 min,  1 user,  load average: 0.09, 0.06, 0.01

When the error is seen, it is no longer possible to control the SMART status on the reported drive and no other details are available (e.g. temperature):
# smartctl -s off /dev/bus/0 -d megaraid,5

smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.4.35-1-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF ENABLE/DISABLE COMMANDS SECTION ===
unable to fetch IEC (SMART) mode page [Input/output error]

Onlythe above error when queried is consistent, the error logged when smartd is running as a daemon varies on each server:

smartd[1649]: Device: /dev/bus/0 [megaraid_disk_05], failed to read Temperature
smartd[2050]: Device: /dev/bus/0 [megaraid_disk_04], Read SMART Self-Test Log Failed
smartd[1422]: Device: /dev/bus/0 [megaraid_disk_04], failed to read SMART values

The error only seems to occur after the server has been running for a few while and the only way I found to clear it is to reboot or power down the servers.

The drives are using the 'megaraid_sas' kernel module which I've tried to reload but as the drives are active, this is not possible.  Further hardware details below:

scsi host0: Avago SAS based MegaRAID driver
81:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108 [Invader] (rev 02)

Kernel version is:
Linux SERVERNAME 4.4.35-1-pve #1 SMP Fri Dec 9 11:09:55 CET 2016 x86_64 GNU/Linux

I've checked out the smartmontools forum and there doesn't appear to have been an recent report of this issue.  A few other similar reports from Redhat but this was before these controllers were supported by the kernel.

Has anyone else come across this issue? 

I've changed the RAID controller to not power down any spare/unused drives in case this is what's occuring but in case this doesn't resolve it, any other  advice would be appreciated.

Thanks in advance

Murphy



More information about the pve-user mailing list