[PVE-User] Proxmox 6 - disk problem

lord_Niedzwiedz sir_Misiek1 at o2.pl
Thu Aug 22 10:26:07 CEST 2019


Hello,

Disks are nvme (m.2), inserted through pci-e plates.
Until now, everything worked fine and never hung on proxmox 5-4.

root at tomas:/var/log# smartctl -a /*/dev/sda*/
=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda ES.2
Device Model:     ST31000340NS
Serial Number:    9QJ2LV6L
LU WWN Device Id: 5 000c50 01082a141
Firmware Version: SN05
User Capacity:    1,000,203,804,160 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Thu Aug 22 10:16:10 2019 CEST

==> WARNING: There are known problems with these drives,
see the following Seagate web pages:
http://knowledge.seagate.com/articles/en_US/FAQ/207931en
http://knowledge.seagate.com/articles/en_US/FAQ/207963en

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                     was completed without error.
                     Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test 
routine completed
                     without error or no self-test has ever
                     been run.
Total time to complete Offline
data collection:         (  642) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                     Auto Offline data collection on/off support.
                     Suspend Offline collection upon new
                     command.
                     Offline surface scan supported.
                     Self-test supported.
                     Conveyance Self-test supported.
                     Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                     power-saving mode.
                     Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                     General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   1) minutes.
Extended self-test routine
recommended polling time:      ( 237) minutes.
Conveyance self-test routine
recommended polling time:      (   2) minutes.
SCT capabilities:            (0x003d)    SCT Status supported.
                     SCT Error Recovery Control supported.
                     SCT Feature Control supported.
                     SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE UPDATED  
WHEN_FAILED RAW_VALUE
   1 Raw_Read_Error_Rate     0x000f   078   063   044    Pre-fail 
Always       -       60157077
   3 Spin_Up_Time            0x0003   099   099   000    Pre-fail 
Always       -       0
   4 Start_Stop_Count        0x0032   100   100   020    Old_age 
Always       -       119
   5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail 
Always       -       3
   7 Seek_Error_Rate         0x000f   080   060   030    Pre-fail 
Always       -       22054801530
   9 Power_On_Hours          0x0032   089   011   000    Old_age 
Always       -       10379
  10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail 
Always       -       0
  12 Power_Cycle_Count       0x0032   100   037   020    Old_age 
Always       -       120
184 End-to-End_Error        0x0032   100   100   099    Old_age 
Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age 
Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age 
Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age 
Always       -       0
190 Airflow_Temperature_Cel 0x0022   067   048   045    Old_age 
Always       -       33 (Min/Max 28/34)
194 Temperature_Celsius     0x0022   033   052   000    Old_age 
Always       -       33 (0 15 0 0 0)
195 Hardware_ECC_Recovered  0x001a   027   008   000    Old_age 
Always       -       60157077
197 Current_Pending_Sector  0x0012   100   100   000    Old_age 
Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age 
Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age 
Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
     1        0        0  Not_testing
     2        0        0  Not_testing
     3        0        0  Not_testing
     4        0        0  Not_testing
     5        0        0  Not_testing
Selective self-test flags (0x0):
   After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.




root at tomas:/var/log# smartctl -a /dev//*nvme0n1*/
=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 970 EVO Plus 500GB
Serial Number:                      S4EVNG0M138594B
Firmware Version:                   1B2QEXM7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 500,107,862,016 [500 GB]
Unallocated NVM Capacity:           0
Controller ID:                      4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          500,107,862,016 [500 GB]
Namespace 1 Utilization:            500,021,350,400 [500 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 5191511acc
Local Time is:                      Thu Aug 22 10:16:15 2019 CEST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero 
Sav/Sel_Feat Timestmp
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     85 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
  0 +     7.80W       -        -    0  0  0  0        0       0
  1 +     6.00W       -        -    1  1  1  1        0       0
  2 +     3.40W       -        -    2  2  2  2        0       0
  3 -   0.0700W       -        -    3  3  3  3      210    1200
  4 -   0.0100W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
  0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        31 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    6,383,822 [3.26 TB]
Data Units Written:                 5,347,865 [2.73 TB]
Host Read Commands:                 36,439,507
Host Write Commands:                19,952,933
Controller Busy Time:               98
Power Cycles:                       76
Power On Hours:                     96
Unsafe Shutdowns:                   47
Media and Data Integrity Errors:    0
Error Information Log Entries:      2
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               31 Celsius
Temperature Sensor 2:               26 Celsius

Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged



root at tomas:/var/log# smartctl -a /dev//*nvme1n1*/
=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 970 EVO Plus 500GB
Serial Number:                      S4EVNG0M134497V
Firmware Version:                   1B2QEXM7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 500,107,862,016 [500 GB]
Unallocated NVM Capacity:           0
Controller ID:                      4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          500,107,862,016 [500 GB]
Namespace 1 Utilization:            499,998,076,928 [499 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 5191510465
Local Time is:                      Thu Aug 22 10:16:17 2019 CEST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero 
Sav/Sel_Feat Timestmp
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     85 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
  0 +     7.80W       -        -    0  0  0  0        0       0
  1 +     6.00W       -        -    1  1  1  1        0       0
  2 +     3.40W       -        -    2  2  2  2        0       0
  3 -   0.0700W       -        -    3  3  3  3      210    1200
  4 -   0.0100W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
  0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        36 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    9,246,297 [4.73 TB]
Data Units Written:                 7,122,074 [3.64 TB]
Host Read Commands:                 39,558,971
Host Write Commands:                8,371,335
Controller Busy Time:               74
Power Cycles:                       30
Power On Hours:                     2,702
Unsafe Shutdowns:                   17
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               36 Celsius
Temperature Sensor 2:               37 Celsius

Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged

root at tomas:/var/log# smartctl -a /dev//*nvme2n1*/
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-5.0.15-1-pve] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 970 EVO 500GB
Serial Number:                      S466NB0K630810T
Firmware Version:                   2B2QEXE7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 500,107,862,016 [500 GB]
Unallocated NVM Capacity:           0
Controller ID:                      4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          500,107,862,016 [500 GB]
Namespace 1 Utilization:            500,107,853,824 [500 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 5681b0985d
Local Time is:                      Thu Aug 22 10:16:19 2019 CEST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero 
Sav/Sel_Feat Timestmp
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     85 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
  0 +     6.20W       -        -    0  0  0  0        0       0
  1 +     4.30W       -        -    1  1  1  1        0       0
  2 +     2.10W       -        -    2  2  2  2        0       0
  3 -   0.0400W       -        -    3  3  3  3      210    1200
  4 -   0.0050W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
  0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        28 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    2%
Data Units Read:                    12,332,923 [6.31 TB]
Data Units Written:                 16,309,063 [8.35 TB]
Host Read Commands:                 670,325,851
Host Write Commands:                594,935,440
Controller Busy Time:               925
Power Cycles:                       120
Power On Hours:                     1,562
Unsafe Shutdowns:                   68
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               28 Celsius
Temperature Sensor 2:               29 Celsius

Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged



root at tomas:/var/log# smartctl -a /dev//*nvme3n1*/
=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 970 EVO 500GB
Serial Number:                      S466NX0K939382F
Firmware Version:                   2B2QEXE7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 500,107,862,016 [500 GB]
Unallocated NVM Capacity:           0
Controller ID:                      4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          500,107,862,016 [500 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 5981b1a75f
Local Time is:                      Thu Aug 22 10:16:21 2019 CEST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero 
Sav/Sel_Feat Timestmp
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     85 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
  0 +     6.20W       -        -    0  0  0  0        0       0
  1 +     4.30W       -        -    1  1  1  1        0       0
  2 +     2.10W       -        -    2  2  2  2        0       0
  3 -   0.0400W       -        -    3  3  3  3      210    1200
  4 -   0.0050W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
  0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        27 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    2%
Data Units Read:                    10,105,558 [5.17 TB]
Data Units Written:                 16,223,988 [8.30 TB]
Host Read Commands:                 654,021,540
Host Write Commands:                594,078,253
Controller Busy Time:               930
Power Cycles:                       96
Power On Hours:                     1,540
Unsafe Shutdowns:                   51
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               27 Celsius
Temperature Sensor 2:               28 Celsius

Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged

root at tomas:/var/log# smartctl -a /dev//*nvme4n1*/
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-5.0.15-1-pve] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 970 EVO 500GB
Serial Number:                      S466NB0K630742Y
Firmware Version:                   2B2QEXE7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 500,107,862,016 [500 GB]
Unallocated NVM Capacity:           0
Controller ID:                      4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          500,107,862,016 [500 GB]
Namespace 1 Utilization:            498,767,261,696 [498 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 5681b09819
Local Time is:                      Thu Aug 22 10:16:24 2019 CEST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero 
Sav/Sel_Feat Timestmp
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     85 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
  0 +     6.20W       -        -    0  0  0  0        0       0
  1 +     4.30W       -        -    1  1  1  1        0       0
  2 +     2.10W       -        -    2  2  2  2        0       0
  3 -   0.0400W       -        -    3  3  3  3      210    1200
  4 -   0.0050W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
  0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        27 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    2%
Data Units Read:                    14,121,818 [7.23 TB]
Data Units Written:                 15,364,291 [7.86 TB]
Host Read Commands:                 668,618,811
Host Write Commands:                581,016,189
Controller Busy Time:               969
Power Cycles:                       102
Power On Hours:                     1,587
Unsafe Shutdowns:                   56
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               27 Celsius
Temperature Sensor 2:               27 Celsius

Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged


W dniu 22.08.2019 o 10:07, Eneko Lacunza pisze:
> Hi,
>
> So what disks/RAID controller are there on the server? :)
>
> My guess is disk if failed :) Did you try smartctl ?
>
> Also, I think attachments are stripped off :)
>
> Cheers
>
> El 22/8/19 a las 10:03, lord_Niedzwiedz escribió:
>> CPU usage 0.04% of 32 CPU(s)
>> _/*IO delay    20.38%        !!*/_
>> Load average    37.97,37.26,30.31
>> RAM usage    45.25% (56.93 GiB of 125.81 GiB)
>> KSM sharing    0 B
>> HD space(root)    0.53% (1.32 GiB of 247.29 GiB)
>> SWAP usage        N/A
>> CPU(s)        32 x AMD EPYC 7281 16-Core Processor (1 Socket)
>> Kernel Version        Linux 5.0.15-1-pve #1 SMP PVE 5.0.15-1 (Wed, 03 
>> Jul 2019 10:51:57 +0200)
>> PVE Manager Version        pve-manager/6.0-4/2a719255
>>
>> Proxmox working very slowly.
>> I stop all VM.
>>
>> htop -    say nothing
>> iotop    -    say nothing
>>
>>
>> If i try command:
>> # sync
>> - shell waiting !! ;/
>>
>>
>> This same too:
>> root at tomas:~# pveperf
>> CPU BOGOMIPS:      134377.28
>> REGEX/SECOND:      2100393
>> HD SIZE:           247.29 GB (rpool/ROOT/pve-1)
>> FSYNCS/SECOND:     531.28
>>
>> ^C^Z
>> [1]+  Stopped                 pveperf
>> root at tomas:~# ^C
>>
>> _/*After this:*/__/*    IO delay         40%*/_
>>
>>
>> In a phisical console i heave:
>> INFO: task zwol:554 blocked for more than 120 seconds.
>> Tainted:    P    0    5.0.15-1-pve #1
>> "echo 0 > /prox/sys/kernel/hung_task_timeout_sec" disables this message.
>> INFO: task txg_quiesce:1007 blocked for more than 120 seconds.
>> Tainted:    P    0    5.0.15-1-pve #1
>> "echo 0 > /prox/sys/kernel/hung_task_timeout_sec" disables this message.
>> INFO: task kvm:27326 blocked for more than 120 seconds.
>> Tainted:    P    0    5.0.15-1-pve #1
>> "echo 0 > /prox/sys/kernel/hung_task_timeout_sec" disables this message.
>> INFO: task kvm:8930 blocked for more than 120 seconds.
>> Tainted:    P    0    5.0.15-1-pve #1
>> "echo 0 > /prox/sys/kernel/hung_task_timeout_sec" disables this message.
>> INFO: task zvol:26963 blocked for more than 120 seconds.
>> Tainted:    P    0    5.0.15-1-pve #1
>> "echo 0 > /prox/sys/kernel/hung_task_timeout_sec" disables this message.
>> INFO: task zvol:26967 blocked for more than 120 seconds.
>> Tainted:    P    0    5.0.15-1-pve #1
>> "echo 0 > /prox/sys/kernel/hung_task_timeout_sec" disables this message.
>> INFO: task zvol:26972 blocked for more than 120 seconds.
>> Tainted:    P    0    5.0.15-1-pve #1
>> "echo 0 > /prox/sys/kernel/hung_task_timeout_sec" disables this message.
>> INFO: task zvol:26974 blocked for more than 120 seconds.
>> Tainted:    P    0    5.0.15-1-pve #1
>> "echo 0 > /prox/sys/kernel/hung_task_timeout_sec" disables this message.
>> INFO: task zvol:26976 blocked for more than 120 seconds.
>> Tainted:    P    0    5.0.15-1-pve #1
>> "echo 0 > /prox/sys/kernel/hung_task_timeout_sec" disables this message.
>> INFO: task zvol:26980 blocked for more than 120 seconds.
>>
>> At the restart on end i heave:
>> [  !!  ]  Froceibly rebooting: Ctrl-Alt-Del was pressed more than 7 
>> times within 2s
>> Systemd-shutdown[1]: Syncing filesystems and block devices - time 
>> out, issuing SIGKILL to PID 3940.
>> Started bpfilter
>> pvefw-logger [24351]: received terminate request (signal)
>> pvefw-logger [24351]: stopping pvefw logger
>>
>> Server not stop/restart   ;-/
>> Any idea        ??!!
>>
>> log file included.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> pve-user mailing list
>> pve-user at pve.proxmox.com
>> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>
>



More information about the pve-user mailing list