[PVE-User] Proxmox 6 - disk problem
lord_Niedzwiedz
sir_Misiek1 at o2.pl
Thu Aug 22 10:26:07 CEST 2019
Hello,
Disks are nvme (m.2), inserted through pci-e plates.
Until now, everything worked fine and never hung on proxmox 5-4.
root at tomas:/var/log# smartctl -a /*/dev/sda*/
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda ES.2
Device Model: ST31000340NS
Serial Number: 9QJ2LV6L
LU WWN Device Id: 5 000c50 01082a141
Firmware Version: SN05
User Capacity: 1,000,203,804,160 bytes [1.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 2.6, 3.0 Gb/s
Local Time is: Thu Aug 22 10:16:10 2019 CEST
==> WARNING: There are known problems with these drives,
see the following Seagate web pages:
http://knowledge.seagate.com/articles/en_US/FAQ/207931en
http://knowledge.seagate.com/articles/en_US/FAQ/207963en
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test
routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 642) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 237) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED
WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 078 063 044 Pre-fail
Always - 60157077
3 Spin_Up_Time 0x0003 099 099 000 Pre-fail
Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age
Always - 119
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail
Always - 3
7 Seek_Error_Rate 0x000f 080 060 030 Pre-fail
Always - 22054801530
9 Power_On_Hours 0x0032 089 011 000 Old_age
Always - 10379
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail
Always - 0
12 Power_Cycle_Count 0x0032 100 037 020 Old_age
Always - 120
184 End-to-End_Error 0x0032 100 100 099 Old_age
Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age
Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age
Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age
Always - 0
190 Airflow_Temperature_Cel 0x0022 067 048 045 Old_age
Always - 33 (Min/Max 28/34)
194 Temperature_Celsius 0x0022 033 052 000 Old_age
Always - 33 (0 15 0 0 0)
195 Hardware_ECC_Recovered 0x001a 027 008 000 Old_age
Always - 60157077
197 Current_Pending_Sector 0x0012 100 100 000 Old_age
Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age
Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
root at tomas:/var/log# smartctl -a /dev//*nvme0n1*/
=== START OF INFORMATION SECTION ===
Model Number: Samsung SSD 970 EVO Plus 500GB
Serial Number: S4EVNG0M138594B
Firmware Version: 1B2QEXM7
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 500,107,862,016 [500 GB]
Unallocated NVM Capacity: 0
Controller ID: 4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 500,107,862,016 [500 GB]
Namespace 1 Utilization: 500,021,350,400 [500 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 002538 5191511acc
Local Time is: Thu Aug 22 10:16:15 2019 CEST
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero
Sav/Sel_Feat Timestmp
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 85 Celsius
Critical Comp. Temp. Threshold: 85 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 7.80W - - 0 0 0 0 0 0
1 + 6.00W - - 1 1 1 1 0 0
2 + 3.40W - - 2 2 2 2 0 0
3 - 0.0700W - - 3 3 3 3 210 1200
4 - 0.0100W - - 4 4 4 4 2000 8000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 31 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 6,383,822 [3.26 TB]
Data Units Written: 5,347,865 [2.73 TB]
Host Read Commands: 36,439,507
Host Write Commands: 19,952,933
Controller Busy Time: 98
Power Cycles: 76
Power On Hours: 96
Unsafe Shutdowns: 47
Media and Data Integrity Errors: 0
Error Information Log Entries: 2
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 31 Celsius
Temperature Sensor 2: 26 Celsius
Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged
root at tomas:/var/log# smartctl -a /dev//*nvme1n1*/
=== START OF INFORMATION SECTION ===
Model Number: Samsung SSD 970 EVO Plus 500GB
Serial Number: S4EVNG0M134497V
Firmware Version: 1B2QEXM7
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 500,107,862,016 [500 GB]
Unallocated NVM Capacity: 0
Controller ID: 4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 500,107,862,016 [500 GB]
Namespace 1 Utilization: 499,998,076,928 [499 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 002538 5191510465
Local Time is: Thu Aug 22 10:16:17 2019 CEST
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero
Sav/Sel_Feat Timestmp
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 85 Celsius
Critical Comp. Temp. Threshold: 85 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 7.80W - - 0 0 0 0 0 0
1 + 6.00W - - 1 1 1 1 0 0
2 + 3.40W - - 2 2 2 2 0 0
3 - 0.0700W - - 3 3 3 3 210 1200
4 - 0.0100W - - 4 4 4 4 2000 8000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 36 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 9,246,297 [4.73 TB]
Data Units Written: 7,122,074 [3.64 TB]
Host Read Commands: 39,558,971
Host Write Commands: 8,371,335
Controller Busy Time: 74
Power Cycles: 30
Power On Hours: 2,702
Unsafe Shutdowns: 17
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 36 Celsius
Temperature Sensor 2: 37 Celsius
Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged
root at tomas:/var/log# smartctl -a /dev//*nvme2n1*/
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-5.0.15-1-pve] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: Samsung SSD 970 EVO 500GB
Serial Number: S466NB0K630810T
Firmware Version: 2B2QEXE7
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 500,107,862,016 [500 GB]
Unallocated NVM Capacity: 0
Controller ID: 4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 500,107,862,016 [500 GB]
Namespace 1 Utilization: 500,107,853,824 [500 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 002538 5681b0985d
Local Time is: Thu Aug 22 10:16:19 2019 CEST
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero
Sav/Sel_Feat Timestmp
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 85 Celsius
Critical Comp. Temp. Threshold: 85 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 6.20W - - 0 0 0 0 0 0
1 + 4.30W - - 1 1 1 1 0 0
2 + 2.10W - - 2 2 2 2 0 0
3 - 0.0400W - - 3 3 3 3 210 1200
4 - 0.0050W - - 4 4 4 4 2000 8000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 28 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 2%
Data Units Read: 12,332,923 [6.31 TB]
Data Units Written: 16,309,063 [8.35 TB]
Host Read Commands: 670,325,851
Host Write Commands: 594,935,440
Controller Busy Time: 925
Power Cycles: 120
Power On Hours: 1,562
Unsafe Shutdowns: 68
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 28 Celsius
Temperature Sensor 2: 29 Celsius
Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged
root at tomas:/var/log# smartctl -a /dev//*nvme3n1*/
=== START OF INFORMATION SECTION ===
Model Number: Samsung SSD 970 EVO 500GB
Serial Number: S466NX0K939382F
Firmware Version: 2B2QEXE7
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 500,107,862,016 [500 GB]
Unallocated NVM Capacity: 0
Controller ID: 4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 500,107,862,016 [500 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 002538 5981b1a75f
Local Time is: Thu Aug 22 10:16:21 2019 CEST
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero
Sav/Sel_Feat Timestmp
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 85 Celsius
Critical Comp. Temp. Threshold: 85 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 6.20W - - 0 0 0 0 0 0
1 + 4.30W - - 1 1 1 1 0 0
2 + 2.10W - - 2 2 2 2 0 0
3 - 0.0400W - - 3 3 3 3 210 1200
4 - 0.0050W - - 4 4 4 4 2000 8000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 27 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 2%
Data Units Read: 10,105,558 [5.17 TB]
Data Units Written: 16,223,988 [8.30 TB]
Host Read Commands: 654,021,540
Host Write Commands: 594,078,253
Controller Busy Time: 930
Power Cycles: 96
Power On Hours: 1,540
Unsafe Shutdowns: 51
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 27 Celsius
Temperature Sensor 2: 28 Celsius
Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged
root at tomas:/var/log# smartctl -a /dev//*nvme4n1*/
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-5.0.15-1-pve] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: Samsung SSD 970 EVO 500GB
Serial Number: S466NB0K630742Y
Firmware Version: 2B2QEXE7
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 500,107,862,016 [500 GB]
Unallocated NVM Capacity: 0
Controller ID: 4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 500,107,862,016 [500 GB]
Namespace 1 Utilization: 498,767,261,696 [498 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 002538 5681b09819
Local Time is: Thu Aug 22 10:16:24 2019 CEST
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero
Sav/Sel_Feat Timestmp
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 85 Celsius
Critical Comp. Temp. Threshold: 85 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 6.20W - - 0 0 0 0 0 0
1 + 4.30W - - 1 1 1 1 0 0
2 + 2.10W - - 2 2 2 2 0 0
3 - 0.0400W - - 3 3 3 3 210 1200
4 - 0.0050W - - 4 4 4 4 2000 8000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 27 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 2%
Data Units Read: 14,121,818 [7.23 TB]
Data Units Written: 15,364,291 [7.86 TB]
Host Read Commands: 668,618,811
Host Write Commands: 581,016,189
Controller Busy Time: 969
Power Cycles: 102
Power On Hours: 1,587
Unsafe Shutdowns: 56
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 27 Celsius
Temperature Sensor 2: 27 Celsius
Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged
W dniu 22.08.2019 o 10:07, Eneko Lacunza pisze:
> Hi,
>
> So what disks/RAID controller are there on the server? :)
>
> My guess is disk if failed :) Did you try smartctl ?
>
> Also, I think attachments are stripped off :)
>
> Cheers
>
> El 22/8/19 a las 10:03, lord_Niedzwiedz escribió:
>> CPU usage 0.04% of 32 CPU(s)
>> _/*IO delay 20.38% !!*/_
>> Load average 37.97,37.26,30.31
>> RAM usage 45.25% (56.93 GiB of 125.81 GiB)
>> KSM sharing 0 B
>> HD space(root) 0.53% (1.32 GiB of 247.29 GiB)
>> SWAP usage N/A
>> CPU(s) 32 x AMD EPYC 7281 16-Core Processor (1 Socket)
>> Kernel Version Linux 5.0.15-1-pve #1 SMP PVE 5.0.15-1 (Wed, 03
>> Jul 2019 10:51:57 +0200)
>> PVE Manager Version pve-manager/6.0-4/2a719255
>>
>> Proxmox working very slowly.
>> I stop all VM.
>>
>> htop - say nothing
>> iotop - say nothing
>>
>>
>> If i try command:
>> # sync
>> - shell waiting !! ;/
>>
>>
>> This same too:
>> root at tomas:~# pveperf
>> CPU BOGOMIPS: 134377.28
>> REGEX/SECOND: 2100393
>> HD SIZE: 247.29 GB (rpool/ROOT/pve-1)
>> FSYNCS/SECOND: 531.28
>>
>> ^C^Z
>> [1]+ Stopped pveperf
>> root at tomas:~# ^C
>>
>> _/*After this:*/__/* IO delay 40%*/_
>>
>>
>> In a phisical console i heave:
>> INFO: task zwol:554 blocked for more than 120 seconds.
>> Tainted: P 0 5.0.15-1-pve #1
>> "echo 0 > /prox/sys/kernel/hung_task_timeout_sec" disables this message.
>> INFO: task txg_quiesce:1007 blocked for more than 120 seconds.
>> Tainted: P 0 5.0.15-1-pve #1
>> "echo 0 > /prox/sys/kernel/hung_task_timeout_sec" disables this message.
>> INFO: task kvm:27326 blocked for more than 120 seconds.
>> Tainted: P 0 5.0.15-1-pve #1
>> "echo 0 > /prox/sys/kernel/hung_task_timeout_sec" disables this message.
>> INFO: task kvm:8930 blocked for more than 120 seconds.
>> Tainted: P 0 5.0.15-1-pve #1
>> "echo 0 > /prox/sys/kernel/hung_task_timeout_sec" disables this message.
>> INFO: task zvol:26963 blocked for more than 120 seconds.
>> Tainted: P 0 5.0.15-1-pve #1
>> "echo 0 > /prox/sys/kernel/hung_task_timeout_sec" disables this message.
>> INFO: task zvol:26967 blocked for more than 120 seconds.
>> Tainted: P 0 5.0.15-1-pve #1
>> "echo 0 > /prox/sys/kernel/hung_task_timeout_sec" disables this message.
>> INFO: task zvol:26972 blocked for more than 120 seconds.
>> Tainted: P 0 5.0.15-1-pve #1
>> "echo 0 > /prox/sys/kernel/hung_task_timeout_sec" disables this message.
>> INFO: task zvol:26974 blocked for more than 120 seconds.
>> Tainted: P 0 5.0.15-1-pve #1
>> "echo 0 > /prox/sys/kernel/hung_task_timeout_sec" disables this message.
>> INFO: task zvol:26976 blocked for more than 120 seconds.
>> Tainted: P 0 5.0.15-1-pve #1
>> "echo 0 > /prox/sys/kernel/hung_task_timeout_sec" disables this message.
>> INFO: task zvol:26980 blocked for more than 120 seconds.
>>
>> At the restart on end i heave:
>> [ !! ] Froceibly rebooting: Ctrl-Alt-Del was pressed more than 7
>> times within 2s
>> Systemd-shutdown[1]: Syncing filesystems and block devices - time
>> out, issuing SIGKILL to PID 3940.
>> Started bpfilter
>> pvefw-logger [24351]: received terminate request (signal)
>> pvefw-logger [24351]: stopping pvefw logger
>>
>> Server not stop/restart ;-/
>> Any idea ??!!
>>
>> log file included.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> pve-user mailing list
>> pve-user at pve.proxmox.com
>> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>
>
More information about the pve-user
mailing list