finish zfs incident
This commit is contained in:
parent
22934f4a8c
commit
0b3a981ae3
1 changed files with 778 additions and 2 deletions
|
|
@ -5,7 +5,7 @@
|
|||
|
||||
On 2025-12-30 I was snooping around the Proxmox UI where I accidentally bumped into a storage view that showed that my ZFS pool (which I use for the disks of all my VMs) was in degraded state. Opening up the detail, it appeared one of the disks was in FAULTED state.
|
||||
|
||||
I attempted rebooting the host, which trigger an attempt at resilvering. But the disk remained in the same state.
|
||||
I attempted rebooting the host, which triggered an attempt at resilvering. But the disk remained in the same state.
|
||||
|
||||
## First diagnostic
|
||||
|
||||
|
|
@ -399,12 +399,788 @@ Rough plan:
|
|||
- The writing side of the resilvering is running at ~50MB/s. I'll shut down all the VMs in hopes of preventing contention for the disk IO.
|
||||
- After around 30min, speed has increased to 100MB/s.
|
||||
- The resilvering will take a long time and it's late, so I'l go to sleep and continue tomorrow.
|
||||
- The next morning, status read this:
|
||||
```
|
||||
pool: proxmox-tank-1
|
||||
state: DEGRADED
|
||||
status: One or more devices has experienced an unrecoverable error. An
|
||||
attempt was made to correct the error. Applications are unaffected.
|
||||
action: Determine if the device needs to be replaced, and clear the errors
|
||||
using 'zpool clear' or replace the device with 'zpool replace'.
|
||||
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
|
||||
scan: resilvered 495G in 01:07:58 with 0 errors on Sat Jan 3 00:25:33 2026
|
||||
config:
|
||||
|
||||
NAME STATE READ WRITE CKSUM
|
||||
proxmox-tank-1 DEGRADED 0 0 0
|
||||
mirror-0 DEGRADED 0 0 0
|
||||
ata-ST4000NT001-3M2101_WX11TN0Z DEGRADED 0 0 0 too many errors
|
||||
ata-ST4000NT001-3M2101_WX11TN2P ONLINE 0 0 0
|
||||
|
||||
errors: No known data errors
|
||||
```
|
||||
- Apparently, after a little crisis like the one this disk had, ZFS will only mark it clear with human acknowledgement.
|
||||
- To do that, I run: `zpool clear proxmox-tank-1 ata-ST4000NT001-3M2101_WX11TN0Z`
|
||||
- Status immediately becomes:
|
||||
```
|
||||
pool: proxmox-tank-1
|
||||
state: ONLINE
|
||||
scan: resilvered 495G in 01:07:58 with 0 errors on Sat Jan 3 00:25:33 2026
|
||||
config:
|
||||
|
||||
NAME STATE READ WRITE CKSUM
|
||||
proxmox-tank-1 ONLINE 0 0 0
|
||||
mirror-0 ONLINE 0 0 0
|
||||
ata-ST4000NT001-3M2101_WX11TN0Z ONLINE 0 0 0
|
||||
ata-ST4000NT001-3M2101_WX11TN2P ONLINE 0 0 0
|
||||
|
||||
errors: No known data errors
|
||||
|
||||
```
|
||||
- Scrubbing
|
||||
- I trigger the scrub with: `sudo zpool scrub proxmox-tank-1`
|
||||
- This is the final message once the scrub finished:
|
||||
```
|
||||
pool: proxmox-tank-1
|
||||
state: ONLINE
|
||||
status: One or more devices has experienced an unrecoverable error. An
|
||||
attempt was made to correct the error. Applications are unaffected.
|
||||
action: Determine if the device needs to be replaced, and clear the errors
|
||||
using 'zpool clear' or replace the device with 'zpool replace'.
|
||||
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
|
||||
scan: scrub repaired 13.0M in 02:14:22 with 0 errors on Sat Jan 3 11:03:54 2026
|
||||
config:
|
||||
|
||||
NAME STATE READ WRITE CKSUM
|
||||
proxmox-tank-1 ONLINE 0 0 0
|
||||
mirror-0 ONLINE 0 0 0
|
||||
ata-ST4000NT001-3M2101_WX11TN0Z ONLINE 0 0 992
|
||||
ata-ST4000NT001-3M2101_WX11TN2P ONLINE 0 0 0
|
||||
|
||||
errors: No known data errors
|
||||
```
|
||||
- I clear the error messages with `zpool clear proxmox-tank-1 ata-ST4000NT001-3M2101_WX11TN0Z`
|
||||
- Nevertheless, those checksum errors might be of concern.
|
||||
- Checking disk with smartctl
|
||||
- I run `smartctl -x "$(readlink -f "$DISKPATH")" | egrep -i 'Reallocated|Pending|Offline_Uncorrect|CRC|Hardware Resets|COMRESET|SATA Phy'` and get back:
|
||||
```
|
||||
5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0
|
||||
197 Current_Pending_Sector -O--C- 100 100 000 - 0
|
||||
198 Offline_Uncorrectable ----C- 100 100 000 - 0
|
||||
199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0
|
||||
0x0c GPL R/O 2048 Pending Defects log
|
||||
0x11 GPL R/O 1 SATA Phy Event Counters log
|
||||
If Selective self-test is pending on power-up, resume after 0 minute delay.
|
||||
0x03 0x020 4 0 --- Number of Reallocated Logical Sectors
|
||||
0x06 0x008 4 41 --- Number of Hardware Resets
|
||||
0x06 0x018 4 0 --- Number of Interface CRC Errors
|
||||
Pending Defects log (GP Log 0x0c)
|
||||
SATA Phy Event Counters (GP Log 0x11)
|
||||
0x000a 2 2 Device-to-host register FISes sent due to a COMRESET
|
||||
0x0001 2 0 Command failed due to ICRC error
|
||||
|
||||
```
|
||||
- The full output of `smartctl -x /dev/sdb`:
|
||||
```
|
||||
smartctl 7.4 2024-10-15 r5620 [x86_64-linux-6.14.8-2-pve] (local build)
|
||||
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
|
||||
|
||||
=== START OF INFORMATION SECTION ===
|
||||
Device Model: ST4000NT001-3M2101
|
||||
Serial Number: WX11TN0Z
|
||||
LU WWN Device Id: 5 000c50 0fb8869af
|
||||
Firmware Version: EN01
|
||||
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
|
||||
Sector Sizes: 512 bytes logical, 4096 bytes physical
|
||||
Rotation Rate: 7200 rpm
|
||||
Form Factor: 3.5 inches
|
||||
Device is: Not in smartctl database 7.3/5528
|
||||
ATA Version is: ACS-4 (minor revision not indicated)
|
||||
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
|
||||
Local Time is: Sat Jan 3 11:16:38 2026 CET
|
||||
SMART support is: Available - device has SMART capability.
|
||||
SMART support is: Enabled
|
||||
AAM feature is: Unavailable
|
||||
APM feature is: Unavailable
|
||||
Rd look-ahead is: Enabled
|
||||
Write cache is: Enabled
|
||||
DSN feature is: Disabled
|
||||
ATA Security is: Disabled, NOT FROZEN [SEC1]
|
||||
Write SCT (Get) Feature Control Command failed: scsi error unsupported field in scsi command
|
||||
Wt Cache Reorder: Unknown (SCT Feature Control command failed)
|
||||
|
||||
=== START OF READ SMART DATA SECTION ===
|
||||
SMART overall-health self-assessment test result: PASSED
|
||||
|
||||
General SMART Values:
|
||||
Offline data collection status: (0x82) Offline data collection activity
|
||||
was completed without error.
|
||||
Auto Offline Data Collection: Enabled.
|
||||
Self-test execution status: ( 0) The previous self-test routine completed
|
||||
without error or no self-test has ever
|
||||
been run.
|
||||
Total time to complete Offline
|
||||
data collection: ( 567) seconds.
|
||||
Offline data collection
|
||||
capabilities: (0x7b) SMART execute Offline immediate.
|
||||
Auto Offline data collection on/off support.
|
||||
Suspend Offline collection upon new
|
||||
command.
|
||||
Offline surface scan supported.
|
||||
Self-test supported.
|
||||
Conveyance Self-test supported.
|
||||
Selective Self-test supported.
|
||||
SMART capabilities: (0x0003) Saves SMART data before entering
|
||||
power-saving mode.
|
||||
Supports SMART auto save timer.
|
||||
Error logging capability: (0x01) Error logging supported.
|
||||
General Purpose Logging supported.
|
||||
Short self-test routine
|
||||
recommended polling time: ( 1) minutes.
|
||||
Extended self-test routine
|
||||
recommended polling time: ( 372) minutes.
|
||||
Conveyance self-test routine
|
||||
recommended polling time: ( 2) minutes.
|
||||
SCT capabilities: (0x50bd) SCT Status supported.
|
||||
SCT Error Recovery Control supported.
|
||||
SCT Feature Control supported.
|
||||
SCT Data Table supported.
|
||||
|
||||
SMART Attributes Data Structure revision number: 10
|
||||
Vendor Specific SMART Attributes with Thresholds:
|
||||
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
|
||||
1 Raw_Read_Error_Rate POSR-- 080 064 044 - 89842680
|
||||
3 Spin_Up_Time PO---- 097 093 000 - 0
|
||||
4 Start_Stop_Count -O--CK 100 100 020 - 237
|
||||
5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0
|
||||
7 Seek_Error_Rate POSR-- 078 060 045 - 58464314
|
||||
9 Power_On_Hours -O--CK 099 099 000 - 1551
|
||||
10 Spin_Retry_Count PO--C- 100 100 097 - 0
|
||||
12 Power_Cycle_Count -O--CK 100 100 020 - 237
|
||||
18 Unknown_Attribute PO-R-- 100 100 050 - 0
|
||||
187 Reported_Uncorrect -O--CK 100 100 000 - 0
|
||||
188 Command_Timeout -O--CK 100 100 000 - 0
|
||||
190 Airflow_Temperature_Cel -O---K 060 054 000 - 40 (Min/Max 26/43)
|
||||
192 Power-Off_Retract_Count -O--CK 100 100 000 - 229
|
||||
193 Load_Cycle_Count -O--CK 100 100 000 - 1964
|
||||
194 Temperature_Celsius -O---K 040 046 000 - 40 (0 23 0 0 0)
|
||||
197 Current_Pending_Sector -O--C- 100 100 000 - 0
|
||||
198 Offline_Uncorrectable ----C- 100 100 000 - 0
|
||||
199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0
|
||||
240 Head_Flying_Hours ------ 100 100 000 - 516 (189 160 0)
|
||||
241 Total_LBAs_Written ------ 100 253 000 - 19110016931
|
||||
242 Total_LBAs_Read ------ 100 253 000 - 9057450849
|
||||
||||||_ K auto-keep
|
||||
|||||__ C event count
|
||||
||||___ R error rate
|
||||
|||____ S speed/performance
|
||||
||_____ O updated online
|
||||
|______ P prefailure warning
|
||||
|
||||
General Purpose Log Directory Version 1
|
||||
SMART Log Directory Version 1 [multi-sector log support]
|
||||
Address Access R/W Size Description
|
||||
0x00 GPL,SL R/O 1 Log Directory
|
||||
0x01 SL R/O 1 Summary SMART error log
|
||||
0x02 SL R/O 5 Comprehensive SMART error log
|
||||
0x03 GPL R/O 5 Ext. Comprehensive SMART error log
|
||||
0x04 GPL R/O 256 Device Statistics log
|
||||
0x04 SL R/O 8 Device Statistics log
|
||||
0x06 SL R/O 1 SMART self-test log
|
||||
0x07 GPL R/O 1 Extended self-test log
|
||||
0x08 GPL R/O 2 Power Conditions log
|
||||
0x09 SL R/W 1 Selective self-test log
|
||||
0x0a GPL R/W 8 Device Statistics Notification
|
||||
0x0c GPL R/O 2048 Pending Defects log
|
||||
0x10 GPL R/O 1 NCQ Command Error log
|
||||
0x11 GPL R/O 1 SATA Phy Event Counters log
|
||||
0x13 GPL R/O 1 SATA NCQ Send and Receive log
|
||||
0x21 GPL R/O 1 Write stream error log
|
||||
0x22 GPL R/O 1 Read stream error log
|
||||
0x24 GPL R/O 768 Current Device Internal Status Data log
|
||||
0x2f GPL R/O 1 Set Sector Configuration
|
||||
0x30 GPL,SL R/O 9 IDENTIFY DEVICE data log
|
||||
0x80-0x9f GPL,SL R/W 16 Host vendor specific log
|
||||
0xa1 GPL,SL VS 160 Device vendor specific log
|
||||
0xa2 GPL VS 16320 Device vendor specific log
|
||||
0xa4 GPL,SL VS 160 Device vendor specific log
|
||||
0xa6 GPL VS 192 Device vendor specific log
|
||||
0xa8-0xa9 GPL,SL VS 136 Device vendor specific log
|
||||
0xab GPL VS 1 Device vendor specific log
|
||||
0xad GPL VS 16 Device vendor specific log
|
||||
0xb1 GPL,SL VS 160 Device vendor specific log
|
||||
0xb6 GPL VS 1920 Device vendor specific log
|
||||
0xbe-0xbf GPL VS 65535 Device vendor specific log
|
||||
0xc1 GPL,SL VS 8 Device vendor specific log
|
||||
0xc3 GPL,SL VS 24 Device vendor specific log
|
||||
0xc6 GPL VS 5184 Device vendor specific log
|
||||
0xc7 GPL,SL VS 8 Device vendor specific log
|
||||
0xc9 GPL,SL VS 8 Device vendor specific log
|
||||
0xca GPL,SL VS 16 Device vendor specific log
|
||||
0xcd GPL,SL VS 1 Device vendor specific log
|
||||
0xce GPL VS 1 Device vendor specific log
|
||||
0xcf GPL VS 512 Device vendor specific log
|
||||
0xd1 GPL VS 656 Device vendor specific log
|
||||
0xd2 GPL VS 10256 Device vendor specific log
|
||||
0xd4 GPL VS 2048 Device vendor specific log
|
||||
0xda GPL,SL VS 1 Device vendor specific log
|
||||
0xe0 GPL,SL R/W 1 SCT Command/Status
|
||||
0xe1 GPL,SL R/W 1 SCT Data Transfer
|
||||
|
||||
SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
|
||||
No Errors Logged
|
||||
|
||||
SMART Extended Self-test Log Version: 1 (1 sectors)
|
||||
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
|
||||
# 1 Short offline Completed without error 00% 1462 -
|
||||
|
||||
SMART Selective self-test log data structure revision number 1
|
||||
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
|
||||
1 0 0 Not_testing
|
||||
2 0 0 Not_testing
|
||||
3 0 0 Not_testing
|
||||
4 0 0 Not_testing
|
||||
5 0 0 Not_testing
|
||||
Selective self-test flags (0x0):
|
||||
After scanning selected spans, do NOT read-scan remainder of disk.
|
||||
If Selective self-test is pending on power-up, resume after 0 minute delay.
|
||||
|
||||
SCT Status Version: 3
|
||||
SCT Version (vendor specific): 522 (0x020a)
|
||||
Device State: Active (0)
|
||||
Current Temperature: 40 Celsius
|
||||
Power Cycle Min/Max Temperature: 26/43 Celsius
|
||||
Lifetime Min/Max Temperature: 23/46 Celsius
|
||||
Under/Over Temperature Limit Count: 0/2
|
||||
SMART Status: 0xc24f (PASSED)
|
||||
Vendor specific:
|
||||
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
|
||||
00 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00
|
||||
|
||||
SCT Temperature History Version: 2
|
||||
Temperature Sampling Period: 4 minutes
|
||||
Temperature Logging Interval: 59 minutes
|
||||
Min/Max recommended Temperature: 10/40 Celsius
|
||||
Min/Max Temperature Limit: 5/60 Celsius
|
||||
Temperature History Size (Index): 128 (123)
|
||||
|
||||
Index Estimated Time Temperature Celsius
|
||||
124 2025-12-29 05:52 34 ***************
|
||||
125 2025-12-29 06:51 34 ***************
|
||||
126 2025-12-29 07:50 33 **************
|
||||
... ..( 3 skipped). .. **************
|
||||
2 2025-12-29 11:46 33 **************
|
||||
3 2025-12-29 12:45 35 ****************
|
||||
4 2025-12-29 13:44 35 ****************
|
||||
5 2025-12-29 14:43 35 ****************
|
||||
6 2025-12-29 15:42 ? -
|
||||
7 2025-12-29 16:41 36 *****************
|
||||
8 2025-12-29 17:40 ? -
|
||||
9 2025-12-29 18:39 36 *****************
|
||||
10 2025-12-29 19:38 ? -
|
||||
11 2025-12-29 20:37 36 *****************
|
||||
12 2025-12-29 21:36 36 *****************
|
||||
13 2025-12-29 22:35 35 ****************
|
||||
... ..( 4 skipped). .. ****************
|
||||
18 2025-12-30 03:30 35 ****************
|
||||
19 2025-12-30 04:29 34 ***************
|
||||
... ..( 2 skipped). .. ***************
|
||||
22 2025-12-30 07:26 34 ***************
|
||||
23 2025-12-30 08:25 33 **************
|
||||
24 2025-12-30 09:24 33 **************
|
||||
25 2025-12-30 10:23 33 **************
|
||||
26 2025-12-30 11:22 32 *************
|
||||
... ..( 3 skipped). .. *************
|
||||
30 2025-12-30 15:18 32 *************
|
||||
31 2025-12-30 16:17 33 **************
|
||||
32 2025-12-30 17:16 33 **************
|
||||
33 2025-12-30 18:15 32 *************
|
||||
... ..( 10 skipped). .. *************
|
||||
44 2025-12-31 05:04 32 *************
|
||||
45 2025-12-31 06:03 31 ************
|
||||
... ..( 5 skipped). .. ************
|
||||
51 2025-12-31 11:57 31 ************
|
||||
52 2025-12-31 12:56 30 ***********
|
||||
... ..( 8 skipped). .. ***********
|
||||
61 2025-12-31 21:47 30 ***********
|
||||
62 2025-12-31 22:46 31 ************
|
||||
... ..( 3 skipped). .. ************
|
||||
66 2026-01-01 02:42 31 ************
|
||||
67 2026-01-01 03:41 30 ***********
|
||||
... ..( 10 skipped). .. ***********
|
||||
78 2026-01-01 14:30 30 ***********
|
||||
79 2026-01-01 15:29 29 **********
|
||||
80 2026-01-01 16:28 29 **********
|
||||
81 2026-01-01 17:27 29 **********
|
||||
82 2026-01-01 18:26 30 ***********
|
||||
83 2026-01-01 19:25 29 **********
|
||||
84 2026-01-01 20:24 29 **********
|
||||
85 2026-01-01 21:23 29 **********
|
||||
86 2026-01-01 22:22 30 ***********
|
||||
87 2026-01-01 23:21 30 ***********
|
||||
88 2026-01-02 00:20 32 *************
|
||||
89 2026-01-02 01:19 33 **************
|
||||
90 2026-01-02 02:18 33 **************
|
||||
91 2026-01-02 03:17 33 **************
|
||||
92 2026-01-02 04:16 32 *************
|
||||
93 2026-01-02 05:15 31 ************
|
||||
94 2026-01-02 06:14 ? -
|
||||
95 2026-01-02 07:13 30 ***********
|
||||
96 2026-01-02 08:12 ? -
|
||||
97 2026-01-02 09:11 30 ***********
|
||||
98 2026-01-02 10:10 ? -
|
||||
99 2026-01-02 11:09 30 ***********
|
||||
100 2026-01-02 12:08 ? -
|
||||
101 2026-01-02 13:07 30 ***********
|
||||
102 2026-01-02 14:06 ? -
|
||||
103 2026-01-02 15:05 30 ***********
|
||||
104 2026-01-02 16:04 ? -
|
||||
105 2026-01-02 17:03 30 ***********
|
||||
106 2026-01-02 18:02 ? -
|
||||
107 2026-01-02 19:01 31 ************
|
||||
108 2026-01-02 20:00 ? -
|
||||
109 2026-01-02 20:59 31 ************
|
||||
110 2026-01-02 21:58 ? -
|
||||
111 2026-01-02 22:57 26 *******
|
||||
112 2026-01-02 23:56 38 *******************
|
||||
113 2026-01-03 00:55 36 *****************
|
||||
114 2026-01-03 01:54 34 ***************
|
||||
115 2026-01-03 02:53 33 **************
|
||||
... ..( 4 skipped). .. **************
|
||||
120 2026-01-03 07:48 33 **************
|
||||
121 2026-01-03 08:47 37 ******************
|
||||
122 2026-01-03 09:46 42 ***********************
|
||||
123 2026-01-03 10:45 43 ************************
|
||||
|
||||
SCT Error Recovery Control:
|
||||
Read: 70 (7.0 seconds)
|
||||
Write: 70 (7.0 seconds)
|
||||
|
||||
Device Statistics (GP Log 0x04)
|
||||
Page Offset Size Value Flags Description
|
||||
0x01 ===== = = === == General Statistics (rev 1) ==
|
||||
0x01 0x008 4 237 --- Lifetime Power-On Resets
|
||||
0x01 0x010 4 1551 --- Power-on Hours
|
||||
0x01 0x018 6 18855234811 --- Logical Sectors Written
|
||||
0x01 0x020 6 38962968 --- Number of Write Commands
|
||||
0x01 0x028 6 9004753896 --- Logical Sectors Read
|
||||
0x01 0x030 6 148517033 --- Number of Read Commands
|
||||
0x01 0x038 6 - --- Date and Time TimeStamp
|
||||
0x03 ===== = = === == Rotating Media Statistics (rev 1) ==
|
||||
0x03 0x008 4 1203 --- Spindle Motor Power-on Hours
|
||||
0x03 0x010 4 516 --- Head Flying Hours
|
||||
0x03 0x018 4 1964 --- Head Load Events
|
||||
0x03 0x020 4 0 --- Number of Reallocated Logical Sectors
|
||||
0x03 0x028 4 0 --- Read Recovery Attempts
|
||||
0x03 0x030 4 0 --- Number of Mechanical Start Failures
|
||||
0x03 0x038 4 0 --- Number of Realloc. Candidate Logical Sectors
|
||||
0x03 0x040 4 229 --- Number of High Priority Unload Events
|
||||
0x04 ===== = = === == General Errors Statistics (rev 1) ==
|
||||
0x04 0x008 4 0 --- Number of Reported Uncorrectable Errors
|
||||
0x04 0x010 4 0 --- Resets Between Cmd Acceptance and Completion
|
||||
0x04 0x018 4 0 -D- Physical Element Status Changed
|
||||
0x05 ===== = = === == Temperature Statistics (rev 1) ==
|
||||
0x05 0x008 1 40 --- Current Temperature
|
||||
0x05 0x010 1 32 --- Average Short Term Temperature
|
||||
0x05 0x018 1 34 --- Average Long Term Temperature
|
||||
0x05 0x020 1 46 --- Highest Temperature
|
||||
0x05 0x028 1 27 --- Lowest Temperature
|
||||
0x05 0x030 1 43 --- Highest Average Short Term Temperature
|
||||
0x05 0x038 1 30 --- Lowest Average Short Term Temperature
|
||||
0x05 0x040 1 34 --- Highest Average Long Term Temperature
|
||||
0x05 0x048 1 34 --- Lowest Average Long Term Temperature
|
||||
0x05 0x050 4 0 --- Time in Over-Temperature
|
||||
0x05 0x058 1 60 --- Specified Maximum Operating Temperature
|
||||
0x05 0x060 4 0 --- Time in Under-Temperature
|
||||
0x05 0x068 1 5 --- Specified Minimum Operating Temperature
|
||||
0x06 ===== = = === == Transport Statistics (rev 1) ==
|
||||
0x06 0x008 4 41 --- Number of Hardware Resets
|
||||
0x06 0x010 4 8 --- Number of ASR Events
|
||||
0x06 0x018 4 0 --- Number of Interface CRC Errors
|
||||
0xff ===== = = === == Vendor Specific Statistics (rev 1) ==
|
||||
0xff 0x010 7 0 --- Vendor Specific
|
||||
0xff 0x018 7 0 --- Vendor Specific
|
||||
|||_ C monitored condition met
|
||||
||__ D supports DSN
|
||||
|___ N normalized value
|
||||
|
||||
Pending Defects log (GP Log 0x0c)
|
||||
No Defects Logged
|
||||
|
||||
SATA Phy Event Counters (GP Log 0x11)
|
||||
ID Size Value Description
|
||||
0x000a 2 2 Device-to-host register FISes sent due to a COMRESET
|
||||
0x0001 2 0 Command failed due to ICRC error
|
||||
0x0003 2 0 R_ERR response for device-to-host data FIS
|
||||
0x0004 2 0 R_ERR response for host-to-device data FIS
|
||||
0x0006 2 0 R_ERR response for device-to-host non-data FIS
|
||||
0x0007 2 0 R_ERR response for host-to-device non-data FIS
|
||||
|
||||
Seagate FARM log (GP Log 0xa6) supported [try: -l farm]
|
||||
```
|
||||
- `smartctl -l error` shows no errors.
|
||||
- `smartctl -l selftest` after triggering the short test:
|
||||
```
|
||||
smartctl 7.4 2024-10-15 r5620 [x86_64-linux-6.14.8-2-pve] (local build)
|
||||
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
|
||||
|
||||
=== START OF READ SMART DATA SECTION ===
|
||||
SMART Self-test log structure revision number 1
|
||||
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
|
||||
# 1 Short offline Completed without error 00% 1551 -
|
||||
# 2 Short offline Completed without error 00% 1462 -
|
||||
```
|
||||
- Then I run the longtest. It will be finished at 18:00.
|
||||
- While the test runs, I start raising the host VMs again since they can run in parallel.
|
||||
- The selftest is taking way longer than initially planned, but it's progressing. It's 21:00, 20% remaining.
|
||||
- It finished eventually. Here's the full output:
|
||||
```
|
||||
smartctl 7.4 2024-10-15 r5620 [x86_64-linux-6.14.8-2-pve] (local build)
|
||||
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
|
||||
|
||||
=== START OF INFORMATION SECTION ===
|
||||
Device Model: ST4000NT001-3M2101
|
||||
Serial Number: WX11TN0Z
|
||||
LU WWN Device Id: 5 000c50 0fb8869af
|
||||
Firmware Version: EN01
|
||||
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
|
||||
Sector Sizes: 512 bytes logical, 4096 bytes physical
|
||||
Rotation Rate: 7200 rpm
|
||||
Form Factor: 3.5 inches
|
||||
Device is: Not in smartctl database 7.3/5528
|
||||
ATA Version is: ACS-4 (minor revision not indicated)
|
||||
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
|
||||
Local Time is: Sat Jan 3 23:48:15 2026 CET
|
||||
SMART support is: Available - device has SMART capability.
|
||||
SMART support is: Enabled
|
||||
AAM feature is: Unavailable
|
||||
APM feature is: Unavailable
|
||||
Rd look-ahead is: Enabled
|
||||
Write cache is: Enabled
|
||||
DSN feature is: Disabled
|
||||
ATA Security is: Disabled, NOT FROZEN [SEC1]
|
||||
Write SCT (Get) Feature Control Command failed: scsi error unsupported field in scsi command
|
||||
Wt Cache Reorder: Unknown (SCT Feature Control command failed)
|
||||
|
||||
=== START OF READ SMART DATA SECTION ===
|
||||
SMART overall-health self-assessment test result: PASSED
|
||||
|
||||
General SMART Values:
|
||||
Offline data collection status: (0x82) Offline data collection activity
|
||||
was completed without error.
|
||||
Auto Offline Data Collection: Enabled.
|
||||
Self-test execution status: ( 0) The previous self-test routine completed
|
||||
without error or no self-test has ever
|
||||
been run.
|
||||
Total time to complete Offline
|
||||
data collection: ( 567) seconds.
|
||||
Offline data collection
|
||||
capabilities: (0x7b) SMART execute Offline immediate.
|
||||
Auto Offline data collection on/off support.
|
||||
Suspend Offline collection upon new
|
||||
command.
|
||||
Offline surface scan supported.
|
||||
Self-test supported.
|
||||
Conveyance Self-test supported.
|
||||
Selective Self-test supported.
|
||||
SMART capabilities: (0x0003) Saves SMART data before entering
|
||||
power-saving mode.
|
||||
Supports SMART auto save timer.
|
||||
Error logging capability: (0x01) Error logging supported.
|
||||
General Purpose Logging supported.
|
||||
Short self-test routine
|
||||
recommended polling time: ( 1) minutes.
|
||||
Extended self-test routine
|
||||
recommended polling time: ( 372) minutes.
|
||||
Conveyance self-test routine
|
||||
recommended polling time: ( 2) minutes.
|
||||
SCT capabilities: (0x50bd) SCT Status supported.
|
||||
SCT Error Recovery Control supported.
|
||||
SCT Feature Control supported.
|
||||
SCT Data Table supported.
|
||||
|
||||
SMART Attributes Data Structure revision number: 10
|
||||
Vendor Specific SMART Attributes with Thresholds:
|
||||
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
|
||||
1 Raw_Read_Error_Rate POSR-- 082 064 044 - 160614704
|
||||
3 Spin_Up_Time PO---- 097 093 000 - 0
|
||||
4 Start_Stop_Count -O--CK 100 100 020 - 237
|
||||
5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0
|
||||
7 Seek_Error_Rate POSR-- 078 060 045 - 63053692
|
||||
9 Power_On_Hours -O--CK 099 099 000 - 1564
|
||||
10 Spin_Retry_Count PO--C- 100 100 097 - 0
|
||||
12 Power_Cycle_Count -O--CK 100 100 020 - 237
|
||||
18 Unknown_Attribute PO-R-- 100 100 050 - 0
|
||||
187 Reported_Uncorrect -O--CK 100 100 000 - 0
|
||||
188 Command_Timeout -O--CK 100 100 000 - 0
|
||||
190 Airflow_Temperature_Cel -O---K 063 054 000 - 37 (Min/Max 26/45)
|
||||
192 Power-Off_Retract_Count -O--CK 100 100 000 - 229
|
||||
193 Load_Cycle_Count -O--CK 100 100 000 - 1965
|
||||
194 Temperature_Celsius -O---K 037 046 000 - 37 (0 23 0 0 0)
|
||||
197 Current_Pending_Sector -O--C- 100 100 000 - 0
|
||||
198 Offline_Uncorrectable ----C- 100 100 000 - 0
|
||||
199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0
|
||||
240 Head_Flying_Hours ------ 100 100 000 - 529 (206 38 0)
|
||||
241 Total_LBAs_Written ------ 100 253 000 - 19648189091
|
||||
242 Total_LBAs_Read ------ 100 253 000 - 9322473897
|
||||
||||||_ K auto-keep
|
||||
|||||__ C event count
|
||||
||||___ R error rate
|
||||
|||____ S speed/performance
|
||||
||_____ O updated online
|
||||
|______ P prefailure warning
|
||||
|
||||
General Purpose Log Directory Version 1
|
||||
SMART Log Directory Version 1 [multi-sector log support]
|
||||
Address Access R/W Size Description
|
||||
0x00 GPL,SL R/O 1 Log Directory
|
||||
0x01 SL R/O 1 Summary SMART error log
|
||||
0x02 SL R/O 5 Comprehensive SMART error log
|
||||
0x03 GPL R/O 5 Ext. Comprehensive SMART error log
|
||||
0x04 GPL R/O 256 Device Statistics log
|
||||
0x04 SL R/O 8 Device Statistics log
|
||||
0x06 SL R/O 1 SMART self-test log
|
||||
0x07 GPL R/O 1 Extended self-test log
|
||||
0x08 GPL R/O 2 Power Conditions log
|
||||
0x09 SL R/W 1 Selective self-test log
|
||||
0x0a GPL R/W 8 Device Statistics Notification
|
||||
0x0c GPL R/O 2048 Pending Defects log
|
||||
0x10 GPL R/O 1 NCQ Command Error log
|
||||
0x11 GPL R/O 1 SATA Phy Event Counters log
|
||||
0x13 GPL R/O 1 SATA NCQ Send and Receive log
|
||||
0x21 GPL R/O 1 Write stream error log
|
||||
0x22 GPL R/O 1 Read stream error log
|
||||
0x24 GPL R/O 768 Current Device Internal Status Data log
|
||||
0x2f GPL R/O 1 Set Sector Configuration
|
||||
0x30 GPL,SL R/O 9 IDENTIFY DEVICE data log
|
||||
0x80-0x9f GPL,SL R/W 16 Host vendor specific log
|
||||
0xa1 GPL,SL VS 160 Device vendor specific log
|
||||
0xa2 GPL VS 16320 Device vendor specific log
|
||||
0xa4 GPL,SL VS 160 Device vendor specific log
|
||||
0xa6 GPL VS 192 Device vendor specific log
|
||||
0xa8-0xa9 GPL,SL VS 136 Device vendor specific log
|
||||
0xab GPL VS 1 Device vendor specific log
|
||||
0xad GPL VS 16 Device vendor specific log
|
||||
0xb1 GPL,SL VS 160 Device vendor specific log
|
||||
0xb6 GPL VS 1920 Device vendor specific log
|
||||
0xbe-0xbf GPL VS 65535 Device vendor specific log
|
||||
0xc1 GPL,SL VS 8 Device vendor specific log
|
||||
0xc3 GPL,SL VS 24 Device vendor specific log
|
||||
0xc6 GPL VS 5184 Device vendor specific log
|
||||
0xc7 GPL,SL VS 8 Device vendor specific log
|
||||
0xc9 GPL,SL VS 8 Device vendor specific log
|
||||
0xca GPL,SL VS 16 Device vendor specific log
|
||||
0xcd GPL,SL VS 1 Device vendor specific log
|
||||
0xce GPL VS 1 Device vendor specific log
|
||||
0xcf GPL VS 512 Device vendor specific log
|
||||
0xd1 GPL VS 656 Device vendor specific log
|
||||
0xd2 GPL VS 10256 Device vendor specific log
|
||||
0xd4 GPL VS 2048 Device vendor specific log
|
||||
0xda GPL,SL VS 1 Device vendor specific log
|
||||
0xe0 GPL,SL R/W 1 SCT Command/Status
|
||||
0xe1 GPL,SL R/W 1 SCT Data Transfer
|
||||
|
||||
SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
|
||||
No Errors Logged
|
||||
|
||||
SMART Extended Self-test Log Version: 1 (1 sectors)
|
||||
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
|
||||
# 1 Extended offline Completed without error 00% 1563 -
|
||||
# 2 Short offline Completed without error 00% 1551 -
|
||||
# 3 Short offline Completed without error 00% 1462 -
|
||||
|
||||
SMART Selective self-test log data structure revision number 1
|
||||
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
|
||||
1 0 0 Not_testing
|
||||
2 0 0 Not_testing
|
||||
3 0 0 Not_testing
|
||||
4 0 0 Not_testing
|
||||
5 0 0 Not_testing
|
||||
Selective self-test flags (0x0):
|
||||
After scanning selected spans, do NOT read-scan remainder of disk.
|
||||
If Selective self-test is pending on power-up, resume after 0 minute delay.
|
||||
|
||||
SCT Status Version: 3
|
||||
SCT Version (vendor specific): 522 (0x020a)
|
||||
Device State: Active (0)
|
||||
Current Temperature: 37 Celsius
|
||||
Power Cycle Min/Max Temperature: 26/45 Celsius
|
||||
Lifetime Min/Max Temperature: 23/46 Celsius
|
||||
Under/Over Temperature Limit Count: 0/14
|
||||
SMART Status: 0xc24f (PASSED)
|
||||
Vendor specific:
|
||||
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
|
||||
00 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00
|
||||
|
||||
SCT Temperature History Version: 2
|
||||
Temperature Sampling Period: 4 minutes
|
||||
Temperature Logging Interval: 59 minutes
|
||||
Min/Max recommended Temperature: 10/40 Celsius
|
||||
Min/Max Temperature Limit: 5/60 Celsius
|
||||
Temperature History Size (Index): 128 (8)
|
||||
|
||||
Index Estimated Time Temperature Celsius
|
||||
9 2025-12-29 18:39 36 *****************
|
||||
10 2025-12-29 19:38 ? -
|
||||
11 2025-12-29 20:37 36 *****************
|
||||
12 2025-12-29 21:36 36 *****************
|
||||
13 2025-12-29 22:35 35 ****************
|
||||
... ..( 4 skipped). .. ****************
|
||||
18 2025-12-30 03:30 35 ****************
|
||||
19 2025-12-30 04:29 34 ***************
|
||||
... ..( 2 skipped). .. ***************
|
||||
22 2025-12-30 07:26 34 ***************
|
||||
23 2025-12-30 08:25 33 **************
|
||||
24 2025-12-30 09:24 33 **************
|
||||
25 2025-12-30 10:23 33 **************
|
||||
26 2025-12-30 11:22 32 *************
|
||||
... ..( 3 skipped). .. *************
|
||||
30 2025-12-30 15:18 32 *************
|
||||
31 2025-12-30 16:17 33 **************
|
||||
32 2025-12-30 17:16 33 **************
|
||||
33 2025-12-30 18:15 32 *************
|
||||
... ..( 10 skipped). .. *************
|
||||
44 2025-12-31 05:04 32 *************
|
||||
45 2025-12-31 06:03 31 ************
|
||||
... ..( 5 skipped). .. ************
|
||||
51 2025-12-31 11:57 31 ************
|
||||
52 2025-12-31 12:56 30 ***********
|
||||
... ..( 8 skipped). .. ***********
|
||||
61 2025-12-31 21:47 30 ***********
|
||||
62 2025-12-31 22:46 31 ************
|
||||
... ..( 3 skipped). .. ************
|
||||
66 2026-01-01 02:42 31 ************
|
||||
67 2026-01-01 03:41 30 ***********
|
||||
... ..( 10 skipped). .. ***********
|
||||
78 2026-01-01 14:30 30 ***********
|
||||
79 2026-01-01 15:29 29 **********
|
||||
80 2026-01-01 16:28 29 **********
|
||||
81 2026-01-01 17:27 29 **********
|
||||
82 2026-01-01 18:26 30 ***********
|
||||
83 2026-01-01 19:25 29 **********
|
||||
84 2026-01-01 20:24 29 **********
|
||||
85 2026-01-01 21:23 29 **********
|
||||
86 2026-01-01 22:22 30 ***********
|
||||
87 2026-01-01 23:21 30 ***********
|
||||
88 2026-01-02 00:20 32 *************
|
||||
89 2026-01-02 01:19 33 **************
|
||||
90 2026-01-02 02:18 33 **************
|
||||
91 2026-01-02 03:17 33 **************
|
||||
92 2026-01-02 04:16 32 *************
|
||||
93 2026-01-02 05:15 31 ************
|
||||
94 2026-01-02 06:14 ? -
|
||||
95 2026-01-02 07:13 30 ***********
|
||||
96 2026-01-02 08:12 ? -
|
||||
97 2026-01-02 09:11 30 ***********
|
||||
98 2026-01-02 10:10 ? -
|
||||
99 2026-01-02 11:09 30 ***********
|
||||
100 2026-01-02 12:08 ? -
|
||||
101 2026-01-02 13:07 30 ***********
|
||||
102 2026-01-02 14:06 ? -
|
||||
103 2026-01-02 15:05 30 ***********
|
||||
104 2026-01-02 16:04 ? -
|
||||
105 2026-01-02 17:03 30 ***********
|
||||
106 2026-01-02 18:02 ? -
|
||||
107 2026-01-02 19:01 31 ************
|
||||
108 2026-01-02 20:00 ? -
|
||||
109 2026-01-02 20:59 31 ************
|
||||
110 2026-01-02 21:58 ? -
|
||||
111 2026-01-02 22:57 26 *******
|
||||
112 2026-01-02 23:56 38 *******************
|
||||
113 2026-01-03 00:55 36 *****************
|
||||
114 2026-01-03 01:54 34 ***************
|
||||
115 2026-01-03 02:53 33 **************
|
||||
... ..( 4 skipped). .. **************
|
||||
120 2026-01-03 07:48 33 **************
|
||||
121 2026-01-03 08:47 37 ******************
|
||||
122 2026-01-03 09:46 42 ***********************
|
||||
123 2026-01-03 10:45 43 ************************
|
||||
124 2026-01-03 11:44 42 ***********************
|
||||
125 2026-01-03 12:43 43 ************************
|
||||
126 2026-01-03 13:42 44 *************************
|
||||
127 2026-01-03 14:41 45 **************************
|
||||
0 2026-01-03 15:40 43 ************************
|
||||
1 2026-01-03 16:39 43 ************************
|
||||
2 2026-01-03 17:38 42 ***********************
|
||||
... ..( 2 skipped). .. ***********************
|
||||
5 2026-01-03 20:35 42 ***********************
|
||||
6 2026-01-03 21:34 41 **********************
|
||||
7 2026-01-03 22:33 41 **********************
|
||||
8 2026-01-03 23:32 38 *******************
|
||||
|
||||
SCT Error Recovery Control:
|
||||
Read: 70 (7.0 seconds)
|
||||
Write: 70 (7.0 seconds)
|
||||
|
||||
Device Statistics (GP Log 0x04)
|
||||
Page Offset Size Value Flags Description
|
||||
0x01 ===== = = === == General Statistics (rev 1) ==
|
||||
0x01 0x008 4 237 --- Lifetime Power-On Resets
|
||||
0x01 0x010 4 1564 --- Power-on Hours
|
||||
0x01 0x018 6 19393406971 --- Logical Sectors Written
|
||||
0x01 0x020 6 40248649 --- Number of Write Commands
|
||||
0x01 0x028 6 9269776944 --- Logical Sectors Read
|
||||
0x01 0x030 6 154066717 --- Number of Read Commands
|
||||
0x01 0x038 6 - --- Date and Time TimeStamp
|
||||
0x03 ===== = = === == Rotating Media Statistics (rev 1) ==
|
||||
0x03 0x008 4 1215 --- Spindle Motor Power-on Hours
|
||||
0x03 0x010 4 528 --- Head Flying Hours
|
||||
0x03 0x018 4 1965 --- Head Load Events
|
||||
0x03 0x020 4 0 --- Number of Reallocated Logical Sectors
|
||||
0x03 0x028 4 0 --- Read Recovery Attempts
|
||||
0x03 0x030 4 0 --- Number of Mechanical Start Failures
|
||||
0x03 0x038 4 0 --- Number of Realloc. Candidate Logical Sectors
|
||||
0x03 0x040 4 229 --- Number of High Priority Unload Events
|
||||
0x04 ===== = = === == General Errors Statistics (rev 1) ==
|
||||
0x04 0x008 4 0 --- Number of Reported Uncorrectable Errors
|
||||
0x04 0x010 4 0 --- Resets Between Cmd Acceptance and Completion
|
||||
0x04 0x018 4 0 -D- Physical Element Status Changed
|
||||
0x05 ===== = = === == Temperature Statistics (rev 1) ==
|
||||
0x05 0x008 1 37 --- Current Temperature
|
||||
0x05 0x010 1 35 --- Average Short Term Temperature
|
||||
0x05 0x018 1 34 --- Average Long Term Temperature
|
||||
0x05 0x020 1 46 --- Highest Temperature
|
||||
0x05 0x028 1 27 --- Lowest Temperature
|
||||
0x05 0x030 1 43 --- Highest Average Short Term Temperature
|
||||
0x05 0x038 1 30 --- Lowest Average Short Term Temperature
|
||||
0x05 0x040 1 34 --- Highest Average Long Term Temperature
|
||||
0x05 0x048 1 34 --- Lowest Average Long Term Temperature
|
||||
0x05 0x050 4 0 --- Time in Over-Temperature
|
||||
0x05 0x058 1 60 --- Specified Maximum Operating Temperature
|
||||
0x05 0x060 4 0 --- Time in Under-Temperature
|
||||
0x05 0x068 1 5 --- Specified Minimum Operating Temperature
|
||||
0x06 ===== = = === == Transport Statistics (rev 1) ==
|
||||
0x06 0x008 4 41 --- Number of Hardware Resets
|
||||
0x06 0x010 4 8 --- Number of ASR Events
|
||||
0x06 0x018 4 0 --- Number of Interface CRC Errors
|
||||
0xff ===== = = === == Vendor Specific Statistics (rev 1) ==
|
||||
0xff 0x010 7 0 --- Vendor Specific
|
||||
0xff 0x018 7 0 --- Vendor Specific
|
||||
|||_ C monitored condition met
|
||||
||__ D supports DSN
|
||||
|___ N normalized value
|
||||
|
||||
Pending Defects log (GP Log 0x0c)
|
||||
No Defects Logged
|
||||
|
||||
SATA Phy Event Counters (GP Log 0x11)
|
||||
ID Size Value Description
|
||||
0x000a 2 2 Device-to-host register FISes sent due to a COMRESET
|
||||
0x0001 2 0 Command failed due to ICRC error
|
||||
0x0003 2 0 R_ERR response for device-to-host data FIS
|
||||
0x0004 2 0 R_ERR response for host-to-device data FIS
|
||||
0x0006 2 0 R_ERR response for device-to-host non-data FIS
|
||||
0x0007 2 0 R_ERR response for host-to-device non-data FIS
|
||||
|
||||
Seagate FARM log (GP Log 0xa6) supported [try: -l farm]
|
||||
```
|
||||
- Execution finished, nothing else to do. The disk remains in the mirror.
|
||||
|
||||
- Other notes
|
||||
- I labeled the two disks by hand as AGAPITO1 and AGAPITO2, but I never noted their serial numbers. Silly me. This is the relation:
|
||||
- AGAPITO1 is ata-ST4000NT001-3M2101_WX11TN0Z.
|
||||
- AGAPITO2 is ata-ST4000NT001-3M2101_WX11TN2P.
|
||||
-
|
||||
|
||||
|
||||
## Side quests
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue