fixing hdd

2026-02-21 11:51:14 +01:00 · 2026-02-21 11:51:14 +01:00 · d7681a969f
commit d7681a969f
parent bfbc1153ad
3 changed files with 364 additions and 0 deletions
--- a/zfs-mirror-incident/20251230_first_zfs_degradation.md
+++ b/zfs-mirror-incident/20251230_first_zfs_degradation.md
--- a/zfs-mirror-incident/20260208_second_zfs_degradation.md
+++ b/zfs-mirror-incident/20260208_second_zfs_degradation.md
--- a/zfs-mirror-incident/20260220_agapito1_replacement_runbook.md
+++ b/zfs-mirror-incident/20260220_agapito1_replacement_runbook.md
@ -0,0 +1,364 @@
+# AGAPITO1 Replacement Runbook
+
+Continuation of [20260208_second_zfs_degradation.md](20260208_second_zfs_degradation.md).
+
+AGAPITO1 (`ata-ST4000NT001-3M2101_WX11TN0Z`) had a failing SATA PHY and was RMA'd. The ZFS mirror `proxmox-tank-1` has been running degraded on AGAPITO2 alone since Feb 8. The replacement drive (same model, serial `WX120LHQ`) needs to be physically installed and added to the mirror.
+
+**Current state:**
+- Pool: `proxmox-tank-1` (mirror-0), DEGRADED
+- AGAPITO2 (`WX11TN2P`): ONLINE, on ata4
+- Old AGAPITO1 (`WX11TN0Z`): shows REMOVED in pool config
+- Physical: drive bay empty, SATA data + power cables still connected to mobo/PSU (should be ata3 port after the cable swap from incident 2)
+- New drive: ST4000NT001-3M2101, serial `WX120LHQ`
+
+---
+
+## Phase 1: Pre-shutdown state capture
+
+While server is still running, log current state for reference.
+
+- [x] **1.1** Record pool status
+  ```
+  zpool status -v proxmox-tank-1
+  ```
+  Expected: DEGRADED, WX11TN0Z shows REMOVED, WX11TN2P ONLINE.
+  ```
+    pool: proxmox-tank-1
+   state: DEGRADED
+  status: One or more devices have been removed.
+          Sufficient replicas exist for the pool to continue functioning in a
+          degraded state.
+  action: Online the device using zpool online' or replace the device with
+          'zpool replace'.
+    scan: scrub repaired 0B in 06:55:06 with 0 errors on Tue Feb 17 20:40:50 2026
+  config:
+
+          NAME                                 STATE     READ WRITE CKSUM
+          proxmox-tank-1                       DEGRADED     0     0     0
+            mirror-0                           DEGRADED     0     0     0
+              ata-ST4000NT001-3M2101_WX11TN0Z  REMOVED      0     0     0
+              ata-ST4000NT001-3M2101_WX11TN2P  ONLINE       0     0     0
+
+  errors: No known data errors
+  ```
+
+- [x] **1.2** Record current SATA layout
+  ```
+  dmesg -T | grep -E 'ata[0-9]+\.[0-9]+: ATA-|ata[0-9]+: SATA link up' | tail -20
+  ```
+  Expected: AGAPITO2 visible on ata4. ata3 should show nothing (empty slot).
+  ```
+  [Tue Feb 17 15:37:28 2026] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
+  [Tue Feb 17 15:37:28 2026] ata4.00: ATA-11: ST4000NT001-3M2101, EN01, max UDMA/133
+  ```
+
+- [x] **1.3** Confirm AGAPITO2 is healthy before we start
+  ```
+  smartctl -H /dev/disk/by-id/ata-ST4000NT001-3M2101_WX11TN2P
+  ```
+  Expected: PASSED. If not, stop and investigate before proceeding.
+  ```
+  SMART overall-health self-assessment test result: PASSED
+  ```
+
+---
+
+## Phase 2: Graceful shutdown
+
+- [x] **2.1** Shut down all VMs gracefully from Proxmox UI or CLI
+  ```
+  qm list
+  # For each running VM:
+  qm shutdown <VMID>
+  ```
+
+- [x] **2.2** Verify all VMs are stopped
+  ```
+  qm list
+  ```
+  Expected: all show "stopped".
+
+- [x] **2.3** Power down the server
+  ```
+  shutdown -h now
+  ```
+
+---
+
+## Phase 3: Physical installation
+
+- [x] **3.1** Open the case
+- [x] **3.2** Locate the dangling SATA data + power cables (from the old AGAPITO1 slot)
+- [x] **3.3** Visually inspect cables for damage — especially the SATA data connector pins
+- [x] **3.4** Label the new drive as TOMMY with a marker/sticker. Write serial `WX120LHQ` on the label too.
+- [x] **3.5** Seat the new drive in the bay
+- [x] **3.6** Connect SATA data cable to the drive — push firmly until it clicks
+- [x] **3.7** Connect SATA power cable to the drive — push firmly
+- [x] **3.8** Double-check both connectors are fully seated (wiggle test — they shouldn't move)
+- [x] **3.9** Close the case
+
+---
+
+## Phase 4: Boot and verify detection
+
+- [x] **4.1** Power on the server, let it boot into Proxmox
+
+- [x] **4.2** Verify the new drive is detected by the kernel
+  ```
+  dmesg -T | grep -E 'ata[0-9]+\.[0-9]+: ATA-|ata[0-9]+: SATA link up'
+  ```
+  Expected: new drive detected on ata3 (or whichever port the cable is on), at 6.0 Gbps.
+  ```
+  [Fri Feb 20 22:57:06 2026] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
+  [Fri Feb 20 22:57:06 2026] ata3.00: ATA-11: ST4000NT001-3M2101, EN01, max UDMA/133
+  [Fri Feb 20 22:57:07 2026] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
+  [Fri Feb 20 22:57:07 2026] ata4.00: ATA-11: ST4000NT001-3M2101, EN01, max UDMA/133
+  ```
+  TOMMY on ata3, AGAPITO2 on ata4. Both at 6.0 Gbps, firmware EN01.
+
+- [x] **4.3** Verify the drive appears in `/dev/disk/by-id/`
+  ```
+  ls -l /dev/disk/by-id/ | grep WX120LHQ
+  ```
+  Expected: `ata-ST4000NT001-3M2101_WX120LHQ` pointing to some `/dev/sdX`.
+  ```
+  ata-ST4000NT001-3M2101_WX120LHQ -> ../../sda
+  ```
+
+- [ ] **4.4** Set variables for convenience
+  ```
+  NEW_DISKID="ata-ST4000NT001-3M2101_WX120LHQ"
+  NEW_DISKPATH="/dev/disk/by-id/$NEW_DISKID"
+  OLD_DISKID="ata-ST4000NT001-3M2101_WX11TN0Z"
+  echo "New: $NEW_DISKID -> $(readlink -f $NEW_DISKPATH)"
+  ```
+
+- [x] **4.5** Confirm drive identity and firmware version with smartctl
+  ```
+  smartctl -i "$NEW_DISKPATH"
+  ```
+  Expected: Model ST4000NT001-3M2101, Serial WX120LHQ, Firmware EN01, 4TB capacity.
+  ```
+  Device Model:     ST4000NT001-3M2101
+  Serial Number:    WX120LHQ
+  Firmware Version: EN01
+  User Capacity:    4,000,787,030,016 bytes [4.00 TB]
+  SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
+  ```
+
+- [x] **4.6** Check kernel logs are clean — no SATA errors, link drops, or speed downgrades
+  ```
+  dmesg -T | grep -E 'ata[0-9]' | grep -iE 'error|fatal|reset|link down|slow|limiting'
+  ```
+  Expected: nothing. If there are errors here on a brand new drive + known-good cable, **stop and investigate**.
+  ```
+  [Fri Feb 20 22:57:06 2026] ata1: SATA link down (SStatus 0 SControl 300)
+  [Fri Feb 20 22:57:06 2026] ata2: SATA link down (SStatus 0 SControl 300)
+  ```
+  Clean — ata1/ata2 are unused ports. No errors on ata3 or ata4.
+
+---
+
+## Phase 5: Health-check the new drive before trusting data to it
+
+Don't resilver onto a DOA drive.
+
+- [x] **5.1** SMART overall health
+  ```
+  smartctl -H "$NEW_DISKPATH"
+  ```
+  Expected: PASSED.
+  ```
+  SMART overall-health self-assessment test result: PASSED
+  ```
+
+- [x] **5.2** Check SMART attributes baseline
+  ```
+  smartctl -A "$NEW_DISKPATH" | grep -E 'Reallocated|Pending|Offline_Uncorrect|CRC|Error_Rate'
+  ```
+  Expected: all counters at 0 (it's a new/refurb drive).
+  ```
+    1 Raw_Read_Error_Rate     ... -       6072
+    5 Reallocated_Sector_Ct   ... -       0
+    7 Seek_Error_Rate         ... -       476
+  197 Current_Pending_Sector  ... -       0
+  198 Offline_Uncorrectable   ... -       0
+  199 UDMA_CRC_Error_Count    ... -       0
+  ```
+  All critical counters at 0. Read/Seek error rate raw values are normal Seagate encoding.
+
+- [x] **5.3** Run short self-test
+  ```
+  smartctl -t short "$NEW_DISKPATH"
+  ```
+  Wait ~2 minutes, then check:
+  ```
+  smartctl -l selftest "$NEW_DISKPATH"
+  ```
+  Expected: "Completed without error".
+  ```
+  # 1  Short offline       Completed without error       00%         0         -
+  ```
+  Passed. 0 power-on hours — fresh drive.
+
+- [x] **5.4** (Decision point) Short test passed. Proceeding.
+
+---
+
+## Phase 6: Add new drive to ZFS mirror
+
+- [x] **6.1** Open a dedicated terminal for kernel log monitoring
+  ```
+  dmesg -Tw
+  ```
+  Leave this running throughout the resilver. Watch for ANY `ata` errors.
+
+- [x] **6.2** Replace the old drive with the new one in the pool
+  ```
+  zpool replace proxmox-tank-1 "$OLD_DISKID" "$NEW_DISKID"
+  ```
+  This tells ZFS: "the REMOVED drive WX11TN0Z is being replaced by WX120LHQ". Resilvering starts automatically.
+
+- [x] **6.3** Verify resilvering has started
+  ```
+  zpool status -v proxmox-tank-1
+  ```
+  Expected: state DEGRADED, new drive shows as part of a `replacing` vdev, resilver in progress.
+  ```
+  resilver in progress since Fri Feb 20 23:10:58 2026
+  5.71G / 1.33T scanned at 344M/s, 0B / 1.33T issued
+  0B resilvered, 0.00% done
+
+  replacing-0                        DEGRADED     0     0     0
+    ata-ST4000NT001-3M2101_WX11TN0Z  REMOVED      0     0     0
+    ata-ST4000NT001-3M2101_WX120LHQ  ONLINE       0     0 7.73K
+  ata-ST4000NT001-3M2101_WX11TN2P    ONLINE       0     0     0
+  ```
+  Resilver running. Cksum count on new drive is expected during resilver (unwritten blocks).
+
+- [x] **6.4** Monitor resilver progress periodically
+  ```
+  watch -n 30 "zpool status -v proxmox-tank-1"
+  ```
+  Expected: steady progress, no read/write/cksum errors on either drive. Based on previous experience (~500GB at ~100MB/s with VMs down), expect roughly 1-2 hours.
+  VMs were auto-started on boot. Resilver completed: 1.34T in 03:32:55 with 0 errors.
+
+- [x] **6.5** VMs were already running (auto-start on boot).
+
+---
+
+## Phase 7: Post-resilver verification
+
+Wait for resilver to complete (status will say "resilvered XXG in HH:MM:SS with 0 errors").
+
+- [x] **7.1** Check final pool status
+  ```
+  zpool status -v proxmox-tank-1
+  ```
+  Expected: ONLINE (or DEGRADED with "too many errors" message requiring a clear — same as last time).
+  ```
+  state: ONLINE
+  scan: resilvered 1.34T in 03:32:55 with 0 errors on Sat Feb 21 02:43:53 2026
+
+  ata-ST4000NT001-3M2101_WX120LHQ  ONLINE       0     0 7.73K
+  ata-ST4000NT001-3M2101_WX11TN2P  ONLINE       0     0     0
+  ```
+  ONLINE. 7.73K cksum on TOMMY is expected resilver artifact — clearing next.
+
+- [x] **7.2** Clear residual cksum counters
+  ```
+  zpool clear proxmox-tank-1
+  ```
+  Counters cleared (status message and cksum count gone on re-check).
+  ```
+  state: ONLINE
+  scan: resilvered 1.34T in 03:32:55 with 0 errors on Sat Feb 21 02:43:53 2026
+
+  ata-ST4000NT001-3M2101_WX120LHQ  ONLINE       0     0     0
+  ata-ST4000NT001-3M2101_WX11TN2P  ONLINE       0     0     0
+
+  errors: No known data errors
+  ```
+
+- [x] **7.3** Run a full scrub to verify data integrity
+  ```
+  zpool scrub proxmox-tank-1
+  ```
+  Expected: **0 errors on both drives**.
+  ```
+  scrub repaired 0B in 03:27:50 with 0 errors on Sat Feb 21 11:38:02 2026
+
+  ata-ST4000NT001-3M2101_WX120LHQ  ONLINE       0     0     0
+  ata-ST4000NT001-3M2101_WX11TN2P  ONLINE       0     0     0
+
+  errors: No known data errors
+  ```
+
+- [x] **7.4** Clean status confirmed — 0B repaired, 0 errors, both drives 0/0/0.
+
+- [x] **7.5** Baseline SMART snapshot of the new drive after heavy I/O
+  ```
+  smartctl -x "$NEW_DISKPATH" | grep -E 'Reallocated|Pending|Offline_Uncorrect|CRC|Hardware Resets|COMRESET|Interface'
+  ```
+  Expected: 0 reallocated, 0 CRC errors, low hardware reset count.
+  ```
+  Reallocated_Sector_Ct    ... 0
+  Current_Pending_Sector   ... 0
+  Offline_Uncorrectable    ... 0
+  UDMA_CRC_Error_Count     ... 0
+  Number of Hardware Resets ... 2
+  Number of Interface CRC Errors ... 0
+  COMRESET ... 2
+  ```
+  All clean. 2 hardware resets / COMRESETs from boot — normal.
+
+- ~~**7.6**~~ Skipped — extended SMART self-test is redundant after a clean resilver + scrub. ZFS checksums already verified every data block; the only thing the long test would cover is empty space that ZFS hasn't written to, which ZFS will verify on future use anyway.
+
+---
+
+## Phase 8: Final state — done
+
+- [x] **8.1** Final pool status — already captured in 7.4. Mirror is healthy:
+  ```
+    pool: proxmox-tank-1
+   state: ONLINE
+    scan: scrub repaired 0B in 03:27:50 with 0 errors on Sat Feb 21 11:38:02 2026
+   config:
+     NAME                                 STATE     READ WRITE CKSUM
+     proxmox-tank-1                       ONLINE       0     0     0
+       mirror-0                           ONLINE       0     0     0
+         ata-ST4000NT001-3M2101_WX120LHQ  ONLINE       0     0     0
+         ata-ST4000NT001-3M2101_WX11TN2P  ONLINE       0     0     0
+
+   errors: No known data errors
+  ```
+
+- [x] **8.2** All VMs running normally — verified from Proxmox UI
+
+- [x] **8.3** Celebrate. Mirror is whole again.
+
+---
+
+## Abort conditions
+
+Stop and investigate if any of these happen:
+
+- New drive not detected after boot (bad seating or DOA)
+- SATA errors in `dmesg` during or after boot (bad cable? bad drive?)
+- SMART short test fails on new drive (DOA — contact seller)
+- Resilver stalls or produces errors on the new drive
+- Scrub finds checksum errors on the new drive
+
+---
+
+## Execution summary
+
+Executed 2026-02-20 evening through 2026-02-21 morning. No abort conditions hit — completely clean run.
+
+- TOMMY (`WX120LHQ`) installed on ata3 at 6.0 Gbps, detected first boot
+- SMART short test passed, all critical attributes at zero
+- Resilver: 1.34T in 03:32:55, 0 errors (VMs were running — auto-start on boot)
+- Scrub: repaired 0B in 03:27:50, 0 errors, both drives 0/0/0
+- Post-I/O SMART baseline clean: 0 reallocated, 0 CRC errors
+- Extended SMART test skipped — redundant after clean resilver + scrub (ZFS checksums already verified all data blocks)
+- Pool `proxmox-tank-1` fully healthy. Mirror degradation that started 2026-02-08 is resolved.