fixing hdd

2026-02-21 11:51:14 +01:00 · 2026-02-21 11:51:14 +01:00 · d7681a969f
commit d7681a969f
parent bfbc1153ad
3 changed files with 364 additions and 0 deletions
--- a/zfs-mirror-incident/20251230_first_zfs_degradation.md
+++ b/zfs-mirror-incident/20251230_first_zfs_degradation.md
--- a/zfs-mirror-incident/20260208_second_zfs_degradation.md
+++ b/zfs-mirror-incident/20260208_second_zfs_degradation.md
@ -0,0 +1,145 @@
+# Second ZFS Degradation
+
+Continuation of [20251230_first_zfs_degradation.md](20251230_first_zfs_degradation.md).
+
+
+## Problem
+
+On 2026-02-04, a few days after the first incident was resolved, AGAPITO1 (`ata-ST4000NT001-3M2101_WX11TN0Z`) appeared as REMOVED in the ZFS pool again.
+
+I opened the case and reviewed. Nothing looked visibly wrong — all cables appeared properly seated.
+
+
+## The cable swap experiment
+
+To narrow down whether the issue was the disk itself or the cable/port, I devised a swap test: I unplugged both the power and data cables from AGAPITO1 and AGAPITO2, and swapped them. After the swap, both drives were connected to the motherboard, but each one now used the other's previous data cable, power cable, and motherboard SATA port.
+
+After booting, everything looked good. I ran a scrub and only 1 checksum error appeared on AGAPITO1, but the pool went back to healthy.
+
+At boot, the new mapping was:
+- ata3 → AGAPITO1 (WX11TN0Z) — previously on ata4
+- ata4 → AGAPITO2 (WX11TN2P) — previously on ata3
+
+
+## Second failure
+
+The pool stayed healthy for a few days, but on 2026-02-08 it degraded again:
+
+```
+counterweight@nodito:~$ zpool status -v proxmox-tank-1
+  pool: proxmox-tank-1
+ state: DEGRADED
+status: One or more devices have been removed.
+        Sufficient replicas exist for the pool to continue functioning in a
+        degraded state.
+action: Online the device using zpool online' or replace the device with
+        'zpool replace'.
+  scan: resilvered 0B in 00:02:20 with 0 errors on Sun Feb  8 01:49:13 2026
+config:
+
+        NAME                                 STATE     READ WRITE CKSUM
+        proxmox-tank-1                       DEGRADED     0     0     0
+          mirror-0                           DEGRADED     0     0     0
+            ata-ST4000NT001-3M2101_WX11TN0Z  REMOVED      0     0     0
+            ata-ST4000NT001-3M2101_WX11TN2P  ONLINE       0     0     0
+```
+
+The disk was no longer visible to the OS at all — `ls /dev/disk/by-id/ | grep WX11TN0Z` returned nothing, and `lsblk` did not show it. SMART queries were not possible.
+
+
+## Diagnostic
+
+### Kernel logs
+
+The kernel logs tell the full story. The failure sequence on `ata3` is virtually identical to the first incident's failure on `ata4`:
+
+1. At `01:42:37`, an interface fatal error on ata3 with failed WRITE commands:
+   ```
+   ata3.00: exception Emask 0x50 SAct 0xc40 SErr 0xe0802 action 0x6 frozen
+   ata3.00: irq_stat 0x08000000, interface fatal error
+   ata3: SError: { RecovComm HostInt PHYInt CommWake 10B8B }
+   ata3.00: failed command: WRITE FPDMA QUEUED
+   ```
+
+2. Link goes down, kernel tries hard resets. The familiar speed downgrade cascade begins:
+   ```
+   ata3: SATA link down (SStatus 0 SControl 300)
+   ata3: hard resetting link
+   ata3: link is slow to respond, please be patient (ready=0)
+   ata3: found unknown device (class 0)
+   ata3.00: qc timeout after 5000 msecs (cmd 0xec)
+   ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)
+   ata3: limiting SATA link speed to 3.0 Gbps
+   ```
+
+3. The disk briefly comes back at each speed level, fails immediately with another interface fatal error, and gets downgraded again. 6.0 Gbps → 3.0 Gbps → 1.5 Gbps.
+
+4. At `01:45:31`, the disk detaches (`ata3.00: detaching (SCSI 2:0:0:0)`), leaving a trail of I/O errors on `/dev/sda`. It reappears briefly, fails again.
+
+5. By `01:46:33`, the disk detaches a second time. It comes back momentarily, detaches again at `01:48:12` (now as `/dev/sdc` after device name reassignment). Massive I/O errors on both reads and writes.
+
+6. From `01:48:19` onward, the kernel enters an endless reset loop:
+   ```
+   ata3: hardreset failed
+   ata3: reset failed, giving up
+   ata3: reset failed (errno=-32), retrying in 8 secs
+   ata3: limiting SATA link speed to 3.0 Gbps
+   ata3: reset failed (errno=-32), retrying in 8 secs
+   ata3: limiting SATA link speed to 1.5 Gbps
+   ata3: reset failed (errno=-32), retrying in 33 secs
+   ata3: hardreset failed
+   ata3: reset failed, giving up
+   ```
+   This loop continued for over 10 hours until the dmesg was captured, cycling through speed downgrades and giving up, then retrying.
+
+
+### An interesting contrast: ata4
+
+On Feb 7 at 15:26:48, there was a brief hiccup on ata4 — which is now AGAPITO2 on the swapped cable+port:
+
+```
+ata4.00: exception Emask 0x10 SAct 0x3f000 SErr 0x40d0000 action 0xe frozen
+ata4.00: irq_stat 0x00000040, connection status changed
+ata4: SError: { PHYRdyChg CommWake 10B8B DevExch }
+ata4.00: failed command: READ FPDMA QUEUED
+...
+ata4: hard resetting link
+ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
+ata4.00: configured for UDMA/133
+ata4: EH complete
+```
+
+A single hard reset, link came back at full speed, error handling completed. Normal operation resumed. This port+cable combo is fine — the cable and port that AGAPITO1 used to be on work perfectly for AGAPITO2.
+
+
+### ZFS events
+
+The ZFS events log shows checksum errors accumulating on `ata-ST4000NT001-3M2101_WX11TN0Z-part1` even before the full crash, during the initial pool import on Feb 4 after the swap. The counter increments from `vdev_cksum_errors = 0x1` through `0x4` during the same import event, and one more later. These early checksum errors were likely precursors to the full failure.
+
+
+## Conclusion
+
+The cable swap was the decisive experiment. The key facts:
+
+| | First incident (Dec 30) | Second incident (Feb 8) |
+|---|---|---|
+| **Failing disk** | AGAPITO1 (WX11TN0Z) | AGAPITO1 (WX11TN0Z) |
+| **ATA port** | ata4 | ata3 |
+| **Data cable** | cable A | cable B |
+| **Power cable** | power A | power B |
+| **Failure pattern** | Interface fatal, speed downgrades, detach | Interface fatal, speed downgrades, detach |
+
+The problem followed the disk across different cables, different power connections, and different motherboard SATA ports. Cables and ports are exonerated.
+
+AGAPITO1 has a failing SATA PHY — the physical-layer signaling electronics on the drive that handle the SATA link are intermittently dying. This is exactly the kind of failure that SMART doesn't catch: SMART monitors media integrity (sectors, heads, motor) and firmware/controller logic, not the SATA interface circuitry. As noted in the first incident document, SATA drives are opaque about this kind of issue — they don't "confess" PHY-layer problems the way SAS drives would.
+
+The 992 checksum errors from the first scrub after the initial reseat were likely early symptoms of this progressive hardware failure.
+
+**This disk needs to be replaced.**
+
+
+## Open questions
+
+- How do I replace the disk within the ZFS mirror without data loss? (i.e., `zpool replace` workflow)
+- Should the replacement be another SATA drive, or is this a good opportunity to move to SAS? (Would require a SAS controller since SAS drives don't fit regular SATA ports.)
+- The monitoring question from the first incident remains open: how to automate ZFS pool health monitoring to catch degradations early.
--- a/zfs-mirror-incident/20260220_agapito1_replacement_runbook.md
+++ b/zfs-mirror-incident/20260220_agapito1_replacement_runbook.md
@ -0,0 +1,364 @@
+# AGAPITO1 Replacement Runbook
+
+Continuation of [20260208_second_zfs_degradation.md](20260208_second_zfs_degradation.md).
+
+AGAPITO1 (`ata-ST4000NT001-3M2101_WX11TN0Z`) had a failing SATA PHY and was RMA'd. The ZFS mirror `proxmox-tank-1` has been running degraded on AGAPITO2 alone since Feb 8. The replacement drive (same model, serial `WX120LHQ`) needs to be physically installed and added to the mirror.
+
+**Current state:**
+- Pool: `proxmox-tank-1` (mirror-0), DEGRADED
+- AGAPITO2 (`WX11TN2P`): ONLINE, on ata4
+- Old AGAPITO1 (`WX11TN0Z`): shows REMOVED in pool config
+- Physical: drive bay empty, SATA data + power cables still connected to mobo/PSU (should be ata3 port after the cable swap from incident 2)
+- New drive: ST4000NT001-3M2101, serial `WX120LHQ`
+
+---
+
+## Phase 1: Pre-shutdown state capture
+
+While server is still running, log current state for reference.
+
+- [x] **1.1** Record pool status
+  ```
+  zpool status -v proxmox-tank-1
+  ```
+  Expected: DEGRADED, WX11TN0Z shows REMOVED, WX11TN2P ONLINE.
+  ```
+    pool: proxmox-tank-1
+   state: DEGRADED
+  status: One or more devices have been removed.
+          Sufficient replicas exist for the pool to continue functioning in a
+          degraded state.
+  action: Online the device using zpool online' or replace the device with
+          'zpool replace'.
+    scan: scrub repaired 0B in 06:55:06 with 0 errors on Tue Feb 17 20:40:50 2026
+  config:
+
+          NAME                                 STATE     READ WRITE CKSUM
+          proxmox-tank-1                       DEGRADED     0     0     0
+            mirror-0                           DEGRADED     0     0     0
+              ata-ST4000NT001-3M2101_WX11TN0Z  REMOVED      0     0     0
+              ata-ST4000NT001-3M2101_WX11TN2P  ONLINE       0     0     0
+
+  errors: No known data errors
+  ```
+
+- [x] **1.2** Record current SATA layout
+  ```
+  dmesg -T | grep -E 'ata[0-9]+\.[0-9]+: ATA-|ata[0-9]+: SATA link up' | tail -20
+  ```
+  Expected: AGAPITO2 visible on ata4. ata3 should show nothing (empty slot).
+  ```
+  [Tue Feb 17 15:37:28 2026] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
+  [Tue Feb 17 15:37:28 2026] ata4.00: ATA-11: ST4000NT001-3M2101, EN01, max UDMA/133
+  ```
+
+- [x] **1.3** Confirm AGAPITO2 is healthy before we start
+  ```
+  smartctl -H /dev/disk/by-id/ata-ST4000NT001-3M2101_WX11TN2P
+  ```
+  Expected: PASSED. If not, stop and investigate before proceeding.
+  ```
+  SMART overall-health self-assessment test result: PASSED
+  ```
+
+---
+
+## Phase 2: Graceful shutdown
+
+- [x] **2.1** Shut down all VMs gracefully from Proxmox UI or CLI
+  ```
+  qm list
+  # For each running VM:
+  qm shutdown <VMID>
+  ```
+
+- [x] **2.2** Verify all VMs are stopped
+  ```
+  qm list
+  ```
+  Expected: all show "stopped".
+
+- [x] **2.3** Power down the server
+  ```
+  shutdown -h now
+  ```
+
+---
+
+## Phase 3: Physical installation
+
+- [x] **3.1** Open the case
+- [x] **3.2** Locate the dangling SATA data + power cables (from the old AGAPITO1 slot)
+- [x] **3.3** Visually inspect cables for damage — especially the SATA data connector pins
+- [x] **3.4** Label the new drive as TOMMY with a marker/sticker. Write serial `WX120LHQ` on the label too.
+- [x] **3.5** Seat the new drive in the bay
+- [x] **3.6** Connect SATA data cable to the drive — push firmly until it clicks
+- [x] **3.7** Connect SATA power cable to the drive — push firmly
+- [x] **3.8** Double-check both connectors are fully seated (wiggle test — they shouldn't move)
+- [x] **3.9** Close the case
+
+---
+
+## Phase 4: Boot and verify detection
+
+- [x] **4.1** Power on the server, let it boot into Proxmox
+
+- [x] **4.2** Verify the new drive is detected by the kernel
+  ```
+  dmesg -T | grep -E 'ata[0-9]+\.[0-9]+: ATA-|ata[0-9]+: SATA link up'
+  ```
+  Expected: new drive detected on ata3 (or whichever port the cable is on), at 6.0 Gbps.
+  ```
+  [Fri Feb 20 22:57:06 2026] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
+  [Fri Feb 20 22:57:06 2026] ata3.00: ATA-11: ST4000NT001-3M2101, EN01, max UDMA/133
+  [Fri Feb 20 22:57:07 2026] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
+  [Fri Feb 20 22:57:07 2026] ata4.00: ATA-11: ST4000NT001-3M2101, EN01, max UDMA/133
+  ```
+  TOMMY on ata3, AGAPITO2 on ata4. Both at 6.0 Gbps, firmware EN01.
+
+- [x] **4.3** Verify the drive appears in `/dev/disk/by-id/`
+  ```
+  ls -l /dev/disk/by-id/ | grep WX120LHQ
+  ```
+  Expected: `ata-ST4000NT001-3M2101_WX120LHQ` pointing to some `/dev/sdX`.
+  ```
+  ata-ST4000NT001-3M2101_WX120LHQ -> ../../sda
+  ```
+
+- [ ] **4.4** Set variables for convenience
+  ```
+  NEW_DISKID="ata-ST4000NT001-3M2101_WX120LHQ"
+  NEW_DISKPATH="/dev/disk/by-id/$NEW_DISKID"
+  OLD_DISKID="ata-ST4000NT001-3M2101_WX11TN0Z"
+  echo "New: $NEW_DISKID -> $(readlink -f $NEW_DISKPATH)"
+  ```
+
+- [x] **4.5** Confirm drive identity and firmware version with smartctl
+  ```
+  smartctl -i "$NEW_DISKPATH"
+  ```
+  Expected: Model ST4000NT001-3M2101, Serial WX120LHQ, Firmware EN01, 4TB capacity.
+  ```
+  Device Model:     ST4000NT001-3M2101
+  Serial Number:    WX120LHQ
+  Firmware Version: EN01
+  User Capacity:    4,000,787,030,016 bytes [4.00 TB]
+  SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
+  ```
+
+- [x] **4.6** Check kernel logs are clean — no SATA errors, link drops, or speed downgrades
+  ```
+  dmesg -T | grep -E 'ata[0-9]' | grep -iE 'error|fatal|reset|link down|slow|limiting'
+  ```
+  Expected: nothing. If there are errors here on a brand new drive + known-good cable, **stop and investigate**.
+  ```
+  [Fri Feb 20 22:57:06 2026] ata1: SATA link down (SStatus 0 SControl 300)
+  [Fri Feb 20 22:57:06 2026] ata2: SATA link down (SStatus 0 SControl 300)
+  ```
+  Clean — ata1/ata2 are unused ports. No errors on ata3 or ata4.
+
+---
+
+## Phase 5: Health-check the new drive before trusting data to it
+
+Don't resilver onto a DOA drive.
+
+- [x] **5.1** SMART overall health
+  ```
+  smartctl -H "$NEW_DISKPATH"
+  ```
+  Expected: PASSED.
+  ```
+  SMART overall-health self-assessment test result: PASSED
+  ```
+
+- [x] **5.2** Check SMART attributes baseline
+  ```
+  smartctl -A "$NEW_DISKPATH" | grep -E 'Reallocated|Pending|Offline_Uncorrect|CRC|Error_Rate'
+  ```
+  Expected: all counters at 0 (it's a new/refurb drive).
+  ```
+    1 Raw_Read_Error_Rate     ... -       6072
+    5 Reallocated_Sector_Ct   ... -       0
+    7 Seek_Error_Rate         ... -       476
+  197 Current_Pending_Sector  ... -       0
+  198 Offline_Uncorrectable   ... -       0
+  199 UDMA_CRC_Error_Count    ... -       0
+  ```
+  All critical counters at 0. Read/Seek error rate raw values are normal Seagate encoding.
+
+- [x] **5.3** Run short self-test
+  ```
+  smartctl -t short "$NEW_DISKPATH"
+  ```
+  Wait ~2 minutes, then check:
+  ```
+  smartctl -l selftest "$NEW_DISKPATH"
+  ```
+  Expected: "Completed without error".
+  ```
+  # 1  Short offline       Completed without error       00%         0         -
+  ```
+  Passed. 0 power-on hours — fresh drive.
+
+- [x] **5.4** (Decision point) Short test passed. Proceeding.
+
+---
+
+## Phase 6: Add new drive to ZFS mirror
+
+- [x] **6.1** Open a dedicated terminal for kernel log monitoring
+  ```
+  dmesg -Tw
+  ```
+  Leave this running throughout the resilver. Watch for ANY `ata` errors.
+
+- [x] **6.2** Replace the old drive with the new one in the pool
+  ```
+  zpool replace proxmox-tank-1 "$OLD_DISKID" "$NEW_DISKID"
+  ```
+  This tells ZFS: "the REMOVED drive WX11TN0Z is being replaced by WX120LHQ". Resilvering starts automatically.
+
+- [x] **6.3** Verify resilvering has started
+  ```
+  zpool status -v proxmox-tank-1
+  ```
+  Expected: state DEGRADED, new drive shows as part of a `replacing` vdev, resilver in progress.
+  ```
+  resilver in progress since Fri Feb 20 23:10:58 2026
+  5.71G / 1.33T scanned at 344M/s, 0B / 1.33T issued
+  0B resilvered, 0.00% done
+
+  replacing-0                        DEGRADED     0     0     0
+    ata-ST4000NT001-3M2101_WX11TN0Z  REMOVED      0     0     0
+    ata-ST4000NT001-3M2101_WX120LHQ  ONLINE       0     0 7.73K
+  ata-ST4000NT001-3M2101_WX11TN2P    ONLINE       0     0     0
+  ```
+  Resilver running. Cksum count on new drive is expected during resilver (unwritten blocks).
+
+- [x] **6.4** Monitor resilver progress periodically
+  ```
+  watch -n 30 "zpool status -v proxmox-tank-1"
+  ```
+  Expected: steady progress, no read/write/cksum errors on either drive. Based on previous experience (~500GB at ~100MB/s with VMs down), expect roughly 1-2 hours.
+  VMs were auto-started on boot. Resilver completed: 1.34T in 03:32:55 with 0 errors.
+
+- [x] **6.5** VMs were already running (auto-start on boot).
+
+---
+
+## Phase 7: Post-resilver verification
+
+Wait for resilver to complete (status will say "resilvered XXG in HH:MM:SS with 0 errors").
+
+- [x] **7.1** Check final pool status
+  ```
+  zpool status -v proxmox-tank-1
+  ```
+  Expected: ONLINE (or DEGRADED with "too many errors" message requiring a clear — same as last time).
+  ```
+  state: ONLINE
+  scan: resilvered 1.34T in 03:32:55 with 0 errors on Sat Feb 21 02:43:53 2026
+
+  ata-ST4000NT001-3M2101_WX120LHQ  ONLINE       0     0 7.73K
+  ata-ST4000NT001-3M2101_WX11TN2P  ONLINE       0     0     0
+  ```
+  ONLINE. 7.73K cksum on TOMMY is expected resilver artifact — clearing next.
+
+- [x] **7.2** Clear residual cksum counters
+  ```
+  zpool clear proxmox-tank-1
+  ```
+  Counters cleared (status message and cksum count gone on re-check).
+  ```
+  state: ONLINE
+  scan: resilvered 1.34T in 03:32:55 with 0 errors on Sat Feb 21 02:43:53 2026
+
+  ata-ST4000NT001-3M2101_WX120LHQ  ONLINE       0     0     0
+  ata-ST4000NT001-3M2101_WX11TN2P  ONLINE       0     0     0
+
+  errors: No known data errors
+  ```
+
+- [x] **7.3** Run a full scrub to verify data integrity
+  ```
+  zpool scrub proxmox-tank-1
+  ```
+  Expected: **0 errors on both drives**.
+  ```
+  scrub repaired 0B in 03:27:50 with 0 errors on Sat Feb 21 11:38:02 2026
+
+  ata-ST4000NT001-3M2101_WX120LHQ  ONLINE       0     0     0
+  ata-ST4000NT001-3M2101_WX11TN2P  ONLINE       0     0     0
+
+  errors: No known data errors
+  ```
+
+- [x] **7.4** Clean status confirmed — 0B repaired, 0 errors, both drives 0/0/0.
+
+- [x] **7.5** Baseline SMART snapshot of the new drive after heavy I/O
+  ```
+  smartctl -x "$NEW_DISKPATH" | grep -E 'Reallocated|Pending|Offline_Uncorrect|CRC|Hardware Resets|COMRESET|Interface'
+  ```
+  Expected: 0 reallocated, 0 CRC errors, low hardware reset count.
+  ```
+  Reallocated_Sector_Ct    ... 0
+  Current_Pending_Sector   ... 0
+  Offline_Uncorrectable    ... 0
+  UDMA_CRC_Error_Count     ... 0
+  Number of Hardware Resets ... 2
+  Number of Interface CRC Errors ... 0
+  COMRESET ... 2
+  ```
+  All clean. 2 hardware resets / COMRESETs from boot — normal.
+
+- ~~**7.6**~~ Skipped — extended SMART self-test is redundant after a clean resilver + scrub. ZFS checksums already verified every data block; the only thing the long test would cover is empty space that ZFS hasn't written to, which ZFS will verify on future use anyway.
+
+---
+
+## Phase 8: Final state — done
+
+- [x] **8.1** Final pool status — already captured in 7.4. Mirror is healthy:
+  ```
+    pool: proxmox-tank-1
+   state: ONLINE
+    scan: scrub repaired 0B in 03:27:50 with 0 errors on Sat Feb 21 11:38:02 2026
+   config:
+     NAME                                 STATE     READ WRITE CKSUM
+     proxmox-tank-1                       ONLINE       0     0     0
+       mirror-0                           ONLINE       0     0     0
+         ata-ST4000NT001-3M2101_WX120LHQ  ONLINE       0     0     0
+         ata-ST4000NT001-3M2101_WX11TN2P  ONLINE       0     0     0
+
+   errors: No known data errors
+  ```
+
+- [x] **8.2** All VMs running normally — verified from Proxmox UI
+
+- [x] **8.3** Celebrate. Mirror is whole again.
+
+---
+
+## Abort conditions
+
+Stop and investigate if any of these happen:
+
+- New drive not detected after boot (bad seating or DOA)
+- SATA errors in `dmesg` during or after boot (bad cable? bad drive?)
+- SMART short test fails on new drive (DOA — contact seller)
+- Resilver stalls or produces errors on the new drive
+- Scrub finds checksum errors on the new drive
+
+---
+
+## Execution summary
+
+Executed 2026-02-20 evening through 2026-02-21 morning. No abort conditions hit — completely clean run.
+
+- TOMMY (`WX120LHQ`) installed on ata3 at 6.0 Gbps, detected first boot
+- SMART short test passed, all critical attributes at zero
+- Resilver: 1.34T in 03:32:55, 0 errors (VMs were running — auto-start on boot)
+- Scrub: repaired 0B in 03:27:50, 0 errors, both drives 0/0/0
+- Post-I/O SMART baseline clean: 0 reallocated, 0 CRC errors
+- Extended SMART test skipped — redundant after clean resilver + scrub (ZFS checksums already verified all data blocks)
+- Pool `proxmox-tank-1` fully healthy. Mirror degradation that started 2026-02-08 is resolved.