pablohere/20260220_agapito1_replacement_runbook.md
2026-02-21 12:07:57 +01:00

12 KiB

AGAPITO1 Replacement Runbook

Continuation of 20260208_second_zfs_degradation.md.

AGAPITO1 (ata-ST4000NT001-3M2101_WX11TN0Z) had a failing SATA PHY and was RMA'd. The ZFS mirror proxmox-tank-1 has been running degraded on AGAPITO2 alone since Feb 8. The replacement drive (same model, serial WX120LHQ) needs to be physically installed and added to the mirror.

Current state:

  • Pool: proxmox-tank-1 (mirror-0), DEGRADED
  • AGAPITO2 (WX11TN2P): ONLINE, on ata4
  • Old AGAPITO1 (WX11TN0Z): shows REMOVED in pool config
  • Physical: drive bay empty, SATA data + power cables still connected to mobo/PSU (should be ata3 port after the cable swap from incident 2)
  • New drive: ST4000NT001-3M2101, serial WX120LHQ

Phase 1: Pre-shutdown state capture

While server is still running, log current state for reference.

  • 1.1 Record pool status

    zpool status -v proxmox-tank-1
    

    Expected: DEGRADED, WX11TN0Z shows REMOVED, WX11TN2P ONLINE.

      pool: proxmox-tank-1
     state: DEGRADED
    status: One or more devices have been removed.
            Sufficient replicas exist for the pool to continue functioning in a
            degraded state.
    action: Online the device using zpool online' or replace the device with
            'zpool replace'.
      scan: scrub repaired 0B in 06:55:06 with 0 errors on Tue Feb 17 20:40:50 2026
    config:
    
            NAME                                 STATE     READ WRITE CKSUM
            proxmox-tank-1                       DEGRADED     0     0     0
              mirror-0                           DEGRADED     0     0     0
                ata-ST4000NT001-3M2101_WX11TN0Z  REMOVED      0     0     0
                ata-ST4000NT001-3M2101_WX11TN2P  ONLINE       0     0     0
    
    errors: No known data errors
    
  • 1.2 Record current SATA layout

    dmesg -T | grep -E 'ata[0-9]+\.[0-9]+: ATA-|ata[0-9]+: SATA link up' | tail -20
    

    Expected: AGAPITO2 visible on ata4. ata3 should show nothing (empty slot).

    [Tue Feb 17 15:37:28 2026] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
    [Tue Feb 17 15:37:28 2026] ata4.00: ATA-11: ST4000NT001-3M2101, EN01, max UDMA/133
    
  • 1.3 Confirm AGAPITO2 is healthy before we start

    smartctl -H /dev/disk/by-id/ata-ST4000NT001-3M2101_WX11TN2P
    

    Expected: PASSED. If not, stop and investigate before proceeding.

    SMART overall-health self-assessment test result: PASSED
    

Phase 2: Graceful shutdown

  • 2.1 Shut down all VMs gracefully from Proxmox UI or CLI

    qm list
    # For each running VM:
    qm shutdown <VMID>
    
  • 2.2 Verify all VMs are stopped

    qm list
    

    Expected: all show "stopped".

  • 2.3 Power down the server

    shutdown -h now
    

Phase 3: Physical installation

  • 3.1 Open the case
  • 3.2 Locate the dangling SATA data + power cables (from the old AGAPITO1 slot)
  • 3.3 Visually inspect cables for damage — especially the SATA data connector pins
  • 3.4 Label the new drive as TOMMY with a marker/sticker. Write serial WX120LHQ on the label too.
  • 3.5 Seat the new drive in the bay
  • 3.6 Connect SATA data cable to the drive — push firmly until it clicks
  • 3.7 Connect SATA power cable to the drive — push firmly
  • 3.8 Double-check both connectors are fully seated (wiggle test — they shouldn't move)
  • 3.9 Close the case

Phase 4: Boot and verify detection

  • 4.1 Power on the server, let it boot into Proxmox

  • 4.2 Verify the new drive is detected by the kernel

    dmesg -T | grep -E 'ata[0-9]+\.[0-9]+: ATA-|ata[0-9]+: SATA link up'
    

    Expected: new drive detected on ata3 (or whichever port the cable is on), at 6.0 Gbps.

    [Fri Feb 20 22:57:06 2026] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
    [Fri Feb 20 22:57:06 2026] ata3.00: ATA-11: ST4000NT001-3M2101, EN01, max UDMA/133
    [Fri Feb 20 22:57:07 2026] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
    [Fri Feb 20 22:57:07 2026] ata4.00: ATA-11: ST4000NT001-3M2101, EN01, max UDMA/133
    

    TOMMY on ata3, AGAPITO2 on ata4. Both at 6.0 Gbps, firmware EN01.

  • 4.3 Verify the drive appears in /dev/disk/by-id/

    ls -l /dev/disk/by-id/ | grep WX120LHQ
    

    Expected: ata-ST4000NT001-3M2101_WX120LHQ pointing to some /dev/sdX.

    ata-ST4000NT001-3M2101_WX120LHQ -> ../../sda
    
  • 4.4 Set variables for convenience

    NEW_DISKID="ata-ST4000NT001-3M2101_WX120LHQ"
    NEW_DISKPATH="/dev/disk/by-id/$NEW_DISKID"
    OLD_DISKID="ata-ST4000NT001-3M2101_WX11TN0Z"
    echo "New: $NEW_DISKID -> $(readlink -f $NEW_DISKPATH)"
    
  • 4.5 Confirm drive identity and firmware version with smartctl

    smartctl -i "$NEW_DISKPATH"
    

    Expected: Model ST4000NT001-3M2101, Serial WX120LHQ, Firmware EN01, 4TB capacity.

    Device Model:     ST4000NT001-3M2101
    Serial Number:    WX120LHQ
    Firmware Version: EN01
    User Capacity:    4,000,787,030,016 bytes [4.00 TB]
    SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
    
  • 4.6 Check kernel logs are clean — no SATA errors, link drops, or speed downgrades

    dmesg -T | grep -E 'ata[0-9]' | grep -iE 'error|fatal|reset|link down|slow|limiting'
    

    Expected: nothing. If there are errors here on a brand new drive + known-good cable, stop and investigate.

    [Fri Feb 20 22:57:06 2026] ata1: SATA link down (SStatus 0 SControl 300)
    [Fri Feb 20 22:57:06 2026] ata2: SATA link down (SStatus 0 SControl 300)
    

    Clean — ata1/ata2 are unused ports. No errors on ata3 or ata4.


Phase 5: Health-check the new drive before trusting data to it

Don't resilver onto a DOA drive.

  • 5.1 SMART overall health

    smartctl -H "$NEW_DISKPATH"
    

    Expected: PASSED.

    SMART overall-health self-assessment test result: PASSED
    
  • 5.2 Check SMART attributes baseline

    smartctl -A "$NEW_DISKPATH" | grep -E 'Reallocated|Pending|Offline_Uncorrect|CRC|Error_Rate'
    

    Expected: all counters at 0 (it's a new/refurb drive).

      1 Raw_Read_Error_Rate     ... -       6072
      5 Reallocated_Sector_Ct   ... -       0
      7 Seek_Error_Rate         ... -       476
    197 Current_Pending_Sector  ... -       0
    198 Offline_Uncorrectable   ... -       0
    199 UDMA_CRC_Error_Count    ... -       0
    

    All critical counters at 0. Read/Seek error rate raw values are normal Seagate encoding.

  • 5.3 Run short self-test

    smartctl -t short "$NEW_DISKPATH"
    

    Wait ~2 minutes, then check:

    smartctl -l selftest "$NEW_DISKPATH"
    

    Expected: "Completed without error".

    # 1  Short offline       Completed without error       00%         0         -
    

    Passed. 0 power-on hours — fresh drive.

  • 5.4 (Decision point) Short test passed. Proceeding.


Phase 6: Add new drive to ZFS mirror

  • 6.1 Open a dedicated terminal for kernel log monitoring

    dmesg -Tw
    

    Leave this running throughout the resilver. Watch for ANY ata errors.

  • 6.2 Replace the old drive with the new one in the pool

    zpool replace proxmox-tank-1 "$OLD_DISKID" "$NEW_DISKID"
    

    This tells ZFS: "the REMOVED drive WX11TN0Z is being replaced by WX120LHQ". Resilvering starts automatically.

  • 6.3 Verify resilvering has started

    zpool status -v proxmox-tank-1
    

    Expected: state DEGRADED, new drive shows as part of a replacing vdev, resilver in progress.

    resilver in progress since Fri Feb 20 23:10:58 2026
    5.71G / 1.33T scanned at 344M/s, 0B / 1.33T issued
    0B resilvered, 0.00% done
    
    replacing-0                        DEGRADED     0     0     0
      ata-ST4000NT001-3M2101_WX11TN0Z  REMOVED      0     0     0
      ata-ST4000NT001-3M2101_WX120LHQ  ONLINE       0     0 7.73K
    ata-ST4000NT001-3M2101_WX11TN2P    ONLINE       0     0     0
    

    Resilver running. Cksum count on new drive is expected during resilver (unwritten blocks).

  • 6.4 Monitor resilver progress periodically

    watch -n 30 "zpool status -v proxmox-tank-1"
    

    Expected: steady progress, no read/write/cksum errors on either drive. Based on previous experience (~500GB at ~100MB/s with VMs down), expect roughly 1-2 hours. VMs were auto-started on boot. Resilver completed: 1.34T in 03:32:55 with 0 errors.

  • 6.5 VMs were already running (auto-start on boot).


Phase 7: Post-resilver verification

Wait for resilver to complete (status will say "resilvered XXG in HH:MM:SS with 0 errors").

  • 7.1 Check final pool status

    zpool status -v proxmox-tank-1
    

    Expected: ONLINE (or DEGRADED with "too many errors" message requiring a clear — same as last time).

    state: ONLINE
    scan: resilvered 1.34T in 03:32:55 with 0 errors on Sat Feb 21 02:43:53 2026
    
    ata-ST4000NT001-3M2101_WX120LHQ  ONLINE       0     0 7.73K
    ata-ST4000NT001-3M2101_WX11TN2P  ONLINE       0     0     0
    

    ONLINE. 7.73K cksum on TOMMY is expected resilver artifact — clearing next.

  • 7.2 Clear residual cksum counters

    zpool clear proxmox-tank-1
    

    Counters cleared (status message and cksum count gone on re-check).

    state: ONLINE
    scan: resilvered 1.34T in 03:32:55 with 0 errors on Sat Feb 21 02:43:53 2026
    
    ata-ST4000NT001-3M2101_WX120LHQ  ONLINE       0     0     0
    ata-ST4000NT001-3M2101_WX11TN2P  ONLINE       0     0     0
    
    errors: No known data errors
    
  • 7.3 Run a full scrub to verify data integrity

    zpool scrub proxmox-tank-1
    

    Expected: 0 errors on both drives.

    scrub repaired 0B in 03:27:50 with 0 errors on Sat Feb 21 11:38:02 2026
    
    ata-ST4000NT001-3M2101_WX120LHQ  ONLINE       0     0     0
    ata-ST4000NT001-3M2101_WX11TN2P  ONLINE       0     0     0
    
    errors: No known data errors
    
  • 7.4 Clean status confirmed — 0B repaired, 0 errors, both drives 0/0/0.

  • 7.5 Baseline SMART snapshot of the new drive after heavy I/O

    smartctl -x "$NEW_DISKPATH" | grep -E 'Reallocated|Pending|Offline_Uncorrect|CRC|Hardware Resets|COMRESET|Interface'
    

    Expected: 0 reallocated, 0 CRC errors, low hardware reset count.

    Reallocated_Sector_Ct    ... 0
    Current_Pending_Sector   ... 0
    Offline_Uncorrectable    ... 0
    UDMA_CRC_Error_Count     ... 0
    Number of Hardware Resets ... 2
    Number of Interface CRC Errors ... 0
    COMRESET ... 2
    

    All clean. 2 hardware resets / COMRESETs from boot — normal.

  • 7.6 Skipped — extended SMART self-test is redundant after a clean resilver + scrub. ZFS checksums already verified every data block; the only thing the long test would cover is empty space that ZFS hasn't written to, which ZFS will verify on future use anyway.


Phase 8: Final state — done

  • 8.1 Final pool status — already captured in 7.4. Mirror is healthy:

      pool: proxmox-tank-1
     state: ONLINE
      scan: scrub repaired 0B in 03:27:50 with 0 errors on Sat Feb 21 11:38:02 2026
     config:
       NAME                                 STATE     READ WRITE CKSUM
       proxmox-tank-1                       ONLINE       0     0     0
         mirror-0                           ONLINE       0     0     0
           ata-ST4000NT001-3M2101_WX120LHQ  ONLINE       0     0     0
           ata-ST4000NT001-3M2101_WX11TN2P  ONLINE       0     0     0
    
     errors: No known data errors
    
  • 8.2 All VMs running normally — verified from Proxmox UI

  • 8.3 Celebrate. Mirror is whole again.


Abort conditions

Stop and investigate if any of these happen:

  • New drive not detected after boot (bad seating or DOA)
  • SATA errors in dmesg during or after boot (bad cable? bad drive?)
  • SMART short test fails on new drive (DOA — contact seller)
  • Resilver stalls or produces errors on the new drive
  • Scrub finds checksum errors on the new drive

Execution summary

Executed 2026-02-20 evening through 2026-02-21 morning. No abort conditions hit — completely clean run.

  • TOMMY (WX120LHQ) installed on ata3 at 6.0 Gbps, detected first boot
  • SMART short test passed, all critical attributes at zero
  • Resilver: 1.34T in 03:32:55, 0 errors (VMs were running — auto-start on boot)
  • Scrub: repaired 0B in 03:27:50, 0 errors, both drives 0/0/0
  • Post-I/O SMART baseline clean: 0 reallocated, 0 CRC errors
  • Extended SMART test skipped — redundant after clean resilver + scrub (ZFS checksums already verified all data blocks)
  • Pool proxmox-tank-1 fully healthy. Mirror degradation that started 2026-02-08 is resolved.