diff --git a/20260220_agapito1_replacement_runbook.md b/20260220_agapito1_replacement_runbook.md deleted file mode 100644 index a06d2a8..0000000 --- a/20260220_agapito1_replacement_runbook.md +++ /dev/null @@ -1,364 +0,0 @@ -# AGAPITO1 Replacement Runbook - -Continuation of [20260208_second_zfs_degradation.md](20260208_second_zfs_degradation.md). - -AGAPITO1 (`ata-ST4000NT001-3M2101_WX11TN0Z`) had a failing SATA PHY and was RMA'd. The ZFS mirror `proxmox-tank-1` has been running degraded on AGAPITO2 alone since Feb 8. The replacement drive (same model, serial `WX120LHQ`) needs to be physically installed and added to the mirror. - -**Current state:** -- Pool: `proxmox-tank-1` (mirror-0), DEGRADED -- AGAPITO2 (`WX11TN2P`): ONLINE, on ata4 -- Old AGAPITO1 (`WX11TN0Z`): shows REMOVED in pool config -- Physical: drive bay empty, SATA data + power cables still connected to mobo/PSU (should be ata3 port after the cable swap from incident 2) -- New drive: ST4000NT001-3M2101, serial `WX120LHQ` - ---- - -## Phase 1: Pre-shutdown state capture - -While server is still running, log current state for reference. - -- [x] **1.1** Record pool status - ``` - zpool status -v proxmox-tank-1 - ``` - Expected: DEGRADED, WX11TN0Z shows REMOVED, WX11TN2P ONLINE. - ``` - pool: proxmox-tank-1 - state: DEGRADED - status: One or more devices have been removed. - Sufficient replicas exist for the pool to continue functioning in a - degraded state. - action: Online the device using zpool online' or replace the device with - 'zpool replace'. - scan: scrub repaired 0B in 06:55:06 with 0 errors on Tue Feb 17 20:40:50 2026 - config: - - NAME STATE READ WRITE CKSUM - proxmox-tank-1 DEGRADED 0 0 0 - mirror-0 DEGRADED 0 0 0 - ata-ST4000NT001-3M2101_WX11TN0Z REMOVED 0 0 0 - ata-ST4000NT001-3M2101_WX11TN2P ONLINE 0 0 0 - - errors: No known data errors - ``` - -- [x] **1.2** Record current SATA layout - ``` - dmesg -T | grep -E 'ata[0-9]+\.[0-9]+: ATA-|ata[0-9]+: SATA link up' | tail -20 - ``` - Expected: AGAPITO2 visible on ata4. ata3 should show nothing (empty slot). - ``` - [Tue Feb 17 15:37:28 2026] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300) - [Tue Feb 17 15:37:28 2026] ata4.00: ATA-11: ST4000NT001-3M2101, EN01, max UDMA/133 - ``` - -- [x] **1.3** Confirm AGAPITO2 is healthy before we start - ``` - smartctl -H /dev/disk/by-id/ata-ST4000NT001-3M2101_WX11TN2P - ``` - Expected: PASSED. If not, stop and investigate before proceeding. - ``` - SMART overall-health self-assessment test result: PASSED - ``` - ---- - -## Phase 2: Graceful shutdown - -- [x] **2.1** Shut down all VMs gracefully from Proxmox UI or CLI - ``` - qm list - # For each running VM: - qm shutdown - ``` - -- [x] **2.2** Verify all VMs are stopped - ``` - qm list - ``` - Expected: all show "stopped". - -- [x] **2.3** Power down the server - ``` - shutdown -h now - ``` - ---- - -## Phase 3: Physical installation - -- [x] **3.1** Open the case -- [x] **3.2** Locate the dangling SATA data + power cables (from the old AGAPITO1 slot) -- [x] **3.3** Visually inspect cables for damage — especially the SATA data connector pins -- [x] **3.4** Label the new drive as TOMMY with a marker/sticker. Write serial `WX120LHQ` on the label too. -- [x] **3.5** Seat the new drive in the bay -- [x] **3.6** Connect SATA data cable to the drive — push firmly until it clicks -- [x] **3.7** Connect SATA power cable to the drive — push firmly -- [x] **3.8** Double-check both connectors are fully seated (wiggle test — they shouldn't move) -- [x] **3.9** Close the case - ---- - -## Phase 4: Boot and verify detection - -- [x] **4.1** Power on the server, let it boot into Proxmox - -- [x] **4.2** Verify the new drive is detected by the kernel - ``` - dmesg -T | grep -E 'ata[0-9]+\.[0-9]+: ATA-|ata[0-9]+: SATA link up' - ``` - Expected: new drive detected on ata3 (or whichever port the cable is on), at 6.0 Gbps. - ``` - [Fri Feb 20 22:57:06 2026] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300) - [Fri Feb 20 22:57:06 2026] ata3.00: ATA-11: ST4000NT001-3M2101, EN01, max UDMA/133 - [Fri Feb 20 22:57:07 2026] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300) - [Fri Feb 20 22:57:07 2026] ata4.00: ATA-11: ST4000NT001-3M2101, EN01, max UDMA/133 - ``` - TOMMY on ata3, AGAPITO2 on ata4. Both at 6.0 Gbps, firmware EN01. - -- [x] **4.3** Verify the drive appears in `/dev/disk/by-id/` - ``` - ls -l /dev/disk/by-id/ | grep WX120LHQ - ``` - Expected: `ata-ST4000NT001-3M2101_WX120LHQ` pointing to some `/dev/sdX`. - ``` - ata-ST4000NT001-3M2101_WX120LHQ -> ../../sda - ``` - -- [ ] **4.4** Set variables for convenience - ``` - NEW_DISKID="ata-ST4000NT001-3M2101_WX120LHQ" - NEW_DISKPATH="/dev/disk/by-id/$NEW_DISKID" - OLD_DISKID="ata-ST4000NT001-3M2101_WX11TN0Z" - echo "New: $NEW_DISKID -> $(readlink -f $NEW_DISKPATH)" - ``` - -- [x] **4.5** Confirm drive identity and firmware version with smartctl - ``` - smartctl -i "$NEW_DISKPATH" - ``` - Expected: Model ST4000NT001-3M2101, Serial WX120LHQ, Firmware EN01, 4TB capacity. - ``` - Device Model: ST4000NT001-3M2101 - Serial Number: WX120LHQ - Firmware Version: EN01 - User Capacity: 4,000,787,030,016 bytes [4.00 TB] - SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s) - ``` - -- [x] **4.6** Check kernel logs are clean — no SATA errors, link drops, or speed downgrades - ``` - dmesg -T | grep -E 'ata[0-9]' | grep -iE 'error|fatal|reset|link down|slow|limiting' - ``` - Expected: nothing. If there are errors here on a brand new drive + known-good cable, **stop and investigate**. - ``` - [Fri Feb 20 22:57:06 2026] ata1: SATA link down (SStatus 0 SControl 300) - [Fri Feb 20 22:57:06 2026] ata2: SATA link down (SStatus 0 SControl 300) - ``` - Clean — ata1/ata2 are unused ports. No errors on ata3 or ata4. - ---- - -## Phase 5: Health-check the new drive before trusting data to it - -Don't resilver onto a DOA drive. - -- [x] **5.1** SMART overall health - ``` - smartctl -H "$NEW_DISKPATH" - ``` - Expected: PASSED. - ``` - SMART overall-health self-assessment test result: PASSED - ``` - -- [x] **5.2** Check SMART attributes baseline - ``` - smartctl -A "$NEW_DISKPATH" | grep -E 'Reallocated|Pending|Offline_Uncorrect|CRC|Error_Rate' - ``` - Expected: all counters at 0 (it's a new/refurb drive). - ``` - 1 Raw_Read_Error_Rate ... - 6072 - 5 Reallocated_Sector_Ct ... - 0 - 7 Seek_Error_Rate ... - 476 - 197 Current_Pending_Sector ... - 0 - 198 Offline_Uncorrectable ... - 0 - 199 UDMA_CRC_Error_Count ... - 0 - ``` - All critical counters at 0. Read/Seek error rate raw values are normal Seagate encoding. - -- [x] **5.3** Run short self-test - ``` - smartctl -t short "$NEW_DISKPATH" - ``` - Wait ~2 minutes, then check: - ``` - smartctl -l selftest "$NEW_DISKPATH" - ``` - Expected: "Completed without error". - ``` - # 1 Short offline Completed without error 00% 0 - - ``` - Passed. 0 power-on hours — fresh drive. - -- [x] **5.4** (Decision point) Short test passed. Proceeding. - ---- - -## Phase 6: Add new drive to ZFS mirror - -- [x] **6.1** Open a dedicated terminal for kernel log monitoring - ``` - dmesg -Tw - ``` - Leave this running throughout the resilver. Watch for ANY `ata` errors. - -- [x] **6.2** Replace the old drive with the new one in the pool - ``` - zpool replace proxmox-tank-1 "$OLD_DISKID" "$NEW_DISKID" - ``` - This tells ZFS: "the REMOVED drive WX11TN0Z is being replaced by WX120LHQ". Resilvering starts automatically. - -- [x] **6.3** Verify resilvering has started - ``` - zpool status -v proxmox-tank-1 - ``` - Expected: state DEGRADED, new drive shows as part of a `replacing` vdev, resilver in progress. - ``` - resilver in progress since Fri Feb 20 23:10:58 2026 - 5.71G / 1.33T scanned at 344M/s, 0B / 1.33T issued - 0B resilvered, 0.00% done - - replacing-0 DEGRADED 0 0 0 - ata-ST4000NT001-3M2101_WX11TN0Z REMOVED 0 0 0 - ata-ST4000NT001-3M2101_WX120LHQ ONLINE 0 0 7.73K - ata-ST4000NT001-3M2101_WX11TN2P ONLINE 0 0 0 - ``` - Resilver running. Cksum count on new drive is expected during resilver (unwritten blocks). - -- [x] **6.4** Monitor resilver progress periodically - ``` - watch -n 30 "zpool status -v proxmox-tank-1" - ``` - Expected: steady progress, no read/write/cksum errors on either drive. Based on previous experience (~500GB at ~100MB/s with VMs down), expect roughly 1-2 hours. - VMs were auto-started on boot. Resilver completed: 1.34T in 03:32:55 with 0 errors. - -- [x] **6.5** VMs were already running (auto-start on boot). - ---- - -## Phase 7: Post-resilver verification - -Wait for resilver to complete (status will say "resilvered XXG in HH:MM:SS with 0 errors"). - -- [x] **7.1** Check final pool status - ``` - zpool status -v proxmox-tank-1 - ``` - Expected: ONLINE (or DEGRADED with "too many errors" message requiring a clear — same as last time). - ``` - state: ONLINE - scan: resilvered 1.34T in 03:32:55 with 0 errors on Sat Feb 21 02:43:53 2026 - - ata-ST4000NT001-3M2101_WX120LHQ ONLINE 0 0 7.73K - ata-ST4000NT001-3M2101_WX11TN2P ONLINE 0 0 0 - ``` - ONLINE. 7.73K cksum on TOMMY is expected resilver artifact — clearing next. - -- [x] **7.2** Clear residual cksum counters - ``` - zpool clear proxmox-tank-1 - ``` - Counters cleared (status message and cksum count gone on re-check). - ``` - state: ONLINE - scan: resilvered 1.34T in 03:32:55 with 0 errors on Sat Feb 21 02:43:53 2026 - - ata-ST4000NT001-3M2101_WX120LHQ ONLINE 0 0 0 - ata-ST4000NT001-3M2101_WX11TN2P ONLINE 0 0 0 - - errors: No known data errors - ``` - -- [x] **7.3** Run a full scrub to verify data integrity - ``` - zpool scrub proxmox-tank-1 - ``` - Expected: **0 errors on both drives**. - ``` - scrub repaired 0B in 03:27:50 with 0 errors on Sat Feb 21 11:38:02 2026 - - ata-ST4000NT001-3M2101_WX120LHQ ONLINE 0 0 0 - ata-ST4000NT001-3M2101_WX11TN2P ONLINE 0 0 0 - - errors: No known data errors - ``` - -- [x] **7.4** Clean status confirmed — 0B repaired, 0 errors, both drives 0/0/0. - -- [x] **7.5** Baseline SMART snapshot of the new drive after heavy I/O - ``` - smartctl -x "$NEW_DISKPATH" | grep -E 'Reallocated|Pending|Offline_Uncorrect|CRC|Hardware Resets|COMRESET|Interface' - ``` - Expected: 0 reallocated, 0 CRC errors, low hardware reset count. - ``` - Reallocated_Sector_Ct ... 0 - Current_Pending_Sector ... 0 - Offline_Uncorrectable ... 0 - UDMA_CRC_Error_Count ... 0 - Number of Hardware Resets ... 2 - Number of Interface CRC Errors ... 0 - COMRESET ... 2 - ``` - All clean. 2 hardware resets / COMRESETs from boot — normal. - -- ~~**7.6**~~ Skipped — extended SMART self-test is redundant after a clean resilver + scrub. ZFS checksums already verified every data block; the only thing the long test would cover is empty space that ZFS hasn't written to, which ZFS will verify on future use anyway. - ---- - -## Phase 8: Final state — done - -- [x] **8.1** Final pool status — already captured in 7.4. Mirror is healthy: - ``` - pool: proxmox-tank-1 - state: ONLINE - scan: scrub repaired 0B in 03:27:50 with 0 errors on Sat Feb 21 11:38:02 2026 - config: - NAME STATE READ WRITE CKSUM - proxmox-tank-1 ONLINE 0 0 0 - mirror-0 ONLINE 0 0 0 - ata-ST4000NT001-3M2101_WX120LHQ ONLINE 0 0 0 - ata-ST4000NT001-3M2101_WX11TN2P ONLINE 0 0 0 - - errors: No known data errors - ``` - -- [x] **8.2** All VMs running normally — verified from Proxmox UI - -- [x] **8.3** Celebrate. Mirror is whole again. - ---- - -## Abort conditions - -Stop and investigate if any of these happen: - -- New drive not detected after boot (bad seating or DOA) -- SATA errors in `dmesg` during or after boot (bad cable? bad drive?) -- SMART short test fails on new drive (DOA — contact seller) -- Resilver stalls or produces errors on the new drive -- Scrub finds checksum errors on the new drive - ---- - -## Execution summary - -Executed 2026-02-20 evening through 2026-02-21 morning. No abort conditions hit — completely clean run. - -- TOMMY (`WX120LHQ`) installed on ata3 at 6.0 Gbps, detected first boot -- SMART short test passed, all critical attributes at zero -- Resilver: 1.34T in 03:32:55, 0 errors (VMs were running — auto-start on boot) -- Scrub: repaired 0B in 03:27:50, 0 errors, both drives 0/0/0 -- Post-I/O SMART baseline clean: 0 reallocated, 0 CRC errors -- Extended SMART test skipped — redundant after clean resilver + scrub (ZFS checksums already verified all data blocks) -- Pool `proxmox-tank-1` fully healthy. Mirror degradation that started 2026-02-08 is resolved. diff --git a/public/index.html b/public/index.html index 9593c1a..f7d2efa 100644 --- a/public/index.html +++ b/public/index.html @@ -147,10 +147,6 @@

Writings

Sometimes I like to jot down ideas and drop them here.