homelab/20260208_second_zfs_degradation.md

6.8 KiB

Second ZFS Degradation

Continuation of 20251230_first_zfs_degradation.md.

Problem

On 2026-02-04, a few days after the first incident was resolved, AGAPITO1 (ata-ST4000NT001-3M2101_WX11TN0Z) appeared as REMOVED in the ZFS pool again.

I opened the case and reviewed. Nothing looked visibly wrong — all cables appeared properly seated.

The cable swap experiment

To narrow down whether the issue was the disk itself or the cable/port, I devised a swap test: I unplugged both the power and data cables from AGAPITO1 and AGAPITO2, and swapped them. After the swap, both drives were connected to the motherboard, but each one now used the other's previous data cable, power cable, and motherboard SATA port.

After booting, everything looked good. I ran a scrub and only 1 checksum error appeared on AGAPITO1, but the pool went back to healthy.

At boot, the new mapping was:

  • ata3 → AGAPITO1 (WX11TN0Z) — previously on ata4
  • ata4 → AGAPITO2 (WX11TN2P) — previously on ata3

Second failure

The pool stayed healthy for a few days, but on 2026-02-08 it degraded again:

counterweight@nodito:~$ zpool status -v proxmox-tank-1
  pool: proxmox-tank-1
 state: DEGRADED
status: One or more devices have been removed.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using zpool online' or replace the device with
        'zpool replace'.
  scan: resilvered 0B in 00:02:20 with 0 errors on Sun Feb  8 01:49:13 2026
config:

        NAME                                 STATE     READ WRITE CKSUM
        proxmox-tank-1                       DEGRADED     0     0     0
          mirror-0                           DEGRADED     0     0     0
            ata-ST4000NT001-3M2101_WX11TN0Z  REMOVED      0     0     0
            ata-ST4000NT001-3M2101_WX11TN2P  ONLINE       0     0     0

The disk was no longer visible to the OS at all — ls /dev/disk/by-id/ | grep WX11TN0Z returned nothing, and lsblk did not show it. SMART queries were not possible.

Diagnostic

Kernel logs

The kernel logs tell the full story. The failure sequence on ata3 is virtually identical to the first incident's failure on ata4:

  1. At 01:42:37, an interface fatal error on ata3 with failed WRITE commands:

    ata3.00: exception Emask 0x50 SAct 0xc40 SErr 0xe0802 action 0x6 frozen
    ata3.00: irq_stat 0x08000000, interface fatal error
    ata3: SError: { RecovComm HostInt PHYInt CommWake 10B8B }
    ata3.00: failed command: WRITE FPDMA QUEUED
    
  2. Link goes down, kernel tries hard resets. The familiar speed downgrade cascade begins:

    ata3: SATA link down (SStatus 0 SControl 300)
    ata3: hard resetting link
    ata3: link is slow to respond, please be patient (ready=0)
    ata3: found unknown device (class 0)
    ata3.00: qc timeout after 5000 msecs (cmd 0xec)
    ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)
    ata3: limiting SATA link speed to 3.0 Gbps
    
  3. The disk briefly comes back at each speed level, fails immediately with another interface fatal error, and gets downgraded again. 6.0 Gbps → 3.0 Gbps → 1.5 Gbps.

  4. At 01:45:31, the disk detaches (ata3.00: detaching (SCSI 2:0:0:0)), leaving a trail of I/O errors on /dev/sda. It reappears briefly, fails again.

  5. By 01:46:33, the disk detaches a second time. It comes back momentarily, detaches again at 01:48:12 (now as /dev/sdc after device name reassignment). Massive I/O errors on both reads and writes.

  6. From 01:48:19 onward, the kernel enters an endless reset loop:

    ata3: hardreset failed
    ata3: reset failed, giving up
    ata3: reset failed (errno=-32), retrying in 8 secs
    ata3: limiting SATA link speed to 3.0 Gbps
    ata3: reset failed (errno=-32), retrying in 8 secs
    ata3: limiting SATA link speed to 1.5 Gbps
    ata3: reset failed (errno=-32), retrying in 33 secs
    ata3: hardreset failed
    ata3: reset failed, giving up
    

    This loop continued for over 10 hours until the dmesg was captured, cycling through speed downgrades and giving up, then retrying.

An interesting contrast: ata4

On Feb 7 at 15:26:48, there was a brief hiccup on ata4 — which is now AGAPITO2 on the swapped cable+port:

ata4.00: exception Emask 0x10 SAct 0x3f000 SErr 0x40d0000 action 0xe frozen
ata4.00: irq_stat 0x00000040, connection status changed
ata4: SError: { PHYRdyChg CommWake 10B8B DevExch }
ata4.00: failed command: READ FPDMA QUEUED
...
ata4: hard resetting link
ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
ata4.00: configured for UDMA/133
ata4: EH complete

A single hard reset, link came back at full speed, error handling completed. Normal operation resumed. This port+cable combo is fine — the cable and port that AGAPITO1 used to be on work perfectly for AGAPITO2.

ZFS events

The ZFS events log shows checksum errors accumulating on ata-ST4000NT001-3M2101_WX11TN0Z-part1 even before the full crash, during the initial pool import on Feb 4 after the swap. The counter increments from vdev_cksum_errors = 0x1 through 0x4 during the same import event, and one more later. These early checksum errors were likely precursors to the full failure.

Conclusion

The cable swap was the decisive experiment. The key facts:

First incident (Dec 30) Second incident (Feb 8)
Failing disk AGAPITO1 (WX11TN0Z) AGAPITO1 (WX11TN0Z)
ATA port ata4 ata3
Data cable cable A cable B
Power cable power A power B
Failure pattern Interface fatal, speed downgrades, detach Interface fatal, speed downgrades, detach

The problem followed the disk across different cables, different power connections, and different motherboard SATA ports. Cables and ports are exonerated.

AGAPITO1 has a failing SATA PHY — the physical-layer signaling electronics on the drive that handle the SATA link are intermittently dying. This is exactly the kind of failure that SMART doesn't catch: SMART monitors media integrity (sectors, heads, motor) and firmware/controller logic, not the SATA interface circuitry. As noted in the first incident document, SATA drives are opaque about this kind of issue — they don't "confess" PHY-layer problems the way SAS drives would.

The 992 checksum errors from the first scrub after the initial reseat were likely early symptoms of this progressive hardware failure.

This disk needs to be replaced.

Open questions

  • How do I replace the disk within the ZFS mirror without data loss? (i.e., zpool replace workflow)
  • Should the replacement be another SATA drive, or is this a good opportunity to move to SAS? (Would require a SAS controller since SAS drives don't fit regular SATA ports.)
  • The monitoring question from the first incident remains open: how to automate ZFS pool health monitoring to catch degradations early.