homelab/zfs-mirror-incident/20260208_second_zfs_degradation.md

# Second ZFS Degradation

Continuation of [20251230_first_zfs_degradation.md](20251230_first_zfs_degradation.md).


## Problem

On 2026-02-04, a few days after the first incident was resolved, AGAPITO1 (`ata-ST4000NT001-3M2101_WX11TN0Z`) appeared as REMOVED in the ZFS pool again.

I opened the case and reviewed. Nothing looked visibly wrong — all cables appeared properly seated.


## The cable swap experiment

To narrow down whether the issue was the disk itself or the cable/port, I devised a swap test: I unplugged both the power and data cables from AGAPITO1 and AGAPITO2, and swapped them. After the swap, both drives were connected to the motherboard, but each one now used the other's previous data cable, power cable, and motherboard SATA port.

After booting, everything looked good. I ran a scrub and only 1 checksum error appeared on AGAPITO1, but the pool went back to healthy.

At boot, the new mapping was:
- ata3 → AGAPITO1 (WX11TN0Z) — previously on ata4
- ata4 → AGAPITO2 (WX11TN2P) — previously on ata3


## Second failure

The pool stayed healthy for a few days, but on 2026-02-08 it degraded again:

```
counterweight@nodito:~$ zpool status -v proxmox-tank-1
  pool: proxmox-tank-1
 state: DEGRADED
status: One or more devices have been removed.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using zpool online' or replace the device with
        'zpool replace'.
  scan: resilvered 0B in 00:02:20 with 0 errors on Sun Feb  8 01:49:13 2026
config:

        NAME                                 STATE     READ WRITE CKSUM
        proxmox-tank-1                       DEGRADED     0     0     0
          mirror-0                           DEGRADED     0     0     0
            ata-ST4000NT001-3M2101_WX11TN0Z  REMOVED      0     0     0
            ata-ST4000NT001-3M2101_WX11TN2P  ONLINE       0     0     0
```

The disk was no longer visible to the OS at all — `ls /dev/disk/by-id/ | grep WX11TN0Z` returned nothing, and `lsblk` did not show it. SMART queries were not possible.


## Diagnostic

### Kernel logs

The kernel logs tell the full story. The failure sequence on `ata3` is virtually identical to the first incident's failure on `ata4`:

1. At `01:42:37`, an interface fatal error on ata3 with failed WRITE commands:
   ```
   ata3.00: exception Emask 0x50 SAct 0xc40 SErr 0xe0802 action 0x6 frozen
   ata3.00: irq_stat 0x08000000, interface fatal error
   ata3: SError: { RecovComm HostInt PHYInt CommWake 10B8B }
   ata3.00: failed command: WRITE FPDMA QUEUED
   ```

2. Link goes down, kernel tries hard resets. The familiar speed downgrade cascade begins:
   ```
   ata3: SATA link down (SStatus 0 SControl 300)
   ata3: hard resetting link
   ata3: link is slow to respond, please be patient (ready=0)
   ata3: found unknown device (class 0)
   ata3.00: qc timeout after 5000 msecs (cmd 0xec)
   ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)
   ata3: limiting SATA link speed to 3.0 Gbps
   ```

3. The disk briefly comes back at each speed level, fails immediately with another interface fatal error, and gets downgraded again. 6.0 Gbps → 3.0 Gbps → 1.5 Gbps.

4. At `01:45:31`, the disk detaches (`ata3.00: detaching (SCSI 2:0:0:0)`), leaving a trail of I/O errors on `/dev/sda`. It reappears briefly, fails again.

5. By `01:46:33`, the disk detaches a second time. It comes back momentarily, detaches again at `01:48:12` (now as `/dev/sdc` after device name reassignment). Massive I/O errors on both reads and writes.

6. From `01:48:19` onward, the kernel enters an endless reset loop:
   ```
   ata3: hardreset failed
   ata3: reset failed, giving up
   ata3: reset failed (errno=-32), retrying in 8 secs
   ata3: limiting SATA link speed to 3.0 Gbps
   ata3: reset failed (errno=-32), retrying in 8 secs
   ata3: limiting SATA link speed to 1.5 Gbps
   ata3: reset failed (errno=-32), retrying in 33 secs
   ata3: hardreset failed
   ata3: reset failed, giving up
   ```
   This loop continued for over 10 hours until the dmesg was captured, cycling through speed downgrades and giving up, then retrying.


### An interesting contrast: ata4

On Feb 7 at 15:26:48, there was a brief hiccup on ata4 — which is now AGAPITO2 on the swapped cable+port:

```
ata4.00: exception Emask 0x10 SAct 0x3f000 SErr 0x40d0000 action 0xe frozen
ata4.00: irq_stat 0x00000040, connection status changed
ata4: SError: { PHYRdyChg CommWake 10B8B DevExch }
ata4.00: failed command: READ FPDMA QUEUED
...
ata4: hard resetting link
ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
ata4.00: configured for UDMA/133
ata4: EH complete
```

A single hard reset, link came back at full speed, error handling completed. Normal operation resumed. This port+cable combo is fine — the cable and port that AGAPITO1 used to be on work perfectly for AGAPITO2.


### ZFS events

The ZFS events log shows checksum errors accumulating on `ata-ST4000NT001-3M2101_WX11TN0Z-part1` even before the full crash, during the initial pool import on Feb 4 after the swap. The counter increments from `vdev_cksum_errors = 0x1` through `0x4` during the same import event, and one more later. These early checksum errors were likely precursors to the full failure.


## Conclusion

The cable swap was the decisive experiment. The key facts:

| | First incident (Dec 30) | Second incident (Feb 8) |
|---|---|---|
| **Failing disk** | AGAPITO1 (WX11TN0Z) | AGAPITO1 (WX11TN0Z) |
| **ATA port** | ata4 | ata3 |
| **Data cable** | cable A | cable B |
| **Power cable** | power A | power B |
| **Failure pattern** | Interface fatal, speed downgrades, detach | Interface fatal, speed downgrades, detach |

The problem followed the disk across different cables, different power connections, and different motherboard SATA ports. Cables and ports are exonerated.

AGAPITO1 has a failing SATA PHY — the physical-layer signaling electronics on the drive that handle the SATA link are intermittently dying. This is exactly the kind of failure that SMART doesn't catch: SMART monitors media integrity (sectors, heads, motor) and firmware/controller logic, not the SATA interface circuitry. As noted in the first incident document, SATA drives are opaque about this kind of issue — they don't "confess" PHY-layer problems the way SAS drives would.

The 992 checksum errors from the first scrub after the initial reseat were likely early symptoms of this progressive hardware failure.

**This disk needs to be replaced.**


## Open questions

- How do I replace the disk within the ZFS mirror without data loss? (i.e., `zpool replace` workflow)
- Should the replacement be another SATA drive, or is this a good opportunity to move to SAS? (Would require a SAS controller since SAS drives don't fit regular SATA ports.)
- The monitoring question from the first incident remains open: how to automate ZFS pool health monitoring to catch degradations early.
more incident with hdd 2026-02-08 17:53:45 +01:00			`# Second ZFS Degradation`

			`Continuation of [20251230_first_zfs_degradation.md](20251230_first_zfs_degradation.md).`


			`## Problem`

			On 2026-02-04, a few days after the first incident was resolved, AGAPITO1 (`ata-ST4000NT001-3M2101_WX11TN0Z`) appeared as REMOVED in the ZFS pool again.

			`I opened the case and reviewed. Nothing looked visibly wrong — all cables appeared properly seated.`


			`## The cable swap experiment`

			`To narrow down whether the issue was the disk itself or the cable/port, I devised a swap test: I unplugged both the power and data cables from AGAPITO1 and AGAPITO2, and swapped them. After the swap, both drives were connected to the motherboard, but each one now used the other's previous data cable, power cable, and motherboard SATA port.`

			`After booting, everything looked good. I ran a scrub and only 1 checksum error appeared on AGAPITO1, but the pool went back to healthy.`

			`At boot, the new mapping was:`
			`- ata3 → AGAPITO1 (WX11TN0Z) — previously on ata4`
			`- ata4 → AGAPITO2 (WX11TN2P) — previously on ata3`


			`## Second failure`

			`The pool stayed healthy for a few days, but on 2026-02-08 it degraded again:`

			```
			`counterweight@nodito:~$ zpool status -v proxmox-tank-1`
			`pool: proxmox-tank-1`
			`state: DEGRADED`
			`status: One or more devices have been removed.`
			`Sufficient replicas exist for the pool to continue functioning in a`
			`degraded state.`
			`action: Online the device using zpool online' or replace the device with`
			`'zpool replace'.`
			`scan: resilvered 0B in 00:02:20 with 0 errors on Sun Feb 8 01:49:13 2026`
			`config:`

			`NAME STATE READ WRITE CKSUM`
			`proxmox-tank-1 DEGRADED 0 0 0`
			`mirror-0 DEGRADED 0 0 0`
			`ata-ST4000NT001-3M2101_WX11TN0Z REMOVED 0 0 0`
			`ata-ST4000NT001-3M2101_WX11TN2P ONLINE 0 0 0`
			```

			The disk was no longer visible to the OS at all — `ls /dev/disk/by-id/ \| grep WX11TN0Z` returned nothing, and `lsblk` did not show it. SMART queries were not possible.


			`## Diagnostic`

			`### Kernel logs`

			The kernel logs tell the full story. The failure sequence on `ata3` is virtually identical to the first incident's failure on `ata4`:

			1. At `01:42:37`, an interface fatal error on ata3 with failed WRITE commands:
			```
			`ata3.00: exception Emask 0x50 SAct 0xc40 SErr 0xe0802 action 0x6 frozen`
			`ata3.00: irq_stat 0x08000000, interface fatal error`
			`ata3: SError: { RecovComm HostInt PHYInt CommWake 10B8B }`
			`ata3.00: failed command: WRITE FPDMA QUEUED`
			```

			`2. Link goes down, kernel tries hard resets. The familiar speed downgrade cascade begins:`
			```
			`ata3: SATA link down (SStatus 0 SControl 300)`
			`ata3: hard resetting link`
			`ata3: link is slow to respond, please be patient (ready=0)`
			`ata3: found unknown device (class 0)`
			`ata3.00: qc timeout after 5000 msecs (cmd 0xec)`
			`ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)`
			`ata3: limiting SATA link speed to 3.0 Gbps`
			```

			`3. The disk briefly comes back at each speed level, fails immediately with another interface fatal error, and gets downgraded again. 6.0 Gbps → 3.0 Gbps → 1.5 Gbps.`

			4. At `01:45:31`, the disk detaches (`ata3.00: detaching (SCSI 2:0:0:0)`), leaving a trail of I/O errors on `/dev/sda`. It reappears briefly, fails again.

			5. By `01:46:33`, the disk detaches a second time. It comes back momentarily, detaches again at `01:48:12` (now as `/dev/sdc` after device name reassignment). Massive I/O errors on both reads and writes.

			6. From `01:48:19` onward, the kernel enters an endless reset loop:
			```
			`ata3: hardreset failed`
			`ata3: reset failed, giving up`
			`ata3: reset failed (errno=-32), retrying in 8 secs`
			`ata3: limiting SATA link speed to 3.0 Gbps`
			`ata3: reset failed (errno=-32), retrying in 8 secs`
			`ata3: limiting SATA link speed to 1.5 Gbps`
			`ata3: reset failed (errno=-32), retrying in 33 secs`
			`ata3: hardreset failed`
			`ata3: reset failed, giving up`
			```
			`This loop continued for over 10 hours until the dmesg was captured, cycling through speed downgrades and giving up, then retrying.`


			`### An interesting contrast: ata4`

			`On Feb 7 at 15:26:48, there was a brief hiccup on ata4 — which is now AGAPITO2 on the swapped cable+port:`

			```
			`ata4.00: exception Emask 0x10 SAct 0x3f000 SErr 0x40d0000 action 0xe frozen`
			`ata4.00: irq_stat 0x00000040, connection status changed`
			`ata4: SError: { PHYRdyChg CommWake 10B8B DevExch }`
			`ata4.00: failed command: READ FPDMA QUEUED`
			`...`
			`ata4: hard resetting link`
			`ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)`
			`ata4.00: configured for UDMA/133`
			`ata4: EH complete`
			```

			`A single hard reset, link came back at full speed, error handling completed. Normal operation resumed. This port+cable combo is fine — the cable and port that AGAPITO1 used to be on work perfectly for AGAPITO2.`


			`### ZFS events`

			The ZFS events log shows checksum errors accumulating on `ata-ST4000NT001-3M2101_WX11TN0Z-part1` even before the full crash, during the initial pool import on Feb 4 after the swap. The counter increments from `vdev_cksum_errors = 0x1` through `0x4` during the same import event, and one more later. These early checksum errors were likely precursors to the full failure.


			`## Conclusion`

			`The cable swap was the decisive experiment. The key facts:`

			`\| \| First incident (Dec 30) \| Second incident (Feb 8) \|`
			`\|---\|---\|---\|`
			`\| Failing disk \| AGAPITO1 (WX11TN0Z) \| AGAPITO1 (WX11TN0Z) \|`
			`\| ATA port \| ata4 \| ata3 \|`
			`\| Data cable \| cable A \| cable B \|`
			`\| Power cable \| power A \| power B \|`
			`\| Failure pattern \| Interface fatal, speed downgrades, detach \| Interface fatal, speed downgrades, detach \|`

			`The problem followed the disk across different cables, different power connections, and different motherboard SATA ports. Cables and ports are exonerated.`

			`AGAPITO1 has a failing SATA PHY — the physical-layer signaling electronics on the drive that handle the SATA link are intermittently dying. This is exactly the kind of failure that SMART doesn't catch: SMART monitors media integrity (sectors, heads, motor) and firmware/controller logic, not the SATA interface circuitry. As noted in the first incident document, SATA drives are opaque about this kind of issue — they don't "confess" PHY-layer problems the way SAS drives would.`

			`The 992 checksum errors from the first scrub after the initial reseat were likely early symptoms of this progressive hardware failure.`

			`This disk needs to be replaced.`


			`## Open questions`

			- How do I replace the disk within the ZFS mirror without data loss? (i.e., `zpool replace` workflow)
			`- Should the replacement be another SATA drive, or is this a good opportunity to move to SAS? (Would require a SAS controller since SAS drives don't fit regular SATA ports.)`
			`- The monitoring question from the first incident remains open: how to automate ZFS pool health monitoring to catch degradations early.`