diff --git a/public/writings/replacing-a-failed-disk-in-a-zfs-mirror.html b/public/writings/replacing-a-failed-disk-in-a-zfs-mirror.html index 01aadb3..238c640 100644 --- a/public/writings/replacing-a-failed-disk-in-a-zfs-mirror.html +++ b/public/writings/replacing-a-failed-disk-in-a-zfs-mirror.html @@ -18,8 +18,8 @@

Replacing a Failed Disk in a ZFS Mirror

If you've been following along, you know the story: I set up a ZFS mirror for my Proxmox VMs, then one of the drives started acting flaky, and I diagnosed and fixed what turned out to be a bad SATA connection.

-

Well, the connection wasn't the whole story. A few weeks after that fix, the same drive — AGAPITO1 — started dropping off again. Same symptoms: link resets, speed downgrades, kernel giving up on the connection. I went through the cable swap dance again, tried different SATA ports on the motherboard, tried different cables. Nothing helped. The SATA PHY on the drive itself was failing.

-

I contacted Seagate, RMA'd the drive, and ran degraded on AGAPITO2 alone for about two weeks. Then the replacement arrived. This article covers the process of physically installing a new drive and getting it into the ZFS mirror — from "box on the desk" to "pool healthy, mirror whole."

+

Well, the connection wasn't the whole story. A few weeks after that fix, the same drive, AGAPITO1, started dropping off again. Same symptoms: link resets, speed downgrades, kernel giving up on the connection. I went through the cable swap dance again, tried different SATA ports on the motherboard, tried different cables. Nothing helped. The SATA PHY on the drive itself was failing.

+

I contacted PcComponentes (where I bought it), RMA'd the drive, and ran degraded on AGAPITO2 alone for about two weeks. Then the replacement arrived. This article covers the process of physically installing a new drive and getting it into the ZFS mirror, from "box on the desk" to "pool healthy, mirror whole."

The starting point

Before doing anything, this is what the pool looked like:

@@ -40,26 +40,25 @@ config: ata-ST4000NT001-3M2101_WX11TN2P ONLINE 0 0 0 errors: No known data errors -

DEGRADED with one drive REMOVED. The old drive (WX11TN0Z) was physically gone — shipped back to Seagate. AGAPITO2 (WX11TN2P) was holding down the fort alone.

+

DEGRADED with one drive REMOVED. The old drive (WX11TN0Z) was physically gone, shipped back to PcComponentes. AGAPITO2 (WX11TN2P) was holding down the fort alone.

This is the beauty and the terror of a degraded mirror: everything works fine. Your VMs keep running, your data is intact, reads and writes happen normally. But you have zero redundancy. If that surviving drive has a bad day, you lose everything. Two weeks of running like this was two weeks of hoping AGAPITO2 stayed healthy.

Before you touch hardware

Before doing anything physical, I wanted to capture the current state. When things go wrong during maintenance, you want to be able to compare "before" and "after."

Three things to record while the server is still running:

-

Pool status — the zpool status output above. You want to know exactly what ZFS thinks the world looks like right now.

-

SATA layout — which drive is on which port:

+

Pool status, the zpool status output above. You want to know exactly what ZFS thinks the world looks like right now.

+

SATA layout, which drive is on which port:

dmesg -T | grep -E 'ata[0-9]+\.[0-9]+: ATA-|ata[0-9]+: SATA link up'

In my case, AGAPITO2 was on ata4 and ata3 was empty (the old drive's port). This matters because after you install the new drive, you want to confirm it shows up on the expected port.

-

Surviving drive health — make sure the drive you're depending on is actually healthy before you start:

+

Surviving drive health, to make sure the drive you're depending on is actually healthy before you start:

smartctl -H /dev/disk/by-id/ata-ST4000NT001-3M2101_WX11TN2P
SMART overall-health self-assessment test result: PASSED
-

If this says anything other than PASSED, stop and deal with that first. You don't want to discover your only remaining copy of data is on a failing drive while you're in the middle of hardware work.

-

Once you've got your reference snapshots, shut down gracefully. Stop your VMs first, then power off the server. You want a clean shutdown, not a yank-the-plug situation.

-
qm shutdown <VMID>   # for each running VM
-shutdown -h now
+

If this says anything other than PASSED, stop and deal with that first. You don't want to discover your only remaining copy of data is on a failing drive while you're in the middle of hardware work.

+

Once you've got your reference snapshots, shut down the server gracefully:

+
shutdown -h now

Physical installation

-

I won't write a hardware installation tutorial — every case and drive bay is different. But a few practical tips for homelabbers doing this for the first time:

+

I won't write a hardware installation tutorial, every case and drive bay is different. But a few practical tips for homelabbers doing this for the first time: