<p>If you've been following along, you know the story: I set up a <ahref="why-i-put-my-vms-on-a-zfs-mirror.html">ZFS mirror for my Proxmox VMs</a>, then one of the drives <ahref="a-degraded-pool-with-a-healthy-disk.html">started acting flaky</a>, and I <ahref="fixing-a-degraded-zfs-mirror.html">diagnosed and fixed what turned out to be a bad SATA connection</a>.</p>
<p>Well, the connection wasn't the whole story. A few weeks after that fix, the same drive, AGAPITO1, started dropping off again. Same symptoms: link resets, speed downgrades, kernel giving up on the connection. I went through the cable swap dance again, tried different SATA ports on the motherboard, tried different cables. Nothing helped. The SATA PHY on the drive itself was failing.</p>
<p>I contacted PcComponentes (where I bought it), RMA'd the drive, and ran degraded on AGAPITO2 alone for about two weeks. Then the replacement arrived. This article covers the process of physically installing a new drive and getting it into the ZFS mirror, from "box on the desk" to "pool healthy, mirror whole."</p>
<p><code>DEGRADED</code> with one drive <code>REMOVED</code>. The old drive (WX11TN0Z) was physically gone, shipped back to PcComponentes. AGAPITO2 (WX11TN2P) was holding down the fort alone.</p>
<p>This is the beauty and the terror of a degraded mirror: everything works fine. Your VMs keep running, your data is intact, reads and writes happen normally. But you have zero redundancy. If that surviving drive has a bad day, you lose everything. Two weeks of running like this was two weeks of hoping AGAPITO2 stayed healthy.</p>
<h3>Before you touch hardware</h3>
<p>Before doing anything physical, I wanted to capture the current state. When things go wrong during maintenance, you want to be able to compare "before" and "after."</p>
<p>Three things to record while the server is still running:</p>
<p><strong>Pool status</strong>, the <code>zpool status</code> output above. You want to know exactly what ZFS thinks the world looks like right now.</p>
<p><strong>SATA layout</strong>, which drive is on which port:</p>
<pre><code>dmesg -T | grep -E 'ata[0-9]+\.[0-9]+: ATA-|ata[0-9]+: SATA link up'</code></pre>
<p>In my case, AGAPITO2 was on ata4 and ata3 was empty (the old drive's port). This matters because after you install the new drive, you want to confirm it shows up on the expected port.</p>
<p>If this says anything other than <code>PASSED</code>, stop and deal with that first. You don't want to discover your only remaining copy of data is on a failing drive while you're in the middle of hardware work.</p>
<p>Once you've got your reference snapshots, shut down the server gracefully:</p>
<p>I won't write a hardware installation tutorial, every case and drive bay is different. But a few practical tips for homelabbers doing this for the first time:</p>
<li><strong>Inspect your cables before connecting them.</strong> If the SATA data cable has been sitting disconnected in the case, check the connector pins. Bent pins or dust can cause exactly the kind of intermittent issues that started this whole saga.</li>
<li><strong>Label the new drive.</strong> I labeled mine "TOMMY" with its serial number (WX120LHQ) written on a sticker. Yes, I name my drives. It makes debugging much easier than squinting at serial numbers.</li>
<li><strong>Push connectors until they click.</strong> Both SATA data and power. Then do the wiggle test: grab the connector gently and try to move it. If it shifts at all, it's not fully seated.</li>
</ul>
<p>Seat the drive, connect both cables, close the case, and power on.</p>
<h3>Boot and verify detection</h3>
<p>First thing after boot: did the kernel see the new drive?</p>
<pre><code>dmesg -T | grep -E 'ata[0-9]+\.[0-9]+: ATA-|ata[0-9]+: SATA link up'</code></pre>
<pre><code>[Fri Feb 20 22:57:06 2026] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[Fri Feb 20 22:57:06 2026] ata3.00: ATA-11: ST4000NT001-3M2101, EN01, max UDMA/133
[Fri Feb 20 22:57:07 2026] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[Fri Feb 20 22:57:07 2026] ata4.00: ATA-11: ST4000NT001-3M2101, EN01, max UDMA/133</code></pre>
<p>I saw <code>ata1: SATA link down</code> and <code>ata2: SATA link down</code>, which are just unused ports. Nothing on ata3 or ata4. If you see errors on the port your new drive is on, <strong>stop</strong>. A brand new drive throwing SATA errors on a known-good cable is likely dead on arrival.</p>
<p>A drive can be detected and still be dead on arrival. Before resilvering 1.3 terabytes of data onto it, I wanted to know it was actually healthy.</p>
<pre><code>smartctl -A /dev/disk/by-id/ata-ST4000NT001-3M2101_WX120LHQ | grep -E 'Reallocated|Pending|Offline_Uncorrect|CRC'</code></pre>
<pre><code> 5 Reallocated_Sector_Ct ... - 0
197 Current_Pending_Sector ... - 0
198 Offline_Uncorrectable ... - 0
199 UDMA_CRC_Error_Count ... - 0</code></pre>
<p>All zeros. Reallocated sectors would mean the drive has already had to remap bad spots. Pending sectors are blocks the drive suspects are bad but hasn't confirmed yet. CRC errors indicate data corruption during transfer. On a new or refurbished drive, all of these should be zero.</p>
<p><strong>Short self-test:</strong></p>
<pre><code>smartctl -t short /dev/disk/by-id/ata-ST4000NT001-3M2101_WX120LHQ
<p>This tells ZFS "the drive identified as WX11TN0Z (currently <code>REMOVED</code>) is being replaced by WX120LHQ." ZFS starts resilvering immediately, copying all data from the surviving drive (AGAPITO2) onto the new one (TOMMY).</p>
<p>Notice the <code>replacing-0</code> vdev. That's a temporary structure ZFS creates during the replacement, showing both the old (<code>REMOVED</code>) and new (<code>ONLINE</code>) drive while the resilver is in progress.</p>
<p>The 7.73K cksum count on the new drive might look alarming, but it's expected during a resilver. Those are blocks that haven't been written yet. ZFS is aware of them and they'll clear up as the resilver progresses.</p>
<pre><code>watch -n 30 "zpool status -v proxmox-tank-1"</code></pre>
<p>I also kept <code>dmesg -Tw</code> running in another terminal, watching for any SATA errors. The kernel log stayed quiet the entire time.</p>
<p>In my case, the VMs had auto-started on boot, so the resilver was competing with production I/O. It completed in about 3.5 hours: 1.34 terabytes resilvered with 0 errors. Not bad for a pair of 4TB IronWolf drives running alongside active workloads.</p>
<h3>Post-resilver verification</h3>
<p>The resilver finished. Time to verify everything is actually good.</p>
<p><strong>Pool status:</strong></p>
<pre><code> pool: proxmox-tank-1
state: ONLINE
scan: resilvered 1.34T in 03:32:55 with 0 errors on Sat Feb 21 02:43:53 2026
<p><code>ONLINE</code>. The <code>replacing-0</code> vdev is gone and the mirror now has the new drive in place. The 7.73K cksum on TOMMY is a residual counter from the resilver, so let's clear it:</p>
<p>Now for the real test. A resilver copies data to rebuild the mirror, but a <strong>scrub</strong> reads every block on the pool, verifies all checksums, and repairs any mismatches. This is the definitive integrity check:</p>
<p>Zero bytes repaired, zero errors, both drives at 0/0/0. Clean.</p>
<p>One last thing: a post-I/O SMART check on the new drive. After hours of heavy writes during the resilver and reads during the scrub, any hardware weakness should have surfaced:</p>
<p>The mirror degradation that started on February 8th is resolved. Two weeks of running on a single drive, an RMA, and one evening of work later, the pool is whole again. Full redundancy restored, zero data lost throughout the entire saga. ZFS did exactly what it was designed to do.</p>
<p><em>This is the fourth and final article in this series. If you're just arriving, start with <ahref="why-i-put-my-vms-on-a-zfs-mirror.html">Part 1: Why I Put My VMs on a ZFS Mirror</a>, then <ahref="a-degraded-pool-with-a-healthy-disk.html">Part 2: A Degraded Pool with a Healthy Disk</a>, and <ahref="fixing-a-degraded-zfs-mirror.html">Part 3: Fixing a Degraded ZFS Mirror</a>.</em></p>