pablohere/public/writings/replacing-a-failed-disk-in-a-zfs-mirror.html

<!DOCTYPE HTML>
<html>

<head>
    <title>Pablo here</title>
    <meta charset="utf-8">
    <meta viewport="width=device-width, initial-scale=1">
    <link rel="stylesheet" href="../styles.css">
</head>


<body>
    <main>
        <h1>
            Hi, Pablo here
        </h1>
        <p><a href="../index.html">back to home</a></p>
        <section>
            <h2>Replacing a Failed Disk in a ZFS Mirror</h2>
            <p>If you've been following along, you know the story: I set up a <a href="why-i-put-my-vms-on-a-zfs-mirror.html">ZFS mirror for my Proxmox VMs</a>, then one of the drives <a href="a-degraded-pool-with-a-healthy-disk.html">started acting flaky</a>, and I <a href="fixing-a-degraded-zfs-mirror.html">diagnosed and fixed what turned out to be a bad SATA connection</a>.</p>
            <p>Well, the connection wasn't the whole story. A few weeks after that fix, the same drive, AGAPITO1, started dropping off again. Same symptoms: link resets, speed downgrades, kernel giving up on the connection. I went through the cable swap dance again, tried different SATA ports on the motherboard, tried different cables. Nothing helped. The SATA PHY on the drive itself was failing.</p>
            <p>I contacted PcComponentes (where I bought it), RMA'd the drive, and ran degraded on AGAPITO2 alone for about two weeks. Then the replacement arrived. This article covers the process of physically installing a new drive and getting it into the ZFS mirror, from "box on the desk" to "pool healthy, mirror whole."</p>

            <h3>The starting point</h3>
            <p>Before doing anything, this is what the pool looked like:</p>
            <pre><code>  pool: proxmox-tank-1
 state: DEGRADED
status: One or more devices have been removed.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using zpool online' or replace the device with
        'zpool replace'.
  scan: scrub repaired 0B in 06:55:06 with 0 errors on Tue Feb 17 20:40:50 2026
config:

        NAME                                 STATE     READ WRITE CKSUM
        proxmox-tank-1                       DEGRADED     0     0     0
          mirror-0                           DEGRADED     0     0     0
            ata-ST4000NT001-3M2101_WX11TN0Z  REMOVED      0     0     0
            ata-ST4000NT001-3M2101_WX11TN2P  ONLINE       0     0     0

errors: No known data errors</code></pre>
            <p><code>DEGRADED</code> with one drive <code>REMOVED</code>. The old drive (WX11TN0Z) was physically gone, shipped back to PcComponentes. AGAPITO2 (WX11TN2P) was holding down the fort alone.</p>
            <p>This is the beauty and the terror of a degraded mirror: everything works fine. Your VMs keep running, your data is intact, reads and writes happen normally. But you have zero redundancy. If that surviving drive has a bad day, you lose everything. Two weeks of running like this was two weeks of hoping AGAPITO2 stayed healthy.</p>

            <h3>Before you touch hardware</h3>
            <p>Before doing anything physical, I wanted to capture the current state. When things go wrong during maintenance, you want to be able to compare "before" and "after."</p>
            <p>Three things to record while the server is still running:</p>
            <p><strong>Pool status</strong>, the <code>zpool status</code> output above. You want to know exactly what ZFS thinks the world looks like right now.</p>
            <p><strong>SATA layout</strong>, which drive is on which port:</p>
            <pre><code>dmesg -T | grep -E 'ata[0-9]+\.[0-9]+: ATA-|ata[0-9]+: SATA link up'</code></pre>
            <p>In my case, AGAPITO2 was on ata4 and ata3 was empty (the old drive's port). This matters because after you install the new drive, you want to confirm it shows up on the expected port.</p>
            <p><strong>Surviving drive health</strong>, to make sure the drive you're depending on is actually healthy before you start:</p>
            <pre><code>smartctl -H /dev/disk/by-id/ata-ST4000NT001-3M2101_WX11TN2P</code></pre>
            <pre><code>SMART overall-health self-assessment test result: PASSED</code></pre>
            <p>If this says anything other than <code>PASSED</code>, stop and deal with that first. You don't want to discover your only remaining copy of data is on a failing drive while you're in the middle of hardware work.</p>
            <p>Once you've got your reference snapshots, shut down the server gracefully:</p>
            <pre><code>shutdown -h now</code></pre>

            <h3>Physical installation</h3>
            <p>I won't write a hardware installation tutorial, every case and drive bay is different. But a few practical tips for homelabbers doing this for the first time:</p>
            <ul>
                <li><strong>Inspect your cables before connecting them.</strong> If the SATA data cable has been sitting disconnected in the case, check the connector pins. Bent pins or dust can cause exactly the kind of intermittent issues that started this whole saga.</li>
                <li><strong>Label the new drive.</strong> I labeled mine "TOMMY" with its serial number (WX120LHQ) written on a sticker. Yes, I name my drives. It makes debugging much easier than squinting at serial numbers.</li>
                <li><strong>Push connectors until they click.</strong> Both SATA data and power. Then do the wiggle test: grab the connector gently and try to move it. If it shifts at all, it's not fully seated.</li>
            </ul>
            <p>Seat the drive, connect both cables, close the case, and power on.</p>

            <h3>Boot and verify detection</h3>
            <p>First thing after boot: did the kernel see the new drive?</p>
            <pre><code>dmesg -T | grep -E 'ata[0-9]+\.[0-9]+: ATA-|ata[0-9]+: SATA link up'</code></pre>
            <pre><code>[Fri Feb 20 22:57:06 2026] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[Fri Feb 20 22:57:06 2026] ata3.00: ATA-11: ST4000NT001-3M2101, EN01, max UDMA/133
[Fri Feb 20 22:57:07 2026] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[Fri Feb 20 22:57:07 2026] ata4.00: ATA-11: ST4000NT001-3M2101, EN01, max UDMA/133</code></pre>
            <p>Both drives detected at full 6.0 Gbps: TOMMY on ata3, AGAPITO2 on ata4.</p>
            <p>Next, verify it shows up with its expected serial in <code>/dev/disk/by-id/</code>:</p>
            <pre><code>ls -l /dev/disk/by-id/ | grep WX120LHQ</code></pre>
            <pre><code>ata-ST4000NT001-3M2101_WX120LHQ -> ../../sda</code></pre>
            <p>And confirm identity with SMART:</p>
            <pre><code>smartctl -i /dev/disk/by-id/ata-ST4000NT001-3M2101_WX120LHQ</code></pre>
            <pre><code>Device Model:     ST4000NT001-3M2101
Serial Number:    WX120LHQ
Firmware Version: EN01
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)</code></pre>
            <p>Correct model, serial, firmware, and running at full speed.</p>
            <p>One more critical check: look for SATA errors in the kernel log.</p>
            <pre><code>dmesg -T | grep -E 'ata[0-9]' | grep -iE 'error|fatal|reset|link down|slow|limiting'</code></pre>
            <p>I saw <code>ata1: SATA link down</code> and <code>ata2: SATA link down</code>, which are just unused ports. Nothing on ata3 or ata4. If you see errors on the port your new drive is on, <strong>stop</strong>. A brand new drive throwing SATA errors on a known-good cable is likely dead on arrival.</p>

            <h3>Health-check before trusting it</h3>
            <p>A drive can be detected and still be dead on arrival. Before resilvering 1.3 terabytes of data onto it, I wanted to know it was actually healthy.</p>
            <p><strong>SMART overall health:</strong></p>
            <pre><code>smartctl -H /dev/disk/by-id/ata-ST4000NT001-3M2101_WX120LHQ</code></pre>
            <pre><code>SMART overall-health self-assessment test result: PASSED</code></pre>
            <p><strong>Baseline SMART attributes</strong>, the important ones to check on a new drive:</p>
            <pre><code>smartctl -A /dev/disk/by-id/ata-ST4000NT001-3M2101_WX120LHQ | grep -E 'Reallocated|Pending|Offline_Uncorrect|CRC'</code></pre>
            <pre><code>  5 Reallocated_Sector_Ct   ... -       0
197 Current_Pending_Sector  ... -       0
198 Offline_Uncorrectable   ... -       0
199 UDMA_CRC_Error_Count    ... -       0</code></pre>
            <p>All zeros. Reallocated sectors would mean the drive has already had to remap bad spots. Pending sectors are blocks the drive suspects are bad but hasn't confirmed yet. CRC errors indicate data corruption during transfer. On a new or refurbished drive, all of these should be zero.</p>
            <p><strong>Short self-test:</strong></p>
            <pre><code>smartctl -t short /dev/disk/by-id/ata-ST4000NT001-3M2101_WX120LHQ
# Wait ~2 minutes...
smartctl -l selftest /dev/disk/by-id/ata-ST4000NT001-3M2101_WX120LHQ</code></pre>
            <pre><code># 1  Short offline       Completed without error       00%         0         -</code></pre>
            <p>Passed with 0 power-on hours, a fresh drive. If any of these checks fail, don't proceed. Contact the seller and get another replacement.</p>

            <h3>The replacement: <code>zpool replace</code></h3>
            <p>This is the moment. One command:</p>
            <pre><code>zpool replace proxmox-tank-1 ata-ST4000NT001-3M2101_WX11TN0Z ata-ST4000NT001-3M2101_WX120LHQ</code></pre>
            <p>This tells ZFS "the drive identified as WX11TN0Z (currently <code>REMOVED</code>) is being replaced by WX120LHQ." ZFS starts resilvering immediately, copying all data from the surviving drive (AGAPITO2) onto the new one (TOMMY).</p>
            <p>Checking status right after:</p>
            <pre><code>  pool: proxmox-tank-1
 state: DEGRADED
  scan: resilver in progress since Fri Feb 20 23:10:58 2026

config:

        NAME                                 STATE     READ WRITE CKSUM
        proxmox-tank-1                       DEGRADED     0     0     0
          mirror-0                           DEGRADED     0     0     0
            replacing-0                      DEGRADED     0     0     0
              ata-ST4000NT001-3M2101_WX11TN0Z  REMOVED      0     0     0
              ata-ST4000NT001-3M2101_WX120LHQ  ONLINE       0     0 7.73K
            ata-ST4000NT001-3M2101_WX11TN2P  ONLINE       0     0     0</code></pre>
            <p>Notice the <code>replacing-0</code> vdev. That's a temporary structure ZFS creates during the replacement, showing both the old (<code>REMOVED</code>) and new (<code>ONLINE</code>) drive while the resilver is in progress.</p>
            <p>The 7.73K cksum count on the new drive might look alarming, but it's expected during a resilver. Those are blocks that haven't been written yet. ZFS is aware of them and they'll clear up as the resilver progresses.</p>
            <p>I monitored progress with:</p>
            <pre><code>watch -n 30 "zpool status -v proxmox-tank-1"</code></pre>
            <p>I also kept <code>dmesg -Tw</code> running in another terminal, watching for any SATA errors. The kernel log stayed quiet the entire time.</p>
            <p>In my case, the VMs had auto-started on boot, so the resilver was competing with production I/O. It completed in about 3.5 hours: 1.34 terabytes resilvered with 0 errors. Not bad for a pair of 4TB IronWolf drives running alongside active workloads.</p>

            <h3>Post-resilver verification</h3>
            <p>The resilver finished. Time to verify everything is actually good.</p>
            <p><strong>Pool status:</strong></p>
            <pre><code>  pool: proxmox-tank-1
 state: ONLINE
  scan: resilvered 1.34T in 03:32:55 with 0 errors on Sat Feb 21 02:43:53 2026
config:

        NAME                                 STATE     READ WRITE CKSUM
        proxmox-tank-1                       ONLINE       0     0     0
          mirror-0                           ONLINE       0     0     0
            ata-ST4000NT001-3M2101_WX120LHQ  ONLINE       0     0 7.73K
            ata-ST4000NT001-3M2101_WX11TN2P  ONLINE       0     0     0

errors: No known data errors</code></pre>
            <p><code>ONLINE</code>. The <code>replacing-0</code> vdev is gone and the mirror now has the new drive in place. The 7.73K cksum on TOMMY is a residual counter from the resilver, so let's clear it:</p>
            <pre><code>zpool clear proxmox-tank-1</code></pre>
            <p>Now for the real test. A resilver copies data to rebuild the mirror, but a <strong>scrub</strong> reads every block on the pool, verifies all checksums, and repairs any mismatches. This is the definitive integrity check:</p>
            <pre><code>zpool scrub proxmox-tank-1</code></pre>
            <p>This ran for about 3.5 hours across 1.34T of data:</p>
            <pre><code>  scan: scrub repaired 0B in 03:27:50 with 0 errors on Sat Feb 21 11:38:02 2026

        NAME                                 STATE     READ WRITE CKSUM
        proxmox-tank-1                       ONLINE       0     0     0
          mirror-0                           ONLINE       0     0     0
            ata-ST4000NT001-3M2101_WX120LHQ  ONLINE       0     0     0
            ata-ST4000NT001-3M2101_WX11TN2P  ONLINE       0     0     0

errors: No known data errors</code></pre>
            <p>Zero bytes repaired, zero errors, both drives at 0/0/0. Clean.</p>
            <p>One last thing: a post-I/O SMART check on the new drive. After hours of heavy writes during the resilver and reads during the scrub, any hardware weakness should have surfaced:</p>
            <pre><code>smartctl -x /dev/disk/by-id/ata-ST4000NT001-3M2101_WX120LHQ | grep -E 'Reallocated|Pending|Offline_Uncorrect|CRC|Hardware Resets|COMRESET|Interface'</code></pre>
            <pre><code>Reallocated_Sector_Ct    ... 0
Current_Pending_Sector   ... 0
Offline_Uncorrectable    ... 0
UDMA_CRC_Error_Count     ... 0
Number of Hardware Resets ... 2
Number of Interface CRC Errors ... 0
COMRESET ... 2</code></pre>
            <p>All clean. The 2 hardware resets and 2 COMRESETs are just from the server booting, perfectly normal.</p>

            <h3>The commands, all in one place</h3>
            <p>For future me and anyone else replacing a disk in a ZFS mirror:</p>
            <pre><code># --- Before shutdown ---

# Record pool status
zpool status -v &lt;pool&gt;

# Record SATA layout
dmesg -T | grep -E 'ata[0-9]+\.[0-9]+: ATA-|ata[0-9]+: SATA link up'

# Check surviving drive health
smartctl -H /dev/disk/by-id/&lt;surviving-disk-id&gt;

# Shut down
shutdown -h now

# --- After boot with new drive ---

# Verify detection
dmesg -T | grep -E 'ata[0-9]+\.[0-9]+: ATA-|ata[0-9]+: SATA link up'
ls -l /dev/disk/by-id/ | grep &lt;new-serial&gt;
smartctl -i /dev/disk/by-id/&lt;new-disk-id&gt;

# Check for SATA errors
dmesg -T | grep -E 'ata[0-9]' | grep -iE 'error|fatal|reset|link down'

# Health-check the new drive
smartctl -H /dev/disk/by-id/&lt;new-disk-id&gt;
smartctl -A /dev/disk/by-id/&lt;new-disk-id&gt; | grep -E 'Reallocated|Pending|Offline_Uncorrect|CRC'
smartctl -t short /dev/disk/by-id/&lt;new-disk-id&gt;
smartctl -l selftest /dev/disk/by-id/&lt;new-disk-id&gt;

# --- Replace and resilver ---

# Replace old drive with new
zpool replace &lt;pool&gt; &lt;old-disk-id&gt; &lt;new-disk-id&gt;

# Monitor resilver progress
watch -n 30 "zpool status -v &lt;pool&gt;"

# Watch kernel log for SATA errors during resilver
dmesg -Tw

# --- Post-resilver verification ---

# Check final status
zpool status -v &lt;pool&gt;

# Clear residual cksum counters
zpool clear &lt;pool&gt;

# Run a full scrub
zpool scrub &lt;pool&gt;

# Post-I/O SMART check
smartctl -x /dev/disk/by-id/&lt;new-disk-id&gt; | grep -E 'Reallocated|Pending|Offline_Uncorrect|CRC'</code></pre>

            <p>The mirror degradation that started on February 8th is resolved. Two weeks of running on a single drive, an RMA, and one evening of work later, the pool is whole again. Full redundancy restored, zero data lost throughout the entire saga. ZFS did exactly what it was designed to do.</p>
            <p><em>This is the fourth and final article in this series. If you're just arriving, start with <a href="why-i-put-my-vms-on-a-zfs-mirror.html">Part 1: Why I Put My VMs on a ZFS Mirror</a>, then <a href="a-degraded-pool-with-a-healthy-disk.html">Part 2: A Degraded Pool with a Healthy Disk</a>, and <a href="fixing-a-degraded-zfs-mirror.html">Part 3: Fixing a Degraded ZFS Mirror</a>.</em></p>
            <p><a href="../index.html">back to home</a></p>
        </section>
    </main>

</body>

</html>
new zfs article 2026-02-21 12:07:57 +01:00			`<!DOCTYPE HTML>`
			`<html>`

			`<head>`
			`<title>Pablo here</title>`
			`<meta charset="utf-8">`
			`<meta viewport="width=device-width, initial-scale=1">`
			`<link rel="stylesheet" href="../styles.css">`
			`</head>`


			`<body>`
			`<main>`
			`<h1>`
			`Hi, Pablo here`
			`</h1>`
			`<p><a href="../index.html">back to home</a></p>`
			`<section>`
			`<h2>Replacing a Failed Disk in a ZFS Mirror</h2>`
			`<p>If you've been following along, you know the story: I set up a <a href="why-i-put-my-vms-on-a-zfs-mirror.html">ZFS mirror for my Proxmox VMs</a>, then one of the drives <a href="a-degraded-pool-with-a-healthy-disk.html">started acting flaky</a>, and I <a href="fixing-a-degraded-zfs-mirror.html">diagnosed and fixed what turned out to be a bad SATA connection</a>.</p>`
new zfs article 2026-02-21 12:14:33 +01:00			`<p>Well, the connection wasn't the whole story. A few weeks after that fix, the same drive, AGAPITO1, started dropping off again. Same symptoms: link resets, speed downgrades, kernel giving up on the connection. I went through the cable swap dance again, tried different SATA ports on the motherboard, tried different cables. Nothing helped. The SATA PHY on the drive itself was failing.</p>`
			`<p>I contacted PcComponentes (where I bought it), RMA'd the drive, and ran degraded on AGAPITO2 alone for about two weeks. Then the replacement arrived. This article covers the process of physically installing a new drive and getting it into the ZFS mirror, from "box on the desk" to "pool healthy, mirror whole."</p>`
new zfs article 2026-02-21 12:07:57 +01:00
			`<h3>The starting point</h3>`
			`<p>Before doing anything, this is what the pool looked like:</p>`
			`<pre><code> pool: proxmox-tank-1`
			`state: DEGRADED`
			`status: One or more devices have been removed.`
			`Sufficient replicas exist for the pool to continue functioning in a`
			`degraded state.`
			`action: Online the device using zpool online' or replace the device with`
			`'zpool replace'.`
			`scan: scrub repaired 0B in 06:55:06 with 0 errors on Tue Feb 17 20:40:50 2026`
			`config:`

			`NAME STATE READ WRITE CKSUM`
			`proxmox-tank-1 DEGRADED 0 0 0`
			`mirror-0 DEGRADED 0 0 0`
			`ata-ST4000NT001-3M2101_WX11TN0Z REMOVED 0 0 0`
			`ata-ST4000NT001-3M2101_WX11TN2P ONLINE 0 0 0`

			`errors: No known data errors</code></pre>`
new zfs article 2026-02-21 12:14:33 +01:00			`<p><code>DEGRADED</code> with one drive <code>REMOVED</code>. The old drive (WX11TN0Z) was physically gone, shipped back to PcComponentes. AGAPITO2 (WX11TN2P) was holding down the fort alone.</p>`
new zfs article 2026-02-21 12:07:57 +01:00			`<p>This is the beauty and the terror of a degraded mirror: everything works fine. Your VMs keep running, your data is intact, reads and writes happen normally. But you have zero redundancy. If that surviving drive has a bad day, you lose everything. Two weeks of running like this was two weeks of hoping AGAPITO2 stayed healthy.</p>`

			`<h3>Before you touch hardware</h3>`
			`<p>Before doing anything physical, I wanted to capture the current state. When things go wrong during maintenance, you want to be able to compare "before" and "after."</p>`
			`<p>Three things to record while the server is still running:</p>`
new zfs article 2026-02-21 12:14:33 +01:00			`<p><strong>Pool status</strong>, the <code>zpool status</code> output above. You want to know exactly what ZFS thinks the world looks like right now.</p>`
			`<p><strong>SATA layout</strong>, which drive is on which port:</p>`
new zfs article 2026-02-21 12:07:57 +01:00			`<pre><code>dmesg -T \| grep -E 'ata[0-9]+\.[0-9]+: ATA-\|ata[0-9]+: SATA link up'</code></pre>`
			`<p>In my case, AGAPITO2 was on ata4 and ata3 was empty (the old drive's port). This matters because after you install the new drive, you want to confirm it shows up on the expected port.</p>`
new zfs article 2026-02-21 12:14:33 +01:00			`<p><strong>Surviving drive health</strong>, to make sure the drive you're depending on is actually healthy before you start:</p>`
new zfs article 2026-02-21 12:07:57 +01:00			`<pre><code>smartctl -H /dev/disk/by-id/ata-ST4000NT001-3M2101_WX11TN2P</code></pre>`
			`<pre><code>SMART overall-health self-assessment test result: PASSED</code></pre>`
new zfs article 2026-02-21 12:14:33 +01:00			`<p>If this says anything other than <code>PASSED</code>, stop and deal with that first. You don't want to discover your only remaining copy of data is on a failing drive while you're in the middle of hardware work.</p>`
			`<p>Once you've got your reference snapshots, shut down the server gracefully:</p>`
			`<pre><code>shutdown -h now</code></pre>`
new zfs article 2026-02-21 12:07:57 +01:00
			`<h3>Physical installation</h3>`
new zfs article 2026-02-21 12:14:33 +01:00			`<p>I won't write a hardware installation tutorial, every case and drive bay is different. But a few practical tips for homelabbers doing this for the first time:</p>`
new zfs article 2026-02-21 12:07:57 +01:00			`<ul>`
			`<li><strong>Inspect your cables before connecting them.</strong> If the SATA data cable has been sitting disconnected in the case, check the connector pins. Bent pins or dust can cause exactly the kind of intermittent issues that started this whole saga.</li>`
			`<li><strong>Label the new drive.</strong> I labeled mine "TOMMY" with its serial number (WX120LHQ) written on a sticker. Yes, I name my drives. It makes debugging much easier than squinting at serial numbers.</li>`
			`<li><strong>Push connectors until they click.</strong> Both SATA data and power. Then do the wiggle test: grab the connector gently and try to move it. If it shifts at all, it's not fully seated.</li>`
			`</ul>`
			`<p>Seat the drive, connect both cables, close the case, and power on.</p>`

			`<h3>Boot and verify detection</h3>`
			`<p>First thing after boot: did the kernel see the new drive?</p>`
			`<pre><code>dmesg -T \| grep -E 'ata[0-9]+\.[0-9]+: ATA-\|ata[0-9]+: SATA link up'</code></pre>`
			`<pre><code>[Fri Feb 20 22:57:06 2026] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)`
			`[Fri Feb 20 22:57:06 2026] ata3.00: ATA-11: ST4000NT001-3M2101, EN01, max UDMA/133`
			`[Fri Feb 20 22:57:07 2026] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)`
			`[Fri Feb 20 22:57:07 2026] ata4.00: ATA-11: ST4000NT001-3M2101, EN01, max UDMA/133</code></pre>`
new zfs article 2026-02-21 12:14:33 +01:00			`<p>Both drives detected at full 6.0 Gbps: TOMMY on ata3, AGAPITO2 on ata4.</p>`
new zfs article 2026-02-21 12:07:57 +01:00			`<p>Next, verify it shows up with its expected serial in <code>/dev/disk/by-id/</code>:</p>`
			`<pre><code>ls -l /dev/disk/by-id/ \| grep WX120LHQ</code></pre>`
			`<pre><code>ata-ST4000NT001-3M2101_WX120LHQ -> ../../sda</code></pre>`
			`<p>And confirm identity with SMART:</p>`
			`<pre><code>smartctl -i /dev/disk/by-id/ata-ST4000NT001-3M2101_WX120LHQ</code></pre>`
			`<pre><code>Device Model: ST4000NT001-3M2101`
			`Serial Number: WX120LHQ`
			`Firmware Version: EN01`
			`User Capacity: 4,000,787,030,016 bytes [4.00 TB]`
			`SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)</code></pre>`
new zfs article 2026-02-21 12:14:33 +01:00			`<p>Correct model, serial, firmware, and running at full speed.</p>`
new zfs article 2026-02-21 12:07:57 +01:00			`<p>One more critical check: look for SATA errors in the kernel log.</p>`
			`<pre><code>dmesg -T \| grep -E 'ata[0-9]' \| grep -iE 'error\|fatal\|reset\|link down\|slow\|limiting'</code></pre>`
new zfs article 2026-02-21 12:14:33 +01:00			`<p>I saw <code>ata1: SATA link down</code> and <code>ata2: SATA link down</code>, which are just unused ports. Nothing on ata3 or ata4. If you see errors on the port your new drive is on, <strong>stop</strong>. A brand new drive throwing SATA errors on a known-good cable is likely dead on arrival.</p>`
new zfs article 2026-02-21 12:07:57 +01:00
			`<h3>Health-check before trusting it</h3>`
			`<p>A drive can be detected and still be dead on arrival. Before resilvering 1.3 terabytes of data onto it, I wanted to know it was actually healthy.</p>`
			`<p><strong>SMART overall health:</strong></p>`
			`<pre><code>smartctl -H /dev/disk/by-id/ata-ST4000NT001-3M2101_WX120LHQ</code></pre>`
			`<pre><code>SMART overall-health self-assessment test result: PASSED</code></pre>`
new zfs article 2026-02-21 12:14:33 +01:00			`<p><strong>Baseline SMART attributes</strong>, the important ones to check on a new drive:</p>`
new zfs article 2026-02-21 12:07:57 +01:00			`<pre><code>smartctl -A /dev/disk/by-id/ata-ST4000NT001-3M2101_WX120LHQ \| grep -E 'Reallocated\|Pending\|Offline_Uncorrect\|CRC'</code></pre>`
			`<pre><code> 5 Reallocated_Sector_Ct ... - 0`
			`197 Current_Pending_Sector ... - 0`
			`198 Offline_Uncorrectable ... - 0`
			`199 UDMA_CRC_Error_Count ... - 0</code></pre>`
			`<p>All zeros. Reallocated sectors would mean the drive has already had to remap bad spots. Pending sectors are blocks the drive suspects are bad but hasn't confirmed yet. CRC errors indicate data corruption during transfer. On a new or refurbished drive, all of these should be zero.</p>`
			`<p><strong>Short self-test:</strong></p>`
			`<pre><code>smartctl -t short /dev/disk/by-id/ata-ST4000NT001-3M2101_WX120LHQ`
			`# Wait ~2 minutes...`
			`smartctl -l selftest /dev/disk/by-id/ata-ST4000NT001-3M2101_WX120LHQ</code></pre>`
			`<pre><code># 1 Short offline Completed without error 00% 0 -</code></pre>`
new zfs article 2026-02-21 12:14:33 +01:00			`<p>Passed with 0 power-on hours, a fresh drive. If any of these checks fail, don't proceed. Contact the seller and get another replacement.</p>`
new zfs article 2026-02-21 12:07:57 +01:00
new zfs article 2026-02-21 12:14:33 +01:00			`<h3>The replacement: <code>zpool replace</code></h3>`
new zfs article 2026-02-21 12:07:57 +01:00			`<p>This is the moment. One command:</p>`
			`<pre><code>zpool replace proxmox-tank-1 ata-ST4000NT001-3M2101_WX11TN0Z ata-ST4000NT001-3M2101_WX120LHQ</code></pre>`
new zfs article 2026-02-21 12:14:33 +01:00			`<p>This tells ZFS "the drive identified as WX11TN0Z (currently <code>REMOVED</code>) is being replaced by WX120LHQ." ZFS starts resilvering immediately, copying all data from the surviving drive (AGAPITO2) onto the new one (TOMMY).</p>`
new zfs article 2026-02-21 12:07:57 +01:00			`<p>Checking status right after:</p>`
			`<pre><code> pool: proxmox-tank-1`
			`state: DEGRADED`
			`scan: resilver in progress since Fri Feb 20 23:10:58 2026`

			`config:`

			`NAME STATE READ WRITE CKSUM`
			`proxmox-tank-1 DEGRADED 0 0 0`
			`mirror-0 DEGRADED 0 0 0`
			`replacing-0 DEGRADED 0 0 0`
			`ata-ST4000NT001-3M2101_WX11TN0Z REMOVED 0 0 0`
			`ata-ST4000NT001-3M2101_WX120LHQ ONLINE 0 0 7.73K`
			`ata-ST4000NT001-3M2101_WX11TN2P ONLINE 0 0 0</code></pre>`
new zfs article 2026-02-21 12:14:33 +01:00			`<p>Notice the <code>replacing-0</code> vdev. That's a temporary structure ZFS creates during the replacement, showing both the old (<code>REMOVED</code>) and new (<code>ONLINE</code>) drive while the resilver is in progress.</p>`
			`<p>The 7.73K cksum count on the new drive might look alarming, but it's expected during a resilver. Those are blocks that haven't been written yet. ZFS is aware of them and they'll clear up as the resilver progresses.</p>`
new zfs article 2026-02-21 12:07:57 +01:00			`<p>I monitored progress with:</p>`
			`<pre><code>watch -n 30 "zpool status -v proxmox-tank-1"</code></pre>`
			`<p>I also kept <code>dmesg -Tw</code> running in another terminal, watching for any SATA errors. The kernel log stayed quiet the entire time.</p>`
			`<p>In my case, the VMs had auto-started on boot, so the resilver was competing with production I/O. It completed in about 3.5 hours: 1.34 terabytes resilvered with 0 errors. Not bad for a pair of 4TB IronWolf drives running alongside active workloads.</p>`

			`<h3>Post-resilver verification</h3>`
			`<p>The resilver finished. Time to verify everything is actually good.</p>`
			`<p><strong>Pool status:</strong></p>`
			`<pre><code> pool: proxmox-tank-1`
			`state: ONLINE`
			`scan: resilvered 1.34T in 03:32:55 with 0 errors on Sat Feb 21 02:43:53 2026`
			`config:`

			`NAME STATE READ WRITE CKSUM`
			`proxmox-tank-1 ONLINE 0 0 0`
			`mirror-0 ONLINE 0 0 0`
			`ata-ST4000NT001-3M2101_WX120LHQ ONLINE 0 0 7.73K`
			`ata-ST4000NT001-3M2101_WX11TN2P ONLINE 0 0 0`

			`errors: No known data errors</code></pre>`
new zfs article 2026-02-21 12:14:33 +01:00			`<p><code>ONLINE</code>. The <code>replacing-0</code> vdev is gone and the mirror now has the new drive in place. The 7.73K cksum on TOMMY is a residual counter from the resilver, so let's clear it:</p>`
new zfs article 2026-02-21 12:07:57 +01:00			`<pre><code>zpool clear proxmox-tank-1</code></pre>`
			`<p>Now for the real test. A resilver copies data to rebuild the mirror, but a <strong>scrub</strong> reads every block on the pool, verifies all checksums, and repairs any mismatches. This is the definitive integrity check:</p>`
			`<pre><code>zpool scrub proxmox-tank-1</code></pre>`
			`<p>This ran for about 3.5 hours across 1.34T of data:</p>`
			`<pre><code> scan: scrub repaired 0B in 03:27:50 with 0 errors on Sat Feb 21 11:38:02 2026`

			`NAME STATE READ WRITE CKSUM`
			`proxmox-tank-1 ONLINE 0 0 0`
			`mirror-0 ONLINE 0 0 0`
			`ata-ST4000NT001-3M2101_WX120LHQ ONLINE 0 0 0`
			`ata-ST4000NT001-3M2101_WX11TN2P ONLINE 0 0 0`

			`errors: No known data errors</code></pre>`
new zfs article 2026-02-21 12:14:33 +01:00			`<p>Zero bytes repaired, zero errors, both drives at 0/0/0. Clean.</p>`
			`<p>One last thing: a post-I/O SMART check on the new drive. After hours of heavy writes during the resilver and reads during the scrub, any hardware weakness should have surfaced:</p>`
new zfs article 2026-02-21 12:07:57 +01:00			`<pre><code>smartctl -x /dev/disk/by-id/ata-ST4000NT001-3M2101_WX120LHQ \| grep -E 'Reallocated\|Pending\|Offline_Uncorrect\|CRC\|Hardware Resets\|COMRESET\|Interface'</code></pre>`
			`<pre><code>Reallocated_Sector_Ct ... 0`
			`Current_Pending_Sector ... 0`
			`Offline_Uncorrectable ... 0`
			`UDMA_CRC_Error_Count ... 0`
			`Number of Hardware Resets ... 2`
			`Number of Interface CRC Errors ... 0`
			`COMRESET ... 2</code></pre>`
new zfs article 2026-02-21 12:14:33 +01:00			`<p>All clean. The 2 hardware resets and 2 COMRESETs are just from the server booting, perfectly normal.</p>`
new zfs article 2026-02-21 12:07:57 +01:00
			`<h3>The commands, all in one place</h3>`
			`<p>For future me and anyone else replacing a disk in a ZFS mirror:</p>`
			`<pre><code># --- Before shutdown ---`

			`# Record pool status`
			`zpool status -v <pool>`

			`# Record SATA layout`
			`dmesg -T \| grep -E 'ata[0-9]+\.[0-9]+: ATA-\|ata[0-9]+: SATA link up'`

			`# Check surviving drive health`
			`smartctl -H /dev/disk/by-id/<surviving-disk-id>`

new zfs article 2026-02-21 12:14:33 +01:00			`# Shut down`
new zfs article 2026-02-21 12:07:57 +01:00			`shutdown -h now`

			`# --- After boot with new drive ---`

			`# Verify detection`
			`dmesg -T \| grep -E 'ata[0-9]+\.[0-9]+: ATA-\|ata[0-9]+: SATA link up'`
			`ls -l /dev/disk/by-id/ \| grep <new-serial>`
			`smartctl -i /dev/disk/by-id/<new-disk-id>`

			`# Check for SATA errors`
			`dmesg -T \| grep -E 'ata[0-9]' \| grep -iE 'error\|fatal\|reset\|link down'`

			`# Health-check the new drive`
			`smartctl -H /dev/disk/by-id/<new-disk-id>`
			`smartctl -A /dev/disk/by-id/<new-disk-id> \| grep -E 'Reallocated\|Pending\|Offline_Uncorrect\|CRC'`
			`smartctl -t short /dev/disk/by-id/<new-disk-id>`
			`smartctl -l selftest /dev/disk/by-id/<new-disk-id>`

			`# --- Replace and resilver ---`

			`# Replace old drive with new`
			`zpool replace <pool> <old-disk-id> <new-disk-id>`

			`# Monitor resilver progress`
			`watch -n 30 "zpool status -v <pool>"`

			`# Watch kernel log for SATA errors during resilver`
			`dmesg -Tw`

			`# --- Post-resilver verification ---`

			`# Check final status`
			`zpool status -v <pool>`

			`# Clear residual cksum counters`
			`zpool clear <pool>`

			`# Run a full scrub`
			`zpool scrub <pool>`

			`# Post-I/O SMART check`
			`smartctl -x /dev/disk/by-id/<new-disk-id> \| grep -E 'Reallocated\|Pending\|Offline_Uncorrect\|CRC'</code></pre>`

new zfs article 2026-02-21 12:14:33 +01:00			`<p>The mirror degradation that started on February 8th is resolved. Two weeks of running on a single drive, an RMA, and one evening of work later, the pool is whole again. Full redundancy restored, zero data lost throughout the entire saga. ZFS did exactly what it was designed to do.</p>`
new zfs article 2026-02-21 12:07:57 +01:00			`<p><em>This is the fourth and final article in this series. If you're just arriving, start with <a href="why-i-put-my-vms-on-a-zfs-mirror.html">Part 1: Why I Put My VMs on a ZFS Mirror</a>, then <a href="a-degraded-pool-with-a-healthy-disk.html">Part 2: A Degraded Pool with a Healthy Disk</a>, and <a href="fixing-a-degraded-zfs-mirror.html">Part 3: Fixing a Degraded ZFS Mirror</a>.</em></p>`
			`<p><a href="../index.html">back to home</a></p>`
			`</section>`
			`</main>`

			`</body>`

			`</html>`