October 12, 2012 – Energiequant

I have yet to figure out why, but today one of my hard drives hit multiple CRC errors and went offline. It is part of a software (mdraid) RAID 1 and I had seen such errors before, so I did the usual procedure: Shutdown and power off for a few minutes, check that all drives come up on boot, boot to a rescue system, stop RAIDs, run a long self-test on the lost drive and then resync the RAID by running mdadm --add using the drive that remained online as source. Sounds okay? Too bad I had errors on the source drive…

When the first RAID partition’s resync neared completion around 95%, it suddenly stopped and marked the target drive as spare again. I wondered what happened and looked at the kernel log which told me that there have been many failed retries to read a particular sector which made the resync impossible to complete. Plus, S.M.A.R.T. now listed one “Current Pending Sector”.

What could I have done to avoid this?

First of all, I should have run check/repair more regularly. If a check is being run, uncorrectable read errors can be noticed and the failed sectors can be re-written from a good disk. However, check/repair can be a rather lengthy task which means it is not very suited to desktop computers that are not running 24/7.

Another option could have been to use --re-add instead of --add, which might have synced only recently modified sectors, thus skipping the bad one I hit on the full resync caused by --add. However, since I had my system in use about one hour before I noticed the emails indicating RAID failure, I doubt this could have helped much. Plus, it would likely have been too late to run that after a resync has already tried and failed as the data on the lost disk was already partially overwritten.

What I did to work around the issue

WARNING!

The following steps can cause irreparable damage to your data. Only continue if you fully understand what you are doing and you either have a working backup or can avoid loosing the data. This post is omitting information you should know if you are going to follow these steps. Please make your own mind about them before running any commands.

THE AUTHOR WILL NOT BE HELD LIABLE FOR ANY DAMAGE CAUSED BY FOLLOWING THE BELOW STEPS. FOLLOWING THIS BLOG POST IS ON YOUR OWN RISK.

NOTE: The failed sector will be called 12345 from now on. The broken sector resides on /dev/sdb and the drive that went offline and has a partial resync but good sector is /dev/sdc.

A quick search turned up some helpful sites ([1], [2]). First, I verified that I had the correct sector address by running hdparm --read-sector 12345 /dev/sdb, which returned an I/O error just as expected. I then checked the sectors immediately before and after the failed one. I was lucky to find a strangely uniform pattern that simply counted up – I’m not sure if that is some feature of either ext4 or mdraid or just random luck. I tried to ask debugfs what’s stored there (could have been free space, as in [2]) but I wasn’t sure if I had the correct ext4 block number, so I didn’t give anything on that information.

Since this is a RAID 1, I thought, maybe I could just selectively copy the sector over from /dev/sdc. What I needed to do was to get the correct sector address for sdc and then run some dd command. Since the partition layouts differ on sdb and sdc, the sector numbers don’t match 1:1 and have to be calculated. I ran parted and set unit s to get sector addresses, then ran print to get the partition tables of both disks. All that has to be done is subtracting the start offset from the failed sector address, then add the other partition’s offset again. Let’s say the address was 23456.

Since I knew what the sector should look like, I could verify it directly with hdparm. Additionally, I checked a few sectors below and above that address and the data matched perfectly.

Next, I had to assemble a dd command to display and then copy the sector. Using dd if=/dev/sdc bs=512 count=1 skip=23456 | hexdump (bs should match the logical sector size) and comparing it to the hdparm output, I could verify that I read the correct sector. I also tried a few sectors above/below again. When I was ready, I finally copied the sector: dd if=/dev/sdGOOD of=/dev/sdBAD bs=512 count=1 skip=23456 seek=12345 oflag=direct (replace sdGOOD and sdBAD by the actual drives – just making this post copy&paste-proof :) ) oflag=direct is required or you will likely get an I/O error.

To be sure that everything went fine, I checked the result with hdparm again. After restarting the RAID, the resync ran fine this time.

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Day: October 12, 2012

Patching a non-syncing RAID 1

What could I have done to avoid this?

What I did to work around the issue