RAID DIsk Failed... What Now?

35 posts / 0 new
Last post
#1 Tue, 05/01/2012 - 18:25
mrwilder

RAID DIsk Failed... What Now?

Hi, I have a RAID setup in my machine. One of the drives failed.

There are three drives listed in the array, md0, md1, and md2.

They all have a screen that looks like the attached screenshot, except md0's says it is mount on /boot instead of /tmp. In the second screenshot I notice the drives are different sizes and that I don't have the faintest clue what I'm doing.

I found a site that probably tells me exactly how to fix it, but I am not bright enough to understand. Can anyone give me "how to recover your array with Webmin and Virtualmin for Dummies" version? Or perhaps tell me where to find it?

Thanks!

Tue, 05/01/2012 - 20:56
helpmin

You didn't really provide any useful information for us to help you :-) Not even the screenshots you mentioned are there :-)

What distro? What kind of Raid?

Tue, 05/01/2012 - 21:18 (Reply to #2)
mrwilder

Well, it's NOT because I'm nervous. I just like to mix it up like that :^)

I've attached the images. This is a CentOS machine. Unfortunately, I'm confused about which kind of RAID it is because I am not clear which drive failed. To further demonstrate my confusion, the md0 is listed as a RAID level 1... but there's only one drive!

The other two drives are listed as RAID level 5. Their sizes do not match. The machine appears to be operating fine.

I apologize for being so ignorant.

Wed, 05/02/2012 - 00:23
mrwilder

Ok - how about this... I have a "virtualBox" version of this machine I could bring up. However, I'm afraid to bring DOWN the real machine since I'm afraid it won't come back up.

Since I don't know how to proceed and feel quite a bit of urgency, I wonder if anyone can tell me the best course of action I might take at this time...specifically, should I shut down the real machine and bring up the virtualbox image?

Thanks,

Tony

Wed, 05/02/2012 - 01:24
mrwilder

OK, sorry for posting again. Here is where I'm at trying to work through this. I have shut down the real server and brought up a vitualbox image of it running elsewhere.

Before I shutdown the real server, I ran "cat /proc/mdstat" and got:

Personalities : [raid6] [raid5] [raid4] [raid1]
md0 : active raid1 sda1[2](F) sdb1[1] sdc1[0]
      256896 blocks [2/2] [UU]

md2 : active raid5 sdc2[2] sdb2[1] sda2[3](F)
      20482560 blocks level 5, 256k chunk, algorithm 2 [3/2] [_UU]

md1 : active raid5 sdc5[2] sdb5[1] sda5[3](F)
      198627328 blocks level 5, 256k chunk, algorithm 2 [3/2] [_UU]

unused devices: <none>

I believe this means sda has failed. sda is an IDE disk, the other two are SATA, If I recall. I am GUESSING this is what I would need to do:

  • bring the physical server back up (will it ever boot again, now that I've shut it down?)
mdadm --manage /dev/md0 --fail /dev/sda1
mdadm --manage /dev/md1 --fail /dev/sda5
mdadm --manage /dev/md2 --fail /dev/sda2
  • shutdown (do I really have to shut down??? I'd rather not if it's possible to do this while the server's up)
  • replace the dead disk
  • boot (but will it boot with the new unformatted disk???)
fdisk -l /dev/sda

mdadm --manage /dev/md0 --add /dev/sda1
mdadm --manage /dev/md1 --add /dev/sda5
mdadm --manage /dev/md2 --add /dev/sda2

Does that seem like the right steps, all the right steps, and nothing but the right steps?

Thanks again,

Tony

Wed, 05/02/2012 - 04:48
Locutus

Some comments from my end... Not a "dummy walkthrough" though since my mdadm experience is a bit rusty. So take those as advices only please, don't execute any commands I give without verification!

Your md0 is indeed a RAID-1, with three disks according to mdadm. Was that intentional to set it up like that? It is possible of course, having multiple mirrored disks. It seems that only sdb1 and sdc1 are active in md0, and sda2 is set up as hot spare.

Where do you see that their "sizes do not match"? Sizes of RAID member partitions MUST match (okay, the smallest one dictates the size for the array actually).

Unfortunately, your code printout of the proc/mdstat file was partially garbled due to forum bugs, it inserted a link where important information should be. You might want to check and fix that. (@Eric: Is it possible to get those forum bugs fixed? In code, no links or other stuff should be inferred.) According to the mdadm documentation, the partitions in proc/mdstat are appended their driver ordering number in square brackets, and "(F)" follows if that drive is failed.

To get details about the failed drive, you can use the commands "mdadm -E /dev/sda1" (examine, to be used with physical partitions) and "mdadm -D /dev/md0" (details, to be used with md devices). Make sure you find out there which drive actually failed.

To remove them from the array, you IMO wouldn't need the "--fail" command, since the drive already failed, but the "--remove" one. Check "man mdadm" for details; also the -E and -D should tell you more. You'll need "--fail" only if the defective disk is member of other md devices and has not failed there.

About sda being an IDE disk: Make sure that is really the case. Old IDE disks usually get "hdX" as device nodes and not "sdX".

Do NOT replace the drive while the server is running, except you have a SATA controller and power connectors that are specifically meant for hot-swapping!

Whether the server will boot again, before and after you remove the defective disk, depends on whether the boot loader (GRUB?) is installed on all the RAID members. If you configured the RAID during OS installation, the installer should have done that for you, otherwise you'll want the commands "grub-install" and "update-grub". Check their man pages; I hope those apply to your CentOS, I'm using Ubuntu/Debian.

Also your BIOS needs to be configured to boot not only from the first HDD but from subsequent ones, in case the failed disk is the first in your system.

Wed, 05/02/2012 - 11:19
mrwilder

It was a SATA disk.

Because I suffer from early onset hyper brain dysfunctional spasmosis, I didn't wait around for any advice even though the backup server was already running.

Instead, before I shut the server down I ran the --fail commands.

Needless to say that now when I boot it says "kernel panic not syncing attempted to kill init"

And stopped.

There's no CD or DVD drive in the machine. If I install one and boot from a CENTOS install disk, can I save the array somehow?

Thanks again,

Tony

Wed, 05/02/2012 - 11:26
Locutus

I'm not sufficiently familiar with CentOS, but in Ubuntu you can boot the install CD to a rescue shell, with mdadm loaded and active, and you can run the test commands I mentioned and should also be able to perform the disk swap and resync from there.

So basically most of what I said in my post is still valid, as long as you can do the required stuff from your install CD. Getting the boot loader back on might be a bit more complicated.

Wed, 05/02/2012 - 12:20
andreychek

Yeah, there is indeed a rescue mode on the CentOS install CD's -- you may be able to figure things out from there.

Many systems can boot from a USB drive, so you may be able to load the CentOS ISO onto the USB drive rather than having to install a CD/DVD in your server.

-Eric

Wed, 05/02/2012 - 19:21
mrwilder

Ok, before I take a shot at the "rescue mode", could I simply take the old bad disk, put it in a Windows based machine, then use Norton Ghost to bit-copy it over to the new drive?

Ahem, then pop it in and go have a beer?

Thanks,

Tony

Thu, 05/03/2012 - 05:38
Locutus

That wouldn't help, since you've been using the other two disks of the RAID-5 after the defective disk got dropped from it. Which means the disks are now out-of-sync. Even if not, the RAID information now records the defective disk as failed, and you'll need to re-add it, no matter what.

IF the defective disk and the other two were still in sync, you could perform your Ghost copy, then force-create a new RAID-5 using the "--assume-clean" option, which skips the initial synchronization. But this only works if ONLY the array composition information got garbled, and the array itself is still fully intact and in sync. You should not do this in any other case.

So, to get the RAID-5 up and running again, the best course of action is to perform a resync through mdadm with a new HDD.

And actually, this is an intended and regular process for a RAID array, so to speak doing this is why you're using RAID at all: To be able to replace a defective disk and re-integrate the new one into your disk set. If you start fiddling with external copies of defective disks now, you might as well stop using RAID altogether. :)

Thu, 05/03/2012 - 14:56 (Reply to #11)
mrwilder

Good point.

Should I leave the BAD disk in when I boot to the recovery CD, or install the new disk now, first, before I restart?

Thanks.

Tony

Thu, 05/03/2012 - 19:10
mrwilder

I'm sorry folks, I have a very good general idea of what I must do, unfortunately the devil is in the details.

May I use the "Net Install" disk? That's all I seem to have of CentOS.

Now that I did NOT --remove the disk and instead --fail(ed) it, must I leave it in and boot to the install media, select "repair" somehow, then get into disk utils in some way, then --remove the bad one?

Then, after that, shutdown, install new disk, and use a similar procedure but add the new disk back into the array, then rebuild it... right?

Oh, and then add the grub loader to all the disks in the array.. ?

Thu, 05/03/2012 - 23:44
mrwilder

Hi intelligent knowledgeable people,

After I burned disk one of the CentOS image, I went to go install the new disk and a) the cord to sda1 simply fell out into my hand and b) I thought about how much traffic has been slamming the machine.

I plugged the cord back in and ran the commands to get sda1 synced. It appears to be resyncing the RAID now.

Assuming that RAID does in fact resync, would you go ahead and change the drive? The SATA cord may very well have been knocked loose by me yesterday, and maybe heavy traffic is unlikely to cause read errors... but to me the fact that the cord was inexplicably loose and traffic has increased 1000% for two weeks may indicate the disk should be given another chance.

What would you do at this point? I already have the new disk, but remember, I don't know anything about "Creating partitions with the original layout" or anything like that... whatever instructions I got to set it up in the first place, I learned here.

So, what would some of you much more knowing people do at this point? ASSUMING it might resync, would you keep the old disk and assume the cable was loose or traffic levels caused a read error, or just use the new disk to get rid of variables and for safety's sake?

If you'd keep the old disk, can the new one be used as a "spare"? Where might I learn what a "spare" actually is and what that entails?

Thanks,

Tony

Fri, 05/04/2012 - 07:57
Locutus

If mdadm resyncs the old disk without errors, you can assume that it is still working and the SATA cable was indeed the culprit. mdadm does thorough tests during resync, and if it can't write or re-read any block, it will stop and tell you so.

So in that case you wouldn't need to replace the disk yet.

As for duplicating the partition layout -- there is a syntax to "sfdisk" that takes the partitioning of one disk and duplicates it to another. This forum post might help, otherwise Google will sure find it:

http://forum.soft32.com/linux/gentoo-howto-fdisk-input-fdisk-ftopict3268...

This command should do it. VERIFY BEFORE EXECUTING!! Overwriting the partition table of the wrong disk will thoroughly nuke it.

# sfdisk -d /dev/sda | sfdisk /dev/sdb # Overwrites sdb's partition table with that on sda

A "spare" disk is an up-and-running HDD in a system that is registered in an array but not an active part of it. It is used to take over for a defective disk automatically; mdadm will use the spare drives to auto-resync to if an active drive fails.

Fri, 05/04/2012 - 19:33
mrwilder

Thanks Locutus.

The disk did resync ok - unfortunately, errors begain again within a few hours, so I was forced to bring back up the virtualbox and do it right.

I'll try to install the new disk this weekend. Thanks for the tutorial and pointers.

Do spare disks need to be formatted and partitioned to the same type of layout of any particular disk? If so, wouldn't that limit a spare disk in terms of which disk it can be called into service for?

Thanks,

Tony

Sat, 05/05/2012 - 03:04
Locutus

What errors exactly are you seeing? If you had a loose SATA cable, the controller might still be the problem and not the HDD.

As for spare disks: Yes, since you assign partitions as spare to an array and not a whole drive, they need to be partitioned like member drives before they can be used.

Well actually, that's only half-true. In your specific case it's true, since your existing array uses partitions. mdadm can actually also operate on whole raw drives, without making partitions, by using like /dev/sda and /dev/sdb when creating the array, as opposed to /dev/sda1. Naturally you can only have one array per set of drives then.

Sat, 05/05/2012 - 10:09
lp86

This is why I use hardware RAID, its more expensive, but a lot easier to figure out if something stops working. You can get the SATA PCI-X cards on ebay pretty cheap now.

Sat, 05/05/2012 - 15:45
Locutus

I agree that a good HW RAID card is more reliable than software RAID, yet rather expensive to get a good one, and less flexible in terms of array composition. But if you can spend the money, and don't need flexibility, HW RAID is the way to go. :)

Sat, 05/05/2012 - 19:31
mrwilder

Sigh - I feel so stupid. I just don't get this. I have (apparently) removed sda from everything I see it listed in. I turn off the machine, replace the faulty drive with a new one, but then it won't boot.

I reinstall the faulty drive, and it DOES boot, even though I've supposedly removed the drive from all instances I see it involved in (ie.,

mdadm --manage /dev/md0 --fail /dev/sda1
mdadm --manage /dev/md0 --remove /dev/sda1

Now why would it boot?

Conversely, if I --fail the drive, --remove it, then install grub on some other disk in the array, it does not boot. Thus, I can't run the commands to add the new drive to the array.

And if I boot from the recovery console, the drives aren't mounted at all!

Any hints?

Thanks

Sun, 05/06/2012 - 01:07
mrwilder

Long story short: I successfully replaced the drive in the array.

... but it's almost inconceivably unlikely that (the disk) hardware was the problem!

Not only has the new drive already failed out of the array, but an identical machine running different domains also now has a failed array.

These are two nearly identical machines, side by side. The only thing that has changed lately is the amount of traffic, which has changed from practically none to 1000s of hits a day on both machines. It's not a lot of traffic and these machines should be able to easily handle that... but BOTH machines have sent me "degrade event" emails followed by "failed drive" events.

Yes, it's possible that three drives failed simultaneously on two separate machines. Perhaps the power is bad or the temperature is high. But I think not.

It must be related to the traffic.

I believe my drives are suffering read errors due to the amount of traffic.

How can I verify this, or cut to the chase, assume it's true, and make my drives not suffer read errors from such puny traffic?

Thanks again,

Tony

Sun, 05/06/2012 - 14:13
Locutus

Sorry, wanted to reply earlier, but the site was down all day.

First, you need to elaborate "does not boot". With that description, is it impossible to give any hint. What exactly happens when you try to boot with a new disk?

You could also use a rescue CD, boot from that, and use it to resync the array with a new disk, then install the boot manager on the new drive. You need to bind-mount /dev, /proc and /sys to the directory where you mount the root of your to-be-repaired installation, and use chroot. Google should find you tutorials how to do that.

"High traffic" should, under normal circumstances, never be a cause for RAID failure. HDDs are made to transfer data, at full speed, also for longer periods of time. E.g. during resync, the whole disk is read and written at high speed. "1000s of hits per day" on a webpage is very very low traffic, concerning required disk I/O.

It is indeed unlikely that three drives fail nearly at once. You can use "smartctl" to check the SMART data of your drives, to see if there is indication of actual failure.

Otherwise, it is possible that the used HDDs are unsuitable for mdadm usage. Some drives can, under certain circumstances, produce high delay in responses to OS commands, which mdadm might interpret as failure. You might check Google if that is the case for your drives.

Sun, 05/06/2012 - 17:10
mrwilder

Thanks Locutus. I really do appreciate your incredible helpfulness.

As far as it goes, I was able to get the new drive installed and working - it is simply that it too was marked "failed" by mdadm.

The original disks are probably less than three years old. They are Maxtor Fireball 120GB drives, purchased new for the express purpose of building the machine - perhaps in 2010. The new drive was a WD Caviar 500 GB.

I'm quite certain that the problem is due somehow to my improper configuration. Because the most important thing to me is time, I think it would be best to simply ditch the RAID unless that learning curve can be effectively cut to a day or so.

If a hardware RAID can be had that will handle the drives as-is, that would be nice, but something tells me that's an unlikely and expensive fantasy. Which leads me to wonder, is it possible to somehow create physical disks that contain the same data which is now partially RAID 1 and partially RAID 5'd across three "failing" disks?

Thanks again,

Tony

Sun, 05/06/2012 - 19:34
andreychek

Out of curiosity, if you run the command "dmesg", are you seeing any error output at the end of that?

If you were dealing with some sort of hardware error -- it would likely be throwing some errors that were showing up in that dmesg output.

-Eric

Mon, 05/07/2012 - 04:25
Locutus

Yes, what Eric said, and additionally you should examine the SMART data like I mentioned before.

See if the disks report any SMART data that is indicative of failure. You can also instruct your drives to perform self-tests. ATTENTION! In your situation, you should perform those self-tests only when the RAID arrays are not assembled/running, i.e. from a rescue CD! That is because the self-tests, especially the long one, can cause the drive to respond very slowly to OS commands, thus making mdadm drop the drive from the array cause it thinks it's defective.

As for your Maxtor drives -- I didn't know that in 2010 they still sold such small drives like 120 GB? Sounds rather outdated.

I have mdadm experience with the following drives. At home, I use three WD EARS (1.5 GB) in a RAID-5 for my NAS, and at University, we have two WD RE4 (2.0 GB) in RAID-0. These work okay, all under Ubuntu 10.04.

For server purposes, I'd always suggest using RAID-1 and not RAID-5, except storage space vs. HDD price is a really big issue (which it shouldn't be when operating a server).

Your last question I didn't understand. Can you re-phrase "create physical disks that contain the same data which is now partially RAID 1 and partially RAID 5'd across three "failing" disks" please?

Mon, 05/07/2012 - 19:43
mrwilder

Hi, thanks. When I ran the Smart utility from Virtualmin, it did say something about "old age". I apologize for not being more specific, I cannot get to the machine at this time and I ran the command before you guys suggested it.

As for dmesg, I will run that command when I am physically near the machine again.

Since you are saying that RAID level one is the only thing I should use, I would rather go with a nightly full disk dd command and leave it at that. Thus, I want to get rid of the raid entirely. My last question was in reference to doing just that. Is there a procedure to simply put the contents of these disks back onto a single disk, or is that wishful thinking?

I've had good luck restoring virtualmin sites so I have a lot more faith that rebuilding the machine will go off without huge hitches, if that is my least labor intensive path to freedom from the RAID.

For me, because of my lack of knowledge, the RAID has become more of a burden than a tool.

Thanks for your help,

Tony

Tue, 05/08/2012 - 01:49
Locutus

Migrating an existing installation from RAID to non-RAID is -- at least on Ubuntu, I suppose it's the same for CentOS -- a bit tricky, but when you know the right steps, rather simple.

What you need to do is create the proper partitions on the new drive, use "rsync" to copy over the disk contents from the RAID partitions to their non-RAID counterparts. That's the easy bit. Then to get the boot loader on the new drive, you need to mount the root partition of the new drive somewhere reachable, then bind-mount /dev, /proc and /sys into that root mount, and use chroot to go into it. Then use grub-install and update-grub.

Here's a website that I use for reference when doing such moves:

http://realtechtalk.com/Ubuntu_1004GRUB2_mdadm_wont_boot-1070-articles

The instructions are for non-RAID to RAID, but they work in an analog way for the other way round. Just skip the mdadm bit and install grub to just one drive.

Thu, 05/10/2012 - 11:43
mrwilder

The very last messages in dmesg report the RAID was successfully rebuilt, although the mdadm report for the RAID 0 reports "[U_]"

This SMART Report is confusing to me:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   140   139   021    Pre-fail  Always       -       3966
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       19
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       16
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       19
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       13
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       67
194 Temperature_Celsius     0x0022   110   104   000    Old_age   Always       -       33
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

But this is the NEW drive!

Could that 33C be the culprit? Is the server room too hot?

Thanks,

Tony

Thu, 05/10/2012 - 13:25
Locutus

33 celsius is perfectly okay for an HDD, the SMART data looks a-ok.

Fri, 05/11/2012 - 23:16
mrwilder

Could it be the fstab configuration that is causing the problem?

/dev/md1  /  ext3  grpquota,usrquota,rw  0  1
/dev/md2                /tmp                    ext3    nosuid,noexec,nodev,rw 0 0
/dev/md0                /boot                   ext3    defaults        1 2
tmpfs                   /dev/shm                tmpfs   nosuid,noexec,nodev,rw 0 0
devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
sysfs                   /sys                    sysfs   defaults        0 0
proc                    /proc                   proc    defaults        0 0
LABEL=SWAP-sdc3         swap                    swap    defaults        0 0
LABEL=SWAP-sdb3         swap                    swap    defaults        0 0
LABEL=SWAP-sda3         swap                    swap    defaults        0 0
Sat, 05/12/2012 - 08:17
andreychek

Howdy,

Your fstab shouldn't affect the workings of a software RAID device. It'd normally be the other way around :-)

And your fstab looks pretty normal.

What does /proc/mdstat contain, out of curiosity?

-Eric

Tue, 05/15/2012 - 00:41
mrwilder

Just to close this issue. I re-added the new drive after the previous failure and haven't received any failure notifications from that machine since.

I also noticed that the actual fail notification from the other machine was from MONTHS earlier. I must have forgotten about it and Google simply lumped it into the threaded email because of a similar subject line...

... so go figure.

It all is up and running for the last few days now.

Thanks everyone to your never ending constant striving to make life great for the rest of us. Virtualmin REALLY IS the reason I run on Linux platforms. I'm sure that's true for thousands and thousands of people.

Thanks again,

Tony

Tue, 05/15/2012 - 08:57
Locutus

Virtualmin REALLY IS the reason I run on Linux platforms. I'm sure that's true for thousands and thousands of people.

Yes, actually, Virtualmin was the reason also for me to switch from Windows to Linux for my web hosting platform. :)

Sun, 12/29/2013 - 17:11
wocul

I recently had a failing disk, too (software RAID) - the SMART info in webmin was actually very helpful here to get the disk replaced in time, but I do agree that webmin could provide a helping hand when integrating a replaced disk back into the system, i.e. partitioning the new disk according to existing disks, detaching removed partitions, adding the new partitions to each /dev/mdX and installing/updating the boot loader.

Also, /proc/mdstat should probably be evaluated as part of the status display in the "Linux RAID" module, because it showed "Active (green)", despite the RAID missing several partitions ...so it would probably be better to show the RAID status there, too ?

Wed, 01/01/2014 - 05:18
sgrayban

What a mess this thread turned into....

The only 3 steps you needs to do was fail the whole SCSI Drive A for all partitions and then replace that drive.

Raid would have taken care of the rest when you copied the partition structure to the new drive and then assign the new partitions to the array.

It's a very easy thing to do.

1) Fail sda drive and then shutdown the server and have sda replaced then boot the server backup
2) login as root and issue this command sfdisk -d /dev/sdb | sfdisk /dev/sda
3) Log into webmin and use the RAID module to add your new sda drive into the array.

That's it...

The webmin RAID module is your friend and will nearly do everything you need done.

Topic locked