RAID the hard way (updated 10/3/13)

It has been a long, frustrating road to use RAID for my TimeMachine backups. Hopefully, you can learn what not to do from this... - by David R. Beebe
I optimistically purchased a Mercury Elite-AL Pro QX2 from OWC (macsales.com) back in December 2010. I had previously purchased two 1.5TB WD SATA green drives to alternately use in a NewerTech drive cradle with SuperDuper. Because I already had 2 discs, I ordered 2 more from Amazon and ordered the empty RAID cabinet from OWC. Hooked it up as the first drive in my FW800 chain.

Lesson 1: The discs in the RAID cabinet have to be the same firmware. So I had to order 2 more from Amazon. This also means that if a disc fails and that firmware is no longer available, you have to replace all 4 discs. In my case, they are all WD15EARS.
I started out in RAID 5 (setting 9) for 3 discs + 1 spare. This gave me a 4.5TB backup drive with 1 redundant disc in case of a single failure.
A few months into the backup, I got an error light/alarm on the QX2 with the disc in Slot C. Spoke with OWC Tech Support. While not listed as incompatible, I was told that because green drives spin at variable speeds, they were a poor choice for RAID 5. A write spans discs and may fail to complete consistently.

Lesson 2: Nothing said in the list of incompatible discs that this would be a problem. There was a lot of known issues with Seagate discs but these were Western Digital.
Since the variable spin rate was suspected, I switched from the preferred RAID 5 mode to Disc Spanning (setting 2) which has no redundancy or speed advantage. This of course, wiped out my TimeMachine backup.

The QX2 powered itself off and alarmed. There was no indication as to why this happened. Tech support said that there could have been a problem on the FW800 bus that caused the QX2 to panic. Nothing in the console logs. After the 2nd failure in a month, tech support suggested trying a shorter FW800 cable. I also moved other FW800 discs in the chain to USB 2.0. This left 3 discs in the FW800 chain, all self powered.

After limping along for the first year, in 2012 I start getting a blinking red error light (bad HDD on startup) on Slot C. I ran Drive Genius repair mode against the RAID, it took days. It found 16 bad blocks out of 12 billion blocks. OWC can't say if the Mac Extended (journaled) filesystem or if the disc hardware or the QX2 should manage sparing bad blocks. Soon after that, the QX2 flagged Slot C with a steady red error light (HDD error). Tech Support suggested pulling the disc and letting Disk Utility repair it. That didn't work as the drive showed as unreadable. Western Digital replaced the disc with the same firmware under warranty.

Lesson 3: Ordering the QX2 empty has a 1 year warranty instead of 5 years if purchased with discs. In the long run, this would have been less expensive since I ended up with 2 more 1.5TB discs than I needed due to the need for matching firmware.
Again the QX2 powered itself off and alarmed. Tech support thought that 3 discs chained was too much (despite FW800 standard supporting 63 chained, self-powered devices). I pulled everything but the QX2 and the LG BD-R burner from the FW chain. Had to spend days with Drive Genius again to deal with a few bad blocks on Slot C (which was recently replaced).

After this, I moved the QX2 to USB. I figured it might be the one having the problem with the FW800 bus. Problem with Slot C started up again. With Disk Utility I tried erasing the volume. It progressed to the half way point on the progress bar (up to Slot C) and they didn't make any progress other than to keep extending the expected completion time. I let it get up to 7 hours and stopped it. I re-partitioned the volume instead. USB is just too slow for this kind of volume. I stopped Time Machine and moved it back to FW800.

To see if the problem follows the disc, I took the QX2 offline and reordered the discs. Moved C to A, D to B, A to C and B to D. In order to do this, change the RAID mode and let the QX2 rebuild before setting it back to Disc Spanning 4 discs (position 2).

Now I am faced with another loss of TimeMachine history and I am no closer to knowing what is wrong. I can start up TimeMachine again with the QX2 on USB 2.0 but that will take forever to catch up on 4+TB. The only other option is to pay an out of warranty fee for OWC to diagnose the QX2 but I am not sure if I am confident they will find a problem. I'll know more if the QX2 goes offline on USB or if the disc in Slot C errors again. In the mean time, I am moving some of my discs back to FW800.

05.14.12 Update: Before I wiped out the backup on the RAID, I took a subset backup to a single 1.5TB disk. Good thing I did. While rebuilding the backup on RAID, the internal 1TB disk died a slow death (spinning beach ball all the time, would see Macintosh HD as a boot drive but never complete boot, would not recognize Snow Leopard DVD to run Disk Utility from and finally used Targeted Disk Mode to make 3 repair passes from MacBookPro before it completely failed overnight). It died before a backup completed on the RAID of course.

05.20.12 Update: iMAC is back from Apple Care with a new 1TB internal. It took over 17 hours to restore 600+TB over FW800. Got everything reconfigured and did backups in stages to RAID over FW800. First the newly restored internal then 2 of the critical external disks. A week after the HD failure, I awoke to the RAID alarming again w/o power. Since it is back on FW800, I've not proven what is shutting it down. It happened just minutes after a successful backup by Time Machine of 3 files on one of the externals.

05.21.12 Update: Another shutdown and alarm situation, this time about 20 minutes after a backup completed. Switched back to USB in order to prove or disprove if FW800 is the culprit. Adding back in another 500GB external to the backup. There is a lot of data to move across USB still.

08.30.13 Update: Here we go again. The RAID took itself offline and would no longer mount. No hardware drive errors were posted but the rebuild light was blinking yet it never completed. Nothing found in system logs. Disk Utility could not repair it. Had to power it off, change RAID mode, let it rebuild, change RAID mode back to position 2 (span with 4 disks), reformat and start over.

09.16.13 Update: Throughput over USB 2.0 was terrible for anything but an incremental backup. I am only getting 5GB per hour while trying to backup nearly 5TB. I let it run for a couple of weeks until I gave up and moved the RAID to FW800. In the mean time, I took a bootable backup of the OS with SuperDuper!.

09.25.13 Update: FW800 speeds finally allowed the backup to complete and I had completed a number of incremental backups when the Qx2 alarmed with a hardware error on slot B. Remember last year, the error was posted on slot C and that when I rebuilt it I moved C/D to A/B. Slot C has never alarmed since. I could find nothing wrong with the drive and taking the Qx2 offline and power recycling it "cleared" the error.

09.30.13 Update: Since 9/25, the Qx2 powered itself down with audible alarm once and just now the error on slot B has returned. Another power recycle cleared the error and I switched from FW800 back to USB 2.0. Timemachine continues to work but I have no confidence that a restore would be successful.

10.02.13 Update: I give up! I am not spending $400+ to but new disks for the Qx2 when I have zero confidence in it. Today I purchased a Q-Drive 4TB single disk drive to use with TImemachine. It is a 7200 RPM drive that supports USB 2.0/3.0 and FW800. I expanded the list of exclusions to keep the backup size around 3TB. Static files such as Aperture libraries from past years will be copied to another disk. Perhaps, to the 4 1.5TB drives currently sitting in the Qx2. Anyone want to buy a used Qx2?