This first part of my blog is known as the "teaser". Hopefully it convinces the reader to view the rest of the blog. In this case, I'll be a little more up front.
You might not want to read this blog. It's a little dry. Some might call it uninteresting. It's about the foundational failure handling techniques first employed in the original CLARiiONs. If I were to compare talking about this topic to talking about an American football game, we'd be talking about the offensive versus the defensive line: in the trenches. (by the way, wasn't the last Super Bowl won in the trenches? Think NY Giants defensive line....).
So, who should read this post?
Customers that are considering a RAID solution. You should know that shipping quality RAID systems is a huge challenge. RAID solutions come and go. So this post is for customers.
Especially the ones that really, really, really care about the integrity of their data.
In my last post I wrote about making the decision to design CLARiiON RAID-5 algorithms by using state machine techniques. And then I wrote about my hesitancy to ship our first product.
My reason involved the lack of time I had to generate a thorough test suite that was guaranteed to visit every state and arc in the state machine. So I brought my state machine diagram over to a buddy of mine working in quality engineering, and he created an abusive, nasty test suite known as the DAQ: the Disk Array Qualifier.
This was in 1989. It's 2008, and the DAQ is still going strong. In my mind it's one of the critical factors behind CLARiiON's continued growth.
Here's what it does.
One of the first "bugs" I ran into in my state machine was due to an application sending multiple writes to the exact same disk address. The application did this over and over again, while reading an adjacent block on a different disk.
What's the point of this? Well, if the CLARiiON state machine doesn't handle atomic updates of application data and parity, then adjacent data could become corrupt (even if that data is not being updated!) And after several days of writing the same block over and over again, the adjacent sector somehow became corrupt. How did it happen? Well, a disk hiccuped and caused the software to enter a new state that wasn't handling data integrity very well.
After finding and fixing the problem, I worried about other nasty corner cases of a similar nature. How could I feel confident that we had found them all?
The DAQ
In the RAID-5 state machine there exists a set of arcs between states that represent failure conditions. We needed to cause these failures to occur for all possible states. These failures form quite the list: disk failures, disk disappearances, power failures, software panics, and software resets.
So we added "backdoors" to the CLARiiON RAID-5 build process that exposed "failure hooks" to CLARiiON quality tests. These hooks included:
- the ability to power off a disk
- the ability to write corrupt data onto a disk block or blocks
- the ability to reset the software
- the ability to power on a disk
- access to the RAID-5 mapping algorithms
- access to the RAID-5 rebuild checkpoint
Given the ability of the DAQ to wreak this type of havoc, I'd like to cover a set of representative tests run by the DAQ that cycle through many of these internal states.
1. Write, Read, Fail, Read, Rebuild, Read
This test was fairly straightfoward. Write a known data pattern to healthy disks. Read it back for comparison. Kill a disk. Read the data again. Turn the disk back on and let it fully rebuild. Read the data again. There should be no errors, and no data compare issues.
2. Reverse Rebuild Reads
Write a known pattern of data only to a specific disk in a RAID-5 disk set. Kill that disk. Read all the data from that disk and compare. Turn the disk back on. While the rebuild is occuring, read the known pattern from the highest address to the lowest (this will eventually collide with the rebuild checkpoint, which marches from lowest to highest). There should be no errors, and no data compare issues.
3. Rebuild Checkpoint Reads and Writes
Power off a disk drive, and then immediately power-on the same disk. Continually write and read the data sections immediately around the rebuild checkpoint. There should be no errors or data miscompares.
4. Rebuild Checkpoint Reads and Writes, Adjacent
Same as previous, except write and read areas on healthy disks that are adjacent to the current rebuild checkpoint.
5. Forced Corruption
Write a known pattern throughout every stripe and disk in a RAID-5 set. Read and verify. Corrupt known pattern on every stripe and disk. Read and verify. Examine read response to ensure that a soft error (recoverable) occurred. Overwrite corrupt data with known pattern. Read and verify. There should be no errors or data miscompares.
6. Forced Double Failures
Write a known pattern throughout every stripe and disk in a RAID-5 set. Read and verify. Corrupt known pattern on every stripe and disk, and then corrupt the data on adjacent disks. Read the known pattern and verify that hard errors are returned on every read. (Remember, these are the days before RAID-6!) Clean up all corruption.
7. Forced Checksum and Disk Failures
Write a known pattern throughout every stripe and disk in a RAID-5 set. Read and verify. Corrupt known pattern on every stripe and disk. Then kill a disk. Read the known pattern and verify that hard errors are returned on every read. Clean up all corruption, power on disk.
I'd like to say something about these last two tests. They make sure that no undetected data corruption happens during double failures. It's critically important not only to test double failures (e.g. kill two disks), but also to make sure that CLARiiON can recover from double failures if at all possible.
That's Pretty Thorough
You can see from this sampling of tests that the team shipping RAID-5 went to great lengths to stress the RAID-5 software to the limits. I would argue that the DAQ continues to be one of the most critical pieces of intellectual property we have. The entire test suite represents an incredible amount of creativity.
When a new revision of software is ready for testing, these tests run against it for months.
There's much more that I haven't mentioned. The DAQ continually preserves old tests and adds new ones. There's a whole suite of tests testing power failure conditions. And failover conditions. And write cache data integrity. And I haven't even touched upon other rigorous (non-DAQ) qualification techniques.
It Might Have Been Dry, But You Finished
I hope I've done some justice to the topic. A lot of my blogs about CLARiiON to this point have been my attempt at explaining why CLARiiON RAID-5 adoption continues to grow twenty years after it was created. I've heard it said in IT circles that CLARiiON is a recognized brand name that equates to "trusted RAID storage".
Now you know why.
Steve
Hi Steve,
Thats a great insight into Clariion's RAID5 details. Could you explain what happens when Clariion encounters a read or write error? Does it start an implicit RAID rebuild or the administrator is expected to do it? And what happens to the application that issued IO to the LUN? What error status is returned to the application which encounters the I/O error?
Anand
Posted by: Anand Vidwansa | June 08, 2008 at 01:24 AM
Hi Anand,
Good question. On a solitary read error, yes the rebuild is implicit. The administrator doesn't have to do an explicit rebuild. For one write failure CLARiiON will also do everything implicitly.
From the application point of view, the RAID5 algorithms return good status. The I/O error observed at the disk drive level does not get reported to the application.
Steve
Posted by: Steve Todd | June 09, 2008 at 06:32 AM