After graduating from college I started working on the RAID-5 algorithms. Nearly every Friday afternoon at 3PM I would walk over to my boss and tell him I had found another problem with the design of the (eventual) CLARiiON software. He would groan when he saw me coming. Because usually we wouldn't be able to solve it before quitting time, and therefore we'd have to drag unfinished business into the drive home. Not a good way to start the weekend.
"Next time just tell me on Monday morning."
The RAID-5 algorithms were making a young college kid's head spin.
So I became the Secretary of State.
I'm Talking About State Machines
I decided to go with a state machine software design. This helped me to organize and manage the complexity. I believe that this foundational decision has influenced the continual growth of the CLARiiON RAID-5 implementation.
I could have avoided a lot of complexity by going another route. I could have performed stripe locking. I talked about it with my boss. We decided that it would violate one of the sacred tenets of RAID-5: performance. Let me explain.
Performance and Data Integrity
I've consistently blogged about performance and data integrity being the premise for the initial RAID-5 proposal. The main reason for splitting customer data onto multiple spindles was to parallelize disk operations. So for a RAID-5 implementation to remain true to the proposal it needed to perform simultaneous reads and writes to all spindles.
The main reason for adding parity to customer data was to protect the integrity of the customer's data. When the customer data and parity are adjacent to each other, it's called a RAID-5 stripe. When a customer updated data in a stripe, the parity would need to be updated as well. And the parity needed to be updated no matter what was happening on the other disks.
The permutations of what could be happening on those other disks, plus the permutations of error conditions that could be happening on any disk(s), were mind boggling.
The Easy Way Out
RAID-5 stripe locking was an answer. With stripe locking, when the first read or write operation hits a particular RAID-5 stripe, that stripe is locked, and the operation is guaranteed full ownership of the entire stripe. This eliminates a lot of complexity. Read and write operations for different stripes can proceed on different spindles, thus preserving the performance promise of RAID-5.
Except when multiple reads and writes hit the same RAID-5 stripe.
I couldn't live with this type of design. It did not satisfy the performance tenet of RAID-5 in all cases. Locking is OK (and often required) in failure conditions, but not in the normal case.
So I turned to some of the state machine education I had just received at the University of New Hampshire.
Mechanical Drawing Class
In high school I had taken a drafting class and used those plastic templates that allow you to draw straight lines and circles. I grabbed one of those and used that thing for weeks. I'm not exaggerating when I say that some weeks I was putting in a full 40 hours just drawing circles and lines. As a matter of fact I estimate I spent about four months of state machine design and spec writing before even typing one keystroke of code.
I'm not going to describe each state. But I will take a moment to describe potential initial states. I'll focus on RAID-5 write requests.
Initial State #1: We're Good
All of the disks in the RAID-5 configuration are healthy. Submit a read-write-modify request to the lower level software (known as the Device Handler). Wait in State #1.
Initial State #2: This Disk is Dead
The disk that I want to write to is gone. I need to update the parity information. It's a failure case, so issue lock requests to peer disks containing customer data. Wait in State #2.
Initial State #3: My Disk is Rebuilding
If the write operation straddles the rebuilt and non-rebuilt areas of this disk, I can't do a pre-read so I need to access the data on other disks. Issue lock requests to peer disks containing customer data. Wait in State #3. (non-straddling requests can be handled in either state #1 or state #2).
Initial State #4: Another Disk is Rebuilding
Is it a parity disk? A data disk? How far along is the rebuild? Trying to figure all that out and handle each case is complicated. Proceed as though things are fine, which is the "We're Good" initial state. If we bump into a failure we'll deal with it later.
Initial State #5: Two Disks have Failed
Well, RAID-6 hadn't been invented yet. And trying to write customer data in a dual-failure situation can possibly cause data loss when the disks fully recover. Return an error, terminate state machine.
These five initial states are joined by many other states for parity update, as well as states for performing read operations in healthy, rebuilding, and failed scenarios.
States need events in order to form a true "state machine". Limiting the number of events limits the size of the state machine, which means less code paths to test. We decided to boil down any and all disk failures to three events.
Event #1: Success
The disk operation, whether it was a read, a write, or an atomic read/write operation, was successful.
Event #2: Hard Error
A disk "read" operation failed, but the disk is still functional. Note that disk "write" operations could never return a Hard Error. If the Device Handler could not write to a disk, it shut the disk off. The DH would perform heroic actions (including sector remaps) to try and prevent this scenario, but in the end, it could never return a hard error on a write. This drastically cut down on the number of permutations to test.
Event #3: Disk Dead
The disk is as dead as a doornail. Doesn't matter if it happened on a read or a write. The disk is guaranteed to be dead because the DH shut off power to the disk and it no longer appears on the bus.
I think you get the picture
The foundations of CLARiiON RAID-5 failure handling is a state machine. There are no states uncovered and no unknown events. As I mentioned, this took about four months to design. And less time than that to code. And I was satisfied that this approach would meet our RAID-5 goals of performance and data integrity.
But when I was done writing the code, I didn't want to ship it.
But that's a story for another post.
Steve
Comments