The very first "pre-CLARiiON" version of FLARE (CLARiiON's microcode) supported two RAID levels: RAID-1 and RAID-5. Due to the "RAID-5 write penalty" it was clear that we needed to provide an alternative protection scheme that could handle heavy write workloads. RAID-1 was our answer.
We characterized the performance of our RAID-5 solution and instructed customers that if their applications were performing highly-transactional, multi-threaded disk operations at about a 70-30% read-to-write ratio, then RAID-5 would perform well. For a higher percentage of writes, use mirroring. We had I/O per second (IOPS) characterizations that helped customers make the choice.
Well there was one application that slipped through the cracks, and our RAID-5 performance was so bad that the executives at DG considered putting the FLARE team on the chopping block again. How bad was it?
How about 12 I/Os per second?
In my mind I had grouped customer applications into different categories like "databases", "file systems", or "video streaming". Each one of those applications accomplished a specific task for a specific business need. We had tweaked and tuned the RAID-5 performance to satisfy some of the nuances that were germaine to those applications.
Myself and many others at Data General had figured that customers just wouldn't run an application with high write-to-read characteristics against a RAID-5 configuration.
We were wrong.
Several customers were using RAID-5 to run an application with a 100-0% write-to-read ratio. What was the application? Restore from Tape. (Doh!)
The Worst Possible Access Pattern
Several customers had backed up their entire RAID-5 LUN to tape, and for one reason or another they had to restore the entire LUN. The particular DG restore tool that they were using performed single threaded sequential writes. Not only were RAID-5 writes slow, but the single-threaded nature of the restore tool negated the advantages of multiple spindles handling multiple requests in parallel.
So the writes came down one at a time. The FLARE code mapped the write to a particular disk drive. A pre-read request was issued to disk. Seek time was not much of an issue (the disk heads were already in the general area). The access speed of the drives (rotational latency and transfer times) amounted to about 20 milliseconds. Follow that up with the actual write operation and you add another 20 milliseconds. So we're up to 40 ms for the initial writing of the data.
Moving on to the update of the RAID-5 parity information, you can guess how long that would take: approximately 40 ms. This ends up being a whopping total of 80 ms for a write operation. Divide that into 1 second, throw in some overhead, and you've got twelve I/Os per second.
So we had unhappy customers (who were probably not pleased about having to perform a restore to begin with) waiting forever for their restore to finish. This also resulted in annoyed Data General VPs wondering why Data General was even trying to be in the storage business.
Data Layout Only Goes So Far
There are many tricks you can play with data layout, pre-fetching, and the like. The bottom line, however, is that trying to solve the problem solely at the disk level is only going to take you so far. We considered the write cache capability on that generation of disk drives but truth be told we wanted complete control over any algorithm which affected data integrity. The plain and simple answer was to build some sort of "FLARE" write cache.
The FLARE write cache ended up being fairly disruptive on two levels:
- It had a pretty big software impact. Large parts of FLARE would need to be either ripped out or heavily modified.
- It had a pretty big hardward impact. We decided to use a battery backup unit (BBU) to protect against power failures. This meant that existing customers wouldn't be able to upgrade to a caching version of FLARE. They would need what was then called a "forklift upgrade".
I plan on posting more detail about building the CLARiiON write cache. There's a lot to write about. For me it was one of the harder efforts I've ever been a part of. I experienced project leadership for the first time. And when the CLARiiON write cache was done, I decided "so was I".
Steve
Dear sir, is it true that the Clariion runs Windows XP embedded?
Posted by: dale | May 28, 2008 at 08:37 AM
Hello Dale,
Yes it is true that today's CLARiiON is running a version of Windows XP embedded. CLARiiON made the transition to a Windows-based implementation in the late 1990s.
Steve
Posted by: Steve Todd | May 28, 2008 at 09:25 AM
Kind thanks, gent. So does this imply that FLARE runs as a Win32 application?
Posted by: dale | May 29, 2008 at 01:34 PM
FLARE currently runs as a driver in the kernel.
Regards,
Steve
Posted by: Steve Todd | May 30, 2008 at 02:41 AM