EMC's Symmetrix V-Max(TM) announcement today reminded me of my first internal briefing for the V-Max project. The Symmetrix team proposed building a data storage system that would massively scale beyond the capabilities of any existing system in any data center.
This meeting happened maybe 3-4 years ago. I saw a presentation on a new hardware architecture. I listened to the proposed changes to the Symmetrix microcode (Enginuity). I saw preliminary schedules that were aggressive (given the substantial changes to the system). Based on the Symmetrix organization's tradition of hitting their deadlines, I felt that their schedule was achievable.
Then they revealed their quality target of equivalent reliability to the currently shipping Symm systems (the DMX line), and I thought to myself: "no way".
For the lay person, the Symmetrix V-Max line targets 5 9s of reliability (also called uptime). A 5 9s system will experience downtime on the order of minutes per year.
I was skeptical. Most significant for me was the change from a massive, centrally located cache to the proposed fully distributed caching system. The magnitude of the proposal, in my mind, could not target 5 9s and still ship on time.
Not only has the Anarchist proved me wrong (he's good at that), but the Symm team finished ahead of schedule.
Which leaves me asking one question: how did they do it?
Three Legged Quality Stool: Hardware, Software, and Process
One of the first things that the hardware team did was to enumerate all of the hardware components in the system:
- The number of hardware components had to be minimized. Fewer components increases the odds for better reliability. For example, the traditional Symm front-end directors and back-end adapters were consolidated into a single component. This "integrated" director results in a 75% reduction in parts (yet provides the same function at greater performance).
- Each hardware subsystem (such as the set of directors, or the power subsystem) was assigned a hardware availability goal of 7 9s, which translates to subsystem downtime of roughly 3 seconds per year. This allowed the combined HW system to achieve a design target of 6 9s, or about 31 seconds of downtime per year, overall.
- Component redundancy was designed into every aspect of the eventual system. All components were grouped into subsystems which had a minimum of 1+1 redundancy; critical subsystems had greater that 1+1 redundancy. The fabric power system, for example, is 1+3 redundant. It can withstand 3 failures and still operate.
One of the key activities in regards to the quality of the Enginuity software itself was its ability to abstract itself from the significant changes to the hardware componentry. For example, the caching algorithms themselves were unaware that the underlying physical implementation had changed from a centralized to a distributed model. The fact that front-end directors and back-end adapters were now the same component was invisible to Enginuity.
In other words, Enginuity was able to leave many of its field-proven algorithms untouched.
The newer software (e.g. the distributed caching algorithms) introduced an architecture that protected customer data in transit between distributed directors and all the way down to the disks themselves. This new code was implemented as a separately testable module that was mercilessly beaten for months.
One final note about the software: the shift to an x86 architecture required an Endian change. Endian modifications forced the Symm engineers to examine every line of code in light of the new architecture (see code reviews below). Not only did endianness of byte ordering have to be analyzed, but inter-system SRDF compatibility implied that endianness had to be transparent between systems with different processor architectures.
The V-MAX system, with its enormous scale, required new and/or improved quality processes that would result in a level of quality equivalent to Symmetrix systems already in the field:
- 3 phases of failure modes and effects analysis (FMEA) were held during the design cycle. Each phase (architectural, pre-build design, and post-build validation) covered all of the gory details of all possible modes of failure. Each mode required an appropriate and resilient recovery option.
- Extensive modeling and simulation tools were used during all three phases as well. Modeling occurred at the architectural level and pre-build design level, and at the physical system level. Quality and performance results on the actual V-Max system were constantly compared and validated to models generated in previous phases.
- DFX (design for X) reviews (X = manufacturability, maintainability, serviceabilitiy, reliability, and availability) began early in development and continued throughout the development process.
- First pass concepts all underwent an extensive serviceability review, including user repair, cabling, look and feel, and human factors, which resulted in numerous enhancements.
- An advanced development team (separate from the actual developers) was chartered to shake out many of the new technology, portability, and endian issues that would eventually be discovered during development.
- An independent team was formed to focus on quality above and beyond the DMX-level testing. This separate team worked closely with the DMX teams on DMX-level compatibility and quality metrics. Note that both teams had a strong focus on SRDF and management compatibility. Both systems must seamlessly work together at customer sites.
- The software team increased its focus on rigorous code reviews given the significant architectural changes between V-Max and DMX.
It's difficult to sum up years of quality work in a single blog post, but I believe you get the point. 5 9s was considered from the beginning to the end of the project. Critical subsystems received extra protection (e.g. power fabric can survive 3 power failures, RAID-6 configurations can survive 2 drive failures, etc.).
Which brings me back to the original question: how did Symmetrix create a V-Max system with equivalent reliability to the systems that are in the field? By following the sequence below:
- Assigned the 5 9s requirement to everything.
- Target 7 9s for hardware subsystems, and 6 9s for the hardware system as a whole.
- Combined V-Max modeling tools with field reliability data from existing systems to validate the requirements.
- Performed three phases of FMEA to make sure no stone was left unturned.
- Created development and test processes that continually measured compliance to the original objectives and drove them all the way through system test and QA.