I recall writing a 2009 post about private cloud in which I theorized about ceding internal, private-cloud control of virtual machines to an external cloud. At the time I used the phrase "move a VM right out of the building". I view the official launch of VPLEX as the crossing over from theory to practice. "Out of the building" highlights one of the main characteristics of the VPLEX implementation: removing the physical limitations of data.
It's fascinating to have an insider's view into the construction of a new product like VPLEX. The feature set is challenging to implement (to say the least!). VPLEX can be immediately deployed into enterprise environments alongside other products that already achieve greater than 5-9s availability. This implies that VPLEX, upon its release, had to be 5-9s or better, which in my opinion is more challenging than implementing the features themselves.
How did the team of engineers pull this off?
Start With the VMAX
If you are looking for an excellent primer on the methods that EMC uses for building quality into a new product, I would start with the VMAX Quality post from April of '09. That post covered the three key items considered during the multi-year quality push for VMAX: the hardware, the software, and the quality process itself.
VPLEX is no different in this regard; all three areas received the same maniacal level of quality focus (in fact, VPLEX was developed within the same organization as VMAX). It's worth pointing out, however, that the VPLEX software effort was substantially different. In the case of VMAX, there was a large amount of focus put on maintaining compatibility with previous versions of Symmetrix while completely revamping the caching algorithms to run in a distributed fashion (on new hardware).
The VPLEX engine itself has two foundational quality elements: the hardware itself and the kernel that runs on it. The hardware is the same trusted HW that is used across many of EMC's 5-9s storage systems (e.g. Symmetrix, CLARiiON), while the software is based on the same LINUX kernel used within other 5-9s storage systems (e.g. Centera). These two facts establish a foundational quality baseline for the VPLEX engine; the hardware and operating system implementations are re-used from products with well-established and proven availability characteristics.
Process and Design
It's not easy to summarize all aspects of the multi-year quality process for VPLEX so let me describe (only) three elements which I think are significant:
- As with VMAX, all the gory details of possible failures were exhaustively covered as part of the failure modes and effects analysis (FMEA) process. This process identifies failures that are likely, unlikely, highly unlikely, and even those that will "probably never happen". I highly recommend reading more on FMEA (e.g. navigate to the Wikipedia link above). I think Wikipedia hits the nail right on the head: "Learning from each failure is both costly and time consuming, and FMEA is a more systematic method of studying failure. As such, it is considered better to first conduct some thought experiments."
- Proactive data unavailable/data loss (DU/DL) measurement and analysis occurred throughout the effort, whether it was part of the development process (engineers), the quality process (QA team), or the beta process (customers and field teams). DU/DL numbers, in fact, are one of the primary measurements used to validate EMC's claims of 5-9s on their storage systems. This gathering process was pulled all the way back into the earliest phases of development as a way of increasing the engineering team's focus on quality.
- HA characteristics designed and built into the VPLEX software are a key piece of the quality puzzle. If a data center has four VPLEX engines running a total of eight instances of VPLEX software, up to seven of them can fail and the system will still run. In fact, the VPLEX N+1 clustering software can handle the failure of all seven AT THE SAME TIME. The design also allows the coupling of the clustering with the metro option (active/active access at synchronous distances). This coupling further protects against failures. These two design principles (N+1 clustering and active/active data centers) increase availability and thus are a key driver for achieving 5-9s.
An Enhanced Beta Process
The final piece of the VPLEX quality story is the way the product went through the beta cycle. VPLEX beta started in Q4 of 2009. As usual, there was a strong interaction between engineering, field support, and VPLEX customers.
The full range of product-related business activities usually come on line AFTER a product is announced. One of the big changes that occurred with the VPLEX beta, however, is that EVERYTHING came on line during the beta. This means that VPLEX EMC's support organization was already "all hands on deck", the remote support capabilities (known as ESRS) were turned on, customary post-mortems were conducted after failures, etc.
This beta process, which went on for more than half a year, was the final piece of the quality puzzle.
The VPLEX beta logged more than 40,000 hours of runtime.
[Note: links to information about EMC's beta implementations with AOL and Melbourne IT].
I've covered a lot at the risk of leaving out even more. The EMC QA team had a very full plate when it comes to qualifying the large number of heterogeneous storage devices that can run underneath VPLEX. I hope to list a support matrix of all underlying attaches soon.
Steve
http://stevetodd.typepad.com
Twitter: @SteveTodd
EMC Intrapreneur
Steve, trying to find that "heterogeneous" matrix myself - details very much non-existent, and your magic support system (why hide it behind a bunch of systems that need you to register and be an EMC customer or employee)
Weren't AOL about YY's only real name customer, how much of the run time was on YY vs plex?
Be much appreciated if you can post the support matrix, as so far it looks like EMC only.
Barry
Posted by: Barry Whyte | May 12, 2010 at 04:15 PM
Hey Barry,
The 40,000 hours were all VPLEX.
The qualification matrix is indeed password protected within Powerlink, but some of the vendors mentioned during the EMC World announce included HDS, HP, 3PAR, IBM, Sun, etc. I can’t list every vendor (and every class/revision of system for that vendor) because it would quickly go stale. Hope this helps.
Posted by: Steve Todd | May 13, 2010 at 09:24 AM
Steve,
Thanks for the info, need to retry with Powerlink, didn't like my gmail email - didn't want EMC spam on my IBM email !
Why the need to hide support info behind this system?
Shouldn't support info be public information - I searched the 13,000 page PDF and not one reference to plex.
Posted by: Barry Whyte | May 14, 2010 at 03:26 PM
Barry - I did log in to Powerlink to view the VPLEX support matrix and the vendors I mention above are all in there, no more no less as of the announce.
Posted by: Steve Todd | May 14, 2010 at 06:03 PM