When EMC bought the mid-range CLARiiON storage system in 1999, there were two main reasons: (a) It was fast (mirrored write cache), and (b) it was rock-solid reliable (RAID5 algorithms).
At the time of the acquisition, the caching algorithms were logically "above" the RAID5 algorithms from an architectural standpoint. These algorithms ran as user-space processes that shared a CPU via a process scheduler.
After the acquisition, the architecture shifted towards a layered driver model.
In both cases, one of the key computer science principles instilled into the algorithms was separation of concerns: the cache algorithms didn't need to know anything about the RAID algorithms, and vice versa.
And that was a good thing. The RAID algorithms were implemented as a state machine that presided over what could be a very finicky, heterogeneous, and unpredictable storage infrastructure. Complex disk, bus, and power failure scenarios led to distributed locking techniques and other heroic attempts to preserve data integrity. All of these algorithms were hidden behind a wall of encapsulation; the caching algorithms had little to no knowledge of these problems.
As the layered driver architecture evolved, cores became more powerful. Dual-core and quad-core systems had more than enough power to drive the back-end disk subsystems. The RAID5 algorithms could basically affine all the complex locking to core zero and still have plenty of other work to do.
My colleague and MCx architect Steve Morley points out that as the pace of clock speed increase slowed (due to thermal dissipation), the number of cores grew. This motivated an MCx redesign where core-affined system threads could drive IO through the stack (as opposed to a core zero approach). Chad Sakac also discusses this aspect in his recent post.
In support of core-affinity, a decision was made to expose a portion of the underpinnings of the RAID subsystem up into the cache layer. With more knowledge about the underlying infrastructure, one decision can be made at the top of the stack regarding which core can handle the entirety of the IO – from front to back. This enables IO loads to run in parallel and optimize for the cache footprint
MCx has achieved that goal. At a high level, MCx looks like any sensible implementation of a tiered storage system:
The first step in preparation for multi-core scaling was the creation of a software architecture (often referred to as FBE, or FLARE back end) known internally as the VNX "physical package". The diagram for the physical package is displayed below.
This diagram is essentially an object model of the underlying disk infrastructure. The vertices represent physical objects, and the arcs between them represent messages that travel in between the vertices. Starting seven years ago, this architecture began to make its way into the VNX product line as dark content: software that was present in the system but not exercised.
Why? Because messing with the back-end architecture is messing with the data integrity of the overall system. The physical package evolved (after years of internal testing) as functionally equivalent to the traditional back-end system (which was eventually retired).
The introduction of the physical package paved the way for the MCx announcement.
One of the next steps in enabling core distribution was the separation of the cache and RAID layers from being logically embedded within one driver. This occurred shortly after the physical package was introduced, and was a step towards the MCR and MCC drivers pictured above.
Perhaps the final, key decision was how to handle the RAID algorithms (state machines) that existed above the physical package.
The simple decision was made to "leave them alone". Why fix what's not broken? These state machines implement the fastest form of RAID: mathematical RAID. When an I/O drops out of the cache and down to the RAID layer, the physical location of the data has already been calculated mathematically. Math is faster than lookup.
The RAID state machines were wrapped in the FBE architecture. The object modeling implemented by the physical package was extended up into the RAID layer, giving birth to a new form of RAID layer known as MCR. This approach is referred to in VNX as the "logical package".
The logical drives at the bottom of the logical package map to the top of the physical package.
Once the logical package wrapped the RAID state machines via the FBE architecture, a directed acyclic graph (DAG) was created from the top (the basic volume driver layer) to the bottom (the board level containing the disk). Each layer of the graph could be accessed via an API (known as the FBE API).
My colleague and MCx architect Dan Cummins emphasizes that in addition to removing lock contention, less busy cores can "pull" work from other cores and balance the workload more effectively:
Each per core system thread implements multiple queues which define classes of service. The components of the MCx stack leverage per core resources to drive IO down the stack from within the system thread context and leverage the system queues to herd work to a) leverage the cpu caches and b) classify the importance of the work.
At the top of the stack the interrupts are distributed evenly across the cores but within the local socket. We want to minimize reaching into remote PCI root complexes to clear interrupts. We also want to localize the accesses to the local memory subsystems (NUMA locality). It is here at the top of the stack where we queue the IO to the local core. We do implement multiple IO distributions schemes including round robin but the default is a greedy pull model. That is if a core has work to do it will pull from its local queue otherwise it will pull from cores that are busy. This model reduces to all cores pulling from their local queue when all front end ports are loaded – thus minimizing lock contention.
This post summarizes the significant internal changes made to VNX that resulted in highly significant performance increases. Chad Sakac asked it best: "how do you change the engine of a car while barreling down the highway"?
In future posts I will stake a look at the quality and testing framework, and also share a bit more detail about the MCF and MCC layers.
Steve
EMC Fellow
Steve, what is the underlying operating system? Linux? FreeBSD?
Posted by: Alex | September 07, 2013 at 03:49 AM
VNX is Embedded Windows, VNXe is LINUX
Posted by: Steve Todd | September 07, 2013 at 01:24 PM
Hi Steve,
Thanks for your very informational posts. I'm wondering if you can give some insight into a question I've pondered for some time. Obviously, the leap to Westmere chips in the VNX and now Sandy Bridge in VNX Next Gen has led to a huge increase increase in overall performance capabilities of the mid-tier array. That being said, why did it take so long to finally embrace latest generation server-class CPU's in storage array controllers? Was the performance just not needed back then (pre-flash), was it cost, or some other reason?
Pre-VNX, most CPU's that were released in storage controllers were typically not the highest speed CPU's available at the time of release, and many were CPU's designed for embedded use cases (Jasper Forest). Even though EMC has now refreshed to Sandy Bridge, many other storage arrays still haven't caught up to the Westmere chips in the Classic VNX.
Posted by: INDStorage | November 07, 2013 at 06:20 PM
Certainly FLASH was a major factor. With previous generations the FLARE operating environment leveraged asymmetric multiprocessing which was good enough through about 4 cores. In order to leverage FLASH and to scale with the Intel roadmap we needed change our approach to multiprocessing. We could not justify this investment until the next gen VNX which leveraged much more than 4 cores.
Posted by: Daniel Cummins | November 21, 2013 at 10:42 AM