In a recent post I described how trace libraries can influence the software architecture for different disk array features (e.g. Green, Thin, etc.).
The latest Symmetrix release is a great example. The high-level software architecture for sub-lun FAST algorithms has been directly impacted by analyzing billions of I/O operations captured from customer sites and internal systems.
Trace libraries can be pumped through a microcode simulator that emulates the front-end, back-end, and cache of a product like VMAX. The traffic between the cache and the back-end disks can then be analyzed. Simulation of movement between storage tiers (e.g. flash, FC, SATA) can be explored via the trace libraries. A strong goal for this analysis is to build a software architecture that moves the right data to the right tier at the right time, and doesn't move them when it's the wrong time. Another important goal is to select different chunk/extent movement sizes to understand the cost of moving data between tiers.
An additional goal is to make the architecture customer-friendly (e.g. does not require extensive use of knobs and dials to manage the FAST feature).
Finally, any architecture must be shippable by a certain date. Capturing trace libraries takes time. Getting them into the lab takes more time, and analyzing them takes even more. The architecture must allow the evolution of FAST techniques as more insight is gleaned from more workloads.
So what trace-driven insights made their way into the latest FAST release?
Efficient Extent Selection
Trace libraries allow engineers to experiment with different extent sizes and make trade-offs based on system (and LUN) footprint data. Meta-data that describes the location (tier) of FAST extents ends up increasing the capacity footprint required to represent a LUN. The system resources (memory, CPU) required to move different size extents impacts overall system footprint as well.
The trace libraries allowed the architecture team to identify an ideal extent size that minimized capacity footprint and minimized strain on system resources. This result meets two other goals as well:
- Ease-of-use: no need for customers to input (or experiment with) different sizes
- Shippable: there was no need to add an API for extent modification (and no need to test all of the different permutations).
When's The Right Time?
The trace libraries also shed some light on "when is the right time for data movement between tiers".
In general, when I/O activity is focusing on a repeatable subset of extents, then it is a great time to move those extents to a faster tier (e.g. flash). When the I/O activity is randomly accessing a wide variety of different extents, then tiering activity can be harmful.
Consider the set of extents that are actively accessed at midnight (Extent Subset A), versus the set of extents actively accessed at mid-morning (10:30 AM Extent Subset B). On the X-Axis below, I've highlighted these points in time over a three-day trace:
Now let's add in a Y-Axis containing the EXACT SAME trace data, compare the set of HOT extents at time A versus the hot extents for the entire 3 day period. If the extents are exactly the same for the two time periods (a correlation of 1:1), then plot a point using a WHITE dot. If there is zero correlation, use a BLACK dot. If there is SOME correlation, then shade the dot. The legend is explained on the right hand side.
Note that the extents that are active at mid-night (off-hours), have a very low degree of correlation when compared to extents that are active at other times of the day. Now look at the exact same chart for the 10:30 AM extent subset:
These dashed lines squarely intersect with the "white rectangles". What does it tell us? It tells us that the subset of extents that are active at 10:30 AM are generally active from 9-to-5 PM.
Which means that 9-5 PM is a great time to run FAST, but other times are not.
Activation Periods
This pattern tends to play out over and over again in a variety of different customer traces and workloads. As a result, the concept of an "activation period" was introduced into the Symmetrix microcode. FAST only runs during the activation period, otherwise the algorithms are idle. Adjustment of the activation period is allowed.
This is a sensible addition to the architecture. Thanks to EMC Fellow John Walton for explaining it to me.
Trace libraries will continue to drive the FAST algorithms. There is only so much that can be added to the code (and tested!) at any one time. Collecting traces is a huge investment, but the investment pays off when the released code hits the field and reacts favorably to familiar-looking I/O streams.
Steve
Twitter: @SteveTodd
Comments