I enjoy reading the research papers that come out of the FAST conference. Inevitably a paper will touch some area that is of interest to me. This year was no exception. When I read the "Parity Lost and Parity Regained" abstract, I nearly fell out of my chair. This is exactly the stuff I've been blogging about!
I've done a lot of research in my career, but I haven't had the priviledge of presenting at a conference. But now I have a blog. This allows me to participate in the research. I can become a research assistant.
So this post is about my reaction to the paper, why I think the topic is critical, and modifications that I'd like to propose.
To start with, I'd like to tip my cap.
Nice Job Research Team
The original RAID paper from 1986 offered very little (if any) advice about how to handle the complex and intricate problems that inevitably occur in a RAID implementation. Customers have heard about the performance goals of RAID, they've heard about the price advantages over mirroring, but they haven't heard much about data integrity problems such as "torn writes", and problems like "parity pollution".
Is there any more critical area of research for customers than preserving data correctness and minimizing data loss? Not in my opinion. This paper is long overdue.
The beauty of this paper for me is that the researchers modeled protection techniques and storage errors in order to determine error outcomes. The worst error outcome would be "corrupt data", followed closely by "data loss". This model is clearly tackling an issue that customers care about, and I really like the approach.
Foundational Areas Described by the Paper
The paper describes the different "storage errors" that can impact customer data or the parity that protects that data. These include latent sector errors, corruptions, torn writes, lost writes, and misdirected writes.
"Protection techniques" are then enumerated, including scrubbing, checksums, read-verify-after-write, embedded identities, and version mirroring.
And finally, the combinations of "storage errors" and "protection techniques" result in three types of "error outcomes": (1) the data was recovered, (2) the data was lost, and (3) the data was corrupted.
Logical Identity
I have great familiarity with many of the errors and protection techniques described by the paper. I bumped into many of them in the late 80s as CLARiiON began to implement RAID. I have less familiarity with a technique from the paper called "logical identity", so I studied hard to understand the basics of the technique.
Logical identity, as I understand it, is a technique used to handle the error case where a disk write either completely disappears (lost write) or it gets written to a different location (misdirected write). The paper does not dive into great detail about logical identity other than it works "in a similar fashion to parental checksums".
Parental Checksums
When writing a "block" of customer data, a checksum of that data can be calculated and written into a "parent block". So when data is written, a parent block is also written. When data is read, the paper says that the parent block is "accessed first". In other words, the parent block contains a checksum of the customer data that "should" have been written. When the customer data is actually read, its checksum can be calculated. If the write was "lost" or "misdirected", then the error will be caught, and the protection technique will have avoided returning corrupt data to the customer.
If I understand correctly, this technique writes additional "meta-data" beyond what is described in the RAID paper. With this technique there are now three different types of data that get stored on the surface of the disk: (1) customer data, (2) parity data, and (3) parental checksums (integrity meta-data).
Modeling Errors and Corruption
The paper then creates a state machine which shows what can happen during read and write operations performed from and to the surface of the disk. For example, if a latent sector error occurs on the customer's data, the model moves to a state called "Disk X LSE". If a torn write happens while writing parity, the model moves to a state called "Parity Error".
This where I got lost.
Aren't There Additional States That Need To Be Added?
It seems to me that parental checksums require that new states be added to the diagrams in the paper. The states represent all the things that could go wrong when reading and writing to disk. So in my mind, two new states need to be added:
- PC-LSE: latent sector error on parental checksum
- PC-X-Error: the data represented by the parental checksum is no longer valid
Additionally, new "storage errors" need to be added to the diagrams:
- F-LOST(PC): the write of the parental checksum was lost
- F-MISDIR(PC): the write of the parental checksum was misdirected
- F-CORRUPT(PC): the storage media containing the parental checksum has become corrupt
I'm interested in finding out if this is indeed the case, because it would change the values in Table 3, which attempt to describe the "Probability of Loss or Corruption".
Time Machine, Please
If I had only had been able to read this paper in 1987! Not only would it have given me a peek at what I was in for, but I would have gotten advice on the pros and cons of the various techniques. The simple fact that the researchers created this model results in a much easier way to have conversations about these types of problems.
This thought brings me to the potential to do more research going forward. In addition to the states already modeled, there are additional protection techniques and error scenarios that affect data integrity. Modeling them is important. I plan to describe them in an upcoming post.
Steve
Another good paper from FAST '08 is "An Analysis of Data Corruption in the Storage Stack" http://www.usenix.org/events/fast08/tech/bairavasundaram.html
They have some very interesting observations on the expected failure modes of FC and SATA drives.
Posted by: Keith Stevenson | March 07, 2008 at 05:11 AM
It's being very instructive reading your blog, thanks.
I am begining to read the "Parity lost, parity regained" article, and in the first table it seems clear that the only one doing parent checksums is ZFS. Being that the case, I suppose they implicitly haven't take into account parental checksum errors/corruption due to its implementation in ZFS: all filesystem metadata is duplicated (at least), apart of any possible lower level redundancy (RAID > 0) that is just in place (embedded on ZFS); this metadata duplication, the delay writes, the "copy on write" (the new checksum imply a cascade effect through the subdirectories until arriving to the root directory metadata, last to be written) and its embedded volume and RAID functions seem able to "guarantee" the metadata integrity in any real (tests included) scenario, if I haven't misunderstood all the information I've been reading about it.
I will finish reading the FAST article tomorrow, since I don't understand how it is possible that the embedded RAID procedure doesn't deal well with torn writes, for example.
Anyone interested on ZFS just can take a look at it, perhaps beginning at http://www.opensolaris.org/os/community/zfs/docs/zfs_last.pdf
Certainly ZFS is not so innovative in its error protection, but it is also for the masses, at least those using OpenSolaris or FreeBSD, and certainly it has another innovations.
Posted by: Andrés Suárez González | September 28, 2008 at 03:27 PM