At EMC World this week new research has been unveiled in the area of tracing data lineage. Analysts, customers, and partners have been viewing a demo at EMC's Innovation Showcase. I described the thinking behind the research in a recent post.
What I didn't talk about was the graphing aspect of data lineage. For a given piece of content, a navigation map enabling a graphical depiction of content lineage would be extremely useful. EMC is proposing that the nodes on the graph are implemented using Centera content addresses, and the relationships between nodes on the map are implemented using Centera meta-data (CDFs, or content descriptor files).
Need help visualizing?
A picture tells a thousand words.....here's a screenshot from the demo at EMC World 2008.
Click On the Image To Enlarge.....
The Use Case
In order to understand the diagram it helps to understand the use case that was chosen to highlight the idea. We had a variety to choose from, including scientific, medical, legal, and media copyrighting. We chose an "Earnings Report" use case. The diagram shows the following:
- An earnings report (Report_1) was created using three inputs: Inventory07Q1, Products07Q1, and Sales07Q1. The report was created (via the demo software) using a "tax free" transformational algorithm.
- Once the report was created the user can view the "map". This map details that Report_1 was generated by using the afore-mentioned three pieces of data as input. The use of the "tax free" transformational algorithm is described via the "arrows" on the map.
- The "Inventory07Q1" input was recognized to have an incorrect value, leading to an incorrect earnings report. The demo software allows for a manual edit of the inputs.
- Once the inventory was modified, the data lineage diagram now shows that an "Inventory07Q1_2" file has been created using a process of "Manual Edit".
- The demo software then allows for the generation of a second earnings report, using the "new" inventory input, and the "original" product and sales input.
- The screen shot above shows the lineage of both the new and old earnings report.
Behind The Scenes
What's going on behind the scenes of this demo to make it work? That requires some knowledge about how Centera works. One of the key functions of Centera is its ability to link content and metadata together. As a matter of fact it is impossible to store content to a Centera without sending along metadata describing that content. Both the content and the metadata are referenced via a unique content address.
So let's make a few suppositions. Let's assume the "Inventory07Q1" data has a content address of "ABC", the "Products07Q1" data has a content address of "DEF", and the Sales07Q1" data has a content address of "GHI". (note that Centera content addresses are a lot longer, these are just exemplary).
When the first earnings report is generated (Report_1), the results get stored on a Centera. As I just mentioned, Centera expects metadata to accompany the earnings report. The demo software creates metadata that has three interesting characteristics:
- It contains the content address of the earnings report (Report_1)
- It contains lineage information in the form of content addresses "ABC", "DEF", and "GHI"
- If contains lineage information in the form of the transformational algorithm used to generate the earnings report. This "tax free" algorithm can also be stored on Centera (and thus have its own content address).
Modifying the inventory data results in another content address (e.g. "JKL"), a new earnings report (Report_2), and a new graph which points to the correct inputs for Report_2 ("JKL", "DEF", and "GHI").
The Benefits of This Approach
My previous post on this subject indicated that creating a bulletproof data lineage application is very difficult. Using a Centera for lineage solves the following problems:
- The earnings reports and the inputs are immutable. Changing any one of them (e.g. the inventory) results in Centera creating a NEW piece of content. This new piece of content can also have metadata that points back to the original.
- The content is also non-deletable because the application can use Centera's retention feature to prevent deletion.
- The nodes in the graph, and the lineage information that allows for graph navigation, are also immutable and non-deletable.
- The "tamper-proof" nature of all of the content and metadata can be authoritatively proven because all Centera content is protected by a content address. When the application views any piece of content it is known to be authentic.
What Next?
So what happens beyond the demo? Feedback from customers and partners at EMC World will answer part of that question. The idea will continue to be described over the coming months and EMC will make a determination on whether to continue the research.
I hope to describe some of this feedback in a future post.
Questions on the post, or the technology? Please submit a comment!
Steve
i think data lineage is next gen technology that emc should invest in as the use cases are too numerous to list.
Posted by: DenisG | May 21, 2008 at 09:05 AM