In February of 2015 I wrote about a potential "metadata lake" architecture that could manage and measure data like an asset. At the time Dr. Jim Short and I had just started our joint research on data valuation architectures. In January of this year we held our first data value workshop. The research itself focused on two areas: (1) the emergence of new business processes that were involved with the valuation of data, and (2) the impact that these new business processes may have on IT architectures. In this post I'd like to focus on IT architectures and how a metadata lake is required to sit alongside data assets. For more information on some of the data valuation business processes we've discovered during the research, I recommend scanning the five use cases discovered by Dr. Short during his industry survey.
In regards to an IT architecture that can support valuation use cases, one primary driver of the metadata lake proposal was to create a governance repository that could track the business value of a data asset. Our thinking at the time focused on the following areas:
- The growth of the data insurance market (estimated > 7 Billion by 2020) required a metadata lake framework
- The value of data can govern placement, so we proposed governed placement services via a metadata lake.
- The placement of data based on value is dynamic and requires on-the-fly, programmable placement
- The transformation to a valuation architecture takes time so we proposed a set of phases: 1: decommissioning, 2: manual placement on a new architecture, 3:automated valuation frameworks.
All of these proposals were greenfield ideas that could be discussed for possible implementation.
Over one year later, however, at the MIT Chief Data Officer and Information Quality Symposium, Dr. Short and I listened to discussions of real-world valuation frameworks in operation. And while the industry does not specifically use the term "metadata lake", the operations performed by the frameworks are similar.
My colleague Barbara Latulippe (Chief Data Governance Officer) presented on a panel at the conference (and was also interviewed on the Cube). Barbara and team have successfully built a data governance framework (known as an Information Marketplace) alongside of EMC's corporate data lake. This framework has the following characteristics:
- It runs alongside a consolidated data lake and catalogs business and technical metadata for structured and unstructured data
- It offers a portal for data scientists to create innovation spaces for research tasks
- It presents a catalog for data scientists to quickly find the right data assets (65-70% of their time had been spent searching for data; this number will likely be cut in half during this year).
- The quality, or trustworthiness, of the data is a visible part of the catalog
- Critical data fields within each data set can be mapped to business KPIs and a data steward
- The quality of key fields can be mapped to positive or negative economic outcomes (dollars)
- Analytic models can be mapped to the data sources that created them and reused across other innovation workspaces
- Analytic models can be inserted into the solution and become searchable, findable, and valued data assets within the data lake
- Customer-facing or internal data is labeled as such
- Data products/services are being delivered directly from the solution to EMC customers such as myService360.
- Data is treated as an asset which is similar to the product assets produced by EMC; it travels through stages of maturity and eventually brings value to the business.
The architecture of this implementation is fairly elegant; I plan to blog about it soon. The intersection of the research that Dr. Short has been doing, coupled with some of the valuation approaches created by Gartner's Doug Laney,can be integrated into the Data Governance framework highlighted above.
In future posts I will highlight the architecture and specifically discuss how data assets are valued.
Steve
EMC Fellow