I've been taking a look at data valuation through the context of a data insurance use case. In my last post I mentioned that as data architects go through the process of moving silo'ed data sets into a Business Data Lake, they should simultaneously build a governance structure known as a Metadata Lake.
I like to think of a Metadata Lake as an infrastructure construct that enables partitioning of an overall Data Lake. Indeed, for the data insurance use case, the Metadata Lake becomes a one-stop shop for an insurer to subsequently audit data sets that are insured through premium payments.
One of the key proof points that a data insurer would be looking for is data set compliance with agreed-upon data protection levels. If the insured cannot conclusively prove that the data set is (or was) protected according to the data insurance policy, then the agreement may well be null and void.
My theory is that Metadata Lakes, while clearly useful for auditing insured data sets, can also enable a much larger ecosystem for supporting data valuation and many other advanced IT features as well.
At a minimum, a data insurance auditor would want visibility into (a) the insured data set(s), (b) the agreed-upon data protection policy for each data set, and (c) the actual level of data protection currently employed for each data set. In order to visualize this please refer to the graphic below.
The applications under the insurance policy need to be registered in the lake as a form of application metadata. The policies themselves should also be stored and associated with these applications. At the same time, the location of the data sets themselves, and the current data protection levels at each location, should be gathered from the infrastructure and time-stamped as an audit-able proof point of compliance.
If your company is considering data insurance, the Metadata Lake construct will be mandatory. This brings us back to first step in the data underwriting process:
A Metadata Lake is a form of a Data Audit and Inventory System (DAIS). If a data insurer discovers that this structure doesn't exist during their initial visit, then some form of it must surely be created.
Here is some advice for Data Architects that are in the midst of migrating silo'ed data sets to a Data Lake: build a Metadata Lake at the same time. While doing this a Data Architect should pay careful attention to the mapping of that data set to the underlying trusted (data protected) infrastructure (or at least think about how it could be segmented in the future).
I'd welcome any thoughts on this topic from any Data Architects that are creating Data Lake ecosystems at this time.
Steve
EMC Fellow
Hi, Steve! Metadata management for Lake architecture is very top-of-mind for me and the rest of the EMC² Global Services Transformation work-group.
Our conceptualization is for a Data Catalog, which would describe every individual [published and public] Data Asset in the Lake, where that little word 'in' can be a bit misleading since the Data Lake is a federated architecture and the actual bytes me be physically or logically resident elsewhere. In that case, the Data Catalog represents a link or "external" table to the federated Data Asset.
But the brilliant part is the consumer does not have to care about any of that - the Lake is location and source technology agnostic.
Getting to the Catalog entries - you've raised an interesting use case for leveraging metadata at the Data Asset level. But we believe that the first use case most customers will want is Data Provisioning, i.e. giving an analyst or data scientist access to a piece of information so that they can create value with it.
There is a lot more I think we could talk about - let's connect off-blog and I'd love to hear your thoughts on what we've drawn up and prototyped so far. Cheers!
Posted by: Scott Lee | February 27, 2015 at 10:29 AM