In recent posts I've started to take a look at the process of data valuation and the people, processes, and infrastructure that might be involved.
The discussion of data valuation is at an early stage. In order to frame a dialogue I suggested in a recent post that the data insurance underwriting process may be an excellent use case to discuss valuation. I posted the table below in an effort to stimulate dialogue on data insurance people and processes.
In this post I'd like to focus on the first row of the table: data set identification. Consider a corporation that wishes to insure one or more data sets because they have concluded that each data set is "valuable" (how they arrived at that conclusion is the topic of a future post). During the underwriting process an insurer would need to understand the current level of risk associated with the data (e.g. how well protected is it?).
In order to assess that risk, the insurer must first engage with the right people within the organization and ask them a simple question:
"Where's the data"?
The exact physical location of individual data sets is usually not something that an organization would discuss with an outside party. However, the data insurance use case makes it clear that an insurer would specifically desire that level of granularity.
At EMC, we insure our data centers for situations like floods, power outages, earthquakes, etc. The current process is highly involved; an insurer must tour our physical locations, interview data center staff, understand the IT architecture schematics, evaluate disaster recovery scenarios, etc. After months of understanding the overall data center protection strategy, insurance premiums and policies can be generated. From this activity it can be very useful to create an overall data center audit and inventory system (DCAIS) to accelerate the process the next time through.
Data sets are more fluid. So when an insurer comes in and asks "Where's the data", they don't want to take a tour, read schematics, and talk to operators. They would ideally want to see an audit-able data set dashboard that can be leveraged in the exact same way on every subsequent audit.
Let's call this dashboard a data audit and inventory system (DAIS). In future posts I will return to the DAIS and further underscore its importance.
Needless to say, most companies do not have a DAIS. But there are a growing set of employees that are becoming more and more involved in data set discovery, identification, and migration.
Often times these employees are referred to as Data Architects. While there is certainly ambiguity in the job description for a Data Architect, I have observed some of the following behavior. These architects:
- Are often working in the context of a business unit.
- Are aware of the applications that are used by their business unit.
- Scour the globe trying to identify the location and trajectory of all data sets related to those applications.
- Plan the migration of those data sets into a common repository (e.g. Business Data Lake)
In general a data architect would also be quite concerned with the semantics of the data (the content), and perhaps the condition and/or quality of the data, but the primary task at hand is the actual creation of the data lake and movement of the data sets. The data architect role for this phase would be highly IT-oriented (not to mention extremely political).
My advice to a data architect in this position is to take a longer view. Data architects should not only be participating in the creation of large data lakes; they should also be driving the creation of smaller metadata lakes.
Metadata lakes contain query-able and audit-able bread crumbs for the data insurance use case mentioned above. They also have the potential to enable a larger data valuation framework that can lead to business advantage (consider if each business unit Data Architect reported up into a Chief Data Officer, for example). A metadata lake would also be a foundational element in an overall Data Asset and Inventory System.
In upcoming posts I will highlight metadata lakes in more detail.
Thanks to my EMC Ireland colleague Denis Canty for collaborating with me on this article.
Steve
EMC Fellow
Comments
You can follow this conversation by subscribing to the comment feed for this post.