This series of articles describes an analytic lifecycle being used to gain insight into the innovation and research practices of a multi-national corporation (EMC). After creating an analytic plan in Phase 1, a previous post described the Data Preparation phase of the lifecycle. This phase involves the creation of an enormous sandbox (e.g. ten times the size of a data warehouse configuration). Data scientists and engineers are encouraged to extract data from many sources and load it into the sandbox unchanged. This approach may seem a bit revolutionary (most processes transform the data first). This lifecycle, however, is geared towards the data scientist. The possession of the raw data allows for more robust analysis. The diagram below provides an overview of this approach.
As I mentioned in the last post, there are two types of data that will allow a data scientist to analyze innovation. The first type, depicted on the left, is a structured SQL database containing thousands of innovation ideas submitted by employees over a five year period. The second type of data consists of minutes and notes from global innovation and research activities. This content is highly unstructured.
It’s worth taking a moment to discuss how the global team of users and data scientists came up with relevant innovation activities (depicted below). EMC’s product line consists of high-tech products and services that have been introduced into the marketplace. Tracing the lineage of these products and services usually results from an idea that happened long ago during a specific activity. The team came up with a candidate list of activities that is often associated with innovation and research:
Visiting universities, creating publications, attending conferences, visiting customers and partners, holding internal knowledge sessions, holding idea contests, and creating intellectual property are all activities commonly associated with innovation, and therefore an effort was made to gather six months worth of these activities from data sources worldwide.
After resisting the urge to transform these documents before loading them, a next logical step in Phase 2 is to explore the data. I will step through an example of this process in my next post.
Steve
Twitter: @SteveTodd
Director, EMC Innovation Network
Comments