This series of posts is describing a cradle-to-grave Data Analytics project using the lifecycle taught at the Data Science and Big Data Analytics course created by EMC. The steps of the lifecycle are being observed by a business user (myself) who is trying to gain insight into the innovation culture at EMC via a large amount of innovation and research data from around the world. The insight gained as part of this lifecycle should allow us to operationalize new plans and increase the pace of innovation.
As discussed in a previous post, Phase 3 is all about trying out analytical models and continuing to explore collected data. In many cases, Phase 3 can be exited because all of the required data is present and of high quality, and the selected analytical model appears to be promising. My team of global data scientists has had the good fortune of experiencing this situation.
What happens, however, when the data is incomplete, or the selected analytic model does not look promising? This has also happened to our team and it is well worth telling the story.
As we considered the list of hypotheses for our project we focused on #7:
Hypothesis #7 Incubation Lineage and Asset Generation
I believe that the path that knowledge takes, from a local innovator, to a corporate boundary spanner, to an implementation team, to a delivered asset, can be traced and measured. I also believe that this measurement, once studied, can reveal ways to accelerate innovation and point out areas of knowledge that are yet to be converted. I've long been a fan of provenance, and I love the concept of "idea lineage". The lineage can be studied to reduce asset delivery time.
IH7a: Frequent knowledge expansion and transfer events reduce the amount of time it takes to generate a corporate asset from an idea.
IH7b: Lineage maps can reveal when knowledge expansion and transfer did not (or has not) result(ed) in a corporate asset.
When data scientists look at a hypothesis, some potential analytic models come to mind. My colleague Dave Dietrich proposed an approach for Hypothesis #7:
We could in theory apply text mining techniques to address the concept of idea lineage. That is, perhaps you could parse the ideas and descriptions, and then classify them (e.g. using a Topic Modeling approach). Run an automated classification algorithm, such as naïve bayes, to parse and classify certain kinds of ideas. Then create an outcome, such as patent or no-patent, publication or no-publication, new product or no product. That is, you could identify the right outcomes and see if there is a relationship between clusters of certain types of text with discrete outcomes that represent innovation.
Dave's suggestion would use a naïve bayes model, and would appear to go a long ways towards proving the second hypothesis.
The first hypothesis, however, has a strong focus on elapsed time. During our discussion on analytic models and potential visualizations, Data Scientist Dong Xiang from EMC Labs China decided to do some simulated Model-3 data exploration using an impressive javascript visualization tool called d3.js. Using this tool he presented me with a time-lapse view of different phases of an idea:
I liked this approach so much that I commissioned Dong to try and use this data against the Phase3 data in the Analytic Sandbox. Tracking the progress of an idea and visualizing when it crosses thresholds would bring a time dimension into our study that would be useful for proving our hypothesis.
The sandbox contains a set of unstructured ideas, reports, minutes, and notes about global innovation and research activities. Unfortunately, Dong and the team found out the hard way that the data did not provide us with a good way to visualize the transition of an idea to new phases. EMC internally uses a variant of the Technology Readiness Level (TRL) approach for tracking phases, but the data found in the sandbox did not contain TRL levels. Further searching throughout EMC confirmed that this type of data was nowhere to be found.
Our ability to prove hypothesis #7 was in jeopardy. This realization was not the end of the world. In data scientist terms, it was time to begin a longitudinal study (making a series of observations over a long period of time).The team began to design a method whereby TRL levels would be gathered and recorded as a regular part of the reporting and gathering of global innovation activities. Over time, we would eventually have enough data to take a good, hard look at our hypothesis.
Our longitudinal study would involve the following:
- Establish a goal criteria. For our case, what would be the end goal of a successful idea that has traversed the entire journey?
- Identify the right milestones to achieve this goal
- Trace how people move ideas from each milestone towards the goal.
- Once this is done, trace ideas that die, and trace others that reach the goal. Compare the journeys of ideas that make it and ideas that don't.
- Compare the times and the outcomes using a few different methods (depending on how the data is collected and assembled). These could be as simple as t-tests, or perhaps different types of classification algorithms.
A longitudinal study has a similar motto to the Data Analytic Lifecycle: plan everything thoroughly up front!
This post described a hypothesis that fell flat in Phase 3. My previous post described a hypothesis that moved forward into Phase 4 because the model seemed right.
With our analytic plan refined, the team moved to Phase 4. I will introduce this phase in my next post.
Steve
Twitter: @SteveTodd
Director, EMC Innovation Network
Comments
You can follow this conversation by subscribing to the comment feed for this post.