This series of posts describes the efforts of a team of global data scientists. These data scientists are attempting to measure innovation at a large multi-national corporation. The approach they are following has been taken from the Data Science and Big Data Analytics course created by their corporation (EMC).
After spending a good amount of time in Phase 1 (Discovery) and Phase 2 (Data Prep) of the Data Analytics Lifecycle, Phase 3 (Model Planning) is entered once the data scientists conclude that the data in their analytic sandbox is of sufficient quality. In Phase 2 the quality of the data was improved through various data cleaning and conditioning techniques.
As I learned in the course (via David Dietrich):
“Phase 3 represents the last step of preparations before executing the analytical models and, as such, requires you to be thorough in planning the analytical work and experiments in the next phase.”
In Phase 3 the data scientists move closer to the algorithms that they will use to prove or disprove the hypotheses generated as part of the Analytic Plan. The hypotheses frame the analytics that will be executed in Phase 4. Choosing the right methods to validate the hypotheses means that the team needs to consider some of the following conditions:
- The structure of the data will dictate what tools and analytic techniques can be used in Phase 4. Is textual data being analyzed? If so, then maybe Sentiment Analysis using Hadoop is the right approach. Does the sandbox contain structured financial data? Perhaps regression via the R analytics platform is the right method to use.
- The analytical technique that is chosen must map back to the business objectives. The objectives are met when the working hypotheses are proved or disproved. This condition clearly highlights why the generation of an Analytic Plan is so important.
- Determine whether or not the situation warrants a series of tests, or only one test. If a series of techniques must be used as part of a larger analytic workflow, then the team may benefit from an analytic workflow tool such as Alpine Miner.
Some people may be tempted to jump directly to Phase 4 after loading, exploring, and conditioning the data in Phase 2. However, there is more exploring that needs to be done, and this phase of exploration is subtly different.
In Phase 2, the data exploration was mainly about data hygiene and quality.
In Phase 3, additional data exploration should focus on relationships between variables. These relationships will help to further understand the problem domain. The unbiased view of the data scientist is extremely valuable in this phase. Stakeholders (e.g. business users) bring their gut feelings and pre-defined hunches to the problem. Data scientists can translate these hunches into actual correlations between inputs and outcomes. They identify candidate predictors and outcomes, all within the framework of the business problem.
Our experience in Phase 3 has been valuable. As part of the analytic plan, we had theorized that the following analytic techniques would be valuable (described more fully in a previous post):
- Use Map/Reduce …
- Natural language processing (NLP) …
- Several other techniques would be appropriate:
- Clustering (e.g. k-means clustering) …
- Classification …
- Regression analysis …
- Graph theory (e.g. Social Network Analysis) …
In Phase 3, the data scientist team began applying some of these models to the sandbox, and the results were mixed.
These results will be described more fully in future posts.
Steve
Twitter: @SteveTodd
Director, EMC Innovation Network
Comments
You can follow this conversation by subscribing to the comment feed for this post.