There is a massive amount of innovation and research data globally distributed amongst the 50,000+ employees at my corporation (EMC). In this current series of blog posts I've been theorizing that analytics can allow my Innovation Team to unlock key insights from the data and accelerate innovation world-wide. The problem is large but I had the good fortune to attend the first offering of my company's Data Science and Big Data Analytics course. The course described a logical set of steps for testing business theories as part of a Data Analytics Lifecycle. I've been following these steps and after a number of posts describing the critical first step (the generation of an analytic plan), it is a good time to move on to Step 2: Data Prep.
As the arrows on the diagram indicate, these steps are iterative in the process. Proceeding to Phase 2 is often a matter of whether or not you are comfortable sharing the analytic plan with your peers. If so, then the data preparation phase can begin.
The analytic plan assists the data scientist in identifying the business problem, a set of hypotheses, the data set, and a preliminary plan for the creation of algorithms that can prove or disprove the hypotheses. Once the analytic plan has been delivered and socialized, the next step is all about the data. In particular, the next step is all about conditioning the data.
The data must be in the right shape, structure, and quality to enable the subsequent analysis.
Building an Analytic Sandbox
In my last post I mentioned that the data set in question falls into two categories: (a) a production "idea submission" server (essentially a large-scale database containing structured data), and (b) a globally-distributed set of unstructured documents representing knowledge expansion within the corporation in the form of minutes and notes about innovation/research activities.
These data sets cannot be analyzed in their current production formats. In addition, it is possible that the data is not of sufficient quality. Furthermore, the data is likely inconsistent. All of these possibilities add up to the fact that a separate analytic sandbox must be created to run experiments. Industry practice states that on average the size of this sandbox should be roughly ten times the size of the data in question (e.g. the current size of your enterprise data warehouse). Keep these things in mind when creating the sandbox:
- You are going to need strong bandwidth and network connections to your sandbox.
- Collect as much data as you can, including summary data, structured/unstructured, raw data feeds, call logs, web logs, etc. This is why the sandbox needs to be large.
- Determine the type of transformations you will need to assess data quality and derive statistically useful measures.
- Transform the data after it is in the sandbox (ELT: Extract, Load, Transform, as opposed to ETL). This allows analysts to choose to (a) transform the data or (b) use the data in its raw form. It's worth pointing out that this method is the opposite of best practice for some Data Warehousing use cases. While ETL is a widely accepted practice, the sandbox approach prefers ELT.
- Acquire the right set of tools for the transformation. Good examples would be Hadoop for analysis, Alpine Miner for creating analytic workflows, and R for many general purpose transformations.
Sandbox creation typically requires assistance from IT, a DBA, or the person that controls the enterprise data warehouse.
Once the sandbox is created, there are three key activities that allow a data scientist to conclude whether or not the data is "good enough".
- Familiarize yourself with the data thoroughly. List out all the data sources and determine whether key data is available or more information is needed. This can be done by referring back to the analytic plan to determine if you have what's needed, or if more data must be loaded into the sandbox.
- Perform data conditioning. Clean and normalize the data. During this process discern what to keep versus what to discard.
- Survey & Visualize the data. Create overviews, zoom and filter, get details, and begin to create descriptive statistics and evaluate data quality.
I learned in the course that this part of the process is expected to take at least 50% of the time spent on the entire data analytics lifecycle (and 80% is not uncommon)! Indeed, as our team went through this process for innovation analytics we had to work through quite a few issues before being able to work on implementing a model.
I will describe these issues in the next post.
Thanks again to David Dietrich for his research on the data lifecycle and ongoing support throughout this series of blog posts.
Director, EMC Innovation Network