I am blogging my way through the Data Analytics Lifecycle as taught in EMC's Data Science and Big Data Analytics course. I am running a Data Analytics project that employs a team of volunteer data scientists from around the world, and I have communicated an analytic plan to them (along with a set of hypotheses). The entry into Phase 2 of the process (typically the longest phase) has resulted in preparing the data, loading it (without transform), and exploring it.
At this point it is worth mentioning a quote that I heard during the course:
If you do not have data of sufficient quality or cannot get good data, you will not be able to perform the subsequent steps in the lifecycle process.
How Clean is the Data?
One of the data scientists on my team was evaluating a tool called Tableau, which can be used for data exploration (among other things). They began to use the tool and explore the data that had been previously loaded into the analytics sandbox. They sent me the following screenshot (I zoomed in and circled my name):
I am showing up twice in the database because some entries have a space before my first name. This is a classic problem (and not always easy to fix). Addressing this problem within the sandbox is clearly a much easier proposition than doing so in the production database. However, it could take a long time to get it right (another reason why phase 2 takes so long).
Who typically does this work? Is it a database admin (DBA)? A data engineer? Both typically play a role in Phase 2. The "Database Administrator" provisions and configures the database environment to support the analytical needs of the working team. The “Data Engineer” tends to have deep technical skills to assist with tuning SQL queries for data management and extraction. They also support data ingest to the analytic sandbox. These people can be one in the same, but many times the data engineer is an expert onqueries and data manipulation (and not necessarily analytics as such). The DBA may be good at this too, but many times they may simply be someone who is primarily skilled at setting up and deploying a large database schema, or product, or stack.
In addition to mis-spelled names, the data scientists exploring the data are starting to uncover missing data that will help them prove the hypotheses. For example, consider one of the hypotheses generated in phase 1:
H5: Knowledge transfer activity can identify research-specific boundary spanners in disparate regions.
The association of boundary spanners to the geographic location where they work requires that any names found in the sandbox (e.g. "Todd,Steve") have an associated location (e.g. Hopkinton, MA).
Our data scientists found that this data was nowhere to be found within the sandbox.
In addition to DBAs and Data Engineers, IT often plays a large role in Phase 2. For our project, once the names were "cleansed", we had to bring in IT resources to help generate geographic associations via our employee database. In our particular case we were fortunate: not only did IT grant access to our request, but the IT resource had Data Engineering skills and cleansed the data for us! In general, bringing in additional data from the IT realm is no easy task. Access to these types of assets is typically a very tough, time-consuming part of Phase 2.
I could write paragraph upon paragraph describing issues that we've come across (and solved) for Phase 2. It may be more useful, however, to summarize some of the lecture material that describes common problems:
- Consistency of data types (e.g. confirm that all numeric types contain numeric fields)
- Data feeds can often change over time (e.g. someone removes a column without telling anyone)
- Fields that contain calculations (e.g. interest charges) may change over time (if interest rates change over time)
- What are the legal ranges of data and are there any values that are out of bounds?
- Is the data standardized/normalized? If so, what is the scale?
- Are geospatial data sets consistent (e.g. metric versus english units, two-letter state abbreviations versus full-names)?
During this phase the data scientist may discern what to keep and what to discard. They had probably formed an opinion of what model they will use. Data exploration and cleansing has either validated their assumptions or caused them to select a different model. Data cleansing is a big job, so the objective should be to determine "what is enough?". What is clean enough data? What is sufficient quality for the operating context? What will properly enable the analysis? These questions give people boundaries for the data cleaning, which is quite intensive.
Phase 3 is about model planning. How does one know when they are ready to leave phase 2 and move on to phase 3 (keep in mind that a return to Phase 2 is highly likely!)?
In general, Phase 3 begins when the data quality is "good enough" to start building the model. In my case, once we had cleaned up erroneous names and associated the names with geographies, the team had sufficient reason to enter Phase 3.
I will relate our team experience with Phase 3 in future posts.
Director, EMC Innovation Network