Earlier this year I participated in a 5-day Data Science and Big Data Analytics course. I was hoping that by the end of the week I would have a greater sense of how to grow the skill sets that would turn me into a more effective data scientist.
An interesting thing happened on day 1 of the course: I became distracted. The instructor (David Dietrich) put up a chart depicting the Data Science Lifecycle (shown below), and I was drawn so much into the diagram that I realized that the role of data scientist was not for me:
I attended the course because I knew that Data Science would become intrinsically linked to my role as Director of Global Innovation at EMC. The course has been wildly popular inside the company (1200 employees have taken the course). Hundreds of EMC partners and customers have also signed up, and the curriculum is now available at various universities. Over 100 faculty have ordered the readiness training from 20+ countries so far.
For my own data science project, I was collecting innovation data from EMC's R&D locations around the world. I had structured data. I had unstructured data. But I lacked the general knowledge of which analytical techniques would provide me the best insight into EMC's innovation ecosystem. Would it be clustering? Or would regression make the most sense?
Eventually I was taught the basics of clustering and regression algorithms, and I was also taught the circumstances in which these algorithms were most commonly used. Clearly these techniques would assist me in my quest to gain innovation insight.
However, the Lifecycle chart informed me that traditional data science projects have many moving parts and a variety of important roles; the Data Scientist alone could never provide the insight that I am looking for.
In fact, the course defined my role as a Project Sponsor, which is quite distinct from a Data Scientist. Here are the definitions for both:
- Project Sponsor: Responsible for the genesis of the project, providing the impetus for the project and core business problem, generally provides the funding and will gauge the degree of value from the final outputs of the working team.
- Data Scientist: Provide subject matter expertise for analytical techniques, data modeling, applying valid analytical techniques to given business problems and ensuring overall analytical objectives are met.
Indeed, as the months moved forward and I began implementing the six phases of the Analytic Lifecycle, I found myself building a team that not only contained data scientists, but various other roles as well:
- Data Engineers: deep technical skills to assist with tuning SQL queries for data management, extraction, and support data ingest to analytic sandbox (this included IT personnel on the East Coast and EMC Labs researchers in China).
- Database Administrators: Provisions and configures database environment to support the analytical needs of the working team (my co-workers at the CTO Office Lab in Santa Clara).
- Business User: Someone who benefits from the end results and can consult and advise project team on value of end results and how these will be operationalized (the CTO of EMC, who runs the Innovation programs).
The fact that I was drawn into the "business end" of building a data science team for innovation is actually consistent with industry trends. While it is estimated (in the United States) that employers will be short some 140,000-190,000 data scientists in the years to come, there will be an even bigger gap of some 1.5 million "data savvy professionals". These professionals need to know how to build teams, engage with executives, and guide the data scientists, engineers, DBAs, and business intelligence analysts.
In hindsight, I would have valued a dedicated course on the topic of managing data science projects.
Fortunately, I am in luck: within the next several months EMC Education Services will release a new course on this very topic. The course will be geared to teaching data science to business leaders. I will attend a pre-release version of the course and will be sure to share my thoughts in advance of the official release.
Director, EMC Innovation Network