Several weeks ago I attended an announcement at MIT CSAIL regarding the new bigdata@CSAIL initiative. Massachusetts Governor Deval Patrick was there, along with MIT President Susan Hockfield. Myself and several others from EMC listened in as the Governor announced EMC's participation as a founding member of bigdata@CSAIL and an ongoing contributor to the Massachusetts Green High Performance Computing Center.
Big data workloads are important to both initiatives.
In fact, Multi-Tenant Big Data Workloads (MTBDWs) are gaining quite a bit of attention in academia. How can shared, massive data sets be most effectively (and securely) analyzed by multiple tenants in a cloud environment?
Consider the Sloan Digital Sky Survey (billed on its website as "the largest map in the history of the world"). Images from the heavens are input into a data processing pipeline, resulting in a massive amount of raw data, processed images, and meta-data. A variety of scientists can collaborate on the processing of this data, given the publicly accessible interfaces to a directory tree and forms for querying coordinates.
Providing this type of functionality as a service causes me to ask a few questions:
How would this type of collaboration work in a commercialized service provider environment?
How does a service provider enable multi-tenant access to a common data set in a cloud (e.g. using DevPay to access Amazon Public Data sets)?
How can tenant isolation be enforced, or in the case of collaborating tenants, be disabled?
Are the tenants, and/or the data services, all located within VMs?
Can virtualized tenants and/or data services adhere to specific SLAs?
The research community is beginning to explore these questions. Early results were presented at last year's SIGMOD conference in Athens. MIT researchers studied physical consolidation of database workloads onto fewer machines. In their paper "Workload Aware Database Monitoring and Consolidation", the authors stated the following about virtualizing databases (section 9, Conclusion):
Additionally, we show that existing virtual machine technologies are not nearly as effective as our techniques at consolidating database workloads.
The paper goes on to list some of the problems (section 7.4) encountered when virtualizing databases (as opposed to physical consolidation):
- Redundant operations (log writes, log reclamations)
- Over-allocation of RAM
- Excessive context switching
- Less code-sharing between workloads
How then can these issues be addressed to help realize performance gains while allowing for either tenant isolation and/or tenant collaboration for analytic workloads (otherwise known as multi-tenant workload management)?
With a new architecture, of course!
As part of its University R&D program, EMC is not only collaborating with CSAIL but also with the University of Washington (who has data sets of their own). In partnership with University of Washington Professors Magda Balazinska and Bill Howe, the research will attempt to define an architecture that can address the following questions:
- Should tenants always be placed in their own virtual machines?
- Are their benefits to tenant sharing of parallel data processing engines?
- For overlapping data sets between tenants, how should resources be allocated between different tenants and engines?
- What service level agreements make sense for these new scenarios?
- How is elasticity implemented in these new scenarios?
It's exciting to be at the forefront of these discussions, and I'm looking forward to sharing results moving forward.
Steve
Twitter: @SteveTodd
Director, EMC Innovation Network