One of the more interesting ways to discuss Big Data is from the research angle. Chuck pointed out several weeks ago that analytics is not the only theme. As he said: "There's another side to big data that is more about dealing with big files vs. big databases".
I have been paying close attention to Big Data research knowledge expansion across EMC. The two uber-themes (analytics and massive files/dbs) can be broken down into specific research topics. Some topics are being investigated across our global locations. Other topics are on the to-do list. The following list represents different research areas for Big Data (note the strong emphasis on performance):
- Ingest and Export Speeds. Getting the fish from the ocean to the plate as fast as possible is a critical research area. University researchers are still working on this problem nearly a quarter century after RAID.
- Information Storage. Should it be encrypted, compressed, deduplicated, or some combination thereof?
- Application-aware Data Layout. While distributed scale-out architectures and recovery mechanisms are fairly mature, the performance implications of data layout can vary from application to application. This is closely related to workflow (see below).
- Tiering and Mobility. The movement of big data between tiers for performance improvement extends up to the server flash level. The volume of data causes this research topic to overlap with the ingest/export theme.
- Use cases/Workflows. Understanding the applications that need access to Big Data becomes critical for a number of reasons: performance and configuration management are just two of them. Common workflows include:
- Collaboration. All of the scientists working on a data set are not necessarily in the same room or facility. The ability to annotate Big Data is an extremely helpful feature, and can result in loosely coupled Big Data / Small Objects. I'm a big fan of the GreenPlum Chorus approach, but it needs to extend beyond the enterprise model. Which leads to....
- Geographic Mobility. Big Data captured into one data center may not necessarily stay there. Distributed scientists may want to bring the algorithms to the data and/or vice-versa.
- Visualization. This may be one of the more fun areas of research to watch. Creative visualization may yield more influence than any other research topic.
- Analytics. As a computer scientist I always loved saying "BubbleSort". Analysis of algorithms lives on.
- IT infrastructures. The intersection of Big Data and Cloud. Storing data of this size and magnitude will impact all five cloud characteristics.
- Preservation. If Big Data must age like fine wine, then the methods and processes used by digital curators can be applied to the field of Big Data.
- Provenance. Where did all this Big Data come from?
Often times EMC acquires a company to enter an emerging market.
In Brazil, EMC has invested in a new research facility to expand research coverage of Big Data. This initiative will clearly result in knowledge expansion in the area of Oil and Gas workflows, but the research will also certainly spill over (no pun intended) into some of these other areas as well.
I've been involved with provenance research in the past and will continue to post updates as we move the research forward in many of these areas.