I enjoyed reading the Data Lake Dream article in Forbes (written by Edd Dumbill). The very first paragraph, for me, contained some illuminating statements about the value of a data lake.
Once in the door, Hadoop tends to become a center of gravity. This effect is amplified by the appeal of big data being not just about the data size, but the agility it brings to an organization.
According to Dumbill, one reason for Hadoop agility is...
The old way of data warehousing involves a priori data cleaning and validation: selecting what was important and structuring it before storing it.
So how are some customers beginning to build out a Data Lake infrastructure that runs underneath new forms of analytic applications? Many are creating a central destination for data that is HDFS-based, but they are also carefully architecting in-memory data grids using products like GemFire XD. The GemFireXD product provides seamless read/write capability on top of HDFS with in-memory speed. The diagram below shows a variety of application VMs balanced upon a simplified Data Lake architecture.
What is the reason for balancing an in-memory data grid on top of HDFS?
GemFire XD Senior Director of R&D, Makarand Gokhale, shares why:
"A set of integrated, consumer grade services will evolve on top of HDFS - stream ingestion, analytical processing, and transactional serving. Provisioning flexibility and elasticity become critical capabilities for this infrastructure".
This vision makes sense. The GemFire "hot edge" layer does lightning fast real-time analytic processing, while the deeper historical queries run in parallel against the HDFS infrastructure. Combining these two is not easy and I hope to spend some time in future posts talking with Mak about the details of this integration.
For now, however, the analytic speed provided by this kind of Data Lake fits in extremely well with the coding speed provided by the data analytic stack on top of it.
Deploying this architecture with (or creating it as a destination for) traditional application stacks requires some thought and pre-planning. In the next few posts I will start outlining solutions to some of the issues that will support legacy integration with a data lake architecture.
Steve
Twitter: @SteveTodd
EMC Fellow
Comments