Recently I published a post describing the enormous amount of data being contributed to the Digital Universe by machines.
Billions of devices will generate petabytes of information. Much of it will land in deep archives that are eventually mined and analyzed using a variety of fast, parallel analytic techniques.
The Digital Universe study posits that by 2020 these machines will not only generate an estimated 40% of all digital content, but the machines will be expecting an immediate (sub-second) response, or instruction, to determine the next robot-like action that they should take. Often times these instructions will need to rely on a variety of other sensor feeds and/or real-time streaming information.
The analytic techniques of 2012 will quite frankly not cut it. What kind of innovation will be required to support this growing reality?
It's an interesting thought exercise to examine some of the emerging technologies in analytics and consider how they can be combined to achieve near-instantaneous machine-to-machine (M2M) communication and decision making.
In this post I'll approach the discussion from the angle of a centralized model. I'm assuming, in the short term, that service providers will begin adding these technologies into the plumbing of their existing data centers.
Over time, however, as the device count scales you will likely see these implementations move to the edge in a dispersed cloud model.
Consider the graphic below and the bi-directional M2M needs of devices connected to the cloud infrastructure of the future:
Here are two emerging technologies that will facilitate M2M in coming cloud infrastructures:
1. More and more streaming data will be ingested into in-memory, distributed data grids. A good example is VMware GemFire: "an in memory data grid that enables real-time data distribution data replication, caching and data management using a non-relational key-value store, to allow storage of data for client applications." In addition, VMware SQLFire is described as folows: "VMware vFabric SQLFire is memory-oriented data management software delivering application data at runtime with horizontal scale and lightning-fast performance while providing developers with the well-known SQL interface and tools".
The key is the in-memory, horizontal scale provided by these types of technologies. Business logic accepts streaming machine input from sensors and immediately stores them in either key-value (Gemfire) or SQL (SQLFire) format. If a given sensor requires immediate feedback based on the state of other sensors, it can be quickly retrieved via an in-memory query. This approach, depicted below, enables the "sub-second" response times that many sensors will be expecting.
2. More and more streaming machine data will be process in real-time by a technology known as CEP, Complex Event Processing. Wikipedia has a good summary of the technology:Event processing is a method of tracking and analyzing (processing) streams of information (data) about things that happen (events), and deriving a conclusion from them. Complex event processing, or CEP, is event processing that combines data from multiple sources to infer events or patterns that suggest more complicated circumstances. The goal of complex event processing is to identify meaningful events (such as opportunities or threats) and respond to them as quickly as possible.
These events may be happening across the various layers of an organization as sales leads, orders or customer service calls. Or, they may be news items, text messages, social media posts, stock market feeds, traffic reports, weather reports, or other kinds of data. An event may also be defined as a "change of state," when a measurement exceeds a predefined threshold of time, temperature, or other value.
Two implementations of CEP getting a lot of buzz are Esper and Storm (Thomas Dudziak has written a good blog describing both frameworks). For this technology the business logic intercepts and aggregates the streams on the fly. This technique favors more immediate stream state as opposed to querying less recent memory from a data grid. This approach is depicted below (using Storm's "spouts" and "bolts" icons):
Both of these technologies will become more common in the clouds of the future. Exactly how they fit into a cloud architecture is an area being explored by corporations and academia alike. In a future post I will discuss a research initiative that is starting to investigate analytics of streams at scale.
Steve
Twitter: @SteveTodd
EMC Fellow
Comments