The results of the Digital Universe study offer product innovation hints that are likely to surface due to exponential data growth.
In my last post I specifically called out complex event processing (CEP) as a technology that will fuel significant innovation going forward.
In this post I'd like to posit that unstructured data tagging will also be on the rise. Consider this interesting statement from the survey:
All in all, in 2012, we believe 23% of the information in the digital universe (or 643 exabytes) would be useful for Big Data if it were tagged and analyzed.
On the flip side of that number, does it follow that 77% (2152 EB) of data would be useless?
What exactly does "useful" really mean, anyway? The graphic below offers some context:
Take a look at the 2020 data. Blue means "useful". The height difference between yellow and blue maps to "not useful".
After studying the report for quite some time, it dawned on me that these designations apply to "unstructured data". Transactional, structured content is self-describing. Row-column data, for example, has inherit meaning (e.g. a schema) and is therefore useful.
Unstructured data, on the other hand, is often a blob of unknown, variable length content. In order for it to have any value at all, additional meta-data is required.
It is precisely the addition of meta-data that will render the unstructured content as "useful". In fact, the rise of meta-data tagging is also called out specifically by the IDC:
Metadata is one of the fastest-growing subsegments of the digital universe (though metadata itself is a small part of the digital universe overall).
If metadata tagging represents an emerging area of high-tech innovation, what types of unstructured files will be the first targets? Once again, the IDC has a useful graphic (followed by their definitions of each file type):
- Surveillance footage. Typically, generic metadata (date, time, location, etc.) is automatically attached to a video file. However, as IP cameras continue to proliferate, there is greater opportunity to embed more intelligence into the camera (on the edge) so that footage can be captured, analyzed, and tagged in real time. This type of tagging can expedite crime investigations, enhance retail analytics for consumer traffic patterns, and, of course, improve military intelligence as videos from drones across multiple geographies are compared for pattern correlations, crowd emergence and response, or measuring the effectiveness of counterinsurgency.
- Embedded and medical devices. In the future, sensors of all types (including those that may be implanted into the body) will capture vital and nonvital biometrics, track medicine effectiveness, correlate bodily activity with health, monitor potential outbreaks of viruses, etc. — all in real time.
- Entertainment and social media. Trends based on crowds or massive groups of individuals can be a great source of Big Data to help bring to market the “next big thing,” help pick winners and losers in the stock market, and yes, even predict the outcome of elections — all based on information users freely publish through social outlets.
- Consumer images. We say a lot about ourselves when we post pictures of ourselves or our families or friends. A picture used to be worth a thousand words, but the advent of Big Data has introduced a significant multiplier. The key will be the introduction of sophisticated tagging algorithms that can analyze images either in real time when pictures are taken or uploaded or en masse after they are aggregated from various Web sites.
Let's assume that we'll see significant innovation in unstructured tagging for surveillance, medical, entertainment/social and consumer use cases.
The growth in meta-data results in a management problem. If the content gets separated from the meta-data, then the content will revert back to useless.
How do you (permanently) unite the metadata together with the content? The best solutions will permanently graft the metadata and the content together. This will lead to the continued rise of information storage systems that specialize in keeping content and meta-data together: Object-based storage systems.
For those of you unfamiliar with the benefits of object-based storage, earlier this year Chuck Hollis posted a summary of the Atmos implementation of object-based storage. Study his post and it will become clear why object-based technologies are so important.
Tagging transforms unstructured content into something useful (e.g. something that can be analyzed).
Analysis of objects will therefore become critical for the Digital Universe, and a topic worth exploring in a future post.