While working at EMC Cambridge last week I took the Red Line over to Harvard to attend a lunchtime lecture by Joe Futrelle. Joe's lecture focused on the availability of scientific data. Given my interest in digital preservation, as well as my involvement in the MIT DataSpace proposal, I was interested in what Joe had to say.
I jotted down a couple of quotes that I found interesting:
If you can't tweet your data, it doesn't exist.
Scientific data sets are competing with blogs and twitter.
Joe's point was that scientists need a new mindset when generating scientific data sets. Every step of the way (including well before the generation of scientific data), a researcher should be generating and preserving meta-data, and forming a rich set of network links that will ultimately surround the scientific content.
Without an intentional plan for rich linking of scientific data with externally related internet-based research, scientist's data will not be found amid the blogs, tweets, and scientific papers that are written daily about (potentially) similar subjects.
In order to progress his work (and as part of the National Center for Supercomputing Applications) Joe has been involved with the Tupelo project. He used the analogy of JDBC (a common Java interface to unite disparate database technologies) to describe how Tupelo is an effort to use the Resource Description Framework (RDF) to unite scientific data repositories. In particular, the RDF spec describes a graph-data model containing global identifiers and named links as the two key technologies that Tupelo leverages.
This picture of globally unique identifiers and named links reminds me of XAM technology, and in particular how it can be used to trace lineage from one object to its parent.
Software that generates data will come and go; but the data itself strives to remain. Therefore software developers should focus less on their own software (which will eventually become obsolete) and focus on two crucial aspects:
- Focus on how their software interoperates
- Focus on how to annotate the resultant scientific data so it is highly reusable
After the talk I spoke with Joe about XAM technology and its applicability to this area. Within a few weeks the MIT DataSpace proposal (uniting scientific data archives) will be presented to the National Science Foundation in Washington DC as part of the final stages of the NSF DataNet grant process. If approved, it will be interesting to see whether or not technologies like Tupelo and XAM will make their way into the effort.
Twitter: @Steve Todd