The EMC Innovation Lecture that I wrote about several weeks ago went well. A webcast is available here. A few dozen people dialed in to listen to myself, Burt Kaliski from EMC, and Stuart Madnick from MIT talk about Digital Curation and a variety of related topics. Wayne Adams (EMC) from SNIA was also present during the call to talk about what's been going on lately with the XAM standard within SNIA.
Several attendees called in to ask questions and/or submitted them via chat.
When the lecture was over I found that I had a question of my own:
Can scientific research repositories benefit from archival standards?
MIT DataSpace
Distributed silos of research information don't necessarily lend themselves to external re-use. For example, if large amounts of anonymous medical X-RAYS are stored at a hospital for research and analysis, can the X-RAY data (and the results of the research) be accessed by another research team? If environmental data is captured off the coast of Massachusetts and used to measure ocean temperature trends, can that data be re-used by other environmentalists around the world?
The answer can often be no, and the MIT DataSpace project is a proposal focused on turning that answer around.
The proposal has been submitted to the National Science Foundation (NSF). The NSF is accepting proposals as part of its DataNet initiative. If funded, MIT and a host of other academic and industry teams will work on a solution.
What is the Format of Research Data?
Back to my question. Research data in repositories around the world have a variety of different formats. Some of it exists in databases, some of it exists in file systems, and some exist in both. Let's assume that DataSpace would propose that a standard method be used to take these existing systems and provide some standardized form of access to the data.
And let's assume that DataSpace would like this research data to be available for a long time.
Keeping digital data for a long time is the job of a digital curator, and most digital curators have adopted a standard for their digital archives known as the Open Archival Information System (OAIS).
How does scientific research data map to OAIS? Consider the oceanographic data example. The temperature measurements are the key piece of data, known in OAIS terms as a Data Object. Any Data Object, when it is contained within an OAIS archive, must be accompanied by the following mandatory metadata:
- Representation Information: how to interpret the oceanographic data. What format is it in?
- Reference Information: a persistent identifier that uniquely identifies the data.
- Fixity Information: authenticates the oceanographic data
- Provenance Information: documents the history of the data
- Context Information: metadata or description describing the data
It seems to me that the above five fields, if consistently made available for scientific research data, would be highly valuable. This leads me to conclude that the techniques used in long term, OAIS-compliant digital curation should be studied in the context of standardized access to digital research archives.
The next step that I'm interested in pursuing is finding some use cases (e.g. real-life research repositories) and beginning the process of mapping them on to OAIS.
I'd love to hear from anyone that has already thought along these lines (or is interested in doing so).
Steve
http://stevetodd.typepad.com
Twitter: @SteveTodd
Comments