I recently met two professors from the UK that are on a data quest. They're touring the United States asking questions about scientific research data. On Wednesday afternoon I spent a couple of hours with Harvard University's IIC team (Initiative in Innovative Computing) listening to their pitch and asking them what they are up to.
Malcolm Atkinson is a professor of e-Science at the University of Edinburgh, and David De Roure is a professor of Computer Science at the University of Southampton.
Both are involved in the e-Science initiative, which was introduced as "the intersection/collaboration between scientists and computer scientists". Data-intensive research by scientists is "emerging as a new paradigm", and one aspect of the e-Science effort is to effectively collaborate on efficient ways to massively process , analyze, store, and share huge amounts of scientific data.
So why the US tour (taking them through New England, Chicago, West Coast, SouthWest US, etc)? Well, they called it a "fact-finding mission" on the current state of digital research data. They had a large number of questions for the audience, and I've recorded some of them below:
- What direction (and how) should we steer the digital data revolution?
- How do you characterise you community's data requirements?
- How many researchers are using data-intensive methods today, and what limits the adoption rate?
- In what ways is data changing in your community?
- How do we handle all of the different research software stacks that individual research communities are building on their own?
The Challenges of Sharing Research Data
I've written before about my interest in highly-scalable repositories of fixed content, and research data falls into that category. Right down the street from Harvard, in fact, are the researchers at MIT Sloan School of Management, who are requesting funding from the National Science Foundation to research the topic of uniting research archives.
One of the more interesting "research data sharing" challenges mentioned by Malcolm during his talk was the challenge of sharing copyrighted material. I found out about "illegal database joins". Consider a database query that joins two database tables from different research data stores. If one of these tables contains copyrighted material, then the creation of the resulting "join output" could be illegal.
This is just one example highlighting the challenges of sharing research data. It's one very large, tough nut to crack.
After describing the problem space, Malcolm turned over the presentation to David, who presented what I thought was a very unique and effective way to make progress on this issue.
Sharing Scientific Workflows
If the research world is not ready for all-out, full-access sharing of scientific data repositories, how about sharing the process or methods used in the analysis of data repositories? What if this sharing of workflows was accomplished through cutting-edge social media technologies?
David then presented an overview of myExperiment, a site dedicated to "finding, using, and sharing" scientific workflows. The website allows users to create and share scientific workflows, join groups, send messages, add comments, and of course search and find other people's workflows.
Social media for scientists. I like it. Scientists can actually re-use workflows and adapt them for their own purposes. While the input and output data related to the workflows may not be physically available, the conclusions, papers, and theories that result from the workflow can be published along side of it.
In order to make this workflow sharing successful, David went on to say that "credit and attribution" of workflows became one of the key problems to solve. Solving the user requirements for control of intellectual property became a key piece of user adoption.
I went to myExperiment.org and typed in the word "ocean" (oceanography was discussed in EMC's innovation lecture from June), and I found an oceanography group, as well as a workflow that read oceanographic data from a file and sent it to COVE (the collaborative ocean visualization environment).
This type of collaboration and sharing of scientific workflows certainly sets the stage for collaboration and sharing of scientific data repositories.
It was an intriguing presentation and I learned a lot. If you see two professors (with UK accents) crossing the US on a data quest this fall, be sure to give them your data requirements ;>).
Steve
http://stevetodd.typepad.com
Twitter: @SteveTodd
Comments
You can follow this conversation by subscribing to the comment feed for this post.