Progress on “Data Reproducibility”

This month’s issue of D-Lib Magazine (What? You don’t read D-Lib every issue? Tsk!) has several articles from the Reproducible Open Science conference about the difficult but crucial problem of scientific reproducibility in this digital era. (See Sensei Carole Goble’s authoritative talk [2])

The entire idea of the scientific enterprise hinges on sharing not only results, but also methods and data, so that the evidence and reasoning are open for all to check and reuse. In principle, it should be possible to independently reproduce the results, to check their correctness. Reproducibility also enables others to build on the work, applying the methods to new data, or exploring modifications to the methods. So, scientific findings should be published with sufficient information to be reproducible.

In the past several decades most scientific findings are based on multiple computer generated datasets, which are processed by complex algorithms. Reproducing the results would require an adequate understanding of, and ability to access, the relevant datasets and algorithms. That is, it is necessary to publish not only the paper, but also the (possibly large) datasets used and a description of the data processing that created the data and results.

This problem is scarcely new, and it is widely accepted that the solution is powerful automated tools that keep track of “who did what” in the computations [1, 3, 4]. This is, as Sensei Jim Myers put it (over ten years ago, now!), reinventing the scientific notebook [5, 8].

This issue of D-Lib has several articles describing advances toward this goal.

Jingbo Wang and colleagues describe a “Provenance Capture System” implemented by the Australian National Computational Infrastructure (NCI) [9]. This system resembles earlier prototypes developed in the US, UK, and elsewhere (see [3]).

The article gives and example case, which uses several datasets from Earth Observation satellites, which are processed to calibrate and select the data relevant to the study. The system automatically records the datasets and processing, and stores a record on a public server. Any investigator can use the record to understand and reproduce the data. Other articles in the issue discuss similar efforts in Europe [6, 7], as well as efforts to tackle other aspects of the problem.

I’m pleased to see these signs of progress on this very difficult problem. Check out the D-Lib articles.

2 thoughts on “Progress on “Data Reproducibility””

