Progress on “Data Reproducibility”

This month’s issue of D-Lib Magazine (What? You don’t read D-Lib every issue? Tsk!) has several articles from the Reproducible Open Science conference about the difficult but crucial problem of scientific reproducibility in this digital era. (See Sensei Carole Goble’s authoritative talk [2])

The entire idea of the scientific enterprise hinges on sharing not only results, but also methods and data, so that the evidence and reasoning are open for all to check and reuse. In principle, it should be possible to independently reproduce the results, to check their correctness. Reproducibility also enables others to build on the work, applying the methods to new data, or exploring modifications to the methods. So, scientific findings should be published with sufficient information to be reproducible.

In the past several decades most scientific findings are based on multiple computer generated datasets, which are processed by complex algorithms. Reproducing the results would require an adequate understanding of, and ability to access, the relevant datasets and algorithms. That is, it is necessary to publish not only the paper, but also the (possibly large) datasets used and a description of the data processing that created the data and results.

This problem is scarcely new, and it is widely accepted that the solution is powerful automated tools that keep track of “who did what” in the computations [1, 3, 4]. This is, as Sensei Jim Myers put it (over ten years ago, now!), reinventing the scientific notebook [5, 8].

This issue of D-Lib has several articles describing advances toward this goal.

Jingbo Wang and colleagues describe a “Provenance Capture System” implemented by the Australian National Computational Infrastructure (NCI) [9]. This system resembles earlier prototypes developed in the US, UK, and elsewhere (see [3]).

The article gives and example case, which uses several datasets from Earth Observation satellites, which are processed to calibrate and select the data relevant to the study. The system automatically records the datasets and processing, and stores a record on a public server. Any investigator can use the record to understand and reproduce the data. Other articles in the issue discuss similar efforts in Europe [6, 7], as well as efforts to tackle other aspects of the problem.

I’m pleased to see these signs of progress on this very difficult problem. Check out the D-Lib articles.


  1. Joe Futrelle and James Myers, Tracking Provenance in Heterogeneous Execution Contexts. Concurrency and Computation: Practice and Experience, 20 (5):555-564, 10 April 2008.
  2. Carole Goble, What is Reproducibility? The R* Brouhaha, in First International Workshop on Reproducible Open Science. 2016: Hannover, Germany.
  3. Luc Moreau, Special Issue: The First Provenance Challenge. Concurrency and Computation: Practice and Experience (on-line), 20 (5 ) November 2007. http://dx.doi.org/10.1002/cpe.1233
  4. Luc Moreau, Juliana Freire, Robert E. McGrath, Jim Myers, Joe Futrelle, and Patrick Paulson, The Open Provenance Model. 2007. http://eprints.ecs.soton.ac.uk/14979/1/opm.pdf
  5. James D. Myers, Alan R. Chappell, Matthew Elder, Al Geist, and Jens Schwidder, Re-Integrating The Research Record. Computing in Science and Engineering, 5 (3):44-50, May/June 2003. http://ieeexplore.ieee.org/document/1196306/
  6. Stefan Pröll and Andreas Rauber (2017) Enabling Reproducibility for Small and Large Scale Research Data Sets. D-Lib Magazine, https://doi.org/10.1045/january2017-proell
  7. Sheeba Samuel, Frank Taubert, Daniel Walther, Birgitta König-Ries, H. Martin Bücker, and Michael Stifel (2017) Towards Reproducibility of Microscopy Experiments. D-Lib Magazine, https://doi.org/10.1045/january2017-samuel
  8. Tara Talbott, Michael Peterson, Jens Schwidder, and James D. Myers. Adapting the Electronic Laboratory Notebook for the Semantic Era. In International Symposium on Collaborative Technologies and Systems (CTS 2005), 2005. http://ieeexplore.ieee.org/document/1553305/
  9. Jingbo Wang, Ben Evans, Lesley Wyborn, Nick Car, and Edward King (2017) Supporting Data Reproducibility at NCI Using the Provenance Capture System. D-Lib Magazine, https://doi.org/10.1045/january2017-wang

 

2 thoughts on “Progress on “Data Reproducibility””

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s