This is a golden age for scientific data and computation. Almost all science is digital these days, from instruments, through data, data analysis, and publication. These extraordinary capabilities are producing awesome results.
But we are dancing on the edge of a digital dark age for reproducibility.
One of the pillars of science, one that distinguishes it from many other forms of rational argument, is the desire for reproducibility, the intention that any competent person can, with the right tools, walk through the procedure and reach the same result.
One need look no further than the library and archives of any major public university. Pretty much every student thesis is done which digital software and data, and they “deposit” their work into the public record for others to use. But the record is usually a document describing the work in English or other human language, along with some data. Perhaps some source code or pseudocode is included. (Whether the library has any effective way to preserve and disseminate these digital artifacts is a whole other question!)
But it would be extremely difficult to reproduce the results from this information, especially after a year or two. So, have we actually accomplished much by publishing the PDF paper, but not enough of the software that it is actually about?
Don Monroe comments on this problem in the current Communitications of the ACM, “When Data is Not Enough”. He notes that progress has been made in publication of digital data, e.g., as part of the review and publication of papers.
But, as he discusses, reproducing the results requires reproducing the software which can be very difficult. Even under the best circumstances, and with the best intentions, software is complex and error prone.
There are many obvious issues, such as reliance on proprietary codes or products, and the rapid decay of software (I challenge you to run code that was published even ten years ago).
Even worse, I’m pretty sure that most scientists do not even know the details of the software they use. The science code is the apex of a huge iceberg of utilities, libraries, operating systems, and networks. Not only is it impossible to recreated the configuration, it is most likely impossible to even know the configuration at a given time—and it may well change during the experiment.
(And, by the way, more and more, people are using virtual machines and on demand clusters. In this case, the computation may be wrapped up in a cocoon of software from the operating system to the interface. This is totally opaque, even to the person who made it, let alone to anyone else.)
Furthermore, I would say that describing software is nearly hopeless, because we do not understand it, nor do we have adequate language. What do I need to tell you for you to know how to reproduce my software? There is so much detail that is irrelevant, and what words do I use to describe the essential features better than simply providing the code?
Monroe describes efforts to improve the situation by updating the concept of the “scientific notebook” to the digital age, by integrating automated recording of all the stpes. This is the right idea, though Monroe does not cite pioneering efforts such as Scientific Annotation Middleware , VizTrails , or myExperiment , as well as standards processes, such as the W3C PROV. I’m just saying: we’ve been aware of and working on this problem for a long time.
In any case, one can ask whether we merely want to get the exact software working, or do we want to understand and reproduce the conceptual results. If the result depends on a very specific collection of software, then how confident can we be about the results? What we need to do is reproduce results by convergent methods, multiple implementations and independent data and computations.
I note that in systems that require extremely high confidence (e.g., the Space Shuttle), it is common practice to use multiple independently developed systems, and compare the results. If they agree, then we trust the answers. If some disagree, we know we need to check carefully.
(A classic large scale example of this approach is seen in the US nuclear program, which has multiple labs that independently calculate answers to critical questions, answers that simply must be correct.)
So the real key, I say, is to publish enough of the conceptual logic so that it is possible to create an independent calculation. Dumping all the data and software on us is actually not helpful for this purpose.
- Freire, Juliana, Cláudio T. Silva, Steven P. Callahan, Emanuele Santos, Carlos E. Scheidegger, and Huy T. Vo. Managing Rapidly-Evolving Scientific Workflows In International Provenance and Annotation Workshop (IPAW), 2006, 10-18.
- Goble, Carole Anne and D. De Roure, myExperiment: social networking for workflow-using e-scientists, in Proceedings of the 2nd workshop on Workflows in support of large-scale science. 2007, ACM: Monterey, California, USA.
- Monroe, Don, When data is not enough. Commun. ACM, 58 (12):12-14, 2015.
- Myers, James D., Alan R. Chappell, Matthew Elder, Al Geist, and Jens Schwidder, Re-Integrating The Research Record. Computing in Science and Engineering, 5 (3):44-50, May/June 2003.