I have commented before about the difficult problem of understanding and reproducing computations. This is a deep philosophical problem for computational thinking in general, and a critical concern for practical science. When a measurement, finding, or argument depends on a complicated computational workflow, how can we make it possible to accurately reconstruct and reproduce the computation in order to test or modify it?
This month Cedric Notredame and colleagues from the Centre for Genomic Regulation (CRG) in Barcelona, http://www.crg.eu/en/ published a description of a system that addresses many aspects of this problem, at least for common genomic analyses. (These computations need to be highly reproducible for both safety and business purposes, as well as scientific validation.)
One of the key problems they tackle is the subtle issue of numerical stability of the computations. Running the same program on different computers, or with slightly different support libraries, or slightly different settings, can result in accumulations of rounding errors that may change the final result. This issue gives undergraduate Computer Science majors migraines, and most of us try to forget about it as much as possible. Dealing with this requires considerable expert knowledge and a certain amount of practical art.
The Nextflow software essentially captures this knowledge, and makes it easier to get the same results every time. These are not necessarily the correct results. They are repeatable results,
Repeatability is necessary, if not sufficient, to seriously testing their validity.
It is very clear why this product was created, and what advantages it offers investigators.
Looking at the technology, this is built upon Docker container technology, which I’m not familar with, but looks really cool and useful. It encapsulates the dependencies of a particular software component. (You youngsters don’t appreciate how lucky you are. Over my career, I probably spent a cumulative decade or more dealing by hand with these fiddly details of just keeping the darn software running.)
There is a problem, though. When you successfully reproduce the results, you depend on a massive amount of software and settings. Most of those dependencies are supposed to be irrelevant to the conceptual result. By analogy, if you get the result only when you use one specific microscope, or even one model of microscope, then the instrument is confounded with the result.
This result is fragile, because it depends on the exact method that produced it. For robust understanding, it is always important to reproduce results using alternative methods. I.e., you should get the same result with any microscope. Ideally, we want to recreate the same results even with different software components.
A danger with these complex workflows is that it is very difficult to understand the dependencies hidden inside. Great software like Nextflow (and Docker and everything that it depends on!), is that it makes it easy to reuse components without knowing what is inside. It is so easy to be blind to potential confounding factors hidden in the software or even in the hardware.
I think the next step will be capabilities that explain the component, and reason about the workflow to explain the what it is built on top of, and what it “assumes” about the data and computation. For example, if a computation uses a specific library to sift though a database, it implicitly assumes that this library works correctly with this database.
A second improvement will be tools for comparing two workflows, to identify the differences. Some differences should be unimportant (we assume), others are obviously significant. This tool would also help create alternative workflows that use meaningfully different software to do “the same” computation. These can give convergent validity to both results and software.
These ideas are scarcely new. But Nextflow and other contemporary systems have advanced to the point where it is realistic to really do these things, and do them well.
This stuff is really hard, but really, really important.
- Paolo Di Tommaso, Maria Chatzou, Evan W. Floden, Pablo Prieto Barja, Emilio Palumbo, and Cedric Notredame, Nextflow enables reproducible computational workflows. Nature Biotechnology, 35 (4):316-319, 04//print 2017. http://dx.doi.org/10.1038/nbt.3820