The July issue of the DLib Magazine (“The Magazine of Digital Library Research”) is dedicated to software preservation, with several useful (if depressing) articles.
As I have frequently pointed out, almost all contemporary scholarship and science depends on digital software and data. If we wish to continue the traditional mandate to publish and preserve our studies for others to replicate, critique, and build on, then we must (somehow) publish the digital artifacts they are built on.
The special issue of DLib has good coverage of these challenges, but not much in the way of solutions. This is very hard for many reasons, and resources are scarce. Academic libraries are the front lines of this quixotic mission.
Fernando Rios of Johns Hopkins University describes a “planning” tool, which describes strategies for preserving software. This article is short on solutions, but is a pretty good survey of the things you need to think about. It also has a useful collection of references at the end.
He identifies some of the key challenges, including “identifying and capturing metadata, dependencies, support for attribution and citation, infrastructure development, and developing appropriate workflows to enable service provision.” He references other works that discuss, for example, “How to cite and describe software” and “Minimal information for reusable scientific software” and so on.
Think about it. If you needed to record exactly how you did your work, whatever it is, what do you need to write down? In the case of digital software and data, what exactly would you need to tell people, in order to tell them enough to really check your work?
My own view is that this problem is predicated on the thorny issue that we don’t know how to describe software, let alone explain which parts are “important”. Software is one of most complex artifacts humans have ever created, and every bit of software is interconnected and depends on other software. (Even isolated systems depend on the software that was used to create them.)
Where should the circle be drawn to delineate “my software” from all the rest?
Furthermore, the dependencies are difficult to describe, if they are even known. Ideally, I should know that my software used a particular library, and made specific calls to it. But it is difficult to know what the library did, or what software it depended on, or even where it may have executed. It is also unlikely that we can say how the dependent software was created, and we may not know exactly how it works.
Given the complexity and “size” of software, it isn’t even reasonable to expect any person to know these things.
Yet, a full description of my research results requires a description of these crucial tools, which I can’t really do.
Everyone knows that digital artifacts have the maddening properties of being evanescent, yet impossible to delete. Even if I describe the digital tools I used today, they will no longer exist tomorrow in precisely the same form. In many cases, it isn’t even meaningful to talk about “replicating” digital work, because it is essentially a flow of events that will never be repeated exactly.
Archivists tackle these problems by recording snapshots of digital systems, and attempting to store away a “working copy” of it. As the world change out from under it, this requires virtualization and emulation, creating replayable copies of whole systems.
Virtualization/emulation is a heroic enterprise, no doubt. But is this really achieving the goal of preservation? Yes, you can recreate some form of the original work, but can you reuse it? Can you even understand it? (I have seen far too many cases where a piece of software only works on one peculiar virtual machine that no one can figure out or recreate. This quickly degenerates into a magic black box, which is intellectually unsatisfactory and probably wrong, too.)
Other articles give yet more challenges, including the difficulty of getting the people who understand (more or less) the objects to do the work to describe and record them. In addition, various legal licenses place roadblocks in this path, as do persnickety editors and reviewers. And, of course, none of this hard work is covered in the “deliverables” that sponsors are willing to pay for.
The life of an archivist is difficult and poverty stricken. The life of a digital archivist is impossible.
Enough ranting for now.
- Neil Chue Hong, Minimal information for reusable scientific software. 2014. https://figshare.com/articles/Minimal_information_for_reusable_scientific_software/1112528
- Mike Jackson., How to cite and describe software The Software Sustainability Institute, Edinburgh, 2012. https://www.software.ac.uk/how-cite-and-describe-software
- Fernando Rios, (2016) The Pathways of Research Software Preservation: An Educational and Planning Resource for Service Development. DLib Magazine, 10.1045/july2016-rios 10.1045/july2016-rios http://www.dlib.org/dlib/july16/rios/07rios.html