These projects all share one basic rationale, that some tasks are hard for computers, yet easy for people. Galaxy Zoo is an early, successful, and influential project asks people to perform several image classification tasks, such as categorizing the shape of galaxies. The idea is that computer algorithms are ineffective at this task compare to humans, and there is so much data to process that professional astronomers cannot even dream of looking at it all.
Their studies indicate that large numbers of untrained people (i.e., volunteers from the Internet) provide data that is useful and comparable to alternatives such as machine processing. (Crowdsourcing may also give faster turn around.) Similar methods have been applied to a variety of image processing tasks in a number of domains, including interpretation of handwriting from old documents.
In all these cases, the whole enterprise hinges on the claim that human processing beats the computer, at least at the price point (which is generally around zero dollars.) These claims are clearly contingent on the specific task, and on the state of technology (and funding).
For example, recent advances in face recognition algorithms (driven, no doubt by well financed national security needs) have dramatically changed this calculus in the realm of analysis of digital imagery of human faces. Low cost, off the shelf software can probably beat human performance in most cases.
This is actually one of the continuing technological stories of the early twentyfirst: the development of algorithms to meet and exceed human perception and judgment. Part of the “big” news in “Big Data” is the ways that it can outperform humans.
One example of these developments is CoralNet, from U. C. San Diego [1, 2].
It is now possible to survey large areas of coral reef quickly, generating large amounts of data. From this data, it is import to identify the type of coral and other features, which are important to understand the ecology of coral and the associated ecology, and to monitor changes over time. It isn’t feasible to hand annotate this data, so automated methods are needed.
The CoralNet system annotates digital imagery of coral reefs, identifying the type of coral and state of the reef. The basic idea is to use machine learning techniques to train the computer to reproduce the classifications of human experts. How well does that work?
The Silicon Valley approach would be to assert that they have “disrupted” coral identification, and rush out a beta. Real scientists, however, actually study the question, and publish the results.
In the case of CoralNet, there have been several studies over the past few years, including. For example, Oscar Beijbom and colleagues published detailed analysis of the performance of human experts and the automated system . Additional details appear in Beijbom’s Thesis .
The study found variability among human analysts (to be expected, but often overlooked), and determined that the automated system performed comparably to human raters. These papers is a good example of the careful work that is needed to validate digitally automated science.
Since the 2014 study, the software has been improved and updated. CoralNet 2 improved the speed to the point that it is 10 to 100 times faster than human classification. This speed up is significant, making data available quickly enough to understand changes to the reefs. Combined with automated data collection (e.g., autonomous submarines), it is now possible to continuously monitor reefs around the world.
It seems obvious to me that crowdsourcing a la zooniverse would not be warranted for this case. The computer processing is now good enough that human raters, even thousands of them, are not needed.
I note that even in the domain of ocean ecology, there are many examples of simple analysis tasks. For example, in “Seafloor Explorer” crowdsourced identification of images of the seafloor, identifying material and species. This is basically the same task as CoralNet automates, though looking for different targets.
I’m pretty sure that machine learning algorithms could catch or exceed the crowdsourced results of CoralNet. (It may or may not be feasible to develop the system, of course.)
The point is that crowdsourcing science is not a panacea, nor are their any problems that, for certain and always will be done better by Internet crowds. My own suspicion is that crowdsourcing (at least the “galaxy zoo” kind) will fade within a decade, as machine learning conquers every perceptual task.
And since I brought it up, I’ll also note the challenges these techniques pose to reproducibility. Human crowdsourcing is, by definition, impossible to reproduce. Classification vial machine learning may be difficult to reproduce as well, especially if the algorithm is updated with new examples.
- Oscar Beijbom, Automated Annotation of Coral Reef Survey Images, Ph.D. Thesis in Computer Science. 2015, University of California, San Diego: San Diego. http://www.escholarship.org/uc/item/0rd0r3wd
- Oscar Beijbom, Peter J. Edmunds, Chris Roelfsema, Jennifer Smith, David I. Kline, Benjamin P. Neal, Matthew J. Dunlap, Vincent Moriarty, Tung-Yung Fan, Chih-Jui Tan, Stephen Chan, Tali Treibitz, Anthony Gamst, B. Greg Mitchell, and David Kriegman, Towards Automated Annotation of Benthic Survey Images: Variability of Human Experts and Operational Modes of Automation. PLOS ONE, 10 (7):e0130312, 2015. http://dx.doi.org/10.1371%2Fjournal.pone.0130312