No Big Data story is more famous than Google’s claim to be able to track flu outbreaks in real time, much faster than conventional public health surveillance.
In Science, Lazar and colleagues present an analysis and critique of this claim and the actual performance of the Google Flu Trends.
Their finding is that the Google Flu Trends (GFT) consistently over estimates the incidence of flu. In other words, the real time trigger is “too sensitive”, beating the conventional signals in part by “crying wolf”.
These errors are quite important, because this kind of real time prediction is supposed to enable resources to be swiftly deployed, to react to epidemics much quicker than the slower conventional methods allow. But if the real time prediction is a false positive, these resources will be misallocated, and the deployment effort wasted or misdirected.
To the extent that these errors can be assessed (see below), they appear to be due to the use of poor correlates. The GFT is based on analysis of search terms thought to be related to (i.e., correlated with) the outbreak of flu, such as queries about symptoms and medications. This query behavior is only partly driven by actual symptoms, it may, for instance, be triggered by “winter”. (No points will be awarded for detecting winter via Google searches.) It is also possible that social phenomena, such as media hype, can increase interest and fears regardless of symptoms.
Obviously, not everything accurately predicts the actual outbreak of flu. Worse, the dataset contains 50 million terms, while the data to predict is a few thousand points—overfitting is almost guaranteed.
The errors in GFT’s predictions were quite substantial. In fact, the “bad old” conventional reporting, though not real time, was more accurate than GFT for projecting the actual occurrence of flu. This should not be a surprise, since these projections were carefully designed.
Naturally, combining GFT or similar data with other surveillance will be even better than either alone. GFT would also be more useful if commonly used statistical methods were used to model and reduce errors.
But, there are problems in any attempt to use the GFT itself as real data
The GFT is irreproducible, because it has never been adequately reported. The data is unavailable for study, and the algorithms are closed and ever changing. The GFT could not be published in any reputable scientific journal, and it is difficult to see how you could validate it.
This critique extends to most many large data studies: however spectacular the headlines, it is difficult to make useful technology out of opaque, semi-magical, processes. I have remarked on the psychology of Big Data, offering a secular form of prophecy. Lazar et al call this tendency “Big Data Hubris”: big data is always better.
This also demonstrates the reason why we need publicly sponsored science: however wonderful this technology might be, it is owned by Google (or Facebook or Twitter, or whoever, and they release only what they wish to let out. We have no way to replicate, compare, or even understand exactly what they did. Whatever this is, it isn’t science (and it isn’t transparent). Is it evil? Possibly.
I see the GFT is an example of Google’s overall attitude. A love for quick and dirty methods based on unquestioned assumptions that more data is better than anything else, even careful theory and modeling. “Open source” that enables people access to selected data, but complete opaqueness about the actual data and algorithms. “Trust us” is not really good enough for data science.
I would also comment that in GFT and elsewhere, Google has touted the importance of empirical data and analytics, to understand health and safety (and selling advertising). But in the case of Google Glass, the company has not presented any data at all to demonstrate that the device is safe to use, or even useful. We have been given lots of hype and anecdotes, which their data scientists must cringe at.
As I have pointed out sever times, there is really strong reason to worry that Glass is really bad for people, possibly producing eye damage, and
almost certainly causing distraction. Yet Google has presented no data, and has never publicly said that it has ever collected or intends to collect such data.
Combining these cases, we see a cavalier and selfish attitude toward science and public safety. A major, monopolistic, for profit company has a right to act these ways. But if it does, then it better not BS about “not being evil”, because they are, at best, selfish and amoral (just as a capitalist company should be).
David Lazer, Ryan Kennedy, Gary King, and Alessandro Vespignani. 2014. The Parable of Google Flu: Traps in Big Data Analysis. Science 343, no. 14 March: 1203-1205. Copy at http://j.mp/1ii4ETo