One of the most famous “Big Data” cases has been Google Flu Trends, which uses Google search data and metadata to infer outbreaks of influenza or other diseases. The initial reports were enthusiastic, claiming to discover outbreaks much quicker than conventional statistics.
Clairvoyance! Prophecy! Magic! Private Sector Rulz!
It sounds too good to be true, and–wait for it–it is too good to be true. It quickly became apparent that GFT is more sensitive to outbreaks because it is “biased” toward detection, resulting in high chance of false warnings.
This is not surprising to anyone who does data analysis, nor to anyone who does remote sensing. You absolutely need “ground truth” to confirm your inferences from data.
Sober assessment indicates that GFT actually measures something like “public talk” about a disease, rather than actual outbreaks. This is clearly seen in other “outbreaks” that were briefly in the news, possibly far from the actual outbreaks. E.g., the Ebola scare outside Africa.
This sort of data is correlated to and possibly useful for understanding, but not the same a confirmed sickness.
Fortunately, we don’t have to use only GFT, we can combine it with the boring old statistics. Surely is “Big Data” is good, then Big Data + Even More Data is better, no?
This week Davidson, Haim, and Radin published a nice study illustrating this point. They don’t just use clinical reports to confirm suspected outbreaks, they use data from previous (real) epidemics to model how the disease spreads. This model helps adjust for the inherent biases in the search statistics, by sorting out population from media effects.
More data is always better than less data, good data is better than bad data, and a solid theoretical understanding trumps simple “bigness”.
1. Davidson, Michael W., Dotan A. Haim, and Jennifer M. Radin, Using Networks to Combine “Big Data” and Traditional Surveillance to Improve Influenza Predictions. Sci. Rep., 5 01/29/online 2015. http://dx.doi.org/10.1038/srep08154