Several recent books about data and analytics, worth our attention.
Physics comes to Wall Street by James Owen Wetherall
Boston: Houghton Miffen Harcourt, 2013.
Automate This: How Algorithms Came to Rule Our World by Christopher Steiner, New York: PORTFOLIO/PENGUIN, 2012.
Two recent popular books riff on the theme of “algorithms”. These books don’t dwell on data, per se, though everything they talk about depends on the availability of the right kinds of digital data.
Steiner gives a general introduction to algorithms, with a somewhat “get whiz” history. Most of the explanation is readable, and not too badly amiss, though I got tired of being told how complex and magical and mathematical algorithms are. Most of them are complex but not very magical, and many are mathematical only in the general sense that they may be expressed formally, not because they use anything especially difficult mathematically. Steiner mostly focuses on business applications, with one chapter on algorithmic medicine. Unfortunately, this book doesn’t really explain the algorithms very well, or even try.
Wetherall’s book is much better on that point. He reviews a history of physics/mathematics that have moved to finance. Many of the “gee-whiz” ideas are familiar to me or anyone who was a technical student in the 70’s and 80’s, or a reader of Scientific American those years. These include Shannon’s information theory, Mandelbrot’s fractals, Farmer et al’s complexity, etc.. And it brings things up to date with some recent stuff by Didier Sornette on “Dragon Kings”.
What is new in both Steiner and Wetherall is a rendition of how these moved into the pragmatic use by algorithmic traders, and how the ideas were applied. Much of this history is hidden (successful traders do not tell what they are doing), and, partly as a consequence, the pattern was repeated many times: flaky academic gets cool idea, academia doesn’t grok it, flake decides to go make piles of money on wall street.
The best point in Wetherall’s book is that the algorithms are explained pretty well. Some of them are black boxes which work on really short time scales. Others are more transparent, though atheoretical (i.e., they may work, but they don’t explain the phenomenon).
I have some quibbles, of course. As a physicist, Wetherall emphasizes the wonders of the algorithms. He pretty completely ignores the other key pieces of the puzzle, specifically computer technology. None of these geniuses would have got very far, or anywhere at all, with out the stupendous advances in computing and networking (says someone who helped bring those advances to being, and gave it away for free).
This blind spot is pretty obvious in his manifesto at the end, which is tongue in cheek titled ‘send physics, math, and money’. I’m sure that a physicist believes that the solution to the world’s problems is more funding for physics. The rest of us will want our slice of that imaginary pie, thank you.
I also find his prescription for fixing the financial system and world economy to be preposterous. No, the problem isn’t that bankers and regulators aren’t thinking like physicists. The problem is that the system is corrupt, conflicted, and doesn’t want rational, scientific management. The rich and powerful want to get more, and have the means to ensure that the system is rigged for their benefit. More physics just isn’t relevant to the basic problem.
Big Data: A Revolution That Will Transform How We Live, Work, and Think by Viktor Mayer-Schönberger and Kenneth Cukier, Boston: Houghton Miffin Harcourt, 2013.
Here we get to a real pot boiler on the topic of “Big Data”. M-S& C are computer scientists, and they want to make the point about what is now technically possible. It’s not specifically the size of the data set per se, it is more of a mindset that they extol. “Get all the data”, they say. Look for correlations, don’t worry about causation. Don’t be afraid of messiness.
This is the world of Amazon and Google, where large amounts of apparently trivial data are mashed together, attacked with bazillions of computer models, to discover obscure, fine grained correlations.
The authors use the phrase to “treasure hunt” (and “fishing expedition”) which is an extremely apt metaphor. The psychology of this enterprise is that there is “treasure” out there in the massive glob of data, valuable gems awaiting discovery and exploitation. We don’t know what is there, or where, so we are searching broadly, and try to grab onto everything. And it is a “race” to find the treasure before someone else does. And, by the way, the “treasure” is free for the taking by the hunters, and to hell with the natives who might live there.
The book does give a decent rendition of the elements of this approach. “More, Messy, Good enough”, is the mantra, correlation is king, prediction is the big game.
They have an interesting point: what is “good enough” for your purposes? Clearly, if you want to, say, advertise, you don’t really care why it works, you only care if you get results or not. So it may be “good enough” to categorize people down to a class, size of one, and use that to target ads, no matter what the classification might “mean”, if anything.
We can think of examples where this concept works and doesn’t work.
Contemporary speech recognition and speech generation systems perform awesome feats, close enough to human to be useful (e.g., see Pentland below). Yet they are based on very complex pattern recognition discovered through analysis of massive amounts of data (and requiring really fast computers to keep up). Nothing in these algorithms tells us much about how humans speak and hear, nor what human language is in any real sense. It turns out that it doesn’t matter that we still haven’t the foggiest clue how humans do it, we can make machines that are “good enough” at it, using slews of data.
On the other hand, contemporary weather forecasting has achieved amazing accuracy, and will get even better as more computing power becomes available (See Wetherall, 2013 Ch 5 and Wetherall). Ensemble forecasts use the same technology as data mining, with the very important difference that it is done with models that are based on deep theoretical understanding of how weather actually works, i.e., not black boxes based on correlations. Correlation will not work for this problem—and we know exactly why it won’t, because we have good theory. And, by the way, getting slews and slews of messy data would really, really not help. A stream of temperature readings from every cellphone in the world would not be likely to generate a good rather report, let alone forecast.
The point being, you really can’t say “all we need is data”. There are many problems where this approach isn’t good enough. But, of course, M-s & C’s point is that we can get good enough solutions to some problems through treasure hunting. Our intuitions are a poor guide to what will work, so we need to be open to “big data” methods.
Now I had some serious problems with their notions of methodology. They advocate this correlation-based fishing expedition approach as an improvement over conventional hypothesis testing.
“In a small-data world, because so data tended to e available, both causal investigations and correlation analysis began with a hypothesis, which was then tested to be either falsified or verified. But because both methods required a hypothesis to start with, both were equally susceptible to prejudice and erroneous intuition. And the necessary data often were not available. Today, with so much data around and more to come, such hypotheses are no longer crucial for correlational analysis” p. 61
Huh? Obviously, these guys aren’t familiar with the cognitive psychology literature. All human thought is based on assumptions, prejudices, and, yes, errors. Part of the point of hypothesis testing was to try to formally state these assumptions, and draw limited conclusions about them. If you find a study bogus because it began with stupid hypotheses, at least you know about them and can discard the study.
Fishing expeditions based on black box analytics are surely riddled with biases and errors, but it is very difficult to know, since they aren’t explicitly stated. Furthermore, the treasure hunter surely has unstated hypotheses about what would constitute “gold”. So, even it the analytics are unbiased and error free—a hypothesis that is certainly not self-evident—the users of the data clearly are not.
To be fair, the authors do not defend the extreme view that “theory is dead”, acknowledging that there are many reasons why you need to know how things really work.
Their discussions of sociology is pretty bogus (the “demise of the expert” is mostly silly), and the discussion of business models is shallow and probably useless (middle sized companies will be squeezed out? What the heck does that even mean?). Ironically, much of these discussions are based on “intuition”, not data—exactly what they claim is outdated and doomed to fail. There is precious little data relevant to these claims about the recent past and near future of companies.
The treatment of “risks” is shallow and naïve.
“[Data] can become an instrument of the powerful…” p. 101
Well, duh! And the discussion of free will is pretty much irrelevant. The actual issue is who’s interests will be served, and you know very well the answer to that question.
Interestingly, they totally omit the interesting issues of what these technologies imply for creativity and intellectual property law. See Plotkin, The Genie in the Machine (Plotkin, 2009), and Lessig, Remix: Making Art and Commerce Thrive in the Hybrid Economy (Lessig, 2008)
Overall, the book is interesting, but flawed.
“The possession of knowledge, which once meant understanding of the past, is coming to mean an ability to predict the future.” p. 190
Let’s look at prediction more closely.
The Victory Lab: The Secret Science of Winning Campaigns by Sasha Issenberg. New York: Crown Publishing Group, 2012.
The Signal and the Noise: Why So Many Predictions Fail — but Some Don’t by Nate Silver, New York: Penguin Press, 2013.
These are two very interesting books about contemporary applications of data-intensive prediction.
Issenberg portrays how recent political campaigns have used unprecedented amounts of data to create algorithms that guide political campaigns. As one reviewer tagged it, “moneyball for politics”: using dozens of variables about people to individually target political messages, tailored to persuade, to extract contributions or effort, and to get out the vote.
Written in 2011, the technology described was used in the 2012 US elections, notably by Barak Obama’s successful reelection campaign. Instead of blanketing large areas with general messages, the campaign focused on very specific areas, and located, convinced, and delivered individual voters. As one (republican) commentator said, “they found voters we didn’t even know existed”. It is reported that the result of the election would have been reversed were it not for the results in three counties (the cities of Cleveland, Miami, and Philadelphia), and these counties were intensely combed for every possible Obama vote.
The goals of the game are pretty simple, and the general idea is not hard to understand: you want to find voters who might be persuaded, and predict what might convince them. You want to find supporters who might give money or time, and predict how to get the most from them. You need to identify supporters who are not registered voters, and get them registered. And you need to identify supporters who are likely to not vote, and get them to vote. At the same time, you want to not help voters who likely support the opponent.
The tools are data and algorithms which use the characteristics of known supporters to deduce the support or lack for as many people as possible. The data includes obvious things like political and related behaviors, pretty obvious things such as social and economic factors, and perhaps some non-obvious factors.
Interestingly, the models are very dynamic, and are updated continuously on the basis of data, especially voter contact. That is, you use old-fashioned person-to-person contact to validate and revalidate the model, everyday.
Another interesting thing in the VL is the development of techniques for individually targeting messages, based on empirical studies as opposed to ideology or gut instincts. This takes the concepts used by Amazon and Google and all, and uses it to optimize political communication.
Nate Silver has become the king of political predictions, and is one of the people who would tell you that these methods really worked in 2008 and 2012. Silver is famous for his predictions for US national elections, based on models that use polling and other data.
Silver’s book is more general, he talks about how to make predictions, and especially, how to use data profitably. His examples include Baseball, economic prediction, poker, and weather, as well as politics.
His tool is Bayesian reasoning, which depends on data but is actually a theory of decision making. For those who love it, Bayesian reasoning is the way the world works. Silver’s book is a wonderful walk through how to think carefully and make predictions successfully.
“[Learning to make predictions] often begins with having the right data, the right technology, and the right incentives. You need to some information—more of it rather than less, ideally—and you need to make sure it is quality controlled. You need to have some familiarity with the tools of your trade—having top-shelf technology is nice, but it’s more important that you know how to use what you have. You need to care about accuracy—about getting at the objective truth—rather than making a pleasing or convenient prediction, or the one that might get you on television.” p. 313
It is interesting to compare his views on data with the “Big Data” folks. First of all, Bayesian reasoning requires hypotheses. There is nothing else to reason about except hypotheses.
Furthermore, more data is not necessarily a boon, especially if you don’t know have any causal understanding. Discussing the deplorable record of economic predictions, he comments:
“…improved technology did not cover for the lack of theoretical understanding about the economy; it only gave economists faster and more elaborate ways to mistake noise for signal.” p. 198
Silver gives a clear explanation of the dangers of overfitting, discovering signals (correlation) in the noise. The larger and messier your data, the worse these problems.
Silver’s book is certainly recommended.
Honest signals : how they shape our world by Sandy Pentland, Cambridge, Mass.: MIT Press, 2008.
As long as we are on the topic of data and prediction, here’s another great book that everyone should read.
Pentland’s work here used sensor-equipped handheld devices and similar systems to can detect rudimentary behaviors of the users, including who is nearby, when they are speaking, and the flow of conversations. From relatively simple measures, his “sociometer” is able to detect patterns and make predictions about human interpersonal interactions with astonishing accuracy and profound implications for many human activities.
For example, in one study he was able to predict the results of a job interview based on mere seconds of recorded speech, in fact, based on the pace and patterns of the speech, not the content. Amazing. Similarly, the results of a “speed dating” game were predicted based on astonishing small samples of data.
The data and analysis is small enough to be feasible to run in a personal device such as a phone.
The key idea is, the data he looks at represent unconscious behavioral signals, which are difficult to conceal, yet easy for the computer to detect. In other words, it doesn’t take much data, just the right data.
Data intensive, yes, but is this “big data” or not?
Pentland’s work collected data from hundreds of people, totaling hundreds of thousand of hours of data. But the data is pretty simple, e.g., who interacts with who and when, the pace of speech and pauses, etc.. This is not a gigantic dataset.
Furthermore, this was the data used to develop and validate the algorithms.
But the actual algorithms are pretty simple, and work from data that is available from properly equipped device such as phone. If it runs in real time on a phone, it can’t be all that big!
By the way, you probably should check out whatever Sandy Pentland is up to these days.
It goes on and on, more books and articles every day.
Of the books covered here, I’d rate them:
|Anything by SandyPentland||Must read|
|“The Signal and the Noise” by Nate Silver||Should Read|
|“The Victory Lab” by Sasha Issenberg||Should Read|
|“Physics goes to Wall Street”, By James
|“Big Data” by Victor Mayer-Schönberger
and Kenneth Cukier
|“Automate This” by Christopher Steiner||Readable|
The basic thing to remember is that data alone, big or not, isn’t a substitute for thought or understanding. But you should never trust intuition if you can get data to confirm or refute it. Use Bayesian thinking here, and, as Nate Silver, try to care about accuracy.
Hey, T., & Trefethen, A. (2003). The Data Deluge: An e-Science Perspective. In G. F. F. Berman & T. Hey (Eds.), Grid Computing: Making the Global Infrastructure a Reality. Chichester, UK.: John Wiley & Sons, Ltd, Chichester, UK.
McGrath, R. E., Futrelle, J., Plante, R., & Guillaume, D. (1999, June). Digital Library Technology for Locating and Accessing Scientific Data. Paper presented at the ACM Digital Libraries ’99, Berkeley.
Myers, J. D. (2006). The Coming Metadata Deluge. Paper presented at the New Collaborative Relationships: The Role of Academic Libraries in the Digital Data Universe Workshop. http://cet.ncsa.uiuc.edu/publications/background/TheComingMetadataDeluge.pdf
Rajasekar, A., Wan, M., Moore, R., Schroeder, W., Kremenek, G., Jagatheesan, A., . . . Olschanowsky, R. (2003). Storage Resource Broker – Managing Distributed Data in a Grid. Computer Society of India Journal, Special Issue on SAN, 33(4), 42-54.
Sony Pictures Digital. (2011). the social network. from http://www.thesocialnetwork-movie.com/