Category Archives: Data Mining

Computer Security, Data Mining, Internet, Machine Learning

NPR Report on Chinese Hacking Raises Interesting Questions

August 31, 2021 robertmcgrath Leave a comment

This month NPR reported on Chinese hacking of US data sources [2]. Reports of major data breaches are hardly news, nor are links to Chinese hackers.

The interesting part is speculation on what the goal of all this activity might be.

The article makes the case that Chinese hackers have and continue to accumulate massive amounts of data about Americans (they also have massive amounts of data about their own citizens, of course).

But what is it for? While sanctioned countries might need cash, the Chinese government doesn’t need to fiddle around with credit card fraud or identity theft. And, frankly, even if you monetized all of this stuff, it’s hardly a drop in the bucket for the Chinese government.

The NPR report suggests that all this data is fodder for large AI analyses. The Chinese government collects and collates vast amounts of data on its own people. It isn’t far fetched to imagine using stolen data to create a similar effort to monitor the US and other countries.

And from what we know has been stolen, they have extensive data about almost all of the US.

I’ll also note that businesses in China have also developed extremely powerful data analytic capabilities as a part of the global supply chain [1]. Manufacturers and distributors in China can tell you in detail what is going on in the US based on orders from US companies. In many cases, they know more than the American companies do. (Chinese businesses can watch Fox news all they want, but have no access to other US outlets, so their vision is rather skewed from that fact.)

While the US and other countries are open for China to read and analyzed, outside knowledge of China has rapidly closed off [1]. The pandemic accelerated the process, to the point that we have no reporting or contact with what is going on in China.

So, we are blind and they can see everything. In the Information Age, this is pretty much total dominance by China.

What could they do with this stuff?

For one thing, they almost certainly know everyone who works for the US government, and what they do. Bad news for secret agents. They also know all the key people in any organization of interest, and what they do. Who has keys to the vault as the bank? Chinese intelligence probably knows. Heck, they may know the combination already, but certainly know who to spy on to find out.

But what I would try to do, and what I think they are going to try to do, is make a complete, detailed model of the US. Everyone and everything. This model will, of course, reveal US military and strategic capabilities, and likely reveal “secret” plans before they have even started. It will also predict decision making at all levels, from local school boards to the White House.

The ultimate goal, I say, would be to be able to manipulate such a model to influence and control the behavior of the US. We have already seen how well-informed information war can influence big elections. A more complete model could influence every decision-making process. This would constitute complete control of the infosphere, and near complete control of the US.

Sigh.

I haven’t got a conclusion. Data breaches can’t be undone, and likely won’t stop. The horses have already been stolen, and the barn has no door to close anyway, because so much of the US is open.

It is worth noting that the optimists who believed that the Internet could never be censored, nor blockaded at the border are being proved wrong. China, especially, is succeeding in creating a separate infosphere that they control and we can’t see.

From this perspective, all the paranoid ranting about Silicon Valley kind of misses the point. Google, Facebook, Apple, Amazon, et al are annoying. But they are basically in it for the money. Great powers are in it for power.

Honestly, I can’t blame China or any other state for wanting to protect themselves from other states, including the US. It’s what I would want to do in their place.

And I’m sure that many Chinese strategists, along with their Russian counterparts, consider this fair turnabout. The US lorded it over much of the world for many decades, including through data collection and social manipulation.

Peter Hessler, The Peace Corps Breaks Ties with China, in The New HYorker. 2020. https://www.newyorker.com/magazine/2020/03/16/the-peace-corps-breaks-ties-with-china
Dina Temple-Raston, China’s Microsoft Hack May Have Had A Bigger Purpose Than Just Spying, in NPR – Investigations, August 26, 2021. https://www.npr.org/2021/08/26/1013501080/chinas-microsoft-hack-may-have-had-a-bigger-purpose-than-just-spying

Computer Programming, Data Mining, Machine Learning

“Autocorrect” is hazardous to science

August 23, 2021 robertmcgrath Leave a comment

In retirement, I can’t help but look at the world today and wonder at all that my generation of computer programmers brought into the world. Even your grandma knows what a URL is. We did that. And so much more!

But so much of what we did has wrecked havoc on the world, usually as unforeseen side effects. The Internet was not intended to kill off local record stores. We never expected people to get their news of the world from unvetted sources on the net. “Viruses” were irritating vandalism that mainly came from bored undergraduates, not weapons of mass destruction.

The list goes on. Civilization may not survive our brilliance.

This summer Dyani Lewis reports on yet more unintended havoc—spreadsheets “autocorrecting” scnientific data [1].

Let’s be clear. Autocorrect is one of the coolest features we ever created. For those of us who remember writing and typing B.A. (before autocorrect), it’s freaking magic! Sure, we still get things wrong, and sometimes autocorrect makes hilarious gaffes. But it catches most of the “normal” mistakes our fingers make, to the point that we seldom notice. Now that’s cool!

The problem is, sometimes we need to actually print a specific non-word sequence of characters. Such as when we are writing about genetics, which is rife with standard representations, such as BRCA, or DNA sequences which are long strings of ACGT’s.

Apparently Microsoft Excel and other software “catches” these strings and guesses that they should be, for instance, dates, or floating point numbers. So when you open a data file with the app, it kindly shows you nonsense. Oops. And if this rewrite is read into a program, the mistake is irreversible–the reading program has no clue what the original was, it only has junk.

A few cases is just funny, but sprinkling junk throughout gigabytes of data is a real problem for data processing and analysis. Mangled data may well be dropped from analysis, which is even more of a problem because some strings will consistently be mangled and lost, biasing results with little warning to humans trying to interpret the results.

It’s hard enough to share data without our tools mangling it along the way!

I’ll note that there now must be many terabytes of data and analyses out there that probably have these errors in them. Or had errors that have since been corrected, if you can find the corrected dataset.

I’ll also note that large scale machine learning will blithely use these datasets, and will happily discover “patterns”. If the datasets have not been cleaned up, the junk will be learned along with the real data. Uh, oh.

I gather that there is a growing literature and a minor industry has risen concerned with detecting and undoing the “helpful” autocorrections. Glancing at the mitigations, they look pretty kludgy. It’s a contest between computers being stupid and humans trying to undo the mess. My money is on the computer’s stupidity every time.

Dyani Lewis, Autocorrect errors in Excel still creating genomics headache, in Nature – News, August 13, 2021. https://www.nature.com/articles/d41586-021-02211-4

Cultural Heritage, Data Mining, Machine Learning, science and technology

Study Identifies World Music “Outliers”

January 10, 2018 robertmcgrath Leave a comment

Computers have a long and deep connection with music, and have from the beginning. This isn’t surprising to me because humans will make music with any tool they have (and with their body if they have no tools at hand). People use digital computers and networks to make, perform, fiddle with, and share music, just as they have done music with every technology every invented.

Some people also like to use digital techniques to classify and otherwise analyze music (e.g., as “Music Information Retrieval”). This applies widely used learning, classification, and searching schemes to digital representations of music.

I have never really been able to be excited about this topic, myself.

A recent study from Queen Mary University of London reports on “outliers in world music.” Using large collection of digital recordings from around the world, the study looked for examples that are “different”, standing out from other music [1].

The research uses widely used data mining techniques, adapted to musical “objects”. At bottom, this works from summary descriptions of each music sample. There are a huge, if not infinite, number of ways to describe music, so the researchers had to select a handful of characteristics. (This is common practice, and it generally works OK.)

The overall goal is to find “outliers”, which they interpret as especially novel or creative examples. If you have 100 songs, and one or two stand out statistically, there might be something really interesting about those two.

The research binned the music by country of origin. This is a very approximate identification for the cultural tradition that the music might be associated with. So, they found “outliers” within a given country, and also could report countries with relatively high numbers of outliers. For example, in their collection, Botswana had large numbers of outliers.

This paper made me think a bit. The title and press release piqued my interest in “world music”, and made me wonder what an “outlier” would mean. But looking at the report, I see a lot of limitations.

I’m not sure what significance these “outliers” may have. The researchers imagine that these cases somehow represent innovation or creativity. But the classification is such a blunt instrument that it’s not clear how “innovative” these examples may be, or whether other equally “creative” samples are not flagged, because they are different in ways that are not detected.

The methodology is “blunt” for many reasons. It’s a small and unsystematic sample. Yes, these are large databases, with enough data to do statistics. But it is hard to know how representative these samples are. The entire idea that there is some kind of Platonic ideal for, say “Brazilian music”, is lunacy of the first order.

This limited sample probably doesn’t matter too much, because the extracted features are probably obscuring them anyway. The features used are only loosely justified, and there is not particular reason to think that they are specifically related to “creativity” or even to differences between musical traditions. Whatever is being classified here, it isn’t obvious that it has much to do with musical creativity, at least not everywhere and at all times.

(Ironically, the methodology is also “too sharp” in a crucial way. The classification techniques are so powerful that they will find something. They find outliers and groupings, whether such conclusions are meaningful or not.

The “world” part of the study is not exactly what I expected. To me, “world music” means local music that is enjoyed lot’s of places other than home. This study seems to define it as some kind of expression of aboriginal, pre-colonial, pre-mass-communication culture. Taking this as the definition, it is certainly misleading to ‘bin’ music by country. Countries are scarcely mono-cultural, and, by the way, minority “outliers” are often suppressed. Finding “outliers” at the country level is interesting, but probably not indicative of “creativity” so much as stereotypes and the vagarities of the collection methods.

Finally, the entire notion that local folk music is somehow generated from a pure, unsullied local culture is highly questionable. For centuries, musical cultures have been travelling and mixing around the world, and in the twentieth century mass communication has allowed music to spread nearly instantly nearly everywhere.

I would say that some of the most important “innovation” has been in the creative response to all these different sources. At the very least, this means that an “outlier” in one country might be an import that would middle of the road at home, or that an imported hybrid in one country might be a shining outlier in the original country. But these cases aren’t found by this study at all.

For example, consider American Jazz. This music developed from many geographical roots, and now has spread throughout the world, influencing many musical styles. So, everything that is influenced by jazz will be classified as somewhat similar, and less likely to be an “outlier”. On the other hand, a pedestrian cover of a familiar standard might be flagged as an “outlier” compared to the rest of the “traditional” music of the country, less directly copied from overseas. Either way, it misses the who point that Jazz has influenced and been influenced by many people, everywhere.

The point is, the methods of this study aren’t a very good way to find meaningful “outliers”. And whatever this study is about, it probably isn’t finding anything interesting about “innovation” or “creativity”. For that matter, it doesn’t really describe culture or music very well at all.

Maria Panteli, Emmanouil Benetos, and Simon Dixon, A computational study on outliers in world music. PLOS ONE, 12 (12):e0189399, 2017. https://doi.org/10.1371/journal.pone.0189399
Queen Mary University of London, Computational study of world music outliers reveals countries with distinct recordings, in Queen Mary University of London – News. 2017. http://www.qmul.ac.uk/media/news/2017/se/computational-study-of-world-music-outliers-reveals-countries-with-distinct-recordings.html

Big Data, Data Mining, science and technology

Marcus and Davis Op-Ed Is Right But Off Target

April 7, 2014 robertmcgrath Leave a comment

In an NYT op-ed today Gary Marcus and Ernest Davis give “Eight (No, Nine!) Problems With Big Data”. This is at least partly a response to the recent book “The Naked Future,” by Patrick Tucker, which I have not read yet. They also cite the Science article by Lazar and company which I have commented about.

The points they make are clear and obvious, I don’t argue with them.

However, the comment is a bit off target: it could have been titled “correlation still isn’t causation”.

I have commented that “Big Data” is a nebulous and unregulated term, which can lead to hype and non-sequiturs.

In this case, Marcus and Davis attack “Big Data” analysis that is based on pretty blind correlation. This is truly the most problematic and least useful version of Big Data, for the reasons they give.

But this has to do with the analysis methods, not the data itself. Cherry picking correlations out of gazillions of variables has never been a useful method, especially if you have little understanding of where the data comes from. In some sense, the Big News from Big Data is that sometimes you can get a little bit of use from such blunt instruments–and sometimes you get “magical” results, too cool to believe.

However, mindless correlation studies are not the only analytic methods possible. In fact, the success stories are actually complex cases of model building, using massive data in the construction of models, which are applied to massive amounts of data to generate interesting predictions or understandings.

For example, I’ll refer to you to Sandy Pentland’s works. Massive data, si. Mindless correlation, no.

Correlation still isn’t causation, no matter how big our data. And “garbage in” still produces “garbage out”, even if you have absurd amounts of garbage.

But most important, “Big Data” is about Data+Analytics, and you need to work hard on both aspects to get good results.

Data Mining, Internet, Music, Sociotechnical, Uncategorized

Pandora Sucked Into the Same Old Exploitation

January 8, 2014 robertmcgrath 1 Comment

Is there a viable business model for the Internet that isn’t based on obnoxious “targeted advertising”? Not that I can see.

The latest example of this “evil” appears to be Pandora, originally designed to put human radio DJs out of business (which is Evil). This week it is reported that they will now be selling the personal preferences of their customers to advertisers. In this they join many other companies including Google, netflix, and Shazam. Everybody does it, it must be OK.

I note that this data is created by the users, not Pandora, specifically for the benefit of the users. What moral right do they have to profit from this information?

The nastiest thing is that data mining musical preferences (probably) can give considerable prediction about other behavior, including, they say, political leanings. After a year of griping about the US government mining data for arguably legitimate purposes (military defense), how can people swallow this kind of invasive rip off?

Personally, I’ve never used these services because I don’t really want an algorithm mirroring back my day-to-day preferences for music. My tastes are wide and ever changing. The last thing I want is to be channeled into whatever I wanted to listen to yesterday.

Also, I really liked human DJs on local radio stations, thank you very much.

Sigh. Having helped create the foundations of this technology, I despair to think of the Karma I have to pay for all this evil unintentionally set loose upon the world.

Book Review, Data Mining, Internet, Non-fiction, Politics and Economics, science and technology, Sociotechnical, Technology

Review of “Who Owns the Future” by Jaron Lanier

June 8, 2013 robertmcgrath 22 Comments

Jaron Lanier, Who Owns the Future?, New York, Simon & Schuster. 2013. http://www.jaronlanier.com/futurewebresources.html

Jaron Lanier is a strange and interesting guy. I’ve never met him in person, but he’s been around forever, and always seems to be doing something interesting. His new book, “Who owns the future?” is a blockbuster, hitting on a bunch of things I’ve been worrying about, with authoritative insight. Like me, he is implicated in the development of today’s Internet, and like me, he feels a responsibility to try to make it better (a “humanistic information economics”, in his case). Unlike me, he has some fairly deep and broad ideas that deserve to be implemented.

To summarize the problem as Lanier states it, today’s networked systems are built wrong, designed to create winner-take-all super servers (which he terms “Siren Servers”). The problem with this is that the servers make money by taking data from everyone for free, and passing risk to everyone else. Too much is “forgotten”, taken off the books, no longer counted as “value”, and otherwise fraudulently not accounted for. This is not just wrong, it is unsustainable, because it is demonetizing the value created by most people, shrinking the overall economy. Even the super servers can’t last long in the business of destroying value.

Lanier wants to create a “humanistic” architecture, with humans at the center. “Information is people in disguise, and people ought to be paid for the value they contribute that can be steered or stored on a digital network.” (p. 245) From this principle, the whole argument flows.

Of course, you must read his book to get the details.

As technologists, we cannot be content with just complaining, nor can we pretend that the horse can be returned to the barn. Lanier gives a thoughtful set of “tweaks”, which change the way networks do business. The crux of the matter is to pay for everything of value that happens. Concretely, this means that any data that is produced by you, will earn you a micropayment. On the other side of the coin, everyone must pay reasonable amounts for what they do on the net. The argument is that this will vastly enlarge the digital economy, and will provide a way for humans to create a dignified life.

I liked this book for many reasons. Some of the technical details hit on topics I’ve encountered in my own career. The basic technical feature missing from the WWW is two-way linking, which was posited in Ted Nelson’s Xanadu from the 1960s, if you can believe it. (I think Lanier might want to be a “nelsonite” if such things were possible. “Our huge collective task in finding the best future for digital networking will probably turn out to be like finding our way back to approximately where Ted was at the start.” (p. 221))

Two-way links never get stale, and you automatically know who is linking to you, so you can trace back. This feature is so important that, as Lanier points out, Google and others scrape the whole web every day to compute these relations, which should have been engineered in in the first place. Sigh.

I clearly recall discussions at NCSA in the early days of Mosaic about one-way links (the WWW) versus two-way links (favored by information scientists, librarians, and anyone who understood actual information systems). We could have built in two-way links, but we didn’t because it would have been difficult and would have slowed the viral dissemination of mosaic and related technology. Even then, zillions of free downloads was the metric of success, regardless of sustainability.

I’m not the only one who knows this. The noise about the “semantic web” is mainly due to the fact that it enables arbitrary, multi-way linking—even better than two way links. Even Tim Berners-Lee quickly realized the flaw in his one-way link architecture (Berners-Lee, T., J. Hendler, and O. Lassila, The Semantic Web. Scientific American, 284, 5 (2001) 35-43.).

Lanier also nods at the importance of provenance, which I learned much about from another sensei, Jim Myers in 2005-11 (e.g. Myers, J.D. The Coming Metadata Deluge. In: New Collaborative Relationships: The Role of Academic Libraries in the Digital Data Universe Workshop. (2006)). Also, Lanier’s micropayments concept has been invented as a solution to the related problem of citations and attribution in scholarly work (e.g, see the work out of Ben Shneiderman’s lab (Jennifer Preece and Ben Shneiderman, The Reader-to-Leader Framework: Motivating Technology-Mediated Social Participation. AIS Transactions on Human-Computer Interaction, 1, 1 (2009) 13-32.

By the way, Lanier has much to say about 3D printing (I hadn’t thought about the coolness of “unprinting”–using the 3D print programs in reverse to recycle objects. Wow!) But even he is falling behind: at one point he wonders if you will go to your local library where there will be public access 3D printers. The answer is “yes”, and in fact you already can. For example, our local public library has fabrication equipment, though they are still working out what kinds of services to offer.

Basically, I’m saying Lanier’s technical analysis is sound, whether he cites all the academic sources or not.

Of course, as a grumpy old guy, I was greatly entertained by JL’s dope slapping the business practices of today’s “siren servers”. Lanier is not amused by almost anything on the Internet, and knows exactly how they work, so you would be well advised to read his critique. He gives us a blistering faux EULA (pp. 79-82) and gets grouchy about the future of the book (pp. 354-8), and starts everything off with a very dark science fiction future vision of the virtual world (pp. 1-3). Ouch.

Lanier also provides an interesting perspective on “Big Data”, differentiating between “Big Science Data” (which is accurate and very hard work) and “Big Business Data” (which is sloppy, possibly not correct, but very valuable) (see Chapter 9). It is also “stolen”, in that the sources are not paid. This is actually a very useful distinction, because the terminology is so confused, and mixed very legitimate advances (e.g., scientific modeling) with bogosity (e.g., fiddling with pricing on a Web store).

I also enjoyed his insider’s version of life in Silicon Valley. I never moved to California (my version of “humanistic” living involved having a home town), so I never did understand a lot of the crazy stuff coming out of there. Lanier gives some history, showing the ties between the Bay Area New Age culture, and the Internet, quite visible now in the form of the Singularity University and related religious manias. If you read only one part of the book, read pp. 211-231. Seriously, it’s worth getting the book, just for this section.

If there were any doubt that this is a current topic, see George Packer’s article in the New Yorker about Silicon Valley’s political culture and its surprisingly incompetent entry into US politics (George Packer, Change the World: Silicon Valley transfers its slogans—and its money—to the realm of politics, in The New Yorker. May 27, 2013, pp. 44-55 ). Packer notes the isolation of the techies from the communities they live in, starkly apparent in their sealed, inwardly facing campuses. How could we expect anything sensible from such a broken environment? Yet they believe they are the future for everyone

For that matter, judging from reviews , Google’s Eric Schmidt and Jared Cohen’s new book, The New Digital Age (Schmidt, E. and J. Cohen, THE NEW DIGITAL AGE: Reshaping the Future of People, Nations and Business, New York, Alfred A. Knopf. 2013.), has a lot to say on the same topics. (Sorry, I haven’t read it yet.) I seriously doubt that he will agree with Lanier on most points. (As much as I distrust Mr. Assange, I suspect his review of TNDA is probably much more interesting that the book itself.)

And while we are on the topic of “humanistic” computing, let’s look at Kevin Kelly’s massive “What Technology Wants” (Kelly, K., New York, Penguin Group. 2010). (You can tell it’s going to be interesting, because Jaron Lanier provided a cover blurb, “It isn’t often that a book is so important and well crafted that I feel compelled to urge everyone to buy it and read it even though I profoundly disagree with aspects of it. … You can’t understand the most important conversation of our times without reading this book.”)

Kelly’s general thesis is to take the viewpoint of technology as if it were an autonomous being, to try to understand what it “wants”—where it comes from, where it is going to, how it really works, why it sometimes fails spectacularly.

I’m not sure I agree with Kelly’s approach, but I liked it a lot for his social psychological perspective. In particular, you should read Chapter 11, “Lessons of Amish Hackers”. Kelly has spent considerable time with Amish friends, and presents a revealing and helpful explanation of how they approach technology. “In contemporary society our default is set to say yes to new things, in Old Order Amish communities the default is set to “not yet”.” (p. 218) At the bottom, “…is the Amish motivation to strengthen their communities.” (p. 218) Buy the book for this chapter alone.

Part of Kelly’s point is that everyone should be as conscious as the Amish are about technology, in particular about technology uptake. This is a brilliant insight, and has made me feel much better about my idiomatic adoption of tech. I’ve been acting Amish, and not even knowing it.

So, by our own different paths, Brother Kevin, Brother Jaron, and Brother Bob all end up in just about the same place. There must be something there, huh.

So what have we learned?

Let’s look at something that crossed my eyeballs earlier this week. Apparently Google is “teaching” people how to organize their content in ways that will help Google. Google isn’t clear why I would want to do this, but I get the idea from guru’s such as Terri Griffith, “How we can help Google better track our websites”. I guess this is supposed to be a reasonable motivation.

Let’s look at this offer with Lanier’s “humanistic” principle. The Siren Server (in this case Google) wants you to donate your labor to help them make money. Your benefit, if any, is that they will be able to use your data better, maybe get a few more people to look at your content–via Google who gets their eyeballs first. The value you have added to Google is demonetized, and you do not get any of the wealth Google might generate.

Let’s look at how Bob’s grumpy bad attitude would apply. What would I charge if Google wanted to hire me to provide structured content for them? Well, who knows, but my general rates for corporate consulting are, like $250 per hour. So why would I give this to Google for free? I don’t get it.

To borrow from Laniers blurb for Kelly,

You can’t understand the most important conversation of our times without reading ‘Who Owns the Future?’.

[Note: This post was updated on 24 March 2014 to fix a couple of broken links.]

Book Review, Data Mining, Internet, Non-fiction, Sociotechnical, Technology

Flavor of the month: “Big Data”

April 14, 2013 robertmcgrath 2 Comments

“Big Data” is the flavor of the month this year, as the popular media are discovering the power of data analytics. A recent post by Terri Griffith pushed me to work on this note. (Thanks, Terri!)

This phenomenon isn’t news to anyone with a decent technical education any time in the last few decades, but the combination of piles of money (Wetherall, 2013), political power (Steiner, 2012), and some romantic storylines (movies, such as “the social network” (Sony Pictures Digital, 2011), books such as Moneyball (Lewis, 2003) and popular manias) has captured the attention an imagination of the shattering classses, including Wired magazine, the New York Times, the academic literature (April issue of Communications of the ACM has this and this), and undergraduate education (my friend Sally Jackson).

Now, I’ve been a “data guy” since the 70’s, during which time computers have become more available and more capable, and Bayesian statistics have come into fashion (Good, 1983). During the 90’s I did some work with scientific data, digital libraries, and the beginnings of the WWW. For example, I coined the phrase, “conversation with the data” to describe what we wanted to do with the Web, long before it was possible to do it (McGrath, Futrelle, Plante, & Guillaume, 1999).

In science and engineering, ‘Big Data’ is old news. In fact, we’ve were hyping it long before the mainstream media caught the bug. (“Data Deluge”, “metadata deluge”, and so on, Hey & Trefethen, 2003; Myers, 2006; Rajasekar et al., 2003; Schatz et al., 1997; Thomas & Cook, 2005).

“Big” is a psychological concept, it means “at or beyond what I can conveniently handle”. Given the escalation of computing and bandwidth, “big data” has been a rapidly moving target. The biggest news in “big data” is that it is now possible to get big enough samples of data to a) develop many interesting kinds of models (algorithms) and b) target very fine grain populations, down to individuals. (I.e., we can do data big enough to target individual humans.) The other news is that computing power is cheap enough that you can throw massive, nearly random, analyses at huge blobs of data, and get some answers in a reasonable time. (I.e., we can be stupid about data analysis and still get rich).

The most important point of all is that the real story isn’t “data”, it is, as Niklaus Wirth’s classic textbook (Wirth, 1976) put it, “algorithms+data”. And, as everyone knows, data can be misused, or self-fooling, especially if you don’t understand what you are doing, aren’t being honest, or don’t think critically.

A continuing stream of recent books present popular versions of this story, for better or worse. Let’s review a few, with some definite recommendations. There will doubtless be more to consider in the future.
[ Topic Page ]

Data Mining, science and technology, Space Science

Remote Sensing Plus Neural Nets == Black Magic

October 17, 2012 robertmcgrath 1 Comment

One of the most awesome technical developments of the last 20 years has been the maturation of neural nets applied to multispectral and hyperspectral remote sensing. While these techniques have obvious military uses, they are being used for a variety of scientific investigations, including planetary exploration, geology, and now, paleontology.

NASA has released a nice summary some really cool work that uses Landsat imagery (note—this is 40+ year old technology!) to locate fruitful sites to look for new fossils. The basic idea uses the 7 spectral bands of the Landsat data as input, and trains it to learn the signature of known fossil beds. The trained network is then applied to a wider data to identify similar surface geology, which should be much better than average places to look for new fossils.

Tres cool! And pretty amazing considering the rudimentary data provided by Landsat: the visible spectrum (plus near IR) is binned into just seven bands.

Bear in mind that contemporary sensors have far more channels, 32, 64, and soon 200 or more. More channels means much, much more discrimination, so things will only get better. (See, for example, information in wikipedia about Landsat 7 multispectral versus new hyperspectral sensors.)

And neural nets have been around a long time, though only in the last 20 years has there been enough computing power to make them practical. While generally modeled on human nervous systems, artificial neural nets are interesting because they often surprise us, discerning patterns we didn’t or can’t see. (For a fictional account of one such case, see the spacecraft antenna episode in Blind Lake by Robert Charles Wilson. 2003)

I recall conversations with Prof. Erzabet Merényi at Rice. She is a leading expert in this field, and has demonstrated amazing results in NASA funded studies. She related to me that the 200-channel hyperspectral sensor is capable of distinguishing the model of a car from the spectrum of its paint—Ford paint is different from Toyota paint, and this can be detected from airborne or satellite imagery. (Other info about Prof. M’s work can be found here and here).

Robert McGrath's Blog

Category Archives: Data Mining

NPR Report on Chinese Hacking Raises Interesting Questions

“Autocorrect” is hazardous to science

Study Identifies World Music “Outliers”

Marcus and Davis Op-Ed Is Right But Off Target

Pandora Sucked Into the Same Old Exploitation

Review of “Who Owns the Future” by Jaron Lanier

Flavor of the month: “Big Data”

Remote Sensing Plus Neural Nets == Black Magic

A personal blog.

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

A personal blog.