“Autocorrect” is hazardous to science

In retirement, I can’t help but look at the world today and wonder at all that my generation of computer programmers brought into the world.  Even your grandma knows what a URL is.  We did that.  And so much more!

But so much of what we did has wrecked havoc on the world, usually as unforeseen side effects.  The Internet was not intended to kill off local record stores.  We never expected people to get their news of the world from unvetted sources on the net.  “Viruses” were irritating vandalism that mainly came from bored undergraduates, not weapons of mass destruction.

The list goes on.  Civilization may not survive our brilliance.

This summer Dyani Lewis reports on yet more unintended havoc—spreadsheets “autocorrecting” scnientific data [1].

Let’s be clear.  Autocorrect is one of the coolest features we ever created.  For those of us who remember writing and typing B.A. (before autocorrect), it’s freaking magic!  Sure, we still get things wrong, and sometimes autocorrect makes hilarious gaffes.  But it catches most of the “normal” mistakes our fingers make, to the point that we seldom notice. Now that’s cool!

The problem is, sometimes we need to actually print a specific non-word sequence of characters.  Such as when we are writing about genetics, which is rife with standard representations, such as BRCA, or DNA sequences which are long strings of ACGT’s.

Apparently Microsoft Excel and other software “catches” these strings and guesses that they should be, for instance, dates, or floating point numbers.  So when you open a data file with the app, it kindly shows you nonsense.  Oops. And if this rewrite is read into a program, the mistake is irreversible–the reading program has no clue what the original was, it only has junk.

A few cases is just funny, but sprinkling junk throughout gigabytes of data is a real problem for data processing and analysis.  Mangled data may well be dropped from analysis, which is even more of a problem because some strings will consistently be mangled and lost, biasing results with little warning to humans trying to interpret the results.

It’s hard enough to share data without our tools mangling it along the way!

I’ll note that there now must be many terabytes of data and analyses out there that probably have these errors in them. Or had errors that have since been corrected, if you can find the corrected dataset.

I’ll also note that large scale machine learning will blithely use these datasets, and will happily discover “patterns”. If the datasets have not been cleaned up, the junk will be learned along with the real data. Uh, oh.

I gather that there is a growing literature and a minor industry has risen concerned with detecting and undoing the “helpful” autocorrections.  Glancing at the mitigations, they look pretty kludgy.  It’s a contest between computers being stupid and humans trying to undo the mess.  My money is on the computer’s stupidity every time.


  1. Dyani Lewis, Autocorrect errors in Excel still creating genomics headache, in Nature – News, August 13, 2021. https://www.nature.com/articles/d41586-021-02211-4

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.