An Obvious Thought Experiment About Machine Learning Is Confirmed Experimentally

This is actually very good news for the long run.

The first principle of computer science—long before it gets to anything scientific—is: “Garbage In, Garbage Out”, AKA GIGO.

In this year of the AIbot, everyone is getting a demonstration of this crucial principle.

The current generation of machine learning models that are getting so much attention for better or worse, owe much of their behavior to their input.  ChatGPT and DALL-E, like earlier language and image models, learned from vast datasets collected from the Internet. 

Anyone who has spent their life worrying about careful sampling, data cleaning, data provenance, and all that boring stuff; can only shudder at how ad hoc this sample of human language and visual imagery really is.  No one really knows what is out there on the Internet, or where it came from.  And we know that a lot of it is junk.  But clearly, there is a lot of it, whatever it is.

I guess that ChatGPT and friends work pretty well, considering that there is no reason they should work at all, given their training sets.   It’s no wonder these AI models get stuff wrong.  They are trained to predict what the Internet would do, and the Internet hasn’t a clue.

The recent moves to release AIbots to widespread use has revealed the extreme incompetence of these models (combined with extreme overconfidence and “fluent pomposity” that proves that GPTChat’s pronouns are definitely “he/him/his”.) 

This explosion of use has, of course, been posted on the Internet in many forms.  Some are flagged as the product of AI, some are deliberately misidentified as human products, and some are mixed in, to form hybrid natural language/imagery and ML generated products.

Which means that next year’s training sets from the Internet will contain an unknown, but substantial, samples of AI output in among the “natural” human products.  So, next year’s ML will be learning, in part, to predict what this year’s ML predicted people would say or images would show.

This year’s “Garbage Out” is part of next year’s “Garbage In”.

Sigh.

Everyone knows that this is theoretically a bad thing.  If they weren’t garbage before, they soon will be converted into garbage (and posted to the Internet).  Or, as Matthew S. Smith put it, “The Internet Isn’t Completely Weird Yet; AI Can Fix That” [3].

But will this really happen?  And if so, how much difference does it make?  And how long will it take to notice?

As they say on Mythbusters, “It’s time for some science!”


This spring researchers report that this effect is very real, and can happen very fast.

One study from Spanish researchers tested image generation in the worst case where the output of ML is fed directly back into the model, with no other inputs [1].  In this case, some of the results degrade with each “generation”, becoming a total blur by the fourth iteration.

Another study from the UK describes “a degenerative process affecting generations of learned generative models, where generated data end up polluting the training set of the next generation of models; being trained on polluted data, they then mis-perceive reality.” ([2], p. 3)  They call this “model collapse” (which is not the same thing as “catastrophic forgetting”).

Fundamentally, the model begins to believe the errors are the real data.  The UK researchers show that model collapse is inevitable, though not necessarily rapid.

These findings are particularly important for contemporary large models that are trained repeatedly or continuously, adding in new data.  Each retraining propagates errors from the previous model.  The result accumulates errors and loses tails of the distributions.

This effect can be mitigated by including non-generated data, and by keeping the original dataset available.  Of course, very large language models are generally too large to retrain each iteration with the original data, so they are updated incrementally with new data.   This makes them more vulnerable to model collapse, as they retrain on the errors from the earlier training, and “start misperceiving reality based on errors introduced by their ancestors” ([2], p. 12) 


I’ll also note that the situation is even more complicated than these computational studies portray.  There are a lot of different AI’s out there, and a lot of sources of data of varying veracity.  The training data for each ML is going to contain a range of data, much of which can’t be easily evaluated.  And there are a lot of AI systems out there, building a lot of different models.  Input samples from the Internet can potentially contain the output of a lot of different sources.

Which means that even if a large language model is careful about including non-generative data (at least, putative non-generative data), and retraining with the original data set, it will still be including an unknown amount of generated data from other models.


So, to recap. 

Large machine learning models trained from internet data are iffy to start with.  (Because: GIGO.) With each update that includes the output of iffy models, ML models degenerate, becoming even worse. (Because: GO is used as GI.)

This seems to imply that the current generation of our “mind children” are not going to live a long time.  They will grow senile, poisoned by junk data–possibly very rapidly.

It will be interesting to see what the life line of these models turns out to be.  How long will they last?  Will they get old and retire?  Will they hang around, unwanted, causing trouble like Windows XP?  Will they be absorbed into newer models in some kind of cannibalistic ingestion?


  1. Gonzalo Martínez, Lauren Watson, Pedro Reviriego, José Alberto Hernández, Marc Juarez, and Rik Sarkar, Towards Understanding the Interplay of Generative Artificial Intelligence and the Internet. arXiv  arXiv:2306.06130, 2023. https://arxiv.org/abs/2306.06130
  2. Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson, The Curse of Recursion: Training on Generated Data Makes Models Forget. arXiv, 2023. https://arxiv.org/abs/2305.17493v2
  3. Matthew S. Smith, The Internet Isn’t Completely Weird Yet; AI Can Fix That in IEEE Spectrum – Artificial Intelligence, June 24, 2023. https://spectrum.ieee.org/ai-collapse

One thought on “An Obvious Thought Experiment About Machine Learning Is Confirmed Experimentally”

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.