What is the AI Model Collapse that researchers warn about?

Artificial intelligence researchers are sounding alarms about a troubling trend known as “AI model breakdown.” This issue arises when AI models are trained using data created by other AI models rather than relying on original, human-generated data.

The core of the concern is that as AI-generated content becomes more prevalent on the internet, this content starts feeding back into the training of new AI models. This can lead to a degenerative process where the quality and effectiveness of AI technologies may deteriorate over time.

Advertisements

The scientific community is increasingly worried about this phenomenon because it could undermine the reliability and advancement of future AI systems.

Collapse of AI models

In a study published in Nature last July, researchers Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal highlighted significant concerns about the performance of AI models trained on data generated by other AI models. Their research reveals that such practices can lead to notable model accuracy and diversity declines.

Advertisements

The core issue is that as AI-generated content becomes more common on the internet, new models increasingly rely on data created by previous models rather than original human data. This creates a feedback loop where each generation of models is trained on outputs from its predecessors, potentially leading to a gradual degradation in performance—a phenomenon they term “model collapse.”

The researchers identify three main types of errors contributing to this collapse:

Advertisements

Statistical Approximation Error: This occurs when models approximate statistical patterns from the training data, which can become less accurate over time as the data itself becomes increasingly artificial.
Functional Expressivity Error: This error arises when models lose their ability to express complex functions or patterns accurately due to the reduced quality of training data.
Functional Approximation Error: This involves errors in how well the models can approximate the underlying functions of the data, leading to inaccuracies and reduced effectiveness.

Authors sue Anthropic for training AI Claude with pirated books

The study warns that as AI models train on increasingly recycled data, they may suffer from a “loss of information” early on and evolve into a “convergence that bears little resemblance to the original” in later stages. This deterioration affects the quality and utility of AI technologies, highlighting the need for fresh, human-generated data to maintain model performance.

Data regurgitation in AI training

Another recent study, titled “Regurgitative Training” by Jinghui Zhang, Dandan Qiao, Mochen Yang, and Qiang Wei, also published in July, delves into the concept of “data regurgitation” in AI training and its impact on model performance.

Advertisements

The paper discusses how the success of large language models (LLMs) like ChatGPT and Llama has led to a significant amount of online content being generated by these models rather than by humans. This AI-generated content then makes its way into the training datasets for future models, creating a cycle of training on data produced by other models.

The researchers argue that “regurgitated training” is becoming increasingly inevitable due to the growing prevalence of AI-generated content. They point out that a large portion of the web content is already produced by machine translation models, contributing to this cycle.

Their findings indicate that training new LLMs on data generated by other models—whether partially or entirely—tends to result in lower performance compared to training on genuine human-generated data. This suggests that relying on recycled data can lead to diminished accuracy and overall effectiveness of AI models.

Google Releases Gemini Voice Chat for everyone on Android

End of human-generated data

The two papers from July also highlight a pressing issue: the potential depletion of human-generated data necessary for training AI models. As AI technologies become more prevalent, the demand for high-quality datasets has surged. Major tech companies like OpenAI, Meta, and Google are actively scraping vast amounts of web content to gather data for training their models.

However, a 2023 paper, “Will We Run Out of Data? Limits to the Scalability of LLM Based on Human-Generated Data,” raises concerns that the supply of human-generated text data could be exhausted by 2026 if current data collection practices continue. The authors developed a predictive model to assess data demand and the rate at which human-generated text is produced on the web.

Their analysis points to an impending critical juncture by the end of this decade, where the reliance on publicly available human text for training large language models (LLMs) may become unsustainable. If the stock of high-quality human data runs low, it could significantly impact the learning capabilities and performance of AI models.

The prospect of future LLMs being trained predominantly on AI-generated data rather than human-generated content raises concerns about a degenerative process. This shift could lead to a decline in AI model performance, potentially resulting in a scenario where AIs become increasingly “dumb” and less effective over time.