Paper ReviewAI & Machine Learning

Strong Model Collapse: When Synthetic Data Breaks Scaling Laws

The scaling laws that underpin modern LLM training assume clean data. What happens when the data is contaminated with AI-generated text? Two papers — one at ICLR 2025, one proposing a verification-based escape — show that even small fractions of synthetic data can break scaling and that verification offers a partial but imperfect remedy.

By Sean K.S. Shin
This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

The premise of scaling laws is simple: more data, more compute, better models. This relationship has held reliably enough to justify billions of dollars in training infrastructure. But scaling laws carry an implicit assumption that is becoming increasingly fragile — that the training data is real. As AI-generated text proliferates across the web, the training data for future models will inevitably contain synthetic content. Two recent papers examine what happens when it does. The news is not reassuring.

The Research Landscape

Strong Model Collapse: The Core Result

Dohmatob, Feng, Subramonian, and Kempe (2024), published as a spotlight paper at ICLR 2025, establish what they call "strong model collapse." The term is precise: within the scaling laws paradigm, even the smallest fraction of synthetic data in the training corpus — the paper demonstrates this with as little as 1% of the total training dataset — can cause scaling laws to stop working. Larger and larger training sets no longer enhance performance. The scaling curve, which should trend downward (lower loss with more data), flattens or reverses.

This is distinct from the weaker forms of model collapse studied previously, where models trained iteratively on their own outputs degraded over multiple generations. Strong model collapse occurs in a single training run — no iterative self-training required. The synthetic data does not need to dominate the corpus. A small contamination is sufficient to break the scaling relationship.

The paper further investigates whether increasing model size — the other lever in the scaling laws framework — can compensate. In a simplified regime where neural networks are approximated via random projections of tunable size, they both theoretically and empirically show that larger models can amplify model collapse. The intuition is that larger models have more capacity to memorize the distributional artifacts introduced by synthetic data. Interestingly, the theory also indicates that beyond the interpolation threshold, larger models may mitigate the collapse — but this threshold can be extremely high for very large datasets, making it practically unreachable.

The theoretical findings are validated empirically on language models (GPT-2 trained on BabiStories) and feed-forward neural networks for images. The consistency across modalities strengthens the claim that strong model collapse is a general phenomenon, not an artifact of a specific architecture or dataset.

Escaping via Verification

Yi, Liu, Cheng, and Xu (2025) address the follow-up question: can we escape model collapse? Their approach introduces an external synthetic data verifier — whether a human annotator or a better model — that filters synthetic data before it enters the training corpus. The key finding is that verifier-guided retraining can yield near-term improvements.

The paper situates its theoretical analysis in the linear regression setting, showing that verification can avoid collapse by injecting external information about data quality. But the theory also predicts a limitation: unless the verifier is perfectly reliable, early gains will plateau and may even reverse. The verified synthetic retraining process ultimately drives the parameter estimate to the verifier's "knowledge center" — the model converges not to the true distribution but to the verifier's understanding of the distribution.

This is a subtle but important finding. Verification does not solve model collapse; it relocates it. Instead of collapsing toward the distribution of synthetic data, the model converges toward the distribution of the verifier's judgments. If the verifier is good, this is a better outcome. If the verifier has systematic biases, those biases become the model's biases.

Experiments across linear regression, Variational Autoencoders (VAEs) trained on MNIST, and fine-tuning SmolLM2-135M on the XSUM task confirm these theoretical insights. The empirical results show the predicted pattern: initial improvement from verification, followed by plateau.

Critical Analysis: Claims and Evidence

<
ClaimEvidenceVerdict
Even 1% synthetic data can break scaling lawsTheoretical proof + empirical validation (ICLR 2025 spotlight)✅ Supported
Larger models can amplify model collapseTheory (random projection regime) + empirical verification✅ Supported — in simplified regime; full-scale LLM verification pending
Beyond the interpolation threshold, larger models may mitigate collapseTheoretical prediction⚠️ Theoretically shown; threshold may be impractically high
Verification-based filtering can avoid model collapse short-termTheory + experiments across three settings✅ Supported
Verification gains plateau unless verifier is perfectTheoretical prediction + empirical confirmation✅ Supported
The web is already heavily contaminated with synthetic textNot directly studied in either paper⚠️ Widely reported but not measured in these papers

The Practical Alarm

The theoretical results become alarming when mapped onto real-world conditions. The training data for future language models will be drawn from a web that increasingly contains AI-generated content. The exact contamination rate is not measured in these papers, but the direction is clear: any contamination rate above zero is enough to disrupt the scaling laws that the entire training paradigm depends on.

This creates a structural problem. The standard response — "just collect more data" — fails because more data means more contamination. The verification escape helps but introduces its own convergence limitation. And larger models amplifying collapse undermines the other standard response — "just scale up."

Open Questions and Future Directions

  • Contamination measurement: What fraction of current web-crawled training data is AI-generated? No rigorous measurement at scale exists, but the answer determines how urgent the model collapse problem is.
  • Detection at scale: Can we reliably detect and filter AI-generated text from training corpora at the terabyte scale? Current detection methods have significant false positive and false negative rates.
  • Verifier quality requirements: How good does a verifier need to be to provide useful filtering? Yi et al. show that imperfect verifiers help short-term but plateau long-term. What is the practical "good enough" threshold?
  • Domain-specific vulnerability: Are some domains more vulnerable to synthetic data collapse than others? Code, scientific text, and creative writing may have different collapse dynamics.
  • Data provenance infrastructure: Should the ML community invest in provenance systems — watermarking methods that track whether text is human-generated or synthetic?
  • What This Means for Your Research

    If you are training language models, these results suggest that data quality auditing is no longer optional. The scaling laws that justify your compute budget assume clean data. If your training corpus contains even a small percentage of synthetic text, those scaling predictions may not hold.

    If you are generating synthetic data for augmentation, the strong model collapse result adds urgency to verification. The Yi et al. framework helps, but their own results show verification is a mitigation, not a cure. The data abundance assumption — that more data is always better — may be the field's most dangerous blind spot.

    Explore related scaling and data quality research through ORAA ResearchBrain.

    References (2)

    [1] Dohmatob, E., Feng, Y., Subramonian, A., & Kempe, J. (2024). Strong Model Collapse. ICLR 2025 (Spotlight). arXiv:2410.04840.
    [2] Yi, B., Liu, Q., Cheng, Y., & Xu, H. (2025). Escaping Model Collapse via Synthetic Data Verification: Near-term Improvements and Long-term Convergence. arXiv:2510.16657.

    Explore this topic deeper

    Search 290M+ papers, detect research gaps, and find what hasn't been studied yet.

    Click to remove unwanted keywords

    Search 6 keywords →