Paper ReviewAI & Machine Learning

Strong Model Collapse: When Synthetic Data Breaks Scaling Laws

The scaling laws that underpin modern LLM training assume clean data. What happens when the data is contaminated with AI-generated text? Two papers — one at ICLR 2025, one proposing a verification-based escape — show that even small fractions of synthetic data can break scaling and that verification offers a partial but imperfect remedy.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

The premise of scaling laws is simple: more data, more compute, better models. This relationship has held reliably enough to justify billions of dollars in training infrastructure. But scaling laws carry an implicit assumption that is becoming increasingly fragile — that the training data is real. As AI-generated text proliferates across the web, the training data for future models will inevitably contain synthetic content. Two recent papers examine what happens when it does. The news is not reassuring.

The Research Landscape

Strong Model Collapse: The Core Result

Dohmatob, Feng, Subramonian, and Kempe (2024), published as a spotlight paper at ICLR 2025, establish what they call "strong model collapse." The term is precise: within the scaling laws paradigm, even the smallest fraction of synthetic data in the training corpus — the paper demonstrates this with as little as 1% of the total training dataset — can cause scaling laws to stop working. Larger and larger training sets no longer enhance performance. The scaling curve, which should trend downward (lower loss with more data), flattens or reverses.

This is distinct from the weaker forms of model collapse studied previously, where models trained iteratively on their own outputs degraded over multiple generations. Strong model collapse occurs in a single training run — no iterative self-training required. The synthetic data does not need to dominate the corpus. A small contamination is sufficient to break the scaling relationship.

The paper further investigates whether increasing model size — the other lever in the scaling laws framework — can compensate. In a simplified regime where neural networks are approximated via random projections of tunable size, they both theoretically and empirically show that larger models can amplify model collapse. The intuition is that larger models have more capacity to memorize the distributional artifacts introduced by synthetic data. Interestingly, the theory also indicates that beyond the interpolation threshold, larger models may mitigate the collapse — but this threshold can be extremely high for very large datasets, making it practically unreachable.

The theoretical findings are validated empirically on language models (GPT-2 trained on BabiStories) and feed-forward neural networks for images. The consistency across modalities strengthens the claim that strong model collapse is a general phenomenon, not an artifact of a specific architecture or dataset.

Escaping via Verification

Yi, Liu, Cheng, and Xu (2025) address the follow-up question: can we escape model collapse? Their approach introduces an external synthetic data verifier — whether a human annotator or a better model — that filters synthetic data before it enters the training corpus. The key finding is that verifier-guided retraining can yield near-term improvements.

The paper situates its theoretical analysis in the linear regression setting, showing that verification can avoid collapse by injecting external information about data quality. But the theory also predicts a limitation: unless the verifier is perfectly reliable, early gains will plateau and may even reverse. The verified synthetic retraining process ultimately drives the parameter estimate to the verifier's "knowledge center" — the model converges not to the true distribution but to the verifier's understanding of the distribution.

This is a subtle but important finding. Verification does not solve model collapse; it relocates it. Instead of collapsing toward the distribution of synthetic data, the model converges toward the distribution of the verifier's judgments. If the verifier is good, this is a better outcome. If the verifier has systematic biases, those biases become the model's biases.

Experiments across linear regression, Variational Autoencoders (VAEs) trained on MNIST, and fine-tuning SmolLM2-135M on the XSUM task confirm these theoretical insights. The empirical results show the predicted pattern: initial improvement from verification, followed by plateau.

Critical Analysis: Claims and Evidence

Claim	Evidence	Verdict
Even 1% synthetic data can break scaling laws	Theoretical proof + empirical validation (ICLR 2025 spotlight)	✅ Supported
Larger models can amplify model collapse	Theory (random projection regime) + empirical verification	✅ Supported — in simplified regime; full-scale LLM verification pending
Beyond the interpolation threshold, larger models may mitigate collapse	Theoretical prediction	⚠️ Theoretically shown; threshold may be impractically high
Verification-based filtering can avoid model collapse short-term	Theory + experiments across three settings	✅ Supported
Verification gains plateau unless verifier is perfect	Theoretical prediction + empirical confirmation	✅ Supported
The web is already heavily contaminated with synthetic text	Not directly studied in either paper	⚠️ Widely reported but not measured in these papers

The Practical Alarm

The theoretical results become alarming when mapped onto real-world conditions. The training data for future language models will be drawn from a web that increasingly contains AI-generated content. The exact contamination rate is not measured in these papers, but the direction is clear: any contamination rate above zero is enough to disrupt the scaling laws that the entire training paradigm depends on.

This creates a structural problem. The standard response — "just collect more data" — fails because more data means more contamination. The verification escape helps but introduces its own convergence limitation. And larger models amplifying collapse undermines the other standard response — "just scale up."

Open Questions and Future Directions

Contamination measurement: What fraction of current web-crawled training data is AI-generated? No rigorous measurement at scale exists, but the answer determines how urgent the model collapse problem is.

Detection at scale: Can we reliably detect and filter AI-generated text from training corpora at the terabyte scale? Current detection methods have significant false positive and false negative rates.

Verifier quality requirements: How good does a verifier need to be to provide useful filtering? Yi et al. show that imperfect verifiers help short-term but plateau long-term. What is the practical "good enough" threshold?

Domain-specific vulnerability: Are some domains more vulnerable to synthetic data collapse than others? Code, scientific text, and creative writing may have different collapse dynamics.

Data provenance infrastructure: Should the ML community invest in provenance systems — watermarking methods that track whether text is human-generated or synthetic?

What This Means for Your Research

If you are training language models, these results suggest that data quality auditing is no longer optional. The scaling laws that justify your compute budget assume clean data. If your training corpus contains even a small percentage of synthetic text, those scaling predictions may not hold.

If you are generating synthetic data for augmentation, the strong model collapse result adds urgency to verification. The Yi et al. framework helps, but their own results show verification is a mitigation, not a cure. The data abundance assumption — that more data is always better — may be the field's most dangerous blind spot.

Explore related scaling and data quality research through ORAA ResearchBrain.

면책 조항: 이 게시물은 정보 제공을 목적으로 한 연구 검토이다. 학술 저작물에서 인용하기 전에 특정 연구 결과, 통계 및 주장을 원본 논문과 대조하여 검증해야 한다.

강한 모델 붕괴: 합성 데이터가 스케일링 법칙을 무너뜨릴 때

스케일링 법칙의 전제는 단순하다. 데이터가 많을수록, 연산이 많을수록, 모델은 더 좋아진다. 이 관계는 수십억 달러 규모의 학습 인프라를 정당화할 만큼 충분히 신뢰할 수 있는 것으로 입증되어 왔다. 그러나 스케일링 법칙에는 점점 더 불안정해지는 암묵적인 가정이 있다. 바로 학습 데이터가 실제라는 것이다. AI가 생성한 텍스트가 웹 전반에 확산됨에 따라, 미래 모델의 학습 데이터에는 불가피하게 합성 콘텐츠가 포함될 것이다. 최근 발표된 두 편의 논문은 그러한 상황에서 무슨 일이 일어나는지를 살펴본다. 결과는 안심할 수 없다.

연구 현황

강한 모델 붕괴: 핵심 결과

Dohmatob, Feng, Subramonian, Kempe (2024)는 ICLR 2025에서 스포트라이트 논문으로 발표되어 이른바 "강한 모델 붕괴(strong model collapse)"를 확립하였다. 이 용어는 정확하다. 스케일링 법칙 패러다임 내에서, 학습 코퍼스에 포함된 합성 데이터의 비율이 아무리 작더라도 — 논문에서는 전체 학습 데이터셋의 1%만큼 적은 비율로 이를 입증한다 — 스케일링 법칙이 작동을 멈출 수 있다. 학습 데이터셋이 점점 더 커지더라도 성능이 더 이상 향상되지 않는 것이다. 더 많은 데이터에 따라 손실이 감소해야 하는 스케일링 곡선이 평탄해지거나 역전된다.

이는 이전에 연구된 약한 형태의 모델 붕괴와는 구별된다. 약한 모델 붕괴에서는 자신의 출력물로 반복적으로 학습한 모델이 여러 세대에 걸쳐 성능이 저하되었다. 강한 모델 붕괴는 단일 학습 실행에서 발생하며, 반복적인 자기 학습이 필요하지 않다. 합성 데이터가 코퍼스를 지배할 필요도 없다. 소량의 오염만으로도 스케일링 관계를 무너뜨리기에 충분하다.

논문은 스케일링 법칙 프레임워크의 또 다른 레버인 모델 크기 증가가 이를 보완할 수 있는지도 추가로 조사한다. 신경망을 조정 가능한 크기의 랜덤 프로젝션으로 근사하는 단순화된 체제에서, 저자들은 이론적으로나 실험적으로나 더 큰 모델이 모델 붕괴를 증폭시킬 수 있음을 보여준다. 그 직관은, 더 큰 모델일수록 합성 데이터가 도입하는 분포적 아티팩트를 기억하는 더 많은 용량을 가진다는 것이다. 흥미롭게도, 이론은 또한 보간 임계값(interpolation threshold)을 넘어서면 더 큰 모델이 붕괴를 완화할 수 있음을 시사한다. 그러나 이 임계값은 매우 큰 데이터셋의 경우 극히 높을 수 있어 실제로는 도달하기 어렵다.

이론적 발견은 언어 모델(BabiStories로 학습된 GPT-2)과 이미지 분류를 위한 피드포워드 신경망에서 실험적으로 검증된다. 여러 모달리티에 걸친 일관성은 강한 모델 붕괴가 특정 아키텍처나 데이터셋의 아티팩트가 아닌 일반적인 현상이라는 주장을 뒷받침한다.

검증을 통한 탈출

Yi, Liu, Cheng, Xu (2025)는 후속 질문을 다룬다. 모델 붕괴에서 벗어날 수 있는가? 이들의 접근법은 외부 합성 데이터 검증기(verifier) — 인간 주석자이든 더 나은 모델이든 — 를 도입하여, 합성 데이터가 학습 코퍼스에 투입되기 전에 필터링하는 것이다. 핵심 발견은 검증기 기반의 재학습이 단기적 성능 향상을 가져올 수 있다는 것이다.

논문은 이론적 분석을 선형 회귀 설정에 위치시키며, 검증이 데이터 품질에 관한 외부 정보를 주입함으로써 붕괴를 방지할 수 있음을 보여준다. 그러나 이론은 또한 한계를 예측한다. 검증기가 완벽하게 신뢰할 수 있지 않은 한, 초기의 이득은 정체되거나 심지어 역전될 수 있다. 검증된 합성 데이터 재학습 과정은 궁극적으로 파라미터 추정치를 검증기의 "지식 중심(knowledge center)"으로 이끈다. 즉, 모델이 진정한 분포가 아니라 검증기의 분포 이해에 수렴하게 된다. 이것은 미묘하지만 중요한 발견이다. 검증(verification)은 모델 붕괴(model collapse)를 해결하지 않는다; 단지 재배치할 뿐이다. 합성 데이터의 분포를 향해 붕괴하는 대신, 모델은 검증자의 판단 분포를 향해 수렴한다. 검증자가 우수하다면 이는 더 나은 결과이다. 그러나 검증자가 체계적인 편향을 가진다면, 그 편향이 모델의 편향이 된다.

선형 회귀(linear regression), MNIST로 학습된 변분 오토인코더(Variational Autoencoders, VAEs), 그리고 XSUM 태스크에서 SmolLM2-135M을 파인튜닝(fine-tuning)한 실험들이 이러한 이론적 통찰을 확인해 준다. 실험 결과는 예측된 패턴을 보여준다: 검증으로 인한 초기 개선, 그 후 정체(plateau)가 이어진다.

비판적 분석: 주장과 증거

주장	증거	판정
합성 데이터가 1%만 포함되어도 스케일링 법칙(scaling laws)이 붕괴될 수 있다	이론적 증명 + 실험적 검증 (ICLR 2025 spotlight)	✅ 지지됨
더 큰 모델이 모델 붕괴를 증폭시킬 수 있다	이론(랜덤 투영(random projection) 체제) + 실험적 검증	✅ 지지됨 — 단순화된 체제에서; 전체 규모 LLM 검증은 미정
보간 임계값(interpolation threshold)을 넘어서면 더 큰 모델이 붕괴를 완화할 수 있다	이론적 예측	⚠️ 이론적으로 제시됨; 임계값이 현실적으로 달성하기 어려울 수 있음
검증 기반 필터링이 단기적으로 모델 붕괴를 방지할 수 있다	이론 + 세 가지 설정에 걸친 실험	✅ 지지됨
검증자가 완벽하지 않는 한 검증 이득은 정체된다	이론적 예측 + 실험적 확인	✅ 지지됨
웹은 이미 합성 텍스트로 심각하게 오염되어 있다	두 논문 어디에서도 직접적으로 연구되지 않음	⚠️ 널리 보고되고 있으나 해당 논문들에서는 측정되지 않음

실용적 경고

이론적 결과들은 현실 세계의 조건에 대입되었을 때 우려스러워진다. 미래 언어 모델의 학습 데이터는 AI 생성 콘텐츠를 점점 더 많이 포함하는 웹에서 수집될 것이다. 정확한 오염 비율은 이 논문들에서 측정되지 않았지만, 방향은 분명하다: 오염 비율이 0을 초과하기만 해도 전체 학습 패러다임이 의존하는 스케일링 법칙을 교란시키기에 충분하다.

이는 구조적 문제를 만들어낸다. 표준적인 대응책인 "더 많은 데이터를 수집하라"는 방식은 실패한다. 더 많은 데이터는 더 많은 오염을 의미하기 때문이다. 검증을 통한 탈출구는 도움이 되지만 자체적인 수렴 한계를 도입한다. 그리고 더 큰 모델이 붕괴를 증폭시킨다는 점은 또 다른 표준적 대응책인 "그냥 규모를 키우라"는 방식도 약화시킨다.

미해결 질문 및 향후 방향

오염 측정: 현재 웹 크롤링으로 수집된 학습 데이터 중 AI 생성 비율은 얼마나 되는가? 대규모에서의 엄밀한 측정은 아직 존재하지 않지만, 그 답이 모델 붕괴 문제의 시급성을 결정한다.

대규모 탐지: 테라바이트 규모의 학습 말뭉치에서 AI 생성 텍스트를 신뢰성 있게 탐지하고 필터링할 수 있는가? 현재 탐지 방법들은 상당한 위양성(false positive) 및 위음성(false negative) 비율을 가진다.

검증자 품질 요건: 유용한 필터링을 제공하기 위해 검증자는 얼마나 우수해야 하는가? Yi et al.은 불완전한 검증자가 단기적으로는 도움이 되지만 장기적으로는 정체됨을 보인다. 실용적인 "충분히 좋은" 임계값은 무엇인가?

도메인별 취약성: 합성 데이터 붕괴에 더 취약한 도메인이 있는가? 코드, 과학 텍스트, 창작물은 서로 다른 붕괴 양상을 가질 수 있다.

데이터 출처 인프라: ML 커뮤니티는 텍스트가 인간이 생성한 것인지 합성된 것인지를 추적하는 워터마킹(watermarking) 방법과 같은 출처 추적 시스템에 투자해야 하는가?

연구에 주는 시사점

언어 모델을 학습시키고 있다면, 이 결과들은 데이터 품질 감사(auditing)가 더 이상 선택 사항이 아님을 시사한다. 컴퓨팅 예산을 정당화하는 스케일링 법칙은 깨끗한 데이터를 전제로 한다. 학습 말뭉치에 소량의 합성 텍스트라도 포함되어 있다면, 그 스케일링 예측이 적용되지 않을 수 있다. 증강을 위한 합성 데이터를 생성하고 있다면, 강한 모델 붕괴(strong model collapse) 결과는 검증의 긴박성을 더한다. Yi et al.의 프레임워크는 도움이 되지만, 그들 자신의 결과가 보여주듯 검증은 완치책이 아닌 완화책에 불과하다. 데이터 풍부성 가정 — 즉, 데이터가 많을수록 항상 더 좋다는 가정 — 은 이 분야에서 가장 위험한 맹점일 수 있다.

관련 스케일링 및 데이터 품질 연구는 ORAA ResearchBrain을 통해 탐색할 수 있다.

References (2)

[1] Dohmatob, E., Feng, Y., Subramonian, A., & Kempe, J. (2024). Strong Model Collapse. ICLR 2025 (Spotlight). arXiv:2410.04840.

DOI Scholar

[2] Yi, B., Liu, Q., Cheng, Y., & Xu, H. (2025). Escaping Model Collapse via Synthetic Data Verification: Near-term Improvements and Long-term Convergence. arXiv:2510.16657.