Deep DiveMathematics & StatisticsMachine/Deep Learning

The Generalization Puzzle: Why Overparameterized Neural Networks Don't Overfit

Classical statistics says a model with more parameters than data points should memorize training data and fail on new data. Modern neural networks violate this prediction spectacularly—generalizing well despite massive overparameterization. Four 2025 papers advance our theoretical understanding of why.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

Classical statistical learning theory makes a clear prediction: a model with more parameters than training examples will memorize the training data perfectly but fail catastrophically on new data. This prediction is well-founded—it follows from the bias-variance tradeoff, VC dimension bounds, and PAC learning theory that have guided statistical practice for decades.

Modern deep neural networks violate this prediction routinely. A language model with billions of parameters, trained on a dataset of millions of examples, generalizes to new text it has never seen. A vision model with hundreds of millions of parameters, trained on a million images, correctly classifies novel photographs. The parameter-to-data ratio wildly exceeds the thresholds where classical theory predicts catastrophic overfitting—yet the models work.

Understanding why they work is not merely an intellectual curiosity. It determines whether deep learning's success is a fortunate accident that will eventually fail, or a reflection of deeper mathematical structure that we can rely on and extend. The 2025 research on this question makes progress on four fronts.

Implicit Bias: The Hidden Regularizer

The leading explanation for overparameterized generalization is implicit bias: the optimization algorithm (gradient descent and its variants) does not simply find any solution that fits the training data—it finds a specific solution that, among all solutions with zero training error, has properties that promote generalization.

Matt & Stöger provide the most precise characterization to date for a simplified setting: overparameterized linear neural networks (multiple linear layers composed in sequence, no nonlinearities). They prove tight upper and lower bounds on the implicit regularization effect of gradient descent, showing that it implicitly favors solutions with small ℓ₁ norm—the same property that explicit ℓ₁ regularization (Lasso) imposes.

This is remarkable because no regularization term is added to the loss function. The implicit ℓ₁ bias arises purely from the interaction between the network's layered architecture and the gradient descent dynamics. The layered structure—even without nonlinearity—introduces a geometric bias in the optimization landscape that gradient descent exploits.

Spectral Bias: Low Frequencies First

Sahs et al. explore a complementary mechanism: spectral bias—the tendency of neural networks to learn low-frequency components of the target function before high-frequency components. This bias acts as implicit regularization because low-frequency functions are smoother and more likely to generalize, while high-frequency functions are more likely to represent noise.

Their contribution is showing that the choice of activation function shapes the spectral bias. Different nonlinearities (ReLU, sigmoid, GELU, sine) produce different frequency learning priorities. ReLU networks, for instance, favor piecewise-linear functions (low effective frequency), while sine-activated networks can learn high-frequency components more readily.

The practical implication: the activation function is not just an architectural choice for computational convenience—it is a regularization choice that determines which functions the network can easily learn and which it suppresses.

Region Counting: A Geometric Perspective

Li et al. propose a geometric characterization of implicit bias: the number of connected regions that the network's decision boundary creates in input space. A network that carves input space into many small regions is more complex (and more likely to overfit) than one that creates fewer, larger regions.

They prove that gradient descent, for certain architectures, converges to solutions with near-minimal region counts—providing a concrete, geometric explanation for why the learned function is simple (generalizes well) even though the network has the capacity to be arbitrarily complex.

Provable Bounds Beyond Classical Theory

Dhingra provides a survey and extension of provable generalization bounds for overparameterized networks. The key insight: classical bounds (VC dimension, Rademacher complexity) become vacuous (predict generalization error > 1) in the overparameterized regime because they depend on the number of parameters without accounting for the constraints that gradient descent imposes.

Newer bounds—norm-based, PAC-Bayes, compression-based—incorporate information about the specific solution found by gradient descent, producing non-vacuous generalization estimates. These bounds, while still loose, correctly predict the qualitative behavior observed in practice: generalization improves as networks grow wider (more parameters per layer) even though classical theory predicts the opposite.

Claims and Evidence

Claim	Evidence	Verdict
Classical statistical theory predicts overparameterized overfitting	VC dimension, bias-variance tradeoff	✅ Classical prediction
Modern neural networks violate this prediction	Empirical observation across many domains	✅ Well-documented
Implicit ℓ₁ regularization explains generalization in linear networks	Matt & Stöger: tight bounds proven	✅ Proven (linear case)
Spectral bias provides implicit regularization	Sahs et al.: activation-dependent frequency learning priority	✅ Supported
Region counting characterizes implicit bias geometrically	Li et al.: minimal region convergence demonstrated	✅ Supported
Current theory fully explains deep learning generalization	Gaps remain between theory (linear/shallow) and practice (deep nonlinear)	❌ Partial understanding

Open Questions

Deep nonlinear networks: Most theoretical results apply to linear networks, shallow networks, or simplified architectures. Can the insights transfer to the deep, nonlinear, attention-based architectures that dominate practice?

Transformers specifically: The generalization properties of Transformer architectures (attention mechanisms, positional encodings, layer normalization) are poorly understood theoretically. Is there Transformer-specific implicit bias?

Double descent: The "double descent" phenomenon—where generalization improves after interpolation of training data—remains poorly explained by existing theory. Can implicit bias theory account for double descent?

Practical implications: Does understanding implicit bias suggest better architectures or training methods? If gradient descent implicitly regularizes, can we design architectures that enhance this implicit regularization?

Fine-tuning dynamics: When a pre-trained model is fine-tuned on a small dataset, the implicit bias of fine-tuning may differ from that of training from scratch. How does the pre-trained initialization affect the implicit bias of subsequent optimization?

What This Means for Your Research

For ML theorists, the generalization puzzle remains the central open problem in deep learning theory. The 2025 results make meaningful progress—particularly in characterizing implicit bias for specific architectures—but the gap between theory and practice motivates continued investment.

For practitioners, the practical takeaway is that the choice of architecture and optimizer is itself a form of regularization. Activation functions, layer widths, learning rates, and training schedules all influence the implicit bias and therefore the generalization properties of the trained model. Understanding these effects enables more principled model design.

For statisticians trained in classical theory, the overparameterization puzzle is an invitation to extend the foundations of learning theory. The classical frameworks are not wrong—they are incomplete. Incorporating the structure of gradient descent and the geometry of neural network parameter spaces into statistical theory is a productive and impactful research direction.

면책 조항: 이 게시물은 정보 제공 목적의 연구 동향 개요이다. 학술 저작물에 인용하기 전에 구체적인 연구 결과, 통계 및 주장을 원본 논문과 대조하여 검증해야 한다.

일반화의 난제: 과매개변수화된 신경망이 과적합되지 않는 이유

고전 통계적 학습 이론은 명확한 예측을 제시한다. 훈련 예제보다 더 많은 매개변수를 가진 모델은 훈련 데이터를 완벽하게 암기하지만 새로운 데이터에서는 치명적으로 실패한다는 것이다. 이 예측은 수십 년간 통계 실무를 이끌어 온 편향-분산 트레이드오프(bias-variance tradeoff), VC 차원 경계(VC dimension bounds), PAC 학습 이론(PAC learning theory)에서 도출된 것으로 충분한 근거를 갖추고 있다.

현대의 심층 신경망은 이 예측을 일상적으로 위반한다. 수십억 개의 매개변수를 가진 언어 모델은 수백만 개의 예제로 구성된 데이터셋으로 훈련되었음에도, 이전에 본 적 없는 새로운 텍스트에 일반화된다. 수백만 장의 이미지로 훈련된 수억 개의 매개변수를 가진 비전 모델은 새로운 사진을 올바르게 분류한다. 매개변수 대 데이터 비율은 고전 이론이 치명적인 과적합을 예측하는 임계값을 크게 초과하지만, 그럼에도 모델은 작동한다.

왜 작동하는지를 이해하는 것은 단순한 지적 호기심이 아니다. 이는 딥러닝의 성공이 결국 실패할 수 있는 우연한 행운인지, 아니면 우리가 신뢰하고 확장할 수 있는 더 깊은 수학적 구조를 반영하는지를 결정한다. 이 문제에 관한 2025년 연구는 네 가지 측면에서 진전을 이루고 있다.

암묵적 편향: 숨겨진 정규화기

과매개변수화된 일반화에 대한 주요 설명은 암묵적 편향(implicit bias)이다. 최적화 알고리즘(경사 하강법 및 그 변형)이 단순히 훈련 데이터에 맞는 임의의 해를 찾는 것이 아니라, 훈련 오차가 0인 모든 해 중에서 일반화를 촉진하는 특성을 가진 특정 해를 찾는다는 것이다.

Matt & Stöger는 단순화된 설정인 과매개변수화된 선형 신경망(비선형성 없이 순차적으로 구성된 다중 선형 레이어)에 대해 현재까지 가장 정밀한 특성화를 제공한다. 그들은 경사 하강법의 암묵적 정규화 효과에 대한 엄밀한 상한 및 하한 경계를 증명하며, 경사 하강법이 ℓ₁ 노름이 작은 해를 암묵적으로 선호함을 보인다. 이는 명시적 ℓ₁ 정규화(Lasso)가 부과하는 것과 동일한 특성이다.

이는 손실 함수에 정규화 항이 추가되지 않기 때문에 주목할 만하다. 암묵적 ℓ₁ 편향은 순전히 네트워크의 계층적 구조와 경사 하강법 역학 간의 상호작용에서 발생한다. 비선형성이 없더라도 계층적 구조는 최적화 지형에 기하학적 편향을 도입하며, 경사 하강법은 이를 활용한다.

스펙트럼 편향: 저주파수를 먼저

Sahs et al.은 보완적인 메커니즘인 스펙트럼 편향(spectral bias)을 탐구한다. 이는 신경망이 목표 함수의 고주파수 성분보다 저주파수 성분을 먼저 학습하는 경향이다. 저주파수 함수는 더 매끄럽고 일반화될 가능성이 높은 반면, 고주파수 함수는 노이즈를 나타낼 가능성이 더 높기 때문에 이 편향은 암묵적 정규화로 작용한다.

그들의 기여는 활성화 함수의 선택이 스펙트럼 편향을 형성한다는 것을 보이는 데 있다. 서로 다른 비선형성(ReLU, sigmoid, GELU, sine)은 서로 다른 주파수 학습 우선순위를 생성한다. 예를 들어, ReLU 네트워크는 조각별 선형 함수(낮은 유효 주파수)를 선호하는 반면, sine 활성화 네트워크는 고주파수 성분을 보다 쉽게 학습할 수 있다.

실용적인 시사점은 다음과 같다. 활성화 함수는 단순히 계산 편의를 위한 아키텍처 선택이 아니라, 네트워크가 쉽게 학습할 수 있는 함수와 억제하는 함수를 결정하는 정규화 선택이다.

영역 계산: 기하학적 관점

Li et al.은 암묵적 편향에 대한 기하학적 특성화를 제안한다. 이는 네트워크의 결정 경계가 입력 공간에 생성하는 연결된 영역(connected regions)의 수이다. 입력 공간을 많은 작은 영역으로 분할하는 네트워크는 더 적고 큰 영역을 생성하는 네트워크보다 더 복잡하며 과적합될 가능성이 높다. 그들은 특정 아키텍처에서 경사 하강법이 선형 영역 수가 거의 최소에 가까운 해로 수렴함을 증명한다—이는 네트워크가 임의로 복잡해질 수 있는 능력을 갖추고 있음에도 불구하고, 학습된 함수가 왜 단순한지(즉, 일반화가 잘 되는지)에 대한 구체적이고 기하학적인 설명을 제공한다.

고전 이론을 넘어선 증명 가능한 경계

Dhingra는 과매개변수화된 네트워크에 대한 증명 가능한 일반화 경계를 개관하고 이를 확장한다. 핵심 통찰은 다음과 같다: 고전적 경계(VC 차원, Rademacher 복잡도)는 경사 하강법이 부과하는 제약을 고려하지 않고 매개변수 수에만 의존하기 때문에, 과매개변수화 체제에서는 무의미해진다(일반화 오차 > 1을 예측한다).

규범 기반, PAC-Bayes, 압축 기반 등의 최신 경계는 경사 하강법에 의해 찾아진 특정 해에 관한 정보를 반영하여, 무의미하지 않은 일반화 추정치를 산출한다. 이러한 경계들은 여전히 느슨하지만, 실제로 관찰되는 정성적 행동을 올바르게 예측한다: 고전 이론이 반대를 예측함에도 불구하고, 네트워크가 더 넓어질수록(레이어당 매개변수가 많아질수록) 일반화가 향상된다.

주장과 증거

주장	증거	판정
고전 통계 이론은 과매개변수화된 과적합을 예측한다	VC 차원, 편향-분산 트레이드오프	✅ 고전적 예측
현대 신경망은 이 예측을 위반한다	다양한 분야에서의 실증적 관찰	✅ 잘 문서화됨
암묵적 ℓ₁ 정규화가 선형 네트워크의 일반화를 설명한다	Matt & Stöger: 엄밀한 경계 증명	✅ 증명됨 (선형 경우)
스펙트럼 편향이 암묵적 정규화를 제공한다	Sahs et al.: 활성화 의존적 주파수 학습 우선순위	✅ 지지됨
영역 계산이 암묵적 편향을 기하학적으로 특성화한다	Li et al.: 최소 영역 수렴 실증	✅ 지지됨
현재 이론이 딥러닝 일반화를 완전히 설명한다	이론(선형/얕은 네트워크)과 실제(깊고 비선형적인 네트워크) 사이의 간극이 남아 있다	❌ 부분적 이해

미해결 문제

깊은 비선형 네트워크: 대부분의 이론적 결과는 선형 네트워크, 얕은 네트워크, 또는 단순화된 아키텍처에 적용된다. 이러한 통찰을 실제에서 주류를 이루는 깊고 비선형적인 어텐션 기반 아키텍처로 이전할 수 있는가?

Transformer의 특수성: Transformer 아키텍처(어텐션 메커니즘, 위치 인코딩, 레이어 정규화)의 일반화 특성은 이론적으로 잘 이해되지 않고 있다. Transformer에 특유한 암묵적 편향이 존재하는가?

이중 하강(Double descent): 훈련 데이터의 보간 이후 일반화가 향상되는 "이중 하강" 현상은 기존 이론으로 여전히 충분히 설명되지 않는다. 암묵적 편향 이론이 이중 하강을 설명할 수 있는가?

실용적 함의: 암묵적 편향에 대한 이해가 더 나은 아키텍처나 훈련 방법을 제안하는가? 경사 하강법이 암묵적으로 정규화를 수행한다면, 이 암묵적 정규화를 강화하는 아키텍처를 설계할 수 있는가?

미세 조정 동역학: 사전 훈련된 모델을 소규모 데이터셋으로 미세 조정할 때, 미세 조정의 암묵적 편향은 처음부터 훈련하는 경우와 다를 수 있다. 사전 훈련된 초기화가 이후 최적화의 암묵적 편향에 어떤 영향을 미치는가?

연구에 대한 시사점

ML 이론가에게 있어, 일반화 퍼즐은 딥러닝 이론의 핵심 미해결 문제로 남아 있다. 2025년의 결과들은 특히 특정 아키텍처에 대한 암묵적 편향의 특성화에서 의미 있는 진전을 이루었지만, 이론과 실제 사이의 간극은 지속적인 연구 투자를 촉구한다.

실무자에게 있어, 실용적 시사점은 아키텍처와 옵티마이저의 선택 자체가 정규화의 한 형태라는 점이다. 활성화 함수, 레이어 너비, 학습률, 훈련 스케줄은 모두 암묵적 편향에 영향을 미치며, 따라서 훈련된 모델의 일반화 특성에도 영향을 준다. 이러한 효과를 이해함으로써 보다 원칙적인 모델 설계가 가능해진다. 고전 이론을 훈련받은 통계학자들에게 과매개변수화 퍼즐은 학습 이론의 토대를 확장하라는 초대장이다. 고전적 프레임워크가 틀린 것이 아니라—불완전한 것이다. 경사하강법의 구조와 신경망 매개변수 공간의 기하학을 통계 이론에 통합하는 것은 생산적이고 영향력 있는 연구 방향이다.

References (4)

[1] Dhingra, A. (2025). Provable Generalization in Overparameterized Neural Nets. arXiv:2508.17256.

DOI Scholar

[2] Sahs, J., Pyle, R., Anselmi, F. (2025). The Spectral Bias of Shallow Neural Network Learning is Shaped by the Choice of Non-linearity. arXiv:2503.10587.

DOI Scholar

[3] Matt, H. & Stöger, D. (2025). Linear regression with overparameterized linear neural networks: Tight upper and lower bounds for implicit $\ell^1$-regularization. arXiv:2506.01143.

DOI Scholar

[4] Li, J., Xu, J., Wang, Z. (2025). Understanding Nonlinear Implicit Bias via Region Counts in Input Space. arXiv:2505.11370.

DOI Scholar

The Generalization Puzzle: Why Overparameterized Neural Networks Don't Overfit

Implicit Bias: The Hidden Regularizer

Spectral Bias: Low Frequencies First

Region Counting: A Geometric Perspective

Provable Bounds Beyond Classical Theory

Claims and Evidence

Open Questions

What This Means for Your Research

일반화의 난제: 과매개변수화된 신경망이 과적합되지 않는 이유

암묵적 편향: 숨겨진 정규화기

스펙트럼 편향: 저주파수를 먼저

영역 계산: 기하학적 관점

고전 이론을 넘어선 증명 가능한 경계

주장과 증거

미해결 문제

연구에 대한 시사점

References (4)

Explore this topic deeper