Paper ReviewAI & Machine LearningMachine/Deep Learning

Neural Neural Scaling Laws: When AI Predicts Its Own Future Performance

Can AI predict AI's own scaling behavior? Hu et al. (2026) replace hand-designed scaling law formulas with a neural network that learns to predict downstream task performance, achieving 2.04% MAE across 66 tasks—a 38% error reduction over logistic baselines. The meta-level question: what does it mean when we need neural networks to understand neural networks?

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

Scaling laws have been one of the most practically useful discoveries in modern AI research. The observation that model performance follows predictable power-law relationships with compute, data, and parameters—formalized by Kaplan et al. and refined by Hoffmann et al.—gave organizations a planning tool: invest X in compute, expect Y in performance. This predictability turned model training from a gamble into something resembling engineering.

But scaling laws have a dirty secret: they work well for aggregate metrics like validation loss, and much less well for the downstream tasks that actually matter. A model's performance on a specific task—legal question answering, code generation, medical diagnosis—scales in ways that are idiosyncratic, non-monotonic, and poorly captured by the smooth curves that scaling laws predict.

Hu et al. (2026) propose a characteristically recursive solution: use a neural network to predict neural network scaling behavior. Their system, NeuNeu, treats scaling law prediction as a time-series extrapolation problem—and achieves substantially better predictions than traditional parametric approaches.

The Research Landscape

The scaling laws literature has evolved through several phases. The initial Kaplan et al. work established that validation loss scales as a power law with compute. Hoffmann et al. refined this by showing that data and parameters should be scaled together in a specific ratio. Subsequent work has explored whether these aggregate relationships hold for specific capabilities.

The answer, increasingly, is that they do not hold cleanly. Some downstream tasks improve steadily with scale, roughly tracking aggregate loss. Others plateau early—the model reaches near-maximum performance at a modest size and gains little from further scaling. Still others exhibit non-monotonic behavior, where performance temporarily degrades at certain scales before recovering. A few tasks show inverse scaling: larger models perform worse.

This task-level unpredictability is a practical problem. Organizations making decisions about how large a model to train—decisions involving millions of dollars—need to predict not just average quality but performance on the specific tasks their users care about. A model that is better on average but worse on the task your product depends on is not a good investment.

NeuNeu: Learning to Predict Scaling

The core idea of NeuNeu is to replace hand-designed scaling law formulas (power laws, logistic functions, broken power laws) with a neural network that learns the mapping from observable features to downstream performance.

The approach combines two types of input:

Temporal patterns from accuracy trajectories: By observing how a task's accuracy changes across a series of model checkpoints at increasing compute, NeuNeu learns the shape of each task's scaling curve. Some curves are smooth and monotonic; others have inflection points, plateaus, or temporary dips. The neural network learns to recognize and extrapolate these diverse patterns without assuming any particular functional form.

Token-level validation losses: Rather than relying solely on aggregate validation perplexity, NeuNeu uses the distribution of per-token losses as a signal. The intuition is that aggregate loss can be misleading—a model might have low average loss because it excels at easy tokens while struggling with the hard tokens that determine downstream performance. The token-level loss distribution provides a richer signal about what the model has and has not learned.

Critically, the system makes no assumption about any bottleneck or functional form. Traditional scaling laws assume a specific mathematical relationship (power law, logistic curve) and fit parameters to data. NeuNeu instead learns the relationship from data, allowing it to capture whatever patterns exist—including patterns that no parametric form would express well.

Results

Trained on open-source model checkpoints, NeuNeu achieves 2.04% mean absolute error (MAE) in predicting model accuracy on 66 downstream tasks. This represents a 38% reduction in error compared to logistic scaling law baselines—a substantial improvement on a problem where prediction accuracy directly translates to better resource allocation decisions.

The system also generalizes zero-shot to new model families, parameter counts, and tasks it has not seen during training. This generalization is important: if NeuNeu only predicted scaling for the exact models it was trained on, it would be a curve-fitting exercise rather than a general prediction tool. Zero-shot generalization suggests that the neural network has learned something about how scaling works in general, not just how it works for specific model families.

Critical Analysis: Claims and Evidence

Claim	Evidence	Verdict
Downstream tasks scale differently from aggregate metrics	Analysis of scaling curves across 66 tasks	✅ Well-established
NeuNeu achieves 2.04% MAE across 66 tasks	Evaluation on held-out prediction targets	✅ Supported
38% error reduction vs. logistic baselines	Direct comparison on same evaluation set	✅ Supported
Zero-shot generalization to new model families	Evaluation on model families excluded from training	✅ Supported
No functional form assumption improves flexibility	Comparison with parametric alternatives	✅ Supported

The methodology is straightforward and the claims are appropriately scoped. The 66-task evaluation provides reasonable breadth, though the tasks are drawn from established benchmarks (which may not represent the full diversity of real-world applications). The reliance on open-source model checkpoints means the system has been validated on models whose training details are public; whether it generalizes equally well to proprietary models with different training recipes is an open question.

Open Questions

Self-referential limits: NeuNeu predicts how other neural networks scale. Could a similar approach predict how NeuNeu itself scales—and would that recursion converge to useful predictions or diverge into noise?

Actionable predictions: Knowing that a task will plateau at a certain scale is useful only if you can do something about it. Can NeuNeu's predictions be integrated into training pipelines to dynamically adjust resource allocation—spending more compute on tasks that are still improving and less on tasks that have plateaued?

Inverse scaling detection: Can NeuNeu reliably predict tasks where larger models will perform worse? Early detection of inverse scaling would be particularly valuable, as it could prevent organizations from investing in scale that actively degrades the capabilities they need.

Training data composition: Scaling behavior likely depends not just on model size but on training data composition. Can NeuNeu's approach be extended to predict the effects of data mixture changes, not just compute changes?

What This Means for Your Research

For practitioners making compute allocation decisions, NeuNeu offers a concrete tool: better predictions of task-level performance at different scales, enabling more informed investment decisions. The 38% error reduction over logistic baselines translates directly into fewer wasted GPU-hours and fewer unpleasant surprises when a large model fails to improve on a key task.

For the scaling laws community, the work suggests that the era of simple parametric scaling laws may be ending—not because they are wrong, but because they are insufficiently expressive for the diversity of downstream scaling behaviors. Data-driven approaches like NeuNeu may be the next generation of scaling prediction tools.

The meta-level observation is worth noting: we are using neural networks to understand neural networks, a recursion that is novel in AI research.

Explore related work through ORAA ResearchBrain.

면책 조항: 이 게시물은 정보 제공 목적의 연구 동향 개요이다. 학술 연구에서 인용하기 전에 구체적인 연구 결과, 통계 및 주장은 원본 논문과 대조하여 검증해야 한다.

Neural Neural Scaling Laws: AI가 자신의 미래 성능을 예측할 때

Scaling law는 현대 AI 연구에서 가장 실용적으로 유용한 발견 중 하나이다. Kaplan et al.이 공식화하고 Hoffmann et al.이 정교하게 다듬은, 모델 성능이 컴퓨팅, 데이터, 파라미터와 예측 가능한 거듭제곱 법칙(power-law) 관계를 따른다는 관찰은 조직들에게 계획 수단을 제공하였다: 컴퓨팅에 X를 투자하면 성능에서 Y를 기대할 수 있다는 것이다. 이러한 예측 가능성은 모델 학습을 도박에서 엔지니어링에 가까운 무언가로 전환시켰다.

그러나 scaling law에는 불편한 비밀이 있다: 검증 손실(validation loss)과 같은 집계 지표에는 잘 작동하지만, 실제로 중요한 다운스트림 태스크에는 훨씬 덜 잘 작동한다. 특정 태스크—법률 질의응답, 코드 생성, 의료 진단—에서의 모델 성능은 개별적이고, 비단조적이며, scaling law가 예측하는 매끄러운 곡선으로 잘 포착되지 않는 방식으로 스케일된다.

Hu et al. (2026)은 특유의 재귀적인 해법을 제안한다: 신경망을 사용하여 신경망 스케일링 동작을 예측하는 것이다. NeuNeu라고 불리는 그들의 시스템은 scaling law 예측을 시계열 외삽(time-series extrapolation) 문제로 취급하며, 전통적인 파라메트릭 방식보다 실질적으로 더 나은 예측을 달성한다.

연구 현황

Scaling law 문헌은 여러 단계를 거쳐 발전해 왔다. 초기 Kaplan et al.의 연구는 검증 손실이 컴퓨팅에 따라 거듭제곱 법칙으로 스케일된다는 것을 확립하였다. Hoffmann et al.은 데이터와 파라미터가 특정 비율로 함께 스케일되어야 한다는 것을 보여줌으로써 이를 더욱 정교하게 발전시켰다. 이후 연구들은 이러한 집계 관계가 특정 능력에 대해서도 성립하는지를 탐구하였다.

그 답은, 점점 더, 깔끔하게 성립하지 않는다는 것이다. 일부 다운스트림 태스크는 스케일에 따라 꾸준히 향상되어 집계 손실을 대략적으로 추적한다. 다른 태스크들은 초반에 고원(plateau) 상태에 이른다—모델이 적당한 크기에서 거의 최대 성능에 도달하고 추가 스케일링으로부터 거의 이득을 얻지 못한다. 또 다른 태스크들은 비단조적 동작을 보이는데, 특정 스케일에서 성능이 일시적으로 저하된 후 회복된다. 일부 태스크는 역방향 스케일링(inverse scaling)을 보이기도 한다: 더 큰 모델이 더 나쁜 성능을 낸다.

이러한 태스크 수준의 예측 불가능성은 실질적인 문제이다. 모델을 얼마나 크게 학습시킬지에 대한 결정—수백만 달러가 걸린 결정—을 내리는 조직들은 평균적인 품질뿐만 아니라 사용자들이 중요하게 여기는 특정 태스크에서의 성능을 예측해야 한다. 평균적으로는 더 좋지만 자사 제품이 의존하는 태스크에서는 더 나쁜 모델은 좋은 투자가 아니다.

NeuNeu: 스케일링 예측 학습

NeuNeu의 핵심 아이디어는 수작업으로 설계된 scaling law 공식(거듭제곱 법칙, 로지스틱 함수, 절단 거듭제곱 법칙)을 관찰 가능한 특징에서 다운스트림 성능으로의 매핑을 학습하는 신경망으로 대체하는 것이다.

이 접근 방식은 두 가지 유형의 입력을 결합한다:

정확도 궤적으로부터의 시간적 패턴: 컴퓨팅이 증가하는 일련의 모델 체크포인트에 걸쳐 태스크의 정확도가 어떻게 변하는지를 관찰함으로써, NeuNeu는 각 태스크의 스케일링 곡선의 형태를 학습한다. 일부 곡선은 매끄럽고 단조적이며, 다른 곡선들은 변곡점, 고원 구간, 또는 일시적인 하락을 가진다. 신경망은 특정 함수 형태를 가정하지 않고 이러한 다양한 패턴을 인식하고 외삽하는 방법을 학습한다.

토큰 수준의 검증 손실: 집계된 검증 퍼플렉서티(validation perplexity)에만 의존하는 대신, NeuNeu는 토큰별 손실의 분포를 신호로 활용한다. 직관은 집계 손실이 오해를 유발할 수 있다는 것이다—모델은 쉬운 토큰에 뛰어나면서 다운스트림 성능을 결정하는 어려운 토큰에는 어려움을 겪어 평균 손실이 낮을 수 있다. 토큰 수준의 손실 분포는 모델이 학습한 것과 학습하지 못한 것에 대한 더 풍부한 신호를 제공한다. 결정적으로, 이 시스템은 병목 현상이나 함수 형태에 대해 어떠한 가정도 하지 않는다. 전통적인 스케일링 법칙은 특정 수학적 관계(거듭제곱 법칙, 로지스틱 곡선)를 가정하고 데이터에 파라미터를 맞춘다. 반면 NeuNeu는 데이터로부터 그 관계를 학습하여, 어떠한 파라미터 형태로도 잘 표현되지 않는 패턴을 포함해 존재하는 패턴을 포착할 수 있다.

결과

오픈소스 모델 체크포인트로 학습된 NeuNeu는 66개의 다운스트림 태스크에서 모델 정확도 예측 시 2.04%의 평균 절대 오차(MAE)를 달성한다. 이는 로지스틱 스케일링 법칙 기준선 대비 오차를 38% 감소시킨 것으로, 예측 정확도가 자원 배분 결정으로 직결되는 문제에서 실질적인 개선을 나타낸다.

또한 이 시스템은 학습 중 본 적 없는 새로운 모델 패밀리, 파라미터 수, 태스크에 대해 제로샷(zero-shot)으로 일반화한다. 이러한 일반화는 중요하다. NeuNeu가 학습에 사용된 정확히 동일한 모델의 스케일링만 예측할 수 있다면, 이는 일반적인 예측 도구가 아니라 곡선 맞추기(curve-fitting)에 불과할 것이다. 제로샷 일반화는 이 신경망이 특정 모델 패밀리에서의 작동 방식뿐만 아니라, 스케일링이 일반적으로 어떻게 작동하는지에 대해 무언가를 학습했음을 시사한다.

비판적 분석: 주장과 근거

주장	근거	판정
다운스트림 태스크는 집계 지표와 다른 방식으로 스케일된다	66개 태스크의 스케일링 곡선 분석	✅ 충분히 검증됨
NeuNeu는 66개 태스크에서 2.04% MAE를 달성한다	홀드아웃 예측 대상에 대한 평가	✅ 지지됨
로지스틱 기준선 대비 오차 38% 감소	동일한 평가 세트에서의 직접 비교	✅ 지지됨
새로운 모델 패밀리에 대한 제로샷 일반화	학습에서 제외된 모델 패밀리 평가	✅ 지지됨
함수 형태 가정 배제가 유연성을 향상시킨다	파라미터 방식 대안과의 비교	✅ 지지됨

방법론은 명료하고 주장의 범위도 적절하다. 66개 태스크 평가는 합리적인 폭을 제공하지만, 해당 태스크들은 기존 벤치마크에서 도출된 것으로 실제 응용의 전체 다양성을 대표하지 않을 수 있다. 오픈소스 모델 체크포인트에 의존한다는 점은 학습 세부 정보가 공개된 모델들을 대상으로 검증되었음을 의미하며, 서로 다른 학습 방식을 가진 독점 모델에도 동일하게 일반화될 수 있는지는 여전히 열린 질문이다.

미해결 과제

자기 순환적 한계: NeuNeu는 다른 신경망이 어떻게 스케일되는지 예측한다. 유사한 접근 방식으로 NeuNeu 자체의 스케일링을 예측할 수 있을까—그리고 그 재귀는 유용한 예측으로 수렴할까, 아니면 노이즈로 발산할까?

실행 가능한 예측: 특정 규모에서 태스크가 정체될 것임을 아는 것은 그에 대해 조치를 취할 수 있을 때만 유용하다. NeuNeu의 예측을 학습 파이프라인에 통합하여 자원 배분을 동적으로 조정할 수 있을까—여전히 개선 중인 태스크에는 더 많은 컴퓨팅 자원을 투입하고, 정체된 태스크에는 덜 투입하는 식으로?

역 스케일링 탐지: NeuNeu는 대규모 모델이 더 낮은 성능을 보이는 태스크를 신뢰할 수 있게 예측할 수 있을까? 역 스케일링의 조기 탐지는 특히 가치 있는데, 이는 조직이 필요한 역량을 오히려 저하시키는 규모 확장에 투자하는 것을 방지할 수 있기 때문이다.

학습 데이터 구성: 스케일링 동작은 모델 크기만이 아니라 학습 데이터 구성에도 의존할 가능성이 높다. NeuNeu의 접근 방식을 컴퓨팅 변화만이 아니라 데이터 혼합 변화의 효과를 예측하도록 확장할 수 있을까?

연구에 주는 시사점

컴퓨팅 자원 배분 결정을 내리는 실무자들에게 NeuNeu는 구체적인 도구를 제공한다. 다양한 규모에서 태스크 수준의 성능을 더 잘 예측함으로써 보다 정보에 기반한 투자 결정을 가능하게 한다. 로지스틱 기준선 대비 38%의 오차 감소는 낭비되는 GPU 시간의 감소와, 대규모 모델이 핵심 태스크에서 개선을 보이지 않을 때 겪는 예상치 못한 실망의 감소로 직접 이어진다. 스케일링 법칙 연구 커뮤니티에 있어, 이 연구는 단순한 매개변수적 스케일링 법칙의 시대가 끝나가고 있음을 시사한다—이는 해당 법칙들이 틀렸기 때문이 아니라, 다양한 다운스트림 스케일링 동작을 표현하기에 충분히 표현력이 풍부하지 않기 때문이다. NeuNeu와 같은 데이터 기반 접근법이 차세대 스케일링 예측 도구가 될 수 있다.

메타 수준의 관찰도 주목할 만하다: 우리는 신경망을 이해하기 위해 신경망을 사용하고 있으며, 이는 AI 연구에서 새로운 재귀적 접근이다.

ORAA ResearchBrain을 통해 관련 연구를 탐색할 수 있다.

References (1)

[1] Hu, M.Y., Pan, J., Jhaveri, A.R., Lourie, N., & Cho, K. (2026). Neural Neural Scaling Laws. arXiv:2601.19831.

DOI Scholar