Deep DiveAI & Machine Learning

Multiagent Finetuning: How One Base Model Becomes Many Specialized Agents

Multiagent Finetuning (MAFT) starts from a single base language model and produces multiple specialized agent copies that generate diverse reasoning chains — then uses inter-agent selection pressure to improve each agent beyond what single-model self-improvement can achieve, avoiding the collapse that plagues standard synthetic data training.

By ORAA Research

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

A persistent challenge in language model self-improvement is diversity collapse. When a single model generates synthetic training data and then trains on it, the output distribution narrows with each iteration — the model converges on a small set of reasoning patterns, losing the ability to explore alternative approaches. This is not a subtle effect; it has been documented across multiple studies as a fundamental limitation of single-agent self-play.

Multiagent Finetuning (MAFT), introduced by Subramaniam et al. (2025), offers a structural solution. Instead of one model improving itself in isolation, MAFT creates multiple copies of the same base model and differentiates them through interaction. Each copy develops specialized reasoning strategies, and the diversity of the population prevents any single copy from collapsing into a narrow pattern. The approach has accumulated notable attention since its publication, reflecting broad interest in overcoming self-improvement bottlenecks.

The Research Landscape

The Core Mechanism

MAFT operates through a multi-step process:

Step 1: Population initialization. Start with N copies of the same base language model. All copies are initially identical.

Step 2: Diverse generation. Each copy generates responses to training prompts. Because the generation process involves sampling (temperature, top-p), even identical models produce different outputs. Over iterations, the copies diverge further as they train on different subsets of correct responses.

Step 3: Cross-agent selection. For each training prompt, responses from all N agents are evaluated (using a reward model, verifier, or ground-truth labels). The best responses are selected regardless of which agent produced them.

Step 4: Specialized training. Each agent is finetuned on a mixture of its own successful responses and the successful responses of other agents — but with a weighting scheme that encourages each agent to develop distinct capabilities.

Step 5: Iteration. Steps 2–4 repeat, with the population progressively specializing and improving.

The key insight is that the multi-agent structure maintains diversity by construction. Even if Agent 1 converges on a narrow reasoning style, Agent 2 may have developed a different approach that solves problems Agent 1 fails on. The population as a whole remains more capable than any individual member.

Why Single-Agent Self-Improvement Fails

To understand MAFT's contribution, consider why standard self-improvement plateaus. When a model generates synthetic data and trains on the correct subset, it reinforces the reasoning patterns that already work — and loses the patterns that were present but not yet dominant. After a few iterations, the model can only solve problems in the way it already knows, even if alternative approaches would handle novel problems better.

This is analogous to a biological monoculture: optimized for current conditions but brittle against environmental change. MAFT creates a polyculture — a population with diverse strategies that collectively covers a larger portion of the problem space.

Experimental Results

Subramaniam et al. (2025) demonstrate MAFT on mathematical reasoning and code generation tasks, comparing against:

Standard self-improvement: single model generating and training on its own outputs
Best-of-N sampling: single model generating N responses and selecting the best
Rejection sampling finetuning: single model trained on its own high-quality responses

MAFT outperforms all baselines, with the improvement growing over iterations — precisely the regime where single-agent methods plateau. On GSM8K and MATH benchmarks, the multi-agent population achieves accuracy levels that no single agent reaches through self-improvement alone.

Connection to Biological Evolution

The parallel to evolutionary biology is intentional. MAFT implements an analogous mechanism to natural selection: multiple agents with different strategies undergo selection pressure, and population diversity enables exploration beyond individual capacity. The authors frame MAFT as "simulated evolution" — not in a loose metaphorical sense, but in the structural sense of population diversity enabling optimization.

Critical Analysis

Claim	Evidence	Verdict
MAFT prevents diversity collapse in self-improvement	Measured diversity (reasoning strategy distribution) remains high across iterations	✅ Supported — the multi-agent structure maintains diversity by design
MAFT outperforms single-agent self-improvement	Consistent improvements on math and code benchmarks across multiple iterations	✅ Supported — with the caveat that N agents require N× the compute of one agent
Agent specialization emerges without explicit diversity objectives	Analysis shows agents developing distinct error profiles and reasoning preferences	✅ Supported — specialization is an emergent property of the training dynamic
MAFT is computationally efficient	N agents means N× the generation and training cost of single-agent methods	⚠️ Depends on framing — per-agent cost is identical; total cost scales linearly with N
The approach generalizes beyond math and code	Only demonstrated on reasoning-heavy tasks with verifiable answers	⚠️ Plausible but undemonstrated for open-ended generation, creative writing, etc.

The Compute Tradeoff

MAFT's improvement comes at a compute cost: running N agents is approximately N times more expensive than running one. The relevant comparison is not "MAFT vs. single model at same compute" but "MAFT vs. single model at same wall-clock time" (if parallelized) or "MAFT vs. single model with N× data" (if compute-matched). Subramaniam et al. report that even when controlling for total compute, MAFT outperforms baselines — suggesting the diversity benefit exceeds what additional compute alone provides.

Verification Requirements

MAFT works best when response quality can be verified automatically — math problems have correct answers, code has test cases, formal proofs have validators. For tasks where quality assessment requires human judgment (essay writing, nuanced dialogue, ethical reasoning), the cross-agent selection step becomes the bottleneck. This connects to the broader challenge of reward modeling and its limitations (Eisenstein et al., 2023), where reward models introduce their own biases into the selection process.

Open Questions

Optimal population size: How many agents are needed to capture sufficient diversity? Is there diminishing return beyond N=4 or N=8?

Merge versus ensemble: Can the specialized agents be merged (model merging techniques) into a single model that retains the diversity benefits, or does deployment require maintaining the full population?

Domain transfer: Does specialization developed on math reasoning transfer to code generation, or do agents need to specialize independently for each domain?

Scaling with model size: MAFT has been demonstrated on ~7B parameter models. How does the diversity benefit scale with model size — does a 70B model already contain sufficient internal diversity to make population-level diversity redundant?

Human-feedback integration: Can MAFT be combined with RLHF, where each agent learns from its own preference data trajectory, producing diverse alignment strategies?

Closing

Multiagent Finetuning addresses the diversity collapse problem in language model self-improvement through a structurally simple mechanism: maintain a population of model copies, let them specialize through interaction, and use cross-agent selection to preserve high-quality diverse reasoning. The approach draws a deliberate parallel to evolutionary dynamics and demonstrates consistent improvements over single-agent baselines on verifiable reasoning tasks. The open questions — optimal population size, merging strategies, domain transfer, and scaling behavior — define the research agenda for extending MAFT from a compelling proof-of-concept to a practical training methodology.

면책 조항: 이 게시물은 정보 제공을 목적으로 한 연구 동향 개요이다. 학술 저작물에서 인용하기 전에 원본 논문을 통해 구체적인 연구 결과, 통계 및 주장을 검증해야 한다.

Multiagent Finetuning: 하나의 기반 모델이 어떻게 다수의 전문화된 에이전트가 되는가

언어 모델 자기 개선(self-improvement)에서 지속적으로 제기되는 과제는 다양성 붕괴(diversity collapse)이다. 단일 모델이 합성 훈련 데이터를 생성하고 이를 학습에 활용할 경우, 반복할수록 출력 분포가 좁아진다. 즉, 모델이 소수의 추론 패턴에 수렴하면서 대안적 접근 방식을 탐색하는 능력을 상실한다. 이는 미묘한 현상이 아니며, 단일 에이전트 자기 대전(self-play)의 근본적인 한계로서 다수의 연구에서 문서화된 바 있다.

Subramaniam et al. (2025)이 제안한 Multiagent Finetuning(MAFT)은 이에 대한 구조적 해결책을 제시한다. 하나의 모델이 고립된 상태에서 스스로를 개선하는 대신, MAFT는 동일한 기반 모델의 복사본 여러 개를 생성하고 상호작용을 통해 이들을 차별화한다. 각 복사본은 전문화된 추론 전략을 발전시키며, 집단의 다양성은 어느 단일 복사본도 좁은 패턴으로 붕괴되는 것을 방지한다. 이 접근법은 발표 이후 자기 개선의 병목을 극복하려는 광범위한 관심을 반영하며 상당한 주목을 받았다.

연구 동향

핵심 메커니즘

MAFT는 다단계 과정을 통해 작동한다.

1단계: 집단 초기화. 동일한 기반 언어 모델의 복사본 N개로 시작한다. 모든 복사본은 초기에 동일하다.

2단계: 다양한 생성. 각 복사본은 훈련 프롬프트에 대한 응답을 생성한다. 생성 과정에서 샘플링(temperature, top-p)이 수반되므로, 동일한 모델이라도 서로 다른 출력을 생성한다. 반복이 진행됨에 따라 각 복사본은 서로 다른 정답 응답 하위 집합을 학습하면서 더욱 분기된다.

3단계: 교차 에이전트 선택. 각 훈련 프롬프트에 대해 N개의 에이전트가 생성한 응답을 평가한다(보상 모델, 검증기, 또는 정답 레이블 활용). 어느 에이전트가 생성했는지와 무관하게 최적의 응답이 선택된다.

4단계: 전문화 훈련. 각 에이전트는 자신의 성공적인 응답과 다른 에이전트의 성공적인 응답을 혼합하여 파인튜닝되지만, 각 에이전트가 고유한 능력을 발전시키도록 장려하는 가중치 방식이 적용된다.

5단계: 반복. 2–4단계를 반복하며, 집단은 점진적으로 전문화되고 성능이 향상된다.

핵심적인 통찰은 다중 에이전트 구조가 설계상 다양성을 유지한다는 점이다. 에이전트 1이 좁은 추론 방식으로 수렴하더라도, 에이전트 2는 에이전트 1이 실패하는 문제를 해결하는 다른 접근 방식을 발전시켰을 수 있다. 집단 전체는 개별 구성원 어느 하나보다 더 높은 능력을 유지한다.

단일 에이전트 자기 개선이 실패하는 이유

MAFT의 기여를 이해하기 위해, 표준적인 자기 개선이 왜 정체되는지를 살펴본다. 모델이 합성 데이터를 생성하고 정답 하위 집합을 학습할 때, 이미 효과적인 추론 패턴을 강화하는 동시에 존재하지만 아직 지배적이지 않은 패턴을 잃어버린다. 몇 번의 반복 후, 모델은 이미 알고 있는 방식으로만 문제를 풀 수 있게 되며, 대안적 접근 방식이 새로운 문제를 더 잘 처리할 수 있음에도 불구하고 그러하다.

이는 생물학적 단일 재배(monoculture)와 유사하다. 현재 조건에 최적화되어 있지만 환경 변화에는 취약하다. MAFT는 다품종 재배(polyculture)를 만들어낸다. 즉, 다양한 전략을 보유한 집단이 문제 공간의 더 넓은 영역을 집합적으로 포괄한다.

실험 결과

Subramaniam et al. (2025)은 수학적 추론 및 코드 생성 과제에서 MAFT를 검증하며, 다음과 비교하였다.

표준 자기 개선: 단일 모델이 자신의 출력물을 생성하고 학습
Best-of-N 샘플링: 단일 모델이 N개의 응답을 생성하고 최선의 것을 선택
거부 샘플링 파인튜닝(Rejection sampling finetuning): 단일 모델이 자신의 고품질 응답을 학습

MAFT는 모든 기준 방법(baseline)을 능가하며, 단일 에이전트 방법이 정체되는 구간인 반복 횟수가 증가할수록 성능 향상 폭이 커진다. GSM8K 및 MATH 벤치마크에서 다중 에이전트 집단은 어떠한 단일 에이전트도 자기 개선만으로는 도달하지 못하는 정확도 수준을 달성한다.

생물학적 진화와의 연관성

진화생물학과의 유사성은 의도적으로 설정된 것이다. MAFT는 자연선택과 유사한 메커니즘을 구현한다. 즉, 서로 다른 전략을 가진 여러 에이전트가 선택 압력을 받으며, 집단 다양성이 개별 능력을 초월한 탐색을 가능하게 한다. 저자들은 MAFT를 "모의 진화(simulated evolution)"로 규정하는데, 이는 느슨한 은유적 의미가 아니라 집단 다양성이 최적화를 가능하게 한다는 구조적 의미에서이다.

비판적 분석

주장	근거	판정
MAFT는 자기 개선에서 다양성 붕괴를 방지한다	측정된 다양성(추론 전략 분포)이 반복 전반에 걸쳐 높게 유지된다	✅ 지지됨 — 다중 에이전트 구조가 설계상 다양성을 유지한다
MAFT는 단일 에이전트 자기 개선을 능가한다	여러 반복에 걸쳐 수학 및 코드 벤치마크에서 일관된 성능 향상이 나타난다	✅ 지지됨 — N개의 에이전트는 단일 에이전트 대비 N배의 연산량이 필요하다는 점을 유의해야 한다
명시적 다양성 목적 함수 없이 에이전트 전문화가 나타난다	분석에 따르면 에이전트들이 뚜렷한 오류 패턴과 추론 선호도를 발전시킨다	✅ 지지됨 — 전문화는 훈련 역학의 창발적 속성이다
MAFT는 연산 효율적이다	N개의 에이전트는 단일 에이전트 방법 대비 N배의 생성 및 훈련 비용을 의미한다	⚠️ 구성 방식에 따라 다름 — 에이전트당 비용은 동일하며, 총 비용은 N에 따라 선형으로 증가한다
이 접근법은 수학 및 코드 이외로 일반화된다	검증 가능한 정답이 있는 추론 중심 과제에서만 검증되었다	⚠️ 그럴듯하나 개방형 생성, 창작 등에 대해서는 미검증 상태이다

연산량 절충

MAFT의 성능 향상에는 연산 비용이 수반된다. N개의 에이전트를 실행하는 것은 단일 에이전트 실행보다 약 N배 더 비싸다. 적절한 비교 기준은 "동일 연산량에서의 MAFT 대 단일 모델"이 아니라, "동일 실제 소요 시간(병렬화 시)에서의 MAFT 대 단일 모델" 또는 "연산량을 맞춘 경우의 MAFT 대 N배 데이터를 사용한 단일 모델"이다. Subramaniam et al.은 총 연산량을 통제한 경우에도 MAFT가 기준 방법들을 능가한다고 보고하며, 이는 다양성의 이점이 단순히 추가적인 연산량만으로 얻을 수 있는 것을 초과함을 시사한다.

검증 요건

MAFT는 응답 품질을 자동으로 검증할 수 있을 때 가장 효과적이다. 수학 문제에는 정답이 있고, 코드에는 테스트 케이스가 있으며, 형식 증명에는 검증기가 있다. 품질 평가에 인간의 판단이 필요한 과제(에세이 작성, 세밀한 대화, 윤리적 추론 등)에서는 교차 에이전트 선택 단계가 병목이 된다. 이는 보상 모델링의 광범위한 과제 및 그 한계(Eisenstein et al., 2023)와 연결되며, 보상 모델이 선택 과정에 자체적인 편향을 도입한다.

미해결 과제

최적 집단 크기: 충분한 다양성을 확보하려면 몇 개의 에이전트가 필요한가? N=4 또는 N=8을 넘어서면 수확 체감이 발생하는가?

병합 대 앙상블: 전문화된 에이전트들을 다양성 이점을 유지하는 단일 모델로 병합(모델 병합 기법)할 수 있는가, 아니면 배포 시 전체 집단을 유지해야 하는가?

도메인 전이: 수학 추론에서 개발된 전문화가 코드 생성으로 전이되는가, 아니면 에이전트가 각 도메인에 대해 독립적으로 전문화되어야 하는가?

모델 크기에 따른 확장: MAFT는 약 70억(~7B) 파라미터 모델에서 검증되었다. 다양성 이점은 모델 크기에 따라 어떻게 확장되는가? 700억(70B) 모델은 이미 충분한 내부 다양성을 보유하여 집단 수준의 다양성이 불필요해지는가?

인간 피드백 통합: MAFT를 RLHF와 결합할 수 있는가? 이 경우 각 에이전트가 자체 선호 데이터 궤적으로부터 학습하여 다양한 정렬 전략을 생성할 수 있다.

마무리

Multiagent Finetuning은 구조적으로 단순한 메커니즘을 통해 언어 모델 자기 개선에서 발생하는 다양성 붕괴 문제를 해결한다. 즉, 모델 복사본의 집단을 유지하고, 상호작용을 통해 전문화되도록 하며, 교차 에이전트 선택(cross-agent selection)을 활용하여 고품질의 다양한 추론을 보존하는 것이다. 이 접근법은 진화적 역학(evolutionary dynamics)과의 의도적인 유사성을 도출하며, 검증 가능한 추론 과제에서 단일 에이전트 기준선(baseline) 대비 일관된 성능 향상을 보인다. 최적 집단 크기, 병합 전략(merging strategy), 도메인 전이(domain transfer), 확장 행동(scaling behavior) 등의 미해결 질문들은 MAFT를 설득력 있는 개념 증명(proof-of-concept)에서 실용적인 학습 방법론으로 발전시키기 위한 연구 의제를 정의한다.

References (3)

Subramaniam, V., Du, Y., & Tenenbaum, J. et al. (2025). Multiagent finetuning: Self improvement with diverse reasoning chains. arXiv preprint.

DOI Scholar

Eisenstein, J., Nagpal, C., & Agarwal, A. (2023). Helping or herding? Reward model ensembles mitigate but do not eliminate reward hacking. arXiv preprint.

DOI Scholar

Zhuang, Y., Yu, X., & Wu, J. et al. (2025). Self-taught agentic long context understanding. arXiv preprint.

DOI Scholar

Multiagent Finetuning: How One Base Model Becomes Many Specialized Agents

The Research Landscape

The Core Mechanism

Why Single-Agent Self-Improvement Fails

Experimental Results

Connection to Biological Evolution

Critical Analysis

The Compute Tradeoff

Verification Requirements

Open Questions

Closing

Multiagent Finetuning: 하나의 기반 모델이 어떻게 다수의 전문화된 에이전트가 되는가

연구 동향

핵심 메커니즘

단일 에이전트 자기 개선이 실패하는 이유

실험 결과

생물학적 진화와의 연관성

비판적 분석

연산량 절충

검증 요건

미해결 과제

마무리

References (3)

Explore this topic deeper