Deep DiveAI & Machine Learning
Multiagent Finetuning: How One Base Model Becomes Many Specialized Agents
Multiagent Finetuning (MAFT) starts from a single base language model and produces multiple specialized agent copies that generate diverse reasoning chains β then uses inter-agent selection pressure to improve each agent beyond what single-model self-improvement can achieve, avoiding the collapse that plagues standard synthetic data training.
By ORAA Research
This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.
A persistent challenge in language model self-improvement is diversity collapse. When a single model generates synthetic training data and then trains on it, the output distribution narrows with each iteration β the model converges on a small set of reasoning patterns, losing the ability to explore alternative approaches. This is not a subtle effect; it has been documented across multiple studies as a fundamental limitation of single-agent self-play.
Multiagent Finetuning (MAFT), introduced by Subramaniam et al. (2025), offers a structural solution. Instead of one model improving itself in isolation, MAFT creates multiple copies of the same base model and differentiates them through interaction. Each copy develops specialized reasoning strategies, and the diversity of the population prevents any single copy from collapsing into a narrow pattern. The approach has accumulated notable attention since its publication, reflecting broad interest in overcoming self-improvement bottlenecks.
The Research Landscape
The Core Mechanism
MAFT operates through a multi-step process:
Step 1: Population initialization. Start with N copies of the same base language model. All copies are initially identical.
Step 2: Diverse generation. Each copy generates responses to training prompts. Because the generation process involves sampling (temperature, top-p), even identical models produce different outputs. Over iterations, the copies diverge further as they train on different subsets of correct responses.
Step 3: Cross-agent selection. For each training prompt, responses from all N agents are evaluated (using a reward model, verifier, or ground-truth labels). The best responses are selected regardless of which agent produced them.
Step 4: Specialized training. Each agent is finetuned on a mixture of its own successful responses and the successful responses of other agents β but with a weighting scheme that encourages each agent to develop distinct capabilities.
Step 5: Iteration. Steps 2β4 repeat, with the population progressively specializing and improving.
The key insight is that the multi-agent structure maintains diversity by construction. Even if Agent 1 converges on a narrow reasoning style, Agent 2 may have developed a different approach that solves problems Agent 1 fails on. The population as a whole remains more capable than any individual member.
Why Single-Agent Self-Improvement Fails
To understand MAFT's contribution, consider why standard self-improvement plateaus. When a model generates synthetic data and trains on the correct subset, it reinforces the reasoning patterns that already work β and loses the patterns that were present but not yet dominant. After a few iterations, the model can only solve problems in the way it already knows, even if alternative approaches would handle novel problems better.
This is analogous to a biological monoculture: optimized for current conditions but brittle against environmental change. MAFT creates a polyculture β a population with diverse strategies that collectively covers a larger portion of the problem space.
Experimental Results
Subramaniam et al. (2025) demonstrate MAFT on mathematical reasoning and code generation tasks, comparing against:
- Standard self-improvement: single model generating and training on its own outputs
- Best-of-N sampling: single model generating N responses and selecting the best
- Rejection sampling finetuning: single model trained on its own high-quality responses
MAFT outperforms all baselines, with the improvement growing over iterations β precisely the regime where single-agent methods plateau. On GSM8K and MATH benchmarks, the multi-agent population achieves accuracy levels that no single agent reaches through self-improvement alone.
Connection to Biological Evolution
The parallel to evolutionary biology is intentional. MAFT implements an analogous mechanism to natural selection: multiple agents with different strategies undergo selection pressure, and population diversity enables exploration beyond individual capacity. The authors frame MAFT as "simulated evolution" β not in a loose metaphorical sense, but in the structural sense of population diversity enabling optimization.
Critical Analysis
<
| Claim | Evidence | Verdict |
|---|
| MAFT prevents diversity collapse in self-improvement | Measured diversity (reasoning strategy distribution) remains high across iterations | β
Supported β the multi-agent structure maintains diversity by design |
| MAFT outperforms single-agent self-improvement | Consistent improvements on math and code benchmarks across multiple iterations | β
Supported β with the caveat that N agents require NΓ the compute of one agent |
| Agent specialization emerges without explicit diversity objectives | Analysis shows agents developing distinct error profiles and reasoning preferences | β
Supported β specialization is an emergent property of the training dynamic |
| MAFT is computationally efficient | N agents means NΓ the generation and training cost of single-agent methods | β οΈ Depends on framing β per-agent cost is identical; total cost scales linearly with N |
| The approach generalizes beyond math and code | Only demonstrated on reasoning-heavy tasks with verifiable answers | β οΈ Plausible but undemonstrated for open-ended generation, creative writing, etc. |
The Compute Tradeoff
MAFT's improvement comes at a compute cost: running N agents is approximately N times more expensive than running one. The relevant comparison is not "MAFT vs. single model at same compute" but "MAFT vs. single model at same wall-clock time" (if parallelized) or "MAFT vs. single model with NΓ data" (if compute-matched). Subramaniam et al. report that even when controlling for total compute, MAFT outperforms baselines β suggesting the diversity benefit exceeds what additional compute alone provides.
Verification Requirements
MAFT works best when response quality can be verified automatically β math problems have correct answers, code has test cases, formal proofs have validators. For tasks where quality assessment requires human judgment (essay writing, nuanced dialogue, ethical reasoning), the cross-agent selection step becomes the bottleneck. This connects to the broader challenge of reward modeling and its limitations (Eisenstein et al., 2023), where reward models introduce their own biases into the selection process.
Open Questions
Optimal population size: How many agents are needed to capture sufficient diversity? Is there diminishing return beyond N=4 or N=8?Merge versus ensemble: Can the specialized agents be merged (model merging techniques) into a single model that retains the diversity benefits, or does deployment require maintaining the full population?Domain transfer: Does specialization developed on math reasoning transfer to code generation, or do agents need to specialize independently for each domain?Scaling with model size: MAFT has been demonstrated on ~7B parameter models. How does the diversity benefit scale with model size β does a 70B model already contain sufficient internal diversity to make population-level diversity redundant?Human-feedback integration: Can MAFT be combined with RLHF, where each agent learns from its own preference data trajectory, producing diverse alignment strategies?Closing
Multiagent Finetuning addresses the diversity collapse problem in language model self-improvement through a structurally simple mechanism: maintain a population of model copies, let them specialize through interaction, and use cross-agent selection to preserve high-quality diverse reasoning. The approach draws a deliberate parallel to evolutionary dynamics and demonstrates consistent improvements over single-agent baselines on verifiable reasoning tasks. The open questions β optimal population size, merging strategies, domain transfer, and scaling behavior β define the research agenda for extending MAFT from a compelling proof-of-concept to a practical training methodology.
λ©΄μ±
μ‘°ν: μ΄ κ²μλ¬Όμ μ 보 μ 곡μ λͺ©μ μΌλ‘ ν μ°κ΅¬ λν₯ κ°μμ΄λ€. νμ μ μλ¬Όμμ μΈμ©νκΈ° μ μ μλ³Έ λ
Όλ¬Έμ ν΅ν΄ ꡬ체μ μΈ μ°κ΅¬ κ²°κ³Ό, ν΅κ³ λ° μ£Όμ₯μ κ²μ¦ν΄μΌ νλ€.
Multiagent Finetuning: νλμ κΈ°λ° λͺ¨λΈμ΄ μ΄λ»κ² λ€μμ μ λ¬Ένλ μμ΄μ νΈκ° λλκ°
μΈμ΄ λͺ¨λΈ μκΈ° κ°μ (self-improvement)μμ μ§μμ μΌλ‘ μ κΈ°λλ κ³Όμ λ λ€μμ± λΆκ΄΄(diversity collapse)μ΄λ€. λ¨μΌ λͺ¨λΈμ΄ ν©μ± νλ ¨ λ°μ΄ν°λ₯Ό μμ±νκ³ μ΄λ₯Ό νμ΅μ νμ©ν κ²½μ°, λ°λ³΅ν μλ‘ μΆλ ₯ λΆν¬κ° μ’μμ§λ€. μ¦, λͺ¨λΈμ΄ μμμ μΆλ‘ ν¨ν΄μ μλ ΄νλ©΄μ λμμ μ κ·Ό λ°©μμ νμνλ λ₯λ ₯μ μμ€νλ€. μ΄λ λ―Έλ¬ν νμμ΄ μλλ©°, λ¨μΌ μμ΄μ νΈ μκΈ° λμ (self-play)μ κ·Όλ³Έμ μΈ νκ³λ‘μ λ€μμ μ°κ΅¬μμ λ¬Έμνλ λ° μλ€.
Subramaniam et al. (2025)μ΄ μ μν Multiagent Finetuning(MAFT)μ μ΄μ λν ꡬ쑰μ ν΄κ²°μ±
μ μ μνλ€. νλμ λͺ¨λΈμ΄ κ³ λ¦½λ μνμμ μ€μ€λ‘λ₯Ό κ°μ νλ λμ , MAFTλ λμΌν κΈ°λ° λͺ¨λΈμ 볡μ¬λ³Έ μ¬λ¬ κ°λ₯Ό μμ±νκ³ μνΈμμ©μ ν΅ν΄ μ΄λ€μ μ°¨λ³ννλ€. κ° λ³΅μ¬λ³Έμ μ λ¬Ένλ μΆλ‘ μ λ΅μ λ°μ μν€λ©°, μ§λ¨μ λ€μμ±μ μ΄λ λ¨μΌ 볡μ¬λ³Έλ μ’μ ν¨ν΄μΌλ‘ λΆκ΄΄λλ κ²μ λ°©μ§νλ€. μ΄ μ κ·Όλ²μ λ°ν μ΄ν μκΈ° κ°μ μ λ³λͺ©μ 극볡νλ €λ κ΄λ²μν κ΄μ¬μ λ°μνλ©° μλΉν μ£Όλͺ©μ λ°μλ€.
μ°κ΅¬ λν₯
ν΅μ¬ λ©μ»€λμ¦
MAFTλ λ€λ¨κ³ κ³Όμ μ ν΅ν΄ μλνλ€.
1λ¨κ³: μ§λ¨ μ΄κΈ°ν. λμΌν κΈ°λ° μΈμ΄ λͺ¨λΈμ 볡μ¬λ³Έ Nκ°λ‘ μμνλ€. λͺ¨λ 볡μ¬λ³Έμ μ΄κΈ°μ λμΌνλ€.
2λ¨κ³: λ€μν μμ±. κ° λ³΅μ¬λ³Έμ νλ ¨ ν둬ννΈμ λν μλ΅μ μμ±νλ€. μμ± κ³Όμ μμ μνλ§(temperature, top-p)μ΄ μλ°λλ―λ‘, λμΌν λͺ¨λΈμ΄λΌλ μλ‘ λ€λ₯Έ μΆλ ₯μ μμ±νλ€. λ°λ³΅μ΄ μ§νλ¨μ λ°λΌ κ° λ³΅μ¬λ³Έμ μλ‘ λ€λ₯Έ μ λ΅ μλ΅ νμ μ§ν©μ νμ΅νλ©΄μ λμ± λΆκΈ°λλ€.
3λ¨κ³: κ΅μ°¨ μμ΄μ νΈ μ ν. κ° νλ ¨ ν둬ννΈμ λν΄ Nκ°μ μμ΄μ νΈκ° μμ±ν μλ΅μ νκ°νλ€(보μ λͺ¨λΈ, κ²μ¦κΈ°, λλ μ λ΅ λ μ΄λΈ νμ©). μ΄λ μμ΄μ νΈκ° μμ±νλμ§μ 무κ΄νκ² μ΅μ μ μλ΅μ΄ μ νλλ€.
4λ¨κ³: μ λ¬Έν νλ ¨. κ° μμ΄μ νΈλ μμ μ μ±κ³΅μ μΈ μλ΅κ³Ό λ€λ₯Έ μμ΄μ νΈμ μ±κ³΅μ μΈ μλ΅μ νΌν©νμ¬ νμΈνλλμ§λ§, κ° μμ΄μ νΈκ° κ³ μ ν λ₯λ ₯μ λ°μ μν€λλ‘ μ₯λ €νλ κ°μ€μΉ λ°©μμ΄ μ μ©λλ€.
5λ¨κ³: λ°λ³΅. 2β4λ¨κ³λ₯Ό λ°λ³΅νλ©°, μ§λ¨μ μ μ§μ μΌλ‘ μ λ¬Ένλκ³ μ±λ₯μ΄ ν₯μλλ€.
ν΅μ¬μ μΈ ν΅μ°°μ λ€μ€ μμ΄μ νΈ κ΅¬μ‘°κ° μ€κ³μ λ€μμ±μ μ μ§νλ€λ μ μ΄λ€. μμ΄μ νΈ 1μ΄ μ’μ μΆλ‘ λ°©μμΌλ‘ μλ ΄νλλΌλ, μμ΄μ νΈ 2λ μμ΄μ νΈ 1μ΄ μ€ν¨νλ λ¬Έμ λ₯Ό ν΄κ²°νλ λ€λ₯Έ μ κ·Ό λ°©μμ λ°μ μμΌ°μ μ μλ€. μ§λ¨ μ 체λ κ°λ³ ꡬμ±μ μ΄λ νλλ³΄λ€ λ λμ λ₯λ ₯μ μ μ§νλ€.
λ¨μΌ μμ΄μ νΈ μκΈ° κ°μ μ΄ μ€ν¨νλ μ΄μ
MAFTμ κΈ°μ¬λ₯Ό μ΄ν΄νκΈ° μν΄, νμ€μ μΈ μκΈ° κ°μ μ΄ μ μ 체λλμ§λ₯Ό μ΄ν΄λ³Έλ€. λͺ¨λΈμ΄ ν©μ± λ°μ΄ν°λ₯Ό μμ±νκ³ μ λ΅ νμ μ§ν©μ νμ΅ν λ, μ΄λ―Έ ν¨κ³Όμ μΈ μΆλ‘ ν¨ν΄μ κ°ννλ λμμ μ‘΄μ¬νμ§λ§ μμ§ μ§λ°°μ μ΄μ§ μμ ν¨ν΄μ μμ΄λ²λ¦°λ€. λͺ λ²μ λ°λ³΅ ν, λͺ¨λΈμ μ΄λ―Έ μκ³ μλ λ°©μμΌλ‘λ§ λ¬Έμ λ₯Ό ν μ μκ² λλ©°, λμμ μ κ·Ό λ°©μμ΄ μλ‘μ΄ λ¬Έμ λ₯Ό λ μ μ²λ¦¬ν μ μμμλ λΆκ΅¬νκ³ κ·Έλ¬νλ€.
μ΄λ μλ¬Όνμ λ¨μΌ μ¬λ°°(monoculture)μ μ μ¬νλ€. νμ¬ μ‘°κ±΄μ μ΅μ νλμ΄ μμ§λ§ νκ²½ λ³νμλ μ·¨μ½νλ€. MAFTλ λ€νμ’
μ¬λ°°(polyculture)λ₯Ό λ§λ€μ΄λΈλ€. μ¦, λ€μν μ λ΅μ 보μ ν μ§λ¨μ΄ λ¬Έμ 곡κ°μ λ λμ μμμ μ§ν©μ μΌλ‘ ν¬κ΄νλ€.
μ€ν κ²°κ³Ό
Subramaniam et al. (2025)μ μνμ μΆλ‘ λ° μ½λ μμ± κ³Όμ μμ MAFTλ₯Ό κ²μ¦νλ©°, λ€μκ³Ό λΉκ΅νμλ€.
- νμ€ μκΈ° κ°μ : λ¨μΌ λͺ¨λΈμ΄ μμ μ μΆλ ₯λ¬Όμ μμ±νκ³ νμ΅
- Best-of-N μνλ§: λ¨μΌ λͺ¨λΈμ΄ Nκ°μ μλ΅μ μμ±νκ³ μ΅μ μ κ²μ μ ν
- κ±°λΆ μνλ§ νμΈνλ(Rejection sampling finetuning): λ¨μΌ λͺ¨λΈμ΄ μμ μ κ³ νμ§ μλ΅μ νμ΅
MAFTλ λͺ¨λ κΈ°μ€ λ°©λ²(baseline)μ λ₯κ°νλ©°, λ¨μΌ μμ΄μ νΈ λ°©λ²μ΄ μ 체λλ ꡬκ°μΈ λ°λ³΅ νμκ° μ¦κ°ν μλ‘ μ±λ₯ ν₯μ νμ΄ μ»€μ§λ€. GSM8K λ° MATH λ²€μΉλ§ν¬μμ λ€μ€ μμ΄μ νΈ μ§λ¨μ μ΄λ ν λ¨μΌ μμ΄μ νΈλ μκΈ° κ°μ λ§μΌλ‘λ λλ¬νμ§ λͺ»νλ μ νλ μμ€μ λ¬μ±νλ€.
μλ¬Όνμ μ§νμμ μ°κ΄μ±
μ§νμλ¬Όνκ³Όμ μ μ¬μ±μ μλμ μΌλ‘ μ€μ λ κ²μ΄λ€. MAFTλ μμ°μ νκ³Ό μ μ¬ν λ©μ»€λμ¦μ ꡬννλ€. μ¦, μλ‘ λ€λ₯Έ μ λ΅μ κ°μ§ μ¬λ¬ μμ΄μ νΈκ° μ ν μλ ₯μ λ°μΌλ©°, μ§λ¨ λ€μμ±μ΄ κ°λ³ λ₯λ ₯μ μ΄μν νμμ κ°λ₯νκ² νλ€. μ μλ€μ MAFTλ₯Ό "λͺ¨μ μ§ν(simulated evolution)"λ‘ κ·μ νλλ°, μ΄λ λμ¨ν μμ μ μλ―Έκ° μλλΌ μ§λ¨ λ€μμ±μ΄ μ΅μ νλ₯Ό κ°λ₯νκ² νλ€λ ꡬ쑰μ μλ―Έμμμ΄λ€.
λΉνμ λΆμ
<
| μ£Όμ₯ | κ·Όκ±° | νμ |
|---|
| MAFTλ μκΈ° κ°μ μμ λ€μμ± λΆκ΄΄λ₯Ό λ°©μ§νλ€ | μΈ‘μ λ λ€μμ±(μΆλ‘ μ λ΅ λΆν¬)μ΄ λ°λ³΅ μ λ°μ κ±Έμ³ λκ² μ μ§λλ€ | β
μ§μ§λ¨ β λ€μ€ μμ΄μ νΈ κ΅¬μ‘°κ° μ€κ³μ λ€μμ±μ μ μ§νλ€ |
| MAFTλ λ¨μΌ μμ΄μ νΈ μκΈ° κ°μ μ λ₯κ°νλ€ | μ¬λ¬ λ°λ³΅μ κ±Έμ³ μν λ° μ½λ λ²€μΉλ§ν¬μμ μΌκ΄λ μ±λ₯ ν₯μμ΄ λνλλ€ | β
μ§μ§λ¨ β Nκ°μ μμ΄μ νΈλ λ¨μΌ μμ΄μ νΈ λλΉ Nλ°°μ μ°μ°λμ΄ νμνλ€λ μ μ μ μν΄μΌ νλ€ |
| λͺ
μμ λ€μμ± λͺ©μ ν¨μ μμ΄ μμ΄μ νΈ μ λ¬Ένκ° λνλλ€ | λΆμμ λ°λ₯΄λ©΄ μμ΄μ νΈλ€μ΄ λλ ·ν μ€λ₯ ν¨ν΄κ³Ό μΆλ‘ μ νΈλλ₯Ό λ°μ μν¨λ€ | β
μ§μ§λ¨ β μ λ¬Ένλ νλ ¨ μνμ μ°½λ°μ μμ±μ΄λ€ |
| MAFTλ μ°μ° ν¨μ¨μ μ΄λ€ | Nκ°μ μμ΄μ νΈλ λ¨μΌ μμ΄μ νΈ λ°©λ² λλΉ Nλ°°μ μμ± λ° νλ ¨ λΉμ©μ μλ―Ένλ€ | β οΈ κ΅¬μ± λ°©μμ λ°λΌ λ€λ¦ β μμ΄μ νΈλΉ λΉμ©μ λμΌνλ©°, μ΄ λΉμ©μ Nμ λ°λΌ μ νμΌλ‘ μ¦κ°νλ€ |
| μ΄ μ κ·Όλ²μ μν λ° μ½λ μ΄μΈλ‘ μΌλ°νλλ€ | κ²μ¦ κ°λ₯ν μ λ΅μ΄ μλ μΆλ‘ μ€μ¬ κ³Όμ μμλ§ κ²μ¦λμλ€ | β οΈ κ·Έλ΄λ―νλ κ°λ°©ν μμ±, μ°½μ λ±μ λν΄μλ λ―Έκ²μ¦ μνμ΄λ€ |
μ°μ°λ μ μΆ©
MAFTμ μ±λ₯ ν₯μμλ μ°μ° λΉμ©μ΄ μλ°λλ€. Nκ°μ μμ΄μ νΈλ₯Ό μ€ννλ κ²μ λ¨μΌ μμ΄μ νΈ μ€νλ³΄λ€ μ½ Nλ°° λ λΉμΈλ€. μ μ ν λΉκ΅ κΈ°μ€μ "λμΌ μ°μ°λμμμ MAFT λ λ¨μΌ λͺ¨λΈ"μ΄ μλλΌ, "λμΌ μ€μ μμ μκ°(λ³λ ¬ν μ)μμμ MAFT λ λ¨μΌ λͺ¨λΈ" λλ "μ°μ°λμ λ§μΆ κ²½μ°μ MAFT λ Nλ°° λ°μ΄ν°λ₯Ό μ¬μ©ν λ¨μΌ λͺ¨λΈ"μ΄λ€. Subramaniam et al.μ μ΄ μ°μ°λμ ν΅μ ν κ²½μ°μλ MAFTκ° κΈ°μ€ λ°©λ²λ€μ λ₯κ°νλ€κ³ λ³΄κ³ νλ©°, μ΄λ λ€μμ±μ μ΄μ μ΄ λ¨μν μΆκ°μ μΈ μ°μ°λλ§μΌλ‘ μ»μ μ μλ κ²μ μ΄κ³Όν¨μ μμ¬νλ€.
κ²μ¦ μ건
MAFTλ μλ΅ νμ§μ μλμΌλ‘ κ²μ¦ν μ μμ λ κ°μ₯ ν¨κ³Όμ μ΄λ€. μν λ¬Έμ μλ μ λ΅μ΄ μκ³ , μ½λμλ ν
μ€νΈ μΌμ΄μ€κ° μμΌλ©°, νμ μ¦λͺ
μλ κ²μ¦κΈ°κ° μλ€. νμ§ νκ°μ μΈκ°μ νλ¨μ΄ νμν κ³Όμ (μμΈμ΄ μμ±, μΈλ°ν λν, μ€λ¦¬μ μΆλ‘ λ±)μμλ κ΅μ°¨ μμ΄μ νΈ μ ν λ¨κ³κ° λ³λͺ©μ΄ λλ€. μ΄λ 보μ λͺ¨λΈλ§μ κ΄λ²μν κ³Όμ λ° κ·Έ νκ³(Eisenstein et al., 2023)μ μ°κ²°λλ©°, 보μ λͺ¨λΈμ΄ μ ν κ³Όμ μ μ체μ μΈ νΈν₯μ λμ
νλ€.
λ―Έν΄κ²° κ³Όμ
μ΅μ μ§λ¨ ν¬κΈ°: μΆ©λΆν λ€μμ±μ ν보νλ €λ©΄ λͺ κ°μ μμ΄μ νΈκ° νμνκ°? N=4 λλ N=8μ λμ΄μλ©΄ μν 체κ°μ΄ λ°μνλκ°?λ³ν© λ μμλΈ: μ λ¬Ένλ μμ΄μ νΈλ€μ λ€μμ± μ΄μ μ μ μ§νλ λ¨μΌ λͺ¨λΈλ‘ λ³ν©(λͺ¨λΈ λ³ν© κΈ°λ²)ν μ μλκ°, μλλ©΄ λ°°ν¬ μ μ 체 μ§λ¨μ μ μ§ν΄μΌ νλκ°?λλ©μΈ μ μ΄: μν μΆλ‘ μμ κ°λ°λ μ λ¬Ένκ° μ½λ μμ±μΌλ‘ μ μ΄λλκ°, μλλ©΄ μμ΄μ νΈκ° κ° λλ©μΈμ λν΄ λ
립μ μΌλ‘ μ λ¬Ένλμ΄μΌ νλκ°?λͺ¨λΈ ν¬κΈ°μ λ°λ₯Έ νμ₯: MAFTλ μ½ 70μ΅(~7B) νλΌλ―Έν° λͺ¨λΈμμ κ²μ¦λμλ€. λ€μμ± μ΄μ μ λͺ¨λΈ ν¬κΈ°μ λ°λΌ μ΄λ»κ² νμ₯λλκ°? 700μ΅(70B) λͺ¨λΈμ μ΄λ―Έ μΆ©λΆν λ΄λΆ λ€μμ±μ 보μ νμ¬ μ§λ¨ μμ€μ λ€μμ±μ΄ λΆνμν΄μ§λκ°?μΈκ° νΌλλ°± ν΅ν©: MAFTλ₯Ό RLHFμ κ²°ν©ν μ μλκ°? μ΄ κ²½μ° κ° μμ΄μ νΈκ° μ체 μ νΈ λ°μ΄ν° κΆ€μ μΌλ‘λΆν° νμ΅νμ¬ λ€μν μ λ ¬ μ λ΅μ μμ±ν μ μλ€.λ§λ¬΄λ¦¬
Multiagent Finetuningμ ꡬ쑰μ μΌλ‘ λ¨μν λ©μ»€λμ¦μ ν΅ν΄ μΈμ΄ λͺ¨λΈ μκΈ° κ°μ μμ λ°μνλ λ€μμ± λΆκ΄΄ λ¬Έμ λ₯Ό ν΄κ²°νλ€. μ¦, λͺ¨λΈ 볡μ¬λ³Έμ μ§λ¨μ μ μ§νκ³ , μνΈμμ©μ ν΅ν΄ μ λ¬Ένλλλ‘ νλ©°, κ΅μ°¨ μμ΄μ νΈ μ ν(cross-agent selection)μ νμ©νμ¬ κ³ νμ§μ λ€μν μΆλ‘ μ 보쑴νλ κ²μ΄λ€. μ΄ μ κ·Όλ²μ μ§νμ μν(evolutionary dynamics)κ³Όμ μλμ μΈ μ μ¬μ±μ λμΆνλ©°, κ²μ¦ κ°λ₯ν μΆλ‘ κ³Όμ μμ λ¨μΌ μμ΄μ νΈ κΈ°μ€μ (baseline) λλΉ μΌκ΄λ μ±λ₯ ν₯μμ 보μΈλ€. μ΅μ μ§λ¨ ν¬κΈ°, λ³ν© μ λ΅(merging strategy), λλ©μΈ μ μ΄(domain transfer), νμ₯ νλ(scaling behavior) λ±μ λ―Έν΄κ²° μ§λ¬Έλ€μ MAFTλ₯Ό μ€λλ ₯ μλ κ°λ
μ¦λͺ
(proof-of-concept)μμ μ€μ©μ μΈ νμ΅ λ°©λ²λ‘ μΌλ‘ λ°μ μν€κΈ° μν μ°κ΅¬ μμ λ₯Ό μ μνλ€.
References (3)
Subramaniam, V., Du, Y., & Tenenbaum, J. et al. (2025). Multiagent finetuning: Self improvement with diverse reasoning chains. arXiv preprint.
Eisenstein, J., Nagpal, C., & Agarwal, A. (2023). Helping or herding? Reward model ensembles mitigate but do not eliminate reward hacking. arXiv preprint.
Zhuang, Y., Yu, X., & Wu, J. et al. (2025). Self-taught agentic long context understanding. arXiv preprint.