Methodology GuideAI & Machine LearningMachine/Deep Learning

The MoE Takeover: Why a Majority of 2025's LLMs Use Mixture-of-Experts

Mixture-of-Experts has become the default LLM architecture in 2025, with models like DeepSeek-R1, Kimi K2, and Mistral Large adopting it. We examine how DeepSeekMoE's expert specialization strategies shaped this trend and what design choices make MoE work at scale.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

Consider a simple observation: among the most capable open-source language models released in 2025 — DeepSeek-R1, Kimi K2, Mistral Large 3, DBRX, Arctic — the majority use Mixture-of-Experts (MoE) architectures. This was not the case two years ago. In 2023, dense Transformer models dominated. Something changed, and the change was not gradual. MoE went from a niche research topic to the default production architecture within roughly eighteen months.

To understand why, it helps to examine one of the papers that catalyzed this shift: DeepSeekMoE (Dai et al., 2024), which introduced design principles that subsequent MoE models have widely adopted.

Research Landscape: The Economics of Sparsity

The appeal of MoE is fundamentally economic. A dense model activates all its parameters for every token. A MoE model activates only a subset — typically the top-k "experts" selected by a routing mechanism — meaning that a model with 140 billion total parameters might use only 20 billion parameters per forward pass. The total parameter count determines the model's knowledge capacity. The active parameter count determines its inference cost. MoE decouples these two quantities.

This decoupling matters because the scaling laws for dense models have hit practical limits. Training a dense 1T-parameter model requires infrastructure that only a handful of organizations can afford. But a MoE model with 1T total parameters and 100B active parameters per token achieves knowledge capacity comparable to the dense model while requiring inference compute closer to a 100B dense model.

The catch is that MoE introduces engineering complexity: expert routing must be load-balanced, communication overhead between experts on different GPUs must be managed, and the total model still requires full memory even though only a fraction is active. The 2024-2025 period saw these engineering challenges solved at production scale, removing the primary barrier to adoption.

DeepSeekMoE: Two Design Innovations

Dai et al. (2024) identified a core problem with conventional MoE architectures: expert redundancy. In standard MoE (e.g., GShard, Switch Transformer), experts tend to learn overlapping representations. When multiple experts encode similar knowledge, the model wastes capacity — the total parameter count overstates the effective knowledge.

DeepSeekMoE introduces two strategies to address this:

Fine-Grained Expert Segmentation

Instead of N experts with K activated, DeepSeekMoE uses mN experts with mK activated, where each expert is 1/m the size of a conventional expert. With more, smaller experts, the routing mechanism can compose expert combinations with finer granularity. The analogy is moving from selecting whole dishes at a buffet to selecting individual ingredients — more combinations become possible, and the model can specialize more precisely.

Shared Expert Isolation

The second innovation is isolating Ks shared experts that are always active, regardless of routing decisions. These shared experts capture common knowledge — syntactic patterns, frequent vocabulary, general world knowledge — that every input requires. By explicitly dedicating capacity to this common knowledge, the routed experts are freed to specialize in domain-specific or task-specific representations.

The combination yields measurable results. According to the paper, a 2B-parameter DeepSeekMoE model matches a 2.9B-parameter GShard model while requiring fewer expert parameters. At the 16B-parameter scale, DeepSeekMoE matches LLaMA2 7B performance using approximately 40% of the computation. The 145B-parameter variant achieves performance comparable to DeepSeek 67B while using only 28.5% of computations.

Critical Analysis: Claims and Evidence

Claim	Source	Assessment
2B DeepSeekMoE matches 2.9B GShard MoE	Paper benchmarks	Supported; comparison on standard language modeling tasks
16B variant matches LLaMA2 7B at ~40% compute	Paper benchmarks	Supported; the compute comparison is meaningful for deployment cost
145B variant matches DeepSeek 67B at 28.5% compute	Paper benchmarks	Supported; demonstrates scaling of the approach
Fine-grained segmentation reduces expert redundancy	Ablation studies	Supported; ablations show performance degradation when segmentation is removed
Shared experts improve specialization of routed experts	Ablation studies	Supported; analysis shows reduced knowledge overlap in routed experts

What the Numbers Do and Do Not Tell Us

The efficiency claims require careful interpretation. "Matches performance at X% of compute" compares inference cost (FLOPs per token), not training cost. MoE models are generally more expensive to train because all parameters participate in the backward pass. The economics favor MoE at inference time, which is the dominant cost for deployed models but not for research-focused organizations.

The 2025 MoE Landscape

DeepSeekMoE's design principles — fine-grained experts, shared expert isolation, careful load balancing — reappear throughout 2025's model releases, though each team adapts them:

DeepSeek-R1: Extends the MoE architecture with reinforcement learning for reasoning, demonstrating that MoE is compatible with reasoning-focused training.
Kimi K2: Adopts MoE with aggressive expert counts, pushing the total-to-active parameter ratio further.
Mistral Large 3: Uses MoE with emphasis on multilingual expert specialization.

The common thread is that MoE has moved from "interesting alternative" to "why would you not use this?" — a status shift driven by demonstrated cost efficiency and the engineering maturity to deploy it reliably.

Open Questions

Expert specialization interpretability: Do individual experts learn interpretable specializations (e.g., "this expert handles legal language"), or are the representations distributed and opaque? Early evidence is mixed.

Routing failure modes: When the router assigns a token to the wrong expert, the failure is silent — the model produces output, but from a suboptimal expert. How do we detect and measure routing errors?

Expert pruning: If some experts are rarely activated, can they be removed post-training to reduce memory footprint?

MoE for small models: Does the architecture provide meaningful benefits at 7B or smaller, where routing overhead may outweigh efficiency gains?

What This Means for Practitioners

If you are choosing an LLM architecture for a new project in 2025, the default recommendation has shifted to MoE for any model above approximately 30B total parameters. For fine-tuning, selectively fine-tuning routed experts while freezing shared experts can be more parameter-efficient than full fine-tuning.

The broader lesson is architectural: the most impactful advances in LLM efficiency have come from changing which parameters activate for which inputs — a sparse computation strategy whose time has clearly arrived.

면책 조항: 이 게시물은 정보 제공을 목적으로 한 연구 개요이다. 학술 연구에서 인용하기 전에 특정 연구 결과, 통계 및 주장을 원본 논문과 대조하여 검증해야 한다.

MoE의 부상: 2025년 LLM 중 다수가 Mixture-of-Experts를 채택하는 이유

간단한 사실을 살펴보자. 2025년에 출시된 가장 유능한 오픈소스 언어 모델들 — DeepSeek-R1, Kimi K2, Mistral Large 3, DBRX, Arctic — 중 대다수가 Mixture-of-Experts(MoE) 아키텍처를 채택하고 있다. 2년 전에는 그렇지 않았다. 2023년에는 밀집(dense) Transformer 모델이 지배적이었다. 무언가가 변화했고, 그 변화는 점진적이지 않았다. MoE는 대략 18개월 만에 틈새 연구 주제에서 기본 프로덕션 아키텍처로 자리 잡았다.

그 이유를 이해하기 위해, 이 전환을 촉발한 논문 중 하나인 DeepSeekMoE(Dai et al., 2024)를 살펴보는 것이 도움이 된다. 이 논문은 이후 MoE 모델들이 광범위하게 채택한 설계 원칙을 제시하였다.

연구 배경: 희소성의 경제학

MoE의 매력은 근본적으로 경제적이다. 밀집 모델은 모든 토큰에 대해 전체 파라미터를 활성화한다. 반면 MoE 모델은 라우팅 메커니즘에 의해 선택된 상위 k개의 "전문가(expert)"만을 활성화한다. 즉, 총 1,400억 개의 파라미터를 가진 모델이 순방향 패스(forward pass)당 200억 개의 파라미터만 사용할 수 있다. 총 파라미터 수는 모델의 지식 용량을 결정하고, 활성 파라미터 수는 추론 비용을 결정한다. MoE는 이 두 가지 요소를 분리한다.

이러한 분리가 중요한 이유는 밀집 모델의 스케일링 법칙이 실질적인 한계에 도달했기 때문이다. 밀집 1조 파라미터 모델을 학습시키려면 소수의 기관만이 감당할 수 있는 인프라가 필요하다. 그러나 총 파라미터 1조 개, 토큰당 활성 파라미터 1,000억 개를 가진 MoE 모델은 밀집 모델에 필적하는 지식 용량을 달성하면서도, 추론 연산량은 1,000억 규모의 밀집 모델에 가까운 수준을 요구한다.

다만 MoE는 엔지니어링 복잡성을 수반한다는 단점이 있다. 전문가 라우팅의 부하 균형을 맞춰야 하고, 서로 다른 GPU에 있는 전문가 간의 통신 오버헤드를 관리해야 하며, 전체 모델은 일부만 활성화되더라도 여전히 전체 메모리를 필요로 한다. 2024~2025년에는 이러한 엔지니어링 과제들이 프로덕션 규모에서 해결되어, 채택의 주요 장벽이 제거되었다.

DeepSeekMoE: 두 가지 설계 혁신

Dai et al.(2024)은 기존 MoE 아키텍처의 핵심 문제인 전문가 중복성(expert redundancy)을 규명하였다. 표준 MoE(예: GShard, Switch Transformer)에서는 전문가들이 중복된 표현을 학습하는 경향이 있다. 여러 전문가가 유사한 지식을 인코딩하면 모델의 용량이 낭비되어, 총 파라미터 수가 실제 유효 지식을 과대평가하게 된다.

DeepSeekMoE는 이를 해결하기 위한 두 가지 전략을 도입한다.

세밀한 전문가 분할(Fine-Grained Expert Segmentation)

기존의 N개 전문가 중 K개를 활성화하는 방식 대신, DeepSeekMoE는 mN개의 전문가 중 mK개를 활성화하며, 각 전문가의 크기는 기존 전문가의 1/m이다. 더 많고 작은 전문가를 통해 라우팅 메커니즘은 더욱 세밀한 수준에서 전문가 조합을 구성할 수 있다. 이는 뷔페에서 요리 단위로 선택하는 방식에서 개별 재료 단위로 선택하는 방식으로 전환하는 것과 유사하다. 가능한 조합이 늘어나고, 모델은 더욱 정밀하게 특화될 수 있다.

공유 전문가 분리(Shared Expert Isolation)

두 번째 혁신은 라우팅 결정과 무관하게 항상 활성화되는 Ks개의 공유 전문가를 분리하는 것이다. 이 공유 전문가들은 모든 입력에 필요한 공통 지식 — 통사적 패턴, 빈출 어휘, 일반 세계 지식 — 을 담당한다. 이 공통 지식에 용량을 명시적으로 할당함으로써, 라우팅 전문가들은 도메인별 또는 과제별 표현에 특화될 수 있다. 이 조합은 측정 가능한 결과를 산출한다. 논문에 따르면, 2B 파라미터 DeepSeekMoE 모델은 2.9B 파라미터 GShard 모델과 동등한 성능을 발휘하면서도 더 적은 전문가 파라미터를 필요로 한다. 16B 파라미터 규모에서 DeepSeekMoE는 약 40%의 연산량만으로 LLaMA2 7B의 성능에 필적한다. 145B 파라미터 변형 모델은 연산량의 28.5%만을 사용하면서 DeepSeek 67B에 필적하는 성능을 달성한다.

비판적 분석: 주장과 근거

주장	출처	평가
2B DeepSeekMoE가 2.9B GShard MoE와 동등	논문 벤치마크	지지됨; 표준 언어 모델링 과제에서의 비교
16B 변형이 약 40% 연산량으로 LLaMA2 7B와 동등	논문 벤치마크	지지됨; 연산량 비교는 배포 비용 측면에서 의미 있음
145B 변형이 28.5% 연산량으로 DeepSeek 67B와 동등	논문 벤치마크	지지됨; 접근법의 규모 확장성을 입증
세밀한 분할이 전문가 중복성을 감소시킴	절제 연구	지지됨; 절제 실험에서 분할 제거 시 성능 저하 확인
공유 전문가가 라우팅 전문가의 특수화를 향상시킴	절제 연구	지지됨; 분석을 통해 라우팅 전문가의 지식 중복 감소 확인

수치가 말해주는 것과 말해주지 않는 것

효율성 주장은 신중한 해석을 필요로 한다. "X%의 연산량으로 동등한 성능"은 훈련 비용이 아닌 추론 비용(토큰당 FLOPs)을 비교한 것이다. MoE 모델은 역전파 과정에서 모든 파라미터가 참여하기 때문에 일반적으로 훈련 비용이 더 높다. 경제성은 추론 시점에서 MoE에 유리한데, 이는 배포된 모델에서 지배적인 비용이지만 연구 중심 조직에서는 그렇지 않다.

2025년 MoE 환경

DeepSeekMoE의 설계 원칙들—세밀한 전문가, 공유 전문가 분리, 신중한 부하 분산—은 각 팀이 이를 적용하는 방식에 차이가 있음에도 불구하고, 2025년 모델 출시 전반에 걸쳐 재등장하고 있다:

DeepSeek-R1: 추론을 위한 강화학습으로 MoE 아키텍처를 확장하여, MoE가 추론 중심 훈련과 호환됨을 입증한다.
Kimi K2: 적극적인 전문가 수를 적용한 MoE를 채택하여, 전체 파라미터 대비 활성 파라미터 비율을 더욱 높인다.
Mistral Large 3: 다국어 전문가 특수화에 중점을 둔 MoE를 사용한다.

공통적인 흐름은 MoE가 "흥미로운 대안"에서 "왜 이것을 사용하지 않겠는가?"라는 위치로 이동했다는 것이다—이러한 위상 변화는 입증된 비용 효율성과 안정적인 배포를 가능하게 하는 엔지니어링 성숙도에 의해 이루어졌다.

미해결 과제

전문가 특수화의 해석 가능성: 개별 전문가들이 해석 가능한 특수화(예: "이 전문가는 법률 언어를 처리한다")를 학습하는가, 아니면 표현이 분산되어 불투명한가? 초기 증거는 엇갈린다.

라우팅 실패 양상: 라우터가 토큰을 잘못된 전문가에게 할당할 때 그 실패는 무음으로 발생한다—모델은 출력을 생성하지만, 최적이 아닌 전문가로부터 생성된다. 라우팅 오류를 어떻게 감지하고 측정할 것인가?

전문가 가지치기: 일부 전문가가 드물게 활성화된다면, 메모리 사용량을 줄이기 위해 훈련 후 해당 전문가를 제거할 수 있는가?

소형 모델을 위한 MoE: 라우팅 오버헤드가 효율성 이득을 능가할 수 있는 7B 이하 규모에서도 이 아키텍처가 의미 있는 이점을 제공하는가?

실무자를 위한 시사점

2025년 신규 프로젝트를 위한 LLM 아키텍처를 선택한다면, 전체 파라미터 수가 약 30B 이상인 모든 모델에 대해 기본 권장 사항이 MoE로 전환되었다. 파인튜닝의 경우, 공유 전문가를 고정한 채로 라우팅 전문가만 선택적으로 파인튜닝하는 것이 전체 파인튜닝보다 파라미터 효율적일 수 있다.

더 넓은 시사점은 아키텍처적인 것이다: LLM 효율성에서 가장 영향력 있는 발전은 어떤 파라미터가 어떤 입력에 대해 활성화되는지를 변경하는 것으로부터 비롯되었다—희소 연산 전략의 시대가 명확히 도래했다.

References (1)

[1] Dai, D., Deng, C., Zhao, C., Xu, R.X., Gao, H., Chen, D., Li, J. et al. (2024). DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. arXiv:2401.06066.

DOI Scholar