Deep DiveAI & Machine LearningMachine/Deep Learning

DeepSeek-V3: How 671 Billion Parameters Activate Only 37 Billion Per Token

DeepSeek-V3 stores 671 billion parameters but activates only 37 billion per token—a ratio of roughly 18:1. This architectural choice, combining Multi-head Latent Attention with auxiliary-loss-free load balancing in a Mixture-of-Experts framework, achieves competitive performance at a reported training cost of 2.788 million H800 GPU-hours.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

The economics of large language models present a persistent tension. Larger models tend to perform better, but they also cost more to train and—critically—more to serve. A 671 billion parameter dense model would require enormous GPU clusters just for inference, making it impractical for most applications. The Mixture-of-Experts (MoE) architecture resolves this tension by decoupling model capacity from computational cost: the model stores knowledge across many parameters but activates only a fraction for each input token.

DeepSeek-V3 (DeepSeek-AI, 2024) pushes this architecture to a notable scale: 671 billion total parameters with only 37 billion activated per token. The ratio—roughly 18:1—means that the model has the knowledge capacity of a 671B model but the computational cost closer to that of a 37B model. The technical report details several architectural innovations that make this work, including Multi-head Latent Attention (MLA) and an auxiliary-loss-free approach to expert load balancing.

The Research Landscape

Mixture-of-Experts is not new. The concept dates to Jacobs et al. (1991), and Shazeer et al. (2017) demonstrated its applicability to large-scale language models. The GShard and Switch Transformer papers (2020–2021) established MoE as a practical architecture for training models beyond dense-model cost constraints.

The persistent challenge with MoE has been load balancing: ensuring that input tokens are distributed roughly evenly across experts. Without balancing, popular experts become overloaded while others sit idle, wasting both compute and capacity. The standard solution is an auxiliary loss—an additional training objective that penalizes uneven expert utilization.

However, auxiliary losses introduce their own problems. They compete with the primary language modeling objective, and the balance between the two losses requires careful tuning. Too little auxiliary loss and experts become unbalanced; too much and the model sacrifices language quality for load balance. This tension has been a practical bottleneck in MoE training.

DeepSeek-V3 introduces an auxiliary-loss-free load balancing strategy, which, according to the technical report, eliminates this tension by achieving balanced routing without an additional loss term.

Architectural Innovations

The technical report describes three primary architectural contributions:

Multi-head Latent Attention (MLA)

Standard multi-head attention computes separate key, query, and value projections for each attention head, resulting in a KV cache that scales linearly with the number of layers and heads. For very large models, the KV cache becomes a significant memory bottleneck during inference, limiting batch size and throughput.

MLA addresses this by projecting keys and values into a lower-dimensional latent space before computing attention. The latent representations are shared across heads, reducing the KV cache size substantially. During attention computation, the latent representations are projected back to the full dimensionality, preserving expressiveness while reducing memory overhead.

The practical effect is that DeepSeek-V3 can serve longer sequences and larger batches than a comparably sized model with standard attention, improving inference throughput at a given hardware budget.

Auxiliary-Loss-Free Load Balancing

The DeepSeekMoE architecture in V3 uses a routing mechanism that achieves balanced expert utilization without an auxiliary loss. According to the technical report, this is accomplished through the routing design itself rather than through an additional training signal.

The significance is practical: removing the auxiliary loss eliminates a hyperparameter (the auxiliary loss weight) that has been a persistent source of training instability in MoE models. It also removes the fundamental tension between language modeling quality and expert utilization—the model optimizes only for language quality, and balanced routing emerges from the routing mechanism's design.

Multi-Token Prediction

Standard language model training predicts one token at a time: given the preceding tokens, predict the next one. Multi-token prediction extends this to predict several future tokens simultaneously. This provides a denser training signal per input sequence, potentially improving both training efficiency and the model's ability to plan ahead in generation.

The combination of these three innovations—latent attention for efficient inference, loss-free load balancing for stable training, and multi-token prediction for dense training signal—constitutes DeepSeek-V3's architectural contribution.

Training Economics

The technical report states a total training cost of 2.788 million H800 GPU-hours. For context, this is a large but not extreme compute investment by current standards. At typical cloud GPU pricing, this represents a training cost in the low tens of millions of dollars—significantly less than the reported training costs for comparably performing dense models.

The MoE architecture is the primary driver of this cost efficiency. Because only 37 billion parameters are active per token, the per-step computational cost is dramatically lower than a 671B dense model. The total parameter count affects memory requirements (all 671B parameters must be stored across the cluster), but not the per-token FLOP count.

This cost structure may lower the barrier to entry for training frontier models.

Critical Analysis: Claims and Evidence

Claim	Source	Verdict
671B total parameters with 37B activated per token	DeepSeek-AI (2024), technical report	✅ Reported architecture specification
Multi-head Latent Attention reduces KV cache overhead	DeepSeek-AI (2024), technical report	✅ Described mechanism; consistent with information-theoretic reasoning
Auxiliary-loss-free load balancing achieves balanced routing	DeepSeek-AI (2024), technical report	✅ Reported; specific mechanism described in paper
Multi-token prediction improves training efficiency	DeepSeek-AI (2024), technical report	✅ Reported; consistent with prior multi-token prediction literature
Total training cost of 2.788M H800 GPU-hours	DeepSeek-AI (2024), technical report	✅ Reported; plausible given MoE architecture
DeepSeek-V3 achieves competitive performance with frontier models	Contextual claim	⚠️ Performance comparisons depend on benchmark selection

As with all self-reported benchmarks, independent evaluation on held-out benchmarks would strengthen the performance claims.

Open Questions

Expert specialization. In an MoE model with this many experts, what do individual experts learn? Do they specialize by language, topic, reasoning type, or some other dimension? Understanding expert specialization could inform both model design and interpretability.

Routing stability. Auxiliary-loss-free routing removes one instability source but may introduce others. How robust is the routing mechanism to distributional shift—does it maintain balance when the input distribution changes from pretraining to fine-tuning to deployment?

Fine-tuning dynamics. MoE models present unique challenges for fine-tuning. When a model is fine-tuned on a narrow domain, do some experts become underutilized? Does the routing mechanism adapt, or does the fine-tuned model effectively become a smaller model (using fewer experts)?

What This Means for Your Research

For researchers with limited compute budgets, DeepSeek-V3 demonstrates that MoE architectures substantially reduce the cost of training large models. The auxiliary-loss-free load balancing, in particular, removes a significant hyperparameter tuning burden.

For those working on inference optimization, the 671B/37B split presents both challenge and opportunity: the model is large to store but cheap to run per token, suggesting that memory-efficient serving strategies (offloading, quantization of inactive experts) could be particularly effective.

For the broader field, DeepSeek-V3's training economics suggest that the cost barrier to frontier model development may be lower than previously assumed, at least for organizations willing to adopt MoE architectures and the engineering complexity they entail.

Explore related work through ORAA ResearchBrain.

면책 조항: 이 게시물은 정보 제공을 목적으로 한 연구 동향 개요이다. 학술 논문에 인용하기 전에 원본 논문에서 구체적인 연구 결과, 통계 및 주장을 검증해야 한다.

DeepSeek-V3: 6,710억 개의 파라미터 중 토큰당 370억 개만 활성화하는 방법

대규모 언어 모델의 경제성에는 지속적인 긴장 관계가 존재한다. 모델이 클수록 성능이 향상되는 경향이 있지만, 학습 비용이 증가하고—결정적으로—서비스 비용도 증가한다. 6,710억 개의 파라미터를 가진 밀집 모델(dense model)은 추론만을 위해서도 엄청난 GPU 클러스터를 필요로 하며, 이는 대부분의 응용 분야에서 비현실적이다. Mixture-of-Experts(MoE) 아키텍처는 모델의 용량과 계산 비용을 분리함으로써 이 긴장 관계를 해소한다. 즉, 모델은 많은 파라미터에 걸쳐 지식을 저장하지만, 각 입력 토큰에 대해서는 그 일부만 활성화한다.

DeepSeek-V3(DeepSeek-AI, 2024)는 이 아키텍처를 주목할 만한 규모로 확장한다. 총 6,710억 개의 파라미터 중 토큰당 370억 개만 활성화되는 구조이다. 약 18:1의 비율은, 이 모델이 671B 모델의 지식 용량을 보유하면서도 37B 모델에 가까운 계산 비용을 가짐을 의미한다. 기술 보고서는 이를 구현한 여러 아키텍처 혁신 사항을 상세히 기술하고 있으며, 여기에는 Multi-head Latent Attention(MLA)과 보조 손실(auxiliary loss) 없는 전문가 부하 균형 방식이 포함된다.

연구 동향

Mixture-of-Experts는 새로운 개념이 아니다. 이 개념은 Jacobs et al.(1991)로 거슬러 올라가며, Shazeer et al.(2017)은 대규모 언어 모델에 대한 적용 가능성을 입증하였다. GShard와 Switch Transformer 논문(2020–2021)은 MoE를 밀집 모델의 비용 제약을 넘어서는 모델 학습을 위한 실용적인 아키텍처로 확립하였다.

MoE의 지속적인 과제는 부하 균형(load balancing)이었다. 즉, 입력 토큰이 전문가들에게 대략적으로 균등하게 분배되도록 보장하는 것이다. 균형이 없으면 인기 있는 전문가는 과부하 상태가 되는 반면 다른 전문가들은 유휴 상태로 남아, 계산 자원과 용량 모두를 낭비하게 된다. 표준적인 해결책은 보조 손실(auxiliary loss)로, 전문가 활용의 불균형을 억제하는 추가적인 학습 목표이다.

그러나 보조 손실은 그 자체로 문제를 야기한다. 보조 손실은 주요 언어 모델링 목표와 경쟁하며, 두 손실 간의 균형을 위해서는 세심한 조정이 필요하다. 보조 손실이 너무 적으면 전문가들의 균형이 무너지고, 너무 많으면 모델이 부하 균형을 위해 언어 품질을 희생하게 된다. 이 긴장 관계는 MoE 학습의 실질적인 병목으로 작용해 왔다.

DeepSeek-V3는 보조 손실 없는 부하 균형 전략을 도입하였으며, 기술 보고서에 따르면 이는 추가적인 손실 항 없이 균형 잡힌 라우팅을 달성함으로써 이 긴장 관계를 해소한다.

아키텍처 혁신

기술 보고서는 세 가지 주요 아키텍처 기여를 기술한다.

Multi-head Latent Attention (MLA)

표준 multi-head attention은 각 어텐션 헤드에 대해 별도의 키(key), 쿼리(query), 밸류(value) 투영을 계산하며, 그 결과로 레이어 수와 헤드 수에 비례하여 선형적으로 증가하는 KV 캐시가 생성된다. 매우 큰 모델의 경우, KV 캐시는 추론 중 배치 크기와 처리량을 제한하는 중요한 메모리 병목이 된다.

MLA는 어텐션을 계산하기 전에 키와 밸류를 저차원 잠재 공간(latent space)으로 투영함으로써 이 문제를 해결한다. 잠재 표현은 헤드 간에 공유되어 KV 캐시 크기를 상당히 줄인다. 어텐션 계산 중에는 잠재 표현이 전체 차원으로 다시 투영되어, 메모리 오버헤드를 줄이면서도 표현력을 유지한다.

실질적인 효과는, DeepSeek-V3가 표준 어텐션을 사용하는 유사한 크기의 모델보다 더 긴 시퀀스와 더 큰 배치를 처리할 수 있다는 것이며, 이를 통해 주어진 하드웨어 예산 내에서 추론 처리량이 향상된다.

보조 손실 없는 부하 균형

V3의 DeepSeekMoE 아키텍처는 보조 손실 없이 균형 잡힌 전문가 활용을 달성하는 라우팅 메커니즘을 사용한다. 기술 보고서에 따르면, 이는 추가적인 학습 신호가 아닌 라우팅 설계 자체를 통해 구현된다. 실질적인 의의는 다음과 같다. 보조 손실을 제거하면 MoE 모델에서 학습 불안정성의 지속적인 원인이 되어 온 하이퍼파라미터(보조 손실 가중치)를 없앨 수 있다. 또한 언어 모델링 품질과 전문가 활용 간의 근본적인 긴장 관계도 해소된다. 모델은 오직 언어 품질만을 최적화하며, 균형 잡힌 라우팅은 라우팅 메커니즘의 설계로부터 자연스럽게 도출된다.

멀티 토큰 예측(Multi-Token Prediction)

표준 언어 모델 학습은 한 번에 하나의 토큰을 예측한다. 즉, 앞선 토큰들이 주어지면 다음 토큰을 예측하는 방식이다. 멀티 토큰 예측은 이를 확장하여 여러 미래 토큰을 동시에 예측한다. 이를 통해 입력 시퀀스당 더 밀도 높은 학습 신호를 제공하며, 학습 효율성과 생성 시 미리 계획하는 모델의 능력 모두를 잠재적으로 향상시킨다.

이 세 가지 혁신의 조합—효율적인 추론을 위한 잠재 어텐션(latent attention), 안정적인 학습을 위한 무손실 부하 분산(loss-free load balancing), 밀도 높은 학습 신호를 위한 멀티 토큰 예측—이 DeepSeek-V3의 아키텍처적 기여를 구성한다.

학습 경제성

기술 보고서에는 총 학습 비용으로 278만 8천 H800 GPU-시간이 명시되어 있다. 현재 기준으로 볼 때, 이는 크지만 극단적이지는 않은 컴퓨팅 투자에 해당한다. 일반적인 클라우드 GPU 가격으로 환산하면, 이는 수천만 달러 초반대의 학습 비용을 나타내며—이는 유사한 성능을 보이는 밀집 모델(dense model)의 보고된 학습 비용보다 현저히 낮은 수준이다.

MoE 아키텍처가 이러한 비용 효율성의 주된 동인이다. 토큰당 370억 개의 파라미터만 활성화되기 때문에, 단계당 계산 비용은 6,710억 개의 밀집 모델에 비해 극적으로 낮다. 전체 파라미터 수는 메모리 요구량에 영향을 미치지만(6,710억 개의 파라미터 모두를 클러스터 전반에 걸쳐 저장해야 한다), 토큰당 FLOP 수에는 영향을 미치지 않는다.

이러한 비용 구조는 프론티어 모델 학습의 진입 장벽을 낮출 수 있다.

비판적 분석: 주장과 근거

주장	출처	판정
토큰당 370억 개가 활성화되는 총 6,710억 개의 파라미터	DeepSeek-AI (2024), 기술 보고서	✅ 보고된 아키텍처 사양
다중 헤드 잠재 어텐션(Multi-head Latent Attention)이 KV 캐시 오버헤드를 감소시킴	DeepSeek-AI (2024), 기술 보고서	✅ 설명된 메커니즘; 정보 이론적 추론과 일치
보조 손실 없는(Auxiliary-loss-free) 부하 분산이 균형 잡힌 라우팅을 달성함	DeepSeek-AI (2024), 기술 보고서	✅ 보고됨; 논문에 구체적인 메커니즘 설명
멀티 토큰 예측이 학습 효율성을 향상시킴	DeepSeek-AI (2024), 기술 보고서	✅ 보고됨; 기존 멀티 토큰 예측 문헌과 일치
총 학습 비용 278만 8천 H800 GPU-시간	DeepSeek-AI (2024), 기술 보고서	✅ 보고됨; MoE 아키텍처를 고려할 때 타당함
DeepSeek-V3가 프론티어 모델들과 경쟁력 있는 성능을 달성함	맥락적 주장	⚠️ 성능 비교는 벤치마크 선택에 따라 달라짐

모든 자체 보고 벤치마크와 마찬가지로, 보류된 벤치마크에 대한 독립적인 평가가 성능 주장을 강화할 것이다.

미해결 질문

전문가 특화(Expert specialization). 이처럼 많은 전문가를 보유한 MoE 모델에서 개별 전문가들은 무엇을 학습하는가? 언어, 주제, 추론 유형, 또는 다른 차원으로 특화되는가? 전문가 특화를 이해하면 모델 설계와 해석 가능성 모두에 도움이 될 수 있다.

라우팅 안정성(Routing stability). 보조 손실 없는 라우팅은 한 가지 불안정 요인을 제거하지만 다른 불안정 요인을 초래할 수 있다. 라우팅 메커니즘은 분포 변화에 얼마나 강건한가—사전 학습에서 파인튜닝, 그리고 배포에 이르기까지 입력 분포가 변화할 때 균형을 유지하는가?

파인튜닝 동역학(Fine-tuning dynamics). MoE 모델은 파인튜닝에 있어 고유한 과제를 제시한다. 모델이 좁은 도메인에 대해 파인튜닝될 때, 일부 전문가들이 충분히 활용되지 않게 되는가? 라우팅 메커니즘이 적응하는가, 아니면 파인튜닝된 모델이 사실상 더 작은 모델(더 적은 전문가를 사용하는)이 되는가?

여러분의 연구에 대한 시사점

제한된 컴퓨팅 예산을 가진 연구자들에게 DeepSeek-V3은 MoE 아키텍처가 대형 모델 훈련 비용을 실질적으로 절감시킨다는 것을 보여준다. 특히 보조 손실 없는 부하 분산(auxiliary-loss-free load balancing)은 중요한 하이퍼파라미터 튜닝 부담을 제거한다.

추론 최적화를 연구하는 이들에게 671B/37B 분할은 도전과 기회를 동시에 제시한다. 즉, 모델은 저장하기에 크지만 토큰당 실행 비용은 저렴하므로, 메모리 효율적인 서빙 전략(비활성 전문가의 오프로딩, 양자화)이 특히 효과적일 수 있음을 시사한다.

더 넓은 분야의 관점에서 볼 때, DeepSeek-V3의 훈련 경제성은 MoE 아키텍처와 그에 따른 엔지니어링 복잡성을 기꺼이 수용하는 조직에 한해서라도, 최전선 모델 개발의 비용 장벽이 기존에 가정했던 것보다 낮을 수 있음을 시사한다.

ORAA ResearchBrain을 통해 관련 연구를 탐색하라.

References (1)

[1] DeepSeek-AI. (2024). DeepSeek-V3 Technical Report. arXiv:2412.19437.

DOI Scholar

DeepSeek-V3: How 671 Billion Parameters Activate Only 37 Billion Per Token

The Research Landscape

Architectural Innovations

Multi-head Latent Attention (MLA)

Auxiliary-Loss-Free Load Balancing

Multi-Token Prediction

Training Economics

Critical Analysis: Claims and Evidence

Open Questions

What This Means for Your Research

DeepSeek-V3: 6,710억 개의 파라미터 중 토큰당 370억 개만 활성화하는 방법

연구 동향

아키텍처 혁신

Multi-head Latent Attention (MLA)

보조 손실 없는 부하 균형

멀티 토큰 예측(Multi-Token Prediction)

학습 경제성

비판적 분석: 주장과 근거

미해결 질문

여러분의 연구에 대한 시사점

References (1)

Explore this topic deeper