Trend AnalysisComputer SystemsMachine/Deep Learning

The Communication Wall: Why Scaling LLM Training Infrastructure Is Harder Than Adding More GPUs

Training a frontier large language model requires thousands of GPUs working in concert. The naive expectation is that doubling the GPUs should halve the training time.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

Training a frontier large language model requires thousands of GPUs working in concert. The naive expectation is that doubling the GPUs should halve the training time. In practice, the relationship is far less favorable. As clusters scale from hundreds to thousands and then tens of thousands of accelerators, communication overhead—the time spent synchronizing gradients, activating pipeline stages, and moving data between devices—increasingly dominates the training loop. Model FLOPS utilization (MFU), the fraction of theoretical compute actually used for useful work, routinely falls well below theoretical peak at scale. This communication wall, not compute capacity, is the binding constraint on LLM training infrastructure in 2025.

The Research Landscape

Quantifying the Communication Bottleneck

Fernandez et al. (2024) provide the most systematic analysis of hardware scaling trends and diminishing returns in distributed training. Their study across multiple GPU cluster configurations demonstrates that scaling efficiency drops sharply beyond certain cluster sizes: scaling from hundreds to thousands of GPUs yields sublinear speedup rather than the theoretical 4x. The culprit is collective communication operations—AllReduce for data parallelism, point-to-point communication for pipeline parallelism—whose cost grows with participant count while computation per device remains fixed.

Liang et al. (2024) provide a comprehensive survey of communication-efficient techniques for large-scale distributed deep learning. They categorize approaches into four families: gradient compression (reducing the data volume), communication scheduling (overlapping communication with computation), topology-aware algorithms (matching communication patterns to network structure), and decentralized protocols (eliminating central parameter servers). Their analysis suggests that no single technique is sufficient; practical systems combine multiple approaches.

Cai et al. (2026), published in Tsinghua Science and Technology, survey efficient inference for edge LLMs, examining how communication constraints at the edge mirror and differ from datacenter training. While the inference setting is different, the fundamental bottleneck—data movement costs exceeding computation costs—is shared, suggesting that communication efficiency is a general challenge across the LLM lifecycle.

Architectural Solutions: Networks and Topologies

Meng et al. (2025) share operational experience from designing and deploying Astral, a datacenter infrastructure purpose-built for large-scale LLM training. their work provides rare insight into production infrastructure decisions. Key findings include: (a) network congestion from AllReduce operations is the primary cause of training interruptions, (b) rail-optimized topologies reduce cross-rack traffic but create bandwidth bottlenecks at top-of-rack switches, and (c) failure recovery dominates operational cost—a single node failure in a 10,000-GPU training run can waste hours of work across all nodes.

Feng et al. (2025) propose RailX, a flexible network architecture for hyper-scale LLM training that addresses the cost and scalability limitations of tree-based topologies. Traditional rail-optimized networks scale poorly beyond a few thousand GPUs because the aggregation switches become bandwidth bottlenecks. RailX uses a hybrid direct-indirect topology that reduces the number of expensive high-radix switches while maintaining sufficient bisection bandwidth for collective operations.

TCCL by Kim et al. (2024) tackles a more specific but widely relevant problem: optimizing collective communication for PCIe-connected GPU clusters. While high-end training clusters use NVLink/NVSwitch, many organizations train on PCIe-based systems with substantially lower interconnect bandwidth. TCCL discovers better communication paths by profiling the actual PCIe topology—including NUMA effects and shared switches—rather than assuming an idealized fully-connected topology.

Software Approaches: Overlapping Communication and Computation

Wang et al. (2024) demonstrate that existing frameworks leave significant performance on the table by executing communication and computation sequentially. Their profiling reveals that collective communication occupies a substantial fraction of the training iteration time but that the GPU is often idle during communication phases. By co-executing micro-batches—processing one micro-batch's computation while communicating another's gradients—they achieve meaningful training speedup without any hardware changes.

Wang et al. (2024) introduce Domino, which takes the overlap idea further by decomposing tensor operations into slices that can be communicated as soon as they are computed, rather than waiting for an entire layer's computation to complete. Domino demonstrates near-complete elimination of exposed communication time for data-parallel training, achieving MFU well above conventional levels on configurations where standard approaches achieve far lower.

Sun et al. (2024) present CO2, a system that achieves full communication-computation overlap through careful scheduling. Their approach is significant for geo-distributed settings—training across data centers connected by wide-area networks—where communication latency is orders of magnitude higher than intra-datacenter networks. CO2 achieves competitive training throughput even when inter-datacenter bandwidth is 10-100x lower than intra-datacenter bandwidth.

Critical Analysis: Claims and Evidence

Claim	Evidence	Verdict
Communication overhead causes MFU to fall well below peak at scale	Fernandez et al. , Wang et al. profiling data	Supported — consistent finding across multiple studies
Micro-batch co-execution recovers substantial training throughput	Wang et al. co-execution study	Supported — but gains are workload-dependent
Near-complete communication hiding is achievable	Domino — MFU well above conventional levels	Supported for data parallelism — pipeline parallelism is harder
Network topology is a primary design constraint	Meng et al. Astral , Feng et al. RailX	Supported — production experience confirms this
PCIe clusters can match NVLink performance with software optimization	TCCL	Partially supported — gap narrows but NVLink retains advantage
Geo-distributed training is viable	CO2	Supported — with appropriate overlap scheduling

Open Questions and Future Directions

The failure recovery problem. Meng et al. identify failure recovery as the dominant operational cost. A single GPU failure in a 10,000-GPU run can waste hours of synchronized work. Checkpoint-based recovery helps but introduces its own overhead. Elastic training—where the system continues with fewer GPUs—remains an active research area.

Heterogeneous clusters. Tang et al. (2025) explore training on hyper-heterogeneous clusters with chips from multiple vendors. As organizations piece together GPU allocations from different generations and manufacturers, heterogeneity-aware scheduling becomes essential but is poorly supported by current frameworks.

Communication-computation co-design. Current approaches treat the network and the compute as separate systems to be optimized independently. Co-designing the network topology, collective algorithms, and parallelization strategy jointly could yield better solutions.

Energy proportionality. At 10,000+ GPU scale, the energy consumed by network switches, memory, and cooling approaches the energy consumed by the GPUs themselves. Communication efficiency improvements that reduce total energy consumption may matter more than raw training speed.

Optical interconnects. Current GPU clusters use electrical interconnects (NVLink, InfiniBand, Ethernet). Optical interconnects promise higher bandwidth at lower power but require new switch architectures and communication protocols. The transition timeline remains uncertain.

What This Means for ML Engineers

For teams training large models, the practical takeaway is that infrastructure design choices—network topology, collective communication library, and parallelization strategy—matter as much as the model architecture and training algorithm. Investing in communication profiling (tools like NCCL's built-in profiler, or frameworks like Domino's analysis pipeline) before scaling up can prevent expensive under-utilization. The era of "just add more GPUs" is over; communication-aware training design is now a core competency.

Explore related work through ORAA ResearchBrain.

면책 조항: 이 게시물은 정보 제공 목적의 연구 동향 개요이다. 학술 연구에서 인용하기 전에 구체적인 연구 결과, 통계 및 주장은 원본 논문을 통해 검증해야 한다.

커뮤니케이션 장벽: LLM 훈련 인프라 확장이 GPU 추가보다 어려운 이유

프론티어 대규모 언어 모델(LLM)을 훈련하려면 수천 개의 GPU가 협력해야 한다. 단순하게 생각하면 GPU를 두 배로 늘리면 훈련 시간이 절반으로 줄어야 한다. 실제로는 이 관계가 훨씬 불리하다. 클러스터가 수백 개에서 수천 개, 그리고 수만 개의 가속기로 확장됨에 따라, 그래디언트 동기화, 파이프라인 스테이지 활성화, 장치 간 데이터 이동에 소요되는 시간인 통신 오버헤드가 훈련 루프를 점점 더 지배하게 된다. 이론적 연산 능력 중 실제로 유용한 작업에 사용되는 비율인 모델 FLOPS 활용률(MFU)은 대규모 환경에서 이론적 최고치를 크게 밑도는 경우가 흔하다. 2025년 LLM 훈련 인프라의 핵심 제약 조건은 연산 용량이 아니라 바로 이 커뮤니케이션 장벽이다.

연구 동향

통신 병목 현상의 정량화

Fernandez et al. (2024)은 분산 훈련에서 하드웨어 확장 추세와 수확 체감에 관한 가장 체계적인 분석을 제공한다. 여러 GPU 클러스터 구성에 걸친 이들의 연구는 특정 클러스터 규모를 넘어서면 확장 효율이 급격히 떨어진다는 것을 보여준다. 수백 개에서 수천 개의 GPU로 확장하면 이론적인 4배가 아닌 준선형(sublinear) 속도 향상만 달성된다. 원인은 집합 통신 연산(collective communication operations)에 있다. 데이터 병렬성을 위한 AllReduce와 파이프라인 병렬성을 위한 점대점(point-to-point) 통신의 비용은 참여자 수에 따라 증가하는 반면, 장치당 연산량은 고정되어 있다.

Liang et al. (2024)은 대규모 분산 딥러닝을 위한 통신 효율화 기법에 관한 포괄적인 서베이를 제공한다. 이들은 접근 방식을 네 가지 계열로 분류한다. 데이터 볼륨을 줄이는 그래디언트 압축(gradient compression), 통신과 연산을 겹치는 통신 스케줄링(communication scheduling), 통신 패턴을 네트워크 구조에 맞추는 토폴로지 인식 알고리즘(topology-aware algorithms), 중앙 파라미터 서버를 제거하는 탈중앙화 프로토콜(decentralized protocols)이다. 이들의 분석에 따르면 단일 기법만으로는 충분하지 않으며, 실용적인 시스템은 여러 접근 방식을 조합한다.

Tsinghua Science and Technology에 게재된 Cai et al. (2026)은 엣지 LLM의 효율적인 추론을 서베이하며, 엣지에서의 통신 제약이 데이터센터 훈련과 어떻게 유사하고 다른지를 검토한다. 추론 환경은 다르지만, 근본적인 병목 현상—데이터 이동 비용이 연산 비용을 초과하는 것—은 공통적이며, 이는 통신 효율이 LLM 생애 주기 전반에 걸친 일반적인 과제임을 시사한다.

아키텍처적 해결책: 네트워크와 토폴로지

Meng et al. (2025)은 대규모 LLM 훈련을 위해 특별히 설계된 데이터센터 인프라인 Astral의 설계 및 배포 운영 경험을 공유한다. 이들의 연구는 프로덕션 인프라 의사결정에 대한 드문 통찰을 제공한다. 주요 발견으로는 다음이 있다. (a) AllReduce 연산으로 인한 네트워크 혼잡이 훈련 중단의 주된 원인이다. (b) 레일 최적화(rail-optimized) 토폴로지는 랙 간 트래픽을 줄이지만 랙 상단(top-of-rack) 스위치에서 대역폭 병목을 유발한다. (c) 장애 복구가 운영 비용을 지배하며, 10,000개의 GPU로 구성된 훈련 실행에서 단일 노드 장애만으로도 모든 노드에 걸쳐 수 시간의 작업이 낭비될 수 있다.

Feng et al. (2025)은 트리 기반 토폴로지의 비용 및 확장성 한계를 해결하는 초대규모 LLM 훈련을 위한 유연한 네트워크 아키텍처인 RailX를 제안한다. 기존의 레일 최적화 네트워크는 집계 스위치가 대역폭 병목이 되기 때문에 수천 개의 GPU를 넘어서면 확장성이 떨어진다. RailX는 하이브리드 직접-간접(direct-indirect) 토폴로지를 사용하여 고가의 고방사형(high-radix) 스위치 수를 줄이면서도 집합 연산에 충분한 이분(bisection) 대역폭을 유지한다. TCCL(Kim et al., 2024)은 보다 구체적이지만 광범위하게 관련된 문제, 즉 PCIe로 연결된 GPU 클러스터에서의 집합 통신(collective communication) 최적화를 다룬다. 고급 학습 클러스터는 NVLink/NVSwitch를 사용하지만, 많은 조직은 상당히 낮은 인터커넥트 대역폭을 가진 PCIe 기반 시스템에서 학습을 수행한다. TCCL은 이상화된 완전 연결 토폴로지를 가정하는 대신, NUMA 효과 및 공유 스위치를 포함한 실제 PCIe 토폴로지를 프로파일링하여 더 나은 통신 경로를 발견한다.

소프트웨어 접근법: 통신과 연산의 중첩

Wang et al.(2024)은 기존 프레임워크가 통신과 연산을 순차적으로 실행함으로써 상당한 성능 잠재력을 낭비하고 있음을 입증한다. 이들의 프로파일링 결과에 따르면, 집합 통신은 학습 반복 시간의 상당한 비율을 차지하지만, 통신 단계에서 GPU는 종종 유휴 상태에 있다. 마이크로 배치를 공동 실행함으로써—하나의 마이크로 배치의 연산을 처리하는 동안 다른 마이크로 배치의 그래디언트를 통신하는 방식으로—이들은 어떠한 하드웨어 변경 없이도 의미 있는 학습 속도 향상을 달성한다.

Wang et al.(2024)은 Domino를 제안하는데, 이는 전체 레이어의 연산이 완료될 때까지 기다리지 않고 텐서 연산을 슬라이스로 분해하여 연산이 완료되는 즉시 통신할 수 있도록 함으로써 중첩 아이디어를 더욱 발전시킨다. Domino는 데이터 병렬 학습에서 노출된 통신 시간을 거의 완전히 제거함을 입증하며, 표준 접근법이 훨씬 낮은 수준에 그치는 구성에서 일반적인 수준을 훨씬 상회하는 MFU를 달성한다.

Sun et al.(2024)은 세심한 스케줄링을 통해 통신과 연산의 완전한 중첩을 달성하는 시스템인 CO2를 제안한다. 이 접근법은 지리적으로 분산된 환경—광역 네트워크로 연결된 데이터 센터 간에 걸친 학습—에서 특히 중요한데, 이 환경에서 통신 지연 시간은 데이터 센터 내부 네트워크보다 몇 배나 높다. CO2는 데이터 센터 간 대역폭이 데이터 센터 내부 대역폭보다 10~100배 낮은 경우에도 경쟁력 있는 학습 처리량을 달성한다.

비판적 분석: 주장과 증거

주장	증거	판정
통신 오버헤드로 인해 대규모에서 MFU가 최고 성능보다 훨씬 낮아진다	Fernandez et al., Wang et al. 프로파일링 데이터	지지됨 — 다수의 연구에서 일관된 발견
마이크로 배치 공동 실행이 상당한 학습 처리량을 회복한다	Wang et al. 공동 실행 연구	지지됨 — 단, 이득은 워크로드에 따라 다름
통신의 거의 완전한 은닉이 가능하다	Domino — 일반적인 수준을 훨씬 상회하는 MFU	데이터 병렬 처리에서는 지지됨 — 파이프라인 병렬 처리는 더 어려움
네트워크 토폴로지가 주요 설계 제약 조건이다	Meng et al. Astral, Feng et al. RailX	지지됨 — 실제 운영 경험으로 확인
PCIe 클러스터가 소프트웨어 최적화로 NVLink 성능에 필적할 수 있다	TCCL	부분적으로 지지됨 — 격차는 줄어들지만 NVLink는 여전히 우위를 유지
지리적으로 분산된 학습이 실현 가능하다	CO2	지지됨 — 적절한 중첩 스케줄링이 수반될 경우

미해결 과제와 향후 방향

장애 복구 문제. Meng et al.은 장애 복구를 주요 운영 비용으로 지목한다. 10,000개 GPU 학습 실행에서 단일 GPU 장애는 동기화된 작업의 수 시간을 낭비할 수 있다. 체크포인트 기반 복구는 도움이 되지만 그 자체로 오버헤드를 발생시킨다. 탄력적 학습(elastic training)—더 적은 수의 GPU로 시스템이 계속 실행되는 방식—은 여전히 활발한 연구 분야이다.

이기종 클러스터. Tang et al.(2025)은 여러 공급업체의 칩으로 구성된 초이기종(hyper-heterogeneous) 클러스터에서의 학습을 탐구한다. 조직들이 서로 다른 세대와 제조사의 GPU 할당을 조합함에 따라, 이기종 인식 스케줄링이 필수적이 되었지만 현재 프레임워크에서는 제대로 지원되지 않는다.

통신-연산 공동 설계. 현재의 접근법은 네트워크와 연산을 독립적으로 최적화해야 할 별개의 시스템으로 취급한다. 네트워크 토폴로지, 집합 알고리즘, 병렬화 전략을 공동으로 설계한다면 더 나은 해결책을 도출할 수 있다.

에너지 비례성. 10,000개 이상의 GPU 규모에서는 네트워크 스위치, 메모리, 냉각에 소비되는 에너지가 GPU 자체에서 소비되는 에너지에 근접한다. 총 에너지 소비를 줄이는 통신 효율 개선은 단순한 훈련 속도보다 더 중요할 수 있다.

광학적 상호 연결. 현재 GPU 클러스터는 전기적 상호 연결(NVLink, InfiniBand, Ethernet)을 사용한다. 광학적 상호 연결은 더 낮은 전력에서 더 높은 대역폭을 제공할 것으로 기대되지만, 새로운 스위치 아키텍처와 통신 프로토콜을 필요로 한다. 전환 일정은 여전히 불확실하다.

ML 엔지니어에게 주는 시사점

대규모 모델을 훈련하는 팀에게 실질적인 시사점은, 네트워크 토폴로지, 집합 통신 라이브러리, 병렬화 전략 등의 인프라 설계 선택이 모델 아키텍처와 훈련 알고리즘만큼이나 중요하다는 것이다. 규모를 확장하기 전에 통신 프로파일링(NCCL의 내장 프로파일러나 Domino의 분석 파이프라인과 같은 프레임워크 등의 도구)에 투자하면, 비용이 많이 드는 저활용 문제를 예방할 수 있다. "GPU를 더 추가하면 된다"는 시대는 끝났다. 통신을 고려한 훈련 설계는 이제 핵심 역량이다.

관련 연구는 ORAA ResearchBrain을 통해 탐색할 수 있다.

References (9)

[1] Fernandez, J., Wehrstedt, L., & Shamis, L. (2024). Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training. arXiv preprint.

DOI Scholar

[2] Liang, F., Zhang, Z., & Lu, H. (2024). Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey. arXiv preprint.

DOI Scholar

[3] Cai, G., Tian, R., & Yang, L. (2026). Efficient Inference for Edge Large Language Models: A Survey. Tsinghua Science and Technology.

DOI Scholar

[4] Meng, Q., Zheng, H., & Zhang, Z. (2025). Astral: A Datacenter Infrastructure for Large Language Model Training at Scale. ACM EuroSys.

DOI Scholar

[5] Feng, Y., Chen, T., & Wei, Y. (2025). RailX: A Flexible, Scalable, and Low-Cost Network Architecture for Hyper-Scale LLM Training Systems. arXiv preprint.

DOI Scholar

[6] Kim, H., Ryu, J., & Lee, J. (2024). TCCL: Discovering Better Communication Paths for PCIe GPU Clusters. ASPLOS '24.

DOI Scholar

[7] Wang, G., Zhang, C., & Shen, Z. (2024). Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping. arXiv preprint.

DOI Scholar

[8] Sun, W., Qin, Z., & Sun, W. (2024). CO2: Efficient Distributed Training with Full Communication-Computation Overlap. arXiv preprint.

DOI Scholar

Li, Z., Xu, L., Huang, Z., Qian, S., Bu, H., Yang, M., et al. (2025). CTCCL: Cost-Efficient Joint Device-Network Load Balancing for LLM Training in RoCE-based Intelligent Computing Network. Proceedings of the 39th ACM International Conference on Supercomputing, 355-367.

DOI Scholar

The Communication Wall: Why Scaling LLM Training Infrastructure Is Harder Than Adding More GPUs

The Research Landscape

Quantifying the Communication Bottleneck

Architectural Solutions: Networks and Topologies

Software Approaches: Overlapping Communication and Computation

Critical Analysis: Claims and Evidence

Open Questions and Future Directions

What This Means for ML Engineers

커뮤니케이션 장벽: LLM 훈련 인프라 확장이 GPU 추가보다 어려운 이유

연구 동향

통신 병목 현상의 정량화

아키텍처적 해결책: 네트워크와 토폴로지

소프트웨어 접근법: 통신과 연산의 중첩

비판적 분석: 주장과 증거

미해결 과제와 향후 방향

ML 엔지니어에게 주는 시사점

References (9)

Explore this topic deeper