Trend AnalysisComputer SystemsMixed Methods

The AI Chip Trilemma: NVIDIA GPUs, Groq LPU, and Digital In-Memory Computing

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

Every generation of AI hardware promises to solve the same three problems simultaneously: raw throughput, energy efficiency, and programmability. Every generation discovers that optimizing for two of these tends to compromise the third. As the AI accelerator market fragments into distinct architectural philosophies—GPU-centric scaling, deterministic dataflow, and compute-in-memory—the landscape in 2025 reveals not a single winner but an emerging trilemma that shapes what each architecture can and cannot do well.

The Research Landscape

GPU-Centric Scaling: The Incumbent Advantage

NVIDIA's dominance rests on a straightforward proposition: general-purpose GPU architectures, enhanced with tensor cores and high-bandwidth memory (HBM), can handle the widest range of AI workloads with a mature software ecosystem (CUDA, cuDNN, TensorRT). The upcoming Rubin architecture continues this trajectory with larger HBM stacks and faster interconnects.

Peng et al. (2023) provide a comparative evaluation of emerging AI accelerators—including IPUs, RDUs, and AMD/NVIDIA GPUs—across standard benchmarks. Their findings confirm that NVIDIA GPUs maintain strong performance across diverse workloads but reveal diminishing returns in energy efficiency as models scale. The NVIDIA advantage, they argue, is less about raw silicon and more about the compiler, library, and framework ecosystem that has been refined over more than a decade.

Seo et al. (2024) introduce IANUS, an integrated NPU-PIM (Processing-in-Memory) system that addresses a specific limitation of GPU architectures: the memory bandwidth bottleneck during LLM inference. Their design achieves a 6.2x energy efficiency improvement over GPU-only baselines for transformer inference by keeping data close to computation. this work represents the most empirically validated result in the current cohort, demonstrating that the memory wall is the binding constraint for inference workloads.

Deterministic Dataflow: The Groq LPU Approach

Groq's Language Processing Unit (LPU) takes a fundamentally different approach: replace the GPU's flexible but unpredictable execution model with a deterministic, compiler-scheduled dataflow architecture. Instead of relying on caches and dynamic scheduling, the LPU uses a Tensor Streaming Processor (TSP) where every data movement is determined at compile time.

Xie et al. (2024) investigate thermal management for the Groq LPU architecture, and their thermal analysis inadvertently reveals the architectural tradeoffs. The LPU achieves its latency advantages through a "functionally sliced" design where different chip regions handle different operations in a strict pipeline. This eliminates the scheduling overhead of GPUs but creates thermal hotspots that require advanced cooling solutions—a concrete example of how optimizing for one dimension (latency predictability) creates costs in another (thermal management complexity).

Lee et al. (2025) present RNGD, a 5nm tensor-contraction processor designed for energy-efficient LLM inference. While not the Groq LPU itself, RNGD shares the same architectural philosophy: fixed-function tensor operations with compiler-determined data movement. Their ISSCC results show competitive throughput per watt compared to GPU baselines, but the approach requires model-specific compiler optimization for each new architecture variation.

Digital In-Memory Computing: Breaking the Von Neumann Bottleneck

The third approach attacks the fundamental bottleneck differently: instead of moving data to computation (GPU) or scheduling data movement perfectly (LPU), compute-in-memory (CIM) performs operations where the data already resides.

Khwa et al. (2025), published in Nature, present a mixed-precision memristor and SRAM CIM processor that achieves notable energy efficiency for neural network inference. in a short period, this represents a significant validation of the CIM approach in a high-impact venue. The key innovation is combining analog memristor arrays for low-precision operations with digital SRAM for high-precision operations, addressing the accuracy limitations that have historically plagued analog CIM designs.

Wu et al. (2024) demonstrate a floating-point 6T SRAM CIM macro that supports the precision requirements of advanced AI workloads. Their hybrid-domain structure—combining time-domain and digital-domain computation—achieves energy efficiency that exceeds conventional digital accelerators while maintaining floating-point accuracy. This addresses a critical objection to CIM: that it only works for low-precision inference.

Mao et al. (2025) push SRAM-based CIM further with a 28nm accelerator achieving 135 TOPS/W through layer-wise precision and sparsity exploitation. Their approach dynamically adjusts computation precision per neural network layer, avoiding the one-size-fits-all limitation of fixed-precision CIM designs.

Critical Analysis: Claims and Evidence

Claim	Evidence	Verdict
GPUs maintain broadest workload coverage	Peng et al. multi-accelerator benchmark	Supported — ecosystem advantage remains substantial
Memory bandwidth is the binding constraint for LLM inference	Seo et al. IANUS NPU-PIM results	Supported — 6.2x efficiency gain validates the bottleneck hypothesis
Deterministic dataflow achieves lower inference latency	Xie et al. thermal analysis of Groq LPU	Partially supported — latency advantage confirmed but thermal costs acknowledged
CIM achieves superior energy efficiency for inference	Khwa et al. Nature paper, Wu et al., Mao et al.	Supported for inference — training workloads remain largely unaddressed
Any single architecture dominates all metrics	Cross-paper comparison	Not supported — the trilemma persists

Open Questions and Future Directions

Training vs. inference divergence. CIM and LPU architectures show promise for inference but have not demonstrated viability for training workloads. Will the AI chip market bifurcate into training chips (GPUs) and inference chips (CIM/LPU)?

Software ecosystem lock-in. NVIDIA's CUDA ecosystem represents a multi-decade investment by the research community. How much performance advantage do alternative architectures need to justify the switching cost?

Precision-efficiency Pareto frontier. CIM designs increasingly support mixed precision, but can they match GPU-class floating-point accuracy for fine-tuning and reinforcement learning workloads?

Chiplet and heterogeneous integration. Rather than a single winning architecture, the future may involve heterogeneous packages combining GPU cores, CIM arrays, and fixed-function accelerators. Seo et al.'s IANUS points in this direction.

Scaling economics. At datacenter scale, energy cost dominates. CIM's energy efficiency advantage could prove decisive even if per-chip performance is lower, but the manufacturing maturity gap with GPUs remains significant.

What This Means for Practitioners

The AI chip trilemma is not a problem to be solved but a tradeoff to be managed. For training workloads, GPU architectures remain the practical choice due to ecosystem maturity and floating-point precision. For inference at scale, CIM and dataflow architectures offer compelling energy efficiency but require workload-specific optimization. The most productive research direction may not be finding a single winner but developing heterogeneous systems that deploy the right architecture for each stage of the AI pipeline.

Explore related work through ORAA ResearchBrain.

면책 조항: 이 게시물은 정보 제공 목적의 연구 동향 개요이다. 학술 저작물에서 인용하기 전에 원본 논문을 통해 구체적인 연구 결과, 통계 및 주장을 검증해야 한다.

AI 칩 트릴레마: NVIDIA GPU, Groq LPU, 그리고 디지털 인메모리 컴퓨팅

AI 하드웨어의 세대마다 동일한 세 가지 문제를 동시에 해결하겠다고 약속한다: 원시 처리량(raw throughput), 에너지 효율성, 그리고 프로그래밍 가능성. 그러나 세대마다 이 중 두 가지를 최적화하면 세 번째가 희생된다는 사실을 발견한다. AI 가속기 시장이 GPU 중심 확장, 결정론적 데이터플로우(deterministic dataflow), 컴퓨트 인 메모리(compute-in-memory)라는 서로 다른 아키텍처 철학으로 분화되는 가운데, 2025년의 시장 풍경은 단일 승자가 아닌 새로운 트릴레마(trilemma)를 드러낸다. 이 트릴레마는 각 아키텍처가 잘할 수 있는 것과 그렇지 못한 것을 결정짓는다.

연구 동향

GPU 중심 확장: 선점자의 이점

NVIDIA의 지배력은 명확한 전제에 기반한다: 텐서 코어(tensor core)와 고대역폭 메모리(HBM)로 강화된 범용 GPU 아키텍처는 성숙한 소프트웨어 생태계(CUDA, cuDNN, TensorRT)를 통해 가장 다양한 AI 워크로드를 처리할 수 있다는 것이다. 곧 출시될 Rubin 아키텍처는 더 대용량의 HBM 스택과 더 빠른 인터커넥트(interconnect)로 이 궤적을 이어간다.

Peng et al. (2023)은 표준 벤치마크를 기준으로 IPU, RDU, AMD/NVIDIA GPU를 포함한 신흥 AI 가속기들에 대한 비교 평가를 제시한다. 이들의 연구 결과는 NVIDIA GPU가 다양한 워크로드에 걸쳐 강력한 성능을 유지한다는 것을 확인하면서도, 모델이 확장될수록 에너지 효율성에서 수확 체감이 나타남을 밝힌다. 저자들은 NVIDIA의 우위가 순수한 실리콘보다는 10년 이상 정제되어 온 컴파일러, 라이브러리, 프레임워크 생태계에 있다고 주장한다.

Seo et al. (2024)은 GPU 아키텍처의 특정 한계, 즉 LLM 추론 과정에서의 메모리 대역폭 병목 현상을 해결하는 통합 NPU-PIM(Processing-in-Memory) 시스템인 IANUS를 소개한다. 이들의 설계는 데이터를 연산 가까이에 유지함으로써 트랜스포머(transformer) 추론에서 GPU 단독 기준 대비 6.2배의 에너지 효율 향상을 달성한다. 이 연구는 현재 세대에서 가장 실증적으로 검증된 결과를 나타내며, 메모리 장벽이 추론 워크로드의 핵심 제약 조건임을 보여준다.

결정론적 데이터플로우: Groq LPU 방식

Groq의 언어 처리 장치(LPU, Language Processing Unit)는 근본적으로 다른 접근 방식을 취한다: GPU의 유연하지만 예측 불가능한 실행 모델을 결정론적이고 컴파일러가 스케줄링하는 데이터플로우 아키텍처로 대체하는 것이다. LPU는 캐시와 동적 스케줄링에 의존하는 대신, 모든 데이터 이동이 컴파일 시점에 결정되는 텐서 스트리밍 프로세서(TSP, Tensor Streaming Processor)를 사용한다.

Xie et al. (2024)은 Groq LPU 아키텍처의 열 관리(thermal management)를 연구하며, 이들의 열 분석은 부수적으로 아키텍처의 트레이드오프를 드러낸다. LPU는 서로 다른 칩 영역이 엄격한 파이프라인 방식으로 각기 다른 연산을 처리하는 "기능적 슬라이싱(functionally sliced)" 설계를 통해 지연 시간 우위를 달성한다. 이는 GPU의 스케줄링 오버헤드를 제거하지만, 고급 냉각 솔루션을 필요로 하는 열 집중 현상(thermal hotspot)을 야기한다. 이는 한 차원(지연 시간 예측 가능성)을 최적화하면 다른 차원(열 관리 복잡성)에서 비용이 발생한다는 구체적인 사례이다.

Lee et al. (2025)은 에너지 효율적인 LLM 추론을 위해 설계된 5nm 텐서 수축(tensor-contraction) 프로세서인 RNGD를 발표한다. Groq LPU 자체는 아니지만, RNGD는 동일한 아키텍처 철학, 즉 컴파일러가 데이터 이동을 결정하는 고정 기능 텐서 연산을 공유한다. 이들의 ISSCC 결과는 GPU 기준 대비 경쟁력 있는 와트당 처리량을 보여주지만, 이 접근 방식은 새로운 아키텍처 변형마다 모델별 컴파일러 최적화를 필요로 한다.

디지털 인메모리 컴퓨팅: 폰 노이만 병목 현상 타파

세 번째 접근 방식은 근본적인 병목 현상을 다른 방식으로 공략한다: 데이터를 연산으로 이동시키거나(GPU) 데이터 이동을 완벽하게 스케줄링하는(LPU) 대신, 컴퓨트 인 메모리(CIM)는 데이터가 이미 존재하는 곳에서 연산을 수행한다. Khwa et al. (2025)은 Nature에 게재된 논문에서 혼합 정밀도(mixed-precision) 멤리스터(memristor)와 SRAM CIM 프로세서를 제시하며, 신경망 추론(inference)에서 주목할 만한 에너지 효율을 달성하였다. 짧은 기간 내에 이는 영향력 있는 학술지에서 CIM 접근법이 상당한 검증을 받았음을 의미한다. 핵심 혁신은 저정밀도 연산에 아날로그 멤리스터 배열을, 고정밀도 연산에 디지털 SRAM을 결합하여, 역사적으로 아날로그 CIM 설계를 괴롭혀 온 정확도 한계를 해결한 것이다.

Wu et al. (2024)은 고급 AI 워크로드의 정밀도 요구사항을 지원하는 부동소수점(floating-point) 6T SRAM CIM 매크로를 제시한다. 시간 영역(time-domain)과 디지털 영역(digital-domain) 연산을 결합한 하이브리드 도메인 구조는 부동소수점 정확도를 유지하면서도 기존 디지털 가속기를 능가하는 에너지 효율을 달성한다. 이는 CIM에 대한 핵심적인 반론, 즉 CIM이 저정밀도 추론에서만 작동한다는 주장을 해소한다.

Mao et al. (2025)은 레이어별 정밀도 및 희소성(sparsity) 활용을 통해 135 TOPS/W를 달성하는 28nm 가속기로 SRAM 기반 CIM을 한층 발전시킨다. 이들의 접근법은 신경망의 각 레이어별로 연산 정밀도를 동적으로 조정하여, 고정 정밀도 CIM 설계의 획일적인 한계를 극복한다.

비판적 분석: 주장과 근거

주장	근거	판정
GPU는 가장 광범위한 워크로드 커버리지를 유지한다	Peng et al. 다중 가속기 벤치마크	지지됨 — 생태계 우위가 실질적으로 유지됨
메모리 대역폭이 LLM 추론의 핵심 제약이다	Seo et al. IANUS NPU-PIM 결과	지지됨 — 6.2배 효율 향상이 병목 가설을 검증함
결정론적 데이터플로우(deterministic dataflow)가 더 낮은 추론 지연시간을 달성한다	Xie et al. Groq LPU 열 분석	부분적으로 지지됨 — 지연시간 우위는 확인되었으나 열 비용이 인정됨
CIM이 추론에서 우월한 에너지 효율을 달성한다	Khwa et al. Nature 논문, Wu et al., Mao et al.	추론에 대해서는 지지됨 — 훈련 워크로드는 대체로 미해결 상태
어떤 단일 아키텍처가 모든 지표에서 우위를 점한다	논문 간 교차 비교	지지되지 않음 — 트릴레마가 지속됨

미해결 문제 및 향후 방향

훈련과 추론의 분기. CIM 및 LPU 아키텍처는 추론에서 가능성을 보이지만 훈련 워크로드에 대한 실용성은 아직 입증되지 않았다. AI 칩 시장이 훈련 칩(GPU)과 추론 칩(CIM/LPU)으로 이분화될 것인가?

소프트웨어 생태계 종속. NVIDIA의 CUDA 생태계는 연구 커뮤니티가 수십 년에 걸쳐 투자한 결과물이다. 대안적 아키텍처가 전환 비용을 정당화하려면 얼마나 큰 성능 우위가 필요한가?

정밀도-효율 파레토 프론티어(Pareto frontier). CIM 설계는 점점 혼합 정밀도를 지원하고 있지만, 파인튜닝(fine-tuning) 및 강화학습(reinforcement learning) 워크로드에서 GPU급 부동소수점 정확도에 필적할 수 있는가?

칩렛(chiplet) 및 이종 집적(heterogeneous integration). 단일 아키텍처의 승리보다는, 미래는 GPU 코어, CIM 배열, 고정 기능 가속기를 결합한 이종(heterogeneous) 패키지를 포함할 수 있다. Seo et al.의 IANUS가 이 방향을 가리킨다.

스케일링 경제학. 데이터센터 규모에서는 에너지 비용이 지배적이다. CIM의 에너지 효율 우위는 칩당 성능이 낮더라도 결정적인 요소가 될 수 있으나, GPU와의 제조 성숙도 격차는 여전히 상당하다.

실무자를 위한 시사점

AI 칩 트릴레마는 해결해야 할 문제가 아니라 관리해야 할 트레이드오프이다. 훈련 워크로드의 경우, 생태계 성숙도와 부동소수점 정밀도 측면에서 GPU 아키텍처가 여전히 현실적인 선택지이다. 대규모 추론의 경우, CIM 및 데이터플로우 아키텍처는 매력적인 에너지 효율을 제공하지만 워크로드별 최적화가 필요하다. 가장 생산적인 연구 방향은 단일 승자를 찾는 것이 아니라, AI 파이프라인의 각 단계에 적합한 아키텍처를 배치하는 이종 시스템을 개발하는 것일 수 있다.

관련 연구는 ORAA ResearchBrain을 통해 탐색할 수 있다.

References (7)

[1] Peng, H., Ding, C., & Geng, T. (2023). Evaluating Emerging AI/ML Accelerators: IPU, RDU, and NVIDIA/AMD GPUs. ACM International Conference on Supercomputing.

DOI Scholar

[2] Seo, M., Nguyen, X., & Hwang, S. (2024). IANUS: Integrated Accelerator based on NPU-PIM Unified Memory System. ASPLOS '24.

DOI Scholar

[3] Xie, F., Lyu, S., & Yang, Z. (2024). Direct-On-Chip Hotspot Targeted Microjet Cooling for Ultra-fast Inference at Scale Running on Groq Language Processing Unit. IEEE ITherm.

DOI Scholar

[4] Lee, S. M., Kim, H., & Yeon, J. (2025). RNGD: A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models. IEEE ISSCC.

DOI Scholar

[5] Khwa, W., Wen, T.-H., & Hsu, H.-H. (2025). A mixed-precision memristor and SRAM compute-in-memory AI processor. Nature.

DOI Scholar

[6] Wu, P., Su, J.-W., & Hong, L. (2024). A Floating-Point 6T SRAM In-Memory-Compute Macro Using Hybrid-Domain Structure for Advanced AI Edge Chips. IEEE JSSC.

DOI Scholar

[7] Mao, W., Liu, D., & Zhou, H. (2025). A 28-nm 135.19 TOPS/W Bootstrapped-SRAM Compute-in-Memory Accelerator With Layer-Wise Precision and Sparsity. IEEE TCAS-I.

DOI Scholar

The AI Chip Trilemma: NVIDIA GPUs, Groq LPU, and Digital In-Memory Computing

The Research Landscape

GPU-Centric Scaling: The Incumbent Advantage

Deterministic Dataflow: The Groq LPU Approach

Digital In-Memory Computing: Breaking the Von Neumann Bottleneck

Critical Analysis: Claims and Evidence

Open Questions and Future Directions

What This Means for Practitioners

AI 칩 트릴레마: NVIDIA GPU, Groq LPU, 그리고 디지털 인메모리 컴퓨팅

연구 동향

GPU 중심 확장: 선점자의 이점

결정론적 데이터플로우: Groq LPU 방식

디지털 인메모리 컴퓨팅: 폰 노이만 병목 현상 타파

비판적 분석: 주장과 근거

미해결 문제 및 향후 방향

실무자를 위한 시사점

References (7)

Explore this topic deeper