Trend AnalysisComputer SystemsMixed Methods

The AI Chip Trilemma: NVIDIA GPUs, Groq LPU, and Digital In-Memory Computing

Every generation of AI hardware promises to solve the same three problems simultaneously: raw throughput, energy efficiency, and programmability. Every generation discovers that optimizing for two ...

By Sean K.S. Shin
This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

Every generation of AI hardware promises to solve the same three problems simultaneously: raw throughput, energy efficiency, and programmability. Every generation discovers that optimizing for two of these tends to compromise the third. As the AI accelerator market fragments into distinct architectural philosophiesโ€”GPU-centric scaling, deterministic dataflow, and compute-in-memoryโ€”the landscape in 2025 reveals not a single winner but an emerging trilemma that shapes what each architecture can and cannot do well.

The Research Landscape

GPU-Centric Scaling: The Incumbent Advantage

NVIDIA's dominance rests on a straightforward proposition: general-purpose GPU architectures, enhanced with tensor cores and high-bandwidth memory (HBM), can handle the widest range of AI workloads with a mature software ecosystem (CUDA, cuDNN, TensorRT). The upcoming Rubin architecture continues this trajectory with larger HBM stacks and faster interconnects.

Peng et al. (2023) provide a comparative evaluation of emerging AI acceleratorsโ€”including IPUs, RDUs, and AMD/NVIDIA GPUsโ€”across standard benchmarks. Their findings confirm that NVIDIA GPUs maintain strong performance across diverse workloads but reveal diminishing returns in energy efficiency as models scale. The NVIDIA advantage, they argue, is less about raw silicon and more about the compiler, library, and framework ecosystem that has been refined over more than a decade.

Seo et al. (2024) introduce IANUS, an integrated NPU-PIM (Processing-in-Memory) system that addresses a specific limitation of GPU architectures: the memory bandwidth bottleneck during LLM inference. Their design achieves a 6.2x energy efficiency improvement over GPU-only baselines for transformer inference by keeping data close to computation. this work represents the most empirically validated result in the current cohort, demonstrating that the memory wall is the binding constraint for inference workloads.

Deterministic Dataflow: The Groq LPU Approach

Groq's Language Processing Unit (LPU) takes a fundamentally different approach: replace the GPU's flexible but unpredictable execution model with a deterministic, compiler-scheduled dataflow architecture. Instead of relying on caches and dynamic scheduling, the LPU uses a Tensor Streaming Processor (TSP) where every data movement is determined at compile time.

Xie et al. (2024) investigate thermal management for the Groq LPU architecture, and their thermal analysis inadvertently reveals the architectural tradeoffs. The LPU achieves its latency advantages through a "functionally sliced" design where different chip regions handle different operations in a strict pipeline. This eliminates the scheduling overhead of GPUs but creates thermal hotspots that require advanced cooling solutionsโ€”a concrete example of how optimizing for one dimension (latency predictability) creates costs in another (thermal management complexity).

Lee et al. (2025) present RNGD, a 5nm tensor-contraction processor designed for energy-efficient LLM inference. While not the Groq LPU itself, RNGD shares the same architectural philosophy: fixed-function tensor operations with compiler-determined data movement. Their ISSCC results show competitive throughput per watt compared to GPU baselines, but the approach requires model-specific compiler optimization for each new architecture variation.

Digital In-Memory Computing: Breaking the Von Neumann Bottleneck

The third approach attacks the fundamental bottleneck differently: instead of moving data to computation (GPU) or scheduling data movement perfectly (LPU), compute-in-memory (CIM) performs operations where the data already resides.

Khwa et al. (2025), published in Nature, present a mixed-precision memristor and SRAM CIM processor that achieves notable energy efficiency for neural network inference. in a short period, this represents a significant validation of the CIM approach in a high-impact venue. The key innovation is combining analog memristor arrays for low-precision operations with digital SRAM for high-precision operations, addressing the accuracy limitations that have historically plagued analog CIM designs.

Wu et al. (2024) demonstrate a floating-point 6T SRAM CIM macro that supports the precision requirements of advanced AI workloads. Their hybrid-domain structureโ€”combining time-domain and digital-domain computationโ€”achieves energy efficiency that exceeds conventional digital accelerators while maintaining floating-point accuracy. This addresses a critical objection to CIM: that it only works for low-precision inference.

Mao et al. (2025) push SRAM-based CIM further with a 28nm accelerator achieving 135 TOPS/W through layer-wise precision and sparsity exploitation. Their approach dynamically adjusts computation precision per neural network layer, avoiding the one-size-fits-all limitation of fixed-precision CIM designs.

Critical Analysis: Claims and Evidence

<
ClaimEvidenceVerdict
GPUs maintain broadest workload coveragePeng et al. multi-accelerator benchmarkSupported โ€” ecosystem advantage remains substantial
Memory bandwidth is the binding constraint for LLM inferenceSeo et al. IANUS NPU-PIM resultsSupported โ€” 6.2x efficiency gain validates the bottleneck hypothesis
Deterministic dataflow achieves lower inference latencyXie et al. thermal analysis of Groq LPUPartially supported โ€” latency advantage confirmed but thermal costs acknowledged
CIM achieves superior energy efficiency for inferenceKhwa et al. Nature paper, Wu et al., Mao et al.Supported for inference โ€” training workloads remain largely unaddressed
Any single architecture dominates all metricsCross-paper comparisonNot supported โ€” the trilemma persists

Open Questions and Future Directions

  • Training vs. inference divergence. CIM and LPU architectures show promise for inference but have not demonstrated viability for training workloads. Will the AI chip market bifurcate into training chips (GPUs) and inference chips (CIM/LPU)?
  • Software ecosystem lock-in. NVIDIA's CUDA ecosystem represents a multi-decade investment by the research community. How much performance advantage do alternative architectures need to justify the switching cost?
  • Precision-efficiency Pareto frontier. CIM designs increasingly support mixed precision, but can they match GPU-class floating-point accuracy for fine-tuning and reinforcement learning workloads?
  • Chiplet and heterogeneous integration. Rather than a single winning architecture, the future may involve heterogeneous packages combining GPU cores, CIM arrays, and fixed-function accelerators. Seo et al.'s IANUS points in this direction.
  • Scaling economics. At datacenter scale, energy cost dominates. CIM's energy efficiency advantage could prove decisive even if per-chip performance is lower, but the manufacturing maturity gap with GPUs remains significant.
  • What This Means for Practitioners

    The AI chip trilemma is not a problem to be solved but a tradeoff to be managed. For training workloads, GPU architectures remain the practical choice due to ecosystem maturity and floating-point precision. For inference at scale, CIM and dataflow architectures offer compelling energy efficiency but require workload-specific optimization. The most productive research direction may not be finding a single winner but developing heterogeneous systems that deploy the right architecture for each stage of the AI pipeline.

    Explore related work through ORAA ResearchBrain.

    References (7)

    [1] Peng, H., Ding, C., & Geng, T. (2023). Evaluating Emerging AI/ML Accelerators: IPU, RDU, and NVIDIA/AMD GPUs. ACM International Conference on Supercomputing.
    [2] Seo, M., Nguyen, X., & Hwang, S. (2024). IANUS: Integrated Accelerator based on NPU-PIM Unified Memory System. ASPLOS '24.
    [3] Xie, F., Lyu, S., & Yang, Z. (2024). Direct-On-Chip Hotspot Targeted Microjet Cooling for Ultra-fast Inference at Scale Running on Groq Language Processing Unit. IEEE ITherm.
    [4] Lee, S. M., Kim, H., & Yeon, J. (2025). RNGD: A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models. IEEE ISSCC.
    [5] Khwa, W., Wen, T.-H., & Hsu, H.-H. (2025). A mixed-precision memristor and SRAM compute-in-memory AI processor. Nature.
    [6] Wu, P., Su, J.-W., & Hong, L. (2024). A Floating-Point 6T SRAM In-Memory-Compute Macro Using Hybrid-Domain Structure for Advanced AI Edge Chips. IEEE JSSC.
    [7] Mao, W., Liu, D., & Zhou, H. (2025). A 28-nm 135.19 TOPS/W Bootstrapped-SRAM Compute-in-Memory Accelerator With Layer-Wise Precision and Sparsity. IEEE TCAS-I.

    Explore this topic deeper

    Search 290M+ papers, detect research gaps, and find what hasn't been studied yet.

    Click to remove unwanted keywords

    Search 8 keywords โ†’