Trend AnalysisAI & Machine LearningMachine/Deep Learning

Shrinking Giants: The Race to Run LLMs on Your Phone

The most powerful language models require data centers. But 2025's compression breakthroughs—vector quantization, entropy coding, and KV cache optimization—are making billion-parameter models viable on edge devices. The implications for privacy, latency, and access are profound.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

The cloud dependency of large language models is not merely a technical inconvenience—it is a structural constraint that limits who can use AI, where they can use it, and what data they must surrender to do so. Every query to GPT-4 or Claude traverses a network to a data center, incurring latency, requiring connectivity, and exposing potentially sensitive inputs to third-party infrastructure. For a physician in a rural clinic, a soldier in a disconnected environment, or a user who simply values privacy, this architecture is a barrier, not a feature.

The race to run LLMs on edge devices—phones, laptops, IoT hardware—is therefore not an academic exercise in compression. It is a contest to determine whether AI remains a centralized service controlled by a few providers, or becomes a distributed capability available to everyone.

The Compression Trilemma

Every approach to on-device LLM deployment confronts a fundamental trilemma: model quality, memory footprint, and inference speed. Improving any two typically degrades the third. The art of edge deployment lies in finding the least painful compromise.

Liu et al.'s comprehensive survey taxonomizes the landscape into four pillars:

Pruning removes parameters deemed unnecessary—zeroing out weights below a threshold or eliminating entire attention heads. Unstructured pruning achieves high compression ratios but creates sparse matrices that standard hardware accelerates poorly. Structured pruning maintains hardware-friendly dense computation but removes less redundancy.

Knowledge distillation trains a smaller "student" model to mimic a larger "teacher." The student inherits the teacher's behavior without its parameter count. Distillation works well but requires access to the teacher model's outputs during training—a constraint that may be prohibitive for proprietary models.

Quantization reduces the numerical precision of weights and activations—from 16-bit floating point to 8-bit, 4-bit, or even lower. This is the dominant approach in 2025 because it offers the most favorable quality-compression trade-off and is well-supported by hardware.

Architectural redesign rethinks the model structure itself—replacing attention mechanisms, reducing layer counts, or introducing sparse mixture-of-experts routing. This is the most radical approach and potentially the most impactful, but it requires retraining from scratch.

The KV Cache Bottleneck

A subtlety lost in popular discourse is that the primary memory bottleneck during LLM inference is often not the model weights but the key-value (KV) cache—the stored attention states that enable efficient autoregressive generation. For long-context models processing thousands of tokens, the KV cache can consume more memory than the model itself.

Yao et al.'s VecInfer tackles this directly. Their insight: standard element-wise quantization of KV cache entries suffers from outlier sensitivity—a few extreme values in each cache entry distort the quantization range, degrading quality for all other values. VecInfer suppresses these outliers before applying vector quantization, treating groups of cache values as atomic units rather than independent scalars.

The practical impact: low-bit KV cache quantization with minimal quality degradation, achieving substantial reductions in both memory footprint and end-to-end inference latency on long-context workloads.

VQ-LLM (Liu et al.) complements this with a hardware perspective. Vector quantization introduces lookup-table operations that standard GPU kernels handle inefficiently. Their high-performance code generation framework produces custom kernels optimized for VQ operations, achieving throughput improvements that make the theoretical memory savings of VQ practically realizable.

Beyond Lossy: Lossless Compression for LLMs

Yubeaton et al.'s Huff-LLM challenges a widespread assumption: that LLM compression must be lossy. Their approach applies Huffman coding—a classical lossless compression algorithm—directly to FP16/BF16 model weights as an alternative to lossy techniques like quantization and pruning, achieving compression without any quality degradation.

The key observation is that LLM weight distributions are highly non-uniform. Certain weight values appear far more frequently than others. Huffman coding exploits this statistical redundancy, assigning shorter bit sequences to common values and longer sequences to rare ones.

The result: meaningful reduction in on-chip memory capacity and bandwidth requirements, completely free of quality loss. This lossless approach offers a complementary path to the lossy compression methods that dominate current practice—preserving full model fidelity while still achieving meaningful size reduction.

Qwen2.5 On-Device: A Complete System

Xiang et al.'s work on deploying Qwen2.5 provides a concrete picture of what on-device LLM deployment requires. It is not sufficient to compress the model; you must co-optimize across the entire stack: activation-aware weight quantization, hardware-software co-design where compute-intensive operations are offloaded to the FPGA fabric, and custom hardware pipelines to reduce per-token inference cost.

Their deployed system demonstrates viable on-device inference for quantized LLMs on constrained embedded hardware, with meaningful throughput improvements over an unoptimized baseline.

Claims and Evidence

Claim	Evidence	Verdict
4-bit quantization preserves most model quality	Multiple papers show <2% degradation on standard benchmarks	✅ Strongly supported
KV cache is a major memory bottleneck for long contexts	VecInfer achieves low-bit KV cache quantization with minimal quality loss and substantial latency reduction	✅ Supported
Lossless compression provides meaningful size reduction	Huff-LLM demonstrates lossless compression as an alternative to quantization	✅ Supported
On-device quantized LLMs achieve viable inference speed	Qwen2.5 on FPGA demonstrates meaningful throughput improvement over baseline	✅ Demonstrated
Compressed models match cloud model quality	Quality gap remains, especially for complex reasoning tasks	⚠️ Partially supported

Open Questions

The reasoning gap: Compressed models perform well on factual recall and simple generation but degrade on multi-step reasoning. Is this an inherent limitation of compression, or can reasoning-aware compression strategies close the gap?

Privacy vs. telemetry: On-device inference eliminates data exposure to providers—but device manufacturers and OS vendors may still have access. Does on-device AI merely shift the trust boundary rather than eliminating it?

Update mechanisms: Cloud models can be updated continuously. On-device models require explicit downloads. How do we keep billions of deployed models current without overwhelming mobile bandwidth?

Energy cost: Running a 7B model on a phone is technically possible but energetically expensive. What is the carbon and battery cost of widespread on-device inference, and is it justified?

The capability floor: What is the smallest model that can meaningfully assist with complex tasks? If the answer is larger than what phones can run, on-device deployment may be limited to simpler use cases.

What This Means for Your Research

The on-device LLM revolution has immediate implications across multiple research domains. For NLP researchers, the constraint of limited compute forces a return to first principles—which aspects of language understanding truly require billions of parameters, and which are achievable with orders of magnitude less? For systems researchers, the co-design of algorithms and hardware for LLM inference represents a rich new design space. For privacy researchers, on-device inference offers a path to AI-assisted applications that process sensitive data—medical, legal, financial—without ever exposing it to external servers.

The trajectory is clear: within two years, running a capable language model locally will be as unremarkable as running a web browser. The research question is not whether this will happen, but what becomes possible when it does.

면책 조항: 이 게시물은 정보 제공을 목적으로 한 연구 동향 개요이다. 학술 연구에서 인용하기 전에 구체적인 연구 결과, 통계 및 주장을 원본 논문과 대조하여 검증해야 한다.

거인을 축소하다: 스마트폰에서 LLM을 구동하기 위한 경쟁

대규모 언어 모델의 클라우드 의존성은 단순한 기술적 불편함이 아니라, AI를 누가 사용할 수 있는지, 어디서 사용할 수 있는지, 그리고 사용을 위해 어떤 데이터를 제공해야 하는지를 제한하는 구조적 제약이다. GPT-4나 Claude에 대한 모든 쿼리는 네트워크를 통해 데이터 센터로 전송되며, 이는 지연 시간을 발생시키고, 연결성을 요구하며, 잠재적으로 민감한 입력을 제3자 인프라에 노출시킨다. 농촌 지역 클리닉의 의사, 단절된 환경의 군인, 혹은 단순히 프라이버시를 중시하는 사용자에게 이러한 아키텍처는 특장점이 아니라 장벽이다.

따라서 스마트폰, 노트북, IoT 하드웨어와 같은 엣지 디바이스에서 LLM을 구동하기 위한 경쟁은 압축 기술에 관한 학문적 연습이 아니다. 이는 AI가 소수의 공급자에 의해 통제되는 중앙화된 서비스로 남을 것인지, 아니면 모든 사람이 이용할 수 있는 분산된 역량이 될 것인지를 결정하는 경쟁이다.

압축의 트릴레마

온디바이스 LLM 배포에 대한 모든 접근 방식은 근본적인 트릴레마에 직면한다: 모델 품질, 메모리 사용량, 추론 속도. 이 중 두 가지를 개선하면 일반적으로 나머지 하나가 저하된다. 엣지 배포의 핵심은 가장 타협이 적은 절충점을 찾는 것이다.

Liu et al.의 포괄적인 서베이는 이 분야를 네 가지 핵심 축으로 분류한다:

가지치기(Pruning)는 불필요하다고 판단되는 파라미터를 제거한다—임계값 이하의 가중치를 0으로 만들거나 어텐션 헤드 전체를 제거하는 방식이다. 비구조적 가지치기(Unstructured pruning)는 높은 압축률을 달성하지만, 표준 하드웨어가 희소 행렬을 잘 가속화하지 못하는 문제가 있다. 구조적 가지치기(Structured pruning)는 하드웨어 친화적인 밀집 연산을 유지하지만 중복성을 덜 제거한다.

지식 증류(Knowledge distillation)는 더 작은 "학생(student)" 모델이 더 큰 "교사(teacher)" 모델을 모방하도록 훈련시킨다. 학생 모델은 파라미터 수 없이 교사 모델의 동작을 습득한다. 지식 증류는 효과적이지만, 훈련 중 교사 모델의 출력에 접근해야 한다는 제약이 있어 독점 모델에는 적용이 어려울 수 있다.

양자화(Quantization)는 가중치와 활성화의 수치 정밀도를 줄인다—16비트 부동 소수점에서 8비트, 4비트, 혹은 그 이하로 낮추는 방식이다. 이는 2025년 현재 가장 지배적인 접근 방식으로, 품질-압축 간의 가장 유리한 절충 관계를 제공하며 하드웨어 지원도 잘 갖추어져 있다.

아키텍처 재설계(Architectural redesign)는 모델 구조 자체를 재고한다—어텐션 메커니즘을 대체하거나, 레이어 수를 줄이거나, 희소 혼합 전문가(mixture-of-experts) 라우팅을 도입하는 방식이다. 이는 가장 급진적인 접근 방식으로 잠재적으로 가장 큰 영향을 미칠 수 있지만, 처음부터 재훈련이 필요하다.

KV 캐시 병목 현상

대중적인 담론에서 종종 간과되는 사실은, LLM 추론 중 주요 메모리 병목이 모델 가중치가 아니라 키-값(KV) 캐시인 경우가 많다는 점이다. KV 캐시는 효율적인 자기회귀 생성을 가능하게 하는 저장된 어텐션 상태이다. 수천 개의 토큰을 처리하는 장문 맥락 모델의 경우, KV 캐시가 모델 자체보다 더 많은 메모리를 소비할 수 있다.

Yao et al.의 VecInfer는 이 문제를 직접적으로 다룬다. 이들의 핵심 통찰은 다음과 같다: KV 캐시 항목에 대한 표준 원소별 양자화는 이상값 민감성(outlier sensitivity) 문제를 겪는다—각 캐시 항목 내의 소수의 극단적인 값이 양자화 범위를 왜곡하여 다른 모든 값의 품질을 저하시킨다. VecInfer는 벡터 양자화를 적용하기 전에 이러한 이상값을 억제하며, 캐시 값의 그룹을 독립적인 스칼라가 아닌 원자적 단위로 처리한다.

실질적인 효과: 품질 저하가 최소화된 저비트 KV 캐시 양자화를 통해, 장문 맥락 작업에서 메모리 사용량과 종단 간 추론 지연 시간 모두에서 상당한 감소를 달성한다. VQ-LLM(Liu et al.)은 하드웨어 관점에서 이를 보완한다. 벡터 양자화는 룩업 테이블 연산을 도입하는데, 표준 GPU 커널은 이를 비효율적으로 처리한다. 이들의 고성능 코드 생성 프레임워크는 VQ 연산에 최적화된 커스텀 커널을 생성하여 처리량을 개선하고, VQ의 이론적 메모리 절감 효과를 실질적으로 실현 가능하게 한다.

손실 압축을 넘어서: LLM을 위한 무손실 압축

Yubeaton et al.의 Huff-LLM은 널리 퍼진 가정에 도전한다: LLM 압축은 반드시 손실이 있어야 한다는 것이다. 이들의 접근 방식은 양자화 및 가지치기(pruning)와 같은 손실 기법의 대안으로, 고전적인 무손실 압축 알고리즘인 허프만 코딩(Huffman coding)을 FP16/BF16 모델 가중치에 직접 적용하여 품질 저하 없이 압축을 달성한다.

핵심 관찰은 LLM 가중치 분포가 매우 불균일하다는 것이다. 특정 가중치 값은 다른 값보다 훨씬 더 자주 나타난다. 허프만 코딩은 이러한 통계적 중복성을 활용하여, 자주 등장하는 값에는 더 짧은 비트 시퀀스를, 드물게 등장하는 값에는 더 긴 시퀀스를 할당한다.

그 결과로 품질 손실 없이 온칩 메모리 용량과 대역폭 요구량을 의미 있게 감소시킨다. 이 무손실 접근 방식은 현재의 지배적인 손실 압축 방법에 대한 보완적 경로를 제공하며, 모델의 완전한 충실도를 보존하면서도 의미 있는 크기 감소를 달성한다.

온디바이스 Qwen2.5: 완전한 시스템

Xiang et al.의 Qwen2.5 배포 연구는 온디바이스 LLM 배포에 무엇이 필요한지 구체적으로 보여준다. 모델을 압축하는 것만으로는 충분하지 않으며, 전체 스택에 걸쳐 공동 최적화해야 한다: 활성화 인식 가중치 양자화, 연산 집약적 작업을 FPGA 패브릭에 오프로드하는 하드웨어-소프트웨어 공동 설계, 그리고 토큰당 추론 비용을 줄이기 위한 커스텀 하드웨어 파이프라인이 그것이다.

이들이 배포한 시스템은 제한된 임베디드 하드웨어에서 양자화된 LLM의 실행 가능한 온디바이스 추론을 입증하며, 최적화되지 않은 기준선 대비 의미 있는 처리량 개선을 보여준다.

주장과 근거

주장	근거	평가
4비트 양자화는 모델 품질 대부분을 유지한다	여러 논문이 표준 벤치마크에서 2% 미만의 성능 저하를 보임	✅ 강하게 지지됨
KV 캐시는 긴 문맥에서 주요 메모리 병목이다	VecInfer는 최소한의 품질 손실과 상당한 지연 시간 감소로 저비트 KV 캐시 양자화를 달성함	✅ 지지됨
무손실 압축은 의미 있는 크기 감소를 제공한다	Huff-LLM이 양자화의 대안으로서 무손실 압축을 시연함	✅ 지지됨
온디바이스 양자화 LLM이 실행 가능한 추론 속도를 달성한다	FPGA 위의 Qwen2.5가 기준선 대비 의미 있는 처리량 개선을 시연함	✅ 입증됨
압축된 모델이 클라우드 모델 품질에 필적한다	특히 복잡한 추론 과제에서 품질 격차가 여전히 존재함	⚠️ 부분적으로 지지됨

미해결 질문

추론 격차: 압축된 모델은 사실 회상과 단순 생성에서는 잘 작동하지만 다단계 추론에서 성능이 저하된다. 이것이 압축의 본질적인 한계인가, 아니면 추론 인식 압축 전략이 이 격차를 좁힐 수 있는가?

프라이버시 대 원격 측정: 온디바이스 추론은 제공자에 대한 데이터 노출을 제거하지만, 기기 제조사와 OS 벤더는 여전히 접근권을 가질 수 있다. 온디바이스 AI는 신뢰 경계를 없애는 것이 아니라 단지 이동시키는 것에 불과한가?

업데이트 메커니즘: 클라우드 모델은 지속적으로 업데이트될 수 있다. 온디바이스 모델은 명시적인 다운로드가 필요하다. 모바일 대역폭에 과부하를 주지 않으면서 수십억 개의 배포된 모델을 최신 상태로 유지하려면 어떻게 해야 하는가?

에너지 비용: 스마트폰에서 70억 파라미터 모델을 실행하는 것은 기술적으로는 가능하지만 에너지 소모가 크다. 광범위한 온디바이스 추론의 탄소 및 배터리 비용은 얼마이며, 그것이 정당화되는가?

역량 하한선: 복잡한 과제에서 의미 있는 도움을 줄 수 있는 가장 작은 모델은 무엇인가? 그 답이 스마트폰이 실행할 수 있는 것보다 크다면, 온디바이스 배포는 더 단순한 사용 사례에 국한될 수 있다.

이것이 연구에 의미하는 바

References (6)

[1] Xiang, M., Fernando, R., Wang, B. et al. (2025). On-Device Qwen2.5: Efficient LLM Inference with Model Compression and Hardware Acceleration. arXiv:2504.17376.

DOI Scholar

[2] Yao, D., Yang, C., Tong, Z. et al. (2025). VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization. arXiv:2510.06175.

DOI Scholar

[3] Yubeaton, P., Mahmoud, T., Naga, S. et al. (2025). Huff-LLM: End-to-End Lossless Compression for Efficient LLM Inference. arXiv:2502.00922.

DOI Scholar

[4] Liu, D., Zhu, Y., Liu, Z. et al. (2025). A survey of model compression techniques: past, present, and future. Frontiers in Robotics and AI.

DOI Scholar

[5] Liu, Z., Luo, X., Guo, J. et al. (2025). VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference. IEEE HPCA.

DOI Scholar

Xiang et al. (2025). On-Device Qwen2.5: Efficient LLM Inference with Model Compression and Hardware Acceleration.

Scholar