Deep DiveComputer Systems

vLLM and Speculative Decoding: How Parallel Drafting Triples LLM Throughput

Autoregressive decoding—generating one token at a time—remains the primary throughput bottleneck in LLM serving. Berkeley's integration of P-EAGLE parallel speculative decoding into vLLM generates K draft tokens in a single forward pass, with Eagle3 representing current state-of-the-art and TurboSpec adding closed-loop dynamic parameter control.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

Disclaimer: This post is a research trend overview for informational purposes. Specific findings, statistics, and claims should be verified against the original papers before citation in academic work.

vLLM and Speculative Decoding: How Parallel Drafting Triples LLM Throughput

Every token a large language model generates requires a full forward pass through billions of parameters. For a 70-billion-parameter model producing a 500-token response, that means 500 sequential passes—each consuming GPU memory bandwidth, each adding latency that users experience as waiting. The arithmetic is unforgiving: autoregressive decoding turns the most powerful models into the slowest ones.

Speculative decoding attacks this bottleneck by splitting inference into two phases: a small "draft" model proposes multiple candidate tokens cheaply, and the large "target" model verifies them in a single batched pass. When the draft model guesses correctly, multiple tokens are accepted at once, and the output is mathematically identical to what the target model would have produced alone. The key insight is that verification is cheaper than generation—checking whether five tokens are correct costs roughly the same as generating one.

The vLLM Ecosystem

vLLM has emerged as the de facto open-source serving engine for large language models. Originally developed at Berkeley for its PagedAttention memory management system, vLLM now handles the inference workloads of research labs, startups, and enterprises that cannot or will not rely on proprietary serving APIs. Its architecture—continuous batching, efficient KV-cache management, and a modular execution backend—makes it a natural platform for integrating speculative decoding at the systems level rather than as an afterthought.

The Berkeley EECS-2025-192 technical report documents how speculative decoding has been integrated into vLLM's core serving pipeline, moving it from a research curiosity to a production-grade optimization. Three approaches represent the current frontier.

P-EAGLE: Parallel Draft Generation

The P-EAGLE (Parallel EAGLE) method addresses a fundamental limitation of earlier speculative decoding: draft models themselves generate tokens autoregressively, creating a smaller but still sequential bottleneck. P-EAGLE restructures the draft phase so that K candidate tokens are generated in a single forward pass rather than K sequential ones.

This parallelism changes the economics of speculation. When draft generation is sequential, there is a crossover point beyond which generating more draft tokens costs more than it saves—the draft model's own latency eats into the gains from batched verification. When draft generation is parallel, the cost of proposing K tokens is nearly constant regardless of K, shifting the optimal draft length upward and increasing the expected number of accepted tokens per verification cycle.

Eagle3: Current State-of-the-Art

Eagle3 represents the current state-of-the-art in speculative decoding within the vLLM ecosystem. While the technical report does not provide isolated benchmark numbers for Eagle3 separate from the broader vLLM integration, its position at the top of the speculative decoding hierarchy reflects iterative improvements in draft model architecture, training methodology, and integration with vLLM's KV-cache management.

The progression from Eagle to Eagle2 to Eagle3 illustrates a pattern common in systems research: each generation addresses bottlenecks revealed by the previous one. Eagle improved draft quality; Eagle2 improved draft-target alignment; Eagle3 improves the systems-level integration that determines whether theoretical speedups survive contact with production serving conditions—batched requests, variable sequence lengths, and memory pressure from concurrent users.

TurboSpec: Closed-Loop Control

Perhaps the most architecturally interesting contribution is TurboSpec, which applies closed-loop control theory to speculative decoding. Rather than fixing the number of draft tokens (K) and the acceptance threshold as static hyperparameters, TurboSpec dynamically adjusts these parameters based on runtime feedback.

Claim	Source	Confidence	Status
P-EAGLE generates K draft tokens in a single forward pass	Berkeley EECS-2025-192 abstract	High	Stated in source
Eagle3 is current state-of-the-art for speculative decoding	Berkeley EECS-2025-192 abstract	High	Stated in source
TurboSpec uses closed-loop control for dynamic parameter adjustment	Berkeley EECS-2025-192 abstract	High	Stated in source
vLLM is becoming the de facto LLM serving standard	Berkeley EECS-2025-192 abstract	Medium	Characterization by authors

The intuition is straightforward: different prompts, different models, and different hardware configurations produce different acceptance rates. A fixed K=5 draft length might be optimal for code generation on an A100 but wasteful for creative writing on an H100. TurboSpec monitors acceptance rates in real time and adjusts the draft length and acceptance criteria accordingly, treating the speculative decoding pipeline as a control system with a measurable objective (throughput) and tunable parameters.

This approach echoes TCP congestion control, where the sending rate adapts to observed network conditions rather than being set statically. The analogy is more than superficial—both systems face the challenge of maximizing throughput in the presence of variable and unpredictable conditions, and both benefit from feedback loops that respond to observed performance rather than predicted performance.

What This Means for LLM Serving Infrastructure

The integration of speculative decoding into vLLM at the engine level—rather than as a wrapper or plugin—has implications for how LLM serving infrastructure evolves. When the serving engine itself manages draft-target coordination, it can make scheduling decisions that account for the speculative decoding overhead: allocating GPU resources between draft and target models, managing KV-cache memory for both, and batching verification passes across multiple concurrent requests.

Open Questions

Draft model training cost: Speculative decoding requires a draft model aligned with the target model. How do the training costs of draft models scale as target models grow, and at what model size does the amortized training cost exceed the serving savings?

Multi-tenant interference: In production serving with many concurrent users, does the memory overhead of maintaining separate draft model state for each request degrade the batching efficiency that makes vLLM competitive in the first place?

Generalization across modalities: Current speculative decoding focuses on text. As multimodal models generate interleaved text, image tokens, and audio, do the acceptance rate assumptions that make speculation profitable still hold?

Hardware co-design: TurboSpec's closed-loop control adapts to hardware differences at runtime. Would hardware-aware draft model architectures—designed for specific accelerator memory hierarchies—outperform the adaptive approach?

The movement of speculative decoding from research papers into production serving engines marks a transition point. The algorithmic ideas are maturing; the engineering challenge is now integration—making these techniques work reliably under the messy conditions of real-world LLM deployment, where requests are heterogeneous, hardware is shared, and the cost of a regression is measured in user experience and cloud bills.

면책 조항: 이 게시물은 정보 제공 목적의 연구 동향 개요이다. 학술 연구에서 인용하기 전에 구체적인 연구 결과, 통계 및 주장은 원본 논문을 통해 반드시 확인해야 한다.

vLLM과 추측적 디코딩: 병렬 초안 생성이 LLM 처리량을 3배로 늘리는 방법

대규모 언어 모델이 토큰을 생성할 때마다 수십억 개의 파라미터를 통한 완전한 순전파(forward pass)가 필요하다. 700억 개의 파라미터를 가진 모델이 500개의 토큰으로 구성된 응답을 생성하는 경우, GPU 메모리 대역폭을 소비하는 500번의 순차적 순전파가 필요하며, 이는 사용자가 대기 시간으로 체감하는 지연을 발생시킨다. 이 계산은 냉혹하다: 자기회귀 디코딩(autoregressive decoding)은 가장 강력한 모델을 가장 느린 모델로 만든다.

추측적 디코딩(speculative decoding)은 추론을 두 단계로 분리함으로써 이 병목을 해결한다. 소형 "초안(draft)" 모델이 여러 후보 토큰을 저비용으로 제안하고, 대형 "목표(target)" 모델이 단일 배치 순전파로 이를 검증한다. 초안 모델의 예측이 정확할 경우 여러 토큰이 한 번에 채택되며, 출력 결과는 목표 모델이 단독으로 생성했을 결과와 수학적으로 동일하다. 핵심 통찰은 검증이 생성보다 비용이 저렴하다는 것이다. 즉, 5개의 토큰이 올바른지 확인하는 비용은 토큰 하나를 생성하는 비용과 거의 같다.

vLLM 에코시스템

vLLM은 대규모 언어 모델을 위한 사실상의 표준 오픈소스 서빙 엔진으로 자리잡았다. PagedAttention 메모리 관리 시스템으로 버클리에서 처음 개발된 vLLM은 현재 독점 서빙 API에 의존하지 않거나 의존할 수 없는 연구소, 스타트업, 기업의 추론 워크로드를 처리한다. 연속 배치(continuous batching), 효율적인 KV-캐시 관리, 모듈형 실행 백엔드로 구성된 아키텍처는 vLLM을 추측적 디코딩을 사후적으로 추가하는 것이 아닌 시스템 수준에서 통합하기에 적합한 플랫폼으로 만든다.

버클리 EECS-2025-192 기술 보고서는 추측적 디코딩이 vLLM의 핵심 서빙 파이프라인에 통합된 과정을 기록하고 있으며, 이를 통해 연구 수준의 기법이 프로덕션 수준의 최적화로 발전하였다. 현재 최전선을 대표하는 세 가지 접근법이 있다.

P-EAGLE: 병렬 초안 생성

P-EAGLE(Parallel EAGLE) 방법은 초기 추측적 디코딩의 근본적인 한계를 해결한다. 초안 모델 자체도 토큰을 자기회귀적으로 생성하여 더 작지만 여전히 순차적인 병목을 만든다는 문제이다. P-EAGLE은 초안 단계를 재구성하여 K개의 후보 토큰을 K번의 순차적 순전파가 아닌 단일 순전파로 생성한다.

이러한 병렬 처리는 추측의 경제성을 변화시킨다. 초안 생성이 순차적일 때는 더 많은 초안 토큰을 생성하는 비용이 이를 통해 얻는 이점을 초과하는 교차점이 존재한다. 초안 모델 자체의 지연 시간이 배치 검증으로 얻는 이득을 잠식하기 때문이다. 초안 생성이 병렬로 이루어질 때는 K개의 토큰을 제안하는 비용이 K에 관계없이 거의 일정하게 유지되어, 최적 초안 길이가 늘어나고 검증 주기당 채택되는 토큰의 기댓값이 증가한다.

Eagle3: 현재 최고 수준

Eagle3는 vLLM 에코시스템 내 추측적 디코딩의 현재 최고 수준(state-of-the-art)을 대표한다. 기술 보고서가 더 광범위한 vLLM 통합과 별도로 Eagle3만의 독립적인 벤치마크 수치를 제공하지는 않지만, 추측적 디코딩 계층 구조에서 최상위를 차지한다는 사실은 초안 모델 아키텍처, 학습 방법론, 그리고 vLLM의 KV-캐시 관리와의 통합에서 이루어진 반복적 개선을 반영한다.

Eagle에서 Eagle2, 그리고 Eagle3로의 발전은 시스템 연구에서 흔히 볼 수 있는 패턴을 보여준다. 각 세대는 이전 세대가 드러낸 병목을 해결한다. Eagle은 초안 품질을 개선했고, Eagle2는 초안-목표 정렬을 개선했으며, Eagle3는 이론적 속도 향상이 프로덕션 서빙 환경—배치 요청, 가변적인 시퀀스 길이, 동시 사용자로 인한 메모리 압박—에서도 실현될 수 있도록 시스템 수준의 통합을 개선하였다.

TurboSpec: 폐루프 제어

아마도 가장 아키텍처적으로 흥미로운 기여는 TurboSpec일 것이다. TurboSpec은 추론적 디코딩(speculative decoding)에 폐루프 제어 이론(closed-loop control theory)을 적용한다. 드래프트 토큰 수(K)와 수용 임계값을 정적 하이퍼파라미터로 고정하는 대신, TurboSpec은 런타임 피드백을 기반으로 이러한 매개변수를 동적으로 조정한다.

주장	출처	신뢰도	상태
P-EAGLE는 단일 순전파(forward pass)로 K개의 드래프트 토큰을 생성한다	Berkeley EECS-2025-192 초록	높음	출처에 명시됨
Eagle3는 추론적 디코딩의 현재 최신 기술(state-of-the-art)이다	Berkeley EECS-2025-192 초록	높음	출처에 명시됨
TurboSpec은 동적 매개변수 조정을 위해 폐루프 제어를 사용한다	Berkeley EECS-2025-192 초록	높음	출처에 명시됨
vLLM은 사실상의 LLM 서빙 표준이 되고 있다	Berkeley EECS-2025-192 초록	중간	저자들의 서술

직관은 명확하다. 서로 다른 프롬프트, 서로 다른 모델, 서로 다른 하드웨어 구성은 각기 다른 수용률(acceptance rate)을 만들어 낸다. 고정된 K=5 드래프트 길이는 A100에서의 코드 생성에는 최적일 수 있지만, H100에서의 창의적 글쓰기에는 낭비일 수 있다. TurboSpec은 수용률을 실시간으로 모니터링하고 드래프트 길이 및 수용 기준을 그에 맞게 조정하며, 추론적 디코딩 파이프라인을 측정 가능한 목표(처리량)와 조정 가능한 매개변수를 갖춘 제어 시스템으로 취급한다.

이러한 접근 방식은 TCP 혼잡 제어(TCP congestion control)를 연상시킨다. TCP 혼잡 제어에서는 전송률이 정적으로 설정되는 것이 아니라 관측된 네트워크 조건에 적응한다. 이 유사성은 단순한 비유 이상이다. 두 시스템 모두 가변적이고 예측 불가능한 조건 하에서 처리량을 최대화해야 하는 과제에 직면해 있으며, 예측된 성능이 아닌 관측된 성능에 반응하는 피드백 루프로부터 이점을 얻는다.

LLM 서빙 인프라에 대한 의미

추론적 디코딩을 래퍼나 플러그인이 아닌 엔진 수준에서 vLLM에 통합하는 것은 LLM 서빙 인프라의 발전 방향에 시사하는 바가 있다. 서빙 엔진 자체가 드래프트-타깃 조정을 관리할 때, 추론적 디코딩 오버헤드를 고려한 스케줄링 결정을 내릴 수 있다. 즉, 드래프트 모델과 타깃 모델 간에 GPU 자원을 할당하고, 두 모델 모두에 대한 KV-캐시 메모리를 관리하며, 다수의 동시 요청에 걸쳐 검증 패스(verification pass)를 일괄 처리할 수 있다.

미해결 질문들

드래프트 모델 학습 비용: 추론적 디코딩은 타깃 모델에 정렬된 드래프트 모델을 필요로 한다. 타깃 모델이 커짐에 따라 드래프트 모델의 학습 비용은 어떻게 증가하며, 어느 모델 크기에서 상각된 학습 비용이 서빙 절감 효과를 초과하는가?

다중 테넌트 간섭(Multi-tenant interference): 다수의 동시 사용자가 있는 프로덕션 서빙 환경에서, 각 요청에 대해 별도의 드래프트 모델 상태를 유지하는 메모리 오버헤드가 vLLM을 경쟁력 있게 만드는 배치 효율성을 저하시키는가?

모달리티 전반에 걸친 일반화: 현재의 추론적 디코딩은 텍스트에 집중되어 있다. 멀티모달 모델이 텍스트, 이미지 토큰, 오디오를 혼합하여 생성할 때, 투기(speculation)를 수익성 있게 만드는 수용률 가정이 여전히 유효한가?

하드웨어 공동 설계(Hardware co-design): TurboSpec의 폐루프 제어는 런타임에 하드웨어 차이에 적응한다. 특정 가속기 메모리 계층 구조를 위해 설계된 하드웨어 인식 드래프트 모델 아키텍처가 적응형 접근 방식을 능가할 것인가?

추론적 디코딩이 연구 논문에서 프로덕션 서빙 엔진으로 이동하는 것은 하나의 전환점을 의미한다. 알고리즘적 아이디어는 성숙하고 있으며, 이제 공학적 과제는 통합에 있다. 즉, 요청이 이질적이고, 하드웨어가 공유되며, 회귀(regression)의 비용이 사용자 경험과 클라우드 비용으로 측정되는 실제 LLM 배포의 혼잡한 조건 하에서 이러한 기술들이 안정적으로 작동하도록 만드는 것이다.

References (2)

Berkeley EECS. (2025). Speculative Decoding in vLLM (Technical Report EECS-2025-192). University of California, Berkeley.

Scholar

Berkeley EECS (2025). Speculative Decoding in vLLM.

DOI Scholar

vLLM and Speculative Decoding: How Parallel Drafting Triples LLM Throughput

vLLM and Speculative Decoding: How Parallel Drafting Triples LLM Throughput

The vLLM Ecosystem

P-EAGLE: Parallel Draft Generation

Eagle3: Current State-of-the-Art

TurboSpec: Closed-Loop Control

What This Means for LLM Serving Infrastructure

Open Questions

vLLM과 추측적 디코딩: 병렬 초안 생성이 LLM 처리량을 3배로 늘리는 방법

vLLM 에코시스템

P-EAGLE: 병렬 초안 생성

Eagle3: 현재 최고 수준

TurboSpec: 폐루프 제어

LLM 서빙 인프라에 대한 의미

미해결 질문들

References (2)

Explore this topic deeper