Deep DiveComputer Systems

vLLM and Speculative Decoding: How Parallel Drafting Triples LLM Throughput

Autoregressive decoding—generating one token at a time—remains the primary throughput bottleneck in LLM serving. Berkeley's integration of P-EAGLE parallel speculative decoding into vLLM generates K draft tokens in a single forward pass, with Eagle3 representing current state-of-the-art and TurboSpec adding closed-loop dynamic parameter control.

By Sean K.S. Shin
This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.
Disclaimer: This post is a research trend overview for informational purposes. Specific findings, statistics, and claims should be verified against the original papers before citation in academic work.

vLLM and Speculative Decoding: How Parallel Drafting Triples LLM Throughput

Every token a large language model generates requires a full forward pass through billions of parameters. For a 70-billion-parameter model producing a 500-token response, that means 500 sequential passes—each consuming GPU memory bandwidth, each adding latency that users experience as waiting. The arithmetic is unforgiving: autoregressive decoding turns the most powerful models into the slowest ones.

Speculative decoding attacks this bottleneck by splitting inference into two phases: a small "draft" model proposes multiple candidate tokens cheaply, and the large "target" model verifies them in a single batched pass. When the draft model guesses correctly, multiple tokens are accepted at once, and the output is mathematically identical to what the target model would have produced alone. The key insight is that verification is cheaper than generation—checking whether five tokens are correct costs roughly the same as generating one.

The vLLM Ecosystem

vLLM has emerged as the de facto open-source serving engine for large language models. Originally developed at Berkeley for its PagedAttention memory management system, vLLM now handles the inference workloads of research labs, startups, and enterprises that cannot or will not rely on proprietary serving APIs. Its architecture—continuous batching, efficient KV-cache management, and a modular execution backend—makes it a natural platform for integrating speculative decoding at the systems level rather than as an afterthought.

The Berkeley EECS-2025-192 technical report documents how speculative decoding has been integrated into vLLM's core serving pipeline, moving it from a research curiosity to a production-grade optimization. Three approaches represent the current frontier.

P-EAGLE: Parallel Draft Generation

The P-EAGLE (Parallel EAGLE) method addresses a fundamental limitation of earlier speculative decoding: draft models themselves generate tokens autoregressively, creating a smaller but still sequential bottleneck. P-EAGLE restructures the draft phase so that K candidate tokens are generated in a single forward pass rather than K sequential ones.

This parallelism changes the economics of speculation. When draft generation is sequential, there is a crossover point beyond which generating more draft tokens costs more than it saves—the draft model's own latency eats into the gains from batched verification. When draft generation is parallel, the cost of proposing K tokens is nearly constant regardless of K, shifting the optimal draft length upward and increasing the expected number of accepted tokens per verification cycle.

Eagle3: Current State-of-the-Art

Eagle3 represents the current state-of-the-art in speculative decoding within the vLLM ecosystem. While the technical report does not provide isolated benchmark numbers for Eagle3 separate from the broader vLLM integration, its position at the top of the speculative decoding hierarchy reflects iterative improvements in draft model architecture, training methodology, and integration with vLLM's KV-cache management.

The progression from Eagle to Eagle2 to Eagle3 illustrates a pattern common in systems research: each generation addresses bottlenecks revealed by the previous one. Eagle improved draft quality; Eagle2 improved draft-target alignment; Eagle3 improves the systems-level integration that determines whether theoretical speedups survive contact with production serving conditions—batched requests, variable sequence lengths, and memory pressure from concurrent users.

TurboSpec: Closed-Loop Control

Perhaps the most architecturally interesting contribution is TurboSpec, which applies closed-loop control theory to speculative decoding. Rather than fixing the number of draft tokens (K) and the acceptance threshold as static hyperparameters, TurboSpec dynamically adjusts these parameters based on runtime feedback.

<
ClaimSourceConfidenceStatus
P-EAGLE generates K draft tokens in a single forward passBerkeley EECS-2025-192 abstractHighStated in source
Eagle3 is current state-of-the-art for speculative decodingBerkeley EECS-2025-192 abstractHighStated in source
TurboSpec uses closed-loop control for dynamic parameter adjustmentBerkeley EECS-2025-192 abstractHighStated in source
vLLM is becoming the de facto LLM serving standardBerkeley EECS-2025-192 abstractMediumCharacterization by authors

The intuition is straightforward: different prompts, different models, and different hardware configurations produce different acceptance rates. A fixed K=5 draft length might be optimal for code generation on an A100 but wasteful for creative writing on an H100. TurboSpec monitors acceptance rates in real time and adjusts the draft length and acceptance criteria accordingly, treating the speculative decoding pipeline as a control system with a measurable objective (throughput) and tunable parameters.

This approach echoes TCP congestion control, where the sending rate adapts to observed network conditions rather than being set statically. The analogy is more than superficial—both systems face the challenge of maximizing throughput in the presence of variable and unpredictable conditions, and both benefit from feedback loops that respond to observed performance rather than predicted performance.

What This Means for LLM Serving Infrastructure

The integration of speculative decoding into vLLM at the engine level—rather than as a wrapper or plugin—has implications for how LLM serving infrastructure evolves. When the serving engine itself manages draft-target coordination, it can make scheduling decisions that account for the speculative decoding overhead: allocating GPU resources between draft and target models, managing KV-cache memory for both, and batching verification passes across multiple concurrent requests.

Open Questions

  • Draft model training cost: Speculative decoding requires a draft model aligned with the target model. How do the training costs of draft models scale as target models grow, and at what model size does the amortized training cost exceed the serving savings?
  • Multi-tenant interference: In production serving with many concurrent users, does the memory overhead of maintaining separate draft model state for each request degrade the batching efficiency that makes vLLM competitive in the first place?
  • Generalization across modalities: Current speculative decoding focuses on text. As multimodal models generate interleaved text, image tokens, and audio, do the acceptance rate assumptions that make speculation profitable still hold?
  • Hardware co-design: TurboSpec's closed-loop control adapts to hardware differences at runtime. Would hardware-aware draft model architectures—designed for specific accelerator memory hierarchies—outperform the adaptive approach?
  • The movement of speculative decoding from research papers into production serving engines marks a transition point. The algorithmic ideas are maturing; the engineering challenge is now integration—making these techniques work reliably under the messy conditions of real-world LLM deployment, where requests are heterogeneous, hardware is shared, and the cost of a regression is measured in user experience and cloud bills.


    References (2)

    Berkeley EECS. (2025). Speculative Decoding in vLLM (Technical Report EECS-2025-192). University of California, Berkeley.
    Berkeley EECS (2025). Speculative Decoding in vLLM.

    Explore this topic deeper

    Search 290M+ papers, detect research gaps, and find what hasn't been studied yet.

    Click to remove unwanted keywords

    Search 6 keywords →