TL;DR

vLLM’s V1 engine and a set of runtime optimizations pushed multi-node DeepSeek-style MoE inference to 2.2k tokens/second per H200 in community benchmarks. The team describes Wide-EP, Dual-Batch Overlap, expert-parallel load balancing, DeepEP kernels and disaggregated serving as key contributors to the gain.

What happened

vLLM has completed a full migration to its V1 engine (v0.11.0 removed the last V0 code) and reports sustained production-like inference throughput of 2.2k tokens per second on each H200 GPU in multi-node CoreWeave clusters connected over InfiniBand with ConnectX-7 NICs. That number reflects community benchmarking and follows earlier results near ~1.5k tok/s per GPU. The project attributes the increase to a chain of optimizations including async scheduling, Dual-Batch Overlap (DBO) to overlap compute and communication, DeepEP and other MoE kernels, CUDA graph modes, DeepGEMM by default, SiLU kernel fusion work, and expert-parallel load balancing (EPLB). vLLM documents deployment patterns for Wide-EP (expert parallelism combined with data parallelism), disaggregated prefill/decode serving, and integration paths with llm-d, Dynamo and Ray Serve LLM, while publishing a roadmap for further improvements such as elastic expert parallelism and long-context serving.

Why it matters

  • Higher sustained throughput per GPU can reduce the number of replicas needed for a target QPS, lowering infrastructure cost for inference.
  • Wide-EP targets MoE architectures by improving effective KV cache usage for models with sparse expert activation, which can unlock larger models in production.
  • Communication- and routing-focused techniques (DBO, EPLB, disaggregated serving) aim to mitigate the scalability limits of expert-parallel deployments.
  • Upstream kernel and runtime work (DeepEP, DeepGEMM, CUDA graph modes) provides optimizations that other deployments can adopt or replicate through documented integration paths.

Key facts

  • vLLM removed the last V0 engine code in v0.11.0, completing migration to the V1 engine.
  • The project reported 1,969 contributors and over 950 commits in the month leading up to 12/18/25.
  • Community benchmarks on CoreWeave H200 clusters with InfiniBand and ConnectX-7 NICs show sustained 2.2k tokens/s per H200 in production-like, multi-node deployments.
  • Earlier benchmarks measured ~1.5k tokens/s per GPU; improvements are credited to kernel fixes and runtime strategies including DBO and EPLB.
  • Key runtime and kernel improvements listed include async scheduling, Dual-Batch Overlap, disaggregated serving, CUDA graph FULL_AND_PIECEWISE, DeepGEMM enabled by default, DeepEP kernels, and a SiLU kernel for DeepSeek-R1.
  • Wide-EP (expert parallelism combined with data parallelism) is recommended for MLA architectures; DeepSeek-R1 activates ~37B of 671B parameters per forward pass.
  • vLLM supports expert-parallel deployment via –enable-expert-parallel and offers multiple all-to-all kernel backends (DeepEP, Perplexity MoE kernels, NCCL-based AllGather-ReduceScatter).
  • Dual-Batch Overlap (–enable-dbo) overlaps microbatch compute and collectives to improve GPU utilization where communication overhead is significant.
  • Expert Parallel Load Balancing (–enable-eplb) implements hierarchical and global policies to rebalance logical-to-physical expert mappings without restarting the model.
  • The project documents integration and deployment paths with llm-d, Dynamo, and Ray Serve LLM for reproducing wide-EP and disaggregated serving setups.

What to watch next

  • Progress on roadmap items such as elastic expert parallelism, long-context serving, and KV cache transfer via CPU (as listed in vLLM’s roadmap).
  • Community and operator replication of the 2.2k tok/s/H200 result using the llm-d, Dynamo, and Ray Serve LLM deployment paths and recipes provided by vLLM.
  • Ongoing kernel and hardware-specific optimizations: DeepEP kernel adoption, FlashInfer integration improvements, and planned GB200 optimizations.

Quick glossary

  • Mixture-of-Experts (MoE): A model architecture that routes different input tokens to different expert sub-networks so only a sparse subset of parameters is active per forward pass.
  • Wide-EP: A deployment pattern combining expert parallelism with data parallelism to increase effective KV cache capacity and batch efficiency for models with sparse activation.
  • KV cache: Key-value cache that stores past attention keys and values to avoid recomputing them during autoregressive decoding and speed up inference.
  • Disaggregated serving: A prefill/decode serving pattern that separates phases and resources across ranks or services to reduce contention and improve throughput in distributed inference.
  • Dual-Batch Overlap (DBO): A microbatching strategy that overlaps compute and collective communication by running interleaved microbatch worker threads to improve GPU utilization.

Reader FAQ

What does the 2.2k tok/s measurement represent?
It is the reported sustained throughput per H200 GPU in community benchmarks run on a CoreWeave H200 cluster with InfiniBand and ConnectX-7 NICs in multi-node, production-like deployments.

Has vLLM fully moved to the V1 engine?
Yes. The project removed the last V0 engine code in v0.11.0, marking a complete migration to the V1 engine.

Are major teams using vLLM in production?
vLLM states it is trusted in production by teams at Meta, LinkedIn, Red Hat, Mistral, and HuggingFace.

Are latency and exact cost-per-token numbers provided?
not confirmed in the source

vLLM Large Scale Serving: DeepSeek @ 2.2k tok/s/H200 with Wide-EP Dec 17, 2025 • vLLM Team Introduction In v0.11.0, the last code from vLLM V0 engine was removed, marking the…

Sources

Related posts

By

Leave a Reply

Your email address will not be published. Required fields are marked *