vLLM large-scale serving hits 2.2k tok/s per H200 with Wide-EP
TL;DR vLLM’s V1 engine and a set of runtime optimizations pushed multi-node DeepSeek-style MoE inference to 2.2k tokens/second per H200…
Wow News on Tech and AI
TL;DR vLLM’s V1 engine and a set of runtime optimizations pushed multi-node DeepSeek-style MoE inference to 2.2k tokens/second per H200…