TL;DR
TurboDiffusion is an open-source framework that the authors say can accelerate end-to-end video diffusion generation by roughly 100–200× on a single RTX 5090. The implementation combines SageAttention, a Sparse-Linear Attention variant, and timestep distillation (rCM) and provides model checkpoints, inference scripts and installation guidance on GitHub.
What happened
Researchers published TurboDiffusion, a repository and implementation that claims large runtime reductions for video diffusion models. The framework speeds up sampling by combining SageAttention-based attention acceleration, SLA (Sparse-Linear Attention) and rCM for timestep distillation. In the project’s benchmarks on a single RTX 5090 GPU, TurboDiffusion reduced end-to-end (E2E) generation times from 4,549 seconds to 38 seconds for a Wan-2.2 I2V A14B 720p workload and from 184 seconds to 1.9 seconds for a Wan-2.1 T2V 1.3B 480p example; a larger Wan-2.1 14B model showed E2E time cut from 4,767 seconds to 24 seconds. The repository includes quantized and unquantized checkpoints, recommended environment and installation steps, and inference scripts with options for attention type, sampling steps and quantization. The authors note that the paper and checkpoints are not finalized and may be updated.
Why it matters
- Substantial runtime reductions could enable much faster iteration and lower compute cost for video generation workflows.
- Real-time or near-real-time generation on high-end consumer GPUs (example: RTX 5090) becomes more feasible according to the reported measurements.
- Attention and sampling optimizations demonstrated here could be applied to other large video diffusion pipelines.
- Results and tooling are provided as open-source checkpoints and scripts, allowing independent testing and adaptation.
- The project is still flagged as preliminary; the paper and checkpoints may change, which affects reproducibility and validation.
Key facts
- TurboDiffusion reports 100–200× acceleration for end-to-end diffusion generation on a single RTX 5090.
- Primary techniques named are SageAttention, SLA (Sparse-Linear Attention) and rCM (timestep distillation).
- Benchmarks: Wan-2.2-I2V-A14B-720P E2E time: original 4,549s → TurboDiffusion 38s.
- Benchmarks: Wan-2.1-T2V-1.3B-480P E2E time: original 184s → TurboDiffusion 1.9s. Wan-2.1-T2V-14B-720P: 4,767s → 24s.
- Repository provides several TurboWan checkpoints (480p and 720p variants) on Hugging Face and inference scripts.
- Install notes: Python >= 3.9 and torch >= 2.7.0; torch==2.8.0 is recommended to avoid OOM in some cases.
- Quantized checkpoints (suffix -quant) are provided for GPUs with ~40GB or less (e.g., RTX 5090); unquantized checkpoints are recommended for GPUs with >40GB memory (e.g., H100).
- Inference scripts expose options such as –num_steps (1–4), –attention_type (original, sla, sagesla) and –sla_topk (default 0.1; repo recommends 0.15 for better quality).
- E2E Time in evaluations excludes text encoding and VAE decoding.
- The repository states checkpoints and the paper are not finalized and will be updated to improve quality.
What to watch next
- Finalized paper and updated checkpoints from the TurboDiffusion authors (not finalized in the source).
- Independent evaluations of visual quality and fidelity compared with baseline models (not confirmed in the source).
- Community tests on a wider range of GPUs and longer video lengths to verify the reported speedups (not confirmed in the source).
Quick glossary
- SageAttention: An attention implementation referenced by the project that is used to accelerate attention computations; the repository pairs it with Sparse-Linear Attention for speed.
- Sparse-Linear Attention (SLA): An attention technique that reduces computation by sparsifying or approximating dense attention matrices to lower runtime and memory cost.
- rCM (timestep distillation): A method described in the repository for distilling or reducing the number of timesteps needed during diffusion sampling, intended to speed generation.
- Quantized checkpoint: A model checkpoint in which weights or linear layers are quantized to reduce memory footprint and increase inference speed on limited-memory GPUs.
- End-to-end (E2E) time: Latency measurement reported by the project for diffusion generation excluding certain preprocessing and postprocessing steps such as text encoding and VAE decoding.
Reader FAQ
Is the code and checkpoints available?
Yes. The TurboDiffusion repository and checkpoints are available on GitHub and linked to Hugging Face in the project documentation.
How large are the reported speedups?
The repository reports roughly 100–200× acceleration on a single RTX 5090 GPU for end-to-end diffusion generation.
Do I need a specific GPU or setup?
The project provides quantized checkpoints for GPUs with about 40GB or less (e.g., RTX 5090) and unquantized checkpoints for >40GB GPUs (e.g., H100). Python >=3.9 and torch>=2.7.0 are required; torch==2.8.0 is recommended.
Are the paper and checkpoints final?
The repository notes that the checkpoints and paper are not finalized and will be updated to improve quality.
TurboDiffusion This repository provides the official implementation of TurboDiffusion, a video generation acceleration framework that can speed up end-to-end diffusion generation by 100 ∼ 200 × on a single RTX…
Sources
- TurboDiffusion: 100–200× Acceleration for Video Diffusion Models
- Accelerating Video Diffusion Models by 100-200 Times
- TurboDiffusion: Accelerating Video Diffusion Models by …
Related posts
- Ask HN: What skills will you develop or improve in 2026? Community thread
- MiniMax M2.1: Enhanced Multi-Language Coding for Real-World Tasks
- Huge AI blocklist for uBlock Origin and uBlacklist to filter image search