TL;DR
Modal has added GPU memory snapshotting to its existing checkpoint/restore system, using the CUDA checkpoint API to capture GPU state along with CPU memory. Early tests show cold start times reduced by up to 10x for some GPU-heavy functions; the feature is available in alpha behind a new flag.
What happened
Modal extended its memory-snapshot technology to include GPU-resident state by using the CUDA checkpoint/restore API introduced in NVIDIA driver branches 570 and 575. Previously, Modal’s snapshots captured only CPU memory, forcing workloads to recreate CUDA contexts, move weights to device memory, and re-run GPU-dependent warmups after restore. The new flow locks active CUDA processes, copies GPU memory and CUDA objects into host memory, releases GPU resources and ends CUDA sessions, then includes that captured state in the snapshot. On restore, Modal reverses the steps to reinstantiate CUDA state alongside the CPU memory image. The implementation integrates with Modal’s existing gVisor checkpoint/restore infrastructure, enumerates and monitors active CUDA sessions to ensure consistency, and adds retry logic to handle transient errors. Modal reports tests across several models showing large startup reductions; the capability is offered in alpha and enabled via an experimental flag.
Why it matters
- Serverless GPU functions can scale to zero without long cold starts, improving responsiveness for on-demand inference.
- Restoring GPU state directly avoids costly device transfers and re-compilation steps such as torch.compile, saving seconds to minutes per cold boot in tested workloads.
- Capturing CUDA kernels, streams and contexts preserves expensive GPU-side initializations and compiled artifacts across restarts.
- Integration with an existing snapshot system means users who already use CPU snapshots can adopt GPU snapshots with minimal code changes.
Key facts
- GPU snapshotting leverages NVIDIA’s CUDA checkpoint/restore API available on driver branches 570 and 575.
- The snapshot sequence locks CUDA processes, checkpoints GPU memory and CUDA objects to host memory, releases GPU resources, and terminates CUDA sessions before completing a full program snapshot.
- During restore, Modal uses the CUDA restore and unlock calls to rebuild GPU state in the container.
- Modal integrates GPU snapshots with its gVisor-based checkpoint/restore runtime and its distributed file-cache system.
- The system monitors CUDA process states (e.g., CU_PROCESS_STATE_CHECKPOINTED) and implements retry logic to detect and handle checkpoint failures.
- Modal reports up to 10x faster cold starts in tests: example reductions include Parakeet transcription from about 20s to ~2s, a ViT inference case from 8.5s to 2.25s, and vLLM Qwen2.5 from 45s to 5s.
- To opt in, users add experimental_options{"enable_gpu_snapshot": True} when enabling memory snapshots for a Modal app.
- GPU memory snapshots are currently available in alpha at Modal; the team notes they are still exploring limitations.
What to watch next
- General availability timeline and broader rollout plans: not confirmed in the source
- Support matrix for specific GPU models and driver versions beyond the 570/575 branches: not confirmed in the source
- Limitations and community feedback uncovered during the alpha period (Modal is actively exploring these)
Quick glossary
- CUDA checkpoint/restore API: An NVIDIA driver-level interface that lets software capture and later restore GPU state, including device memory, kernels, and CUDA contexts.
- GPU memory (vRAM): Dedicated memory on a graphics processing unit used to hold model weights, activations, and other data for GPU-accelerated computations.
- torch.compile: A PyTorch feature that compiles model code into optimized kernels; results can be hardware-dependent and costly to rebuild on each startup.
- Snapshot / checkpoint: A saved image of a program’s runtime state (memory, file descriptors, and relevant resources) that can be restored later to resume execution quickly.
- gVisor: A container runtime sandbox that can be extended to support checkpoint/restore workflows for isolating and managing application processes.
Reader FAQ
How much faster are cold starts with GPU memory snapshots?
Modal reports up to 10x faster cold starts in tested workloads, with specific examples like Parakeet reducing from ~20s to ~2s and vLLM from 45s to 5s.
How do I enable GPU snapshots in Modal?
Enable memory snapshots and set experimental_options{"enable_gpu_snapshot": True} in your Modal app configuration.
Is GPU snapshotting broadly available now?
The feature is available in alpha at Modal; wider availability or GA timing is not confirmed in the source.
Does this require specific NVIDIA drivers?
Modal’s implementation uses the CUDA checkpoint/restore API available in driver branches 570 and 575.
Which GPUs and hardware configurations are supported?
not confirmed in the source

All posts Engineering July 30, 2025 • 10 minute read GPU Memory Snapshots: Supercharging Sub-second Startup Luis Capelo@luiscape Member of Technical Staff Colin Weld Member of Technical Staff At Modal,…
Sources
- GPU memory snapshots: sub-second startup (2025)
- Serverless GPUs for AI Inference and Training – Beam Cloud
- The Complete Guide to Measuring and Fixing GPU …
- A Concurrent OS-level GPU Checkpoint and Restore …
Related posts
- I Hate Go, but It Saved My Startup — An Architectural Autopsy
- Open Chaos: A community-driven, self-evolving open-source project
- Containarium: Hosting 100+ Linux developer environments on one VM