TL;DR
SkyPilot is an open-source control plane that lets teams launch, manage, and scale AI workloads across Kubernetes, Slurm, and 20+ cloud providers. The project (v0.11 as of Dec 2025) adds multi-cloud pools, faster managed jobs, and features aimed at enterprise-scale deployments.
What happened
SkyPilot, an open-source project originating at UC Berkeley's Sky Computing Lab, provides a single interface for running AI tasks on diverse infrastructure. It exposes a task definition (YAML or Python API) specifying resources, setup steps, and run commands, then finds and provisions GPUs/TPUs/CPUs across Kubernetes clusters, Slurm sites or many public clouds. The project recently published version 0.11, which introduced multi-cloud pools (managed pools of warm workers across clouds or clusters), faster managed jobs and enterprise-focused improvements. SkyPilot supports features intended to reduce cost and increase reliability — spot instance handling with automated recovery, automatic cleanup of idle resources, and intelligent scheduling to prefer cheaper capacity. The repo includes runnable examples for training and serving (LLMs, RL, DeepSpeed, TorchTitan, vLLM and others) and provides installation via pip with extras for selected cloud integrations.
Why it matters
- Avoids vendor lock-in by letting teams move workloads between Kubernetes, Slurm and many clouds using the same task definition.
- Gives infrastructure teams a unified control plane to coordinate heterogeneous compute (reserved GPUs, clusters, and cloud instances).
- Offers cost-saving mechanisms such as spot instance support, autostop for idle resources, and scheduling that targets cheaper available infra.
- Provides primitives (multi-cluster, gang scheduling, pools of warm workers) aimed at improving throughput for large or latency-sensitive AI workloads.
Key facts
- Latest noted release: SkyPilot v0.11 (Dec 2025), featuring Multi-Cloud Pools, Fast Managed Jobs, and enterprise-readiness improvements.
- Supports running on Kubernetes and Slurm as well as 20+ cloud or provider integrations, including AWS, GCP, Azure, OCI, CoreWeave, Nebius, Lambda Cloud, RunPod, Fluidstack, Cudo, DigitalOcean, Paperspace, Cloudflare, Samsung, IBM, Vast.ai, VMware vSphere, Seeweb, Prime Intellect, and Shadeform.
- SkyPilot tasks are defined in YAML or via a Python API and declare resources, node counts, a work directory, setup commands, and run commands that SkyPilot executes on provisioned machines.
- Installation is available through pip with optional extras for specific backends (example: skypilot[kubernetes,aws,gcp,azure,oci,nebius,lambda,runpod,fluidstack,paperspace,cudo,ibm,scp,seeweb,shadeform]).
- Capabilities highlighted include auto-retry on provisioning failures, multi-cluster scheduling, gang scheduling for coordinated allocations, and automatic syncing of local code to remote workdirs.
- Cost and reliability features include autostop (cleanup of idle resources), spot instance support with preemption auto-recovery, and scheduling that prefers cheaper and more available infra.
- Repository is public and includes examples and documentation covering development, training, serving, and integrations with frameworks like PyTorch, DeepSpeed, JAX/TPU, and RL toolkits.
- Project provenance: originated at the Sky Computing Lab at UC Berkeley and has attracted numerous industry contributors; documentation links, blog posts and research papers are available from the project.
What to watch next
- Adoption of v0.11 Multi-Cloud Pools in production workflows and how they affect latency and cost for batch inference workloads.
- How enterprise-readiness changes (from v0.11) play out at scale across organizations operating heterogeneous clusters.
- not confirmed in the source: any commercial support plans, SLAs, or hosted offerings built on SkyPilot are forthcoming.
Quick glossary
- Kubernetes: An open-source platform for automating deployment, scaling, and management of containerized applications.
- Slurm: A workload manager commonly used to schedule jobs and manage resources on HPC clusters.
- Spot instance: A cloud VM offered at lower cost with the risk of interruption when the provider reclaims capacity.
- Gang scheduling: A scheduling technique that allocates a group of tasks simultaneously so they can run in a coordinated fashion.
- Autostop: A policy that automatically shuts down or cleans up idle resources to reduce costs.
Reader FAQ
What infrastructures does SkyPilot support?
SkyPilot supports Kubernetes, Slurm and integrations with 20+ cloud or provider backends such as AWS, GCP, Azure, OCI, CoreWeave, RunPod, Paperspace and others, as listed in the project documentation.
How do I install SkyPilot?
The project can be installed via pip; the repo provides package variants with extras to enable specific cloud or Kubernetes support.
Is SkyPilot open source?
Yes — the project is hosted publicly (the repository shows an open-source license) and accepts contributions; the code and docs are available on the project's GitHub.
Where can I get support or report issues?
The project directs users to open GitHub issues for bugs, use GitHub Discussions for questions, and join the SkyPilot Slack for general conversation.
Does SkyPilot provide commercial SLAs or managed hosting?
not confirmed in the source
Run AI on Any Infrastructure 🌟 SkyPilot Demo 🌟: Click to see a 1-minute tour SkyPilot is a system to run, manage, and scale AI workloads on any AI infrastructure….
Sources
- SkyPilot: One system to use and manage all AI compute (K8s, 20 clouds, Slurm)
- SkyPilot: Run AI on Any Infrastructure — SkyPilot Docs
- Democratizing AI Compute with AMD Using SkyPilot
- Welcome to SkyPilot!
Related posts
- How a 40-line fix closed a 400x JVM performance gap on Linux systems
- Terra: A Rolling-Release Fedora Repository Built on Rust Tooling
- AsciiSketch — free browser-based ASCII art and diagram editor (Show HN)