SkyPilot: Unified system to run and manage AI compute across clouds

TL;DR

SkyPilot is an open-source control plane that lets teams launch, manage, and scale AI workloads across Kubernetes, Slurm, and 20+ cloud providers. The project (v0.11 as of Dec 2025) adds multi-cloud pools, faster managed jobs, and features aimed at enterprise-scale deployments.

What happened

SkyPilot, an open-source project originating at UC Berkeley's Sky Computing Lab, provides a single interface for running AI tasks on diverse infrastructure. It exposes a task definition (YAML or Python API) specifying resources, setup steps, and run commands, then finds and provisions GPUs/TPUs/CPUs across Kubernetes clusters, Slurm sites or many public clouds. The project recently published version 0.11, which introduced multi-cloud pools (managed pools of warm workers across clouds or clusters), faster managed jobs and enterprise-focused improvements. SkyPilot supports features intended to reduce cost and increase reliability — spot instance handling with automated recovery, automatic cleanup of idle resources, and intelligent scheduling to prefer cheaper capacity. The repo includes runnable examples for training and serving (LLMs, RL, DeepSpeed, TorchTitan, vLLM and others) and provides installation via pip with extras for selected cloud integrations.

Why it matters

Avoids vendor lock-in by letting teams move workloads between Kubernetes, Slurm and many clouds using the same task definition.
Gives infrastructure teams a unified control plane to coordinate heterogeneous compute (reserved GPUs, clusters, and cloud instances).
Offers cost-saving mechanisms such as spot instance support, autostop for idle resources, and scheduling that targets cheaper available infra.
Provides primitives (multi-cluster, gang scheduling, pools of warm workers) aimed at improving throughput for large or latency-sensitive AI workloads.

Key facts

Latest noted release: SkyPilot v0.11 (Dec 2025), featuring Multi-Cloud Pools, Fast Managed Jobs, and enterprise-readiness improvements.
Supports running on Kubernetes and Slurm as well as 20+ cloud or provider integrations, including AWS, GCP, Azure, OCI, CoreWeave, Nebius, Lambda Cloud, RunPod, Fluidstack, Cudo, DigitalOcean, Paperspace, Cloudflare, Samsung, IBM, Vast.ai, VMware vSphere, Seeweb, Prime Intellect, and Shadeform.
SkyPilot tasks are defined in YAML or via a Python API and declare resources, node counts, a work directory, setup commands, and run commands that SkyPilot executes on provisioned machines.
Installation is available through pip with optional extras for specific backends (example: skypilot[kubernetes,aws,gcp,azure,oci,nebius,lambda,runpod,fluidstack,paperspace,cudo,ibm,scp,seeweb,shadeform]).
Capabilities highlighted include auto-retry on provisioning failures, multi-cluster scheduling, gang scheduling for coordinated allocations, and automatic syncing of local code to remote workdirs.
Cost and reliability features include autostop (cleanup of idle resources), spot instance support with preemption auto-recovery, and scheduling that prefers cheaper and more available infra.
Repository is public and includes examples and documentation covering development, training, serving, and integrations with frameworks like PyTorch, DeepSpeed, JAX/TPU, and RL toolkits.
Project provenance: originated at the Sky Computing Lab at UC Berkeley and has attracted numerous industry contributors; documentation links, blog posts and research papers are available from the project.

What to watch next

Adoption of v0.11 Multi-Cloud Pools in production workflows and how they affect latency and cost for batch inference workloads.
How enterprise-readiness changes (from v0.11) play out at scale across organizations operating heterogeneous clusters.
not confirmed in the source: any commercial support plans, SLAs, or hosted offerings built on SkyPilot are forthcoming.

Quick glossary

Kubernetes: An open-source platform for automating deployment, scaling, and management of containerized applications.
Slurm: A workload manager commonly used to schedule jobs and manage resources on HPC clusters.
Spot instance: A cloud VM offered at lower cost with the risk of interruption when the provider reclaims capacity.
Gang scheduling: A scheduling technique that allocates a group of tasks simultaneously so they can run in a coordinated fashion.
Autostop: A policy that automatically shuts down or cleans up idle resources to reduce costs.

Reader FAQ

What infrastructures does SkyPilot support?
SkyPilot supports Kubernetes, Slurm and integrations with 20+ cloud or provider backends such as AWS, GCP, Azure, OCI, CoreWeave, RunPod, Paperspace and others, as listed in the project documentation.

How do I install SkyPilot?
The project can be installed via pip; the repo provides package variants with extras to enable specific cloud or Kubernetes support.

Is SkyPilot open source?
Yes — the project is hosted publicly (the repository shows an open-source license) and accepts contributions; the code and docs are available on the project's GitHub.

Where can I get support or report issues?
The project directs users to open GitHub issues for bugs, use GitHub Discussions for questions, and join the SkyPilot Slack for general conversation.

Does SkyPilot provide commercial SLAs or managed hosting?
not confirmed in the source

Run AI on Any Infrastructure 🌟 SkyPilot Demo 🌟: Click to see a 1-minute tour SkyPilot is a system to run, manage, and scale AI workloads on any AI infrastructure….

SkyPilot: Unified system to run and manage AI compute across clouds

By

TL;DR

What happened

Why it matters

Key facts

What to watch next

Quick glossary

Reader FAQ

Sources

Related posts

By

Related Post

How a 40-line fix closed a 400x JVM performance gap on Linux systems

40-Line Fix Cuts Java Thread User-Time Call Latency by Up to 400×

Operators Debate Whether to Run VXLAN Over WireGuard or Vice Versa

Leave a Reply Cancel reply

You missed

DOJ publishes partially redacted documents related to Operation Absolute Resolve

ASCII Clouds: Browser-based ASCII cloud generator with customizable presets

Apple Scrambles as Glass-Cloth Shortage Threatens Chip Supply Through 2027

How IRC Amplifies Social Interaction Compared with Real Life