TL;DR

Kyutai released Pocket TTS, a 100M-parameter text-to-speech model that can clone voices from about five seconds of audio and run faster than real time on recent laptop CPUs. The project is open-sourced under an MIT license and was trained only on public English datasets totaling roughly 88k hours.

What happened

Kyutai introduced Pocket TTS, a compact text-to-speech system that aims to combine the flexibility of large voice models with the efficiency of lightweight TTS. The generative portion of the model totals 100 million parameters (90M transformer + 10M decoder), plus an 18M encoder used to embed a reference voice. Rather than relying on discrete audio tokens, the model predicts sequences of continuous latents and uses a set of sampling and training techniques — including a Lagrangian Self-Distillation loss for 1-step sampling, a Head Batch Multiplier (N=8) to amortize transformer cost, and a Gaussian temperature heuristic — to preserve quality at small scale. Kyutai evaluated Pocket TTS on Librispeech test-clean (audio denoised to 24 kHz with Adobe Enhance Speech), reported low word-error rates using Whisper-large-v3, and ran latency tests on an Intel Core Ultra 7 165H and an Apple M3 MacBook Air. The repo provides local CLI/server commands and a demo; the code is available under MIT.

Why it matters

  • Bridges gap between large, GPU-bound TTS systems and small, inflexible models by delivering voice cloning in a model that runs on CPUs.
  • Enables local, real-time speech generation on laptops without requiring powerful GPUs, lowering hardware barriers for developers and researchers.
  • Open-source release and training on public datasets support reproducibility and independent evaluation.
  • Techniques for continuous-latent prediction and 1-step sampling could influence future compact audio-generation models.

Key facts

  • Total generative parameter count: 100M (90M causal transformer + 10M codec decoder); an 18M encoder is used to embed voice prompts.
  • Voice cloning works from roughly five seconds of reference audio and reproduces voice color, emotion, accent, cadence, and recording conditions according to the authors.
  • Trained only on publicly available English datasets (AMI, EARNINGS22, GIGASpeech, SPGISpeech, TED-LIUM, VoxPopuli, LibriHeavy, Emilia) totaling about 88,000 hours.
  • Evaluated on Librispeech test-clean with audio cleaned to 24 kHz via Adobe Enhance Speech; WER measured with Whisper-large-v3 and human pairwise tests used for audio quality and speaker similarity.
  • Compared to several baselines (F5-TTS, Kyutai TTS 1.6B, Chatterbox Turbo, Kokoro), Pocket TTS achieves competitive WER and audio-quality scores while remaining much smaller.
  • Pocket TTS and Kokoro were the only models in the authors' latency tests that ran faster than real time on the tested laptop CPUs.
  • Architecture departs from the common discrete-token pipeline by predicting continuous latents directly and distilling WavLM into the codec latent space.
  • Training and sampling optimizations include Head Batch Multiplier (N=8), Gaussian temperature sampling (tau ~0.7), LSD loss for one-step sampling, and a form of classifier-free guidance applied to transformer outputs.

What to watch next

  • Release of v3 of the Continuous Audio Language Models paper with more implementation and evaluation details (announced as coming soon).
  • How Pocket TTS performs when trained or fine-tuned with additional private data and whether that changes quality or voice-cloning fidelity.
  • not confirmed in the source

Quick glossary

  • Text-to-speech (TTS): Technology that converts written text into spoken audio using machine learning models and audio decoders.
  • Neural audio codec: A learned encoder–decoder that compresses audio into a compact representation (discrete tokens or continuous latents) and reconstructs waveform audio from that representation.
  • Continuous latents: Real-valued vector representations of audio used as an internal intermediate for generation, as opposed to discrete token sequences.
  • Word Error Rate (WER): A common automatic metric for speech recognition quality that measures the percentage of words incorrect between reference and hypothesis transcriptions.
  • Classifier-Free Guidance (CFG): A method to steer conditional generative models by interpolating between conditional and unconditional model outputs during sampling.

Reader FAQ

Is Pocket TTS open source?
Yes. Kyutai released the code under an MIT license.

How much reference audio is needed to clone a voice?
The authors report the model can clone a voice from about five seconds of audio.

Can Pocket TTS run without a GPU?
Yes. The model is small enough to run faster than real time on the tested laptop CPUs (Intel Core Ultra 7 165H and Apple M3 MacBook Air).

What datasets were used to train it?
Trained exclusively on public English datasets listed by the authors (AMI, EARNINGS22, GIGASpeech, SPGISpeech, TED-LIUM, VoxPopuli, LibriHeavy, Emilia), totaling about 88k hours.

Are detailed reproduction instructions provided?
The source includes CLI/server commands and points to the repository and demo; full technical details are slated for the upcoming v3 of the paper.

← Back to Blog Pocket TTS: A high quality TTS that gives your CPU a voice 13 January 2026 We present Pocket TTS, a 100M-parameter text-to-speech model with voice cloning…

Sources

Related posts

By

Leave a Reply

Your email address will not be published. Required fields are marked *