TL;DR

Sopro is a lightweight English text-to-speech model that uses dilated convolutions and cross-attention instead of Transformers. It supports streaming and zero-shot voice cloning, runs on CPU with a reported 0.25 real-time factor on an M3 base, and requires 3–12 seconds of reference audio for cloning.

What happened

A developer published Sopro, a compact English TTS model built around dilated convolutional layers (WaveNet-style) and lightweight cross-attention rather than the Transformer architectures commonly used today. The released package (pip and GitHub) offers both non-streaming and streaming synthesis, a simple CLI and Python API, and an interactive web demo deployable via uvicorn or Docker. Sopro’s author reports a 169 million parameter model trained on a single L40S GPU, capable of zero-shot voice cloning from 3–12 seconds of reference audio. Performance on an M3 base CPU was measured at roughly 0.25 RTF (30 seconds of audio in 7.5 seconds). The repo includes usage notes, tuning parameters for style and stopping behavior, and disclaimers about inconsistency, microphone and noise sensitivity, a generation limit around 32 seconds, and trade-offs between streaming and non-streaming outputs.

Why it matters

  • Enables voice cloning and streaming TTS on commodity CPUs, widening access for lightweight deployments.
  • Offers an alternative architecture (dilated convs + cross-attention) to Transformer-based TTS systems, which may afford efficiency gains.
  • Small model size and PyPI/GitHub availability lower the barrier for experimentation and local use.
  • Practical caveats are documented, helping users understand quality limits and tuning options.

Key facts

  • Model size: 169 million parameters.
  • Architecture: dilated convolutions (WaveNet-like) with lightweight cross-attention layers, not a Transformer.
  • Features: streaming synthesis and zero-shot voice cloning.
  • CPU performance: reported 0.25 real-time factor on an M3 base (30s audio generated in ~7.5s).
  • Reference audio for cloning: between 3 and 12 seconds.
  • Training: the author trained the model on a single L40S GPU and used pre-tokenized data; raw audio was discarded.
  • Generation is currently capped at ~32 seconds (400 frames) to avoid hallucinations.
  • Training data sources listed: Emilia YODAS, LibriTTS-R, Mozilla Common Voice 22, MLS.
  • Distribution: available on PyPI (pip install sopro) and as a GitHub repository with demo and Docker options.

What to watch next

  • Publication of the training code (author says it will be released once organized).
  • Efforts to support additional languages and dataset improvements to raise voice similarity.
  • Potential optimizations mentioned by the author, such as caching convolutional states to improve speed.
  • Performance variations depending on PyTorch version (author notes torch==2.6.0 on M3 produced ~3× better performance in his tests).

Quick glossary

  • Zero-shot voice cloning: Generating speech that mimics a speaker after hearing a short reference clip, without explicit per-speaker fine-tuning.
  • Real-time factor (RTF): The ratio of generation time to audio duration; an RTF of 0.25 means 30 seconds of audio can be produced in 7.5 seconds.
  • Dilated convolutions: Convolutional operations with spaced-apart kernels that increase receptive field without adding parameters; used in models like WaveNet for audio generation.
  • Cross-attention: A mechanism that lets one sequence attend to another (for example, conditioning audio generation on text or speaker embeddings).
  • Streaming synthesis: Producing audio output incrementally as text is processed, rather than generating the full waveform only after synthesis completes.

Reader FAQ

Is Sopro state-of-the-art?
The author states Sopro is not SOTA across most voices and situations.

Can Sopro run on a CPU?
Yes; the repository reports a 0.25 real-time factor on an M3 base CPU in the author's measurements.

How much reference audio is needed for voice cloning?
The model requires about 3–12 seconds of reference audio for zero-shot cloning.

Is the training code available?
Not confirmed in the source; the author says they will publish the training code once they have time to organize it.

Does Sopro support languages other than English?
Sopro is described as a lightweight English text-to-speech model; broader language support is something the author would like to add but is not confirmed as available now.

sopro_readme.mp4 Sopro TTS Sopro (from the Portuguese word for “breath/blow”) is a lightweight English text-to-speech model I trained as a side project. Sopro is composed of dilated convs (à la…

Sources

Related posts

By

Leave a Reply

Your email address will not be published. Required fields are marked *