Voice Cloning with Offline TTS: Kokoro, Kitten, and Piper Compared
Voice cloning — the ability to create a synthetic voice that sounds like a specific person — used to require cloud services, massive GPU clusters, and privacy compromises. In 2026, that’s no longer true. Three offline TTS engines now offer voice cloning capabilities, each with a fundamentally different approach.
This article breaks down exactly how Kokoro TTS, Kitten TTS, and Piper TTS handle voice cloning, what the trade-offs are, and which one you should choose for your use case.
The Starting Point: None of Them Clone Voices Out of the Box
Here’s the surprise: Kokoro, Kitten, and Piper all ship without built-in voice cloning. Each engine provides a set of pre-trained voices, and that’s it. If you want a custom voice — your own voice, a client’s voice, a character voice — you need to go beyond the default setup.
The difference is in how far you have to go and what tools are available to help you get there.
Kokoro TTS + KokoClone: Zero-Shot Cloning in Seconds
How It Works
Kokoro TTS ships with 54 curated voices across 9 languages, powered by the StyleTTS 2 architecture. Its voice system is based on speaker embeddings — compact numerical representations that capture the acoustic characteristics of each voice. The engine already uses these embeddings internally to switch between its built-in voices.
KokoClone extends this architecture with a speaker encoder based on ECAPA-TDNN, a neural network originally developed for speaker verification. Think of it like a fingerprint scanner for voices: you feed it a short audio sample, and it extracts a mathematical representation of that voice’s unique characteristics. This representation is then plugged directly into Kokoro’s existing decoder.
The key insight: because Kokoro already uses speaker embeddings to define its voices, replacing one embedding with another requires zero retraining. It’s like swapping a key in a lock — the mechanism stays the same, only the key changes.
What You Need
- Reference audio: 3–10 seconds of clean speech
- Hardware: CPU works. GPU is faster but not required
- Training time: None. It’s real-time inference
The Code
from kokoclone import KokoClone
clone = KokoClone(device="cpu")
audio = clone.text_to_speech(
text="Hello, this is my cloned voice.",
ref_wav="my_voice.wav",
language="en"
)
That’s it. No data preparation, no training loop, no GPU requirement. You provide a short audio clip and get synthesized speech back immediately.
Strengths and Limitations
Strengths:
- Instant results. No training phase at all
- Runs on CPU with reasonable latency (~150ms per 10 seconds of text)
- Supports multiple languages (English, Chinese, French, Japanese, and more)
- Model footprint is tiny (~84MB total)
Limitations:
- Cloning quality depends heavily on the reference audio quality. Background noise or echo degrades results noticeably
- The cloned voice may not capture very subtle speech mannerisms — think of it as a close approximation rather than a perfect replica
- KokoClone is a community project, not an official Kokoro feature, so updates and support vary
When to Choose KokoClone
Rapid prototyping, personal assistants, IoT devices, or any scenario where you need a custom voice fast and don’t have access to a GPU. It’s the “good enough, right now” option.
Kitten TTS: Cloning Through Fine-Tuning
How It Works
Kitten TTS is built on a lightweight VITS architecture — the entire model is just 15–80MB with 15 million parameters. It’s designed for environments where every megabyte matters: embedded systems, mobile browsers, low-power hardware.
Kitten ships with 8 built-in voices (Bella, Jasper, and a handful of others). There is no speaker encoder, no zero-shot mechanism, no shortcut to a custom voice. If you want a new voice, you have to train it into the model.
The process works like this:
-
Collect paired data. You need 5–30 minutes of clean audio from the target speaker, paired with accurate transcriptions. Tools like Montreal Forced Aligner can help generate these alignments from raw audio.
-
Fine-tune the model. Load a pre-trained Kitten checkpoint (like
kitten-tts-mini-0.8), freeze most of the network, and train only the speaker embedding layers. This requires at least 8GB of VRAM and takes 6–12 hours depending on dataset size. -
Export to ONNX. Once training is complete, use
export_onnx.pyto generate an offline-runnable model file (~20–30MB).
# Fine-tuning Kitten TTS
python train.py \
--model_name kitten-tts-nano-0.8 \
--train_dataset ./my_dataset \
--output_dir ./ckpt \
--epochs 100 \
--learning_rate 5e-4 \
--speaker_embedding True
# Export for offline use
python export_onnx.py --ckpt ./ckpt/best.pt --output ./my_voice.onnx
Strengths and Limitations
Strengths:
- Smallest model footprint of any TTS engine with custom voice support
- After training, inference is extremely fast — even on CPU
- The resulting model is self-contained and portable
Limitations:
- Significant upfront effort: data collection, alignment, and training
- Requires GPU during the training phase (8–40GB VRAM)
- No zero-shot capability at all — you must train for each new voice
- The training pipeline is not officially supported; you’re adapting existing scripts
When to Choose Kitten TTS Fine-Tuning
Embedded systems, mobile apps, and IoT devices where model size and inference speed matter more than the convenience of instant cloning. If you’re deploying to a Raspberry Pi Zero or a smartwatch, Kitten’s tiny footprint is hard to beat — but you need to be willing to invest in the one-time training cost.
Piper TTS + Training Suite: One-Click Express Clone
How It Works
Piper TTS is a battle-tested VITS-based engine with over 900 pre-trained voices. It’s been the go-to choice for Home Assistant integrations and Raspberry Pi projects for years. On its own, Piper doesn’t support voice cloning — you pick from its library of existing voices.
The Piper Training Suite changes that with a feature called Express Clone. It’s a two-stage pipeline:
Stage 1: Synthetic Data Generation with Chatterbox
You provide 3–10 seconds of reference audio. Chatterbox — a zero-shot voice synthesis model — generates over 1,500 short audio clips in the target voice, paired with their transcriptions. This creates a complete training dataset automatically.
Stage 2: Fine-Tuning Piper
The synthetic dataset is fed into Piper’s training pipeline. The model fine-tunes for 300–500 epochs, learning to reproduce the target voice with higher fidelity than zero-shot approaches can achieve. The result is exported as a standard .onnx file that runs anywhere Piper does.
# One-command Express Clone
python cloneToPiper.py MyVoice ./reference.wav \
--samples 200 --epochs 500 --quality high --language en-us \
--checkpoint lessac
# After training completes, use the custom voice:
piper -m ./exports/MyVoice.onnx -t "This is my cloned voice"
What You Need
- Reference audio: 3–10 seconds
- Hardware: NVIDIA GPU with CUDA recommended (8–12GB VRAM). CPU works but training takes much longer
- Training time: 3–5 minutes for data generation, 2–4 hours for fine-tuning
Strengths and Limitations
Strengths:
- Highest cloning quality among the three approaches. Fine-tuning produces voices that are more faithful to the reference than zero-shot methods
- One-command pipeline handles the entire process from reference audio to deployable model
- The resulting voice model runs on standard Piper inference — CPU, real-time, fully offline
- Multi-language support via Piper’s existing language framework
Limitations:
- Requires GPU for the training phase
- The 2–4 hour training time means this isn’t suitable for real-time or on-demand cloning
- Chatterbox’s synthetic data is good but not perfect — some artifacts can propagate into the final model
- Docker/WSL2 setup can be involved on some systems
When to Choose Piper Training Suite
Audiobook production, customer service voice bots, game character voiceovers — any scenario where you need high-fidelity voice cloning and can afford a one-time training investment. The “train once, deploy everywhere” model works well for production use.
Side-by-Side Comparison
| Feature | KokoClone | Kitten Fine-Tune | Piper Express Clone |
|---|---|---|---|
| Clone method | Zero-shot (speaker encoder) | Fine-tune from scratch | Synthetic data + fine-tune |
| Reference audio | 3–10 seconds | 5–30 minutes paired | 3–10 seconds |
| Training required | No | Yes (6–12 hours) | Yes (2–4 hours) |
| GPU required | No | Yes (8–40GB VRAM) | Recommended (8–12GB VRAM) |
| Clone quality | Good | Good (with enough data) | Best |
| Inference speed | ~150ms / 10s text on CPU | Very fast (tiny model) | Real-time on CPU |
| Model size | ~84MB | 20–30MB | ~75MB |
| Multi-language | Yes (9 languages) | Single language | Yes |
| Maturity | Community project | Manual adaptation | Documented pipeline |
Choosing the Right Approach
The decision comes down to three questions:
Do you need the voice right now? KokoClone gives you results in seconds. The other two require hours of training.
Do you need the highest possible quality? Piper Training Suite’s fine-tuning approach produces better voice fidelity than zero-shot cloning. If you’re creating a voice that thousands of people will hear, the quality difference matters.
What hardware do you have? KokoClone runs on any CPU. Kitten and Piper both require GPU during training. If you don’t have a GPU, KokoClone is your only option.
Here’s a simple decision framework:
- Personal projects, quick experiments, IoT devices — KokoClone
- Embedded systems with strict size constraints — Kitten TTS fine-tuning
- Production voice cloning for audiobooks, games, enterprise — Piper Training Suite
The Offline Advantage
All three approaches share one critical benefit: your voice data never leaves your machine. Unlike cloud-based cloning services like ElevenLabs or Resemble.ai, which require uploading your reference audio to their servers, these tools work entirely locally.
This matters because:
-
Privacy. Voice recordings are biometric data. Uploading them to third-party servers creates privacy and security risks that regulations like GDPR and CCPA take seriously.
-
Control. Once you’ve trained or encoded a voice, it’s yours. No subscription fees, no rate limits, no terms of service that could change.
-
Reliability. Offline tools don’t depend on internet connectivity or service uptime. Your voice generation works the same way whether you’re in a data center or a cabin in the woods.
Getting Started
Ready to try voice cloning? The fastest path is KokoClone — install it, provide a 5-second audio clip, and generate speech in a custom voice immediately:
from kokoclone import KokoClone
clone = KokoClone(device="cpu")
audio = clone.text_to_speech(
text="Voice cloning, completely offline.",
ref_wav="reference.wav",
language="en"
)
For production-quality results, set aside an afternoon with the Piper Training Suite. The Express Clone pipeline automates most of the work — you provide a reference clip, run one command, and come back to a trained voice model.
And if you just want to explore what Kokoro can do with its 54 built-in voices across 9 languages, try it in your browser — no installation required, completely offline, completely private.