Kokoro TTS: Complete Guide — Voices, Languages, WebGPU Setup, and Quality
Kokoro TTS is an 82-million-parameter text-to-speech model that runs in a browser tab. At 82MB, it produces speech that ranks ahead of Google WaveNet and Amazon Polly Neural on blind listener tests. This guide covers everything: what it is, how to use it, which voices sound best, and how to set it up in any environment.
What Is Kokoro TTS?
Kokoro TTS is built on the StyleTTS 2 architecture — a neural TTS system that uses duration prediction, pitch estimation, and a diffusion-based vocoder to synthesize speech. The model was released in January 2025 and has been rapidly adopted as the leading browser-runnable TTS engine.
Key specifications:
| Spec | Value |
|---|---|
| Parameters | 82 million |
| Model size (ONNX) | 82MB (Small), ~300MB (Medium), ~600MB (Large) |
| Architecture | StyleTTS 2 |
| Runtime | WebGPU (GPU), WebAssembly (CPU), Python |
| License | Apache 2.0 |
| Languages | 9 (English, Japanese, Chinese, French, Spanish, Hindi, Italian, Portuguese, Korean) |
| Voices | 88 with quality grades A–D |
| Real-time factor | 1–2x (WebGPU), 0.5–1x (WASM) |
Kokoro powers OfflineTTS and TTS Studio — the two most visible browser-based TTS projects. Try it now in your browser.
Voices and Quality Grades
Kokoro’s 88 voices are organized by language and quality grade. Each voice is labeled A through D:
- Grade A: Premium quality, suitable for long-form narration and content creation
- Grade B: Good quality, suitable for most use cases
- Grade C: Acceptable quality, functional for short-form content
- Grade D: Basic quality, suitable for notifications and short prompts
English Voices (American + British)
| Voice | Accent | Grade | Best For |
|---|---|---|---|
| Heart | American | A | Audiobook narration, YouTube voiceovers, long-form content |
| Bella | American | A | Conversational content, podcast-style narration |
| Nova | American | A | Professional presentations, e-learning, corporate content |
| Sarah | American | A | Warm narration, storytelling |
| Michael | American | A | Deep narration, documentary-style |
| Michelle | American | B | General purpose, moderate-length content |
| Nicole | British | A | British English narration, formal content |
| Emma | British | A | British conversational, podcast |
| George | British | B | British general purpose |
The full list of 88 voices is available on the OfflineTTS tool — select any voice to hear a sample instantly.
Multilingual Voices
Kokoro supports 8 additional languages beyond English:
| Language | Voice Count | Top Voices | Notes |
|---|---|---|---|
| Japanese | 10+ | Misaki (A) | Uses phonemization via Misaki/espeak-ng |
| Mandarin Chinese | 10+ | Xiaoming (A) | Simplified Chinese supported |
| French | 5+ | Colette (A) | Metropolitan French |
| Spanish | 5+ | Alejandro (A) | Latin American Spanish |
| Hindi | 5+ | Priya (A) | Formal Hindi |
| Italian | 5+ | Alessandro (A) | Standard Italian |
| Brazilian Portuguese | 5+ | Lucas (A) | Brazilian Portuguese |
| Korean | 5+ | Minjun (A) | Standard Korean |
How to Use Kokoro TTS
In Your Browser (Zero Setup)
The fastest way to use Kokoro TTS:
- Open offlinetts.com/app
- Click “Load Model” (one-time download, ~90MB for Small model)
- Type or paste your text
- Select a voice
- Click Generate
The model caches in your browser’s IndexedDB after the first download. Subsequent visits load in under a second. Works offline after initial load.
WebGPU provides GPU-accelerated inference in Chrome 113+, Safari 17.4+, and Edge 113+. WebAssembly provides CPU fallback in all modern browsers. On a mid-2025 MacBook Pro (M4), WebGPU generates speech at roughly 1.5–2x real-time.
Python CLI
For programmatic use, Kokoro is available as a Python package:
pip install kokoro
from kokoro import KPipeline
pipeline = KPipeline(lang_code='a') # 'a' for American English
generator = pipeline(
"Kokoro TTS produces natural-sounding speech from any text.",
voice='af_heart'
)
for _, _, audio in generator:
# audio is a numpy array — save, process, or stream it
pass
Language codes: a (American English), b (British English), j (Japanese), z (Chinese), f (French), e (Spanish), h (Hindi), i (Italian), p (Portuguese), k (Korean).
API via Kokoro
Kokoro is also available through an API at $0.65 per million characters — the lowest per-character price of any production TTS service. See the TTS Arena leaderboard for cost comparisons.
WebGPU Setup and Troubleshooting
Checking WebGPU Support
if (navigator.gpu) {
console.log('WebGPU is supported');
} else {
console.log('WebGPU is NOT supported — falling back to WASM');
}
Browser Compatibility
| Browser | WebGPU Support | Notes |
|---|---|---|
| Chrome 113+ | Yes | Best performance |
| Edge 113+ | Yes | Same engine as Chrome |
| Safari 17.4+ | Yes | Supported on macOS and iOS |
| Firefox | Behind flag | Set dom.webgpu.enabled in about:config |
| Chrome Android | Partial | Available in Chrome 121+ |
| Safari iOS 17.4+ | Yes | Works on iPhone/iPad |
When WebGPU is unavailable, Kokoro automatically falls back to WASM. Generation is slower (roughly 0.5–1x real-time on modern CPUs) but produces identical output.
Common Issues
Model loading fails: Clear IndexedDB data for the site, then reload. This fixes most loading issues caused by corrupted cache.
No audio output: Check that your browser’s audio autoplay policy allows playback. Most browsers require a user gesture (click) before playing audio.
Slow generation on WASM: Close other tabs using GPU/CPU resources. WASM inference on a 2020+ laptop generates 30 seconds of audio in roughly 30–60 seconds.
Quality Benchmarks
On the Artificial Analysis TTS Arena, Kokoro 82M v1.0 ranks 32nd out of 74 models with an Elo of 1056. Among models that can run in a browser, it ranks 1st.
| Comparison | Kokoro Elo | Other Elo | Kokoro Win Rate |
|---|---|---|---|
| vs. Google WaveNet | 1056 | 873 | — |
| vs. Amazon Polly Neural | 1056 | 868 | — |
| vs. OpenAI TTS-1 | 1056 | 1102 | 35% |
| vs. ElevenLabs v3 | 1056 | 1178 | 31% |
| vs. Piper TTS | — | — | See benchmark |
Kokoro’s strength is knowledge sharing (articles, documentation, educational content) — it scores Elo 1066 in that category. It is less competitive in entertainment (dialogue, character voices) where larger models with expressive training data dominate.
Kokoro vs Piper vs Kitten
Kokoro is one of three browser-runnable TTS engines. Here’s how they compare:
| Feature | Kokoro | Piper | Kitten |
|---|---|---|---|
| Parameters | 82M | ~20M | 15M |
| Model size | 82MB | 75MB | 24MB |
| Voices | 88 | 904 | 8 |
| Languages | 9 | 1 (English) | 1 (English) |
| WebGPU | Yes | No | Yes |
| WASM fallback | Yes | Yes | Yes |
| Sample rate | 8–48kHz | 22kHz fixed | 8–48kHz |
| Quality grade | A/A- | B+ | C+ |
| Real-time (WebGPU) | 1.5–2x | N/A | 2–3x |
| License | Apache 2.0 | MIT | Apache 2.0 |
For the full comparison with audio samples, see our browser TTS benchmark.
When to choose Kokoro: Quality matters, you need multiple languages, or you want the most voices.
When to choose Piper: You need 904 English voices, maximum speed on CPU, or Raspberry Pi deployment.
When to choose Kitten: You need the smallest possible model size (24MB) for embedded or mobile deployment.
Kokoro vs Cloud TTS Services
| Feature | Kokoro (Browser) | ElevenLabs | OpenAI TTS-1 | Google Cloud |
|---|---|---|---|---|
| Cost | Free | $5–330/mo | $15/1M chars | $4–16/1M chars |
| Sign-up | No | Yes | Yes | Yes |
| Offline | Yes | No | No | No |
| Privacy | On-device | Cloud | Cloud | Cloud |
| Voice count | 88 | 100+ | 6 | 100+ |
| Languages | 9 | 29+ | 7 | 50+ |
| Quality (Elo) | 1056 | 1178 | 1102 | 1062 (Studio) |
The trade-off is clear: cloud services offer more voices and marginally higher quality, but require accounts, cost money, and send your text to external servers. Kokoro delivers A-grade quality at zero cost with complete privacy.
For a deeper comparison, see OfflineTTS vs ElevenLabs.
Voice Cloning with Kokoro
Kokoro does not include built-in voice cloning, but the community project KokoClone adds zero-shot cloning using a speaker encoder based on ECAPA-TDNN. You provide 3–10 seconds of reference audio, and KokoClone generates a speaker embedding that plugs into Kokoro’s decoder.
The cloned voice is a close approximation — not a perfect replica — but it runs on CPU with no training phase. For production-quality cloning, the Piper Training Suite offers fine-tuning with higher fidelity at the cost of a GPU and 2–4 hours of training time.
Use Cases
YouTube and Video Content
Kokoro’s Grade A voices (Heart, Bella, Nova) produce natural narration suitable for YouTube videos. YouTube TTS guide covers the full workflow including LUFS normalization for platform compliance.
Audiobook Production
For audiobooks, Kokoro excels at neutral narration but is weaker at dramatic character voices. The audiobook voice page has voice recommendations for long-form content.
E-Learning and Accessibility
Kokoro’s multilingual support (9 languages) and offline capability make it suitable for educational tools and accessibility applications. No internet dependency means it works in schools, training rooms, and environments with restricted connectivity.
Privacy-Sensitive Applications
Legal, medical, and corporate users benefit from Kokoro’s on-device processing. Text never leaves the browser, satisfying data sovereignty and compliance requirements. Our private TTS guide covers the compliance landscape.
Getting Started
The fastest way to use Kokoro TTS:
Open OfflineTTS — 88 voices, 9 languages, free and offline
For Python usage, install with pip install kokoro and see the GitHub repository for documentation.