← Back to Blog

Kokoro TTS: Complete Guide — Voices, Languages, WebGPU Setup, and Quality

kokorottsguidewebgpuofflinebrowser

Kokoro TTS is an 82-million-parameter text-to-speech model that runs in a browser tab. At 82MB, it produces speech that ranks ahead of Google WaveNet and Amazon Polly Neural on blind listener tests. This guide covers everything: what it is, how to use it, which voices sound best, and how to set it up in any environment.

What Is Kokoro TTS?

Kokoro TTS is built on the StyleTTS 2 architecture — a neural TTS system that uses duration prediction, pitch estimation, and a diffusion-based vocoder to synthesize speech. The model was released in January 2025 and has been rapidly adopted as the leading browser-runnable TTS engine.

Key specifications:

SpecValue
Parameters82 million
Model size (ONNX)82MB (Small), ~300MB (Medium), ~600MB (Large)
ArchitectureStyleTTS 2
RuntimeWebGPU (GPU), WebAssembly (CPU), Python
LicenseApache 2.0
Languages9 (English, Japanese, Chinese, French, Spanish, Hindi, Italian, Portuguese, Korean)
Voices88 with quality grades A–D
Real-time factor1–2x (WebGPU), 0.5–1x (WASM)

Kokoro powers OfflineTTS and TTS Studio — the two most visible browser-based TTS projects. Try it now in your browser.

Voices and Quality Grades

Kokoro’s 88 voices are organized by language and quality grade. Each voice is labeled A through D:

  • Grade A: Premium quality, suitable for long-form narration and content creation
  • Grade B: Good quality, suitable for most use cases
  • Grade C: Acceptable quality, functional for short-form content
  • Grade D: Basic quality, suitable for notifications and short prompts

English Voices (American + British)

VoiceAccentGradeBest For
HeartAmericanAAudiobook narration, YouTube voiceovers, long-form content
BellaAmericanAConversational content, podcast-style narration
NovaAmericanAProfessional presentations, e-learning, corporate content
SarahAmericanAWarm narration, storytelling
MichaelAmericanADeep narration, documentary-style
MichelleAmericanBGeneral purpose, moderate-length content
NicoleBritishABritish English narration, formal content
EmmaBritishABritish conversational, podcast
GeorgeBritishBBritish general purpose

The full list of 88 voices is available on the OfflineTTS tool — select any voice to hear a sample instantly.

Multilingual Voices

Kokoro supports 8 additional languages beyond English:

LanguageVoice CountTop VoicesNotes
Japanese10+Misaki (A)Uses phonemization via Misaki/espeak-ng
Mandarin Chinese10+Xiaoming (A)Simplified Chinese supported
French5+Colette (A)Metropolitan French
Spanish5+Alejandro (A)Latin American Spanish
Hindi5+Priya (A)Formal Hindi
Italian5+Alessandro (A)Standard Italian
Brazilian Portuguese5+Lucas (A)Brazilian Portuguese
Korean5+Minjun (A)Standard Korean

How to Use Kokoro TTS

In Your Browser (Zero Setup)

The fastest way to use Kokoro TTS:

  1. Open offlinetts.com/app
  2. Click “Load Model” (one-time download, ~90MB for Small model)
  3. Type or paste your text
  4. Select a voice
  5. Click Generate

The model caches in your browser’s IndexedDB after the first download. Subsequent visits load in under a second. Works offline after initial load.

WebGPU provides GPU-accelerated inference in Chrome 113+, Safari 17.4+, and Edge 113+. WebAssembly provides CPU fallback in all modern browsers. On a mid-2025 MacBook Pro (M4), WebGPU generates speech at roughly 1.5–2x real-time.

Python CLI

For programmatic use, Kokoro is available as a Python package:

pip install kokoro
from kokoro import KPipeline

pipeline = KPipeline(lang_code='a')  # 'a' for American English
generator = pipeline(
    "Kokoro TTS produces natural-sounding speech from any text.",
    voice='af_heart'
)
for _, _, audio in generator:
    # audio is a numpy array — save, process, or stream it
    pass

Language codes: a (American English), b (British English), j (Japanese), z (Chinese), f (French), e (Spanish), h (Hindi), i (Italian), p (Portuguese), k (Korean).

API via Kokoro

Kokoro is also available through an API at $0.65 per million characters — the lowest per-character price of any production TTS service. See the TTS Arena leaderboard for cost comparisons.

WebGPU Setup and Troubleshooting

Checking WebGPU Support

if (navigator.gpu) {
  console.log('WebGPU is supported');
} else {
  console.log('WebGPU is NOT supported — falling back to WASM');
}

Browser Compatibility

BrowserWebGPU SupportNotes
Chrome 113+YesBest performance
Edge 113+YesSame engine as Chrome
Safari 17.4+YesSupported on macOS and iOS
FirefoxBehind flagSet dom.webgpu.enabled in about:config
Chrome AndroidPartialAvailable in Chrome 121+
Safari iOS 17.4+YesWorks on iPhone/iPad

When WebGPU is unavailable, Kokoro automatically falls back to WASM. Generation is slower (roughly 0.5–1x real-time on modern CPUs) but produces identical output.

Common Issues

Model loading fails: Clear IndexedDB data for the site, then reload. This fixes most loading issues caused by corrupted cache.

No audio output: Check that your browser’s audio autoplay policy allows playback. Most browsers require a user gesture (click) before playing audio.

Slow generation on WASM: Close other tabs using GPU/CPU resources. WASM inference on a 2020+ laptop generates 30 seconds of audio in roughly 30–60 seconds.

Quality Benchmarks

On the Artificial Analysis TTS Arena, Kokoro 82M v1.0 ranks 32nd out of 74 models with an Elo of 1056. Among models that can run in a browser, it ranks 1st.

ComparisonKokoro EloOther EloKokoro Win Rate
vs. Google WaveNet1056873
vs. Amazon Polly Neural1056868
vs. OpenAI TTS-11056110235%
vs. ElevenLabs v31056117831%
vs. Piper TTSSee benchmark

Kokoro’s strength is knowledge sharing (articles, documentation, educational content) — it scores Elo 1066 in that category. It is less competitive in entertainment (dialogue, character voices) where larger models with expressive training data dominate.

Kokoro vs Piper vs Kitten

Kokoro is one of three browser-runnable TTS engines. Here’s how they compare:

FeatureKokoroPiperKitten
Parameters82M~20M15M
Model size82MB75MB24MB
Voices889048
Languages91 (English)1 (English)
WebGPUYesNoYes
WASM fallbackYesYesYes
Sample rate8–48kHz22kHz fixed8–48kHz
Quality gradeA/A-B+C+
Real-time (WebGPU)1.5–2xN/A2–3x
LicenseApache 2.0MITApache 2.0

For the full comparison with audio samples, see our browser TTS benchmark.

When to choose Kokoro: Quality matters, you need multiple languages, or you want the most voices.

When to choose Piper: You need 904 English voices, maximum speed on CPU, or Raspberry Pi deployment.

When to choose Kitten: You need the smallest possible model size (24MB) for embedded or mobile deployment.

Kokoro vs Cloud TTS Services

FeatureKokoro (Browser)ElevenLabsOpenAI TTS-1Google Cloud
CostFree$5–330/mo$15/1M chars$4–16/1M chars
Sign-upNoYesYesYes
OfflineYesNoNoNo
PrivacyOn-deviceCloudCloudCloud
Voice count88100+6100+
Languages929+750+
Quality (Elo)1056117811021062 (Studio)

The trade-off is clear: cloud services offer more voices and marginally higher quality, but require accounts, cost money, and send your text to external servers. Kokoro delivers A-grade quality at zero cost with complete privacy.

For a deeper comparison, see OfflineTTS vs ElevenLabs.

Voice Cloning with Kokoro

Kokoro does not include built-in voice cloning, but the community project KokoClone adds zero-shot cloning using a speaker encoder based on ECAPA-TDNN. You provide 3–10 seconds of reference audio, and KokoClone generates a speaker embedding that plugs into Kokoro’s decoder.

The cloned voice is a close approximation — not a perfect replica — but it runs on CPU with no training phase. For production-quality cloning, the Piper Training Suite offers fine-tuning with higher fidelity at the cost of a GPU and 2–4 hours of training time.

Use Cases

YouTube and Video Content

Kokoro’s Grade A voices (Heart, Bella, Nova) produce natural narration suitable for YouTube videos. YouTube TTS guide covers the full workflow including LUFS normalization for platform compliance.

Audiobook Production

For audiobooks, Kokoro excels at neutral narration but is weaker at dramatic character voices. The audiobook voice page has voice recommendations for long-form content.

E-Learning and Accessibility

Kokoro’s multilingual support (9 languages) and offline capability make it suitable for educational tools and accessibility applications. No internet dependency means it works in schools, training rooms, and environments with restricted connectivity.

Privacy-Sensitive Applications

Legal, medical, and corporate users benefit from Kokoro’s on-device processing. Text never leaves the browser, satisfying data sovereignty and compliance requirements. Our private TTS guide covers the compliance landscape.

Getting Started

The fastest way to use Kokoro TTS:

Open OfflineTTS — 88 voices, 9 languages, free and offline

For Python usage, install with pip install kokoro and see the GitHub repository for documentation.

Share this article

Frequently Asked Questions

How many voices does Kokoro TTS have?
Kokoro TTS includes 88 voices across 9 languages: American English, British English, Japanese, Mandarin Chinese, French, Spanish, Hindi, Italian, and Brazilian Portuguese. Each voice is quality-graded A through D.
Can Kokoro TTS run in a browser?
Yes. Kokoro runs via WebGPU (GPU-accelerated) or WebAssembly (CPU fallback) in any modern browser. The model downloads once (~82MB), caches in IndexedDB, and works offline after that.
What browsers support WebGPU for Kokoro TTS?
Chrome 113+, Edge 113+, and Safari 17.4+ support WebGPU. Firefox has it behind a flag. When WebGPU is unavailable, Kokoro automatically falls back to WASM — slower but produces identical output.
Is Kokoro TTS free for commercial use?
Yes. Kokoro TTS is licensed under Apache 2.0, which permits commercial use without attribution requirements. You can use generated audio in YouTube videos, audiobooks, e-learning, and any commercial project.
How does Kokoro compare to ElevenLabs?
Kokoro produces A/A- quality at zero cost; ElevenLabs produces premium quality at $5–330/month. On blind listener tests, ElevenLabs v3 scores Elo 1178 vs Kokoro's 1056. Kokoro wins on cost, privacy, and offline capability.

Try OfflineTTS

Free. Private. Works offline. 54 voices in 9 languages.

Open TTS Tool