How many voices does Kokoro TTS have?

Kokoro TTS includes 88 voices across 9 languages: American English, British English, Japanese, Mandarin Chinese, French, Spanish, Hindi, Italian, and Brazilian Portuguese. Each voice is quality-graded A through D.

Can Kokoro TTS run in a browser?

Yes. Kokoro runs via WebGPU (GPU-accelerated) or WebAssembly (CPU fallback) in any modern browser. The model downloads once (~82MB), caches in IndexedDB, and works offline after that.

What browsers support WebGPU for Kokoro TTS?

Chrome 113+, Edge 113+, and Safari 17.4+ support WebGPU. Firefox has it behind a flag. When WebGPU is unavailable, Kokoro automatically falls back to WASM — slower but produces identical output.

Is Kokoro TTS free for commercial use?

Yes. Kokoro TTS is licensed under Apache 2.0, which permits commercial use without attribution requirements. You can use generated audio in YouTube videos, audiobooks, e-learning, and any commercial project.

How does Kokoro compare to ElevenLabs?

Kokoro produces A/A- quality at zero cost; ElevenLabs produces premium quality at $5–330/month. On blind listener tests, ElevenLabs v3 scores Elo 1178 vs Kokoro's 1056. Kokoro wins on cost, privacy, and offline capability.

Kokoro TTS: Complete Guide — Voices, Languages, WebGPU Setup, and Quality

Kokoro TTS is an 82-million-parameter text-to-speech model that runs in a browser tab. At 82MB, it produces speech that ranks ahead of Google WaveNet and Amazon Polly Neural on blind listener tests. This guide covers everything: what it is, how to use it, which voices sound best, and how to set it up in any environment.

What Is Kokoro TTS?

Kokoro TTS is built on the StyleTTS 2 architecture — a neural TTS system that uses duration prediction, pitch estimation, and a diffusion-based vocoder to synthesize speech. The model was released in January 2025 and has been rapidly adopted as the leading browser-runnable TTS engine.

Key specifications:

Spec	Value
Parameters	82 million
Model size (ONNX)	82MB (Small), ~300MB (Medium), ~600MB (Large)
Architecture	StyleTTS 2
Runtime	WebGPU (GPU), WebAssembly (CPU), Python
License	Apache 2.0
Languages	9 (English, Japanese, Chinese, French, Spanish, Hindi, Italian, Portuguese, Korean)
Voices	88 with quality grades A–D
Real-time factor	1–2x (WebGPU), 0.5–1x (WASM)

Kokoro powers OfflineTTS and TTS Studio — the two most visible browser-based TTS projects. Try it now in your browser.

Voices and Quality Grades

Kokoro’s 88 voices are organized by language and quality grade. Each voice is labeled A through D:

Grade A: Premium quality, suitable for long-form narration and content creation
Grade B: Good quality, suitable for most use cases
Grade C: Acceptable quality, functional for short-form content
Grade D: Basic quality, suitable for notifications and short prompts

English Voices (American + British)

Voice	Accent	Grade	Best For
Heart	American	A	Audiobook narration, YouTube voiceovers, long-form content
Bella	American	A	Conversational content, podcast-style narration
Nova	American	A	Professional presentations, e-learning, corporate content
Sarah	American	A	Warm narration, storytelling
Michael	American	A	Deep narration, documentary-style
Michelle	American	B	General purpose, moderate-length content
Nicole	British	A	British English narration, formal content
Emma	British	A	British conversational, podcast
George	British	B	British general purpose

The full list of 88 voices is available on the OfflineTTS tool — select any voice to hear a sample instantly.

Multilingual Voices

Kokoro supports 8 additional languages beyond English:

Language	Voice Count	Top Voices	Notes
Japanese	10+	Misaki (A)	Uses phonemization via Misaki/espeak-ng
Mandarin Chinese	10+	Xiaoming (A)	Simplified Chinese supported
French	5+	Colette (A)	Metropolitan French
Spanish	5+	Alejandro (A)	Latin American Spanish
Hindi	5+	Priya (A)	Formal Hindi
Italian	5+	Alessandro (A)	Standard Italian
Brazilian Portuguese	5+	Lucas (A)	Brazilian Portuguese
Korean	5+	Minjun (A)	Standard Korean

How to Use Kokoro TTS

In Your Browser (Zero Setup)

The fastest way to use Kokoro TTS:

Open offlinetts.com/app
Click “Load Model” (one-time download, ~90MB for Small model)
Type or paste your text
Select a voice
Click Generate

The model caches in your browser’s IndexedDB after the first download. Subsequent visits load in under a second. Works offline after initial load.

WebGPU provides GPU-accelerated inference in Chrome 113+, Safari 17.4+, and Edge 113+. WebAssembly provides CPU fallback in all modern browsers. On a mid-2025 MacBook Pro (M4), WebGPU generates speech at roughly 1.5–2x real-time.

Python CLI

For programmatic use, Kokoro is available as a Python package:

pip install kokoro

from kokoro import KPipeline

pipeline = KPipeline(lang_code='a')  # 'a' for American English
generator = pipeline(
    "Kokoro TTS produces natural-sounding speech from any text.",
    voice='af_heart'
)
for _, _, audio in generator:
    # audio is a numpy array — save, process, or stream it
    pass

Language codes: a (American English), b (British English), j (Japanese), z (Chinese), f (French), e (Spanish), h (Hindi), i (Italian), p (Portuguese), k (Korean).

API via Kokoro

Kokoro is also available through an API at $0.65 per million characters — the lowest per-character price of any production TTS service. See the TTS Arena leaderboard for cost comparisons.

WebGPU Setup and Troubleshooting

Checking WebGPU Support

if (navigator.gpu) {
  console.log('WebGPU is supported');
} else {
  console.log('WebGPU is NOT supported — falling back to WASM');
}

Browser Compatibility

Browser	WebGPU Support	Notes
Chrome 113+	Yes	Best performance
Edge 113+	Yes	Same engine as Chrome
Safari 17.4+	Yes	Supported on macOS and iOS
Firefox	Behind flag	Set `dom.webgpu.enabled` in about:config
Chrome Android	Partial	Available in Chrome 121+
Safari iOS 17.4+	Yes	Works on iPhone/iPad

When WebGPU is unavailable, Kokoro automatically falls back to WASM. Generation is slower (roughly 0.5–1x real-time on modern CPUs) but produces identical output.

Common Issues

Model loading fails: Clear IndexedDB data for the site, then reload. This fixes most loading issues caused by corrupted cache.

No audio output: Check that your browser’s audio autoplay policy allows playback. Most browsers require a user gesture (click) before playing audio.

Slow generation on WASM: Close other tabs using GPU/CPU resources. WASM inference on a 2020+ laptop generates 30 seconds of audio in roughly 30–60 seconds.

Quality Benchmarks

On the Artificial Analysis TTS Arena, Kokoro 82M v1.0 ranks 32nd out of 74 models with an Elo of 1056. Among models that can run in a browser, it ranks 1st.

Comparison	Kokoro Elo	Other Elo	Kokoro Win Rate
vs. Google WaveNet	1056	873	—
vs. Amazon Polly Neural	1056	868	—
vs. OpenAI TTS-1	1056	1102	35%
vs. ElevenLabs v3	1056	1178	31%
vs. Piper TTS	—	—	See benchmark

Kokoro’s strength is knowledge sharing (articles, documentation, educational content) — it scores Elo 1066 in that category. It is less competitive in entertainment (dialogue, character voices) where larger models with expressive training data dominate.

Kokoro vs Piper vs Kitten

Kokoro is one of three browser-runnable TTS engines. Here’s how they compare:

Feature	Kokoro	Piper	Kitten
Parameters	82M	~20M	15M
Model size	82MB	75MB	24MB
Voices	88	904	8
Languages	9	1 (English)	1 (English)
WebGPU	Yes	No	Yes
WASM fallback	Yes	Yes	Yes
Sample rate	8–48kHz	22kHz fixed	8–48kHz
Quality grade	A/A-	B+	C+
Real-time (WebGPU)	1.5–2x	N/A	2–3x
License	Apache 2.0	MIT	Apache 2.0

For the full comparison with audio samples, see our browser TTS benchmark.

When to choose Kokoro: Quality matters, you need multiple languages, or you want the most voices.

When to choose Piper: You need 904 English voices, maximum speed on CPU, or Raspberry Pi deployment.

When to choose Kitten: You need the smallest possible model size (24MB) for embedded or mobile deployment.

Kokoro vs Cloud TTS Services

Feature	Kokoro (Browser)	ElevenLabs	OpenAI TTS-1	Google Cloud
Cost	Free	$5–330/mo	$15/1M chars	$4–16/1M chars
Sign-up	No	Yes	Yes	Yes
Offline	Yes	No	No	No
Privacy	On-device	Cloud	Cloud	Cloud
Voice count	88	100+	6	100+
Languages	9	29+	7	50+
Quality (Elo)	1056	1178	1102	1062 (Studio)

The trade-off is clear: cloud services offer more voices and marginally higher quality, but require accounts, cost money, and send your text to external servers. Kokoro delivers A-grade quality at zero cost with complete privacy.

For a deeper comparison, see OfflineTTS vs ElevenLabs.

Voice Cloning with Kokoro

Kokoro does not include built-in voice cloning, but the community project KokoClone adds zero-shot cloning using a speaker encoder based on ECAPA-TDNN. You provide 3–10 seconds of reference audio, and KokoClone generates a speaker embedding that plugs into Kokoro’s decoder.

The cloned voice is a close approximation — not a perfect replica — but it runs on CPU with no training phase. For production-quality cloning, the Piper Training Suite offers fine-tuning with higher fidelity at the cost of a GPU and 2–4 hours of training time.

Use Cases

YouTube and Video Content

Kokoro’s Grade A voices (Heart, Bella, Nova) produce natural narration suitable for YouTube videos. YouTube TTS guide covers the full workflow including LUFS normalization for platform compliance.

Audiobook Production

For audiobooks, Kokoro excels at neutral narration but is weaker at dramatic character voices. The audiobook voice page has voice recommendations for long-form content.

E-Learning and Accessibility

Kokoro’s multilingual support (9 languages) and offline capability make it suitable for educational tools and accessibility applications. No internet dependency means it works in schools, training rooms, and environments with restricted connectivity.

Privacy-Sensitive Applications

Legal, medical, and corporate users benefit from Kokoro’s on-device processing. Text never leaves the browser, satisfying data sovereignty and compliance requirements. Our private TTS guide covers the compliance landscape.

Getting Started