Kokoro TTS
Free AI text to speech with Kokoro TTS. 54 voices across 9 languages, WebGPU + WASM, highest quality. Runs entirely in your browser — offline and private.
TTS works best on desktop
Audio generation uses WebGPU/WASM. Desktop Chrome or Edge gives the most reliable result.
About Kokoro TTS
Kokoro TTS is the flagship engine on OfflineTTS, offering 54 voices across 9 languages including English, Japanese, Chinese, Spanish, French, Hindi, Italian, and Portuguese. Powered by an 82M parameter StyleTTS 2 model with ISTFTNet, it delivers the highest quality speech synthesis available in a browser.
English voices use kokoro-js's built-in phonemizer and work fully offline after the model download. For non-English languages, a lightweight server API converts text to IPA phonemes (using misaki for Japanese/Chinese and espeak-ng for others), then audio synthesis runs locally via WebGPU/WASM. The server receives only plain text and returns phoneme strings — no audio is sent, no data is stored.
It supports multiple model sizes (q4 ~90MB, q8 ~300MB, fp32 ~600MB) and runs on both WebGPU and WASM backends, automatically selecting the fastest option for your device.
Getting Started with Kokoro TTS
1. Choose Your Model Size
Start with the q4 model (~90MB) for quick testing. Switch to q8 (~300MB) for production quality. The fp32 model (~600MB) delivers the highest quality but takes longer to download.
2. Pick a Voice
Heart (A-rated) is the best all-rounder for English. Bella (A-rated) adds more expressiveness. Browse all 54 voices across 9 languages to find the tone that matches your project.
3. Write Your Script
Use proper punctuation — commas add pauses, periods create full stops, question marks raise pitch. Well-punctuated text produces the most natural speech.
4. Generate & Download
Click generate, then download as WAV (lossless, for editing) or MP3 (compressed, for sharing). All processing happens on your device.
Tips for Kokoro TTS
Use the right model size for your needs. q4 (~90MB) is great for quick drafts and testing. q8 (~300MB) is the sweet spot for production audio. fp32 (~600MB) delivers maximum quality for studio-grade output.
English works fully offline. After the initial model download, English TTS never contacts any server. Non-English TTS sends only plain text for phonemization — audio synthesis stays local.
Use WebGPU for speed. Chrome and Edge support WebGPU, which generates speech 3-5x faster than WASM. The tool auto-detects and uses the best available backend.
Try multiple voices for different content. Heart excels at warm narration, Bella at energetic delivery, Michael at professional reviews. Each of the 54 voices has a distinct character suited to different use cases.