Browser TTS workspace

Kokoro TTS

Free AI text to speech with Kokoro TTS. 54 voices across 9 languages, WebGPU + WASM, highest quality. Runs entirely in your browser โ€” offline and private.

Private generation WAV + MP3 export Default TTS workspace
~300
q4/q8/fp32
MB model
54
9 languages
voices
82M
StyleTTS 2
params
WebGPU
+WASM fallback
GPU+CPU

TTS works best on desktop

Audio generation uses WebGPU/WASM. Desktop Chrome or Edge gives the most reliable result.

About Kokoro TTS

Kokoro TTS is the flagship engine on OfflineTTS, offering 54 voices across 9 languages including English, Japanese, Chinese, Spanish, French, Hindi, Italian, and Portuguese. Powered by an 82M parameter StyleTTS 2 model with ISTFTNet, it delivers the highest quality speech synthesis available in a browser.

It supports multiple model sizes (q4 ~90MB, q8 ~300MB, fp32 ~600MB) and runs on both WebGPU and WASM backends, automatically selecting the fastest option for your device.

Compare engines: Kitten TTS (8 voices, 24MB, lightest) ยท Piper TTS (25 voices, fastest CPU) ยท Supertonic TTS (5 languages, local inference)

Getting Started with Kokoro TTS

New to AI text to speech? Here's how to get the best results from Kokoro TTS in under two minutes.

1. Choose Your Model Size

Start with the q4 model (~90MB) for quick testing. Switch to q8 (~300MB) for production quality. The fp32 model (~600MB) delivers the highest quality but takes longer to download.

2. Pick a Voice

Heart (A-rated) is the best all-rounder for English. Bella (A-rated) adds more expressiveness. Browse all 54 voices to find the tone that matches your project.

3. Write Your Script

Use proper punctuation โ€” commas add pauses, periods create full stops, question marks raise pitch. Well-punctuated text produces the most natural speech.

4. Generate & Download

Click generate, wait for the audio to play, then download as WAV (lossless) or MP3 (compressed). WAV is recommended for further editing.

Tips for Best TTS Quality

1.

Use WebGPU for speed. Chrome 113+ and Edge 113+ support WebGPU, which generates speech 3-5x faster than WASM. The tool auto-detects and uses the best available backend.

2.

Punctuate properly. This is the single most important factor for natural-sounding speech. Commas, periods, question marks, and exclamation marks all create distinct prosodic effects.

3.

Break long text into paragraphs. The tool handles up to 50,000 characters, but shorter paragraphs with clear punctuation produce better rhythm and pacing.

4.

Try multiple voices. Different voices suit different content types. Heart excels at warm narration, Bella at energetic delivery, Michael at professional reviews.

5.

Use WAV for production. WAV preserves full audio quality for editing. MP3 is fine for quick sharing, but use WAV if you plan to mix, master, or further process the audio.