Browser TTS workspace

Supertonic TTS

Free AI text to speech with Supertonic 3. English, Spanish, Portuguese, French, and Korean, 10 preset voice styles, WebGPU + WASM, and fully local browser inference after model download.

Private generation WAV + MP3 export 5 languages · Local

Switch Tool TTS + STT

🎙️ Kokoro TTS 54 voices · Best quality 🐱 Kitten TTS 8 voices · Lightest 🔊 Piper TTS 25 voices · Fastest CPU 🎤 Supertonic TTS 5 languages · Local 📝 Whisper STT 99 langs · captions

en, es, pt, fr, ko

languages

M1-M5 + F1-F5

styles

~99M

public ONNX assets

params

WebGPU

+WASM fallback

GPU+CPU

TTS works best on desktop

Audio generation uses WebGPU/WASM. Desktop Chrome or Edge gives the most reliable result.

About Supertonic TTS

Supertonic 3 is an on-device multilingual text-to-speech model from Supertone. This browser integration exposes the currently supported language set: English, Spanish, Portuguese, French, and Korean.

The browser integration runs the public ONNX assets through ONNX Runtime Web. It loads separate duration predictor, text encoder, vector estimator, and vocoder models, then performs local denoising and waveform generation directly in the browser.

Supertonic uses language-tagged Unicode text rather than a server-side phonemization API, so synthesis stays local after the initial model and voice-style downloads. The asset base URL is configurable and defaults to Hugging Face, making it straightforward to move model delivery to a CDN or R2 later.

Compare engines: Kokoro TTS (54 voices · Best quality) · Kitten TTS (8 voices · Lightest) · Piper TTS (25 voices · Fastest CPU)

Getting Started with Supertonic TTS

1. Load the ONNX Models

Supertonic uses multiple ONNX assets for duration, text encoding, denoising, and vocoding. The first load downloads and caches the model files from the configured asset host.

2. Choose Language and Style

Pick the language that matches your text, then choose one of ten preset voice styles: five male and five female styles.

3. Set Steps and Speed

Use 8 steps for a practical default. Higher step counts can improve quality but take longer. Speed around 0.9-1.5 is usually the useful range.

4. Generate and Export

Generate speech locally, preview it in the browser, then download WAV for editing or MP3 for sharing.

Tips for Supertonic TTS

Match the selected language to the input text. Supertonic wraps text in language tags before inference. Selecting the right language gives the model the best pronunciation context.

Start with eight denoising steps. Eight steps is the upstream browser demo default. Increase steps for quality checks and reduce them for faster drafts.

Use WebGPU when available. Chrome and Edge can run ONNX Runtime Web with WebGPU. The app falls back to WASM when WebGPU is unavailable.

Keep long passages structured. The tool chunks long text and inserts short pauses. Paragraphs and punctuation help preserve natural pacing.