Frequently Asked Questions

Everything you need to know about OfflineTTS โ€” free AI text to speech and speech to text tools.

General

How does OfflineTTS work?

OfflineTTS runs AI models directly in your browser using WebGPU or WebAssembly. For text to speech, choose from three engines: Kokoro TTS (54 voices, highest quality), Kitten TTS (8 expressions, lightest), or Piper TTS (25+ voices, fastest on CPU). For speech to text, use Whisper STT (99 languages, streaming transcription). Your data never leaves your device.

Is it really free?

Yes, 100% free. The AI models run on your device, so there are no server costs. All three engines are open-source (Kokoro: Apache 2.0, Piper: MIT, Kitten: open-source). No subscriptions, no per-character charges, no hidden fees.

Does it work offline?

After the initial model download (~90MB for Small model, cached in your browser), English TTS works completely offline. Non-English TTS requires an internet connection for text-to-phoneme conversion (a tiny API call), but audio synthesis still runs on your device. STT works fully offline.

What browsers are supported?

Chrome 113+, Edge 113+, and Safari 17.4+ support WebGPU for fastest performance. All modern browsers support the WASM fallback.

What is text to speech (TTS)?

Text to speech (TTS) is technology that converts written text into spoken audio. Modern TTS uses neural network models to generate natural-sounding human speech. Unlike older robotic-sounding systems, neural TTS models like Kokoro, Kitten, and Piper produce speech that sounds like a real person reading your text aloud.

What is speech to text (STT)?

Speech to text (STT), also called speech recognition or transcription, converts spoken audio into written text. OfflineTTS uses OpenAI's Whisper model, which supports 99 languages and produces transcriptions with word-level timestamps. This is useful for creating subtitles, meeting notes, and making audio content searchable.

Voices & Languages

What voices are available?

88 voices across 9 languages for TTS, plus 99 languages for speech to text. Choose from 3 TTS engines: Kokoro TTS (54 voices, highest quality), Piper TTS (26 voices, fastest on CPU), and Kitten TTS (8 expression voices, lightest model). For STT, Whisper provides accurate transcription with word-level timestamps.

What languages are supported for TTS?

TTS supports 9 languages: American English (20 voices), British English (8 voices), Japanese (5 voices), Mandarin Chinese (8 voices), Spanish (3 voices), French (1 voice), Hindi (4 voices), Italian (2 voices), and Brazilian Portuguese (3 voices).

What languages are supported for STT?

Speech-to-text supports 99 languages including English, Spanish, French, German, Chinese, Japanese, Korean, Arabic, Hindi, Portuguese, Russian, Italian, Dutch, and many more. Whisper automatically detects the spoken language.

Which voice should I use?

For English: Heart (A-rated, warm storytelling), Bella (A-rated, energetic vlogs), Michael (B-rated, professional reviews). For other languages, each has curated voices optimized for natural pronunciation. Try different voices to find the one that matches your content style.

What are voice quality ratings?

Kokoro TTS voices are rated on a quality scale from A (best) to D (lowest). A-rated voices like Heart and Bella produce the most natural-sounding speech with proper intonation and rhythm. B-rated voices are still good quality for most use cases. Voice quality depends on the training data and model architecture.

What are Kitten TTS expression voices?

Kitten TTS uses 8 expression embeddings instead of individual voice models: cheerful, serious, sad, whisper, excited, gentle, calm, and neutral. Each expression shapes the tone and delivery style of the output. This approach gives you creative control over the emotional character of the speech while keeping the model extremely lightweight (24MB).

Privacy & Data

Is my text data safe?

English TTS is fully offline โ€” no data leaves your browser. For non-English TTS (Japanese, Chinese, Spanish, French, Hindi, Italian, Portuguese), your text is sent to our phonemization server which converts it to pronunciation data (IPA phonemes) and returns it. The server does not log or store any text. Audio synthesis always happens on your device. STT (speech to text) is fully offline.

What data do you collect?

We do not collect personal data, text inputs, audio inputs, audio outputs, or usage patterns. There are no accounts and no cookies that identify you personally. We use Google Analytics (GA4) and Microsoft Clarity for anonymized site usage data only. See our Privacy Policy for full details.

Can I use generated speech commercially?

Yes. Kokoro TTS is Apache 2.0 licensed, Piper TTS is MIT licensed, and Kitten TTS is open-source โ€” all permit commercial use. You can use the generated audio for videos, podcasts, audiobooks, and any commercial project without restrictions.

What is phonemization and why does it need a server?

Phonemization converts written text into IPA phoneme strings โ€” pronunciation data that the TTS model uses to generate speech. Kokoro's browser library only handles English natively. For other languages, specialized models (misaki for Japanese/Chinese, espeak-ng for others) are too large to bundle in a browser (~50MB+), so we run them on a lightweight server. The server receives only plain text, returns phonemes in ~10ms, and discards everything immediately.

Technical

What audio formats can I export?

You can export audio as WAV (lossless, studio-quality) or MP3 (compressed, smaller file size). WAV is recommended for further audio editing; MP3 is great for direct use in videos and podcasts.

How much text can I convert at once?

Up to 50,000 characters per session. Longer texts are automatically split into chunks and processed sequentially with natural pauses between segments.

Why is WebGPU recommended?

WebGPU generates speech 3-5x faster than the WASM fallback. Chrome 113+ and Edge 113+ support WebGPU. Safari users can use the WASM fallback. If WebGPU is not available, the tool automatically falls back to WASM.

What model sizes are available?

Kokoro TTS offers three model sizes: q4 (~90MB, fast), q8 (~300MB, balanced), and fp32 (~600MB, highest quality). For STT, Whisper offers Tiny (~40MB, fastest), Base (~76MB, balanced), and Small (~240MB, best accuracy).

Does it work on mobile?

TTS and STT work best on desktop browsers with WebGPU support. Mobile browsers may have limited WebGPU/WASM support and could produce errors. For the best experience, use Chrome or Edge on a desktop or laptop.

What is the difference between Kokoro, Kitten, and Piper TTS engines?

Kokoro TTS: 82M params, 54 voices, 9 languages, highest quality, WebGPU+WASM, ~90-600MB model. Best for production-quality output. Kitten TTS: 15M params, 8 expressions, English only, lightest model at 24MB. Best for quick prototyping and devices with limited resources. Piper TTS: VITS architecture, 25 curated voices from 904-speaker dataset, WASM-only at ~75MB. Best for CPU-only environments and maximum voice variety.

How are the models downloaded and cached?

Models are downloaded over HTTPS on first use and cached in your browser's IndexedDB storage. Kokoro models are served from Cloudflare R2 (CDN) and Hugging Face. Subsequent visits load instantly from cache without re-downloading. You can clear the cache through your browser's storage settings.

What is the sample rate of the generated audio?

Kokoro TTS outputs at 24kHz. Kitten TTS is configurable from 8kHz to 48kHz. Piper TTS has a fixed 22.05kHz sample rate. For STT, audio input at any common sample rate is accepted โ€” the tool handles resampling automatically.

How does the text chunking work for long texts?

When you enter text longer than the model's optimal chunk size, it is automatically split into segments at sentence boundaries. Each chunk is processed independently and the results are concatenated with natural pauses. This ensures consistent quality even for very long texts like audiobook chapters.

Comparisons

Is OfflineTTS better than ElevenLabs?

OfflineTTS is completely free with no usage limits, works offline, and keeps your data private. ElevenLabs offers more voices and higher quality but charges per character and requires an internet connection. For most use cases โ€” YouTube voiceovers, e-learning, audiobooks โ€” OfflineTTS delivers comparable quality at zero cost.

How does OfflineTTS compare to NaturalReader?

OfflineTTS is free with no usage limits and works offline. NaturalReader charges $9.99/month for premium features and requires an internet connection. OfflineTTS offers 88 voices across 9 languages while NaturalReader has 60+ voices. OfflineTTS keeps all processing on your device.

How does OfflineTTS compare to Speechify?

OfflineTTS is free with unlimited usage, while Speechify charges per character. OfflineTTS works offline after model download, while Speechify requires an internet connection. Both offer natural-sounding AI voices, but OfflineTTS gives you full privacy since no data leaves your device.

How does OfflineTTS compare to browser built-in TTS?

Browser built-in TTS (SpeechSynthesis API) uses system voices that sound robotic and unnatural. OfflineTTS uses neural network models (Kokoro, Kitten, Piper) that produce natural, human-like speech. The quality difference is dramatic โ€” neural TTS sounds like a real person, while system TTS sounds like a robot.

Use Cases

Can I use OfflineTTS for YouTube videos?

Yes. Generate voice-overs for YouTube videos, download as WAV, and import into your video editor (DaVinci Resolve, Premiere Pro, Final Cut, etc.). Heart (A-rated) is the top pick for educational content, Bella for vlogs, and Michael for review videos.

Can I create audiobooks with OfflineTTS?

Yes. Process one chapter at a time, export as WAV, then assemble in your DAW. Use the q8 or fp32 model for audiobook-quality output. Heart (A-rated) is the best voice for long-form narration. Since there are no per-character charges, your royalties stay yours.

Can I use OfflineTTS for e-learning?

Absolutely. Add voice narration to online courses, training materials, and educational content. Supports 9 languages for international audiences. Generate consistent, professional narration without hiring voice talent for every course update.

Can I use OfflineTTS for accessibility?

Yes. Convert text to speech for visually impaired users, create audio versions of written content, and add voice narration to any web content. The STT tool can also generate subtitles (SRT/VTT) for making video content accessible.

Still Have Questions?

Contact us and we'll help you out.