Question 1

How does OfflineTTS work?

Accepted Answer

OfflineTTS runs AI models directly in your browser using WebGPU or WebAssembly. For text to speech, choose from four engines: Kokoro TTS (54 voices, highest quality), Kitten TTS (8 expressions, lightest), Piper TTS (25+ voices, fastest on CPU), or Supertonic TTS (10 preset styles across 5 languages). For speech to text, use Whisper STT (99 languages, streaming transcription).

Question 2

Is it really free?

Accepted Answer

Yes, 100% free. The AI models run on your device, so there are no per-generation server costs. OfflineTTS includes Kokoro, Kitten, Piper, and Supertonic for TTS plus Whisper for STT. No subscriptions, no per-character charges, no hidden fees.

Question 3

Does it work offline?

Accepted Answer

After the initial model download (~90MB for Small model, cached in your browser), English TTS works completely offline. Non-English TTS requires an internet connection for text-to-phoneme conversion (a tiny API call), but audio synthesis still runs on your device. STT works fully offline.

Question 4

What browsers are supported?

Accepted Answer

Chrome 113+, Edge 113+, and Safari 17.4+ support WebGPU for fastest performance. All modern browsers support the WASM fallback.

Question 5

What is text to speech (TTS)?

Accepted Answer

Text to speech (TTS) is technology that converts written text into spoken audio. Modern TTS uses neural network models to generate natural-sounding human speech. Unlike older robotic-sounding systems, neural TTS models like Kokoro, Kitten, Piper, and Supertonic produce speech that sounds closer to a real person reading your text aloud.

Question 6

What is speech to text (STT)?

Accepted Answer

Speech to text (STT), also called speech recognition or transcription, converts spoken audio into written text. OfflineTTS uses OpenAI's Whisper model, which supports 99 languages and produces transcriptions with word-level timestamps. This is useful for creating subtitles, meeting notes, and making audio content searchable.

Question 7

What voices are available?

Accepted Answer

98 voice options and styles across 10 TTS language options, plus 99 languages for speech to text. Choose from 4 TTS engines: Kokoro TTS (54 voices, highest quality), Piper TTS (25+ voices, fastest on CPU), Kitten TTS (8 expression voices, lightest model), and Supertonic TTS (10 preset styles across English, Spanish, Portuguese, French, and Korean). For STT, Whisper provides accurate transcription with word-level timestamps.

Question 8

What languages are supported for TTS?

Accepted Answer

TTS supports 9 languages: American English (20 voices), British English (8 voices), Japanese (5 voices), Mandarin Chinese (8 voices), Spanish (3 voices), French (1 voice), Hindi (4 voices), Italian (2 voices), and Brazilian Portuguese (3 voices).

Question 9

What languages are supported for STT?

Accepted Answer

Speech-to-text supports 99 languages including English, Spanish, French, German, Chinese, Japanese, Korean, Arabic, Hindi, Portuguese, Russian, Italian, Dutch, and many more. Whisper automatically detects the spoken language.

Question 10

Which voice should I use?

Accepted Answer

For English: Heart (A-rated, warm storytelling), Bella (A-rated, energetic vlogs), Michael (B-rated, professional reviews). For other languages, each has curated voices optimized for natural pronunciation. Try different voices to find the one that matches your content style.

Question 11

What are voice quality ratings?

Accepted Answer

Kokoro TTS voices are rated on a quality scale from A (best) to D (lowest). A-rated voices like Heart and Bella produce the most natural-sounding speech with proper intonation and rhythm. B-rated voices are still good quality for most use cases. Voice quality depends on the training data and model architecture.

Question 12

What are Kitten TTS expression voices?

Accepted Answer

Kitten TTS uses 8 expression embeddings instead of individual voice models: cheerful, serious, sad, whisper, excited, gentle, calm, and neutral. Each expression shapes the tone and delivery style of the output. This approach gives you creative control over the emotional character of the speech while keeping the model extremely lightweight (24MB).

Question 13

Is my text data safe?

Accepted Answer

English TTS is fully offline — no data leaves your browser. For non-English TTS (Japanese, Chinese, Spanish, French, Hindi, Italian, Portuguese), your text is sent to our phonemization server which converts it to pronunciation data (IPA phonemes) and returns it. The server does not log or store any text. Audio synthesis always happens on your device. STT (speech to text) is fully offline.

Question 14

What data do you collect?

Accepted Answer

We do not collect personal data, text inputs, audio inputs, audio outputs, or usage patterns. There are no accounts and no cookies that identify you personally. We use Google Analytics (GA4) and Microsoft Clarity for anonymized site usage data only. See our Privacy Policy for full details.

Question 15

Can I use generated speech commercially?

Accepted Answer

In most creator workflows, yes: you can download and use generated audio in videos, podcasts, audiobooks, and commercial projects. Kokoro and Piper use permissive upstream licenses; Kitten and Supertonic are also available as local TTS engines, but you should check the upstream model terms for the exact engine you use before large-scale commercial deployment.

Question 16

What is phonemization and why does it need a server?

Accepted Answer

Phonemization converts written text into IPA phoneme strings — pronunciation data that the TTS model uses to generate speech. Kokoro's browser library only handles English natively. For other languages, specialized models (misaki for Japanese/Chinese, espeak-ng for others) are too large to bundle in a browser (~50MB+), so we run them on a lightweight server. The server receives only plain text, returns phonemes in ~10ms, and discards everything immediately.

Question 17

What audio formats can I export?

Accepted Answer

You can export audio as WAV (lossless, studio-quality) or MP3 (compressed, smaller file size). WAV is recommended for further audio editing; MP3 is great for direct use in videos and podcasts.

Question 18

How much text can I convert at once?

Accepted Answer

Up to 50,000 characters per session. Longer texts are automatically split into chunks and processed sequentially with natural pauses between segments.

Question 19

Why is WebGPU recommended?

Accepted Answer

WebGPU generates speech 3-5x faster than the WASM fallback. Chrome 113+ and Edge 113+ support WebGPU. Safari users can use the WASM fallback. If WebGPU is not available, the tool automatically falls back to WASM.

Question 20

What model sizes are available?

Accepted Answer

Kokoro TTS offers three model sizes: q4 (~90MB, fast), q8 (~300MB, balanced), and fp32 (~600MB, highest quality). Kitten is the lightest TTS option at ~24MB, Piper is CPU-oriented at ~75MB, and Supertonic loads a multi-file ONNX model stack for its multilingual synthesis path. For STT, Whisper offers Tiny (~40MB, fastest), Base (~76MB, balanced), and Small (~240MB, best accuracy).

Question 21

Does it work on mobile?

Accepted Answer

TTS and STT work best on desktop browsers with WebGPU support. Mobile browsers may have limited WebGPU/WASM support and could produce errors. For the best experience, use Chrome or Edge on a desktop or laptop.

Question 22

What is the difference between Kokoro, Kitten, Piper, and Supertonic TTS engines?

Accepted Answer

Kokoro TTS: 82M params, 54 voices, broad language coverage, highest quality, WebGPU+WASM, ~90-600MB model. Best for production-quality output. Kitten TTS: 15M params, 8 expressions, English only, lightest model at 24MB. Best for quick prototyping and devices with limited resources. Piper TTS: VITS architecture, 25 curated voices from a large English speaker dataset, WASM-only at ~75MB. Best for CPU-only environments. Supertonic TTS: multilingual ONNX model stack with 10 preset voice styles across English, Spanish, Portuguese, French, and Korean. Best for local multilingual generation in its supported languages.

Question 23

How are the models downloaded and cached?

Accepted Answer

Models are downloaded over HTTPS on first use and cached in your browser's IndexedDB storage. Kokoro models are served from Cloudflare R2 (CDN) and Hugging Face. Subsequent visits load instantly from cache without re-downloading. You can clear the cache through your browser's storage settings.

Question 24

What is the sample rate of the generated audio?

Accepted Answer

Kokoro TTS outputs at 24kHz. Kitten TTS is configurable from 8kHz to 48kHz. Piper TTS has a fixed 22.05kHz sample rate. For STT, audio input at any common sample rate is accepted — the tool handles resampling automatically.

Question 25

How does the text chunking work for long texts?

Accepted Answer

When you enter text longer than the model's optimal chunk size, it is automatically split into segments at sentence boundaries. Each chunk is processed independently and the results are concatenated with natural pauses. This ensures consistent quality even for very long texts like audiobook chapters.

Question 26

Is OfflineTTS better than ElevenLabs?

Accepted Answer

OfflineTTS is completely free with no usage limits, works offline, and keeps your data private. ElevenLabs offers more voices and higher quality but charges per character and requires an internet connection. For most use cases — YouTube voiceovers, e-learning, audiobooks — OfflineTTS delivers comparable quality at zero cost.

Question 27

How does OfflineTTS compare to NaturalReader?

Accepted Answer

OfflineTTS is free with no usage limits and works offline where each engine supports it. NaturalReader charges $9.99/month for premium features and requires an internet connection. OfflineTTS offers 98 voice options and styles across 10 TTS language options while NaturalReader has 60+ voices.

Question 28

How does OfflineTTS compare to Speechify?

Accepted Answer

OfflineTTS is free with unlimited usage, while Speechify charges per character. OfflineTTS works offline after model download, while Speechify requires an internet connection. Both offer natural-sounding AI voices, but OfflineTTS gives you full privacy since no data leaves your device.

Question 29

How does OfflineTTS compare to browser built-in TTS?

Accepted Answer

Browser built-in TTS (SpeechSynthesis API) uses system voices that sound robotic and unnatural. OfflineTTS uses neural network models (Kokoro, Kitten, Piper, and Supertonic) that produce more natural speech. The quality difference is dramatic — neural TTS sounds closer to a real person, while system TTS often sounds like a robot.

Question 30

Can I use OfflineTTS for YouTube videos?

Accepted Answer

Yes. Generate voice-overs for YouTube videos, download as WAV, and import into your video editor (DaVinci Resolve, Premiere Pro, Final Cut, etc.). Heart (A-rated) is the top pick for educational content, Bella for vlogs, and Michael for review videos.

Question 31

Can I create audiobooks with OfflineTTS?

Accepted Answer

Yes. Process one chapter at a time, export as WAV, then assemble in your DAW. Use the q8 or fp32 model for audiobook-quality output. Heart (A-rated) is the best voice for long-form narration. Since there are no per-character charges, your royalties stay yours.

Question 32

Can I use OfflineTTS for e-learning?

Accepted Answer

Absolutely. Add voice narration to online courses, training materials, and educational content. Supports 10 TTS language options across Kokoro and Supertonic for international audiences. Generate consistent, professional narration without hiring voice talent for every course update.

Question 33

Can I use OfflineTTS for accessibility?

Accepted Answer

Yes. Convert text to speech for visually impaired users, create audio versions of written content, and add voice narration to any web content. The STT tool can also generate subtitles (SRT/VTT) for making video content accessible.

Frequently Asked Questions

General