Local TTS: How to Run AI Voice Synthesis On Your Device (No Cloud)
Cloud TTS APIs are convenient. They’re also a bottleneck — you need internet, you pay per character, and your text passes through someone else’s servers. Local TTS removes all three constraints.
This guide covers the practical side of running TTS on your own hardware: what works, what doesn’t, and how to get started.
What “Local TTS” Actually Means
Local TTS runs the neural network inference on a device you control — your laptop, a server, a Raspberry Pi, or even your browser. The model weights are downloaded once. After that, no network is required.
There are three levels of “local”:
| Level | Where | Example | Setup Effort |
|---|---|---|---|
| Browser | Your browser | OfflineTTS, TTS Studio | Zero — open the URL |
| Desktop | Your machine | Python + Kokoro, Piper CLI | Low — pip install |
| Edge | IoT/embedded | Piper on Pi, Kitten on MCU | Medium — cross-compile |
All three are “local” in the sense that your text never leaves your device. The difference is how much effort it takes to set up and how much hardware you need.
Level 1: Browser-Based Local TTS
This is the easiest path. You open a website, a model downloads to your browser, and all future inference happens in a WebAssembly or WebGPU sandbox on your device.
How It Works
1. Open the website
2. Model downloads (~80–300MB, one time)
3. Browser caches model in IndexedDB
4. Inference uses WebGPU (fast) or WASM (compatible)
5. Audio plays or downloads — no server round-trip
The key insight: your browser is a capable inference runtime. ONNX Runtime Web, which powers most browser ML, supports both WebGPU (GPU-accelerated) and WASM (CPU-only) backends. Modern devices — even phones — have enough compute for the current generation of TTS models.
What to Use
OfflineTTS — Kokoro TTS in the browser. 54 voices, 9 languages, free. Model sizes from 90MB (Small) to 600MB (Large, highest quality). Works offline after first load.
TTS Studio — Side-by-side comparison tool for Kokoro, Piper, and Kitten TTS. Useful for testing which model sounds best for your use case before committing.
Hardware Requirements
| Model Size | RAM | Storage | Recommended |
|---|---|---|---|
| Small (~90MB) | 2GB | 100MB | Phones, tablets, old laptops |
| Medium (~300MB) | 4GB | 350MB | Most laptops, desktops |
| Large (~600MB) | 8GB | 650MB | Modern laptops, desktops |
WebGPU support: Chrome 113+, Edge 113+, Safari 17.4+. Firefox is behind a flag. If WebGPU isn’t available, WASM kicks in automatically — slower, but it works everywhere.
When Browser-Based Is Enough
- Content creators who need quick voice-overs
- Language learners practicing pronunciation
- Anyone who wants TTS without installing software
- Privacy-conscious users who don’t want to send text anywhere
Level 2: Desktop Local TTS
Browser-based TTS is convenient, but it has limits: you’re constrained to models that fit in a browser’s memory sandbox, and you can’t integrate with local scripts or apps directly.
Running TTS as a local process gives you more control.
Kokoro TTS (Python)
pip install kokoro
from kokoro import KPipeline
pipeline = KPipeline(lang_code='a') # American English
generator = pipeline("Hello, this is a test of local TTS.", voice='af_heart')
for _, _, audio in generator:
# audio is a numpy array — save, process, or stream it
pass
This runs on CPU. No GPU required. On a modern laptop, generation speed is roughly 1–2x realtime — fast enough for batch processing and interactive use.
Best for: scripting, batch generation, integration with local apps, building your own TTS-powered tools.
Piper TTS (CLI)
# Install
pip install piper-tts
# Generate
echo "Hello world" | piper --model en_US-libritts_r-medium \
--output-raw | aplay --rate 22050 --format FLOAT32
Piper is the established choice for Linux-based local TTS. It’s fast, stable, and has 904 English voices. The trade-off is English-only and 22kHz fixed sample rate.
Best for: Home Assistant, accessibility tools, command-line workflows, Raspberry Pi projects.
F5-TTS and XTTS-v2
If you need voice cloning — generating speech in a specific person’s voice from a short audio sample — these are the next step up.
pip install f5-tts
F5-TTS supports zero-shot voice cloning with a 5-second reference clip. It’s MIT-licensed and works well on consumer GPUs (RTX 3060 or better).
Best for: audiobook production with consistent character voices, custom voice creation, voice restoration.
Desktop Hardware Recommendations
| Use Case | CPU | RAM | GPU | Storage |
|---|---|---|---|---|
| Basic TTS (Kokoro/Piper) | Any modern x86/ARM | 4GB | Any | 500MB |
| Voice cloning (F5-TTS) | 4+ cores | 8GB | RTX 3060+ (6GB VRAM) | 5GB |
| High-quality multi-voice (XTTS-v2) | 8+ cores | 16GB | RTX 3080+ (10GB VRAM) | 10GB |
Level 3: Edge and Embedded TTS
This is where local TTS meets hardware constraints. Running TTS on a Raspberry Pi, microcontroller, or embedded device requires models optimized for low memory and compute.
Piper on Raspberry Pi
Piper was designed for this. A Raspberry Pi 4 can generate speech in faster than realtime — impressive for a $35 board.
# On Raspberry Pi OS
sudo apt install piper
piper --model en_US-libritts_r-medium \
--output-raw < script.txt | aplay --rate 22050
Kitten TTS on Constrained Hardware
At 24MB, Kitten TTS runs on devices where even Piper feels heavy. It’s been tested on Raspberry Pi Zero and has a browser-based version that works on mobile devices with limited memory.
Building an Offline Voice Agent
The current community standard for a complete offline voice agent chains three local components:
Microphone
↓
Whisper (STT) — speech to text, runs on CPU/GPU
↓
Ollama / llama.cpp (LLM) — text generation, runs locally
↓
Kokoro (TTS) — text to speech, runs on CPU
↓
Speaker
All three components run on a single machine. No internet required. The latency budget looks like this:
| Component | Latency (CPU) | Latency (GPU) |
|---|---|---|
| Whisper (STT) | ~500ms | ~100ms |
| LLM (8B model) | ~2s | ~300ms |
| Kokoro (TTS) | ~300ms | ~150ms |
| Total | ~3s | ~550ms |
On GPU, sub-second end-to-end voice agent response is achievable. On CPU, 2–3 seconds is realistic — acceptable for most use cases.
How to Choose Your Local TTS Path
Do you have a browser?
├── Yes → Start with OfflineTTS (zero setup)
│ └── Need to compare models? → TTS Studio
└── Need programmatic access?
├── Python scripting? → Kokoro (pip install kokoro)
├── CLI / Linux? → Piper (pip install piper-tts)
├── Voice cloning? → F5-TTS or XTTS-v2
└── Raspberry Pi / embedded? → Piper or Kitten
The common thread: once the model is downloaded, you own the entire pipeline. No API keys to manage. No rate limits to hit. No pricing tiers to outgrow. No text leaves your device.
Why Local TTS Matters in 2026
Three shifts have made local TTS the practical choice:
-
Model quality caught up. Kokoro TTS at 82M parameters produces speech that scores 4.3–4.5 MOS — comparable to mid-tier cloud APIs. The quality gap that justified cloud dependency has largely closed.
-
Hardware got cheaper. A $500 laptop with 8GB RAM can run Kokoro faster than realtime. You don’t need a data center. You need a device made in the last five years.
-
Browser runtimes matured. ONNX Runtime Web + WebGPU means you can run inference in a browser tab with near-native performance. The setup cost dropped to zero.
Get Started
The fastest way to try local TTS is in your browser: