Local TTS: How to Run AI Voice Synthesis On Your Device (No Cloud)

Cloud TTS APIs are convenient. They’re also a bottleneck — you need internet, you pay per character, and your text passes through someone else’s servers. Local TTS removes all three constraints.

This guide covers the practical side of running TTS on your own hardware: what works, what doesn’t, and how to get started.

What “Local TTS” Actually Means

Local TTS runs the neural network inference on a device you control — your laptop, a server, a Raspberry Pi, or even your browser. The model weights are downloaded once. After that, no network is required.

There are three levels of “local”:

Level	Where	Example	Setup Effort
Browser	Your browser	OfflineTTS, TTS Studio	Zero — open the URL
Desktop	Your machine	Python + Kokoro, Piper CLI	Low — pip install
Edge	IoT/embedded	Piper on Pi, Kitten on MCU	Medium — cross-compile

All three are “local” in the sense that your text never leaves your device. The difference is how much effort it takes to set up and how much hardware you need.

Level 1: Browser-Based Local TTS

This is the easiest path. You open a website, a model downloads to your browser, and all future inference happens in a WebAssembly or WebGPU sandbox on your device.

How It Works

1. Open the website
2. Model downloads (~80–300MB, one time)
3. Browser caches model in IndexedDB
4. Inference uses WebGPU (fast) or WASM (compatible)
5. Audio plays or downloads — no server round-trip

The key insight: your browser is a capable inference runtime. ONNX Runtime Web, which powers most browser ML, supports both WebGPU (GPU-accelerated) and WASM (CPU-only) backends. Modern devices — even phones — have enough compute for the current generation of TTS models.

What to Use

OfflineTTS — Kokoro TTS in the browser. 54 voices, 9 languages, free. Model sizes from 90MB (Small) to 600MB (Large, highest quality). Works offline after first load.

TTS Studio — Side-by-side comparison tool for Kokoro, Piper, and Kitten TTS. Useful for testing which model sounds best for your use case before committing.

Hardware Requirements

Model Size	RAM	Storage	Recommended
Small (~90MB)	2GB	100MB	Phones, tablets, old laptops
Medium (~300MB)	4GB	350MB	Most laptops, desktops
Large (~600MB)	8GB	650MB	Modern laptops, desktops

WebGPU support: Chrome 113+, Edge 113+, Safari 17.4+. Firefox is behind a flag. If WebGPU isn’t available, WASM kicks in automatically — slower, but it works everywhere.

When Browser-Based Is Enough

Content creators who need quick voice-overs
Language learners practicing pronunciation
Anyone who wants TTS without installing software
Privacy-conscious users who don’t want to send text anywhere

Level 2: Desktop Local TTS

Browser-based TTS is convenient, but it has limits: you’re constrained to models that fit in a browser’s memory sandbox, and you can’t integrate with local scripts or apps directly. For basic use, try it in your browser first — it handles most needs.

Running TTS as a local process gives you more control.

Kokoro TTS (Python)

pip install kokoro

from kokoro import KPipeline

pipeline = KPipeline(lang_code='a')  # American English
generator = pipeline("Hello, this is a test of local TTS.", voice='af_heart')
for _, _, audio in generator:
    # audio is a numpy array — save, process, or stream it
    pass

This runs on CPU. No GPU required. On a modern laptop, generation speed is roughly 1–2x realtime — fast enough for batch processing and interactive use.

Best for: scripting, batch generation, integration with local apps, building your own TTS-powered tools.

Piper TTS (CLI)

# Install
pip install piper-tts

# Generate
echo "Hello world" | piper --model en_US-libritts_r-medium \
  --output-raw | aplay --rate 22050 --format FLOAT32

Piper is the established choice for Linux-based local TTS. It’s fast, stable, and has 904 English voices. The trade-off is English-only and 22kHz fixed sample rate.

Best for: Home Assistant, accessibility tools, command-line workflows, Raspberry Pi projects.

F5-TTS and XTTS-v2

If you need voice cloning — generating speech in a specific person’s voice from a short audio sample — these are the next step up.

pip install f5-tts

F5-TTS supports zero-shot voice cloning with a 5-second reference clip. It’s MIT-licensed and works well on consumer GPUs (RTX 3060 or better).

Best for: audiobook production with consistent character voices, custom voice creation, voice restoration.

Desktop Hardware Recommendations

Use Case	CPU	RAM	GPU	Storage
Basic TTS (Kokoro/Piper)	Any modern x86/ARM	4GB	Any	500MB
Voice cloning (F5-TTS)	4+ cores	8GB	RTX 3060+ (6GB VRAM)	5GB
High-quality multi-voice (XTTS-v2)	8+ cores	16GB	RTX 3080+ (10GB VRAM)	10GB

Level 3: Edge and Embedded TTS

This is where local TTS meets hardware constraints. Running TTS on a Raspberry Pi, microcontroller, or embedded device requires models optimized for low memory and compute.

Piper on Raspberry Pi

Piper was designed for this. A Raspberry Pi 4 can generate speech in faster than realtime — impressive for a $35 board.

# On Raspberry Pi OS
sudo apt install piper
piper --model en_US-libritts_r-medium \
  --output-raw < script.txt | aplay --rate 22050

Kitten TTS on Constrained Hardware

At 24MB, Kitten TTS runs on devices where even Piper feels heavy. It’s been tested on Raspberry Pi Zero and has a browser-based version that works on mobile devices with limited memory.

Building an Offline Voice Agent

The current community standard for a complete offline voice agent chains three local components:

Microphone
    ↓
Whisper (STT) — speech to text, runs on CPU/GPU
    ↓
Ollama / llama.cpp (LLM) — text generation, runs locally
    ↓
Kokoro (TTS) — text to speech, runs on CPU
    ↓
Speaker

All three components run on a single machine. No internet required. The latency budget looks like this:

Component	Latency (CPU)	Latency (GPU)
Whisper (STT)	~500ms	~100ms
LLM (8B model)	~2s	~300ms
Kokoro (TTS)	~300ms	~150ms
Total	~3s	~550ms

On GPU, sub-second end-to-end voice agent response is achievable. On CPU, 2–3 seconds is realistic — acceptable for most use cases.

How to Choose Your Local TTS Path

Do you have a browser?
├── Yes → Start with OfflineTTS (zero setup)
│         └── Need to compare models? → TTS Studio
└── Need programmatic access?
    ├── Python scripting? → Kokoro (pip install kokoro)
    ├── CLI / Linux? → Piper (pip install piper-tts)
    ├── Voice cloning? → F5-TTS or XTTS-v2
    └── Raspberry Pi / embedded? → Piper or Kitten

The common thread: once the model is downloaded, you own the entire pipeline. No API keys to manage. No rate limits to hit. No pricing tiers to outgrow. No text leaves your device.

Why Local TTS Matters in 2026

Three shifts have made local TTS the practical choice:

Model quality caught up. Kokoro TTS at 82M parameters produces speech that scores 4.3–4.5 MOS — comparable to mid-tier cloud APIs. The quality gap that justified cloud dependency has largely closed.
Hardware got cheaper. A $500 laptop with 8GB RAM can run Kokoro faster than realtime. You don’t need a data center. You need a device made in the last five years.
Browser runtimes matured. ONNX Runtime Web + WebGPU means you can run inference in a browser tab with near-native performance. The setup cost dropped to zero.

Get Started

The fastest way to try local TTS is in your browser:

OfflineTTS — 54 voices, 9 languages, runs on your device →