← Back to Blog

Local TTS: How to Run AI Voice Synthesis On Your Device (No Cloud)

local ttson-devicettsguideofflinesetup

Cloud TTS APIs are convenient. They’re also a bottleneck — you need internet, you pay per character, and your text passes through someone else’s servers. Local TTS removes all three constraints.

This guide covers the practical side of running TTS on your own hardware: what works, what doesn’t, and how to get started.

What “Local TTS” Actually Means

Local TTS runs the neural network inference on a device you control — your laptop, a server, a Raspberry Pi, or even your browser. The model weights are downloaded once. After that, no network is required.

There are three levels of “local”:

LevelWhereExampleSetup Effort
BrowserYour browserOfflineTTS, TTS StudioZero — open the URL
DesktopYour machinePython + Kokoro, Piper CLILow — pip install
EdgeIoT/embeddedPiper on Pi, Kitten on MCUMedium — cross-compile

All three are “local” in the sense that your text never leaves your device. The difference is how much effort it takes to set up and how much hardware you need.

Level 1: Browser-Based Local TTS

This is the easiest path. You open a website, a model downloads to your browser, and all future inference happens in a WebAssembly or WebGPU sandbox on your device.

How It Works

1. Open the website
2. Model downloads (~80–300MB, one time)
3. Browser caches model in IndexedDB
4. Inference uses WebGPU (fast) or WASM (compatible)
5. Audio plays or downloads — no server round-trip

The key insight: your browser is a capable inference runtime. ONNX Runtime Web, which powers most browser ML, supports both WebGPU (GPU-accelerated) and WASM (CPU-only) backends. Modern devices — even phones — have enough compute for the current generation of TTS models.

What to Use

OfflineTTS — Kokoro TTS in the browser. 54 voices, 9 languages, free. Model sizes from 90MB (Small) to 600MB (Large, highest quality). Works offline after first load.

TTS Studio — Side-by-side comparison tool for Kokoro, Piper, and Kitten TTS. Useful for testing which model sounds best for your use case before committing.

Hardware Requirements

Model SizeRAMStorageRecommended
Small (~90MB)2GB100MBPhones, tablets, old laptops
Medium (~300MB)4GB350MBMost laptops, desktops
Large (~600MB)8GB650MBModern laptops, desktops

WebGPU support: Chrome 113+, Edge 113+, Safari 17.4+. Firefox is behind a flag. If WebGPU isn’t available, WASM kicks in automatically — slower, but it works everywhere.

When Browser-Based Is Enough

  • Content creators who need quick voice-overs
  • Language learners practicing pronunciation
  • Anyone who wants TTS without installing software
  • Privacy-conscious users who don’t want to send text anywhere

Level 2: Desktop Local TTS

Browser-based TTS is convenient, but it has limits: you’re constrained to models that fit in a browser’s memory sandbox, and you can’t integrate with local scripts or apps directly.

Running TTS as a local process gives you more control.

Kokoro TTS (Python)

pip install kokoro
from kokoro import KPipeline

pipeline = KPipeline(lang_code='a')  # American English
generator = pipeline("Hello, this is a test of local TTS.", voice='af_heart')
for _, _, audio in generator:
    # audio is a numpy array — save, process, or stream it
    pass

This runs on CPU. No GPU required. On a modern laptop, generation speed is roughly 1–2x realtime — fast enough for batch processing and interactive use.

Best for: scripting, batch generation, integration with local apps, building your own TTS-powered tools.

Piper TTS (CLI)

# Install
pip install piper-tts

# Generate
echo "Hello world" | piper --model en_US-libritts_r-medium \
  --output-raw | aplay --rate 22050 --format FLOAT32

Piper is the established choice for Linux-based local TTS. It’s fast, stable, and has 904 English voices. The trade-off is English-only and 22kHz fixed sample rate.

Best for: Home Assistant, accessibility tools, command-line workflows, Raspberry Pi projects.

F5-TTS and XTTS-v2

If you need voice cloning — generating speech in a specific person’s voice from a short audio sample — these are the next step up.

pip install f5-tts

F5-TTS supports zero-shot voice cloning with a 5-second reference clip. It’s MIT-licensed and works well on consumer GPUs (RTX 3060 or better).

Best for: audiobook production with consistent character voices, custom voice creation, voice restoration.

Desktop Hardware Recommendations

Use CaseCPURAMGPUStorage
Basic TTS (Kokoro/Piper)Any modern x86/ARM4GBAny500MB
Voice cloning (F5-TTS)4+ cores8GBRTX 3060+ (6GB VRAM)5GB
High-quality multi-voice (XTTS-v2)8+ cores16GBRTX 3080+ (10GB VRAM)10GB

Level 3: Edge and Embedded TTS

This is where local TTS meets hardware constraints. Running TTS on a Raspberry Pi, microcontroller, or embedded device requires models optimized for low memory and compute.

Piper on Raspberry Pi

Piper was designed for this. A Raspberry Pi 4 can generate speech in faster than realtime — impressive for a $35 board.

# On Raspberry Pi OS
sudo apt install piper
piper --model en_US-libritts_r-medium \
  --output-raw < script.txt | aplay --rate 22050

Kitten TTS on Constrained Hardware

At 24MB, Kitten TTS runs on devices where even Piper feels heavy. It’s been tested on Raspberry Pi Zero and has a browser-based version that works on mobile devices with limited memory.

Building an Offline Voice Agent

The current community standard for a complete offline voice agent chains three local components:

Microphone

Whisper (STT) — speech to text, runs on CPU/GPU

Ollama / llama.cpp (LLM) — text generation, runs locally

Kokoro (TTS) — text to speech, runs on CPU

Speaker

All three components run on a single machine. No internet required. The latency budget looks like this:

ComponentLatency (CPU)Latency (GPU)
Whisper (STT)~500ms~100ms
LLM (8B model)~2s~300ms
Kokoro (TTS)~300ms~150ms
Total~3s~550ms

On GPU, sub-second end-to-end voice agent response is achievable. On CPU, 2–3 seconds is realistic — acceptable for most use cases.

How to Choose Your Local TTS Path

Do you have a browser?
├── Yes → Start with OfflineTTS (zero setup)
│         └── Need to compare models? → TTS Studio
└── Need programmatic access?
    ├── Python scripting? → Kokoro (pip install kokoro)
    ├── CLI / Linux? → Piper (pip install piper-tts)
    ├── Voice cloning? → F5-TTS or XTTS-v2
    └── Raspberry Pi / embedded? → Piper or Kitten

The common thread: once the model is downloaded, you own the entire pipeline. No API keys to manage. No rate limits to hit. No pricing tiers to outgrow. No text leaves your device.

Why Local TTS Matters in 2026

Three shifts have made local TTS the practical choice:

  1. Model quality caught up. Kokoro TTS at 82M parameters produces speech that scores 4.3–4.5 MOS — comparable to mid-tier cloud APIs. The quality gap that justified cloud dependency has largely closed.

  2. Hardware got cheaper. A $500 laptop with 8GB RAM can run Kokoro faster than realtime. You don’t need a data center. You need a device made in the last five years.

  3. Browser runtimes matured. ONNX Runtime Web + WebGPU means you can run inference in a browser tab with near-native performance. The setup cost dropped to zero.

Get Started

The fastest way to try local TTS is in your browser:

OfflineTTS — 54 voices, 9 languages, runs on your device →

Try OfflineTTS

Free. Private. Works offline. 54 voices in 9 languages.

Open TTS Tool