Browser STT workspace

Whisper Speech to Text

Upload audio or video, record from your microphone, or load a direct media URL. Transcribe privately in your browser and export TXT, SRT, or VTT.

Private transcription Subtitle exports Whisper models
40-240
tiny/base/small
MB model
99
languages
supported
TXT/SRT
+ WebVTT
exports
WebGPU
+WASM fallback
GPU+CPU

STT works best on desktop

Speech recognition uses WebGPU/WASM. Desktop Chrome or Edge gives the most reliable result.

About Whisper STT

Whisper is OpenAI's state-of-the-art speech recognition model, now running entirely in your browser. It supports 99 languages with automatic language detection, and produces highly accurate transcriptions with word-level timestamps.

Choose from three model sizes: Tiny (~40MB, fastest), Base (~76MB, good balance), or Small (~240MB, best quality). The model automatically uses WebGPU when available, falling back to WASM for broad compatibility. Audio is processed in a background thread so the UI stays responsive.

Audio never leaves your browser โ€” all processing happens locally. Export transcriptions as plain text, SRT subtitles, or WebVTT captions.

Try our TTS tool: Kokoro TTS (54 voices ยท Best quality) ยท Kitten TTS (8 voices ยท Lightest) ยท Piper TTS (25 voices ยท Fastest CPU) ยท Supertonic TTS (5 languages ยท Local)

Getting Started with Whisper STT

Transcribe audio to text directly in your browser. No API key, no signup โ€” just upload audio and get accurate transcription with word-level timestamps.

1. Choose Model Size

Tiny (~40MB) for quick tests, Base (~76MB) for balanced speed and accuracy, Small (~240MB) for best quality. Start with Tiny to verify your setup.

2. Upload or Record Audio

Upload an audio file (WAV, MP3, WebM, etc.) or record directly in the browser. The tool decodes audio in a background worker for smooth performance.

3. Transcribe

Whisper auto-detects the spoken language and produces word-level timestamps. Streaming mode shows results in real-time as transcription progresses.

4. Export Results

Download as plain text, SRT subtitles, or WebVTT captions. SRT and VTT include word-level timestamps for video captioning.

Tips for Accurate Transcription

1.

Use clear audio. Low background noise and clear speech produce the best results. If possible, use a good microphone and record in a quiet environment.

2.

Choose the right model. Tiny works well for quick drafts. For production use โ€” subtitles, meeting notes, accessibility โ€” use Small for highest accuracy.

3.

Let auto-detection work. Whisper detects the spoken language automatically. If you have multilingual audio, let it auto-detect rather than forcing a language.

4.

Use WebGPU. Chrome and Edge with WebGPU support transcribe significantly faster. The tool auto-selects the best available backend.