Browser Speech Recognition in 2026: Whisper and the STT Landscape
Speech-to-text in the browser has moved from demo-grade to production-ready. OpenAI’s Whisper model — and the ecosystem it spawned — made this possible. But “Whisper” is not a single thing. There are now multiple implementations, each with distinct trade-offs in model size, inference speed, and output quality.
This article examines the current landscape of browser-based STT, explains the underlying architecture, and compares the available libraries on metrics that matter.
How Browser-Based STT Works
Running speech recognition in a browser involves three stages:
-
Audio capture and decoding — Microphone input or file upload is converted to mono 16 kHz PCM, the format Whisper expects. WebCodecs (where available) provides hardware-accelerated decoding; AudioContext is the fallback.
-
Neural inference — The Whisper encoder-decoder model runs via ONNX Runtime Web, using WebGPU when available and falling back to WebAssembly. This is where the heavy computation happens: feature extraction (mel spectrogram), encoder self-attention, and autoregressive token generation.
-
Post-processing — Raw token output is decoded into text, with timestamp tokens parsed into segment boundaries. Long audio is handled by chunking into 30-second windows with overlapping strides.
The critical bottleneck is step 2. A Whisper base model has ~74M parameters in the encoder and ~74M in the decoder. Running this in a browser at acceptable speed requires model quantization and hardware acceleration.
The Whisper Family
OpenAI Whisper (Original)
OpenAI released Whisper in September 2022 as a family of five English + multilingual models ranging from tiny (39M params) to large-v3 (1.5B params). The models were trained on 680,000 hours of multilingual audio.
The original implementation runs in Python with PyTorch. It remains the reference for accuracy but is not browser-runnable.
Whisper.cpp (C/C++)
Georgi Gerganov’s whisper.cpp is a C/C++ inference port using GGML tensor operations. It runs on CPU with optimized SIMD (AVX2, ARM NEON) and supports quantized models (Q4, Q5, Q8).
Strengths: Extremely fast on CPU, widely ported (iOS, Android, Raspberry Pi), mature and well-tested.
Limitation: Not directly usable in browsers. WASM builds exist but lack GPU acceleration.
transformers.js Whisper (HuggingFace)
HuggingFace’s transformers.js provides a JavaScript API for running ONNX-converted Whisper models in the browser. It uses onnxruntime-web as the inference backend, with WebGPU support added in v3.x.
This is the most common starting point for browser-based STT. The pipeline API is straightforward:
const transcriber = await pipeline('automatic-speech-recognition', 'onnx-community/whisper-base', {
device: 'webgpu',
dtype: 'q8',
});
const result = await transcriber(audio, { return_timestamps: true, chunk_length_s: 30 });
Strengths: Familiar API, works with multiple model sizes, WebGPU support.
Caveats: The pipeline is single-threaded on the main thread by default — long audio files block the UI. As of v3.8.1, SuppressTokensLogitsProcessor is commented out, which can cause hallucination in long-form transcription. Quantization choice matters: q8 for the encoder can degrade feature quality, while hybrid (fp32 encoder, q4 decoder) preserves accuracy with acceptable model size.
browser-whisper
browser-whisper is a purpose-built library that wraps transformers.js with production-oriented architecture:
- Web Workers — Audio decoding (via WebCodecs/MediaBunny) and Whisper inference each run in dedicated workers, keeping the main thread responsive during long transcriptions.
- Streaming output — Segments are emitted via
AsyncIterable<TranscriptSegment>as they’re transcribed, enabling real-time UI updates. - Backpressure — The decoder worker pauses if the inference worker falls behind, preventing memory growth on long files.
- Hybrid quantization — Uses
fp32encoder +q4decoder by default, balancing quality and model size (Whisper base ≈ 76 MB vs. ~300 MB for fullq8).
const whisper = new BrowserWhisper({ model: 'whisper-base' });
for await (const segment of whisper.transcribe(file)) {
console.log(`[${segment.start}s - ${segment.end}s] ${segment.text}`);
}
Strengths: Non-blocking, streaming, correct default configuration, pre-warm shader compilation.
Limitation: Additional dependency (mediabunny for WebCodecs), no word-level timestamps in current version.
Moonshine
Moonshine is a newer model family (OtterAI) designed specifically for on-device ASR. Available in tiny (5.8M params) and base (61M params) variants via onnx-community on HuggingFace.
Strengths: Very small models, fast on CPU, designed for real-time streaming use cases.
Limitation: English-only, no timestamp support, smaller training dataset than Whisper.
Distil-Whisper
HuggingFace’s distilled version of Whisper large-v3, trained on 22,000 hours of audio. Available in English-only variants (distil-small.en, distil-medium.en).
Strengths: 5-6x faster than Whisper large with minimal quality loss for English.
Limitation: English-only, no multilingual support, larger than Whisper small.
Comparison Matrix
| Library | Models | Languages | Timestamps | Model Size | WebGPU | Streaming | Worker-Based |
|---|---|---|---|---|---|---|---|
| transformers.js | tiny/base/small/large | 99 | Segment-level | 40-3000 MB | Yes | No | No |
| browser-whisper | tiny/base/small | 99 | Segment-level | 40-240 MB | Yes | Yes | Yes |
| Whisper.cpp | tiny through large-v3 | 99 | Word-level | 39-3000 MB | No | No | N/A (native) |
| Moonshine | tiny/base | English only | No | 6-61 MB | Yes | No | No |
| Distil-Whisper | small/medium | English only | Segment-level | 185-760 MB | Yes | No | No |
Model Size vs. Accuracy
The choice of model size is the primary trade-off in browser-based STT:
| Model | Parameters | Download Size (hybrid) | Relative Accuracy | Real-time Factor (WebGPU) |
|---|---|---|---|---|
| Whisper Tiny | 39M | ~40 MB | Adequate for clear speech | 10-15x |
| Whisper Base | 74M | ~76 MB | Good balance for most use cases | 5-8x |
| Whisper Small | 244M | ~240 MB | Best quality, handles accents/noise | 2-4x |
Tiny is useful for quick previews or constrained environments. Base is the recommended default — it handles most real-world audio well. Small is worth the extra download if accuracy is paramount.
Quantization: Why It Matters
ONNX models can be quantized to reduce size. The key insight is that not all parts of the model should be quantized equally:
- Encoder (feature extractor): Sensitive to quantization.
fp32is recommended. Quantizing the encoder toq8can degrade feature quality, leading to garbled output, especially on accented or noisy audio. - Decoder (text generator): More tolerant of quantization.
q4orq8both work, withq4being significantly smaller.
This is why browser-whisper defaults to hybrid quantization (fp32 encoder + q4 decoder). A full q8 model at ~300 MB isn’t just larger — it can produce worse transcriptions than the 76 MB hybrid version because quantization noise in the encoder propagates through the entire decoder stack.
WebGPU vs. WebAssembly
WebGPU provides 5-10x speedup over WASM for Whisper inference, but adoption remains limited:
- Chrome/Edge 113+: Supported, with occasional driver-specific issues
- Safari: Not supported as of 2026
- Firefox: Experimental, behind flags
- Linux Chrome: Requires
--enable-unsafe-webgpu --enable-features=Vulkan
A robust browser STT implementation must fall back to WASM gracefully. The fallback should be automatic — the user should not need to configure anything.
Long-Form Audio: Chunking and Hallucination
Whisper was designed for 30-second audio clips. For longer files, the standard approach is sliding-window chunking with overlapping strides (typically 30-second windows, 5-second stride).
Two common problems arise:
Hallucination — Whisper can generate repetitive, nonsensical text (“biasesVIDEO biasesVIDEO…”) especially at chunk boundaries or in silent regions. This is partially caused by the model’s suppress tokens not being applied. In transformers.js v3.8.1, SuppressTokensLogitsProcessor is commented out, meaning 90 hallucination-prone tokens identified in Whisper’s generation config are never suppressed during decoding. Libraries that work around this (browser-whisper applies correct pipeline configuration; manual logits patching is another approach) produce significantly cleaner output.
Timestamp misalignment — At chunk boundaries, timestamps can drift or overlap. The stride mechanism mitigates this, but post-processing may still be needed for subtitle formats (SRT, VTT).
Architecture Patterns for Production Use
Based on the libraries examined, three architectural patterns have emerged:
1. Main-Thread Pipeline (transformers.js default)
The simplest approach. The entire pipeline — audio decoding, feature extraction, inference, and post-processing — runs on the main thread.
[User action] → [Pipeline on main thread] → [UI frozen until complete]
Works for short clips. Unsuitable for files over 60 seconds because the UI becomes unresponsive.
2. Web Worker Pipeline (browser-whisper)
Audio decoding and inference each run in dedicated Web Workers, connected by a MessageChannel for zero-copy PCM transfer.
[Main thread] ← segments ← [Whisper Worker] ← PCM chunks ← [Decoder Worker] ← file
The main thread stays responsive. Backpressure prevents memory growth. This is the recommended architecture for any production browser STT tool.
3. Hybrid (main thread + offscreen)
Some implementations move inference to an OffscreenCanvas or SharedArrayBuffer-based worker while keeping audio decoding on the main thread. This is less clean than pattern 2 but avoids the complexity of dual-worker coordination.
Recommendations
For most use cases: Use browser-whisper with the whisper-base model and hybrid quantization. It provides the best balance of correctness, performance, and developer experience.
For maximum accuracy: Use whisper-small with the same pipeline. The extra 164 MB download is worth it for transcribing accented speech or noisy recordings.
For fastest load time on slow connections: Use whisper-tiny. It’s adequate for clear English audio and downloads in seconds on any connection.
For real-time streaming (e.g., live captioning): Consider Moonshine. Its tiny model and English-only focus make it fast enough for sub-second latency, though you’ll sacrifice multilingual support and timestamp accuracy.
For server-side deployment: Whisper.cpp remains the best option. It’s faster than any browser implementation and supports all model sizes including large-v3.
Conclusion
Browser-based speech recognition has reached a point where it’s genuinely useful for production applications. The combination of WebGPU acceleration, hybrid quantization, and worker-based architecture means that a 76 MB model can transcribe audio at 5-8x real-time speed without blocking the UI.
The key is choosing the right tool for the job: a purpose-built library like browser-whisper for browser applications, whisper.cpp for native deployments, and the raw transformers.js pipeline for custom research or prototyping.
Try browser-based speech recognition yourself at OfflineTTS STT — no signup, no API key, runs entirely in your browser.