Browser Speech Recognition in 2026: Whisper and the STT Landscape

Speech-to-text in the browser has moved from demo-grade to production-ready. OpenAI’s Whisper model — and the ecosystem it spawned — made this possible. But “Whisper” is not a single thing. There are now multiple implementations, each with distinct trade-offs in model size, inference speed, and output quality.

This article examines the current landscape of browser-based STT, explains the underlying architecture, and compares the available libraries on metrics that matter.

How Browser-Based STT Works

Running speech recognition in a browser involves three stages:

Audio capture and decoding — Microphone input or file upload is converted to mono 16 kHz PCM, the format Whisper expects. WebCodecs (where available) provides hardware-accelerated decoding; AudioContext is the fallback.
Neural inference — The Whisper encoder-decoder model runs via ONNX Runtime Web, using WebGPU when available and falling back to WebAssembly. This is where the heavy computation happens: feature extraction (mel spectrogram), encoder self-attention, and autoregressive token generation.
Post-processing — Raw token output is decoded into text, with timestamp tokens parsed into segment boundaries. Long audio is handled by chunking into 30-second windows with overlapping strides.

The critical bottleneck is step 2. A Whisper base model has ~74M parameters in the encoder and ~74M in the decoder. Running this in a browser at acceptable speed requires model quantization and hardware acceleration.

The Whisper Family

OpenAI Whisper (Original)

OpenAI released Whisper in September 2022 as a family of five English + multilingual models ranging from tiny (39M params) to large-v3 (1.5B params). The models were trained on 680,000 hours of multilingual audio.

The original implementation runs in Python with PyTorch. It remains the reference for accuracy but is not browser-runnable.

Whisper.cpp (C/C++)

Georgi Gerganov’s whisper.cpp is a C/C++ inference port using GGML tensor operations. It runs on CPU with optimized SIMD (AVX2, ARM NEON) and supports quantized models (Q4, Q5, Q8).

Strengths: Extremely fast on CPU, widely ported (iOS, Android, Raspberry Pi), mature and well-tested.

Limitation: Not directly usable in browsers. WASM builds exist but lack GPU acceleration.

transformers.js Whisper (HuggingFace)

HuggingFace’s transformers.js provides a JavaScript API for running ONNX-converted Whisper models in the browser. It uses onnxruntime-web as the inference backend, with WebGPU support added in v3.x.

This is the most common starting point for browser-based STT. The pipeline API is straightforward:

const transcriber = await pipeline('automatic-speech-recognition', 'onnx-community/whisper-base', {
  device: 'webgpu',
  dtype: 'q8',
});
const result = await transcriber(audio, { return_timestamps: true, chunk_length_s: 30 });

Strengths: Familiar API, works with multiple model sizes, WebGPU support.

Caveats: The pipeline is single-threaded on the main thread by default — long audio files block the UI. As of v3.8.1, SuppressTokensLogitsProcessor is commented out, which can cause hallucination in long-form transcription. Quantization choice matters: q8 for the encoder can degrade feature quality, while hybrid (fp32 encoder, q4 decoder) preserves accuracy with acceptable model size.

browser-whisper

browser-whisper is a purpose-built library that wraps transformers.js with production-oriented architecture:

Web Workers — Audio decoding (via WebCodecs/MediaBunny) and Whisper inference each run in dedicated workers, keeping the main thread responsive during long transcriptions.
Streaming output — Segments are emitted via AsyncIterable<TranscriptSegment> as they’re transcribed, enabling real-time UI updates.
Backpressure — The decoder worker pauses if the inference worker falls behind, preventing memory growth on long files.
Hybrid quantization — Uses fp32 encoder + q4 decoder by default, balancing quality and model size (Whisper base ≈ 76 MB vs. ~300 MB for full q8).

const whisper = new BrowserWhisper({ model: 'whisper-base' });
for await (const segment of whisper.transcribe(file)) {
  console.log(`[${segment.start}s - ${segment.end}s] ${segment.text}`);
}

Strengths: Non-blocking, streaming, correct default configuration, pre-warm shader compilation.

Limitation: Additional dependency (mediabunny for WebCodecs), no word-level timestamps in current version.

Moonshine

Moonshine is a newer model family (OtterAI) designed specifically for on-device ASR. Available in tiny (5.8M params) and base (61M params) variants via onnx-community on HuggingFace.

Strengths: Very small models, fast on CPU, designed for real-time streaming use cases.

Limitation: English-only, no timestamp support, smaller training dataset than Whisper.

Distil-Whisper

HuggingFace’s distilled version of Whisper large-v3, trained on 22,000 hours of audio. Available in English-only variants (distil-small.en, distil-medium.en).

Strengths: 5-6x faster than Whisper large with minimal quality loss for English.

Limitation: English-only, no multilingual support, larger than Whisper small.

Comparison Matrix

Library	Models	Languages	Timestamps	Model Size	WebGPU	Streaming	Worker-Based
transformers.js	tiny/base/small/large	99	Segment-level	40-3000 MB	Yes	No	No
browser-whisper	tiny/base/small	99	Segment-level	40-240 MB	Yes	Yes	Yes
Whisper.cpp	tiny through large-v3	99	Word-level	39-3000 MB	No	No	N/A (native)
Moonshine	tiny/base	English only	No	6-61 MB	Yes	No	No
Distil-Whisper	small/medium	English only	Segment-level	185-760 MB	Yes	No	No

Model Size vs. Accuracy

The choice of model size is the primary trade-off in browser-based STT:

Model	Parameters	Download Size (hybrid)	Relative Accuracy	Real-time Factor (WebGPU)
Whisper Tiny	39M	~40 MB	Adequate for clear speech	10-15x
Whisper Base	74M	~76 MB	Good balance for most use cases	5-8x
Whisper Small	244M	~240 MB	Best quality, handles accents/noise	2-4x

Tiny is useful for quick previews or constrained environments. Base is the recommended default — it handles most real-world audio well. Small is worth the extra download if accuracy is paramount.

Quantization: Why It Matters

ONNX models can be quantized to reduce size. The key insight is that not all parts of the model should be quantized equally:

Encoder (feature extractor): Sensitive to quantization. fp32 is recommended. Quantizing the encoder to q8 can degrade feature quality, leading to garbled output, especially on accented or noisy audio.
Decoder (text generator): More tolerant of quantization. q4 or q8 both work, with q4 being significantly smaller.

This is why browser-whisper defaults to hybrid quantization (fp32 encoder + q4 decoder). A full q8 model at ~300 MB isn’t just larger — it can produce worse transcriptions than the 76 MB hybrid version because quantization noise in the encoder propagates through the entire decoder stack.

WebGPU vs. WebAssembly

WebGPU provides 5-10x speedup over WASM for Whisper inference, but adoption remains limited:

Chrome/Edge 113+: Supported, with occasional driver-specific issues
Safari: Not supported as of 2026
Firefox: Experimental, behind flags
Linux Chrome: Requires --enable-unsafe-webgpu --enable-features=Vulkan

A robust browser STT implementation must fall back to WASM gracefully. The fallback should be automatic — the user should not need to configure anything.

Long-Form Audio: Chunking and Hallucination

Whisper was designed for 30-second audio clips. For longer files, the standard approach is sliding-window chunking with overlapping strides (typically 30-second windows, 5-second stride).

Two common problems arise:

Hallucination — Whisper can generate repetitive, nonsensical text (“biasesVIDEO biasesVIDEO…”) especially at chunk boundaries or in silent regions. This is partially caused by the model’s suppress tokens not being applied. In transformers.js v3.8.1, SuppressTokensLogitsProcessor is commented out, meaning 90 hallucination-prone tokens identified in Whisper’s generation config are never suppressed during decoding. Libraries that work around this (browser-whisper applies correct pipeline configuration; manual logits patching is another approach) produce significantly cleaner output.

Timestamp misalignment — At chunk boundaries, timestamps can drift or overlap. The stride mechanism mitigates this, but post-processing may still be needed for subtitle formats (SRT, VTT).

Architecture Patterns for Production Use

Based on the libraries examined, three architectural patterns have emerged:

1. Main-Thread Pipeline (transformers.js default)

The simplest approach. The entire pipeline — audio decoding, feature extraction, inference, and post-processing — runs on the main thread.

[User action] → [Pipeline on main thread] → [UI frozen until complete]

Works for short clips. Unsuitable for files over 60 seconds because the UI becomes unresponsive.

2. Web Worker Pipeline (browser-whisper)

Audio decoding and inference each run in dedicated Web Workers, connected by a MessageChannel for zero-copy PCM transfer.

[Main thread] ← segments ← [Whisper Worker] ← PCM chunks ← [Decoder Worker] ← file

The main thread stays responsive. Backpressure prevents memory growth. This is the recommended architecture for any production browser STT tool.

3. Hybrid (main thread + offscreen)

Some implementations move inference to an OffscreenCanvas or SharedArrayBuffer-based worker while keeping audio decoding on the main thread. This is less clean than pattern 2 but avoids the complexity of dual-worker coordination.

Recommendations

For most use cases: Use browser-whisper with the whisper-base model and hybrid quantization. It provides the best balance of correctness, performance, and developer experience.

For maximum accuracy: Use whisper-small with the same pipeline. The extra 164 MB download is worth it for transcribing accented speech or noisy recordings.

For fastest load time on slow connections: Use whisper-tiny. It’s adequate for clear English audio and downloads in seconds on any connection.

For real-time streaming (e.g., live captioning): Consider Moonshine. Its tiny model and English-only focus make it fast enough for sub-second latency, though you’ll sacrifice multilingual support and timestamp accuracy.

For server-side deployment: Whisper.cpp remains the best option. It’s faster than any browser implementation and supports all model sizes including large-v3.

Conclusion

Browser-based speech recognition has reached a point where it’s genuinely useful for production applications. The combination of WebGPU acceleration, hybrid quantization, and worker-based architecture means that a 76 MB model can transcribe audio at 5-8x real-time speed without blocking the UI.

The key is choosing the right tool for the job: a purpose-built library like browser-whisper for browser applications, whisper.cpp for native deployments, and the raw transformers.js pipeline for custom research or prototyping.

Try browser-based speech recognition yourself at OfflineTTS STT — no signup, no API key, runs entirely in your browser.