TTS & STT Landscape in May 2026: On-Device Breakthroughs, New APIs, and Open-Source Momentum
The text-to-speech and speech-to-text ecosystems are moving faster than at any point in their history. As of May 2026, the defining tension is no longer “cloud vs. local” — it’s the fact that both sides are advancing simultaneously, and the gap between them is narrowing in ways that matter for real products. On-device engines now deliver multilingual speech at speeds that were unthinkable two years ago, while cloud APIs are consolidating the STT-LLM-TTS pipeline into single-call interfaces that eliminate entire layers of integration complexity.
Here is a detailed look at the developments shaping TTS and STT right now.
On-Device TTS: Two Standout Releases
Supertonic — 99M Parameters, 31 Languages, 167x Real-Time
Supertonic is arguably the most technically impressive on-device TTS release of the month. Built on ONNX Runtime for fully local inference, the model weighs in at approximately 99 million parameters — small enough to run comfortably on a Raspberry Pi, a mobile phone, or inside a browser tab — yet it supports 31 languages and achieves inference speeds exceeding 167 times real-time.
What makes the 167x figure meaningful is not just raw throughput. At that speed, generating an hour of audio takes roughly 22 seconds on capable hardware. For batch processing workloads — converting documents, generating audiobooks, building voice datasets — this turns a task that previously required GPU-backed cloud infrastructure into something a consumer laptop handles in seconds.
Supertonic 3, the latest version, ships with improved accuracy and expanded language coverage. It also includes a Voice Builder tool for creating custom voices from short reference clips, which makes it viable for applications that need brand-consistent or character-specific voices without sending audio data to a third party. The project is actively maintained on GitHub, with regular commits and a growing contributor base.
The practical implication is clear: for any use case where privacy, latency, or offline operation is a requirement, Supertonic has become the benchmark that other on-device solutions will be measured against.
ToBe SAID — Fully Offline AI Voice Engine for Android
ToBe SAID takes a different approach. Rather than targeting developers, it delivers a polished end-user experience on Android as a system-level TTS engine. The app runs entirely on-device, prioritizes privacy, and is designed around a specific workflow: converting ebooks into natural-sounding audiobooks with customizable voices.
Recent updates have significantly improved voice generation stability and naturalness — two areas where earlier on-device Android TTS solutions struggled. The free tier supports one voice slot, which is sufficient for casual use. The Pro tier unlocks unlimited voice slots and advanced customization options.
For Android users who want to listen to documents, articles, or ebooks without uploading their reading material to a cloud service, ToBe SAID fills a gap that the platform’s built-in TTS engines have historically left wide open.
Cloud API Developments: Consolidation and Speed
xAI Grok TTS and STT APIs
xAI has entered the voice API market with Grok TTS and STT, offering standalone speech-to-text and text-to-speech services that emphasize speed and accuracy across multiple languages. The APIs are already integrated through channels like Telnyx, making them accessible to developers building telephony and communication applications.
The significance here is not just another API entering a crowded market. xAI’s move signals that the major AI platform companies now consider voice I/O a core capability rather than an add-on. For developers, the practical benefit is more competition driving down latency and pricing while pushing up quality. The API documentation covers both TTS and STT endpoints, voice customization options, and streaming support.
OpenAI Realtime Voice — The Single-Call Paradigm
OpenAI’s continued investment in its Realtime API represents a more fundamental architectural shift. The traditional voice agent pipeline — STT service feeding an LLM feeding a TTS service, with each hop adding latency and integration overhead — is being replaced by a single API call that takes speech in and returns speech out.
The impact on voice agent development is substantial. A three-service pipeline that required careful orchestration, WebSocket management, and latency optimization can now be replaced by one API interaction. This does not eliminate the need for on-device TTS in privacy-sensitive or offline scenarios, but it dramatically lowers the barrier for applications where cloud connectivity is acceptable and speed of deployment matters.
Gemini 3.1 Flash TTS — Emotion Control at Scale
Google’s Gemini 3.1 Flash TTS brings two advances worth noting. First, it covers over 70 languages, which makes it one of the most broadly capable cloud TTS models available. Second, and more importantly, it introduces fine-grained emotional controllability — the ability to direct not just what the voice says but how it says it, adjusting tone, emphasis, and affect to match the context of the content.
This matters because TTS quality is not just about naturalness. A model that reads a technical document and a marketing script with identical delivery is failing at half its job. Emotional controllability is the feature that separates “readable” from “engaging,” and it has been conspicuously absent from most cloud TTS offerings. Gemini 3.1 Flash TTS’s performance on the OpenRouter model rankings suggests that listeners are noticing the difference.
Open-Source Models: The MOSS-TTS Family
The MOSS-TTS project, developed by OpenMOSS and MOSI.AI, represents the most ambitious open-source effort in speech generation right now. It is not a single model but a family of models addressing different aspects of speech and sound generation:
- High-fidelity long-form speech — sustained quality across extended passages, addressing the degradation that many models exhibit after the first few sentences
- Multi-speaker dialogue — distinct voices within a single generation, enabling conversational applications without stitching together separate TTS calls
- Real-time streaming TTS — chunk-by-chunk output with minimal buffering, suitable for live interaction scenarios
- Sound design and effects — generation of non-speech audio, opening up applications in game development and media production
The MOSS-TTS-Nano variant targets resource-constrained environments, trading some quality for significantly reduced memory and compute requirements. For developers building voice features into mobile apps or embedded systems, Nano provides a path to high-quality open-source TTS that does not require a workstation GPU.
The broader open-source community remains active across multiple projects. Chatterbox, Kokoro, and Fish Audio S2 Pro continue to receive updates and community attention. Local Whisper applications for macOS, iOS, and Android that combine on-device STT with local TTS engines like Kokoro are gaining traction among developers who want to build fully offline voice workflows.
Benchmarks and Industry Trends
The Cloud vs. On-Device Gap Is Narrowing
OpenRouter’s updated 2026 TTS model rankings show cloud models like Gemini and Inworld leading on raw quality metrics. But the on-device story has shifted. Six months ago, running TTS locally meant accepting noticeably inferior quality. Today, engines like Supertonic deliver output that is competitive with mid-tier cloud services for many use cases, while offering zero latency, zero cost, and zero data exposure.
The three core drivers of on-device adoption remain consistent:
- Privacy protection — no text or audio data leaves the device
- Low latency — no network round-trip, which is critical for real-time interaction
- No network dependency — TTS works in airplanes, tunnels, and areas with poor connectivity
Commercial API Landscape
Developers continue to debate the trade-offs between ElevenLabs, Google Cloud TTS, and newer entrants like xAI’s Grok APIs. The conversation has shifted from “which sounds best” to a more nuanced discussion of quality vs. latency vs. pricing at different scales. ElevenLabs remains the quality benchmark but carries premium pricing. Google’s offerings provide strong quality with better pricing at scale. The new entrants are competing aggressively on both cost and speed.
Azure Speech Service continues to expand its language coverage, maintaining its position as the go-to solution for enterprise applications that need broad language support with enterprise-grade SLAs.
Open-Source STT Benchmarks
On the STT side, the benchmark landscape is more fragmented. Canary, Granite, and various Whisper variants continue to trade top positions depending on the language, accent, and domain being tested. The practical takeaway is that no single open-source STT model dominates across all scenarios — the right choice depends on the specific languages and acoustic conditions of your application.
What This Means for Developers
If you are building a voice feature today, the decision tree looks different than it did a year ago:
- If offline operation or privacy is non-negotiable — Supertonic for TTS, a Whisper variant for STT. The quality is now good enough for production use in most content-reading and voice-assistant scenarios.
- If you need the highest possible output quality and can accept cloud dependency — Gemini 3.1 Flash TTS for emotional expressiveness, ElevenLabs v3 for overall naturalness, OpenAI Realtime API for the simplest integration path.
- If you are building a voice agent — The single-call Realtime API approach eliminates the need to orchestrate three separate services. But consider whether your users will accept cloud dependency and the associated data handling implications.
- If you need voice cloning or multi-speaker dialogue — MOSS-TTS for open-source flexibility, or the commercial APIs if you need managed infrastructure.
The most significant trend is not any single release but the convergence: on-device quality is catching up to cloud, cloud APIs are consolidating multi-step pipelines into single calls, and open-source models are covering capabilities — like emotional control and multi-speaker dialogue — that were recently exclusive to commercial services. The result is more options at every point on the quality-cost-privacy spectrum, and that is unambiguously good for anyone building with voice.
This analysis covers developments reported as of May 7, 2026. Sources include official announcements from xAI, OpenAI, Google, and Supertone; GitHub repository activity; and community discussions on X and developer forums.