TTS Arena Leaderboard 2026: How Every Text-to-Speech Model Ranks — Open Source vs Commercial

Seventy-four text-to-speech models. Thousands of blind A/B comparisons. One question: which TTS engine actually sounds the best?

The Artificial Analysis Speech Arena ranks every major TTS model using an Elo-based rating system — the same methodology behind chess rankings and competitive gaming ladders. Human listeners hear two models side by side, pick the one that sounds more natural, and the scores adjust accordingly. No marketing claims, no cherry-picked demos. Just listener preference, aggregated at scale.

Here’s what the leaderboard reveals in May 2026, and what it means for anyone choosing a TTS engine.

How the Leaderboard Works

The Artificial Analysis Speech Arena uses a crowdsourced Elo rating system. Each comparison follows this process:

A listener is presented with the same text spoken by two anonymous models
The listener votes for the one that sounds more natural
The Elo scores adjust — the winner gains points, the loser loses points
Over thousands of comparisons, models settle into their true ranking

The system evaluates models across four categories — assistants, customer service, entertainment, and knowledge sharing — and two accents — US English and UK English. This matters because a model that excels at reading audiobooks might struggle with conversational dialogue.

The global leaderboard aggregates all votes regardless of category or accent. As of May 2026, the arena has collected data from 74 models with thousands of comparisons each.

The Top 10 Overall

Rank	Model	Creator	Elo	Win Rate	Appearances	Open Weights	Price/1M chars
1	Realtime TTS 1.5 Max	Inworld	1209.6	73.3%	1,851	No	$35
2	Gemini 3.1 Flash TTS	Google	1205.8	72.4%	1,890	No	$36.61
3	Eleven v3	ElevenLabs	1178.0	68.9%	3,753	No	$100
4	Inworld TTS 1 Max	Inworld	1165.4	66.1%	1,694	No	$35
5	Speech 2.8 HD	MiniMax	1163.7	65.2%	3,512	No	$100
6	Realtime TTS 1.5 Mini	Inworld	1158.4	66.2%	2,148	No	$25
7	Step TTS 2	StepFun	1149.1	64.6%	1,341	No	$40
8	Speech 2.8 Turbo	MiniMax	1146.7	64.0%	3,666	No	$60
9	Speech 2.6 HD	MiniMax	1133.5	62.1%	3,425	No	$100
10	Speech 2.6 Turbo	MiniMax	1128.7	61.3%	3,748	No	$60

The top 10 is dominated by commercial API services. Inworld’s Realtime TTS 1.5 Max holds the crown with a 73.3% win rate across 1,851 appearances. Google’s Gemini 3.1 Flash TTS trails by less than 4 Elo points. ElevenLabs v3 — the most recognizable name in TTS — sits at number 3 with the largest sample size in the top 10 (3,753 appearances).

But here’s the key observation: every single model in the top 10 is closed-source and requires an API connection. No open-weight model cracks the top 10.

Open-Weight Models: The Standings

Open-weight models — those with publicly available model weights that anyone can download and run — are a different story. These are the engines you can run on your own hardware, in your browser, or on embedded devices. No API keys, no per-character billing, no data leaving your machine.

Rank (Overall)	Model	Creator	Elo	Win Rate	Appearances	Price/1M chars
11	Fish Audio S2 Pro	Fish Audio	1128.7	61.0%	1,115	$15
16	Step Audio EditX	StepFun	1104.9	58.7%	1,110	—
26	Magpie-Multilingual 357M	NVIDIA	1064.2	53.3%	1,091	—
32	Kokoro 82M v1.0	Kokoro	1056.2	54.4%	5,368	$0.65
33	Voxtral TTS	Mistral	1055.9	52.3%	1,114	$16
35	Maya1	Maya Research	1050.6	50.5%	2,852	—
51	Fish Speech 1.5	Fish Audio	1011.9	49.1%	4,918	$15
52	Chatterbox	Resemble AI	1006.4	47.9%	4,707	$25
54	Magpie-Multilingual 357M (older)	NVIDIA	1001.9	45.2%	3,563	—
55	Zonos v0.1	Zyphra	1000.0	47.1%	4,842	$20
57	VibeVoice 7B	Microsoft	959.7	38.1%	1,649	—
60	OpenVoice v2	OpenVoice	949.9	44.0%	7,834	$8.33
66	XTTS v2	Coqui	885.9	36.4%	6,186	$40.44
67	StyleTTS 2	StyleTTS	878.8	37.4%	5,725	$2.82
74	MetaVoice v1	MetaVoice	765.2	21.5%	3,639	—

Fish Audio S2 Pro: The Open-Weight Leader

Fish Audio S2 Pro sits at rank 11 overall with an Elo of 1128.7 — the highest of any open-weight model. Released in March 2026, it wins 61% of its matchups, which puts it ahead of several commercial offerings including OpenAI’s TTS-1 (Elo 1101.6) and Google’s Studio model (Elo 1062.2).

Its strength shows in specific categories. In customer service scenarios, Fish Audio S2 Pro reaches Elo 1158 — competitive with top-10 commercial models. In knowledge sharing (long-form narration, articles, documentation), it scores 1124.

The catch: Fish Audio S2 Pro requires significant GPU resources for self-hosting. Running it locally means you need hardware capable of handling a large transformer model, which limits deployment options compared to lighter engines like Kokoro.

Kokoro 82M: Best Quality-to-Size Ratio

Kokoro 82M v1.0 ranks 32nd overall (Elo 1056.2, 54.4% win rate) — but that number understates its real-world impact. With 5,368 appearances, Kokoro has one of the largest sample sizes on the leaderboard, making its Elo score unusually stable (confidence interval ±9, compared to ±21 for most newer models).

Kokoro’s standout trait is efficiency. At 82 million parameters, it runs entirely in a web browser via WebGPU or WebAssembly — no server, no GPU, no API key. Its $0.65 per 1M characters pricing (via API) is the lowest on the entire leaderboard, and self-hosting is free.

In specific categories, Kokoro performs differently:

Category	Accent	Elo	Win Rate
Knowledge sharing	All	1065.8	57.1%
Assistants	All	1065.8	50.9%
Customer service	US	1135.4	46.0%
Knowledge sharing	US	1096.0	56.7%
Entertainment	All	976.0	49.5%

The pattern is clear: Kokoro excels at knowledge sharing — reading articles, documentation, educational content. It is competitive in assistants and customer service contexts. It is weaker in entertainment (dialogue, character voices, dramatic reading), where more expressive models dominate.

For a model that fits in 82MB and runs in a browser tab, this performance profile is remarkable. Try Kokoro TTS in your browser — no installation needed.

Voxtral TTS and Maya1: Newer Challengers

Voxtral TTS by Mistral and Maya1 by Maya Research are both recent additions. Voxtral (released March 2026) sits at Elo 1055.9 with a 52.3% win rate across 1,114 appearances. Maya1 (released June 2025) reaches Elo 1050.6 with a 50.5% win rate across 2,852 appearances.

Both are open-weight models with moderate sample sizes, and their confidence intervals overlap significantly with Kokoro’s. As the arena collects more votes, these rankings could shift.

Fish Speech 1.5 vs Fish Audio S2 Pro

The gap between Fish Audio’s two open-weight models tells a story about how fast the field is moving:

Model	Elo	Win Rate	Release
Fish Audio S2 Pro	1128.7	61.0%	March 2026
Fish Speech 1.5	1011.9	49.1%	December 2024

A 117-point Elo jump in roughly 15 months. Fish Speech 1.5 was competitive when it launched; Fish Audio S2 Pro now outranks it by a wide margin. This kind of rapid improvement is common across the entire leaderboard — models released in 2024 consistently rank below their 2026 successors.

The Quality vs. Accessibility Trade-off

The leaderboard reveals a persistent tension in TTS: the best-sounding models are the least accessible, and the most accessible models trail in quality.

The top 10 models all require API access, per-character billing, and sending your text to external servers. The best open-weight model (Fish Audio S2 Pro) ranks 11th. Kokoro, which runs entirely in a browser, ranks 32nd.

But rankings only tell part of the story. Consider what Kokoro achieves at 82MB:

It outperforms Google’s WaveNet (Elo 873.3), which powered Google Assistant for years
It beats Amazon Polly Neural (Elo 868.3), which processes billions of characters monthly
It surpasses XTTS v2 (Elo 885.9), which was the standard for open-source voice cloning

A browser tab running Kokoro produces speech that listeners prefer over enterprise API services that charge per character and require network connectivity.

Category Breakdowns: Where Models Shine

The arena evaluates across four categories, and models perform differently depending on the use case. Here are the top 3 open-weight models in each category:

Model	Elo	Win Rate
Fish Audio S2 Pro	1124.0	61.4%
Kokoro 82M v1.0	1065.8	57.1%
Maya1	1043.0	54.2%

Knowledge sharing is where Kokoro shines. Long-form, informative text plays to its strengths — clear pronunciation, consistent pacing, and natural prosody over extended passages.

Assistants (conversational, task-oriented)

Model	Elo	Win Rate
Fish Audio S2 Pro	1168.5	62.2%
Step Audio EditX	1116.0	57.1%
Kokoro 82M v1.0	1065.8	50.9%

Customer Service (professional, clear, neutral)

Model	Elo	Win Rate
Fish Audio S2 Pro	1158.4	58.2%
Step Audio EditX	1095.0	55.5%
Magpie-Multilingual 357M	1068.0	52.5%

Entertainment (dialogue, character voices, dramatic reading)

Model	Elo	Win Rate
Fish Audio S2 Pro	1073.1	60.9%
Step Audio EditX	1039.0	56.4%
Voxtral TTS	1024.0	50.2%

Entertainment is the hardest category for open-weight models. The top commercial models use much larger architectures with dedicated expressive training, and the Elo gap is largest here.

What the Rankings Miss

Elo ratings measure listener preference in direct comparison, not objective audio quality. A few things the leaderboard cannot capture:

Latency and speed. Kokoro generates speech at up to 96x real-time on GPU hardware. Some top-ranked models take seconds to produce their first audio token. If you need real-time streaming, a slightly lower Elo score might be worth the trade-off.

Privacy. Every closed-source model on this list requires sending your text to an external server. For healthcare, legal, or corporate use cases, that data transfer may violate compliance requirements. Offline TTS eliminates this risk entirely.

Cost at scale. ElevenLabs v3 costs $100 per 1M characters. Kokoro costs $0.65. At 100M characters per month — a reasonable volume for a content platform — that is $10,000 vs. $65. Self-hosting Kokoro is free.

Language support. The arena currently evaluates only English. Kokoro supports 9 languages. Some top-ranked commercial models are English-only. If you need multilingual TTS, the rankings tell an incomplete story.

Browser deployment. Only Kokoro and a handful of other models can run entirely in a web browser. Most top-ranked models require server-side GPU inference. For web applications, browser-based TTS is a hard requirement, not a nice-to-have.

Choosing a TTS Model Based on the Rankings

The leaderboard helps, but the right model depends on your use case:

You need the highest possible quality and budget is not a concern. Use Eleven v3 (Elo 1178), Gemini 3.1 Flash TTS (Elo 1206), or Inworld Realtime TTS 1.5 Max (Elo 1210). All are commercial APIs with per-character pricing.

You need open-weight, high-quality output. Fish Audio S2 Pro (Elo 1129) is the best open-weight model on the leaderboard. It requires GPU hardware for self-hosting.

You need TTS that runs in a browser or on-device. Kokoro 82M (Elo 1056) is the strongest browser-runnable model. It runs via WebGPU or WebAssembly with no server needed — try it free.

You need a balance of quality, speed, and cost. Kokoro at $0.65/1M characters via API (or free self-hosted) offers the best value on the leaderboard. Its Elo of 1056 beats every commercial model priced below $15/1M characters.

You need voice cloning. The arena does not test voice cloning — it evaluates default voices only. For voice cloning, see our voice cloning comparison.

The Trend: Open Source Is Catching Up

Looking at the release dates and Elo scores, a trend emerges:

Era	Best Open-Weight Elo	Best Commercial Elo	Gap
2023	879 (StyleTTS 2)	1102 (TTS-1)	223
2024	950 (OpenVoice v2)	1107 (ElevenLabs v2)	157
Early 2025	1006 (Chatterbox)	1134 (Speech 2.6 HD)	128
Mid 2025	1056 (Kokoro)	1170 (Eleven v3 pre)	114
Early 2026	1129 (Fish S2 Pro)	1210 (Inworld RT 1.5)	81

The gap has narrowed from 223 Elo points in 2023 to 81 in early 2026. Open-weight models are improving faster than commercial ones. At the current rate, the best open-weight model could crack the top 10 within a year.

Try the Leaderboard’s Top Browser Model

Kokoro ranks 32nd globally, but first among models that can run entirely in your browser. No API key, no server, no data leaving your device. Generate speech with 54 voices across 9 languages — free and offline.