← Back to Blog

TTS Arena Leaderboard 2026: How Every Text-to-Speech Model Ranks — Open Source vs Commercial

ttscomparisonbenchmarkleaderboardopen-sourcekokorooffline

Seventy-four text-to-speech models. Thousands of blind A/B comparisons. One question: which TTS engine actually sounds the best?

The Artificial Analysis Speech Arena ranks every major TTS model using an Elo-based rating system — the same methodology behind chess rankings and competitive gaming ladders. Human listeners hear two models side by side, pick the one that sounds more natural, and the scores adjust accordingly. No marketing claims, no cherry-picked demos. Just listener preference, aggregated at scale.

Here’s what the leaderboard reveals in May 2026, and what it means for anyone choosing a TTS engine.

How the Leaderboard Works

The Artificial Analysis Speech Arena uses a crowdsourced Elo rating system. Each comparison follows this process:

  1. A listener is presented with the same text spoken by two anonymous models
  2. The listener votes for the one that sounds more natural
  3. The Elo scores adjust — the winner gains points, the loser loses points
  4. Over thousands of comparisons, models settle into their true ranking

The system evaluates models across four categories — assistants, customer service, entertainment, and knowledge sharing — and two accents — US English and UK English. This matters because a model that excels at reading audiobooks might struggle with conversational dialogue.

The global leaderboard aggregates all votes regardless of category or accent. As of May 2026, the arena has collected data from 74 models with thousands of comparisons each.

The Top 10 Overall

RankModelCreatorEloWin RateAppearancesOpen WeightsPrice/1M chars
1Realtime TTS 1.5 MaxInworld1209.673.3%1,851No$35
2Gemini 3.1 Flash TTSGoogle1205.872.4%1,890No$36.61
3Eleven v3ElevenLabs1178.068.9%3,753No$100
4Inworld TTS 1 MaxInworld1165.466.1%1,694No$35
5Speech 2.8 HDMiniMax1163.765.2%3,512No$100
6Realtime TTS 1.5 MiniInworld1158.466.2%2,148No$25
7Step TTS 2StepFun1149.164.6%1,341No$40
8Speech 2.8 TurboMiniMax1146.764.0%3,666No$60
9Speech 2.6 HDMiniMax1133.562.1%3,425No$100
10Speech 2.6 TurboMiniMax1128.761.3%3,748No$60

The top 10 is dominated by commercial API services. Inworld’s Realtime TTS 1.5 Max holds the crown with a 73.3% win rate across 1,851 appearances. Google’s Gemini 3.1 Flash TTS trails by less than 4 Elo points. ElevenLabs v3 — the most recognizable name in TTS — sits at number 3 with the largest sample size in the top 10 (3,753 appearances).

But here’s the key observation: every single model in the top 10 is closed-source and requires an API connection. No open-weight model cracks the top 10.

Open-Weight Models: The Standings

Open-weight models — those with publicly available model weights that anyone can download and run — are a different story. These are the engines you can run on your own hardware, in your browser, or on embedded devices. No API keys, no per-character billing, no data leaving your machine.

Rank (Overall)ModelCreatorEloWin RateAppearancesPrice/1M chars
11Fish Audio S2 ProFish Audio1128.761.0%1,115$15
16Step Audio EditXStepFun1104.958.7%1,110
26Magpie-Multilingual 357MNVIDIA1064.253.3%1,091
32Kokoro 82M v1.0Kokoro1056.254.4%5,368$0.65
33Voxtral TTSMistral1055.952.3%1,114$16
35Maya1Maya Research1050.650.5%2,852
51Fish Speech 1.5Fish Audio1011.949.1%4,918$15
52ChatterboxResemble AI1006.447.9%4,707$25
54Magpie-Multilingual 357M (older)NVIDIA1001.945.2%3,563
55Zonos v0.1Zyphra1000.047.1%4,842$20
57VibeVoice 7BMicrosoft959.738.1%1,649
60OpenVoice v2OpenVoice949.944.0%7,834$8.33
66XTTS v2Coqui885.936.4%6,186$40.44
67StyleTTS 2StyleTTS878.837.4%5,725$2.82
74MetaVoice v1MetaVoice765.221.5%3,639

Fish Audio S2 Pro: The Open-Weight Leader

Fish Audio S2 Pro sits at rank 11 overall with an Elo of 1128.7 — the highest of any open-weight model. Released in March 2026, it wins 61% of its matchups, which puts it ahead of several commercial offerings including OpenAI’s TTS-1 (Elo 1101.6) and Google’s Studio model (Elo 1062.2).

Its strength shows in specific categories. In customer service scenarios, Fish Audio S2 Pro reaches Elo 1158 — competitive with top-10 commercial models. In knowledge sharing (long-form narration, articles, documentation), it scores 1124.

The catch: Fish Audio S2 Pro requires significant GPU resources for self-hosting. Running it locally means you need hardware capable of handling a large transformer model, which limits deployment options compared to lighter engines like Kokoro.

Kokoro 82M: Best Quality-to-Size Ratio

Kokoro 82M v1.0 ranks 32nd overall (Elo 1056.2, 54.4% win rate) — but that number understates its real-world impact. With 5,368 appearances, Kokoro has one of the largest sample sizes on the leaderboard, making its Elo score unusually stable (confidence interval ±9, compared to ±21 for most newer models).

Kokoro’s standout trait is efficiency. At 82 million parameters, it runs entirely in a web browser via WebGPU or WebAssembly — no server, no GPU, no API key. Its $0.65 per 1M characters pricing (via API) is the lowest on the entire leaderboard, and self-hosting is free.

In specific categories, Kokoro performs differently:

CategoryAccentEloWin Rate
Knowledge sharingAll1065.857.1%
AssistantsAll1065.850.9%
Customer serviceUS1135.446.0%
Knowledge sharingUS1096.056.7%
EntertainmentAll976.049.5%

The pattern is clear: Kokoro excels at knowledge sharing — reading articles, documentation, educational content. It is competitive in assistants and customer service contexts. It is weaker in entertainment (dialogue, character voices, dramatic reading), where more expressive models dominate.

For a model that fits in 82MB and runs in a browser tab, this performance profile is remarkable. Try Kokoro TTS in your browser — no installation needed.

Voxtral TTS and Maya1: Newer Challengers

Voxtral TTS by Mistral and Maya1 by Maya Research are both recent additions. Voxtral (released March 2026) sits at Elo 1055.9 with a 52.3% win rate across 1,114 appearances. Maya1 (released June 2025) reaches Elo 1050.6 with a 50.5% win rate across 2,852 appearances.

Both are open-weight models with moderate sample sizes, and their confidence intervals overlap significantly with Kokoro’s. As the arena collects more votes, these rankings could shift.

Fish Speech 1.5 vs Fish Audio S2 Pro

The gap between Fish Audio’s two open-weight models tells a story about how fast the field is moving:

ModelEloWin RateRelease
Fish Audio S2 Pro1128.761.0%March 2026
Fish Speech 1.51011.949.1%December 2024

A 117-point Elo jump in roughly 15 months. Fish Speech 1.5 was competitive when it launched; Fish Audio S2 Pro now outranks it by a wide margin. This kind of rapid improvement is common across the entire leaderboard — models released in 2024 consistently rank below their 2026 successors.

The Quality vs. Accessibility Trade-off

The leaderboard reveals a persistent tension in TTS: the best-sounding models are the least accessible, and the most accessible models trail in quality.

The top 10 models all require API access, per-character billing, and sending your text to external servers. The best open-weight model (Fish Audio S2 Pro) ranks 11th. Kokoro, which runs entirely in a browser, ranks 32nd.

But rankings only tell part of the story. Consider what Kokoro achieves at 82MB:

  • It outperforms Google’s WaveNet (Elo 873.3), which powered Google Assistant for years
  • It beats Amazon Polly Neural (Elo 868.3), which processes billions of characters monthly
  • It surpasses XTTS v2 (Elo 885.9), which was the standard for open-source voice cloning

A browser tab running Kokoro produces speech that listeners prefer over enterprise API services that charge per character and require network connectivity.

Category Breakdowns: Where Models Shine

The arena evaluates across four categories, and models perform differently depending on the use case. Here are the top 3 open-weight models in each category:

Knowledge Sharing (articles, documentation, educational content)

ModelEloWin Rate
Fish Audio S2 Pro1124.061.4%
Kokoro 82M v1.01065.857.1%
Maya11043.054.2%

Knowledge sharing is where Kokoro shines. Long-form, informative text plays to its strengths — clear pronunciation, consistent pacing, and natural prosody over extended passages.

Assistants (conversational, task-oriented)

ModelEloWin Rate
Fish Audio S2 Pro1168.562.2%
Step Audio EditX1116.057.1%
Kokoro 82M v1.01065.850.9%

Customer Service (professional, clear, neutral)

ModelEloWin Rate
Fish Audio S2 Pro1158.458.2%
Step Audio EditX1095.055.5%
Magpie-Multilingual 357M1068.052.5%

Entertainment (dialogue, character voices, dramatic reading)

ModelEloWin Rate
Fish Audio S2 Pro1073.160.9%
Step Audio EditX1039.056.4%
Voxtral TTS1024.050.2%

Entertainment is the hardest category for open-weight models. The top commercial models use much larger architectures with dedicated expressive training, and the Elo gap is largest here.

What the Rankings Miss

Elo ratings measure listener preference in direct comparison, not objective audio quality. A few things the leaderboard cannot capture:

Latency and speed. Kokoro generates speech at up to 96x real-time on GPU hardware. Some top-ranked models take seconds to produce their first audio token. If you need real-time streaming, a slightly lower Elo score might be worth the trade-off.

Privacy. Every closed-source model on this list requires sending your text to an external server. For healthcare, legal, or corporate use cases, that data transfer may violate compliance requirements. Offline TTS eliminates this risk entirely.

Cost at scale. ElevenLabs v3 costs $100 per 1M characters. Kokoro costs $0.65. At 100M characters per month — a reasonable volume for a content platform — that is $10,000 vs. $65. Self-hosting Kokoro is free.

Language support. The arena currently evaluates only English. Kokoro supports 9 languages. Some top-ranked commercial models are English-only. If you need multilingual TTS, the rankings tell an incomplete story.

Browser deployment. Only Kokoro and a handful of other models can run entirely in a web browser. Most top-ranked models require server-side GPU inference. For web applications, browser-based TTS is a hard requirement, not a nice-to-have.

Choosing a TTS Model Based on the Rankings

The leaderboard helps, but the right model depends on your use case:

You need the highest possible quality and budget is not a concern. Use Eleven v3 (Elo 1178), Gemini 3.1 Flash TTS (Elo 1206), or Inworld Realtime TTS 1.5 Max (Elo 1210). All are commercial APIs with per-character pricing.

You need open-weight, high-quality output. Fish Audio S2 Pro (Elo 1129) is the best open-weight model on the leaderboard. It requires GPU hardware for self-hosting.

You need TTS that runs in a browser or on-device. Kokoro 82M (Elo 1056) is the strongest browser-runnable model. It runs via WebGPU or WebAssembly with no server needed — try it free.

You need a balance of quality, speed, and cost. Kokoro at $0.65/1M characters via API (or free self-hosted) offers the best value on the leaderboard. Its Elo of 1056 beats every commercial model priced below $15/1M characters.

You need voice cloning. The arena does not test voice cloning — it evaluates default voices only. For voice cloning, see our voice cloning comparison.

The Trend: Open Source Is Catching Up

Looking at the release dates and Elo scores, a trend emerges:

EraBest Open-Weight EloBest Commercial EloGap
2023879 (StyleTTS 2)1102 (TTS-1)223
2024950 (OpenVoice v2)1107 (ElevenLabs v2)157
Early 20251006 (Chatterbox)1134 (Speech 2.6 HD)128
Mid 20251056 (Kokoro)1170 (Eleven v3 pre)114
Early 20261129 (Fish S2 Pro)1210 (Inworld RT 1.5)81

The gap has narrowed from 223 Elo points in 2023 to 81 in early 2026. Open-weight models are improving faster than commercial ones. At the current rate, the best open-weight model could crack the top 10 within a year.

Try the Leaderboard’s Top Browser Model

Kokoro ranks 32nd globally, but first among models that can run entirely in your browser. No API key, no server, no data leaving your device. Generate speech with 54 voices across 9 languages — free and offline.

Share this article

Try OfflineTTS

Free. Private. Works offline. 54 voices in 9 languages.

Open TTS Tool