the verticalsv05 / 17#elevenlabs#tts§ 08 sections · 07 figures

ElevenLabs: why a TTS company is priced at $11B.

In February 2026, a two-person London startup closed a $500M Series D at $11B post-money. A TTS company sat at the top of the voice-AI valuation stack. This piece walks the structure behind that fact in order: founders, hypothesis, product, customers, counterargument.

fullduplex research

published apr 2026· 18 min read· ~3,350 words· verticals v05 / 17

18m

read time

feed to ai↧view .mdClaudeopen in claudeChatGPTopen in chatgpt

verticals · v05 of 17 · subject profile

A TTS company priced at $11B. A two-person London startup sits at the top of the voice-AI valuation stack because its voice model feeds two flywheels at once — consumer depth and developer breadth — on a single shared spine.

subject: ElevenLabs · London · founded 2022~$11B post-money · $330M ARR · 1M+ users

1. What the $11B price tag contains

On February 4, 2026, ElevenLabs closed a $500M Series D at $11B post-money. Sequoia Capital led, with Andrew Reed joining the board. Andreessen Horowitz and ICONIQ took super pro-rata positions, and Lightspeed, Evantic, and BOND came in new. Cumulative funding since 2022 now stands at $781M across five rounds.

The number alone is hard to read, so place it next to the other voice-AI rounds from the same eight weeks of Q1 2026. Deepgram raised a $1.3B Series C in January, Decagon $4.5B the same month, Parloa $3B, and Sierra was already at $10B as of September 2025. On the April 2026 private voice-AI valuation board, ElevenLabs sits at the top of the model layer and lands within ten percent of Sierra, the leading vertical integrator. Put plainly, the market is pricing ElevenLabs as a platform, not as a component.

Lead investor Andrew Reed summed up the round in one line: “ElevenLabs is defining the future of voice AI” (Sequoia Capital).

One more number underwrites that valuation. Year-end 2025 annual recurring revenue (ARR) was roughly $330M. In the first quarter of 2026 alone, net-new ARR added more than $100M. Deutsche Telekom, MasterClass, Better, Klarna, and Revolut all moved customer-facing voice onto ElevenLabs’ voice agents. Next to the same-quarter vertical integrators, that slope is the steepest single revenue curve in the voice-AI category.

fig.f1 · funding trajectory·········

Figure F1. In the 32 months between the 2023 pre-seed of $2M and the February 2026 Series D of $500M at $11B, ElevenLabs compressed its valuation by 110x. Sesame’s $250M Series B is the nearest consumer-voice comparable.

2. Founders — two Poles in London, irritated by lektor dubbing

ElevenLabs was founded in 2022 by two Polish engineers who had been friends since university.

Mati Staniszewski (CEO) came from Palantir, where he was a deployment strategist — embedded with a large enterprise customer, integrating Palantir on the customer’s data stack, running the project from inside. Five years of customer implementation rather than product. Piotr Dąbkowski (CTO) came from Google Research as a machine learning engineer, with an Oxford engineering bachelor’s and a Cambridge computer science master’s. He is also the original author of Js2Py. A research-capable ML engineer who writes his own code.

Their origin story is unusually personal for a voice-AI startup. Poland uses a distinctive dubbing format for foreign films called “lektor.” A single male narrator reads every line of dialogue in a monotone over the original soundtrack, regardless of the character’s gender or age.

“It’s a horrible experience.”— Mati Staniszewski, on Polish lektor dubbing (Sifted)

Both founders grew up with it. Their parents and grandparents lived with the same format. The zero-to-one of ElevenLabs sits at the intersection of a personal irritation and the research progress of early-2020s neural speech synthesis.

The labs ElevenLabs is measured against (Kyutai, the OpenAI voice team, DeepMind Gemini Live, Alibaba DAMO) are research organizations with dozens of senior PhDs. ElevenLabs is not. A deployment person and an ML engineer, as a pair, shipped Eleven Multilingual v2 from London with no research team at founding. Four years later they are valued at $11B. What is unusual is not the size of the org. It is the shape.

Picking London as the base has two side effects. First, they were inside the GDPR perimeter from day one, which carries directly into the sales motion for European regulated industries like Deutsche Telekom and Revolut. Second, the time zone. A Bay Area startup has to call Klarna or Revolut procurement at dawn local time. ElevenLabs can call them inside regular business hours.

“Voice will be one of the primary interfaces.”— Mati Staniszewski (Sequoia Training Data)

fig.f2 · dual-flywheel architecture·········

Figure F2. Consumer-depth on the left (1M+ dashboard users, cloning, dubbing, Projects) and Developer-breadth on the right (API, ElevenAgents, SIP) both feed the shared voice model at the center, which ships improvements back to both sides.

3. Hypothesis — the dual-flywheel bet

Here is the piece’s hypothesis in one sentence.

ElevenLabs’ bet is to build a model-layer voice company as a dual flywheel: (a) a download-style product used deeply by consumers, and (b) a platform API embedded broadly by developers, with both flywheels spinning on the same single voice model.

The musical metaphor: polish one instrument so that both the concert performer (creators on the consumer surface) and the background-music pipeline feeding public spaces (embeds through developers) play on it. One instrument. The more hands play it, the better it gets.

Most 2024 voice-AI startups built the two sides separately or leaned on one of them. On one pole sit pure TTS API companies (Play.ht, WellSaid, Resemble — developers only). On the other sit the agent-stack companies (Retell, Vapi, Bland — they borrow someone else’s TTS). ElevenLabs did neither.

Point 1 — Consumer-depth leg

ElevenLabs’ dashboard (the logged-in browser app) is where creators, translators, educators, and podcasters use voice cloning, dubbing, and Projects directly. More than one million registered users push on the product monthly and surface the model’s weaknesses as real user feedback. A company where consumers use the model with enough investment to hurt. That information loop is one a standalone TTS API company cannot replicate.

Point 2 — Developer-breadth leg

The same voice model ships as an API and gets embedded into horizontal integrators (Retell, Vapi, Bland, and ElevenLabs’ own Agents platform), BPO operations, medical scribes, games, education. Through 2023 and 2024, if you tapped a startup with voice in it, the backend was usually ElevenLabs. Behind that API sits the same model being polished by Point 1.

Point 3 — The shared voice model

Multilingual v2 (2023) → Turbo v2.5 / Flash v2.5 (2024, ~75ms latency) → Eleven v3 Alpha (June 5, 2025, with inline emotion tags like [laughs], [whispers], [sighs]) → Scribe STT (February 2025) → Scribe v2 Realtime (January 2026, 150ms). Four years, one engineering organization, one lineage. Feedback from both sides is integrated into a single model, and the integrated model pushes improvements back to both sides.

“The intonation, the emotions, even the imperfections make a voice special.”— Mati Staniszewski (Pigment Perspectives)

4. Product — five years of voice primitive, staged into a platform in fifteen months

The launch of Conversational AI on November 18, 2024 (renamed ElevenAgents during 2025) is the pivot that moved ElevenLabs from “TTS API company” to “voice agent platform.” Fifteen months later, the $11B Series D closed.

ElevenAgents is built as a cascade pipeline. Cascade means stringing STT → LLM → TTS in series, distinct from the joint audio-language approach of Moshi or GPT-4o Realtime. The cooking metaphor: a cascade is a coursed meal, where different chefs prepare appetizer, main, and dessert in order; a joint model is one chef running the whole tasting menu in parallel. In ElevenAgents, the STT is the in-house Scribe, the LLM is chosen by the customer (BYO LLM), and the TTS is Eleven v3. In mid-2025, Conversational AI 2.0 swapped simple VAD-based turn-taking for a classifier that reads filler words, prosody, and rhythm, and added three pacing modes: Eager, Normal, Patient.

Scribe matters for two reasons. One is performance: at launch on February 26, 2025, Scribe posted a 7.7% WER on the Artificial Analysis benchmark, taking the top spot at the time. It beat Whisper v3 and Deepgram Nova-3. The other is structural: with an in-house STT, ElevenLabs can tune end-to-end cascade latency without depending on an outside vendor.

The other important move was the Iconic Voice Marketplace launched on November 11, 2025. From living icons like Michael Caine, Liza Minnelli, Art Garfunkel, and Al Joyner, to archival rights holders for Judy Garland, Maya Angelou, Mark Twain, John Wayne, Laurence Olivier, and Amelia Earhart, 28 licensed voices in total are available to brands and creators. Every voice has explicit rights-holder consent, usage is compensated by volume, and usage history is traceable. The most operationally advanced instance to date of the opt-in economy substrate.

We should be honest about this. ElevenLabs also carries the category’s largest regulatory and ethical exposure, because the same voice model can clone a speaker from 3 to 10 seconds of reference audio. In 2023, Instant Voice Clone required 30 seconds. By 2026, Mistral Voxtral is at three seconds and Microsoft MAI-Voice-1 at ten. The SAG-AFTRA 2023-2025 contract codifying performer consent, and California AB 1836 / AB 2602, are the social response to that technical reality. Cloning and the Marketplace opt-in substrate live on the same platform.

fig.f3 · 2024-2026 inflection timeline·········

Figure F3. Six product events span the fifteen months between the Conversational AI launch and the Series D: Scribe STT, v3 Alpha inline tags, CAI 2.0 turn-taking, Iconic Voice Marketplace, and Scribe v2 Realtime.

5. Customers — reading enterprise distribution by name

Among the four horizontal integrators (ElevenLabs, Retell, Vapi, Bland), ElevenLabs carries a markedly thicker logo roster. That is because the Developer-breadth in Point 2 is backed and differentiated by the Consumer-depth in Point 1. Representative accounts:

Deutsche Telekom: one of Europe’s largest telecom operators. Partnership announced fall 2025.
Epic Games / Fortnite: ElevenLabs generates the voice of Darth Vader for an in-game conversational experience, with rights secured from the James Earl Jones estate.
Revolut: after rolling out multilingual ElevenAgents, reported an 8x reduction in ticket resolution time.
Klarna: 10x reduction in Time to Resolution across ~35M US users.
Square, Ukrainian Government, MasterClass, Better: named on the Series D customer roster.
IBM watsonx Orchestrate: the partnership announced March 25, 2026 embeds ElevenLabs’ TTS and STT into IBM’s enterprise agentic-AI stack. 70 languages, 10,000+ voices, PCI / HIPAA / Zero Retention Mode / data residency support.

“ElevenLabs gives watsonx premium voice out of the box.”— IBM watsonx lead (IBM Newsroom)

With the option to build voice in-house available, the buyer is paying to skip the selection process itself. That is the procurement read. A class of logos no standalone TTS API company could land accumulated in the fifteen months after the Conversational AI launch. Revolut and Klarna chose ElevenLabs because they wanted a single vendor to close out the voice-agent stack.

The developer surface supports the same hypothesis. ElevenAgents ships native Twilio integration, generic SIP trunking across Telnyx, Vonage, RingCentral, Sinch, Infobip, Exotel, Plivo, and Bandwidth, and codec support for G.711 8kHz and G.722 16kHz. Pricing is $0.10/min (Creator / Pro), $0.08/min (Business annual), custom Enterprise. One pricing primitive no one else in the category ships: a silence discount (silent intervals billed at 5% of the per-minute rate), which leans the agent toward patiently waiting for the user to finish.

fig.f4 · enterprise customer map·········

Customer	Sector	Region	Reported impact
Deutsche Telekom	Telecom	EU (DE)	Voice agent integration
Revolut	Fintech	EU (UK)	Ticket resolution 8x faster
Klarna	Fintech	US	Time to Resolution 10x faster
Epic Games / Fortnite	Gaming	US	Darth Vader conversational experience
Ukrainian Government	Public	EU (UA)	Not disclosed
MasterClass / Better	Consumer / PropTech	US	Named at Series D
IBM watsonx Orchestrate	Enterprise AI	US / Global	70 languages / 10k+ voices
Cisco (via IBM)	Enterprise IT	US / Global	Through IBM integration

Figure F4. A cross-section spanning European regulated industries (telecom, fintech), gaming, government, and enterprise AI. That class of logos no standalone voice API company could land accumulated within fifteen months.

6. Counterargument — if OpenAI Realtime commoditizes, does the ElevenLabs model advantage survive?

Here we take the strongest counter head on.

The counter: “OpenAI Realtime API and Google Gemini Live are built on joint audio-language architectures and are structurally faster than a cascade. On top of that, hyperscaler distribution will drive pricing toward free. The TTS quality gap has also closed: as of March 2026, Inworld TTS-1.5 Max sits at #1 on Artificial Analysis Quality ELO, and ElevenLabs Multilingual v2 has fallen to #7. Sesame CSM, Orpheus, Kokoro, F5-TTS, and Cartesia Sonic all converged into the same quality band. The 1.0 MOS lead ElevenLabs had in 2023 is now compressed to 0.1–0.2 MOS in 2026. ElevenLabs is selling a shrinking-differential business at $11B.”

“voice-AI valuations are now a bet on distribution, not on TTS quality alone.”— Matt Turck, Firstmark (mattturck.com)

The response is two-step.

First, the ELO ranking is measuring the narrowest possible definition of voice-model moat. Artificial Analysis Quality ELO is based on single-utterance English blind A/B tests. “Read one line, vote which one sounds more human.” That framing cannot see depth across 29 languages; the performative range Eleven v3 gives through inline emotion tags; emotional-range cloning from a three-second reference; the rights-cleared Iconic Voice catalog; internal cascade latency tuned through Scribe; and the compliance and residency posture required to pass enterprise procurement. If Klarna, Revolut, and IBM bought on single-utterance blind tests alone, they would be buying Inworld. They are not.

Second, ElevenLabs has already opened an exit from cascade on the product surface side. The components of ElevenAgents (Scribe, turn-taking classifier, voice library, telephony integrations, enterprise sales motion) are all backend-architecture independent. If a Moshi-family or GPT-Realtime-family joint audio-language backend needs to be swapped in later, the existing surface stays in place. The Swiss Army knife analogy: the blade (the backend) is replaceable, but the grip (the product surface) does not change shape. Mati’s self-description as “audio general intelligence” at the Series D announcement signals exactly that intention.

fig.f5 · ELO vs product-surface axes·········

Figure F5. AA Quality ELO ranking (March 2026, ElevenLabs Multilingual v2 at #7) against the six product-surface axes ELO does not measure. Klarna, Revolut, and IBM clear procurement on the right-hand set, not the single-utterance blind test on the left.

7. Landscape — two connection points

Last, we set this profile on the map of the main STS series. Two load-bearing connections matter.

The first is Article 10 (consent / licensing). The Iconic Voice Marketplace is the most operationally advanced instance to date of what Article 10 calls the “substrate of the opt-in economy.” At the same time, the platform’s three-second cloning capability sits directly on the regulatory front line. That dual-use posture places ElevenLabs on both sides of the consent debate simultaneously within the category.

The second is Article 06 (foundation before vertical). Mati’s self-description as “audio general intelligence” runs head-on into Article 06’s timing claim that “the foundation threshold for full-duplex STS has not yet been crossed.” ElevenLabs is recognized as a TTS foundation, but it does not yet have a joint audio-language model. The $11B valuation is market evidence that investors are betting the gap will be closed.

The Fullduplex.ai connection follows. Like the shift from cassette tape to streaming, where the main battleground moved from the songs to the rights plumbing of the distribution network, the shift from TTS to full-duplex will move the battleground from voice models to conversational data infrastructure. If a foundation TTS company prices at $11B, there is matching economic room for a foundation full-duplex data company.

fig.f6 · 2x2 landscape·········

Figure F6. Horizontal axis: foundation model ownership. Vertical axis: agent surface ownership. ElevenLabs is the sole occupant of the top-left cell (foundation plus agent).

8. What to watch — three signals

Three signals over the next five years will tell us whether ElevenLabs’ bet is landing.

Signal 1: whether ElevenLabs ships a joint audio-language backend in-house, acquires one, or routes to an external one. The existing product surface is architecture-independent and a backend swap is drop-in. The “audio general intelligence” self-description points to the in-house path. Whether the product cadence confirms or contradicts that will be the answer.

Signal 2: whether the Iconic Voice Marketplace stays a single-company primitive or becomes a category-wide shared substrate. Article 10’s opt-in economy thesis cannot graduate from investment thesis to operational reality on one company’s product-level consent substrate alone. The split-point is whether ElevenLabs opens the Marketplace as an API or an open standard.

Signal 3: the timing of Agents-specific ARR disclosure. The $330M figure reported today is company-wide, and the Agents share is not disclosed. When an Agents ARR number stands alone, either at a Series E or through an IPO window, the “foundation plus integrator hybrid” story will be confirmed as revenue reality rather than narrative.

fig.f7 · three signals panel·········

Figure F7. Three signals that will score ElevenLabs’ bet over the next five years. Joint audio-language backend path. Iconic Voice opening. Agents-specific ARR disclosure.

ElevenLabs’ four-year bet condenses to one point: voice primitives outcompound the orchestration layer. The 2026 evidence is consistent with that bet landing. The rest is a question of time, and the three signals above.

Investor data room access. Fullduplex.ai is building the full-duplex conversational data and evaluation infrastructure that a foundation-plus-integrator voice-AI platform will call on when it scales a joint audio-language backend to enterprise grade. Contact hello@fullduplex.ai.