§M · models & platforms

Models.

A directory of speech-to-speech, full-duplex, and audio-foundation-model entries — open weights, hosted APIs, voice-agent platforms, and agent frameworks — grouped by what they are and how they close the conversation loop. Each card tags the kind (speech-LM, realtime API, platform…), the FD mode (native vs cascaded stack), and the surface you consume it through (API, model, SDK, framework). Missing or wrong something? Open the channel.

Editorial notes

Curation policy

New entries and corrections land as PRs against src/data/models.ts. We prioritise public weights, public papers, and public pricing, and list proprietary APIs when vendor documentation is stable enough to compare. Demos without a backing spec are noted but kept out of the headline figures.

Commercial-use taxonomy

Each entry carries a 4-tier flag separate from the free-form license string — commercial ok (clearly permissive), research only (no commercial deployment), non-commercial (CC-BY-NC & friends), or license unclear (needs legal review). Read the chip first, the license string second, and confirm against upstream before shipping.

Lifecycle tags

preview (still moving), legacy (superseded by a newer snapshot), and deprecated (scheduled for retirement). Every card carries a verified 2026-04 line; entries without their own lastVerified field fall back to that global refresh date.

total tracked

native full-duplex

open weights

preview / legacy

2 / 2

Counter definitions. open weights = openWeights === true. native full-duplex = fdMode === "native". All four counters cover preview + current + legacy entries but exclude deprecated, so the headline does not inflate as older snapshots are retired.

FD · nativeFD · stackturn-takingopen weightsproprietaryframeworkpreviewlegacy

M-A · capability matrix

Model × capability at a glance

Twelve capability axes scored from the canonical model fields and tags — the five language columns track the world's most-spoken tongues (EN / ZH / HI / ES / FR), plus a "Persona" column for models that support voice-prompt or role-prompt conditioning. Rows are split into Full-duplex native and Others; the FD-native band opens by default, click a group header to fold. Hover a column for its definition, click a row to jump to the detailed card below.

FD-native

Cascade

≤300ms

Open wts.

Comm.-OK

Persona

Platform

MoshiKyutai

GLM-4-VoiceZhipu AI (Z.ai)

Sesame CSMSesame AI

Kimi-AudioMoonshot AI

PersonaPlexNVIDIA (ADLR)

MiniCPM-o 4.5OpenBMB / ModelBest

SALMONN-omniByteDance · Tsinghua · Cambridge

Step-Audio 2 miniStepFun

Covo-Audio-Chat / Covo-Audio-Chat-FDTencent

OpenAI Realtime (gpt-realtime)OpenAI

Gemini 2.5 LiveGoogle DeepMind

Gemini 3.1 Flash Live PreviewGoogle DeepMind

Hume EVI 3Hume AI

Amazon Nova 2 SonicAmazon / AWS

Grok Voice Agent APIxAI

Qwen3-OmniAlibaba Cloud

HibikiKyutai

supportedpartial / unofficialunsupported

M-B · latency × release

Latency vs release date

Scope

First-response latency (log scale) on the Y axis, release date on the X. Open-weights models in filled orange; proprietary APIs in dark grey. Dashed outlines mark frameworks and research-only; dashed rings flag preview snapshots and half-opacity dots flag legacy lines that have been superseded. Default shows end-to-end full-duplex native models only.

Latency provenance. Values mix vendor claims, paper reports, third-party evaluations, and demo measurements — they are not normalised to a single trace or workload. Read the chart as order-of-magnitude rather than a head-to-head ranking.

open-weights / open-sourceproprietaryframeworkpreviewlegacy

M-C · release timeline

Native STS · 2024 – 2026

Scope

Native STS only — one lane for end-to-end full-duplex models. Pre-2024 entries are collapsed to a single chip on the left edge. Use All models for the 3-lane view.

§01 · native speech-lm

Full-duplex speech-LMs

End-to-end models that ingest and emit audio in the same network. Open weights, native barge-in.

Tencent · 2026-03·verified 2026-04
Covo-Audio-Chat / Covo-Audio-Chat-FD
7B end-to-end audio LLM with a native full-duplex variant — tri-modal speech-text interleaving and dual-stream listen-and-speak.
previewFD · nativeopen weightsopen modelresearch onlyTencent research license (non-commercial)
#full-duplex#open-weights#speech-lm#zh-en#research#non-commercial
Paper: 'Covo-Audio Technical Report' (arXiv 2602.09823, Feb 2026). Three releases share the same 7B backbone (Qwen2.5-7B + Whisper): Covo-Audio (base), Covo-Audio-Chat (spoken-dialogue variant, open-sourced), and Covo-Audio-Chat-FD (native low-latency full-duplex variant, benchmarked against Freeze-Omni / MiMo-Audio / Step-Audio 2 on URO-Bench). Model card labels weights as 'research and experimental purposes only,' so treat as non-commercial until a formal license is published.
evaluated onURO-Bench VoiceBench
site demo paper repozh · en
OpenBMB / ModelBest · 2026-02·verified 2026-04
MiniCPM-o 4.5
9B on-device omni-modal LLM with full-duplex speech + vision streaming, targeting Gemini 2.5 Flash quality on a laptop.
FD · nativeopen weightsopen modellicense unclearApache-2.0 (code) / MiniCPM Model License (weights)
#full-duplex#open-weights#speech-lm#omni-modal#on-device#vision#tdm
End-to-end multimodal model that processes continuous video + audio streams while generating text + speech in parallel — none of the four streams block each other. Uses a time-division-multiplex (TDM) full-duplex mechanism with an interleaved speech-token decoder. Ships with llama.cpp-omni / WebRTC demos for local deployment on a MacBook (M4 Max / 24 GB) or a small GPU. Covers 30+ languages via Qwen backbone.
site demo paper repoen · zh · +30
NVIDIA (ADLR) · 2026-01·verified 2026-04
PersonaPlex
7B full-duplex speech LM with hybrid voice + text prompts — any role, any voice, on top of Moshi.
FD · nativeopen weightsopen model170 ms first-responseresearch onlyMIT (code) / NVIDIA Open Model License (weights)
#full-duplex#open-weights#speech-lm#persona#voice-prompt#moshi-family
Fine-tunes Kyutai's Moshi stack (Mimi codec @ 24 kHz, Helium LM backbone) with a hybrid conditioning path: a voice prompt captures timbre/style and a text prompt pins role, facts, and scenario. Trained on 1,217 h of Fisher English dyads plus ~2,250 h of synthetic assistant / customer-service dialogues rendered with Chatterbox TTS. NVIDIA reports ~170 ms average latency vs cascaded baselines on FullDuplexBench-style evaluations. Weights are hosted as `nvidia/personaplex-7b-v1` on Hugging Face; code is MIT, weights use the NVIDIA Open Model License.
trained onFisher English
site demo paper repoen
ByteDance · Tsinghua · Cambridge · 2025-11·verified 2026-04
SALMONN-omni
First standalone codec-free full-duplex speech LLM — no audio tokens in the vocabulary, RL-trained for barge-in.
FD · nativeopen weightsopen modelcommercial okApache-2.0
#full-duplex#open-weights#speech-lm#codec-free#research
Codec-free architecture: streaming speech encoder → LLM backbone (no codec tokens in its vocab) → streaming speech synthesizer, synchronised through embeddings and a dynamic 'thinking' mechanism. Reports a 30–36% relative improvement over Moshi, Freeze-Omni, Qwen2.5-Omni, GLM-4-Voice, VITA-1.5, MiniCPM-o, Kimi-Audio, Baichuan-Audio on open-domain spoken QA with substantially less training data. First FD speech LLM to apply RL to turn-taking / context-dependent barge-in. Published as NeurIPS 2025; weights released on the SALMONN org.
site paper repoen
StepFun · 2025-08·verified 2026-04
Step-Audio 2 mini
End-to-end multimodal audio LLM for speech conversation with expressive TTS and tool / search integration.
FD · nativeopen weightsopen modelcommercial okApache-2.0
#full-duplex#open-weights#speech-lm#zh-en#tool-use#search
Open-weights release of the mini and mini-Base checkpoints (initialized from Qwen2-Audio / Qwen2.5-7B). Step-Audio 2 handles paralinguistic cues (laughter, sighs, accent, dialect), supports dynamic voice switching, and ships with a vLLM backend plus a hosted StepFun realtime console. The `mini Think` variant released in 2025-09 adds chain-of-thought reasoning for speech. Covered as a full-duplex baseline in the Covo-Audio and MiMo-Audio technical reports.
evaluated onBig Bench Audio VoiceBench URO-Bench
site demo paper repozh · en
Moonshot AI · 2025-04·verified 2026-04
Kimi-Audio
A 7B audio foundation model covering ASR, AQA, audio QA, and end-to-end voice chat in one net.
FD · nativeopen weightsopen modelcommercial okModified MIT
#full-duplex#open-weights#audio-foundation#unified
Hybrid audio input (continuous acoustic vectors + discrete semantic tokens) feeds an LLM with parallel text and audio output heads. Pre-trained on 13M hours of audio. Ships with an evaluation toolkit, Kimi-Audio-Evalkit.
evaluated onVoiceBench MMAR AIR-Bench
site paper repozh · en
ICT / CAS (ictnlp) · 2025-04·verified 2026-04
LLaMA-Omni 2
Qwen2.5-based real-time spoken chatbot series (0.5B–14B) with autoregressive streaming speech synthesis.
turn-takingopen weightsopen model583 ms first-responsenon-commercialApache-2.0 (code) / Non-commercial (weights)
#open-weights#speech-lm#streaming-tts#research#non-commercial
ACL 2025 main-conference paper. Speech encoder → Qwen2.5-0.5B/1.5B/3B/7B/14B/32B-Instruct backbone → autoregressive streaming speech decoder (CosyVoice 2 tokenizer + flow-matching vocoder). 7B variant reports total first-audio latency ≈580 ms with chunk size 3 / 10. Trained on only 200K multi-turn speech dialogues yet matches GLM-4-Voice for spoken QA. Turn-based rather than native full-duplex; interrupt / barge-in is not part of the released model.
evaluated onVoiceBench
trained onInstructS2S-200K
site paper repoen
Sesame AI · 2025-03·verified 2026-04
Sesame CSM
The conversational speech model everyone called "uncanny-valley-free."
FD · nativeopen weightsopen model200 ms first-responsecommercial okApache-2.0
#full-duplex#open-weights#prosody#consumer
A Llama-backbone model with a smaller audio decoder that emits Mimi codec frames. The 1B checkpoint is public; larger 8B variant powers the hosted demo. Strong at laughs, sighs, and non-verbal cues.
site demo repoen
Zhipu AI (Z.ai) · 2024-10·verified 2026-04
GLM-4-Voice
An end-to-end Chinese/English speech LM that exposes tone, dialect, and emotion as instructions.
FD · nativeopen weightsopen modelcommercial okApache-2.0
#full-duplex#open-weights#speech-lm#zh-en
A three-part stack: a Whisper-based VQ tokenizer (12.5 tok/s), a 9B LM fine-tuned on speech tokens, and a Flow-Matching decoder that starts streaming audio after as few as 10 tokens. One of the few open speech-LMs with strong Chinese coverage.
site paper repozh · en
Kyutai · 2024-09·verified 2026-04
Moshi
A 7B speech-text foundation model that generates user and assistant audio streams in parallel.
FD · nativeopen weightsopen model160 ms first-responsecommercial okApache-2.0 (code) / CC-BY-4.0 (weights)
#full-duplex#open-weights#speech-lm#low-latency
Moshi ingests user audio and emits its own audio simultaneously through an RQ-Transformer stacked on the Mimi streaming codec. It can think, back-channel, and interrupt while the other side is still talking. Theoretical latency is 160 ms, ≈200 ms on an L4 GPU.
evaluated onfull-duplex-bench Big Bench Audio Talking Turns
trained onFisher English
read the long-formKyutai: the twelve-person Paris nonprofit turning open releases into shared vocabulary
site demo paper repoen

§02 · proprietary s2s api

Realtime assistant APIs

Hosted speech-to-speech APIs from the big labs. Pay-per-minute, closed weights, ≤ 500 ms latency.

Google DeepMind · 2026-03·verified 2026-04
Gemini 3.1 Flash Live Preview
Current Google Live API target — preview snapshot with proactive-audio, native thinking, and tighter tool-call latency on top of 3.1 Flash.
previewFD · nativeproprietaryAPI260 ms first-responsecommercial ok
#full-duplex#proprietary#vision+voice#native-audio#preview#gemini
Snapshot name: `gemini-3.1-flash-live-preview`. Preview-tier successor to Gemini 2.5 Live, rolled out as the recommended Live API target alongside the Gemini 3.1 Flash base model. Reported improvements include tighter first-response latency, proactive-audio triggering, and more stable tool / function-calling through long sessions. Still labelled preview, so naming and pricing may move before GA.
evaluated onBig Bench Audio Talking Turns
siteen · ja · zh · …
OpenAI · 2026-02·verified 2026-04
OpenAI Realtime (gpt-realtime)
Canonical GA speech-to-speech stack on the Realtime API — gpt-realtime (current snapshot gpt-realtime-2025-08-28) and gpt-realtime-mini as a lower-latency companion.
FD · nativeproprietaryAPI310 ms first-responsecommercial ok
#full-duplex#proprietary#realtime-api#webrtc#sip#current
`gpt-realtime` is the canonical current model on the OpenAI Realtime API; the current pinned snapshot is `gpt-realtime-2025-08-28`. `gpt-realtime-mini` is a separate lower-latency / lower-cost companion, not a version bump. Features across the family include Cedar & Marin voices, remote MCP servers, SIP phone calling, image input, and async function calling, over WebRTC / WebSocket / SIP transports with a 32k-token context.
evaluated onBig Bench Audio full-duplex-bench Scale Voice Showdown
site demoen · ja · zh · …
Google DeepMind · 2025-12·verified 2026-04
Gemini 2.5 Live
Legacy native-audio live API — still available but Google has announced gemini-3.1-flash-live-preview as the successor.
Legacy — vendor recommends migrating to Gemini 3.1 Flash Live Preview.
legacyFD · nativeproprietaryAPI300 ms first-responsecommercial ok
#full-duplex#proprietary#vision+voice#native-audio#legacy
Gemini 2.5 Flash with native audio output went GA on Vertex AI in December 2025 (snapshot: `gemini-2.5-flash-native-audio-preview-12-2025`). A single model replaces the STT→LLM→TTS pipeline, with improved barge-in in noisy rooms and affective dialog that reacts to the speaker's emotion. Superseded by `gemini-3.1-flash-live-preview` in March 2026 — 2.5 remains available for backward compatibility but new integrations should target 3.1.
evaluated onBig Bench Audio Talking Turns
siteen · ja · zh · …
Amazon / AWS · 2025-12·verified 2026-04
Amazon Nova 2 Sonic
Bedrock-hosted GA speech-to-speech model — polyglot voices, 1M-token context, async tool calls, telephony-ready.
FD · nativeproprietaryAPI400 ms first-responsecommercial ok
#full-duplex#proprietary#realtime-api#bedrock#telephony#polyglot#current
Successor to Nova Sonic (Apr 2025); GA on Amazon Bedrock in Dec 2025 via model ID `amazon.nova-2-sonic-v1:0`. Adds Portuguese and Hindi, polyglot voices that keep identity across languages, configurable turn-taking (low / medium / high pause sensitivity), cross-modal input, asynchronous tool calling, and a 1M-token context window. Integrates with Amazon Connect, Vonage, Twilio, AudioCodes, LiveKit, and Pipecat. Artificial Analysis reports median TTFA ≈0.4 s and higher speech-reasoning accuracy than GPT-Realtime and Gemini 2.5 Flash Live on common benchmarks.
evaluated onBig Bench Audio
site paperen · es · fr · …
xAI · 2025-12·verified 2026-04
Grok Voice Agent API
OpenAI-Realtime-compatible voice agent API — 20+ languages, flat $0.05/min, ships with web/iOS testers and LiveKit plugin.
FD · nativeproprietaryAPI1000 ms first-responsecommercial ok
#full-duplex#proprietary#realtime-api#webrtc#openai-compatible#current
Launched 2025-12-17 as the developer-facing spin-out of the Grok Voice stack used in the xAI mobile apps and Tesla vehicles. xAI trained its own VAD, tokenizer, and audio models in-house, and exposes them via a WebSocket endpoint at `wss://api.x.ai/v1/realtime` that is wire-compatible with the OpenAI Realtime API. Reports ≈1 s average time-to-first-audio and #1 on Big Bench Audio as of launch. Supports web search / x_search / file_search / custom tools, LiveKit plugin, and ephemeral browser-side client secrets.
evaluated onBig Bench Audio
site demoen · es · fr · …
Hume AI · 2025-05·verified 2026-04
Hume EVI 3
An empathic speech-to-speech foundation model that infers user affect and tunes its tone to match.
FD · nativeproprietaryAPI400 ms first-responsecommercial ok
#full-duplex#proprietary#emotion#custom-voice
First speech-LM to speak expressively with any voice, real or designed, without fine-tuning. Supports 200K+ custom voices, 30-second voice clones, and pluggable LLM backends (Claude 4, Gemini 2.5, Kimi K2, custom).
read the long-formHume AI: the smile inside a sentence, and the nine days that clarified voice AI’s exit shape
siteen · ja · zh · …

§03 · turnkey stack

Voice-agent platforms

Commercial platforms that wire STT+LLM+TTS together with telephony, RAG, and observability.

ElevenLabs · 2025-05·verified 2026-04
ElevenLabs Agents
The TTS incumbent's turnkey agent platform — turn-taking, RAG, and telephony pre-wired.
FD · stackproprietaryAPI450 ms first-responsecommercial ok
#platform#proprietary#turnkey#telephony#rag
Conversational AI 2.0 (rebranded as ElevenLabs Agents) ships SOTA turn-taking that reads backchannels like "um" / "ah," multi-character voices in a single agent, RAG, HIPAA mode, and SIP trunking. 5,000+ voices across 31 languages.
read the long-formElevenLabs: why a TTS company is priced at $11B
siteen · ja · zh · …
Deepgram · 2024-12·verified 2026-04
Deepgram Voice Agent
STT+TTS incumbent's agent API using Nova-3 / Flux for transcription and Aura-2 for voice.
FD · stackproprietaryAPI450 ms first-responsecommercial ok
#platform#proprietary#asr#tts
Pairs Deepgram's Nova-3 / Flux STT (with explicit turn detection) and Aura-2 TTS with pluggable LLMs including GPT-5.4 and Gemini. Flux's end-of-turn model is aimed squarely at sub-second back-and-forth latency.
siteen · ja · zh · …
Retell · 2024-02·verified 2026-04
Retell AI
Low-latency voice agent API built for call-centre workloads and SIP carriers.
FD · stackproprietaryAPI500 ms first-responsecommercial ok
#platform#proprietary#telephony#b2b
Tight SIP and Twilio integration, with batteries-included call recording, post-call analytics, and outbound campaigns. One of the more battle-tested commercial stacks for production voice agents.
siteen · ja · es · …
Bland AI · 2023-11·verified 2026-04
Bland
Turnkey voice-agent platform for inbound and outbound telephony with web agents and an API.
FD · stackproprietaryAPI400 ms first-responsecommercial ok
#platform#proprietary#telephony#b2b#outbound
Phone-native voice-agent platform with inbound / outbound orchestration, web-embedded agents, pathways (node-based conversation logic), and a documented API. Often compared directly against Vapi, Retell, and Deepgram Voice Agent for production call deployments.
site demoen · es · +30
Vapi · 2023-07·verified 2026-04
Vapi
Developer-first orchestration layer for inbound/outbound voice agents with BYO models.
FD · stackproprietaryAPI500 ms first-responsecommercial ok
#platform#proprietary#telephony#byo-models
Billed at $0.05/min of orchestration on top of your own STT/LLM/TTS/telephony keys. Two-tier building model — Assistants (single prompt) and Squads (multi-agent). Claims sub-500 ms infra latency at 99.99% uptime.
siteen · ja · es · …

§04 · open framework

Open agent frameworks

Self-hosted libraries for assembling voice agents with swappable STT / LLM / TTS blocks.

Kyutai · 2025-07·verified 2026-04
Kyutai Unmute
Open reference stack that gives any text LLM ears and a voice via delayed streams modelling.
FD · stackopen srcframeworkcommercial okMIT (code) / CC-BY-4.0 (weights)
#framework#open-source#streaming#low-latency
Wraps a text LLM with Kyutai STT (1B EN/FR with 0.5 s delay or 2.6B EN with 2.5 s delay, both with semantic VAD) and Kyutai TTS 1.6B. Rust and Python implementations, Docker Compose deploy.
read the long-formKyutai: the twelve-person Paris nonprofit turning open releases into shared vocabulary
site demo repoen · fr
LiveKit · 2024-09·verified 2026-04
LiveKit Agents
Open agent framework on top of the LiveKit WebRTC mesh — 50+ provider plugins, first-class FD.
FD · stackframeworkframeworkcommercial okApache-2.0
#framework#open-source#webrtc#orchestration
Since 1.5 ships an adaptive interruption ML model (86% precision vs VAD baselines), dynamic endpointing, preemptive generation, and per-turn latency metrics out of the box. Integrates with 50+ STT/LLM/TTS providers including gpt-realtime.
site repoen
Hugging Face · 2024-08·verified 2026-04
HF speech-to-speech
Modular reference pipeline that cascades VAD → ASR → LLM → TTS for local voice agents.
FD · stackopen srcframeworkcommercial okMIT
#framework#open-source#cascade#local
Hugging Face's reference cascaded stack (Silero VAD → Whisper / Lightning-Whisper-MLX → any HF LLM → Parler / MeloTTS / ChatTTS / Kokoro). Not full-duplex out of the box — it uses a turn-taking loop — but often the fastest way to prototype an open-source voice agent end to end. Apple-silicon optimised.
site repoen · fr · es · …
Daily · 2023-12·verified 2026-04
Pipecat
Python framework for voice agents with swappable ASR / LLM / TTS and sane defaults for turn-taking.
FD · stackopen srcframeworkcommercial okBSD-2
#framework#open-source#orchestration#python
Wires a WebRTC transport (typically Daily) to a chain of ASR / LLM / TTS blocks. Increasingly the default pipeline for teams who want full control over every hop without writing their own media server.
site repoen

§05 · audio-in / omni

Audio-LLMs & omni-models

Models whose primary job is understanding audio, not driving a conversation. Some emit speech too.

Alibaba Cloud · 2025-09·verified 2026-04
Qwen3-Omni
Natively end-to-end omni-LLM (text / image / audio / video in, text + speech out).
FD · nativeopen weightsopen modelcommercial okApache-2.0
#audio-llm#open-weights#omni#streaming
MoE Thinker–Talker architecture with an Audio Transformer encoder pre-trained on 20M hours of speech. Claims SOTA on 22/36 audio/video benchmarks; ASR and voice chat on par with Gemini 2.5 Pro. Speech output in 10 languages.
evaluated onBig Bench Audio VoiceBench MMAR AIR-Bench
read the long-formAlibaba DAMO and the Qwen Audio team: the most-downloaded open audio lab closed only its flagship
site paper repoen · zh · ja · …
Alibaba Cloud · 2025-03·verified 2026-04
Qwen2.5-Omni
End-to-end multimodal Qwen (text / image / audio / video in, text + speech out) with streaming response.
Legacy — vendor recommends migrating to Qwen3-Omni.
legacyturn-takingopen weightsopen modelcommercial okApache-2.0
#audio-llm#open-weights#omni#streaming
The 2.5-generation omni model, predecessor to Qwen3-Omni. Thinker–Talker split that interleaves text reasoning with streaming speech output. Listed here because it is still the reference for any paper comparing against the Qwen omni line.
evaluated onAIR-Bench VoiceBench
read the long-formAlibaba DAMO and the Qwen Audio team: the most-downloaded open audio lab closed only its flagship
site paper repoen · zh · ja · …
Fixie AI · 2024-08·verified 2026-04
Ultravox
Lightweight audio adapter that bolts onto Llama / Qwen to give an LLM ears without retraining.
turn-takingopen weightsopen modelcommercial okMIT
#audio-llm#open-weights#adapter#whisper
Cross-attends a Whisper-class speech encoder into an existing text LLM. Popular when you want to bolt voice onto an in-house base model without running a separate ASR stage.
site repoen

§06 · synthesis

TTS & voice generation

Speech synthesis APIs and open TTS weights. Feeds every cascaded stack above.

Alibaba Cloud · 2026-01·verified 2026-04
Qwen3-TTS
Open streaming TTS successor to CosyVoice with voice cloning, voice design, and 10-language coverage.
not conversationalopen weightsopen modelcommercial okApache-2.0
#tts#open-weights#multilingual#zeroshot#streaming
Part of the 2026 wave of open expressive TTS, alongside MOSS-TTS. Streams tokens to audio with voice cloning, prompt-based voice design, and cross-lingual speaker transfer. Positioned as the replacement for CosyVoice 2 in the Qwen ecosystem.
read the long-formAlibaba DAMO and the Qwen Audio team: the most-downloaded open audio lab closed only its flagship
site repoen · zh · ja · …
OpenMOSS (Fudan) · 2026-01·verified 2026-04
MOSS-TTS
Real-time streaming TTS family with voice cloning, multi-speaker dialogue, and sound generation.
not conversationalopen weightsopen modelcommercial okApache-2.0
#tts#open-weights#streaming#dialogue
Open TTS family from the OpenMOSS group at Fudan. Covers real-time streaming synthesis, zero-shot voice cloning, multi-speaker dialogue generation, and ambient sound generation in one codebase — modernises the open TTS drawer alongside Qwen3-TTS.
site repoen · zh
Cartesia · 2025-10·verified 2026-04
Cartesia Sonic 3
State-space TTS with ~90 ms time-to-first-audio and realistic non-verbal sounds.
not conversationalproprietaryAPI90 ms first-responsecommercial ok
#tts#proprietary#low-latency#websocket
WebSocket TTS with bidirectional streaming, context continuations for prosody, and laughter / breathing synthesis. Shipped by the Mamba / SSM team; the TTS of choice for sub-second voice agents.
read the long-formCartesia: why AWS put a non-transformer voice AI on its own shelf
siteen · ja · zh · …
Alibaba Tongyi · 2024-12·verified 2026-04
CosyVoice 2
Open expressive TTS with strong zero-shot voice cloning across Japanese, Chinese, and English.
not conversationalopen weightsopen modelcommercial okApache-2.0
#tts#open-weights#multilingual#zeroshot
Half of Alibaba's open voice stack, paired with SenseVoice for ASR. The canonical pipeline is SenseVoice → LLM → CosyVoice. Strong zero-shot cloning with ~3 s of reference audio.
read the long-formAlibaba DAMO and the Qwen Audio team: the most-downloaded open audio lab closed only its flagship
site paper repozh · en · ja · …

§07 · s2st

Speech-to-speech translation

Direct speech translation — one language in, another out, with or without text intermediate.

Kyutai · 2025-02·verified 2026-04
Hibiki
On-device high-fidelity simultaneous speech-to-speech translation — Moshi-style dual-stream for FR → EN, with voice transfer.
FD · nativeopen weightsopen modelcommercial okCC-BY-4.0 (weights) / Apache-2.0 & MIT (code)
#s2st#translation#open-weights#streaming#on-device#moshi-family
Hibiki reuses Moshi's multistream architecture to jointly model source and target audio, producing text + audio tokens at 12.5 Hz while the user is still speaking. Two checkpoints: Hibiki 2B (≈2.7B with the depth transformer, 16 RVQ per stream) and Hibiki-M 1B for on-device smartphone inference. Currently French → English only. Inference code for PyTorch (CUDA), Rust (CUDA), MLX (macOS), and MLX-Swift (iOS). Open-sourced under CC-BY-4.0 for weights.
read the long-formKyutai: the twelve-person Paris nonprofit turning open releases into shared vocabulary
site demo paper repofr · en
Meta AI · 2023-11·verified 2026-04
SeamlessM4T v2
Direct speech-to-speech translation across 100+ languages, with weights in the open.
not conversationalopen weightsopen modelnon-commercialCC-BY-NC-4.0
#s2st#translation#open-weights#multilingual
Meta shipped ExpressiveSeamless and SeamlessStreaming as companion drops for prosody transfer and streaming translation. Commercial use requires either a derivative model or a separate arrangement.
trained onCommon Voice 18 Multilingual LibriSpeech (MLS)VoxPopuli FLEURS CoVoST-2
read the long-formMeta FAIR Speech: six years, nine papers, and the field's default citations
site paper repo100+

§N · observatory notes

What the taxonomy is telling us.

gap · 01
Native × English-first
Every native full-duplex speech-LM in the table is primarily English or Chinese. Native open speech-LMs for the other high-speaker-count languages are still unclaimed territory — whoever ships the first one sets the reference for that language.
gap · 02
Platforms ≠ speech-LMs
Most "voice AI" vendors live in the platform drawer, not the speech-LM drawer. That means the full-duplex behaviour you feel is usually crafted by a turn-taking model on top of a cascade — which is why benchmark coverage of that layer (Talking Turns, Full-Duplex-Bench) matters.
gap · 03
Where the work is heading
The interesting frontier is vertical voice agents on top of these platforms — healthcare, support, field operations. The Observatory will add that vertical drawer as enough public entries appear.

§M·∎ · next stops

Keep reading — three ways out of the directory.

The observatory is deliberately triangular. Models are where you land; the benchmarks and datasets pages are where you verify the claim, and the blog is where you get the story behind the name.

Spotted a model we're missing? submit an entry

Models.

Model × capability at a glance

Latency vs release date

Native STS · 2024 – 2026

Full-duplex speech-LMs

Covo-Audio-Chat / Covo-Audio-Chat-FD

MiniCPM-o 4.5

PersonaPlex

SALMONN-omni

Step-Audio 2 mini

Kimi-Audio

LLaMA-Omni 2

Sesame CSM

GLM-4-Voice

Moshi

Realtime assistant APIs

Gemini 3.1 Flash Live Preview

OpenAI Realtime (gpt-realtime)

Gemini 2.5 Live

Amazon Nova 2 Sonic

Grok Voice Agent API

Hume EVI 3

Voice-agent platforms

ElevenLabs Agents

Deepgram Voice Agent

Retell AI

Bland

Vapi

Open agent frameworks

Kyutai Unmute

LiveKit Agents

HF speech-to-speech

Pipecat

Audio-LLMs & omni-models

Qwen3-Omni

Qwen2.5-Omni

Ultravox

TTS & voice generation

Qwen3-TTS

MOSS-TTS

Cartesia Sonic 3

CosyVoice 2

Speech-to-speech translation

Hibiki

SeamlessM4T v2

What the taxonomy is telling us.

Native × English-first

Platforms ≠ speech-LMs

Where the work is heading

Keep reading — three ways out of the directory.

See how these models are measured.

See what they were trained on.

Read the long-form.