Models.
A directory of speech-to-speech, full-duplex, and audio-foundation-model entries — open weights, hosted APIs, voice-agent platforms, and agent frameworks — grouped by what they are and how they close the conversation loop. Each card tags the kind (speech-LM, realtime API, platform…), the FD mode (native vs cascaded stack), and the surface you consume it through (API, model, SDK, framework). Missing or wrong something? Open the channel.
Counter definitions. open weights = openWeights === true. native full-duplex = fdMode === "native". All four counters cover preview + current + legacy entries but exclude deprecated, so the headline does not inflate as older snapshots are retired.
Model × capability at a glance
Twelve capability axes scored from the canonical model fields and tags — the five language columns track the world's most-spoken tongues (EN / ZH / HI / ES / FR), plus a "Persona" column for models that support voice-prompt or role-prompt conditioning. Rows are split into Full-duplex native and Others; the FD-native band opens by default, click a group header to fold. Hover a column for its definition, click a row to jump to the detailed card below.
Latency vs release date
First-response latency (log scale) on the Y axis, release date on the X. Open-weights models in filled orange; proprietary APIs in dark grey. Dashed outlines mark frameworks and research-only; dashed rings flag preview snapshots and half-opacity dots flag legacy lines that have been superseded. Default shows end-to-end full-duplex native models only.
Latency provenance. Values mix vendor claims, paper reports, third-party evaluations, and demo measurements — they are not normalised to a single trace or workload. Read the chart as order-of-magnitude rather than a head-to-head ranking.
Native STS · 2024 – 2026
Native STS only — one lane for end-to-end full-duplex models. Pre-2024 entries are collapsed to a single chip on the left edge. Use All models for the 3-lane view.
Full-duplex speech-LMs
End-to-end models that ingest and emit audio in the same network. Open weights, native barge-in.
Covo-Audio-Chat / Covo-Audio-Chat-FD
7B end-to-end audio LLM with a native full-duplex variant — tri-modal speech-text interleaving and dual-stream listen-and-speak.
previewFD · nativeopen weightsopen modelresearch onlyTencent research license (non-commercial)#full-duplex#open-weights#speech-lm#zh-en#research#non-commercialPaper: 'Covo-Audio Technical Report' (arXiv 2602.09823, Feb 2026). Three releases share the same 7B backbone (Qwen2.5-7B + Whisper): Covo-Audio (base), Covo-Audio-Chat (spoken-dialogue variant, open-sourced), and Covo-Audio-Chat-FD (native low-latency full-duplex variant, benchmarked against Freeze-Omni / MiMo-Audio / Step-Audio 2 on URO-Bench). Model card labels weights as 'research and experimental purposes only,' so treat as non-commercial until a formal license is published.
MiniCPM-o 4.5
9B on-device omni-modal LLM with full-duplex speech + vision streaming, targeting Gemini 2.5 Flash quality on a laptop.
FD · nativeopen weightsopen modellicense unclearApache-2.0 (code) / MiniCPM Model License (weights)#full-duplex#open-weights#speech-lm#omni-modal#on-device#vision#tdmEnd-to-end multimodal model that processes continuous video + audio streams while generating text + speech in parallel — none of the four streams block each other. Uses a time-division-multiplex (TDM) full-duplex mechanism with an interleaved speech-token decoder. Ships with llama.cpp-omni / WebRTC demos for local deployment on a MacBook (M4 Max / 24 GB) or a small GPU. Covers 30+ languages via Qwen backbone.
PersonaPlex
7B full-duplex speech LM with hybrid voice + text prompts — any role, any voice, on top of Moshi.
FD · nativeopen weightsopen model170 ms first-responseresearch onlyMIT (code) / NVIDIA Open Model License (weights)#full-duplex#open-weights#speech-lm#persona#voice-prompt#moshi-familyFine-tunes Kyutai's Moshi stack (Mimi codec @ 24 kHz, Helium LM backbone) with a hybrid conditioning path: a voice prompt captures timbre/style and a text prompt pins role, facts, and scenario. Trained on 1,217 h of Fisher English dyads plus ~2,250 h of synthetic assistant / customer-service dialogues rendered with Chatterbox TTS. NVIDIA reports ~170 ms average latency vs cascaded baselines on FullDuplexBench-style evaluations. Weights are hosted as `nvidia/personaplex-7b-v1` on Hugging Face; code is MIT, weights use the NVIDIA Open Model License.
trained onFisher EnglishSALMONN-omni
First standalone codec-free full-duplex speech LLM — no audio tokens in the vocabulary, RL-trained for barge-in.
FD · nativeopen weightsopen modelcommercial okApache-2.0#full-duplex#open-weights#speech-lm#codec-free#researchCodec-free architecture: streaming speech encoder → LLM backbone (no codec tokens in its vocab) → streaming speech synthesizer, synchronised through embeddings and a dynamic 'thinking' mechanism. Reports a 30–36% relative improvement over Moshi, Freeze-Omni, Qwen2.5-Omni, GLM-4-Voice, VITA-1.5, MiniCPM-o, Kimi-Audio, Baichuan-Audio on open-domain spoken QA with substantially less training data. First FD speech LLM to apply RL to turn-taking / context-dependent barge-in. Published as NeurIPS 2025; weights released on the SALMONN org.
Step-Audio 2 mini
End-to-end multimodal audio LLM for speech conversation with expressive TTS and tool / search integration.
FD · nativeopen weightsopen modelcommercial okApache-2.0#full-duplex#open-weights#speech-lm#zh-en#tool-use#searchOpen-weights release of the mini and mini-Base checkpoints (initialized from Qwen2-Audio / Qwen2.5-7B). Step-Audio 2 handles paralinguistic cues (laughter, sighs, accent, dialect), supports dynamic voice switching, and ships with a vLLM backend plus a hosted StepFun realtime console. The `mini Think` variant released in 2025-09 adds chain-of-thought reasoning for speech. Covered as a full-duplex baseline in the Covo-Audio and MiMo-Audio technical reports.
Kimi-Audio
A 7B audio foundation model covering ASR, AQA, audio QA, and end-to-end voice chat in one net.
FD · nativeopen weightsopen modelcommercial okModified MIT#full-duplex#open-weights#audio-foundation#unifiedHybrid audio input (continuous acoustic vectors + discrete semantic tokens) feeds an LLM with parallel text and audio output heads. Pre-trained on 13M hours of audio. Ships with an evaluation toolkit, Kimi-Audio-Evalkit.
LLaMA-Omni 2
Qwen2.5-based real-time spoken chatbot series (0.5B–14B) with autoregressive streaming speech synthesis.
turn-takingopen weightsopen model583 ms first-responsenon-commercialApache-2.0 (code) / Non-commercial (weights)#open-weights#speech-lm#streaming-tts#research#non-commercialACL 2025 main-conference paper. Speech encoder → Qwen2.5-0.5B/1.5B/3B/7B/14B/32B-Instruct backbone → autoregressive streaming speech decoder (CosyVoice 2 tokenizer + flow-matching vocoder). 7B variant reports total first-audio latency ≈580 ms with chunk size 3 / 10. Trained on only 200K multi-turn speech dialogues yet matches GLM-4-Voice for spoken QA. Turn-based rather than native full-duplex; interrupt / barge-in is not part of the released model.
evaluated onVoiceBenchtrained onInstructS2S-200KSesame CSM
The conversational speech model everyone called "uncanny-valley-free."
FD · nativeopen weightsopen model200 ms first-responsecommercial okApache-2.0#full-duplex#open-weights#prosody#consumerA Llama-backbone model with a smaller audio decoder that emits Mimi codec frames. The 1B checkpoint is public; larger 8B variant powers the hosted demo. Strong at laughs, sighs, and non-verbal cues.
GLM-4-Voice
An end-to-end Chinese/English speech LM that exposes tone, dialect, and emotion as instructions.
FD · nativeopen weightsopen modelcommercial okApache-2.0#full-duplex#open-weights#speech-lm#zh-enA three-part stack: a Whisper-based VQ tokenizer (12.5 tok/s), a 9B LM fine-tuned on speech tokens, and a Flow-Matching decoder that starts streaming audio after as few as 10 tokens. One of the few open speech-LMs with strong Chinese coverage.
Moshi
A 7B speech-text foundation model that generates user and assistant audio streams in parallel.
FD · nativeopen weightsopen model160 ms first-responsecommercial okApache-2.0 (code) / CC-BY-4.0 (weights)#full-duplex#open-weights#speech-lm#low-latencyMoshi ingests user audio and emits its own audio simultaneously through an RQ-Transformer stacked on the Mimi streaming codec. It can think, back-channel, and interrupt while the other side is still talking. Theoretical latency is 160 ms, ≈200 ms on an L4 GPU.
trained onFisher English
Realtime assistant APIs
Hosted speech-to-speech APIs from the big labs. Pay-per-minute, closed weights, ≤ 500 ms latency.
Gemini 3.1 Flash Live Preview
Current Google Live API target — preview snapshot with proactive-audio, native thinking, and tighter tool-call latency on top of 3.1 Flash.
previewFD · nativeproprietaryAPI260 ms first-responsecommercial ok#full-duplex#proprietary#vision+voice#native-audio#preview#geminiSnapshot name: `gemini-3.1-flash-live-preview`. Preview-tier successor to Gemini 2.5 Live, rolled out as the recommended Live API target alongside the Gemini 3.1 Flash base model. Reported improvements include tighter first-response latency, proactive-audio triggering, and more stable tool / function-calling through long sessions. Still labelled preview, so naming and pricing may move before GA.
siteen · ja · zh · …OpenAI Realtime (gpt-realtime)
Canonical GA speech-to-speech stack on the Realtime API — gpt-realtime (current snapshot gpt-realtime-2025-08-28) and gpt-realtime-mini as a lower-latency companion.
FD · nativeproprietaryAPI310 ms first-responsecommercial ok#full-duplex#proprietary#realtime-api#webrtc#sip#current`gpt-realtime` is the canonical current model on the OpenAI Realtime API; the current pinned snapshot is `gpt-realtime-2025-08-28`. `gpt-realtime-mini` is a separate lower-latency / lower-cost companion, not a version bump. Features across the family include Cedar & Marin voices, remote MCP servers, SIP phone calling, image input, and async function calling, over WebRTC / WebSocket / SIP transports with a 32k-token context.
Gemini 2.5 Live
Legacy native-audio live API — still available but Google has announced gemini-3.1-flash-live-preview as the successor.
Legacy — vendor recommends migrating to Gemini 3.1 Flash Live Preview.
legacyFD · nativeproprietaryAPI300 ms first-responsecommercial ok#full-duplex#proprietary#vision+voice#native-audio#legacyGemini 2.5 Flash with native audio output went GA on Vertex AI in December 2025 (snapshot: `gemini-2.5-flash-native-audio-preview-12-2025`). A single model replaces the STT→LLM→TTS pipeline, with improved barge-in in noisy rooms and affective dialog that reacts to the speaker's emotion. Superseded by `gemini-3.1-flash-live-preview` in March 2026 — 2.5 remains available for backward compatibility but new integrations should target 3.1.
siteen · ja · zh · …Amazon Nova 2 Sonic
Bedrock-hosted GA speech-to-speech model — polyglot voices, 1M-token context, async tool calls, telephony-ready.
FD · nativeproprietaryAPI400 ms first-responsecommercial ok#full-duplex#proprietary#realtime-api#bedrock#telephony#polyglot#currentSuccessor to Nova Sonic (Apr 2025); GA on Amazon Bedrock in Dec 2025 via model ID `amazon.nova-2-sonic-v1:0`. Adds Portuguese and Hindi, polyglot voices that keep identity across languages, configurable turn-taking (low / medium / high pause sensitivity), cross-modal input, asynchronous tool calling, and a 1M-token context window. Integrates with Amazon Connect, Vonage, Twilio, AudioCodes, LiveKit, and Pipecat. Artificial Analysis reports median TTFA ≈0.4 s and higher speech-reasoning accuracy than GPT-Realtime and Gemini 2.5 Flash Live on common benchmarks.
evaluated onBig Bench AudioGrok Voice Agent API
OpenAI-Realtime-compatible voice agent API — 20+ languages, flat $0.05/min, ships with web/iOS testers and LiveKit plugin.
FD · nativeproprietaryAPI1000 ms first-responsecommercial ok#full-duplex#proprietary#realtime-api#webrtc#openai-compatible#currentLaunched 2025-12-17 as the developer-facing spin-out of the Grok Voice stack used in the xAI mobile apps and Tesla vehicles. xAI trained its own VAD, tokenizer, and audio models in-house, and exposes them via a WebSocket endpoint at `wss://api.x.ai/v1/realtime` that is wire-compatible with the OpenAI Realtime API. Reports ≈1 s average time-to-first-audio and #1 on Big Bench Audio as of launch. Supports web search / x_search / file_search / custom tools, LiveKit plugin, and ephemeral browser-side client secrets.
evaluated onBig Bench AudioHume EVI 3
An empathic speech-to-speech foundation model that infers user affect and tunes its tone to match.
FD · nativeproprietaryAPI400 ms first-responsecommercial ok#full-duplex#proprietary#emotion#custom-voiceFirst speech-LM to speak expressively with any voice, real or designed, without fine-tuning. Supports 200K+ custom voices, 30-second voice clones, and pluggable LLM backends (Claude 4, Gemini 2.5, Kimi K2, custom).
read the long-formHume AI: the smile inside a sentence, and the nine days that clarified voice AI’s exit shapesiteen · ja · zh · …
Voice-agent platforms
Commercial platforms that wire STT+LLM+TTS together with telephony, RAG, and observability.
ElevenLabs Agents
The TTS incumbent's turnkey agent platform — turn-taking, RAG, and telephony pre-wired.
FD · stackproprietaryAPI450 ms first-responsecommercial ok#platform#proprietary#turnkey#telephony#ragConversational AI 2.0 (rebranded as ElevenLabs Agents) ships SOTA turn-taking that reads backchannels like "um" / "ah," multi-character voices in a single agent, RAG, HIPAA mode, and SIP trunking. 5,000+ voices across 31 languages.
read the long-formElevenLabs: why a TTS company is priced at $11Bsiteen · ja · zh · …Deepgram Voice Agent
STT+TTS incumbent's agent API using Nova-3 / Flux for transcription and Aura-2 for voice.
FD · stackproprietaryAPI450 ms first-responsecommercial ok#platform#proprietary#asr#ttsPairs Deepgram's Nova-3 / Flux STT (with explicit turn detection) and Aura-2 TTS with pluggable LLMs including GPT-5.4 and Gemini. Flux's end-of-turn model is aimed squarely at sub-second back-and-forth latency.
siteen · ja · zh · …Retell AI
Low-latency voice agent API built for call-centre workloads and SIP carriers.
FD · stackproprietaryAPI500 ms first-responsecommercial ok#platform#proprietary#telephony#b2bTight SIP and Twilio integration, with batteries-included call recording, post-call analytics, and outbound campaigns. One of the more battle-tested commercial stacks for production voice agents.
siteen · ja · es · …Bland
Turnkey voice-agent platform for inbound and outbound telephony with web agents and an API.
FD · stackproprietaryAPI400 ms first-responsecommercial ok#platform#proprietary#telephony#b2b#outboundPhone-native voice-agent platform with inbound / outbound orchestration, web-embedded agents, pathways (node-based conversation logic), and a documented API. Often compared directly against Vapi, Retell, and Deepgram Voice Agent for production call deployments.
Vapi
Developer-first orchestration layer for inbound/outbound voice agents with BYO models.
FD · stackproprietaryAPI500 ms first-responsecommercial ok#platform#proprietary#telephony#byo-modelsBilled at $0.05/min of orchestration on top of your own STT/LLM/TTS/telephony keys. Two-tier building model — Assistants (single prompt) and Squads (multi-agent). Claims sub-500 ms infra latency at 99.99% uptime.
siteen · ja · es · …
Open agent frameworks
Self-hosted libraries for assembling voice agents with swappable STT / LLM / TTS blocks.
Kyutai Unmute
Open reference stack that gives any text LLM ears and a voice via delayed streams modelling.
FD · stackopen srcframeworkcommercial okMIT (code) / CC-BY-4.0 (weights)#framework#open-source#streaming#low-latencyWraps a text LLM with Kyutai STT (1B EN/FR with 0.5 s delay or 2.6B EN with 2.5 s delay, both with semantic VAD) and Kyutai TTS 1.6B. Rust and Python implementations, Docker Compose deploy.
LiveKit Agents
Open agent framework on top of the LiveKit WebRTC mesh — 50+ provider plugins, first-class FD.
FD · stackframeworkframeworkcommercial okApache-2.0#framework#open-source#webrtc#orchestrationSince 1.5 ships an adaptive interruption ML model (86% precision vs VAD baselines), dynamic endpointing, preemptive generation, and per-turn latency metrics out of the box. Integrates with 50+ STT/LLM/TTS providers including gpt-realtime.
HF speech-to-speech
Modular reference pipeline that cascades VAD → ASR → LLM → TTS for local voice agents.
FD · stackopen srcframeworkcommercial okMIT#framework#open-source#cascade#localHugging Face's reference cascaded stack (Silero VAD → Whisper / Lightning-Whisper-MLX → any HF LLM → Parler / MeloTTS / ChatTTS / Kokoro). Not full-duplex out of the box — it uses a turn-taking loop — but often the fastest way to prototype an open-source voice agent end to end. Apple-silicon optimised.
Pipecat
Python framework for voice agents with swappable ASR / LLM / TTS and sane defaults for turn-taking.
FD · stackopen srcframeworkcommercial okBSD-2#framework#open-source#orchestration#pythonWires a WebRTC transport (typically Daily) to a chain of ASR / LLM / TTS blocks. Increasingly the default pipeline for teams who want full control over every hop without writing their own media server.
Audio-LLMs & omni-models
Models whose primary job is understanding audio, not driving a conversation. Some emit speech too.
Qwen3-Omni
Natively end-to-end omni-LLM (text / image / audio / video in, text + speech out).
FD · nativeopen weightsopen modelcommercial okApache-2.0#audio-llm#open-weights#omni#streamingMoE Thinker–Talker architecture with an Audio Transformer encoder pre-trained on 20M hours of speech. Claims SOTA on 22/36 audio/video benchmarks; ASR and voice chat on par with Gemini 2.5 Pro. Speech output in 10 languages.
Qwen2.5-Omni
End-to-end multimodal Qwen (text / image / audio / video in, text + speech out) with streaming response.
Legacy — vendor recommends migrating to Qwen3-Omni.
legacyturn-takingopen weightsopen modelcommercial okApache-2.0#audio-llm#open-weights#omni#streamingThe 2.5-generation omni model, predecessor to Qwen3-Omni. Thinker–Talker split that interleaves text reasoning with streaming speech output. Listed here because it is still the reference for any paper comparing against the Qwen omni line.
Ultravox
Lightweight audio adapter that bolts onto Llama / Qwen to give an LLM ears without retraining.
turn-takingopen weightsopen modelcommercial okMIT#audio-llm#open-weights#adapter#whisperCross-attends a Whisper-class speech encoder into an existing text LLM. Popular when you want to bolt voice onto an in-house base model without running a separate ASR stage.
TTS & voice generation
Speech synthesis APIs and open TTS weights. Feeds every cascaded stack above.
Qwen3-TTS
Open streaming TTS successor to CosyVoice with voice cloning, voice design, and 10-language coverage.
not conversationalopen weightsopen modelcommercial okApache-2.0#tts#open-weights#multilingual#zeroshot#streamingPart of the 2026 wave of open expressive TTS, alongside MOSS-TTS. Streams tokens to audio with voice cloning, prompt-based voice design, and cross-lingual speaker transfer. Positioned as the replacement for CosyVoice 2 in the Qwen ecosystem.
MOSS-TTS
Real-time streaming TTS family with voice cloning, multi-speaker dialogue, and sound generation.
not conversationalopen weightsopen modelcommercial okApache-2.0#tts#open-weights#streaming#dialogueOpen TTS family from the OpenMOSS group at Fudan. Covers real-time streaming synthesis, zero-shot voice cloning, multi-speaker dialogue generation, and ambient sound generation in one codebase — modernises the open TTS drawer alongside Qwen3-TTS.
Cartesia Sonic 3
State-space TTS with ~90 ms time-to-first-audio and realistic non-verbal sounds.
not conversationalproprietaryAPI90 ms first-responsecommercial ok#tts#proprietary#low-latency#websocketWebSocket TTS with bidirectional streaming, context continuations for prosody, and laughter / breathing synthesis. Shipped by the Mamba / SSM team; the TTS of choice for sub-second voice agents.
read the long-formCartesia: why AWS put a non-transformer voice AI on its own shelfsiteen · ja · zh · …CosyVoice 2
Open expressive TTS with strong zero-shot voice cloning across Japanese, Chinese, and English.
not conversationalopen weightsopen modelcommercial okApache-2.0#tts#open-weights#multilingual#zeroshotHalf of Alibaba's open voice stack, paired with SenseVoice for ASR. The canonical pipeline is SenseVoice → LLM → CosyVoice. Strong zero-shot cloning with ~3 s of reference audio.
Speech-to-speech translation
Direct speech translation — one language in, another out, with or without text intermediate.
Hibiki
On-device high-fidelity simultaneous speech-to-speech translation — Moshi-style dual-stream for FR → EN, with voice transfer.
FD · nativeopen weightsopen modelcommercial okCC-BY-4.0 (weights) / Apache-2.0 & MIT (code)#s2st#translation#open-weights#streaming#on-device#moshi-familyHibiki reuses Moshi's multistream architecture to jointly model source and target audio, producing text + audio tokens at 12.5 Hz while the user is still speaking. Two checkpoints: Hibiki 2B (≈2.7B with the depth transformer, 16 RVQ per stream) and Hibiki-M 1B for on-device smartphone inference. Currently French → English only. Inference code for PyTorch (CUDA), Rust (CUDA), MLX (macOS), and MLX-Swift (iOS). Open-sourced under CC-BY-4.0 for weights.
SeamlessM4T v2
Direct speech-to-speech translation across 100+ languages, with weights in the open.
not conversationalopen weightsopen modelnon-commercialCC-BY-NC-4.0#s2st#translation#open-weights#multilingualMeta shipped ExpressiveSeamless and SeamlessStreaming as companion drops for prosody transfer and streaming translation. Commercial use requires either a derivative model or a separate arrangement.
What the taxonomy is telling us.
- gap · 01
Native × English-first
Every native full-duplex speech-LM in the table is primarily English or Chinese. Native open speech-LMs for the other high-speaker-count languages are still unclaimed territory — whoever ships the first one sets the reference for that language. - gap · 02
Platforms ≠ speech-LMs
Most "voice AI" vendors live in the platform drawer, not the speech-LM drawer. That means the full-duplex behaviour you feel is usually crafted by a turn-taking model on top of a cascade — which is why benchmark coverage of that layer (Talking Turns, Full-Duplex-Bench) matters. - gap · 03
Where the work is heading
The interesting frontier is vertical voice agents on top of these platforms — healthcare, support, field operations. The Observatory will add that vertical drawer as enough public entries appear.
Keep reading — three ways out of the directory.
The observatory is deliberately triangular. Models are where you land; the benchmarks and datasets pages are where you verify the claim, and the blog is where you get the story behind the name.
- B · evaluations
See how these models are measured.
25 native benchmarks for speech-to-speech, full-duplex, and audio foundation work — each entry links back to the model cards that post a score.
open benchmarks - D · corpora
See what they were trained on.
14 frontier corpora on the 2024–26 full-duplex frontier, plus the classical read/dialog sets. Every row lists its license, hours, and channel-separation status.
open datasets - A · essays & dispatches
Read the long-form.
The Verticals series profiles the labs, companies, and maintainers behind these entries — Kyutai, Sesame, Cartesia, Hume, ElevenLabs, Meta FAIR Speech, Alibaba DAMO, and more.
open the blog
Spotted a model we're missing? submit an entry