FullduplexFullduplex/blog
§M · models & platforms

Models.

A directory of speech-to-speech, full-duplex, and audio-foundation-model entries — open weights, hosted APIs, voice-agent platforms, and agent frameworks — grouped by what they are and how they close the conversation loop. Each card tags the kind (speech-LM, realtime API, platform…), the FD mode (native vs cascaded stack), and the surface you consume it through (API, model, SDK, framework). Missing or wrong something? Open the channel.

total tracked
34
native full-duplex
17
open weights
20
preview / legacy
2 / 2

Counter definitions. open weights = openWeights === true. native full-duplex = fdMode === "native". All four counters cover preview + current + legacy entries but exclude deprecated, so the headline does not inflate as older snapshots are retired.

FD · nativeFD · stackturn-takingopen weightsproprietaryframeworkpreviewlegacy
M-A · capability matrix

Model × capability at a glance

Twelve capability axes scored from the canonical model fields and tags — the five language columns track the world's most-spoken tongues (EN / ZH / HI / ES / FR), plus a "Persona" column for models that support voice-prompt or role-prompt conditioning. Rows are split into Full-duplex native and Others; the FD-native band opens by default, click a group header to fold. Hover a column for its definition, click a row to jump to the detailed card below.

FD-native
Cascade
≤300ms
Open wts.
Comm.-OK
Persona
EN
ZH
HI
ES
FR
Platform
MoshiKyutai
GLM-4-VoiceZhipu AI (Z.ai)
Sesame CSMSesame AI
Kimi-AudioMoonshot AI
PersonaPlexNVIDIA (ADLR)
MiniCPM-o 4.5OpenBMB / ModelBest
SALMONN-omniByteDance · Tsinghua · Cambridge
Step-Audio 2 miniStepFun
Covo-Audio-Chat / Covo-Audio-Chat-FDTencent
OpenAI Realtime (gpt-realtime)OpenAI
Gemini 2.5 LiveGoogle DeepMind
Gemini 3.1 Flash Live PreviewGoogle DeepMind
Hume EVI 3Hume AI
Amazon Nova 2 SonicAmazon / AWS
Grok Voice Agent APIxAI
Qwen3-OmniAlibaba Cloud
HibikiKyutai
supportedpartial / unofficialunsupported
M-B · latency × release

Latency vs release date

Scope

First-response latency (log scale) on the Y axis, release date on the X. Open-weights models in filled orange; proprietary APIs in dark grey. Dashed outlines mark frameworks and research-only; dashed rings flag preview snapshots and half-opacity dots flag legacy lines that have been superseded. Default shows end-to-end full-duplex native models only.

Latency provenance. Values mix vendor claims, paper reports, third-party evaluations, and demo measurements — they are not normalised to a single trace or workload. Read the chart as order-of-magnitude rather than a head-to-head ranking.

100ms200ms300ms500ms2023202420252026first-response (ms, log)release dateGemini 2.5 Live · Google DeepMind · 2025-12 · 300msGemini 2.5 LiveOpenAI Realtime (gpt-realtime) · OpenAI · 2026-02 · 310msOpenAI RealtimeMoshi · Kyutai · 2024-09 · 160msMoshiPersonaPlex · NVIDIA (ADLR) · 2026-01 · 170msPersonaPlexSesame CSM · Sesame AI · 2025-03 · 200msSesame CSMGemini 3.1 Flash Live Preview · Google DeepMind · 2026-03 · 260msGemini 3.1 LiveHume EVI 3 · Hume AI · 2025-05 · 400msHume EVI 3Amazon Nova 2 Sonic · Amazon / AWS · 2025-12 · 400msNova 2 SonicGrok Voice Agent API · xAI · 2025-12 · 1000msGrok VA API
open-weights / open-sourceproprietaryframeworkpreviewlegacy
M-C · release timeline

Native STS · 2024 – 2026

Scope

Native STS only — one lane for end-to-end full-duplex models. Pre-2024 entries are collapsed to a single chip on the left edge. Use All models for the 3-lane view.

Full-duplex native (end-to-end)202420252026Moshi · Kyutai · 2024-09MoshiGLM-4-Voice · Zhipu AI (Z.ai) · 2024-10GLM-4-VoiceHibiki · Kyutai · 2025-02HibikiSesame CSM · Sesame AI · 2025-03Sesame CSMKimi-Audio · Moonshot AI · 2025-04Kimi-AudioHume EVI 3 · Hume AI · 2025-05Hume EVI 3Step-Audio 2 mini · StepFun · 2025-08Step-Audio 2Qwen3-Omni · Alibaba Cloud · 2025-09Qwen3-OmniSALMONN-omni · ByteDance · Tsinghua · Cambridge · 2025-11SALMONN-omniGemini 2.5 Live · Google DeepMind · 2025-12Gemini 2.5 LiveAmazon Nova 2 Sonic · Amazon / AWS · 2025-12Amazon Nova 2 SonicGrok Voice Agent API · xAI · 2025-12Grok VA APIPersonaPlex · NVIDIA (ADLR) · 2026-01PersonaPlexMiniCPM-o 4.5 · OpenBMB / ModelBest · 2026-02MiniCPM-o 4.5OpenAI Realtime (gpt-realtime) · OpenAI · 2026-02OpenAI RealtimeCovo-Audio-Chat / Covo-Audio-Chat-FD · Tencent · 2026-03Covo-Audio-Chat / Covo-Audio-Chat-FDGemini 3.1 Flash Live Preview · Google DeepMind · 2026-03Gemini 3.1 Flash Live
§01 · native speech-lm

Full-duplex speech-LMs

End-to-end models that ingest and emit audio in the same network. Open weights, native barge-in.

§02 · proprietary s2s api

Realtime assistant APIs

Hosted speech-to-speech APIs from the big labs. Pay-per-minute, closed weights, ≤ 500 ms latency.

§03 · turnkey stack

Voice-agent platforms

Commercial platforms that wire STT+LLM+TTS together with telephony, RAG, and observability.

§04 · open framework

Open agent frameworks

Self-hosted libraries for assembling voice agents with swappable STT / LLM / TTS blocks.

§05 · audio-in / omni

Audio-LLMs & omni-models

Models whose primary job is understanding audio, not driving a conversation. Some emit speech too.

§06 · synthesis

TTS & voice generation

Speech synthesis APIs and open TTS weights. Feeds every cascaded stack above.

§07 · s2st

Speech-to-speech translation

Direct speech translation — one language in, another out, with or without text intermediate.

§N · observatory notes

What the taxonomy is telling us.

  1. gap · 01

    Native × English-first

    Every native full-duplex speech-LM in the table is primarily English or Chinese. Native open speech-LMs for the other high-speaker-count languages are still unclaimed territory — whoever ships the first one sets the reference for that language.
  2. gap · 02

    Platforms ≠ speech-LMs

    Most "voice AI" vendors live in the platform drawer, not the speech-LM drawer. That means the full-duplex behaviour you feel is usually crafted by a turn-taking model on top of a cascade — which is why benchmark coverage of that layer (Talking Turns, Full-Duplex-Bench) matters.
  3. gap · 03

    Where the work is heading

    The interesting frontier is vertical voice agents on top of these platforms — healthcare, support, field operations. The Observatory will add that vertical drawer as enough public entries appear.
§M·∎ · next stops

Keep reading — three ways out of the directory.

The observatory is deliberately triangular. Models are where you land; the benchmarks and datasets pages are where you verify the claim, and the blog is where you get the story behind the name.

Spotted a model we're missing? submit an entry