Benchmarks.
How the field measures voice models — not just ASR and TTS, but conversation, turn-taking, emotion, instruction-following, and task completion. Scope is deliberately a little wider than strict STS: this page tracks 42 speech-interaction and adjacent benchmarks. Every entry carries a tier (native / component / adjacent / legacy) and a setting (lab / arena / live / vertical), and links are split into site, paper, code, and leaderboard slots so the link is exactly what it claims to be. Found a stale score or a missing entry? report it to the community.
What each benchmark actually measures.
30 speech-interaction benchmarks on the rows (the core STS subset of the 42-entry directory below), grouped by capability family, against 15 capability axes that today's benchmarks actually score. Toggle the +5 unexplored axes button above the grid to expose five structurally-uncovered columns (code-switch, long-form memory, emotion regulation, on-device, audio adversarial) — axes that exist in text-LLM or ASR/TTS evaluation but have no public STS benchmark as of April 2026. Each row carries two runs-on chips —cas for cascade stacks and fd for full-duplex STS models — plus year / setting / license metadata. Click a group header to fold its rows, or a column header to rank benchmarks by that axis. The cards below this grid add 12 component / adjacent / legacy entries that are out of the grid's scope but still part of the measurement landscape (TTS arenas, ASR WER, S2ST, text-only verticals, and the historical baselines).
Full-duplex & interactive
Benchmarks that score turn-taking, back-channeling, overlap, and interruption — the axes that separate real voice agents from LLM-plus-TTS.
- NTU / UC Berkeley / UW / MIT · updated 2026-04 · verified 2026-04nativelive
Full-Duplex-Bench v3
April 2026 release — adds tool-use events in the dialogue stream and fine-grained latency breakdown (first-token / first-word / full-reply).
full-duplexvertical-agentv3 drops backchannel and overlap subsets (retired to v1.5 / v2) and adds tool invocation + reaction-time breakdown. Expected to become the default 'latency in a dialogue with tools' scoreboard through 2026.
- Sierra Research · updated 2026-02 · verified 2026-04nativevertical
τ³-Bench (τ-Voice)
The first agent benchmark with a live, full-duplex voice track — reveals a 50-point gap between text and voice reasoning.
vertical-agentfull-duplexspeech-lmExtends τ-bench and τ²-bench into banking (698 docs, ~195 k tokens of policy) and knowledge retrieval, then adds τ-Voice: live full-duplex conversations with OpenAI / Gemini / xAI voice models under simulated noisy, accented conditions. Text-reasoning agents score ~85 %; the same models drop to 26–38 % once voice and real-time are introduced.
- ICASSP 2026 Spoken Dialogue Challenge · updated 2026-01 · verified 2026-04nativelab
HumDial (ICASSP '26)
Dual-track human-evaluated spoken dialogue challenge — first public benchmark to explicitly score paralinguistic input alongside turn-taking dynamics.
full-duplexspeech-lmemotionTwo tracks (Track I English; Track II Chinese + English) with a held-out human evaluation protocol. Sits at the intersection of full-duplex dynamics and paralinguistic understanding.
turn-takingparalinguistic groundinghuman preference - Academic (Audio MultiChallenge authors) · updated 2025-12 · verified 2026-04nativelab
Audio MultiChallenge
Multi-turn evaluation of spoken dialogue systems on natural human interaction — scores Inference Memory, Instruction Retention, Self Coherence, Voice Editing, Audio-Cue handling.
speech-lmfull-duplexCovers the long-range context and self-coherence axes that are currently thin on the STS measurement layer. Strong complement to Talking Turns (turn-taking) and MTR-DuplexBench (multi-turn dynamics).
- NTU / UC Berkeley / UW / MIT · updated 2025-11 · verified 2026-04nativelive
Full-Duplex-Bench v2
Late-2025 WebRTC-based live harness — first FDB release that runs dynamic evaluation against streaming endpoints, plus multi-turn stimuli.
full-duplexspeech-lmv2 moves the FDB family from fixed-file scoring to real-time streaming against the model under test. Adds multi-turn dialogue stimuli and a (subjective) first probe of instruction-following inside the FDB harness. Now the most-cited public full-duplex eval outside arenas.
- Academic (FLEXI authors) · updated 2025-11 · verified 2026-04nativelab
FLEXI
Full-duplex safety benchmark — scores whether a model correctly barges in on a user in safety-critical moments (medical, self-harm, imminent danger).
full-duplexThe only public eval targeted at proactive safety interrupts. Thin coverage across other axes, but unique: it forces full-duplex systems to demonstrate that they will speak up on their own when needed.
- Academic (MTR-DuplexBench authors) · updated 2025-10 · verified 2026-04nativelab
MTR-DuplexBench
Multi-turn, multi-topic full-duplex benchmark built from recorded call-center-style dialogues.
full-duplexspeech-lmProbes how full-duplex dynamics degrade across longer conversations and topic shifts — the regime where most turn-taking benchmarks stop. Currently the main public signal on long-range full-duplex behaviour.
- Academic (SID-Bench authors) · updated 2025-09 · verified 2026-04nativelab
SID-Bench
Spoken Interaction Dynamics benchmark — millisecond-resolution ground truth for end-of-turn detection and interruption recovery.
full-duplexZooms in where FDB stays macro: SID-Bench scores reaction-time dynamics in full-duplex speech with millisecond-level labels for turn boundaries, barge-in timing, and interruption recovery. Useful as a paired eval alongside FDB v1 on the reaction-time axis.
- Academic (MMedFD authors) · updated 2025-09 · verified 2026-04adjacentvertical
MMedFD
Real-world healthcare benchmark for multi-turn full-duplex automatic speech recognition — adjacent to STS, crucial for duplex deployment.
asrvertical-agentfull-duplexNot STS in the strict sense — it scores ASR under multi-turn full-duplex healthcare conversations — but the failure modes it exposes (overlap, disfluency, medical-term recall) are exactly the ones that break vertical voice deployments. Listed as `adjacent` because the STS community will quote it whenever duplex systems hit healthcare production.
- NTU / UC Berkeley / UW / MIT · updated 2025-08 · verified 2026-04nativelab
Full-Duplex-Bench v1.5
Mid-2025 refresh of FDB v1 — adds the simultaneous-speech / overlap axis (user interruption, listener back-channel, side conversation, ambient speech).
full-duplexSame harness as v1, same automatic scoring, one new axis: what happens when both sides are producing audio at once. v1.5 is what most 2025 papers now cite when they claim 'full-duplex evaluated.'
- Academic (FD-Bench authors) · updated 2025-07 · verified 2026-04nativelab
FD-Bench
Independent full-duplex benchmark focused on natural-pause / short-utterance regimes — a cross-check against FDB.
full-duplexDifferent stimulus sources to FDB but overlapping metrics, which is the point: agreement / disagreement between FDB and FD-Bench has become a quick sanity signal for whether a model is overfit to one harness.
- Apple / CMU · updated 2025-04 · verified 2026-04nativelab
Talking Turns
A supervised turn-taking judge trained on human-human conversations to score audio foundation models.
full-duplexspeech-lmTrains a classifier to predict turn-taking events (end-of-turn, backchannel, interruption) in Switchboard and uses it as an automatic judge for spoken dialogue systems. The first study to systematically show that existing speech-LMs fail to understand when to speak and rarely back-channel. Published at ICLR 2025.
- NTU / UC Berkeley / UW / MIT · updated 2025-03 · verified 2026-04nativelab
Full-Duplex-Bench v1
Original Full-Duplex-Bench — automatic scoring of the four canonical turn-taking events: when-to-speak, back-channel, interruption success, pause handling.
full-duplexReleased 2025-03. Fixed audio stimuli + reproducible automatic metrics, establishing the first public benchmark dedicated to full-duplex interactive behaviour. Everything after (v1.5 / v2 / v3) is additive on top of this harness.
- Japanese academic community · updated 2024-11 · verified 2026-04nativelab
J-Moshi (subjective)
Japanese open-weights full-duplex model paper whose subjective MOS listening tests are the main public Japanese STS measurement layer today.
full-duplexspeech-lmNot a held-out test set — just MOS scores on human-evaluated samples. Listed here because right now there is no shared automatic Japanese full-duplex benchmark, and J-Moshi's listener ratings are what the JA community points at.
Speech-LM & Audio Foundation Models
Lab-style offline evals for speech-LMs and audio foundation models — knowledge, reasoning, safety, instruction following, multilingual dialogue, paralinguistic awareness.
- Artificial Analysis · updated 2026-03 · verified 2026-04nativelab
Big Bench Audio
1 000 audio-question reasoning benchmark — the dataset behind the Artificial Analysis S2S speech-reasoning leaderboard.
speech-lminstruction-followingAdapts four Big Bench Hard categories (Formal Fallacies, Navigate, Object Counting, Web of Lies) into 250 spoken questions each, rendered with 23 top TTS voices. Used by Artificial Analysis to publish the public `Speech Reasoning` score across S2S providers and to compute Time-to-First-Audio. v1.1 (2026-03) switched to a Claude Sonnet 4.6 judge and includes unanswered questions in the score. Current top entry (2026-04) is Step-Audio R1.1 Realtime at 97.0%.
top entryStep-Audio R1.1 Realtime97.0%models scoredAmazon Nova 2 SonicGemini 2.5 LiveGemini 3.1 Flash Live PreviewGrok Voice Agent APIMoshiOpenAI Realtime (gpt-realtime)Qwen3-OmniStep-Audio 2 miniaccuracyTTFA (s) - SJTU (OmniAgent) · updated 2026-01 · verified 2026-04nativelab
VocalBench
A four-dimensional stress test for speech-interaction models — semantic, acoustic, conversational, and robust.
speech-lmaudio-understanding~24 k EN + ZH instances evaluating knowledge / reasoning (ACC), open-ended chat (LLM-judge), acoustic quality (UTMOS, WER), emotional empathy (EER), and robustness to noise, reverb, far-field, packet loss, and clipping. Tested on 27 mainstream speech-interaction models including GPT-4o Voice and Qwen-Audio.
- SJTU (OmniAgent) · updated 2025-11 · verified 2026-04nativelab
VocalBench-zh
Mandarin version of VocalBench — 10 subsets, ~10k instances, 14 models evaluated at the Nov 2025 launch.
speech-lmaudio-understandingThe default Mandarin STS scoring target and a common co-citation with VoiceBench. Same four-dimensional design as the original VocalBench adapted to Chinese data.
- Academic (CS3-Bench authors) · updated 2025-10 · verified 2026-04nativelab
CS3-Bench
Mandarin-English code-switching benchmark — headline finding: S2S models drop ~66% relative on code-switched inputs vs monolingual ones.
speech-lmClosest thing the field has to a real code-switch axis for speech-LMs. Probes where most current systems silently fail — language mixing within an utterance — which the multilingual benchmarks above currently underweight.
- Ruiqi Yan et al. · updated 2025-08 · verified 2026-04nativelab
URO-Bench
The first end-to-end S2S benchmark with multilingual, multi-turn, and paralinguistic coverage.
speech-lminstruction-followingemotionForty datasets across twenty tasks in full Chinese-English pairings, split into a basic track and a pro track. Each track scores Understanding / Reasoning / Oral-conversation (URO) axes, exposing how open-source spoken dialogue models lag their backbone LLMs in instruction following, paralinguistics, and audio understanding. Findings of EMNLP 2025.
- NUS · updated 2025-06 · verified 2026-04nativelab
VoiceBench
Evaluates LLM-based voice assistants across knowledge, reasoning, safety, and instruction-following — judged by a GPT-4-class model.
speech-lminstruction-followingWraps nine text/spoken subtasks (AlpacaEval, CommonEval, WildVoice, OpenBookQA, MMSU, SD-QA, IFEval, BBH, AdvBench) around speech-in / text-out assistants, injecting speaker, environment, and content variations. The public leaderboard tracks 39+ systems split across cascaded, audio-LLM, omni, and S2S / full-duplex categories.
- AudioLLMs (I2R Singapore) · updated 2025-05 · verified 2026-04nativelab
IFEval-Audio
Format-constrained spoken instructions, verified programmatically — an audio port of text IFEval.
instruction-followingspeech-lm280 audio-instruction-answer triples across six dimensions (content, capitalization, symbol, list, length, format), drawn from Spoken SQuAD, TED-LIUM 3, MuchoMusic, and others. Each answer is programmatically checked against its constraint, giving IFR (format), SCR (semantics), and OSR (both) scores. Published at IJCNLP-AACL 2025.
- Alibaba DAMO (OFA-Sys) · updated 2024-12 · verified 2026-04nativelab
AIR-Bench
A two-tier stress test for audio-LLMs covering speech, natural sound, and music — MCQ plus chat.
audio-understandinginstruction-followingFoundation benchmark runs ~19 k single-choice questions across 19 audio-understanding tasks; the chat benchmark scores ~2 k open-ended audio-instruction dialogues with a GPT-4 judge. Supports English and Chinese; published at ACL 2024.
Arena & preference
Blind A/B human-preference arenas. The only evaluators that reliably catch 'this just sounds off' — and the only ones that scale across languages and accents.
- Artificial Analysis · updated 2026-04 · verified 2026-04adjacentarenatts-onlynot-full-duplex
Artificial Analysis Speech Arena
Blind user votes turn TTS models into a live Elo leaderboard — the de-facto preference number for TTS. Despite the 'Speech Arena' label, this is a TTS-only (text → speech) arena, not an STS ranking.
ttsSixty-nine TTS models (15 open-weights) ranked by Elo from blind A/B listening comparisons on the same prompt. Filterable by category (assistants, entertainment, customer service, knowledge sharing) and accent. Currently led by Inworld TTS 1.5 Max and Eleven v3. Useful as a voice-quality proxy, but it does not evaluate conversation, turn-taking, or dialogue behaviour — do not read it as an STS benchmark.
- Scale AI · updated 2026-03 · verified 2026-04nativearenanot-full-duplex
Scale Voice Showdown
First in-the-wild preference arena for voice AI — 11 frontier models, 60+ languages, real user speech. Dictate + S2S modes live; full-duplex mode still in development.
speech-lmttsIntercepts <5% of real ChatLab conversations for blind head-to-head battles between two voice models, then continues the session with the winner. Eighty-one percent of prompts are open-ended conversational, and a third of battles happen in non-English (ES, AR, JA, PT, HI, FR). Modes currently evaluated are Dictate (speech-in / text-out) and S2S — *not* yet full-duplex live streaming, despite the arena framing. Treat scores as 'best voice assistant under turn-based play,' not as a full-duplex ranking.
models scoredOpenAI Realtime (gpt-realtime)preference win-rateper-language score - Hugging Face · updated 2026-03 · verified 2026-04adjacentarenatts-onlynot-full-duplex
TTS Arena
The original blind A/B voting space for TTS, feeding an Elo ladder on Hugging Face.
ttsUsers blind-compare outputs from two random TTS systems reading the same prompt; votes feed an Elo score. Predates Artificial Analysis' arena and remains the open-research reference point for human-preference TTS scoring.
- FreedomIntelligence / CUHK-SZ · updated 2025-09 · verified 2026-04nativearena
MTalk-Bench
First multi-turn S2S benchmark with both arena-style (pairwise) and rubric-based (absolute) protocols for dialogue evaluation.
speech-lminstruction-followingemotionCovers three dimensions — Semantic, Paralinguistic, and Ambient Sound — over nine scenarios each, judged by humans and by LLM-as-a-judge. Unlike VoiceBench (single-turn) and S2S-Arena (no rubric), MTalk-Bench evaluates holistic conversational behaviour in audio-grounded multi-turn contexts. Code released under Apache-2.0; dataset under research license. Updated through 2025-09 with an expanded model pool (GPT-4o Voice, Kimi-Audio, Step-Audio, Qwen2.5-Omni, Moshi, and more).
Vertical & task
Task-completion benchmarks in customer service, outbound calling, and healthcare. Where the rubber meets the phone line — and where text-first agents collapse.
- Meituan / Xbench / Agora · updated 2025-10 · verified 2026-04nativevertical
VoiceAgentEval
A dual-dimensional benchmark for expert-level outbound-calling agents — interaction fluency vs task flow compliance.
vertical-agentspeech-lmCovers six business domains and thirty sub-scenarios (recruitment, sales, CS, financial risk control, market research, proactive care). A large-model User Simulator combines five personalities × thirty scenarios to yield 150 evaluation dialogues. Separates General Interaction Capability (GIC) from Task Flow Compliance (TFC), exposing 'thoughtful listener' vs 'strict executor' trade-offs.
- Academic (VoiceAgentBench authors) · updated 2025-10 · verified 2026-04nativevertical
VoiceAgentBench
Spoken agentic benchmark — multi-tool workflows, multi-turn interactions, and safety under realistic voice-agent conditions; covers English and Hindi.
vertical-agentspeech-lminstruction-followingComplements τ³-bench's voice track: VoiceAgentBench focuses on ~tool-rich multi-step agent tasks in the voice-first setting, with explicit safety and multilingual (including Hindi) subsets. Fills a gap that VoiceBench / VocalBench leave at the 'does the assistant actually execute the task' end.
- OpenAI · updated 2025-05 · verified 2026-04adjacentverticaltext-only
HealthBench
5,000 multi-turn health conversations × 48k physician-written rubric criteria — text-only, listed as a transferable foundation rubric rather than a voice benchmark.
vertical-agentinstruction-followingThe most rigorous vertical dialogue eval to date, but *text-only*: conversations and rubrics are strings, so it scores an LLM's clinical reasoning rather than any speech or full-duplex behaviour. 262 physicians practicing in 60 countries wrote bespoke rubric criteria, weighted by clinical importance; responses are graded by GPT-4.1. Included here as a **transferable text-only benchmark** that any serious medical voice agent will ultimately need to pass on top of its speech stack — not as a native STS benchmark.
Components
Component benchmarks that aren't speech-LM native but score the parts every STS system is built from — translation, synthesis, non-verbal delivery.
- SJTU / MBZUAI · NeurIPS 2025 · updated 2026-02 · verified 2026-04componentlab
MMAR
1 000 human-curated deep-reasoning audio QA items spanning speech, sound, music, and their mix — four hierarchical reasoning layers.
audio-understandingspeech-lmCovers Signal / Perception / Semantic / Cultural reasoning layers, each annotated with chain-of-thought rationales. Harder than AudioBench and MMAU: no evaluated model approaches human performance as of publication. MMAR-Rubrics (2026-02) adds instance-level rubric scoring, and the Interspeech 2026 Audio Reasoning Challenge uses MMAR as its primary evaluation set. Dataset is CC-BY-NC-4.0; code is on GitHub under Apache-2.0.
- NVSpeech authors · updated 2025-08 · verified 2026-04componentlab
NV-Bench
Standardised test for generating laughs, sighs, fillers, and back-channels in TTS.
ttsemotion1,651 multilingual (English + Mandarin) utterances across 14 non-verbal vocalization categories — vegetative sounds, affect bursts, conversational grunts — each paired with human reference audio. Scores instruction alignment via PCER (paralinguistic character error rate) and acoustic fidelity via FAD, WavLM speaker similarity, and DNSMOS.
- ByteDance · updated 2024-11 · verified 2026-04componentlab
Seed-TTS Eval
The de-facto evaluation set for zero-shot voice cloning — WER, speaker similarity, and CMOS in one recipe.
ttsDefined in the Seed-TTS tech report and reused by most zero-shot TTS papers since. Combines automatic ASR-based word error, SECS speaker similarity against a reference clip, and CMOS listening tests to triangulate intelligibility, identity, and naturalness.
- NTU et al. · updated 2024-11 · verified 2026-04componentlab
Dynamic-SUPERB
Instruction-tuned successor to SUPERB — ~55 tasks across speech understanding and light paralinguistic input.
speech-lminstruction-followingThe baseline every new speech-LM paper still reports. Covers instruction-following over speech across classification, generation, and open-ended tasks; Phase-2 (2024-11) doubled the task pool.
- Academic (VoiceAssistant-Eval authors) · updated 2024-09 · verified 2026-04componentlab
VoiceAssistant-Eval
End-to-end voice assistant eval — instruction-following, multi-turn dialogue, and naturalness in one suite with GPT-4-class judges.
speech-lminstruction-followingBundles open-ended task prompts with GPT-4-judge scoring on content, style, and multi-turn coherence. Useful as a second opinion alongside VoiceBench on the assistant-quality axis.
- AudioLLMs (I2R Singapore) · updated 2024-07 · verified 2026-04componentlab
AudioBench
Broad audio-understanding benchmark covering speech, music, and ambient sound reasoning.
audio-understandingGood signal on audio-grounded reasoning across speech / sound / music, but does not score conversational dynamics. Complements AIR-Bench as a second audio-LLM reasoning reference.
- Academic (SD-Eval authors) · updated 2024-06 · verified 2026-04componentlab
SD-Eval
Baseline probe on paralinguistic input — does a speech-LM use tone, emotion, and speaker traits at all?
speech-lmemotionFour subsets (emotion, accent, age, environment) check whether the model's response is conditioned on paralinguistic features rather than transcript-only. A precondition every serious paralinguistic-output claim should first beat.
- Meta AI · updated 2023-12 · verified 2026-04componentlab
SeamlessExpressive (mExpresso)
Expressive speech-to-speech translation judged on translation, prosody, and vocal-style preservation.
s2stBundles the mExpresso benchmark (EN → FR / DE / ES / IT / ZH, seven styles including happy, sad, whisper, and laughing) with metrics that go beyond ASR-BLEU. AutoPCP scores phrase-level prosody correspondence and Vocal Style Similarity measures voice carry-over across languages.
- Academic (ProsAudit authors) · updated 2023-02 · verified 2026-04componentlab
ProsAudit
Prosodic boundary detection benchmark — can the model perceive phrase / sentence boundaries from audio alone?
emotionaudio-understandingOld (2023) but still cited as a prerequisite: if the model cannot hear prosodic boundaries, downstream paralinguistic-output claims do not hold up.
- Meta AI · updated 2020-07 · verified 2026-04componentlab
CoVoST-2 BLEU
21 language pairs of speech-to-text translation, long the S2T standard.
s2stasrBuilt on Common Voice, CoVoST-2 pairs read speech with translated text across 21 language pairs (EN⇄X and X⇄EN). It is the reference workload for cross-model S2T benchmarking used by SeamlessM4T, Whisper, USM, and friends.
Adjacent & transferable
Benchmarks that aren't STS in the strict sense — TTS-only arenas, text-only clinical rubrics, duplex-ASR for verticals — but that any serious voice system will ultimately need to pass on top of its speech stack. Kept separate from Components because they don't live inside the STS pipeline itself.
Nothing here yet.
Legacy & context
Saturated or representation-era benchmarks. Rarely used to rank modern speech-LMs, but still the scaffolding every new speech encoder reports against.
- NTU et al. (s3prl) · updated 2024-06 · verified 2026-04legacylab
SUPERB
The classic: general-purpose speech representations scored on ten frozen-encoder downstream tasks.
speech-lmasraudio-understandingTen-task evaluation suite (ASR, PR, KS, SID, IC, SF, ER, QbE, SD, VC) for frozen self-supervised speech encoders, run through the s3prl toolkit introduced at Interspeech 2021. The Dynamic-SUPERB fork extends the protocol to instruction-following for speech-LMs. Listed here for context — modern speech-LMs are no longer ranked on SUPERB.
- OpenSLR · updated 2024-01 · verified 2026-04legacylab
LibriSpeech WER
The coordinate system of English ASR — test-clean and test-other WER.
asrEvaluation splits of the 960-hour LibriSpeech read-speech corpus distributed through OpenSLR. Not conversational and long-since saturated, but still the most cited fitness test for English ASR and the reference number every new speech encoder reports.
WER (test-clean)WER (test-other) - Academic · updated 2023-06 · verified 2026-04legacylab
MELD / IEMOCAP
The classic emotion-recognition pair, evaluated with full dialogue context.
emotionaudio-understandingMELD is built from Friends with 7-way utterance-level emotion labels; IEMOCAP uses scripted-plus-improvised dyadic sessions with 6 emotion classes. Both remain standard baselines for emotional ASR and conversational emotion recognition, though modern speech-LMs are rarely ranked on them.
Weighted F1Accuracy
Three gaps we're watching.
Based on everything catalogued above, the measurement landscape has a few structural holes. They are where new benchmarks — and new companies — are likely to emerge in 2026.
- gap · 01
Full-duplex × vertical × real audio.
τ³-Bench proves voice-reasoning collapses by ~50 pts when real-time is introduced, but its acoustic layer is simulated. No public benchmark ties real recordings, full-duplex timing, and a vertical task (CS, sales, health) together. - gap · 02
Non-English, non-Chinese full-duplex.
URO-Bench ships EN + ZH; no turn-taking leaderboard exists for most other high-speaker-count languages. Anyone who builds one at scale first writes the standard for that language. - gap · 03
Latency as a first-class axis.
First-response-ms numbers are vendor-reported and unreplicated across Moshi / GPT-4o Voice / Gemini Live / Sesame. Full-Duplex-Bench v2 makes this measurable; we expect a dedicated latency leaderboard to land within the year.
Fullduplex's editorial bias: we'll flag any new benchmark that closes one of these three gaps on the blog and add it here within the week.
New benchmark or revised score? submit an entry