What happened this week
Four papers, all in the full-duplex conversation area. Two of them — Full-Duplex-Bench-v3 and EchoChain — push the evaluation surface forward into tool use and state updates. One paper proposes an RL training recipe that avoids the collapse-into-repetition failure mode, and one paper is a data-engineering contribution for recovering two-track dialogue audio from monaural mixtures.
Evaluation — two new benchmarks
Full-Duplex-Bench-v3 is the third iteration of the Full-Duplex-Bench line, this time evaluating tool use under realistic disfluency. The twist is that every test utterance is real human audio with annotated disfluency categories, and tasks require chained API calls across four domains. The reported pass rates for GPT-Realtime, Gemini Live 2.5 / 3.1, Grok, Ultravox v0.7, and a Whisper cascade give a first honest read on where voice agents are when you let humans actually talk like humans.
EchoChain is complementary — it targets state-update reasoning specifically, by injecting mid-response interruptions at standardised points and looking for three failure modes: contextual inertia, interruption amnesia, and objective displacement. No evaluated system exceeds a 50 percent pass rate, and a half-duplex control drops total failures by 40 percent relative, which pins most of the failure on state revision rather than task difficulty.
Training — ASPIRin
ASPIRin attacks a specific RL failure mode in full-duplex speech LMs: reward-driving the raw token stream causes generative collapse and repetition. By projecting the text vocabulary down to a binary active-speech / silence signal and running GRPO with rule-based rewards, the method decouples when to speak from what to say. Duplicate n-grams drop by over 50 percent versus vanilla GRPO while interactivity metrics improve. Useful as a recipe when moving a FD-SLM from SFT to RL.
Data engineering — DialogueSidon
DialogueSidon addresses a long-standing bottleneck: most in-the-wild two-speaker dialogue is recorded as a single monaural track, which makes it useless for FD research that needs speaker-separate streams. The paper combines a self-supervised speech-feature VAE with a diffusion-based latent predictor to recover per-speaker latents from degraded mixtures. Worth watching if you have a pile of YouTube-scale dialogue audio and no way to use it.
Corrections to hello@fullduplex.ai. Next issue: 2026-W17.