References & further reading
Every external source cited across the Fullduplex STS Series — papers, benchmarks, repositories, platforms, and corpora. Grouped by article, then by kind. Walk the graph.
Speech-to-speech AI, a primer
22 referencesThe research arc that produced Moshi — six public papers over four years, plus the benchmarks and open-weight repos that ship with the modern voice stack.
Research papers16
- 01Stivers et al. (2009) — Universals in turn-takingPNAS. Ten languages, same ~200 ms turn-gap. The foundational claim that the conversational threshold is a biological constant.pnas.orgpaper
- 02GSLM (Meta, 2021)Generative Spoken Language Modeling — language modeling on raw speech with no text at all.arxiv.orgpaper
- 03SoundStream (Google, 2021)End-to-end neural audio codec. Introduced residual vector quantization (RVQ) as the alphabet for audio LMs.arxiv.orgpaper
- 04AudioLM (Google, 2022)Hierarchy of semantic + acoustic tokens. Bridged GSLM and SoundStream into a single audio language model.arxiv.orgpaper
- 05dGSLM (Meta, 2022)Two-speaker dialogue extension of GSLM, trained on Fisher. First textless model with natural turn-taking.arxiv.orgpaper
- 06VALL-E (Microsoft, 2023)Codec + language-model recipe for high-quality TTS. Voice cloning from a three-second sample.arxiv.orgpaper
- 07SpeechGPT (Fudan, 2023)Speech tokens plugged into an LLM vocabulary. Early end-to-end spoken-instruction-in, spoken-answer-out.arxiv.orgpaper
- 08Translatotron (Google, 2019)Direct speech-to-speech translation without text — parallel thread proving text is not a mandatory intermediate.arxiv.orgpaper
- 09Translatotron 2 (Google, 2021)Follow-up to Translatotron with improved quality and robustness.arxiv.orgpaper
- 10Moshi paper (Kyutai, 2024)First real-time, full-duplex, speech-text foundation model, released under Apache with open weights.arxiv.orgpaper
- 11X-Talk surveySurvey on modular voice systems with paralinguistic side-channels — the steel-man for the cascade approach.arxiv.orgpaper
- 12Full-Duplex-BenchFirst benchmark for turn-taking and interruption handling in STS models.arxiv.orgpaper
- 13URO-BenchParalinguistic understanding and response evaluation for speech-to-speech systems.arxiv.orgpaper
- 14J-CHAT (2024)76,000-hour Japanese dialogue corpus from the public web.arxiv.orgpaper
- 15InteractSpeech (EMNLP Findings, 2025)Full-duplex dataset work targeting interactive speech.aclanthology.orgpaper
- 16DialogueSidon (2026)Recent dialogue dataset / model release cited as a 2026 data point in the primer.arxiv.orgpaper
Repositories & open weights02
- 01Moshi (Kyutai)Open-weights reference implementation of the Moshi full-duplex model.github.comrepo
- 02Sesame CSMOpen-weights conversational speech model from Sesame AI Labs.github.comrepo
Platforms & documentation03
- 01OpenAI — Voice Agents guideOfficial framing of two valid tracks: chained pipelines vs. speech-to-speech.platform.openai.complatform
- 02OpenAI — gpt-realtime releaseRealtime API announcement. Cites loss of emotion, emphasis, and accents in stitched pipelines.openai.complatform
- 03Gemini Live on Vertex AIGoogle Cloud documentation for Gemini Live API.cloud.google.complatform
Corpora01
- 01Fisher English (LDC2004S13)1,960-hour two-channel conversational English corpus collected by LDC in 2004. Still the workhorse for dialogue training.catalog.ldc.upenn.educorpus
The full-duplex threshold
27 referencesWhere the ~200 ms number comes from, and the small cluster of systems that have actually crossed it — plus the first benchmarks that can tell you so.
Research papers12
- 01Stivers et al. (2009)PNAS — measured the ~200 ms turn-gap across ten languages. Reused here as the biological anchor.pnas.orgpaper
- 02Levinson & Torreira (2015)Frontiers in Psychology — predictive processing of upcoming turn ends.doi.orgpaper
- 03Magyari et al. (2015)Scientific Reports — brain activity during turn-prediction in conversation.nature.compaper
- 04Heldner & Edlund (2010)Journal of Phonetics — distribution of silences and overlaps in conversation.doi.orgpaper
- 05De Ruiter et al. (2006)Language — projecting the end of a speaker's turn.doi.orgpaper
- 06Full-Duplex-BenchThe benchmark that made the threshold measurable on modern STS systems.arxiv.orgpaper
- 07Full-Duplex-Bench v3Latest iteration of the benchmark with expanded coverage.arxiv.orgpaper
- 08SyncLLMSynchronous speech-text LLM approach to full-duplex.arxiv.orgpaper
- 09OmniFlatten (Alibaba Tongyi, 2024)The paper that named the flattened-token architecture family.arxiv.orgpaper
- 10Freeze-Omni (Tencent AI Lab et al.)Adapter-based approach that freezes the backbone LLM while unlocking duplex speech.arxiv.orgpaper
- 11Mini-Omni2Lightweight end-to-end speech-in speech-out model.arxiv.orgpaper
- 12τ-Voice2026 paper cited in the threshold discussion.arxiv.orgpaper
Repositories & open weights03
- 01Moshi (Kyutai)Reference full-duplex model that first crossed the threshold in a reproducible open release.github.comrepo
- 02Hibiki (Kyutai)Kyutai's follow-up work on simultaneous translation.github.comrepo
- 03Full-Duplex-Bench repoEvaluation code and harness for Full-Duplex-Bench.github.comrepo
Platforms & models06
- 01OpenAI — Realtime APIDocumentation for gpt-realtime, OpenAI's production STS endpoint.platform.openai.complatform
- 02OpenAI — next-generation audio modelsAnnouncement quoting sub-500 ms latency targets for the stack.openai.complatform
- 03Hello GPT-4oLaunch post. First consumer-grade STS demo at the threshold.openai.complatform
- 04Google DeepMind — GeminiOverview page covering Gemini Live and multimodal capabilities.deepmind.googleplatform
- 05Moshi demoPublic hosted demo of the Moshi model.moshi.chatplatform
- 06Kyutai — UnmuteLab project page for Kyutai's real-time voice work.kyutai.orgplatform
Background & reference05
- 01Duplex (Wikipedia)Telecom background on half- vs. full-duplex.en.wikipedia.orgreference
- 02Full duplex (Wikipedia anchor)Section specifically defining simultaneous bi-directional transmission.en.wikipedia.orgreference
- 03WHO — vision impairment fact sheet2.2 billion people with some form of vision impairment — accessibility sizing.who.intreference
- 04BetterUp — CANDOR researchCandor corpus: 1,600 English conversations with rich metadata.betterup.comcorpus
- 05CC BY-NC 4.0 licenseLicense referenced for several corpora and open releases.creativecommons.orgreference
From oto01
- 01oto newsletterWeekly dispatch tracking STS, full-duplex, and audio foundation models.oto.earthoto
From pipeline to integrated
35 referencesThe four architectural families of integrated STS as of April 2026 — every model, codec, and backbone cited in the field guide, grouped by kind.
Research papers09
- 01Moshi — measured latencySource for the 200 ms measured end-to-end on an NVIDIA L4 figure.arxiv.orgpaper
- 02Mimi codec (Moshi technical report)Streaming neural audio codec at 12.5 Hz — the enabling piece for a joint full-duplex model.kyutai.orgpaper
- 03NVIDIA PersonaPlex-7B-v1NVIDIA ADLR, Jan 2026 — initializes from a Moshi-family checkpoint.arxiv.orgpaper
- 04OmniFlatten (Alibaba Tongyi, 2024)Paper that named the flattened-token architectural family.arxiv.orgpaper
- 05LLaMA-Omni 2Meta-LLaMA-based reimplementation of the flatten idea.arxiv.orgpaper
- 06Moonshot — Kimi-AudioClaims 13M hours of speech training; released under MIT.arxiv.orgpaper
- 07Tencent — Covo-Audio / Covo-Audio-Chat-FDTencent's entry in the flatten / adapter family.huggingface.copaper
- 08Freeze-OmniAdapter approach — frozen backbone + streaming speech adapters.arxiv.orgpaper
- 09SALMONN-omni (ByteDance)Representative entry in the no-codec family: continuous speech features into the LLM.arxiv.orgpaper
Repositories & open weights13
- 01Moshi (Kyutai)Reference implementation for both the model and the Mimi codec (MIT).github.comrepo
- 02Chatterbox TTS (Resemble AI)Cited alongside CSM as an open-weights TTS/voice release.github.comrepo
- 03Sesame CSM-1BOpen-weights conversational speech model.github.comrepo
- 04CosyVoice (FunAudioLLM)Streaming TTS used inside several flattened-token stacks.github.comrepo
- 05Qwen2.5-OmniAlibaba's omni-modal Qwen release.github.comrepo
- 06Step-Audio 2 (StepFun)Open-weights member of the flatten family.github.comrepo
- 07GLM-4-Voice (THUDM)Tsinghua's GLM-family speech model.github.comrepo
- 08Qwen2-7B-InstructFrozen backbone used by Freeze-Omni.huggingface.corepo
- 09MiniCPM-o 4.5 (OpenBMB)Compact open-weights omni model.github.comrepo
- 10SigLIP2 (Google)Vision encoder cited as a component of modern omni stacks.huggingface.corepo
- 11Whisper (OpenAI)The ASR backbone that many pipeline and adapter systems still call.github.comrepo
- 12Qwen3-8BBackbone LLM used inside several 2025 – 2026 speech stacks.huggingface.corepo
- 13SALMONN (ByteDance)Repository for the SALMONN line, including the omni variant.github.comrepo
Platforms & models10
- 01KyutaiFrench non-profit AI lab. Home of Moshi, Hibiki, Unmute, and the Mimi codec.kyutai.orgplatform
- 02OpenAI — GPT-4o audioConsumer-grade pipeline cited at roughly one second end-to-end on a typical day.openai.complatform
- 03Deepgram Voice Agent (Aura)Commercial agent stack quoting sub-second end-to-end latency.deepgram.complatform
- 04Cartesia SonicOne of the fastest commercial TTS engines (~90 ms first-audio).cartesia.aiplatform
- 05Hello GPT-4oOpenAI's launch of the integrated GPT-4o voice stack.openai.complatform
- 06Google DeepMind — GeminiGemini Live umbrella page.deepmind.googleplatform
- 07Gemini Live API on VertexDocumentation for Gemini 3.1 Flash Live API on Google Cloud.cloud.google.complatform
- 08Amazon Nova SonicAWS Bedrock Nova family — cloud provider STS entry.aws.amazon.complatform
- 09Microsoft AI Services (MAI-Voice-1)Azure AI Services — Microsoft's production voice stack.azure.microsoft.complatform
- 10Hume EVIEmpathic Voice Interface — emotional and prosodic voice agent.hume.aiplatform
Corpora01
- 01Fisher English (LDC2004S13)Two-channel 1,960-hour corpus — still the default starting point for duplex dialogue training.catalog.ldc.upenn.educorpus
From oto02
- 01Contact otoGet in touch about STS datasets and partnerships.oto.earthoto
- 02oto investor data roomMaterials for investors exploring the STS / full-duplex category.oto.earthoto
The data ceiling
27 referencesThe post-training data problem — separation and diarization ceilings, license and content-shape filters, the phase-fit matrix, and the public corpus catalog that still pivots on a 2004 telephone corpus.
Research papers15
- 01Cieri et al. (2004) — Fisher corpus designLREC 2004. Original paper describing the Fisher corpus — 1,960 hours, 11,699 dyadic conversations, each speaker on a separate disk track at collection time.ldc.upenn.edupaper
- 02Moshi — Défossez et al. (2024)Kyutai's Moshi paper. ~7M hours of mono pre-training, Fisher for full-duplex fine-tune, 200 ms measured latency.arxiv.orgpaper
- 03OmniFlatten (2024)A 0.5B-parameter STS trained on ~2,000 hours of 100% TTS-synthesized dialogue. Proof that the synthetic ceiling is above zero.arxiv.orgpaper
- 04SepFormer (2021)Transformer-based monaural source separation. ~22.3 dB SI-SDRi on WSJ0-2mix.arxiv.orgpaper
- 05Conv-TasNet (2019)Fully-convolutional time-domain audio separation network. A baseline the rest of the field references.arxiv.orgpaper
- 06TDANet (2023)Top-down attention network for separation; a mid-2020s high-water mark alongside SepFormer.arxiv.orgpaper
- 07MossFormer2 (2024)State-of-the-art separation on WSJ0-2mix (~24.1 dB SI-SDRi). Strong on synthetic mixes, collapses on LibriCSS at 30% overlap.arxiv.orgpaper
- 08pyannote 3.x (2023)Current research-default diarization system. ~22% DER on AMI, ~11% on VoxConverse.arxiv.orgpaper
- 09EEND-EDA (2020)End-to-end neural diarization with encoder-decoder attractors. The family pyannote and NeMo descend from.arxiv.orgpaper
- 10LibriCSS (Chen et al. 2020)Conversation-style benchmark with controlled overlap rates. Evidence that WER at 30-40% overlap stays above 18% even with a 7-channel array.arxiv.orgpaper
- 11Raj et al. (2021) — Integration of separation + ASRThe study that caught compounding error on record: a separation front-end helps overlap but slightly hurts clean audio.arxiv.orgpaper
- 12CHiME-8 DASR overview (Cornell et al. 2024)Benchmark organizers state the ceiling plainly: neural SSE techniques still can't reliably handle complex multi-speaker scenarios.arxiv.orgpaper
- 13HuBERT (2021)Self-supervised speech representation learning. LARGE model pre-trained on ~60,000 hours of LibriLight mono audiobook audio.arxiv.orgpaper
- 14Wav2Vec 2.0 (2020)The other canonical self-supervised speech backbone; same source pool as HuBERT.arxiv.orgpaper
- 15Freeze-Omni (2024)A Family-3 STS with a 110,000-hour ASR mid-training corpus. The mono-audio entry point into dialogue-shaped training.arxiv.orgpaper
Repositories & models01
- 01pyannote speaker-diarization-3.1 (Hugging Face)The model card behind the DER numbers cited in §2.4. What a production diarization deployment actually uses.huggingface.corepo
Platforms & reporting06
- 01CHiME-6 challengeDinner-party audio challenge. Track 1 (oracle diarization) vs Track 2 (system diarization) quantifies the cost of building the label table yourself.chimechallenge.orgplatform
- 02YouTube Terms of ServiceThe legal ceiling: explicit prohibition on automated extraction and unauthorized ML training.youtube.complatform
- 03YouTube — third-party AI training opt-in controlsThe opt-in control surface creators must actively switch on before their content is a legitimate training input.support.google.complatform
- 04Millette v. OpenAI (TechCrunch)Class-action suit over OpenAI's scraping of YouTube creator transcripts. One of the cases pressure-testing the terms through 2024-25.techcrunch.complatform
- 05RSL: RSS-for-AI-licensing protocol (TechCrunch)The RSS co-creator's 2025 protocol for declaring training-license intent in podcast feeds. Evidence that the current ecosystem lacks the field.techcrunch.complatform
- 06Abaka AIMarch 2026 vendor of a 20,000-hour commercial full-duplex corpus. Direct-to-enterprise pricing, seven languages, 100% real human-to-human.abaka.aiplatform
Corpora03
- 01Switchboard (LDC97S62)1997. ~260 hours of two-channel telephone conversation. The smaller predecessor to Fisher, still in use.catalog.ldc.upenn.educorpus
- 02CANDOR (BetterUp)2023. ~850 hours of two-channel natural conversation. CC BY-NC 4.0, so unavailable for commercial training.betterup.comcorpus
- 03Emilia dataset2024-25. ~216,000 hours of mono web-scraped speech. Headline number hides a license split between an NC core and a YODAS extension.emilia-dataset.github.iocorpus
From oto02
- 01oto — dataset inquiryTwo-channel capture at source, per-speaker consent, commercial redistribution, phase-fit labeling. The post-training column, built to order.oto.earthoto
- 02oto investor data roomMaterials for investors exploring the STS / full-duplex data market.oto.earthoto
Foundation before vertical
43 referencesA thesis essay on the foundation threshold — the concept, the three domains that have already crossed it, the 30×–150× gap that full-duplex STS still has to close, and the six plausible routes to 100k+ hours of two-channel conversational data.
Research papers17
- 01Radford et al. (2018) — GPT-1Improving language understanding with unsupervised pretraining. 117M params, 0.8B tokens; still required task-specific fine-tuning.cdn.openai.compaper
- 02Radford et al. (2019) — GPT-2Language models are unsupervised multitask learners. 1.5B params, ~10B tokens; zero-shot was interesting but unreliable.cdn.openai.compaper
- 03Brown et al. (2020) — GPT-3Language models are few-shot learners. 175B params, 300B tokens. The text-LLM foundation threshold crossing.arxiv.orgpaper
- 04Singhal et al. (2022) — Med-PaLMLarge language models encode clinical knowledge. 67.6% on MedQA, adapter on PaLM rather than from-scratch medical LLM.arxiv.orgpaper
- 05Singhal et al. (2023) — Med-PaLM 2Towards expert-level medical QA. 86.5% on MedQA, built on PaLM 2. The vertical adapter pattern, matured.arxiv.orgpaper
- 06Rozière et al. (2023) — Code LlamaOpen foundation models for code. 500B additional code tokens on Llama 2 — roughly 10% extra training for a specialized vertical.arxiv.orgpaper
- 07Radford et al. (2021) — CLIPLearning transferable visual models from natural language supervision. 400M image-text pairs; the vision zero-shot threshold.arxiv.orgpaper
- 08Ma et al. (2024) — MedSAMNature Communications. +22.51 DICE over zero-shot SAM across 86/86 internal tasks, using 1.57M medical mask annotations.nature.compaper
- 09Zhang et al. (2023) — BiomedCLIP15M biomedical image-text pairs on a CLIP base. Confirms the two-to-three orders-of-magnitude-smaller adapter pattern.arxiv.orgpaper
- 10Radford et al. (2022) — WhisperRobust speech recognition via large-scale weak supervision. 680,000 hours; the ASR foundation threshold crossing.arxiv.orgpaper
- 11Wu et al. (2023) — BloombergGPTA large language model for finance. 50B params, 363B finance + 345B general tokens. Matched or exceeded by GPT-4 within twelve months.arxiv.orgpaper
- 12Taylor et al. (2022) — GalacticaA large language model for science. Withdrawn after three days — narrow-corpus hallucinations that sounded plausible.arxiv.orgpaper
- 13Défossez et al. (2024) — MoshiKyutai. ~7B parameters, the first open full-duplex STS model. The GPT-2 analog of the STS scaling arc.arxiv.orgpaper
- 14Nakata et al. (2024) — J-CHAT69,000 hours of Japanese audio — mono single-speaker, so unusable for full-duplex fine-tune despite the headline volume.arxiv.orgpaper
- 15Korfiatis et al. (2022) — PriMock57A primary-care mock consultation dataset. Mocked with patient actors as a HIPAA workaround.arxiv.orgpaper
- 16Yim et al. (2023) — ACI-BenchAmbient Clinical Intelligence benchmark. Mocked medical dialogues for evaluation; the authors are explicit about the regulatory constraint.arxiv.orgpaper
- 17Chiu et al. (2017) — Google Health medical dialogue14,000 hours of institutional medical conversations. Never released — institutional corpora trapped by regulation.arxiv.orgpaper
Companies & platforms21
- 01Harvey — Series A announcement (TechCrunch)Five months after ChatGPT. The post-foundation-compression benchmark for a vertical LLM.techcrunch.complatform
- 02Hippocratic AI — $50M seed (Reuters)Six months after ChatGPT. Would not have been financeable eighteen months earlier.reuters.complatform
- 03Abridge$5.3B valuation (June 2025). The post-Whisper vertical winner for medical scribing.abridge.complatform
- 04Decagon$4.5B Series D (January 2026). Customer-service STS agents — pipeline stack, not native full-duplex.decagon.aiplatform
- 05Deepgram$1.3B Series C (January 2026). Enterprise voice AI.deepgram.complatform
- 06VapiDeveloper voice platform. ~$130M valuation reported, on $20M Series A capital.vapi.aiplatform
- 07Retell AIThe honest counter-nuance: $50M ARR on ~$5M funding suggests some verticals compound without foundation-level investment.retellai.complatform
- 08Abaka AI20,000 hours bidirectional commercial release (2026). The single data point above 10k h for Route 3. Vendor-claimed, not independently audited.abaka.aiplatform
- 09Nexdata15k-hour multilingual conversational corpus — mono 8 kHz, fails the two-channel bar.nexdata.aiplatform
- 10AppenManaged crowdsourced data collection. Project-based, not standing corpora.appen.complatform
- 11TELUS DigitalDigital customer-experience vendor offering managed audio collection at enterprise scale.telusdigital.complatform
- 12Linguistic Data Consortium (LDC)The academic-consortium institution behind Switchboard and Fisher. Commercial tier yields in-year redistribution rights.ldc.upenn.eduplatform
- 13Reddit–Google licensing deal (TechCrunch)$60M/year (February 2024). Proof that platforms can monetize UGC corpora to AI labs.techcrunch.complatform
- 14YouTube — third-party training opt-inDecember 2024 creator control surface. The opt-in plumbing for Route 6 audio licensing exists.blog.youtubeplatform
- 15RSL — Really Simple LicensingLaunched September 2025. 1,500+ publishers by late 2025. Watch Route 6 for a surprise inflection.rslstandard.orgplatform
- 16Spotify — Developer Policy (May 2025)Explicit prohibition on training models from Spotify content. One vendor's stated closure of the audio-licensing door.developer.spotify.complatform
- 17Mozilla Common Voice31,841 hours across 286 languages, CC0. The crowdsourced ceiling — but all single-speaker read or monologue.commonvoice.mozilla.orgplatform
- 18ReplikaSince 2017. Luka Inc. hit with €5M Italian DPA fine (April 2025) after a provisional ban — Route 1 under regulatory ceiling.replika.complatform
- 19Character.AISince 2021. Consumer companion app with >10k h of in-app conversation. Corpus has never been released.character.aiplatform
- 20SesameBeta 2025. Companion STS app; another volume-rich source structurally trapped inside its app.sesame.complatform
- 21Oyez Project5,000+ hours of public-domain US Supreme Court oral argument audio. Configuration-wrong (mono-mixed), not data-poor.oyez.orgplatform
Corpora & references02
- 01Switchboard (LDC97S62)1991. DARPA + Texas Instruments. 260 hours of two-channel telephone conversation. The Route-5 origin point.catalog.ldc.upenn.educorpus
- 02Fisher corpus (Cieri et al. 2004)LREC 2004. 1,960 hours, 11,699 dyadic conversations. Still the default full-duplex fine-tune corpus twenty-two years later.ldc.upenn.educorpus
From Fullduplex03
- 01Article 04 — The data ceilingThe data-supply side of the same coin. Why separation AI and YouTube scraping don't rescue the gap./blog/data-ceilingoto
- 02Article 01 — Speech-to-speech AI, a primerSets the vocabulary — what an STS system is, and what full-duplex changes./blog/sts-primeroto
- 03Fullduplex — datasetsThe curated index of conversational speech datasets underlying every article in the series./datasetsoto