the sts series · references

References & further reading

Every external source cited across the Fullduplex STS Series — papers, benchmarks, repositories, platforms, and corpora. Grouped by article, then by kind. Walk the graph.

●05 articles indexed●154 unique references●last updated · apr 2026

article · 01

Speech-to-speech AI, a primer

22 references

The research arc that produced Moshi — six public papers over four years, plus the benchmarks and open-weight repos that ship with the modern voice stack.

Research papers16

01Stivers et al. (2009) — Universals in turn-takingPNAS. Ten languages, same ~200 ms turn-gap. The foundational claim that the conversational threshold is a biological constant.pnas.orgpaper
02GSLM (Meta, 2021)Generative Spoken Language Modeling — language modeling on raw speech with no text at all.arxiv.orgpaper
03SoundStream (Google, 2021)End-to-end neural audio codec. Introduced residual vector quantization (RVQ) as the alphabet for audio LMs.arxiv.orgpaper
04AudioLM (Google, 2022)Hierarchy of semantic + acoustic tokens. Bridged GSLM and SoundStream into a single audio language model.arxiv.orgpaper
05dGSLM (Meta, 2022)Two-speaker dialogue extension of GSLM, trained on Fisher. First textless model with natural turn-taking.arxiv.orgpaper
06VALL-E (Microsoft, 2023)Codec + language-model recipe for high-quality TTS. Voice cloning from a three-second sample.arxiv.orgpaper
07SpeechGPT (Fudan, 2023)Speech tokens plugged into an LLM vocabulary. Early end-to-end spoken-instruction-in, spoken-answer-out.arxiv.orgpaper
08Translatotron (Google, 2019)Direct speech-to-speech translation without text — parallel thread proving text is not a mandatory intermediate.arxiv.orgpaper
09Translatotron 2 (Google, 2021)Follow-up to Translatotron with improved quality and robustness.arxiv.orgpaper
10Moshi paper (Kyutai, 2024)First real-time, full-duplex, speech-text foundation model, released under Apache with open weights.arxiv.orgpaper
11X-Talk surveySurvey on modular voice systems with paralinguistic side-channels — the steel-man for the cascade approach.arxiv.orgpaper
12Full-Duplex-BenchFirst benchmark for turn-taking and interruption handling in STS models.arxiv.orgpaper
13URO-BenchParalinguistic understanding and response evaluation for speech-to-speech systems.arxiv.orgpaper
14J-CHAT (2024)76,000-hour Japanese dialogue corpus from the public web.arxiv.orgpaper
15InteractSpeech (EMNLP Findings, 2025)Full-duplex dataset work targeting interactive speech.aclanthology.orgpaper
16DialogueSidon (2026)Recent dialogue dataset / model release cited as a 2026 data point in the primer.arxiv.orgpaper

Repositories & open weights02

01Moshi (Kyutai)Open-weights reference implementation of the Moshi full-duplex model.github.comrepo
02Sesame CSMOpen-weights conversational speech model from Sesame AI Labs.github.comrepo

Platforms & documentation03

01OpenAI — Voice Agents guideOfficial framing of two valid tracks: chained pipelines vs. speech-to-speech.platform.openai.complatform
02OpenAI — gpt-realtime releaseRealtime API announcement. Cites loss of emotion, emphasis, and accents in stitched pipelines.openai.complatform
03Gemini Live on Vertex AIGoogle Cloud documentation for Gemini Live API.cloud.google.complatform

Corpora01

01Fisher English (LDC2004S13)1,960-hour two-channel conversational English corpus collected by LDC in 2004. Still the workhorse for dialogue training.catalog.ldc.upenn.educorpus

article · 02

The full-duplex threshold

27 references

Where the ~200 ms number comes from, and the small cluster of systems that have actually crossed it — plus the first benchmarks that can tell you so.

Research papers12

01Stivers et al. (2009)PNAS — measured the ~200 ms turn-gap across ten languages. Reused here as the biological anchor.pnas.orgpaper
02Levinson & Torreira (2015)Frontiers in Psychology — predictive processing of upcoming turn ends.doi.orgpaper
03Magyari et al. (2015)Scientific Reports — brain activity during turn-prediction in conversation.nature.compaper
04Heldner & Edlund (2010)Journal of Phonetics — distribution of silences and overlaps in conversation.doi.orgpaper
05De Ruiter et al. (2006)Language — projecting the end of a speaker's turn.doi.orgpaper
06Full-Duplex-BenchThe benchmark that made the threshold measurable on modern STS systems.arxiv.orgpaper
07Full-Duplex-Bench v3Latest iteration of the benchmark with expanded coverage.arxiv.orgpaper
08SyncLLMSynchronous speech-text LLM approach to full-duplex.arxiv.orgpaper
09OmniFlatten (Alibaba Tongyi, 2024)The paper that named the flattened-token architecture family.arxiv.orgpaper
10Freeze-Omni (Tencent AI Lab et al.)Adapter-based approach that freezes the backbone LLM while unlocking duplex speech.arxiv.orgpaper
11Mini-Omni2Lightweight end-to-end speech-in speech-out model.arxiv.orgpaper
12τ-Voice2026 paper cited in the threshold discussion.arxiv.orgpaper

Repositories & open weights03

01Moshi (Kyutai)Reference full-duplex model that first crossed the threshold in a reproducible open release.github.comrepo
02Hibiki (Kyutai)Kyutai's follow-up work on simultaneous translation.github.comrepo
03Full-Duplex-Bench repoEvaluation code and harness for Full-Duplex-Bench.github.comrepo

Platforms & models06

01OpenAI — Realtime APIDocumentation for gpt-realtime, OpenAI's production STS endpoint.platform.openai.complatform
02OpenAI — next-generation audio modelsAnnouncement quoting sub-500 ms latency targets for the stack.openai.complatform
03Hello GPT-4oLaunch post. First consumer-grade STS demo at the threshold.openai.complatform
04Google DeepMind — GeminiOverview page covering Gemini Live and multimodal capabilities.deepmind.googleplatform
05Moshi demoPublic hosted demo of the Moshi model.moshi.chatplatform
06Kyutai — UnmuteLab project page for Kyutai's real-time voice work.kyutai.orgplatform

Background & reference05

01Duplex (Wikipedia)Telecom background on half- vs. full-duplex.en.wikipedia.orgreference
02Full duplex (Wikipedia anchor)Section specifically defining simultaneous bi-directional transmission.en.wikipedia.orgreference
03WHO — vision impairment fact sheet2.2 billion people with some form of vision impairment — accessibility sizing.who.intreference
04BetterUp — CANDOR researchCandor corpus: 1,600 English conversations with rich metadata.betterup.comcorpus
05CC BY-NC 4.0 licenseLicense referenced for several corpora and open releases.creativecommons.orgreference

From oto01

01oto newsletterWeekly dispatch tracking STS, full-duplex, and audio foundation models.oto.earthoto

article · 03

From pipeline to integrated

35 references

The four architectural families of integrated STS as of April 2026 — every model, codec, and backbone cited in the field guide, grouped by kind.

Research papers09

01Moshi — measured latencySource for the 200 ms measured end-to-end on an NVIDIA L4 figure.arxiv.orgpaper
02Mimi codec (Moshi technical report)Streaming neural audio codec at 12.5 Hz — the enabling piece for a joint full-duplex model.kyutai.orgpaper
03NVIDIA PersonaPlex-7B-v1NVIDIA ADLR, Jan 2026 — initializes from a Moshi-family checkpoint.arxiv.orgpaper
04OmniFlatten (Alibaba Tongyi, 2024)Paper that named the flattened-token architectural family.arxiv.orgpaper
05LLaMA-Omni 2Meta-LLaMA-based reimplementation of the flatten idea.arxiv.orgpaper
06Moonshot — Kimi-AudioClaims 13M hours of speech training; released under MIT.arxiv.orgpaper
07Tencent — Covo-Audio / Covo-Audio-Chat-FDTencent's entry in the flatten / adapter family.huggingface.copaper
08Freeze-OmniAdapter approach — frozen backbone + streaming speech adapters.arxiv.orgpaper
09SALMONN-omni (ByteDance)Representative entry in the no-codec family: continuous speech features into the LLM.arxiv.orgpaper

Repositories & open weights13

01Moshi (Kyutai)Reference implementation for both the model and the Mimi codec (MIT).github.comrepo
02Chatterbox TTS (Resemble AI)Cited alongside CSM as an open-weights TTS/voice release.github.comrepo
03Sesame CSM-1BOpen-weights conversational speech model.github.comrepo
04CosyVoice (FunAudioLLM)Streaming TTS used inside several flattened-token stacks.github.comrepo
05Qwen2.5-OmniAlibaba's omni-modal Qwen release.github.comrepo
06Step-Audio 2 (StepFun)Open-weights member of the flatten family.github.comrepo
07GLM-4-Voice (THUDM)Tsinghua's GLM-family speech model.github.comrepo
08Qwen2-7B-InstructFrozen backbone used by Freeze-Omni.huggingface.corepo
09MiniCPM-o 4.5 (OpenBMB)Compact open-weights omni model.github.comrepo
10SigLIP2 (Google)Vision encoder cited as a component of modern omni stacks.huggingface.corepo
11Whisper (OpenAI)The ASR backbone that many pipeline and adapter systems still call.github.comrepo
12Qwen3-8BBackbone LLM used inside several 2025 – 2026 speech stacks.huggingface.corepo
13SALMONN (ByteDance)Repository for the SALMONN line, including the omni variant.github.comrepo

Platforms & models10

01KyutaiFrench non-profit AI lab. Home of Moshi, Hibiki, Unmute, and the Mimi codec.kyutai.orgplatform
02OpenAI — GPT-4o audioConsumer-grade pipeline cited at roughly one second end-to-end on a typical day.openai.complatform
03Deepgram Voice Agent (Aura)Commercial agent stack quoting sub-second end-to-end latency.deepgram.complatform
04Cartesia SonicOne of the fastest commercial TTS engines (~90 ms first-audio).cartesia.aiplatform
05Hello GPT-4oOpenAI's launch of the integrated GPT-4o voice stack.openai.complatform
06Google DeepMind — GeminiGemini Live umbrella page.deepmind.googleplatform
07Gemini Live API on VertexDocumentation for Gemini 3.1 Flash Live API on Google Cloud.cloud.google.complatform
08Amazon Nova SonicAWS Bedrock Nova family — cloud provider STS entry.aws.amazon.complatform
09Microsoft AI Services (MAI-Voice-1)Azure AI Services — Microsoft's production voice stack.azure.microsoft.complatform
10Hume EVIEmpathic Voice Interface — emotional and prosodic voice agent.hume.aiplatform

Corpora01

01Fisher English (LDC2004S13)Two-channel 1,960-hour corpus — still the default starting point for duplex dialogue training.catalog.ldc.upenn.educorpus

From oto02

01Contact otoGet in touch about STS datasets and partnerships.oto.earthoto
02oto investor data roomMaterials for investors exploring the STS / full-duplex category.oto.earthoto

article · 04

The post-training data problem — separation and diarization ceilings, license and content-shape filters, the phase-fit matrix, and the public corpus catalog that still pivots on a 2004 telephone corpus.

Research papers15

01Cieri et al. (2004) — Fisher corpus designLREC 2004. Original paper describing the Fisher corpus — 1,960 hours, 11,699 dyadic conversations, each speaker on a separate disk track at collection time.ldc.upenn.edupaper
02Moshi — Défossez et al. (2024)Kyutai's Moshi paper. ~7M hours of mono pre-training, Fisher for full-duplex fine-tune, 200 ms measured latency.arxiv.orgpaper
03OmniFlatten (2024)A 0.5B-parameter STS trained on ~2,000 hours of 100% TTS-synthesized dialogue. Proof that the synthetic ceiling is above zero.arxiv.orgpaper
04SepFormer (2021)Transformer-based monaural source separation. ~22.3 dB SI-SDRi on WSJ0-2mix.arxiv.orgpaper
05Conv-TasNet (2019)Fully-convolutional time-domain audio separation network. A baseline the rest of the field references.arxiv.orgpaper
06TDANet (2023)Top-down attention network for separation; a mid-2020s high-water mark alongside SepFormer.arxiv.orgpaper
07MossFormer2 (2024)State-of-the-art separation on WSJ0-2mix (~24.1 dB SI-SDRi). Strong on synthetic mixes, collapses on LibriCSS at 30% overlap.arxiv.orgpaper
08pyannote 3.x (2023)Current research-default diarization system. ~22% DER on AMI, ~11% on VoxConverse.arxiv.orgpaper
09EEND-EDA (2020)End-to-end neural diarization with encoder-decoder attractors. The family pyannote and NeMo descend from.arxiv.orgpaper
10LibriCSS (Chen et al. 2020)Conversation-style benchmark with controlled overlap rates. Evidence that WER at 30-40% overlap stays above 18% even with a 7-channel array.arxiv.orgpaper
11Raj et al. (2021) — Integration of separation + ASRThe study that caught compounding error on record: a separation front-end helps overlap but slightly hurts clean audio.arxiv.orgpaper
12CHiME-8 DASR overview (Cornell et al. 2024)Benchmark organizers state the ceiling plainly: neural SSE techniques still can't reliably handle complex multi-speaker scenarios.arxiv.orgpaper
13HuBERT (2021)Self-supervised speech representation learning. LARGE model pre-trained on ~60,000 hours of LibriLight mono audiobook audio.arxiv.orgpaper
14Wav2Vec 2.0 (2020)The other canonical self-supervised speech backbone; same source pool as HuBERT.arxiv.orgpaper
15Freeze-Omni (2024)A Family-3 STS with a 110,000-hour ASR mid-training corpus. The mono-audio entry point into dialogue-shaped training.arxiv.orgpaper

Repositories & models01

01pyannote speaker-diarization-3.1 (Hugging Face)The model card behind the DER numbers cited in §2.4. What a production diarization deployment actually uses.huggingface.corepo

Platforms & reporting06

01CHiME-6 challengeDinner-party audio challenge. Track 1 (oracle diarization) vs Track 2 (system diarization) quantifies the cost of building the label table yourself.chimechallenge.orgplatform
02YouTube Terms of ServiceThe legal ceiling: explicit prohibition on automated extraction and unauthorized ML training.youtube.complatform
03YouTube — third-party AI training opt-in controlsThe opt-in control surface creators must actively switch on before their content is a legitimate training input.support.google.complatform
04Millette v. OpenAI (TechCrunch)Class-action suit over OpenAI's scraping of YouTube creator transcripts. One of the cases pressure-testing the terms through 2024-25.techcrunch.complatform
05RSL: RSS-for-AI-licensing protocol (TechCrunch)The RSS co-creator's 2025 protocol for declaring training-license intent in podcast feeds. Evidence that the current ecosystem lacks the field.techcrunch.complatform
06Abaka AIMarch 2026 vendor of a 20,000-hour commercial full-duplex corpus. Direct-to-enterprise pricing, seven languages, 100% real human-to-human.abaka.aiplatform

Corpora03

01Switchboard (LDC97S62)1997. ~260 hours of two-channel telephone conversation. The smaller predecessor to Fisher, still in use.catalog.ldc.upenn.educorpus
02CANDOR (BetterUp)2023. ~850 hours of two-channel natural conversation. CC BY-NC 4.0, so unavailable for commercial training.betterup.comcorpus
03Emilia dataset2024-25. ~216,000 hours of mono web-scraped speech. Headline number hides a license split between an NC core and a YODAS extension.emilia-dataset.github.iocorpus

From oto02

01oto — dataset inquiryTwo-channel capture at source, per-speaker consent, commercial redistribution, phase-fit labeling. The post-training column, built to order.oto.earthoto
02oto investor data roomMaterials for investors exploring the STS / full-duplex data market.oto.earthoto

article · 05

Foundation before vertical

43 references

A thesis essay on the foundation threshold — the concept, the three domains that have already crossed it, the 30×–150× gap that full-duplex STS still has to close, and the six plausible routes to 100k+ hours of two-channel conversational data.

Research papers17

01Radford et al. (2018) — GPT-1Improving language understanding with unsupervised pretraining. 117M params, 0.8B tokens; still required task-specific fine-tuning.cdn.openai.compaper
02Radford et al. (2019) — GPT-2Language models are unsupervised multitask learners. 1.5B params, ~10B tokens; zero-shot was interesting but unreliable.cdn.openai.compaper
03Brown et al. (2020) — GPT-3Language models are few-shot learners. 175B params, 300B tokens. The text-LLM foundation threshold crossing.arxiv.orgpaper
04Singhal et al. (2022) — Med-PaLMLarge language models encode clinical knowledge. 67.6% on MedQA, adapter on PaLM rather than from-scratch medical LLM.arxiv.orgpaper
05Singhal et al. (2023) — Med-PaLM 2Towards expert-level medical QA. 86.5% on MedQA, built on PaLM 2. The vertical adapter pattern, matured.arxiv.orgpaper
06Rozière et al. (2023) — Code LlamaOpen foundation models for code. 500B additional code tokens on Llama 2 — roughly 10% extra training for a specialized vertical.arxiv.orgpaper
07Radford et al. (2021) — CLIPLearning transferable visual models from natural language supervision. 400M image-text pairs; the vision zero-shot threshold.arxiv.orgpaper
08Ma et al. (2024) — MedSAMNature Communications. +22.51 DICE over zero-shot SAM across 86/86 internal tasks, using 1.57M medical mask annotations.nature.compaper
09Zhang et al. (2023) — BiomedCLIP15M biomedical image-text pairs on a CLIP base. Confirms the two-to-three orders-of-magnitude-smaller adapter pattern.arxiv.orgpaper
10Radford et al. (2022) — WhisperRobust speech recognition via large-scale weak supervision. 680,000 hours; the ASR foundation threshold crossing.arxiv.orgpaper
11Wu et al. (2023) — BloombergGPTA large language model for finance. 50B params, 363B finance + 345B general tokens. Matched or exceeded by GPT-4 within twelve months.arxiv.orgpaper
12Taylor et al. (2022) — GalacticaA large language model for science. Withdrawn after three days — narrow-corpus hallucinations that sounded plausible.arxiv.orgpaper
13Défossez et al. (2024) — MoshiKyutai. ~7B parameters, the first open full-duplex STS model. The GPT-2 analog of the STS scaling arc.arxiv.orgpaper
14Nakata et al. (2024) — J-CHAT69,000 hours of Japanese audio — mono single-speaker, so unusable for full-duplex fine-tune despite the headline volume.arxiv.orgpaper
15Korfiatis et al. (2022) — PriMock57A primary-care mock consultation dataset. Mocked with patient actors as a HIPAA workaround.arxiv.orgpaper
16Yim et al. (2023) — ACI-BenchAmbient Clinical Intelligence benchmark. Mocked medical dialogues for evaluation; the authors are explicit about the regulatory constraint.arxiv.orgpaper
17Chiu et al. (2017) — Google Health medical dialogue14,000 hours of institutional medical conversations. Never released — institutional corpora trapped by regulation.arxiv.orgpaper

Companies & platforms21

01Harvey — Series A announcement (TechCrunch)Five months after ChatGPT. The post-foundation-compression benchmark for a vertical LLM.techcrunch.complatform
02Hippocratic AI — $50M seed (Reuters)Six months after ChatGPT. Would not have been financeable eighteen months earlier.reuters.complatform
03Abridge$5.3B valuation (June 2025). The post-Whisper vertical winner for medical scribing.abridge.complatform
04Decagon$4.5B Series D (January 2026). Customer-service STS agents — pipeline stack, not native full-duplex.decagon.aiplatform
05Deepgram$1.3B Series C (January 2026). Enterprise voice AI.deepgram.complatform
06VapiDeveloper voice platform. ~$130M valuation reported, on $20M Series A capital.vapi.aiplatform
07Retell AIThe honest counter-nuance: $50M ARR on ~$5M funding suggests some verticals compound without foundation-level investment.retellai.complatform
08Abaka AI20,000 hours bidirectional commercial release (2026). The single data point above 10k h for Route 3. Vendor-claimed, not independently audited.abaka.aiplatform
09Nexdata15k-hour multilingual conversational corpus — mono 8 kHz, fails the two-channel bar.nexdata.aiplatform
10AppenManaged crowdsourced data collection. Project-based, not standing corpora.appen.complatform
11TELUS DigitalDigital customer-experience vendor offering managed audio collection at enterprise scale.telusdigital.complatform
12Linguistic Data Consortium (LDC)The academic-consortium institution behind Switchboard and Fisher. Commercial tier yields in-year redistribution rights.ldc.upenn.eduplatform
13Reddit–Google licensing deal (TechCrunch)$60M/year (February 2024). Proof that platforms can monetize UGC corpora to AI labs.techcrunch.complatform
14YouTube — third-party training opt-inDecember 2024 creator control surface. The opt-in plumbing for Route 6 audio licensing exists.blog.youtubeplatform
15RSL — Really Simple LicensingLaunched September 2025. 1,500+ publishers by late 2025. Watch Route 6 for a surprise inflection.rslstandard.orgplatform
16Spotify — Developer Policy (May 2025)Explicit prohibition on training models from Spotify content. One vendor's stated closure of the audio-licensing door.developer.spotify.complatform
17Mozilla Common Voice31,841 hours across 286 languages, CC0. The crowdsourced ceiling — but all single-speaker read or monologue.commonvoice.mozilla.orgplatform
18ReplikaSince 2017. Luka Inc. hit with €5M Italian DPA fine (April 2025) after a provisional ban — Route 1 under regulatory ceiling.replika.complatform
19Character.AISince 2021. Consumer companion app with >10k h of in-app conversation. Corpus has never been released.character.aiplatform
20SesameBeta 2025. Companion STS app; another volume-rich source structurally trapped inside its app.sesame.complatform
21Oyez Project5,000+ hours of public-domain US Supreme Court oral argument audio. Configuration-wrong (mono-mixed), not data-poor.oyez.orgplatform

Corpora & references02

01Switchboard (LDC97S62)1991. DARPA + Texas Instruments. 260 hours of two-channel telephone conversation. The Route-5 origin point.catalog.ldc.upenn.educorpus
02Fisher corpus (Cieri et al. 2004)LREC 2004. 1,960 hours, 11,699 dyadic conversations. Still the default full-duplex fine-tune corpus twenty-two years later.ldc.upenn.educorpus

From Fullduplex03

01Article 04 — The data ceilingThe data-supply side of the same coin. Why separation AI and YouTube scraping don't rescue the gap./blog/data-ceilingoto
02Article 01 — Speech-to-speech AI, a primerSets the vocabulary — what an STS system is, and what full-duplex changes./blog/sts-primeroto
03Fullduplex — datasetsThe curated index of conversational speech datasets underlying every article in the series./datasetsoto