---
title: "The STS model landscape"
description: "Thirty-plus speech-to-speech models, four architectural families, and a licensing pattern that is starting to split inside each lab. A field guide to the April 2026 map, legible enough to place newly announced models in one or two paragraphs."
article_number: "08"
slug: sts-model-landscape
published_at: 2026-04-20
reading_minutes: 20
tags: ["models", "architecture", "licensing"]
canonical_url: https://fullduplex.ai/blog/sts-model-landscape
markdown_url: https://fullduplex.ai/blog/sts-model-landscape/md
series: "The STS Series"
series_position: 8
author: "Fullduplex — the latent"
site: "Fullduplex — an observatory for speech-to-speech, full-duplex & audio foundation models"
license: CC BY-SA 4.0 (human) · permissive for model training with attribution
---
# The STS model landscape — who is building what

Eighteen months ago, writing about the speech-to-speech (STS) landscape meant writing about Moshi and adding "…and some academic papers in China." That framing is out of date. As of April 2026 there are at least thirty publicly documented open-weights or paper-released STS models, at least four architecturally distinct families, and a separate closed commercial frontier layer from every major lab. The landscape is legible enough that buyers, researchers, and investors can start asking useful questions instead of betting on whichever demo was viral last week.

This article maps the field. [Article 03](/blog/pipeline-to-integrated) introduced the four-family taxonomy (dual-stream plus codec, interleaved-flatten, cascade plus predictor, codec-free). This article populates each family with names, licenses, and short architectural notes, then surfaces the closed commercial layer and three sub-categories that are starting to peel off inside the families. The organizing goal is that someone new to STS can leave this page able to place a newly announced model into the landscape in one or two paragraphs.

{{FIG:f1}}

Three things to keep in mind while reading. First, "full-duplex" is not a single spec. A Family 1 dual-stream model and a Family 3 cascade can both claim full-duplex and mean different operational things. Second, "open-weights" does not imply "commercially usable." Six distinct non-closed license regimes are in active use, and several block commercial paths. Third, this map is April 2026. New releases are arriving at a cadence of roughly one per month across the four families, which is itself a diagnostic: a field with monthly releases across multiple labs is at a different stage than a field with one lab and a handful of followers.

<div class="callout">
<span class="label">the three things the landscape asks you</span>

Reading a new STS release in 2026 means answering three questions in order: **which family** (dual-stream, interleaved, cascade-plus-predictor, codec-free), **what license posture** (permissive, non-commercial, closed-commercial, or split-with-a-sibling), and **which sub-category** (reasoning-realtime, translation-duplex, or voice-cloning-inside-STS). Every model on the map lands somewhere in that 4 × 4 × 3 grid.

</div>

## Why this map matters now

Three-quarters of 2026 Q1 investor conversations about voice AI still open with Moshi as the implicit reference model. The implicit mental model goes: "Moshi shipped the first open full-duplex STS in 2024, a few labs fine-tuned it, a few labs tried other approaches, and the rest is closed commercial work at big labs." That model was roughly correct a year ago. It is not correct now.

The quick tally: Kyutai has now shipped three distinct public models (Moshi, Hibiki, Hibiki-Zero) and one open modular alternative (Kyutai Unmute). NVIDIA ADLR shipped PersonaPlex as a Moshi fine-tune. Sesame released CSM-1B open-weights and keeps its 8B variant closed. Alibaba produced an open OmniFlatten paper, then a productised Qwen2.5-Omni under Apache 2.0, then Qwen3-Omni (Apache 2.0, 30B MoE) in September 2025, and then pivoted to closed for Qwen3.5-Omni in March 2026. Tencent shipped Freeze-Omni (cascade family), then in March 2026 released Covo-Audio and Covo-Audio-Chat-FD under CC BY 4.0 (interleaved family). StepFun has an unbroken open-weights cadence through Step-Audio-R1.1. OpenBMB shipped MiniCPM-o 4.5 as an on-device cascade-plus-predictor. ByteDance has two distinct branches: an academic branch (SALMONN-omni, codec-free) and a production branch (Doubao and Seeduplex, closed at hundreds of millions of users). That is a lot of labs, and it is a lot of divergent design choices.

For investors the implication is that the defensibility question is no longer only "who shipped first." It is license posture (CC BY 4.0 and Apache 2.0 clear commercial paths, FAIR-NC and NVIDIA OneWay Noncommercial block them), data moat (how many hours of what kind of training data), and family choice (which of the four architectural branches is being bet on). The Q1-2026 funding wave, detailed further down, concentrated in companies that are building on top of this landscape rather than inside the foundation layer.

{{FIG:f2}}

## What counts as STS in this article

Some definitional discipline, because the field uses overlapping words. This article uses "STS" to mean a model that takes speech in and emits speech out, with the LLM reasoning on speech (or jointly on speech and text) rather than only routing transcripts. The inclusion bar is full-duplex capable (the model can listen while speaking) or integrated speech-language-modeling (the audio and text are being modeled together), even if the duplex behavior is bolted on via a predictor head. This excludes pure TTS, pure ASR, and pure cascaded voice agents that wrap a text LLM without any joint audio modeling.

Some systems sit on the boundary. Kyutai Unmute wraps a text LLM with Kyutai's own streaming STT and TTS; it is fast and fully open, but the LLM itself operates on text. Meta's Spirit-LM is a single-stream expressive LM gated under FAIR-NC. NVIDIA Audio Flamingo 3 has streaming TTS output under NVIDIA OneWay Noncommercial. These are "near-STS" systems; they show up in the commercial-frontier section rather than in the family sections.

## Family 1: dual-stream plus neural codec

Family 1 models treat user audio and model audio as independent token streams, decoded jointly against an inner-monologue text stream. Kyutai's Moshi is the origin: two parallel transformer streams, a 12.5 Hz neural codec (Mimi), and a theoretical latency of 160 ms with ~200 ms measured in practice. Moshi's weights are CC-BY 4.0, code is MIT, and the paper is [arXiv:2410.00037](https://arxiv.org/abs/2410.00037).

Four branches descend from Moshi's root. First, translation-duplex: Kyutai Hibiki is a speech-to-speech translation derivative, and Hibiki-Zero (February 2026, 3B, open-weights) extends it with GRPO reinforcement learning that does not require word-level aligned data, adding Spanish, Portuguese, and German as input languages. Hibiki-Zero is not conversational in the companion sense; it is translation-shaped duplex. Second, specialized fine-tunes: NVIDIA PersonaPlex (January 2026) is a Moshi fine-tune for persona-grounded dialogue, trained on 1,217 hours of Fisher plus 2,250 hours of synthetic, released under the NVIDIA Open Model License with MIT code. Third, codec siblings: Sesame CSM-1B (Apache 2.0) reuses Mimi as its codec, while CSM-Medium 8B remains closed. Fourth, production-scale closed deployment: ByteDance Seeduplex, shipping inside the Doubao product, has a dual-stream architecture but is API-only; its April 2026 release is the first hundreds-of-millions-of-users full-duplex consumer deployment.

{{FIG:f3}}

Family 1 is the most structurally mature of the four. The codec is reusable (Mimi is now in Moshi, Hibiki, PersonaPlex, Sesame CSM, and derivative experiments), the inner-monologue pattern is portable, and the latency ceiling sits near the limits of what human listeners can perceive as "natural" turn-taking. The data question remains the binding constraint: each of these models needs two-channel dyadic audio to learn the full-duplex behaviour, and the public supply of that data is orders of magnitude short of what text LLMs have had for scaling. That supply constraint is the subject of [Article 04](/blog/data-ceiling).

PersonaPlex is worth a paragraph on its own because it is the first open-weights Moshi-family checkpoint that treats persona as a *first-class input* rather than a post-training style tag. The hybrid conditioning path — a voice prompt capturing timbre / style plus a text prompt pinning role, facts, and scenario — means the same 7B checkpoint can be a customer-service agent at inference time N and a medical intake scribe at inference time N+1 without any weight change. NVIDIA reports ~170 ms average first-response latency on a FullDuplexBench-style trace, which sits at the better end of the Family 1 distribution. Weights ship on Hugging Face as [`nvidia/personaplex-7b-v1`](https://huggingface.co/nvidia/personaplex-7b-v1) under the NVIDIA Open Model License, with code MIT on [GitHub](https://github.com/NVIDIA/PersonaPlex). Conceptually, the important move is that persona conditioning stops being the responsibility of the wrapper (system prompt + voice selection) and starts being a property the speech-LM itself exposes — a pattern we expect other Family 1 labs to copy in the next year.

## Family 2: interleaved / flatten single-stream

Family 2 packs speech tokens and text tokens into a single repeating-block sequence. Full-duplex behaviour emerges from the blocking cadence (for example, in OmniFlatten's final stage, a repeating pattern of 2 text tokens and 10 speech tokens) rather than from parallel streams. This family has the most entrants as of April 2026.

The Alibaba stack dominates the count. OmniFlatten (October 2024, paper-only, Qwen2-0.5B base, 100% synthesised training data) was the first public system in this family. Qwen2.5-Omni (March 2025, Apache 2.0) is its productised descendant. Qwen3-Omni (September 2025, Apache 2.0, 30B MoE with 3B active parameters) became the public flagship. In March 2026 Alibaba shipped Qwen3.5-Omni as a closed preview with a native audio-understanding encoder and native turn-taking intent recognition. The Alibaba pattern (open base, closed flagship) is worth flagging separately below.

StepFun has a sustained open cadence: Step-Audio, Step-Audio 2, and Step-Audio-R1.1 (January 2026, reasoning-tuned realtime variant) plus Step-Audio-EditX (January 2026, paralinguistic-edit). Zhipu AI shipped GLM-4-Voice. CAS / ICT-CAS released LLaMA-Omni and LLaMA-Omni 2, built on a Meta Llama base. Moonshot AI released Kimi-Audio (April 2025, MIT, 13M+ hours of training data). Tencent added Covo-Audio and Covo-Audio-Chat-FD (March 2026, CC BY 4.0, tri-modal interleaving) as a Family 2 entrant distinct from Freeze-Omni's Family 3 line. Shanghai Jiao Tong released SLAM-Omni.

The newest Family 2 entrant is FlashLabs Chroma (January 2026, 4B, open-weights, interleaved-flatten with 1:2 text-audio ratio, RTF 0.43, sub-second latency). Chroma is notable because it is the first open-source integrated STS to ship with built-in personalized voice cloning. That has consent implications picked up in [Article 09](/blog/consent-licensing-opt-in).

{{FIG:f4}}

The operational reason Family 2 has the most entrants is that interleaved single-stream is the most natural shape when you start from a text LLM and add audio. Families 1, 3, and 4 each require deeper architectural surgery. Family 2 is a packing-and-sequencing problem, which is tractable for any lab with a strong text LLM and audio tokenization capability. The flip side is that the latency story is less clean (a packed single stream cannot be as parallel as a dual-stream setup), and the full-duplex behaviour depends heavily on the blocking cadence chosen at training time.

## Family 3: cascade with chunk-level duplex predictor

Family 3 keeps ASR and TTS conceptually separate but adds a state-predictor head that chunks input and output so the model can interrupt, backchannel, or pause at sub-second granularity. This is sometimes called time-division multiplexing after MiniCPM-o's framing.

Freeze-Omni (November 2024, Tencent plus Nanjing University and Fudan) is the reference point: latency of 160-320 ms for model-only and ~1.2 seconds in real scenario deployment, weights available under an Apache-style release. MiniCPM-o 4.5 (OpenBMB) brought the time-division-multiplex approach to on-device deployment. Mini-Omni and Mini-Omni 2 (Tsinghua) populate the academic-scale end. OpenS2S (CASIA) is the empathy-first fully open entrant, releasing code, data, and weights together.

The important 2026 Q1 entrant is DuplexCascade (March 2026 arXiv, [paper-id: duplexcascade-2026](https://arxiv.org/abs/2603.09180)). DuplexCascade is VAD-free: it uses conversational control tokens and micro-turn chunks to make the turn-taking decision end-to-end within the cascade, rather than routing through a voice-activity detector. The paper claims state-of-the-art full-duplex turn-taking on Full-Duplex-Bench among open-source STS systems. That matters because [Article 03](/blog/pipeline-to-integrated) had written the cascade-plus-predictor family as the branch most likely to lose out to integrated systems. DuplexCascade reopens it, at least for labs that prefer to keep ASR and TTS as separable components.

{{FIG:f5}}

The trade-off inside Family 3 is that it inherits a compounding-error vulnerability that Families 1, 2, and 4 avoid: errors in the ASR stage propagate into the LLM stage and then into the TTS stage, and the predictor head does not undo them. In practice this means Family 3 systems tend to be stronger on strictly-defined turn-taking tasks and weaker on paralinguistic expressiveness, because the text bottleneck in the middle strips prosody. For enterprise buyers that want a clean, inspectable pipeline, that is sometimes a feature. For consumer companion-app deployment, it is usually a problem.

## Family 4: codec-free single-decoder

Family 4 is thin. The canonical example is SALMONN-omni (ByteDance), which operates on continuous embeddings without a neural audio codec in the loop and uses an internal "thinking" state to decide when to emit speech versus listen. The design choice is architectural minimalism: no codec means no codec-artefact failure modes, but it also means fewer reusable components and less production tooling than Families 1 or 2.

As of April 2026 Family 4 has one serious public entrant. If the codec-free approach proves out at production scale, it could seed a distinct research lineage; at this point it is a placeholder family in the taxonomy rather than a populated one. Including it in the map is the right move anyway, because a four-family taxonomy that collapsed it into Family 3 would mis-describe what the ByteDance team is actually doing.

<p class="aside-inline">
<span class="aside-lbl">aside</span>
The taxonomy is a working object, not a finished one. A fifth family would not be surprising in 2026 — a discrete-token diffusion approach, or a retrieval-augmented STS that retrieves at the audio level rather than the text level. The test for "new family vs. variant of an existing family" is whether the training-data shape and the architectural choice are jointly new. Every current family passes that test. Future entrants should be held to the same bar.
</p>

## Near-STS and the closed commercial frontier

Near-STS, as defined earlier, includes Kyutai Unmute (modular cascade wrapping a text LLM with Kyutai's open STT and TTS, production deployment at ~450 ms, all components MIT-licensed), Meta Spirit-LM (single-stream expressive LM under FAIR-NC, gated, English-only), NVIDIA Audio Flamingo 3 (streaming TTS output under NVIDIA OneWay Noncommercial), VITA-MLLM LUCY (emotion-token plus tool use), F-Actor (KIT / Edinburgh / NatWest, 2,000 hours of fine-tuning, academic-budget), IntrinsicVoice, and several others on the boundary. These systems are usable research artefacts, but none of them is a joint audio-language modeler in the Family 1-4 sense.

The closed commercial frontier is where the consumer-scale deployments live. OpenAI ships GPT-4o voice plus the Realtime API and its gpt-realtime successor. Google ships Gemini Live plus Gemini 3.1 Flash Live (March 2026). Amazon ships Nova Sonic. Microsoft ships MAI-Voice-1 plus MAI-Transcribe-1 (broad availability April 2026). ByteDance ships Doubao in China and Seeduplex as the April 2026 full-duplex upgrade. Hume ships EVI plus Octave 2 plus EVI 4-mini (a prosody-driven half-duplex with an external LLM; full EVI 4 with a native-LM variant was still pending as of April 2026). Cartesia ships Sonic. ElevenLabs has expanded its TTS catalogue into streaming voice agents. Deepgram ships Aura Nova. xAI launched Grok Voice APIs in April 2026.

The Q1-2026 funding wave concentrated in this frontier or one layer below it. Deepgram closed a $130M Series C in January 2026. Parloa raised $350M Series D at a ~$3B valuation. Decagon closed a $250M Series D at a $4.5B valuation, followed by a secondary tender offer in March 2026. ElevenLabs closed a $500M Series D at a ~$11B valuation in February 2026. Retell raised a publicly-disclosed $4.6M seed in March 2026. Across five voice-AI rounds inside eight weeks, the Q1-2026 total was approximately $1.23B. For comparison, the equivalent Q1-2024 number was roughly an order of magnitude smaller.

Two acquisitions completed the picture. In January 2026 Google DeepMind acqui-hired Hume AI's CEO and approximately seven engineers, with Hume's consumer product continuing under the original team. Apple acquired Q.ai, a silent-speech interface company, at a reported $1.6-2B valuation in January 2026. The hyperscaler absorption pattern is distinct from standard M&A. It suggests that voice-AI talent is being pulled into the frontier labs rather than accumulating inside independent startups, which has implications for who gets to build the next generation of foundation models.

{{FIG:f6}}

## Licensing bifurcation and three emerging sub-categories

Across the ~30 open or paper-released models, at least six non-closed license regimes are in active use: MIT, Apache 2.0, CC BY 4.0, CC BY-NC 4.0, FAIR-NC, NVIDIA Open Model License, NVIDIA OneWay Noncommercial, and several community-restrictive licenses with commercial carve-outs. This is more license diversity than the text LLM open-weights ecosystem had at a comparable stage. For enterprise buyers, that diversity is a compliance story before it is a capability story. A CC BY-NC or FAIR-NC model cannot be directly deployed in commercial production without a bespoke license; an Apache 2.0 or MIT model can.

A new pattern visible in Q1-2026 is license bifurcation inside a single lab. Alibaba's Qwen3-Omni shipped under Apache 2.0 in September 2025. Its successor Qwen3.5-Omni (March 2026) is a closed API-only preview. ByteDance has a similar split: its academic branch (SALMONN-omni) is open, while its production branch (Doubao and Seeduplex) is closed. StepFun has so far kept Step-Audio 2 at Apache 2.0 but has not confirmed the license for Step-Audio-R1.1 in its initial announcement. The open-first-base, closed-at-the-flagship pattern is now visible enough to be a planning assumption rather than a lab-specific choice.

<div class="callout dark">
<span class="label">the split, in one line</span>

**Open-first-base, closed-at-the-flagship.** Alibaba, ByteDance, and (probably) StepFun are all running the same pattern: keep the research base permissive enough to collect ecosystem contributions, then close the commercial flagship where the margins are. A buyer choosing "Qwen3" today does not get access to the same object a buyer choosing "Qwen3.5" gets, and that divergence is the story of Q1-2026.

</div>

Three sub-categories are starting to peel off inside the four families. The first is reasoning-tuned realtime: Step-Audio-R1.1 and the Qwen3-Omni-Thinking variant both advertise "thinking while speaking," embedding reasoning trajectories in the audio-generation stream. This is a new sub-category inside Family 2. The second is translation-duplex: Kyutai Hibiki and Hibiki-Zero define an application branch of Family 1 that shares Moshi's architecture but serves simultaneous interpretation rather than conversation. Meta SeamlessStreaming sits on this axis from a different architectural base. The third is voice-cloning inside integrated STS: FlashLabs Chroma is the first open STS to ship with built-in personalized voice cloning. Previously that capability lived only in TTS-specific stacks (ElevenLabs, Voxtral TTS). Merging it into integrated STS compounds consent obligations and is part of why [Article 09](/blog/consent-licensing-opt-in)'s treatment of voice-as-biometric data matters now rather than later.

{{FIG:f7}}

## What this map means

For builders, the family choice is a commitment. A Family 1 dual-stream model trained on two-channel dyadic data is not substitutable for a Family 3 cascade trained on single-channel monologue data. The training-data shape and the architectural choice are coupled, which is why the open-vs-closed question for a specific product cannot be answered without first answering the family question.

For researchers, the families are not mutually exclusive but they are not benchmark-comparable without labeling. A 200 ms latency number from a Family 1 dual-stream model is not the same artefact as a 200 ms latency number from a Family 3 cascade. [Full-Duplex-Bench v1 through v3](/blog/benchmark-landscape) evaluates behaviours (interruption, pause, backchannel, turn-taking) without isolating architecture, and [the next generation of benchmarks](/blog/why-new-benchmarks) will need to add family labels if headline numbers are going to be comparable.

For enterprise buyers, the license tier determines the deployment surface. NVIDIA Open Model License, CC BY-NC, and FAIR-NC block most commercial paths without a bespoke agreement. CC BY 4.0 and Apache 2.0 clear them. Closed APIs let you skip the license question but commit you to a vendor's rate card. Almost no frontier model is sold on truly-permissive commercial terms today; the exceptions (Moshi, Qwen3-Omni at the base tier, Step-Audio 2, Kimi-Audio, Covo-Audio, FlashLabs Chroma) are worth identifying early.

For investors, the landscape is starting to bifurcate in a way that was not visible a year ago. A small set of foundation-tier players is building across the four families with capital-intensive runs and growing data moats. A larger set of vertical-tier players is building on top of the foundation layer, capturing workflow and distribution in contact centres, healthcare, gaming, and companion apps. A middle band is getting absorbed into hyperscalers: Google DeepMind's acqui-hire of Hume and Apple's acquisition of Q.ai are the visible 2026 examples, and the pattern is likely to continue. [Article 05](/blog/foundation-before-vertical) argued that the STS foundation moment is near rather than past. The presence of DuplexCascade, Seeduplex, FlashLabs Chroma, and the licensing bifurcation together say the field is still accumulating moves, not consolidating. That is the moment before the moment; knowing which family a move belongs to is the difference between reading the field and guessing at it.

The next article ([Article 09](/blog/consent-licensing-opt-in)) turns to the data side of this landscape: who owns the conversations these models are learning from, how consent works now, and where the regulatory floor is rising.

---

**Fullduplex is building large-scale two-channel full-duplex conversational speech datasets for next-generation STS models.** If you are a frontier lab, a research team, or an enterprise buyer making family or license decisions, [get in touch](mailto:hello@fullduplex.ai). If you are an investor evaluating the voice-AI stack and want access to our data room, [reach out here](mailto:hello@fullduplex.ai).

---

_Originally published at [https://fullduplex.ai/blog/sts-model-landscape](https://fullduplex.ai/blog/sts-model-landscape)._
_Part of **The STS Series** · 08 / 10 · from Fullduplex._
_Full index: https://fullduplex.ai/blog · Markdown of every article: https://fullduplex.ai/llms-full.txt._
