Two releases, one inflection week
The STS series argued, across nine articles, that voice AI in 2026 sits between the GPT-2 and GPT-3 moments: architecture is becoming a commodity, the bottleneck is foundation data, evaluation is misaligned, and the closed commercial frontier is pulling ahead of public benchmarks via a single proprietary bridge. The two releases this week pressure-test every one of those claims at once.
#2026-w21-001 — OpenAI Realtime API GA and GPT-Realtime-2
On May 7, OpenAI graduated the Realtime API out of beta and shipped three new audio models: gpt-realtime-2, gpt-realtime-translate, and gpt-realtime-whisper. The flagship adds GPT-5-class reasoning into the realtime path, lifts context from 32k to 128k tokens, and exposes a reasoning_effort knob with five tiers from minimal to xhigh. OpenAI's own scoreboard reports +15.2 pp on Big Bench Audio and +13.8 pp on Audio MultiChallenge over Realtime-1.5 at the higher reasoning tier. Audio billing is $32 / 1M input, $64 / 1M output, $0.40 / 1M cached.
Three implications against the series.
The reasoning-realtime sub-category from article 08 is now the platform default. That article named Step-Audio-R1.1 and Qwen3-Omni-Thinking as early entrants to a sub-category embedding reasoning trajectories in the audio-generation stream. With OpenAI exposing reasoning_effort as a first-class API parameter, the choice between low-latency conversation and deliberate reasoning stops being a model selection problem and becomes a request parameter. The sub-category is no longer peeling off — it has been absorbed into the closed commercial flagship.
The gpt-realtime-translate release lands directly on Kyutai Hibiki's territory. Article 08 flagged translation-duplex as an application branch of Family 1, with Hibiki and Hibiki-Zero as the only open-weights entrants and Meta SeamlessStreaming from a different base. A closed-commercial S2ST at per-minute pricing changes the procurement question for any team that was evaluating Hibiki because it was the only option.
Artificial Analysis is no longer the only proprietary bridge. Article 07 singled out AA as the single non-reproducible gateway through which commercial STS scoreboards flow. OpenAI's launch this week reports Big Bench Audio and Audio MultiChallenge lifts in absolute pp terms, citing its own runner. That is not better for reproducibility, but it does shift the bridge from AA-via-vendor to vendor-direct on the reasoning axis. Whether that pattern holds for FD-Bench is the question to watch over the next two months.
Model directory updates: openai-realtime revised to gpt-realtime-2 (May 2026, 128k context, five reasoning tiers); gpt-realtime-translate added as a new s2st entry.
#2026-w21-002 — TML-Interaction-Small, the first VAD-free 5-family entrant
On May 12, Mira Murati's Thinking Machines Lab announced TML-Interaction-Small, a 276B-parameter mixture-of-experts model with 12B active parameters. The single most consequential detail is that it is VAD-free and codec-light: dMel embeddings for audio, hMLP for 40×40 video patches, a flow head for audio decoding, all early-fused and decoded in 200 ms time-aligned micro-turns. Standard voice-activity detection is replaced by model-internal signals tracking whether speakers are thinking, yielding, self-correcting, or inviting response. Turn-taking latency is 0.40 s on FD-Bench v1; interaction quality is 77.8 / 100 on FD-Bench v1.5, ahead of GPT-Realtime-2 and Gemini 3.1 Flash Live. On pure intelligence it trails (43.4% vs 48.5% on Audio MultiChallenge APR).
Four implications against the series.
Article 03's four-family taxonomy needs a fifth slot. The aside in that article set the bar for a new family at "training-data shape and architectural choice are jointly new." TML clears it on both sides. Architecturally it is neither dual-stream-plus-codec (F1) nor interleaved-flatten (F2) nor cascade-plus-predictor (F3) nor codec-free-with-thinking (F4). Encoder-free early fusion with concurrent audio-video-text streams and time-aligned micro-turns is its own shape. Training-data shape is also distinct — the system explicitly trains on three-stream multimodal data rather than two-channel dyadic conversation. Call it Family 5: encoder-free multimodal early-fusion.
The co-completion gap from article 02 is being closed from outside the FD-Bench harness. That article noted that FD-Bench v1 measures three of the four conversational micro-behaviours (barge-in, backchannel, overlap recovery) and explicitly does not score the fourth (co-completion). TML's TimeSpeak (proactive timing, 64.7%) and CueSpeak (verbal-cue response, 81.7%) measure exactly the behaviours FD-Bench leaves on the table. The methodology is vendor-published rather than community-standardised, but the gap is being targeted.
Article 07's requirement #4 — open methodology including judge selection — is the bar TML's own benchmarks fail. TimeSpeak and CueSpeak are vendor-published with no third-party harness; cross-vendor comparability is not established. That is the inverse of what article 07 argued the field needs. A commercial lab publishing four bespoke evaluation axes alongside its model release is the same structural problem as Artificial Analysis — a private scoring gateway — distributed across labs instead of consolidated in one vendor. Watch for whether the FD-Bench team incorporates a TimeSpeak / CueSpeak equivalent in v3.5 or v4.
Article 05's 100k-500k hour foundation-data hypothesis is not falsifiable from this release. TML disclosed nothing about training corpus size or composition. Three of the load-bearing assumptions in that article's hypothesis — ASR-curve analogy, full-duplex difficulty multiplier, parameter-data co-scaling — cannot be evaluated against TML until weights, paper, or data card lands. The 276B-A12B parameter footprint is in the GPT-3-equivalent band the article placed the foundation threshold at, which is at least directionally consistent with the hypothesis.
Provisioning: limited research preview in the coming months, wider release later in 2026. License and open-weights posture undisclosed. Model directory: tml-interaction-small added as a preview-tier entry under speech-lm-fd. Benchmarks: tml-timespeak and tml-cuespeak added with the preview / vendor-published flag.
What this week answered, and what it did not
The series asked four open questions across articles 02-08. Two are now sharper.
- Will reasoning-realtime stay a sub-category or fold into the platform default? It folded. Five-tier reasoning is now an API parameter.
- Can the FD-Bench family widen to cover co-completion? Not from inside the harness yet. From outside, TML's TimeSpeak / CueSpeak hit the target.
Two remain open.
- Will the Artificial Analysis bridge become reproducible or be replaced? Neither this week. Vendor-direct citation grew; open methodology did not.
- Will the foundation-data threshold come into view? No. TML's training corpus is undisclosed; OpenAI's has been undisclosed since GPT-4o.
One new question, not previously in the series. Does a 5-family taxonomy make article 03 better or worse? A taxonomy with a vendor-of-one fifth slot is a weaker organising tool, not a stronger one — unless a second lab ships something architecturally adjacent to TML inside the next two quarters. The watch list is FlashLabs, ByteDance SALMONN-omni's successor, and any Sesame CSM-Medium-class release.
Corrections to hello@fullduplex.ai.