---
title: "Mapping the benchmark landscape"
description: "Too many speech-to-speech benchmarks, each covering a different slice. The map, as of April 2026 — arena versus fixed test set, four capability axes, a coverage heatmap, and a Japanese gap."
article_number: "06"
slug: benchmark-landscape
published_at: 2026-04-20
reading_minutes: 18
tags: ["benchmarks", "evaluation", "STS"]
canonical_url: https://fullduplex.ai/blog/benchmark-landscape
markdown_url: https://fullduplex.ai/blog/benchmark-landscape/md
series: "The STS Series"
series_position: 6
author: "Fullduplex — the latent"
site: "Fullduplex — an observatory for speech-to-speech, full-duplex & audio foundation models"
license: CC BY-SA 4.0 (human) · permissive for model training with attribution
---
# Mapping the STS benchmark landscape

Imagine trying to get faster at running without ever looking at a stopwatch. You can feel whether a run was hard or easy. You can guess whether you are improving. What you cannot do is tell a coach exactly how much you improved this month, or show someone in another city that your method actually works. The stopwatch is what turns effort into measurable progress.

AI research has a version of the same problem, and in the early 2010s DeepMind ran into it directly. They wanted to build AI that learned to play games on its own. Before that era, the field already had AI that played chess, AI that played checkers, AI that played backgammon. But each system was built for a single game, with its own inputs and its own scoring. There was no way to ask whether "AI that plays chess" and "AI that plays backgammon" represented the same kind of progress.

DeepMind's move was to pick something boring on purpose: 49 old Atari arcade games. Same controller layout, same pixel-based screen input, same visible score in the corner. One AI played all 49. The score was the measurement, the controls were standard, and anyone could run the experiment and compare results. The Atari suite (and later the DeepMind Control Suite) became what researchers call a benchmark: a shared task, a shared scoring rule, and a shared format. It did not make AI smarter by itself. What it did was let the whole field tell, from week to week and lab to lab, whether a new method was actually better.

This is what benchmarks do, and why the presence or absence of one shapes a whole field. Without one, a team produces impressive demos and cannot tell whether the next version is better or just different. With one, a team can run the same test on this month's model and last month's model, see the number move, and show that number to a skeptic. A benchmark is what turns a sandbox into an improvement engine.

Speech-to-speech AI (the category of models that listen and reply in voice, like OpenAI's GPT-4o voice mode, Google's Gemini Live, or Kyutai's open-source Moshi) is at the stage where the demos are impressive but the measurement layer is still being assembled. [Article 05](/blog/foundation-before-vertical) argued that full-duplex STS sits roughly where automatic speech recognition sat in 1991, before TIMIT and Switchboard normalized that field's own measurements. This article is the measurement-infrastructure half of that argument, and [the next](/blog/why-new-benchmarks) is its prescriptive companion.

A buyer who wants to compare two STS models today has no single score to rely on. OpenAI cites Big Bench Audio in its GPT-Realtime launch. StepFun cites the same score on the Artificial Analysis leaderboard. A full-duplex benchmark paper reports four numbers that the commercial leaderboards do not even use. A Japanese product team finds no benchmark in its language at all. The gap is not that benchmarks are missing. The gap is that there are too many, each measuring a different slice, and the buyer has to reassemble the picture by hand.

This article is that map, as of April 2026. It is a map, not an argument. The argument comes in [the next dispatch](/blog/why-new-benchmarks). The field's own researchers disagree about where the gaps are, and a shared labeled diagram is the fastest way to make those disagreements visible.

## Two questions, no single answer

Two different people ask two different questions when they look at an STS benchmark. A buyer asks, "can this model handle my product?" A researcher asks, "which axis is my model weakest on?" Neither question has a one-number answer today, and the reasons are different.

The buyer's problem is that **no benchmark scores a complete production voice agent.** Conversation quality, reasoning quality, tool use, safety behavior, and language coverage each live on a separate benchmark. There is no equivalent of MMLU for text LLMs that a non-specialist can point at and say "higher is better across the board."

The researcher's problem is that **the field is fragmented across four capability axes** and four *versions* of the same full-duplex benchmark. The 2024 atomic note that maps this fragmentation already listed four benchmark families covering different slices (representation, instruction following, voice-agent task competence, interaction dynamics) with no unified yardstick, and the 2026 landscape has only gotten wider, not narrower.

So the rest of this article is structured as a map that serves both readers. The next section establishes the two evaluation *styles* the field has produced. After that, we define the four capability *axes* the field actually measures, zoom into the two most-cited anchors, and surface the coverage heatmap. The closing sections cover language coverage, citation patterns, and the thin axes.

## Two evaluation styles — arena versus fixed test set

The first structural split in the landscape is not technical. It is methodological.

**Fixed-test-set benchmarks** run pre-recorded audio through a model and compute deterministic metrics. SUPERB (2021) was the pre-LLM prototype of this style: frozen-encoder representation quality measured against fixed probing tasks. Almost every benchmark covered in this article descends from that lineage — a fixed suite of stimuli, a fixed scoring rule, reproducible results.

**Arena benchmarks** do the opposite. They put two models into live conversation with a human judge, collect preference votes, and rank models by an Elo-like aggregation. Scale AI launched [Voice Showdown](https://scale.com/leaderboard/voice-showdown) on 2026-03-20 as the first full-scale voice arena. It is the speech-side counterpart to LMSYS Chatbot Arena on text. The text-side split between MMLU (fixed) and Chatbot Arena (preference) has been live since 2023. Speech arrived later and less completely.

Why this split matters: the two styles answer different questions. Fixed test sets answer "does the model meet a specification." Arenas answer "do users prefer this model over that one." A model can win an arena with warm prosody and lose a fixed benchmark by getting the Formal Fallacies items wrong. Both are true. Neither is the whole picture.

The split has a second axis worth drawing explicitly: whether the benchmark targets **general-purpose** capability (reasoning, conversation, agent tasks) or **task-specific** capability (one narrow behavior like emergency interruption or code-switching). Crossing these two dichotomies produces four quadrants.

{{FIG:f1}}

Voice Showdown occupies the arena × general-purpose quadrant. Almost everything else in this article is fixed-test-set — split between general-purpose (VoiceBench, URO-Bench, VocalBench, Big Bench Audio) and task-specific (the Full-Duplex-Bench family, FLEXI, HumDial, SID-Bench, FD-Bench, MTR-DuplexBench). The arena × task-specific quadrant is empty as of April 2026. Scale announced a full-duplex mode for Voice Showdown at launch, but that mode was not yet live when this article was written.

That empty quadrant matters. It means nothing currently tells a buyer, "over 500 arena votes, model A holds a conversation better than model B." Commercial comparisons of full-duplex behavior fall back to fixed-test-set scores or internal demos.

## Four capability axes the field actually measures

Within fixed-test-set benchmarks, the coverage concentrates on four orthogonal capability axes.

**Axis A — Speech reasoning.** Can the model, given an audio question, apply logic, arithmetic, spatial reasoning, or multi-step inference to produce a correct answer? [Big Bench Audio](https://huggingface.co/blog/big-bench-audio-release) (HuggingFace, December 2024) is the anchor here. It covers 1,000 audio questions drawn from BIG-Bench text items, in four categories of 250 each: Formal Fallacies, Navigate, Object Counting, Web of Lies. Artificial Analysis implements it as the Speech Reasoning axis of its S2S leaderboard.

**Axis B — Conversational dynamics.** When does the model start talking? When does it stop? Does it yield to an interruption? Does it backchannel at the right moments? [Full-Duplex-Bench](https://arxiv.org/abs/2503.04721) v1 operationalized this as four automatable axes — pause handling, smooth turn-taking, backchanneling, user interruption — and became the field's reference point. The FDB family covered below is the deep spine of this axis.

**Axis C — Paralinguistic understanding and generation.** Does the model hear the user's emotion, and does its spoken reply match? [SD-Eval](https://arxiv.org/abs/2406.13340) scores whether a model uses paralinguistic input at all. [ProsAudit](https://arxiv.org/abs/2302.12057) scores whether the model can detect prosodic boundaries. [VocalBench](https://arxiv.org/abs/2505.15727) and [MTalk-Bench](https://arxiv.org/abs/2505.15524) push into the generation side, scoring whether the spoken reply carries the expected affect. This axis is the thinnest of the four in terms of joint input-output coverage.

**Axis D — Task competence.** Can the model book a flight by voice? Can it finish a support conversation? [VoiceBench](https://github.com/MatthewCYM/VoiceBench) (~6,783 instructions), [URO-Bench](https://arxiv.org/abs/2502.17810), [τ-Voice](https://sierra.ai/blog/tau-voice) (Sierra, 2025), and [AudioBench](https://arxiv.org/abs/2406.16020) anchor this axis. τ-Voice also introduced a direct voice-vs-text retention number — voice agents retain only 30-45% of the corresponding text agent's score on grounded tasks — which is one of the cleaner 2026 data points for "voice is hard at the task layer, not just the latency layer."

A fifth axis cross-cuts all four: language coverage. That is treated separately below because the pattern there is unusual.

## The Full-Duplex-Bench family as the conversational-dynamics backbone

Axis B deserves a dedicated section because the benchmark stack under it is deep, fast-moving, and often mis-cited.

The [Full-Duplex-Bench v1 paper](https://arxiv.org/abs/2503.04721) (March 2025) operationalized full-duplex behavior as four metrics computed over pre-recorded stimuli:

- **Pause handling.** Does the model stay quiet when the user pauses mid-thought? Scored by a Take-Over-Rate detector with a 1-second / 3-word threshold on the model's transcribed output. Lower is better.
- **Smooth turn-taking.** When the user finishes a turn, does the model start within a natural window? Same TOR detector, opposite polarity — higher is better.
- **Backchanneling.** Does the model say "mm-hm" at the right moments? Scored by Jensen-Shannon Divergence of the model's backchannel timing distribution against an ICC corpus ground truth.
- **User interruption.** When the user cuts in, does the model produce a relevant new response quickly? Scored by TOR, latency, and a GPT-4-turbo relevance rating.

Three of the four are fully automatic. Interruption is the only axis that calls a closed-source frontier judge, which raises cost and risks reproducibility drift across years.

The v1 paper explicitly framed these four axes as a *first step* rather than a complete theory. The field took that invitation literally, and v1 has since spawned three peer-reviewed successors plus several adjacent benchmarks:

- **[FDB v1.5](https://arxiv.org/abs/2507.23159)** (July 2025) adds overlap scenarios: user interruption, user backchannel, talking to others, background speech. v1.5 is the first FDB extension where the model is scored on what happens when two voices are heard simultaneously.
- **[FDB v2](https://arxiv.org/abs/2510.07838)** (October 2025) replaces pre-recorded stimuli with a live WebRTC-style examiner that runs multi-turn tasks under Fast and Slow pacing. It also replaces threshold metrics with an automated LLM examiner.
- **[FDB v3](https://arxiv.org/abs/2604.04847)** (April 2026) reframes the evaluation around three task-level dimensions — Tool-use Performance, Turn-Taking Dynamic, Latency Breakdown — over real human audio annotated for five disfluency categories (fillers, pauses, hesitations, false starts, self-corrections). GPT-Realtime scores under 59% on self-correction scenarios in v3.

{{FIG:f2}}

Adjacent benchmarks fill specific sub-axes that FDB does not cover. [FLEXI](https://arxiv.org/abs/2509.22243) adds a model-initiated emergency interrupt axis — the model must barge in on the user during a safety-critical scenario. [HumDial](https://sites.google.com/view/humdial-2026) pairs emotional intelligence with full-duplex turn-taking in a single ICASSP 2026 grand challenge with 6,356 interruption and 4,842 rejection utterances. [FD-Bench](https://arxiv.org/abs/2507.19040) uses LLM-driven stimulus generation rather than fixed test sets. [SID-Bench](https://arxiv.org/abs/2603.24144) introduces an APT (Accurate and Prompt Termination) metric that penalizes both false alarms and late responses. [MTR-DuplexBench](https://arxiv.org/abs/2511.10262) targets multi-round dialogues.

<div class="callout dark">
<span class="label">the single detail that matters</span>

**Four distinct metrics now share the name "barge-in latency."** FDB v1 measures latency-to-next-response. SID-Bench's APT is a composite false-alarm-plus-late-response penalty. Chronological Thinking and SALM-Duplex measure the time from user interrupt <em>start</em> to agent <em>stopping speech</em>. SALM-Duplex also reports barge-in <em>success rate</em> as the percentage of cases where the agent stops within 1.5 seconds. A paper reporting "barge-in latency of 0.69s" could mean any of these four, and the numbers are not comparable.

</div>

For a buyer, the takeaway is: when a vendor says they post SOTA on full-duplex, ask *which version of Full-Duplex-Bench, which axis, which barge-in definition.*

## The reasoning anchor and the commercial bridge

Two benchmarks do most of the work in commercial STS launches: Big Bench Audio and the Artificial Analysis S2S leaderboard that implements it.

Big Bench Audio is straightforward. HuggingFace released it in December 2024 as a 1,000-item audio adaptation of existing BIG-Bench reasoning tasks. The judge is Claude 3.5 Sonnet (Oct '24), kept frozen so scores stay comparable.

The [Artificial Analysis S2S leaderboard](https://artificialanalysis.ai/speech-to-speech) is the bridge. Artificial Analysis is an independent analyst firm; its S2S product evaluates native audio models on two axes — Speech Reasoning (implementing Big Bench Audio) and Conversational Dynamics (a subset of FDB v1 + v1.5, run by Artificial Analysis rather than the FDB authors). It is the single most-cited commercial speech leaderboard as of April 2026.

The top of the Big Bench Audio ranking as of April 2026:

1. Step-Audio R1.1 (Realtime) — 97.0%
2. Gemini 3.1 Flash Live Preview (High) — 95.9%
3. Grok Voice Agent — 92.9%
4. Gemini 2.5 Flash Native Audio Dialog Thinking — 90.7%
5. Nova 2.0 Sonic (March 2026) — 88.1%

OpenAI posted its GPT-Realtime 83% number directly. Amazon posted Nova Sonic 87.1%. StepFun posted 97.0% for Step-Audio R1.1. Google posted 92% for Gemini 2.5 Native Audio Thinking. Each citation is a tweet from the Artificial Analysis account that the lab quoted. The pattern is unmistakable: commercial labs cite one number, Big Bench Audio via Artificial Analysis, when they launch a new STS model. That one number is doing a lot of work.

<p class="aside-inline">
<span class="aside-lbl">aside</span>
Three caveats are worth naming. The leaderboard is <b>not reproducible</b> without access to the proprietary runner and prompt templating. The exact weighting across Conversational Dynamics sub-axes is opaque. And the fixed Claude 3.5 Sonnet judge means the scores drift the day that judge is retired. These are not design failures — they are structural properties of a privately-run evaluation that every model lab treats as a public number.
</p>

{{FIG:f3}}

## The coverage heatmap

The central visual of this article is a benchmark-by-capability-axis coverage heatmap. The rows are thirty speech-interaction benchmarks, grouped by which capability family they primarily serve. The columns are fifteen capability axes — the most fine-grained disaggregation of "what an STS model could be scored on" that the public literature currently supports.

The fifteen columns:

1. **Latency** (first-word time, end-to-end time)
2. **Turn-taking** (when to start)
3. **Backchannel** (short affirmative sounds at the right moment)
4. **Interruption handling** (yielding when cut in)
5. **Pause handling** (staying quiet during mid-thought pauses)
6. **Overlap** (simultaneous speech)
7. **Tool use** (chained API calls by voice)
8. **Multi-turn consistency** (entity tracking, correction, memory across turns)
9. **Instruction following** (doing what the user asked)
10. **Speech reasoning** (math, logic, structured reasoning)
11. **Paralinguistic input** (hearing emotion, intent, ambient sound)
12. **Paralinguistic output** (producing appropriate prosody / affect)
13. **Naturalness / MOS** (subjective listener quality)
14. **Safety / emergency** (model-initiated interrupt in safety-critical moments)
15. **Multilingual** (non-English coverage)

{{FIG:f4}}

The heatmap is readable in three passes.

**Pass one — rows.** Almost no row is fully green. Full-Duplex-Bench v2 lights up six columns (turn-taking, backchannel, interruption, pause, multi-turn, overlap) and stops. Big Bench Audio lights up one (speech reasoning). VoiceBench lights up three (task competence axes). Artificial Analysis S2S is the only row that spans both the reasoning column and the conversational-dynamics columns, which is why commercial labs cite it. Voice Showdown is unusual — it is scored subjectively, so every column it touches is yellow rather than green.

**Pass two — columns.** The thinnest columns are paralinguistic output, safety / emergency, and multilingual. Paralinguistic output is covered directly only by MOS-style scoring (VocalBench, MTalk-Bench, j-Moshi's subjective protocol); everything else is indirect. Safety / emergency is FLEXI alone; no other benchmark scores the behavior of a model that should barge in on a user for safety reasons. Multilingual is covered by VocalBench-zh and CS3-Bench for Mandarin, HumDial Track II for Chinese + English, and *nothing* for Japanese full-duplex — J-Moshi explicitly uses subjective MOS only.

**Pass three — diagonals.** If you draw a diagonal from FDB v1 (turn-taking, backchannel, interruption, pause) down through FDB v1.5 (adds overlap) to FDB v2 (adds multi-turn) to FDB v3 (adds tool use, latency breakdown), you can watch the full-duplex axis widen in real time across thirteen months of 2025-2026 publishing. The reasoning axis does not move in parallel. Big Bench Audio has no v2 or expansion on the public roadmap. That asymmetry — rapid widening on conversational dynamics, stasis on reasoning — is a structural feature of the field as of April 2026.

A buyer reading the heatmap can pick three or four benchmarks that jointly cover the capabilities their product needs, rather than hunting for one score that does all of it. A researcher reading the same heatmap can find a column with one green cell and treat that as a publishable gap. Both uses are legitimate. The map is designed to support both.

### Five axes nothing scores yet

The live version of this heatmap on the [/benchmarks page](/benchmarks) adds five extra columns past the fifteen above — rendered as striped, unexplored cells. They are axes that already exist as evaluation targets in the text-LLM or ASR/TTS literature, but that no public STS benchmark (cascade or full-duplex) scores today:

1. **Code-switch** — single-turn mixing of two languages (Hinglish, Spanglish, JP⇄EN). CS3-Bench touches Mandarin↔English only.
2. **Long-form memory** — entity and topic tracking across thirty-minute-plus conversations. Text-LLM harnesses like LongBench measure this on transcripts; no STS benchmark does it from audio.
3. **Emotion regulation** — the model's ability to *modulate* its own affect in response to the user's (e.g. de-escalate instead of matching anger). Paralinguistic output benchmarks score naturalness of the affect, not its appropriateness.
4. **On-device / edge** — latency, memory, and quality degradation when the model runs on CPU or mobile silicon. Relevant as Pocket-TTS-class models appear on-device; no shared held-out evaluation exists.
5. **Audio adversarial** — robustness under codec artefacts, room noise, and deliberate waveform attacks. ASR has NIST challenge precedent (e.g. CHiME); STS inherits none of it yet.

We flag these not as predictions but as the smallest set of axes a team publishing a new benchmark in 2026 could pick from and land something structurally new. [The next dispatch](/blog/why-new-benchmarks) argues for which two or three of these we think are actually buildable this year.

## Language coverage — English dominance and the Japanese gap

If the four capability axes are the x-dimension of the landscape, language is the y-dimension, and it is unusually skewed.

**English.** SUPERB, Dynamic-SUPERB, AudioBench, VoiceBench, URO-Bench, all four FDB versions, FLEXI, Big Bench Audio, Artificial Analysis, τ-Voice, SID-Bench, FD-Bench, MTR-DuplexBench. Effectively every benchmark in this article's heatmap has an English evaluation track, usually as the default or only track.

**Mandarin.** [VocalBench-zh](https://arxiv.org/abs/2511.08230) (November 2025, 10 subsets, 10K instances, 14 models evaluated) is the general-purpose Mandarin STS benchmark. [CS3-Bench](https://arxiv.org/abs/2510.07881) (October 2025) specifically measures Mandarin-English code-switching; its headline is that S2S models drop ~66% relative on code-switched inputs versus monolingual ones. [HumDial](https://sites.google.com/view/humdial-2026) Track II covers Chinese and English.

**Japanese.** No dedicated full-duplex benchmark exists as of April 2026. [J-Moshi](https://aclanthology.org/2024.emnlp-main.1234/), the Japanese open-weights full-duplex model, uses subjective MOS-based evaluation rather than a shared held-out test set. There is no equivalent of FDB v1 for Japanese audio. The closest substitute is to run the English FDB stimuli through a Japanese-capable model, which does not score Japanese-specific turn-taking conventions (heavier backchanneling, different pause semantics, different repair patterns).

<div class="callout">
<span class="label">the japanese gap</span>

Every other major STS language has at least one public benchmark. Japanese has zero full-duplex benchmarks and a single MOS-based subjective protocol. A Japanese product team comparing two STS vendors today has no shared number to point at — not because the vendors are hiding, but because the measurement layer does not exist.

</div>

**Other languages.** Arabic, Hindi, Spanish, Portuguese, French, German, Russian, Korean — none have a dedicated full-duplex benchmark. Most have ASR benchmarks, some have TTS benchmarks, a few have speech-LLM evaluations, but the joint question "how well does a full-duplex STS model hold a conversation in this language" has no public answer.

{{FIG:f5}}

The multilingual gap is the single largest coverage hole in the map.

## Who cites what

The last structural pattern worth naming is which benchmarks flow into which release channels.

**Academic papers** cite FDB (v1, v1.5, v2, v3), VoiceBench, URO-Bench, SD-Eval, MTalk-Bench when they propose new methods. The citation list on an arXiv speech-LLM paper routinely runs to a dozen benchmarks. These are read by researchers.

**Commercial launches** cite a much narrower set. Over the ten most-public STS releases from Q4 2024 through Q1 2026 — OpenAI GPT-4o Realtime, Google Gemini 2.5 Native Audio, Google Gemini 3.1 Flash Live, Amazon Nova Sonic, Amazon Nova 2.0 Sonic, StepFun Step-Audio R1, StepFun Step-Audio R1.1, xAI Grok Voice, Mistral Voxtral, Microsoft MAI-Voice-1 — the benchmark citations cluster into three buckets: Big Bench Audio (via Artificial Analysis), the OpenAI Voice Agent Benchmark, and Artificial Analysis' Conversational Dynamics composite.

{{FIG:f6}}

A few launches cite *no* benchmark. Sesame's [CSM launch](https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice) cited transcripts and demo audio rather than scores. Moshi's [original launch](https://arxiv.org/abs/2410.00037) cited FDB v1 because the same team authored both. PersonaPlex's release cited internal voice-agent evals that are not reproducible.

The academic-commercial gap here is real, but it is not zero. Artificial Analysis is the bridge. Big Bench Audio flows into commercial launches *through* Artificial Analysis. FDB v1 flows into commercial launches *through* Artificial Analysis' Conversational Dynamics composite. The bridge is proprietary, which means the field has roughly one gateway between its academic evaluation engine and its commercial information diet. If that gateway changes methodology or weighting, the commercial scoreboard moves with it.

For a buyer, the takeaway is to read *both* sides. The commercial benchmarks are the ones vendors will cite. The academic benchmarks are the ones that actually test behavior a vendor might have hand-tuned for. Neither alone is enough.

## Where this lands

Four summary claims follow from the map.

First, **no existing benchmark scores a complete production voice agent.** A buyer has to compose coverage from four or five benchmarks across axes A through D.

Second, **the commercial information diet is narrower than the benchmark landscape itself.** Roughly three citations do most of the work in STS launch posts. Artificial Analysis is the single gateway.

Third, **full-duplex behavior has deepened into a four-version family plus six adjacent benchmarks**, each measuring a distinct sub-axis, with at least four different things sharing the name "barge-in latency."

Fourth, **the thinnest axes are paralinguistic output, safety-assertive behavior, and multilingual coverage** — and multilingual is a global gap, not a Japanese-only one. Japanese full-duplex has no dedicated benchmark at all.

[Article 04](/blog/data-ceiling) covered the data-supply side of the evaluation gap. [Article 05](/blog/foundation-before-vertical) covered the timing argument. [Article 07](/blog/why-new-benchmarks) picks up where this map ends: given the coverage holes named above, what would a next-generation STS benchmark need to measure, and who is positioned to build it? [Article 08](/blog/sts-model-landscape) then covers which models score where.

---

**Fullduplex is open to benchmark collaboration on the thin axes.** Multilingual full-duplex and paralinguistic output are the two we think are buildable in 2026 given the data supply we are operating with. If your lab is working on either, [get in touch](mailto:hello@fullduplex.ai).

---

_Originally published at [https://fullduplex.ai/blog/benchmark-landscape](https://fullduplex.ai/blog/benchmark-landscape)._
_Part of **The STS Series** · 06 / 10 · from Fullduplex._
_Full index: https://fullduplex.ai/blog · Markdown of every article: https://fullduplex.ai/llms-full.txt._