Signals · 2026-W23 — Fullduplex

What happened this week

Note: this is a backfill issue. The scheduled task missed the June 1 publish slot, and W23 is being reconstructed retrospectively. The window itself produced material — three threads that compound on prior issues.

Streaming SpeechLLM translation, without cross-attention

DOA (Papi, Bentivogli) takes on a specific question that W22's Samsung streaming SpeechLLM translator left open: state-of-the-art SimulST policies have relied on attention-based encoder-decoder models where cross-attention provides explicit alignment signals, but SpeechLLMs are decoder-only. Does decoder self-attention contain sufficiently stable alignment to guide a streaming read-or-write policy? The paper proposes DOA, a training-free decoder-only attention policy, and reports that long-form simultaneous translation with SpeechLLMs holds up under it. Stacking with W22's Samsung paper, the open SimulST stack on a SpeechLLM base just got a clean training-free policy.

Why Can't They Remember? introduces EnvMem, a controlled multi-turn benchmark designed to study the gap between semantic (speech) and acoustic (non-speech) understanding in LALMs across turns. The paper localises failure modes to representation (latent embedding) and retrieval (attention allocation) levels — useful for anyone trying to debug why their voice agent forgets non-speech context across turns.

VoxParadox (Pang, Chaubey, Soleymani) is the paralinguistic counterpart. The benchmark uses controlled speech synthesis to intentionally mismatch transcript claims and speaking style across 10 paralinguistic tasks (2,000 verified examples). Audio LLMs evaluated on it score consistently low on acoustic ground truths when transcripts pull in a different direction — confirming that the modality gap papers from W20 (TextPro-SLM, MSEB) were not overstating the problem.

Audio safety, formalised

Audio Jailbreaks in LALMs is the unifying taxonomy paper. It organises prior work into semantic, acoustic, signal, and internal-representation threat classes, then runs a controlled empirical evaluation under a single threat model and cost-aware protocol. With W21's Acoustic Interference (paralinguistic priors) and W22's SpeechJBB (code-switched speech) as recent feeders, this paper consolidates the audio-safety literature into a single comparison grid. The cost-aware evaluation framing is the part to take seriously: jailbreaks that succeed but require thousands of dollars of compute matter less than ones that succeed at near-zero cost.

Speech-LM tokenization, deflated

The WER Trap (Zhang, Li, Zhang) pushes back on a community assumption: that low-WER tokens from Whisper-style tokenizers inherently preserve enough information for intelligible acoustic synthesis. The paper argues that high-frequency tokens succeed at generation due to implicit information leakage, not because the tokens themselves are good representations of speech. Implication for Family 2 (interleaved-flatten) SLMs: the apparent generation quality may be papering over a representation gap the WER metric cannot see.

Product — Pipecat 1.3.0 multi-agent

Pipecat 1.3.0 is the headline platform release of the week. Pipecat pipelines become multi-agent compatible by default: every PipelineWorker (formerly PipelineTask) becomes a peer on a shared bus that passes typed messages, dispatches @job work, and coordinates with siblings, while existing single-pipeline code keeps running untouched. The matching examples ship LLM handoff, parallel debate, sidecar code assistants and hardware controllers, distributed deployments over Redis or PGMQ, and WebSocket proxies. The release also introduces UIWorker — an LLM worker that observes and drives a client web UI over the RTVI UI channel, reading accessibility snapshots and routing client UI events to handlers — for voice agents that act on what the user is looking at. Vonage Video Connector transport, Inception Mercury 2 LLM service, Rime coda TTS support, and a plain WebSocket transport for the development runner round out the release.

Product — LiveKit Agents 1.5.14 and 1.5.15

livekit-agents 1.5.14 (May 27) adds the gpt-realtime-2 model string, support for VAD reset without stream close, the Inworld TTS delivery_mode parameter, a GnaniAI STT plugin, and fixes for the voice flush / clear-buffer race that leaked unplayed transcript on interrupt plus an Anthropic stream retry path. livekit-agents 1.5.15 (May 29) follows with Cartesia ink-2 STT, an AMD no-speech-timer fix that defers until the SIP call is answered, a Respeecher TTS plugin, and the LLM Responses-API tool serialisation pass. Together they tighten the realtime + telephony surface significantly without any single headline feature.

What is not here

No verified in-window dataset drop and no reclassification. Cartesia, Hume, Deepgram Voice Agent, and ElevenLabs Agents shipped no in-window technical changelog items beyond LiveKit-side plugin additions. Microsoft Build 2026 announcements (including MAI-Voice-2 on June 2) fell into W24 and are captured there.

Corrections to hello@fullduplex.ai.

Signals · 2026-W23.