What happened this week
Four items this week. One is a direct follow-up on turn detection; two are on speech-LM construction; one is a dataset.
Turn-taking — FastTurn
FastTurn unifies streaming CTC decoding with acoustic features to make early turn decisions from partial observations without waiting for a full ASR result. The paper also releases a test set based on real human dialogue — worth flagging because most existing turn-taking test sets are read-aloud corpora or post-hoc annotations on Switchboard-style data.
Building a speech-LM on top of a text LLM
Two papers propose cheap recipes for inheriting text-LLM capability into a speech stack:
- Multimodal Depth Up-Scaling inserts new transformer layers into a frozen text LLM and only trains the added layers on speech data. Applied to SmolLM2-360M and 1.7B on 48k hours of English audio, the authors report minimal degradation to text benchmarks while gaining reasonable speech understanding. Likely to be copied by teams that want speech output without retraining a whole LM.
- OmniVoice goes the other direction — a diffusion language model trained directly for 600+ language zero-shot TTS, with a discrete non-autoregressive head that maps text to multi-codebook acoustic tokens in one shot. The interesting contribution is skipping the text-to-semantic-to-acoustic two-stage pipeline; whether that trades off intelligibility for simplicity is the thing to watch.
Dataset — AffectSpeech
AffectSpeech is a large-scale emotional speech dataset with fine-grained textual descriptions, aimed at emotion captioning and controllable emotional synthesis. The textual-description layer matters because it moves affective control away from the usual categorical labels — useful for anyone evaluating emotion controllability in expressive TTS.
Corrections to hello@fullduplex.ai. Next issue: 2026-W16.