What happened this week
Seven preprints worth forwarding, weighted toward omni-modal flagships and audio-LM evaluation. The Hugging Face / GitHub / lab-blog buckets did not surface a primary-sourced, in-window release that meets the bar (the only candidates were redistributions and quantizations of older weights), so the issue is paper-only.
Omni-modal headlines
MiniCPM-o 4.5 is the most concrete full-duplex omni-modal release of the window. The Omni-Flow framework aligns vision, audio, and text on a shared temporal axis so perception and response stop alternating, and the system can issue proactive comments in the middle of a live scene rather than waiting for an explicit user turn. At 9B parameters total, OpenBMB claim parity with Gemini 2.5 Flash on vision-language and a win over Qwen3-Omni-30B-A3B on omni-modal understanding, while running real-time full-duplex inference under 12 GB RAM on edge devices.
Nemotron 3 Nano Omni is NVIDIA's first Nemotron with native audio inputs alongside text, image, and video. It is built on the 30B-A3B Nemotron 3 Nano backbone with multimodal token-reduction tricks for throughput, and the BF16 / FP8 / FP4 checkpoints plus portions of training data and code are being released. The headlined wins are document understanding, long-form audio-video comprehension, and agentic computer use.
Step-Audio-R1.5 is the contrarian piece of the omni cluster. StepFun argue that RLVR (the dominant recipe for audio reasoning since 2025) systematically degrades conversational feel: optimizing isolated, verifiable text labels collapses prosody, emotional continuity, and immersion in long-turn dialogue. Their proposed shift is back to RLHF for audio reasoning. The technical report does not yet ship weights, but it is the most pointed argument against the current evaluation lens we have seen this quarter.
Method paper
Continuous diffusion SLM scaling from Apple's foundation-models group is the methodological contribution of the week. The paper introduces a phoneme Jensen-Shannon divergence (pJSD) metric for SLM linguistic quality, then derives scaling laws for both validation loss and pJSD on a continuous-diffusion speech-only language model. Scaled to 16B parameters on tens of millions of hours of conversational audio, it generates emotive, prosodic, multi-speaker, multilingual speech, but long-form coherence is still an open problem.
Evaluation
All That Glitters Is Not Audio is a diagnostic on eight LALMs across three benchmarks. The claim is that models retain 60–72 percent of their full audio score with no audio input, and only 3.0–4.2 percent of items that need audio actually require the full clip. This is the kind of finding that moves how we grade /benchmarks entries: the paper closes with concrete guidelines for benchmark design that we should map back onto VocalBench, AIR-Bench, and MMAR.
Datasets and TTS
Two low-resource releases close out the issue. AppTek Call-Center Dialogues is a commissioned, never-public, 14-accent English long-form ASR benchmark for conversational AI evaluation, exactly the kind of artifact missing from the long-form ASR shelf. JaiTTS is a Thai voice-cloning TTS adapted from VoxCPM that handles Thai-English code-switching natively, reports a CER of 1.94 percent on short-duration speech (slightly under the human ground truth of 1.98 percent), and wins 283 of 400 pairwise human comparisons against commercial flagships.
What is not here
No open-weights drop or lab-blog release surfaced inside the window with a primary source we could verify. MiniCPM-o-4.5 weights were uploaded in February, Step-Audio-R1.5 weights are not yet on Hugging Face, and Nemotron 3 Nano Omni checkpoints landed on April 20–24, just before the window opens.
Corrections to hello@fullduplex.ai.