FullduplexFullduplex/
the verticalsv12 / 17#nii#j-moshi#japan§ 08 sections · 08 figures

NII, Nagoya, and J-Moshi: the morning academia shipped a Japanese listen-while-speaking AI.

On February 25, 2026, NII published LLM-jp-Moshi-v1 — the first commercially usable Japanese full-duplex STS, under Apache 2.0. With no commercial frontier voice-AI lab in the country, this profile traces how a national institute, a PhD thesis, and a University of Tokyo data lab locked together as a tripod and shipped a release 25 years of academic infrastructure made possible.

verticals · v12 of 17 · subject profile
A national institute, a PhD thesis, and a University of Tokyo data lab locked together as a tripod and shipped Japan’s first commercially usable full-duplex STS — 25 years of academic infrastructure crystallizing into a single Apache 2.0 release.
subject: LLM-jp-Moshi-v1 · NII + Nagoya + U-Tokyo · Feb 25, 2026Apache 2.0 · ~1,000 h fine-tune · 344 h stereo upstream

1. February 25, 2026. Apache 2.0, 1,000 hours, and 25 years

On the morning of February 25, 2026, in Chiyoda, Tokyo, the National Institute of Informatics (NII) published something quietly but very consequentially. A Japanese spoken-dialogue model called LLM-jp-Moshi-v1 landed on Hugging Face under the Apache 2.0 license. The NII press release called it “a commercially usable simultaneous two-way Japanese spoken-dialogue model” and marked it as a world first. Weights, inference code, and training-data composition all arrived on the same day.

The real headline of this article is not the model’s quality. It is the license.

Apache 2.0 is the permissive end of the software license spectrum. It is close to saying “use it as you wish, redistribution and commercial deployment are fine.” Most prior Japanese speech releases carried a “research use only” carve-out. Running them inside a contact center meant opening a separate licensing conversation with a university. With an unrestricted Apache 2.0 stamp on it, LLM-jp-Moshi-v1 is on record as the first “anyone can commercially deploy a Japanese listen-while-speaking AI” model.

Listen-while-speaking is what full-duplex means. Concretely, like a telephone call or an in-person conversation, it is the mode in which backchannels, interruptions, and silence-fills all happen simultaneously. Half-duplex is the walkie-talkie world of “over,” “roger.” Full-duplex is the telephone world of overlapping speech. Full-duplex is phone, not radio. Where older voice assistants lived in a “wait for the beep, then speak” world, the Moshi family is the first set of public models to step into the “overlap is allowed” world.

The training recipe is simple. On top of the Moshi foundation model released by Paris-based Kyutai, NII’s public statement describes fine-tuning on roughly 1,000 hours of Japanese casual dialogue. Fine-tuning is the process of adding domain knowledge to a pre-built large model after the fact. Grounding the base in everyday conversation rather than keigo-heavy customer service reflects the builders’ design philosophy, which is to align a conversational baseline first and let anyone stack domain specialization on top. Within 48 hours of release, an AICU review observed that natural Japanese dialogue held up even at low latency and low bitrate.

Now rewind the clock.

Keiichi Tokuda has led the design of HTS, an HMM-based speech synthesis system, at Nagoya Institute of Technology (NITech) since 1999. Before WaveNet arrived, HTS was the default of the text-to-speech (TTS) research world. Google Scholar citations stand at roughly 21,713 as of April 2026. IEEE Fellow in 2014. Purple Ribbon award (Shijuhosho, Japan’s honor for contributions in academia and the arts) in 2020. Tokuda has stood at the center of the family tree of Japanese speech research for 25 years.

And Atsumoto Ohashi is the first author of the J-Moshi paper (arXiv 2506.02979) presented at Interspeech in June 2025. A PhD student in Nagoya University’s Higashinaka Lab, his dissertation project became the first public implementation of a Japanese full-duplex STS. One graduate student, four months of work, 344 hours of stereo audio, and the design choice that Moshi’s weights were portable (transferable to other languages). This is the direct upstream of LLM-jp-Moshi-v1.

Japanese full-duplex AI was not built by one commercial startup. It crystallized as a single release, on top of 25 years of academic infrastructure, when a national research institute, a graduate student, and a University of Tokyo data lab locked together as a tripod.

Three points organize the argument.

Point 1 — the 25-year baton

NITech HTS (started 1999) and NII SRC (started 1987) form speech-research and corpus-distribution infrastructure that has been running continuously for 30 to 40 years.

Point 2 — a four-month thesis implementation

Because Moshi was released in September 2024, a Japanese port shrank to a scope one PhD student could finish inside a dissertation timeline.

Point 3 — commercial licensing by a collective ten months later

NII re-implemented the J-Moshi recipe inside the LLM-jp collaborative and published under Apache 2.0. Without all three, the morning of February 25, 2026 does not happen.

fig.f1 · release at a glance·········
LLM-jp-Moshi-v1 — released 2026-02-25 by NII under Apache 2.0First commercially usable Japanese full-duplex STS on recordWhat shippedLicenseApache 2.0, no research-use carve-outBase modelKyutai Moshi (CC-BY 4.0 weights)Fine-tune corpus~1,000 h Japanese casual dialogueShipped byLLM-jp collaborative (NII + U-Tokyo + Kyoto + Tohoku)Months after Kyutai Moshi~18 (Sep 2024 → Feb 2026)Prior Japanese academic defaultLicenseResearch-use carve-out typicalScopeSingle-lab paper releaseCommercial useRequires separate licensing conversationAuthorshipTypically single PI / labDownstreamBenchmark and paper, rarely deployed product
Figure F1. LLM-jp-Moshi-v1 at a glance. The concrete contents of the February 25, 2026 release, contrasted with the prior default for Japanese academic speech releases.

2. The distribution infrastructure that started in 1987, and the last piece the SRC never carried in 30 years

The continuity of Japanese speech research starts in 1987.

That year, NII’s Speech Resources Consortium (SRC) began distributing Japanese speech corpora. The SRC predates NII itself (founded 2000) and was carried into NII at founding. In one line, the SRC is the post office for Japanese speech corpora. Individual labs deposit the data they build, and anyone who wants to use it checks it out under a formal license. In the English-speaking world, the LDC (Linguistic Data Consortium, based at the University of Pennsylvania) plays the same role. The SRC is the Japanese counterpart, one size smaller.

Neighboring institutions stack on top.

Nagoya Institute of Technology (NITech) has maintained Open JTalk and HTS, the global reference implementations for HMM-based speech synthesis before WaveNet. Open JTalk still runs today on embedded devices that predate smartphones. Julius is the open-source Japanese speech recognition engine that Akinobu Lee has maintained at NITech since around 1997. Together with Open JTalk, this makes NITech the host of the longest continuously maintained Japanese open-source speech stack in the world.

The University of Tokyo’s Saruwatari Lab supplies the modern Japanese data layer. JSUT (2017, 10 hours, single speaker, read speech). JVS (2019, 30 hours, 100 speakers, three styles). And J-CHAT (2024, arXiv 2407.15828), 69,000 hours, mono, sourced from podcasts and YouTube. In seven years, public Japanese speech data grew by more than two orders of magnitude. That is the thickness of the family tree. As of 2026, J-CHAT is the largest non-English dialogue corpus in the world.

NINJAL (the National Institute for Japanese Language and Linguistics) supplies the dialogue layer. The CSJ (Corpus of Spontaneous Japanese) runs about 650 hours, but most of it is monologue. The dialogue portion usable for stereo fine-tuning shrinks to around 12 hours. The CEJC (Corpus of Everyday Japanese Conversation) runs 200 hours of multi-mic recordings, but it is not organized as two-channel capture. BTSJ runs 127 hours as a pragmatics-oriented conversation corpus maintained by NINJAL’s Usami group.

Industry is present. NTT Communication Science Laboratories is the largest industrial speech research group in Japan by headcount. 18 papers accepted at Interspeech 2025. Best Paper Award at ASRU 2025 (NTT announcement). Hirokazu Kameoka’s voice conversion lineage runs long, but its weights are rarely published.

Across 40 years, six or seven institutions stacked, and almost every component of Japanese speech infrastructure ended up in academia. But one thing the 30-year SRC distribution network never carried. Two-channel (stereo, with speakers physically captured on separate microphones) Japanese dialogue audio does not exist at Fisher scale (2,000+ hours). Two-channel is the same principle as stereo recording, a capture setup where speaker A and speaker B are recorded on separate microphones so the signals stay separated. This missing piece matters later in §4, when we unpack the contents of J-Moshi’s 344-hour stereo mixture.

fig.f2 · Japanese speech dataset inventory·········
DatasetHoursChannelLicenseFull-duplex ready
J-CHAT (Saruwatari, 2024)69,000MonoResearch; upstream YT/podcastNo (mono)
CSJ (NINJAL)~650 total; ~12 dialogueMixedNII SRCPartial (12h)
CEJC (NINJAL, 2022)200Multi-micNII SRC, researchPartial
BTSJ (NINJAL Usami lab, 2023)127Session-variableNINJAL, researchPartial
CALLHOME Japanese (LDC96S37)49Two-channelLDC gatedYes
Travel Agency Dialogue (Inaba 2024)115Two-channel ZoomACM TALLIP, academicYes
OGVC (NII SRC)<20Session-variableNII SRC, researchPartial
J-Moshi stereo (public portion)143Two-channelMixed academic + LDCYes
J-Moshi stereo (in-house Nagoya)201Two-channelNot redistributableBlocked (private)
Figure F2. Japanese speech dataset inventory, sorted by full-duplex training fitness. Of nine listed corpora, only three are clean two-channel.

3. The people who carry Japanese speech AI — a 25-year continuous line from Tokuda to Ohashi

Every researcher we spoke with agrees on one thing. Japanese speech AI abstracts away too fast unless you read it through individual names.

The most globally recognized researcher in NII’s speech group is Junichi Yamagishi. Jointly with the University of Edinburgh’s CSTR, he has served on the organizing committee of the ASVspoof challenge since 2015. ASVspoof is the academic reference benchmark for voice spoofing and deepfake detection. Yamagishi’s voice cloning and anti-spoofing work is one of the most cited Japanese speech-research outputs of the past decade.

Sadao Kurohashi is the Director General of NII. In his Kyoto University years he carried the KNP / JUMAN / Kyoto Corpus lineage, a single thick branch of the family tree of Japanese NLP. In the February 25, 2026 release announcement, the NII official account posted one line that was very simple.

“A commercially usable simultaneous two-way Japanese spoken-dialogue model,” a world first.— NII official account (X)

Koichi Shinoda is a professor at NII. Before NII, he led the speech information processing group at Tokyo Institute of Technology. He covers the lineage of acoustic modeling and HMMs.

At the Higashinaka Laboratory (Nagoya University, founded 2020, inside the Graduate School of Informatics), Professor Ryuichiro Higashinaka spent 19 years at NTT CS Labs before setting up the lab in Nagoya. Co-chair of the Dialogue System Live Competition since 2018. The lab’s field experience even extends to a guide robot deployment at the NIFREL aquarium in Osaka. On the day of the February 2026 release, Higashinaka wrote on X:

“Trained on data independently collected by LLM-jp, released under a commercially usable license.”— Ryuichiro Higashinaka (X)

The lab’s driving force, J-Moshi first author Atsumoto Ohashi, lists himself as a PhD student in Higashinaka Lab on his site. His Interspeech 2025 co-authors are Shinya Iizuka and Jingjing Jiang. The Nagoya University English press release was even more concrete.

“Our J-Moshi was built in approximately four months by adapting the English model.”— Nagoya University press release

One PhD student’s thesis project shipped the first public Japanese full-duplex STS. That fact sits closer to the center of this story than any institution name.

At NITech, Keiichiro Oura maintains the HTS working group. HTS belongs to the lab lineage of Keiichi Tokuda, and that lineage produced researchers including Heiga Zen in his Google Brain years (co-author of the WaveNet paper). Tokuda has been a professor at NITech since 2004. He started HTS in 1999 and kept HMM-based TTS as the global standard for more than a decade. The IEEE Fellow 2014 citation reads “for contributions to hidden Markov model-based speech synthesis.” In 2020 he received the Purple Ribbon award (Shijuhosho). Google Scholar citations of about 21,713 as of April 2026. Akinobu Lee is the other pillar at NITech, maintaining Julius since 1997.

At the University of Tokyo, Hiroshi Saruwatari leads the Saruwatari Lab. Over the past nine years he has shipped the JSUT / JVS / J-CHAT lineage. The fact that J-CHAT’s 69,000 hours are mono is exactly the structural reason J-Moshi later needs a separate stereo fine-tune.

At NTT CS Labs, Hirokazu Kameoka leads voice conversion research. NTT’s outputs tend to stay inside papers. Published weights are rare.

LLM-jp-Moshi-v1 is not the product of a single PI. It is a collective output of the LLM-jp collaborative. LLM-jp is the NII-led domestic large-language-model collaborative-development project that started in 2023, with researchers from NII, the University of Tokyo, Kyoto University, Tohoku University, and other institutions. The bet rests not on an abstract “NII” or “Nagoya” but on named individuals and on the assumption that the LLM-jp collaborative keeps convening them. From Tokuda to Ohashi, the person who started designing HMM TTS 25 years ago and the person who shipped Japanese full-duplex in 2025 stand in the same Nagoya research ecosystem. An ekiden sash passed across generations is the image that fits.

fig.f3 · people who carry the stack·········
The people who carry the Japanese full-duplex STS stackNII (Tokyo)Junichi YamagishiASVspoof · NII + Edinburgh CSTRSadao KurohashiDG · Kyoto NLP lineageKoichi Shinodaacoustic modeling · ex-Tokyo TechLLM-jp collaborativeshipped LLM-jp-Moshi-v1Nagoya University (Higashinaka Lab)Ryuichiro Higashinakalab founder · DSLC since 2018Atsumoto OhashiJ-Moshi first author (PhD)Nagoya Institute of Technology (NITech)Keiichiro OuraHTS / Open JTalk maintainerAkinobu LeeJulius LVCSR since 1997Keiichi Tokuda (IEEE Fellow 2014, Purple Ribbon 2020)HTS 1999– / ~21,713 Google Scholar citationsUniversity of Tokyo Saruwatari LabHiroshi Saruwatari — JSUT · JVS · J-CHAT 69k hNTT Communication Science LaboratoriesHirokazu Kameoka — voice conversion (closed output)A 25-year generational line from Tokuda (1999 HTS) to Ohashi (2025 J-Moshi).Green box = load-bearing carrier with a public artifact.Yellow box = closed output or collective-authorship release.
Figure F3. The people who carry the Japanese full-duplex STS stack. Grouped by institution, with each person’s load-bearing artifact attached.

4. How J-Moshi was assembled — the four-month procedure that turned Moshi into Japanese

Tracing J-Moshi’s assembly surfaces the engineering ground under everything argued above. The cooking analogy is apt. We walk through the recipe once, asking what ingredients went in, in what order, and in what quantity.

J-Moshi (Interspeech paper, arXiv 2506.02979, nu-dialogue/j-moshi) fine-tunes Moshi in two stages.

Stage 1: pre-train on J-CHAT’s 69,000 hours of mono Japanese dialogue (base seasoning). Stage 2: fine-tune on 344 hours of stereo mixture (the finishing step). Per Table 2 of the paper, the 344 hours break down as Japanese CallHome 16 h (LDC, paid), CSJ dialogue subset 12 h, Travel Agency Dialogue Corpus 115 h (Inaba et al. 2024), Casual Dialogue Corpus 148 h (Nagoya in-house), and Consultation Dialogue Corpus 53 h (Nagoya in-house). Of the 344 hours, 143 are publicly licensable. The remaining 201 (59%) are Nagoya privately collected data.

Three points come out of this pipeline.

Point 1 — the four-month window was a direct consequence of the foundation model being open

Per the Nagoya press release, adapting Moshi to Japanese took about four months. Until Moshi was released in September 2024, no public foundation model could be ported at the same price (four months of one graduate student’s labor). In other words, Moshi’s release is what pulled a Japanese port inside a dissertation timeline, and that time compression is the mechanism by which J-Moshi stood up as a thesis project.

Point 2 — the two-stage structure is engineering evidence for Articles 04 and 05

J-CHAT’s 69,000 mono hours are the largest non-English dialogue corpus in the world, larger than the sum of all public English dialogue corpora. But that alone is not enough to train full-duplex behavior (the listen-while-speaking action). The 344 stereo fine-tune hours are only 0.5% of the pre-training pool, yet that small amount carries the crucial supervision signal for listen-while-speaking. Why. In a mono recording, speaker A’s voice and speaker B’s voice arrive mixed on the same single channel. The moment overlap happens, the signal adds together, and the recording cannot strictly separate who was speaking. In a stereo (two-channel) recording, speaker A is on the left channel and speaker B is on the right, physically separated. Overlapping backchannels and interruptions can be cleanly observed as training data. The supervision signal full-duplex learning needs is data that does not blur “who spoke when.”

Point 3 — the evaluation protocol is public, but a benchmark is not

The J-Moshi paper publishes a protocol that compares naturalness MOS, meaningfulness MOS, turn-taking latency, and backchannel rate against a human reference. A protocol is an evaluation recipe that another lab can reproduce on the same steps. There is no hidden test set and no public leaderboard yet, but having the protocol alone is enough to place a Japanese row in the Article 07 benchmark landscape. The one line in the Convergence Lab review captures the implication plainly.

“Beyond conversational naturalness, the optimum for business deployment.”— Convergence Lab (2026-02-26 review)

J-Moshi-Ext (nu-dialogue/j-moshi-ext) followed as an extended fine-tune version. LLM-jp-Moshi-v1 takes J-Moshi’s recipe as is, re-implements and scales it inside the LLM-jp collective to roughly 1,000 hours, and publishes it under Apache 2.0. Ten months apart, but the same process path.

fig.f4 · J-Moshi training pipeline·········
Moshi baseKyutai, 7B, CC-BY 4.0Pre-train on J-CHAT69,000 h mono (Saruwatari)Fine-tune: 344 h stereo (Ohashi 2025)Public 143 hIn-house 201 h (Nagoya)CallHome 16 · CSJ 12 · Travel 115 · Casual 148 · Consult 53J-Moshi (Apache 2.0, research)Fine-tune (NII LLM-jp)~1,000 h Japanese casual dialogueLLM-jp-Moshi-v1NII, 2026-02-25, Apache 2.0, commercial
Figure F4. J-Moshi training pipeline, derived from Kyutai Moshi. The parallel LLM-jp-Moshi-v1 branch is shown in the lower row.

5. The tripod structure, and why the foundation layer is filled without a commercial lab

Three legs support Japanese full-duplex STS in 2026.

Picture a camera tripod. If any one leg is shorter, it falls. Only when three legs stand in three different places can one release sit stably on top.

The first leg is NII in Tokyo. The institution that runs the SRC licensing and distribution infrastructure, hosts LLM-jp, employs Yamagishi, Kurohashi, and Shinoda, and shipped LLM-jp-Moshi-v1 on February 25, 2026.

The second leg is Nagoya. Nagoya University’s Higashinaka Lab shipped J-Moshi as Ohashi’s thesis project. Nagoya Institute of Technology is a separate institution, but it sits in the same city and maintains Open JTalk and Julius. The two Nagoya institutions are legally independent, but they share a regional compute and joint-research zone, and some of the people who produced Japanese speech-AI artifacts have moved between them.

The third leg is the University of Tokyo’s Saruwatari Lab, the data-layer supplier. Without J-CHAT, J-Moshi’s two-stage recipe does not even start.

Remove any one leg and nothing moves. Remove NII and the release infrastructure and collective-authorship template disappear. Remove Nagoya and J-Moshi disappears, and the recipe that LLM-jp-Moshi-v1 inherits does not exist. Remove Saruwatari Lab and the public pre-training data pool shrinks by an order of magnitude. The two-stage J-Moshi recipe and the Apache 2.0 collective release are direct products that only stand up when all three legs are functioning at once.

There is no commercial frontier lab inside Japan in the position that a Kyutai (Paris), Cartesia (San Francisco), or Hume (New York) would occupy in the US or EU voice foundation cluster. The domestic market is one order of magnitude smaller than English speech infrastructure, and even in the English-speaking world the number of horizontal integrators is limited. Japanese venture capital has structurally preferred vertical SaaS and deep-tech hardware. The accurate way to say it is that the academic tripod covers the foundation-model layer precisely because commercial institutions are absent from it.

On top of the foundation layer sits a different job, the commercial integrator layer. SLAs, go-to-market execution, call-center-scale inference, frontier-pace iteration. A Japanese version of Article 12’s integrator landscape will eventually fill that layer. But without academic supply underneath, there is no Japanese full-duplex model for integrators to wrap in the first place. The tripod is the foundation the rest of the stack rides on.

A note on the term sovereign AI. Sovereign AI is a policy term for AI infrastructure that can be trained, operated, and audited inside a country’s language, legal system, and policy context. METI and MIC published v1.0 of the AI Guidelines for Business in April 2024, v1.01 in November 2024, and Version 1.1 on March 28, 2025. Not legally binding, but the most widely referenced guidance document for AI operators in Japan. Put differently, the fact that a commercially usable Japanese full-duplex model exists in 2026 provides a concrete instance of the “domestically governable AI infrastructure” that METI and MIC’s policy language presupposed. The point is not that Japan is special. It is that even a non-English ecosystem can reach a commercially usable license through the combination of an open foundation model, a national corpus-distribution institute, and universities. One empirical case now stands.

fig.f5 · academic supply triangle·········
NII (Tokyo)SRC 1987–, LLM-jp 2023–Yamagishi, Shinoda, KurohashiNagoya + NITechHigashinaka (J-Moshi)Open JTalk, Julius, HTSU-Tokyo SaruwatariJSUT, JVS, J-CHAT 69k hdata layerEach leg is necessary.Remove any one and J-Moshior LLM-jp-Moshi-v1 does not ship.(the commercial frontier lab that would be here in the US or EU is absent)
Figure F5. The academic supply triangle. With no commercial frontier voice-AI lab in the country, three institutional legs carry Japanese full-duplex STS.

6. An honest counter — does it matter that it is not near the top of English benchmarks

Let us state the obvious counterargument.

As of today, LLM-jp-Moshi-v1 is not in the top five of Big Bench Audio or Artificial Analysis’s S2S comparison. The top of that list is Step-Audio-R1.1 (97.0%), Gemini 3.1 Flash Live (95.9%), Grok Voice (92.9%), Gemini 2.5 Native Audio (90.7%), Nova 2.0 Sonic (88.1%). Looking at the quality gap on English benchmarks, does a “Japanese-first” design choice justify the trade-off?

Three observations answer it.

First, the comparison axis is misaligned to begin with. Big Bench Audio is close to an English task. There is no public benchmark as of April 2026 that measures the naturalness of Japanese aizuchi (the listener’s short overlapping tokens like “ee,” “un,” and “soudesuka” spoken during the speaker’s utterance), pause structure, or keigo hierarchy. “Not in the top five of English benchmarks” is not evidence for “low quality on Japanese tasks.” Sprint times on the track do not substitute for marathon finishing order.

Second, “commercially usable” is a license statement, not a quality statement. Quality comparison is a separate axis. NII’s release notes observe that J-Moshi exceeded MOS and dialogue naturalness metrics. The quality baseline at commercial deployment will move with the additional fine-tune and evaluation done in the target domain.

Third, the shape of the trade-off is different. Step-Audio-R1.1 and Gemini 3.1 Flash Live are closed APIs. Weights are not distributed. “Top of the English benchmark” can also mean “not usable as is under a public license.” When a commercial deployment requires compliance, data sovereignty, or on-premise hosting, you cannot pick a top-five closed model. The fact that LLM-jp-Moshi-v1 ships under Apache 2.0 is itself an evaluation on a different axis (deployability, auditability).

In short, placing near the top of English Big Bench Audio is not the primary goal of the 2026 Japanese speech-AI plan. The claim of this release is that Japanese-conversation quality and commercial deployability line up inside the same artifact. As the AICU review put it,

“High-quality Japanese simultaneous two-way spoken dialogue even at low latency and low bitrate.”— AICU (2026-02-26 review)

On the quality side, the Japanese-speaking community’s response within the first 48 hours was favorable.

fig.f6 · Japan speech-AI institutions·········
AcademicCorporate researchHyperscaler JP subsNII (Tokyo)Nagoya U. (Higashinaka)NITech (Open JTalk, HTS)U-Tokyo SaruwatariNINJALKyoto / Waseda / TohokuATR (Kyoto)NTT CS LabsSonyLY Corporation (LINE/Yahoo)Rakuten, SB, KDDI labsAWS JapanGoogle JapanMicrosoft JapanOpenAI JapanOutput: J-Moshi, LLM-jp-Moshi-v1Output: papers, not open weightsOutput: API, closed models
Figure F6. Japan speech-AI institutions. Academic nodes supply data and models. Most of the commercial layer is served by foreign-made APIs.

7. Where this fits on the series map

The Japanese academic speech ecosystem connects to the STS landscape this series covers at four points.

First, the multilingual frontier. J-Moshi and LLM-jp-Moshi-v1 are the first public case study of porting Moshi to a non-English language. The observed cost, four months of graduate-student labor plus a few hundred hours of stereo data, is the data point that operationalizes Article 03’s “language-port feasibility” claim in Japanese. Subsequent ports that follow the same template (Korean, Vietnamese, Hindi) will have a benchmark for calibration.

Second, the data roadmap. Article 06 hypothesizes 100,000 to 500,000 hours of two-channel dyadic (two-party) audio as the foundation threshold for full-duplex STS. The ceiling on Japanese public stereo is about 250 hours, or 0.05% of the lower bound. The English public stereo floor is roughly 2,000 to 3,000 hours, or 2% of the lower bound. Neither language has a foundation corpus yet. Moshi’s English-side precondition was the LDC Fisher Corpus, 1,960 hours of two-channel telephone speech from US national investment in 2003 to 2005. On the Japanese side, CALLHOME Japanese runs 49 hours, and nothing exists beyond that specification. The most direct route to closing the Japanese gap is a Fisher-scale new collection. Every other component is already in place. SRC runs licensing. Julius and Open JTalk are deployed in production. J-CHAT is the largest non-English dialogue corpus in the world. What the ecosystem needs is one line item, not an entire budget.

Third, the benchmark map. Articles 07 and 08 wrote the map and the argument of the STS benchmark landscape. As of April 2026, a Japanese full-duplex benchmark does not exist, and J-Moshi’s evaluation protocol is the closest substitute. Japan is the clearest example of the multilingual gap that Article 08 specifically identifies. The Japanese ecosystem does hold SID-Bench and J-Moshi-eval, the only Japanese full-duplex evaluations in the public literature today, built respectively by the Yamagishi group at NII and the Higashinaka group at Nagoya. Because of this prior work, Japan can be placed on the benchmark map at all.

Fourth, consent and licensing. Article 10 places Japan’s APPI (Act on the Protection of Personal Information) alongside the EU AI Act and the US state patchwork as one of the three major regulatory fronts. The Personal Information Protection Commission is in the middle of the 2024-2025 triennial review, and the interim report introduces new rules on biometric data (including voiceprints) that move toward the special-category treatment of GDPR Article 9 (Biometric Update April 2026, Clifford Chance 2024). A draft bill is expected in 2025 and enforcement in 2027. For operators based in Japan, building opt-in consent workflows now under the clearer current rules carries a first-mover advantage over retrofitting after enforcement lands.

fig.f7 · Japanese STS output 2024-2026·········
202420252026AcademicHyperscaler APIsJP commercialJ-Moshi (mid-2025)J-Moshi-ExtLLM-jp-Moshi-v1 (Feb 2026)GPT-4o voice JPGemini Live JPNova Sonic JPAzure voice JP updates(no Japanese-origin horizontal STS release documented in this window)
Figure F7. Japanese STS output, 2024-2026. Left column is academic supply. Right column is commercial-startup supply (empty). The middle column is hyperscaler APIs, which carry most of production usage.

8. Summary, and three signals to watch over the next five years

The most straightforward reading of Japanese full-duplex STS in April 2026 is this. It is a case where academic infrastructure is carrying the foundation-layer work that a domestic commercial voice-AI industry would normally cover. The institutions that have maintained Japanese speech datasets and open-source toolkits for the past 30 years also produced the country’s first commercial-grade full-duplex STS model. The result of two preconditions lining up: the LLM-jp collective-authorship template, and the option to use Kyutai Moshi as a permissive base. One PhD student’s thesis project shipped J-Moshi. A national research institute paired with the LLM-jp collaborative shipped LLM-jp-Moshi-v1 on February 25, 2026. Up to here, not a single commercial startup was required.

The insight to take away comes as a pair. The Japanese academic supply route is both proof of what is possible and a specification of what is still missing. J-Moshi and LLM-jp-Moshi-v1 cleared the commercial-deployability bar. They did not clear the foundation-threshold scale that Article 06 hypothesizes. A ceiling of 250 public stereo hours sits 3.5 orders of magnitude below the threshold’s lower bound. Every other precondition already exists inside the Japanese academic ecosystem. What is missing is one line item, collection itself, and that is buildable.

Three signals to watch over the next five years.

Signal 1: whether LLM-jp-Moshi-v2 or a successor pushes the fine-tune corpus from 1,000 hours to the next order of magnitude. Lifting Japanese dialogue that mixes casual and keigo speech by one order would significantly close the Japanese-versus-English STS quality gap.

Signal 2: whether a Fisher-class two-channel Japanese dialogue collection starts, either through a national project or through an opt-in commercial operator. The institutional muscle for distribution, licensing, and governance already sits at NII SRC. The rate-limiting step is origination, the act of collecting fresh recordings.

Signal 3: whether the LLM-jp template gets ported into underserved-language ecosystems like Korean, Vietnamese, or Hindi. If another country’s collective reruns this pattern on the same Moshi base, the Japanese release becomes not a one-off but the first instance of a reproducible template.

25 years ago, when Tokuda began designing HTS, the destination “in 2026 a Japanese listen-while-speaking AI ships from academia under a commercial license” was not a goal anyone could write down. The fact that it has now shipped once changed the shape of the question. The question is no longer which lab ships what next. It is whether this style of institutional design repeats. 25 years between Tokuda and Ohashi. 18 months between Moshi and LLM-jp-Moshi-v1. A long clock and a short clock converged on the same morning in 2026.

fig.f8 · Japanese foundation-threshold gap·········
1001k10k100k500k hoursJapanese public stereo(2026-04 ceiling)~250 hEnglish public stereo floor(Fisher + CALLHOME + ...)~2,000–3,000 hArticle 06 threshold band(hypothesized foundation floor)100k → 500k hJapanese is three to three-and-a-half orders of magnitude short.The signal to watch: whether a Fisher-class Japanese collection gets funded.
Figure F8. The Japanese foundation-threshold gap. Japanese public stereo dialogue placed against Article 06’s 100,000 to 500,000 hour band, on a log scale. The closing signal to watch next is whether a Fisher-class Japanese collection starts.
Dataset inquiry. Fullduplex.ai is building consent-based two-channel Japanese dyadic audio in Japan, aimed at the next generation of Japanese full-duplex releases. Research institutes, operators, and policy stakeholders interested in Fisher-scale Japanese dialogue collection can reach us at hello@fullduplex.ai. A one-line email is enough.