the verticalsv11 / 17#ldc#data-licensing§ 07 sections · 05 figures

LDC at 34: the academic consortium that quietly trains 2026’s full-duplex stack.

On March 25, 2023, Christopher Cieri died in Philadelphia. Eighteen months later, Kyutai’s open-source full-duplex model Moshi shipped with a single corpus buried in its references: Fisher English Training Speech, 1,960 hours, 11,699 telephone calls, published in 2004. Cieri was the first author. This profile walks why a 1992 academic consortium still sits at the center of the training-data supply for open full-duplex STS in 2026.

fullduplex research

published apr 2026· 13 min read· ~3,900 words· verticals v11 / 17

13m

read time

feed to ai↧view .mdClaudeopen in claudeChatGPTopen in chatgpt

verticals · v11 of 17 · subject profile

An academic consortium founded in 1992, operating from the same Penn building for thirty-four years, quietly supplies the documented-consent conversational audio that every major open full-duplex STS model in 2026 is fine-tuned on. One early choice — paid, signed speaker consent — compounded for thirty-four years.

subject: Linguistic Data Consortium · Philadelphia · founded 1992~1,000 corpora · USD 2,400 non-profit / USD 34,000 for-profit

1. One corpus, twenty years, and March 25, 2023

On March 25, 2023, a researcher named Christopher Cieri died in Philadelphia, Pennsylvania, at the age of 64. His name is known in linguistics. Outside of AI, it rarely comes up. Eighteen months later, in September 2024, when the French lab Kyutai released the open-source full-duplex model Moshi, a single corpus sat deep in the paper’s references. Fisher English Training Speech, published in 2004, 1,960 hours, 11,699 telephone conversations. Cieri was the first author who designed the protocol and led the collection twenty years earlier.

Cieri never saw Moshi. But the protocol he finished in 2003 is used somewhere in the training of every major open full-duplex model available in 2026: Moshi, dGSLM (Meta FAIR, 2022), J-Moshi (NII, 2025), and PersonaPlex (NVIDIA, January 2026). For anyone trying to teach a machine full-duplex (the telephone-style form in which both sides speak and listen at the same time), Fisher remains the only 2,000-hour dialogue corpus on Earth that is physically recorded per speaker on separate channels, backed by signed consent, and available under a commercial license.

Cieri’s employer is the subject of this profile: the Linguistic Data Consortium (LDC). Founded inside the University of Pennsylvania in 1992 on a DARPA startup grant, it has operated from the same building for thirty-four years. The catalog lists roughly 1,000 corpora. Hundreds of universities, research institutes, and companies pay an annual membership fee for access to every release. In 2025, the non-profit tier is USD 2,400 and the for-profit tier is USD 34,000.

One way to picture LDC is as a national library of speech data. Someone records a conversation forty years ago, consent is collected properly, contracts are filed, the audio is stored in a standardized format. Twenty years later, someone else pulls it out for a new purpose. The data is still usable because someone kept the same rules intact for twenty years.

This profile has one question to answer. Why is this small academic consortium, founded in 1992, still sitting at the center of the training-data supply for open full-duplex STS in 2026? The short answer is that one early choice — documented consent — compounded for thirty-four years.

fig.f1 · Fisher 2004 fan-out·········

Figure F1. Fisher 2004 at the center, fanning out to dGSLM (2022), Moshi (2024), J-Moshi (2025), and PersonaPlex (2026). Open full-duplex research in the 2020s concentrates on a single LDC corpus. Source: Nguyen et al. 2022, Défossez et al. 2024, Ohashi et al. 2025, NVIDIA PersonaPlex model card January 2026.

2. Three people, nearly eighty years of tenure

If you start with the institutional story, you miss the point. Most of LDC’s durability comes from one fact. The same people held the same roles for decades.

Mark Liberman (director, since 1992) is LDC’s founding director. He is a full professor in the University of Pennsylvania’s linguistics department, a member of the National Academy of Sciences, and the lead writer on the academic linguistics blog Language Log. In 1992, speech corpora were priced by the kilobyte and shipped on magnetic tape. Liberman’s choice was to build LDC not as a vendor and not as a government lab, but as a membership-based academic consortium.

“LDC exists because researchers need shared data with shared licensing.”— Mark Liberman, Language Log

Christopher Cieri (executive director, 1999 to March 25, 2023) earned his PhD in linguistics at Penn and oversaw LDC’s collection programs for almost twenty-five years. He served as collection PI on dozens of DARPA, IARPA, NIST, and NSF programs and was the person who designed and ran the Fisher protocol in Cieri et al. 2004 (LREC). The Penn Almanac obituary described him in one line.

“A thoughtful leader who built high-trust relationships with sponsors.”— Penn Almanac obituary, March 2023

Stephanie Strassel (associate director, since 1998) served as collection PI on DARPA GALE (2005 to 2011, multilingual broadcast and conversational speech) and IARPA Babel (2012 to 2017, low-resource language packs across more than twenty-five languages). When a 2026 speaker-recognition paper uses the NIST SRE (Speaker Recognition Evaluation) for its test set, a large share of the original field collection traces back to projects Strassel led.

Added together, the three of them have spent almost eighty years at LDC. Liberman 34 years, Cieri 24 years, Strassel 28 years. That kind of continuity is what lets a corpus from 2003-04 carry the fine-tuning of Moshi (the stage where an existing model is adapted to a target task) twenty years later. The same people defended the same licensing terms for two decades.

“We trust LDC because they have honored their agreements for thirty years.”— Sanjeev Khudanpur, JHU CLSP, ICASSP 2023 award lecture

fig.f2 · three tenure bars·········

Figure F2. Three tenure bars. Liberman 34 years, Cieri 24 years (ending March 25, 2023), Strassel 28 years. Thirty years of senior-staff continuity feed directly into the research output of the institution. Source: LDC About page, Penn Almanac obituary, Language Log about page.

3. Three public goods a consortium supplies by design

LDC’s function in 2026 breaks down into three parts. No commercial market has satisfied all three at once.

Point 1 — Documented-consent conversational audio

Every Fisher speaker was recruited through an agency, paid, signed a release form, and compensated per call. Every CALLHOME caller placed the call voluntarily and understood the recording would be archived for research. IRB approval (Institutional Review Board, the US research-ethics oversight body) came at collection time, so member researchers can use Fisher without running a fresh IRB for each study. In short, thirty-four years of signed speaker agreements are stacked in an archive. When the Italian Garante fined Replika EUR 5 million in April 2025, the central issue was that the consent to use the chatbot and the consent to train on chat logs had been bundled into one notice. Article 50 of the EU AI Act takes effect on August 2, 2026. A web-scraped training corpus does not have a signed release per speaker. The LDC archive does, by construction.

Point 2 — Member-access clearinghouse licensing

LDC negotiates consent and contracts once, at collection time, and then distributes finished corpora to members under a standardized User Agreement. In practice, a Penn PhD student can pull Fisher out of the catalog on a Tuesday afternoon and start writing a paper, without a new IRB or a separate license negotiation. A clearinghouse that consolidates the IRB and contracting each lab would otherwise run on its own is the right frame.

“LDC is how linguistics pays its taxes.”— Dan Jurafsky, Stanford, 2024 NLP summer school

Point 3 — Government-sponsored long-horizon collection

Collections at a scale and duration that commercial vendors on quarterly budgets cannot underwrite only happen through the triangle of government sponsor, academic executor, and LDC releaser. Switchboard (1990-91), Fisher (2003-04, DARPA EARS), GALE, Babel, MATERIAL (DARPA, roughly 2017 to 2024), the 2024 AnnoDIFP release (242 hours, co-produced with the Florida Institute of Technology and the University of New Haven), and the forthcoming CALL MY NET 2 (more than 800 hours of Tunisian Arabic conversational telephone speech, planned for 2025), are all outputs of that triangle.

“Fisher is the only public two-channel corpus at useful scale.”— Neil Zeghidour, Kyutai (formerly Google Brain)

fig.f3 · three public goods·········

Figure F3. Three public goods a consortium supplies by design. Documented-consent audio is the substrate that scraping-era data does not have. Clearinghouse licensing is the legal infrastructure that turns it into something usable in ten minutes, not ten months. Long-horizon collection is the engine that originates the substrate in the first place. No commercial market has delivered all three at once.

4. How Fisher became the full-duplex anchor

To read Fisher’s 2026 role correctly, you need one observation. Its 2004 collection purpose and its 2024 usage purpose are not the same.

Fisher was originally designed as the training and evaluation data for DARPA EARS (Effective, Affordable, Reusable Speech-to-Text), an ASR (automatic speech recognition, the task of turning audio into text) accuracy program. The target back then was transcription. It was not fine-tuning data for full-duplex STS.

But the protocol Cieri and his team chose had five elements: physical two-channel separation, paid consent-based speaker recruitment, topic-prompted conversational framing, ten-minute call length, and standardized licensing. By coincidence, those conditions turned out to be almost exactly what full-duplex STS training needs twenty years later.

Put another way, in 2004, Cieri and his co-authors were the only people in the world who had collected 2,000 hours of two-channel dialogue audio in which speakers interrupt each other, produce backchannels (short acknowledgements like “uh-huh”), and overlap in speech. YouTube and podcast audio is mono (a single channel) or mixed-down pseudo-stereo. A model looking at that has to infer the speaker structure out of a blended signal. Fisher is different because PSTN (the public switched telephone network, the old phone system) records speaker A and speaker B on physically separate wires. That per-speaker isolation is precisely the condition that lets a model learn to listen and speak at the same time.

“We use Fisher because no other public corpus provides separated speakers at scale.”— Wei-Ning Hsu, Meta FAIR, dGSLM paper footnote

Four labs — Meta FAIR, Kyutai, NII, and NVIDIA — independently reached the same conclusion.

There is one lesson here. The durability of a corpus design is not won by predicting future uses. It is won by picking protocol principles that survive any future use: paid consent, physical separation, standardized licensing, and prompt diversity. Fisher is still on active duty twenty years later and for a different purpose because it was stored like a twenty-year wine. The people who build the next Fisher can pick those conditions as explicit design variables, rather than relying on the luck that Cieri had in 2003.

fig.f4 · 2003-2026 arc·········

Figure F4. The twenty-two years from 2003 to 2026. Liberman’s 34 years and Cieri’s 1999-2023 run above and below the release of Fisher 2004 and the arrival of Moshi in 2024. Twenty years separate the Fisher paper and the September 2024 Moshi release. Eighteen months separate Cieri’s death and the Moshi launch.

5. Is the membership model obsolete in the era of Common Voice and Hugging Face?

Let me engage the counterargument honestly. In 2026, what is the point of paying USD 34,000 a year to join LDC when Common Voice and Hugging Face are free?

Global speech data distribution splits into four modes in 2026.

The Common Voice mode. Volunteer speakers record read-speech audio, released under CC0. More than 110 languages, over 31,000 hours. Read, not conversational. Single speaker, mono. Useful for ASR pre-training at scale.
The Hugging Face mode. Free hosting. Licensing is up to whoever uploads. The consent chain is whatever the uploader documented. Useful for experimental access and early prototypes.
The commercial bulk-purchase mode. For instance Abaka AI sells 20,000 hours of bidirectional dialogue audio. You buy two-channel conversational audio outright. Useful when a large fine-tuning set is needed fast.
The LDC mode. Membership-based, standardized User Agreement, consent chain collected at the time of collection, perpetual commercial rights for for-profit members. Useful when a canonical corpus has to hold up against legal scrutiny over the long run.

The four modes are not in competition. They cover different uses. The interesting exercise is to ask which mode Fisher could be moved into. Moving it to CC0 would require re-consenting every speaker from thirty-four years ago (effectively impossible). Putting it on Hugging Face would require rebuilding the license framework (same problem). Selling it through the commercial-bulk channel would require rewriting the speaker compensation contracts from the time of collection (same problem). Fisher is a corpus that only stands up under the LDC mode. As the other modes mature, the scarcity of “corpora that cannot exist outside a membership consortium” increases.

A fair ceiling note belongs here. USD 34,000 is rounding error on a single GPU-hour line for a frontier AI lab. For an academic speech lab outside grant funding, it can be the largest single item in a year’s research budget. The membership model still introduces friction at the small-institution end. The 2024 AnnoDIFP joint release shows that a grant-plus-membership combination can put a small institution into the corpus-producing column, so the model still functions. The honest 2026 read is not that the membership model is obsolete, but that it has narrowed to a structural niche: the canonical supplier of documented-consent conversational audio that neither the Common Voice layer nor the commercial layer can fill.

6. Where LDC meets the STS landscape, and the next Fisher

LDC’s 2026 role crosses this series’ main line in two places.

First, as a source of foundation data. The full-duplex STS data gap that Articles 04 and 06 traced is precisely the problem of “how much documented-consent two-channel conversational audio is publicly available.” Switchboard and Fisher were both produced through the academic-executed, government-sponsored, LDC-released triangle. A 100,000-hour full-duplex successor, if it is built, will almost certainly reuse that template. LDC sits at the hinge between sponsor, academic executor, and research user.

Second, as the floor of provenance. LDC’s archive is the reason commercial-clean full-duplex training data is not zero in 2026. Without Fisher, the scarcity triangle in Article 04 has an empty vertex at the top. Any new commercial supply has to clear this floor on at least one of scale, fidelity (audio quality), language coverage, or commercial-rights structure. LDC’s 2026 comparative advantage has shifted from scale-of-collection to provenance-of-collection. That is exactly the position the market most needs filled.

There is one open design question left. How will Fisher’s successor be built? The practical specification is nearly pinned. Scale 10,000 to 100,000 hours. Physical per-speaker channel separation is non-negotiable. Wideband audio (16 kHz or higher) is the new non-negotiable for fidelity. Consent and licensing should extend the Fisher framework with a transparency layer informed by Article 50 of the EU AI Act and the April 2025 Garante ruling. Sponsor options are NSF, NIST, a pool of frontier labs, or a single corporate sponsor. The executor cluster is CMU LTI, JHU CLSP, and LDC. Nearly every condition that made Fisher possible in 2003 is reproducible in 2026. The only missing piece is the funding decision. A public announcement from a sponsor and a PI would be the signal that the successor corpus has started. As of April 2026, that announcement has not arrived.

fig.f5 · provenance vs scale·········

Figure F5. A provenance-by-scale scatter. LDC sits at strong provenance, mid-scale. Abaka AI is strong-scale, mid-provenance. Common Voice ships CC0 so consent documentation is not needed per speaker. The Moshi pretraining set is huge-scale, thin-provenance. Fullduplex.ai sits on the strong-provenance path and extends scale from there.

7. Summary and outlook

To compress LDC’s 2026 role into one line: as commercial scraping saturates, the public-good value of documented-consent conversational audio compounds. Whisper was trained on 680,000 hours of weakly supervised web audio. HuBERT and wav2vec 2.0 use 60,000 hours of audiobooks from LibriLight. Moshi scraped 7,000,000 hours for its backbone. These are cases where the foundation threshold is crossed with super-scale mono audio. But at the fine-tuning stage, the stage where the model learns to take turns, overlap, and backchannel as a conversation, Fisher is still pulled off the shelf. Scraped audio teaches a model to transcribe. LDC audio teaches a model to converse.

The 2026 voice-AI frontier is consolidating rapidly into hyperscalers (Google DeepMind hired Hume’s CEO and engineers on January 22, 2026, and Apple acquired Q.ai for USD 1.6 to 2 billion on January 30, 2026). Outside that consolidation, one function has to survive: shared, reproducible, auditable open research infrastructure. LDC sealed that function into an institutional form thirty-four years ago. A member-funded consortium, per-corpus licenses, a documented consent chain, and decade-scale institutional continuity. Those four pieces combined form a structural counterweight to hyperscaler consolidation.

A field that started with under 300 hours of telephone conversations, grew into models trained on millions of hours of web audio, and still cannot fine-tune without those original 300 hours. The situation in 2026 is the simplest way to state what LDC’s thirty-four years of accumulation mean.

Dataset inquiry. Fullduplex.ai builds full-duplex conversational audio with documented consent, wideband fidelity, and clear commercial rights. Not as a replacement for the provenance lineage Fisher established, but as an heir, complementing the LDC archive. If you are evaluating training data for a full-duplex speech-to-speech model in 2026 and want to see how a modern two-channel corpus pairs with LDC’s fine-tuning anchor, a one-line email to hello@fullduplex.ai is enough.