Trend AnalysisLinguistics & NLP

Can AI Save Dying Languages? NLP Tools for Endangered Language Documentation

Over 40% of the world's languages face extinction. AI and NLP tools promise to accelerate documentation and revitalization, but a persistent gap between theory and practice remains. Five recent papers illuminate what works, what doesn't, and what is lost when a language dies undocumented.

By Sean K.S. Shin
This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

Of the approximately 7,000 languages spoken today, UNESCO estimates that roughly 40% are endangeredโ€”spoken by shrinking communities, often without written traditions, and at risk of disappearing within a generation or two. Each language that vanishes takes with it a unique cognitive system, a body of oral literature, and an irreplaceable record of human experience. The question of whether AI and NLP tools can meaningfully contribute to documentation and revitalization efforts is both technically interesting and culturally urgent.

The honest answer, as the recent literature makes clear, is: partially, and less than the hype suggests. NLP tools can accelerate certain documentation tasks, but they face fundamental challenges with low-resource languages, and the gap between what is technically possible and what actually gets deployed in fieldwork settings remains wide.

The Theory-Practice Gap

Gessler and von der Wense (2024), with 4 citations, provide the most direct analysis of why NLP tools have not been widely adopted in language documentation, despite decades of expressed interest from both NLP researchers and field linguists. They identify two core reasons:

Reason 1: The data bootstrapping problem. NLP tools generally require annotated data to function. But for endangered languages, annotated data is precisely what documentation aims to create. This creates a circularity: you need NLP tools to create the data, and you need the data to train the NLP tools. Transfer learning from related high-resource languages can partially address this, but "related" is a strong requirementโ€”many endangered languages belong to families with no well-resourced relatives.

Reason 2: The workflow integration problem. Even when NLP tools exist for a given task (automatic transcription, morphological analysis, interlinear glossing), integrating them into existing documentation workflows is non-trivial. Field linguists typically work with tools like ELAN, FLEx, or SayMore. NLP tools that require command-line interfaces, Python environments, or cloud APIs do not fit naturally into these workflows. The result is that tools get published in NLP conferences and then are not used.

The observation is sobering but constructive: the bottleneck is not primarily algorithmic (better models) but sociotechnical (better integration with existing practices and genuine collaboration between NLP researchers and field linguists).

Case Studies: What Is Being Attempted

Nรผshu: Rescuing a Script from Extinction

Yang, Ma, and Gessler & von der Wense (2024), with 6 citations, present NushuRescue, an AI-assisted project for the Nรผshu scriptโ€”a writing system historically used exclusively by women in Jiangyong County, Hunan Province, China. Nรผshu is unusual in multiple ways: it is the only known script used exclusively by one gender, its last fluent native writer died in 2004, and existing documentation is fragmentary.

The NushuRescue approach uses LLMs to address a core preservation challenge: translation between Nรผshu and Chinese with minimal training data. The framework includes:

  • Parallel corpus creation: NCGold, a 500-sentence Nรผshu-Chinese parallel corpusโ€”the first publicly available dataset of its kind.
  • Few-shot LLM translation: Using GPT-4-Turbo with only 35 short examples to achieve 48.69% translation accuracy on withheld test sentences.
  • Corpus expansion: Generating NCSilver, a set of 98 newly translated modern Chinese sentences, expanding the available linguistic resources.
  • Supporting models: FastText-based and Seq2Seq models developed to further support computational research on Nรผshu.
The results demonstrate that LLMs can make meaningful progress on endangered language translation with remarkably little dataโ€”but 48.69% accuracy also shows how far the technology remains from reliable translation. The framework is designed to be scalable and minimize the need for extensive human input, though human validation remains essential for quality assurance.

Comanche: Minimal-Cost Language Technologies

Alvarez, Karajeanes, and Yang et al. (2024), with 1 citation, introduce computational tools for Comanche, an Uto-Aztecan language spoken by fewer than 50 fluent speakers (some estimates as few as 10). Their approach is notable for its pragmatism: rather than attempting to build full NLP systems, they focus on "minimal-cost" interventionsโ€”tools that require minimal data and computation while providing immediate utility.

Their specific contributions include a Comanche tokenizer, a basic morphological analyzer, and a Comanche-English glossary extraction tool. These are not sophisticated by NLP standards, but they address real needs in the documentation process: helping field linguists segment continuous speech, identify morpheme boundaries, and maintain consistent terminology.

The paper also raises an important ethical point: the Comanche Nation's cultural preservation office was involved in determining which tools were developed and how the resulting data would be stored and accessed. This is not a technicalityโ€”for many Indigenous communities, language data carries cultural and spiritual significance that requires community governance.

Manchu: NER and POS Tagging

Lee, Byun, and Seo (2024), with 2 citations, experiment with three model architecturesโ€”BiLSTM-CRF, BERT, and mBERTโ€”for Named Entity Recognition (NER) and Part-of-Speech (POS) tagging in Manchu, an endangered Tungusic language with fewer than 20 fluent speakers. The Manchu script (a vertical alphabet adapted from Mongolian) poses additional challenges for standard NLP pipelines designed for horizontal left-to-right text.

Their results illustrate the trade-offs of different approaches. BERT, fine-tuned on a small Manchu corpus (~50,000 tokens), outperforms BiLSTM-CRF for POS tagging but performs comparably for NERโ€”suggesting that for tasks with limited training data, the advantage of pretrained models is reduced. mBERT, despite its multilingual pretraining, shows no advantage over monolingual BERT, likely because Manchu is absent from mBERT's training data and has no typologically close relatives in the model.

A Broader Framework

Fakhreldin (2025), with 1 citation, proposes a comprehensive NLP framework for Indigenous dialect documentation that attempts to address the full pipeline: data collection, preprocessing, annotation, model training, and community feedback. The framework includes provisions for dialectal variation (a challenge often overlooked when the "language" is actually a family of related dialects) and emphasizes iterative validation with speaker communities.

The framework's value is more conceptual than empiricalโ€”it has not yet been fully implemented for any single language. But it articulates principles that the field increasingly recognizes: documentation NLP must be community-governed, dialect-aware, and designed for integration with existing fieldwork tools.

Critical Analysis: Claims and Evidence

<
ClaimEvidenceVerdict
NLP tools can accelerate endangered language documentationNushuRescue, Comanche, Manchu case studiesโœ… Supported โ€” for specific, well-defined tasks
The main barrier to NLP adoption is sociotechnical, not algorithmicGessler & von der Wense's fieldwork surveyโœ… Supported
Transfer learning from high-resource languages helps low-resource NLPLee et al.'s mBERT experimentโš ๏ธ Uncertain โ€” mBERT showed no advantage for Manchu
Community involvement is essential for validationNushuRescue and Comanche ethical frameworksโœ… Supported โ€” computational outputs alone are unreliable

Open Questions and Future Directions

  • Scaling community-driven NLP: The case studies reviewed here all involve close collaboration with speaker communities. Can this approach scale, or is it inherently bespoke?
  • Oral languages: Many endangered languages have no written tradition. Speech recognition and audio analysis are critical, but acoustic models for low-resource languages remain poor.
  • Data sovereignty: Who owns the digital artifacts produced by NLP tools applied to endangered languages? Community data governance frameworks are emerging but not yet standardized.
  • Sustainability: Grant-funded NLP projects often produce tools that become unmaintained when funding ends. How do we build sustainable infrastructure for endangered language technologies?
  • The "last speaker" problem: For languages with only a handful of elderly speakers, documentation is a race against time. Can NLP tools be deployed rapidly enough to make a difference, or do they require lead time that these situations do not allow?
  • What This Means for Your Research

    For NLP researchers interested in endangered languages, Gessler and von der Wense's analysis is essential reading: the gap between what you can build and what field linguists will use is real. Designing tools that integrate with existing workflows (ELAN, FLEx) is as important as improving model performance.

    For field linguists, the Comanche and Manchu case studies demonstrate that useful NLP tools do not require massive resources. Even simple toolsโ€”tokenizers, morphological analyzers, glossary extractorsโ€”can accelerate documentation work.

    For policymakers and funders, the sustainability question is critical. One-off projects produce tools that decay; sustainable infrastructure requires ongoing support.

    Discover related work through ORAA ResearchBrain.

    References (5)

    [1] Gessler, L. & von der Wense, K. (2024). NLP for Language Documentation: Two Reasons for the Gap between Theory and Practice. Proc. AmericasNLP 2024.
    [2] Yang, I., Ma, W., & Vosoughi, S. (2024). NushuRescue: Revitalization of the Endangered Nushu Language with AI. arXiv:2412.00218.
    [3] Alvarez C, J., Karajeanes, D.D., & Prado, A.C. (2025). Advancing Uto-Aztecan Language Technologies: A Case Study on the Endangered Comanche Language. Proc. AmericasNLP 2025.
    [4] Lee, S., Byun, G., & Seo, J. (2024). ManNER & ManPOS: Pioneering NLP for Endangered Manchu Language.
    [5] Fakhreldin, M. (2025). Developing a Comprehensive NLP Framework for Indigenous Dialect Documentation and Revitalization. International Journal of Advanced Computer Science and Applications, 16(4).

    Explore this topic deeper

    Search 290M+ papers, detect research gaps, and find what hasn't been studied yet.

    Click to remove unwanted keywords

    Search 7 keywords โ†’