Trend AnalysisLinguistics & NLP

Can AI Save Dying Languages? NLP Tools for Endangered Language Documentation

Over 40% of the world's languages face extinction. AI and NLP tools promise to accelerate documentation and revitalization, but a persistent gap between theory and practice remains. Five recent papers illuminate what works, what doesn't, and what is lost when a language dies undocumented.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

Of the approximately 7,000 languages spoken today, UNESCO estimates that roughly 40% are endangered—spoken by shrinking communities, often without written traditions, and at risk of disappearing within a generation or two. Each language that vanishes takes with it a unique cognitive system, a body of oral literature, and an irreplaceable record of human experience. The question of whether AI and NLP tools can meaningfully contribute to documentation and revitalization efforts is both technically interesting and culturally urgent.

The honest answer, as the recent literature makes clear, is: partially, and less than the hype suggests. NLP tools can accelerate certain documentation tasks, but they face fundamental challenges with low-resource languages, and the gap between what is technically possible and what actually gets deployed in fieldwork settings remains wide.

The Theory-Practice Gap

Gessler and von der Wense (2024), with 4 citations, provide the most direct analysis of why NLP tools have not been widely adopted in language documentation, despite decades of expressed interest from both NLP researchers and field linguists. They identify two core reasons:

Reason 1: The data bootstrapping problem. NLP tools generally require annotated data to function. But for endangered languages, annotated data is precisely what documentation aims to create. This creates a circularity: you need NLP tools to create the data, and you need the data to train the NLP tools. Transfer learning from related high-resource languages can partially address this, but "related" is a strong requirement—many endangered languages belong to families with no well-resourced relatives.

Reason 2: The workflow integration problem. Even when NLP tools exist for a given task (automatic transcription, morphological analysis, interlinear glossing), integrating them into existing documentation workflows is non-trivial. Field linguists typically work with tools like ELAN, FLEx, or SayMore. NLP tools that require command-line interfaces, Python environments, or cloud APIs do not fit naturally into these workflows. The result is that tools get published in NLP conferences and then are not used.

The observation is sobering but constructive: the bottleneck is not primarily algorithmic (better models) but sociotechnical (better integration with existing practices and genuine collaboration between NLP researchers and field linguists).

Case Studies: What Is Being Attempted

Nüshu: Rescuing a Script from Extinction

Yang, Ma, and Gessler & von der Wense (2024), with 6 citations, present NushuRescue, an AI-assisted project for the Nüshu script—a writing system historically used exclusively by women in Jiangyong County, Hunan Province, China. Nüshu is unusual in multiple ways: it is the only known script used exclusively by one gender, its last fluent native writer died in 2004, and existing documentation is fragmentary.

The NushuRescue approach uses LLMs to address a core preservation challenge: translation between Nüshu and Chinese with minimal training data. The framework includes:

Parallel corpus creation: NCGold, a 500-sentence Nüshu-Chinese parallel corpus—the first publicly available dataset of its kind.
Few-shot LLM translation: Using GPT-4-Turbo with only 35 short examples to achieve 48.69% translation accuracy on withheld test sentences.
Corpus expansion: Generating NCSilver, a set of 98 newly translated modern Chinese sentences, expanding the available linguistic resources.
Supporting models: FastText-based and Seq2Seq models developed to further support computational research on Nüshu.

The results demonstrate that LLMs can make meaningful progress on endangered language translation with remarkably little data—but 48.69% accuracy also shows how far the technology remains from reliable translation. The framework is designed to be scalable and minimize the need for extensive human input, though human validation remains essential for quality assurance.

Comanche: Minimal-Cost Language Technologies

Alvarez, Karajeanes, and Yang et al. (2024), with 1 citation, introduce computational tools for Comanche, an Uto-Aztecan language spoken by fewer than 50 fluent speakers (some estimates as few as 10). Their approach is notable for its pragmatism: rather than attempting to build full NLP systems, they focus on "minimal-cost" interventions—tools that require minimal data and computation while providing immediate utility.

Their specific contributions include a Comanche tokenizer, a basic morphological analyzer, and a Comanche-English glossary extraction tool. These are not sophisticated by NLP standards, but they address real needs in the documentation process: helping field linguists segment continuous speech, identify morpheme boundaries, and maintain consistent terminology.

The paper also raises an important ethical point: the Comanche Nation's cultural preservation office was involved in determining which tools were developed and how the resulting data would be stored and accessed. This is not a technicality—for many Indigenous communities, language data carries cultural and spiritual significance that requires community governance.

Manchu: NER and POS Tagging

Lee, Byun, and Seo (2024), with 2 citations, experiment with three model architectures—BiLSTM-CRF, BERT, and mBERT—for Named Entity Recognition (NER) and Part-of-Speech (POS) tagging in Manchu, an endangered Tungusic language with fewer than 20 fluent speakers. The Manchu script (a vertical alphabet adapted from Mongolian) poses additional challenges for standard NLP pipelines designed for horizontal left-to-right text.

Their results illustrate the trade-offs of different approaches. BERT, fine-tuned on a small Manchu corpus (~50,000 tokens), outperforms BiLSTM-CRF for POS tagging but performs comparably for NER—suggesting that for tasks with limited training data, the advantage of pretrained models is reduced. mBERT, despite its multilingual pretraining, shows no advantage over monolingual BERT, likely because Manchu is absent from mBERT's training data and has no typologically close relatives in the model.

A Broader Framework

Fakhreldin (2025), with 1 citation, proposes a comprehensive NLP framework for Indigenous dialect documentation that attempts to address the full pipeline: data collection, preprocessing, annotation, model training, and community feedback. The framework includes provisions for dialectal variation (a challenge often overlooked when the "language" is actually a family of related dialects) and emphasizes iterative validation with speaker communities.

The framework's value is more conceptual than empirical—it has not yet been fully implemented for any single language. But it articulates principles that the field increasingly recognizes: documentation NLP must be community-governed, dialect-aware, and designed for integration with existing fieldwork tools.

Critical Analysis: Claims and Evidence

Claim	Evidence	Verdict
NLP tools can accelerate endangered language documentation	NushuRescue, Comanche, Manchu case studies	✅ Supported — for specific, well-defined tasks
The main barrier to NLP adoption is sociotechnical, not algorithmic	Gessler & von der Wense's fieldwork survey	✅ Supported
Transfer learning from high-resource languages helps low-resource NLP	Lee et al.'s mBERT experiment	⚠️ Uncertain — mBERT showed no advantage for Manchu
Community involvement is essential for validation	NushuRescue and Comanche ethical frameworks	✅ Supported — computational outputs alone are unreliable

Open Questions and Future Directions

Scaling community-driven NLP: The case studies reviewed here all involve close collaboration with speaker communities. Can this approach scale, or is it inherently bespoke?

Oral languages: Many endangered languages have no written tradition. Speech recognition and audio analysis are critical, but acoustic models for low-resource languages remain poor.

Data sovereignty: Who owns the digital artifacts produced by NLP tools applied to endangered languages? Community data governance frameworks are emerging but not yet standardized.

Sustainability: Grant-funded NLP projects often produce tools that become unmaintained when funding ends. How do we build sustainable infrastructure for endangered language technologies?

The "last speaker" problem: For languages with only a handful of elderly speakers, documentation is a race against time. Can NLP tools be deployed rapidly enough to make a difference, or do they require lead time that these situations do not allow?

What This Means for Your Research

For NLP researchers interested in endangered languages, Gessler and von der Wense's analysis is essential reading: the gap between what you can build and what field linguists will use is real. Designing tools that integrate with existing workflows (ELAN, FLEx) is as important as improving model performance.

For field linguists, the Comanche and Manchu case studies demonstrate that useful NLP tools do not require massive resources. Even simple tools—tokenizers, morphological analyzers, glossary extractors—can accelerate documentation work.

For policymakers and funders, the sustainability question is critical. One-off projects produce tools that decay; sustainable infrastructure requires ongoing support.

Discover related work through ORAA ResearchBrain.

면책 조항: 이 게시물은 정보 제공을 목적으로 한 연구 동향 개요이다. 특정 연구 결과, 통계, 주장은 학술 저작물에 인용하기 전에 원본 논문을 통해 검증해야 한다.

AI가 소멸 위기 언어를 구할 수 있는가? 위기 언어 문서화를 위한 NLP 도구들

오늘날 사용되는 약 7,000개의 언어 중, UNESCO는 약 40%가 위기에 처해 있다고 추정한다—이 언어들은 점점 줄어드는 공동체에서만 사용되고, 종종 문자 전통도 없으며, 한두 세대 안에 사라질 위험에 놓여 있다. 하나의 언어가 소멸할 때마다 고유한 인지 체계, 구술 문학의 집합체, 그리고 대체 불가능한 인류 경험의 기록이 함께 사라진다. AI와 NLP 도구가 문서화 및 언어 활성화 노력에 실질적으로 기여할 수 있는가 하는 문제는 기술적으로도 흥미롭고 문화적으로도 시급한 질문이다.

최근 문헌이 분명히 하듯, 솔직한 답변은 다음과 같다: 부분적으로 가능하며, 과대 선전이 시사하는 것보다는 적은 수준이다. NLP 도구는 특정 문서화 작업을 가속화할 수 있지만, 저자원 언어에서 근본적인 한계에 부딪히며, 기술적으로 가능한 것과 현장 작업 환경에서 실제로 배포되는 것 사이의 간극은 여전히 넓다.

이론과 실천의 괴리

Gessler와 von der Wense(2024)는 4회 인용으로, NLP 연구자와 현장 언어학자 양측에서 수십 년간 관심을 표명해 왔음에도 불구하고 NLP 도구가 언어 문서화에 널리 채택되지 않은 이유에 대해 가장 직접적인 분석을 제공한다. 이들은 두 가지 핵심 이유를 제시한다:

이유 1: 데이터 부트스트래핑 문제. NLP 도구는 일반적으로 작동하기 위해 주석이 달린 데이터를 필요로 한다. 그러나 위기 언어의 경우, 주석이 달린 데이터야말로 문서화가 생성하고자 하는 바로 그것이다. 이는 순환 논리를 만들어 낸다: 데이터를 생성하려면 NLP 도구가 필요하고, NLP 도구를 훈련하려면 데이터가 필요하다. 관련된 고자원 언어로부터의 전이 학습이 이를 부분적으로 해결할 수 있지만, '관련된'이라는 조건은 강력한 요건이다—많은 위기 언어들은 자원이 풍부한 친족 언어가 없는 어족에 속한다.

이유 2: 워크플로 통합 문제. 특정 작업(자동 전사, 형태소 분석, 인터리니어 글로싱)을 위한 NLP 도구가 존재하더라도, 이를 기존 문서화 워크플로에 통합하는 것은 간단하지 않다. 현장 언어학자들은 일반적으로 ELAN, FLEx, SayMore 같은 도구로 작업한다. 명령줄 인터페이스, Python 환경, 또는 클라우드 API를 필요로 하는 NLP 도구들은 이러한 워크플로에 자연스럽게 맞지 않는다. 그 결과, 도구들은 NLP 학술대회에서 발표된 후 실제로는 사용되지 않는다.

이 관찰은 냉정하지만 건설적이다: 병목 지점은 주로 알고리즘적인 것(더 나은 모델)이 아니라 사회기술적인 것(기존 실천과의 더 나은 통합, 그리고 NLP 연구자와 현장 언어학자 간의 진정한 협력)이다.

사례 연구: 시도되고 있는 것들

女書(Nüshu): 소멸 위기의 문자 구제

Yang, Ma, Gessler & von der Wense(2024)는 6회 인용으로, 여서(女書) 문자를 위한 AI 지원 프로젝트인 NushuRescue를 제시한다—여서는 중국 후난성 장융현(Jiangyong County)의 여성들이 역사적으로 독점 사용해 온 문자 체계이다. 여서는 여러 면에서 독특하다: 한 성별만 독점적으로 사용한 것으로 알려진 유일한 문자이며, 마지막 유창한 원어민 작가가 2004년에 사망하였고, 현존하는 문서화는 단편적이다.

NushuRescue 접근법은 최소한의 훈련 데이터로 여서와 중국어 사이의 번역이라는 핵심 보존 과제를 해결하기 위해 LLM을 활용한다. 이 프레임워크는 다음을 포함한다:

병렬 말뭉치 구축: NCGold—500개 문장으로 구성된 여서-중국어 병렬 말뭉치로, 동종 최초의 공개 데이터셋이다.
퓨샷 LLM 번역: GPT-4-Turbo를 35개의 짧은 예시만으로 사용하여 보류된 테스트 문장에서 48.69%의 번역 정확도를 달성하였다.
말뭉치 확장: NCSilver 구축—새롭게 번역된 현대 중국어 문장 98개로 구성되며, 이를 통해 활용 가능한 언어 자원을 확장하였다.
지원 모델: 여서에 대한 계산 연구를 추가 지원하기 위해 FastText 기반 모델과 Seq2Seq 모델을 개발하였다.

코만치어: 최소 비용 언어 기술

Alvarez, Karajeanes, Yang 등(2024)은 피인용 1회를 기록하며, 유창한 화자가 50명 미만(일부 추정치로는 10명에 불과)인 Uto-Aztecan어족의 코만치어를 위한 계산 도구를 소개한다. 이들의 접근 방식은 실용주의적 면에서 주목할 만하다. 완전한 NLP 시스템 구축을 시도하는 대신, "최소 비용" 개입, 즉 최소한의 데이터와 연산만으로 즉각적인 효용을 제공하는 도구에 집중한다.

구체적인 기여로는 코만치어 토크나이저, 기본적인 형태소 분석기, 그리고 코만치어-영어 용어집 추출 도구가 포함된다. 이는 NLP 기준에서 정교한 수준은 아니지만, 연속 발화의 분절, 형태소 경계 식별, 일관된 용어 유지 등 문서화 과정에서 발생하는 실질적인 필요를 충족한다.

해당 논문은 또한 중요한 윤리적 쟁점을 제기한다. 코만치 네이션(Comanche Nation)의 문화 보존 사무소가 어떤 도구를 개발할지, 그리고 도출된 데이터를 어떻게 저장하고 접근할지를 결정하는 과정에 참여하였다. 이는 단순한 절차적 형식이 아니다. 많은 원주민 공동체에서 언어 데이터는 공동체 차원의 거버넌스를 필요로 하는 문화적·정신적 의미를 지닌다.

만주어: NER 및 POS 태깅

Lee, Byun, Seo(2024)는 피인용 2회를 기록하며, 유창한 화자가 20명 미만인 위기 Tungusic어족 언어인 만주어의 개체명 인식(NER) 및 품사(POS) 태깅을 위해 BiLSTM-CRF, BERT, mBERT의 세 가지 모델 아키텍처를 실험한다. 만주어 문자(몽골 문자를 변형한 세로쓰기 알파벳)는 가로 좌-우 방향 텍스트를 전제로 설계된 표준 NLP 파이프라인에 추가적인 난제를 야기한다.

이들의 결과는 접근 방식별 상충 관계를 잘 보여 준다. 소규모 만주어 말뭉치(약 5만 토큰)로 미세 조정된 BERT는 POS 태깅에서 BiLSTM-CRF를 능가하지만, NER에서는 유사한 수준의 성능을 보인다. 이는 학습 데이터가 제한된 과제에서는 사전 학습 모델의 이점이 감소함을 시사한다. mBERT는 다국어 사전 학습에도 불구하고 단일 언어 BERT 대비 어떠한 이점도 보이지 않는데, 이는 만주어가 mBERT의 학습 데이터에 포함되지 않았으며 해당 모델 내에 유형론적으로 근접한 친족 언어도 존재하지 않기 때문으로 보인다.

보다 폭넓은 프레임워크

Fakhreldin(2025)은 피인용 1회를 기록하며, 원주민 방언 문서화를 위한 포괄적인 NLP 프레임워크를 제안한다. 이 프레임워크는 데이터 수집, 전처리, 주석, 모델 학습, 공동체 피드백에 이르는 전체 파이프라인을 다루고자 한다. 또한 방언 변이(해당 "언어"가 실제로는 관련 방언들의 집합일 때 흔히 간과되는 문제)를 위한 조항을 포함하며, 화자 공동체와의 반복적 검증을 강조한다.

이 프레임워크의 가치는 경험적이기보다 개념적인 측면에서 더 크다. 아직 어떤 단일 언어에도 완전히 구현된 바 없기 때문이다. 그러나 이는 현장이 점점 더 인식하게 된 원칙들을 명확히 제시한다. 문서화 NLP는 공동체가 주도하고, 방언을 고려하며, 기존 현장 조사 도구와의 통합을 전제로 설계되어야 한다는 것이다.

비판적 분석: 주장과 근거

이 결과들은 LLM이 놀라울 정도로 적은 데이터만으로도 위기 언어 번역에서 의미 있는 진전을 이룰 수 있음을 보여 주지만, 48.69%라는 정확도는 신뢰할 수 있는 번역까지 기술이 얼마나 먼 거리에 있는지도 나타낸다. 이 프레임워크는 확장 가능하고 광범위한 인적 투입의 필요성을 최소화하도록 설계되었으나, 품질 보증을 위한 인간 검증은 여전히 필수적이다.

주장	근거	판정
NLP 도구는 위기 언어 문서화를 가속화할 수 있다	NushuRescue, 코만치어, 만주어 사례 연구	✅ 지지됨 — 구체적이고 명확히 정의된 과제에 한함
NLP 도입의 주된 장벽은 알고리즘이 아닌 사회기술적 측면에 있다	Gessler & von der Wense의 현장 조사 연구	✅ 지지됨
고자원 언어로부터의 전이 학습은 저자원 NLP에 도움이 된다	Lee 등의 mBERT 실험	⚠️ 불확실 — mBERT는 만주어에서 어떠한 이점도 보이지 않음
검증을 위한 커뮤니티 참여는 필수적이다	NushuRescue 및 Comanche 윤리적 프레임워크	✅ 지지됨 — 계산적 출력만으로는 신뢰할 수 없다

미해결 질문과 향후 방향

커뮤니티 주도 NLP의 확장: 여기서 검토된 사례 연구는 모두 화자 커뮤니티와의 긴밀한 협력을 수반한다. 이러한 접근 방식이 확장 가능한가, 아니면 본질적으로 맞춤형에 그치는가?

구어 언어: 많은 소멸 위기 언어는 문자 전통이 없다. 음성 인식과 음향 분석이 중요하지만, 저자원 언어를 위한 음향 모델은 여전히 성능이 낮다.

데이터 주권: 소멸 위기 언어에 적용된 NLP 도구가 생성한 디지털 산출물은 누가 소유하는가? 커뮤니티 데이터 거버넌스 프레임워크가 등장하고 있지만 아직 표준화되지 않았다.

지속 가능성: 보조금으로 지원되는 NLP 프로젝트는 자금이 종료되면 유지 관리되지 않는 도구를 만들어내는 경우가 많다. 소멸 위기 언어 기술을 위한 지속 가능한 인프라를 어떻게 구축할 것인가?

"마지막 화자" 문제: 소수의 고령 화자만 남은 언어의 경우, 문서화는 시간과의 싸움이다. NLP 도구를 충분히 빠르게 배치하여 실질적인 차이를 만들 수 있는가, 아니면 이러한 상황이 허용하지 않는 준비 시간이 필요한가?

연구에 대한 시사점

소멸 위기 언어에 관심 있는 NLP 연구자들에게 Gessler와 von der Wense의 분석은 필독 자료이다. 구축 가능한 것과 현장 언어학자들이 실제로 사용할 것 사이의 격차는 실재한다. 기존 워크플로(ELAN, FLEx)와 통합되는 도구를 설계하는 것은 모델 성능 향상만큼이나 중요하다.

현장 언어학자들에게 Comanche 및 Manchu 사례 연구는 유용한 NLP 도구가 방대한 자원을 필요로 하지 않는다는 것을 보여준다. 토크나이저, 형태소 분석기, 용어 추출기와 같은 간단한 도구조차도 문서화 작업을 가속화할 수 있다.

정책 입안자 및 지원 기관에게 지속 가능성 문제는 매우 중요하다. 일회성 프로젝트는 시간이 지나면 쇠퇴하는 도구를 만들어내며, 지속 가능한 인프라는 지속적인 지원을 필요로 한다.

ORAA ResearchBrain에서 관련 연구를 검색하라.

References (5)

[1] Gessler, L. & von der Wense, K. (2024). NLP for Language Documentation: Two Reasons for the Gap between Theory and Practice. Proc. AmericasNLP 2024.

DOI Scholar

[2] Yang, I., Ma, W., & Vosoughi, S. (2024). NushuRescue: Revitalization of the Endangered Nushu Language with AI. arXiv:2412.00218.

DOI Scholar

[3] Alvarez C, J., Karajeanes, D.D., & Prado, A.C. (2025). Advancing Uto-Aztecan Language Technologies: A Case Study on the Endangered Comanche Language. Proc. AmericasNLP 2025.

DOI Scholar

[4] Lee, S., Byun, G., & Seo, J. (2024). ManNER & ManPOS: Pioneering NLP for Endangered Manchu Language.

DOI Scholar

[5] Fakhreldin, M. (2025). Developing a Comprehensive NLP Framework for Indigenous Dialect Documentation and Revitalization. International Journal of Advanced Computer Science and Applications, 16(4).

DOI Scholar

Can AI Save Dying Languages? NLP Tools for Endangered Language Documentation

The Theory-Practice Gap

Case Studies: What Is Being Attempted

Nüshu: Rescuing a Script from Extinction

Comanche: Minimal-Cost Language Technologies

Manchu: NER and POS Tagging

A Broader Framework

Critical Analysis: Claims and Evidence

Open Questions and Future Directions

What This Means for Your Research

AI가 소멸 위기 언어를 구할 수 있는가? 위기 언어 문서화를 위한 NLP 도구들

이론과 실천의 괴리

사례 연구: 시도되고 있는 것들

女書(Nüshu): 소멸 위기의 문자 구제

코만치어: 최소 비용 언어 기술

만주어: NER 및 POS 태깅

보다 폭넓은 프레임워크

비판적 분석: 주장과 근거

미해결 질문과 향후 방향

연구에 대한 시사점

References (5)

Explore this topic deeper