Trend AnalysisLinguistics & NLP

Machine Translation for Low-Resource Languages: Closing the Digital Divide

Machine translation excels for high-resource language pairs but struggles dramatically with the majority of the world's languages. Recent strategies include synthetic pivoting, morphological modeling, and ancient language adaptation.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

Modern neural machine translation (NMT) achieves near-human quality for well-resourced language pairs like English-German or English-Chinese, benefitting from billions of parallel sentences and years of engineering optimization. But this success is concentrated in approximately 100 of the world's 7,000+ languages. For the vast majority, including languages spoken by millions of people, translation quality ranges from mediocre to unusable. The fundamental bottleneck is data: NMT systems are hungry for parallel corpora (aligned translations in both languages), and most language pairs simply lack the millions of sentence pairs that high-quality translation requires. Solving this problem is both a technical challenge in NLP and a question of linguistic equity.

Why It Matters

Language barriers are information barriers. When machine translation fails for a language, its speakers are effectively locked out of the global digital information ecosystem. They cannot access medical information, educational resources, government services, or economic opportunities available in dominant languages. The UN Sustainable Development Goals emphasize information access as a driver of development, but without adequate translation technology, billions of people are underserved. For linguistics, the low-resource translation problem is deeply intertwined with language documentation: every advance in translation for under-resourced languages also generates linguistic resources, parallel texts, lexicons, and grammatical analyses, that serve documentation and preservation goals.

The Science

Synthetic Pivoting for Language Pairs with No Direct Data

Ahmed and Buys (2024) address the most extreme case: translation between two low-resource languages that share no direct parallel data. Traditional pivot-based approaches use a high-resource language (typically English) as an intermediary, but this introduces compounding errors and struggles when the languages are typologically distant from the pivot. Their synthetic pivoting method generates synthetic parallel data between the two target languages using the pivot as a bridge, then trains a direct translation model on this synthetic data. The approach significantly outperforms traditional pivoting, particularly for typologically similar language pairs where synthetic data quality is higher. The linguistic insight is that pivot-based methods lose information that is structurally encoded in the source but absent from the pivot language, and direct models, even when trained on imperfect synthetic data, can preserve this information.

Corpus Development and Human Evaluation

Lankford (2024) takes a holistic approach to low-resource NMT, examining the entire pipeline from corpus development through human evaluation to model architecture for English-Irish and English-Marathi translation. A critical contribution is the emphasis on human evaluation alongside automatic metrics. BLEU scores, the standard automatic metric, correlate poorly with human quality judgments for low-resource languages, particularly those with rich morphology or flexible word order. The study introduces explainable AI architectures that allow linguists to inspect what the translation model has learned, revealing systematic patterns in error types. Error patterns differ across language pairs, reflecting distinct typological challenges each language poses for English-centric NMT architectures.

Morphological Complexity as a Barrier

Aci et al. (2025) provide a focused analysis of how morphological complexity affects NMT performance, using English-Turkish as their test case. Turkish is highly agglutinative, encoding information through strings of suffixes that can create words equivalent to entire English sentences. Standard NMT tokenization schemes (BPE, SentencePiece) fragment these complex words in linguistically arbitrary ways, losing morphological structure that carries critical meaning. Their analysis demonstrates that NMT error rates correlate directly with morphological complexity: sentences with more agglutinated forms produce more translation errors. The implication is that morphology-aware architectures, not just larger datasets, are needed for typologically diverse languages.

Ancient Languages as an Extreme Case

Chaoui and Khoury (2025) push the low-resource problem to its logical extreme: machine translation for Coptic, an ancient language with a tiny corpus and no native speakers. Their systematic evaluation of translation strategies, comparing pivot versus direct translation, the impact of pre-training, and robustness to noise, provides a methodological template for any extremely low-resource language. Key findings include that pre-training on related languages (in this case, other Afroasiatic languages) provides measurable benefit, and that multi-version fine-tuning, using different editions and translations of the same texts, effectively multiplies the available training data. For historical linguistics, the ability to translate ancient languages computationally opens new possibilities for large-scale comparative analysis.

Translation Quality by Resource Level

Resource Level	Example Languages	Parallel Data	Typical BLEU	Primary Strategy
High-resource	EN-DE, EN-ZH, EN-FR	>10M sentences	35-45	Standard NMT
Medium-resource	EN-TR, EN-HI, EN-AR	1-10M sentences	25-35	Transfer learning + data augmentation
Low-resource	EN-GA, EN-MR, JV-MAD	10K-1M sentences	15-25	Pivot, back-translation, multilingual
Extremely low-resource	Coptic, Irula, Suba	<10K sentences	5-15	Synthetic pivoting, related-language transfer

What To Watch

The rise of massively multilingual LLMs (like NLLB-200 covering 200 languages) is beginning to establish a baseline for many previously untranslatable language pairs, but quality for truly low-resource languages remains well below usability thresholds. The most promising near-term advance is community-driven parallel corpus creation, where bilingual speakers contribute translations through mobile apps and crowdsourcing platforms. Morphology-aware tokenization and subword models designed for agglutinative and polysynthetic languages represent a necessary architectural evolution. Longer-term, the integration of translation with language documentation could create a virtuous cycle: translation tools help document languages, documented languages provide data for better translation tools.

Discover related work using ORAA ResearchBrain.

면책 조항: 이 게시물은 정보 제공을 목적으로 한 연구 동향 개요이다. 학술 저작물에서 인용하기 전에 구체적인 연구 결과, 통계 및 주장은 원본 논문과 대조하여 확인해야 한다.

저자원 언어를 위한 기계 번역: 디지털 격차 해소

현대 신경 기계 번역(NMT)은 영어-독일어 또는 영어-중국어와 같이 자원이 풍부한 언어 쌍에서 수십억 개의 병렬 문장과 수년간의 엔지니어링 최적화를 바탕으로 인간에 가까운 품질을 달성한다. 그러나 이러한 성공은 전 세계 7,000개 이상의 언어 중 약 100개에 집중되어 있다. 수백만 명이 사용하는 언어를 포함한 대다수의 언어에서 번역 품질은 평범한 수준에서 사용 불가능한 수준까지 다양하다. 근본적인 병목 현상은 데이터에 있다. NMT 시스템은 병렬 말뭉치(두 언어의 정렬된 번역)를 필요로 하지만, 대부분의 언어 쌍은 고품질 번역에 필요한 수백만 개의 문장 쌍을 단순히 갖추지 못하고 있다. 이 문제를 해결하는 것은 NLP의 기술적 과제인 동시에 언어적 형평성의 문제이기도 하다.

중요성

언어 장벽은 곧 정보 장벽이다. 특정 언어에 대한 기계 번역이 실패하면, 해당 언어 사용자들은 사실상 글로벌 디지털 정보 생태계에서 배제된다. 이들은 지배적 언어로 제공되는 의료 정보, 교육 자료, 정부 서비스 또는 경제적 기회에 접근할 수 없다. UN 지속가능발전목표(SDG)는 정보 접근성을 개발의 동력으로 강조하지만, 적절한 번역 기술 없이는 수십억 명의 사람들이 제대로 된 서비스를 받지 못한다. 언어학적 관점에서 저자원 번역 문제는 언어 기록과 깊이 연결되어 있다. 저자원 언어 번역의 모든 발전은 언어 자료, 병렬 텍스트, 어휘 목록, 문법 분석 등 언어 기록 및 보존 목적에 기여하는 언어 자원을 생성하기 때문이다.

연구 내용

직접 데이터가 없는 언어 쌍을 위한 합성 피벗

Ahmed와 Buys(2024)는 가장 극단적인 경우, 즉 직접적인 병렬 데이터가 전혀 없는 두 저자원 언어 간의 번역 문제를 다룬다. 전통적인 피벗 기반 접근법은 고자원 언어(일반적으로 영어)를 중간 매개어로 사용하지만, 이는 복합적인 오류를 야기하며 두 언어가 피벗 언어와 유형론적으로 거리가 멀 경우 어려움을 겪는다. 이들의 합성 피벗 방법은 피벗을 다리로 활용하여 두 목표 언어 간의 합성 병렬 데이터를 생성한 뒤, 이 합성 데이터로 직접 번역 모델을 훈련한다. 이 접근법은 특히 합성 데이터 품질이 더 높은 유형론적으로 유사한 언어 쌍에서 전통적인 피벗 방식을 크게 능가한다. 언어학적 관점에서의 핵심은, 피벗 기반 방법이 원어에는 구조적으로 인코딩되어 있지만 피벗 언어에는 존재하지 않는 정보를 손실하는 반면, 불완전한 합성 데이터로 훈련된 경우에도 직접 모델은 이러한 정보를 보존할 수 있다는 점이다.

말뭉치 개발 및 인간 평가

Lankford(2024)는 영어-아일랜드어 및 영어-마라티어 번역을 대상으로 말뭉치 개발부터 인간 평가, 모델 아키텍처에 이르는 전체 파이프라인을 검토하며 저자원 NMT에 대한 총체적 접근법을 취한다. 핵심적인 기여는 자동 평가 지표와 함께 인간 평가를 강조한다는 점이다. 표준 자동 평가 지표인 BLEU 점수는 특히 풍부한 형태론이나 유연한 어순을 가진 저자원 언어에서 인간의 품질 판단과 낮은 상관관계를 보인다. 이 연구는 언어학자들이 번역 모델이 학습한 내용을 검사할 수 있게 하는 설명 가능한 AI 아키텍처를 도입하여, 오류 유형의 체계적인 패턴을 밝혀낸다. 오류 패턴은 언어 쌍마다 다르게 나타나며, 이는 각 언어가 영어 중심의 NMT 아키텍처에 제기하는 고유한 유형론적 과제를 반영한다.

형태론적 복잡성이라는 장벽

극단적 사례로서의 고대 언어

Chaoui와 Khoury(2025)는 저자원 문제를 그 논리적 극단까지 밀어붙인다. 바로 극소량의 말뭉치와 원어민 화자가 존재하지 않는 고대 언어인 콥트어(Coptic)에 대한 기계 번역이다. 이들은 피벗(pivot) 번역과 직접 번역의 비교, 사전 학습(pre-training)의 효과, 노이즈에 대한 강건성 등 다양한 번역 전략을 체계적으로 평가함으로써, 극단적 저자원 언어 전반에 적용 가능한 방법론적 틀을 제시한다. 주요 연구 결과로는, 관련 언어(이 경우 다른 아프로아시아어족 언어들)를 활용한 사전 학습이 측정 가능한 수준의 성능 향상을 제공한다는 점, 그리고 동일 텍스트의 다양한 판본과 번역본을 활용하는 다중 버전 미세 조정(multi-version fine-tuning)이 가용 학습 데이터를 효과적으로 증가시킨다는 점이 포함된다. 역사 언어학의 관점에서, 고대 언어를 전산적으로 번역할 수 있는 능력은 대규모 비교 분석을 위한 새로운 가능성을 열어준다.

Aci 외(2025)는 영어-터키어를 실험 대상으로 삼아 형태론적 복잡성이 NMT 성능에 미치는 영향을 집중적으로 분석한다. 터키어는 고도로 교착적인 언어로, 일련의 접미사를 통해 정보를 인코딩하며, 이를 통해 형성된 단어 하나가 영어의 문장 전체에 해당하는 의미를 담을 수 있다. 표준 NMT 토크나이제이션(tokenization) 방식인 BPE나 SentencePiece는 이러한 복잡한 단어를 언어학적으로 자의적인 방식으로 분절하여, 핵심 의미를 담고 있는 형태론적 구조를 손실시킨다. 이들의 분석은 NMT 오류율이 형태론적 복잡성과 직접적인 상관관계를 보임을 증명한다. 즉, 교착 형태가 더 많이 포함된 문장일수록 더 많은 번역 오류가 발생한다. 이는 유형론적으로 다양한 언어를 처리하기 위해서는 단순히 더 많은 데이터셋이 아니라, 형태론을 인식하는 아키텍처(morphology-aware architecture)가 필요함을 시사한다.

자원 수준별 번역 품질

자원 수준	언어 예시	병렬 데이터	일반적 BLEU	주요 전략
고자원	EN-DE, EN-ZH, EN-FR	>1,000만 문장	35-45	표준 NMT
중자원	EN-TR, EN-HI, EN-AR	100만-1,000만 문장	25-35	전이 학습 + 데이터 증강
저자원	EN-GA, EN-MR, JV-MAD	1만-100만 문장	15-25	피벗, 역번역, 다국어
극단적 저자원	Coptic, Irula, Suba	<1만 문장	5-15	합성 피벗, 관련 언어 전이

주목할 동향

200개 언어를 지원하는 NLLB-200과 같은 대규모 다국어 LLM의 부상은 기존에 번역이 불가능했던 많은 언어 쌍에 대한 기준선을 확립하기 시작하고 있으나, 진정한 저자원 언어에 대한 품질은 실용적 수준에 여전히 크게 못 미친다. 가장 유망한 단기적 진전은 커뮤니티 주도의 병렬 말뭉치 구축으로, 이중 언어 사용자들이 모바일 앱과 크라우드소싱 플랫폼을 통해 번역에 기여하는 방식이다. 교착어 및 다종합어(polysynthetic language)를 위해 설계된 형태론 인식 토크나이제이션과 서브워드(subword) 모델은 필수적인 아키텍처적 발전을 나타낸다. 장기적으로는, 번역과 언어 기록(language documentation)의 통합이 선순환 구조를 만들어낼 수 있다. 번역 도구가 언어 기록을 돕고, 기록된 언어는 더 나은 번역 도구를 위한 데이터를 제공하는 것이다.

관련 연구는 ORAA ResearchBrain을 통해 탐색할 수 있다.

References (4)

[1] Ahmed, K. & Buys, J. (2024). Neural Machine Translation between Low-Resource Languages with Synthetic Pivoting.

DOI Scholar

[2] Lankford, S. (2024). Enhancing Neural Machine Translation of Low-Resource Languages: Corpus Development, Human Evaluation and Explainable AI Architectures.

DOI Scholar

[3] Aci, M., Sari, N., & Aci, C. (2025). Morphological and structural complexity analysis of low-resource English-Turkish language pair using neural machine translation models. PeerJ Computer Science, 11.

DOI Scholar

[4] Chaoui, N. & Khoury, R. (2025). Neural Machine Translation for Coptic-French: Strategies for Low-Resource Ancient Languages.

DOI Scholar

Machine Translation for Low-Resource Languages: Closing the Digital Divide

Why It Matters

The Science

Synthetic Pivoting for Language Pairs with No Direct Data

Corpus Development and Human Evaluation

Morphological Complexity as a Barrier

Ancient Languages as an Extreme Case

Translation Quality by Resource Level

What To Watch

저자원 언어를 위한 기계 번역: 디지털 격차 해소

중요성

연구 내용

직접 데이터가 없는 언어 쌍을 위한 합성 피벗

말뭉치 개발 및 인간 평가

형태론적 복잡성이라는 장벽

극단적 사례로서의 고대 언어

자원 수준별 번역 품질

주목할 동향

References (4)

Explore this topic deeper