Field MapLinguistics & NLPMachine/Deep Learning
The 5,980 Languages LLMs Cannot Speak: Three Breakthroughs in Low-Resource NLP
Of the roughly 6,000 languages spoken worldwide, large language models perform well in only about 20. Three recent papers attack this digital divide from different angles: comprehensive benchmarking across 64 African languages, language identification spanning 1,665 languages, and tokenizer optimization for 22 Indian languages. Together, they reveal how deep the gap truly is and where the most promising interventions lie.
By Sean K.S. Shin
This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.
Ethnologue catalogs approximately 7,168 living languages. Large language models---the technology that has reshaped search, writing, coding, and customer service in the span of three years---perform competently in roughly 20 of them. English dominates the training corpora, followed by a handful of European and East Asian languages with large digital footprints. The remaining languages, spoken by billions of people across Africa, South Asia, Southeast Asia, the Pacific, and the Americas, exist in what researchers increasingly call the "low-resource" zone: insufficient training data, inadequate evaluation benchmarks, and tokenizers that fragment their scripts into inefficient subword sequences. The result is a technology that amplifies the communicative power of already-privileged language communities while offering little to those who need it most.
This is not merely a technical inconvenience. When a language is absent from LLM capabilities, its speakers are excluded from AI-assisted education, healthcare information systems, legal document processing, and economic participation in the digital economy. The low-resource problem is, at its root, a problem of global equity. Three recent papers attack this challenge from distinct and complementary angles: benchmarking the performance gap across an entire continent, building identification systems that can recognize over a thousand languages, and redesigning tokenization to serve scripts that current models handle poorly. Each reveals something important about the structure of the problem and the plausibility of solutions.
The Research Landscape
AfroBench: Quantifying the Continental Divide
Ojo et al. (2025) present AfroBench, the most comprehensive evaluation of LLM capabilities across African languages to date. The benchmark spans 64 African languages across 15 NLP tasks, evaluated on 12 LLMs including both proprietary and open-weight models. The scale alone is significant---prior benchmarks for African languages typically covered fewer than 10 languages and a handful of tasks. AfroBench provides, for the first time, a systematic picture of where LLMs stand across an entire continent's linguistic diversity.
The headline findings are sobering. Comparing English to African language performance, GPT-4o (proprietary) shows a gap of more than +25 points, while Gemma 2 27B (the best open-weight model) shows a gap exceeding +40 points, depending on the task. This is not a marginal difference; it represents a qualitative shift from "useful" to "unreliable." Among proprietary models, GPT-4o achieves an aggregate score of 58.1 and Gemini scores 58.9---reasonable but far below their English-language capabilities. Among open-weight alternatives, the best performer is Gemma 2 27B at approximately 48, with a +13-point gap behind the proprietary leaders---a score that places many downstream applications below the threshold of practical utility for most downstream applications.
Perhaps the most actionable finding concerns the comparison between prompting and fine-tuning strategies. Fine-tuning consistently outperforms prompting by an average of +11.5 points across tasks and languages. This gap has direct implications for deployment: organizations seeking to serve African language communities cannot simply deploy a general-purpose LLM with translated prompts and expect acceptable results. Task-specific fine-tuning on even modest amounts of in-language data produces substantially better outcomes.
The benchmark also reveals important variation across language families. Niger-Congo B (Bantu) languages tend to perform better than Afro-Asiatic or Nilo-Saharan languages, likely reflecting differences in available training data rather than intrinsic linguistic difficulty. This pattern suggests that data availability, not model architecture, is the binding constraint for most African languages.
GlotLID: Before You Can Serve a Language, You Must Recognize It
Kargaran et al. (2024) address a problem that is logically prior to generation or understanding: language identification. Before an LLM can process text in a given language, the system must determine what language the text is written in. This task, trivial for high-resource languages with distinctive scripts, becomes remarkably challenging when the scope extends to 1,665 languages---as GlotLID attempts.
The model itself is built on FastText, a computationally efficient architecture that scales well to large label spaces. GlotLID's training leverages carefully curated datasets with labels verified against multiple sources, addressing a persistent problem in multilingual NLP: noisy or incorrect language labels in web-crawled corpora. The authors introduce the SET (Semantically Equivalent Translation) evaluation framework, which provides a more rigorous assessment of identification accuracy than simple held-out test sets drawn from the same distribution as training data.
The key challenge GlotLID exposes is confusion between closely related languages. Distinguishing between, say, Serbian and Croatian (which share substantial vocabulary and grammar but use different scripts in formal contexts), or between mutually intelligible West African languages with limited written traditions, pushes the boundaries of character-level and word-level statistical models. GlotLID provides individual performance tables for 1,832 language-script pairs, making it possible for the first time to assess identification reliability on a per-language basis rather than relying on aggregate accuracy figures that are dominated by high-resource languages.
The practical implications extend beyond LLM routing. Language identification is a foundational component of corpus construction: to build training data for a low-resource language, you first need to identify and extract text in that language from multilingual web crawls. Errors at this stage---misidentifying language X as related language Y---propagate through the entire pipeline, producing models that are subtly trained on the wrong data. GlotLID's per-language reliability scores provide the information needed to assess and mitigate this risk.
IndicSuperTokenizer: Fixing the Fertility Problem
Rana et al. (2025) tackle a different bottleneck: tokenization. Standard LLM tokenizers, trained predominantly on English text, are notoriously inefficient for non-Latin scripts. A concept that requires one token in English may require four or five tokens in Hindi, Bengali, or Tamil---a phenomenon measured by "fertility," the average number of tokens per word. High fertility means that the same context window covers less actual content, that inference is slower, and that the model's effective capacity for the language is reduced.
IndicSuperTokenizer addresses this for 22 Indian languages through a two-stage approach the authors call SuperBPE. The first stage applies standard Byte Pair Encoding to build a subword vocabulary. The second stage---the innovation---merges frequent subword sequences into "superwords," creating larger units that better capture the morphological structure of Indian languages. The resulting vocabulary of 200K tokens, built with NFKC Unicode normalization to handle script variation, achieves a fertility reduction of -39.5% compared to standard tokenizers.
The downstream impact is substantial. Inference throughput improves by +44%, not because the model architecture changes but because the same text is represented in fewer tokens, reducing the computational cost of attention mechanisms that scale quadratically with sequence length. This is a case where a preprocessing improvement produces multiplicative benefits throughout the system.
The design decisions are worth noting. The 200K vocabulary is large by current standards---GPT-4's tokenizer uses roughly 100K tokens---but the authors argue that serving 22 languages with diverse scripts requires this capacity. NFKC normalization handles the proliferation of visually identical but Unicode-distinct characters that plague Indic script processing. And the two-stage approach avoids the need to retrain the base model: IndicSuperTokenizer can be applied as a drop-in replacement for existing tokenizers, making adoption relatively straightforward.
Critical Analysis
<
| Claim | Evidence | Verdict |
|---|
| Open-weight LLMs lag proprietary models by 15+ points on African languages | AfroBench: Gemma 2 27B ~13 points behind GPT-4o | โ
Supported --- gap is consistent across tasks |
| Fine-tuning outperforms prompting for low-resource languages | AfroBench: +11.5 average improvement | โ
Supported --- effect is robust across language families |
| Language identification at 1,665-language scale is feasible | GlotLID: FastText-based model with per-language evaluation | โ ๏ธ Partially --- aggregate accuracy is high, but closely related language confusion remains a significant limitation |
| Tokenizer optimization can reduce fertility by ~40% for Indic languages | IndicSuperTokenizer: -39.5% fertility, +44% throughput | โ
Supported --- though long-term effects on model quality need more evaluation |
| The low-resource gap is primarily a data problem, not an architecture problem | AfroBench's language-family analysis; fine-tuning gains | โ ๏ธ Likely but not fully established --- architecture choices (tokenizer, attention) also contribute |
Open Questions
Benchmark saturation versus real-world utility. AfroBench demonstrates that LLMs score poorly on African languages, but benchmark scores do not directly measure whether the models are useful for actual applications (healthcare chatbots, legal document summarization, educational tools). Bridging the gap between benchmark performance and deployment-ready quality requires task-specific evaluation in realistic settings.Data sovereignty and consent. Building training corpora for low-resource languages often involves scraping text from community forums, religious texts, or government documents. Who authorizes this use? The question of data sovereignty---whether language communities have governance rights over digital representations of their languages---is legally and ethically unresolved for most of the world's languages.Tokenizer-model co-optimization. IndicSuperTokenizer improves tokenization as a preprocessing step, but the base models were trained with different tokenizers. What gains are possible if models are pre-trained from scratch with language-optimized tokenizers? The computational cost of this experiment has so far prevented systematic investigation.Closely related language identification at scale. GlotLID's confusion between related languages is not just a classification nuisance---it affects corpus quality for all downstream tasks. Can identification accuracy for closely related languages be improved without sacrificing coverage of the long tail of rare languages?Sustainability of benchmarks. AfroBench covers 64 of Africa's approximately 2,000 languages. Expanding coverage requires sustained effort in data collection, annotation, and community engagement. Who funds this work after the initial publication, and how is the benchmark maintained as languages and technologies evolve?What This Means for Your Research
For computational linguists and NLP researchers, these three papers collectively argue that the low-resource problem is not monolithic---it is a compound challenge requiring simultaneous progress on benchmarking, identification, and tokenization. Working on any one of these in isolation produces incomplete solutions. AfroBench provides the evaluation infrastructure; GlotLID provides the language identification layer; IndicSuperTokenizer provides the preprocessing optimization. The research agenda that emerges is one of integrated multilingual NLP pipelines rather than isolated model improvements.
For researchers working on specific low-resource languages, the fine-tuning results from AfroBench are immediately actionable: even small amounts of in-language fine-tuning data produce disproportionate gains compared to prompt engineering. Prioritizing the creation of modest, high-quality fine-tuning datasets (hundreds to low thousands of examples) is likely more impactful than waiting for multilingual foundation models to improve organically.
For policymakers and funding agencies, the 5,980-language gap is a concrete measure of digital exclusion. The technical solutions exist in prototype form---what is missing is the sustained investment in data collection, community engagement, and infrastructure maintenance that would make these solutions available at scale. The cost of inaction is not merely technical: it is the progressive exclusion of billions of speakers from the AI-mediated information economy.
Discover related work through ORAA ResearchBrain.
Ethnologue catalogs approximately 7,168 living languages. Large language models---the technology that has reshaped search, writing, coding, and customer service in the span of three years---perform competently in roughly 20 of them. English dominates the training corpora, followed by a handful of European and East Asian languages with large digital footprints. The remaining languages, spoken by billions of people across Africa, South Asia, Southeast Asia, the Pacific, and the Americas, exist in what researchers increasingly call the "low-resource" zone: insufficient training data, inadequate evaluation benchmarks, and tokenizers that fragment their scripts into inefficient subword sequences. The result is a technology that amplifies the communicative power of already-privileged language communities while offering little to those who need it most.
This is not merely a technical inconvenience. When a language is absent from LLM capabilities, its speakers are excluded from AI-assisted education, healthcare information systems, legal document processing, and economic participation in the digital economy. The low-resource problem is, at its root, a problem of global equity. Three recent papers attack this challenge from distinct and complementary angles: benchmarking the performance gap across an entire continent, building identification systems that can recognize over a thousand languages, and redesigning tokenization to serve scripts that current models handle poorly. Each reveals something important about the structure of the problem and the plausibility of solutions.
The Research Landscape
AfroBench: Quantifying the Continental Divide
Ojo et al. (2025) present AfroBench, the most comprehensive evaluation of LLM capabilities across African languages to date. The benchmark spans 64 African languages across 15 NLP tasks, evaluated on 12 LLMs including both proprietary and open-weight models. The scale alone is significant---prior benchmarks for African languages typically covered fewer than 10 languages and a handful of tasks. AfroBench provides, for the first time, a systematic picture of where LLMs stand across an entire continent's linguistic diversity.
The headline findings are sobering. Comparing English to African language performance, GPT-4o (proprietary) shows a gap of more than +25 points, while Gemma 2 27B (the best open-weight model) shows a gap exceeding +40 points, depending on the task. This is not a marginal difference; it represents a qualitative shift from "useful" to "unreliable." Among proprietary models, GPT-4o achieves an aggregate score of 58.1 and Gemini scores 58.9---reasonable but far below their English-language capabilities. Among open-weight alternatives, the best performer is Gemma 2 27B at approximately 48, with a +13-point gap behind the proprietary leaders---a score that places many downstream applications below the threshold of practical utility for most downstream applications.
Perhaps the most actionable finding concerns the comparison between prompting and fine-tuning strategies. Fine-tuning consistently outperforms prompting by an average of +11.5 points across tasks and languages. This gap has direct implications for deployment: organizations seeking to serve African language communities cannot simply deploy a general-purpose LLM with translated prompts and expect acceptable results. Task-specific fine-tuning on even modest amounts of in-language data produces substantially better outcomes.
The benchmark also reveals important variation across language families. Niger-Congo B (Bantu) languages tend to perform better than Afro-Asiatic or Nilo-Saharan languages, likely reflecting differences in available training data rather than intrinsic linguistic difficulty. This pattern suggests that data availability, not model architecture, is the binding constraint for most African languages.
GlotLID: Before You Can Serve a Language, You Must Recognize It
Kargaran et al. (2024) address a problem that is logically prior to generation or understanding: language identification. Before an LLM can process text in a given language, the system must determine what language the text is written in. This task, trivial for high-resource languages with distinctive scripts, becomes remarkably challenging when the scope extends to 1,665 languages---as GlotLID attempts.
The model itself is built on FastText, a computationally efficient architecture that scales well to large label spaces. GlotLID's training leverages carefully curated datasets with labels verified against multiple sources, addressing a persistent problem in multilingual NLP: noisy or incorrect language labels in web-crawled corpora. The authors introduce the SET (Semantically Equivalent Translation) evaluation framework, which provides a more rigorous assessment of identification accuracy than simple held-out test sets drawn from the same distribution as training data.
The key challenge GlotLID exposes is confusion between closely related languages. Distinguishing between, say, Serbian and Croatian (which share substantial vocabulary and grammar but use different scripts in formal contexts), or between mutually intelligible West African languages with limited written traditions, pushes the boundaries of character-level and word-level statistical models. GlotLID provides individual performance tables for 1,832 language-script pairs, making it possible for the first time to assess identification reliability on a per-language basis rather than relying on aggregate accuracy figures that are dominated by high-resource languages.
The practical implications extend beyond LLM routing. Language identification is a foundational component of corpus construction: to build training data for a low-resource language, you first need to identify and extract text in that language from multilingual web crawls. Errors at this stage---misidentifying language X as related language Y---propagate through the entire pipeline, producing models that are subtly trained on the wrong data. GlotLID's per-language reliability scores provide the information needed to assess and mitigate this risk.
IndicSuperTokenizer: Fixing the Fertility Problem
Rana et al. (2025) tackle a different bottleneck: tokenization. Standard LLM tokenizers, trained predominantly on English text, are notoriously inefficient for non-Latin scripts. A concept that requires one token in English may require four or five tokens in Hindi, Bengali, or Tamil---a phenomenon measured by "fertility," the average number of tokens per word. High fertility means that the same context window covers less actual content, that inference is slower, and that the model's effective capacity for the language is reduced.
IndicSuperTokenizer addresses this for 22 Indian languages through a two-stage approach the authors call SuperBPE. The first stage applies standard Byte Pair Encoding to build a subword vocabulary. The second stage---the innovation---merges frequent subword sequences into "superwords," creating larger units that better capture the morphological structure of Indian languages. The resulting vocabulary of 200K tokens, built with NFKC Unicode normalization to handle script variation, achieves a fertility reduction of -39.5% compared to standard tokenizers.
The downstream impact is substantial. Inference throughput improves by +44%, not because the model architecture changes but because the same text is represented in fewer tokens, reducing the computational cost of attention mechanisms that scale quadratically with sequence length. This is a case where a preprocessing improvement produces multiplicative benefits throughout the system.
The design decisions are worth noting. The 200K vocabulary is large by current standards---GPT-4's tokenizer uses roughly 100K tokens---but the authors argue that serving 22 languages with diverse scripts requires this capacity. NFKC normalization handles the proliferation of visually identical but Unicode-distinct characters that plague Indic script processing. And the two-stage approach avoids the need to retrain the base model: IndicSuperTokenizer can be applied as a drop-in replacement for existing tokenizers, making adoption relatively straightforward.
Critical Analysis
<
| Claim | Evidence | Verdict |
|---|
| Open-weight LLMs lag proprietary models by 15+ points on African languages | AfroBench: Gemma 2 27B ~13 points behind GPT-4o | โ
Supported --- gap is consistent across tasks |
| Fine-tuning outperforms prompting for low-resource languages | AfroBench: +11.5 average improvement | โ
Supported --- effect is robust across language families |
| Language identification at 1,665-language scale is feasible | GlotLID: FastText-based model with per-language evaluation | โ ๏ธ Partially --- aggregate accuracy is high, but closely related language confusion remains a significant limitation |
| Tokenizer optimization can reduce fertility by ~40% for Indic languages | IndicSuperTokenizer: -39.5% fertility, +44% throughput | โ
Supported --- though long-term effects on model quality need more evaluation |
| The low-resource gap is primarily a data problem, not an architecture problem | AfroBench's language-family analysis; fine-tuning gains | โ ๏ธ Likely but not fully established --- architecture choices (tokenizer, attention) also contribute |
Open Questions
Benchmark saturation versus real-world utility. AfroBench demonstrates that LLMs score poorly on African languages, but benchmark scores do not directly measure whether the models are useful for actual applications (healthcare chatbots, legal document summarization, educational tools). Bridging the gap between benchmark performance and deployment-ready quality requires task-specific evaluation in realistic settings.Data sovereignty and consent. Building training corpora for low-resource languages often involves scraping text from community forums, religious texts, or government documents. Who authorizes this use? The question of data sovereignty---whether language communities have governance rights over digital representations of their languages---is legally and ethically unresolved for most of the world's languages.Tokenizer-model co-optimization. IndicSuperTokenizer improves tokenization as a preprocessing step, but the base models were trained with different tokenizers. What gains are possible if models are pre-trained from scratch with language-optimized tokenizers? The computational cost of this experiment has so far prevented systematic investigation.Closely related language identification at scale. GlotLID's confusion between related languages is not just a classification nuisance---it affects corpus quality for all downstream tasks. Can identification accuracy for closely related languages be improved without sacrificing coverage of the long tail of rare languages?Sustainability of benchmarks. AfroBench covers 64 of Africa's approximately 2,000 languages. Expanding coverage requires sustained effort in data collection, annotation, and community engagement. Who funds this work after the initial publication, and how is the benchmark maintained as languages and technologies evolve?What This Means for Your Research
For computational linguists and NLP researchers, these three papers collectively argue that the low-resource problem is not monolithic---it is a compound challenge requiring simultaneous progress on benchmarking, identification, and tokenization. Working on any one of these in isolation produces incomplete solutions. AfroBench provides the evaluation infrastructure; GlotLID provides the language identification layer; IndicSuperTokenizer provides the preprocessing optimization. The research agenda that emerges is one of integrated multilingual NLP pipelines rather than isolated model improvements.
For researchers working on specific low-resource languages, the fine-tuning results from AfroBench are immediately actionable: even small amounts of in-language fine-tuning data produce disproportionate gains compared to prompt engineering. Prioritizing the creation of modest, high-quality fine-tuning datasets (hundreds to low thousands of examples) is likely more impactful than waiting for multilingual foundation models to improve organically.
For policymakers and funding agencies, the 5,980-language gap is a concrete measure of digital exclusion. The technical solutions exist in prototype form---what is missing is the sustained investment in data collection, community engagement, and infrastructure maintenance that would make these solutions available at scale. The cost of inaction is not merely technical: it is the progressive exclusion of billions of speakers from the AI-mediated information economy.
Discover related work through ORAA ResearchBrain.
References (3)
[1] Ojo, J., Ogueji, K., Stenetorp, P., & Adelani, D.I. (2025). AfroBench: Benchmarking LLMs Across 64 African Languages. arXiv preprint.
[2] Kargaran, A.H., Imani, A., Yvon, F., & Schuetze, H. (2024). GlotLID: Language Identification for Low-Resource Languages. arXiv:2310.16248.
[3] Rana, A., Soumya, S., Saxena, A., & Jyothi, P. (2025). IndicSuperTokenizer: Optimized Tokenization for 22 Indian Languages. arXiv preprint.