Trend AnalysisLinguistics & NLP

Sentiment Analysis Beyond English: Measuring Emotion Across the World's Languages

Sentiment analysis research has been dominated by English, but emotions are expressed differently across languages. New frameworks for South African, South Asian, and code-mixed languages are expanding the frontier.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

Sentiment analysis, the computational detection of opinions, emotions, and attitudes in text, is one of NLP's most commercially important applications, driving everything from brand monitoring to political polling to mental health screening. But the field's empirical foundation is radically skewed: the vast majority of sentiment analysis research, training data, and deployed systems target English. This monolingual bias creates a compound problem. First, sentiment analysis tools fail for billions of non-English speakers. Second, the theoretical assumptions embedded in English-centric approaches, including the sentiment lexicon, the polarity scale, and even the granularity of emotion categories, may not transfer across languages and cultures where emotions are categorized, expressed, and communicated differently.

Why It Matters

Consider a global brand monitoring sentiment about a product launch across 50 markets. English sentiment analysis might achieve 88% accuracy; Hindi, 72%; Swahili, 55%; Zulu, unmeasurable due to lack of tools. The business impact is clear: decision-makers see an accurate picture in some markets and a distorted or absent picture in others, systematically privileging the perspectives of English-speaking consumers. The same pattern applies to political sentiment tracking, public health monitoring, and crisis response. The linguistic communities most likely to be underserved by sentiment analysis are often those most in need of having their voices heard.

The theoretical dimension is equally important. Emotion expression varies profoundly across languages. Japanese encodes speaker affect grammatically through sentence-final particles. Arabic uses morphological patterns to express emotional intensity. Many African languages use tonal variation to convey attitude. Sentiment analysis systems that treat emotion as a simple positive-negative polarity miss the linguistic richness of how affect is actually communicated.

The Science

Adaptive Pretraining for Low-Resource Sentiment

Raychawdhary et al. (2024) address the resource imbalance head-on with a method combining adaptive pretraining and strategic language selection for multilingual sentiment analysis across twelve African languages, including Hausa, Yoruba, Igbo, and Swahili. The key insight is that not all languages are equally useful for cross-lingual transfer: strategically selecting which languages to include in pretraining based on their typological and genealogical relationship to the target low-resource language significantly improves transfer performance. For African languages, this means that closely related languages within the same family (e.g., other Niger-Congo languages for Yoruba) provide stronger transfer than typologically distant high-resource languages. This finding suggests that cross-lingual sentiment transfer is not language-agnostic but follows the contours of language family relationships and shared cultural contexts of emotional expression.

Retrieval-Augmented Sentiment Lexicons

Nkongolo et al. (2025) present TriLex, a three-stage retrieval-augmented framework for building sentiment analysis systems for low-resource South African languages. The framework combines corpus-based extraction (mining sentiment-bearing words from available text), cross-lingual projection (transferring sentiment labels from English to target languages via translation), and retrieval-augmented enrichment (using LLMs to expand and validate the lexicon). Applied to three South African languages, the framework demonstrates that retrieval augmentation can compensate for data scarcity by leveraging the broad knowledge encoded in multilingual LLMs while maintaining language-specific accuracy through corpus-based validation. The approach is particularly noteworthy for its attention to cultural specificity: sentiment lexicons are not simply translated but adapted to reflect the emotional connotations specific to each language community.

Code-Mixed Sentiment Analysis

Nazir et al. (2025) tackle the especially challenging case of sentiment analysis on code-mixed text in low-resource languages, where speakers alternate between languages (e.g., Urdu-English or Hindi-English) within single messages. Standard sentiment analysis fails spectacularly on code-mixed text because sentiment-bearing words may come from either language, negation patterns may cross language boundaries, and the emotional register of code-switching itself carries sentiment information. Their multilingual transformer approach fine-tunes on code-mixed datasets, learning to process mixed-language sentiment in an integrated way rather than decomposing the text into monolingual segments. The results show that code-mixed sentiment analysis requires dedicated models; multilingual models trained only on monolingual data in each language do not automatically handle the mixed case.

Addressing Class Imbalance in Bengali

Yousuf et al. (2025) address a pervasive methodological problem: class imbalance in sentiment datasets. In Bengali social media data, positive sentiments vastly outnumber negative ones, causing classifiers to learn a positive-by-default strategy. Their comparative study of BanglaBERT (a Bengali-specific model) and multilingual BERT reveals that language-specific pretraining provides an edge over multilingual models, particularly for the minority sentiment classes that matter most for applications like complaint detection and crisis monitoring. The study demonstrates that the choice between monolingual and multilingual models involves tradeoffs between language coverage and language-specific accuracy that depend on the application context.

Multilingual Sentiment Analysis Resource Landscape

Language Group	Available Resources	Best Approach	Accuracy Gap vs English
Major European (DE, FR, ES)	Extensive corpora + lexicons	Fine-tuned monolingual models	3-5% lower
Major Asian (ZH, JA, KO)	Moderate corpora, growing	Multilingual + domain adaptation	5-10% lower
South Asian (HI, BN, UR)	Limited corpora, code-mixing prevalent	Multilingual transformers + code-mixed training	10-20% lower
African languages (ZU, XH, SW)	Minimal, emerging	Retrieval-augmented + cross-lingual transfer	20-35% lower
Code-mixed varieties	Very limited	Dedicated code-mixed models	15-25% lower

What To Watch

The democratization of sentiment analysis across languages will likely come from two converging trends: massively multilingual LLMs that provide a baseline for any language they have seen in training, and community-driven annotation efforts that create the language-specific evaluation data needed to measure and improve performance. The theoretical frontier involves moving beyond polarity (positive/negative) to fine-grained emotion detection across languages, a task that requires engaging with cultural psychology's research on whether emotion categories are universal or culturally constructed. The answer, almost certainly, is "both, in complex ways," and building sentiment analysis systems that respect this complexity is the field's next grand challenge.

Discover related work using ORAA ResearchBrain.

면책 조항: 이 게시물은 정보 제공 목적의 연구 동향 개요이다. 학술 논문에서 인용하기 전에 구체적인 연구 결과, 통계 및 주장을 원본 논문과 대조하여 확인해야 한다.

영어를 넘어선 감성 분석: 세계 언어에 걸친 감정 측정

감성 분석은 텍스트에서 의견, 감정, 태도를 계산적으로 감지하는 기술로, NLP에서 상업적으로 가장 중요한 응용 분야 중 하나이다. 브랜드 모니터링부터 정치 여론 조사, 정신 건강 검진에 이르기까지 다양한 분야를 이끌고 있다. 그러나 이 분야의 경험적 토대는 심각하게 편향되어 있다. 감성 분석 연구, 훈련 데이터, 배포된 시스템의 대다수가 영어를 대상으로 한다. 이러한 단일 언어 편향은 복합적인 문제를 야기한다. 첫째, 감성 분석 도구가 수십억 명의 비영어권 사용자에게 제대로 작동하지 않는다. 둘째, 감성 어휘 목록, 극성 척도, 심지어 감정 범주의 세분화 방식을 포함하여 영어 중심적 접근 방식에 내재된 이론적 가정들이 감정을 서로 다르게 범주화하고, 표현하며, 전달하는 언어와 문화에 걸쳐 전이되지 않을 수 있다.

왜 중요한가

50개 시장에 걸쳐 제품 출시에 대한 감성을 모니터링하는 글로벌 브랜드를 생각해 보자. 영어 감성 분석은 88%의 정확도를 달성할 수 있지만, 힌디어는 72%, 스와힐리어는 55%, 줄루어는 도구의 부재로 측정 자체가 불가능할 수 있다. 비즈니스적 영향은 명확하다. 의사 결정자들은 일부 시장에서는 정확한 상황을 파악하지만, 다른 시장에서는 왜곡되거나 부재한 정보를 접하게 되어, 체계적으로 영어권 소비자의 관점이 우선시된다. 동일한 패턴이 정치 감성 추적, 공중 보건 모니터링, 위기 대응에도 적용된다. 감성 분석의 혜택을 가장 받지 못할 가능성이 높은 언어 공동체는 흔히 자신들의 목소리가 가장 절실히 반영되어야 하는 공동체이기도 하다.

이론적 차원도 똑같이 중요하다. 감정 표현은 언어마다 크게 다르다. 일본어는 문말 조사를 통해 화자의 감정을 문법적으로 부호화한다. 아랍어는 형태론적 패턴을 사용하여 감정적 강도를 표현한다. 많은 아프리카 언어들은 성조 변이를 통해 태도를 전달한다. 감정을 단순한 긍정-부정 극성으로 취급하는 감성 분석 시스템은 감정이 실제로 전달되는 방식에 담긴 언어적 풍부함을 놓치게 된다.

과학적 연구

저자원 감성 분석을 위한 적응형 사전 훈련

Raychawdhary et al. (2024)은 적응형 사전 훈련과 전략적 언어 선택을 결합한 방법으로 하우사어, 요루바어, 이그보어, 스와힐리어를 포함한 12개 아프리카 언어에 걸친 다국어 감성 분석의 자원 불균형 문제를 정면으로 다룬다. 핵심 통찰은 교차 언어 전이에 있어 모든 언어가 동등하게 유용하지 않다는 점이다. 즉, 대상 저자원 언어와의 유형론적·계통적 관계를 기반으로 사전 훈련에 포함할 언어를 전략적으로 선택하면 전이 성능이 크게 향상된다. 아프리카 언어의 경우, 동일 어족 내에서 근접하게 관련된 언어들(예: 요루바어에 대한 다른 니제르-콩고어족 언어들)이 유형론적으로 거리가 먼 고자원 언어들보다 더 강한 전이를 제공한다. 이 연구 결과는 교차 언어 감성 전이가 언어에 무관한 것이 아니라 어족 관계의 윤곽과 감정 표현의 공유된 문화적 맥락을 따른다는 점을 시사한다.

검색 증강 감성 어휘 목록

Nkongolo et al. (2025)은 자원이 부족한 남아프리카 언어를 위한 감성 분석 시스템 구축을 위한 3단계 검색 증강 프레임워크인 TriLex를 제시한다. 이 프레임워크는 코퍼스 기반 추출(가용 텍스트에서 감성 함의 단어 채굴), 교차 언어 투영(번역을 통해 영어에서 대상 언어로 감성 레이블 전이), 검색 증강 강화(LLM을 활용한 어휘 확장 및 검증)를 결합한다. 세 가지 남아프리카 언어에 적용된 이 프레임워크는 코퍼스 기반 검증을 통해 언어 특화 정확도를 유지하면서 다국어 LLM에 인코딩된 폭넓은 지식을 활용함으로써 검색 증강이 데이터 부족을 보완할 수 있음을 보여준다. 이 접근법은 문화적 특수성에 대한 고려에서 특히 주목할 만한데, 감성 어휘 목록이 단순히 번역되는 것이 아니라 각 언어 공동체에 특유한 정서적 함축을 반영하도록 적응된다.

코드 혼용 감성 분석

Nazir et al. (2025)은 자원이 부족한 언어의 코드 혼용 텍스트에 대한 감성 분석이라는 특히 어려운 사례를 다루는데, 여기서 화자들은 단일 메시지 내에서 언어를 교체한다(예: Urdu-English 또는 Hindi-English). 표준 감성 분석은 코드 혼용 텍스트에서 현저히 실패하는데, 그 이유는 감성 함의 단어가 어느 쪽 언어에서든 나타날 수 있고, 부정 패턴이 언어 경계를 넘을 수 있으며, 코드 전환 자체의 정서적 어조가 감성 정보를 담기 때문이다. 이들의 다국어 트랜스포머 접근법은 코드 혼용 데이터셋에 대해 파인튜닝을 수행하여, 텍스트를 단일 언어 세그먼트로 분해하는 대신 통합적인 방식으로 혼합 언어 감성을 처리하도록 학습한다. 결과는 코드 혼용 감성 분석에 전용 모델이 필요함을 보여주는데, 각 언어의 단일 언어 데이터만으로 훈련된 다국어 모델은 혼합 사례를 자동으로 처리하지 못한다.

벵골어의 클래스 불균형 해결

Yousuf et al. (2025)은 감성 데이터셋의 클래스 불균형이라는 만연한 방법론적 문제를 다룬다. 벵골어 소셜 미디어 데이터에서는 긍정 감성이 부정 감성보다 압도적으로 많아, 분류기가 기본적으로 긍정으로 판단하는 전략을 학습하게 된다. BanglaBERT(벵골어 특화 모델)와 다국어 BERT에 대한 비교 연구는 언어 특화 사전 훈련이 다국어 모델보다 우위를 제공함을 보여주는데, 특히 불만 감지 및 위기 모니터링과 같은 응용에서 가장 중요한 소수 감성 클래스에서 두드러진다. 이 연구는 단일 언어 모델과 다국어 모델 사이의 선택이 응용 맥락에 따라 달라지는 언어 커버리지와 언어 특화 정확도 사이의 트레이드오프를 수반한다는 것을 보여준다.

다국어 감성 분석 자원 현황

언어 그룹	가용 자원	최적 접근법	영어 대비 정확도 격차
주요 유럽어 (DE, FR, ES)	광범위한 코퍼스 + 어휘 목록	파인튜닝된 단일 언어 모델	3-5% 낮음
주요 아시아어 (ZH, JA, KO)	중간 규모 코퍼스, 성장 중	다국어 + 도메인 적응	5-10% 낮음
남아시아어 (HI, BN, UR)	제한된 코퍼스, 코드 혼용 만연	다국어 트랜스포머 + 코드 혼용 훈련	10-20% 낮음
아프리카 언어 (ZU, XH, SW)	최소한, 발전 중	검색 증강 + 교차 언어 전이	20-35% 낮음
코드 혼용 변종	매우 제한적	전용 코드 혼용 모델	15-25% 낮음

주목할 사항

언어 전반에 걸친 감성 분석의 민주화는 두 가지 수렴하는 흐름으로부터 비롯될 가능성이 높다. 하나는 학습 과정에서 접한 모든 언어에 대한 기준점을 제공하는 대규모 다국어 LLM이고, 다른 하나는 성능을 측정하고 개선하는 데 필요한 언어별 평가 데이터를 생성하는 커뮤니티 주도의 어노테이션 활동이다. 이론적 최전선에서는 극성(긍정/부정)을 넘어 언어 전반에 걸친 세밀한 감정 탐지로 나아가는 작업이 진행 중이며, 이는 감정 범주가 보편적인지 혹은 문화적으로 구성되는지에 관한 문화심리학 연구와 맞닿아 있는 과제이다. 그 답은 거의 확실하게 "복잡한 방식으로 둘 다"일 것이며, 이러한 복잡성을 존중하는 감성 분석 시스템을 구축하는 것이 이 분야의 다음 거대한 도전 과제이다.

ORAA ResearchBrain을 사용하여 관련 연구를 탐색해보세요.

References (4)

[1] Raychawdhary, N., Das, A., & Bhattacharya, S. (2024). Optimizing Multilingual Sentiment Analysis in Low-Resource Languages with Adaptive Pretraining and Strategic Language Selection. Proc. ICMI 2024, IEEE.

DOI Scholar

[2] Nkongolo, M., Vorster, H., & Warren, J. (2025). TriLex: A Framework for Multilingual Sentiment Analysis in Low-Resource South African Languages.

DOI Scholar

[3] Nazir, M.K., Faisal, C.N., & Habib, M.A. (2025). Leveraging Multilingual Transformer for Multiclass Sentiment Analysis in Code-Mixed Data of Low-Resource Languages. IEEE Access.

DOI Scholar

[4] Yousuf, M., Rifat, M.H., & Mondal, P.K. (2025). Addressing Class Imbalance in Bengali Sentiment Analysis. Proc. ECCE 2025, IEEE.

DOI Scholar

Sentiment Analysis Beyond English: Measuring Emotion Across the World's Languages

Why It Matters

The Science

Adaptive Pretraining for Low-Resource Sentiment

Retrieval-Augmented Sentiment Lexicons

Code-Mixed Sentiment Analysis

Addressing Class Imbalance in Bengali

Multilingual Sentiment Analysis Resource Landscape

What To Watch

영어를 넘어선 감성 분석: 세계 언어에 걸친 감정 측정

왜 중요한가

과학적 연구

저자원 감성 분석을 위한 적응형 사전 훈련

검색 증강 감성 어휘 목록

코드 혼용 감성 분석

벵골어의 클래스 불균형 해결

다국어 감성 분석 자원 현황

주목할 사항

References (4)

Explore this topic deeper