Trend AnalysisLinguistics & NLP

Automatic Speech Recognition for Accented English: When AI Struggles with Diversity

ASR systems still perform significantly worse on accented English, creating a systematic bias against billions of non-native and non-standard dialect speakers. New approaches from LoRA mixtures to spectrogram masking aim to close this gap.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

English is spoken as a first or additional language by approximately 1.5 billion people, encompassing enormous phonological diversity from Nigerian English to Singaporean English to Appalachian English. Yet automatic speech recognition systems, trained predominantly on standard American and British English, exhibit significant performance degradation on accented speech, with word error rates increasing by 20-50% or more for speakers with non-standard accents. This is not merely a technical inconvenience: it represents a systematic bias in voice-activated technology that disproportionately affects immigrants, non-native speakers, and speakers of non-prestige dialects, precisely the populations that might benefit most from voice interfaces.

Why It Matters

Voice interfaces are increasingly gatekeepers to essential services: healthcare navigation, banking, emergency services, educational platforms, and smart home control. When ASR systems fail on accented speech, they create a two-tier technology landscape where speakers of prestige dialects enjoy seamless voice interaction while others are forced to adapt their speech, switch to text interfaces, or abandon the technology entirely. The scale of the problem is staggering: the majority of English speakers worldwide are non-native speakers, meaning that the typical English speaker is one whose accent ASR systems handle poorly.

For sociolinguistics, the ASR accent gap is a concrete manifestation of linguistic discrimination. Accent-based bias in technology mirrors and potentially reinforces accent-based bias in employment, education, and social evaluation. Understanding and fixing the technical problem requires engaging with the sociolinguistic reality that no accent is inherently more "correct" or more "clear" than any other.

The Science

Mixture of Accent-Specific LoRA Experts

Bagat et al. (2025) introduce MAS-LoRA (Mixture of Accent-Specific LoRAs), a fine-tuning method that leverages a mixture of Low-Rank Adaptation experts, each specialized for a different accent. The approach is elegant: rather than training a single model to handle all accents (which leads to compromised performance on each) or training separate models per accent (which is computationally prohibitive and requires accent identification as a preprocessing step), MAS-LoRA learns to dynamically combine accent-specific adaptations based on the input speech. The method is designed for low-resource multi-accent settings where only small amounts of accented data are available. Results show significant improvements over both accent-agnostic baselines and single-accent fine-tuning, suggesting that accent adaptation benefits from explicitly modeling accent as a source of structured variation rather than noise.

Accent-Invariant Representations via Spectrogram Masking

Sameti et al. (2025) take the opposite architectural philosophy: rather than adapting to specific accents, they aim to learn accent-invariant representations by masking accent-specific features in the input spectrogram. Their saliency-driven approach identifies which spectral regions contribute most to accent variation (as opposed to linguistic content) and selectively masks them during training. This forces the model to rely on accent-invariant features for recognition. The approach works for both English and Persian, suggesting the method generalizes across languages with different accent variation patterns. The linguistic insight is that accent information and linguistic content are partially separable in the acoustic signal, with accent primarily affecting formant frequencies, voice onset times, and prosodic patterns while leaving spectral envelope shapes relatively intact.

Accent Identification as a Precursor

Ahmed et al. (2025) focus on the upstream task of accent identification, using spectral features and a hybrid CNN-BiLSTM architecture to classify English accents before feeding the signal to accent-specific recognition modules. Accurate accent identification enables conditional processing pipelines where the ASR system adapts its behavior based on the detected accent. Their system achieves strong identification accuracy across multiple English accent categories, though performance degrades for accents underrepresented in training data and for speakers whose accents blend features from multiple varieties, a common characteristic of multilingual speakers.

Data Augmentation for Accent Robustness

Banerjee and Ramasubramanian (2025) address the data scarcity problem directly with Manifold Mixup, a data augmentation technique that creates synthetic training examples by interpolating between accented speech samples in the model's hidden representation space. This approach generates diverse training conditions without requiring additional recordings of accented speech. The method is particularly effective in low-resource settings where collecting and annotating accented speech data is expensive. Their results demonstrate that augmentation in the representation space is more effective than augmentation in the acoustic space (e.g., speed perturbation, pitch shifting), suggesting that meaningful accent variation operates at a more abstract representational level than simple acoustic parameters.

ASR Accent Adaptation Strategies

Strategy	Approach	Data Requirement	Strengths	Limitations
MAS-LoRA experts	Accent-specific modules, dynamic combination	Small per-accent data	Preserves accent-specific detail	Requires some labeled accent data
Spectrogram masking	Remove accent features, learn invariant representations	Standard training data	No accent labels needed	May lose useful accent information
Accent identification + routing	Detect accent, route to specialized model	Accent-labeled speech	Optimal per-accent performance	Pipeline errors compound
Manifold Mixup augmentation	Synthetic accent variation in hidden space	Minimal accented data	Data-efficient	Synthetic variation may not cover real range
Multilingual pre-training	Leverage cross-language phonetic knowledge	Large multilingual corpus	Broad coverage	May not capture accent-specific patterns

What To Watch

The convergence of personalized ASR (adapting to individual speakers over time) with accent-robust ASR promises systems that learn each user's speech patterns regardless of accent category. Self-supervised speech models like Whisper and wav2vec have demonstrated surprising accent robustness compared to supervised systems, suggesting that learning from diverse unlabeled speech captures accent variation more effectively than curated labeled datasets. The critical next step is evaluation: current accent ASR research often uses a small number of accent categories (5-10), but real-world accent variation is continuous and multidimensional. Evaluation frameworks that capture this continuous variation, rather than treating accents as discrete categories, will be essential for measuring genuine progress.

Discover related work using ORAA ResearchBrain.

면책 조항: 이 게시물은 정보 제공 목적의 연구 동향 개요이다. 학술 연구에서 인용하기 전에 원본 논문을 통해 구체적인 연구 결과, 통계 및 주장을 반드시 확인해야 한다.

악센트 영어를 위한 자동 음성 인식: AI가 다양성 앞에서 어려움을 겪을 때

영어는 약 15억 명의 사람들이 제1언어 또는 추가 언어로 사용하고 있으며, 나이지리아 영어, 싱가포르 영어, 애팔래치아 영어에 이르기까지 방대한 음운론적 다양성을 포괄한다. 그러나 주로 표준 미국 영어와 영국 영어로 훈련된 자동 음성 인식(ASR) 시스템은 악센트가 있는 음성에 대해 현저한 성능 저하를 보이며, 비표준 악센트를 가진 화자의 경우 단어 오류율(WER)이 20-50% 이상 증가한다. 이는 단순한 기술적 불편함이 아니다. 음성 활성화 기술에 내재된 체계적인 편향을 나타내는 것으로, 이민자, 비원어민 화자, 비위신 방언 화자 등 음성 인터페이스로부터 가장 많은 혜택을 받을 수 있는 인구 집단에 불균형적인 영향을 미친다.

중요성

음성 인터페이스는 의료 안내, 금융, 응급 서비스, 교육 플랫폼, 스마트홈 제어 등 필수 서비스의 관문으로서 그 역할이 점점 커지고 있다. ASR 시스템이 악센트가 있는 음성에서 오류를 일으킬 때, 위신 방언 화자는 원활한 음성 상호작용을 누리는 반면 다른 화자들은 발화 방식을 조정하거나, 텍스트 인터페이스로 전환하거나, 기술 자체를 포기할 수밖에 없는 이중적 기술 환경이 형성된다. 이 문제의 규모는 놀라울 정도이다. 전 세계 영어 화자의 대다수는 비원어민 화자이며, 이는 전형적인 영어 화자가 바로 ASR 시스템이 제대로 처리하지 못하는 악센트를 가진 사람임을 의미한다.

사회언어학적 관점에서, ASR의 악센트 격차는 언어적 차별의 구체적인 발현이다. 기술에 내재된 악센트 기반 편향은 취업, 교육, 사회적 평가에서의 악센트 기반 편향을 반영하고 잠재적으로 강화한다. 이 기술적 문제를 이해하고 해결하기 위해서는 어떠한 악센트도 본질적으로 다른 것보다 더 "올바르거나" 더 "명확하지" 않다는 사회언어학적 현실을 받아들여야 한다.

연구 내용

악센트별 LoRA 전문가 혼합

Bagat et al. (2025)은 서로 다른 악센트에 특화된 저랭크 적응(LoRA) 전문가 혼합을 활용하는 파인튜닝 기법인 MAS-LoRA(Mixture of Accent-Specific LoRAs)를 제안한다. 이 접근법은 우아하다. 모든 악센트를 처리하도록 단일 모델을 훈련하거나(각 악센트에 대한 성능 저하 초래), 악센트별로 별도의 모델을 훈련하는 방식(계산 비용이 과도하며 전처리 단계로서 악센트 식별 필요) 대신, MAS-LoRA는 입력 음성에 기반하여 악센트별 적응을 동적으로 결합하는 방법을 학습한다. 이 방법은 악센트가 있는 데이터가 소량만 확보 가능한 저자원 다중 악센트 환경을 위해 설계되었다. 결과는 악센트 무관 기준 모델과 단일 악센트 파인튜닝 모두에 비해 유의미한 성능 향상을 보여주며, 이는 악센트 적응이 악센트를 노이즈가 아닌 구조화된 변이의 원천으로 명시적으로 모델링함으로써 이점을 얻는다는 것을 시사한다.

스펙트로그램 마스킹을 통한 악센트 불변 표현

Sameti 외 (2025)는 반대의 아키텍처 철학을 취한다. 특정 억양에 적응하는 대신, 입력 스펙트로그램에서 억양 특정 특징을 마스킹함으로써 억양 불변 표현(accent-invariant representations)을 학습하는 것을 목표로 한다. 이들의 현저성 기반(saliency-driven) 접근법은 언어적 내용이 아닌 억양 변이에 가장 크게 기여하는 스펙트럼 영역을 식별하고, 훈련 중 이를 선택적으로 마스킹한다. 이를 통해 모델은 인식을 위해 억양 불변 특징에 의존하도록 강제된다. 이 접근법은 영어와 페르시아어 모두에서 작동하며, 서로 다른 억양 변이 패턴을 가진 언어에 걸쳐 방법이 일반화됨을 시사한다. 언어학적 통찰은 억양 정보와 언어적 내용이 음향 신호에서 부분적으로 분리 가능하다는 것이다. 억양은 주로 포먼트 주파수(formant frequencies), 발성 개시 시간(voice onset times), 운율 패턴(prosodic patterns)에 영향을 미치면서 스펙트럼 엔벨로프 형태(spectral envelope shapes)는 비교적 그대로 유지한다.

선행 과제로서의 억양 식별

Ahmed 외 (2025)는 억양 식별이라는 상위 과제에 초점을 맞추어, 스펙트럼 특징과 하이브리드 CNN-BiLSTM 아키텍처를 사용하여 영어 억양을 분류한 뒤 신호를 억양별 인식 모듈에 전달한다. 정확한 억양 식별은 ASR 시스템이 감지된 억양에 따라 동작을 적응시키는 조건부 처리 파이프라인을 가능하게 한다. 이들의 시스템은 여러 영어 억양 범주에 걸쳐 강력한 식별 정확도를 달성하지만, 훈련 데이터에서 충분히 표현되지 않은 억양이나 다중 변종의 특징이 혼합된 억양을 가진 화자, 즉 다중 언어 화자에게서 흔히 나타나는 특성에 대해서는 성능이 저하된다.

억양 강건성을 위한 데이터 증강

Banerjee와 Ramasubramanian (2025)은 데이터 부족 문제를 Manifold Mixup으로 직접 해결한다. 이는 모델의 은닉 표현 공간(hidden representation space)에서 억양이 있는 음성 샘플 간 보간(interpolating)을 통해 합성 훈련 예제를 생성하는 데이터 증강 기법이다. 이 접근법은 억양 음성의 추가 녹음 없이도 다양한 훈련 조건을 생성한다. 이 방법은 억양 음성 데이터를 수집하고 주석을 다는 비용이 높은 저자원(low-resource) 환경에서 특히 효과적이다. 이들의 결과는 표현 공간에서의 증강이 음향 공간에서의 증강(예: 속도 변조, 피치 이동)보다 더 효과적임을 보여주며, 이는 의미 있는 억양 변이가 단순한 음향 매개변수보다 더 추상적인 표현 수준에서 작동함을 시사한다.

ASR 억양 적응 전략

전략	접근법	데이터 요구사항	강점	한계
MAS-LoRA 전문가	억양별 모듈, 동적 결합	억양별 소량 데이터	억양별 세부 사항 보존	일부 레이블된 억양 데이터 필요
스펙트로그램 마스킹	억양 특징 제거, 불변 표현 학습	표준 훈련 데이터	억양 레이블 불필요	유용한 억양 정보 손실 가능
억양 식별 + 라우팅	억양 감지 후 전문 모델로 전달	억양 레이블된 음성	억양별 최적 성능	파이프라인 오류 누적
Manifold Mixup 증강	은닉 공간에서 합성 억양 변이	최소한의 억양 데이터	데이터 효율적	합성 변이가 실제 범위를 포괄하지 못할 수 있음
다국어 사전 훈련	교차 언어 음성 지식 활용	대규모 다국어 코퍼스	광범위한 커버리지	억양별 패턴을 포착하지 못할 수 있음

주목할 사항

개인화된 ASR(시간이 지남에 따라 개별 화자에 적응하는)과 악센트에 강인한 ASR의 융합은 악센트 범주에 관계없이 각 사용자의 발화 패턴을 학습하는 시스템을 가능하게 할 것이다. Whisper와 wav2vec 같은 자기지도 음성 모델은 지도 학습 시스템에 비해 놀라운 악센트 강인성을 보여주었으며, 이는 다양한 비레이블 음성으로부터 학습하는 것이 선별된 레이블 데이터셋보다 악센트 변이를 더 효과적으로 포착함을 시사한다. 다음의 핵심 단계는 평가이다. 현재 악센트 ASR 연구는 소수의 악센트 범주(5-10개)를 사용하는 경우가 많지만, 실제 악센트 변이는 연속적이고 다차원적이다. 악센트를 이산적 범주로 처리하는 대신 이러한 연속적 변이를 포착하는 평가 프레임워크는 실질적인 진전을 측정하는 데 필수적일 것이다.

ORAA ResearchBrain을 통해 관련 연구를 탐색할 수 있다.

References (4)

[1] Bagat, R., Illina, I., & Vincent, E. (2025). Mixture of LoRA Experts for Low-Resourced Multi-Accent Automatic Speech Recognition. Proc. Interspeech 2025.

DOI Scholar

[2] Sameti, M.H., Moridani, S.H., & Zarean, A. (2025). Accent-Invariant Automatic Speech Recognition via Saliency-Driven Spectrogram Masking.

DOI Scholar

[3] Ahmed, G., Lawaye, A.A., & Jain, V. (2025). Enhancing English accent identification in automatic speech recognition using spectral features and hybrid CNN-BiLSTM model. Multimedia Tools & Applications.

DOI Scholar

[4] Banerjee, T. & Ramasubramanian, V. (2025). Accent-robust speech recognition for English in low-resource settings using Manifold Mixup. EURASIP J. Audio, Speech, and Music Processing.

DOI Scholar

Automatic Speech Recognition for Accented English: When AI Struggles with Diversity

Why It Matters

The Science

Mixture of Accent-Specific LoRA Experts

Accent-Invariant Representations via Spectrogram Masking

Accent Identification as a Precursor

Data Augmentation for Accent Robustness

ASR Accent Adaptation Strategies

What To Watch

악센트 영어를 위한 자동 음성 인식: AI가 다양성 앞에서 어려움을 겪을 때

중요성

연구 내용

악센트별 LoRA 전문가 혼합

스펙트로그램 마스킹을 통한 악센트 불변 표현

선행 과제로서의 억양 식별

억양 강건성을 위한 데이터 증강

ASR 억양 적응 전략

주목할 사항

References (4)

Explore this topic deeper