Trend AnalysisLinguistics & NLP

Automatic Speech Recognition for Accented English: When AI Struggles with Diversity

ASR systems still perform significantly worse on accented English, creating a systematic bias against billions of non-native and non-standard dialect speakers. New approaches from LoRA mixtures to spectrogram masking aim to close this gap.

By Sean K.S. Shin
This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

English is spoken as a first or additional language by approximately 1.5 billion people, encompassing enormous phonological diversity from Nigerian English to Singaporean English to Appalachian English. Yet automatic speech recognition systems, trained predominantly on standard American and British English, exhibit significant performance degradation on accented speech, with word error rates increasing by 20-50% or more for speakers with non-standard accents. This is not merely a technical inconvenience: it represents a systematic bias in voice-activated technology that disproportionately affects immigrants, non-native speakers, and speakers of non-prestige dialects, precisely the populations that might benefit most from voice interfaces.

Why It Matters

Voice interfaces are increasingly gatekeepers to essential services: healthcare navigation, banking, emergency services, educational platforms, and smart home control. When ASR systems fail on accented speech, they create a two-tier technology landscape where speakers of prestige dialects enjoy seamless voice interaction while others are forced to adapt their speech, switch to text interfaces, or abandon the technology entirely. The scale of the problem is staggering: the majority of English speakers worldwide are non-native speakers, meaning that the typical English speaker is one whose accent ASR systems handle poorly.

For sociolinguistics, the ASR accent gap is a concrete manifestation of linguistic discrimination. Accent-based bias in technology mirrors and potentially reinforces accent-based bias in employment, education, and social evaluation. Understanding and fixing the technical problem requires engaging with the sociolinguistic reality that no accent is inherently more "correct" or more "clear" than any other.

The Science

Mixture of Accent-Specific LoRA Experts

Bagat et al. (2025) introduce MAS-LoRA (Mixture of Accent-Specific LoRAs), a fine-tuning method that leverages a mixture of Low-Rank Adaptation experts, each specialized for a different accent. The approach is elegant: rather than training a single model to handle all accents (which leads to compromised performance on each) or training separate models per accent (which is computationally prohibitive and requires accent identification as a preprocessing step), MAS-LoRA learns to dynamically combine accent-specific adaptations based on the input speech. The method is designed for low-resource multi-accent settings where only small amounts of accented data are available. Results show significant improvements over both accent-agnostic baselines and single-accent fine-tuning, suggesting that accent adaptation benefits from explicitly modeling accent as a source of structured variation rather than noise.

Accent-Invariant Representations via Spectrogram Masking

Sameti et al. (2025) take the opposite architectural philosophy: rather than adapting to specific accents, they aim to learn accent-invariant representations by masking accent-specific features in the input spectrogram. Their saliency-driven approach identifies which spectral regions contribute most to accent variation (as opposed to linguistic content) and selectively masks them during training. This forces the model to rely on accent-invariant features for recognition. The approach works for both English and Persian, suggesting the method generalizes across languages with different accent variation patterns. The linguistic insight is that accent information and linguistic content are partially separable in the acoustic signal, with accent primarily affecting formant frequencies, voice onset times, and prosodic patterns while leaving spectral envelope shapes relatively intact.

Accent Identification as a Precursor

Ahmed et al. (2025) focus on the upstream task of accent identification, using spectral features and a hybrid CNN-BiLSTM architecture to classify English accents before feeding the signal to accent-specific recognition modules. Accurate accent identification enables conditional processing pipelines where the ASR system adapts its behavior based on the detected accent. Their system achieves strong identification accuracy across multiple English accent categories, though performance degrades for accents underrepresented in training data and for speakers whose accents blend features from multiple varieties, a common characteristic of multilingual speakers.

Data Augmentation for Accent Robustness

Banerjee and Ramasubramanian (2025) address the data scarcity problem directly with Manifold Mixup, a data augmentation technique that creates synthetic training examples by interpolating between accented speech samples in the model's hidden representation space. This approach generates diverse training conditions without requiring additional recordings of accented speech. The method is particularly effective in low-resource settings where collecting and annotating accented speech data is expensive. Their results demonstrate that augmentation in the representation space is more effective than augmentation in the acoustic space (e.g., speed perturbation, pitch shifting), suggesting that meaningful accent variation operates at a more abstract representational level than simple acoustic parameters.

ASR Accent Adaptation Strategies

<
StrategyApproachData RequirementStrengthsLimitations
MAS-LoRA expertsAccent-specific modules, dynamic combinationSmall per-accent dataPreserves accent-specific detailRequires some labeled accent data
Spectrogram maskingRemove accent features, learn invariant representationsStandard training dataNo accent labels neededMay lose useful accent information
Accent identification + routingDetect accent, route to specialized modelAccent-labeled speechOptimal per-accent performancePipeline errors compound
Manifold Mixup augmentationSynthetic accent variation in hidden spaceMinimal accented dataData-efficientSynthetic variation may not cover real range
Multilingual pre-trainingLeverage cross-language phonetic knowledgeLarge multilingual corpusBroad coverageMay not capture accent-specific patterns

What To Watch

The convergence of personalized ASR (adapting to individual speakers over time) with accent-robust ASR promises systems that learn each user's speech patterns regardless of accent category. Self-supervised speech models like Whisper and wav2vec have demonstrated surprising accent robustness compared to supervised systems, suggesting that learning from diverse unlabeled speech captures accent variation more effectively than curated labeled datasets. The critical next step is evaluation: current accent ASR research often uses a small number of accent categories (5-10), but real-world accent variation is continuous and multidimensional. Evaluation frameworks that capture this continuous variation, rather than treating accents as discrete categories, will be essential for measuring genuine progress.

Discover related work using ORAA ResearchBrain.

References (4)

[1] Bagat, R., Illina, I., & Vincent, E. (2025). Mixture of LoRA Experts for Low-Resourced Multi-Accent Automatic Speech Recognition. Proc. Interspeech 2025.
[2] Sameti, M.H., Moridani, S.H., & Zarean, A. (2025). Accent-Invariant Automatic Speech Recognition via Saliency-Driven Spectrogram Masking.
[3] Ahmed, G., Lawaye, A.A., & Jain, V. (2025). Enhancing English accent identification in automatic speech recognition using spectral features and hybrid CNN-BiLSTM model. Multimedia Tools & Applications.
[4] Banerjee, T. & Ramasubramanian, V. (2025). Accent-robust speech recognition for English in low-resource settings using Manifold Mixup. EURASIP J. Audio, Speech, and Music Processing.

Explore this topic deeper

Search 290M+ papers, detect research gaps, and find what hasn't been studied yet.

Click to remove unwanted keywords

Search 7 keywords โ†’