Trend AnalysisLinguistics & NLP

Sign Language Recognition and Generation: Bridging Deaf and Hearing Worlds with AI

Sign language recognition and generation technology is advancing rapidly, but the gap between isolated gesture recognition and full continuous sign language understanding remains the field's central challenge.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

Sign languages are full natural languages with their own grammars, morphologies, and pragmatic systems, used by approximately 70 million deaf people worldwide. Yet the technological infrastructure supporting sign languages lags dramatically behind that of spoken languages. While spoken language processing benefits from decades of ASR research, sign language recognition (SLR) and sign language generation (SLG) remain challenging open problems. The core difficulty is that sign languages operate in a visual-gestural modality involving simultaneous use of hand shape, movement, location, facial expression, and body posture, a representational complexity that exceeds what most current computer vision systems can capture.

Why It Matters

The communication barrier between deaf and hearing communities has profound social consequences: reduced access to education, healthcare, employment, and civic participation. Real-time sign language translation could transform this landscape, but the technology must work for actual continuous signing, not just isolated vocabulary items. The linguistic stakes are equally high. Sign languages provide critical evidence for theories of language universals, language acquisition, and the neural basis of language, challenging assumptions derived primarily from spoken modalities. Any adequate theory of human language must account for the visual-gestural modality, and computational sign language research generates both data and formal models that advance this understanding.

The Science

Synthetic Data for Training

Perea-Trigo et al. (2024) address the most fundamental bottleneck in SLR research: data scarcity. Collecting large-scale sign language video corpora is expensive, requiring native signers, controlled recording conditions, and expert annotation. Their solution is synthetic corpus generation for Spanish Sign Language, using 3D avatar technology to produce training data at scale. The review covers state-of-the-art methods in sign language recognition and generation and identifies synthetic data as a critical enabler. The key question is fidelity: can synthetically generated signs capture the phonological and prosodic nuances that distinguish natural from artificial signing? Their results suggest synthetic data is useful for training initial models but must be supplemented with natural signing data for production-quality systems.

Dynamic Temporal Processing

Kim and Kim (2025) tackle a core technical problem in continuous SLR: how to segment and process video input of varying lengths. Conventional systems divide input videos into a fixed number of clips regardless of actual duration, losing temporal information for long utterances and padding short ones. Their coverage-based dynamic clip generation method adapts the number of clips to the actual signing content, preserving the temporal dynamics that carry linguistic meaning. This matters linguistically because sign language grammar makes heavy use of temporal modification: the speed, duration, and rhythm of signs convey morphological and syntactic information that fixed-frame approaches systematically discard.

Real-Time Bidirectional Translation

The Indian Sign Language system (A. M. et al., 2025) demonstrates a bidirectional approach: not only recognizing signs and converting them to text or speech, but also generating sign language output from text input. The system targets real-world deployment in educational and public service settings. The bidirectional architecture is linguistically significant because sign language generation is not simply the reverse of recognition. Generation requires modeling the grammatical structure of the target sign language, which may differ dramatically from the source spoken language in word order, morphological marking, and discourse organization.

Privacy-Preserving Distributed Learning

Alzu'bi et al. (2024) introduce a federated learning approach for Arabic Sign Language recognition, addressing both privacy and scalability concerns. In smart city deployments, sign language recognition systems process sensitive biometric video data. Federated learning allows models to be trained across distributed devices without centralizing this data. The system uses 3D virtual signers for generation, connecting to the broader challenge of creating sign language output that is grammatically correct and culturally appropriate. The Arabic Sign Language focus highlights that each sign language has its own grammatical system that must be independently modeled.

Sign Language Technology Progress

Capability	Current State	Key Limitation	Linguistic Requirement
Isolated sign recognition	90-98% accuracy	Signer-dependent	Lexical only
Continuous sign recognition	60-75% accuracy	Segmentation and coarticulation	Morphosyntactic processing
Sign language generation	Avatar-based, limited grammar	Naturalness and fluency	Full grammatical model
Sign-to-text translation	Emerging	Discourse-level meaning	Pragmatic interpretation
Real-time deployment	Prototype stage	Latency and reliability	All levels

What To Watch

The field is at an inflection point. Foundation models for video understanding, trained on massive datasets of human movement, could provide the general visual representations that sign language-specific models can fine-tune. The integration of facial expression recognition with hand tracking is critical because non-manual markers (eyebrow raise, head tilt, mouth gestures) carry grammatical information in all sign languages, including negation, questions, and relative clauses. On the generation side, photorealistic neural avatars are replacing the rigid 3D models that deaf communities have consistently found unnatural and difficult to understand. The most important development may be community-driven: deaf researchers and signers increasingly leading and co-designing the technology, ensuring that systems reflect genuine sign language use rather than hearing assumptions about what signing looks like.

Discover related work using ORAA ResearchBrain.

면책 조항: 이 게시물은 정보 제공 목적의 연구 동향 개요이다. 인용 전 구체적인 연구 결과, 통계, 주장은 원본 논문을 통해 반드시 검토해야 한다.

수화 인식 및 생성: AI로 청각장애인과 청인의 세계를 잇다

수화는 독자적인 문법, 형태론, 화용론 체계를 갖춘 완전한 자연어로, 전 세계 약 7천만 명의 청각장애인이 사용한다. 그러나 수화를 지원하는 기술 인프라는 음성 언어에 비해 현저히 뒤처져 있다. 음성 언어 처리는 수십 년에 걸친 ASR 연구의 혜택을 누리고 있는 반면, 수화 인식(SLR)과 수화 생성(SLG)은 여전히 해결되지 않은 난제로 남아 있다. 핵심적인 어려움은 수화가 손 모양, 동작, 위치, 표정, 신체 자세를 동시에 사용하는 시각-운동 양식으로 구현된다는 데 있으며, 이러한 표상적 복잡성은 현재 대부분의 컴퓨터 비전 시스템이 포착할 수 있는 범위를 초과한다.

중요성

청각장애인 커뮤니티와 청인 커뮤니티 사이의 의사소통 장벽은 교육, 의료, 고용, 시민 참여에 대한 접근성 저하 등 심각한 사회적 결과를 초래한다. 실시간 수화 번역은 이러한 상황을 변화시킬 수 있지만, 해당 기술은 개별 어휘 항목만이 아닌 실제 연속 수화에서 작동해야 한다. 언어학적 측면에서의 의의도 마찬가지로 크다. 수화는 언어 보편성, 언어 습득, 언어의 신경학적 기반에 관한 이론에 중요한 증거를 제공하며, 주로 음성 양식에서 도출된 기존의 가정에 도전한다. 인간 언어에 관한 어떠한 적절한 이론도 시각-운동 양식을 반드시 설명해야 하며, 수화 전산 연구는 이러한 이해를 증진시키는 데이터와 형식 모델을 함께 생성한다.

연구 내용

학습을 위한 합성 데이터

Perea-Trigo et al. (2024)은 SLR 연구에서 가장 근본적인 병목 문제인 데이터 부족을 다룬다. 대규모 수화 비디오 코퍼스를 수집하는 데는 원어 수화 사용자, 통제된 녹화 환경, 전문가 주석 작업이 필요하므로 많은 비용이 든다. 이들의 해결책은 스페인어 수화(Spanish Sign Language)를 위한 합성 코퍼스 생성으로, 3D 아바타 기술을 활용하여 대규모 학습 데이터를 생성하는 방식이다. 해당 리뷰는 수화 인식 및 생성 분야의 최신 방법론을 검토하고 합성 데이터를 핵심적인 가능 요인으로 규명한다. 핵심 문제는 충실도(fidelity)이다. 즉, 합성 생성된 수화가 자연 수화와 인공 수화를 구별하는 음운론적·운율론적 뉘앙스를 포착할 수 있는가의 문제이다. 이들의 결과에 따르면 합성 데이터는 초기 모델 학습에는 유용하지만, 상용 수준의 시스템을 위해서는 자연 수화 데이터로 보완되어야 한다.

동적 시간 처리

Kim and Kim (2025)은 연속 SLR에서의 핵심적인 기술 문제, 즉 다양한 길이의 비디오 입력을 분할하고 처리하는 방법을 다룬다. 기존 시스템은 실제 지속 시간에 관계없이 입력 비디오를 고정된 수의 클립으로 분할하여, 긴 발화에서는 시간적 정보를 손실하고 짧은 발화에서는 패딩을 적용한다. 이들의 커버리지 기반 동적 클립 생성 방법은 실제 수화 내용에 맞게 클립 수를 조정함으로써 언어적 의미를 담는 시간적 역동성을 보존한다. 이는 언어학적으로 중요한 의미를 갖는데, 수화 문법은 시간적 변형을 광범위하게 활용하기 때문이다. 즉, 수화의 속도, 지속 시간, 리듬은 형태론적·통사론적 정보를 전달하며, 고정 프레임 방식은 이를 체계적으로 폐기한다.

실시간 양방향 번역

인도 수어 시스템

인도 수어 시스템(A. M. et al., 2025)은 양방향적 접근 방식을 보여준다. 즉, 수어를 인식하여 텍스트나 음성으로 변환할 뿐만 아니라, 텍스트 입력으로부터 수어 출력을 생성하기도 한다. 이 시스템은 교육 및 공공 서비스 환경에서의 실제 배포를 목표로 한다. 양방향 아키텍처는 언어학적으로 중요한 의미를 갖는데, 수어 생성이 단순히 인식의 역과정이 아니기 때문이다. 생성은 목표 수어의 문법 구조를 모델링해야 하며, 이는 어순, 형태론적 표지, 담화 구성 측면에서 원천 구어와 상당히 다를 수 있다.

프라이버시 보호 분산 학습

Alzu'bi et al.(2024)은 아랍 수어 인식을 위한 연합 학습(federated learning) 접근 방식을 제안하며, 프라이버시와 확장성 문제를 동시에 다룬다. 스마트 시티 배포 환경에서 수어 인식 시스템은 민감한 생체 인식 비디오 데이터를 처리한다. 연합 학습은 이러한 데이터를 중앙화하지 않고도 분산된 기기 전반에 걸쳐 모델을 학습시킬 수 있게 한다. 이 시스템은 생성에 3D 가상 수어 사용자(signer)를 활용하며, 문법적으로 올바르고 문화적으로 적절한 수어 출력을 만드는 더 넓은 과제와 연결된다. 아랍 수어에 초점을 맞춘다는 점은, 각 수어가 독립적으로 모델링되어야 하는 고유한 문법 체계를 갖고 있음을 부각시킨다.

수어 기술의 발전

기능	현재 상태	주요 한계	언어학적 요구사항
개별 수어 인식	정확도 90-98%	수어 사용자 의존적	어휘 수준에 한정
연속 수어 인식	정확도 60-75%	분절 및 연접 조음(coarticulation)	형태통사적 처리
수어 생성	아바타 기반, 문법 제한적	자연스러움과 유창성	완전한 문법 모델
수어-텍스트 번역	초기 단계	담화 수준의 의미	화용론적 해석
실시간 배포	프로토타입 단계	지연 시간 및 안정성	전 수준

주목할 동향

이 분야는 변곡점에 있다. 방대한 인간 동작 데이터셋으로 학습된 비디오 이해용 파운데이션 모델(foundation model)은 수어 특화 모델이 파인튜닝(fine-tune)할 수 있는 범용 시각 표현을 제공할 수 있다. 비수지 표지(non-manual marker), 즉 눈썹 올리기, 머리 기울이기, 입 모양 등이 모든 수어에서 부정, 의문, 관계절 등의 문법 정보를 담고 있기 때문에, 안면 표정 인식과 손 추적의 통합은 매우 중요하다. 생성 측면에서는 포토리얼리스틱(photorealistic) 뉴럴 아바타(neural avatar)가 청각 장애인 커뮤니티가 지속적으로 부자연스럽고 이해하기 어렵다고 여겨온 경직된 3D 모델을 대체하고 있다. 가장 중요한 발전은 커뮤니티 주도적인 면일 수 있다. 청각 장애인 연구자와 수어 사용자들이 기술을 점점 더 주도하고 공동 설계함으로써, 시스템이 수어의 실제 사용을 반영하도록 하고 수어가 어떤 모습이어야 하는지에 대한 청인의 가정을 탈피하고 있다.

관련 연구는 ORAA ResearchBrain을 통해 확인할 수 있다.

References (4)

[1] Perea-Trigo, M., Botella-Lopez, C., & Martinez-del-Amor, M.A. (2024). Synthetic Corpus Generation for Deep Learning-Based Translation of Spanish Sign Language. Sensors, 24(5), 1472.

DOI Scholar

[2] Kim, T. & Kim, B. (2025). Enhancing Sign Language Recognition Performance Through Coverage-Based Dynamic Clip Generation. Applied Sciences, 15(11), 6372.

DOI Scholar

[3] A. M. et al. (2025). Real-Time Indian Sign Language Recognition & Multilingual Sign Generation. Proc. ICAISS 2025, IEEE.

DOI Scholar

[4] Alzu'bi, A., Al-Hadhrami, T., & Albashayreh, A. (2024). A Federated Learning-Based Virtual Interpreter for Arabic Sign Language Recognition in Smart Cities.

DOI Scholar

Sign Language Recognition and Generation: Bridging Deaf and Hearing Worlds with AI

Why It Matters

The Science

Synthetic Data for Training

Dynamic Temporal Processing

Real-Time Bidirectional Translation

Privacy-Preserving Distributed Learning

Sign Language Technology Progress

What To Watch

수화 인식 및 생성: AI로 청각장애인과 청인의 세계를 잇다

중요성

연구 내용

학습을 위한 합성 데이터

동적 시간 처리

실시간 양방향 번역

인도 수어 시스템

프라이버시 보호 분산 학습

수어 기술의 발전

주목할 동향

References (4)

Explore this topic deeper