Paper015571 1
Paper015571 1
Sumaiya Tahseen,
Department of Computer Science and Egineering Hina Parveen,
Integral University Department of Computer Science and Egineering
[email protected] Integral University
[email protected]
Maruti Maurya
Department of Computer Science and Egineering
Integral University
[email protected]
Abstract:
This work proposes a vision-based approach to real-time sign language translation for Indian Sign Language (ISL). The
system uses state-of-the-art deep learning architectures such as CNN (Convolutional Neural Networks), LSTM (Long
Short-Term Memory) networks, and Transformer-based encoder-decoder models for gesture recognition in both
isolated and continuous forms. Data preprocessing techniques such as DTW (Dynamic Time Warping) were applied to
augment and normalize gesture sequences from custom ISL and public ASL datasets. The model performance was
quantitatively evaluated using precision, recall, F1-score, BLEU, ROUGE, CER(character error rate) and WER (word error
rate). A Transformer-based model outperformed the achieving a BLEU score of 0.74 and a classification accuracy of
96.1%. The developed desktop application enables real-time ISL-to-English translation at 18 FPS without requiring
external sensors, while ablation studies validate the benefits of multimodal fusion and pose-language alignment. This
work demonstrates a robust, scalable approach to non-intrusive sign language translation, advancing accessibility for
the DHH community.
Keywords: Transformer-based encoder-decoder, Spatiotemporal gesture modeling, Indian Sign Language (ISL),
Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM), Dynamic Time Warping (DTW), Real-time sign
language translation.
1.Introduction:
Communication is an essential aspect of human life, yet millions of people who are hard of hearing or deaf experience
significant challenges in this area around the world continue to experience barriers due to the limited adoption and
understanding of sign languages in mainstream society [5], [12]. Sign languages, being natural and visually rich, vary
significantly across regions, with no universal standard, making the creation of a robust translation system both
necessary and challenging [6], [21], [28]. Despite being linguistically complete, sign languages remain
underrepresented in technological solutions for accessible communication.
Recent advances in artificial intelligence (AI), computer vision, and deep learning have made the doors open for real-
time sign language recognition and interpretation Vision-based techniques based on convolutional neural networks
(CNNs), recurrent neural networks (RNNs), long short-term memory (LSTM) models, and transformer-based
architectures have achieved high accuracy in gesture classification and sequence learning [14], [22], [30], [38]. These
models allow systems to learn the spatial-temporal patterns of hand movements and facial
expressions are important in guaranteeing correct sign interpretation.
This research focuses on a vision-based AI model for sign language translation, particularly emphasizing Indian and
American Sign Languages, without reliance on wearable sensors or external hardware [9], [16], [29]. The aim is to
provide a non-intrusive, real-time system that captures visual inputs via standard cameras, extracts relevant features,
classifies them into sign tokens, and then translates these into meaningful text or speech.
Furthermore, this paper addresses the social exclusion faced by the deaf community due to linguistic isolation and
evaluates how AI systems can contribute to bridging this gap. The methodology involves camera-based image
acquisition, pre-processing for background removal and contrast enhancement, feature extraction using CNNs,
classification through hybrid models like CNN-SVM, and natural language generation via encoder-decoder models [13],
[26], [31], [40], [50]. It considers the non-manual markers like mouth patterns and facial expressions in enhancing
translation accuracy.
By leveraging publicly available datasets and applying state-of-the-art AI models, this work proposes a scalable solution
to reduce communication barriers and increase inclusivity for individuals with hearing and speech impairments.
2. Background:
Sign languages are fully-fledged natural languages that have evolved independently of spoken languages and exhibit all
essential linguistic features such as grammar, morphology, and syntax [6], [21]. In contrast to spoken languages, sign
languages adopt a visual-gestural modality, and meaning is drawn from hand shapes, movement, orientation,
facial expression, and body posture [28], [33]. There are more than 200 identified sign
languages today, each formed by its geographical, cultural, and social backgrounds [25], [39].
Indian Sign Language (ISL), the primary focus of this study, has developed organically across various regions of India
and remains under-documented compared to British Sign Language (BSL) or American Sign Language (ASL) [18], [26],
[41]. ISL does not follow the grammatical structure of spoken Indian languages; instead, it has its own rules, sentence
structures, and lexicons [31], [42]. However, due to the lack of official recognition and limited integration into
educational and governmental institutions until recently, many ISL users face significant barriers to communication and
accessibility [27], [34].
The structure of a typical sign includes five essential components: movement, handshape, orientation, location and
non-manual features like eye gaze and facial expressions [33], [36]. Variations in any of these components can shift the
sence of a sign, making sign recognition a complex, high-dimensional problem [8], [37]. Non-manual cues are especially
critical for conveying grammatical aspects such as negation, interrogation, or emotion [7], [24].
Studies have emphasized that sign languages are not mutually intelligible—even among those using the same alphabet
system—due to differences in vocabulary, syntax, and cultural usage [19]. For instance, ASL and BSL are markedly
different in grammar and lexicon, despite being used in English-speaking countries [6]. Moreover, fingerspelling, a
method for spelling out words using hand gestures for each letter, is used variably across different sign languages and
further complicates translation systems [13], [32].
The digital documentation of ISL has gained momentum recently, aided by initiatives from the Indian government and
linguistic researchers, leading to the creation of ISL dictionaries and video corpora [27]. However, compared to ASL
datasets, ISL resources remain limited in both size and diversity, posing a challenge for training robust AI models [20],
[35].
Therefore, an accurate understanding of the linguistic, cultural, and structural features of sign languages is crucial for
the creation of efficient sign language translation systems. This background provides the foundational knowledge
needed to approach the computational challenges addressed in subsequent sections.
3. Advancements in Artificial Intelligence for Vision-Based Sign Language Translation:
Artificial Intelligence (AI) represents the simulation of human cognitive functions through computational systems
capable of learning, reasoning, and adapting autonomously. In recent years, AI has significantly impacted fields such as
speech recognition, natural language processing, and particularly computer vision technologies that form the backbone
of modern sign language translation systems. Within this domain, AI facilitates the interpretation of complex visual
gestures through the integration of deep learning (DL), machine learning (ML), and advanced vision-based techniques.
These methods enable the extraction of temporal and spatial and features from gesture-based
inputs, allowing the translation of sign language into textual or spoken language forms with growing accuracy and
fluency [1], [5], [17].
Initially, traditional machine learning algorithms such as Hidden Markov Models (HMM) and Support Vector Machines
(SVM) were employed for isolated sign recognition tasks. While effective in controlled environments, these methods
exhibited limitations when dealing with continuous signing due to issues like gesture overlap, co-articulation, and
contextual dependency inherent in sign languages [8], [11]. Recent advancements have shifted focus toward deep
learning models namely Convolutional Neural Networks (CNNs) for spatial feature extraction and temporal models
such as Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) units, and Transformer-based
architectures for dynamic gesture sequences [2], [9], [16]. Moreover, the rise of multi-modal learning, which integrates
video data with skeletal and facial cues, has improved recognition robustness. Architectures such as the Transformer
and T5 have demonstrated efficacy in translating signs into natural language using encoder-decoder mechanisms [14],
[15]. To support real-time use cases, lightweight frameworks like TensorFlow Lite and ONNX have enabled the
deployment of efficient AI models on mobile and embedded platforms, increasing accessibility for the deaf and hard-
of-hearing population [4], [10]. Nevertheless, the effectiveness of such systems continues to be largely subject to the
variety and quality of training data sets, as well as capacity to learn cultural and grammatical variations of different sign
languages.
Fig. 1. Deep artificial neural network framework for sign language translation. Input methods include recorded video, real-time video
feed, and raw image data. The CNN extracts spatial features, LSTM captures gesture dynamics, and Transformer handles sequence
learning. Output methods provide real-time text translation, speech output, and visual feedback with sign overlays.
5. Methodology:
The proposed methodology integrates advanced vision-based AI techniques to facilitate real-time, continuous sign
language recognition and translation. The framework is modular, allowing efficient data acquisition, preprocessing,
feature extraction, classification, and translation. This section outlines the key components and architecture of the
system.
Fig. 2. MediaPipe hand solution framework for gesture recognition. The system processes a video input through a sequence of node
calculators and streams, including image transformation, tensor conversion, model inference, and landmark extraction. The final
output is rendered visually with detected hand landmarks overlaid onto the video stream.
To preserve temporal dynamics, frames are sampled at uniform intervals, and keypoints from hands, face, and upper
body are extracted for every frame. Data augmentation techniques such as, temporal and flipping, rotation shifting are
applied to enhance the dataset diversity [31], [41].
Fig. 3. Convolutional neural network (CNN) framework for sign gesture recognition. The input image undergoes feature extraction
through convolution and pooling layers, generating feature maps that capture important spatial patterns. These features are then
processed by fully connected layers to classify the gesture into an output category.
For effective classification of sign gestures, two distinct modeling approaches are adopted. Static sign recognition,
particularly suited for alphabet-based or isolated word gestures, utilizes Support Vector Machines (SVM) due to their
capability in handling high-dimensional data and achieving precise classification boundaries in feature space [34], [36].
In contrast, dynamic sign recognition requires modeling temporal dependencies across continuous frames. To address
this, Recurrent Neural Networks (RNN), including Long Short-Term Memory (LSTM) networks and Transformer-based
architectures, are employed. These models effectively capture spatiotemporal patterns and sequential dynamics
inherent in complex gesture transitions [23], [37]. Both classification paradigms are trained using categorical cross-
entropy loss functions and validated through accuracy and F1-score metrics to ensure robust performance.
Furthermore, a hybrid attention mechanism is incorporated within the dynamic model architecture to selectively
prioritize informative frames and enhance the focus on critical motion cues during classification, resulting in improved
recognition accuracy across varied sign sequences [42].
Fig. 4. Hand keypoint detection across different lighting and backgrounds using the MediaPipe Hands model. This figure illustrates
robust 21-point hand landmark detection using computer vision, adapting to variations in skin tone, gesture, lighting, and background.
5.4 Sign-to-Text Translation:
For continuous sign translation, a T5-base encoder-decoder model is fine-tuned to translate gesture sequences into
grammatically correct English text. The encoder ingests the sequence of extracted features, and the decoder outputs
textual equivalents. Beam search and language modeling techniques are used to enhance fluency and reduce
ambiguity in sentence construction.
To handle real-time inference, the system is optimized using quantization and pruning, enabling deployment on edge
devices [39]. An optional feedback loop allows user correction to improve system learning over time.
Fig. 5. Sign language translation system framework. The process begins with image acquisition using a capture device, followed by
preprocessing, segmentation, and feature extraction of hand gestures. Extracted features are classified using a recognition model, and
the corresponding sign is translated into textual output. A database supports the classification process for improved accuracy and
retrieval.
where Y(i,j) is the resulting feature map at position (i,j), X is the input frame, and W is the convolution kernel. After
convolution, the non-linear ReLU activation is applied:
where Q, K, and V are the query, key, and value matrices, and dk is the dimension of the keys.
Positional encodings are added to maintain sequence order:
This architecture enables robust handling of complex gesture sequences, even under conditions of co-articulation or
signer variability.
6.4 Dynamic Time Warping (DTW) for Temporal Normalization
Dynamic Time Warping (DTW) is utilized to align gesture sequences of varying lengths to a consistent temporal
dimension, facilitating uniform model input. The recursive DTW formulation between sequences X and Y is:
where ∥xi-yi∥ denotes the Euclidean distance. DTW ensures effective comparison and normalization of sign gestures
performed at different speeds.
where C is the number of classes, yi the ground-truth indicator, and ŷi the predicted probability.
Performance evaluation relies on the following metrics:
Fig. 6. Variation among four key sign language datasets used in this study. The bar chart shows the number of video samples per
dataset, while the red line indicates the number of distinct signers. MS-ASL exhibits the highest diversity, supporting its robustness for
training generalized models. The custom ISL dataset presents a smaller but diverse set, specifically designed for isolated Indian Sign
Language gestures.
CNN-SVM Hybrid: This model employs Convolutional Neural Networks (CNNs) for spatial feature extraction,
followed by a Support Vector Machine (SVM) classifier. While effective for isolated gesture recognition, it shows
limitations in handling continuous gestures due to the absence of temporal modeling [3], [27].
LSTM-CNN: This hybrid approach combines Long Short-Term Memory (LSTM) networks with CNNs to capture both
spatial and temporal features. It demonstrates improved performance on continuous gesture sequences by
modeling motion dynamics between video frames [41].
Transformer-Based Encoder-Decoder (T5): A sequence-to-sequence Transformer architecture is utilized, where
the encoder processes visual-spatial features and the decoder generates context-aware textual outputs. The
integration of self-attention mechanisms enables better disambiguation of co-articulated gestures [28].
To enhance model generalization, pretrained CNNs such as ResNet50 were used for transfer learning [4], [5].
Additionally, pose and keypoint embeddings extracted using OpenPose and MediaPipe were fused with CNN features
to enrich semantic representations [7], [10].
Precision:
Recall:
Accuracy:
F1-Score:
7.5 Results:
The models were evaluated on multiple datasets, including the ISL dataset, ASLLVD [36], and a custom dataset. The
results are summarized in the following tables.
Fig. 7: Classification Accuracy on Isolated Gesture Datasets (Bar Chart): This bar chart compares the classification accuracy of the
three models (CNN-SVM, LSTM-CNN, Transformer (T5)) across three datasets: ISL, ASLLVD, and Custom.
Fig. 9. Translation quality (BLEU and ROUGE) scores. The radar graph illustrates the comparative performance of CNN-SVM, LSTM-
CNN, and Transformer (T5) models on the ISL dataset across eight key metrics: precision, recall, accuracy, F1-score, BLEU, ROUGE,
WER, and CER. The Transformer (T5) model consistently outperforms the others, particularly in language metrics (BLEU/ROUGE) and
error rates (WER/CER), while the LSTM-CNN offers a balanced performance. The CNN-SVM model shows relatively lower accuracy and
higher error rates, highlighting the advantage of deep sequence models in capturing temporal and semantic features for sign
language recognition.Table 2: Confusion Matrix and Error Analysis
This table shows the confusion matrix for each model across the ISL dataset for a sample of five different classes (A, B,
C, D, E).
Model (Class (Class B) (Class C) (Class D) (Class E)
A)
CNN-SVM 180 5 2 1 3
LSTM-CNN 182 4 3 2 4
Transformer 185 3 1 1 2
(T5)
Key Observations:
Accuracy: Transformer (T5) consistently outperforms both CNN-SVM and LSTM-CNN across datasets, with a
significant improvement in both isolated and continuous sign recognition.
Translation Quality: BLEU and ROUGE scores further support the superior performance of the Transformer (T5),
highlighting its ability to generate more accurate translations.
Error Analysis: The confusion matrix demonstrates fewer misclassifications for the Transformer (T5) model,
especially in challenging sign gestures like 'A', 'B', 'C', 'D', and 'E'.
This comprehensive evaluation confirms that the Transformer (T5) model is the most effective for both sign recognition
and translation tasks, offering improvements in both accuracy and translation quality.
8. Proposed Work
9. Conclusion:
The application of Artificial Intelligence (AI) in sign language translation has evolved greatly, providing promising
solutions towards filling communication gaps between the deaf and hearing populations. Recent advancements,
including the STMC-Transformer, have produced significant upgrades in translation accuracy through the use of
transformer-based architectures in gloss-to-text and video-to-text translations.
AI-based sign language translators, which utilize computer vision, machine learning, and natural language processing,
have made it possible to translate sign language into text or speech, and vice versa, in real time. These technologies
have played a crucial role in improving accessibility across different fields, such as education, healthcare, and public
services.
Notwithstanding these developments, there are challenges. The inherent complexity of sign languages, defined by
their dependency on hand movement, facial expression, and bodily movement, is a major obstacle for AI systems.
Moreover, the absence of standard datasets and regional variations of sign languages hinder the creation of translation
systems applicable everywhere.
Ethical implications are of the utmost importance in the creation of these technologies. Guaranteeing inclusivity and
preventing biases requires direct participation from the deaf and hard-of-hearing communities in the design and
deployment of AI-based translation tools.
In summary, despite the important achievements of AI in sign language interpretation, future study, social
participation, and ethics are key to unlocking its total potential. With the help of addressing the existing challenges and
through inclusive innovation, AI has a great chance of contributing towards supporting equal communication and
accessibility of deaf and hard-of-hearing groups globally.
9. References:
[1] H. Bay, T. Tuytelaars, and L. Van Gool, "SURF: Speeded Up Robust Features," European Conference on Computer
Vision, pp. 404–417, 2006.
[2] A. Yalçıner, "Bag of Visual Words (BoVW)," [Online]. Available: https://medium.com/@yalcinera/bag-of-visual-
words-bovw-cb90c6f3c405
[3] C. Cortes and V. Vapnik, "Support-vector networks," Machine Learning, vol. 20, no. 3, pp. 273–297, 1995.
[4] Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," Nature, vol. 521, no. 7553, pp. 436–444, 2015.
[5] R. Elakkiya and B. Natarajan, "ISL-CSLTR: Indian Sign Language Dataset for Continuous Sign Language Translation
and Recognition," Mendeley Data, 2021.
[6] R. Verma, "Indian Sign Language Alphabet Dataset," Kaggle, 2022.
[7] S. Thakar, S. Shah, B. Shah, and A. V. Nimkar, "Sign Language to Text Conversion in Real Time using Transfer
Learning," 2022.
[8] C. C. de Amorim and C. Zanchettin, "ASL-Skeleton3D and ASL-Phono: Two Novel Datasets for the American Sign
Language," 2022.
[9] P. Roy, S. Bhattacharya, P. P. Roy, and U. Pal, "Position and Rotation Invariant Sign Language Recognition from 3D
Kinect Data with Recurrent Neural Networks," 2020.
[10] M. Gupta et al., "CNN-LSTM Hybrid Real-Time IoT-Based Cognitive Approaches for ISLR with WebRTC: Auditory
Impaired Assistive Technology," NCBI, 2022.
[11] M. R. K. et al., "Image-based Indian Sign Language Recognition: A Practical Review using Deep Neural Networks,"
2023.
[12] B. Shi, "Toward American Sign Language Processing in the Real World: Data, Tasks, and Methods," 2023.
[13] I. R. Shaffer, "A study of facial expression recognition technologies on deaf adults and their children," 2018.
[14] H. Cate, F. Dalvi, and Z. Hussain, "Sign Language Recognition Using Temporal Classification," 2017.
[15] F. Wen et al., "AI enabled sign language recognition and VR space bidirectional communication using triboelectric
smart glove," NCBI, 2021.
[16] L. A. Khuzayem et al., "Efhamni: A Deep Learning-Based Saudi Sign Language Recognition Application," NCBI, 2024.
[17] S. Albanie et al., "BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues," 2020.
[18] N. C. Camgoz et al., "Content4All Open Research Sign Language Translation Datasets," 2021.
[19] A. Desai et al., "ASL Citizen: A Community-Sourced Dataset for Advancing Isolated Sign Language Recognition,"
2023.
[20] H. Walsh et al., "A Data-Driven Representation for Sign Language Production," 2024.
[21] P. Roy et al., "American Sign Language Video to Text Translation," 2024.
[22] B. Fang, J. Co, and M. Zhang, "DeepASL: Enabling Ubiquitous and Non-Intrusive Word and Sentence-Level Sign
Language Translation," 2018.
[23] G. Halvardsson et al., "Interpretation of Swedish Sign Language using Convolutional Neural Networks and Transfer
Learning," 2020.
[24] N. Tran et al., "Assessment of Sign Language-Based versus Touch-Based Input for Deaf Users Interacting with
Intelligent Personal Assistants," 2024.
[25] M. Huenerfauth and H. Kacorri, "Best practices for conducting evaluations of sign language animation," 2015.
[26] P. Rust et al., "Towards Privacy-Aware Sign Language Translation at Scale," 2024.
[27] H. Zhang, S. Li, and M. Sun, "Automatic Sign Language Recognition Using Convolutional Neural Networks," IEEE
ICCV, pp. 154–162, 2019.
[28] T. Zhang et al., "A Survey on Real-Time Sign Language Recognition and Translation: Methods, Trends, and
Challenges," Journal of Visual Communication and Image Representation, vol. 68, p. 102764, 2020.
[29] Y. Liu et al., "Deep Learning for Sign Language Recognition: A Review," IEEE Access, vol. 8, pp. 54542–54557, 2020.
[30] J. Liu et al., "Multi-Scale Convolutional Neural Networks for Sign Language Recognition," IEEE Trans. Neural
Networks and Learning Systems, vol. 31, no. 12, pp. 5124–5137, 2020.
[31] S. Wang et al., "Exploring Sign Language Recognition Using Visual Data with Deep Neural Networks," Pattern
Recognition, vol. 99, p. 107084, 2020.
[32] S. Albanie et al., "Co-Articulated Sign Language Recognition from Video Data," IEEE TPAMI, vol. 42, no. 7, pp.
1637–1650, 2020.
[33] R. J. Wang et al., "A Review on Sign Language Recognition and Translation," Sensors, vol. 21, no. 13, p. 4486, 2021.
[34] M. B. Smith and A. J. Binns, "Real-Time Sign Language Recognition Using Machine Learning," IEEE CVPRW, pp. 1–
10, 2020.
[35] P. J. Wong et al., "Deep Learning-Based Sign Language Recognition Using Hand Gestures," Int. J. Adv. Robotic
Systems, vol. 17, no. 5, p. 1726, 2021.
[36] R. S. Camara and S. Lima, "A Comparison of Deep Learning Architectures for Sign Language Translation," ICONIP,
pp. 357–366, 2021.
[37] Y. Wang and Z. Li, "Sign Language Recognition via Deep Learning Techniques: A Review," J. Electrical Eng. & Tech.,
vol. 16, pp. 3589–3596, 2021.
[38] N. P. Lavrenko et al., "Human-Computer Interaction for Sign Language Translation: Challenges and Techniques,"
Computer Science Review, vol. 40, p. 100412, 2021.
[39] Y. Xu et al., "Deep Gesture Recognition for Real-Time Sign Language Translation," IEEE THMS, vol. 51, no. 5, pp.
502–511, 2021.
[40] P. Singh, S. Verma, and A. Saxena, "A Survey on Sign Language Recognition Using Vision-Based Techniques," IEEE
Access, vol. 9, pp. 51001–51015, 2021.
[41] M. S. Sahu and R. S. Meher, "Recognition of Sign Language Gestures Using CNN-LSTM Hybrid Model," J. Electrical
Eng. & Tech., vol. 17, no. 1, pp. 435–444, 2022.
[42] P. Kumar, R. Kumar, and A. Soni, "An Adaptive Framework for Sign Language Translation Using CNN-Based Deep
Learning Models," Expert Systems with Applications, vol. 167, p. 114038, 2021.