Final Projct
Final Projct
Bachelor of Engineering
in
Information Science and Engineering
by
SANDEEPA T N (4MC20IS045) SATHVIK RAO (4MC20IS046)
KEERTHAN V (4MC20IS020) ANIRUDH R (4MC19IS009)
Mrs. Shruthi D V
(Assistant Professor)
Department of ISE
External Viva
Name of the Examiners Signature with Date
1.
2.
ACKNOWLEDGEMENT
We have made efforts in this project. However, it would not have been possible
without the kind support and help of many individuals and organizations. We would like to
extend our sincere thanks to all of them.
We would like to express our gratitude to our respected principal Dr. A.J. Krishnaiah
for providing a congenial environment and surroundings to work in. We would like to
express our sincere gratitude to Dr. Chandrika J, Head of the Department of Information
Science and Engineering, for her continuous support and encouragement.
We are highly indebted to Mrs. Shruthi D V for her guidance and constant
supervision as well as for providing necessary information regarding the project & also for
her support in completing the project.
We would like to express our gratitude to my parents & members of Malnad College
of Engineering for their kind co-operation and encouragement which helped us in the
completion of this project.
Our thanks and appreciation also go to our colleagues in developing the project and
the people who have willingly helped us out with their abilities.
SANDEEPA T N- 4MC20IS045
KEERTHAN V - 4MC20IS020
ANIRUDH R - 4MC19IS009
ABSTRACT
Page no.
Chapter 1: Introduction
1.1 Introduction to Sign Language 1
1.2 Potential of the Problem 2
1.3 Problem Statement 3
1.4 Objective of the present work 3
1.5 Expected Impact 4
1.6 Platform and Tools used 4
1.6.1 Tools and Technology 4
1.6.2 Integrated Development Environment 6
Chapter 2: System Analysis
2.1 Literature Survey 7
Chapter 7: Conclusion
7.1 Conclusions of the present work 36
7.2 Limitations 36
REFERENCES 38
FIGURE
TABLE
INTRODUCTION
1.1 Introduction to Sign Language
The core focus of this scholarly endeavour resides in the succinct encapsulation of
technological innovations germane to ISL recognition, while simultaneously accentuating
lacunae and challenges entrenched in the current corpus of knowledge. This comprehensive
survey aspires to proffer valuable insights, which, in turn, will serve as a compass for those
navigating the landscape of knowledge dissemination and solution implementation in
addressing conundrums and contingencies with innovative predispositions.
1
1.2 Potential of the Problem
Sign Language Recognition (SLR) addresses a critical societal need by leveraging
technology to enhance communication for the deaf and hard-of-hearing communities. This
section explores the potential of the SLR problem, outlines the objectives of addressing it,
and provides a detailed description of the current problem landscape.
❖ Societal Impact:
➢ Communication Enhancement: SLR has the potential to significantly enhance
communication for individuals who rely on sign languages. By creating accurate and
efficient recognition systems, it bridges the gap between the deaf community and the
broader society.
➢ Inclusive Society: A successful SLR system contributes to building a more inclusive
society, breaking down communication barriers and fostering understanding among
diverse groups of people.
❖ Educational and Personal Empowerment:
➢ Accessible Learning: A robust SLR system serves as an accessible tool for learning
sign languages, empowering both individuals with hearing impairments and those
seeking to understand and communicate with them.
➢ Educational Equality: The system can contribute to educational equality by
providing deaf individuals with the means to participate more actively in mainstream
educational settings.
❖ Technological Advancements:
➢ Innovative Solutions: SLR necessitates the development of cutting-edge
technologies, including machine learning and computer vision, to accurately
interpret and recognize intricate sign language gestures.
➢ Real-time Processing: Advancements in SLR technology can lead to real-time
processing capabilities, enabling immediate and fluid communication without
delays.
2
1.3 Problem statement
Sign languages, such as Indian Sign Language (ISL) serve as crucial modes of
communication for the deaf and hard-of-hearing communities. Despite their significance,
individuals who rely on sign languages face barriers in effective communication,
particularly in interactions with those unfamiliar with sign languages. This limitation
contributes to feelings of isolation, hindering social integration and access to essential
services. Developing robust Sign Language Recognition (SLR) systems can play a pivotal
role in mitigating these challenges.
The present work focuses on addressing the challenges inherent in SLR, considering
both Indian Sign Language (ISL) .The detailed objectives and problem description are as
follows:
3
➢ Objective: Implement efficient algorithms and processing pipelines to ensure real-
time recognition of sign language gestures.
➢ Rationale: Real-time processing is essential for natural and seamless
communication, eliminating delays that may hinder effective interaction.
❖ Continuous Learning:
➢ Objective: Integrate machine learning capabilities into the SLR system, allowing it
to continuously improve accuracy through user interactions.
➢ Rationale: Continuous learning ensures adaptability and responsiveness, enabling
the system to enhance its performance over time.
4
streams. It allows for accurate and efficient detection of sign language gestures,
forming the basis of the recognition system.
• OpenCV: OpenCV (Open Source Computer Vision Library) is a popular library for
computer vision and image processing tasks. It is used for tasks such as reading and
displaying video streams, image manipulation, and drawing on images. In this
project, OpenCV is utilized for capturing video from webcam feeds, processing
video frames, and visualizing the detected keypoints.
• Python: Python is the primary programming language used for developing the Sign
Speak project. Python's simplicity, versatility, and extensive libraries make it well-
suited for tasks such as machine learning, computer vision, and natural language
processing. The majority of the project's code, including data processing, model
training, and interface development, is written in Python.
5
1.6.2 Integrated Development Environment (IDE):
• Visual Studio Code (VSCode): Visual Studio Code is a lightweight, open-source
code editor developed by Microsoft. It offers features such as syntax highlighting,
code completion, debugging, version control integration, and extensions support.
VSCode provides a user-friendly interface for writing, debugging, and managing
code files. It was chosen as the primary IDE for its versatility, extensive extensions
ecosystem, and ease of use.
• Jupyter Notebook: Jupyter Notebook is an open-source web application that allows
for the creation and sharing of documents containing live code, equations,
visualizations, and narrative text. It supports various programming languages,
including Python, R, and Julia.
6
CHAPTER 2
SYSTEM ANALYSIS
2.1 LITERATURE SURVEY
For feature extraction, the system leverages the InceptionV3 convolutional neural
network (CNN) model, which encompasses a ReLU correction layer, a max-pooling layer,
and two fully connected layers. The output from this CNN model is subsequently channeled
into a Long Short-Term Memory (LSTM) network for the crucial task of symbol
classification and conversion into textual representation. Notably, LSTM distinguishes itself
by obviating the necessity for manual feature engineering, rendering it a favored choice in
comparison to other deep learning methodologies. The architecture presented in this study
demonstrates a remarkable training accuracy of 100% while maintaining a commendable
testing accuracy level of 81%.
The paper [2] delves into the innovative utilization of MediaPipe technology for real-
time recognition of hand gestures, with a particular emphasis on its applicability to sign
language, a domain where it holds immense promise for individuals grappling with hearing
impairments. The authors detail their utilization of MediaPipe's robust library, enabling the
precise prediction of a human hand's skeletal structure and intricate gestures. This precision
is realized through the integration of two key models: a palm detector and a hand landmark
model.
7
A noteworthy focal point of this study is the quest for achieving lightweight and
resource-efficient hand gesture detection, ideally suited for mobile devices. The authors
employ rigorous quantitative analysis, punctuated by comparisons to alternative methods,
notably the deployment of Support Vector Machine (SVM) algorithms. The outcome of
these comparisons showcases the exceptional prowess of their model, manifesting in an
average accuracy rate of a staggering 99% across multiple sign language datasets. In
addition, the study accentuates the cost-effectiveness, real-time responsiveness, and
adaptability of their model when confronted with diverse sign language datasets.
[3]In a recent study focusing on the realm of Indian Sign Language (ISL)
recognition, an advanced deep learning paradigm, as introduced by Deepsign, was
harnessed. This novel system capitalizes on Gated Recurrent Unit (GRU) and Long Short-
Term Memory (LSTM) architectures for the discernment of sign language gestures
embedded within video streams, subsequently converting them into their corresponding
English lexemes. The dataset, denoted as IISL2020, was meticulously curated by a cohort
of 16 participants, encompassing both male and female individuals aged between 20 and 25
years. Within this dataset, 11 distinct words were represented, amounting to a total of 1100
video samples per word. These video samples were artfully recorded using mobile devices,
under natural lighting conditions, at an impressive resolution of 1920 x 1080, and a high
frame rate of 28 frames per second. Each video segment had an average duration of 2
seconds. The model training procedure was executed on a formidable 16GB GDDR6 GPU,
with feature extraction being adeptly conducted using sampling methodologies. The salient
features derived from the video frames were further processed through a pre-trained
MobileNet coupled with InceptionResNetV2, thereby generating a feature vector for LSTM-
based predictions. To substantiate the model's performance, a robust ten-fold K-fold cross-
validation methodology was employed.
8
in a classification accuracy of 99.08%. Notably, this approach incorporates the U-Net
Architecture, obviating the requirement for an RGB-D Kinect camera, thereby enhancing
efficiency and accessibility.
[5]In their recent exploratory research within the sphere of Indian Sign Language
(ISL) recognition, a novel Convolutional Neural Network (CNN)-based ISL converter was
unveiled. This cutting-edge architecture demonstrates proficiency in classifying the 26
alphabet letters of ISL. The authors harnessed transfer learning, focusing primarily on
retraining the final layer of the pre-trained MobileNet model when employed as a classifier.
Their self-curated dataset, boasting a voluminous collection of 52,000 images, was
meticulously recorded via a 720p HD camera, with each alphabet being represented by 2000
images. To enrich the dataset's diversity, they introduced various augmentations, including
background alterations, image cropping, flips, expansions, and resizing. Hand segmentation
was adroitly executed using the GrabCut Algorithm. The overall testing accuracy of this
architectural marvel culminated at an impressive 96%.
9
extraction was skilfully carried out through the utilization of pre-trained models, including
OpenPose, Pose Videos, and PAF Videos, which were subsequently flattened and
normalized before being routed through an array of machine-learning classifiers. The stellar
performance was exemplified when the XG-Boost algorithm outshone regular RNNs and
LSTMs. For recognition, the authors harnessed Pose and PAF videos, extracting features
with the aid of the pre-trained MobileNetV2 model. These features were subsequently
channeled through a BiLSTM architecture for classification, with the hidden states of the
LSTM cells being flattened and conveyed through a fully connected layer and softmax layer.
The result was an astounding overall accuracy rate of 85.6% on the extensive INCLUDE
dataset.
In this deep learning [8], the focal point was the utilization of the CIFAR-10 dataset
as the cornerstone for image classification. A comprehensive preprocessing pipeline was
meticulously executed, encompassing the pivotal stages of data augmentation for enriched
dataset diversity and normalization for optimized data scaling and distribution. The core
architectural paradigm adopted was a Convolutional Neural Network (CNN), strategically
configured with two convolutional layers meticulously engineered for feature extraction and
abstraction. This was supplemented by a singularly sophisticated fully connected layer. The
culmination of this technical pursuit was characterized by the assessment of recognition
accuracy on the exacting test set, yielding a commendable performance benchmark, which
consistently approximated an 85% threshold. This outcome underscores the model's
technical acumen in effectively categorizing a diverse array of images, thereby solidifying
its pre-eminence in the specialized domain of image classification.
10
2.2 Finding of the Analysis
❖ Datasets:
➢ Most researchers used self-created datasets, suggesting the need for larger, publicly
available datasets.
➢ Data sizes varied significantly, ranging from 20 signs to 263 signs and 1,100 to
10,000 videos.
➢ Recording conditions varied, with some using mobile phones and others using high-
resolution cameras.
❖ Preprocessing Techniques:
➢ Common techniques included gray scaling, normalization, noise removal, and edge
detection.
➢ Some studies employed data augmentation, such as flipping and resampling, to
improve model generalizability.
➢ Advanced techniques like 3D reconstruction and semantic segmentation were also
explored.
❖ Architectures:
➢ Deep learning architectures dominated, with CNNs and LSTMs being the most
popular choices.
➢ Several studies employed hybrid architectures or multi-stage pipelines.
➢ Some explored classical machine learning methods like SVMs and Extreme
Machine Learning.
❖ Recognition Accuracy:
➢ Accuracy varied considerably, ranging from 80.76% to 99.08%.
➢ Higher accuracy was often associated with larger datasets, more complex
architectures, and advanced preprocessing techniques.
➢ However, factors like dataset composition and sign complexity also played a role.
❖ Overall Findings:
➢ Sign language recognition is an active research area with significant progress in
recent years.
➢ Deep learning approaches are currently the most effective, with CNNs and LSTMs
showing promising results.
➢ Preprocessing techniques and dataset composition can significantly influence
accuracy.
11
➢ Future research directions include exploring hybrid architectures, incorporating
contextual information, and developing larger, publicly available datasets.
• Real-time gesture capture: This utilizes cameras to capture gestures in real-time, ideal
for face-to-face communication.
• Pre-recorded video upload: Users can upload pre-recorded videos containing sign
language gestures for recognition.
The system then employs a deep learning model trained on a diverse dataset for
gesture recognition. This model analyses the captured video or uploaded video frames to
identify the signs being presented. A user-friendly interface provides real-time visual or
auditory feedback, clearly conveying the recognized signs. The system integrates with
external systems through APIs, allowing for broader functionality. Privacy and ethical
considerations are addressed throughout the development process, with clear
communication of data handling policies to ensure user trust.
The SLR system undergoes extensive user testing to evaluate its usability and
effectiveness. Performance metrics are defined to measure accuracy, speed, and robustness.
Additionally, the system is designed for scalability with a modular architecture, enabling
future enhancements and adaptation to evolving needs. Comprehensive documentation and
training resources ensure user accessibility and usability, contributing to enhanced
communication and inclusivity within the deaf and hard-of-hearing community. Regular
updates and refinements based on user feedback and technological advancements are
integral to the system's ongoing improvement.
• Video Input: Users can upload video files or stream real-time video from a webcam.
12
• Keypoint Detection: Detect human poses and hand gestures in video frames using
Mediapipe.
• Sign Language Recognition: Classify sign language gestures based on detected
keypoints using a machine learning model.
• User Interface: Provide an interactive interface for users to upload videos, view
predictions, and control playback.
• Auditory Feedback: Vocalize recognized sign language gestures using text-to-speech
synthesis for immediate feedback.
• Gradio Integration: Integrate with Gradio to create a web-based interface for easy
accessibility.
• Cloud Deployment: Deploy the system on a cloud platform for scalability and
reliability.
• Evaluation and Testing: Evaluate accuracy, speed, and usability through testing and
user feedback.
• Documentation and Support: Provide comprehensive documentation and user
support channels for assistance.
• Continuous Improvement: Continuously update and enhance the system based on
user feedback and technological advancements.
• Performance: The system should have low latency, providing real-time sign
language recognition with minimal delay. It should be capable of handling multiple
concurrent user requests efficiently without performance degradation.
• Accuracy: The sign language recognition model should achieve a high level of
accuracy in identifying gestures, minimizing misclassifications. The system's overall
accuracy should be consistently maintained across different environments and
conditions.
• Reliability: The system should be reliable and available, with minimal downtime or
service interruptions. It should handle errors gracefully, providing informative error
messages and recovering from failures autonomously when possible.
• Scalability: The system should be scalable, capable of handling an increasing
number of users and workload demands without sacrificing performance. It should
13
scale horizontally by adding more resources or vertically by optimizing existing
resources as needed.
• Security: The system should adhere to security best practices to protect user data and
privacy. It should implement authentication and authorization mechanisms to control
access to sensitive features and data.
• Usability: The user interface should be intuitive and user-friendly, requiring minimal
training for users to operate effectively. It should support accessibility standards to
accommodate users with disabilities, ensuring inclusivity.
• Compatibility: The system should be compatible with a wide range of web browsers,
devices, and operating systems to maximize accessibility for users. It should adhere
to web standards and guidelines to ensure consistent behaviour across different
platforms.
• Maintainability: The system should be designed with modular and well-structured
code, facilitating ease of maintenance and future enhancements. It should include
comprehensive documentation to aid developers in understanding, troubleshooting,
and extending the system.
• Performance Under Load: The system should maintain consistent performance under
varying load conditions, with the ability to handle peak loads efficiently. It should
be stress-tested to identify and mitigate performance bottlenecks before deployment.
• Legal and Regulatory Compliance: The system should comply with relevant laws,
regulations, and industry standards governing data privacy, accessibility, and usage
rights. It should include mechanisms for obtaining user consent and managing data
in accordance with applicable regulations.
▪ VS code
▪ Opencv-python==4.9.0.80
▪ Mediapipe==0.10.10
▪ Tensorflow==2.15.0
▪ Gradio
14
2.4.2.2 HARDWARE REQUIREMENTS
❖ Computer System:
▪ Processor: Intel Core i5 or equivalent or Ryzen 7 or more.
▪ RAM: 8 GB or more.
▪ Storage: 256 GB SSD or more.
▪ GPU
▪ Camera
❖ Android Devices:
▪ Quad-core processor or more
▪ Camera sensor
▪ Android 6.0 or later
❖ iOS Devices:
▪ A10 Fusion chip or higher
▪ High-quality camera
15
CHAPTER 3
DESIGN
16
➢ Data Collection
▪ User Performs Signs: In this step, a user performs various signs in front of a
camera.
▪ Record Video with Signs: The signs performed by the user are recorded as a
video.
▪ Extract Keypoints (Pose, Hands): A tool like MediaPipe is used to extract
keypoints from the recorded video. Keypoints are important locations on a
person's body, such as wrists, elbows, and shoulders in this case.
▪ Store Keypoints & Sign Labels: The extracted keypoints, along with labels
corresponding to the signs performed in the video, are stored in a database for
later training of the sign recognition model.
➢ Training
▪ Preprocess Data (Normalization, etc.): Before training the model, the collected
data may undergo preprocessing. This can involve normalization, scaling, or
other techniques to ensure the data is in a format suitable for the machine learning
model.
▪ Split Data (Training & Testing): The preprocessed data is divided into two sets:
training and testing. The training set is used to train the model, and the testing
set is used to evaluate the model's performance on unseen data.
▪ Train Deep Learning Model: A deep learning model, such as a Long Short-Term
Memory (LSTM) network, is trained on the training data. The model learns to
map the sequences of keypoints extracted from sign language videos to their
corresponding signs.
▪ Evaluate Model Performance: The model's performance is evaluated using the
testing data. Metrics like accuracy and confusion matrix are used to assess how
well the model can correctly classify signs from unseen videos.
▪ Save Trained Model: Once the model's performance meets the desired criteria,
the trained model is saved for deployment in the real-time prediction phase.
17
➢ Real-time Prediction
▪ User Interface (Gradio): Users interact with a web interface, likely built using
Gradio, to use the sign language recognition system.
▪ Capture Video/Webcam Stream: The Gradio interface can capture video streams
from a webcam or allow users to upload pre-recorded videos for sign recognition.
▪ Extract Keypoints (Pose, Hands): Similar to data collection, keypoints are
extracted from each frame of the captured video stream or uploaded video.
▪ Predict Sign using Trained Model: The sequence of extracted keypoints is fed to
the trained deep learning model for prediction. The model predicts the most
likely sign the user performed in the video based on the keypoint sequence.
▪ Display Predicted Sign (Text): The Gradio interface displays the text label of the
predicted sign for the user.
▪ Text-to-Speech (Optional): Optionally, the predicted sign label can be sent to a
text-to-speech service to provide spoken feedback to the user.
▪ Play Spoken Sign (Optional): If a text-to-speech service is integrated, the
synthesized speech corresponding to the predicted sign label is played on the
Gradio interface, providing auditory feedback alongside the visual prediction.
1. Level 0 DFD:
18
2. Level 1 DFD:
1. LSTM IN SLR:
19
Image 5: Implemented Training Process
❖ Overview: LSTMs are a type of Recurrent Neural Network that excel at handling
sequential data. They possess an internal memory allowing them to learn long-term
dependencies within sequences, making them ideal for tasks like sign language
recognition where order and timing of gestures are crucial.
20
2. MediaPipe Holistic Model:
Image 6: Landmarks
❖ Overview: The MediaPipe Holistic Model tackles hand and body pose estimation in a
powerful one-two punch. For hands, it excels at real-time tracking by identifying 21
keypoints on each hand. These keypoints pinpoint crucial locations like fingertips,
knuckles, and palm base, allowing the model to understand hand gestures and posture.
In parallel, the model tracks 33 keypoints across the body, including joints like elbows,
knees, shoulders, and hips. The MediaPipe Holistic Model's efficiency and open-source
nature make it a valuable tool for developers working on innovative applications that
require real-time hand and body pose analysis.
Sign language research and technological innovation heavily rely on diverse and
comprehensive datasets. This section offers an extensive overview of datasets associated
with both Indian Sign Language (ISL) and American Sign Language (ASL), highlighting
opportunities for creating new datasets through community involvement.
21
• Videos per Word: 40
• Frames per Video: 30
• Total Videos: 10 words * 40 videos/word = 400 videos
• User Interface Design: The interface allows users to stream live video for sign
language recognition, providing instant feedback on their signing. Gradio provides
a straightforward interface where users can upload a video. Once the video is
uploaded, the system processes it to recognize sign language gestures.
• Gesture Input: The system recognizes sign language gestures from the uploaded
video using a pre-trained model. It employs the MediaPipe library for pose detection
and hand tracking, enabling the recognition of gestures.
• Visual Representation: The recognized sign language gesture is spoken aloud using
text-to-speech (TTS) functionality.
• Multimodal Input: The system supports video input for gesture recognition, making
it accessible for users to upload their signing samples.
• Accessibility: The spoken output provides accessibility for users with hearing
impairments.
• Feedback: Users can provide feedback using a flag button if they encounter a wrong
prediction, enabling continuous improvement of the recognition accuracy.
22
Chapter 4
IMPLEMENTATION
- Use Mediapipe Holistic model for detecting keypoints of human body parts (pose, left
hand, right hand).
- Define functions for performing Mediapipe detection and drawing landmarks on the
frame.
- Extract keypoints for the pose, left hand, and right hand.
- Create directories to store collected data for each action (word) and sequence (video).
- Prompt the user to enter new words for training and create directories accordingly.
23
- Collect 30 frames per video sequence for each action (word) and store the keypoints in
numpy files.
- Compile the model with Adam optimizer and categorical cross-entropy loss.
- Train the model using the training data and validate using a validation split.
24
- Continuously capture video frames from the webcam and predict the sign language
gesture.
- Display the recognized gesture and update the prediction based on the sequence of
frames.
- Utilize Gradio library to create an intuitive and user-friendly interface for real-time sign
language recognition.
- Define input and output components for the interface, allowing users to upload or record
a video and receive the predicted sign language gesture as output.
- Configure the interface with appropriate title, description, and example inputs to guide
users effectively.
- Integrate the Gradio interface with the real-time sign language recognition system to
provide users with a seamless experience.
- Upon receiving input video from the interface, pass it through the trained model to
predict the sign language gesture.
- Display the predicted gesture as output in the interface, allowing users to visualize the
recognition result instantly.
- Leverage the Hugging Face platform for deploying the trained sign language
recognition model.
- Prepare the model for deployment by packaging it with its configuration, tokenizer, and
necessary assets.
25
- Upload the packaged model to the Hugging Face model repository, making it accessible
to the community for inference and fine-tuning.
- Integrate the deployed model from Hugging Face with the Gradio interface, enabling
users to access the model directly through the interface.
- Configure the interface to use the deployed model hosted on the Hugging Face model
hub for making predictions on input videos.
- Ensure seamless communication between the Gradio interface and the deployed model,
allowing users to experience real-time sign language recognition without any latency.
- Benefit from the scalability and accessibility features offered by Hugging Face,
enabling easy sharing and deployment of the sign language recognition model across
various platforms and applications.
Table 1: Our Dataset Model Summary Table 2: Kaggle Dataset Model Summary
26
Table 3: Accuracy Achieved with Kaggle dataset
27
Chapter 5
TESTING
5.1 Introduction to testing
TC-04 File Upload User can upload Video file uploaded Pass
Functionality video files successfully
28
TC-06 Camera Access System can access Camera access Pass
the camera granted
This section details the evaluation methods used to assess the performance of the
LSTM model for sign language recognition. We'll focus on two key metrics: Loss vs Epoch
graph and Confusion Matrix.
The Loss vs Epoch graph is a fundamental tool for visualizing the training process
of the LSTM model. Here's a breakdown of its components:
Loss: As explained earlier, the loss function measures the discrepancy between the model's
predicted sign probabilities and the actual sign labels in the training data. A lower loss value
indicates better model performance.
The Loss vs Epoch graph plots the training loss on the Y-axis and the number of epochs on
the X-axis. Ideally, the graph should exhibit a downward trend during training. This signifies
that the model is progressively learning from the data and minimizing its prediction errors.
Steep Descent: A rapid decrease in loss early on indicates the model is efficiently learning
the patterns in the training data.
Gradual Decrease: A slower, steady decline suggests the model is gradually improving its
accuracy.
29
Stagnation: If the loss plateaus after a certain number of epochs, it might indicate the model
has reached its learning capacity or is overfitting the training data. Techniques like early
stopping or adjusting hyperparameters can be employed to address overfitting.
Fluctuations: Minor fluctuations in the loss curve are normal and can be attributed to the
stochastic nature of the training process.
By analyzing the Loss vs Epoch graph, we can gain valuable insights into the training
progress, identify potential issues, and determine when to stop training to prevent
overfitting.
In sign language recognition, where we have multiple possible signs, the Confusion
Matrix is typically one-hot encoded. This means each row and column represents a single
sign class. The values in the cells then represent the number of times the model predicted
30
that sign class, regardless of whether it was the actual sign or not. This one-hot encoding
provides a clearer picture of the model's performance for each individual sign.
By combining the insights from the Loss vs Epoch graph and the Confusion Matrix, we can
comprehensively evaluate the effectiveness of the LSTM model for sign language
recognition and identify areas for potential improvement.
31
Chapter 6
USER MANUAL
6.1 Installation Procedure
Installation procedures for the tools and libraries used in the sign language recognition
project:
TensorFlow:
OpenCV:
OpenCV (Open Source Computer Vision Library) is a popular library for computer vision
and image processing tasks. It provides functions for reading, writing, and manipulating
images and videos.
MediaPipe:
scikit-learn:
32
Scikit-learn is a machine learning library in Python that provides simple and efficient tools
for data analysis and modeling. It includes various algorithms for classification, regression,
clustering, and dimensionality reduction.
Matplotlib:
Matplotlib is a plotting library for creating static, animated, and interactive visualizations in
Python. It provides a MATLAB-like interface for generating plots and charts.
Seaborn:
Gradio:
Gradio is an open-source Python library that allows you to quickly create customizable UI
components for machine learning models. It provides a simple interface for building web-
based applications to interact with models.
33
6.1 Snap Shots
34
Image 13: Uploading file
35
Chapter 7
CONCLUSION
7.1 Conclusions of the present work
This project has successfully developed a sign language recognition system using an LSTM
model. The system leverages the power of LSTMs to capture the temporal dependencies
within sign language gestures, leading to accurate recognition of various signs. The project
addressed the critical challenge of communication between deaf and hearing communities
by providing a technology-driven solution.
7.2 Limitations
While the project achieved significant progress, there are some limitations to consider:
• Accuracy Variations: The accuracy of the sign language recognition system can be
impacted by various factors like lighting conditions, video quality, hand pose
variations, and potential background noise. Further training and data augmentation
techniques can be employed to improve robustness in real-world scenarios.
36
• Limited Sign Set: The current system might recognize a specific set of signs.
Expanding the training dataset with a broader range of signs from different regional
variations of sign language is crucial for wider application.
• Real-Time Performance: While the system aims for real-time processing,
computational limitations might introduce slight delays. Optimizing the model
architecture and utilizing efficient hardware can enhance real-time performance.
Building upon the successes of this project, several directions offer exciting prospects for
future development:
By addressing the limitations and exploring these future scopes, this project has the
potential to evolve into a robust and comprehensive sign language recognition system,
fostering inclusivity and empowering communication for all.
37
References
[1] Jayadeep, Gautham, et al. "Mudra: convolutional neural network based Indian sign
language translator for banks." 2020 4th International Conference on Intelligent Computing
and Control Systems (ICICCS). IEEE, 2020.
[3] Kothadiya, D., Bhatt, C., Sapariya, K., Patel, K., Gil-Gonza lez, A. B., & Corchado, J.
M. (2022). Deepsign: Sign language detection and recognition using deep learning.
Electronics, 11(11), 1780.
[4] Likhar, Pratik, Neel Kamal Bhagat, and G. N. Rathna. "Deep learning methods for Indian
sign language recognition." 2020 IEEE 10th International Conference on Consumer
Electronics (ICCE-Berlin). IEEE, 2020.
[5] Gangadia, D., Chamaria, V., Doshi, V., & Gandhi, J. (2020, December). Indian sign
language interpretation and sentence formation. In 2020 IEEE Pune section international
conference (PuneCon) (pp. 71-76). IEEE.
[6] Sridhar, A., Ganesan, R. G., Kumar, P., & Khapra, M. (2020, October). Include: A large
scale dataset for indian sign language recognition. In Proceedings of the 28th ACM
international conference on multimedia (pp. 1366-1375).
[7] Kumar, Anand, and Ravinder Kumar. "A novel approach for ISL alphabet recognition
using Extreme Learning Machine." International Journal of Information Technology 13
(2021): 349-357.
[8] Yulius Obia, Kent Samuel Claudioa, Vetri Marvel Budimana, Said Achmada,
Aditya Kurniawana. "Sign language recognition system for communicating to people
with disabilities." Procedia Computer Science 216 (2023) 13–20.
[9] https://data.mendeley.com/datasets/kcmpdxky7p/1
38
[10] https://pypi.org/project/SignLanguageRecognition/#General%20Info
[11] https://www.kaggle.com/datasets/prathumarikeri/indian-sign-language-isl
[12] https://huggingface.co/datasets/IIT-K/CISLR
[13] ISL-CSLTR: Indian Sign Language Dataset for Continuous Sign Language Translation
and Recognition - Mendeley Data
[15] Real-Time Gesture Recognition Using GOOGLE’S MediaPipe Hands — Add Your
Own Gestures [Tutorial #1] | by Vaibhav Mudgal | Medium
[16] Electronics | Free Full-Text | Deepsign: Sign Language Detection and Recognition
Using Deep Learning (mdpi.com)
[18] Sign Language Recognition Using ResNet50 Deep Neural Network Architecture by
Pulkit Rathi, Raj Kuwar Gupta, Soumya Agarwal, Anupam Shukla :: SSRN
[19] https://www.kaggle.com/datasets/ayuraj/asl-dataset
[20] https://projects.asl.ethz.ch/datasets/
39