0% found this document useful (0 votes)
25 views46 pages

Final Projct

ascascascsacas

Uploaded by

alphaq21062002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views46 pages

Final Projct

ascascascsacas

Uploaded by

alphaq21062002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Malnad College of Engineering

(An Autonomous Institution under Visvesvaraya Technological University, Belagavi)


Hassan–573202

Signspeak: Bridging Communication Through


Deep learning
A Dissertation submitted to Malnad College of Engineering, Hassan, during the
academic year 2023-24 in partial fulfillment for the award of the degree of

Bachelor of Engineering
in
Information Science and Engineering
by
SANDEEPA T N (4MC20IS045) SATHVIK RAO (4MC20IS046)
KEERTHAN V (4MC20IS020) ANIRUDH R (4MC19IS009)

Under the Guidance of

Mrs. Shruthi D V
(Assistant Professor)
Department of ISE

Department of Information Science & Engineering


Malnad College of Engineering
Hassan-573202
Tel.:08172-245093 Fax:08172-245683 URL:www.mcehassan.ac.in
2023-24
Malnad College of Engineering
(An Autonomous Institution under Visvesvaraya Technological University, Belagavi)

Hassan – 573 202


Department of Information Science & Engineering

Certified that the Project Work (20IS802) titled


Signspeak: Bridging Communication Through
Deep learning
is a bonafide work carried out by

Sandeepa T N (4MC20IS045) Sathvik Rao (4MC20IS046)


Keerthan V (4MC20IS020) Anirudh R (4MC19IS009)

in partial fulfillment for the award of


Bachelor Degree in Information Science and Engineering
of
Malnad College of Engineering
affiliated to
Visvesvaraya Technological University, Belagavi

during the year 2023-24. It is certified that all corrections/ suggestions


indicated for Internal Assessment have been incorporated in the Project report
deposited in the Department Library. The Project Report has been approved,
as it satisfies the academic requirements in respect of Project Work prescribed
for the Bachelor of Engineering Degree.

(Mrs. Shruthi D V) (Dr. Chandrika .J) (Dr. A.J Krishnaiah)


Guide Head of the Department Principal

External Viva
Name of the Examiners Signature with Date

1.

2.
ACKNOWLEDGEMENT
We have made efforts in this project. However, it would not have been possible
without the kind support and help of many individuals and organizations. We would like to
extend our sincere thanks to all of them.

We would like to express our gratitude to our respected principal Dr. A.J. Krishnaiah
for providing a congenial environment and surroundings to work in. We would like to
express our sincere gratitude to Dr. Chandrika J, Head of the Department of Information
Science and Engineering, for her continuous support and encouragement.

We are highly indebted to Mrs. Shruthi D V for her guidance and constant
supervision as well as for providing necessary information regarding the project & also for
her support in completing the project.

We would like to express our gratitude to my parents & members of Malnad College
of Engineering for their kind co-operation and encouragement which helped us in the
completion of this project.

Our thanks and appreciation also go to our colleagues in developing the project and
the people who have willingly helped us out with their abilities.

SANDEEPA T N- 4MC20IS045

SATHVIK RAO - 4MC20IS046

KEERTHAN V - 4MC20IS020

ANIRUDH R - 4MC19IS009
ABSTRACT

Sign language recognition (SLR) is an essential application that enables


communication for individuals with hearing impairments. This project implements a system
for real-time SLR using computer vision and deep learning techniques. The system collects
data by capturing hand gestures through a webcam, extracts keypoint values using the
MediaPipe library, and preprocesses the data for training. It then builds and trains a Long
Short-Term Memory (LSTM) neural network model using TensorFlow and Keras. The
trained model is capable of recognizing a variety of sign language gestures in real-time.
Evaluation metrics such as confusion matrices and accuracy scores are employed to assess
the performance of the model. The project also includes functionality for testing the model
in real-time, allowing users to interactively communicate using sign language gestures.
Overall, this system provides a robust and effective solution for real-time sign language
recognition, facilitating communication accessibility for the hearing-impaired community.
CONTENTS

Page no.
Chapter 1: Introduction
1.1 Introduction to Sign Language 1
1.2 Potential of the Problem 2
1.3 Problem Statement 3
1.4 Objective of the present work 3
1.5 Expected Impact 4
1.6 Platform and Tools used 4
1.6.1 Tools and Technology 4
1.6.2 Integrated Development Environment 6
Chapter 2: System Analysis
2.1 Literature Survey 7

2.2 Finding of the Analysis 11


2.3 Proposed System 12
2.4 System Requirement Specification 12
2.4.1 Functional Requirements 12
2.4.2 Non-functional Requirements 13
2.4.2.1 Software Requirements 14
2.4.2.2 Hardware Requirements 15
Chapter 3: Design
3.1 Design of functions 16
3.1.1 Data flow diagrams 18
3.1.2 Algorithms intended to use 19
3.1.3 Datasets intended to use 21
3.1.3.1 Indian Sign Language Data set 21
3.2 Design of User Interface 22
Chapter 4: Implementation
4.1 Modules Implemented 23
4.2 Models Comparison 26
Chapter 5: Testing
5.1 Introduction to testing 28

5.2 Various test case scenarios considered 28

5.3 Testing and Evaluation Metrics 29

5.3.1 Loss vs Epoch Graph 29

5.3.2. One-Hot Encoded Confusion Matrix 30

5.4 Inference drawn from the test cases 31

Chapter 6: User Manual


6.1 Installation Procedure 32

6.2 Snap Shot 34

Chapter 7: Conclusion
7.1 Conclusions of the present work 36

7.2 Limitations 36

7.3 Future Scopes 37

REFERENCES 38

FIGURE

Image 1: System design 16


Image 2: Level 0 DFD 18
Image 3: Level 1 DFD 19
Image 4: LSTM Architecture 19
Image 5: Implemented Training Process 20
Image 6: Landmarks 21
Image 7: User Interface 22
Image 8: Loss VS Epoch 30
Image 9: Accuracy VS Epoch 30
Image 10: Truth VS Predicted 31
Image 11: Real time recognition 34
Image 12: Sign Speak deployed at Hugging Face 34
Image 13: Uploading File 35
Image 14: Uploaded Image Recognized 35

TABLE

Table 1: Our Dataset Model Summary 26


Table 2: Kaggle Dataset Model Summary 26
Table 3: Accuracy Achieved with Kaggle dataset 27
Table 4: Accuracy Achieved with our dataset 27
Table 5: Test case Scenarios 29
Chapter 1

INTRODUCTION
1.1 Introduction to Sign Language

Communication has perennially served as an intrinsic facet of human existence,


where the capacity to articulate one’s, thoughts remain foundational. Nevertheless, the realm
of communication presents formidable hurdles for individuals contending with speech
impediments, thereby necessitating their reliance on sign language as a conduit for
interaction. Sign language, a visual medium for information transmission, encompasses
numerous linguistic variations globally, inclusive of Indian Sign Language (ISL). ISL stands
out for its inherent intricacy, owing to the amalgamation of both single- and double-handed
gestural lexicons, as well as the incorporation of both static and dynamic signalling modes.

In India, a substantial demographic of approximately 19 lakhs grapples with speech


impairments, underlining an imperative need for technology capable of deftly and precisely
recognizing ISL signs and seamlessly rendering them into a human-comprehensible format.
Eminent researchers have undertaken multifaceted approaches encompassing the domains
of image and video processing, machine learning, deep learning, and sensor-driven
hardware mechanisms, with the overarching objective of engendering robust ISL
recognition systems.

The core focus of this scholarly endeavour resides in the succinct encapsulation of
technological innovations germane to ISL recognition, while simultaneously accentuating
lacunae and challenges entrenched in the current corpus of knowledge. This comprehensive
survey aspires to proffer valuable insights, which, in turn, will serve as a compass for those
navigating the landscape of knowledge dissemination and solution implementation in
addressing conundrums and contingencies with innovative predispositions.

1
1.2 Potential of the Problem
Sign Language Recognition (SLR) addresses a critical societal need by leveraging
technology to enhance communication for the deaf and hard-of-hearing communities. This
section explores the potential of the SLR problem, outlines the objectives of addressing it,
and provides a detailed description of the current problem landscape.

❖ Societal Impact:
➢ Communication Enhancement: SLR has the potential to significantly enhance
communication for individuals who rely on sign languages. By creating accurate and
efficient recognition systems, it bridges the gap between the deaf community and the
broader society.
➢ Inclusive Society: A successful SLR system contributes to building a more inclusive
society, breaking down communication barriers and fostering understanding among
diverse groups of people.
❖ Educational and Personal Empowerment:
➢ Accessible Learning: A robust SLR system serves as an accessible tool for learning
sign languages, empowering both individuals with hearing impairments and those
seeking to understand and communicate with them.
➢ Educational Equality: The system can contribute to educational equality by
providing deaf individuals with the means to participate more actively in mainstream
educational settings.
❖ Technological Advancements:
➢ Innovative Solutions: SLR necessitates the development of cutting-edge
technologies, including machine learning and computer vision, to accurately
interpret and recognize intricate sign language gestures.
➢ Real-time Processing: Advancements in SLR technology can lead to real-time
processing capabilities, enabling immediate and fluid communication without
delays.

2
1.3 Problem statement

Sign languages, such as Indian Sign Language (ISL) serve as crucial modes of
communication for the deaf and hard-of-hearing communities. Despite their significance,
individuals who rely on sign languages face barriers in effective communication,
particularly in interactions with those unfamiliar with sign languages. This limitation
contributes to feelings of isolation, hindering social integration and access to essential
services. Developing robust Sign Language Recognition (SLR) systems can play a pivotal
role in mitigating these challenges.

1.4 Objective of the Present Work

The present work focuses on addressing the challenges inherent in SLR, considering
both Indian Sign Language (ISL) .The detailed objectives and problem description are as
follows:

❖ Develop a Comprehensive SLR System:


➢ Objective: Create a sophisticated SLR system capable of accurately recognizing a
diverse range of sign language gestures.
➢ Rationale: A comprehensive system ensures that it can be applied across various
contexts, addressing the nuanced and intricate nature of sign languages.
❖ Universal Applicability:
➢ Objective: Design the SLR system to recognize ISL, acknowledging and
accommodating the regional and cultural variations in sign languages.
➢ Rationale: Considering the diversity in sign languages globally, a universal SLR
system enhances its applicability and impact.
❖ User-Friendly Interface:
➢ Objective: Develop an intuitive and user-friendly interface for the SLR system to
enhance accessibility for both the deaf community and individuals unfamiliar with
sign languages.
➢ Rationale: Ensuring ease of use is crucial for widespread adoption, making the
technology accessible to a broader audience.
❖ Real-time Processing:

3
➢ Objective: Implement efficient algorithms and processing pipelines to ensure real-
time recognition of sign language gestures.
➢ Rationale: Real-time processing is essential for natural and seamless
communication, eliminating delays that may hinder effective interaction.
❖ Continuous Learning:
➢ Objective: Integrate machine learning capabilities into the SLR system, allowing it
to continuously improve accuracy through user interactions.
➢ Rationale: Continuous learning ensures adaptability and responsiveness, enabling
the system to enhance its performance over time.

1.5 Expected Impact

• Enhanced Communication: The successful execution of the objectives is expected to


facilitate smoother communication between the deaf community and the general
population.
• Increased Accessibility: The user-friendly interface and universal applicability aim to
provide a means of learning and using sign languages that is accessible to a wider
audience.
• Inclusive Technological Solutions: The project contributes to building a more inclusive
society by leveraging technology to break down communication barriers.
• Empowerment: Individuals with hearing impairments are empowered by offering them
a reliable tool for effective communication.
• Cultural Sensitivity: Recognizing and accounting for the diverse cultural and regional
aspects of sign languages ensures a respectful and inclusive approach in the development
of the SLR system.

1.6 Platform and Tools Used


1.6.1 Tools and Technology
• Mediapipe: Mediapipe is an open-source library developed by Google that provides
pre-trained machine learning models for various tasks, including pose detection,
hand tracking, and facial recognition. In this project, Mediapipe's holistic model is
used for detecting keypoints (landmarks) of poses and hands in real-time video

4
streams. It allows for accurate and efficient detection of sign language gestures,
forming the basis of the recognition system.

• OpenCV: OpenCV (Open Source Computer Vision Library) is a popular library for
computer vision and image processing tasks. It is used for tasks such as reading and
displaying video streams, image manipulation, and drawing on images. In this
project, OpenCV is utilized for capturing video from webcam feeds, processing
video frames, and visualizing the detected keypoints.

• TensorFlow: TensorFlow is an open-source machine learning framework developed


by Google for building and training machine learning models. It provides tools and
APIs for creating neural networks, including deep learning models. TensorFlow is
used in this project for training and deploying the sign language recognition model,
which is based on a Long Short-Term Memory (LSTM) neural network architecture.

• Gradio: Gradio is a Python library that simplifies the creation of customizable UI


components for machine learning models. It provides a user-friendly interface for
users to interact with models, making it suitable for building the real-time sign
language recognition interface. Gradio was chosen for its simplicity and ease of
integration with machine learning models.

• Pyttsx3: Pyttsx3 is a text-to-speech conversion library in Python. It allows for the


synthesis of natural-sounding speech from text input. In this project, Pyttsx3 is used
to convert the recognized sign language gestures into spoken language, providing
immediate feedback to the user. This feature enhances accessibility and usability,
especially for users who may not understand sign language.

• Python: Python is the primary programming language used for developing the Sign
Speak project. Python's simplicity, versatility, and extensive libraries make it well-
suited for tasks such as machine learning, computer vision, and natural language
processing. The majority of the project's code, including data processing, model
training, and interface development, is written in Python.

5
1.6.2 Integrated Development Environment (IDE):
• Visual Studio Code (VSCode): Visual Studio Code is a lightweight, open-source
code editor developed by Microsoft. It offers features such as syntax highlighting,
code completion, debugging, version control integration, and extensions support.
VSCode provides a user-friendly interface for writing, debugging, and managing
code files. It was chosen as the primary IDE for its versatility, extensive extensions
ecosystem, and ease of use.
• Jupyter Notebook: Jupyter Notebook is an open-source web application that allows
for the creation and sharing of documents containing live code, equations,
visualizations, and narrative text. It supports various programming languages,
including Python, R, and Julia.

6
CHAPTER 2

SYSTEM ANALYSIS
2.1 LITERATURE SURVEY

[1] In a previously published research paper, an Indian Sign Language (ISL)


recognition system named Mudra was introduced. This system specializes in the
categorization of dynamic signs related to the domain of banking. Mudra employs a custom
database comprising 20 signs associated with banks and a set of common signs. The dataset
encompasses 1100 video recordings of varying durations, thoughtfully captured by
volunteer students using mobile phones, recording at a brisk 40 frames per second. The
dataset is thoughtfully divided into training and testing sets following an 80:20 ratio.

For feature extraction, the system leverages the InceptionV3 convolutional neural
network (CNN) model, which encompasses a ReLU correction layer, a max-pooling layer,
and two fully connected layers. The output from this CNN model is subsequently channeled
into a Long Short-Term Memory (LSTM) network for the crucial task of symbol
classification and conversion into textual representation. Notably, LSTM distinguishes itself
by obviating the necessity for manual feature engineering, rendering it a favored choice in
comparison to other deep learning methodologies. The architecture presented in this study
demonstrates a remarkable training accuracy of 100% while maintaining a commendable
testing accuracy level of 81%.

The paper [2] delves into the innovative utilization of MediaPipe technology for real-
time recognition of hand gestures, with a particular emphasis on its applicability to sign
language, a domain where it holds immense promise for individuals grappling with hearing
impairments. The authors detail their utilization of MediaPipe's robust library, enabling the
precise prediction of a human hand's skeletal structure and intricate gestures. This precision
is realized through the integration of two key models: a palm detector and a hand landmark
model.

7
A noteworthy focal point of this study is the quest for achieving lightweight and
resource-efficient hand gesture detection, ideally suited for mobile devices. The authors
employ rigorous quantitative analysis, punctuated by comparisons to alternative methods,
notably the deployment of Support Vector Machine (SVM) algorithms. The outcome of
these comparisons showcases the exceptional prowess of their model, manifesting in an
average accuracy rate of a staggering 99% across multiple sign language datasets. In
addition, the study accentuates the cost-effectiveness, real-time responsiveness, and
adaptability of their model when confronted with diverse sign language datasets.

[3]In a recent study focusing on the realm of Indian Sign Language (ISL)
recognition, an advanced deep learning paradigm, as introduced by Deepsign, was
harnessed. This novel system capitalizes on Gated Recurrent Unit (GRU) and Long Short-
Term Memory (LSTM) architectures for the discernment of sign language gestures
embedded within video streams, subsequently converting them into their corresponding
English lexemes. The dataset, denoted as IISL2020, was meticulously curated by a cohort
of 16 participants, encompassing both male and female individuals aged between 20 and 25
years. Within this dataset, 11 distinct words were represented, amounting to a total of 1100
video samples per word. These video samples were artfully recorded using mobile devices,
under natural lighting conditions, at an impressive resolution of 1920 x 1080, and a high
frame rate of 28 frames per second. Each video segment had an average duration of 2
seconds. The model training procedure was executed on a formidable 16GB GDDR6 GPU,
with feature extraction being adeptly conducted using sampling methodologies. The salient
features derived from the video frames were further processed through a pre-trained
MobileNet coupled with InceptionResNetV2, thereby generating a feature vector for LSTM-
based predictions. To substantiate the model's performance, a robust ten-fold K-fold cross-
validation methodology was employed.

In another recent study[4], researchers introduced two distinct methodologies for


Indian Sign Language (ISL) recognition. The initial approach focuses on segregating hand
movements from depth and RGB data, comprising a repertoire of 36 static signs. Techniques
such as affine transformation and 3D construction were adroitly deployed to partition the
data, which was subsequently channeled through Convolutional Neural Networks (CNNs)
for precise classification. This approach delivered an impressive classification accuracy rate
of 98.91%. The second methodology incorporates Long Short-Term Memory (LSTM) and
convolutional kernels to categorize a dataset encompassing 10 dynamic signs, culminating

8
in a classification accuracy of 99.08%. Notably, this approach incorporates the U-Net
Architecture, obviating the requirement for an RGB-D Kinect camera, thereby enhancing
efficiency and accessibility.

[5]In their recent exploratory research within the sphere of Indian Sign Language
(ISL) recognition, a novel Convolutional Neural Network (CNN)-based ISL converter was
unveiled. This cutting-edge architecture demonstrates proficiency in classifying the 26
alphabet letters of ISL. The authors harnessed transfer learning, focusing primarily on
retraining the final layer of the pre-trained MobileNet model when employed as a classifier.
Their self-curated dataset, boasting a voluminous collection of 52,000 images, was
meticulously recorded via a 720p HD camera, with each alphabet being represented by 2000
images. To enrich the dataset's diversity, they introduced various augmentations, including
background alterations, image cropping, flips, expansions, and resizing. Hand segmentation
was adroitly executed using the GrabCut Algorithm. The overall testing accuracy of this
architectural marvel culminated at an impressive 96%.

The research paper [6] under consideration serves as a comprehensive exploration


of a real-time interactive system that intertwines the realms of gesture recognition and
phrase generation within Indian Sign Language (ISL). This intricate system adeptly
processes video data depicting ISL gestures, subsequently weaving them into coherent and
grammatically sound phrases. The dataset that fuels this study comprises an expansive
collection of 10,000 images, portraying 100 unique signs in four diverse formats, including
no filter, FAST, Canny Edge, and SIFT. A multitude of sophisticated preprocessing
techniques are meticulously applied to the dataset, and a hybrid Convolutional Neural
Network (CNN) model is painstakingly trained to discern and classify these intricate
gestures. Once a gesture is accurately identified, the corresponding label is ingeniously
employed to craft meaningful phrases, thoughtfully combining them with relevant words.

In a preceding research endeavor,[7] the authors heralded a significant advancement


in the domain of Indian Sign Language (ISL) recognition. They accomplished this feat by
pioneering a large-scale open-source dataset, aptly named INCLUDE, encompassing a
staggering 263 distinct words. Their robust methodology for recognizing multiple sign
languages seamlessly integrated data augmentation and feature extraction during
preprocessing. The dataset underwent meticulous curation, featuring both horizontal and
vertical image flipping, cropping, and size alterations to introduce diversity. Feature

9
extraction was skilfully carried out through the utilization of pre-trained models, including
OpenPose, Pose Videos, and PAF Videos, which were subsequently flattened and
normalized before being routed through an array of machine-learning classifiers. The stellar
performance was exemplified when the XG-Boost algorithm outshone regular RNNs and
LSTMs. For recognition, the authors harnessed Pose and PAF videos, extracting features
with the aid of the pre-trained MobileNetV2 model. These features were subsequently
channeled through a BiLSTM architecture for classification, with the hidden states of the
LSTM cells being flattened and conveyed through a fully connected layer and softmax layer.
The result was an astounding overall accuracy rate of 85.6% on the extensive INCLUDE
dataset.

In this deep learning [8], the focal point was the utilization of the CIFAR-10 dataset
as the cornerstone for image classification. A comprehensive preprocessing pipeline was
meticulously executed, encompassing the pivotal stages of data augmentation for enriched
dataset diversity and normalization for optimized data scaling and distribution. The core
architectural paradigm adopted was a Convolutional Neural Network (CNN), strategically
configured with two convolutional layers meticulously engineered for feature extraction and
abstraction. This was supplemented by a singularly sophisticated fully connected layer. The
culmination of this technical pursuit was characterized by the assessment of recognition
accuracy on the exacting test set, yielding a commendable performance benchmark, which
consistently approximated an 85% threshold. This outcome underscores the model's
technical acumen in effectively categorizing a diverse array of images, thereby solidifying
its pre-eminence in the specialized domain of image classification.

10
2.2 Finding of the Analysis
❖ Datasets:
➢ Most researchers used self-created datasets, suggesting the need for larger, publicly
available datasets.
➢ Data sizes varied significantly, ranging from 20 signs to 263 signs and 1,100 to
10,000 videos.
➢ Recording conditions varied, with some using mobile phones and others using high-
resolution cameras.
❖ Preprocessing Techniques:
➢ Common techniques included gray scaling, normalization, noise removal, and edge
detection.
➢ Some studies employed data augmentation, such as flipping and resampling, to
improve model generalizability.
➢ Advanced techniques like 3D reconstruction and semantic segmentation were also
explored.
❖ Architectures:
➢ Deep learning architectures dominated, with CNNs and LSTMs being the most
popular choices.
➢ Several studies employed hybrid architectures or multi-stage pipelines.
➢ Some explored classical machine learning methods like SVMs and Extreme
Machine Learning.
❖ Recognition Accuracy:
➢ Accuracy varied considerably, ranging from 80.76% to 99.08%.
➢ Higher accuracy was often associated with larger datasets, more complex
architectures, and advanced preprocessing techniques.
➢ However, factors like dataset composition and sign complexity also played a role.
❖ Overall Findings:
➢ Sign language recognition is an active research area with significant progress in
recent years.
➢ Deep learning approaches are currently the most effective, with CNNs and LSTMs
showing promising results.
➢ Preprocessing techniques and dataset composition can significantly influence
accuracy.

11
➢ Future research directions include exploring hybrid architectures, incorporating
contextual information, and developing larger, publicly available datasets.

2.3 Proposed System


The proposed Sign Language Recognition (SLR) system is designed to facilitate
accurate and efficient interpretation of sign language gestures. The system offers two modes
for sign recognition:

• Real-time gesture capture: This utilizes cameras to capture gestures in real-time, ideal
for face-to-face communication.
• Pre-recorded video upload: Users can upload pre-recorded videos containing sign
language gestures for recognition.

The system then employs a deep learning model trained on a diverse dataset for
gesture recognition. This model analyses the captured video or uploaded video frames to
identify the signs being presented. A user-friendly interface provides real-time visual or
auditory feedback, clearly conveying the recognized signs. The system integrates with
external systems through APIs, allowing for broader functionality. Privacy and ethical
considerations are addressed throughout the development process, with clear
communication of data handling policies to ensure user trust.

The SLR system undergoes extensive user testing to evaluate its usability and
effectiveness. Performance metrics are defined to measure accuracy, speed, and robustness.
Additionally, the system is designed for scalability with a modular architecture, enabling
future enhancements and adaptation to evolving needs. Comprehensive documentation and
training resources ensure user accessibility and usability, contributing to enhanced
communication and inclusivity within the deaf and hard-of-hearing community. Regular
updates and refinements based on user feedback and technological advancements are
integral to the system's ongoing improvement.

2.4 System Requirement Specification

2.4.1 Functional Requirements

• Video Input: Users can upload video files or stream real-time video from a webcam.

12
• Keypoint Detection: Detect human poses and hand gestures in video frames using
Mediapipe.
• Sign Language Recognition: Classify sign language gestures based on detected
keypoints using a machine learning model.
• User Interface: Provide an interactive interface for users to upload videos, view
predictions, and control playback.
• Auditory Feedback: Vocalize recognized sign language gestures using text-to-speech
synthesis for immediate feedback.
• Gradio Integration: Integrate with Gradio to create a web-based interface for easy
accessibility.
• Cloud Deployment: Deploy the system on a cloud platform for scalability and
reliability.
• Evaluation and Testing: Evaluate accuracy, speed, and usability through testing and
user feedback.
• Documentation and Support: Provide comprehensive documentation and user
support channels for assistance.
• Continuous Improvement: Continuously update and enhance the system based on
user feedback and technological advancements.

2.4.2 Non-functional Requirements

• Performance: The system should have low latency, providing real-time sign
language recognition with minimal delay. It should be capable of handling multiple
concurrent user requests efficiently without performance degradation.
• Accuracy: The sign language recognition model should achieve a high level of
accuracy in identifying gestures, minimizing misclassifications. The system's overall
accuracy should be consistently maintained across different environments and
conditions.
• Reliability: The system should be reliable and available, with minimal downtime or
service interruptions. It should handle errors gracefully, providing informative error
messages and recovering from failures autonomously when possible.
• Scalability: The system should be scalable, capable of handling an increasing
number of users and workload demands without sacrificing performance. It should

13
scale horizontally by adding more resources or vertically by optimizing existing
resources as needed.
• Security: The system should adhere to security best practices to protect user data and
privacy. It should implement authentication and authorization mechanisms to control
access to sensitive features and data.
• Usability: The user interface should be intuitive and user-friendly, requiring minimal
training for users to operate effectively. It should support accessibility standards to
accommodate users with disabilities, ensuring inclusivity.
• Compatibility: The system should be compatible with a wide range of web browsers,
devices, and operating systems to maximize accessibility for users. It should adhere
to web standards and guidelines to ensure consistent behaviour across different
platforms.
• Maintainability: The system should be designed with modular and well-structured
code, facilitating ease of maintenance and future enhancements. It should include
comprehensive documentation to aid developers in understanding, troubleshooting,
and extending the system.
• Performance Under Load: The system should maintain consistent performance under
varying load conditions, with the ability to handle peak loads efficiently. It should
be stress-tested to identify and mitigate performance bottlenecks before deployment.
• Legal and Regulatory Compliance: The system should comply with relevant laws,
regulations, and industry standards governing data privacy, accessibility, and usage
rights. It should include mechanisms for obtaining user consent and managing data
in accordance with applicable regulations.

2.4.2.1 SOFTWARE REQUIREMENTS

▪ Operating system: Windows, Linux, Mac, Android, iOS.

▪ VS code

▪ Opencv-python==4.9.0.80

▪ Mediapipe==0.10.10

▪ Tensorflow==2.15.0

▪ Gradio

▪ Python 3.11 or above

14
2.4.2.2 HARDWARE REQUIREMENTS
❖ Computer System:
▪ Processor: Intel Core i5 or equivalent or Ryzen 7 or more.
▪ RAM: 8 GB or more.
▪ Storage: 256 GB SSD or more.
▪ GPU
▪ Camera
❖ Android Devices:
▪ Quad-core processor or more
▪ Camera sensor
▪ Android 6.0 or later
❖ iOS Devices:
▪ A10 Fusion chip or higher
▪ High-quality camera

15
CHAPTER 3

DESIGN

3.1 Design of Functions

Image 1: System Design

16
➢ Data Collection
▪ User Performs Signs: In this step, a user performs various signs in front of a
camera.
▪ Record Video with Signs: The signs performed by the user are recorded as a
video.
▪ Extract Keypoints (Pose, Hands): A tool like MediaPipe is used to extract
keypoints from the recorded video. Keypoints are important locations on a
person's body, such as wrists, elbows, and shoulders in this case.
▪ Store Keypoints & Sign Labels: The extracted keypoints, along with labels
corresponding to the signs performed in the video, are stored in a database for
later training of the sign recognition model.

➢ Training
▪ Preprocess Data (Normalization, etc.): Before training the model, the collected
data may undergo preprocessing. This can involve normalization, scaling, or
other techniques to ensure the data is in a format suitable for the machine learning
model.
▪ Split Data (Training & Testing): The preprocessed data is divided into two sets:
training and testing. The training set is used to train the model, and the testing
set is used to evaluate the model's performance on unseen data.
▪ Train Deep Learning Model: A deep learning model, such as a Long Short-Term
Memory (LSTM) network, is trained on the training data. The model learns to
map the sequences of keypoints extracted from sign language videos to their
corresponding signs.
▪ Evaluate Model Performance: The model's performance is evaluated using the
testing data. Metrics like accuracy and confusion matrix are used to assess how
well the model can correctly classify signs from unseen videos.
▪ Save Trained Model: Once the model's performance meets the desired criteria,
the trained model is saved for deployment in the real-time prediction phase.

17
➢ Real-time Prediction
▪ User Interface (Gradio): Users interact with a web interface, likely built using
Gradio, to use the sign language recognition system.
▪ Capture Video/Webcam Stream: The Gradio interface can capture video streams
from a webcam or allow users to upload pre-recorded videos for sign recognition.
▪ Extract Keypoints (Pose, Hands): Similar to data collection, keypoints are
extracted from each frame of the captured video stream or uploaded video.
▪ Predict Sign using Trained Model: The sequence of extracted keypoints is fed to
the trained deep learning model for prediction. The model predicts the most
likely sign the user performed in the video based on the keypoint sequence.
▪ Display Predicted Sign (Text): The Gradio interface displays the text label of the
predicted sign for the user.
▪ Text-to-Speech (Optional): Optionally, the predicted sign label can be sent to a
text-to-speech service to provide spoken feedback to the user.
▪ Play Spoken Sign (Optional): If a text-to-speech service is integrated, the
synthesized speech corresponding to the predicted sign label is played on the
Gradio interface, providing auditory feedback alongside the visual prediction.

3.1.1 Data Flow Diagrams

1. Level 0 DFD:

Image 2: Level 0 DFD

18
2. Level 1 DFD:

Image 3: Level 1 DFD

3.1.2 Algorithms Intended to Use

1. LSTM IN SLR:

Image 4: LSTM Architecture

19
Image 5: Implemented Training Process

❖ Overview: LSTMs are a type of Recurrent Neural Network that excel at handling
sequential data. They possess an internal memory allowing them to learn long-term
dependencies within sequences, making them ideal for tasks like sign language
recognition where order and timing of gestures are crucial.

20
2. MediaPipe Holistic Model:

Image 6: Landmarks

❖ Overview: The MediaPipe Holistic Model tackles hand and body pose estimation in a
powerful one-two punch. For hands, it excels at real-time tracking by identifying 21
keypoints on each hand. These keypoints pinpoint crucial locations like fingertips,
knuckles, and palm base, allowing the model to understand hand gestures and posture.
In parallel, the model tracks 33 keypoints across the body, including joints like elbows,
knees, shoulders, and hips. The MediaPipe Holistic Model's efficiency and open-source
nature make it a valuable tool for developers working on innovative applications that
require real-time hand and body pose analysis.

3.1.3 Datasets Intended to Use

Sign language research and technological innovation heavily rely on diverse and
comprehensive datasets. This section offers an extensive overview of datasets associated
with both Indian Sign Language (ISL) and American Sign Language (ASL), highlighting
opportunities for creating new datasets through community involvement.

3.1.3.1 Indian Sign Language Datasets:


▪ Kaggle dataset (KSHITIJ KUMAR AND 2 COLLABORATORS):
• This dataset is about Indian Sign Language and consists of 50 words each
containing 40 videos of 20 frames each.
▪ Data Set created by our team :
• Number of Words (Classes): 10

21
• Videos per Word: 40
• Frames per Video: 30
• Total Videos: 10 words * 40 videos/word = 400 videos

3.2 Design of User Interface

• User Interface Design: The interface allows users to stream live video for sign
language recognition, providing instant feedback on their signing. Gradio provides
a straightforward interface where users can upload a video. Once the video is
uploaded, the system processes it to recognize sign language gestures.
• Gesture Input: The system recognizes sign language gestures from the uploaded
video using a pre-trained model. It employs the MediaPipe library for pose detection
and hand tracking, enabling the recognition of gestures.
• Visual Representation: The recognized sign language gesture is spoken aloud using
text-to-speech (TTS) functionality.
• Multimodal Input: The system supports video input for gesture recognition, making
it accessible for users to upload their signing samples.
• Accessibility: The spoken output provides accessibility for users with hearing
impairments.
• Feedback: Users can provide feedback using a flag button if they encounter a wrong
prediction, enabling continuous improvement of the recognition accuracy.

Image 7: User Interface

22
Chapter 4

IMPLEMENTATION

4.1 Modules implemented

1. Import and Install Dependencies:

- Import necessary libraries such as TensorFlow, OpenCV, Mediapipe, etc.

- Install required packages using pip.

2. Keypoints using MP Holistic:

- Use Mediapipe Holistic model for detecting keypoints of human body parts (pose, left
hand, right hand).

- Define functions for performing Mediapipe detection and drawing landmarks on the
frame.

3. Extract Keypoint Values

- Define a function to extract keypoints from the detection results.

- Extract keypoints for the pose, left hand, and right hand.

4. Setup Folders for Collection:

- Create directories to store collected data for each action (word) and sequence (video).

- Prompt the user to enter new words for training and create directories accordingly.

5. Collect Keypoint Values for Training and Testing:

- Use a webcam to capture video frames in real-time.

23
- Collect 30 frames per video sequence for each action (word) and store the keypoints in
numpy files.

- Display instructions to the user during data collection.

6. Preprocess Data and Create Labels and Features:

- Load the collected data and labels from the directories.

- Prepare sequences of keypoints and their corresponding labels.

- Split the data into training and testing sets.

7. Build and Train LSTM Neural Network:

- Create an LSTM neural network model using Keras.

- Compile the model with Adam optimizer and categorical cross-entropy loss.

- Train the model using the training data and validate using a validation split.

8. Evaluation using Confusion Matrix and Accuracy:

- Evaluate the trained model using the test dataset.

- Calculate accuracy, precision, recall, and generate a confusion matrix.

- Visualize the training and validation metrics using plots.

9. Save and Load Model Weights:

- Save the trained model weights to a file for future use.

- Load the saved model weights for real-time testing.

10. Test in Real Time:

- Implement real-time sign language recognition using the trained model.

24
- Continuously capture video frames from the webcam and predict the sign language
gesture.

- Display the recognized gesture and update the prediction based on the sequence of
frames.

- Visualize the probabilities of predicted gestures.

11. Interface Creation with Gradio:

- Utilize Gradio library to create an intuitive and user-friendly interface for real-time sign
language recognition.

- Define input and output components for the interface, allowing users to upload or record
a video and receive the predicted sign language gesture as output.

- Configure the interface with appropriate title, description, and example inputs to guide
users effectively.

12. Integration with Real-Time Recognition:

- Integrate the Gradio interface with the real-time sign language recognition system to
provide users with a seamless experience.

- Upon receiving input video from the interface, pass it through the trained model to
predict the sign language gesture.

- Display the predicted gesture as output in the interface, allowing users to visualize the
recognition result instantly.

13. Model Deployment in Hugging Face:

- Leverage the Hugging Face platform for deploying the trained sign language
recognition model.

- Prepare the model for deployment by packaging it with its configuration, tokenizer, and
necessary assets.

25
- Upload the packaged model to the Hugging Face model repository, making it accessible
to the community for inference and fine-tuning.

14. Integration with Gradio Interface:

- Integrate the deployed model from Hugging Face with the Gradio interface, enabling
users to access the model directly through the interface.

- Configure the interface to use the deployed model hosted on the Hugging Face model
hub for making predictions on input videos.

- Ensure seamless communication between the Gradio interface and the deployed model,
allowing users to experience real-time sign language recognition without any latency.

15. Scalability and Accessibility:

- Benefit from the scalability and accessibility features offered by Hugging Face,
enabling easy sharing and deployment of the sign language recognition model across
various platforms and applications.

- Facilitate collaborative development and experimentation by allowing other developers


to fine-tune, evaluate, and deploy the model further.

4.2 Models Comparison

Table 1: Our Dataset Model Summary Table 2: Kaggle Dataset Model Summary

26
Table 3: Accuracy Achieved with Kaggle dataset

Table 4: Accuracy Achieved with our dataset

27
Chapter 5
TESTING
5.1 Introduction to testing

Testing is a critical phase in the development lifecycle, ensuring the reliability,


functionality, and performance of the system before deployment. It involves systematically
evaluating the system's behavior under different conditions to uncover defects, validate
functionality, and ensure that it meets the specified requirements.

5.2 Various test case scenarios considered

Test Descriptions Actual Outcome Pass/


Expected Outcome
Cases Fail

TC-01 Model Training Model training Model trained Pass


without errors successfully

TC-02 Model Evaluation Model achieves high Model accuracy Pass


accuracy 97.5%

TC-03 Real Time Recognition Some Pass


System accurately misclassifications
recognizes observed during
testing

TC-04 File Upload User can upload Video file uploaded Pass
Functionality video files successfully

TC-05 File Processing Uploaded files are File processed Pass


processed accurately for SLR

28
TC-06 Camera Access System can access Camera access Pass
the camera granted

TC-07 Video Recording Video recording Pass


System can record
completed
videos
successfully

Table 5 : Test Case Scenario

5.3 Testing and Evaluation Metrics

This section details the evaluation methods used to assess the performance of the
LSTM model for sign language recognition. We'll focus on two key metrics: Loss vs Epoch
graph and Confusion Matrix.

5.3.1. Loss vs Epoch Graph

The Loss vs Epoch graph is a fundamental tool for visualizing the training process
of the LSTM model. Here's a breakdown of its components:

Loss: As explained earlier, the loss function measures the discrepancy between the model's
predicted sign probabilities and the actual sign labels in the training data. A lower loss value
indicates better model performance.

Epoch: Represents a complete pass through the entire training dataset.

The Loss vs Epoch graph plots the training loss on the Y-axis and the number of epochs on
the X-axis. Ideally, the graph should exhibit a downward trend during training. This signifies
that the model is progressively learning from the data and minimizing its prediction errors.

Interpreting the Loss vs Epoch Graph:

Steep Descent: A rapid decrease in loss early on indicates the model is efficiently learning
the patterns in the training data.

Gradual Decrease: A slower, steady decline suggests the model is gradually improving its
accuracy.

29
Stagnation: If the loss plateaus after a certain number of epochs, it might indicate the model
has reached its learning capacity or is overfitting the training data. Techniques like early
stopping or adjusting hyperparameters can be employed to address overfitting.

Fluctuations: Minor fluctuations in the loss curve are normal and can be attributed to the
stochastic nature of the training process.

By analyzing the Loss vs Epoch graph, we can gain valuable insights into the training
progress, identify potential issues, and determine when to stop training to prevent
overfitting.

Image 8: Loss VS Epoch

Image 9: Accuracy VS Epoch

5.3.2.One-Hot Encoded Confusion Matrix:

In sign language recognition, where we have multiple possible signs, the Confusion
Matrix is typically one-hot encoded. This means each row and column represents a single
sign class. The values in the cells then represent the number of times the model predicted

30
that sign class, regardless of whether it was the actual sign or not. This one-hot encoding
provides a clearer picture of the model's performance for each individual sign.

By combining the insights from the Loss vs Epoch graph and the Confusion Matrix, we can
comprehensively evaluate the effectiveness of the LSTM model for sign language
recognition and identify areas for potential improvement.

Image 10: Truth VS Predicted

5.4 Inference drawn from the test cases


Through rigorous testing, we aim to validate the accuracy, reliability, and robustness
of the ISL translation system. By identifying and addressing potential issues and
shortcomings, we can enhance the system's performance and ensure its effectiveness in
real-world scenarios. Testing also provides valuable insights into areas for improvement,
guiding future enhancements and optimizations to deliver a more seamless and user-
friendly ISL translation experience.

31
Chapter 6
USER MANUAL
6.1 Installation Procedure
Installation procedures for the tools and libraries used in the sign language recognition
project:

TensorFlow:

Installation: pip Installs TensorFlow

TensorFlow is an open-source machine learning framework developed by Google. It


provides tools for building and training machine learning models, including neural
networks.

OpenCV:

Installation: pip install opencv-python

OpenCV (Open Source Computer Vision Library) is a popular library for computer vision
and image processing tasks. It provides functions for reading, writing, and manipulating
images and videos.

MediaPipe:

Installation: pip install mediapipe

MediaPipe is a machine learning framework for building scalable and customizable


pipelines for processing media data, such as images and videos. It includes pre-trained
models for tasks like pose detection, hand tracking, and facial recognition.

scikit-learn:

Installation: pip install scikit-learn

32
Scikit-learn is a machine learning library in Python that provides simple and efficient tools
for data analysis and modeling. It includes various algorithms for classification, regression,
clustering, and dimensionality reduction.

Matplotlib:

Installation: pip install matplotlib

Matplotlib is a plotting library for creating static, animated, and interactive visualizations in
Python. It provides a MATLAB-like interface for generating plots and charts.

Seaborn:

Installation: pip install seaborn

Seaborn is a statistical data visualization library based on Matplotlib. It provides a high-


level interface for drawing attractive and informative statistical graphics.

Gradio:

Installation: pip install gradio

Gradio is an open-source Python library that allows you to quickly create customizable UI
components for machine learning models. It provides a simple interface for building web-
based applications to interact with models.

33
6.1 Snap Shots

Image 11: Real Time recognition

Image 12: Sign Speak deployed at hugging face

34
Image 13: Uploading file

Image 14: Uploaded file recognised

35
Chapter 7
CONCLUSION
7.1 Conclusions of the present work
This project has successfully developed a sign language recognition system using an LSTM
model. The system leverages the power of LSTMs to capture the temporal dependencies
within sign language gestures, leading to accurate recognition of various signs. The project
addressed the critical challenge of communication between deaf and hearing communities
by providing a technology-driven solution.

Here are the key takeaways from this project:

• Effective Sign Recognition: The LSTM model demonstrated promising results in


recognizing a diverse range of signs. The system effectively learned the intricate
patterns within sign language gestures, enabling accurate translation.
• Real-World Applicability: By utilizing a cloud-based deployment platform like
Hugging Face Hub, the system offers easy accessibility and scalability. This paves
the way for real-world applications, promoting inclusivity and bridging the
communication gap between signing and non-signing communities.
• Contribution to Accessibility: This project contributes significantly to the field of
assistive technologies. The sign language recognition system empowers individuals
with hearing impairments by facilitating seamless communication and information
access.

7.2 Limitations

While the project achieved significant progress, there are some limitations to consider:

• Accuracy Variations: The accuracy of the sign language recognition system can be
impacted by various factors like lighting conditions, video quality, hand pose
variations, and potential background noise. Further training and data augmentation
techniques can be employed to improve robustness in real-world scenarios.

36
• Limited Sign Set: The current system might recognize a specific set of signs.
Expanding the training dataset with a broader range of signs from different regional
variations of sign language is crucial for wider application.
• Real-Time Performance: While the system aims for real-time processing,
computational limitations might introduce slight delays. Optimizing the model
architecture and utilizing efficient hardware can enhance real-time performance.

7.3 Future Scopes

Building upon the successes of this project, several directions offer exciting prospects for
future development:

• Advanced Model Architectures: Exploring more sophisticated deep learning


architectures like attention mechanisms or transformers could potentially improve
the model's ability to handle complex sign language sequences.
• Speaker Independence: The system can be further developed to achieve speaker
independence, meaning it can recognize signs accurately regardless of the signer's
individual hand shape or signing style. This requires a more diverse training dataset
encompassing variations in signing styles.
• Multilingual Support: Extending the system to recognize and translate sign
language into multiple spoken languages would significantly broaden its reach and
impact.

By addressing the limitations and exploring these future scopes, this project has the
potential to evolve into a robust and comprehensive sign language recognition system,
fostering inclusivity and empowering communication for all.

37
References
[1] Jayadeep, Gautham, et al. "Mudra: convolutional neural network based Indian sign
language translator for banks." 2020 4th International Conference on Intelligent Computing
and Control Systems (ICICCS). IEEE, 2020.

[2] Kavana KM, Suma NR"RECOGNIZATION OF HAND GESTURES USING


MEDIAPIPE HANDS" IRJETS Volume:04 (2022): 2582-5208.

[3] Kothadiya, D., Bhatt, C., Sapariya, K., Patel, K., Gil-Gonza lez, A. B., & Corchado, J.
M. (2022). Deepsign: Sign language detection and recognition using deep learning.
Electronics, 11(11), 1780.

[4] Likhar, Pratik, Neel Kamal Bhagat, and G. N. Rathna. "Deep learning methods for Indian
sign language recognition." 2020 IEEE 10th International Conference on Consumer
Electronics (ICCE-Berlin). IEEE, 2020.

[5] Gangadia, D., Chamaria, V., Doshi, V., & Gandhi, J. (2020, December). Indian sign
language interpretation and sentence formation. In 2020 IEEE Pune section international
conference (PuneCon) (pp. 71-76). IEEE.

[6] Sridhar, A., Ganesan, R. G., Kumar, P., & Khapra, M. (2020, October). Include: A large
scale dataset for indian sign language recognition. In Proceedings of the 28th ACM
international conference on multimedia (pp. 1366-1375).

[7] Kumar, Anand, and Ravinder Kumar. "A novel approach for ISL alphabet recognition
using Extreme Learning Machine." International Journal of Information Technology 13
(2021): 349-357.

[8] Yulius Obia, Kent Samuel Claudioa, Vetri Marvel Budimana, Said Achmada,
Aditya Kurniawana. "Sign language recognition system for communicating to people
with disabilities." Procedia Computer Science 216 (2023) 13–20.

[9] https://data.mendeley.com/datasets/kcmpdxky7p/1

38
[10] https://pypi.org/project/SignLanguageRecognition/#General%20Info

[11] https://www.kaggle.com/datasets/prathumarikeri/indian-sign-language-isl

[12] https://huggingface.co/datasets/IIT-K/CISLR

[13] ISL-CSLTR: Indian Sign Language Dataset for Continuous Sign Language Translation
and Recognition - Mendeley Data

[14] An Exploration into Human–Computer Interaction: Hand Gesture Recognition


Management in a Challenging Environment | SN Computer Science (springer.com)

[15] Real-Time Gesture Recognition Using GOOGLE’S MediaPipe Hands — Add Your
Own Gestures [Tutorial #1] | by Vaibhav Mudgal | Medium

[16] Electronics | Free Full-Text | Deepsign: Sign Language Detection and Recognition
Using Deep Learning (mdpi.com)

[17] Isolated Word Sign Language Recognition Based on Improved SKResNet-TCN


Network (hindawi.com)

[18] Sign Language Recognition Using ResNet50 Deep Neural Network Architecture by
Pulkit Rathi, Raj Kuwar Gupta, Soumya Agarwal, Anupam Shukla :: SSRN

[19] https://www.kaggle.com/datasets/ayuraj/asl-dataset

[20] https://projects.asl.ethz.ch/datasets/

39

You might also like