0% found this document useful (0 votes)
2 views5 pages

A_Real_Time_Hand_Gesture_Recognition_for_Indian_Sign_Language_using_Advanced_Neu

The document presents a study on a real-time hand gesture recognition system for Indian Sign Language using advanced neural networks. The proposed model leverages deep learning techniques to translate sign language into text or speech, utilizing a dataset of one million images for training and achieving a 96% accuracy rate. The system employs convolutional neural networks for image classification and incorporates various preprocessing techniques to enhance performance.

Uploaded by

zoro67135
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views5 pages

A_Real_Time_Hand_Gesture_Recognition_for_Indian_Sign_Language_using_Advanced_Neu

The document presents a study on a real-time hand gesture recognition system for Indian Sign Language using advanced neural networks. The proposed model leverages deep learning techniques to translate sign language into text or speech, utilizing a dataset of one million images for training and achieving a 96% accuracy rate. The system employs convolutional neural networks for image classification and incorporates various preprocessing techniques to enhance performance.

Uploaded by

zoro67135
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Proceedings of the Second International Conference on Automation, Computing and Renewable Systems (ICACRS-2023)

IEEE Xplore Part Number: CFP23CB5-ART; ISBN: 979-8-3503-4023-5

A Real Time Hand Gesture Recognition for Indian


Sign Language using Advanced Neural Networks
2023 2nd International Conference on Automation, Computing and Renewable Systems (ICACRS) | 979-8-3503-4023-5/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICACRS58579.2023.10404450

Pradeep S Poonguzhali R R.S. Amshavalli Dr. Leema Nelson


Department of Department of Department of Chitkara University Institute
Computer Science and Computer Science and Computer Science and of Engineering & technology
Engineering Engineering Engineering Chitkara University,
Sathyabama Institute of Sathyabama Institute of Sathyabama Institute of Punjab, India.
Science and Science and Science and [email protected]
Technology, Chennai, India Technology, Chennai, India Technology, Chennai, India
[email protected] [email protected] [email protected]
c.in

Abstract: One of the most important means of survival in a technology has cleared the path for everything from aerial
group is verbal exchange. In addition, according to the World robots to speed bump detection models. Two main strategies
Health Organization, over 5% of people worldwide have have been the foundation for many other works created to
speech impairments as of 2021. This means that they lack a address nonverbal communication: contact-based systems
fundamental requirement for the survival of the human and visual displays. The latter approach is more enticing and
species. The "Sign Language" that these people use in their economical thanks to advancements in deep learning
communities contains meanings, grammar, and vocabulary technology. Originally, image processing was done with
that everyone else may not understand. Our suggested
Matlab, which takes a lot longer to complete than OpenCV,
approach focuses on developing a frame-based system that
leverages deep learning, an advanced machine learning
which executes much more quickly. This suggested model
technology, to translate sign language into legible speech or combines processing methods from software and hardware.
text on an embedded device. Our algorithm processes the It combines the novel algorithm with a specially designed
gathered samples by using an integrated webcam to record flexible processing architecture. The dataset is trained using
hand gestures in real-time. For improved performance, the neural networks, which aid in recognizing the scenarios that
model uses a dataset including one million images for each are presented and provide textual output as a result. Three
move. A 9:1 ratio has been established between the training distinct scenarios—an indoor, an outdoor, and a green
and test sets of data. background—were used to train the datasets. For data
augmentation, datasets with green backgrounds were used to
Keywords— Hand gesture recognition, Deep Learning,
Convolutional Neural Network, Background Subtraction, Indian train the model. At last, the most likely label is predicted by
Sign Language. this trained model. The main idea of our suggested model is
shown in Figure 1
below.
I. INTRODUCTION

Hand gesture is one of the most common ways that people


communicate. Researcher interest in this nonverbal mode of
communication has been high. The increasing number of
persons who are deaf or hard of hearing has led to its
increased prominence in society. Research indicates that the
hard of hearing and hearing groups are becoming less
communicative. Approximately 50 lakh people in India have
disabilities, whereas the interpreters who can comprehend
translate their sign were only around 300. Aside from
meeting social and personal requirements, hand gestures can
also be utilized as a luxury. A couple of examples of this
include gesture-controlled video games and home
Figure 1. Overview model
automation. One of the main fields with quick recognition
globally for tasks involving pattern matching and Advanced convolutional neural networks are employed in
recognition is deep learning. Embedded deep learning the construction of this prototype sign language translation
frameworks in IoT nodes as well as mobile and wearable system. Three main sections comprise the process: model
devices have become more and more popular in recent training and estimation, data library, and architectural
years. This is revolutionary for the next generation of
design. The fact that this prototype is intended for
technologies because it is trained to analyse data locally. individuals who wish to acquire gesticulation rather than
These days, deep learning algorithms that make vision- those with disabilities is one of its main advantages over
based sensors possible are highly welcomed. This

979-8-3503-4023-5/23/$31.00 ©2023 IEEE 1471


Authorized licensed use limited to: Zhejiang University. Downloaded on February 13,2025 at 12:27:44 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Automation, Computing and Renewable Systems (ICACRS-2023)
IEEE Xplore Part Number: CFP23CB5-ART; ISBN: 979-8-3503-4023-5

other traditional methodologies. This prototype pays dataset of images with various backgrounds and angles. The
attention to the interpreter and deciphers his signals while images are classified using a ConvNet by means of
translating the signs. Without the aid of a qualified sign MobileNet, which is employed as a classifier. Object
translator, non-signers could comprehend the sign language recognition is a crucial component of the system, and the
with the help of the proposed prototype. The following GrabCut algorithm is used in this segmentation process. To
section provides an in-depth analysis of the existing system, increase efficiency and accurately predict the outcomes, a
followed by our suggested approach that covers the number of images have been tested. 96% accuracy is
algorithm, system architecture, results, and further work. demonstrated by this model [5].

The paper focuses on applying the latest developments in


II. LITERATURE SURVEY deep learning to the development of a vision-based
Researchers have been interested in automatic speech application. The spatiotemporal features of a video sequence
recognition for a while now. Many studies pertaining to this are extracted by the suggested model. A Convolutional
particular issue have been conducted on the various sign Neural Network recognizes the dimensional features, and an
languages that are still in use today. Kanchan Dabre and RNN trains on temporally extracted features [6]. An
Surekha Dholay have introduced a system that combines approach to ASL gesture recognition using deep learning. It
image processing methodology, visual perception can translate non-specific hand features into comprehensible
technology, and neural networks to analyze video footage of language by identifying both global and local gestures [7].
the interpreter’s signals and establish its attributes. After the Vision-based hand gesture identification. When CNN is
signs are interpreted by the Haar cascade classifier, speech used particularly to recognize the hand signal and categorize
is synthesized from the relevant output. The entire it into a language that can be understood, it performs better
procedure has been divided into two phases: the than any other method in terms of classification accuracy.
preprocessing phase, which entails removing frames from [8]. RGB gesture recognition Using CNN architecture and
the video data sequences, and the Interpretation layer, which static images, the model considers RGB and depth values to
uses the Haar Cascade Classifier to categorize the gesture. precisely produce intense details by hand signals and finger
With an average accuracy of 92.68%, this suggested system hindrance [9]. Data gloves, an automated interpreter for sign
can accurately read 18 to 21 frames per second [1]. language, are useful for people with speech impairments
S.Gnanavel et al., has proposed rapid text retrieval based on because they have flex sensors that record the user's actions
probabilistic model. Mapping of data analytics from image and convert them into the appropriate word with 93%
to text were done using Latent Dirichlet Allocation method. accuracy [10]. Vinay Kukreja and Poonam Dhimam insisted
It highlights the efficacy of the proposed method when the importance of preprocessing layer. Their result
compared with the contemporary algorithms. highlights the need for preprocessing. Accuracy ratio has
been elevated from 67 to 89% with the help of strong
Abey Abraham and Rohini V. proposed a hand spelling preprocessing framework [11]. In research work [12],
conversion using ANN in real time. They have created a contour based background subtraction has been
gadget that uses an Arduino Uno board, two flex sensors to implemented. It motivates us to eliminate the background
recognize hand gestures made by the wearer. This system noises to achieve quicker and better results. In our work,
already has pre-defined conditions for plentiful values background noises of the image frames are eliminated using
generated by the flex sensor from the Android app are given VGG 16 TL methodology. Proposed method in [13] helps in
as the input into this system. The Integration System for training the data set with supervised learning methods.
Mobile module sends the flex sensor's data to the cloud Among 76 features, around 14 are selected for testing the
server, which then utilizes it to fuel a neural network that dataset. Selecting the potential frames helps in improving
anticipates gestures. To forecast the value received by the the accuracy. In our proposed work, we have extracted
sensor at particular point of time, the model has been fine- around 1200 potential frames among 43200 frames from
tuned [3]. A real-time communication system that is various perspectives make up our dataset. Using Matlab,
software-based was developed using cutting-edge hand gesture to speech conversion takes a quick picture of
techniques in recent trends. The most important aspect of the hand, identifies the signal, and plays a recorded sound
this notion is the capacity for an interpreter and the user to that goes with the gesture using trending computer vision
properly interact with one another through a bidirectional technologies. [14].
communication framework. The injected neural network
that processed the image data first generates the output,
which is then utilized to classify the gesture. Python is used III. NEURAL NETWORK
to complete the conversion to sign. To generate output, the Neural networks are computational models that mimic the
words were mapped in the database. With a prediction time structure and functions of human brains. The terms
of less than 0.000805 seconds, our suggested system can "Artificial Neural Networks," "Recurrent Neural Networks,"
predict around 17600 photos with a high accuracy of 99% and "Convolutional Neural Networks" are used in deep
[4]. Meenakshi, Nikhil Gala, Nishi Intwala, and Arkav learning. feed-forward neural network, also known as an
Banerjee developed a recognition system for recognizing artificial neural network, processes and feeds data forward.
Indian Sign Language using CNN. This method can classify At every layer, this ANN is made up of several perceptrons.
26 ISL letters by obtaining the frames from the user and Language translation and sentiment analysis are two
translating it into comparable text. Initially, pre-processing examples of temporal problems for which recurrent neural
and feature extraction were done on a database made up of a networks, or RNNs, are particularly useful. The RNN injects

979-8-3503-4023-5/23/$31.00 ©2023 IEEE 1472


Authorized licensed use limited to: Zhejiang University. Downloaded on February 13,2025 at 12:27:44 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Automation, Computing and Renewable Systems (ICACRS-2023)
IEEE Xplore Part Number: CFP23CB5-ART; ISBN: 979-8-3503-4023-5

the results back into the network, in contrast to other neural IV. PROPOSED SYSTEM
networks. The network architecture and workflow of the suggested
system for interpreting the hand gestures are included in this
section. More specific details are provided in the ensuing
subsection. The procedure has been divided into four main
stages. The process of gathering datasets, training models,
and testing model prediction based on a user's real-time
image. The gathering of data is the first and most important
phase. The gathered data is pre-processed utilizing picture
pre-processing techniques, such as key frame extraction,
background subtraction, brightness, and contrast
modifications, RGB to grayscale conversion, and extractable
features. This section introduces the concept of pre-
Figure 2. Convolutional neural networks, or CNNs, are processed data classification. Sliding window technique and
among the best deep learning algorithms available for Contour based technique is used for extracting the key
classifying images and processing video sequences. For this frames and eliminating the background respectively.
reason, the suggested model is trained with CNNs. The Convolution neural networks, an advanced deep-structured
fundamental unit of CNN is its filter architecture. In the learning method, are used to classify this pre-processed
process, the pertinent features are brought to light. From the image. The training dataset for this neural network system
image sequence, this feed-forward neural network can includes pictures taken from various perspectives and in
extract spatial features [15]. The spatial feature is the various lighting conditions. After the previous stage is
connection between the image and the arrangement of finished, our model receives the test data for testing.
pixels. The CNN is made up of three hidden layers and an
i/o layer.

1. Input Layers: This is where we provide the input to


the model. In terms of neurons, it is equal to the
quantity of features we discovered in the data.

2. Hidden Layers: The result of the input layer is


transferred to the hidden layer. The building blocks
are created by pooling, normalization, and
conversion into convolutional connected layers at
the interface between the input and output layers.
Both our model and the size of the data affect the
number of hidden layers.

3. Output Layer: The softmax, a logistic function for


managing output, receives the results from the
previous layer gradually.

Figure 2. Working of CNN Figure 3. Workflow

979-8-3503-4023-5/23/$31.00 ©2023 IEEE 1473


Authorized licensed use limited to: Zhejiang University. Downloaded on February 13,2025 at 12:27:44 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Automation, Computing and Renewable Systems (ICACRS-2023)
IEEE Xplore Part Number: CFP23CB5-ART; ISBN: 979-8-3503-4023-5

3. Noise Reduction: By employing a Gaussian filter


to extract the noise, smoothening techniques lessen
both noise and discontinuity.
4. Grayscale Conversion: To make pixel operations
easier, the RGB color image is transformed into a
grayscale tone.
5. Brightness adjustments: This procedure modifies
the values of pixel intensity.
6. Image Scaling: Image scaling is carried out for
effective image processing.

Figure 4. Architecture diagram

A. Data Collection
Figure 5. image processing
Training and testing the data is the first and most important
step. One of the most laborious parts of the process is
analyzing and working with the data set. A part of C. Network Architecture
INCLUDE dataset [17] is used for the proposed work. The The novelty of the proposed work comes from the inclusion
quantity and caliber of the data sets used in the model of Multi-Layer Perceptron (MLP). Convolution_3D
directly correlate with its accuracy. To achieve a high level operations list, which includes the ability to extract several
of accuracy, 43,200 pictures from among the 4287 videos of frames in a single operation, is utilized by this model.
INCLUDE are used to educate the proposed model. A total Furthermore, it also makes use of the Maxpooling_3D
of 36 classes—26 alphabets and 10 numerals—with 1200 method. The result of the two operations above is then
photos each from various perspectives make up our dataset. flattened and sent into the Multi-Layer Perceptron (MLP).
Lastly, the probability value is calculated by activating the
B. Pre Processing MLP's SoftMax regression layer, which projects the results
and outputs the associated value.
We typically identify and extract the necessary features in
1. Input block: To create an image sequence, the
this section of our model so that it can be trained. In order to
recorded video is extracted and stored as an image
enhance the training process. In this stage, several image
in a sequential order.
processing techniques are worked on, such as filtering,
2. Convolution: A 3D convolution-based network can
grayscale conversion, altering edges and corners. It also
be used to collect and refine objects that
works on background subtraction.
correspond to both space and time.
1. Key frame extraction: Potential frames are grouped
3. Maxpooling: The next step after a convoluted
to provide sensible meaning. Selection of these
image is to reduce the spatial size of the
important key frames plays crucial role.
representation in order to minimize computational
2. Background Subtraction: To prepare the photos for
requirements.
further processing, unwanted background details
4. Softmax: Constrictive is the final layer, which
are extracted from the ones that were taken.
anticipates the distribution of all the classes and
normalizes the output vector.

979-8-3503-4023-5/23/$31.00 ©2023 IEEE 1474


Authorized licensed use limited to: Zhejiang University. Downloaded on February 13,2025 at 12:27:44 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Automation, Computing and Renewable Systems (ICACRS-2023)
IEEE Xplore Part Number: CFP23CB5-ART; ISBN: 979-8-3503-4023-5

V. PERFORMANCE ANALYSIS REFERENCES


Test data that differed from the training data set was used to [1] Vijayalakshmi, P., & Aarthi, M. (2016, April). Sign language to
speech conversion. In 2016 International Conference on Recent
evaluate the application of our proposed methodology. A
Trends in Information Technology (ICRTIT) (pp. 1-6). IEEE.
million image samples of various hand signals are used in [2] S. Gnanavel, Vinodhini Mani, M. Sreekrishna, R. S. Amshavalli,
the testing process. To produce accurate results, our Yomiyu Reta Gashu, N. Duraimurugan, Namburi Srinivasa Rao, and
proposed model was updated simultaneously with all of M. Praveen Kumar Reddy. 2022. Rapid Text Retrieval and Analysis
these images. Our desired outcome is to receive a text that Supporting Latent Dirichlet Allocation Based on Probabilistic
Models. Mob. Inf. Syst. 2022 (2022).
represents the input sign language translated into English. https://doi.org/10.1155/2022/6028739
All of the Indian Sign Language's hand gestures will be [3] NB, M. K. (2018). Conversion of sign language into
predicted by our model. to get a productive outcome. Our text. International Journal of Applied Engineering Research, 13(9),
proposed system's predicted accuracy is more than 95%, 7154-7161.
which is currently considered a sufficient result for real-time [4] Masood, S., Srivastava, A., Thuwal, H. C., & Ahmad, M. (2018).
interpretations. Real-time sign language gesture (word) recognition from video
sequences using CNN and RNN. In Intelligent Engineering
Informatics (pp. 623-632). Springer, Singapore.
[5] Apoorv, S., Bhowmick, S. K., & Prabha, R. S. (2020, June). Indian
sign language interpreter using image processing and machine
learning. In IOP Conference Series: Materials Science and
Engineering (Vol. 872, No. 1, p. 012026). IOP Publishing.
[6] Kaushik, N., Rahul, V., & Kumar, K. S. (2020). A Survey of
Approaches for Sign Language Recognition System. International
Journal of Psychosocial Rehabilitation, 24(01).
[7] Kishore, P. V. V., & Kumar, P. R. (2012). A video based Indian sign
language recognition system (INSLR) using wavelet transform and
fuzzy logic. International Journal of Engineering and
Technology, 4(5), 537.
[8] Dixit, K., & Jalal, A. S. (2013, February). Automatic Indian sign
language recognition system. In 2013 3rd IEEE International
Advance Computing Conference (IACC) (pp. 883-887). IEEE.
[9] Das, A., Yadav, L., Singhal, M., Sachan, R., Goyal, H., Taparia, K.,
Figure 6. Epoch-accuracy graph ... & Trivedi, G. (2016, December). Smart glove for Sign Language
communications. In 2016 International Conference on Accessibility to
Digital World (ICADW) (pp. 27-31). IEEE.
[10] Sruthi, R., Rao, B. V., Nagapravallika, P., Harikrishna, G., & Babu,
K. N. (2018). Vision Based Sign Language by Using
MATLAB. International Research Journal of Engineering and
Technology (IRJET), 5(3).
[11] V. Kukreja and P. Dhiman, "A Deep Neural Network based disease
detection scheme for Citrus fruits," 2020 International Conference on
Smart Electronics and Communication (ICOSEC), Trichy, India,
2020, pp. 97-101, doi: 10.1109/ICOSEC49089.2020.9215359.
[12] Amshavalli, R.S., Kalaivani, J. Real-time institution video data
TABLE I: PERFORMANCE OF KEYFRAME EXTRACTION analysis using fog computing and adaptive background subtraction. J
Real-Time Image Proc 20, 96 (2023). https://doi.org/10.1007/s11554-
023-01350-3
For each Frame Set (FS) total frames, number of [13] R. . TR, U. K. Lilhore, P. . M, S. . Simaiya, . A. . Kaur, and M. .
extracted frames and compression ratio is calculated and Hamdi, “PREDICTIVE ANALYSIS OF HEART DISEASES WITH
compared with the existing one. The compression ratio is MACHINE LEARNING APPROACHES”, MJCS, pp. 132–148, Mar.
calculated for each frame set individually and the obtained 2022
results are depicted in the above tabular column. Best [14] Kumar, A., & Kumar, R. (2021). A novel approach for ISL alphabet
recognition using Extreme Learning Machine. International Journal
extracted frames give better results in the further knowledge of Information Technology, 13(1), 349-357.
framework. [15] Maraqa, M., & Abu-Zaiter, R. (2008, August). Recognition of Arabic
Sign Language (ArSL) using recurrent neural networks. In 2008 First
VI. CONCLUSION AND FUTURE WORK International Conference on the Applications of Digital Information
and Web Technologies (ICADIWT) (pp. 478-481). IEEE.
The proposed framework model uses CNN deep structured
[16] Masood, S., Srivastava, A., Thuwal, H. C., & Ahmad, M. (2018).
learning techniques to identify and categorize Indian Sign Real-time sign language gesture (word) recognition from video
Language has achieved around 90% accuracy in interpreting sequences using CNN and RNN. In Intelligent Engineering
the ISL. We note that, thanks to its sophisticated methods, Informatics (pp. 623-632). Springer, Singapore.
the CNN model provides the highest accuracy. We can infer [17] Sridhar A, Ganesan R (2020) INCLUDE: a large scale dataset for
from the procedure that CNN is a productive method for Indian sign language recognition, In: MM '20: proceedings of the 28th
ACM international conference on multimedia, pp 1366–1375
accurately classifying hand gestures. In subsequent work,
we hope to develop a full application from the model
software. The other focus will be to enhance the
preprocessing layer with the assumption that strong
preprocessing layer will improve the accuracy.

979-8-3503-4023-5/23/$31.00 ©2023 IEEE 1475


Authorized licensed use limited to: Zhejiang University. Downloaded on February 13,2025 at 12:27:44 UTC from IEEE Xplore. Restrictions apply.

You might also like