Journal of Theoretical and Applied Information Technology
15th November 2022. Vol.100. No 21
© 2022 Little Lion Scientific
ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195
DEEP LEARNING BASED SOUTH INDIAN SIGN
LANGUAGE RECOGNITION BY STACKED
AUTOENCODER MODEL AND ENSEMBLE CLASSIFIER ON
STILL IMAGES AND VIDEOS
RAMESH MANOHAR BADIGER1 , DHARMANNA LAMANI2
1
Assistant Professor, Computer Science Department, Tontadarya College of Engineering, Gadag 582101,
India
2
Associate Professor, Department of ISE, SDMIT, Ujire 574240, India
E-mail:
[email protected], 2
[email protected] ABSTRACT
Recently, sign or gesture recognition has been challenged by concerns like high computational cost,
occlusion of hands, and inaccurate tracking of hand signs and gestures. The existing models face difficulty
in managing longer term sequential data, due to poor information learning and processing. To highlight the
aforementioned concerns, a novel deep learning based ensemble model is proposed in this article. Firstly,
the sign/gesture images are acquired from American Sign Language (ASL)-Modified National Institute of
Standard and Technology (MNIST) and real time South Indian Sign Language (SISL) databases. In
addition, K-means clustering with the Gaussian blur method is implemented for precisely segmenting the
sign/gesture region. Next, the feature extraction is carried-out using Gray-level Co-occurrence Matrix
(GLCM) features and AlexNet, and then the dimensionality of the extracted feature vectors are decreased
using a deep learning model: stacked autoencoder. The dimensionally decreased feature vectors are fed to
the ensemble classifier (Multi-Support Vector Machine (MSVM) and Naive Bayes) to classify 24 alphabets
and 30 SISL classes on the ASL-MNIST and real time SISL databases. The extensive experiments
demonstrated that the ensemble based stacked autoencoder model achieved 99.96% and 99.08% of
accuracy on the ASL-MNIST and real time SISL databases, which are better related to the traditional
machine learning classifiers.
Keywords: Gesture, K-means Clustering, Multi Support Vector Machine, Naïve Bayes, Sign Language
Recognition, Stacked Autoencoder
1. INTRODUCTION into two types vision based methods and data glove
methods [8-9]. However, the
Sign language is a vision based inter-
active language with complex and unique
linguistics rules [1], and it is mainly used by people
who are impaired in communicating and
exchanging their thoughts, ideas and feelings using
different body parts [2-3]. Sign language differs
from one place to another based on its geographic
location, but it has unique linguistic structures [4].
In recent decades, each nation has created its sign
languages to communicate among the deaf and
dumb communities [5-6]. Hence, manual
sign/gesture recognition involves hand orientation,
hand postures and hand movements [7]. The non-
manual sign/gesture recognition involves lip
movements, eye gaze, and facial expressions, where
the recognition methods are generally categorized
6581
Journal of Theoretical and Applied Information Technology
15th November 2022. Vol.100. No 21
© 2022 Little Lion Scientific
ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195
existing models are very sensitive to lighting
conditions and cannot be operated in the
cluttered environment [10-11]. In addition to
this, the existing models obtain minimum
classification performance, due to over-lapping
of the head, hand, skin color, and background
color [12-13]. To overcome the above-
mentioned problems and to achieve better
gesture/sign recognition, a novel deep learning
based ensemble model is implemented in this
manuscript. The main contributions are listed
below:
Acquired raw images from ASL-MNIST
and real time SISL databases and further,
the Region of Interest (RoI) is segmented
by using K-means clustering with
Gaussian blur method.
After segmenting RoI from ASL-MNIST
and the real time SISL databases, the
feature extraction is performed using
GLCM feature and AlexNet model. The
semantic space
6582
Journal of Theoretical and Applied Information Technology
15th November 2022. Vol.100. No 21
© 2022 Little Lion Scientific
ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195
between the extracted feature sub-sets is reduced not manage continuous and dynamic signs. Kumar
by extracting local and deep learning feature and
vectors, where this process helps in
achieving better classification results.
The extracted multi-dimensional feature
vectors are optimally decreased by proposing
a stacked autoencoder, where it decreases
computational complexity and running time
of the proposed system.
The developed ensemble classifier uses the
optimum feature vectors for classifying 24
alphabets and 30 SISL classes on the ASL-
MNIST and real time SISL databases, and
the effectiveness of the ensemble based
stacked autoencoder model is tested by
means of sensitivity, accuracy, F1-measure,
specificity and Matthews Correlation
Coefficient (MCC). The experiments
conducted on the ASL-MNIST and real time
SISL databases showed that the proposed
model achieved 99.96% and 99.08% of
accuracy.
This article is prepared as follows: Some
manuscripts related to sign language and gesture
recognition are surveyed in Section 2. The brief
theoretical description, simulation result and the
conclusion of ensemble based stacked autoencoder
model is represented in Sections 3, 4, and 5
respectively.
2. RELATED WORKS
Subramanian, et al. [14] introduced a new
Media-Pipe-Optimized Gated Recurrent Unit
(MOPGRU) algorithm for sign detection. As
depicted in the resulting phase, the implemented
MOPGRU algorithm obtained high learning
efficiency, prediction accuracy, fast convergence
and information processing capability related to
existing sequential algorithms. However, the
implemented model was computationally
expensive, because it requires higher-end graphics
processing units for achieving better classification
results. Gangrade and Bharti [15] used Gaussian
blur for decreasing the noise from the acquired gray
sign images, and then, the hand segmentation was
accomplished utilizing the background subtraction
technique. Finally, the hand sign detection was
performed by implementing the Convolutional
Neural Network (CNN) model. The simulation
investigation showed that the presented model
efficiently detects ISL alphabet in the real time
database with a high detection rate. The presented
model works well with static ISL sign, but it does
6583
Journal of Theoretical and Applied Information Technology
15th November 2022. Vol.100. No 21
© 2022 Little Lion Scientific
ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195
Kumar [16] used Histogram of Oriented Gradients learning based ensemble model is proposed in the
(HOG) and Extreme Learning Machine (ELM) current manuscript.
for feature extraction and hand sign recognition.
Still, the developed model needs to be extended
for recognizing the dynamic ISL signs.
Katoch, et al. [17] used background
subtraction and skin color techniques to segment
sign regions from the collected images. Further,
Speeded up Robust Features (SURF) and bag of
visual words for feature extraction and then the
sign classification was accomplished using
hybrid classifiers: CNN and SVM, but the
presented hybrid model was computationally
costly. Wadhawan and Kumar, [18]
implemented deep learning based CNN model
for robust ISL recognition. The effectiveness of
the developed model was tested utilizing
performance metrics like recall, F1-score and
precision. The implemented deep learning based
CNN model was computationally complex,
where it needed an enormous number of sign
images to attain superior classification
performance. Additionally, Badhe and Kulkarni
[19] integrated Otsu thresholding and
background subtraction techniques for gesture
segmentation. Then, the handcrafted features
were utilized along with Artificial Neural
Networks (ANNs) for gesture classification. As
depicted in the future work, the factors like
occlusions, lighting conditions, and background
variations affect the presented model’s
effectiveness in classification.
In addition, Roy, et al. [20] integrated
the hidden markov model and cam-shift tracker
for effective gesture recognition. As a future
extension, the classification performance can be
further improved by incorporating the developed
model with other deep learning classifiers.
Additionally, Xiao, et al. [21] used Capsule
Networks (CapsNet) for alphabetic letter and
sign language digit recognition. Hence, the
CapsNet model has achieved higher
classification accuracy in ISL recognition, but it
was computationally complex. Mannan et al.
[22] used a hyper-tuned deep CNN model for
sign recognition. The conducted experiments
confirmed that the deep CNN model obtained
higher accuracy compared to the existing state-
of-the-art methods. Correspondingly, Fregoso, et
al. [23] integrated Particle Swarm Optimizer
(PSO) and CNN for feature optimization and
sign language recognition. As stated earlier, the
CNN model was computationally costly, while
experimenting on the larger databases. To
address the above-stated issues, a new deep
6584
Journal of Theoretical and Applied Information Technology
15th November 2022. Vol.100. No 21
© 2022 Little Lion Scientific
ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195
3. METHODOLOGY extraction: GLCM feature with AlexNet model,
feature optimization: stacked autoencoder model and
In the hand sign/gesture recognition, the sign/gesture detection: ensemble classifier
proposed system consists of five phases such as (combination of MSVM and Naïve Bayes). The
image acquisition: ASL-MNIST and real time SISL block diagram of the proposed system is
databases, Sign/gesture segmentation: K-means determined in figure 1.
clustering with Gaussian blur method, feature
Figure 1: Block-diagram of the proposed system
3.1. Image acquisition
In this manuscript, the proposed ensemble Table 1: The statistical description of the ASL-MNIST
model’s effectiveness is tested on two databases. database
The ASL-MNIST database includes 34627 gray- Name ASL-MNIST details
scale sign images with a pixel size of 28 × 28. The Database format Comma Separated Values
ASL-MNIST database consists of 24 labeled (CSV) file
classes in a range from zero to twenty-five. Classes Image size 28 × 28
nine and twenty-five (alphabets J and Z) are Testing images 6926
eliminated, due to improper gestural movements. Training images 27701
The statistical description of the ASL-MNIST Total images 34627
database is stated in table 1, and the sample images
are represented in figure 2.
Database link:
https://www.kaggle.com/datasets/datamunge/sign-
language-mnist
Figure 2: Sample images of ASL-MNIST database
6585
Journal of Theoretical and Applied Information Technology
15th November 2022. Vol.100. No 21
© 2022 Little Lion Scientific
ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195
In the real-time SISL database, the images 1
are captured utilizing smart-phone cameras. The 𝑠 = ∑ ∑ 𝑝𝑖𝑥𝑒𝑙𝑠 (𝑥, (2)
parameter specifications followed during image 𝑦)
k k y∈sk x∈sk
acquisition are given as follows: mobile type:
5th step: The following steps are repeated
Samsung A51, light: normal day, image pixel size:
until tolerance or error value is satisfied. Then, the
1080 × 2400 and focal length: f/2.0 aperture.
Gaussian blur method is used for decreasing the
Around 2000 sign images are captured in the real
noise from the segmented grayscale images which
time SISL database which belongs to thirty SISL
help to obtain better classification results.
hand signs of Kannada, Telugu and Tamil
languages. The sample images of the real time SISL
3.3. Feature extraction
database is represented in figure 3.
After segmenting the sign/gesture regions,
the AlexNet and GLCM features are applied for
extracting feature vectors. Initially, AlexNet model
extracts deep feature vectors from the segmented
sign/gesture regions, where it consists of 8 pre-
defined layers like 5 convolutional and 3 fully-
connected layers. The following layers comprise
two important functions such as max-pooling and
leaky Rectified Linear Unit (ReLU) activation
function. The AlexNet model extracts 392 deep
feature vectors from the segmented sign/gesture
regions [25-27].
the position of the centroids is recomputed using
equation (2).
Figure 3: Sample sign images of the real time SISL
database
3.2. Sign/gesture segmentation
After acquiring images from ASL-MNIST
and real time SISL databases, the sign/gesture
segmentation is accomplished by using K-means
clustering. Initially, the acquired images are
partitioned into k-number of disjoint clusters or k-
number of groups. Further, computes the k-
centroids, and then identifies the clusters that have
the nearest centroids using data points. In K-means
clustering technique, Euclidean distance is used for
determining the distance between nearest centroids,
where each cluster is determined by its member
objects and centroids. The steps involved in K-
means clustering are listed as follows:
1st step: Initialize some clusters and
centroids k [24].
2nd step: Compute Euclidean distance
𝑑 between image pixel and centroids 𝑠k using
equation (1).
𝑑 = ‖𝑝𝑖𝑥𝑒𝑙𝑠 (𝑥, 𝑦) − 𝑠k‖ (1)
3rd step: Based on Euclidean distance 𝑑,
the pixel values are assigned to the nearest
centroids. 4th step: After assigning all pixel
values,
6590
Journal of Theoretical and Applied Information Technology
15th November 2022. Vol.100. No 21
© 2022 Little Lion Scientific
ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195
Additionally, the GLCM features include
21 techniques for feature extraction such as
difference variance, the sum of squares, inverse
difference moment normalized, correlations,
sum average, maximum probability, difference
entropy, dissimilarity, inverse difference,
contrast, cluster prominence, variance,
homogeneity, information measure of
correlation, sum entropy, inverse difference
normalized, cluster shades, autocorrelations,
entropy, energy, and the sum variance [28-29].
Around 1819 feature vectors are extracted by
applying 21 GLCM techniques. Then, the
feature level fusion technique is employed to
combine 392 deep feature vectors, and 1819
GLCM feature vectors.
3.4. Feature optimization
After extracting the feature vectors, the
dimensionality reduction is performed utilizing a
stacked autoencoder model, which it performed
superiorly in feature dimensionality reduction
compared to the traditional models. The stacked
autoencoder model is a feed forward neural
network that consists of numerous hidden layers,
an output layer, and an input layer, which are
detailed in equations (3) and (4).
𝑍(l) = 𝑦(l–1) × 𝑊(l) + 𝑏(l) (3)
𝑦(l) = 𝑔( 𝑍(l)) (4)
6590
Journal of Theoretical and Applied Information Technology
15th November 2022. Vol.100. No 21
© 2022 Little Lion Scientific
ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195
Where, 𝑔(. ) specifies a non-linear 3.5. Sign/gesture recognition
activation function, 𝑊(l) ∈ ℝni×n0 states matrix of The optimized feature values are fed as the
learnable biases 𝑏, 𝑦(L) denotes final layer input to the ensemble classifier for classifying hand
output, 𝑦(l–1) indicates the output of previous signs and gestures. The ensemble classifier
layers 𝑙 − 1 and input of present layer 𝑙, 𝑦(l) integrates MSVM and Naïve Bayes classifiers, and
indicates the model’s input, 𝑍(l) represents pre- then, the best outcomes are voted out using
activation layer of vector 𝑙, and 𝑙 ∈ [1, … 𝐿] weighted voting. The Naïve Bayes performs
indicates sign/gesture classification based on the maximum-
posterior- decision rule. The Naïve Bayes has
𝑙th layer. In this model, the ReLU is applied as an
accomplished with an existing probability 𝑝𝑟
activation function, which significantly improves the
function and a Gaussian function and it is stated in
model’s learning rate and computational
equations (8) and (9) [32-33].
effectiveness for better feature dimensionality
reduction. In addition to this, the softmax non-
linearity function is utilized for obtaining better 𝑃𝑟(𝑓 , 𝑓 , … . 𝑓 |𝐶) = 𝑝𝑟(𝑓 |𝐶) (8)
∏n
probability interpretation in the output layer, and it 1 2 n i=1 i
is mathematically depicted in equation (5). pr(𝐶i|𝑓)×pr(f) (9)
𝑃𝑟(𝑓i |𝐶i ) = pr(Ci)
𝑆𝑜𝑓𝑡𝑚𝑎𝑥(𝑍(L) expZk
)= Further, the association possibility is
(5)
K
∑k=1 expZk applied for classifying test data 𝐶, which is
mathematically indicated in equation (10).
Where, 𝐾 represent output classes. In the
stacked autoencoder model, the cross entropy loss ∏n 𝑝𝑟(𝑓i|𝐶z),
𝐶n = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑝𝑟(𝐶t)i=1
function 𝐶𝑟 is employed for dealing with the Where 𝑡 = 1,2 … (10)
optimization problems, which is mathematically
mentioned in equation (6). The MSVM comprises two techniques like
One against All (1-a-a) and One against One (1-a-1)
𝐶𝑟 = − ∑K 𝑦^ 𝑙𝑜𝑔(𝑦(L)) (6) for sign/gesture classification. First, the 1-a-a
k=1 k k
technique creates a binary classification method for
every class which effectively distinguishes the
Where, 𝑦^k ∈ {0,1}k denotes encoded
and objects in the same classes, and the result of 𝑖th class
𝑦(L) labels states the model’s output. In this in 1-a-a technique is compared with the 1-a-1
article, the
deep learning model: stacked autoencoder is used for technique for achieving high output value. In
learning higher dimensional feature vectors.
addition, the MSVM classifier generates all
Additionally, the mathematical formula of a stacked
possible two class classification methods from the
autoencoder model with hidden layers is
training sets of 𝑖th class, but it trains only two out of
determined in equation (7) [30-31].
𝑖th class which results in 𝑖 × (𝑖 − 1)/2
classifiers. The mathematical illustration of the
ℎe = 𝑎1(𝑊e𝑥) 𝑎𝑛𝑑 𝑥^ = 𝑎2(𝑊d ℎe) (7)
MSVM is stated in equations (11-13) [34-35].
Where, 𝑊d and 𝑊e are matrices,
𝑚𝑖𝑛Φ(𝐸, () = 1/2 (𝐸 ) +
which ∑o
denote a linear combination of the inputs for o m m=1 m
decoding and encoding, 𝑥^ indicates reconstructed 𝐶 ∑i=1 ∑m ஷ yi ξi (11)
feature vectors, and 𝑥 specifies input feature
vectors.
In addition, ℎe indicates a bottle-neck layer that The optimized 198 feature vectors are given to the
considers the low dimensional representation of the ensemble classifier for sign/gesture recognition.
feature vectors, and 𝑎1 and 𝑎2 represents constant
values. The hyper-parameter settings of the stacked
autoencoder model are listed as follows: maximum
iterations are 100, a number of hidden layers is 100,
L2 weight regularization is 0.4, sparsity
regularization is 4, and sparsity proportion is 0.150.
6591
Journal of Theoretical and Applied Information Technology
15th November 2022. Vol.100. No 21
© 2022 Little Lion Scientific
ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195
(𝐸yi × 𝑥i) + 𝑈yi ≥ (𝐸yi × 𝑥i) + 𝑈m + 2 − ξm
(12) i
ξm ≥ 0, 𝑖 = 1,2,3 … 𝑜, 𝑚, 𝑦𝑖 ∈ i
{1,2,3 … 𝑘}, 𝑚 ≠ 𝑦𝑖 (13)
The decision function in MSVM is
mathematically determined in equation (14).
𝑑𝑓(𝑥) = 𝑎𝑟𝑔 𝑚𝑎𝑥[(𝐸i × 𝑥) + 𝑈i] ,
𝑖 = 1,2, 3, . . 𝑘
(14)
Where, 𝐶 indicates classes, 𝑜 specifies
training data points, ξm states slack variables, 𝑦𝑖 i
6592
Journal of Theoretical and Applied Information Technology
15th November 2022. Vol.100. No 21
© 2022 Little Lion Scientific
ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195
states class of training data vectors 𝑥i and 𝑘 denotes TP+TN
user’s positive constant. The hyper-parameters of the 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = × 100 (17)
ensemble classifier are specified as follows: TP+TN+FP+F
criterion is Gini, splitter is best, minimum 2TP
samples spilt is
two, maximum depth is none, minimum samples leaf 𝐹1 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = × 100 (18)
FP+2TP+FN
TN
is one, a degree in kernel function is three, 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = × 100 (19)
TN+FP
tolerance of the termination criteria is 0.1, coast
factor is five,
and the kernel function is linear. The experimental Where, TP, TN, FP, and FN state true
result of the proposed model on the ASL-MNIST positive, true negative, false positive and false
and real time SISL databases is specified in the next negative values.
phase.
4. SIMULATION RESULTS 4.1. Quantitative evaluation
In this section, the ensemble based stacked
In this manuscript, the ensemble based autoencoder model’s efficacy is evaluated on the
stacked autoencoder model’s efficacy is analyzed ASL-MNIST database in light of sensitivity, MCC,
utilizing Matlab 2020 software environment and accuracy, F1-measure and specificity. As
validated on a computer with configuration: Intel represented earlier, the ASL-MNIST database
core i9 processor, 4TB hard disk, 8GB random includes 34627 gray-scale sign images with a pixel
access memory and windows 10 (64-bit) operating size of 28 × 28, and it has 24 labeled classes. The
system. In this research, the ensemble-based experimental result of the ensemble based stacked
stacked autoencoder model’s efficacy is autoencoder model on the ASL-MNIST database is
investigated using the evaluation measures like represented in table 2. By inspecting table 2, the
sensitivity, MCC, accuracy, F1-measure and experimental analysis is performed with various
specificity. The evaluation measures: sensitivity classifiers: Ensemble, naïve Bayes, MSVM and
and specificity to identify the features of the SVM, and optimizers: firefly optimizer, reliefF,
sign/gesture and background regions. The accuracy infinite and stacked autoencoder. Related to other
is an important evaluation measure in sign/gesture combination results, the combination: Ensemble
recognition, because it finds how close the obtained classifier with stacked autoencoder model has
results are to the true values. In addition, the obtained high classification result with a sensitivity
parametric value of MCC lies between zero to one, of 99.98%, MCC of 99.95%, the accuracy of
where the ensemble based stacked autoencoder 99.96%, F1-measure of 99.80%, and specificity of
model is effective in the sign/gesture classification, 99.82% on the ASL-MNIST database. The
while the parametric value is one. The F1-measure graphical presentation of the ensemble based
is a harmonic mean of precision and sensitivity stacked autoencoder model on the ASL-MNIST
values, where the mathematical depiction of the database is represented in figure 4. In this
undertaken evaluation metrics: sensitivity, MCC, manuscript, the stacked autoencoder significantly
accuracy, F1-measure and specificity is specified in optimizes the dimensions of the extracted feature
equations (15-19). vectors or selects the optimum relevant feature
vectors. The incorporation of the stacked
TP autoencoder model in the proposed system
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = × 100 (15) effectively reduces the running time and
TP+FN
computational complexity.
TP×TN–FP×FN
𝑀𝐶𝐶 =√(TP+FP)(TP+FN)(TN+FP)(TN+FN) × 100 (16)
Table 2: Experimental results of the ensemble based stacked autoencoder model on the ASL-MNIST database
Optimizers Classifiers Sensitivity MCC (%) Accuracy (%) F1-measure (%) Specificity
(%) (%)
Firefly 90.76 90.55 90.30 90.88 90.82
Infinite Naïve 92.78 92.20 91.76 91.27 92.08
ReliefF Bayes 94.80 93.80 92.85 92.77 92.10
Autoencoder 95.78 94.72 93.50 92.95 93.80
Firefly 92.90 94.22 93.68 93.13 94.44
6593
Journal of Theoretical and Applied Information Technology
15th November 2022. Vol.100. No 21
© 2022 Little Lion Scientific
ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195
Infinite 94.85 95.91 94.46 94.32 94.60
ReliefF SVM 96.78 96.46 94.85 95.99 96.96
6594
Journal of Theoretical and Applied Information Technology
15th November 2022. Vol.100. No 21
© 2022 Little Lion Scientific
ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195
Autoencoder 96.90 96.85 95.60 96.78 97.94
Firefly 93.98 94.28 94.25 95.07 96.38
Infinite 95.34 95.46 95.90 96.46 97.38
ReliefF MSVM 96.96 97.56 96.56 97.63 98.52
Autoencoder 97.72 98.90 97.58 98.64 98.82
Firefly 97.60 97.85 98.10 96.40 97.90
Infinite 98.90 98.80 98.28 97.94 98.18
ReliefF Ensemble 99.12 99.48 99.18 98.86 99.40
Autoencoder 99.98 99.95 99.96 99.80 99.82
Figure 4: Graphical validation of the ensemble based stacked autoencoder model on the ASL-MNIST database
Similar to table 3, the experimental result
based stacked autoencoder model has obtained
of the ensemble based stacked autoencoder model
98.72% of Sensitivity, 98.40% of MCC, 99.08% of
on the real time SISL database is given in table 3.
Accuracy, 98.68% of F1-measure, and 98.24% of
The real time SISL database has 2000 sign images
Specificity on the real time SISL database. The
with the pixel size of 1080 × 2400. As denoted in
graphical validation of the ensemble based stacked
table 3, the combination: ensemble classifier with
autoencoder model on the real time SISL database
stacked autoencoder model has attained a
is represented in figure 5. Related to the individual
maximum classification performance with 80:20%
classifiers, the ensemble classifier makes superior
training and testing of data, and it is better related
predictions and achieves better classification
to other training percentages. By performing cross-
performance. In addition, the ensemble classifier
validations, the computational time, variance, and
effectively decreases the dispersion of the model
bias of the ensemble based stacked autoencoder
performance.
model is superiorly reduced. Further, the ensemble
Table 3: Experimental results of the ensemble based stacked autoencoder model on the real time SISL database
Optimizers Classifiers Sensitivity MCC (%) Accuracy (%) F1-measure Specificity
(%) (%) (%)
Firefly 86.80 90.98 92.06 88.32 90.40
Infinite Naïve 88.72 93.18 93.26 92.24 91.82
ReliefF Bayes 90.98 94.36 94.54 92.88 91.92
Autoencoder 94.84 95.72 95.87 94.38 92.84
Firefly 92.46 88.78 90.85 91.92 91.44
Infinite SVM 94.86 91.98 94.56 93.12 92.60
ReliefF 95.14 92.84 94.90 94.44 93.14
6595
Journal of Theoretical and Applied Information Technology
15th November 2022. Vol.100. No 21
© 2022 Little Lion Scientific
ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195
Autoencoder 95.38 93.62 95.42 95.36 94.98
Firefly 88.48 94.48 93.88 92.80 94.04
Infinite 93.27 95.06 94.18 94.86 95.44
ReliefF MSVM 94.18 95.38 95.82 95.88 96.86
Autoencoder 96.45 97.90 96.18 96.30 97.82
Firefly 95.94 94.92 95.26 92.92 96.90
Infinite 96.34 95.68 96.68 95.94 97.06
ReliefF Ensemble 97.26 97.38 97.42 96.96 97.44
Autoencoder 98.72 98.40 99.08 98.68 98.24
Figure 5: Graphical validation of the ensemble based stacked autoencoder model on the real time SISL database
4.2. Comparative evaluation
As stated in the methodology section,
In this section, the comparative evaluation
feature optimization and sign/gesture classification
between the prior models and the proposed
are the two integral phases of this research. Where
ensemble based stacked autoencoder model is
the extracted higher dimensional features are
specified in table 4 and figure 6. Mannan, et al. [22]
effectively optimized by a deep learning model:
implemented a hyper-tuned deep CNN model for
stacked autoencoder. The dimensionality reduction
sign language recognition. The experiments
diminishes the computational complexity of the
conducted on the ASL-MNIST database
proposed system to linear based on order of
demonstrated that the implemented model achieved
magnitude and input size. Further, the running time
99.67% of recognition accuracy. Fregoso et al. [23]
of the ensemble based stacked autoencoder model
integrated the PSO and CNN model for
is 30 and 54.1 seconds on the ASL-MNIST and real
dimensionality reduction and sign language
time SISL databases, which are minimum
detection. As depicted in the resulting phase, the
compared to the state-of-the-art methods. As
developed model has achieved 99.80% of
represented in the literature section, the major
recognition accuracy on the ASL-MNIST database.
problems: computational cost and complexity are
Related to the existing research manuscripts, the
effectively decreased by implementing the
ensemble based stacked autoencoder model
ensemble based stacked autoencoder model.
achieved significant classification performance with
a recognition accuracy of 99.96% on the ASL-
MNIST database.
6596
Journal of Theoretical and Applied Information Technology
15th November 2022. Vol.100. No 21
© 2022 Little Lion Scientific
ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195
Table 4: Comparative evaluation between the prior models and the proposed ensemble based stacked autoencoder
model
Models Recognition
accuracy (%)
Hyper-tuned deep CNN [22] 99.67
PSO-CNN II [23] 99.80
Ensemble based 99.96
stacked
autoencoder
Figure 6: Graphical comparison between the prior models and the proposed ensemble based stacked autoencoder
model
accuracy, F1-measure and specificity
5. CONCLUSION
The ensemble based stacked autoencoder
model is implemented in this manuscript for
effective sign/gesture recognition. The developed
ensemble based stacked autoencoder model
includes four important steps, firstly, the
sign/gesture regions are segmented from the ASL-
MNIST and real time SISL databases using k-
means clustering with Gaussian blur technique.
Then, the discriminative feature vectors are
extracted by implementing GLCM features and
AlexNet, which are further dimensionally reduced
using a stacked autoencoder model. This action
helps in the reduction of computational complexity
and running time, and further, the selected features
are classified by proposing an ensemble classifier
(naïve Bayes and MSVM) and it classifies 24
alphabetical and 30 Indian sign classes on the ASL-
MNIST and real time SISL databases. In the
quantitative evaluation phase, the undertaken
evaluation measures like sensitivity, MCC,
6597
Journal of Theoretical and Applied Information Technology
15th November 2022. Vol.100. No 21
© 2022 Little Lion Scientific
ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195
demonstrated the effectiveness of the ensemble
based stacked autoencoder model. The
developed model has achieved 99.96% and
99.08% of classification accuracy on the ASL-
MNIST and real time SISL databases. Further,
the ensemble based stacked autoencoder model
has shown superior performance using
computational complexity and running time. In a
real time sign/gesture recognition, the proposed
model fail to meet the requirements, especially
in the grammatical aspects of continuous signs.
Therefore, as the future extension, a novel deep
learning classifier can be incorporated with the
ensemble based stacked autoencoder model to
further improve sign/gesture recognition on the
large unstructured databases.
REFERENCES:
[1] J. Joy, K. Balakrishnan, and M. Sreeraj,
“SignQuiz: a quiz based tool for learning
fingerspelled signs in Indian sign language
using ASLR”, IEEE Access, Vol. 7, 2019,
pp. 28363-
28371.
6598
Journal of Theoretical and Applied Information Technology
15th November 2022. Vol.100. No 21
© 2022 Little Lion Scientific
ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195
[2] T. Raghuveera, R. Deepthi, R. Mangalashri, and [13] J. Imran, and B. Raman, “Deep motion
R. Akshaya, “A depth-based Indian sign templates and extreme learning machine for
language recognition using microsoft Kinect”, sign language recognition”, The Visual
Sādhanā, Vol. 45, No. 1, 2020, pp. 1-13. Computer, Vol. 36, No. 6, 2020, pp. 1233-1246.
[3] J. Gangrade, J. Bharti, and A. Mulye, [14] B. Subramanian, B. Olimov, S.M. Naik, S. Kim,
“Recognition of Indian sign language using K.H. Park, and J. Kim, “An integrated
ORB with bag of visual words by Kinect mediapipe-optimized GRU model for Indian
sensor”, IETE Journal of Research, 2020, pp.1- sign language recognition”, Scientific Reports,
15. Vol. 12, No. 1, 2022, pp. 1-16.
[4] R. Gupta, and A. Kumar, “Indian sign language [15] J. Gangrade, and J. Bharti, “Vision-based hand
recognition using wearable sensors and multi- gesture recognition for Indian sign language
label classification”, Computers & Electrical using convolution neural network”, IETE
Engineering, Vol. 90, 2021, pp. 106898. Journal of Research, pp. 1-10, 2020.
[5] R. Gupta, and S. Rajan, “Comparative analysis [16] A. Kumar, and R. Kumar, “A novel approach
of convolution neural network models for for ISL alphabet recognition using Extreme
continuous Indian sign language classification”, Learning Machine”, International Journal of
Procedia Computer Science, Vol. 171, 2020, Information Technology, Vol. 13, No. 1, 2021,
pp. 1542-1550. pp. 349-357.
[6] S.G. Praveena, and C. Jayasri, “Recognition and [17] S. Katoch, V. Singh, and U.S. Tiwary, “Indian
Translation of Indian Sign Language for Deaf Sign Language recognition system using SURF
and Dumb People”, International Journal of with SVM and CNN”, Array, Vol. 14, 2022, pp.
Information and Computing Science, Vol. 6, 100141.
2019. [18] A. Wadhawan, and P. Kumar, “Deep learning-
[7] G.A. Rao, and P.V.V. Kishore, “Selfie video based sign language recognition system for
based continuous Indian sign language static signs”, Neural computing and
recognition system”, Ain Shams Engineering applications, Vol. 32, No. 12, 2020, pp. 7957-
Journal, Vol. 9, No. 4, 2018, pp. 1929-1939. 7968.
[8] N. Aloysius, and M. Geetha, “Understanding [19] P.C. Badhe, and V. Kulkarni, “Artificial neural
vision-based continuous sign language network based Indian sign language recognition
recognition”, Multimedia Tools and using hand crafted features”, 11th International
Applications, Vol. 79, No. 31, 2020, pp. 22177- Conference on Computing, Communication and
22209. Networking Technologies (ICCCNT), IEEE,
[9] D.A. Kumar, A.S.C.S. Sastry, P.V.V. Kishore, 2020, pp. 1-6.
and E.K. Kumar, “3D sign language recognition [20] P.P. Roy, P. Kumar, and B.G. Kim, “An
using spatio temporal graph kernels”, Journal of efficient sign language recognition (SLR)
King Saud University-Computer and system using Camshift tracker and hidden
Information Sciences, 2018. Markov model (hmm)”, SN Computer Science,
[10] G.A. Rao, and P.V.V. Kishore, “Selfie sign Vol. 2, No. 2, 2021, pp. 1-15.
language recognition with multiple features on [21] H. Xiao, Y. Yang, K. Yu, J. Tian, X. Cai, U.
adaboost multilabel multiclass classifier”, Muhammad, and J. Chen, “Sign language digits
Journal of Engineering Science and and alphabets recognition by capsule networks”,
Technology, Vol. 13, No. 8, 2018, pp. 2352- Journal of Ambient Intelligence and Humanized
2368. Computing, Vol. 13, No. 4, 2022, pp. 2131-2141.
[11] M. Jebali, A. Dakhli, and M. Jemni, “Vision- [22] A. Mannan, A. Abbasi, A.R. Javed, A. Ahsan,
based continuous sign language recognition T.R. Gadekallu, and Q. Xin, “Hypertuned deep
using multimodal sensor fusion”, Evolving convolutional neural network for sign language
Systems, Vol. 12, No.4, 2021, pp.1031-1044. recognition”, Computational Intelligence and
[12] N. Krishnaraj, M.G. Kavitha, T. Jayasankar, and Neuroscience, 2022.
K.V. Kumar, “A Glove based approach to [23] J. Fregoso, C.I. Gonzalez, and G.E. Martinez,
recognize Indian Sign Languages”, “Optimization of convolutional neural networks
International Journal of Recent Technology and architectures using pso for sign language
Engineering (IJRTE), Vol. 7, 2019, pp. 1419- recognition”, Axioms, Vol. 10, No. 3, 2021, pp.
1425. 139.
6599
Journal of Theoretical and Applied Information Technology
15th November 2022. Vol.100. No 21
© 2022 Little Lion Scientific
ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195
[24] R.C. Hrosik, E. Tuba, E. Dolicanin, R. [34] A. Krishnaswamy Rangarajan and R.
Jovanovic, and M. Tuba, “Brain image Purushothaman, “Disease classification in
segmentation based on firefly algorithm eggplant using pre-trained VGG16 and
combined with k-means clustering”, Stud. MSVM”, Scientific reports, Vol. 10, No. 1,
Inform. Control, Vol. 28, No. 2, 2019, pp. 167- 2020, pp. 1-11.
176. [35] Y. Guo, Z. Zhang, and F. Tang, “Feature
[25] M. Lv, G. Zhou, M. He, A. Chen, W. Zhang, selection with kernelized multi-class support
and vector machine”, Pattern Recognition, Vol. 117,
Y. Hu, “Maize leaf disease identification based pp. 107988, 2021.
on feature enhancement and DMS-robust
alexnet”, IEEE Access, Vol. 8, 2020, pp. 57952-
57966.
[26] H. Alaskar, N. Alzhrani, A. Hussain, and F.
Almarshed, “The implementation of pretrained
AlexNet on PCG classification”, In
International conference on intelligent
computing, Springer, Cham, pp. 784-794, 2019.
[27] S. Sun, T. Zhang, Q. Li, J. Wang, W. Zhang, Z.
Wen, and Y. Tang, “Fault diagnosis of
conventional circuit breaker contact system
based on time-frequency analysis and improved
AlexNet”, IEEE Transactions on
Instrumentation and Measurement, Vol. 70,
pp.1-12, 2020.
[28] A. Tassi, and M. Vizzari, “Object-oriented lulc
classification in google earth engine combining
snic, glcm, and machine learning algorithms”,
Remote Sensing, Vol. 12, No. 22, pp. 3776,
2020.
[29] P.K. Mall, P.K. Singh, and D. Yadav,
“December. Glcm based feature extraction and
medical x-ray image classification using
machine learning techniques”, In 2019 IEEE
Conference on Information and Communication
Technology, pp. 1-6, 2019.
[30] M. Yu, T. Quan, Q. Peng, X. Yu, and L. Liu,
“A model-based collaborate filtering algorithm
based on stacked AutoEncoder”, Neural
Computing and Applications, Vol. 34, No. 4,
pp. 2503-2511, 2022.
[31] A. Sagheer, and M. Kotb, “Unsupervised pre-
training of a deep LSTM-based stacked
autoencoder for multivariate time series
forecasting problems”, Scientific reports, Vol.
9, No. 1, pp. 1-16, 2019.
[32] S. Chen, G.I. Webb, L. Liu, and X. Ma, “A
novel selective naïve Bayes algorithm”,
Knowledge- Based Systems, Vol. 192, 2020, pp.
105361.
[33] H. Zhang, L. Jiang, and L. Yu, “Attribute and
instance weighted naive Bayes”, Pattern
Recognition, Vol. 111, pp. 107674, 2021.
6591
0