Mathematics 12 01393
Mathematics 12 01393
Article
A Hybrid Image Augmentation Technique for User- and
Environment-Independent Hand Gesture Recognition Based on
Deep Learning
Baiti-Ahmad Awaluddin 1,2 , Chun-Tang Chao 1 and Juing-Shian Chiou 1, *
1 Department of Electrical Engineering, Southern Taiwan University of Science and Technology, 1, Nan-Tai St.,
Yongkang District, Tainan City 71005, Taiwan; [email protected] (B.-A.A.); [email protected] (C.-T.C.)
2 Department of Electronics Engineering Education, Universitas Negeri Yogyakarta,
Yogyakarta 55281, Indonesia
* Correspondence: [email protected]; Tel.: +886-916-221-152; Fax: +886-6-3010-069
Abstract: This research stems from the increasing use of hand gestures in various applications, such
as sign language recognition to electronic device control. The focus is the importance of accuracy and
robustness in recognizing hand gestures to avoid misinterpretation and instruction errors. However,
many experiments on hand gesture recognition are conducted in limited laboratory environments,
which do not fully reflect the everyday use of hand gestures. Therefore, the importance of an
ideal background in hand gesture recognition, involving only the signer without any distracting
background, is highlighted. In the real world, the use of hand gestures involves various unique
environmental conditions, including differences in background colors, varying lighting conditions,
and different hand gesture positions. However, the datasets available to train hand gesture recog-
nition models often lack sufficient variability, thereby hindering the development of accurate and
adaptable systems. This research aims to develop a robust hand gesture recognition model capable of
operating effectively in diverse real-world environments. By leveraging deep learning-based image
augmentation techniques, the study seeks to enhance the accuracy of hand gesture recognition by
simulating various environmental conditions. Through data duplication and augmentation methods,
including background, geometric, and lighting adjustments, the diversity of the primary dataset is
Citation: Awaluddin, B.-A.; Chao,
expanded to improve the effectiveness of model training. It is important to note that the utilization
C.-T.; Chiou, J.-S. A Hybrid Image
Augmentation Technique for User-
of the green screen technique, combined with geometric and lighting augmentation, significantly
and Environment-Independent Hand contributes to the model’s ability to recognize hand gestures accurately. The research results show
Gesture Recognition Based on Deep a significant improvement in accuracy, especially with implementing the proposed green screen
Learning. Mathematics 2024, 12, 1393. technique, underscoring its effectiveness in adapting to various environmental contexts. Additionally,
https://doi.org/10.3390/ the study emphasizes the importance of adjusting augmentation techniques to the dataset’s character-
math12091393 istics for optimal performance. These findings provide valuable insights into the practical application
Academic Editor: Konstantin Kozlov
of hand gesture recognition technology and pave the way for further research in tailoring techniques
to datasets with varying complexities and environmental variations.
Received: 8 March 2024
Revised: 13 April 2024 Keywords: hand gesture recognition; hybrid augmentation; environment independent; green
Accepted: 30 April 2024
screen technique
Published: 2 May 2024
MSC: 68T07
home/assisted living applications [13]. Computer scientists have harnessed diverse com-
putational techniques and methodologies to optimize human-computer interactions [14,15],
integrating hand gestures into software programs that enhance computer and human com-
munication [16]. The advancement of gesture recognition systems significantly enhances
the interaction between computers and humans, with hand gestures becoming increasingly
prevalent across various sectors. Hand gestures are now used in diverse applications such
as gaming [17,18], virtual reality and augmented reality [19,20], assisted living [21,22], and
more. Moreover, the recent proliferation of hand gesture recognition in industries like
human-robot interaction in manufacturing [23,24] and autonomous vehicle control [25]
has spurred considerable interest. Against the backdrop of the ongoing COVID-19 pan-
demic from 2020 to 2023 [26], where social distancing remains a top priority, the scope for
implementing hand gestures is expanding [27], rendering it an intriguing topic for further
exploration and discussion.
Although there are several techniques for hand gesture recognition, deep learning—a
part of machine learning—has become the most advanced way. Deep learning research
has made remarkable strides in solving complex image recognition and related challenges.
The development of deep learning was greatly influenced by the significant use of (CNNs)
for image classification in 2012. AlexNet performed better than traditional shallow ap-
proaches [28]. The popularity of deep learning in this field can be credited to advancements
in deep network structures, significant processing capability, and the availability of exten-
sive training datasets. ImageNet [29] is a prominent huge dataset that has significantly
stimulated additional advancements. The ImageNet Large Scale Visual Recognition Chal-
lenge (ILSVRC) was responsible for coordinating and monitoring the progress of many
CNN models, leading to the creation of well-known architectures, including VGG [30],
Google Net [31], and ResNet50 [32]. The classification top-5 error rate has significantly
decreased throughout the years. 2010–2011, when shallow methods were often used, the
error rate was over 25%. However, with the introduction of deep learning in 2015, the error
rate dropped to less than 5% [33].
However, despite these significant achievements, deep neural networks and their
associated learning algorithms confront several pertinent challenges. One of the most
frequently cited issues is the lack of training data [34–36] or the imbalance of classes
within datasets [31,32]. To resolve this matter, data augmentation has emerged as the
favored technique to enlarge datasets and mitigate the problem of insufficient data for
training models [37,38]. Data augmentation involves applying various transformations or
manipulations to existing data to generate new samples that resemble the original data.
This technique extends the existing dataset by generating new variations from the current
data. Yet, despite the efficacy of data augmentation in alleviating data shortages, there
are still constraints on the number of variations that can be generated from the original
data. This limitation has spurred the exploration of methods to generate an almost limitless
volume of new data for use in datasets. Various methods of augmenting image-based data
have been developed [39], including geometric changes, color tweaks, random occlusions,
and deep learning techniques such as Generative Adversarial Networks (GANs).
Notably, hand gestures in applications demand high robustness and accuracy to
preclude misinterpretation and instruction errors. A review of experiments on hand
gesture recognition conducted by Noraini et al. [40] revealed that over 47 articles had
been executed in constrained laboratory settings. Lim et al. [41] attribute this to the ideal
backdrop for gesture recognition, which includes only the signer, devoid of any background,
as background clutter can impair gesture recognition accuracy. This limitation underscores
the necessity for conducting tests beyond the confines of controlled environments. It is a
pressing concern because the domain of hand gesture recognition extends far beyond the
laboratory setting. Hand gestures now find applications in many real-world scenarios, each
characterized by unique environmental conditions, including varying background colors,
lighting conditions, and hand gesture positions [40]. These challenges are compounded
by limitations in available datasets for real-world HGR, which often lack the variety
Mathematics 2024, 12, 1393 3 of 34
and diversity essential for training robust models [42]. The scarcity of comprehensive
and diverse real-world datasets hinders the development of accurate and adaptable hand
gesture recognition systems. Training GANs to produce new synthetic images is challenging
due to mode collapse, non-convergence, and oscillatory behavior [42], despite its potential
for data augmentation [42].
Data augmentation may also generate new data for testing classifiers in some cases [43].
In comparison to GAN-based augmentation, classical augmentation techniques, such as
background and brightness variations and geometric transformations, are more straightfor-
ward, efficient, and effective for enhancing the performance of CNNs in image classification
tasks. For instance, Dundar et al. [38] investigated the impact of background variations
on image classification. Their study demonstrated that altering the backgrounds of the
training dataset could significantly affect testing accuracies. The study enhanced existing
augmentation techniques with foreground segmented objects, and its findings are instru-
mental in improving accuracies, mainly when working with limited datasets by creating
synthetic images.
Kandel [44] adopted a meticulous approach to brightness augmentation techniques,
which closely matched the original brightness levels of images. This approach yielded
optimal performance. Previous studies on HGR have employed geometric transformations
as a data augmentation technique. For example, in [45], the implementation of geometric
augmentation significantly enhanced the performance of CNNs by a maximum of 4%. In
another study, an HGR system that utilized Capsule Networks exhibited improved perfor-
mance when combined with geometric augmentation involving rotation and translation
operations [46]. Similarly, ref. [47] employed Adapted CNN and image translation (vertical
and horizontal) to augment the original data, resulting in a 4% improvement in classifi-
cation accuracy. Furthermore, ref. [48] utilized random scaling and horizontal/vertical
translation to augment training data diversity for HGR applications.
The paper presented by Luo, Cui, and Li in 2021 [49] established a CNN model
for recognition. Hand gesture detection is hindered by movements’ intricate and varied
nature, which are affected by contextual circumstances, lighting conditions, and occlusion.
The initial step in improving recognition involves the application of a filter to the skin
color of the hand inside the YCbCr color space, thereby isolating the region of motion. A
Gaussian filter is employed to mitigate edge noise, followed by morphological gray opening
operations and the watershed method for contour segmentation. The eight-connected filling
algorithm improves the characteristics of motion. The model in this experiment identifies
ten movements ranging from 0 to 9. The experimental findings illustrate the prompt and
precise identification of the proposed approach, with an average success rate of 96.46%,
without a substantial augmentation in the recognition duration.
However, Rahmat et al. [50] found difficulties when attempting to recognize hand
movements in complex backgrounds and without objects using computer vision in human-
computer interaction. Problems with skin and background detection needed a potent
remedy. The suggested method comprised several stages: acquiring images, resizing them,
converting the color space, utilizing the HS-CbCr format for skin recognition, and applying
averaging to overcome background difficulties. Grayscale image processing, background
accumulation, thresholding, inversion, frame differencing, and picture enhancement were
further processes. Contour, convex-hull, and convexity flaws were used to extract features,
which were then counted on fingers, and hand direction was determined. The generated
instruction-controlled applications like PDF readers, video players, audio players, and
slideshow presentations. The method’s efficiency was proven by experimental findings,
which showed up to 98.71% accuracy under well-lit situations. Lighting highly influenced
accuracy, with 95.03% accuracy recorded under lesser illumination. Future improvements
included considering machine learning for better object detection accuracy and integrat-
ing a hand-tracking approach for dynamic gesture recognition. It was suggested that
future studies in increasing hand gesture recognition address skin detection issues related
to lighting.
Mathematics 2024, 12, 1393 4 of 34
Yi Yao and Chang-Tsun Li’s research [51] focuses on addressing the formidable task of
recognizing and tracking hand movements in uncontrolled environments. They identify
several critical challenges inherent in such environments, including multiple hand regions,
moving background objects, variations in scale, speed, trajectory location, changing light-
ing conditions, and frontal occlusions. They propose an appearance-based method that
leverages a novel classifier weighting scheme to tackle these challenges. Their method
demonstrates promising performance in effectively handling uncontrolled environments’
complexities without prior knowledge. They utilize the Palm Graffiti Digits Database
and the Warwick Hand Gesture Database to evaluate their approach. Through extensive
experimentation, they illustrate their method’s capability to overcome environmental ob-
stacles and enhance the performance of the initial classifier. Our research methodology
presents significant advancements over previous works, such as those by Luo, Cui, and
Li (2021) and Yi Yao and Rahmat et al., by incorporating cutting-edge data augmentation
techniques and optimizing neural network architectures specifically designed for hand
gesture recognition. Unlike the former, which relies on traditional image pre-processing
and CNN models, and the latter, which focuses on hand gesture recognition under well-lit
conditions, our approach is engineered to robustly adapt to a wider range of environmental
variations, including dramatic lighting changes and complex backgrounds. We employ
advanced data augmentation strategies, including geometric manipulations and lighting
variations, to significantly enhance model resilience against external condition fluctuations.
Furthermore, our specialized neural network architecture is optimized to capture essential
features of hand gestures more efficiently, aiming for superior accuracy while maintaining
or even reducing recognition time. This innovative methodology broadens the spectrum
of recognizable hand gestures under challenging conditions and increases the model’s
applicability and flexibility across diverse domains, setting a new benchmark for future
hand gesture recognition research.
This innovative research study explores brightness’s function as an augmentation
method in training deep learning networks. This work aims to expand current knowledge
and provide a more thorough understanding of color distortion approaches, geometric
augmentation, noise relevance, and image quality in deep learning architectures, in addi-
tion to complementing previous research in these areas. Moreover, the paper explores the
significance of augmenting background variation to enhance the performance and robust-
ness of deep learning models. By considering the intricate interplay between foreground
objects and their surrounding environments, the researchers aim to uncover the potential
benefits and challenges of incorporating background variation as an augmentation strategy.
In conjunction with all these strategies, another objective is to ascertain the possibility of
creating virtually limitless datasets using classical augmentation techniques, resulting in
more diversified datasets to address data insufficiency. In brief, the main contributions of
this study are:
(a) Introduce a new model augmentation that combines geometric transformation, back-
ground, brightness, temperature, and blurriness difference to train deep learning
networks to improve hand gesture recognition.
(b) Offering a strategy change background using green screening as a data augmentation
technique eliminates the need for manual object annotations. It enhances training
performance with limited data by replacing the background using computer vision
algorithms.
(c) Depict the proposed green screen technique hand gesture dataset that can be used
for training hand gesture recognition to know the effect of background distortion to
simulate HGR in real-world uncontrolled environments.
(d) Exploring the potential of classical augmentation techniques to generate unlimited
data variations and maintain accuracy.
This research evaluates the accuracy of gesture classification experiments using a
primary dataset that employs the proposed green screen technique to replace the back-
ground with complex backgrounds, such as images from indoor and outdoor locations
Mathematics 2024, 12, 1393 5 of 34
2.1. Material
2.1.1. Environmental Surrounding
Recognizing hand gestures in uncontrolled environmental conditions involves several
key factors. In this context, complex backgrounds are the primary factors affecting the
accuracy of hand gesture recognition. Research conducted by Fuchang et al. [52]. High-
lights that lighting, rotation, translation, and environmental scaling changes can challenge
separating the hand from the background. They address this issue by using depth data
to separate the hand and achieve synchronized color and depth images. However, their
research should have explicitly explored the impact of optical blur on recognition accuracy,
leaving a significant gap in our understanding.
It’s important to note that another limitation of models utilizing depth images is
the requirement for specialized cameras capable of capturing depth information. This
limitation may restrict the practicality of depth-based models, as such cameras are often
more expensive and less common than regular cameras.
In their study, Igor Vasiljevic, Ayan Chakrabarti, and Gregory Shakhnarovich [53] pro-
vide insights into the specific influence of optical blur, resulting from defocus or subject and
camera motion, on network model performance. Their research delves into the detrimental
effects of blur. It emphasizes the importance of considering blurriness as a significant factor
that can challenge the recognition of hand gestures in uncontrolled environmental condi-
tions. Therefore, addressing the impact of optical blur, along with other environmental
factors, is crucial for enhancing the accuracy of hand gesture recognition systems.
Additionally, hand gesture recognition in uncontrolled environments faces various
challenges beyond complex backgrounds. One of these challenges is the variations in
lighting conditions, as examined in a study by Salunke [54]. Their findings highlight
that optimal accuracy is achieved under bright lighting conditions. To address the issues
related to changes in lighting, researchers have employed techniques such as brightness
augmentation. Temperature sensors, a form of temperature augmentation, have also been
utilized to compensate for temperature fluctuations that can impact image quality, as
demonstrated in the study by Oinam et al. [55] Temperature augmentation ensures that
Mathematics 2024, 12, 1393 6 of 34
machine learning models used for hand gesture recognition perform consistently across
diverse lighting conditions.
In image processing, noise reduction and contrast enhancement under varying lighting
conditions become crucial. Jose et al. [56] use digital image processing techniques to
eliminate noise, separate the hand from the background, and enhance contrast. In cases of
complex backgrounds with significant background variations, hand localization becomes a
challenging task. The study by Peijun et al. [57] shows that accuracy varies depending on
the complexity of the background.
Additionally, geometric transformations such as rotation and scaling are essential, and
several studies, as explained by Yu et al. [58], have successfully addressed this issue. In
situations where devices like Kinect detect a larger object, incorrect results may appear,
as found in the study by Atharva et al. [59]. Environmental noise reduction needs to be
considered in noisy background situations, as seen in the research by Hafız et al. [60].
Multi-modal approaches using depth data and color images enhance hand gesture recog-
nition accuracy in various backgrounds and lighting situations, as shown in the study by
Yifan et al. [61].
Furthermore, addressing geometric transformations is a primary focus. The study by
Yiwen et al. [62] demonstrates that their method can adapt to geometric transformations.
Time of Flight (ToF) data is used in several studies, as presented by Sachara et al. [63].
However, this data can be noisy depending on lighting conditions and other factors.
Technological approaches in the form of sensor systems have also been used to address
various environmental constraints. Washef [64] successfully overcame the limitations of
glove-based and vision-based approaches in situations with varying lighting challenges,
complex backgrounds, and camera distance. Finally, in handling variations in rotation,
scale, translation, and other transformations, the system in the study by Lalit & Pritee [65]
has proven to handle various variations, including digits, letters, and symbols.
In conclusion, the representation of environmental surroundings in gesture recognition
encompasses factors such as complex background, geometric transformation, brightness
augmentation, temperature augmentation, and blurriness augmentation. Researchers have
devised various strategies and techniques to address these challenges and enhance the
accuracy of hand gesture recognition in uncontrolled environments. These insights can
significantly contribute to developing robust and adaptable gesture recognition systems.
Augmentation Technique
Augmentation is a technique used to address data scarcity in neural network training
by applying specific transformations to the original data or images. It is precious in domains
like the medical field, where data availability may be limited. These transformations
generate new images, facilitating the generalization of knowledge learned by classifiers
such as neural networks. Generalization is crucial as it enables models to perform well
on unseen or new data, ensuring their reliability in real-world scenarios. This paper uses
augmentation to construct a model approach to simulate conditions in real-world scenarios.
This technique applies an alternative “on-the-fly” augmentation that dynamically augments
during the training process, generating new data in each training epoch and improving
knowledge generalization, as shown in Figure 1.
scenarios.
models to This paper
perform uses
well on augmentation
unseen or newtodata,
construct
ensuringa model approachintoreal-world
their reliability simulate
conditions in real-world scenarios. This technique applies an alternative
scenarios. This paper uses augmentation to construct a model approach to simulate “on-the-fly”
augmentation
conditions in that dynamically
real-world augments
scenarios. during theapplies
This technique traininganprocess, generating
alternative new
“on-the-fly”
data in each training
augmentation epoch and improving
that dynamically augmentsknowledge
during the generalization, as shown
training process, in Figure
generating new
Mathematics 2024, 12, 1393 7 of 34
.
1data in each training epoch and improving knowledge generalization, as shown in Figure
1.
Geometric Transformation
This work uses this method to simulate hand gesture poses in uncontrolled envi-
ronments. Nevertheless, it is essential to acknowledge that certain adjustments, such as
inversion, may not be appropriate for image categories, such as digit images, as it can lead
to ambiguity between the digits 6 and 9. This research will employ a specific technique:
Geometric Transformation
This work uses this method to simulate hand gesture poses in uncontrolled
environments. Nevertheless, it is essential to acknowledge that certain adjustments, such
as inversion, may not be appropriate for image categories, such as digit images, as it can
Mathematics 2024, 12, 1393 lead to ambiguity between the digits 6 and 9. This research will employ a specific 8 of 34
technique:
Image
Image Scaling
Scaling
Image
Image scaling
scaling isis aa geometric
geometric transformation
transformation technique used to enlarge an image image byby
multiplying withdistinct
multiplying it with distinctscaling
scalingfactors
factorsonon
thethe
(x) (x)
andand (y) axes.
(y) axes. This This enables
enables us to
us to resize
resize the image
the image to a larger
to a larger or smaller
or smaller size assize as required
required whilewhile preserving
preserving its dimensions
its dimensions and
and intricate
intricate features.
features. Equations
Equations (1) (1)
andand(2)(2) canbebeutilized
can utilizedfor
for picture
picture scaling. In these these
equations, ′ ,y′ ) represent
equations, (x,y)
(x,y)represents
representsthe thecoordinates
coordinatesof ofaapixel
pixelinin the
the original
original image,
image, (x
(x′,y′) represent
the
the coordinates
coordinates of of the
the pixel
pixel in
in the
the scaled
scaled image,
image, and
and (s_x)
(s_x) and
and (s_y)
(s_y) represent
represent the the scale
scale
factors for the rows and columns of the image, respectively.
factors for the rows and columns of the image, respectively.
x ′𝑥=
′
=x ·𝑥s x∙ 𝑠𝑥 (1)
y′𝑦=
′ y·s
= 𝑦 y∙ 𝑠𝑦 (2)
Interpolation
Interpolation cancan enhance
enhance the smoothness
smoothness of of the
the object’s
object’s edges
edges in in the
the scaled
scaled image
image
while
while preserving
preserving the
the aspect
aspect ratio
ratio when
when the scaling factors (s x ) and
(sx) and (sy ) are
(sy) areequal.
equal. Resizing
Resizing
images
images enhances
enhances thethe model’s
model’s effectiveness
effectiveness and resilience against input size variations.
When
When an image is enlarged, it is trimmed
an image is enlarged, it is trimmed to to its
its original
original dimensions;
dimensions; however,
however, when when it
it is
is
reduced in size, the original dimensions are preserved, and the empty area
reduced in size, the original dimensions are preserved, and the empty area is filled using is filled using
the
the nearest
nearest pixel
pixel neighbor
neighbor technique.
technique. Figure
Figure 33 illustrates
illustrates the
the description
description of of image
image scaling
scaling
for augmentation.
for augmentation.
Figure 4. Image
Figure 4. Image Rotation
Rotation illustration.
illustration.
Image Translation
Image translation refers to the movement of an image in both the horizontal (x-axis)
and vertical (y-axis) directions by a defined number of pixels. This shift in position creates
additional training data, improving the model’s ability to handle changes in position.
Figure 4. Image Rotation illustration.
Mathematics 2024, 12, 1393 9 of 34
Image Translation
Image translation refers to the movement of an image in both the horizontal (x-axis)
Image Translation
and vertical (y-axis) directions by a defined number of pixels. This shift in position creates
Imagetraining
additional translation
data,refers to the movement
improving the model’s of an image
ability to in both the
handle horizontal
changes (x-axis)
in position.
and vertical
Equation (5) (y-axis) directions
is employed by a definedthe
for translating number
x-axis,of while
pixels.Equation
This shift(6)
in position
is utilizedcreates
for
additional the
translating training
y-axis.data, improving
In these the(x,y)
equations, model’s abilitythe
represents to coordinates
handle changes in position.
of a pixel in the
Equationimage,
original (5) is employed for translating
(x′,y′) represents the x-axis,
the coordinates of while Equation (6) ispixel
the corresponding utilized
in thefor
translating the y-axis. In these equations, (x,y) represents
translated image, and (dx, dy) represents the translation offsets. the coordinates of a pixel in the
original image, (x′ ,y′ ) represents the coordinates of the corresponding pixel in the translated
𝑥 ′ = 𝑥 + 𝑑𝑥
image, and (dx, dy) represents the translation offsets. (5)
𝑦 ′ =x ′𝑦=+x𝑑𝑦
+ dx (6)
(5)
The nearest pixel interpolation algorithm additionally populates empty regions in
y′ = y + dy (6)
the translated images. The depiction of the picture translation for augmentation may be
The
seen in nearest
Figure 5. pixel interpolation algorithm additionally populates empty regions in the
translated images. The depiction of the picture translation for augmentation may be seen
in Figure 5.
ImageTranslation
Figure5.5.Image
Figure TranslationIllustration.
Illustration.
Image Shearing
Image Shearing
Image sharing is a technique used to alter an image by shifting each row or column
Image sharing is a technique used to alter an image by shifting each row or column
of pixels along the x-axis or y-axis. The amount of displacement is controlled by the y-
of pixels along the x-axis or y-axis. The amount of displacement is controlled by the y-
coordinate or x-coordinate of each pixel. This methodology simplifies the administration
coordinate or x-coordinate
of input photographs of each
acquired pixel.
from This methodology
different views or angles.simplifies the administration
Equations (7) and (8) can
of
beinput
used photographs acquired
to shear an image from
along the different views
x-axis and or In
y-axis. angles.
theseEquations
equations,(7) and
(x,y) (8) can
represent
be used to shear an image along the x-axis and y-axis.
′ In
′ these equations, (x,y)
the coordinates of a pixel in the original image, (x ,y ) represent the coordinates of the represent
the coordinatespixel
corresponding of a in
pixel
the in the original
sheared image, andimage,
shx (x′,y′)
and shy represent
representthe
thecoordinates
shear factorsofalong
the
corresponding pixel in the sheared
the x-axis and y-axis, respectively. image, and shx and shy represent the shear factors
along the x-axis and y-axis, respectively.
x ′ = x + shx ∗ y (7)
𝑥 ′ = 𝑥 + 𝑠ℎ𝑥 ∗ 𝑦 (7)
y′ = y + shy ∗ x (8)
𝑦 ′ = 𝑦 + 𝑠ℎ𝑦 ∗ 𝑥 (8)
Mathematics 2024, 12, x FOR PEER REVIEWTo fill in the space in the sheared images, nearest pixel neighbor interpolation 10 of 37
is used.
To 6fillshows
Figure in thean
space in the sheared
illustration of howimages, nearest pixel
image shearing neighborfor
is described interpolation is used.
augmentation.
Figure 6 shows an illustration of how image shearing is described for augmentation.
Figure 6.
Figure Image Shearing
6. Image Shearing Illustration.
Illustration.
Image Flipping
Image Flipping
Image flipping is a method that alters the image by inverting its left and right sides.
HGRImage flipping horizontal
only employs is a method that alters
flipping. the image
Equation (9) isby
theinverting
formula its
forleft and right
horizontal sides.
flipping.
HGR only employs horizontal flipping. Equation 9 is the formula for horizontal flipping.
In this equation, (x,y) represents the coordinates of a pixel in the original image, x′
represents the matching pixel index in the flipped image, and W represents the image’s
width.
𝑥 ′ = (𝑊 − 1) − 𝑥 (9)
Figure 6. Image Shearing Illustration.
Image Flipping
Mathematics 2024, 12, 1393 10 of 34
Image flipping is a method that alters the image by inverting its left and right sides.
HGR only employs horizontal flipping. Equation 9 is the formula for horizontal flipping.
In this
In thisequation,
equation, (x,y)
(x,y) represents
represents the coordinates
the coordinates of ainpixel
of a pixel in the image,
the original original image, x′
x′ represents
represents the matching pixel index in the flipped image, and W represents
the matching pixel index in the flipped image, and W represents the image’s width. the image’s
width.
x ′ = (W − 1 ) − x (9)
𝑥 ′ = (𝑊 − 1) − 𝑥 (9)
The order of pixels in each row is reversed
reversed from left to
to right
right to
to invert
invert the
theentire
entireimage.
image.
Flipping an image horizontally does not need interpolation. Figure 7 shows how
an image horizontally does not need interpolation. Figure 7 shows how to to describe
Image Flipping
describe Horizontally
Image Flipping Horizontally
Figure 7.
Figure 7. Image Flipping Horizontally
Image Flipping Horizontally Illustration.
Illustration.
Background
Background Augmentation
Augmentation
This
This research project
research project aims
aims toto develop
develop aa system
system capable
capable of of accurately
accurately recognizing
recognizing
hand movements,especially
hand movements, especiallyininvarious
various environmental
environmental conditions.
conditions. We We
use use a technique
a technique that
that replaces the proposed green screen technique background with various
replaces the proposed green screen technique background with various pre-prepared pre-prepared
backgrounds
backgrounds to to simulate
simulate such
such ecological
ecological variations. This creates
variations. This creates aa consistent
consistent and
and uniform
uniform
background so the computer can focus more on correctly recognizing
background so the computer can focus more on correctly recognizing hand gestures. hand gestures.
The
The approach
approach we we used
used was
was the
the proposed
proposed green
green screen
screen technique,
technique, also
also known
known as as
Chroma
Chroma Key, which has become an essential tool in the field of visual effects.creating
Key, which has become an essential tool in the field of visual effects. When When
hand gesture
creating handimages,
gesturewe use this
images, wetechnique as a backdrop
use this technique as a to make it easy
backdrop to replace
to make it easythe
to
background to simulateto
replace the background a natural
simulateenvironment. As highlighted
a natural environment. in the various
As highlighted abstracts,
in the various
green screens
abstracts, have
green evolved
screens where
have previously
evolved where impossible
previouslybackgrounds, such as alien such
impossible backgrounds, planets
as
and outer space, can now be easily realized.
alien planets and outer space, can now be easily realized.
Several studies further outline techniques and advances related to green screens. For
Several studies further outline techniques and advances related to green screens. For
example, Raditya et al. [66] researched using various background colors in the Chroma Key
example, Raditya et al. [66] researched using various background colors in the Chroma
process to measure efficiency and fatigue. Jin Zhi et al. [67] explore alternatives to green,
Key process to measure efficiency and fatigue. Jin Zhi et al. [67] explore alternatives to
incorporating complex mathematical concepts to improve the quality of results. In addition,
green, incorporating complex mathematical concepts to improve the quality of results. In
Soumyadip Sengupta et al. [68] introduce a technique to produce mattes without relying on
addition, Soumyadip Sengupta et al. [68] introduce a technique to produce mattes without
a green screen or hand manipulation, making the background replacement process more
relying on a green screen or hand manipulation, making the background replacement
convenient. By referring to these studies, we gain insight into the evolutionary nature of
process more convenient. By referring to these studies, we gain insight into the
green screen techniques and their diverse applications. In the context of the hand-gesture
evolutionary nature of green screen techniques and their diverse applications. In the
images we create, green screen techniques offer creative freedom to build dynamic and
context of the hand-gesture images we create, green screen techniques offer creative
immersive visual environments, bridging the gap between technology and visual arts to
develop realistic environmental simulations.
Backgrounds used to replace the original background in hand gesture images include
a variety of indoor and outdoor settings, from busy airport scenes and dimly lit basements
to quiet library environments, cozy rooms, and streets. Busy city street. We also incorporate
various backgrounds, such as forests, subtle color gradients, and vibrant environments.
All these background variations enable computers to effectively learn and recognize hand
movements in various real-world scenarios, improving their operational capabilities in
diverse conditions. The following are examples of the images we use as background
variations in Figure 8:
incorporate various backgrounds, such as forests, subtle color gradients, and
environments. All these background variations enable computers to effectively le
recognize hand movements in various real-world scenarios, improving their ope
Mathematics 2024, 12, 1393 capabilities in diverse conditions. The following are examples of the images
11 of 34 w
background variations in Figure 8:
Figure
Figure 8. 8. Sample
Sample Image
Image of Background.
of Background.
Figure 9.
Figure 9. Background
Background changing
changingtotomake
makethe
thevariation.
variation.
Temperature
As a part of augmentation techniques to simulate the real world for handling
accuracy in recognition, temperature augmentation modifies captured images to
represent diverse lighting conditions and color temperature changes. By applying
controlled adjustments to the color channels, such as increasing or decreasing red or blue
Figure 9. Background changing to make the variation.
Mathematics 2024, 12, 1393 12 of 34
Temperature
As a part of augmentation techniques to simulate the real world for handling
accuracy
Temperature in recognition, temperature augmentation modifies captured images to
represent diverse lighting conditions and color temperature changes. By applying
As a part of augmentation techniques to simulate the real world for handling accuracy
controlled adjustments to the color channels, such as increasing or decreasing red or blue
in recognition, temperature augmentation modifies captured images to represent diverse
intensity, images can be transformed to mimic shifts in color temperature. This process
lighting conditions and color temperature changes. By applying controlled adjustments to
aids in training
the color models
channels, suchtoasrecognize
increasingobjects or gestures
or decreasing redaccurately in environments
or blue intensity, images canwith
be
varying color temperatures, ensuring that they perform reliably across different
transformed to mimic shifts in color temperature. This process aids in training models to lighting
conditions. Temperature
recognize objects augmentation
or gestures accuratelyisinessential in the arsenal
environments of techniques
with varying to enhance
color temperatures,
machine-learning models’ robustness and improve their performance
ensuring that they perform reliably across different lighting conditions. Temperature in real-world
scenarios.
augmentation Theisfollowing is the
essential in an arsenal
illustration of temperature
of techniques augmentation
to enhance as shown
machine-learning in
models’
Figure 10:
robustness and improve their performance in real-world scenarios. The following is an
illustration of temperature augmentation as shown in Figure 10:
Figure11.
Figure BlurrinessAugmentation.
11.Blurriness Augmentation.
2.1.3. Dataset
2.1.3. Dataset
In this study, we adopt a highly inclusive comprehensive approach to evaluate our
Hand In Gesture
this study, we adopt a(HGR)
Recognition highlymodel.
inclusiveThiscomprehensive
approach involvesapproachthe to evaluate our
combination of
Hand Gesture Recognition (HGR) model. This approach involves
various datasets, including a meticulously created custom dataset. This custom dataset the combination of
various datasets,several
has undergone including a meticulously
treatments created
to simulate custom dataset.
real-world This creating
conditions, custom dataset
a more
has undergone several treatments to simulate real-world conditions,
representative testing environment. It is also essential to understand the concept creating a more
of a
representative
public dataset, testing environment.
which researchers It is openly.
created also essential to understand
Public datasets encouragethe concept of a
collaboration,
public dataset, which
standardization, andresearchers
advancement created
within openly. Public
a specific datasetsWe
domain. encourage collaboration,
aim to create a holistic
standardization,
approach to HGR model development by merging custom and public datasets. holistic
and advancement within a specific domain. We aim to create a
approach
In thetohand
HGRgesture
model recognition
development (HGR)by merging
context, custom and public
most models datasets. operated
have historically
in a user-dependent mode. This implies that the training and testinghave
In the hand gesture recognition (HGR) context, most models data historically
are derived
operated in a user-dependent
from a single dataset. However, mode.
in thisThis implieswe
research, that
aimthetotraining
developand testing data are
a user-independent
derived
model. from a single
In other words,dataset.
trainingHowever,
and testing in data
this research,
will involvewe various
aim to develop a user-
individuals with
independent
diverse handmodel. In other This
characteristics. words, training
is highly and testing
relevant data will
for practical involve various
applications because
individuals with with
many individuals diverse handhand
varying characteristics.
features will This
use theis HGR
highly relevant
system we arefordeveloping.
practical
applications because many individuals with varying hand features
Combining these two dataset types, we aim to create a robust HGR model that addresses will use the HGR
system we are
challenges developing.
arising Combining
from real-world these two dataset
environmental types,
variations. Ourwe aim to create approach
comprehensive a robust
HGR
allowsmodel that addresses
our research challenges
to bridge the gap arising
between from real-world
controlled environmental
laboratory conditionsvariations.
and the
complexity
Our of unpredictable
comprehensive approachreal-world
allows our scenarios.
researchConsequently,
to bridge the our gap research
betweenfindings are
controlled
expected toconditions
laboratory be relevant in andrealthe
practical situations.
complexity of unpredictable real-world scenarios.
It is essential
Consequently, our to note that
research both dataset
findings types used
are expected to be in this research
relevant adhere tosituations.
in real practical applicable
privacy and usage guidelines. Throughout this research, we are highly committed to ethical
considerations, including protecting individual privacy.
Public
Public Datasets/Secondary Dataset
Datasets/Secondary Dataset
We
We utilize
utilize aa public
public dataset
dataset comprising
comprising several
several sub-datasets
sub-datasets to
to assess
assess our
our model’s
model’s
reliability in various real-world scenarios. These sub-datasets include Massey
reliability in various real-world scenarios. These sub-datasets include Massey Hand Image
Hand
ASL,
ImageDigit
ASL,and Alphabet,
Digit NUS-II,
and Alphabet, Sebastian
NUS-II, Marcel,
Sebastian and Hand
Marcel, Gesture
and Hand (HG14)
Gesture 14, each
(HG14) 14,
with distinct characteristics and data sources. Each sub-dataset contains diverse images
each with distinct characteristics and data sources. Each sub-dataset contains diverse
displaying unique hand gestures. The public dataset is utilized as the testing data to
images displaying unique hand gestures. The public dataset is utilized as the testing data
assess the efficacy of our model, which has been trained using a custom-created dataset
to assess the efficacy of our model, which has been trained using a custom-created dataset
subjected to diverse augmentation to replicate real-world scenarios. The explanation of
subjected to diverse augmentation to replicate real-world scenarios. The explanation of
each sub-dataset will be described in the paragraph below:
each sub-dataset will be described in the paragraph below:
Massey University (MU) HandImages American Sign Language (A.S.L)
Massey University (MU) HandImages American Sign Language (A.S.L)
Barczak and colleagues conducted a study at Massey University (M.U.) in New
Mathematics 2024, 12, x FOR PEER REVIEWBarczak
Zealand, whereand colleagues
they created the conducted a study at
MU HandImages Massey
A.S.L. University
dataset. (M.U.)
This dataset in ofNew
15
comprises 37
Zealand, where they created the MU HandImages A.S.L. dataset. This dataset
2425 images captured from five individuals, each displaying various hand gestures. The comprises
2425 images captured
photographs were taken from five individuals,
in different lighting each displaying
conditions variousahand
and against greengestures. The
screen back-
photographs
ground. were
The dataset taken in
comprises different
26 classeslighting conditions
representing and
fundamentalagainst a
Americangreen screen
SignSign
Lan-
background. The dataset comprises 26 classes representing fundamental American
guage (ASL)
Language movements.
(ASL) movements. The The
images in the
images in dataset havehave
the dataset a black backdrop,
a black and and
backdrop, theirtheir
pixel
sizes
pixel may
sizesvary
maydepending
vary dependingon theonhand posture.
the hand Figure
posture. 13 [69]
Figure 13visually represents
[69] visually some
represents
images from this dataset. This dataset includes gestures representing digits
some images from this dataset. This dataset includes gestures representing digits and and alphabets,
making it amaking
alphabets, valuableit resource
a valuable for researchfor
resource and applications.
research and applications.
Figure 13.
Figure 13. The
The MU
MU HandImages
HandImages A.S.L.
A.S.L. dataset
datasetisisaacollection
collectionofof26
26unique
uniquehand
handgestures
gesturesthat
thatare
are
frequently employed in American Sign Language (ASL).
frequently employed in American Sign Language (ASL).
Sebastian
Sebastian Marcel Static Hand Gesture
Gesture Dataset
Dataset
The
The Sebastian Marcel
Marcel Static
Static Hand
Hand Gesture
GestureDataset
Dataset[70] [70]was
wasused
usedtototrain
traina aneural
neural
network
network model to detect hand positions
positions inin photographs.
photographs.Space Spacediscretization
discretizationwaswasused
usedtoto
separate
separate hand movements based on
movements based on facial
facialposition
positionand
andbodybodyanthropometry.
anthropometry.The Thedataset
dataset
has
has ten
tenpersons
personsdemonstrating
demonstratingsix
sixhand
handpostures
postures(a,(a,b,b,
c, c,
point, five,
point, and
five, v) v)
and in uniform
in uniformand
complicated backdrops
and complicated with with
backdrops different picture
different sizes sizes
picture basedbasedon theonhand gesture.
the hand FigureFigure
gesture. 14 [70]
depicts several several
14 [70] depicts photos photos
from this collection.
from this collection.
Figure 14.
Figure 14. Sample
Sample images
images from
from the
the Sebastian
SebastianMarcel
MarcelStatic
StaticHand
HandGesture
GestureDataset
Datasetfeaturing
featuringsix
six
distinct hand
distinct hand gestures
gestures demonstrated
demonstrated by
by ten
ten individuals.
individuals.
The
The NUS
NUS Hand
Hand Posture
Posture IIII
NUS
NUS Hand Posture Dataset includes
Posture Dataset includes hand
handposture
postureimages
imagescaptured
capturedininand
andaround
around
the
the National University of Singapore (NUS) against complex backgrounds.
University of Singapore (NUS) against complex backgrounds. It consists of It consists
of
1010 classes
classes of hand
of hand postures
postures performed
performed by 40bysubjects
40 subjects of different
of different ethnicities,
ethnicities, genders,
genders, and
and ages (22 to 56 years). Each subject demonstrated the ten
ages (22 to 56 years). Each subject demonstrated the ten hand postures five times,hand postures five times,
Mathematics 2024, 12, x FOR PEER REVIEW
incorporating 16 of 35
incorporating natural variations.
variations.
The
The dataset
dataset is divided into three folders: folders: “Hand
“Hand Postures”
Postures”(2000
(2000images),
images),“Hand
“Hand
Postures
Postures with human noise” noise” (750
(750 images),
images), and
and “Backgrounds”
“Backgrounds”(2000 (2000images).
images).TheThehandhand
posture
postureimages
images areare available in in grayscale
grayscaleandandcolor,
color,with
withresolutions
resolutionsofof160
160 × 120
× 120 andand
320
320 × 240 pixels. The images in the “Hand
× 240 pixels. The images in the “Hand Postures with human Postures with human noise” folder include
folder include
additional
additionalelements
elementslike likethe
theface
faceofofthe
theposer
poserand
andhumans
humansininthe thebackground.
background.
All
Allimages
imagesare arein
inRGB
RGBformat
formatand andsaved
savedas asJPEG
JPEGfiles.
files.The
Thedataset
datasethashasbeen
beenused
usedin in
academic
academicresearch,
research,andandthetheresults
resultsof ofhand
handposture
posturedetection
detectionand andrecognition
recognitionarearereported
reported
ininthe
thepaper
papertitled
titled“Attention
“Attention BasedBasedDetection
DetectionandandRecognition
Recognitionof ofHand
HandPostures
PosturesAgainst
Against
Complex
Complex Backgrounds” by Pramod Kumar Pisharady, Prahlad Vadakkepat, andAi
Backgrounds” by Pramod Kumar Pisharady, Prahlad Vadakkepat, and AiPoh
Poh
Loh [71]. Figure 15 [71] shows sample images of
Loh [71]. Figure 15 [71] shows sample images of this dataset. this dataset.
Figure15.
Figure 15. Sample
Sample image
image of
ofNUS
NUSDataset
Datasetconsists
consistsof
ofHand
HandPosture,
Posture,hand
handPosture
Posturewith
withnoise
noiseand
and
background.
background.
HG
HG1414
The
The Hand
Hand Gestures
Gestures 14 (HG14)
(HG14) dataset,
dataset, developed
developedby byGuler
Guleretetal.al.[72],
[72], comprises
comprises 14
14 handgestures
hand gesturesthat
that are appropriate
appropriatefor
forhand interaction
hand andand
interaction application
application control in aug-
control in
augmented reality. The dataset comprises 14,000 photographs, each containing RGB
channels and occupying a 256 by 256 pixels resolution. Every image is accompanied by a
background that is both simple and uniformly colored, as depicted in Figure 16 [72].
Figure 15. Sample image of NUS Dataset consists of Hand Posture, hand Posture with noise and
background.
Figure16.
Figure 16.Sample
Sampleimages
imagesfrom
fromthe
theHG14
HG14dataset
datasetshowcasing
showcasingthe
the14
14distinct
distincthand
handgestures.
gestures.
2.1.4.
2.1.4. Deep
Deep Learning
Learning and and Neural
NeuralNetwork
Network
Deep
Deep learning
learning isis aacutting-edge
cutting-edge method
method ininthe
themachine
machinelearning
learning domain
domain that
that aims
aims toto
understand and manage complex information from large data sets automatically.
understand and manage complex information from large data sets automatically. Inspired Inspired
by
byhow
howthethehuman
humanbrain
brainworks,
works,this
thisapproach
approachleverages
leveragesartificial
artificialneural
neuralnetworks
networksto tohandle
handle
complex
complex data analysis tasks. Neural networks consist of a series of interconnectedlayers
data analysis tasks. Neural networks consist of a series of interconnected layersofof
neurons,
neurons, where
where each
each layer
layer has
has its
itsrole
rolein
inprocessing
processingand andinterpreting
interpreting data.
data. At
At the
the peak
peak ofof
deep
deep learning
learning progress,
progress, there
there are
arevarious
variousarchitectures,
architectures, such
such as
asforward
forwardneural
neuralnetworks
networks
(FNNs),
(FNNs), convolutional
convolutional neural
neural networks
networks (CNNs),
(CNNs), recurrent
recurrent neural
neural networks
networks (RNNs),
(RNNs), and and
generative
generative adversarial networks (GANs). These models continue to evolve, adaptto
adversarial networks (GANs). These models continue to evolve, adapt tovarious
various
problems,
problems, and
and demonstrate
demonstrate the the changing
changing dynamics
dynamics of of research
research and
and applications
applications in in17deep
deep
Mathematics 2024, 12, x FOR PEER REVIEW of 37
learning. The following is the architecture and approach used in this research.
learning. The following is the architecture and approach used in this research.
Convolutional Neural Network
Convolutional Neural Network
DeepLearning
Deep Learning(DL)(DL)has
hasvarious
variousarchitectures,
architectures,oneoneofofwhich
whichisisConvolutional
ConvolutionalNeural
Neural
Networks (CNN), known for its effectiveness in image recognition compared to traditionalto
Networks (CNN), known for its effectiveness in image recognition compared
traditional
machine machine
learning learning approaches
approaches [32]. The
[32]. The basic ideabasic idea is
of CNN of the
CNN is the technique
technique of imageof
image convolution,
convolution, which combines
which combines an input
an input matrix andmatrix
a kerneland a kernel
matrix matrixa to
to produce produce
third matrixa
third
that matrix that
represents howrepresents
one matrixhow one matrix
is modified byisthe
modified
other. by the other.
AACNNCNN architecture
architecturegenerally consists
generally of twoofparts:
consists two feature
parts: extraction and classifica-
feature extraction and
tion [73], as depicted
classification [73], asin depicted
Figure 17.inThe feature
Figure 17.extraction partextraction
The feature applies image
partconvolution
applies imageto
the input image to produce a series of feature maps. These features are then
convolution to the input image to produce a series of feature maps. These features are used in the
classification
then used in part to classify thepart
the classification label
toof the input
classify the image.
label of the input image.
Figure 17.CNN
Figure17. CNNarchitecture
architecturewith
withfeature
featureextraction
extractionand
andclassification
classificationparts.
parts.
CNN
CNNutilizes
utilizesconvolution
convolutiontotogenerate
generateaacollection
collectionofoffeature
featuremaps.
maps.Each
Eachfilter
filtercan
can
identify distinct patterns within the input image, including but not limited to edges, lines,
identify distinct patterns within the input image, including but not limited to edges, lines,
and corners. To incorporate non-linearity and acquire intricate representations of the input
and corners. To incorporate non-linearity and acquire intricate representations of the
input image, the output of the convolutional layer is subjected to a non-linear activation
function, such as the Rectified Linear Unit (ReLU). The ReLU function is defined in
Equation (10) [74].
𝑅𝑒𝐿𝑈(𝑥) = 𝑚𝑎𝑥(0, 𝑥) (10)
Mathematics 2024, 12, 1393 17 of 34
image, the output of the convolutional layer is subjected to a non-linear activation function,
such as the Rectified Linear Unit (ReLU). The ReLU function is defined in Equation (10) [74].
The pooling layer is employed after the convolutional layer to decrease the dimen-
sionality of the feature maps and introduce translational invariance. Maximum pooling
selects the highest value inside a specific local area, whereas average pooling computes the
average value.
challenges for security and surveillance systems, particularly in recognizing faces wearing
masks. To address this issue, the research faces constraints related to the lack of adequate
datasets for masked face recognition. Available datasets prioritize faces with Caucasian
features, while faces with Ethiopian racial features are often overlooked. Therefore, this
study formulated a specific dataset to overcome these limitations. This study conducted a
comparative analysis among three leading neural network models: AlexNet, ResNet-50,
and Inception-V3. They underwent testing to determine their ability to identify faces
covered by surgical, textile, and N95 masks, among other forms of masks. The research
findings demonstrate that CNN models can achieve very high levels of recognition accuracy,
both for faces wearing masks and those without. Furthermore, model performance analysis
indicates that ResNet-50 stands out by achieving the highest accuracy, reaching 97.5%.
This finding underscores the superiority of ResNet-50 in recognizing faces wearing masks
compared to other models, such as Inception-V3 and AlexNet. From the results of this study,
it can be concluded that the use of the ResNet-50 model makes a significant contribution
to improving the accuracy of masked face recognition, making it the preferred choice in
addressing the challenges of face recognition in the pandemic era.
Furthermore, research related to brain tumor classification also supports the use of
the ResNet-50 model. A framework that utilizes optimal deep learning features has been
proposed for brain tumor classification. In this experiment, the ResNet-50 model was
used with transfer learning, and the results showed significant accuracy for brain tumor
classification [81].
Thus, the use of pre-trained ResNet-50 models supports the success of these studies
and demonstrates their enormous potential in realizing intelligent and accurate solutions for
various challenges in various disciplines. With the continuous advancement of knowledge
and technology, it is expected that the role of this model in future research will become
even more prominent and provide increasingly significant benefits to the broader society.
TP + TN
Accuracy = (11)
TP + FP + TN + FN
The variable LP represents the count of correctly categorized true positive labels, LN
represents the count of correctly classified true negative labels, FP represents the count
of incorrectly classified false positive labels, and AN represents the count of incorrectly
classed false negative labels.
(d) Exploration of the extent of Classical Augmentation for Generating Varied Data while
Maintaining Accuracy
cs 2024, 12, x FOR PEER REVIEW (e) Investigation into the Contribution of Green Screen Dataset for21Implementing
of 37 Hybrid
Image Augmentation.
The research methodology designed to address these inquiries is visually illustrated
Figure 18.accuracy
approach ensures that the This multi-step approach
of each dataset begins with
is rigorously creatingand
evaluated a base dataset featuring hand
contributes
gestures performed against a consistent green screen background.
to a comprehensive understanding of the method’s performance across various datasets. To ensure an ample
volume of data for analysis, additional datasets are procured and meticulously replicated
30 times, expanding the dataset’s size significantly.
3. Results In the subsequent phases, critical augmentations are introduced. The green screen
background,
3.1. Experimental Setup the constant backdrop in the initial dataset, is thoughtfully substituted with
backgrounds
This study aims to simulatethathand
spangesture
a spectrum of colors
recognition and settings.environments
in uncontrolled These backgrounds include
walls, wooded areas, gradient patterns, airports, basements, vibrant
or real-world conditions. It involves treating the initial dataset, which uses a green screen and colorful locations,
background, by replacing or adding backgrounds that reflect various other locations. backgrounds is
libraries, rooms, and urban streets. The deliberate inclusion of these diverse
pivotal totechniques
Classical augmentation mimicking asuchreal-world environment
as geometry, characterized
brightness, by manyand
temperature, potential scenarios.
Continuing to steps 3 and 4, classical augmentation techniques come into play. Ge-
blurriness enhance this simulation.
ometric transformations, brightness adjustments, temperature shifts, the introduction
The study adopts an “on the fly” augmentation strategy to achieve this goal and
of blurriness, and a pretrained ResNet50 are applied systematically to replicate the un-
defines various parameters for each augmentation technique, as shown in Table 1.
predictability and uncontrolled aspects of real-world environments. These alterations
Augmentations are performed in a series of steps, starting with background augmentation
significantly enrich the dataset and prepare it for testing.
(1), followed by geometry transformation (2), brightness (3), temperature (4), and
The final dataset, a product of these meticulous augmentations and background
blurriness (5). The deep learning algorithm’s training phase involves each stage’s
substitutions, is subjected to rigorous testing. The testing phase involves an examination of
application.
its performance against publicly established datasets obtained from Massey, NUS, Sebastian
Before augmentation, the initial dataset is duplicated 10, 20, 30 times. Each copy of
Marcel, DLSI, and HG14. Each public dataset is assessed independently to gauge the impact
the dataset then undergoes a series of augmentations. The result is a training dataset with
and effectiveness of the hybrid image augmentation strategy. This approach ensures that
increased variability. This dataset will be tested to evaluate its impact on model accuracy.
the accuracy of each dataset is rigorously evaluated and contributes to a comprehensive
Thus, we hope this research canof
understanding provide valuable
the method’s insights intoacross
performance usingvarious
augmentation in the
datasets.
context of hand gesture recognition in real-world environments.
3. Results
Table 1. Parameter3.1.
setting for all Augmentation
Experimental Setup Techniques.
Parameter This study aims to simulate hand gesture recognition in uncontrolled environments
Value/
egory Technique Direction Description
orRange
Setting real-world conditions. It involves treating the initial dataset, which uses a green screen
background, by replacing or adding backgrounds that reflect various other locations. Clas-
Positive
metric rotation_ sical augmentation techniques such as Rotation
geometry, ofbrightness,
the image within the
temperature, and blurriness
Rotation 10° (Clockwise)/Negative
ormation range enhance this simulation. range of −10 degrees to +10 degrees.
(Counterclockwise)
The study adopts an “on the fly” augmentation strategy to achieve this goal and defines
Positive
width_shift various parameters for each augmentation Shifting the image
technique, as width
shownwithin the
in Table 1. Augmentations
Translation 0.1 (Rightward)/Negative
range are performed in a series of steps, starting
range of −10% to +10% of the width. (1), followed by
with background augmentation
(Leftward)
Positive Shifting the image height within
height_shift_
0.1 (Downward)/Negative the range of −10% to +10% of the
range
(Upward) height.
Mathematics 2024, 12, 1393 20 of 34
geometry transformation (2), brightness (3), temperature (4), and blurriness (5). The deep
learning algorithm’s training phase involves each stage’s application.
Parameter Value/
Category Technique Direction Description
Setting Range
Positive Rotation of the image within
Geometric rotation_
Rotation 10◦ (Clockwise)/Negative the range of −10 degrees to
Transformation range
(Counterclockwise) +10 degrees.
Positive Shifting the image width
width_shift
Translation 0.1 (Rightward)/Negative within the range of −10% to
range
(Leftward) +10% of the width.
Positive Shifting the image height
height_shift_
0.1 (Downward)/Negative within the range of −10% to
range
(Upward) +10% of the height.
Positive (Right Shearing the image within the
Shearing shear_range 10◦ Shear)/Negative range of −10 degrees to
(Left Shear) +10 degrees.
Positive Scaling the image within the
Scaling zoom_range [1, 1.5] (Zoom In)/Negative range of 1 to 1.5 times the
(Zoom Out) original size.
This enables horizontal
flipping. It means the image
Enabled/
Flipping Horizontal Flip Horizontal Flip can be horizontally flipped
Disabled
(resulting in a
mirrored image).
Positive Changing the brightness level
Brightness Adjustment brightness 20 (Brightening)/Negative within the range of
(Darkening) −20 to +20.
Positive (Warming– Adjusting the image’s color
Temperature Adjusting temperature 20 Red)/Negative temperature within the range
(Cooling–Blue) of −20 to +20.
Blurring Randomly adding blur
Blurriness Randomly blurriness Random
(No explicit direction) to the image.
Before augmentation, the initial dataset is duplicated 10, 20, 30 times. Each copy of
the dataset then undergoes a series of augmentations. The result is a training dataset with
increased variability. This dataset will be tested to evaluate its impact on model accuracy.
Thus, we hope this research can provide valuable insights into using augmentation in the
context of hand gesture recognition in real-world environments.
Transfer Learning uses pre-trained weights from ImageNet as the initial weights for
the network to circumvent the need for intricate and computationally intensive learning
processes. Weight optimization is limited to the classification layer of the pre-trained
model, where it is solely used to optimize the networks in the fully connected layers. The
Optimizing technique known as Adaptive Moment Estimation (ADAM) is used to improve
the training process and prevent gradient vanishing during training [83]. The pre-existing
network undergoes retraining for a total of 50 epochs, employed with a batch size of 32. To
preserve the pre-trained weights of ImageNet, the process of network training involves the
freezing of all layers throughout the feature extraction phase. The evaluation of network
performance is conducted by employing the “accuracy” metric.
The experimental setup employs the Python programming language, with various
libraries like TensorFlow, Matplotlib, and NumPy. The experiment is conducted on a
personal computer, with the specifications specified in Table 2.
Mathematics 2024, 12, 1393 21 of 34
Hardware/Software Specification
Processor (CPU) Intel Core i5-9300H @2.40 GHz
Memory (RAM) 32 GB DDR4
Graphical Processing Unit (GPU) Nvidia GTX 1660 Ti–6 GB vRAM
Operating System Windows 11 Home Edition
Python version 3.6.13
Cuda/CuDNN version 11.0/8.0
“‘python
def preprocess_image(image):
# ...
return img_array
def blur_image(image):
# ...
return image
def change_background_v2(img, bg):
# ...
return img_edit
def adjust_brightness_temperature(image, bg):
# ...
return result_image
”’
Next, the configuration of the “ImageDataGenerator” is set for training data with
various augmentation techniques, including rotation, horizontal and vertical shifts, shear,
zoom, and custom preprocessing using the “preprocess_image” function.
“‘python
train_datagen = ImageDataGenerator(
rotation_range = 10,
width_shift_range = 0.1,
height_shift_range = 0.1,
shear_range = 10,
zoom_range = [1, 1.5],
fill_mode = ‘nearest’,
preprocessing_function=preprocess_image
)
”’
(a) Finally, a CNN model is built using the ResNet50 architecture as the base, with
additional layers suitable for the hand gesture recognition task. Here’s a detailed
explanation of each step in building a CNN model using the ResNet50 architecture:
(b) Building the Base Model with ResNet50:
- ResNet50 is a Convolutional Neural Network (CNN) architecture developed by
Microsoft Research. It consists of 50 layers (hence the “50” in the name) and has
been proven highly effective in various computer vision tasks, especially image
classification.
- ResNet50 (include_top = False, weights = ‘imagenet’, input_shape = (224, 224,
3))”: This function creates the base ResNet50 model. The argument “include_top
= False” indicates that the fully connected layers at the top of the model will not be
included, allowing us to customize the top layers according to our task. “Weights
= ‘imagenet’ initializes the model with weights learned from the “imagenet”
dataset, enabling the model to have an initial understanding of various features
present in images. The argument “input_shape = (224, 224, 3)” specifies the size
and color channels (RGB) of the input images that the model will receive.
(c) Setting Trainable Layers
After creating the base ResNet50 model, we set all its layers to be non-trainable
(“layer.trainable = False”). This step ensures that the weights learned by the ResNet50
model are not altered during the new model’s training process. We want to leverage the
features learned by the ResNet50 model on the “imagenet” dataset without modifying
them.
(d) Adding Additional Layers:
- After the base ResNet50 model, we add several additional layers on top of it to
tailor the model to the hand gesture recognition task.
Mathematics 2024, 12, 1393 23 of 34
- Flatten(): This layer flattens the output of the base model into one dimension.
This is necessary because the subsequent Dense layers require input in the form
of a one-dimensional vector.
- Dense(512, activation = ‘relu’): This Dense layer consists of 512 neuron units with
the ReLU activation function. Dense layers like this aim to learn more abstract
feature representations from the image data.
- Dropout(0.5): The Dropout layer is used to reduce overfitting by randomly
deactivating some neuron units during the training process.
- Dense(train_generator.num_classes, activation = ‘softmax’): The final Dense layer
has the same number of neuron units as the number of classes in the training
dataset, and it uses the softmax activation function to generate probabilities of
possible classes.
(e) Compiling the Model:
- After adding the additional layers, the model needs to be compiled before it can
be used for the training process.
- model.compile(optimizer = ‘adam’, loss = ‘categorical_crossentropy’,metrics =
[‘accuracy’]): In this step, we specify the optimizer to be used (in this case, the
Adam optimizer), the loss function appropriate for the multi-class classification
task (categorical cross-entropy), and the metrics to be monitored during training
(in this case, accuracy).
With the above steps, we have successfully built a CNN model using the ResNet50
architecture as the base, ready to be trained for the hand gesture recognition task.
“‘python
base_model = ResNet50(include_top = False, weights = ‘imagenet’, input_shape =
(224, 224, 3))
for layer in base_model.layers:
layer.trainable = False
x = base_model.output
x = Flatten()(x)
x = Dense(512, activation=‘relu’)(x)
x = Dropout(0.5)(x)
predictions = Dense(train_generator.num_classes, activation=‘softmax’)(x)
model = Model(inputs = base_model.input, outputs = predictions)
model.compile(optimizer = ‘adam’, loss = ‘categorical_crossentropy’, metrics = [‘accu-
racy’])
”’
This experiment is directed towards each public dataset to understand the extent of
the impact of hybrid augmentation on accuracy results.
Image
Category Name of Dataset Number of Data Number of Classes Images Size Background/Image
Consist
HG14 14,280 14 256 × 256 uniform
MU HandImages
Public (As a
ASL 700 10 vary uniform
Testing Data)
(Digit 0–9)
MU HandImages
ASL 3490 26 vary uniform
(Alphabet)
Sebastian Marcel 659 6 vary uniform & complex
NUS-II 2000 10 160 × 120 complex
HG14 280 14 224 × 224 greenscreen
Custom Dataset (using
MU HandImages
Green Screen B.G. for
ASL 201 10 224 × 224 greenscreen
each Public Dataset as
(Digit 0–9)
Training Data)
MU HandImages
ASL 781 26 224 × 224 greenscreen
(Alphabet)
Sebastian Marcel 120 6 224 × 224 greenscreen
NUS-II 210 10 224 × 224 greenscreen
Image Background for various outdoor
replacing B.G. - 90 - 400 × 320 scenarios, gradients,
Greenscreen and different colors
is carried out on the relevant public dataset for each specific dataset. The results of these
tests provide information about the accuracy of each dataset.
Through this approach, we can conduct a more in-depth analysis to measure the extent
to which our applied approach succeeds when tested on various public datasets.
Table 4. Accuracy Results on the Sebastian Marcel Dataset with Various Data Augmentation Tech-
niques for Each Duplication.
Table 5. Accuracy Results on the NUS_II Dataset with Various Data Augmentation Techniques for
Each Duplication.
Table 6. Accuracy Results on the Massey (Digit) Dataset with Various Data Augmentation Techniques
for Each Duplication.
Table 7. Accuracy Results on the Massey (Alphabet) Dataset with Various Data Augmentation
Techniques for Each Duplication.
Lastly, the HG 14 dataset presents challenges in improving accuracy, with the lowest
accuracy values observed for some types of augmentation, as given in Table 8. This indicates
that effective augmentation strategies may vary depending on dataset characteristics and
require structured approaches and careful exploration.
Table 8. Accuracy Results on the HG 14 Dataset with Various Data Augmentation Techniques for
Each Duplication.
4. Discussion
However, it’s essential to remember that there is no one-size-fits-all approach in hand
gesture recognition. Each dataset possesses unique characteristics that influence how it
responds to augmentation and data duplication techniques. Therefore, the selection and
adaptation of techniques should be based on a deep understanding of the specific dataset’s
characteristics used in the research.
Based on our research findings, we draw several pivotal conclusions. The “green
screen” technique substantially boosts hand gesture recognition accuracy across various
Mathematics 2024, 12, 1393 29 of 34
5. Conclusions
This research has successfully developed a hybrid augmentation framework for en-
hancing the performance of hand gesture recognition systems in user- and environment-
independent scenarios. The augmentation strategy incorporates background replacement,
geometric transformations, brightness and temperature changes, and blurriness imple-
mentation. All these augmentations aim to generate more training data with variational
conditions. From the experiments using several datasets, it is found that this hybrid aug-
mentation strategy improves the classification accuracy by 8.5% on average. Besides that,
adding the training data using 10×, 20×, and 30× duplications also helps to increase the
recognition performance by up to 6% compared to just using the original training data.
It is noted that the experiment was conducted using training data from a single person.
Therefore, in the future, it can be observed whether the proposed augmentation strategy
Mathematics 2024, 12, 1393 30 of 34
will provide more significant improvement when more volunteers are involved in the
image acquisition process.
Based on the discussion above, we can conclude several crucial points that mark
a significant step forward in hand gesture recognition. First, using the “green screen”
technique has revolutionized the accuracy of hand gesture recognition across various
environmental contexts. By isolating the background, the hand gesture recognition model
becomes more focused on the gestures, overcoming visual disturbances that may arise from
complex or changing backgrounds. Second, innovative approaches in applying hybrid
image augmentation have shown that geometric precision, lighting quality, and consistent
backgrounds are key to improving accuracy. However, this improvement is not uniform
and needs to be tailored to the unique characteristics of each dataset, emphasizing the need
for dataset-based technique adjustments.
Moreover, the findings offer significant insights into the significance of tailoring aug-
mentation strategies to the specific dataset employed. Datasets using the “green screen”
technique demonstrate more substantial success in implementing hybrid image augmen-
tation, reaffirming the need for adaptive strategies in data processing to achieve optimal
accuracy. This conclusion paves the way for more detailed approaches in using aug-
mentation techniques, especially on datasets with diverse characteristics and complex
environmental variations. By providing a solid foundation, this research encourages fur-
ther exploration into the potential application of hand gesture recognition technology in
various broader application contexts.
References
1. Sun, J.-H.; Ji, T.-T.; Zhang, S.-B.; Yang, J.-K.; Ji, G.-R. Research on the Hand Gesture Recognition Based on Deep Learning. In
Proceedings of the 2018 12th International Symposium on Antennas, Propagation and EM Theory (ISAPE), Hangzhou, China,
3–6 December 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–4.
2. Oudah, M.; Al-Naji, A.; Chahl, J. Hand Gesture Recognition Based on Computer Vision: A Review of Techniques. J. Imaging 2020,
6, 73. [CrossRef]
3. Muthu Mariappan, H.; Gomathi, V. Real-Time Recognition of Indian Sign Language. In Proceedings of the ICCIDS 2019—2nd
International Conference on Computational Intelligence in Data Science, Chennai, India, 21–23 February 2019; Institute of
Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA; Department of Computer Science and Engineering, National
Engineering College: Kovilpatti, India, 2019.
4. Makarov, I.; Veldyaykin, N.; Chertkov, M.; Pokoev, A. Russian Sign Language Dactyl Recognition. In Proceedings of the 2019 42nd
International Conference on Telecommunications and Signal Processing, TSP 2019, Budapest, Hungary, 1–3 July 2019; Institute
of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA; National Research University Higher School of Economics:
Moscow, Russia, 2019; pp. 726–729.
5. Žemgulys, J.; Raudonis, V.; Maskeliūnas, R.; Damaševičius, R. Recognition of Basketball Referee Signals from Real-Time Videos. J.
Ambient. Intell. Humaniz. Comput. 2020, 11, 979–991. [CrossRef]
6. Kong, L.; Huang, D.; Qin, J.; Wang, Y. A Joint Framework for Athlete Tracking and Action Recognition in Sports Videos. IEEE
Trans. Circuits Syst. Video Technol. 2020, 30, 532–548. [CrossRef]
Mathematics 2024, 12, 1393 31 of 34
7. Carfi, A.; Motolese, C.; Bruno, B.; Mastrogiovanni, F. Online Human Gesture Recognition Using Recurrent Neural Networks
and Wearable Sensors. In Proceedings of the 2018 27th IEEE International Symposium on Robot and Human Interactive
Communication (RO-MAN), Nanjing, China, 27–31 August 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 188–195.
8. Park, S.; Kim, D. Study on 3D Action Recognition Based on Deep Neural Network. In Proceedings of the 2019 International
Conference on Electronics, Information, and Communication (ICEIC), Auckland, New Zealand, 22–25 January 2019; IEEE:
Piscataway, NJ, USA, 2019; pp. 1–3.
9. Badave, H.; Kuber, M. Head Pose Estimation Based Robust Multicamera Face Recognition. In Proceedings of the 2021 International
Conference on Artificial Intelligence and Smart Systems (ICAIS), Coimbatore, India, 25–27 March 2021; IEEE: Piscataway, NJ,
USA, 2021; pp. 492–495.
10. Liaqat, S.; Dashtipour, K.; Arshad, K.; Assaleh, K.; Ramzan, N. A Hybrid Posture Detection Framework: Integrating Machine
Learning and Deep Neural Networks. IEEE Sens. J. 2021, 21, 9515–9522. [CrossRef]
11. Wang, Y.; Liu, J. A Self-Developed Smart Wristband to Monitor Exercise Intensity and Safety in Physical Education Class. In
Proceedings of the Proceedings—2019 8th International Conference of Educational Innovation through Technology, EITT 2019,
Biloxi, MS, USA, 27–31 October 2019; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2019; pp. 160–164.
12. Caviedes, J.E.; Li, B.; Jammula, V.C. Wearable Sensor Array Design for Spine Posture Monitoring during Exercise Incorporating
Biofeedback. IEEE Trans. Biomed. Eng. 2020, 67, 2828–2838. [CrossRef]
13. Arathi, P.N.; Arthika, S.; Ponmithra, S.; Srinivasan, K.; Rukkumani, V. Gesture Based Home Automation System. In Proceedings
of the 2017 International Conference On Nextgen Electronic Technologies: Silicon to Software, ICNETS2 2017, Chennai, India,
23–25 March 2017; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA; Department of Electronics and
Instrumentation Engineering, Sri Ramakrishna Engineering College: Coimbatore, India, 2017; pp. 198–201.
14. Abraham, L.; Urru, A.; Normani, N.; Wilk, M.P.; Walsh, M.; O’flynn, B. Hand Tracking and Gesture Recognition Using Lensless
Smart Sensors. Sensors 2018, 18, 2834. [CrossRef]
15. Nascimento, T.H.; Soares, F.A.A.M.N.; Nascimento, H.A.D.; Vieira, M.A.; Carvalho, T.P.; de Miranda, W.F. Netflix Control Method
Using Smartwatches and Continuous Gesture Recognition. In Proceedings of the 2019 IEEE Canadian Conference of Electrical
and Computer Engineering (CCECE), Edmonton, AB, Canada, 5–8 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–4.
16. Ahmed, S.; Cho, S.H. Hand Gesture Recognition Using an IR-UWB Radar with an Inception Module-Based Classifier. Sensors
2020, 20, 564. [CrossRef]
17. Lee, C.; Kim, J.; Cho, S.; Kim, J.; Yoo, J.; Kwon, S. Development of Real-Time Hand Gesture Recognition for Tabletop Holographic
Display Interaction Using Azure Kinect. Sensors 2020, 20, 4566. [CrossRef] [PubMed]
18. Ekneling, S.; Sonestedt, T.; Georgiadis, A.; Yousefi, S.; Chana, J. Magestro: Gamification of the Data Collection Process for
Development of the Hand Gesture Recognition Technology. In Proceedings of the Adjunct Proceedings—2018 IEEE International
Symposium on Mixed and Augmented Reality, ISMAR-Adjunct 2018, Munich, Germany, 16–20 October 2018; Institute of Electrical
and Electronics Engineers Inc.: Piscataway, NJ, USA; Department of Computer and Systems Sciences, Stockholm University:
Stockholm, Sweden, 2018; pp. 417–418.
19. Bai, Z.; Wang, L.; Zhou, S.; Cao, Y.; Liu, Y.; Zhang, J. Fast Recognition Method of Football Robot’s Graphics from the VR
Perspective. IEEE Access 2020, 8, 161472–161479. [CrossRef]
20. Nooruddin, N.; Dembani, R.; Maitlo, N. HGR: Hand-Gesture-Recognition Based Text Input Method for AR/VR Wearable Devices.
In Proceedings of the Conference Proceedings—IEEE International Conference on Systems, Man and Cybernetics, Toronto, ON,
Canada, 11–14 October 2020; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2020; pp. 744–751.
21. Mezari, A.; Maglogiannis, I. An Easily Customized Gesture Recognizer for Assisted Living Using Commodity Mobile Devices. J.
Healthc. Eng. 2018, 2018, 3180652. [CrossRef]
22. Roberge, A.; Bouchard, B.; Maître, J.; Gaboury, S. Hand Gestures Identification for Fine-Grained Human Activity Recognition in
Smart Homes. Procedia Comput. Sci. 2022, 201, 32–39. [CrossRef]
23. Kaczmarek, W.; Panasiuk, J.; Borys, S.; Banach, P. Industrial Robot Control by Means of Gestures and Voice Commands in Off-Line
and On-Line Mode. Sensors 2020, 20, 6358. [CrossRef] [PubMed]
24. Neto, P.; Simão, M.; Mendes, N.; Safeea, M. Gesture-Based Human-Robot Interaction for Human Assistance in Manufacturing.
Int. J. Adv. Manuf. Technol. 2019, 101, 119–135. [CrossRef]
25. Young, G.; Milne, H.; Griffiths, D.; Padfield, E.; Blenkinsopp, R.; Georgiou, O. Designing Mid-Air Haptic Gesture Controlled User
Interfaces for Cars. Proc. ACM Hum. Comput. Interact 2020, 4, 1–23. [CrossRef]
26. Archived: WHO Timeline—COVID-19. Available online: https://www.who.int/news/item/27-04-2020-who-timeline---covid-
19 (accessed on 25 October 2023).
27. Katti, J.; Kulkarni, A.; Pachange, A.; Jadhav, A.; Nikam, P. Contactless Elevator Based on Hand Gestures during COVID-19 like
Pandemics. In Proceedings of the 2021 7th International Conference on Advanced Computing and Communication Systems,
ICACCS 2021, Coimbatore, India, 19–20 March 2021; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA;
Pimpri Chinchwad College of Engineering: Maharashtra, India, 2021; pp. 672–676.
28. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM
2017, 60, 84–90. [CrossRef]
Mathematics 2024, 12, 1393 32 of 34
29. Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings
of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops),
Miami, FL, USA, 20–25 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 248–255.
30. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556.
31. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In
Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June
2016; IEEE: Piscataway, NJ, USA, 2016; pp. 2818–2826.
32. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016;
pp. 770–778.
33. Shafiq, M.; Gu, Z. Deep Residual Learning for Image Recognition: A Survey. Appl. Sci. 2022, 12, 8972. [CrossRef]
34. Khosla, C.; Saini, B.S. Enhancing Performance of Deep Learning Models with Different Data Augmentation Techniques: A Survey.
In Proceedings of the International Conference on Intelligent Engineering and Management, ICIEM 2020, London, UK, 17–19
June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 79–85.
35. Mikolajczyk, A.; Grochowski, M. Data Augmentation for Improving Deep Learning in Image Classification Problem. In
Proceedings of the 2018 International Interdisciplinary PhD Workshop (IIPhDW), Swinoujscie, Poland, 9–12 May 2018; IEEE:
Piscataway, NJ, USA, 2018; pp. 117–122.
36. Kaur, P.; Khehra, B.S.; Mavi, E.B.S. Data Augmentation for Object Detection: A Review. In Proceedings of the 2021 IEEE
International Midwest Symposium on Circuits and Systems (MWSCAS), Lansing, MI, USA, 9–11 August 2021; pp. 537–543.
[CrossRef]
37. Leevy, J.L.; Khoshgoftaar, T.M.; Bauder, R.A.; Seliya, N. A Survey on Addressing High-Class Imbalance in Big Data. J. Big Data
2018, 5, 42. [CrossRef]
38. Shukla, P.; Bhowmick, K. To Improve Classification of Imbalanced Datasets. In Proceedings of the 2017 International Conference
on Innovations in Information, Embedded and Communication Systems, ICIIECS 2017, Coimbatore, India, 17–18 March 2017; pp.
1–5. [CrossRef]
39. Shorten, C.; Khoshgoftaar, T.M. A Survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, 6, 60. [CrossRef]
40. Mohamed, N.; Mustafa, M.B.; Jomhari, N. A Review of the Hand Gesture Recognition System: Current Progress and Future
Directions. IEEE Access 2021, 9, 157422–157436. [CrossRef]
41. Lim, K.M.; Tan, A.W.C.; Tan, S.C. A Feature Covariance Matrix with Serial Particle Filter for Isolated Sign Language Recognition.
Expert Syst. Appl. 2016, 54, 208–218. [CrossRef]
42. Farahanipad, F.; Rezaei, M.; Nasr, M.S.; Kamangar, F.; Athitsos, V. A Survey on GAN-Based Data Augmentation for Hand Pose
Estimation Problem. Technologies 2022, 10, 43. [CrossRef]
43. Sharma, S.; Singh, S. Vision-Based Hand Gesture Recognition Using Deep Learning for the Interpretation of Sign Language.
Expert Syst. Appl. 2021, 182, 115657. [CrossRef]
44. Kandel, I.; Castelli, M.; Manzoni, L. Brightness as an Augmentation Technique for Image Classification. Emerg. Sci. J. 2022, 6,
881–892. [CrossRef]
45. Islam, M.Z.; Hossain, M.S.; Ul Islam, R.; Andersson, K. Static Hand Gesture Recognition Using Convolutional Neural Network
with Data Augmentation. In Proceedings of the 2019 Joint 8th International Conference on Informatics, Electronics and Vision,
ICIEV 2019 and 3rd International Conference on Imaging, Vision and Pattern Recognition, icIVPR 2019 with International
Conference on Activity and Behavior Computing, ABC 2019, Spokane, WA, USA, 30 May–2 June 2019; pp. 324–329. [CrossRef]
46. Bousbai, K.; Merah, M. Hand Gesture Recognition Using Capabilities of Capsule Network and Data Augmentation. In Proceedings
of the 2022 7th International Conference on Image and Signal Processing and Their Applications, ISPA 2022—Proceedings,
Mostaganem, Algeria, 8–9 May 2022; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA; Mostaganem
University, Elctronics and Embededd Systems, Department of Electrical Engineering: Mostaganem, Algeria, 2022.
47. Alani, A.A.; Cosma, G.; Taherkhani, A.; McGinnity, T.M. Hand Gesture Recognition Using an Adapted Convolutional Neural
Network with Data Augmentation. In Proceedings of the 2018 4th International Conference on Information Management (ICIM),
Oxford, UK, 25–27 May 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 5–12.
48. Zhou, W.; Chen, K. A Lightweight Hand Gesture Recognition in Complex Backgrounds. Displays 2022, 74, 102226. [CrossRef]
49. Luo, Y.; Cui, G.; Li, D. An Improved Gesture Segmentation Method for Gesture Recognition Based on CNN and YCbCr. J. Electr.
Comput. Eng. 2021, 2021, 1783246. [CrossRef]
50. Fadillah Rahmat, R.; Chairunnisa, T.; Gunawan, D.; Fermi Pasha, M.; Budiarto, R. Hand gestures recognition with improved skin
color segmentation in human-computer interaction applications. J. Theor. Appl. Inf. Technol. 2019, 97, 727–739.
51. Yao, Y.; Li, C.T. Hand Gesture Recognition and Spotting in Uncontrolled Environments Based on Classifier Weighting. In
Proceedings of the International Conference on Image Processing, ICIP 2015, Quebec City, QC, Canada, 27–30 September 2015;
pp. 3082–3086. [CrossRef]
52. Yang, F.; Shi, H. Research on Static Hand Gesture Recognition Technology for Human Computer Interaction System. In
Proceedings of the 2016 International Conference on Intelligent Transportation, Big Data and Smart City, ICITBS 2016, Changsha,
China, 17–18 December 2016; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2017; pp. 459–463.
Mathematics 2024, 12, 1393 33 of 34
53. Vasiljevic, I.; Chakrabarti, A.; Shakhnarovich, G. Examining the Impact of Blur on Recognition by Convolutional Networks. arXiv
2016, arXiv:1611.05760.
54. Salunke, T.P.; Bharkad, S.D. Power Point Control Using Hand Gesture Recognition Based on HOG Feature Extraction and K-Nn
Classification. In Proceedings of the International Conference on Computing Methodologies and Communication, ICCMC 2017,
Erode, India, 18–19 July 2017; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA; Dept. of EandTC
Engineering, Government College of Engineering: Aurangabad, India, 2018; pp. 1151–1155.
55. Chanu, O.R.; Pillai, A.; Sinha, S.; Das, P. Comparative Study for Vision Based and Data Based Hand Gesture Recognition
Technique. In Proceedings of the ICCT 2017—International Conference on Intelligent Communication and Computational
Techniques, Jaipur, India, 22–23 December 2017; pp. 26–31. [CrossRef]
56. Flores, C.J.L.; Cutipa, A.E.G.; Enciso, R.L. Application of Convolutional Neural Networks for Static Hand Gestures Recognition
under Different Invariant Features. In Proceedings of the 2017 IEEE 24th International Congress on Electronics, Electrical
Engineering and Computing, INTERCON 2017, Cusco, Peru, 15–18 August 2017; pp. 5–8. [CrossRef]
57. Bao, P.; Maqueda, A.I.; Del-Blanco, C.R.; Garciá, N. Tiny Hand Gesture Recognition without Localization via a Deep Convolutional
Network. IEEE Trans. Consum. Electron. 2017, 63, 251–257. [CrossRef]
58. Qiao, Y.; Feng, Z.; Zhou, X.; Yang, X. Principle Component Analysis Based Hand Gesture Recognition for Android Phone
Using Area Features. In Proceedings of the 2017 2nd International Conference on Multimedia and Image Processing, ICMIP
2017, Wuhan, China, 17–19 March 2017; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA; School of
Information Science and Engineering, University of Jinan: Jinan, China, 2017; pp. 108–112.
59. Kadethankar, A.A.; Joshi, A.D. Dynamic Hand Gesture Recognition Using Kinect. In Proceedings of the 2017 Innovations
in Power and Advanced Computing Technologies, i-PACT 2017, Vellore, India, 21–22 April 2017; Institute of Electrical and
Electronics Engineers Inc.: Piscataway, NJ, USA; Electronics and Telecommunication, Shri Guru Gobind Singhji Institute of Engg.
and Tech.: Maharashtra, India, 2017; pp. 1–3.
60. Abdul-Rashid, H.M.; Kiran, L.; Mirrani, M.D.; Maraaj, M.N. CMSWVHG-Control MS Windows via Hand Gesture. In Proceedings
of the Proceedings of 2017 International Multi-Topic Conference, INMIC 2017, Lahore, Pakistan, 24–26 November 2017; Institute
of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA; National University of Computer and Emerging Sciences,
FAST-NU: Islamabad, Pakistan, 2018; pp. 1–7.
61. Zhang, Y.; Cao, C.; Cheng, J.; Lu, H. EgoGesture: A New Dataset and Benchmark for Egocentric Hand Gesture Recognition. IEEE
Trans. Multimed. 2018, 20, 1038–1050. [CrossRef]
62. He, Y.; Yang, J.; Shao, Z.; Li, Y. Salient Feature Point Selection for Real Time RGB-D Hand Gesture Recognition. In Proceedings
of the 2017 IEEE International Conference on Real-Time Computing and Robotics, RCAR 2017, Okinawa, Japan, 14–18 July
2017; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA; School of Urban Rail Transportation, Soochow
University: Suzhou, China, 2017; pp. 103–108.
63. Sachara, F.; Kopinski, T.; Gepperth, A.; Handmann, U. Free-Hand Gesture Recognition with 3D-CNNs for in-Car Infotainment
Control in Real-Time. In Proceedings of the IEEE Conference on Intelligent Transportation Systems, Proceedings, ITSC, Yokohama,
Japan, 16–19 October 2017; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA; Computer Science Institute,
Hochschule Ruhr West: Bottrop, Germany, 2018; pp. 959–964.
64. Ahmed, W.; Chanda, K.; Mitra, S. Vision Based Hand Gesture Recognition Using Dynamic Time Warping for Indian Sign
Language. In Proceedings of the 2016 International Conference on Information Science, ICIS 2016, Kochi, India, 12–13 August
2016; pp. 120–125. [CrossRef]
65. Kane, L.; Khanna, P. Vision-Based Mid-Air Unistroke Character Input Using Polar Signatures. IEEE Trans. Hum. Mach. Syst. 2017,
47, 1077–1088. [CrossRef]
66. Raditya, C.; Rizky, M.; Mayranio, S.; Soewito, B. The Effectivity of Color for Chroma-Key Techniques. Procedia Comput. Sci. 2021,
179, 281–288. [CrossRef]
67. Zhi, J. An Alternative Green Screen Keying Method for Film Visual Effects. Int. J. Multimed. Its Appl. 2015, 7, 1–12. [CrossRef]
68. Sengupta, S.; Jayaram, V.; Curless, B.; Seitz, S.; Kemelmacher-Shlizerman, I. Background Matting: The World Is Your Green Screen.
In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19
June 2020; pp. 2288–2297. [CrossRef]
69. Barczak, A.L.C.; Reyes, N.H.; Abastillas, M.; Piccio, A.; Susnjak, T. A New 2D Static Hand Gesture Colour Image Dataset for ASL
Gestures; Massey University: Palmerston North, New Zealand, 2011; Volume 15, Available online: https://mro.massey.ac.nz/
server/api/core/bitstreams/09187662-5ebe-4563-8515-3d7e5e1d2a33/content (accessed on 2 February 2023).
70. Marcel, S. Hand Posture Recognition in a Body-Face Centered Space. In CHI’99 Extended Abstracts on Human Factors in Computing
Systems; Association for Computing Machinery: New York, NY, USA, 1999.
71. Pisharady, P.K.; Vadakkepat, P.; Loh, A.P. Attention Based Detection and Recognition of Hand Postures against Complex
Backgrounds. Int. J. Comput. Vis. 2013, 101, 403–419. [CrossRef]
72. Güler, O.; Yücedağ, İ. Hand Gesture Recognition from 2D Images by Using Convolutional Capsule Neural Networks. Arab. J. Sci.
Eng. 2022, 47, 1211–1225. [CrossRef]
73. Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.;
Farhan, L. Review of Deep Learning: Concepts, CNN Architectures, Challenges, Applications, Future Directions. J. Big Data 2021,
8, 53. [CrossRef] [PubMed]
Mathematics 2024, 12, 1393 34 of 34
74. Agarap, A.F. Deep Learning Using Rectified Linear Units (ReLU). arXiv 2018, arXiv:1803.08375.
75. Subburaj, S.; Murugavalli, S. Survey on Sign Language Recognition in Context of Vision-Based and Deep Learning. Meas. Sens.
2022, 23, 100385. [CrossRef]
76. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Salakhutdinov, R. Dropout: A Simple Way to Prevent Neural Networks from
Overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958.
77. Poojary, R.; Pai, A. Comparative Study of Model Optimization Techniques in Fine-Tuned CNN Models. In Proceedings of the
2019 International Conference on Electrical and Computing Technologies and Applications, ICECTA 2019, Ras Al Khaimah,
United Arab Emirates, 19–21 November 2019; pp. 2–5. [CrossRef]
78. Ozdemir, M.A.; Kisa, D.H.; Guren, O.; Onan, A.; Akan, A. EMG Based Hand Gesture Recognition Using Deep Learning. In
Proceedings of the TIPTEKNO 2020—Tip Teknolojileri Kongresi—2020 Medical Technologies Congress, TIPTEKNO 2020, Antalya,
Turkey, 19–20 November 2020. [CrossRef]
79. Theckedath, D.; Sedamkar, R.R. Detecting Affect States Using VGG16, ResNet50 and SE-ResNet50 Networks. SN Comput. Sci.
2020, 1, 79. [CrossRef]
80. Esi Nyarko, B.N.; Bin, W.; Zhou, J.; Agordzo, G.K.; Odoom, J.; Koukoyi, E. Comparative Analysis of AlexNet, Resnet-50, and
Inception-V3 Models on Masked Face Recognition. In Proceedings of the 2022 IEEE World AI IoT Congress, AIIoT 2022, Seattle,
WA, USA, 6–9 June 2022; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2022; pp. 337–343.
81. Hossain, B.; Sazzad, S.M.H.; Islam, M.; Akhtar, N.; Aziz, A.; Attique, M.; Tariq, U.; Nam, Y.; Nazir, M.; Jeong, C.W.; et al. An
Ensemble of Optimal Deep Learning Features for Brain Tumor Classification. In Proceedings of the 2019 International Conference
on Electrical and Computing Technologies and Applications, ICECTA 2019, Ras Al Khaimah, United Arab Emirates, 19–21
November 2019; Volume 211, pp. 2–5. [CrossRef]
82. Muslikhin, M.; Horng, J.R.; Yang, S.Y.; Wang, M.S.; Awaluddin, B.A. An Artificial Intelligence of Things-based Picking Algorithm
for Online Shop in the Society 5.0′ s Context. Sensors 2021, 21, 2813. [CrossRef] [PubMed]
83. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2017, arXiv:1412.6980.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.