0% found this document useful (0 votes)
36 views22 pages

Article Title

The document presents 3D-HandFuser, a multi-modal framework for Arabic Sign Language recognition that integrates RGB and depth data to enhance gesture classification. It employs a deep learning model for hand keypoint detection and a Gaussian Mixture Model for hand extraction, achieving classification accuracies of 79.3%, 65.8%, and 78.3% on benchmark datasets. The proposed approach addresses challenges in traditional sign language recognition systems, such as occlusions and dynamic movements, through effective feature fusion and Transformer-based modeling.

Uploaded by

ghous5632
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views22 pages

Article Title

The document presents 3D-HandFuser, a multi-modal framework for Arabic Sign Language recognition that integrates RGB and depth data to enhance gesture classification. It employs a deep learning model for hand keypoint detection and a Gaussian Mixture Model for hand extraction, achieving classification accuracies of 79.3%, 65.8%, and 78.3% on benchmark datasets. The proposed approach addresses challenges in traditional sign language recognition systems, such as occlusions and dynamic movements, through effective feature fusion and Transformer-based modeling.

Uploaded by

ghous5632
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

3D-HandFuser: RGBD Fusion for Two-Hand

Arabic Sign Language Recognition via 3D Hand


Mesh and Transformer-Based Temporal Modeling

Munazza Aziz1,2† , Hui Liu 3† , Shaheryar Najam4† ,


Mohammed Alshehri5† , Yahya AlQahtani5† ,
Abdulmonem Alshahrani6† , Nouf Abdullah Almujally7†
1
Guodian Nanjing Automation Co., Ltd, Nanjing, China.
2
Department of Biomedical Engineering, Riphah International
University, Pakistan..
3
Cognitive Systems Lab, University of Bremen, Bremen, 28359,
Germany.
4
Department of Electrical Engineering, Bahria University, Islamabad,
44000, Pakistan.
5
Department of Computer Science, King Khalid University, Abha
61421, Saudi Arabia.
6
Department of Informatics and Computer Systems, King Khalid
University, Abha 61421, Saudi Arabia.
7
Department of Information Systems, College of Computer and
Information Sciences, Princess Nourah bint Abdulrahman University,
Riyadh, 11671, State, Saudi Arabia.

Contributing authors: [email protected];


[email protected]; [email protected];
[email protected]; [email protected];
[email protected]; [email protected];

These authors contributed equally to this work.

Abstract
Sign language recognition (SLR) is vital for improving communication for indi-
viduals with hearing and speech impairments. However, traditional computer
vision-based SLR systems face challenges such as occlusions and dynamic hand

1
movements. To address these issues, we propose 3D-HandFuser, a multi-modal
framework for Arabic Sign Language (ArSL) that combines RGB and depth
data for improved recognition. Our approach processes both modalities inde-
pendently, extracting features that are fused for classification. We use a deep
learning model for hand keypoint detection from RGB images, mapping the key-
points to a 3D hand mesh using the SMPL-MANO model. For depth images,
we apply a Novel Gaussian Mixture Model-based ellipse fitting method for hand
extraction and use keypoint detection techniques like MSER, FAST, GFTT,
and SIFT. We then extract features with HOG and Gabor filters, which are
fused into a unified vector and processed using a Transformer-based architec-
ture for effective gesture classification. Our framework is evaluated on three
benchmark ArSL datasets KArSL-100, KArSL-190, and mArSL—achieving clas-
sification accuracies of 79.3%, 65.8%, and 78.3%, respectively. The results
highlight the effectiveness of our multi-modal approach towards robust sign
language recognition.

Keywords: Gesture Recognition, Hand Pose Estimation, 2.5D Skeletal Modeling,


Multi-Keypoint Extraction, Ant Colony Optimization (ACO), Arabic Sign Language
Recognition.

1 Introduction
Sign language serves as a vital means of communication for individuals with hearing
and speech impairments, enabling interaction through structured hand movements
and gestures. The demand for automated sign language recognition has surged with
advancements in artificial intelligence, aiming to bridge communication gaps and
enhance inclusivity [1]. Unlike spoken languages, sign languages incorporate both
manual components (hand gestures) and non-manual elements (facial expressions,
body posture), making recognition a complex and multi-faceted task [2]. Among
these languages, Arabic Sign Language (ArSL) plays a crucial role in the communi-
cation of Arabic-speaking deaf communities but remains relatively underexplored in
computational research [3] public safety, security, and statistical purposes [7, 8].
Sign language recognition faces challenges like hand shape variations, two-handed
gestures, lighting, occlusions, and limited annotated ArSL datasets [4][5]. Image-based
recognition is hindered by background clutter, skin tone variations, and perspective
distortions [6]. Dynamic gestures add complexity due to motion, rapid transitions, and
overlaps [7], while video-based methods are affected by motion blur, frame rate varia-
tions, and occlusions [8]. Traditional techniques struggle with these issues, highlighting
the need for advanced multi-modal approaches that combine spatial and temporal
information for robust recognition [9].
Recent advancements in deep learning and computer vision have enhanced gesture
recognition through RGB and depth data fusion. RGB images provide rich texture
and shape information [10], while depth-based approaches capture 3D hand structures
independent of lighting [11]. Combining both modalities enables more comprehen-
sive feature extraction, improving robustness and accuracy [12]. However, challenges

2
remain in effective RGB-depth fusion, handling occlusions, and maintaining accuracy
across diverse environments, highlighting the need for improved multi-modal learning
strategies.
In this work, we propose 3D-HandFuser, a two-hand gesture recognition frame-
work for Arabic Sign Language using a multi-modal approach that combines RGB and
depth data to enhance performance. RGB images are processed with a deep learning
model for hand keypoint detection, mapping them to the SMPL-MANO model for 3D
hand mesh generation. For depth images, a GMM-based ellipse fitting method detects
hands, followed by keypoint extraction and feature computation. The features from
both modalities are early fused and processed through Transformer-based architec-
ture for gesture classification, capturing both spatial and temporal hand movement
characteristics for robust recognition.
The major contributions and highlights presented in this paper are summarized as
follows.
• Multi-Modal Gesture Recognition Framework: We propose 3D-HandFuser, a two-
hand static and dynamic gesture recognition system for Arabic Sign Language
(ArSL) that integrates RGB and depth modalities, ensuring robust and accurate
recognition across diverse conditions.
• Novel Hand Localization via GMM-Based Ellipse Fitting: A novel technique is intro-
duced where GMM-based ellipse fitting is applied on the complete body in the depth
image. Each ellipse corresponds to a different body part, and the component with
the minimum depth from the sensor is selected as the hand. The resulting region is
used to extract the hand silhouette.
• 3D Hand Representation and Feature Extraction: We use a deep learning model
to detect hand keypoints from RGB images and map them to the SMPL-MANO
model. For depth images, we apply GMM-based ellipse fitting, followed by keypoint
detection and feature extraction.
• Early Fusion and Transformer-Based Classification: Keypoints and features from
both modalities are fused into a unified vector and processed by a Transformer-
based model for accurate gesture classification, capturing spatial and temporal
characteristics.

2 Literature Review
RGB-based hand gesture recognition has gained attention due to the accessibility of
conventional cameras and deep learning techniques. Early methods focused on color
segmentation and contour-based approaches to detect hand regions [13] but strug-
gled with varying lighting and complex backgrounds. The use of convolutional neural
networks (CNNs) for feature extraction has since improved accuracy, with models
like Molchanov et al.’s [14] CNN-based architecture that learned spatiotemporal fea-
tures from RGB sequences, enhancing dynamic gesture recognition. More recently,
Transformer-based models such as Vision Transformers (ViTs) have been applied to
improve performance, with Kim et al. [15] integrating self-attention mechanisms to
capture long-range dependencies in hand movements. Despite progress, RGB-based

3
methods remain sensitive to occlusions and lighting changes, limiting robustness in
real-world scenarios.
Depth-based hand gesture recognition has gained attention due to its ability
to capture three-dimensional hand structures, reducing the influence of background
noise. Depth sensors, such as Microsoft Kinect and Intel RealSense, provide depth
maps that enhance segmentation and feature extraction. Erol et al. [16] developed
an approach using depth maps to extract hand contours and applied hidden Markov
models (HMMs) for gesture classification. Similarly, Oikonomidis et al. [17] used depth
images to perform model-based hand pose estimation, achieving high accuracy in
gesture interpretation. With the rise of deep learning, depth-based CNNs have been
proposed for more robust gesture recognition. For instance, Wan et al. [18] introduced
a 3D CNN that directly processed depth maps, learning spatiotemporal features for
dynamic gestures. Additionally, generative models such as Gaussian Mixture Models
(GMMs) have been utilized for depth-based segmentation, as demonstrated by Yin et
al. [19], who applied GMMs to fit hand shapes and extract key points for recognition.
Combining RGB and depth (RGB-D) data enhances gesture recognition by lever-
aging both modalities, improving robustness against occlusions and lighting variations.
Neverova et al. [20] proposed a multimodal deep learning approach using late fusion,
improving accuracy by combining both features. Huang et al. [21] used a two-stream
CNN to process RGB and depth separately before combining them for classification.
Li et al. [22] introduced a Transformer-based architecture to align RGB and depth fea-
tures, enhancing gesture recognition in cluttered environments. This hybrid approach
has shown promise in sign language recognition and HCI applications.
Further, hand mesh models provide a structured representation of hand gestures,
improving accuracy in pose estimation and fine-grained gesture recognition. SMPL
and MANO models have been widely used to generate parametric hand meshes from
RGB or depth data. Romero et al. [23] introduced the MANO model, a paramet-
ric 3D hand model trained in real-world hand shapes. This model has been used in
conjunction with deep learning techniques to infer hand poses from RGB and depth
images. Similarly, Zhou et al. [24] proposed a hybrid approach that combined CNN-
based feature extraction with mesh reconstruction, achieving high accuracy in hand
tracking. Recent advancements have integrated Graph Neural Networks (GNNs) with
hand meshes to refine hand pose estimation. Ge et al. [25] developed a GNN-based
approach that improved mesh reconstruction from RGB images, enhancing gesture
recognition for virtual reality (VR) applications.

3 Proposed Architecture
In this work, we propose 3D-HandFuser, a hybrid approach for real-time hand ges-
ture recognition using both RGB and depth frames. Our methodology leverages the
strengths of both modalities to enhance the accuracy and robustness of hand ges-
ture detection. The process is divided into several key stages: preprocessing, keypoints
extraction, 3D hand creation, feature extraction, feature fusion, and classification using
Transformer network. The overall comprehensive architecture of the proposed system
is shown in Fig. 1.

4
Fig. 1 Processing pipeline for 3D Hand Mesh and Transformer-Based Temporal Modeling

3.1 RGB Image Processing


RGB image processing includes the following sequence of steps:

3.1.1 RGB Image Processing


RGB preprocessing starts with denoising using a bilateral filter to reduce noise while
preserving edges (Fig. 2(b)). Next, the image is sharpened with Gaussian Blur and
unsharp masking (Fig. 2(c)). Contrast and brightness are enhanced through histogram
equalization in the YCbCr color space (Fig. 2(d)). Face detection is then performed
using the Haar Cascade algorithm, and detected faces are replaced with a solid color
to eliminate their influence using Eq. 1. (Fig. 3(a)).

Iremoved (x, y ) = C · χB (x, y ) + I (x, y ) · (1 − χB (x, y )) (1)

where Iface removed is the image with faces removed, and solid color is the color used
to replace the faces. Following this, skin detection is performed by isolating regions of
skin tone in the YCbCr color space. This is done by defining a range for skin color,
represented using Eq. (2) and it is shown in Fig. 3(b).

Skin Mask = inRange(IYCbCr , lower skin, upper skin) (2)


To refine the skin mask, morphological operations like dilation and erosion refine
the skin mask by smoothing boundaries and removing small artifacts (Fig. 3 (c) and
3 (d)). Connected component analysis eliminates small components (less than 700
pixels) likely to be noise using Eq. (3) as shown in Fig. 3 (e).
(
1, if connected component area ≥ 700
Irefined (x, y ) = (3)
0, otherwise

5
Fig. 2 Preprocessing: (a) Original images, (b) Denoised images (c) Sharpened images, (d) Enhanced
images using histogram equalization

Fig. 3 RGB hand silhouette extraction: (a) Face detection using Haar Cascade classifier, (b) Skin
mask after face removal, (c) Dilated mask (d) Eroded mask (e) Refined mask (f) RGB hand silhouette

where the connected components are labeled, and small areas are removed based
on their size. The processed RGB hand silhouette is shown in Fig. 3(f).

3.1.2 3D Keypoints Extraction


Given an input RGB hand silhouette SRGB , the first step involves extracting keypoints
corresponding to anatomical landmarks of the hand. A deep learning-based hand pose
estimation model [? ] is employed to detect crucial joints and fingertip positions as
shown in Figure 4(c). The extracted 3D keypoints KRGB are represented using Eq. (4).

KRGB = {(xi , yi , zi ) | i = 1, . . . , N } (4)


where N represents the number of detected keypoints, typically 21 per hand.
To normalize the key points for scale invariance, the mean and standard deviation
normalization is applied using Eq. (5).

KRGB − µ(KRGB )
K̂RGB = (5)
σ (KRGB )

6
where µ(KRGB ) and σ (KRGB ) denote the mean and standard deviation of the
keypoints, respectively.

Fig. 4 Keypoint extraction: (a) Original frame (b) RGB hand silhouette (c) Extraction hand
keypoints

3.1.3 3D Hand Mesh using SMPL MANO


Once the 3D keypoints are extracted, they are used as input to the MANO (Model-
based Anthropomorphic Hand Over) model [? ], which reconstructs the corresponding
3D hand mesh representation. The MANO model parameterizes the hand mesh HRGB
using Eq. (6):

HRGB = M (θ, β ) (6)


48
where M (·) is the MANO hand model function, θ ∈ R represents the pose param-
eters (joint rotations in an axis-angle format), and β ∈ R10 represents the shape
parameters (individual hand variations). The hand mesh structure is modeled using
Eq. (7):

V = Wpose (θ) + Wshape (β ) (7)


where Wpose (θ) accounts for deformations due to finger articulation, Wshape (β )
captures static hand shape variations, and ϵ represents residual modeling noise. From
the generated 3D hand mesh, keypoints corresponding to the joints and fingertips are
extracted using a linear regression mapping defined in Eq. (8):

7
K3D = J · V (8)
where J is a predefined regression matrix that maps mesh vertices to joint locations.
Inverse kinematics (IK) refines the 3D keypoints by minimizing the deviation from the
provided 3D keypoints while maintaining biomechanical constraints using the following
optimization objective in Eq. (9):
N
X 2
min ∥ΠK3D,i − KRGB,i ∥ + λ∥θ∥2 (9)
θ
i=1
where KRGB,i represents the 3D keypoints you provide as input, and K3D denotes
the keypoints generated by the MANO model. λ is a regularization term that
discourages unnatural hand poses.
To enhance stability in motion tracking, the reconstructed 3D hand keypoints
undergo temporal smoothing. A Kalman filter is employed to estimate the true
keypoint trajectory using Eq. (10):

K3D,t = A · K3D,t−1 + B · ut + wt (10)


where A models temporal dependencies, and B represents external inputs. The
One-Euro filter adjusts smoothing based on motion velocity using Eq. (11):

K̂3D,t = αK3D,t + (1 − α)K̂3D,t−1 (11)


where α is computed dynamically using Eq. (12):
1
α= cutoff
(12)
1+ speed+ϵ
This ensures that higher-speed movements retain more responsiveness, while lower-
speed movements remain smooth. α is adjusted dynamically based on motion speed.
Fig. 5 shows the 3D hand mesh for four different poses. framework, as shown in Fig. 1.

Fig. 5 5 3D Hand Mesh for different hand poses: (a) Five fingers (b) Two fingers (c) Four fingers
(d) Both hands involving five fingers

8
3.2 Depth Image Processing
The depth images provide information about the distance of different body parts of
the signer performing the gesture. The pipeline for depth image processing includes
the following steps:

3.2.1 Processing
In this study, depth-based preprocessing is applied to enhance image quality through
normalization and noise reduction. Normalization ensures consistency by scaling depth
values within a fixed range using Eq. (13).

D(x, y ) − Dmin
D′ (x, y ) = (13)
Dmax − Dmin
where Dmin and Dmax denote the minimum and maximum depth values, respec-
tively. To suppress noise while preserving edges, a Gaussian filter is used as described
in Eq. (14).
X
Dfiltered (x, y ) = D(i, j ) · G(i, j ) (14)
i,j
where G(i, j ) is the Gaussian kernel.

3.2.2 GMM for Hand Silhouette Extraction


The process of silhouette extraction and segmentation involves several mathemati-
cal transformations and probabilistic modeling techniques. The first step is resizing
the depth image to a uniform size (W, H ), ensuring consistency across all images. A
centered bounding box defines the ROI, standardizing input for Grab Cut [? ] and
GMM [? ]. An extended box initializes Grab Cut, as defined in Eq. (15).
 
W H pW qH
B= , , , (15)
m n m n
where m, n, p, and q are scaling factors that define the ROI, which is iteratively
refined during segmentation. Silhouette extraction uses Grab Cut, guided by an energy
minimization function defined using Eq. (16).
X X
arg min U (Ii , Li ) + V (Ii , Lj ) (16)
L
i i,j
where L labels pixels as foreground or background, U (Ii , Li ) estimates class like-
lihoods using color models, and V (Ii , Lj ) enforces spatial coherence by penalizing
abrupt changes. The optimization refines segmentation, and a binary mask extracts
foreground coordinates for body part identification, with GMM categorizing regions
using Eq. (17).
k
X
P (x) = wi · N (x | µi , Σi ) (17)
i=1

9
where P (x) is the probability density function, wi is the weight of the i-th Gaus-
sian, and N (x | µi , Σi ) is the Gaussian distribution. Pixels are assigned to the
highest-probability cluster, and ellipses are fitted based on the covariance matrix Σ.
Eigenvalues λ1 and λ2 define size and orientation, with eigenvectors determining the
angle, as given by Eq. (18).  
−1 v21
Θ = tan (18)
v11
where v21 and v11 are the components of the first eigenvector. The lengths of the
major and minor axes of the ellipse are determined using Eq. (19):
p p
a = 2 λ1 , b = 2 λ2 (19)
where a and b represent the major and minor axes, respectively. Ellipses with
axes a and b approximate body parts, while color mapping highlights clusters in the
silhouette. A combination of GMMs, shape modeling, and energy minimization enables
accurate hand extraction and localization, even under occlusion (Fig. 6(e)), supporting
gesture and biomechanical analysis.

Fig. 6 Depth hand silhouette extraction: (a) Original frame (b) Extracted human silhouette (c)
Fitting of ellipses (d) Labelling of body parts (e) Extracted hand silhouette

3.2.3 Keypoint and Image Feature Extraction


For depth image processing, keypoint detection algorithms like MSER, GFTT, SIFT,
and FAST were used to ensure precise localization under varying conditions. Features
such as HOG and Gabor filters capture textures and edges for accurate hand pose
estimation and gesture recognition. Keypoints mark anatomical landmarks (e.g., fin-
gertips, joints), while features enhance texture analysis. Combined with a 3D hand
mesh, they enable detailed modeling for reliable pose estimation in applications like
sign language recognition and virtual reality.
MSER (Maximally Stable Extremal Regions) Keypoint Detection
MSER [? ] is applied to detect stable regions in depth images. The depth image
is binarized for various threshold values, and connected components are extracted.
Regions whose areas change slowly across thresholds are considered stable and are

10
detected as keypoints. The stability of each connected component is measured using
the following stability score:

|A(T + ϵ) − A(T )|
S (T ) = (20)
A(T )
where A(T ) is the area of the connected component at threshold T , and ϵ is a
small threshold change. MSER will output the set of stable keypoints KMSER for the
depth image. The extracted keypoints using MSER are shown in Fig. 7(a) and Fig.
7(b). framework, as shown in Fig. 1.

Fig. 7 MSER keypoints (a) MSER keypoints on gesture involving single hand (b) MSER keypoints
on gesture involving both hands

GFTT (Good Features to Track) Keypoint Detection


The GFTT [? ] method is applied to detect corners in the depth image. Using the
second moment matrix, we calculate the corner response function for each pixel. The
keypoints are those pixels with the highest corner response values. The corner response
function R is defined as:

R = det(M ) − k · (trace(M ))2 (21)


where M is the second-moment matrix for the pixel, k is a constant (k = 0.04),
and det(M ) and trace(M ) are the determinant and trace of M , respectively. The

11
detected keypoints from GFTT are represented as KGFTT . The extracted keypoints
using GFTT are shown in Fig. 8(a) and Fig. 8(b).

Fig. 8 GFTT keypoints (a) Gesture involving single hand (b) Gesture involving both hands

FAST (Features from Accelerated Segment Test) Keypoint


Detection
FAST [? ] is used to detect corner-like features in the depth image. For each pixel,
a circular neighborhood of 16 pixels is examined, and if at least 12 pixels in the
neighborhood differ significantly in intensity from the central pixel, it is considered a
keypoint. The decision rule for FAST is:
(
Corner, if Nt ≥ 12
FAAST(p) = (22)
Not a corner, otherwise

where p is the central pixel, and Nt is the number of contiguous pixels in the
Bresenham circle that are brighter or darker than p by a threshold t. If 12 out of 16
pixels meet this, p is a corner. FAST keypoints, KFAST , are shown in Fig. 9(a) and
Fig. 9(b).

12
Fig. 9 FAST keypoints (a) FAST keypoints on gesture involving single hand (b) FAST keypoints
on gesture involving both hands

SIFT (Scale-Invariant Feature Transform) Keypoint Detection


SIFT [? ] detects keypoints by identifying scale-space extrema in the image using
the difference-of-Gaussian (DoG) function. The keypoints are identified as local max-
ima/minima in the scale-space and are further refined using orientation assignment.
The difference-of-Gaussian (DoG) function is defined using Eq. (23):

D(x, y, σ ) = G(x, y, kσ ) − G(x, y, σ ) (23)


where G(x, y, σ ) is the Gaussian function at scale σ , and k is the scaling factor. The
keypoints detected by SIFT are denoted as KSIFT , and they are shown in Fig. 10(a)
and Fig. 10(b).

Histogram of Oriented Gradients (HOG) Keypoint Detection


HOG [? ] extracts features by computing the gradient magnitudes and orientations of
an image, followed by spatially pooling and normalizing the orientation histograms.
The gradient magnitude M (x, y ) and orientation θ(x, y ) at each pixel are computed
using Eq. (24) and Eq. (25):
q
M (x, y ) = Ix2 + Iy2 (24)

13
Fig. 10 SIFT keypoints (a) SIFT keypoints on gesture involving single hand (b) SIFT keypoints
on gesture involving both hands

 
Iy
−1
θ(x, y ) = tan (25)
Ix
where Ix and Iy are the gradients computed using the Sobel operator in the hor-
izontal and vertical directions. The image is divided into p × p cells, with gradient
histograms weighted by magnitudes. Overlapping b × b blocks normalize contrast.
The final HOG descriptor, FHOG , is formed by concatenating the normalized his-
tograms (Fig. 11a, 11b). Next, we calculated HOG features, which capture hand
shape by analyzing edge directions and gradients. The Gradient Orientation Histogram
distinguishes hand shapes and finger positions in localized regions using Eq. (26):
N
X
H= θi (26)
i=1
where θi represents the orientation of the gradient at pixel i. Edge Density
Edge pixels
(DE = Total pixels ) measures the proportion of edge pixels, indicating gesture com-
plexity. Higher density indicates open-hand gestures, while lower density suggests
closed-fist gestures. HOG Entropy measures gradient randomness, with higher entropy
for complex gestures and lower values for simpler shapes, as given in Eq. (27):
X
HG = Pi log(Pi ) (27)

14
Fig. 11 HOG features (a) HOG features on gesture involving single hand (b) HOG features on
gesture involving both hands

where Pi is the probability of a given gradient orientation occurring in the image.

Gabor (Gabor Filter-Based Feature Extraction)


Gabor [? ] filters are used to extract texture-based features by analyzing the frequency
and orientation properties of the image. The Gabor filter is a linear filter that applies
a set of Gaussian-modulated sinusoidal wavelets to the image at different scales and
orientations. The Gabor function is defined using Eq. (28):
 ′2
x + γ 2 y ′2 2πx′2
  
G(x, y, λ, θ, ψ, σ, γ ) = exp − cos +ψ (28)
2σ 2 λ
where x′ = x cos θ + y sin θ and y ′ = −x sin θ + y cos θ represent rotated coordinates,
λ is the wavelength of the sinusoidal component, θ is the orientation of the filter, ψ is
the phase offset, and σ is the standard deviation of the Gaussian envelope, as shown
in Fig. 12.
Next, we calculated Gabor features, which capture texture and frequency compo-
nents of hand gestures. The Mean Gabor Response reflects overall texture patterns,
the Maximum Gabor Response highlights dominant edges, and Gabor Energy Across
Scales measures texture variations, distinguishing fine-detail gestures from smooth
ones. This is given by Eq. (29):

15
S X
X O
EG = |R(s, o)| (29)
s=1 o=1
where S represents different scales, and O represents different orientations.

Fig. 12 Gabor features on gesture (a) Gabor features at 0° (b) Gabor features at 45° (c) Gabor
features at 90° (d) Gabor features at 135°

3.3 Early Feature Fusion


Feature fusion integrates RGB and depth-based hand representations for robust ges-
RGB
ture recognition. From the RGB pipeline, we obtain 3D keypoints K3D using a
pretrained model and reconstruct a hand mesh via MANO. The depth-based pipeline
depth
extracts keypoints K3D using a GMM-based ellipse-fitting approach. To align both
modalities, transformation matrices TRGB and Tdepth are applied using Eq. (30) and
Eq. (31):
RGB RGB
K̂3D = TRGB K3D (30)

3.4 Transformer-based Classification


The proposed model uses a Transformer-based encoder for hand gesture classification,
processing multimodal RGB and depth features. Key components include multi-head
self-attention, feedforward networks, residual connections, and layer normalization,

16
with fused features embedded into a high-dimensional latent space via a learnable
transformation using Eq. (32):

F′ = WF + P (32)
where W is a learnable projection matrix that maps the feature vector to the
embedding space, and P preserves spatial relationships. The encoder uses multi-head
self-attention (MHSA) to capture long-range dependencies between hand keypoints.
The input is projected into query (Q), key (K), and value (V) spaces with learnable
weight matrices using Eq. (33):

Q = F ′ Wq , K = F ′ Wk , V = F ′ Wv (33)
where Wq , Wk , and Wv are learnable parameter matrices. The self-attention mech-
anism computes a weighted representation of the input features through the scaled
dot-product attention function using Eq. (34) and Eq. (35):

QK T
 
A = softmax √ (34)
dk
Fatt = AV (35)
where dk is the dimensionality of the key vectors, and the softmax function nor-
malizes the attention scores. Multi-head attention extends this mechanism by applying
multiple parallel attention operations and concatenating the results using Eq. (36):

Fig. 13 Proposed architecture of the transformer encoder

MHSA(Q, K, V ) = Concat(head1 , . . . , headh )Wo (36)


where Wo is the final projection matrix that maps the concatenated attention
outputs back to the original feature space. The attended feature representations are

17
further refined through a position-wise feedforward network (FFN), which applies
non-linear transformations to enhance feature expressiveness using Eq. (37):

Fffn = σ (Fatt W1 + b1 )W2 + b2 (37)

Table 1 Algorithm for Integrated RGB-Depth Hand Gesture Classification with Transformer

Algorithm 1: Integrated RGB-Depth Hand Gesture Classification with Transformer


Input: ImageRGB : An RGB image, ImageDepth : A depth image
Output: The result of gesture classification (recognized gesture).
Algorithm:
procedure MAIN(ImageRGB , ImageDepth )
results → []
RGB → PROCESS RGB(ImageRGB )
depth → PROCESS DEPTH(ImageDepth )
features fused → FUSE FEATURES(RGB, depth)
features att → APPLY ATTENTION(features fused)
classification → APPLY TRANSFORMER(features att)
append classification to results
return results
end procedure
procedure PROCESS RGB(ImageRGB )
Image processed → PREPROCESS(ImageRGB )
hand mask → DETECT HAND(Image processed)
keypoints 3D → ESTIMATE KEYPOINTS(hand mask)
vertices → RECONSTRUCT 3D MESH(keypoints 3D)
pose refined → OPTIMIZE POSE(vertices, keypoints 3D)
pose smooth → SMOOTH POSE(pose refined)
return pose smooth
end procedure
procedure PROCESS DEPTH(ImageDepth )
silhouette → PREPROCESS(ImageDepth )
keypoints Depth → EXTRACT KEYPOINTS(silhouette)
features Depth → EXTRACT FEATURES(silhouette)
return (keypoints Depth, features Depth)
end procedure
procedure APPLY TRANSFORMER(features)
for each encoder layer do
features → ENCODER LAYER(features)
end for
output → FCC LAYER(features)
return output
end procedure

where W1 and W2 are weight matrices, b1 and b2 are biases, and σ is an activation
function such as ReLU or GELU. To stabilize training, layer normalization and residual
connections are applied using Eq. (38):

Fenc = LayerNorm(Fatt + Fffn ) (38)

18
where the residual connection ensures gradient flow through the network. The
encoder stacks layers with self-attention, layer normalization, and a feedforward net-
work, with a residual connection for gradient flow. The transformer architecture is
shown in Fig. 13. The algorithm for the overall system is given in Table 1.

4 Experimental Setup and Results


4.1 Experimental Setup
Model training was conducted on a Windows 11 PC with an Intel Core i7 8th Gen
CPU (3.60 GHz, Turbo up to 4.60 GHz), Nvidia RTX 2070 GPU (2304 CUDA cores,
8 GB GDDR6), and 16 GB DDR4 RAM. Python 3.6 and Keras were used for model
development.

4.2 Datasets
We evaluated the 3D-HandFuser model using three datasets: mArSL, KArSL-100,
and KArSL-190, introduced by Hamzah Luqman et al. in 2022 to improve Arabic
Sign Language (ArSL) recognition. The mArSL dataset contains 6,748 samples of 50
signs performed by four male signers, recorded with RGB, depth, and skeleton data.
KArSL-100 includes 15,000 samples of 100 signs (30 numerals, 39 letters, 31 words)
from three signers, with skeleton, depth, and RGB data. KArSL-190 extends this with
28,500 samples across 190 signs, including digits, letters, and words, also recorded in
three modalities.

4.3 Confusion Matrices


The mArSL dataset (50 classes), shown in Fig. 14(a), performs well with most diagonal
values between 0.88 to 0.94. Class 5 and class 50 reach 0.92 and 0.93, and over 80% of
classes exceed 85% accuracy. Misclassifications are low, such as class 13 with class 14
at 6%, and class 27 with class 26 at 4%. In the KArSL-100 dataset (100 classes), shown
in Fig. 14(b), more than 75 classes score above 80%. Class 3 gets 0.89 and class 99
hits 0.91. Misclassification increases slightly, with class 19 as class 21 in 8%, and class
35 and 37 confused around 5%. The KArSL-190 dataset (190 classes), in Fig. 14(c), is
more challenging. Class 1 and class 152 score 0.87 to 0.89, with over 110 classes above
75%, though around 40 classes drop below 70%. Major confusions include class 109
and 110 at 11%, and class 173 and 175 at 9%.

4.4 Result Discussion and Comparison


As given in Table 2, the mArSL dataset achieves 78.3% accuracy with precision of 0.68
± 0.18, recall of 0.69 ± 0.17, and F1-score of 0.68 ± 0.17, showing strong performance.
Its clear class structure improves recognition. Compared to KARSL-100 and KARSL-
190, mArSL benefits from better intra-class separation. KARSL-100 achieves 79.3%
accuracy with average metrics of 0.65 ± 0.20, while KARSL-190 has 65.8% accuracy
and lower metrics of 0.56 ± 0.20 due to class overlap. ROC analysis shows strong
performance across datasets. For mArSL (Figure 15(a), the average AUC is 0.88,

19
with perfect scores for some classes and lower AUCs for others (e.g., class 7: 0.63,
class 11: 0.62). In KARSL-100 (Figure 15(b)), the mean AUC is high, with some
classes near-perfect and others moderate (e.g., class 1: 0.62, class 14: 0.60). KARSL-
190 (Figure 15(c)) has a solid AUC of 0.80, with most classes performing well, but
some (e.g., class 12: 0.58, class 100: 0.56) show lower scores, indicating challenges with
complex class sets. Given in Table 4, the proposed model outperforms state-of-the-
art methods in both signer-dependent and signer-independent accuracy and achieved
79.3% signer-independent and 96.8% signer-dependent accuracy on KArSL-100, 65.8%
and 97.9% on KArSL-190, and 78.3% and 93.5% on mArSL. These results highlight
the model’s strong generalization to unseen signers, making it promising for practical
sign language recognition. However, all three datasets involve dynamic gestures with
significant self-occlusion, highlighting the need for further improvement.

Fig. 14 Confusion matrices (a) mArSL dataset (b) KArSL-100 dataset (c) KArSL-190 dataset

Fig. 15 ROC curves (a) mArSL dataset (b) KArSL-100 dataset (c) KArSL-190 dataset

4.5 Ablation Study


As shown in Table 3, The model with all components achieved the highest accuracy
across datasets (79.3% on KArSL-100, 65.8% on KArSL-190, and 78.3% on mArSL).

20
Table 2 Accuracy, Precision, Recall, and F1-Score on
KArSL-100, KArSL-190, and mArSL Datasets

Dataset KArSL-100 KArSL-190 mArSL


Accuracy 79.3% 65.8% 78.3%
Precision ± S.D. 0.65 ± 0.20 0.56 ± 0.20 0.68 ± 0.18
Recall ± S.D. 0.65 ± 0.20 0.56 ± 0.20 0.69 ± 0.17
F1-Score ± S.D. 0.65 ± 0.20 0.56 ± 0.20 0.68 ± 0.17

Removing any key component, especially preprocessing or keypoints, significantly low-


ered accuracy, highlighting the importance of integrating multiple features for better
recognition.

Table 3 Ablation Study on Model Configuration for Hand Gesture Recognition

Model Configuration KArSL-100 KArSL-190 mArSL


All Parameters 79.3% 65.8% 78.3%
Without Preprocessing 65.3% 64.7% 66.2%
Without Key Points + 3D Hand Mesh 68.5% 60.3% 72.1%
Without 3D Hand Mesh 73.2% 61.9% 74.1%
Without Key Points (MSER, GFTT, SIFT, FAST) 70.1% 62.7% 71.8%
Without Image-Based Features (HOG, Gabor) 72.8% 60.5% 75.3%

5 Conclusion
This study presents a multi-keypoint-based deep learning framework for Arabic Sign
Language (ArSL) recognition, integrating 2.5D skeletal modeling, geometric feature
extraction, and temporal motion analysis. The use of Ant Colony Optimization
(ACO) for feature selection enhances classification efficiency by eliminating redundant
features, while the CGAN-GCN hybrid model improves feature generalization and
robustness. The proposed framework effectively handles occlusions, lighting variations,
and complex hand articulations, ensuring high classification accuracy. Experimental
results on KArSL-100, KArSL-190, and ArabSign datasets demonstrate the effective-
ness and scalability of the approach in both signer-independent and signer-dependent
conditions. The model achieves 76% accuracy on KArSL-100, 64.7% on KArSL-190
and 79% on ArabSign in signer-independent settings, outperforming previous works in
generalization across different signers. The ablation study highlights the importance
of multi-modal feature integration, particularly keypoint detection, contour analysis,
and skeletal modeling, in achieving robust and efficient ArSL recognition. Overall,
this work contributes to assistive communication, education, and human-computer
interaction, offering a scalable and cost-effective solution for real-time sign language

21
Table 4 Comparison with State-of-the-Art Methods

Dataset Authors Method Accuracy (%)


Alyami et al. [? ] Signer-dependent 99.74
Signer-independent 68.2
KArSL-100
Alamri et al. [? ] Signer-dependent 93.6
Signer-independent 29.4
Proposed Model Signer-dependent 96.8
Signer-independent 79.3
Hamzah Luqman et al. [? ] Signer-dependent 99.2
Signer-independent 40.6
KArSL-190
Hamzah Luqman et al. [? ] Signer-dependent 99.1
Signer-independent 40.2
Proposed Model Signer-dependent 97.9
Signer-independent 65.8
H. Luqman et al. [? ] Signer-dependent 99.7
mArSL
Signer-independent 72.4
Proposed Model Signer-dependent 93.5
Signer-independent 78.3

recognition. Future research will focus on real-time deployment, adaptive learning,


and cross-linguistic sign recognition to further enhance accessibility.
Author Contributions. This submission has been approved by all authors, and
all authors whose names appear on the manuscript have made equal contributions to
the scientific work.
Acknowledgements. The authors acknowledge Princess Nourah Bint Abdul-
rahman University Researchers supporting Project number (PNURSP2025R410),
Princess Nourah Bint Abdulrahman University, Riyadh 11671, Saudi Arabia.
Conflict of interest. The authors declare that there are no conflict of interest
regarding the publication of this paper.
Funding Statement:. The APC was funded by the Open Access Initiative of
the University of Bremen and the DFG via SuUB Bremen. The authors acknowl-
edge Princess Nourah Bint Abdulrahman University Researchers supporting Project
number (PNURSP2025R410), Princess Nourah Bint Abdulrahman University, Riyadh
11671, Saudi Arabia. The authors extend their appreciation to the Deanship of
Research and Graduate Studies at King Khalid University for funding this work
through Large Group Project under grant number (RGP.2/568/45).
Data Availability:. Dataset kArSL-100 and kArSL-190 are available online
at https://hamzah-luqman.github.io/KArSL/ and Arabsign dataset is available at
https://hamzah-luqman.github.io/ArabSign/

References

22

You might also like