Article Title
Article Title
Abstract
Sign language recognition (SLR) is vital for improving communication for indi-
viduals with hearing and speech impairments. However, traditional computer
vision-based SLR systems face challenges such as occlusions and dynamic hand
1
movements. To address these issues, we propose 3D-HandFuser, a multi-modal
framework for Arabic Sign Language (ArSL) that combines RGB and depth
data for improved recognition. Our approach processes both modalities inde-
pendently, extracting features that are fused for classification. We use a deep
learning model for hand keypoint detection from RGB images, mapping the key-
points to a 3D hand mesh using the SMPL-MANO model. For depth images,
we apply a Novel Gaussian Mixture Model-based ellipse fitting method for hand
extraction and use keypoint detection techniques like MSER, FAST, GFTT,
and SIFT. We then extract features with HOG and Gabor filters, which are
fused into a unified vector and processed using a Transformer-based architec-
ture for effective gesture classification. Our framework is evaluated on three
benchmark ArSL datasets KArSL-100, KArSL-190, and mArSL—achieving clas-
sification accuracies of 79.3%, 65.8%, and 78.3%, respectively. The results
highlight the effectiveness of our multi-modal approach towards robust sign
language recognition.
1 Introduction
Sign language serves as a vital means of communication for individuals with hearing
and speech impairments, enabling interaction through structured hand movements
and gestures. The demand for automated sign language recognition has surged with
advancements in artificial intelligence, aiming to bridge communication gaps and
enhance inclusivity [1]. Unlike spoken languages, sign languages incorporate both
manual components (hand gestures) and non-manual elements (facial expressions,
body posture), making recognition a complex and multi-faceted task [2]. Among
these languages, Arabic Sign Language (ArSL) plays a crucial role in the communi-
cation of Arabic-speaking deaf communities but remains relatively underexplored in
computational research [3] public safety, security, and statistical purposes [7, 8].
Sign language recognition faces challenges like hand shape variations, two-handed
gestures, lighting, occlusions, and limited annotated ArSL datasets [4][5]. Image-based
recognition is hindered by background clutter, skin tone variations, and perspective
distortions [6]. Dynamic gestures add complexity due to motion, rapid transitions, and
overlaps [7], while video-based methods are affected by motion blur, frame rate varia-
tions, and occlusions [8]. Traditional techniques struggle with these issues, highlighting
the need for advanced multi-modal approaches that combine spatial and temporal
information for robust recognition [9].
Recent advancements in deep learning and computer vision have enhanced gesture
recognition through RGB and depth data fusion. RGB images provide rich texture
and shape information [10], while depth-based approaches capture 3D hand structures
independent of lighting [11]. Combining both modalities enables more comprehen-
sive feature extraction, improving robustness and accuracy [12]. However, challenges
2
remain in effective RGB-depth fusion, handling occlusions, and maintaining accuracy
across diverse environments, highlighting the need for improved multi-modal learning
strategies.
In this work, we propose 3D-HandFuser, a two-hand gesture recognition frame-
work for Arabic Sign Language using a multi-modal approach that combines RGB and
depth data to enhance performance. RGB images are processed with a deep learning
model for hand keypoint detection, mapping them to the SMPL-MANO model for 3D
hand mesh generation. For depth images, a GMM-based ellipse fitting method detects
hands, followed by keypoint extraction and feature computation. The features from
both modalities are early fused and processed through Transformer-based architec-
ture for gesture classification, capturing both spatial and temporal hand movement
characteristics for robust recognition.
The major contributions and highlights presented in this paper are summarized as
follows.
• Multi-Modal Gesture Recognition Framework: We propose 3D-HandFuser, a two-
hand static and dynamic gesture recognition system for Arabic Sign Language
(ArSL) that integrates RGB and depth modalities, ensuring robust and accurate
recognition across diverse conditions.
• Novel Hand Localization via GMM-Based Ellipse Fitting: A novel technique is intro-
duced where GMM-based ellipse fitting is applied on the complete body in the depth
image. Each ellipse corresponds to a different body part, and the component with
the minimum depth from the sensor is selected as the hand. The resulting region is
used to extract the hand silhouette.
• 3D Hand Representation and Feature Extraction: We use a deep learning model
to detect hand keypoints from RGB images and map them to the SMPL-MANO
model. For depth images, we apply GMM-based ellipse fitting, followed by keypoint
detection and feature extraction.
• Early Fusion and Transformer-Based Classification: Keypoints and features from
both modalities are fused into a unified vector and processed by a Transformer-
based model for accurate gesture classification, capturing spatial and temporal
characteristics.
2 Literature Review
RGB-based hand gesture recognition has gained attention due to the accessibility of
conventional cameras and deep learning techniques. Early methods focused on color
segmentation and contour-based approaches to detect hand regions [13] but strug-
gled with varying lighting and complex backgrounds. The use of convolutional neural
networks (CNNs) for feature extraction has since improved accuracy, with models
like Molchanov et al.’s [14] CNN-based architecture that learned spatiotemporal fea-
tures from RGB sequences, enhancing dynamic gesture recognition. More recently,
Transformer-based models such as Vision Transformers (ViTs) have been applied to
improve performance, with Kim et al. [15] integrating self-attention mechanisms to
capture long-range dependencies in hand movements. Despite progress, RGB-based
3
methods remain sensitive to occlusions and lighting changes, limiting robustness in
real-world scenarios.
Depth-based hand gesture recognition has gained attention due to its ability
to capture three-dimensional hand structures, reducing the influence of background
noise. Depth sensors, such as Microsoft Kinect and Intel RealSense, provide depth
maps that enhance segmentation and feature extraction. Erol et al. [16] developed
an approach using depth maps to extract hand contours and applied hidden Markov
models (HMMs) for gesture classification. Similarly, Oikonomidis et al. [17] used depth
images to perform model-based hand pose estimation, achieving high accuracy in
gesture interpretation. With the rise of deep learning, depth-based CNNs have been
proposed for more robust gesture recognition. For instance, Wan et al. [18] introduced
a 3D CNN that directly processed depth maps, learning spatiotemporal features for
dynamic gestures. Additionally, generative models such as Gaussian Mixture Models
(GMMs) have been utilized for depth-based segmentation, as demonstrated by Yin et
al. [19], who applied GMMs to fit hand shapes and extract key points for recognition.
Combining RGB and depth (RGB-D) data enhances gesture recognition by lever-
aging both modalities, improving robustness against occlusions and lighting variations.
Neverova et al. [20] proposed a multimodal deep learning approach using late fusion,
improving accuracy by combining both features. Huang et al. [21] used a two-stream
CNN to process RGB and depth separately before combining them for classification.
Li et al. [22] introduced a Transformer-based architecture to align RGB and depth fea-
tures, enhancing gesture recognition in cluttered environments. This hybrid approach
has shown promise in sign language recognition and HCI applications.
Further, hand mesh models provide a structured representation of hand gestures,
improving accuracy in pose estimation and fine-grained gesture recognition. SMPL
and MANO models have been widely used to generate parametric hand meshes from
RGB or depth data. Romero et al. [23] introduced the MANO model, a paramet-
ric 3D hand model trained in real-world hand shapes. This model has been used in
conjunction with deep learning techniques to infer hand poses from RGB and depth
images. Similarly, Zhou et al. [24] proposed a hybrid approach that combined CNN-
based feature extraction with mesh reconstruction, achieving high accuracy in hand
tracking. Recent advancements have integrated Graph Neural Networks (GNNs) with
hand meshes to refine hand pose estimation. Ge et al. [25] developed a GNN-based
approach that improved mesh reconstruction from RGB images, enhancing gesture
recognition for virtual reality (VR) applications.
3 Proposed Architecture
In this work, we propose 3D-HandFuser, a hybrid approach for real-time hand ges-
ture recognition using both RGB and depth frames. Our methodology leverages the
strengths of both modalities to enhance the accuracy and robustness of hand ges-
ture detection. The process is divided into several key stages: preprocessing, keypoints
extraction, 3D hand creation, feature extraction, feature fusion, and classification using
Transformer network. The overall comprehensive architecture of the proposed system
is shown in Fig. 1.
4
Fig. 1 Processing pipeline for 3D Hand Mesh and Transformer-Based Temporal Modeling
where Iface removed is the image with faces removed, and solid color is the color used
to replace the faces. Following this, skin detection is performed by isolating regions of
skin tone in the YCbCr color space. This is done by defining a range for skin color,
represented using Eq. (2) and it is shown in Fig. 3(b).
5
Fig. 2 Preprocessing: (a) Original images, (b) Denoised images (c) Sharpened images, (d) Enhanced
images using histogram equalization
Fig. 3 RGB hand silhouette extraction: (a) Face detection using Haar Cascade classifier, (b) Skin
mask after face removal, (c) Dilated mask (d) Eroded mask (e) Refined mask (f) RGB hand silhouette
where the connected components are labeled, and small areas are removed based
on their size. The processed RGB hand silhouette is shown in Fig. 3(f).
KRGB − µ(KRGB )
K̂RGB = (5)
σ (KRGB )
6
where µ(KRGB ) and σ (KRGB ) denote the mean and standard deviation of the
keypoints, respectively.
Fig. 4 Keypoint extraction: (a) Original frame (b) RGB hand silhouette (c) Extraction hand
keypoints
7
K3D = J · V (8)
where J is a predefined regression matrix that maps mesh vertices to joint locations.
Inverse kinematics (IK) refines the 3D keypoints by minimizing the deviation from the
provided 3D keypoints while maintaining biomechanical constraints using the following
optimization objective in Eq. (9):
N
X 2
min ∥ΠK3D,i − KRGB,i ∥ + λ∥θ∥2 (9)
θ
i=1
where KRGB,i represents the 3D keypoints you provide as input, and K3D denotes
the keypoints generated by the MANO model. λ is a regularization term that
discourages unnatural hand poses.
To enhance stability in motion tracking, the reconstructed 3D hand keypoints
undergo temporal smoothing. A Kalman filter is employed to estimate the true
keypoint trajectory using Eq. (10):
Fig. 5 5 3D Hand Mesh for different hand poses: (a) Five fingers (b) Two fingers (c) Four fingers
(d) Both hands involving five fingers
8
3.2 Depth Image Processing
The depth images provide information about the distance of different body parts of
the signer performing the gesture. The pipeline for depth image processing includes
the following steps:
3.2.1 Processing
In this study, depth-based preprocessing is applied to enhance image quality through
normalization and noise reduction. Normalization ensures consistency by scaling depth
values within a fixed range using Eq. (13).
D(x, y ) − Dmin
D′ (x, y ) = (13)
Dmax − Dmin
where Dmin and Dmax denote the minimum and maximum depth values, respec-
tively. To suppress noise while preserving edges, a Gaussian filter is used as described
in Eq. (14).
X
Dfiltered (x, y ) = D(i, j ) · G(i, j ) (14)
i,j
where G(i, j ) is the Gaussian kernel.
9
where P (x) is the probability density function, wi is the weight of the i-th Gaus-
sian, and N (x | µi , Σi ) is the Gaussian distribution. Pixels are assigned to the
highest-probability cluster, and ellipses are fitted based on the covariance matrix Σ.
Eigenvalues λ1 and λ2 define size and orientation, with eigenvectors determining the
angle, as given by Eq. (18).
−1 v21
Θ = tan (18)
v11
where v21 and v11 are the components of the first eigenvector. The lengths of the
major and minor axes of the ellipse are determined using Eq. (19):
p p
a = 2 λ1 , b = 2 λ2 (19)
where a and b represent the major and minor axes, respectively. Ellipses with
axes a and b approximate body parts, while color mapping highlights clusters in the
silhouette. A combination of GMMs, shape modeling, and energy minimization enables
accurate hand extraction and localization, even under occlusion (Fig. 6(e)), supporting
gesture and biomechanical analysis.
Fig. 6 Depth hand silhouette extraction: (a) Original frame (b) Extracted human silhouette (c)
Fitting of ellipses (d) Labelling of body parts (e) Extracted hand silhouette
10
detected as keypoints. The stability of each connected component is measured using
the following stability score:
|A(T + ϵ) − A(T )|
S (T ) = (20)
A(T )
where A(T ) is the area of the connected component at threshold T , and ϵ is a
small threshold change. MSER will output the set of stable keypoints KMSER for the
depth image. The extracted keypoints using MSER are shown in Fig. 7(a) and Fig.
7(b). framework, as shown in Fig. 1.
Fig. 7 MSER keypoints (a) MSER keypoints on gesture involving single hand (b) MSER keypoints
on gesture involving both hands
11
detected keypoints from GFTT are represented as KGFTT . The extracted keypoints
using GFTT are shown in Fig. 8(a) and Fig. 8(b).
Fig. 8 GFTT keypoints (a) Gesture involving single hand (b) Gesture involving both hands
where p is the central pixel, and Nt is the number of contiguous pixels in the
Bresenham circle that are brighter or darker than p by a threshold t. If 12 out of 16
pixels meet this, p is a corner. FAST keypoints, KFAST , are shown in Fig. 9(a) and
Fig. 9(b).
12
Fig. 9 FAST keypoints (a) FAST keypoints on gesture involving single hand (b) FAST keypoints
on gesture involving both hands
13
Fig. 10 SIFT keypoints (a) SIFT keypoints on gesture involving single hand (b) SIFT keypoints
on gesture involving both hands
Iy
−1
θ(x, y ) = tan (25)
Ix
where Ix and Iy are the gradients computed using the Sobel operator in the hor-
izontal and vertical directions. The image is divided into p × p cells, with gradient
histograms weighted by magnitudes. Overlapping b × b blocks normalize contrast.
The final HOG descriptor, FHOG , is formed by concatenating the normalized his-
tograms (Fig. 11a, 11b). Next, we calculated HOG features, which capture hand
shape by analyzing edge directions and gradients. The Gradient Orientation Histogram
distinguishes hand shapes and finger positions in localized regions using Eq. (26):
N
X
H= θi (26)
i=1
where θi represents the orientation of the gradient at pixel i. Edge Density
Edge pixels
(DE = Total pixels ) measures the proportion of edge pixels, indicating gesture com-
plexity. Higher density indicates open-hand gestures, while lower density suggests
closed-fist gestures. HOG Entropy measures gradient randomness, with higher entropy
for complex gestures and lower values for simpler shapes, as given in Eq. (27):
X
HG = Pi log(Pi ) (27)
14
Fig. 11 HOG features (a) HOG features on gesture involving single hand (b) HOG features on
gesture involving both hands
15
S X
X O
EG = |R(s, o)| (29)
s=1 o=1
where S represents different scales, and O represents different orientations.
Fig. 12 Gabor features on gesture (a) Gabor features at 0° (b) Gabor features at 45° (c) Gabor
features at 90° (d) Gabor features at 135°
16
with fused features embedded into a high-dimensional latent space via a learnable
transformation using Eq. (32):
F′ = WF + P (32)
where W is a learnable projection matrix that maps the feature vector to the
embedding space, and P preserves spatial relationships. The encoder uses multi-head
self-attention (MHSA) to capture long-range dependencies between hand keypoints.
The input is projected into query (Q), key (K), and value (V) spaces with learnable
weight matrices using Eq. (33):
Q = F ′ Wq , K = F ′ Wk , V = F ′ Wv (33)
where Wq , Wk , and Wv are learnable parameter matrices. The self-attention mech-
anism computes a weighted representation of the input features through the scaled
dot-product attention function using Eq. (34) and Eq. (35):
QK T
A = softmax √ (34)
dk
Fatt = AV (35)
where dk is the dimensionality of the key vectors, and the softmax function nor-
malizes the attention scores. Multi-head attention extends this mechanism by applying
multiple parallel attention operations and concatenating the results using Eq. (36):
17
further refined through a position-wise feedforward network (FFN), which applies
non-linear transformations to enhance feature expressiveness using Eq. (37):
Table 1 Algorithm for Integrated RGB-Depth Hand Gesture Classification with Transformer
where W1 and W2 are weight matrices, b1 and b2 are biases, and σ is an activation
function such as ReLU or GELU. To stabilize training, layer normalization and residual
connections are applied using Eq. (38):
18
where the residual connection ensures gradient flow through the network. The
encoder stacks layers with self-attention, layer normalization, and a feedforward net-
work, with a residual connection for gradient flow. The transformer architecture is
shown in Fig. 13. The algorithm for the overall system is given in Table 1.
4.2 Datasets
We evaluated the 3D-HandFuser model using three datasets: mArSL, KArSL-100,
and KArSL-190, introduced by Hamzah Luqman et al. in 2022 to improve Arabic
Sign Language (ArSL) recognition. The mArSL dataset contains 6,748 samples of 50
signs performed by four male signers, recorded with RGB, depth, and skeleton data.
KArSL-100 includes 15,000 samples of 100 signs (30 numerals, 39 letters, 31 words)
from three signers, with skeleton, depth, and RGB data. KArSL-190 extends this with
28,500 samples across 190 signs, including digits, letters, and words, also recorded in
three modalities.
19
with perfect scores for some classes and lower AUCs for others (e.g., class 7: 0.63,
class 11: 0.62). In KARSL-100 (Figure 15(b)), the mean AUC is high, with some
classes near-perfect and others moderate (e.g., class 1: 0.62, class 14: 0.60). KARSL-
190 (Figure 15(c)) has a solid AUC of 0.80, with most classes performing well, but
some (e.g., class 12: 0.58, class 100: 0.56) show lower scores, indicating challenges with
complex class sets. Given in Table 4, the proposed model outperforms state-of-the-
art methods in both signer-dependent and signer-independent accuracy and achieved
79.3% signer-independent and 96.8% signer-dependent accuracy on KArSL-100, 65.8%
and 97.9% on KArSL-190, and 78.3% and 93.5% on mArSL. These results highlight
the model’s strong generalization to unseen signers, making it promising for practical
sign language recognition. However, all three datasets involve dynamic gestures with
significant self-occlusion, highlighting the need for further improvement.
Fig. 14 Confusion matrices (a) mArSL dataset (b) KArSL-100 dataset (c) KArSL-190 dataset
Fig. 15 ROC curves (a) mArSL dataset (b) KArSL-100 dataset (c) KArSL-190 dataset
20
Table 2 Accuracy, Precision, Recall, and F1-Score on
KArSL-100, KArSL-190, and mArSL Datasets
5 Conclusion
This study presents a multi-keypoint-based deep learning framework for Arabic Sign
Language (ArSL) recognition, integrating 2.5D skeletal modeling, geometric feature
extraction, and temporal motion analysis. The use of Ant Colony Optimization
(ACO) for feature selection enhances classification efficiency by eliminating redundant
features, while the CGAN-GCN hybrid model improves feature generalization and
robustness. The proposed framework effectively handles occlusions, lighting variations,
and complex hand articulations, ensuring high classification accuracy. Experimental
results on KArSL-100, KArSL-190, and ArabSign datasets demonstrate the effective-
ness and scalability of the approach in both signer-independent and signer-dependent
conditions. The model achieves 76% accuracy on KArSL-100, 64.7% on KArSL-190
and 79% on ArabSign in signer-independent settings, outperforming previous works in
generalization across different signers. The ablation study highlights the importance
of multi-modal feature integration, particularly keypoint detection, contour analysis,
and skeletal modeling, in achieving robust and efficient ArSL recognition. Overall,
this work contributes to assistive communication, education, and human-computer
interaction, offering a scalable and cost-effective solution for real-time sign language
21
Table 4 Comparison with State-of-the-Art Methods
References
22