0% found this document useful (0 votes)
9 views19 pages

International Journal of Information Technology Decision Making-3

This paper presents a novel lightweight 3DCNN architecture for real-time visual human activity recognition, addressing challenges in human-object interactions and improving accuracy in complex scenarios. The proposed model outperforms existing methods, achieving a 6.6% increase in accuracy on the UCF101 dataset while reducing false positives. The architecture is evaluated on multiple datasets, demonstrating its effectiveness in diverse and noisy environments.

Uploaded by

Laa Ans
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views19 pages

International Journal of Information Technology Decision Making-3

This paper presents a novel lightweight 3DCNN architecture for real-time visual human activity recognition, addressing challenges in human-object interactions and improving accuracy in complex scenarios. The proposed model outperforms existing methods, achieving a 6.6% increase in accuracy on the UCF101 dataset while reducing false positives. The architecture is evaluated on multiple datasets, demonstrating its effectiveness in diverse and noisy environments.

Uploaded by

Laa Ans
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

December 11, 2024 15:0 WSPC/ws-ijitdm output

International Journal of Information Technology & Decision Making


© World Scientific Publishing Company

A NOVEL LIGHTWEIGHT 3DCNN VISUAL HUMAN ACTIVITY


RECOGNITION

AHMED S. SALAMA1
Department of Computer Engineering and Electronics, Cairo Higher Institute for Engineering,
Computer Science and Management, Cairo, Egypt.

HAMADA I. ABDULWAKEL2
Computer Science Department, Faculty of Computers and Information, Minia University,
Minia 61519, Egypt

ESRAA ELDESOUKY3,4∗
Department of Computer Science, College of Computer Engineering and Sciences, Prince
Sattam Bin Abdulaziz University, Al Kharj, 11942, Saudi Arabia
Department of Computer Science, Faculty of Computers and Informatics, Suez Canal University
Ismailia, 41522, Egypt

MOHAMED MASHHOUR1
AHMAD M. NAGM1

Received Day Month Year


Revised Day Month Year

Addressing the challenging issues in human-activity detection and classification within


real videos, such as spatio-temporal complexities, crowd scenarios, and the problems of
missed and false detections, this paper introduces a novel real-time approach for classify-
ing human-object interactions. The proposed method builds upon the three-dimensional
convolutional neural network (3DCNN) framework, introducing a refined architecture
designed to enhance the richness of contextual information in feature extraction. Dis-
tinguishing itself from existing methods, the architecture incorporates novel layers and
configurations aimed at capturing complex spatio-temporal dynamics, improving robust-
ness in challenging scenarios. The proposed model was evaluated on various datasets,
including the UCF YouTube Action dataset, a modified version of the UCF101 dataset,
the full UCF50 dataset, and the LoDVP Abnormal Activities datasets. The proposed
architecture outperforms existing models, with experimental results on UCF101 showing
a 6.6% increase in accuracy compared to state-of-the-art methods, which reduce false
positives. These performance gains validate the model’s efficacy in mitigating common
detection challenges, particularly optimizing results for diverse and noisy environments.

Keywords: 3DCNN; Deep learning; Human activity recognition; UCF Youtube action
dataset; UCF101 dataset; LoDVP abnormal activities dataset.

∗ Corresponding author.

1
December 11, 2024 15:0 WSPC/ws-ijitdm output

2 Ahmed S. Salama, Hamada I. AbdulWakel, Esraa Eldesouky, Mohamed Mashhour, and Ahmad M. Nagm

1. Introduction
Computer Vision (CV) as one of the important advancements in computer science
helps mitigating different formidable visual challenges such as dense video caption-
ing 1 , monitoring 2 , video-object segmentation 3 , and video enhancement 4 , this
increase the dependency on CV for contributing the research community with more
solutions to various challenges.
In the realm of video recognition, the precise identification of activities and
abnormalities remains a formidable challenge, particularly in scenarios such as pa-
tient monitoring 5 , driver drowsiness detection system 6 , security systems 7 . An
intuitive approach to address this issue is to enhance video classification, aiming
to boost the capability of capturing more intricate features, thereby improving the
detection of activities and classifying these activities. However, when dealing with
the detection of activities of similar movements, and especially those where human
activity occupies fewer milliseconds, there’s an inherent scarcity of feature informa-
tion. This necessitates a greater focus on the extraction and utilization of detailed
video frames’ features.
To analyze the spatiotemporal features related to human activities video, in 8
introduced a 3DCNN architecture for motion, static and hybrid features extraction.
Similarly, HarNet model for UAV-captured videos is proposed in 9 to analyze and
classify human activities.
For discerning team or group activities, 10 movement relationship between indi-
viduals are analyzed to solve the problems caused by occlusion and shooting angle.
In 11 ingeniously redefined the ranking loss paradigm to optimize the detection of
human activities, harnessing deep CNN-extracted features and uncovering latent
sequential patterns via LSTM architectures. From a spatiotemporal vantage point,
12
proposed an advanced methodology integrating a pretrained ResNet50 for robust
feature extraction with a multilayered LSTM for sophisticated event recognition in
soccer video analytics. Although these hybrid CNN-LSTM models demonstrate ex-
ceptional efficacy in vision-based human action and activity recognition tasks, they
inherently escalate computational demands due to the resource-intensive nature of
CNN-driven feature extraction and LSTM-centric action modeling. Also, accuracy
results are need to be enhanced to enable more generalization on the classification
model.
In constructing human-activity detection models using Three-Dimensional Con-
volutional Neural Networks (3CNN), employing a multilayered is paramount. The
approach of adding layers would help in increasing the ability to learn more com-
plex features and hierarchies from spatiotemporal video frames, significantly aiding
in elevating the accuracy and detection similar human activities movements. De-
spite the benefits of multilayered 3DCNN for processing video frames, it confronts
numerous challenges during practical detection endeavors such as vanishing, over-
fitting, depth balance, hyperparameters tuning, in addition to the increasing of
computational cost.
December 11, 2024 15:0 WSPC/ws-ijitdm output

A Novel Lightweight 3DCNN Visual Human Activity Recognition 3

The availability of large and diversity human activities video datasets encourage
research community to further toward sustainable development in higher algorith-
mic machine/deep learning researchs requirements to provide more enhanced and
significantly better accuracy when handling these datasets. For this reason, the
main contributions of this paper are summed up as follows:
• We develop a new 3DCNN architecture model to recognize and classify
different human-based activities.
• Well-known human activity video datasets (e.g., UCF11a , UCF50b ,
UCF101c , and LoDVPd ) are used to evaluate the proposed model.
• An intensive analysis is conducted on the above mentioned datasets using
different hyperparameters to gain more stability of the proposed model.
• Based on the full UCF101 dataset, we generate a learning model with the
ability of classifying 101 human activities.
• Our generated learning model, is then used as a transfer learning model to
evaluate the accuracy and loss function for other datasets.
• Our experimental results prove that the proposed model achieves a higher
performance level in different evaluation matrices than existing baseline
deep learning models.
The structure of the paper is organized in this manner. In Section 1, the impor-
tance of analyzing human activity from real-video is discussed. Section 2 describes
related work in this purpose. The suggested methodology, including the 3DCNN
architecture and implementation specifics, is covered in Section 3. Various com-
mon human activity video datasets are discussed in Section 4. Section 5 presents
the results and analysis of the suggested methodology. In Section 6, the paper is
concluded.

2. Related Work
Regarding the challenges brought up in the pertinent literature, human activity
recognition for elder health concerns, such as detecting falls, is addressed in 13 . Using
neural networks for weapon identification through video processing is investigated
by the authors in 14 . In 15 , the focus is on classifying ATM-person attack scenarios
using camera videos. The suitability of monitoring videos and their content for
children is explored in 16 .
The industry is not far from applying neural networks to video classification in
a range of applications to solve its difficulties 17 . Recognizing gestures using spatial-
temporal information is the domain of the sophisticated and powerful 3DCNN model
18
. Based on RGB input films 19 proposed a revolutionary deep learning architecture

a https://www.crcv.ucf.edu/data/UCF YouTube Action.php


b https://www.crcv.ucf.edu/data/UCF50.php
c https://www.crcv.ucf.edu/data/UCF101.php
d https://kmikt.uniza.sk/index.php/vyskum-a-priemysel/datasets
December 11, 2024 15:0 WSPC/ws-ijitdm output

4 Ahmed S. Salama, Hamada I. AbdulWakel, Esraa Eldesouky, Mohamed Mashhour, and Ahmad M. Nagm

to detect and recognize in an automatic manner the hand sign language. To identify
human activities, the authors in 20 developed a deep learning model constructed
with a mix of 3DCNN and Convolutional Long Short-term Memory (ConvLSTM)
for human activities classification.
Different CNN methods for classifying videos into violent or nonviolent cate-
gories are proposed by the authors in 21 . The study in 22 examines the ability of
advanced video CNNs, including 3D ResNet, 3D ResNet, and I3D, to detect al-
tered videos. In 23 , an exponential linear units-3DCNN model is applied to video
representation, extracting deep features from mobile videos. Finally, 24 introduces
a novel 3DCNN-powered model for scene classification in drone surveillance. Addi-
tionally, a more thorough examination of the ConvLSTM architecture is evaluated
using the UCF crime database in 25 which identifies video sequences encompassing
several categories of human activities.
In 26 , 3DCNN is applied for vehicle behavior recognition. The classification of
aerial activities is conducted in 27 using a dataset specific to such activities. Support
vector machines (SVM) and convolutional recurrent neural networks (CRNNs) are
employed to construct the classification model proposed in 28 , where Kinematics
Posture Features are extracted from 3D joint locations. Techniques for identifying
irregularities in crowd scenes are suggested in 29 , where a 3D GAN and 3DCNN
architecture are demonstrated to bridge the domain gap for domain adaptation.
Spatiotemporal features are extracted using ConvLSTM in 30 , which also presents
methods for detecting aggression using Mobile Neural Architecture and a dataset
including both violent and nonviolent samples. Finally, a graph-based system for
learning high-level interactions between objects and humans, referred to as Action
Detection, is proposed in 31 .
Authors in 32 are interested in ensuring human safety and coordinating human
activity. Nevertheless, the authors used a dataset of people walking and running
despite encountering problems with dataset availability. In 33 , 3DCNN and Con-
vLSTM, two prominent neural networks, are used to recognize diseases in videos.
Video recordings are employed to diagnose autism spectrum disorder (ASD) in a
method also discussed in 33 . The conclusions in 34 are drawn by comparing an im-
proved model with two refined deep neural networks (ConvLSTM and 3DCNN).
The authors achieve the best results using a combination of the Histogram of Op-
tical Flow (HOF) descriptor and the MLP classifier. Their deep-learning system,
based on the LSTM network, takes pose-based skeletal joint sequences as input to
learn the temporal progression of postures.

3. Proposed Methodology
3.1. Implementation Details
A high-intensive libraries are required to perform the necessary processing on the
inputs, due to the deep learning models specially those deal with 3D images (i.e.,
videos), so we chosen the Kaggle infrastructure to be our model operating envi-
December 11, 2024 15:0 WSPC/ws-ijitdm output

A Novel Lightweight 3DCNN Visual Human Activity Recognition 5

ronment.The suggested neural network architecture, which is based on the Kaggle


infrastructure, is implemented using the Python programming language and tools,
including the TensorFlow and Keras frameworks. The experiments are conducted on
Kaggle using Keras with TensorFlow backend based on the following specifications:
four CPU cores and 30GB of RAM.

3.2. The Conceived Architecture of the 3DCNN


The suggested 3DCNN’s architecture contains multiple Conv3D layers as depicted
in Figure 1, and includes the following layers:

Fig. 1: The proposed 3DCNN architecture.

• Conv3D layers: a movable objects appeared in a related set of 3D images


are identified and analyzed, the Conv3D layer depends on a set of ordered
3D filters to learn and extract the features map. These 3D filters have a
set step (stride) and travel in the three dimensions (width, height, and
depth) of the 3D input images. In deep learning models, the rectified linear
(ReLU) 35 is used as an activation function. It solves the vanishing gradients
problem and applies a piecewise linear transformation to the feature data
to get the activation values. The activation values, aL i,j,k , can be calculated
by Eqs. (1) and (2).
December 11, 2024 15:0 WSPC/ws-ijitdm output

6 Ahmed S. Salama, Hamada I. AbdulWakel, Esraa Eldesouky, Mohamed Mashhour, and Ahmad M. Nagm

Xi+(F −1) Xj+(F −1) Xk+(F −1)


aL
i,j,k = ReLU[
m=i n=j d=k
XC−1 (1)
L
Wc,m,n,d Xc,m,n,d + bL ]
c=0


x, if x ≥ 0
ReLU = (2)
0, if x < 0
where the output position is represented by the indices (i,j,k), Lth is the
filter at layer L, where X is the input image, F is the filter size, and C is
the number of input channels. W is a weight matrix with size (C,F,F,F),
and the input position is (m, n, d).
• MaxPooling3D layer: is a mathematical operation over the 3D images
that takes in a 5D-tensor representing the 3D image and performs down-
sampling of this input with different stride steps across the Conv3D layers.
For each channel, A single 1*1*1 cube with the representative maximum
value is created from the 2*2*2 cube region by using the MaxPooling oper-
ation, which determines the maximum value for each 2*2*2 subregion. The
2*2*2 subregion mask is shifted by strides along each dimension.
• A flatten layer: (a.k.a fully connected layer) is used to convert the feature
map generated by the max-pooling layer into a format that the dense layers
can understand. In other words, the feature map is a multi-dimensional
array that contains pixel values; the dense layers require a one-dimensional
array as input for processing. For the dense layers, the feature maps are
flattened into a one-dimensional array using the flatten layer.
• Batch normalization layers: (a.k.a. batch norm) is a method used to
normalize the input for Conv3D layers as well as hidden layers. This layer’s
normalization is performed by adjusting the mean and scaling of the acti-
vations, which makes the training of 3DCNN faster and more stable.
The five 3DCNN layers that make up the suggested 3DCNN architecture in this
study are as follows:
(1) The input layer: of the suggested network has dimensions of 32, 32, and 3 for
input width, height, and number of channels, respectively.
(2) The 1st 3DCNN layer: this layer uses 32 filters with 3 × 3 × 3 as the kernel
size. This layer includes a BatchNormalization layer, a MaxPooling3D layer of
size 2 × 2 × 2 with a stride of 2, and another BatchNormalization layer to help
enhancing the output of the training process in this step.
(3) The 2nd 3DCNN layer: for this layer, we operate 64 filters with the same
kernel size as the previous layer. This layer has also two BatchNormalization
layers with a MaxPooling3D layer between them with the same square-region
and stride as previous. To prevent model overfitting and make the proposed
model more stable for the different inputs, we adds a Dropout layer with a
December 11, 2024 15:0 WSPC/ws-ijitdm output

A Novel Lightweight 3DCNN Visual Human Activity Recognition 7

frequency of rate of 0.2 to randomly sets input units to 0 at each step during
training process.
(4) The 3rd 3DCNN layer: the number of 3DCNN filters is increased in this layer
to 128 filters with the same kernel size. Following this layer is a MaxPooling3D
and BatchNormalization layers.
(5) The 4th 3DCNN layer: the number of filters also duplicates in this layer to
be 256 filters where the kernel size used is as the previous layers. This layer has
MaxPooling3D, BatchNormalization, and Dropout layers, in order. Here, the
Dropout layer has a frequency of rate of 0.5 to ensure that all neurons in this
layer are prevented from synchronously optimizing their weights.
(6) The 5th 3DCNN layer: the final 3DCNN layer has 512 filters with the same
fixed kernal size in all layeers. Two BatchNormalization layers interspersed with
a MaxPooling3D layer with the same layer details as previous layers. To op-
timize the training process, this layer concludes with a Dropout layer with a
frequency of rate of 0.5.
(7) The output layer: After the final 3DCNN layer, a Flattened layer is used
to turn the output into a vector. There are two Dense layer with 256 and
number of dataset classes, respectively. The second Dense layer contains the
final predicted class of the input video. We trained our 3DCNN model using
the ”Adam” optimization algorithm Using learning rate has 0.0001. Table 1
provides a more detailed description of our 3DCNN model and its layers.

A Python-based pseudo-code provided in Algorithm 3.1 represents the proposed


3DCNN human activity classification. Before building the 3DCNN model, hyper-
parameters and dataset are needed to be prepared to be suitable to feed into the
model. To specify hyperparameters, we can create a class named UCF that contains
the classes from the human activity dataset as well as initialization parameters like
the batch size and the amount of frames that make up the 3D image sequence. In
order to prepare the dataset, we divided it into three sets: train ds, valid ds, and
test ds. These sets, which correspond to the training, testing, and validation sets,
respectively, have a ratio of 70:20:10. The 3DCNN network is trained by feeding
the model with the training set. Finally, the model building in the Algorithm 3.1
describes the layers details which forming the proposed 3DCNN model. ”Adam”
optimization algorithm is chosen by using learning rate is 0.0001.

4. Methodology Datasets
4.1. LoDVP Abnormal Activities Dataset
The LoDVP Abnormal Activity dataset has 11 human activity category. Each cat-
egory in this dataset contains about 100 videos and this dataset contains 1069
videos in all. These videos have been created by non-professional actors in different
places such as the forest with different scenes angles, the university campus, and
the parking lot. The dataset includes these categories: (a) Begging, (b) Drunken-
December 11, 2024 15:0 WSPC/ws-ijitdm output

8 Ahmed S. Salama, Hamada I. AbdulWakel, Esraa Eldesouky, Mohamed Mashhour, and Ahmad M. Nagm

Table 1: An explanation of the layers in the proposed


3DCNN model.

Layer type Output shape # of parameters


Cond3D (None, 50, 50, 50, 32) 2624
Batch Normalization (None, 50, 50, 50, 32) 128
Max Pooling 3D (None, 25, 25, 25, 32) 0
Batch Normalization 1 (None, 25, 25, 25, 32) 128
Cond3D 1 (None, 25, 25, 25, 64) 55360
Batch Normalization 2 (None, 25, 25, 25, 64) 256
Max Pooling 3D 1 (None, 13, 13, 13, 64) 0
Batch Normalization 3 (None, 13, 13, 13, 64) 256
Dropout (None, 13, 13, 13, 64) 0
Cond3D 2 (None, 13, 13, 13, 128) 221312
Max Pooling 3D 2 (None, 7, 7, 7, 128) 0
Batch Normalization 4 (None, 7, 7, 7, 128) 512
Cond3D 3 (None, 7, 7, 7, 256) 884992
Max Pooling 3D 3 (None, 4, 4, 4, 256) 0
Batch Normalization 5 (None, 4, 4, 4, 256) 1024
Dropout 1 (None, 4, 4, 4, 256) 0
Cond3D 4 (None, 4, 4, 4, 512) 3539456
Batch Normalization 6 (None, 4, 4, 4, 512) 2048
Max Pooling 3D 4 (None, 2, 2, 2, 512) 0
Batch Normalization 7 (None, 2, 2, 2, 512) 2048
Dropout 2 (None, 2, 2, 2, 512) 0
Flatten (None, 4096) 0
Dense (None, 256) 1048832
Dense 1 (None, 50) 12850

ness, (c) Fight, (d) Harassment, (e) Hijack, (f ) Knife hazard, (g) Normal videos,
(h) Pollution, (i) Property damage, (j) Robbery, (k) Terrorism 36 .

4.2. UCF11 Dataset


The UCF11 dataset created by the Center for Research in Computer Vision at the
University of Central Florida, is among the most demanding and very challenging
datasets. Among the challenges in this dataset are object appearance and pose,
illumination conditions, and the large variations in camera motion 37 . The dataset
that comprises 11 categorized human activities among them are Basketball, Diving,
HorseRiding, Swing, TrampolineJumping, and WalkingDog with a 1160 total num-
ber of videos. In a more detailed words, each category in this dataset includes 25
different groups (each with about four videos) of videos come to light the human
activity.

4.3. UCF50 Dataset


Among the most popular human activity datasets which considers to the YouTube
Action data set (UCF11) is expanded. It is a publicly available dataset that in-
cludes 50 action categories. The human action categories in this dataset represents
December 11, 2024 15:0 WSPC/ws-ijitdm output

A Novel Lightweight 3DCNN Visual Human Activity Recognition 9

Algorithm 3.1 The proposed 3DCNN architecture pseudo-code.


Ensure: model a deep learning to classify input videos.
procedure build3DCNNModel(a ) ▷ Model building step
model = Sequential()
model.add(Input(shape=(50, 50, 50, 3)))
model.add(Conv3D(32, (3,3,3), strides=1, activation=’relu’))
model.add(BatchNormalization())
model.add(MaxPool3D((2,2,2), strides=2, padding=’same’))
model.add(BatchNormalization())
model.add(Conv3D(64, (3,3,3), strides=1, activation=’relu’))
model.add(BatchNormalization())
model.add(MaxPool3D((2,2,2), strides=2, padding=’same’))
model.add(BatchNormalization())
model.add(Dropout(0.2))
model.add(Conv3D(128, (3,3,3), strides=1, activation=’relu’))
model.add(MaxPool3D((2,2,2), strides=2, padding=’same’))
model.add(BatchNormalization())
model.add(Conv3D(256, (3,3,3), strides=1, activation=’relu’))
model.add(MaxPool3D((2,2,2),strides=2, padding=’same’))
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(Conv3D(512, (3,3,3), strides=1, activation=’relu’))
model.add(BatchNormalization())
model.add(MaxPool3D((2,2,2), strides=2, padding=’same’))
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(Flatten())
model.add(Dense(units=256, activation=’relu’))
model.add(Dense(classes), activation=’softmax’))
model.compile(loss=SparseCategoricalCrossentropy(),
optimizer=Adam(0.0001), metrics=[”accuracy”])
end procedure

a realistic videos with different variations in person motions, backgrounds colors,


and lighting conditions. But, category-based videos have some common attributes
represented in using the same background color, viewpoint, and person.
The dataset includes various categories such as Baseball Pitch, Fencing, Golf,
Swing, Playing Guitar, High Jump Clean and Jerk, Drumming, and more categories
that describe different human actions. These categories are part of the UCF50
dataset, where each group contains 4 or more action clips 38 .
December 11, 2024 15:0 WSPC/ws-ijitdm output

10 Ahmed S. Salama, Hamada I. AbdulWakel, Esraa Eldesouky, Mohamed Mashhour, and Ahmad M. Nagm

4.4. UCF101 Dataset


UCF101 dataset is an extension of UCF50 dataset which has additionally 51 action
categories with a total videos of 13,320 for all the 101 action categories. Due to the
realistic videos in this dataset adds more powerful and make more research work
depend of it to measure the performance for their deep learning models.
The UCF101 dataset 39 includes various categories such as Apply Eye Makeup,
Baby Crawling, Fencing, Golf, Swing, Clean and Jerk, Haircut, Knitting, Long
Jump, and more. These categories encompass Motion of the body, interactions be-
tween Humans and objects, and musical instruments, Sports, and Playing.

5. Results and Discussion


Firstly, the selected datasets mentioned in Section 4 has been employed to assess the
suggested model using identical input size, both the number of epochs and frames.
The hyperparameter value differ is the batch size in order to know the best batch
size gives the highest accuracy for the proposed model.
To clarify more, the accuracy and loss function measured for the modified
UCF101 were for 10 classes which are: BaseballPitch, BasketballDunk, Biking,
BenchPress, BreastStroke, Billiards, CleanAndJerk, Diving, Drumming, Fencing,
HulaHoop, Golf, HorseRace, HorseRiding, Swing, HighJump. Also, for the full
UCF11 dataset the measures are based on 11 classes of the dataset including: Bas-
ketball, SoccerJuggling, Diving, GolfSwing, VolleyballSpiking, Swing, TennisSwing,
TrampolineJumping, Biking, Horse Riding, and WalkingDog.

Table 2: The proposed model’s evaluation criterion for modified


UCF101 dataset.

Modified UCF101: 2035 videos - 1424 training - 183 validation - 428 testing
32x32 input size - 30 frames - 50 epoch
Phase Metrics batch size = 8 batch size = 16 batch size = 32
- Begin: 2.62 - Begin: 2.50 - Begin: 2.74
Loss
- To: 0.06 - To: 0.05 - To: 0.088
- Begin: 27 - Begin: 26 - Begin: 22
Training Accuracy %
- To: 98 - To: 99 - To: 98
Precision % 88 89 89
Recall % 86 89 88
F1 score % 87 88 88
Loss 0.52 0.44 0.42
Testing
Accuracy % 87 88 88

The first-evaluation videos had resized to a dimension of 32 x 32, with a total of


30 frames. The evaluation criterion (i.e., The Precision Eq.3, Recall for Eq.4, and
F1 score has Eq.5) along with the results for the modified UCF101 are displayed in
Table 2. A total of 2035 videos were evaluated which are split up into three groups for
training, testing, and validation. That results notable increase in accuracy modified
December 11, 2024 15:0 WSPC/ws-ijitdm output

A Novel Lightweight 3DCNN Visual Human Activity Recognition 11

Fig. 2: The accuracy during training step on the Modified UCF101 dataset.

Fig. 3: The loss function during training step on Modified dataset which is named
UCF101.
December 11, 2024 15:0 WSPC/ws-ijitdm output

12 Ahmed S. Salama, Hamada I. AbdulWakel, Esraa Eldesouky, Mohamed Mashhour, and Ahmad M. Nagm

UCF101 in the training phase obtained an accuracy of 99%, as shown in Figure 2,


whereas the testing phase achieved an accuracy of 88%. In the early stages, there
was a significant decline in the loss rate that starts at 2.5, as shown in Figure 3.
By employing the proposed model, the loss rate has been reduced to 0.42 and 0.05
during the testing and training stages, respectively.

TP
Precision = (3)
TP + FP

TP
Recall = (4)
TP + FN

RecallP recision
F1score = 2 ∗ (5)
Recall + P recision
where True positives (TP), true negatives (TN), false positives (FP), and false
negatives (FN) are the four types of numbers. Therefore, an overall evaluation
of the classifier’s effectiveness across all classes is provided by accuracy. Precision
measures how accurately the classifier recognizes positive cases. There are fewer
false positives when the precision rating is higher. A classifier’s recall quantifies its
ability to identify all positive instances. It shows how well the classifier identifies
positive examples. A better recall value indicates fewer false negatives.

Table 3: For the complete set of UCF50 data, suggested model’s


evaluation criterion is as follows.

Full UCF50: 6442 videos - 1119 training - 144 validation - 336 testing
32x32 input size - 30 frames - 50 epoch
Phase Metrics batch size = 8 batch size = 16 batch size = 32
- Begin: 3.78 - Begin: 3.74 - Begin: 3.92
Loss
- To: 0.07 - To: 0.07 - To: 0.05
- Begin: 11 - Begin: 10 - Begin: 9.54
Training Accuracy %
- To: 98 - To: 98 - To: 99
Precision % 84 85 86
Recall % 84 85 85
F1 score % 84 85 85
Loss 0.78 0.61 0.70
Testing
Accuracy % 84 85 84

With the same ratio of dataset splitting, Table 3 shows the results of the pro-
posed model using Full UCF50 dataset, where the training accuracy achieved was
98% with an significant decreasing in loss function from 3.74 at the beginning of the
training phase to 0.07 at the end of 50 epoch. Also, the proposed model achieved
result of 85% for precision, recall, and F1 score.
Table 4 shows the accuracy (99%), precision (82%), recall (80%), and F1 score
(80%) when applying the proposed model under the use of Full UCF11 dataset.
December 11, 2024 15:0 WSPC/ws-ijitdm output

A Novel Lightweight 3DCNN Visual Human Activity Recognition 13

Table 4: The proposed model’s evaluation criteria for the entire


UCF11 dataset.

Full UCF11: 1599 videos - 1119 training - 144 validation - 336 testing
32x32 input size - 30 frames - 50 epoch
Phase Metrics batch size = 8 batch size = 16 batch size = 32
- Begin: 2.66 - Begin: 2.69 - Begin: 2.74
Loss
- To: 0.06 - To: 0.05 - To: 0.06
- Begin: 19 - Begin: 19 - Begin: 16
Training Accuracy %
- To: 99 - To: 99 - To: 99
Precision % 78 82 79
Recall % 78 80 79
F1 score % 77 80 79
Loss 0.73 0.73 0.66
Testing
Accuracy % 78 80 79

Fig. 4: The accuracy during training step on the full UCF11 dataset.

Also, Figures 4 and 5 show the accuracy (from 19% to 99%) and loss function (from
2.69 to 0.05) achievements during the training phase.
Using the same architecture parameters with LoDVP dataset, the proposed
model is evaluated as shown in Table 5. The same evaluation criteria are recorded
to highlights the effectiveness of the proposed model.
According to the use of 32 x 32 input size, batch size = 8, and 35 epochs hyper-
parameter values, splitting the dataset based on training, testing, and validation, in
that order, are 70:20:10, and the overall of totally trainable parameters is 4,851,122,
December 11, 2024 15:0 WSPC/ws-ijitdm output

14 Ahmed S. Salama, Hamada I. AbdulWakel, Esraa Eldesouky, Mohamed Mashhour, and Ahmad M. Nagm

Fig. 5: The loss function while using the entire UCF11 dataset in the training phase.

Table 5: The standard of evaluation of the proposed model for the


LoDVP dataset.

LoDVP: 954 videos - 667 training - 86 validation - 201 testing


32x32 input size - 30 frames - 50 epoch
Phase Metrics batch size = 8 batch size = 16 batch size = 32
- Begin: 2.68 - Begin: 2.70 - Begin: 2.75
Loss
- To: 0.17 - To: 0.11 - To: 0.13
- Begin: 18 - Begin: 16 - Begin: 18
Training Accuracy %
- To: 95 - To: 97 - To: 96
Precision % 86 83 83
Recall % 85 84 83
F1 score % 84 83 82
Loss 0.51 0.71 0.56
Testing
Accuracy % 84 81 82

the proposed model is used for generate a model for the full UCF101 dataset, then
we consider and use the generated model as a transfer learning model to evaluate the
testing accuracy and loss function for the UCF Youtube and full UCF50 datasets.
The results of comparing the generated transfer learning model results with 8 are
seen in Table 6. Of note, we assume a better loss scores for the generated transfer
learning model relative to 8 . According to the UCF Youtube dataset, the better loss
score for testing phase is 0.52 compared with 1.51 for 8 . Also, for the full UCF50
dataset reaches 0.6 compared with 1.61 for 8 .
December 11, 2024 15:0 WSPC/ws-ijitdm output

A Novel Lightweight 3DCNN Visual Human Activity Recognition 15

Table 6: Using transfer model.

Datasets UCF Youtube Full UCF50


Testing Testing
Metrics Testing loss Testing loss
Accuracy Accuracy
Model in 8 1.51 85.2% 1.61 82.2%
The transfer model 0.52 86% 0.6 84%

Table 7: Accuracy comparison of the proposed architecture based on different


datasets.

Accuracy (%)
Method/Model
UCF50 UCF101 LoDVP
3D Resnet5040 – – 36.19
ConvLSTM41 80.38 – 92.38
3D Resnet15240 83.39 – 90.48
Model in42 87.78 – 93.41
ViT-LSTM43 96.1 – 93.41
Model in8 91.2 87.8 –
Model in44 – 72.03 –
Two-Stream45 – 91.3 –
Model in46 – 92.4 –
ConvNet Transformer47 – 86.1 –
Model in48 – 92.8 –
SVFormer-S49 – 79.1 –
SVFormer-B49 – 86.7 –
The proposed model 98 99 97

In Table 7, the proposed model is compared by the-state-of-the art neural net-


work architectures using different datasets. This accuracy comparison highlights the
effectiveness of the proposed model for the three datasets. In average, an accuracy
of 98% outperform existing architectures such as ConvLSTM41 , 3D Resnet15240 ,
model in42 , and model in8 . This significantly better accuracy is based on the obser-
vation of how accuracy changes with the addition and removal of layers.

6. Conclusion
A new deep learning model for classification for human-activity from videos. The
proposed model includes different layers composing Conv3D, Maxpooling, Batch-
normalization, Dropout, Flatten, and Dense. We utlize different datasets such as
UCF11, UCF50, UCF101, and LoDVP with different categories of activities. With
the suggested model, various experiments were carried out with different hyperpa-
rameters on these datasets to compare between in accordance with multiple metrics,
including f1-score, recall, accuracy, and precision, where the highest training accu-
racy achieved on these datasets are 99%, 98%, 99%, and 97%, respectively.
We intend to implement the suggested model in the future in a real-word mon-
itoring where videos are submitted in a time-based manner, so we can evaluate the
December 11, 2024 15:0 WSPC/ws-ijitdm output

16 Ahmed S. Salama, Hamada I. AbdulWakel, Esraa Eldesouky, Mohamed Mashhour, and Ahmad M. Nagm

model suitability for activity detection and classification. Also, a study needs to
be conducted to know how the existing pretrained models could by used with the
proposed model to improve its results.

Acknowledgments
This study is supported via funding from Prince Sattam bin Abdulaziz University
project number (PSAU/2024/R/1446).

Data Availability
The utilized datasets and the source code of this work are accessible upon readers’
request.

References
1. Nayyer Aafaq, Ajmal Mian, Naveed Akhtar, Wei Liu, and Mubarak Shah. Dense video
captioning with early linguistic information fusion. IEEE Transactions on Multimedia,
25:2309–2322, 2022.
2. Varun Kumar Reja, Koshy Varghese, and Quang Phuc Ha. Computer vision-based
construction progress monitoring. Automation in Construction, 138:104245, 2022.
3. Gensheng Pei, Yazhou Yao, Fumin Shen, Dan Huang, Xingguo Huang, and Heng-
Tao Shen. Hierarchical co-attention propagation network for zero-shot video object
segmentation. IEEE Transactions on Image Processing, 32:2348–2359, 2023.
4. Yixuan Gao, Yuqin Cao, Tengchuan Kou, Wei Sun, Yunlong Dong, Xiaohong Liu,
Xiongkuo Min, and Guangtao Zhai. Vdpve: Vqa dataset for perceptual video en-
hancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 1474–1483, 2023.
5. Ruchi Jayaswal, Anchal Pathak, and Smita Mahajan. Integrating 3dcnn attention
mechanism with pose estimation for indoor fall detection. Available at SSRN 4883239.
6. Sara A Alameen and Areej M Alhothali. A lightweight driver drowsiness detection
system using 3dcnn with lstm. Computer Systems Science & Engineering, 44(1), 2023.
7. PS Shanija and K Rahamathulla. Anomalous event detection from videos using 3d
convolutional network. In AIP Conference Proceedings, volume 3037. AIP Publishing,
2024.
8. Roberta Vrskova, Robert Hudec, Patrik Kamencay, and Peter Sykora. Human activity
classification using the 3dcnn architecture. Applied Sciences, 12(2):931, 2022.
9. Nashwan Adnan Othman and Ilhan Aydin. Development of a novel lightweight cnn
model for classification of human actions in uav-captured videos. Drones, 7(3):148,
2023.
10. Lukun Wang, Wancheng Feng, Chunpeng Tian, Liquan Chen, and Jiaming Pei.
3d-unified spatial-temporal graph for group activity recognition. Neurocomputing,
556:126646, 2023.
11. Amin Ullah, Khan Muhammad, Weiping Ding, Vasile Palade, Ijaz Ul Haq, and
Sung Wook Baik. Efficient activity recognition using lightweight cnn and ds-gru net-
work for surveillance applications. Applied Soft Computing, 103:107102, 2021.
12. Khan Muhammad, Amin Ullah, Ali Shariq Imran, Muhammad Sajjad, Mustafa Servet
Kiran, Giovanna Sannino, de Albuquerque, and Victor Hugo C. Human action recog-
nition using attention based lstm network with dilated cnn features. Future Generation
Computer Systems, 125:820–830, 2021.
December 11, 2024 15:0 WSPC/ws-ijitdm output

A Novel Lightweight 3DCNN Visual Human Activity Recognition 17

13. Bruno Malveira Peixoto, Sandra Avila, Zanoni Dias, and Anderson Rocha. Breaking
down violence: A deep-learning strategy to model and classify violence in videos.
In Proceedings of the 13th International Conference on Availability, Reliability and
Security, pages 1–7, 2018.
14. Chhavi Dhiman and Dinesh Kumar Vishwakarma. High dimensional abnormal hu-
man activity recognition using histogram oriented gradients and zernike moments. In
2017 IEEE International Conference on Computational Intelligence and Computing
Research (ICCIC), pages 1–4. IEEE, 2017.
15. Roberto Olmos, Siham Tabik, and Francisco Herrera. Automatic handgun detection
alarm in videos using deep learning. Neurocomputing, 275:66–72, 2018.
16. Jingen Liu, Yang Yang, and Mubarak Shah. Learning semantic visual vocabularies
using diffusion distance. In 2009 IEEE Conference on Computer Vision and Pattern
Recognition, pages 461–468. IEEE, 2009.
17. Xiang Zhang, Lina Yao, Chaoran Huang, Quan Z Sheng, and Xianzhi Wang. Intent
recognition in smart living through deep recurrent neural networks. In Neural Infor-
mation Processing: 24th International Conference, ICONIP 2017, Guangzhou, China,
November 14-18, 2017, Proceedings, Part II 24, pages 748–758. Springer, 2017.
18. Ziheng Guo, Yang Chen, Wei Huang, and Junhao Zhang. An efficient 3d-nas method
for video-based gesture recognition. In Artificial Neural Networks and Machine
Learning–ICANN 2019: Image Processing: 28th International Conference on Artifi-
cial Neural Networks, Munich, Germany, September 17–19, 2019, Proceedings, Part
III 28, pages 319–329. Springer, 2019.
19. Razieh Rastgoo, Kourosh Kiani, and Sergio Escalera. Hand sign language recognition
using multi-view hand skeleton. Expert Systems with Applications, 150:113336, 2020.
20. Tian Wang, Jiakun Li, Mengyi Zhang, Aichun Zhu, Hichem Snoussi, and Chang Choi.
An enhanced 3dcnn-convlstm for spatiotemporal multimedia data analysis. Concur-
rency and Computation: Practice and Experience, 33(2):e5302, 2021.
21. Jean Phelipe de Oliveira Lima and Carlos Maurı́cio Seródio Figueiredo. A temporal
fusion approach for video classification with convolutional and lstm neural networks
applied to violence detection. Inteligencia Artificial, 24(67):40–50, 2021.
22. Muneer Al-Hammadi, Ghulam Muhammad, Wadood Abdul, Mansour Alsulaiman,
Mohamed A Bencherif, and Mohamed Amine Mekhtiche. Hand gesture recognition
for sign language using 3dcnn. IEEE access, 8:79491–79509, 2020.
23. Yaohui Wang and Antitza Dantcheva. A video is worth more than 1000 lies. com-
paring 3dcnn approaches for detecting deepfakes. In 2020 15Th IEEE international
conference on automatic face and gesture recognition (FG 2020), pages 515–519. IEEE,
2020.
24. Balmukund Mishra, Deepak Garg, Pratik Narang, and Vipul Mishra. A hybrid ap-
proach for search and rescue using 3dcnn and pso. Neural Computing and Applications,
33:10813–10827, 2021.
25. Roberta Vrskova, Robert Hudec, Peter Sykora, Patrik Kamencay, and Miroslav Benco.
Violent behavioral activity classification using artificial neural network. In 2020 New
Trends in Signal Processing (NTSP), pages 1–5. IEEE, 2020.
26. Haochuan Hou, Yaochen Li, Chi Zhang, Hao Liao, Ying Zhang, and Yuehu Liu. Vehicle
behavior recognition using multi-stream 3d convolutional neural network. In 2021 36th
Youth Academic Annual Conference of Chinese Association of Automation (YAC),
pages 355–360. IEEE, 2021.
27. Waqas Sultani and Mubarak Shah. Human action recognition in drone videos using a
few aerial training examples. Computer Vision and Image Understanding, 206:103186,
2021.
December 11, 2024 15:0 WSPC/ws-ijitdm output

18 Ahmed S. Salama, Hamada I. AbdulWakel, Esraa Eldesouky, Mohamed Mashhour, and Ahmad M. Nagm

28. Md Atiqur Rahman Ahad, Masud Ahmed, Anindya Das Antar, Yasushi Makihara,
and Yasushi Yagi. Action recognition using kinematics posture feature on 3d skeleton
joint locations. Pattern Recognition Letters, 145:216–224, 2021.
29. Wei Lin, Junyu Gao, Qi Wang, and Xuelong Li. Learning to detect anomaly events
in crowd scenes from synthetic data. Neurocomputing, 436:248–259, 2021.
30. Heyam M Bin Jahlan and Lamiaa A Elrefaei. Mobile neural architecture search net-
work and convolutional long short-term memory-based deep features toward detecting
violence from video. Arabian Journal for Science and Engineering, 46(9):8549–8563,
2021.
31. Matteo Tomei, Lorenzo Baraldi, Simone Calderara, Simone Bronzin, and Rita Cuc-
chiara. Video action detection by learning graph-based spatio-temporal interactions.
Computer Vision and Image Understanding, 206:103187, 2021.
32. Yashswi Jain, Ashvini Kumar Sharma, Rajbabu Velmurugan, and Biplab Banerjee.
Posecvae: Anomalous human activity detection. In 2020 25th International Conference
on Pattern Recognition (ICPR), pages 2927–2934. IEEE, 2021.
33. Sibel Kaçdioglu, Barış Özyer, and Gülsah Tümüklü Özyer. Recognizing self-
stimulatory behaviours for autism spectrum disorders. In 2020 28th Signal Processing
and Communications Applications Conference (SIU), pages 1–4. IEEE, 2020.
34. Farhood Negin, Baris Ozyer, Saeid Agahian, Sibel Kacdioglu, and Gulsah Tumuklu
Ozyer. Vision-assisted recognition of stereotype behaviors for early diagnosis of autism
spectrum disorders. Neurocomputing, 446:145–155, 2021.
35. Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural net-
works. In Proceedings of the fourteenth international conference on artificial intelli-
gence and statistics, pages 315–323. JMLR Workshop and Conference Proceedings,
2011.
36. Roberta Vrskova, Robert Hudec, Patrik Kamencay, and Peter Sykora. A new approach
for abnormal human activities recognition based on convlstm architecture. Sensors,
22(8):2946, 2022.
37. Jingen Liu, Jiebo Luo, and Mubarak Shah. Recognizing realistic actions from videos
“in the wild”. In 2009 IEEE Conference on Computer Vision and Pattern Recognition,
pages 1996–2003. IEEE, 2009.
38. Kishore K Reddy and Mubarak Shah. Recognizing 50 human action categories of web
videos. Machine vision and applications, 24(5):971–981, 2013.
39. Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101
human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
40. Emna Ghodhbani, Mounir Kaaniche, and Amel Benazza-Benyahia. An effective 3d
resnet architecture for stereo image retrieval. In VISIGRAPP (4: VISAPP), pages
380–387, 2021.
41. Asanka G Perera, Yee Wei Law, Titilayo T Ogunwa, and Javaan Chahl. A multiview-
point outdoor dataset for human action recognition. IEEE Transactions on Human-
Machine Systems, 50(5):405–413, 2020.
42. Roberta Vrskova, Patrik Kamencay, Robert Hudec, and Peter Sykora. A new deep-
learning method for human activity recognition. Sensors, 23(5):2816, 2023.
43. Altaf Hussain, Tanveer Hussain, Waseem Ullah, and Sung Wook Baik. Vision trans-
former and deep sequence learning for human activity recognition in surveillance
videos. Computational Intelligence and Neuroscience, 2022(1):3454167, 2022.
44. Jinsol Ha, Joongchol Shin, Hasil Park, and Joonki Paik. Action recognition network
using stacked short-term deep features and bidirectional moving average. Applied Sci-
ences, 11(12):5563, 2021.
45. Kai Zhou, Tingting Wu, Chunyu Wang, Junfeng Wang, and Chao Li. Skeleton based
December 11, 2024 15:0 WSPC/ws-ijitdm output

A Novel Lightweight 3DCNN Visual Human Activity Recognition 19

abnormal behavior recognition using spatio-temporal convolution and attention-based


lstm. Procedia Computer Science, 174:424–432, 2020.
46. Faisal Mehmood, Enqing Chen, Muhammad Azeem Akbar, and Abeer Abdulaziz Al-
sanad. Human action recognition of spatiotemporal parameters for skeleton sequences
using mtln feature learning framework. Electronics, 10(21):2708, 2021.
47. Huu Phong Nguyen and Bernardete Ribeiro. Video action recognition collaborative
learning with dynamics via pso-convnet transformer. Scientific Reports, 13(1):14624,
2023.
48. Amin Ullah, Jamil Ahmad, Khan Muhammad, Muhammad Sajjad, and Sung Wook
Baik. Action recognition in video sequences using deep bi-directional lstm with cnn
features. IEEE access, 6:1155–1166, 2017.
49. Zhen Xing, Qi Dai, Han Hu, Jingjing Chen, Zuxuan Wu, and Yu-Gang Jiang. Sv-
former: Semi-supervised video transformer for action recognition. In Proceedings of
the IEEE/CVF conference on computer vision and pattern recognition, pages 18816–
18826, 2023.

You might also like