International Journal of Information Technology Decision Making-3
International Journal of Information Technology Decision Making-3
AHMED S. SALAMA1
Department of Computer Engineering and Electronics, Cairo Higher Institute for Engineering,
Computer Science and Management, Cairo, Egypt.
HAMADA I. ABDULWAKEL2
Computer Science Department, Faculty of Computers and Information, Minia University,
Minia 61519, Egypt
ESRAA ELDESOUKY3,4∗
Department of Computer Science, College of Computer Engineering and Sciences, Prince
Sattam Bin Abdulaziz University, Al Kharj, 11942, Saudi Arabia
Department of Computer Science, Faculty of Computers and Informatics, Suez Canal University
Ismailia, 41522, Egypt
MOHAMED MASHHOUR1
AHMAD M. NAGM1
Keywords: 3DCNN; Deep learning; Human activity recognition; UCF Youtube action
dataset; UCF101 dataset; LoDVP abnormal activities dataset.
∗ Corresponding author.
1
December 11, 2024 15:0 WSPC/ws-ijitdm output
2 Ahmed S. Salama, Hamada I. AbdulWakel, Esraa Eldesouky, Mohamed Mashhour, and Ahmad M. Nagm
1. Introduction
Computer Vision (CV) as one of the important advancements in computer science
helps mitigating different formidable visual challenges such as dense video caption-
ing 1 , monitoring 2 , video-object segmentation 3 , and video enhancement 4 , this
increase the dependency on CV for contributing the research community with more
solutions to various challenges.
In the realm of video recognition, the precise identification of activities and
abnormalities remains a formidable challenge, particularly in scenarios such as pa-
tient monitoring 5 , driver drowsiness detection system 6 , security systems 7 . An
intuitive approach to address this issue is to enhance video classification, aiming
to boost the capability of capturing more intricate features, thereby improving the
detection of activities and classifying these activities. However, when dealing with
the detection of activities of similar movements, and especially those where human
activity occupies fewer milliseconds, there’s an inherent scarcity of feature informa-
tion. This necessitates a greater focus on the extraction and utilization of detailed
video frames’ features.
To analyze the spatiotemporal features related to human activities video, in 8
introduced a 3DCNN architecture for motion, static and hybrid features extraction.
Similarly, HarNet model for UAV-captured videos is proposed in 9 to analyze and
classify human activities.
For discerning team or group activities, 10 movement relationship between indi-
viduals are analyzed to solve the problems caused by occlusion and shooting angle.
In 11 ingeniously redefined the ranking loss paradigm to optimize the detection of
human activities, harnessing deep CNN-extracted features and uncovering latent
sequential patterns via LSTM architectures. From a spatiotemporal vantage point,
12
proposed an advanced methodology integrating a pretrained ResNet50 for robust
feature extraction with a multilayered LSTM for sophisticated event recognition in
soccer video analytics. Although these hybrid CNN-LSTM models demonstrate ex-
ceptional efficacy in vision-based human action and activity recognition tasks, they
inherently escalate computational demands due to the resource-intensive nature of
CNN-driven feature extraction and LSTM-centric action modeling. Also, accuracy
results are need to be enhanced to enable more generalization on the classification
model.
In constructing human-activity detection models using Three-Dimensional Con-
volutional Neural Networks (3CNN), employing a multilayered is paramount. The
approach of adding layers would help in increasing the ability to learn more com-
plex features and hierarchies from spatiotemporal video frames, significantly aiding
in elevating the accuracy and detection similar human activities movements. De-
spite the benefits of multilayered 3DCNN for processing video frames, it confronts
numerous challenges during practical detection endeavors such as vanishing, over-
fitting, depth balance, hyperparameters tuning, in addition to the increasing of
computational cost.
December 11, 2024 15:0 WSPC/ws-ijitdm output
The availability of large and diversity human activities video datasets encourage
research community to further toward sustainable development in higher algorith-
mic machine/deep learning researchs requirements to provide more enhanced and
significantly better accuracy when handling these datasets. For this reason, the
main contributions of this paper are summed up as follows:
• We develop a new 3DCNN architecture model to recognize and classify
different human-based activities.
• Well-known human activity video datasets (e.g., UCF11a , UCF50b ,
UCF101c , and LoDVPd ) are used to evaluate the proposed model.
• An intensive analysis is conducted on the above mentioned datasets using
different hyperparameters to gain more stability of the proposed model.
• Based on the full UCF101 dataset, we generate a learning model with the
ability of classifying 101 human activities.
• Our generated learning model, is then used as a transfer learning model to
evaluate the accuracy and loss function for other datasets.
• Our experimental results prove that the proposed model achieves a higher
performance level in different evaluation matrices than existing baseline
deep learning models.
The structure of the paper is organized in this manner. In Section 1, the impor-
tance of analyzing human activity from real-video is discussed. Section 2 describes
related work in this purpose. The suggested methodology, including the 3DCNN
architecture and implementation specifics, is covered in Section 3. Various com-
mon human activity video datasets are discussed in Section 4. Section 5 presents
the results and analysis of the suggested methodology. In Section 6, the paper is
concluded.
2. Related Work
Regarding the challenges brought up in the pertinent literature, human activity
recognition for elder health concerns, such as detecting falls, is addressed in 13 . Using
neural networks for weapon identification through video processing is investigated
by the authors in 14 . In 15 , the focus is on classifying ATM-person attack scenarios
using camera videos. The suitability of monitoring videos and their content for
children is explored in 16 .
The industry is not far from applying neural networks to video classification in
a range of applications to solve its difficulties 17 . Recognizing gestures using spatial-
temporal information is the domain of the sophisticated and powerful 3DCNN model
18
. Based on RGB input films 19 proposed a revolutionary deep learning architecture
4 Ahmed S. Salama, Hamada I. AbdulWakel, Esraa Eldesouky, Mohamed Mashhour, and Ahmad M. Nagm
to detect and recognize in an automatic manner the hand sign language. To identify
human activities, the authors in 20 developed a deep learning model constructed
with a mix of 3DCNN and Convolutional Long Short-term Memory (ConvLSTM)
for human activities classification.
Different CNN methods for classifying videos into violent or nonviolent cate-
gories are proposed by the authors in 21 . The study in 22 examines the ability of
advanced video CNNs, including 3D ResNet, 3D ResNet, and I3D, to detect al-
tered videos. In 23 , an exponential linear units-3DCNN model is applied to video
representation, extracting deep features from mobile videos. Finally, 24 introduces
a novel 3DCNN-powered model for scene classification in drone surveillance. Addi-
tionally, a more thorough examination of the ConvLSTM architecture is evaluated
using the UCF crime database in 25 which identifies video sequences encompassing
several categories of human activities.
In 26 , 3DCNN is applied for vehicle behavior recognition. The classification of
aerial activities is conducted in 27 using a dataset specific to such activities. Support
vector machines (SVM) and convolutional recurrent neural networks (CRNNs) are
employed to construct the classification model proposed in 28 , where Kinematics
Posture Features are extracted from 3D joint locations. Techniques for identifying
irregularities in crowd scenes are suggested in 29 , where a 3D GAN and 3DCNN
architecture are demonstrated to bridge the domain gap for domain adaptation.
Spatiotemporal features are extracted using ConvLSTM in 30 , which also presents
methods for detecting aggression using Mobile Neural Architecture and a dataset
including both violent and nonviolent samples. Finally, a graph-based system for
learning high-level interactions between objects and humans, referred to as Action
Detection, is proposed in 31 .
Authors in 32 are interested in ensuring human safety and coordinating human
activity. Nevertheless, the authors used a dataset of people walking and running
despite encountering problems with dataset availability. In 33 , 3DCNN and Con-
vLSTM, two prominent neural networks, are used to recognize diseases in videos.
Video recordings are employed to diagnose autism spectrum disorder (ASD) in a
method also discussed in 33 . The conclusions in 34 are drawn by comparing an im-
proved model with two refined deep neural networks (ConvLSTM and 3DCNN).
The authors achieve the best results using a combination of the Histogram of Op-
tical Flow (HOF) descriptor and the MLP classifier. Their deep-learning system,
based on the LSTM network, takes pose-based skeletal joint sequences as input to
learn the temporal progression of postures.
3. Proposed Methodology
3.1. Implementation Details
A high-intensive libraries are required to perform the necessary processing on the
inputs, due to the deep learning models specially those deal with 3D images (i.e.,
videos), so we chosen the Kaggle infrastructure to be our model operating envi-
December 11, 2024 15:0 WSPC/ws-ijitdm output
6 Ahmed S. Salama, Hamada I. AbdulWakel, Esraa Eldesouky, Mohamed Mashhour, and Ahmad M. Nagm
x, if x ≥ 0
ReLU = (2)
0, if x < 0
where the output position is represented by the indices (i,j,k), Lth is the
filter at layer L, where X is the input image, F is the filter size, and C is
the number of input channels. W is a weight matrix with size (C,F,F,F),
and the input position is (m, n, d).
• MaxPooling3D layer: is a mathematical operation over the 3D images
that takes in a 5D-tensor representing the 3D image and performs down-
sampling of this input with different stride steps across the Conv3D layers.
For each channel, A single 1*1*1 cube with the representative maximum
value is created from the 2*2*2 cube region by using the MaxPooling oper-
ation, which determines the maximum value for each 2*2*2 subregion. The
2*2*2 subregion mask is shifted by strides along each dimension.
• A flatten layer: (a.k.a fully connected layer) is used to convert the feature
map generated by the max-pooling layer into a format that the dense layers
can understand. In other words, the feature map is a multi-dimensional
array that contains pixel values; the dense layers require a one-dimensional
array as input for processing. For the dense layers, the feature maps are
flattened into a one-dimensional array using the flatten layer.
• Batch normalization layers: (a.k.a. batch norm) is a method used to
normalize the input for Conv3D layers as well as hidden layers. This layer’s
normalization is performed by adjusting the mean and scaling of the acti-
vations, which makes the training of 3DCNN faster and more stable.
The five 3DCNN layers that make up the suggested 3DCNN architecture in this
study are as follows:
(1) The input layer: of the suggested network has dimensions of 32, 32, and 3 for
input width, height, and number of channels, respectively.
(2) The 1st 3DCNN layer: this layer uses 32 filters with 3 × 3 × 3 as the kernel
size. This layer includes a BatchNormalization layer, a MaxPooling3D layer of
size 2 × 2 × 2 with a stride of 2, and another BatchNormalization layer to help
enhancing the output of the training process in this step.
(3) The 2nd 3DCNN layer: for this layer, we operate 64 filters with the same
kernel size as the previous layer. This layer has also two BatchNormalization
layers with a MaxPooling3D layer between them with the same square-region
and stride as previous. To prevent model overfitting and make the proposed
model more stable for the different inputs, we adds a Dropout layer with a
December 11, 2024 15:0 WSPC/ws-ijitdm output
frequency of rate of 0.2 to randomly sets input units to 0 at each step during
training process.
(4) The 3rd 3DCNN layer: the number of 3DCNN filters is increased in this layer
to 128 filters with the same kernel size. Following this layer is a MaxPooling3D
and BatchNormalization layers.
(5) The 4th 3DCNN layer: the number of filters also duplicates in this layer to
be 256 filters where the kernel size used is as the previous layers. This layer has
MaxPooling3D, BatchNormalization, and Dropout layers, in order. Here, the
Dropout layer has a frequency of rate of 0.5 to ensure that all neurons in this
layer are prevented from synchronously optimizing their weights.
(6) The 5th 3DCNN layer: the final 3DCNN layer has 512 filters with the same
fixed kernal size in all layeers. Two BatchNormalization layers interspersed with
a MaxPooling3D layer with the same layer details as previous layers. To op-
timize the training process, this layer concludes with a Dropout layer with a
frequency of rate of 0.5.
(7) The output layer: After the final 3DCNN layer, a Flattened layer is used
to turn the output into a vector. There are two Dense layer with 256 and
number of dataset classes, respectively. The second Dense layer contains the
final predicted class of the input video. We trained our 3DCNN model using
the ”Adam” optimization algorithm Using learning rate has 0.0001. Table 1
provides a more detailed description of our 3DCNN model and its layers.
4. Methodology Datasets
4.1. LoDVP Abnormal Activities Dataset
The LoDVP Abnormal Activity dataset has 11 human activity category. Each cat-
egory in this dataset contains about 100 videos and this dataset contains 1069
videos in all. These videos have been created by non-professional actors in different
places such as the forest with different scenes angles, the university campus, and
the parking lot. The dataset includes these categories: (a) Begging, (b) Drunken-
December 11, 2024 15:0 WSPC/ws-ijitdm output
8 Ahmed S. Salama, Hamada I. AbdulWakel, Esraa Eldesouky, Mohamed Mashhour, and Ahmad M. Nagm
ness, (c) Fight, (d) Harassment, (e) Hijack, (f ) Knife hazard, (g) Normal videos,
(h) Pollution, (i) Property damage, (j) Robbery, (k) Terrorism 36 .
10 Ahmed S. Salama, Hamada I. AbdulWakel, Esraa Eldesouky, Mohamed Mashhour, and Ahmad M. Nagm
Modified UCF101: 2035 videos - 1424 training - 183 validation - 428 testing
32x32 input size - 30 frames - 50 epoch
Phase Metrics batch size = 8 batch size = 16 batch size = 32
- Begin: 2.62 - Begin: 2.50 - Begin: 2.74
Loss
- To: 0.06 - To: 0.05 - To: 0.088
- Begin: 27 - Begin: 26 - Begin: 22
Training Accuracy %
- To: 98 - To: 99 - To: 98
Precision % 88 89 89
Recall % 86 89 88
F1 score % 87 88 88
Loss 0.52 0.44 0.42
Testing
Accuracy % 87 88 88
Fig. 2: The accuracy during training step on the Modified UCF101 dataset.
Fig. 3: The loss function during training step on Modified dataset which is named
UCF101.
December 11, 2024 15:0 WSPC/ws-ijitdm output
12 Ahmed S. Salama, Hamada I. AbdulWakel, Esraa Eldesouky, Mohamed Mashhour, and Ahmad M. Nagm
TP
Precision = (3)
TP + FP
TP
Recall = (4)
TP + FN
RecallP recision
F1score = 2 ∗ (5)
Recall + P recision
where True positives (TP), true negatives (TN), false positives (FP), and false
negatives (FN) are the four types of numbers. Therefore, an overall evaluation
of the classifier’s effectiveness across all classes is provided by accuracy. Precision
measures how accurately the classifier recognizes positive cases. There are fewer
false positives when the precision rating is higher. A classifier’s recall quantifies its
ability to identify all positive instances. It shows how well the classifier identifies
positive examples. A better recall value indicates fewer false negatives.
Full UCF50: 6442 videos - 1119 training - 144 validation - 336 testing
32x32 input size - 30 frames - 50 epoch
Phase Metrics batch size = 8 batch size = 16 batch size = 32
- Begin: 3.78 - Begin: 3.74 - Begin: 3.92
Loss
- To: 0.07 - To: 0.07 - To: 0.05
- Begin: 11 - Begin: 10 - Begin: 9.54
Training Accuracy %
- To: 98 - To: 98 - To: 99
Precision % 84 85 86
Recall % 84 85 85
F1 score % 84 85 85
Loss 0.78 0.61 0.70
Testing
Accuracy % 84 85 84
With the same ratio of dataset splitting, Table 3 shows the results of the pro-
posed model using Full UCF50 dataset, where the training accuracy achieved was
98% with an significant decreasing in loss function from 3.74 at the beginning of the
training phase to 0.07 at the end of 50 epoch. Also, the proposed model achieved
result of 85% for precision, recall, and F1 score.
Table 4 shows the accuracy (99%), precision (82%), recall (80%), and F1 score
(80%) when applying the proposed model under the use of Full UCF11 dataset.
December 11, 2024 15:0 WSPC/ws-ijitdm output
Full UCF11: 1599 videos - 1119 training - 144 validation - 336 testing
32x32 input size - 30 frames - 50 epoch
Phase Metrics batch size = 8 batch size = 16 batch size = 32
- Begin: 2.66 - Begin: 2.69 - Begin: 2.74
Loss
- To: 0.06 - To: 0.05 - To: 0.06
- Begin: 19 - Begin: 19 - Begin: 16
Training Accuracy %
- To: 99 - To: 99 - To: 99
Precision % 78 82 79
Recall % 78 80 79
F1 score % 77 80 79
Loss 0.73 0.73 0.66
Testing
Accuracy % 78 80 79
Fig. 4: The accuracy during training step on the full UCF11 dataset.
Also, Figures 4 and 5 show the accuracy (from 19% to 99%) and loss function (from
2.69 to 0.05) achievements during the training phase.
Using the same architecture parameters with LoDVP dataset, the proposed
model is evaluated as shown in Table 5. The same evaluation criteria are recorded
to highlights the effectiveness of the proposed model.
According to the use of 32 x 32 input size, batch size = 8, and 35 epochs hyper-
parameter values, splitting the dataset based on training, testing, and validation, in
that order, are 70:20:10, and the overall of totally trainable parameters is 4,851,122,
December 11, 2024 15:0 WSPC/ws-ijitdm output
14 Ahmed S. Salama, Hamada I. AbdulWakel, Esraa Eldesouky, Mohamed Mashhour, and Ahmad M. Nagm
Fig. 5: The loss function while using the entire UCF11 dataset in the training phase.
the proposed model is used for generate a model for the full UCF101 dataset, then
we consider and use the generated model as a transfer learning model to evaluate the
testing accuracy and loss function for the UCF Youtube and full UCF50 datasets.
The results of comparing the generated transfer learning model results with 8 are
seen in Table 6. Of note, we assume a better loss scores for the generated transfer
learning model relative to 8 . According to the UCF Youtube dataset, the better loss
score for testing phase is 0.52 compared with 1.51 for 8 . Also, for the full UCF50
dataset reaches 0.6 compared with 1.61 for 8 .
December 11, 2024 15:0 WSPC/ws-ijitdm output
Accuracy (%)
Method/Model
UCF50 UCF101 LoDVP
3D Resnet5040 – – 36.19
ConvLSTM41 80.38 – 92.38
3D Resnet15240 83.39 – 90.48
Model in42 87.78 – 93.41
ViT-LSTM43 96.1 – 93.41
Model in8 91.2 87.8 –
Model in44 – 72.03 –
Two-Stream45 – 91.3 –
Model in46 – 92.4 –
ConvNet Transformer47 – 86.1 –
Model in48 – 92.8 –
SVFormer-S49 – 79.1 –
SVFormer-B49 – 86.7 –
The proposed model 98 99 97
6. Conclusion
A new deep learning model for classification for human-activity from videos. The
proposed model includes different layers composing Conv3D, Maxpooling, Batch-
normalization, Dropout, Flatten, and Dense. We utlize different datasets such as
UCF11, UCF50, UCF101, and LoDVP with different categories of activities. With
the suggested model, various experiments were carried out with different hyperpa-
rameters on these datasets to compare between in accordance with multiple metrics,
including f1-score, recall, accuracy, and precision, where the highest training accu-
racy achieved on these datasets are 99%, 98%, 99%, and 97%, respectively.
We intend to implement the suggested model in the future in a real-word mon-
itoring where videos are submitted in a time-based manner, so we can evaluate the
December 11, 2024 15:0 WSPC/ws-ijitdm output
16 Ahmed S. Salama, Hamada I. AbdulWakel, Esraa Eldesouky, Mohamed Mashhour, and Ahmad M. Nagm
model suitability for activity detection and classification. Also, a study needs to
be conducted to know how the existing pretrained models could by used with the
proposed model to improve its results.
Acknowledgments
This study is supported via funding from Prince Sattam bin Abdulaziz University
project number (PSAU/2024/R/1446).
Data Availability
The utilized datasets and the source code of this work are accessible upon readers’
request.
References
1. Nayyer Aafaq, Ajmal Mian, Naveed Akhtar, Wei Liu, and Mubarak Shah. Dense video
captioning with early linguistic information fusion. IEEE Transactions on Multimedia,
25:2309–2322, 2022.
2. Varun Kumar Reja, Koshy Varghese, and Quang Phuc Ha. Computer vision-based
construction progress monitoring. Automation in Construction, 138:104245, 2022.
3. Gensheng Pei, Yazhou Yao, Fumin Shen, Dan Huang, Xingguo Huang, and Heng-
Tao Shen. Hierarchical co-attention propagation network for zero-shot video object
segmentation. IEEE Transactions on Image Processing, 32:2348–2359, 2023.
4. Yixuan Gao, Yuqin Cao, Tengchuan Kou, Wei Sun, Yunlong Dong, Xiaohong Liu,
Xiongkuo Min, and Guangtao Zhai. Vdpve: Vqa dataset for perceptual video en-
hancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 1474–1483, 2023.
5. Ruchi Jayaswal, Anchal Pathak, and Smita Mahajan. Integrating 3dcnn attention
mechanism with pose estimation for indoor fall detection. Available at SSRN 4883239.
6. Sara A Alameen and Areej M Alhothali. A lightweight driver drowsiness detection
system using 3dcnn with lstm. Computer Systems Science & Engineering, 44(1), 2023.
7. PS Shanija and K Rahamathulla. Anomalous event detection from videos using 3d
convolutional network. In AIP Conference Proceedings, volume 3037. AIP Publishing,
2024.
8. Roberta Vrskova, Robert Hudec, Patrik Kamencay, and Peter Sykora. Human activity
classification using the 3dcnn architecture. Applied Sciences, 12(2):931, 2022.
9. Nashwan Adnan Othman and Ilhan Aydin. Development of a novel lightweight cnn
model for classification of human actions in uav-captured videos. Drones, 7(3):148,
2023.
10. Lukun Wang, Wancheng Feng, Chunpeng Tian, Liquan Chen, and Jiaming Pei.
3d-unified spatial-temporal graph for group activity recognition. Neurocomputing,
556:126646, 2023.
11. Amin Ullah, Khan Muhammad, Weiping Ding, Vasile Palade, Ijaz Ul Haq, and
Sung Wook Baik. Efficient activity recognition using lightweight cnn and ds-gru net-
work for surveillance applications. Applied Soft Computing, 103:107102, 2021.
12. Khan Muhammad, Amin Ullah, Ali Shariq Imran, Muhammad Sajjad, Mustafa Servet
Kiran, Giovanna Sannino, de Albuquerque, and Victor Hugo C. Human action recog-
nition using attention based lstm network with dilated cnn features. Future Generation
Computer Systems, 125:820–830, 2021.
December 11, 2024 15:0 WSPC/ws-ijitdm output
13. Bruno Malveira Peixoto, Sandra Avila, Zanoni Dias, and Anderson Rocha. Breaking
down violence: A deep-learning strategy to model and classify violence in videos.
In Proceedings of the 13th International Conference on Availability, Reliability and
Security, pages 1–7, 2018.
14. Chhavi Dhiman and Dinesh Kumar Vishwakarma. High dimensional abnormal hu-
man activity recognition using histogram oriented gradients and zernike moments. In
2017 IEEE International Conference on Computational Intelligence and Computing
Research (ICCIC), pages 1–4. IEEE, 2017.
15. Roberto Olmos, Siham Tabik, and Francisco Herrera. Automatic handgun detection
alarm in videos using deep learning. Neurocomputing, 275:66–72, 2018.
16. Jingen Liu, Yang Yang, and Mubarak Shah. Learning semantic visual vocabularies
using diffusion distance. In 2009 IEEE Conference on Computer Vision and Pattern
Recognition, pages 461–468. IEEE, 2009.
17. Xiang Zhang, Lina Yao, Chaoran Huang, Quan Z Sheng, and Xianzhi Wang. Intent
recognition in smart living through deep recurrent neural networks. In Neural Infor-
mation Processing: 24th International Conference, ICONIP 2017, Guangzhou, China,
November 14-18, 2017, Proceedings, Part II 24, pages 748–758. Springer, 2017.
18. Ziheng Guo, Yang Chen, Wei Huang, and Junhao Zhang. An efficient 3d-nas method
for video-based gesture recognition. In Artificial Neural Networks and Machine
Learning–ICANN 2019: Image Processing: 28th International Conference on Artifi-
cial Neural Networks, Munich, Germany, September 17–19, 2019, Proceedings, Part
III 28, pages 319–329. Springer, 2019.
19. Razieh Rastgoo, Kourosh Kiani, and Sergio Escalera. Hand sign language recognition
using multi-view hand skeleton. Expert Systems with Applications, 150:113336, 2020.
20. Tian Wang, Jiakun Li, Mengyi Zhang, Aichun Zhu, Hichem Snoussi, and Chang Choi.
An enhanced 3dcnn-convlstm for spatiotemporal multimedia data analysis. Concur-
rency and Computation: Practice and Experience, 33(2):e5302, 2021.
21. Jean Phelipe de Oliveira Lima and Carlos Maurı́cio Seródio Figueiredo. A temporal
fusion approach for video classification with convolutional and lstm neural networks
applied to violence detection. Inteligencia Artificial, 24(67):40–50, 2021.
22. Muneer Al-Hammadi, Ghulam Muhammad, Wadood Abdul, Mansour Alsulaiman,
Mohamed A Bencherif, and Mohamed Amine Mekhtiche. Hand gesture recognition
for sign language using 3dcnn. IEEE access, 8:79491–79509, 2020.
23. Yaohui Wang and Antitza Dantcheva. A video is worth more than 1000 lies. com-
paring 3dcnn approaches for detecting deepfakes. In 2020 15Th IEEE international
conference on automatic face and gesture recognition (FG 2020), pages 515–519. IEEE,
2020.
24. Balmukund Mishra, Deepak Garg, Pratik Narang, and Vipul Mishra. A hybrid ap-
proach for search and rescue using 3dcnn and pso. Neural Computing and Applications,
33:10813–10827, 2021.
25. Roberta Vrskova, Robert Hudec, Peter Sykora, Patrik Kamencay, and Miroslav Benco.
Violent behavioral activity classification using artificial neural network. In 2020 New
Trends in Signal Processing (NTSP), pages 1–5. IEEE, 2020.
26. Haochuan Hou, Yaochen Li, Chi Zhang, Hao Liao, Ying Zhang, and Yuehu Liu. Vehicle
behavior recognition using multi-stream 3d convolutional neural network. In 2021 36th
Youth Academic Annual Conference of Chinese Association of Automation (YAC),
pages 355–360. IEEE, 2021.
27. Waqas Sultani and Mubarak Shah. Human action recognition in drone videos using a
few aerial training examples. Computer Vision and Image Understanding, 206:103186,
2021.
December 11, 2024 15:0 WSPC/ws-ijitdm output
18 Ahmed S. Salama, Hamada I. AbdulWakel, Esraa Eldesouky, Mohamed Mashhour, and Ahmad M. Nagm
28. Md Atiqur Rahman Ahad, Masud Ahmed, Anindya Das Antar, Yasushi Makihara,
and Yasushi Yagi. Action recognition using kinematics posture feature on 3d skeleton
joint locations. Pattern Recognition Letters, 145:216–224, 2021.
29. Wei Lin, Junyu Gao, Qi Wang, and Xuelong Li. Learning to detect anomaly events
in crowd scenes from synthetic data. Neurocomputing, 436:248–259, 2021.
30. Heyam M Bin Jahlan and Lamiaa A Elrefaei. Mobile neural architecture search net-
work and convolutional long short-term memory-based deep features toward detecting
violence from video. Arabian Journal for Science and Engineering, 46(9):8549–8563,
2021.
31. Matteo Tomei, Lorenzo Baraldi, Simone Calderara, Simone Bronzin, and Rita Cuc-
chiara. Video action detection by learning graph-based spatio-temporal interactions.
Computer Vision and Image Understanding, 206:103187, 2021.
32. Yashswi Jain, Ashvini Kumar Sharma, Rajbabu Velmurugan, and Biplab Banerjee.
Posecvae: Anomalous human activity detection. In 2020 25th International Conference
on Pattern Recognition (ICPR), pages 2927–2934. IEEE, 2021.
33. Sibel Kaçdioglu, Barış Özyer, and Gülsah Tümüklü Özyer. Recognizing self-
stimulatory behaviours for autism spectrum disorders. In 2020 28th Signal Processing
and Communications Applications Conference (SIU), pages 1–4. IEEE, 2020.
34. Farhood Negin, Baris Ozyer, Saeid Agahian, Sibel Kacdioglu, and Gulsah Tumuklu
Ozyer. Vision-assisted recognition of stereotype behaviors for early diagnosis of autism
spectrum disorders. Neurocomputing, 446:145–155, 2021.
35. Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural net-
works. In Proceedings of the fourteenth international conference on artificial intelli-
gence and statistics, pages 315–323. JMLR Workshop and Conference Proceedings,
2011.
36. Roberta Vrskova, Robert Hudec, Patrik Kamencay, and Peter Sykora. A new approach
for abnormal human activities recognition based on convlstm architecture. Sensors,
22(8):2946, 2022.
37. Jingen Liu, Jiebo Luo, and Mubarak Shah. Recognizing realistic actions from videos
“in the wild”. In 2009 IEEE Conference on Computer Vision and Pattern Recognition,
pages 1996–2003. IEEE, 2009.
38. Kishore K Reddy and Mubarak Shah. Recognizing 50 human action categories of web
videos. Machine vision and applications, 24(5):971–981, 2013.
39. Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101
human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
40. Emna Ghodhbani, Mounir Kaaniche, and Amel Benazza-Benyahia. An effective 3d
resnet architecture for stereo image retrieval. In VISIGRAPP (4: VISAPP), pages
380–387, 2021.
41. Asanka G Perera, Yee Wei Law, Titilayo T Ogunwa, and Javaan Chahl. A multiview-
point outdoor dataset for human action recognition. IEEE Transactions on Human-
Machine Systems, 50(5):405–413, 2020.
42. Roberta Vrskova, Patrik Kamencay, Robert Hudec, and Peter Sykora. A new deep-
learning method for human activity recognition. Sensors, 23(5):2816, 2023.
43. Altaf Hussain, Tanveer Hussain, Waseem Ullah, and Sung Wook Baik. Vision trans-
former and deep sequence learning for human activity recognition in surveillance
videos. Computational Intelligence and Neuroscience, 2022(1):3454167, 2022.
44. Jinsol Ha, Joongchol Shin, Hasil Park, and Joonki Paik. Action recognition network
using stacked short-term deep features and bidirectional moving average. Applied Sci-
ences, 11(12):5563, 2021.
45. Kai Zhou, Tingting Wu, Chunyu Wang, Junfeng Wang, and Chao Li. Skeleton based
December 11, 2024 15:0 WSPC/ws-ijitdm output