Crime Detection From Pre-Crime Video Analysis With Augmented Activity Information
Crime Detection From Pre-Crime Video Analysis With Augmented Activity Information
Abstract—This study focuses on the detection of pre-crime shoplifting activities, contributing to the ongoing discourse on
events in videos, specifically shoplifting. In video understanding, preemptive crime prevention.
visual features, pose, and emotion information are inherently
frame-centric, providing insights in understanding within isolated II. R ELATED W ORK
frames. Our approach introduces a novel method by extracting
activity information between frames and incorporating visual Our approach is influenced by various topics of computer
features augmented with human pose and emotion information to vision, including video understanding, anomaly detection in
enhance the understanding of pre-shoplifting behaviors. We used videos, pose and emotion detection and activity analysis
a set of CCTV videos in which at the end of some clips the cus- studies. The seminal work by Vaswani et al. [1] marked the
tomers shoplift and in others they do not. We used these videos as
training data with a transformer machine learning architecture. advent of the Transformer architecture, addressing the intricate
We augmented low level video analysis with pose, and emotion challenges of capturing long-range dependencies among to-
information on frame level and extracted activity information kens and overcoming parallelization difficulties. Subsequently,
between frames. We conducted a systematic series of experiments the application of transformers expanded into computer vision,
to assess the performance of our model under two distinct con- notably with the Vision Transformer (ViT) model [2]. ViT
figurations. The first configuration involved frame-centric visual
features augmented with pose and emotion information, while revolutionized computer vision by conceptualizing an image
the second configuration focused on activity information between as a sequence of patches and applying the transformer ar-
frames. The results demonstrate improvements, underscoring the chitecture to these patches, particularly for tasks like image
pivotal role of extracting temporal dynamics between frames classification. ViT has been good especially in the extraction
in capturing nuanced activities leading up to the crime. The of features from images [2]. The concept of Vision Trans-
comprehension and capture of these activity patterns between
frames proved to be crucial for a more thorough examination former has become a cornerstone in computer vision research,
of pre-shoplifting events. The essence of our study’s contribution witnessing widespread adoption and innovative extensions by
is hinged upon the utilization of extracting temporal dynamics multiple studies. Researchers have explored novel avenues
between frames, enabling the capture of crucial activities of such as the Swin Transformer, introducing a hierarchical
individuals prior to engaging in shoplifting. vision transformer utilizing shifted windows [3], and Mul-
Index Terms—crime detection, pre-crime video analysis, aug-
tiscale Vision Transformers that enhance classification and
mented information, activity, pose, emotion, transformer
detection capabilities [4], [5]. Beyond image-related tasks,
the transformer architecture has demonstrated effectiveness in
I. I NTRODUCTION video understanding studies [6]–[8]. The impact of the Trans-
former architecture expanded into reshaping the landscape of
Despite widespread use of surveillance cameras, crime is computer vision and video understanding.
still pervasive in society. Significant strides have been made Anomaly detection in video understanding has seen various
in the realms of video analysis and anomaly detection, there is approaches. Sultani et al. [9] used a C3D-based model with
a body of research dedicated to identifying pre-criminal events, Multiple Instance Learning (MIL) for distinguishing normal
with the ultimate goal of enabling proactive intervention. from anomalous surveillance videos. Tsushita et al. [10]
This paper focuses on pre-crime events leading to criminal combined background subtraction with pedestrian tracking
activities, a critical element of crime prevention strategies. for snatch theft detection, focusing on specific features like
Our proposed methodology leverages a transformer archi- area and motion. Morales et al. [11] applied ConvLSTM
tecture renowned for its proficiency in capturing intricate and VGG-16 to differentiate between normal and violent
dependencies within sequential data, like frames in videos. events. Nasaruddin et al. [12] emphasized motion and location
A distinctive contribution of our study is the incorporation of analysis using a 3D CNN with background subtraction for
pose, emotion, and activity information into the visual features, feature extraction. Kirichenko et al. [13] adopted a hybrid 3D
thereby enhancing our capacity to discern essential behavioral CNN and RNN network to classify shoplifting and normal ac-
and emotional cues in individuals. This augmentation pro- tivities, illustrating diverse methodologies in criminal activity
vides valuable insights into pre-crime events associated with detection within video data.
Authorized licensed use limited to: MVJ College of E ngineering - Bengaluru. Downloaded on December 05,2024 at 03:41:10 UTC from IEEE Xplore. Restrictions apply.
While the aforementioned studies focused on anomalies and In the final stage, we implement a transformer architecture
criminal activities, other researchers directed their attention designed for video classification (refer to Fig. 1). This frame-
toward pre-crime events. Martinez et al. [14] deployed CNNs work synergistically utilizes the information gleaned from the
to detect criminal intent in the early stages of shoplifting extracted features, pose data, embedded emotions, and activity
incidents, showcasing the potential of early intervention. Kilic information within the frames, culminating in an effective
et al. employed a 3D CNN architecture on pre-crime frames classification process. The orchestrated interplay of these
augmented with pose information in [15], and used a trans- components not only enhances the richness of information but
former model on on pre-crime frames augmented with pose also contributes to the robustness and accuracy of our video
and emotion information in [16] emphasizing the detection of classification system.
pre-crime intention in shoplifting videos.
There have been various research efforts in extracting hu- B. Pre-processing
man pose information [17], [18] and emotion information [19] Our research exclusively targets shoplifting incidents. We
from the images. Optical flow is a technique used to analyze used shoplifting videos and normal videos from the dataset.
and quantify the motion of objects within a sequence of We carefully trimmed the videos at the frames just before
images or video frames. This method enables the identification the shoplifting incident happens. In this way we had the
of motion patterns, speed, and direction of moving objects shoplifting videos that consist the frames before the crime
[20]. In our study, we utilized optical flow to discern activity occurs.This meticulous approach yielded a two-class dataset,
information within frames and augment visual, pose, and consisting of pre-shoplifting events and normal events.
emotion features. Following the video trimming phase, we proceeded with
the dataset creation process. Employing a clip length of 16
III. M ETHODOLOGY frames, we strategically sampled frames at every 8th frame
within the video. This methodological choice facilitated the
generation of video snippets for our training and testing sets,
each snippet encapsulating approximately 4.2 seconds of video
footage. This dataset configuration is tailored to capture crucial
moments preceding shoplifting incidents, enabling a targeted
and effective training process for our model.
Authorized licensed use limited to: MVJ College of E ngineering - Bengaluru. Downloaded on December 05,2024 at 03:41:10 UTC from IEEE Xplore. Restrictions apply.
video snippet. Through this approach, we harness the capa- The four encoder layers and an eight multi-head attention
bilities of ViT to distill rich visual representations, enabling a mechanism, is finely tuned for the intricate task of distin-
more nuanced understanding of the dynamics present in pre- guishing shoplifting from normal behaviors in surveillance
shoplifting events. videos. The four encoder layers offer a balanced approach,
In our comprehensive approach, we extracted pose features, enhancing computational efficiency and enabling the model
emotion features, and activity information from frames to gain to capture complex nuances without risking overfitting or
intricate descriptive insights. Pose information offer valuable incurring high computational costs. This architecture facili-
information about the body language exhibited by individuals tates learning detailed sequence representations and managing
in the video, enabling the detection of suspicious behavior temporal dynamics efficiently. Simultaneously, the eight multi-
such as frequent checking of surroundings, attempts to avoid head attention mechanism boosts the model’s ability to process
surveillance cameras, or displays of aggressive gestures. To diverse video segments concurrently, enhancing feature detec-
capture pose information, we utilized an off-the-shelf pose tion and ensuring robustness to footage variability, such as
extractor, Google’s Mediapipe library to obtain pixel coor- changes in lighting or occlusions.
dinates for 33 keypoints in each frame [17]. Emotion infor- The multi-head attention mechanism is a crucial part of
mation provides an understanding of individuals’ emotional our transformer model, helping it analyze data by focusing
states, facilitating the identification of distress, anger, or other on various elements simultaneously. First, the overall multi-
emotional situations. For emotion features, we leveraged a pre- head attention, combines insights from several ”heads” each
trained DeepFace library, extracting probabilities of emotions, examining the input data differently (refer to Eq. 1).
including anger, disgust, fear, happiness, sadness, surprise, and
neutrality from faces in the frames [19]. Pose and emotion
information are inherently frame-centric, providing insights MultiHead(Q, K, V ) = Concat(head1 , . . . , headh )W O (1)
into individuals’ body language and emotional states within
Each head, indexed by i, uses a specific attention function,
isolated frames. However, to comprehend the interrelations
which determines how to distribute focus across different parts
between consecutive frames in videos, the extraction of activ-
of the input (refer to Eq. 2) to process the data.The core
ity information becomes paramount. To capture the dynamic
process within each head is the attention function (refer to
interactions and movements between frames, we employed the
Eq. 3).
Farnback optical flow technique [20]. This involves obtaining
optical flow values for pose keypoints of individuals within
frames, allowing us to discern the direction and magnitude headi = Attention(QWiQ , KWiK , V WiV ) (2)
of these key points over time. This innovative approach
significantly enhances our model’s capability to understand
QK T
the temporal flow of activities, contributing to a detailed and Attention(Q, K, V ) = softmax √ V (3)
dk
nuanced analysis of pre-crime events. By integrating activity
information through optical flow, our methodology bridges the Here, Q, K, and V stand for the query, key, and value
gap between individual frames, providing a more comprehen- matrices, guiding how the attention is applied. The term dk
sive understanding of the evolving scenarios captured in video represents the size of the key vectors, influencing the focus’s
data. scale. The matrices W O , WiQ , WiK , and WiV are parameters
that transform the input into a format suitable for attention
D. Model Training processing. This setup allows the model to focus on different
Following the pre-processing and feature extraction stages, segments of the input sequence, enhancing its ability to
our model training phase leverages a transformer-based archi- interpret complex data.
tecture renowned for its aptitude in learning intricate depen- During the training process, visual features augmented
dencies. The transformer’s inherent capability to grasp long- with pose, emotion, and activity information are fed into the
range dependencies between frames, even those widely sepa- transformer model. The model learns to discern context and
rated in the sequence, proves invaluable for our video analysis dependencies between frames by assigning varying weights to
task, where each video essentially represents a sequence of different frames based on their contextual importance within
frames. the video. Post the transformer encoder layers, we obtain a
Our transformer model consists of four encoder layers that normalized vector representation of the sequence tailored for
progressively create abstract representations of the input. Each classification.
layer is equipped with its own multi-head attention mechanism To optimize the learning process, we employ the Cross-
featuring eight heads. These attention mechanisms enable the Entropy Loss as our loss function, a fitting choice for binary
model to selectively focus on different segments of the input classification tasks like ours, where videos are categorized as
sequence, comprehending the significance and context of each either ’shoplifting’ or ’normal’ events. The Adam optimizer,
frame within a video. The deployment of eight heads in our known for its efficiency and low memory requirements, is
attention mechanism enhances the model’s ability to capture chosen to facilitate the optimization process, aligning well with
diverse aspects of the input sequence. the demands of our task.
Authorized licensed use limited to: MVJ College of E ngineering - Bengaluru. Downloaded on December 05,2024 at 03:41:10 UTC from IEEE Xplore. Restrictions apply.
Upon completion of the training phase, our model emerges taking place in the video, which can be crucial for distin-
as a sophisticated tool proficient in discerning and learning guishing between normal behavior and shoplifting incidents.
the interdependencies between distinct frames within a video Pose information is important because suspects usually follow
sequence. With the ability to classify unseen videos with a certain general behaviors, which are in turn related to their
high degree of accuracy into ’shoplifting’ or ’normal’ labels, postures. Another valuable information is emotions, which
the model stands as a robust asset for video analysis in the is important to understand if someone is in distress, angry
context of our specific task. or in any other emotional state. Emotion information can
provide valuable insights into the state of individuals in the
IV. E XPERIMENTS AND R ESULTS video, potentially offering additional clues for identifying
A. Dataset and Evaluation Metrics shoplifting incidents. We extracted pose and emotion features
Our study employs the UCF-Crime dataset [21], an ex- using appropriate methods and combined them with the visual
tensive collection for scrutinizing anomalies in real-world features for training. We used our transformer architecture and
situations. This dataset encompasses a broad spectrum of obtained an accuracy of 0.95, precision of 0.93, recall of 0.91.
incidents and crimes. It consists of 1900 videos, amounting to 2) Experiment 2- Activity Information: Building upon the
128 hours of footage, at a rate of 30 frames per second, with initial experiment, our primary investigation integrated crucial
each frame resized to 320x224 pixels. For the specific purpose activity information with visual features, pose, and emotion
of our study, we randomly selected 50 videos of shoplifting information. While the extraction of visual features, pose, and
incidents and 50 videos of normal scenarios from the UCF- emotion from frames already provides valuable insights into
Crime dataset. This selection was then further divided into pre-crime events in videos, it inherently relies on a frame-
80% for training and 20% for testing, ensuring a representative based analysis.
distribution of data for both phases of our analysis. Pre-crime events often involve a sequence of activities
In the stage of Feature Extraction, Pose, Emotion, and by potential offenders, such as navigating between aisles,
Activity Information Augmentation, we formulated a dataset repeatedly inspecting items, or displaying specific patterns of
comprising visual features, pose, emotion, and activity data behavior like putting an item in their basket and then returning
for each frame of the video snippet for our experimental use. it. Understanding and capturing these patterns of activities
Each video snippet is structured in the form of [16, 940], between frames are essential for a more comprehensive anal-
where 16 represents the number of frames. The initial 768 ysis of pre-shoplifting incidents. In our main experiment, we
values correspond to visual features, the subsequent 99 values utilized the same transformer architecture and achieved an
represent the pose information of x, y, and z coordinates of accuracy of 0.96, precision of 0.94, recall of 0.91 surpassing
33 keypoints, the following 7 values indicate the probabilities the results obtained by solely using visual features augmented
of emotions, and the final 66 values denote the magnitude and with pose and emotion information.
direction of the poses that represent the activity in frames. The incorporation of activity information between video
We evaluated the model using accuracy, precision, recall, frames proved to be pivotal, offering enhanced comprehension
and the F1 score to gain a nuanced understanding of its per- of pre-crime events.
formance in distinguishing between shoplifting incidents and In all experiments, we trained our transformer model using
normal behavior. These metrics helped us identify areas where the Adam optimizer and Cross-Entropy Loss as our loss
the model excels and areas needing improvement, ensuring a function.
balanced approach to minimizing both false positives and false
C. Results
negatives. By leveraging these comprehensive metrics, we aim
to refine our model’s predictive capabilities, optimizing it for In the initial experiment, training the model exclusively on
real-world application in surveillance and security contexts. visual features augmented with pose and emotion information
resulted in an impressive accuracy of 0.95. This integration
B. Experimental Details facilitated a deeper understanding of pre-crime events, distin-
In our study we designed two experiments in order to see guishing normal behavior from potential shoplifting incidents
the affect of augmented activity information in understanding and laying the groundwork for subsequent investigations.
of pre-crime events. In the second experiment integrated crucial activity infor-
1) Experiment 1-Visual Features, Pose, and Emotion Infor- mation with visual features, pose, and emotion information.
mation: In our initial experiment, our primary emphasis was This inclusion further boosted the model’s accuracy to 0.96,
on harnessing the power of visual features augmented with emphasizing the importance of incorporating temporal dynam-
pose and emotion information. Our approach involved training ics through activity information, which highlights the value of
our model exclusively on visual, pose and emotion features analysis between frames beyond static frame analysis(refer to
only. The pose information provides additional context about Table I).
the actions taking place in the video, as it offers valuable
information because suspects usually follow certain general V. C ONCLUSION
behaviors, which are in turn related to their postures. The In our study, we aimed to address the complex task of iden-
pose information provides additional context about the actions tifying pre-crime events in video footage, specifically focusing
Authorized licensed use limited to: MVJ College of E ngineering - Bengaluru. Downloaded on December 05,2024 at 03:41:10 UTC from IEEE Xplore. Restrictions apply.
TABLE I: Comparison of different experiments. [10] H. Tsushita and T. T. Zin, “A study on detection of
Experiments Acc. Prec. Rec. F1 abnormal behavior by a surveillance camera image,”
Exp. 1 - Visual F. & Pose and Emotion Information 0.95 0.93 0.91 0.92
Exp. 2 - Activity Information 0.96 0.94 0.91 0.92
in Big Data Analysis and Deep Learning Applications:
Proceedings of the First International Conference on
Big Data Analysis and Deep Learning 1st, Springer,
on predicting incidents of shoplifting. Our novel approach 2019, pp. 284–291.
revolved around the analysis of activity patterns by extracting [11] G. Morales, I. Salazar-Reque, J. Telles, and D. Dı́az,
temporal dynamics between frames. The comprehension and “Detecting violent robberies in CCTV videos using
capture of these activity patterns between frames proved to deep learning,” in Artificial Intelligence Applications
be crucial for a more thorough examination of pre-shoplifting and Innovations: 15th IFIP WG 12.5 International
events. Conference, AIAI 2019, Hersonissos, Crete, Greece,
Our findings indicate that the best results were obtained May 24–26, 2019, Proceedings 15, Springer, 2019,
when we focused on understanding and capturing the activity pp. 282–291.
patterns between frames. This implies that taking a holistic [12] N. Nasaruddin, K. Muchtar, A. Afdhal, and A. P. J.
approach, considering various aspects of videos, can signifi- Dwiyantoro, “Deep anomaly detection through visual
cantly enhance the performance of pre-crime event recogni- attention in surveillance videos,” Journal of Big Data,
tion in video footage. The implications of this research are vol. 7, no. 1, pp. 1–17, 2020.
substantial, as it has the potential to have a positive impact [13] L. Kirichenko, T. Radivilova, B. Sydorenko, and S.
on the prevention and early detection of shoplifting incidents. Yakovlev, “Detection of shoplifting on video using a
Ultimately, our work contributes to creating safer and more hybrid network,” Comput., vol. 10, p. 199, 2022.
secure retail environments. [14] G. A. Martı́nez-Mascorro, J. R. Abreu-Pederzini, J. C.
Ortiz-Bayliss, A. Garcia-Collantes, and H. Terashima-
R EFERENCES Marı́n, “Criminal intention detection at early stages
[1] A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is of shoplifting cases by using 3D convolutional neural
all you need,” Advances in neural information process- networks,” Computation, vol. 9, no. 2, p. 24, 2021.
ing systems, vol. 30, 2017. [15] S. Kilic and M. Tuceryan, “Crime detection from pre-
[2] A. Dosovitskiy, L. Beyer, A. Kolesnikov, T. Kipf, and crime video analysis with augmented pose information,”
the ViT authors, “An image is worth 16x16 words: in 2023 IEEE International Conference on Electro
Transformers for image recognition at scale,” arXiv Information Technology (eIT), IEEE, 2023, pp. 1–6.
preprint arXiv:2010.11929, 2020. [16] S. Kilic and M. Tuceryan, “Crime detection from pre-
[3] Z. Liu, Y. Lin, Y. Cao, et al., “Swin transformer: crime video analysis with augmented pose and emotion
Hierarchical vision transformer using shifted windows,” information,” in Proceedings of the IEEE Southwest
arXiv preprint arXiv:2103.14030, 2021. Symposium on Image Analysis and Interpretation, Ac-
[4] H. Fan, B. Xiong, K. Mangalam, et al., “Multiscale cepted for publication, 2024.
vision transformers,” arXiv preprint arXiv:2104.11227, [17] C. Lugaresi, J. Tang, H. Nash, et al., “Mediapipe:
2021. A framework for building perception pipelines,” arXiv
[5] Y. Li, C.-Y. Wu, H. Fan, et al., “Mvitv2: Improved preprint arXiv:1906.08172, 2019.
multiscale vision transformers for classification and [18] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao,
detection,” arXiv preprint arXiv:2112.01526, 2021. “Yolov7: Trainable bag-of-freebies sets new state-of-
[6] G. Bertasius, H. Wang, and L. Torresani, “Is space-time the-art for real-time object detectors,” arXiv preprint
attention all you need for video understanding?” In Pro- arXiv:2207.02696, 2022.
ceedings of the International Conference on Machine [19] S. I. Serengil and A. Ozpinar, “Hyperextended light-
Learning (ICML), 2021. face: A facial attribute analysis framework,” in 2021
[7] E. Fish, J. Weinbren, and A. Gilbert, “Two-stream International Conference on Engineering and Emerging
transformer architecture for long video understanding,” Technologies (ICEET), IEEE, 2021, pp. 1–4. DOI: 10.
arXiv preprint arXiv:2208.01753, 2022. 1109/ICEET53442.2021.9659697. [Online]. Available:
[8] Z. Tong, Y. Song, J. Wang, and L. Wang, Videomae: https://doi.org/10.1109/ICEET53442.2021.9659697.
Masked autoencoders are data-efficient learners for [20] G. Farnebäck, “Two-frame motion estimation based on
self-supervised video pre-training, 2022. arXiv: 2203. polynomial expansion,” in Image Analysis, ser. Lecture
12602 [cs.CV]. Notes in Computer Science, vol. 2749, Berlin, Heidel-
[9] W. Sultani, C. Chen, and M. Shah, “Real-world anomaly berg: Springer, 2003. DOI: 10.1007/3-540-45103-X 50.
detection in surveillance videos,” in Proceedings of [21] U. of Central Florida. “UCF-crime dataset.” (2018),
the IEEE conference on computer vision and pattern [Online]. Available: https : / / webpages . uncc . edu /
recognition, 2018, pp. 6479–6488. cchen62/dataset.html (visited on 09/05/2023).
Authorized licensed use limited to: MVJ College of E ngineering - Bengaluru. Downloaded on December 05,2024 at 03:41:10 UTC from IEEE Xplore. Restrictions apply.