0% found this document useful (0 votes)
5 views

Deep Fake Detection - Finalized

The document presents a novel Double-Phase Siamese Network (DPS) model for detecting deepfake images by analyzing noise patterns in facial and surrounding regions. The proposed model addresses limitations in existing methods by focusing on subtle inconsistencies across frames, utilizing the Inception-v3 model for feature extraction and Siamese training for classification. Experimental results on the CelebDF dataset demonstrate the model's effectiveness in improving accuracy and reliability in deepfake detection compared to state-of-the-art techniques.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Deep Fake Detection - Finalized

The document presents a novel Double-Phase Siamese Network (DPS) model for detecting deepfake images by analyzing noise patterns in facial and surrounding regions. The proposed model addresses limitations in existing methods by focusing on subtle inconsistencies across frames, utilizing the Inception-v3 model for feature extraction and Siamese training for classification. Experimental results on the CelebDF dataset demonstrate the model's effectiveness in improving accuracy and reliability in deepfake detection compared to state-of-the-art techniques.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

UNMASKING DEEP FAKES THROUGH DOUBLE-PHASE SIAMNET MODEL

Sai Lokesh V1, Hari Linga Kartheek B1, Sai Sashi Vardhan E1, Priyanka M1, Sangeetha J1,*
1
Department of CSE, Srinivasa Ramanujan Centre, SASTRA University, Kumbakonam, India
*
Corresponding author: [email protected]

ABSTRACT
Deepfake image is a manipulated or synthetic image created using advanced Artificial Intelligence
techniques, which raises concerns about misinformation and privacy breaches. Detecting deepfakes is
crucial to preserving the authenticity of visual content, combating misinformation, and safeguarding
against potential threats to individuals' privacy and reputation. Focusing solely on facial features
through deep neural networks, existing methods lack the sensitivity to detect subtle inconsistencies
across frames, which are crucial for accurate deepfake detection. Despite their complexity, even
baseline deepfake detection models can overfit specific temporal inconsistencies or manipulation
patterns, limiting their ability to perform well on unseen data. This paper proposes a Double-Phase
Siamese network to address manipulated face detection in images by leveraging inconsistencies in the
noise patterns between the face region and the rest of the frame. This network efficiently extracts noise
patterns from both the face region and the patch within the same image using the Inception-v3 model.
Siamese training is then employed to compare and correlate the noise patterns. Experimental assessment
is performed on a benchmark dataset, and the results are compared with state-of-the-art techniques for
validation.
Keywords: Deepfake detection, Inception-V3 model, Siamese Network, Pattern analysis

1 INTRODUCTION

Recent advancements in AI technology have profoundly impacted various aspects of daily life, ranging
from medical diagnoses to entertainment selections. Among these developments, the Generative
Adversarial Networks (GANs) have significantly advanced the creation of realistic multimedia fakes.
These networks, by pitting two algorithms against each other—one generating content and the other
evaluating its authenticity—excel at tasks such as face swapping or altering lip movements in videos to
change the conveyed message. Such manipulated media, known as deepfakes, initially found an
entertainment niche, creating humorous clips featuring celebrities. However, the technology's darker
applications have raised serious concerns. For instance, the dignity of female celebrities has been
compromised through the creation of non-consensual fake pornography, and the fabrications of
misleading news content have posed threats to the dissemination of accurate information [1].

The misuse of deepfakes extends beyond individual harm, touching on broader ethical considerations
involving consent, privacy, and the potential for psychological distress. As society grapples with these
challenges, legal and regulatory frameworks are evolving in an attempt to mitigate the negative impacts
of this technology. Efforts to detect and discriminate between authentic and manipulated content are
underway, employing new AI-driven tools designed to identify deepfakes. Before the advent of these
tools, traditional methods for detecting manipulated media relied on manual inspection and rule-based
approaches [2].

These techniques included metadata analysis, signature analysis, watermarking, and forensic analysis.
Metadata analysis involved scrutinizing timestamps, file formats, and compression artifacts for
anomalies, while signature analysis compared digital signatures or hash values with known authentic
versions. Watermarking embedded invisible marks within multimedia content, while forensic analysis
examined characteristics like noise patterns and lighting inconsistencies [3]. Traditional techniques
offered transparency, low computational overhead, and some level of robustness against manipulation.
However, they were limited in scalability, vulnerability to sophisticated attacks, and required domain-
specific expertise. Despite these limitations, the rise of deepfake technology prompted a shift towards
deep learning techniques due to their potential for higher accuracy and scalability in detecting
manipulated media.

To improve deepfake video recognition efficiency by outperforming the standard technique, a CNN-
based XGBoost model has been proposed [4]. This innovative approach begins with YOLO model-
based face recognition, followed by feature extraction using the InceptionResNetV2 model. By
integrating these components, the model identifies visual anomalies in video frames, effectively
distinguishing between authentic and deepfake videos. Experimental validation conducted on a mixed
dataset of CELEB DF – FACEFORENSICS++ yielded an optimal accuracy of 90.73%, accompanied
by specificity and sensitivity values of 93.53% and 85.39%, respectively. Despite achieving high
accuracy and specificity, it's important to note that the model exhibits limited generalization against
adversarial attacks.

In an alternative approach, facial recognition has been accomplished through the Fisher Face-Local
Binary Pattern Histogram (FF-LBPH) method [5]. This model employs LBPH for dimensionality
reduction within the face space, followed by the application of Fisher Linear Discriminant (FLD) to
maximize class separability and minimize within-class variance. Subsequently, a Deep Belief Network
(DBN) classifier is employed for classification. Testing on CASIA-Web Face and DFFD datasets
resulted in accuracies of 98.82% and 97.82%, respectively. While LBPH proves resilient to lighting and
facial expression variations for face recognition tasks, it does exhibit sensitivity to image quality and
introduces computational complexity. Moreover, another efficient detection method has been proposed
using a hybrid model that combines Visual Geometry Group (VGG16) and Convolutional Neural
Network (CNN) architectures [6]. This model utilizes the VGG16 approach for feature extraction from
images, followed by CNN layers to discern patterns and characteristics, ultimately classifying images
as real or fake. During the testing phase on the CELEB-DF dataset, the model achieved an accuracy of
94%. Despite being fully optimized and having lower complexity, the sophisticated architecture and
numerous configuration parameters introduced a challenge in terms of overfitting the training data. This
highlights the need for ongoing refinement to ensure robust generalization and performance.

In addition, an extra contribution to the field involves the introduction of two complementary face
recognition models based on deep neural networks (DNNs) were proposed [7]. These networks analyze
identity cues not only from the face but also from its surrounding context, incorporating features like
ears and hair. To validate their efficiency, the proposed networks underwent testing on the FF-DF, and
Celeb-DF-v2 datasets, achieving an accuracy of 99.7%, and 66.0% respectively. Emphasizing the
inherent designs of face swap schemes rather than specific artifacts, this model was considered
complementary to other artifact detection methods. Extended from the initial DNNs model [8], this
approach aims to detect Deepfakes by assessing frame variance and determining the rate of change
between extracted frame numbers. Demonstrating effectiveness, the implemented model achieved
detection accuracies of 97.39% for the Face2face dataset, 95.65% for the FaceSwap dataset, and 96.55%
for the DFDC dataset. Despite these notable accomplishments, it's important to acknowledge that this
model exhibited high data requirements, potentially imposing limitations on the classification
capabilities of the neural network.
Building upon the pressing need to address the proliferation of deepfake images, this paper proposed
an optimal technique aimed at enhancing the accuracy and reliability of deepfake detection methods.
Existing detection approaches often fall short in detecting subtle inconsistencies across video frames,
thereby hindering their ability to precisely identify manipulated content. To overcome these limitations,
we propose a novel methodology, the "Double-Phase SIAMNET (DPS)" model. This approach
uniquely focuses on analyzing both the noise patterns within the facial region and the
contextual inconsistencies in the surrounding frame, providing a more holistic and effective analysis.
Through rigorous experimental evaluations on benchmark datasets and comparative studies with
leading-edge techniques, this research aims to demonstrate the superior capability of our DPS model
in the crucial task of deepfake image detection.
The paper is organized as follows: Section 2 provides a detailed explanation of the data and the
workflow of the proposed model; Section 3 evaluates the model using various performance metrics;
Section 4 concludes by discussing the significant impact and potential applications of the model.
2 MATERIALS AND METHODS

2.1 Dataset Description

The proposed model is being evaluated using the "CelebDF" dataset, which offers a diverse collection
of videos for research in facial recognition and synthesis tasks. This dataset comprises three subsets:
"Celeb Real," containing 156 videos of real-world footage featuring celebrities; "Celeb Synthesis,"
which includes 795 synthesized videos mimicking celebrity appearances; and "YouTube Real," with
256 videos sourced from real-world footage on YouTube. CelebDF plays a significant role in training
and evaluating models in facial recognition and synthesis, through the utilization of Deep Learning
(DL) approaches. Table 1 provides a comprehensive overview of the dataset, highlighting its pivotal
role in advancing research in facial analysis.

Classes Total No. of videos per class


Celeb Real 156
Celeb Synthesis 795
Youtube Real 256
Total 1201
Table 1: Categorical Overview of the Dataset
2.2 Proposed Methodology - Double Phase SIAMNET Model

Fig. 1: Architecture of the Proposed DPS model


2.2.1 Pre-Processing

Initially, frames are captured from videos. We collected a total of 30 frames from each video by setting
the window size to 20. Subsequently, a Multi-Task Cascaded Convolutional Neural Network (MTCNN)
[9] is employed to extract face and patch regions, which utilizes three stages for comprehensive
detection. Firstly, the Proposal Network (P-Net) employs Convolutional networks to generate candidate
bounding boxes for faces. Subsequently, the Refinement Network (R-Net) further refines these
bounding boxes to effectively filter out false positives. Finally, the Output Network (O- Net) refines
them by detecting crucial facial landmarks such as eyes, nose, and mouth. This cascade of neural
networks ensures efficient and accurate face detection in images, making it a crucial phase for
subsequent analyses with the dimensions of an image as 299x299x3(Fig. 1).

2.2.2 Feature Extraction

In the first phase of the proposed DPSM model, the Inception V3 [10] approach is deployed as shown
in Fig. 1 for feature extraction to discern subtle noise patterns within the facial and patch regions, which
was extracted earlier. This model receives the preprocessed image (299x299x3) as an input and
undergoes several operations. Initially, the Convolutional Layer applies Convolutional operations to
detect features such as edges, textures, and shapes from the input image. Following that, the Pooling
Layer downsamples the feature maps while maintaining critical information to ensure no loss occurs.
This neural network can effectively collect features across various scales and resolutions by traversing
through multiple layers with varied filter sizes (3 x 3) and strides (5). As the network reaches its apex,
global average pooling aggregates feature maps, resulting in a condensed representation of image
characteristics. This aggregated output is eventually fed into fully connected layers, yielding 2048-
dimensional embeddings that contain high-level features for further analysis.
2.2.3 Classification via SIAMNET:

In the second phase, the output of the Inception V3 model [10], such as the 2048-dimensional
embeddings, is fed into a Siamese network [11] to determine image authenticity as shown in Fig 1. It
is a specialized neural network topology mainly used for similarity assessments, such as face
verification and signature recognition. They are made up of two branches with identical architectures
and weights, each dedicated to processing the face and patch regions individually. Each branch's input
consists of a pair of faces and patches from the same frame of videos, prepared to match the required
dimension (299x299x3) for the pre-trained source identification model used as a feature extractor. The
noise embeddings from both branches are concatenated to create a 4096-dimensional vector for binary
classification.
During the classification phase, the concatenated output is processed via two Dense Layers (DL1, DL2),
with densities of 1024 and 1, respectively. To generate the binary classification result, the sigmoid
activation function is integrated into this layer. The classification process is encapsulated in Eqn. (1),
where the predicted label (𝑌) is determined based on whether the camera noise pattern of the
face region (Yface) correlates with the patch region (Ypatch), which results in the classification of images
into Real (0) or Fake (1). The end-to-end training of the model is governed by the loss function
(Eqn.(2)), which calculates the cross-entropy loss between the predicted and actual labels (Y). This
ensures that the model effectively learns the pattern to discriminate the deepfake image from the
authentic one.

𝑌 = σ( DL2 (DL1 (Concat (Yface , Ypatch))) (1)

Loss = - (Y log(𝑌) + (1 - Y) log ( 1 - 𝑌) (2)


3 EXPERIMENTAL ANALYSES

3.1 Experimental Setup

The DPS model for deep fake detection was implemented in Python 3.10 using the Jupyter Notebook
environment. The experiments were conducted on a laptop equipped with 8GB RAM, 512GB SSD
ROM, and an AMD Ryzen 5 5300U processor running the Windows 10 operating system. Validation
was performed using the scikit-learn package, an open-source data analysis framework that offers
various deep-learning methods for anomaly detection.

3.2 Evaluation Metrics

To evaluate the performance of the proposed approach, a comprehensive set of evaluation metrics (Eqn.
(3) to Eqn. (6)) was utilized. These metrics offer insights into different aspects of the model's predictive
capabilities and facilitate thorough comparisons among various approaches.

(i) Accuracy - Ratio of correctly classified instances to the total instances.


𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑑 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (3)
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠

(ii) Precision - Calculated as the ratio of true positives to the sum of true positives and false
positives.
𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (4)
𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠+𝐹𝑎𝑙𝑠𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠

(iii) Recall - Calculated as the ratio of true positives to the sum of true positives and false
negatives.
𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑅𝑒𝑐𝑎𝑙𝑙 = (5)
𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠+𝐹𝑎𝑙𝑠𝑒𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠

(iv) F1 Score - Harmonic mean of precision and recall, providing a balanced measure of a
model's performance.
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1 𝑆𝑐𝑜𝑟𝑒 =2 ∗ (6)
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙

3.3 Results and Discussions

The model underwent evaluation using a total of 1201 videos, with 80% (960 videos) allocated for
training and the remaining 20% for testing. Prior to model training, the videos were segmented into
frames as described in Subsection 2.1. Features were extracted from both the face and patch regions
using the Inception V3 model [10] before initiating the training process. The model was then subjected
to various computational analyses, which are elaborately discussed in this section.

In our experimental analysis, the performance of various models was thoroughly evaluated, revealing
significant differences in accuracy (Fig. 2). Viola-Jones cascade classifier [12] demonstrated a
commendable accuracy of 91.27%, offering a solid baseline for comparison. The FF-LBPH model [5],
known for its robustness in facial recognition, significantly outperformed the Viola-Jones model [12]
by achieving an accuracy of 97.82%, likely due to its effective use of local binary patterns that are
particularly adept at capturing facial textures. The XG-Boost model [4], while effective in handling
tabular data, achieved a slightly lower accuracy of 90.73% in this context, possibly because its strength
in structured data classification doesn't directly translate to the complexities of image-based deep fake
detection. The VGG16 model [6], leveraging deep convolutional networks, achieved an impressive 94%
accuracy, benefiting from its deep architecture capable of learning high-level featuresfrom images.
However, the standout performer in our experiments was the proposed model, which achieved the
highest accuracy of 98.83% (Fig. 2). This superior performance can be attributed to its innovative use
of a Siamese network architecture, designed specifically for learning fine-grained distinctions between
genuine and manipulated images. Unlike the other models, the Siamese network's ability to compare
and contrast pairs of images allows it to discern subtle inconsistencies indicative of deep fakes, making
it exceptionally well-suited for this task. This comprehensive analysis demonstrates the proposed
model's superior predictive capabilities in comparison to the other evaluated models, highlighting its
potential as a highly effective tool in the ongoing battle against deep fake technology.

Fig. 3: Comparing the evaluation metrics of the


Fig. 2: Accuracy comparison between proposed and DPS model for optimal frame setting
conventional models

In order to identify the ideal configuration for the proposed DPS model, we ran experiments with
different numbers of frames (Fig. 3). This investigation provided insights into the model's performance
across different frame counts. When utilizing 10 frames, the model achieved an accuracy, precision,
and recall of 96.4%, 94.4%, and 95.7%, respectively. Increasing the frame count to 20 resulted in
improved metrics, with the model achieving an accuracy, precision, and recall of 97.59%, 95.4%, and
96.4%, respectively. Further increasing the frame count to 30 demonstrated a significant enhancement
in performance, with the model achieving an accuracy, precision, and recall of 98.8%, 96.6%, and
98.01%, respectively. Interestingly, when using 40 frames, the metrics remained largely consistent with
those obtained at 30 frames, with the model achieving an accuracy, precision, and recall of 98.8%,
96.8%, and 98.1%, respectively (Fig. 3). Upon closer examination, the negligible difference between
the results of 30 and 40 frames led us to conclude that the optimal number of frames for our model is
30. This decision was based on the principle of achieving a balancebetween computational efficiency
and performance, ensuring that the model operates effectively without unnecessary computational
overhead.

Table 2 represents the performance of the model on the CelebDF dataset at both the frame and video
levels. During the training phase with 960 videos, the DPS model achieved an accuracy of 97% at the
frame level and 97.5% at the video level. This high level of accuracy can be attributed to the model's
ability to effectively learn discriminative features from the data, enabling it to distinguish between
genuine and manipulated frames with precision. The robustness of the DPS model is further evidenced
by its precision values of 98.23% at the frame level and 99% at the video level, indicating its capability
to accurately identify deep fake content while minimizing false positives. Moreover, the F1 scores of
97.05% and 98.23% reflect the model's balanced performance in terms of both precision and recall
during training.

Moving to the testing phase, which involved 241 videos, the DPS model demonstrated even higher
performance levels. The increased accuracy of 97.76% at the frame level and 98.83% at the video
level underscores the model's effectiveness in generalizing well to unseen data. This ability to maintain
high accuracy during testing highlights the robustness and reliability of the DPS model in accurately
detecting deep fake content across both individual frames and entire videos within the CelebDF dataset.
Additionally, the precision metrics improved to 98.33% and 99.24%, respectively, during testing,
indicating the model's capability to minimize false positives while maintaining high levels of true
positive identifications. The corresponding F1 scores of 96.53% and 98.9% further emphasize the
balanced performance of the DPS model in terms of precision and recall, solidifying its position as a
highly effective tool for deep fake detection. Fig. 4 depicts the prediction result of the proposed model
during testing phase.

CelebDF
DPS Model Evaluation Frame-Level Video-Level
Metrics (30)
Training Accuracy 97 97.5
(960 videos) Precision 98.23 99
F1-Score 97.05 98.23
Testing Accuracy 97.76 98.83
(241 videos) Precision 98.33 99.24
F1-Score 96.53 98.9

Table 2: Performance of the model in terms of Frame and video level

Fig. 4: Simulation analysis of the DPS model

4 CONCLUSION
Addressing the sophisticated challenge of deep fake disinformation, we propose double-phase SiamNet,
a cutting-edge detection model that leverages inconsistencies in camera noise patterns caused by facial
manipulations. By comparing noise features between the altered face and an untouched region within
the same video frame, our Siamese-based network adeptly identifies face- swapped videos. Despite a
slight reduction in effectiveness against lip-sync fakes, SiamNet achieves aremarkable 98.3% accuracy
on the Celeb-DF dataset, outstripping current methods in both performance and generalizability. This
breakthrough offers a robust tool for the reliable detection of deepfake videos.

REFERENCES

[1].Takale, Dattatray G., Parikshit N. Mahalle, and Bipin Sule. "Advancements and Applications of
Generative Artificial Intelligence."

[2].Vaccari, C., & Chadwick, A. (2020). Deepfakes and Disinformation: Exploring the Impact of
Synthetic Political Video on Deception, Uncertainty, and Trust in News. Social Media + Society, 6(1).
https://doi.org/10.1177/2056305120903408.

[3].William D. Ferreira, Cristiane B.R. Ferreira, Gelson da Cruz Júnior, Fabrizzio Soares,A review of
digital image forensics,Computers & Electrical Engineering,Volume 85,2020,106685,ISSN 0045-
7906,https://doi.org/10.1016/j.compeleceng.2020.106685.
[4].Kaliyar, R.K., Goswami, A. & Narang, P. DeepFakE: improving fake news detection using tensor
decomposition-based deep neural network. J Supercomput 77, 1015-1037 (2021). https://dol.
org/10.1007/s11227-020-03294-y

[5]. Lee G, Kim M. Deepfake Detection Using the Rate of Change between Frames Based on Computer
Vision. Sensors (Basel). 2021 Nov 5:21(21):7367, dot 10.3390/s21217367. PMID: 34770675; PMCID:
PMC8588474.

[6]. Ismail A, Elpeltagy M, S. Zaki M, Eldahshan K. A New Deep Learning-Based Methodology for
Video Deepfake Detection Using XGBoost. Sensors. 2021; 21(16):5413.
https://doi.org/10.3390/s21165413.

[7]. ST S, Ayoobkhan MUA, V KK, Bacanin N, K V, Štěpán H, Pavel T. 2022. Deep learning model
for deep fake face recognition and detection. PeerJ Computer Science 8:e881
https://doi.org/10.7717/peerj-cs.881

[8]. Raza A, Munir K, Almutairi M. A Novel Deep Learning Approach for Deepfake Image
Detection.Applied Sciences 2022; 12(19):9820. https://doi.org/10.3390/app12199820.

[9].L. Zhang, H. Wang and Z. Chen, "A Multi-task Cascaded Algorithm with Optimized Convolution
Neural Network for Face Detection," 2021 Asia-Pacific Conferenceon Communications
Technologyand Computer Science (ACCTCS), Shenyang, China, 2021, pp. 242-245, doi:
10.1109/ACCTCS52002.2021.00054.

[10].Wang, Cheng & Chen, Delei & Lin, Hao & Liu, B. & Zeng, C. & Chen, D. & Zhang, Guokai.
(2019). Pulmonary Image Classification Based on Inception-v3 Transfer Learning Model. IEEE Access.
PP. 1-1. 10.1109/ACCESS.2019.2946000.

[11].Staffy Kingra, Naveen Aggarwal, Nirmal Kaur,SiamNet: Exploiting source camera noise
discrepancies using Siamese Network for Deepfake Detection,Information Sciences,Volume
645,2023,119341,ISSN 0020-0255,https://doi.org/10.1016/j.ins.2023.119341.

[12].S. Tasfia and S. Reno, "Face Mask Detection Using Viola-Jones and Cascade Classifier," 2022
International Conference on Augmented Intelligence and Sustainable Systems (ICAISS), Trichy, India,
2022, pp. 563-569, doi: 10.1109/ICAISS55157.2022.10011114.

You might also like