Deep Learning On Deep Fake Creation and Detection
Deep Learning On Deep Fake Creation and Detection
1. Introduction 2015; Tewari et al., 2018; Lin et al., 2021; Liu et al., 2021). These
models are used to examine facial expressions and movements of a per-
In a narrow definition, deepfakes (stemming from ‘‘deep learning’’ son and synthesize facial images of another person making analogous
and ‘‘fake’’) are created by techniques that can superimpose face images expressions and movements (Lyu, 2018). Deepfake methods normally
of a target person onto a video of a source person to make a video require a large amount of image and video data to train models to
of the target person doing or saying things the source person does. create photo-realistic images and videos. As public figures such as
This constitutes a category of deepfakes, namely face-swap. In a broader celebrities and politicians may have a large number of videos and
definition, deepfakes are artificial intelligence-synthesized content that
images available online, they are initial targets of deepfakes. Deepfakes
can also fall into two other categories, i.e., lip-sync and puppet-master.
were used to swap faces of celebrities or politicians to bodies in
Lip-sync deepfakes refer to videos that are modified to make the
porn images and videos. The first deepfake video emerged in 2017
mouth movements consistent with an audio recording. Puppet-master
where face of a celebrity was swapped to the face of a porn actor.
deepfakes include videos of a target person (puppet) who is animated
following the facial expressions, eye and head movements of another It is threatening to world security when deepfake methods can be
person (master) sitting in front of a camera (Agarwal et al., 2019). employed to create videos of world leaders with fake speeches for
While some deepfakes can be created by traditional visual effects or falsification purposes (Bloomberg, 2018; Chesney and Citron, 2019;
computer-graphics approaches, the recent common underlying mech- Hwang, 2020). Deepfakes therefore can be abused to cause political
anism for deepfake creation is deep learning models such as autoen- or religion tensions between countries, to fool public and affect results
coders and generative adversarial networks (GANs), which have been in election campaigns, or create chaos in financial markets by creating
applied widely in the computer vision domain (Vincent et al., 2008; fake news (Zhou and Zafarani, 2020; Kaliyar et al., 2021; Guo et al.,
Kingma and Welling, 2013; Goodfellow et al., 2014; Makhzani et al., 2020). It can be even used to generate fake satellite images of the Earth
∗ Corresponding author.
E-mail address: [email protected] (T.T. Nguyen).
https://doi.org/10.1016/j.cviu.2022.103525
Received 8 February 2022; Received in revised form 18 July 2022; Accepted 27 July 2022
Available online 31 July 2022
1077-3142/© 2022 Elsevier Inc. All rights reserved.
T.T. Nguyen, Q.V.H. Nguyen, D.T. Nguyen et al. Computer Vision and Image Understanding 223 (2022) 103525
Fig. 1. Number of papers related to deepfakes in years from 2016 to 2021, obtained
from https://app.dimensions.ai at the end of 2021 with the search keyword ‘‘deepfake’’
of fake digital visual media detection methods (Turek, 2019). Recently,
applied to full text of scholarly papers. Facebook Inc. teaming up with Microsoft Corp and the Partnership
on AI coalition have launched the Deepfake Detection Challenge to
catalyze more research and development in detecting and preventing
deepfakes from being used to mislead viewers (Schroepfer, 2019). Data
to contain objects that do not really exist to confuse military analysts,
obtained from https://app.dimensions.ai at the end of 2021 show that
e.g., creating a fake bridge across a river although there is no such a
bridge in reality. This can mislead a troop who have been guided to the number of deepfake papers has increased significantly in recent
cross the bridge in a battle (Tucker, 2019; Fish, 2019). years (Fig. 1). Although the obtained numbers of deepfake papers may
As the democratization of creating realistic digital humans has pos- be lower than actual numbers but the research trend of this topic is
itive implications, there is also positive use of deepfakes such as their obviously increasing.
applications in visual effects, digital avatars, snapchat filters, creating There have been existing survey papers about creating and detecting
voices of those who have lost theirs or updating episodes of movies deepfakes, presented in Tolosana et al. (2020), Verdoliva (2020) and
without reshooting them (Marr, 2019). Deepfakes can have creative or Mirsky and Lee (2021). For example, Mirsky and Lee (2021) focused on
productive impacts in photography, video games, virtual reality, movie reenactment approaches (i.e., to change a target’s expression, mouth,
productions, and entertainment, e.g., realistic video dubbing of foreign pose, gaze or body), and replacement approaches (i.e., to replace
films, education through the reanimation of historical figures, virtually a target’s face by swap or transfer methods). Verdoliva (2020) sep-
trying on clothes while shopping, and so on (Mirsky and Lee, 2021; arated detection approaches into conventional methods (e.g., blind
Verdoliva, 2020). However, the number of malicious uses of deepfakes methods without using any external data for training, one-class sensor-
largely dominates that of the positive ones. The development of ad- based and model-based methods, and supervised methods with hand-
vanced deep neural networks and the availability of large amount of crafted features) and deep learning-based approaches (e.g., CNN mod-
data have made the forged images and videos almost indistinguishable els). Tolosana et al. (2020) categorized both creation and detection
to humans and even to sophisticated computer algorithms. The process methods based on the way deepfakes are created, including entire face
of creating those manipulated images and videos is also much simpler synthesis, identity swap, attribute manipulation, and expression swap.
today as it needs as little as an identity photo or a short video of a On the other hand, we carry out the survey with a different perspective
target individual. Less and less effort is required to produce a stun- and taxonomy. We categorize the deepfake detection methods based
ningly convincing tempered footage. Recent advances can even create on the data type, i.e., images or videos, as presented in Fig. 2. With
a deepfake with just a still image (Zakharov et al., 2019). Deepfakes fake image detection methods, we focus on the features that are used,
therefore can be a threat affecting not only public figures but also i.e., whether they are handcrafted features or deep features. With
ordinary people. For example, a voice deepfake was used to scam a CEO fake video detection methods, two main subcategories are identified
out of $243,000 (Damiani, 2019). A recent release of a software called based on whether the method uses temporal features across frames or
DeepNude shows more disturbing threats as it can transform a person visual artifacts within a video frame. We also discuss extensively the
to a non-consensual porn (Samuel, 2019). Likewise, the Chinese app challenges, research trends and directions on deepfake detection and
Zao has gone viral lately as less-skilled users can swap their faces onto multimedia forensics problems.
bodies of movie stars and insert themselves into well-known movies
and TV clips (The Guardian, 2019). These forms of falsification create a 2. Deepfake creation
huge threat to violation of privacy and identity, and affect many aspects
of human lives. Deepfakes have become popular due to the quality of tampered
Finding the truth in digital domain therefore has become increas- videos and also the easy-to-use ability of their applications to a wide
ingly critical. It is even more challenging when dealing with deepfakes range of users with various computer skills from professional to novice.
as they are majorly used to serve malicious purposes and almost anyone These applications are mostly developed based on deep learning tech-
can create deepfakes these days using existing deepfake tools. Thus far, niques. Deep learning is well known for its capability of representing
there have been numerous methods proposed to detect deepfakes (Lyu, complex and high-dimensional data. One variant of the deep networks
2020; Guarnera et al., 2020c; Jafar et al., 2020; Trinh et al., 2021; with that capability is deep autoencoders, which have been widely
Younus and Hasan, 2020). Most of them are based on deep learning, applied for dimensionality reduction and image compression (Punnap-
and thus a battle between malicious and positive uses of deep learning purath and Brown, 2019; Cheng et al., 2019; Chorowski et al., 2019).
methods has been arising. To address the threat of face-swapping The first attempt of deepfake creation was FakeApp, developed by a
technology or deepfakes, the United States Defense Advanced Research Reddit user using autoencoder–decoder pairing structure (Faceswap,
Projects Agency (DARPA) initiated a research scheme in media foren- 2022; FakeApp, 2022). In that method, the autoencoder extracts latent
sics (named Media Forensics or MediFor) to accelerate the development features of face images and the decoder is used to reconstruct the face
2
T.T. Nguyen, Q.V.H. Nguyen, D.T. Nguyen et al. Computer Vision and Image Understanding 223 (2022) 103525
Fig. 4. The GAN architecture consisting of a generator and a discriminator, and each
can be implemented by a neural network. The entire system can be trained with
backpropagation that allows both networks to improve their capabilities.
Fig. 3. A deepfake creation model using two encoder–decoder pairs. Two networks
use the same encoder but different decoders for training process (top). An image of
face A is encoded with the common encoder and decoded with decoder B to create
a deepfake (bottom). The reconstructed image (in the bottom) is the face B with the After sufficient training, both networks improve their capabilities,
mouth shape of face A. Face B originally has the mouth of an upside-down heart while
the reconstructed face B has the mouth of a conventional heart.
i.e., the generator 𝐺 is able to produce images that are really sim-
ilar to real images while the discriminator 𝐷 is highly capable of
distinguishing fake images from real ones.
Table 1 presents a summary of popular deepfake tools and their
images. To swap faces between source images and target images, there
typical features. Among them, a prominent method for face synthesis
is a need of two encoder–decoder pairs where each pair is used to train
based on a GAN model, namely StyleGAN, was introduced in Karras
on an image set, and the encoder’s parameters are shared between
et al. (2019). StyleGAN is motivated by style transfer (Huang et al.,
two network pairs. In other words, two pairs have the same encoder
2017) with a special generator network architecture that is able to
network. This strategy enables the common encoder to find and learn
create realistic face images. In a traditional GAN model, e.g., the
the similarity between two sets of face images, which are relatively un-
progressive growing of GAN (PGGAN) (Karras et al., 2017), the signal
challenging because faces normally have similar features such as eyes,
noise (latent code) is fed to the input layer of a feedforward network
nose, mouth positions. Fig. 3 shows a deepfake creation process where
that represents the generator. In StyleGAN, there are two networks
the feature set of face A is connected with the decoder B to reconstruct
constructed and linked together, a mapping network 𝑓 and a synthesis
face B from the original face A. This approach is applied in several
network 𝑔. The latent code 𝑧 ∈ 𝑍 is first converted to 𝑤 ∈ 𝑊 (where
works such as DeepFaceLab (DeepFaceLab, 2022b), DFaker (DFaker,
𝑊 is an intermediate latent space) through a non-linear function 𝑓 ∶
2022), DeepFake_tf (tensorflow-based deepfakes) (DeepFake_tf, 2022).
𝑍 → 𝑊 , which is characterized by a neural network (i.e., the mapping
By adding adversarial loss and perceptual loss implemented in
network) consisting of several fully connected layers. Using an affine
VGGFace (Keras-VGGFace, 2022) to the encoder–decoder architecture,
transformation, the intermediate representation 𝑤 is specialized to
an improved version of deepfakes based on the generative adversarial
styles 𝑦 = (𝑦𝑠 , 𝑦𝑏 ) that will be fed to the adaptive instance normalization
network (Goodfellow et al., 2014), i.e., faceswap-GAN, was proposed
(AdaIN) operations, specified as:
in Faceswap-GAN (2022). The VGGFace perceptual loss is added to
make eye movements to be more realistic and consistent with input 𝑥𝑖 − 𝜇(𝑥𝑖 )
AdaIN(𝑥𝑖 , 𝑦) = 𝑦𝑠,𝑖 + 𝑦𝑏,𝑖 (2)
faces and help to smooth out artifacts in segmentation mask, leading 𝜎(𝑥𝑖 )
to higher quality output videos. This model facilitates the creation where each feature map 𝑥𝑖 is normalized separately. The StyleGAN
of outputs with 64 × 64, 128 × 128, and 256 × 256 resolutions. In generator architecture allows controlling the image synthesis by mod-
addition, the multi-task convolutional neural network (CNN) from the ifying the styles via different scales. In addition, instead of using one
FaceNet implementation (FaceNet, 2022) is used to make face detection random latent code during training, this method uses two latent codes
more stable and face alignment more reliable. The CycleGAN (Cycle- to generate a given proportion of images. More specifically, two latent
GAN, 2022) is utilized for generative network implementation in this codes 𝑧1 and 𝑧2 are fed to the mapping network to create respectively
model. 𝑤1 and 𝑤2 that control the styles by applying 𝑤1 before and 𝑤2 after
A conventional GAN model comprises two neural networks: a gen- the crossover point. Fig. 5 demonstrates examples of images created by
erator and a discriminator as depicted in Fig. 4. Given a dataset of mixing two latent codes at three different scales where each subset of
real images 𝑥 having a distribution of 𝑝𝑑𝑎𝑡𝑎 , the aim of the generator styles controls separate meaningful high-level attributes of the image.
𝐺 is to produce images 𝐺(𝑧) similar to real images 𝑥 with 𝑧 being In other words, the generator architecture of StyleGAN is able to learn
noise signals having a distribution of 𝑝𝑧 . The aim of the discriminator separation of high-level attributes (e.g., pose and identity when trained
𝐺 is to correctly classify images generated by 𝐺 and real images 𝑥. on human faces) and enables intuitive, scale-specific control of the face
The discriminator 𝐷 is trained to improve its classification capability, synthesis.
i.e., to maximize 𝐷(𝑥), which represents the probability that 𝑥 is a real
image rather than a fake image generated by 𝐺. On the other hand, 𝐺 is 3. Deepfake detection
trained to minimize the probability that its outputs are classified by 𝐷
as synthetic images, i.e., to minimize 1 − 𝐷(𝐺(𝑧)). This is a minimax Deepfake detection is normally deemed a binary classification prob-
game between two players 𝐷 and 𝐺 that can be described by the lem where classifiers are used to classify between authentic videos
following value function (Goodfellow et al., 2014): and tampered ones. This kind of methods requires a large database of
min max 𝑉 (𝐷, 𝐺) = E𝑥∼𝑝𝑑𝑎𝑡𝑎 (𝑥) [log 𝐷(𝑥)] real and fake videos to train classification models. The number of fake
𝐺 𝐷 videos is increasingly available, but it is still limited in terms of setting
+ E𝑧∼𝑝𝑧 (𝑧) [log(1 − 𝐷(𝐺(𝑧)))] (1) a benchmark for validating various detection methods. To address
3
T.T. Nguyen, Q.V.H. Nguyen, D.T. Nguyen et al. Computer Vision and Image Understanding 223 (2022) 103525
Table 1
Summary of notable deepfake tools.
Tools Links Key features
this issue, Korshunov and Marcel (2019) produced a notable deepfake Other methods such as lip-syncing approaches (Chung et al., 2017;
dataset consisting of 620 videos based on the GAN model using the Suwajanakorn et al., 2017; Korshunov and Marcel, 2018b) and image
open source code Faceswap-GAN (Faceswap-GAN, 2022). Videos from quality metrics with support vector machine (SVM) (Galbally and Mar-
the publicly available VidTIMIT database (VidTIMIT database, 2022) cel, 2014) produce very high error rate when applied to detect deepfake
were used to generate low and high quality deepfake videos, which videos from this newly produced dataset. This raises concerns about the
can effectively mimic the facial expressions, mouth movements, and critical need of future development of more robust methods that can
eye blinking. These videos were then used to test various deepfake detect deepfakes from genuine.
detection methods. Test results show that the popular face recognition This section presents a survey of deepfake detection methods where
systems based on VGG (Parkhi et al., 2015) and Facenet (FaceNet, we group them into two major categories: fake image detection meth-
2022; Schroff et al., 2015) are unable to detect deepfakes effectively. ods and fake video detection ones (Fig. 2). The latter is distinguished
4
T.T. Nguyen, Q.V.H. Nguyen, D.T. Nguyen et al. Computer Vision and Image Understanding 223 (2022) 103525
On the other hand, Agarwal and Varshney (2019) cast the GAN-
based deepfake detection as a hypothesis testing problem where a
statistical framework was introduced using the information-theoretic
study of authentication (Maurer, 2000). The minimum distance be-
tween distributions of legitimate images and images generated by a
particular GAN is defined, namely the oracle error. The analytic results
show that this distance increases when the GAN is less accurate, and
in this case, it is easier to detect deepfakes. In case of high-resolution
image inputs, an extremely accurate GAN is required to generate fake
images that are hard to detect by this method.
5
T.T. Nguyen, Q.V.H. Nguyen, D.T. Nguyen et al. Computer Vision and Image Understanding 223 (2022) 103525
Fig. 6. A two-step process for face manipulation detection where the preprocessing step aims to detect, crop and align faces on a sequence of frames and the second step
distinguishes manipulated and authentic face images by combining convolutional neural network (CNN) and recurrent neural network (RNN) (Sabir et al., 2019).
set of feature maps from the previous layer, with each convolutional
operation is defined by:
∑
𝑖
𝑓𝑗(𝑛) = 𝑓𝑖(𝑛−1) ∗ 𝜔(𝑛) (𝑛)
𝑖𝑗 + 𝑏𝑗 (3)
𝑖=1
where 𝑓𝑗(𝑛) is the 𝑗th feature map of the 𝑛th layer, 𝜔(𝑛)
𝑖𝑗 is the weight of Fig. 7. A deepfake detection method using convolutional neural network (CNN) and
the 𝑖th channel of the 𝑗th convolutional kernel in the 𝑛th layer, and 𝑏(𝑛)
𝑗 long short term memory (LSTM) to extract temporal features of a given video sequence,
is the bias term of the 𝑗th convolutional kernel in the 𝑛th layer. The which are represented via the sequence descriptor. The detection network consisting
proposed approach is evaluated using a dataset consisting of 321,378 of fully-connected layers is employed to take the sequence descriptor as input and
calculate probabilities of the frame sequence belonging to either authentic or deepfake
face images, which are created by applying the Glow model (Kingma class (Güera and Delp, 2018).
and Dhariwal, 2018) to the CelebA face image dataset (Liu et al., 2015).
Evaluation results show that the SCnet model obtains higher accuracy
and better generalization than the Meso-4 model proposed in Afchar
leveraged the use of spatio-temporal features of video streams to detect
et al. (2018).
deepfakes. Video manipulation is carried out on a frame-by-frame
Recently, Zhao et al. (2021) proposed a method for deepfake
basis so that low level artifacts produced by face manipulations are
detection using self-consistency of local source features, which are
believed to further manifest themselves as temporal artifacts with
content-independent, spatially-local information of images. These fea-
inconsistencies across frames. A recurrent convolutional model (RCN)
tures could come from either imaging pipelines, encoding methods or
was proposed based on the integration of the convolutional network
image synthesis approaches. The hypothesis is that a modified image
would have different source features at different locations, while an DenseNet (Huang et al., 2017) and the gated recurrent unit cells (Cho
original image will have the same source features across locations. et al., 2014) to exploit temporal discrepancies across frames (see Fig. 6).
These source features, represented in the form of down-sampled feature The proposed method is tested on the FaceForensics++ dataset, which
maps, are extracted by a CNN model using a special representation includes 1000 videos (Rossler et al., 2019), and shows promising
learning method called pairwise self-consistency learning. This learn- results.
ing method aims to penalize pairs of feature vectors that refer to Likewise, Güera and Delp (2018) highlighted that deepfake videos
locations from the same image for having a low cosine similarity contain intra-frame inconsistencies and temporal inconsistencies be-
score. At the same time, it also penalizes the pairs from different tween frames. They then proposed the temporal-aware pipeline method
images for having a high similarity score. The learned feature maps that uses CNN and long short term memory (LSTM) to detect deepfake
are then fed to a classification method for deepfake detection. This videos. CNN is employed to extract frame-level features, which are
proposed approach is evaluated on seven popular datasets, including then fed into the LSTM to create a temporal sequence descriptor. A
FaceForensics++ (Rossler et al., 2019), DeepfakeDetection (Dufour and fully-connected network is finally used for classifying doctored videos
Gully, 2019), Celeb-DF-v1 & Celeb-DF-v2 (Li et al., 2020b), Deep- from real ones based on the sequence descriptor as illustrated in Fig. 7.
fake Detection Challenge (DFDC) (Dolhansky et al., 2020), DFDC An accuracy of greater than 97% was obtained using a dataset of 600
Preview (Dolhansky et al., 2019), and DeeperForensics-1.0 (Jiang et al., videos, including 300 deepfake videos collected from multiple video-
2020). Experimental results demonstrate that the proposed approach is hosting websites and 300 pristine videos randomly selected from the
superior to state-of-the-art methods. It however may have a limitation Hollywood human actions dataset in Laptev et al. (2008).
when dealing with fake images that are generated by methods that On the other hand, the use of a physiological signal, eye blinking,
directly output the whole images whose source features are consistent to detect deepfakes was proposed in Li et al. (2018) based on the
across all positions within each image. observation that a person in deepfakes has a lot less frequent blinking
than that in untampered videos. A healthy adult human would nor-
3.2. Fake video detection mally blink somewhere between 2 to 10 s, and each blink would take
0.1 and 0.4 s. Deepfake algorithms, however, often use face images
Most image detection methods cannot be used for videos because available online for training, which normally show people with open
of the strong degradation of the frame data after video compres- eyes, i.e., very few images published on the internet show people
sion (Afchar et al., 2018). Furthermore, videos have temporal char- with closed eyes. Thus, without having access to images of people
acteristics that are varied among sets of frames and they are thus blinking, deepfake algorithms do not have the capability to generate
challenging for methods designed to detect only still fake images. This fake faces that can blink normally. In other words, blinking rates in
subsection focuses on deepfake video detection methods and catego- deepfakes are much lower than those in normal videos. To discriminate
rizes them into two smaller groups: methods that employ temporal real and fake videos, Li et al. (2018) crop eye areas in the videos
features and those that explore visual artifacts within frames. and distribute them into long-term recurrent convolutional networks
(LRCN) (Donahue et al., 2015) for dynamic state prediction. The LRCN
3.2.1. Temporal features across video frames consists of a feature extractor based on CNN, a sequence learning based
Based on the observation that temporal coherence is not enforced on long short term memory (LSTM), and a state prediction based on
effectively in the synthesis process of deepfakes, Sabir et al. (2019) a fully connected layer to predict probability of eye open and close
6
T.T. Nguyen, Q.V.H. Nguyen, D.T. Nguyen et al. Computer Vision and Image Understanding 223 (2022) 103525
state. The eye blinking shows strong temporal dependencies and thus
the implementation of LSTM helps to capture these temporal patterns
effectively.
Recently, Caldelli et al. (2021) proposed the use of optical flow to
gauge the information along the temporal axis of a frame sequence for
video deepfake detection. The optical flow is a vector field calculated
on two temporal-distinct frames of a video that can describe the move-
ment of objects in a scene. The optical flow fields are expected to be
different between synthetically created frames and naturally generated
ones (Amerini et al., 2019). Unnatural movements of lips, eyes, or of the
entire faces inserted into deepfake videos would introduce distinctive
motion patterns when compared with pristine ones. Based on this
assumption, features consisting of optical flow fields are fed into a
Fig. 8. Capsule network takes features obtained from the VGG-19 network (Simonyan
CNN model for discriminating between deepfakes and original videos.
and Zisserman, 2014) to distinguish fake images or videos from the real ones (top). The
More specifically, the ResNet50 architecture (He et al., 2016) is imple- pre-processing step detects face region and scales it to the size of 128 × 128 before
mented as a CNN model for experiments. The results obtained using the VGG-19 is used to extract latent features for the capsule network, which comprises
FaceForensics++ dataset (Rossler et al., 2019) show that this approach three primary capsules and two output capsules, one for real and one for fake images
is comparable with state-of-the-art methods in terms of classification (bottom). The statistical pooling constitutes an important part of capsule network that
deals with forgery detection (Nguyen et al., 2019).
accuracy. A combination of this kind of feature with frame-based
features is also experimented, which results in an improved deepfake
detection performance. This demonstrates the usefulness of optical flow
fields in capturing the inconsistencies on the temporal axis of video processes used to produce images of the world (Hinton et al., 2011).
frames for deepfake detection. The recent development of capsule network based on dynamic routing
algorithm (Sabour et al., 2017) demonstrates its ability to describe the
3.2.2. Visual artifacts within video frame hierarchical pose relationships between object parts. This development
As can be noticed in the previous subsection, the methods using is employed as a component in a pipeline for detecting fabricated im-
temporal patterns across video frames are mostly based on deep re- ages and videos as demonstrated in Fig. 8. A dynamic routing algorithm
current network models to detect deepfake videos. This subsection is deployed to route the outputs of the three capsules to the output
investigates the other approach that normally decomposes videos into capsules through a number of iterations to separate between fake and
frames and explores visual artifacts within single frames to obtain real images. The method is evaluated through four datasets covering a
discriminant features. These features are then distributed into either wide range of forged image and video attacks. They include the well-
a deep or shallow classifier to differentiate between fake and authentic known Idiap Research Institute replay-attack dataset (Chingovska et al.,
videos. We thus group methods in this subsection based on the types 2012), the deepfake face swapping dataset created by Afchar et al.
of classifiers, i.e. either deep or shallow. (2018), the facial reenactment FaceForensics dataset (Rössler et al.,
2018), produced by the Face2Face method (Thies et al., 2016), and the
Deep classifiers. Deepfake videos are normally created with limited res- fully computer-generated image dataset generated by Rahmouni et al.
olutions, which require an affine face warping approach (i.e., scaling, (2017). The proposed method yields the best performance compared to
rotation and shearing) to match the configuration of the original ones. its competing methods in all of these datasets. This shows the potential
Because of the resolution inconsistency between the warped face area of the capsule network in building a general detection system that can
and the surrounding context, this process leaves artifacts that can be de- work effectively for various forged image and video attacks.
tected by CNN models such as VGG16 (Simonyan and Zisserman, 2014),
ResNet50, ResNet101 and ResNet152 (He et al., 2016). A deep learning Shallow classifiers. Deepfake detection methods mostly rely on the arti-
method to detect deepfakes based on the artifacts observed during the facts or inconsistency of intrinsic features between fake and real images
face warping step of the deepfake generation algorithms was proposed or videos. Yang et al. (2019) proposed a detection method by observing
in Li and Lyu (2018). The proposed method is evaluated on two the differences between 3D head poses comprising head orientation
deepfake datasets, namely the UADFV and DeepfakeTIMIT. The UADFV and position, which are estimated based on 68 facial landmarks of
dataset (Yang et al., 2019) contains 49 real videos and 49 fake videos the central face region. The 3D head poses are examined because
with 32,752 frames in total. The DeepfakeTIMIT dataset (Korshunov there is a shortcoming in the deepfake face generation pipeline. The
and Marcel, 2018b) includes a set of low quality videos of 64 × 64 extracted features are fed into an SVM classifier to obtain the detection
size and another set of high quality videos of 128 × 128 with totally results. Experiments on two datasets show the great performance of the
10,537 pristine images and 34,023 fabricated images extracted from proposed approach against its competing methods. The first dataset,
320 videos for each quality set. Performance of the proposed method is namely UADFV, consists of 49 deep fake videos and their respective
compared with other prevalent methods such as two deepfake detection real videos (Yang et al., 2019). The second dataset comprises 241 real
MesoNet methods, i.e. Meso-4 and MesoInception-4 (Afchar et al., images and 252 deep fake images, which is a subset of data used in the
2018), HeadPose (Yang et al., 2019), and the face tampering detection DARPA MediFor GAN Image/Video Challenge (Guan et al., 2019). Like-
method two-stream NN (Zhou et al., 2017). Advantage of the proposed wise, a method to exploit artifacts of deepfakes and face manipulations
method is that it needs not to generate deepfake videos as negative based on visual features of eyes, teeth and facial contours was studied
examples before training the detection models. Instead, the negative in Matern et al. (2019). The visual artifacts arise from lacking global
examples are generated dynamically by extracting the face region of consistency, wrong or imprecise estimation of the incident illumination,
the original image and aligning it into multiple scales before applying or imprecise estimation of the underlying geometry. For deepfakes
Gaussian blur to a scaled image of random pick and warping back to detection, missing reflections and missing details in the eye and teeth
the original image. This reduces a large amount of time and computa- areas are exploited as well as texture features extracted from the
tional resources compared to other methods, which require deepfakes facial region based on facial landmarks. Accordingly, the eye feature
are generated in advance. Nguyen et al. (2019) proposed the use of vector, teeth feature vector and features extracted from the full-face
capsule networks for detecting manipulated images and videos. The crop are used. After extracting the features, two classifiers including
capsule network was initially introduced to address limitations of CNNs logistic regression and small neural network are employed to classify
when applied to inverse graphics tasks, which aim to find physical the deepfakes from real videos. Experiments carried out on a video
7
T.T. Nguyen, Q.V.H. Nguyen, D.T. Nguyen et al. Computer Vision and Image Understanding 223 (2022) 103525
dataset downloaded from YouTube show the best result of 0.851 in is that what AI has broken can be fixed by AI as well (Floridi, 2018).
terms of the area under the receiver operating characteristics curve. Detection methods are still in their early stage and various methods
The proposed method however has a disadvantage that requires images have been proposed and evaluated but using fragmented datasets. An
meeting certain prerequisite such as open eyes or visual teeth. approach to improve performance of detection methods is to create
The use of photo response non uniformity (PRNU) analysis was a growing updated benchmark dataset of deepfakes to validate the
proposed in Koopman et al. (2018) to detect deepfakes from authentic ongoing development of detection methods. This will facilitate the
ones. PRNU is a component of sensor pattern noise, which is attributed training process of detection models, especially those based on deep
to the manufacturing imperfection of silicon wafers and the inconsistent learning, which requires a large training set (Dolhansky et al., 2020).
sensitivity of pixels to light because of the variation of the physical Improving performance of deepfake detection methods is important,
characteristics of the silicon wafers. The PRNU analysis is widely used especially in cross-forgery and cross-dataset scenarios. Most detection
in image forensics (Rosenfeld and Sencar, 2009; Li and Li, 2011; Lin and models are designed and evaluated in the same-forgery and in-dataset
Li, 2016; Scherhag et al., 2019; Phan et al., 2018) and advocated to use experiments, which do not ensure their generalization capability. Some
in Koopman et al. (2018) because the swapped face is supposed to alter previous studies have addressed this issue, e.g., in Wang et al. (2020),
the local PRNU pattern in the facial area of video frames. The videos Caldelli et al. (2021), Zhao et al. (2021), Cozzolino et al. (2018)
are converted into frames, which are cropped to the questioned facial and Marra et al. (2019), but more work needs to be done in this
region. The cropped frames are then separated sequentially into eight direction. A model trained on a specific forgery needs to be able to
groups where an average PRNU pattern is computed for each group. work against another unknown one because potential deepfake types
Normalized cross correlation scores are calculated for comparisons are not normally known in the real-world scenarios. Likewise, current
of PRNU patterns among these groups. A test dataset was created, detection methods mostly focus on drawbacks of the deepfake gen-
consisting of 10 authentic videos and 16 manipulated videos, where the eration pipelines, i.e., finding weakness of the competitors to attack
fake videos were produced from the genuine ones by the DeepFaceLab them. This kind of information and knowledge is not always available
tool (DeepFaceLab, 2022b). The analysis shows a significant statistical in adversarial environments where attackers commonly attempt not to
difference in terms of mean normalized cross correlation scores be- reveal such deepfake creation technologies. Recent works on adversar-
tween deepfakes and the genuine. This analysis therefore suggests that ial perturbation attacks to fool DNN-based detectors make the deepfake
PRNU has a potential in deepfake detection although a larger dataset detection task more difficult (Gandhi and Jain, 2020; Hussain et al.,
would need to be tested. 2021; Carlini and Farid, 2020; Yang et al., 2021; Yeh et al., 2020).
When seeing a video or image with suspicion, users normally want These are real challenges for detection method development and a
to search for its origin. However, there is currently no feasibility for future study needs to focus on introducing more robust, scalable and
such a tool. Hasan and Salah (2019) proposed the use of blockchain generalizable methods.
and smart contracts to help users detect deepfake videos based on the Another research direction is to integrate detection methods into
assumption that videos are only real when their sources are traceable. distribution platforms such as social media to increase its effectiveness
Each video is associated with a smart contract that links to its parent in dealing with the widespread impact of deepfakes. The screening or
video and each parent video has a link to its child in a hierarchical filtering mechanism using effective detection methods can be imple-
structure. Through this chain, users can credibly trace back to the mented on these platforms to ease the deepfakes detection (Chesney
original smart contract associated with pristine video even if the video and Citron, 2018b). Legal requirements can be made for tech companies
has been copied multiple times. An important attribute of the smart who own these platforms to remove deepfakes quickly to reduce its
contract is the unique hashes of the interplanetary file system, which impacts. In addition, watermarking tools can also be integrated into
is used to store video and its metadata in a decentralized and content- devices that people use to make digital contents to create immutable
addressable manner (IPFS, 2022). The smart contract’s key features and metadata for storing originality details such as time and location of
functionalities are tested against several common security challenges multimedia contents as well as their untampered attestment (Chesney
such as distributed denial of services, replay and man in the middle and Citron, 2018b). This integration is difficult to implement but a
attacks to ensure the solution meeting security requirements. This solution for this could be the use of the disruptive blockchain technol-
approach is generic, and it can be extended to other types of digital ogy. The blockchain has been used effectively in many areas and there
content, e.g., images, audios and manuscripts. are very few studies so far addressing the deepfake detection problems
based on this technology. As it can create a chain of unique un-
4. Discussions and future research directions changeable blocks of metadata, it is a great tool for digital provenance
solution. The integration of blockchain technologies to this problem has
With the support of deep learning, deepfakes can be created easier demonstrated certain results (Hasan and Salah, 2019) but this research
than ever before. The spread of these fake contents is also quicker direction is far from mature.
thanks to the development of social media platforms (Zubiaga et al., Using detection methods to spot deepfakes is crucial, but under-
2018). Sometimes deepfakes do not need to be spread to massive standing the real intent of people publishing deepfakes is even more
audience to cause detrimental effects. People who create deepfakes important. This requires the judgement of users based on social context
with malicious purpose only need to deliver them to target audiences in which deepfake is discovered, e.g. who distributed it and what they
as part of their sabotage strategy without using social media. For said about it (Read, 2019). This is critical as deepfakes are getting
example, this approach can be utilized by intelligence services trying more and more photorealistic and it is highly anticipated that detection
to influence decisions made by important people such as politicians, software will be lagging behind deepfake creation technology. A study
leading to national and international security threats (Chesney and on social context of deepfakes to assist users in such judgement is thus
Citron, 2018b). Catching the deepfake alarming problem, research worth performing.
community has focused on developing deepfake detection algorithms Videos and photographics have been widely used as evidences in
and numerous results have been reported. This paper has reviewed police investigation and justice cases. They may be introduced as
the state-of-the-art methods and a summary of typical approaches is evidences in a court of law by digital media forensics experts who
provided in Table 2. It is noticeable that a battle between those who use have background in computer or law enforcement and experience in
advanced machine learning to create deepfakes with those who make collecting, examining and analyzing digital information. The develop-
effort to detect deepfakes is growing. ment of machine learning and AI technologies might have been used
Deepfakes’ quality has been increasing and the performance of to modify these digital contents and thus the experts’ opinions may not
detection methods needs to be improved accordingly. The inspiration be enough to authenticate these evidences because even experts are
8
T.T. Nguyen, Q.V.H. Nguyen, D.T. Nguyen et al. Computer Vision and Image Understanding 223 (2022) 103525
Table 2
Summary of prominent deepfake detection methods.
Methods Classifiers/ Key features Dealing with Datasets used
Techniques
Eye blinking (Li et al., LRCN - Use LRCN to learn the temporal patterns of eye Videos Consist of 49 interview and presentation videos,
2018) blinking. and their corresponding generated deepfakes.
- Based on the observation that blinking frequency of
deepfakes is much smaller than normal.
Intra-frame and CNN and LSTM CNN is employed to extract frame-level features, Videos A collection of 600 videos obtained from
temporal which are distributed to LSTM to construct sequence multiple websites.
inconsistencies (Güera descriptor useful for classification.
and Delp, 2018)
Using face warping VGG16 (Simonyan and Artifacts are discovered using CNN models based on Videos - UADFV (Yang et al., 2019), containing 49
artifacts (Li and Lyu, Zisserman, 2014), resolution inconsistency between the warped face area real videos and 49 fake videos with 32752
2018) ResNet models (He and the surrounding context. frames in total.
et al., 2016) - DeepfakeTIMIT (Korshunov and Marcel,
2018b)
MesoNet (Afchar et al., CNN - Two deep networks, i.e. Meso-4 and MesoInception-4 Videos Two datasets: deepfake one constituted from
2018) are introduced to examine deepfake videos at the online videos and the FaceForensics one created
mesoscopic analysis level. by the Face2Face approach (Thies et al., 2016).
- Accuracy obtained on deepfake and FaceForensics
datasets are 98% and 95% respectively.
Eye, teach and facial Logistic regression and - Exploit facial texture differences, and missing Videos A video dataset downloaded from YouTube.
texture (Matern et al., neural network (NN) reflections and details in eye and teeth areas of
2019) deepfakes.
- Logistic regression and NN are used for classifying.
Spatio-temporal RCN Temporal discrepancies across frames are explored Videos FaceForensics++ dataset, including 1000 videos
features with RCN using RCN that integrates convolutional network (Rossler et al., 2019).
(Sabir et al., 2019) DenseNet (Huang et al., 2017) and the gated
recurrent unit cells (Cho et al., 2014)
Spatio-temporal Convolutional - An XceptionNet CNN is used for facial feature Videos FaceForensics++ (Rossler et al., 2019) and
features with LSTM bidirectional recurrent extraction while audio embeddings are obtained by Celeb-DF (5639 deepfake videos) (Li et al.,
(Chintha et al., 2020) LSTM network stacking multiple convolution modules. 2020b) datasets and the ASVSpoof 2019 Logical
- Two loss functions, i.e. cross-entropy and Access audio dataset (Todisco et al., 2019).
Kullback–Leibler divergence, are used.
Analysis of PRNU PRNU - Analysis of noise patterns of light sensitive sensors Videos Created by the authors, including 10 authentic
(Koopman et al., 2018) of digital cameras due to their factory defects. and 16 deepfake videos using DeepFaceLab
- Explore the differences of PRNU patterns between (DeepFaceLab, 2022b).
the authentic and deepfake videos because face
swapping is believed to alter the local PRNU patterns.
Phoneme-viseme CNN - Exploit the mismatches between the dynamics of the Videos Four in-the-wild lip-sync deepfakes from
mismatches (Agarwal mouth shape, i.e. visemes, with a spoken phoneme. Instagram and YouTube
et al., 2020b) - Focus on sounds associated with the M, B and P (www.instagram.com/bill_posters_uk and ) and
phonemes as they require complete mouth closure others are created using synthesis techniques,
while deepfakes often incorrectly synthesize it. i.e. Audio-to-Video (A2V) (Suwajanakorn et al.,
2017) and Text-to-Video (T2V) (Fried et al.,
2019).
Using attribution-based ResNet50 model (He - The ABC metric (Jha et al., 2019) is used to detect Videos VidTIMIT and two other original datasets
confidence (ABC) et al., 2016), deepfake videos without accessing to training data. obtained from the COHFACE
metric (Fernandes pre-trained on - ABC values obtained for original videos are greater (https://www.idiap.ch/dataset/cohface) and
et al., 2020) VGGFace2 (Cao et al., than 0.94 while those of deepfakes have low ABC from YouTube. datasets from COHFACE
2018) values. (Fernandes et al., 2019) and YouTube are used
to generate two deepfake datasets by
commercial website https://deepfakesweb.com
and another deepfake dataset is DeepfakeTIMIT
(Korshunov and Marcel, 2018a).
Using appearance and Rules based on facial Temporal, behavioral biometric based on facial Videos The world leaders dataset (Agarwal et al.,
behavior (Agarwal and behavioral features. expressions and head movements are learned using 2019), FaceForensics++ (Rossler et al., 2019),
et al., 2020a) ResNet-101 (He et al., 2016) while static facial Google/Jigsaw deepfake detection dataset
biometric is obtained using VGG (Parkhi et al., 2015). (Dufour and Gully, 2019), DFDC (Dolhansky
et al., 2019) and Celeb-DF (Li et al., 2020b).
FakeCatcher (Ciftci CNN Extract biological signals in portrait videos and use Videos UADFV (Yang et al., 2019), FaceForensics
et al., 2020) them as an implicit descriptor of authenticity because (Rössler et al., 2018), FaceForensics++ (Rossler
they are not spatially and temporally well-preserved et al., 2019), Celeb-DF (Li et al., 2020b), and a
in deepfakes. new dataset of 142 videos, independent of the
generative model, resolution, compression,
content, and context.
Emotion audio–visual Siamese network Modality and emotion embedding vectors for the face Videos DeepfakeTIMIT (Korshunov and Marcel, 2018a)
affective cues (Mittal (Chopra et al., 2005) and speech are extracted for deepfake detection. and DFDC (Dolhansky et al., 2019).
et al., 2020)
unable to discern manipulated contents. This aspect needs to take into The digital media forensics results therefore must be proved to be valid
account in courtrooms nowadays when images and videos are used as and reliable before they can be used in courts. This requires careful
evidences to convict perpetrators because of the existence of a wide documentation for each step of the forensics process and how the
range of digital manipulation methods (Maras and Alexandrou, 2019). results are reached. Machine learning and AI algorithms can be used to
9
T.T. Nguyen, Q.V.H. Nguyen, D.T. Nguyen et al. Computer Vision and Image Understanding 223 (2022) 103525
Table 2 (continued).
Methods Classifiers/ Key features Dealing with Datasets used
Techniques
Head poses (Yang SVM - Features are extracted using 68 landmarks of the Videos/ - UADFV consists of 49 deep fake videos and
et al., 2019) face region. Images their respective real videos.
- Use SVM to classify using the extracted features. - 241 real images and 252 deep fake images
from DARPA MediFor GAN Image/Video
Challenge.
Capsule-forensics Capsule networks - Latent features extracted by VGG-19 network Videos/ Four datasets: the Idiap Research Institute
(Nguyen et al., 2019) (Simonyan and Zisserman, 2014) are fed into the Images replay-attack (Chingovska et al., 2012),
capsule network for classification. deepfake face swapping by Afchar et al. (2018),
- A dynamic routing algorithm (Sabour et al., 2017) is facial reenactment FaceForensics (Rössler et al.,
used to route the outputs of three convolutional 2018), and fully computer-generated image set
capsules to two output capsules, one for fake and using (Rahmouni et al., 2017).
another for real images, through a number of
iterations.
Preprocessing combined DCGAN, WGAN-GP and - Enhance generalization ability of deep learning Images - Real dataset: CelebA-HQ (Karras et al., 2017),
with deep network PGGAN. models to detect GAN generated images. including high quality face images of
(Xuan et al., 2019) - Remove low level features of fake images. 1024 × 1024 resolution.
- Force deep networks to focus more on pixel level - Fake datasets: generated by DCGAN (Radford
similarity between fake and real images to improve et al., 2015), WGAN-GP (Gulrajani et al., 2017)
generalization ability. and PGGAN (Karras et al., 2017).
Analyzing convolutional KNN, SVM, and linear Using expectation–maximization algorithm to extract Images Authentic images from CelebA and
traces (Guarnera et al., discriminant analysis local features pertaining to convolutional generative corresponding deepfakes are created by five
2020a) (LDA) process of GAN-based image deepfake generators. different GANs (group-wise deep
whitening-and-coloring transformation GDWCT
Cho et al., 2019, StarGAN Choi et al., 2018,
AttGAN (He et al., 2019), StyleGAN (Karras
et al., 2019), StyleGAN2 (Karras et al., 2020)).
Bag of words and SVM, RF, MLP Extract discriminant features using bag of words Images The well-known LFW face database (Huang
shallow classifiers method and feed these features into SVM, RF and et al., 2007), containing 13,223 images with
(Zhang et al., 2017) MLP for binary classification: innocent vs fabricated. resolution of 250 × 250.
Pairwise learning (Hsu CNN concatenated to Two-phase procedure: feature extraction using CFFN Images - Face images: real ones from CelebA (Liu
et al., 2020) CFFN based on the Siamese network architecture (Chopra et al., 2015), and fake ones generated by
et al., 2005) and classification using CNN. DCGAN (Radford et al., 2015), WGAN
(Arjovsky et al., 2017), WGAN-GP (Gulrajani
et al., 2017), least squares GAN (Mao et al.,
2017), and PGGAN (Karras et al., 2017).
- General images: real ones from ILSVRC12
(Russakovsky et al., 2015), and fake ones
generated by BIGGAN (Brock et al., 2018),
self-attention GAN (Zhang et al., 2019) and
spectral normalization GAN (Miyato et al.,
2018).
Defenses against VGG (Parkhi et al., - Introduce adversarial perturbations to enhance Images 5000 real images from CelebA (Liu et al.,
adversarial 2015) and ResNet (He deepfakes and fool deepfake detectors. 2015) and 5000 fake images created by the
perturbations in et al., 2016) - Improve accuracy of deepfake detectors using ‘‘Few-Shot Face Translation GAN’’ method
deepfakes (Gandhi and Lipschitz regularization and deep image prior (Few-Shot Face Translation GAN, 2022).
Jain, 2020) techniques.
Face X-ray (Li et al., CNN - Try to locate the blending boundary between the Images FaceForensics++ (Rossler et al., 2019),
2020a) target and original faces instead of capturing the DeepfakeDetection (DFD) (Dufour and Gully,
synthesized artifacts of specific manipulations. 2019), DFDC (Dolhansky et al., 2019) and
- Can be trained without fake images. Celeb-DF (Li et al., 2020b).
Using common artifacts ResNet-50 (He et al., Train the classifier using a large number of fake Images A new dataset of CNN-generated images,
of CNN-generated 2016) pre-trained with images generated by a high-performing unconditional namely ForenSynths, consisting of synthesized
images (Wang et al., ImageNet (Russakovsky GAN model, i.e., PGGAN (Karras et al., 2017) and images from 11 models such as StyleGAN
2020) et al., 2015) evaluate how well the classifier generalizes to other (Karras et al., 2019), super-resolution methods
CNN-synthesized images. (Dai et al., 2019) and FaceForensics++ (Rossler
et al., 2019).
Using convolutional KNN, SVM, and LDA Training the expectation–maximization algorithm Images A dataset of images generated by ten GAN
traces on GAN-based (Moon, 1996) to detect and extract discriminative models, including CycleGAN (Zhu et al., 2017),
images (Guarnera features via a fingerprint that represents the StarGAN (Choi et al., 2018), AttGAN (He et al.,
et al., 2020b) convolutional traces left by GANs during image 2019), GDWCT (Cho et al., 2019), StyleGAN
generation. (Karras et al., 2019), StyleGAN2 (Karras et al.,
2020), PGGAN (Karras et al., 2017),
FaceForensics++ (Rossler et al., 2019), IMLE
(Li et al., 2019b), and SPADE (Park et al.,
2019).
Using deep features A new CNN model, The CNN-based SCnet is able to automatically learn Images A dataset of 321,378 face images, created by
extracted by CNN (Guo namely SCnet high-level forensics features of image data thanks to a applying the Glow model (Kingma and
et al., 2021) hierarchical feature extraction block, which is formed Dhariwal, 2018) to the CelebA face image
by stacking four convolutional layers. dataset (Liu et al., 2015).
support the determination of the authenticity of digital media and have creates a huge hurdle for the applications of AI in forensics problems
obtained accurate and reliable results, e.g., Su et al. (2017) and Iuliani because not only the forensics experts oftentimes do not have expertise
et al. (2018), but most of these algorithms are unexplainable. This in computer algorithms, but the computer professionals also cannot
10
T.T. Nguyen, Q.V.H. Nguyen, D.T. Nguyen et al. Computer Vision and Image Understanding 223 (2022) 103525
explain the results properly as most of these algorithms are black box Arjovsky, Martin, Chintala, Soumith, Bottou, Léon, 2017. Wasserstein generative
models (Malolan et al., 2020). This is more critical as the most recent adversarial networks. In: International Conference on Machine Learning. PMLR,
pp. 214–223.
models with the most accurate results are based on deep learning
Bai, Shuang, 2017. Growing random forest on deep convolutional neural networks for
methods consisting of many neural network parameters. Researchers scene categorization. Expert Syst. Appl. 71, 279–287.
have recently attempted to create white box and explainable detection Bayar, Belhassen, Stamm, Matthew C., 2016. A deep learning approach to universal
methods. An example is the approach proposed by Giudice et al. (2021) image manipulation detection using a new convolutional layer. In: Proceedings of
in which they use discrete cosine transform statistics to detect so-called the 4th ACM Workshop on Information Hiding and Multimedia Security. pp. 5–10.
Bloomberg, 2018. How faking videos became easy and why that’s so scary. https:
specific GAN frequencies to differentiate between real images and //fortune.com/2018/09/11/deep-fakes-obama-video/.
deepfakes. Through the analysis of particular frequency statistics, that Brock, Andrew, Donahue, Jeff, Simonyan, Karen, 2018. Large scale GAN training for
method can be used to mathematically explain whether a multimedia high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096.
content is a deepfake and why it is. More research must be conducted in Caldelli, Roberto, Galteri, Leonardo, Amerini, Irene, Del Bimbo, Alberto, 2021. Optical
flow based CNN for detection of unlearnt deepfake manipulations. Pattern Recognit.
this area and explainable AI in computer vision therefore is a research
Lett. 146, 31–37.
direction that is needed to promote and utilize the advances and Cao, Qiong, Shen, Li, Xie, Weidi, Parkhi, Omkar M, Zisserman, Andrew, 2018.
advantages of AI and machine learning in digital media forensics. VGGFace2: A dataset for recognising faces across pose and age. In: The 13th IEEE
International Conference on Automatic Face & Gesture Recognition, FG 2018. IEEE,
5. Conclusions pp. 67–74.
Carlini, Nicholas, Farid, Hany, 2020. Evading deepfake-image detectors with white-and
black-box attacks. In: Proceedings of the IEEE/CVF Conference on Computer Vision
Deepfakes have begun to erode trust of people in media contents as and Pattern Recognition Workshops. pp. 658–659.
seeing them is no longer commensurate with believing in them. They Chan, Caroline, Ginosar, Shiry, Zhou, Tinghui, Efros, Alexei A, 2019. Everybody dance
could cause distress and negative effects to those targeted, heighten now. In: Proceedings of the IEEE/CVF International Conference on Computer
Vision. pp. 5933–5942.
disinformation and hate speech, and even could stimulate political
Cheng, Zhengxue, Sun, Heming, Takeuchi, Masaru, Katto, Jiro, 2019. Energy
tension, inflame the public, violence or war. This is especially critical compaction-based image compression using convolutional autoencoder. IEEE Trans.
nowadays as the technologies for creating deepfakes are increasingly Multimed. 22 (4), 860–873.
approachable and social media platforms can spread those fake con- Chesney, Robert, Citron, Danielle Keats, 2018a. Deep fakes: A looming challenge for
tents quickly. This survey provides a timely overview of deepfake privacy, democracy, and national security. Democr. National Secur. 107.
Chesney, R., Citron, D.K., 2018b. Disinformation on steroids: The threat of deep fakes.
creation and detection methods and presents a broad discussion on https://www.cfr.org/report/deep-fake-disinformation-steroids.
challenges, potential trends, and future directions in this area. This Chesney, Robert, Citron, Danielle, 2019. Deepfakes and the new disinformation war:
study therefore will be valuable for the artificial intelligence research The coming age of post-truth geopolitics. Foreign Aff. 98, 147.
community to develop effective methods for tackling deepfakes. Chingovska, Ivana, Anjos, André, Marcel, Sébastien, 2012. On the effectiveness of
local binary patterns in face anti-spoofing. In: Proceedings of the International
Conference of Biometrics Secial Interest Group, BIOSIG. IEEE, pp. 1–7.
CRediT authorship contribution statement Chintha, Akash, Thai, Bao, Sohrawardi, Saniat Javid, Bhatt, Kartavya, Hickerson, An-
drea, Wright, Matthew, Ptucha, Raymond, 2020. Recurrent convolutional structures
Thanh Thi Nguyen: Conceptualization, Methodology, Investiga- for audio spoof and video deepfake detection. IEEE J. Sel. Top. Sign. Proces. 14
tion, Writing. Quoc Viet Hung Nguyen: Conceptualization, Writing – (5), 1024–1037.
Cho, Wonwoong, Choi, Sungha, Park, David Keetae, Shin, Inkyu, Choo, Jaegul, 2019.
review & editing. Dung Tien Nguyen: Methodology, Writing – original Image-to-image translation via group-wise deep whitening-and-coloring transforma-
draft. Duc Thanh Nguyen: Visualization, Writing – original draft. tion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Thien Huynh-The: Validation, Writing – review & editing . Saeid Na- Recognition. pp. 10639–10647.
havandi: Validation, Writing – review & editing. Thanh Tam Nguyen: Cho, Kyunghyun, Van Merriënboer, Bart, Gulcehre, Caglar, Bahdanau, Dzmitry,
Bougares, Fethi, Schwenk, Holger, Bengio, Yoshua, 2014. Learning phrase rep-
Visualization, Writing – review & editing. Quoc-Viet Pham: Validation,
resentations using RNN encoder-decoder for statistical machine translation. arXiv
Writing – review & editing . Cuong M. Nguyen: Investigation, Writing preprint arXiv:1406.1078.
– original draft. Choi, Yunjey, Choi, Minje, Kim, Munyoung, Ha, Jung-Woo, Kim, Sunghun,
Choo, Jaegul, 2018. StarGAN: Unified generative adversarial networks for multi-
Declaration of competing interest domain image-to-image translation. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. pp. 8789–8797.
Chopra, Sumit, Hadsell, Raia, LeCun, Yann, 2005. Learning a similarity metric dis-
The authors declare that they have no known competing finan- criminatively, with application to face verification. In: IEEE Computer Society
cial interests or personal relationships that could have appeared to Conference on Computer Vision and Pattern Recognition, CVPR’05, vol. 1. IEEE,
influence the work reported in this paper. pp. 539–546.
Chorowski, Jan, Weiss, Ron J, Bengio, Samy, van den Oord, Aäron, 2019. Unsupervised
speech representation learning using WaveNet autoencoders. IEEE/ACM Trans.
References Audio Speech Lang. Process. 27 (12), 2041–2053.
Chung, Joon Son, Senior, Andrew, Vinyals, Oriol, Zisserman, Andrew, 2017. Lip reading
Afchar, Darius, Nozick, Vincent, Yamagishi, Junichi, Echizen, Isao, 2018. MesoNet: sentences in the wild. In: 2017 IEEE Conference on Computer Vision and Pattern
A compact facial video forgery detection network. In: 2018 IEEE International Recognition, CVPR. IEEE, pp. 3444–3453.
Workshop on Information Forensics and Security, WIFS. IEEE, pp. 1–7. Ciftci, Umur Aybars, Demir, Ilke, Yin, Lijun, 2020. FakeCatcher: Detection of synthetic
Agarwal, Shruti, Farid, Hany, El-Gaaly, Tarek, Lim, Ser-Nam, 2020a. Detecting deep- portrait videos using biological signals. IEEE Trans. Pattern Anal. Mach. Intell..
fake videos from appearance and behavior. In: IEEE International Workshop on Cozzolino, Davide, Thies, Justus, Rössler, Andreas, Riess, Christian, Nießner, Matthias,
Information Forensics and Security, WIFS. IEEE, pp. 1–6. Verdoliva, Luisa, 2018. ForensicTransfer: Weakly-supervised domain adaptation for
Agarwal, Shruti, Farid, Hany, Fried, Ohad, Agrawala, Maneesh, 2020b. Detecting deep- forgery detection. arXiv preprint arXiv:1812.02510.
fake videos from phoneme-viseme mismatches. In: Proceedings of the IEEE/CVF CycleGAN, 2022. https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix.
Conference on Computer Vision and Pattern Recognition Workshops. pp. 660–661. Dai, Tao, Cai, Jianrui, Zhang, Yongbing, Xia, Shu-Tao, Zhang, Lei, 2019. Second-
Agarwal, Shruti, Farid, Hany, Gu, Yuming, He, Mingming, Nagano, Koki, Li, Hao, order attention network for single image super-resolution. In: Proceedings of
2019. Protecting world leaders against deep fakes. In: Computer Vision and Pattern the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp.
Recognition Workshops, vol. 1, pp. 38–45. 11065–11074.
Agarwal, Sakshi, Varshney, Lav R., 2019. Limits of deepfake detection: A robust Damiani, J., 2019. A voice deepfake was used to scam a CEO out of $243,000.
estimation viewpoint. arXiv preprint arXiv:1905.03493. https://www.forbes.com/sites/jessedamiani/2019/09/03/a-voice-deepfake-was-
Amerini, Irene, Caldelli, Roberto, 2020. Exploiting prediction error inconsistencies used-to-scam-a-ceo-out-of-243000/.
through LSTM-based classifiers to detect deepfake videos. In: Proceedings of the DeepFaceLab, 2022a. DeepFaceLab: Explained and usage tutorial. https://mrdeepfakes.
2020 ACM Workshop on Information Hiding and Multimedia Security. pp. 97–102. com/forums/thread-deepfacelab-explained-and-usage-tutorial.
Amerini, Irene, Galteri, Leonardo, Caldelli, Roberto, Del Bimbo, Alberto, 2019. Deepfake DeepFaceLab, 2022b. https://github.com/iperov/DeepFaceLab.
video detection through optical flow based CNN. In: Proceedings of the IEEE/CVF DeepFake_tf, 2022. DeepFake_tf: Deepfake based on tensorflow. https://github.com/
International Conference on Computer Vision Workshops. pp. 1205–1207. StromWine/DeepFake_tf.
11
T.T. Nguyen, Q.V.H. Nguyen, D.T. Nguyen et al. Computer Vision and Image Understanding 223 (2022) 103525
Deng, Yu, Yang, Jiaolong, Chen, Dong, Wen, Fang, Tong, Xin, 2020. Disentangled Hasan, Haya R., Salah, Khaled, 2019. Combating deepfake videos using blockchain and
and controllable face image generation via 3D imitative-contrastive learning. smart contracts. IEEE Access 7, 41596–41606.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, Sun, Jian, 2016. Deep residual learning
Recognition. pp. 5154–5163. for image recognition. In: Proceedings of the IEEE Conference on Computer Vision
DFaker, 2022. https://github.com/dfaker/df. and Pattern Recognition. pp. 770–778.
Dolhansky, Brian, Bitton, Joanna, Pflaum, Ben, Lu, Jikuo, Howes, Russ, Wang, Menglin, He, Zhenliang, Zuo, Wangmeng, Kan, Meina, Shan, Shiguang, Chen, Xilin, 2019.
Canton Ferrer, Cristian, 2020. The deepfake detection challenge dataset. arXiv AttGAN: Facial attribute editing by only changing what you want. IEEE Trans.
preprint arXiv:2006.07397. Image Process. 28 (11), 5464–5478.
Dolhansky, Brian, Howes, Russ, Pflaum, Ben, Baram, Nicole, Ferrer, Cristian Canton, Hinton, Geoffrey E., Krizhevsky, Alex, Wang, Sida D., 2011. Transforming auto-
2019. The deepfake detection challenge (DFDC) preview dataset. arXiv preprint encoders. In: International Conference on Artificial Neural Networks. Springer, pp.
arXiv:1910.08854. 44–51.
Donahue, Jeffrey, Anne Hendricks, Lisa, Guadarrama, Sergio, Rohrbach, Marcus, Hsu, Chih-Chung, Lee, Chia-Yen, Zhuang, Yi-Xiu, 2018. Learning to detect fake face
Venugopalan, Subhashini, Saenko, Kate, Darrell, Trevor, 2015. Long-term recurrent images in the wild. In: 2018 International Symposium on Computer, Consumer
convolutional networks for visual recognition and description. In: Proceedings of and Control, IS3C. IEEE, pp. 388–391.
the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2625–2634. Hsu, Chih-Chung, Zhuang, Yi-Xiu, Lee, Chia-Yen, 2020. Deep fake image detection
DSSIM, 2022. https://github.com/keras-team/keras-contrib/blob/master/keras_contrib/ based on pairwise learning. Appl. Sci. 10 (1), 370.
losses/dssim.py.
Huang, Gao, Liu, Zhuang, Van Der Maaten, Laurens, Weinberger, Kilian Q, 2017.
Dufour, Nick, Gully, Andrew, 2019. Contributing data to deepfake detection research.
Densely connected convolutional networks. In: Proceedings of the IEEE Conference
https://ai.googleblog.com/2019/09/contributing-data-to-deepfake-detection.html.
on Computer Vision and Pattern Recognition. pp. 4700–4708.
FaceNet, 2022. https://github.com/davidsandberg/facenet.
Huang, Gary B., Ramesh, Manu, Berg, Tamara, Learned-Miller, Erik, 2007. Labeled
Faceswap, 2022. Faceswap: Deepfakes software for all. https://github.com/deepfakes/
Faces in the Wild: a Database for Studying Face Recognition in Unconstrained
faceswap.
Environments. Technical Report 07–49, University of Massachusetts, Amherst.
Faceswap-GAN, 2022. https://github.com/shaoanlu/faceswap-GAN.
Hussain, Shehzeen, Neekhara, Paarth, Jere, Malhar, Koushanfar, Farinaz, McAuley, Ju-
FakeApp, 2022. FakeApp 2.2.0. https://www.malavida.com/en/soft/fakeapp/.
Farid, Hany, 2009. Image forgery detection. IEEE Signal Process. Mag. 26 (2), 16–25. lian, 2021. Adversarial deepfakes: Evaluating vulnerability of deepfake detectors
Fernandes, Steven, Raj, Sunny, Ewetz, Rickard, Pannu, Jodh Singh, Jha, Sumit Ku- to adversarial examples. In: Proceedings of the IEEE/CVF Winter Conference on
mar, Ortiz, Eddy, Vintila, Iustina, Salter, Margaret, 2020. Detecting deepfake Applications of Computer Vision. pp. 3348–3357.
videos using attribution-based confidence metric. In: Proceedings of the IEEE/CVF Hwang, T., 2020. Deepfakes: a Grounded Threat Assessment. Technical report, Centre
Conference on Computer Vision and Pattern Recognition Workshops. pp. 308–309. for Security and Emerging Technologies, Georgetown University.
Fernandes, Steven, Raj, Sunny, Ortiz, Eddy, Vintila, Iustina, Salter, Margaret, Urose- IPFS, 2022. IPFS powers the Distributed Web. https://ipfs.io/.
vic, Gordana, Jha, Sumit, 2019. Predicting heart rate variations of deepfake videos Iuliani, Massimo, Shullani, Dasara, Fontani, Marco, Meucci, Saverio, Piva, Alessandro,
using neural ODE. In: Proceedings of the IEEE/CVF International Conference on 2018. A video forensic framework for the unsupervised analysis of MP4-like file
Computer Vision Workshops. pp. 1721–1729. container. IEEE Trans. Inf. Forensics Secur. 14 (3), 635–645.
Few-Shot Face Translation GAN, 2022. https://github.com/shaoanlu/fewshot-face- Jafar, Mousa Tayseer, Ababneh, Mohammad, Al-Zoube, Mohammad, Elhassan, Ammar,
translation-GAN. 2020. Forensics and analysis of deepfake videos. In: The 11th International
Fish, T., 2019. Deep fakes: AI-manipulated media will be ‘weaponised’ to Conference on Information and Communication Systems, ICICS. IEEE, pp. 053–058.
trick military. https://www.express.co.uk/news/science/1109783/deep-fakes-ai- Jha, Susmit, Raj, Sunny, Fernandes, Steven, Jha, Sumit K, Jha, Somesh, Jalaian, Brian,
artificial-intelligence-photos-video-weaponised-china. Verma, Gunjan, Swami, Ananthram, 2019. Attribution-based confidence metric for
Floridi, Luciano, 2018. Artificial intelligence, deepfakes and a future of ectypes. Philos. deep neural networks. Adv. Neural Inf. Process. Syst. 32, 11826–11837.
Technol. 31 (3), 317–321. Jiang, Liming, Li, Ren, Wu, Wayne, Qian, Chen, Loy, Chen Change, 2020.
Fried, Ohad, Tewari, Ayush, Zollhöfer, Michael, Finkelstein, Adam, Shechtman, Eli, DeeperForensics-1.0: A large-scale dataset for real-world face forgery detection.
Goldman, Dan B, Genova, Kyle, Jin, Zeyu, Theobalt, Christian, Agrawala, Maneesh, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
2019. Text-based editing of talking-head video. ACM Trans. Graph. 38 (4), 1–14. Recognition. pp. 2889–2898.
Galbally, Javier, Marcel, Sébastien, 2014. Face anti-spoofing based on general image Kaliyar, Rohit Kumar, Goswami, Anurag, Narang, Pratik, 2021. DeepFakE: Improving
quality assessment. In: The 22nd International Conference on Pattern Recognition. fake news detection using tensor decomposition-based deep neural network. J.
IEEE, pp. 1173–1178. Supercomput. 77 (2), 1015–1037.
Gandhi, Apurva, Jain, Shomik, 2020. Adversarial perturbations fool deepfake detectors. Karras, Tero, Aila, Timo, Laine, Samuli, Lehtinen, Jaakko, 2017. Progressive growing
In: IEEE International Joint Conference on Neural Networks, IJCNN. IEEE, pp. 1–8. of GANs for improved quality, stability, and variation. arXiv preprint arXiv:1710.
Giudice, Oliver, Guarnera, Luca, Battiato, Sebastiano, 2021. Fighting deepfakes by 10196.
detecting GAN DCT anomalies. arXiv preprint arXiv:2101.09781. Karras, Tero, Laine, Samuli, Aila, Timo, 2019. A style-based generator architecture for
Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on
Ozair, Sherjil, Courville, Aaron, Bengio, Yoshua, 2014. Generative adversarial nets. Computer Vision and Pattern Recognition. pp. 4401–4410.
Adv. Neural Inf. Process. Syst. 27, 2672–2680. Karras, Tero, Laine, Samuli, Aittala, Miika, Hellsten, Janne, Lehtinen, Jaakko,
Guan, Haiying, Kozak, Mark, Robertson, Eric, Lee, Yooyoung, Yates, Amy N, Del- Aila, Timo, 2020. Analyzing and improving the image quality of stylegan. In: Pro-
gado, Andrew, Zhou, Daniel, Kheyrkhah, Timothee, Smith, Jeff, Fiscus, Jonathan, ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
2019. MFC datasets: Large-scale benchmark datasets for media forensic challenge pp. 8110–8119.
evaluation. In: IEEE Winter Applications of Computer Vision Workshops, WACVW.
Keras-VGGFace, 2022. Keras-VGGFace: VGGFace implementation with Keras framework.
IEEE, pp. 63–72.
https://github.com/rcmalli/keras-vggface.
Guarnera, Luca, Giudice, Oliver, Battiato, Sebastiano, 2020a. Deepfake detection by
Kingma, Diederik P., Dhariwal, Prafulla, 2018. Glow: Generative flow with invertible
analyzing convolutional traces. In: Proceedings of the IEEE/CVF Conference on
1 × 1 convolutions. In: Proceedings of the 32nd International Conference on Neural
Computer Vision and Pattern Recognition Workshops. pp. 666–667.
Information Processing Systems. pp. 10236–10245.
Guarnera, Luca, Giudice, Oliver, Battiato, Sebastiano, 2020b. Fighting deepfake by
Kingma, Diederik P., Welling, Max, 2013. Auto-encoding variational Bayes. arXiv
exposing the convolutional traces on images. IEEE Access 8, 165085–165098.
preprint arXiv:1312.6114.
Guarnera, Luca, Giudice, Oliver, Nastasi, Cristina, Battiato, Sebastiano, 2020c. Pre-
liminary forensics analysis of deepfake images. In: AEIT International Annual Koopman, Marissa, Rodriguez, Andrea Macarulla, Geradts, Zeno, 2018. Detection
Conference, AEIT. IEEE, pp. 1–6. of deepfake video manipulation. In: The 20th Irish Machine Vision and Image
Güera, David, Delp, Edward J., 2018. Deepfake video detection using recurrent neural Processing Conference, IMVIP. pp. 133–136.
networks. In: 15th IEEE International Conference on Advanced Video and Signal Korshunov, Pavel, Marcel, Sébastien, 2018a. Deepfakes: A new threat to face
Based Surveillance, AVSS. IEEE, pp. 1–6. recognition? Assessment and detection. arXiv preprint arXiv:1812.08685.
Gulrajani, Ishaan, Ahmed, Faruk, Arjovsky, Martin, Dumoulin, Vincent, Korshunov, Pavel, Marcel, Sébastien, 2018b. Speaker inconsistency detection in tam-
Courville, Aaron, 2017. Improved training of Wasserstein GANs. arXiv preprint pered video. In: The 26th European Signal Processing Conference, EUSIPCO. IEEE,
arXiv:1704.00028. pp. 2375–2379.
Guo, Bin, Ding, Yasan, Yao, Lina, Liang, Yunji, Yu, Zhiwen, 2020. The future of false Korshunov, Pavel, Marcel, Sébastien, 2019. Vulnerability assessment and detection of
information detection on social media: New perspectives and trends. ACM Comput. deepfake videos. In: 2019 International Conference on Biometrics, ICB. IEEE, pp.
Surv. 53 (4), 1–36. 1–6.
Guo, Zhiqing, Hu, Lipin, Xia, Ming, Yang, Gaobo, 2021. Blind detection of glow-based Korshunova, Iryna, Shi, Wenzhe, Dambre, Joni, Theis, Lucas, 2017. Fast face-swap using
facial forgery. Multimedia Tools Appl. 80 (5), 7687–7710. convolutional neural networks. In: Proceedings of the IEEE International Conference
Ha, Sungjoo, Kersner, Martin, Kim, Beomsu, Seo, Seokjun, Kim, Dongyoung, 2020. on Computer Vision. pp. 3677–3685.
Marionette: Few-shot face reenactment preserving identity of unseen targets. In: Laptev, Ivan, Marszalek, Marcin, Schmid, Cordelia, Rozenfeld, Benjamin, 2008. Learning
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, (07), pp. realistic human actions from movies. In: IEEE Conference on Computer Vision and
10893–10900. Pattern Recognition. IEEE, pp. 1–8.
12
T.T. Nguyen, Q.V.H. Nguyen, D.T. Nguyen et al. Computer Vision and Image Understanding 223 (2022) 103525
Lattas, Alexandros, Moschoglou, Stylianos, Gecer, Baris, Ploumpis, Stylianos, Triantafyl- Mo, Huaxiao, Chen, Bolin, Luo, Weiqi, 2018. Fake faces identification via convolutional
lou, Vasileios, Ghosh, Abhijeet, Zafeiriou, Stefanos, 2020. AvatarMe: Realistically neural network. In: Proceedings of the 6th ACM Workshop on Information Hiding
renderable 3D facial reconstruction ‘‘in-the-wild. In: Proceedings of the IEEE/CVF and Multimedia Security. pp. 43–47.
Conference on Computer Vision and Pattern Recognition. pp. 760–769. Moon, Todd K., 1996. The expectation-maximization algorithm. IEEE Signal Process.
Li, Lingzhi, Bao, Jianmin, Yang, Hao, Chen, Dong, Wen, Fang, 2019a. FaceShifter: Mag. 13 (6), 47–60.
Towards high fidelity and occlusion aware face swapping. arXiv preprint arXiv: Nguyen, Huy H., Yamagishi, Junichi, Echizen, Isao, 2019. Capsule-forensics: Using cap-
1912.13457. sule networks to detect forged images and videos. In: IEEE International Conference
Li, Lingzhi, Bao, Jianmin, Zhang, Ting, Yang, Hao, Chen, Dong, Wen, Fang, Guo, Bain- on Acoustics, Speech and Signal Processing, ICASSP. IEEE, pp. 2307–2311.
ing, 2020a. Face X-Ray for more general face forgery detection. In: Proceedings Nirkin, Yuval, Keller, Yosi, Hassner, Tal, 2019. FSGAN: Subject agnostic face swapping
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. and reenactment. In: Proceedings of the IEEE/CVF International Conference on
5001–5010. Computer Vision. pp. 7184–7193.
Li, Yuezun, Chang, Ming-Ching, Lyu, Siwei, 2018. In ictu oculi: Exposing AI created Olszewski, Kyle, Tulyakov, Sergey, Woodford, Oliver, Li, Hao, Luo, Linjie, 2019.
fake videos by detecting eye blinking. In: 2018 IEEE International Workshop on Transformable bottleneck networks. In: Proceedings of the IEEE/CVF International
Information Forensics and Security, WIFS. IEEE, pp. 1–7. Conference on Computer Vision. pp. 7648–7657.
Li, Chang-Tsun, Li, Yue, 2011. Color-decoupled photo response non-uniformity for Park, Taesung, Liu, Ming-Yu, Wang, Ting-Chun, Zhu, Jun-Yan, 2019. Semantic image
digital image forensics. IEEE Trans. Circuits Syst. Video Technol. 22 (2), 260–271. synthesis with spatially-adaptive normalization. In: Proceedings of the IEEE/CVF
Li, Yuezun, Lyu, Siwei, 2018. Exposing deepfake videos by detecting face warping Conference on Computer Vision and Pattern Recognition. pp. 2337–2346.
artifacts. arXiv preprint arXiv:1811.00656. Parkhi, Omkar M., Vedaldi, Andrea, Zisserman, Andrew, 2015. Deep face recognition.
Li, Yuezun, Yang, Xin, Sun, Pu, Qi, Honggang, Lyu, Siwei, 2020b. Celeb-DF: A large- In: Proceedings of the British Machine Vision Conference, BMVC. pp. 41.1–41.12.
scale challenging dataset for deepfake forensics. In: Proceedings of the IEEE/CVF Phan, Quoc-Tin, Boato, Giulia, De Natale, Francesco G.B., 2018. Accurate and scalable
Conference on Computer Vision and Pattern Recognition. pp. 3207–3216. image clustering based on sparse representation of camera fingerprint. IEEE Trans.
Li, Ke, Zhang, Tianhao, Malik, Jitendra, 2019b. Diverse image synthesis from semantic Inf. Forensics Secur. 14 (7), 1902–1916.
layouts via conditional IMLE. In: Proceedings of the IEEE/CVF International Punnappurath, Abhijith, Brown, Michael S., 2019. Learning raw image reconstruction-
Conference on Computer Vision. pp. 4220–4229. aware deep image compressors. IEEE Trans. Pattern Anal. Mach. Intell. 42 (4),
de Lima, Oscar, Franklin, Sean, Basu, Shreshtha, Karwoski, Blake, George, Annet, 2020. 1013–1019.
Deepfake detection using spatiotemporal convolutional networks. arXiv preprint Qian, Yinlong, Dong, Jing, Wang, Wei, Tan, Tieniu, 2015. Deep learning for steganal-
arXiv:2006.14749. ysis via convolutional neural networks. In: Media Watermarking, Security, and
Lin, Xufeng, Li, Chang-Tsun, 2016. Large-scale image clustering based on camera Forensics, vol. 9409, p. 94090J.
fingerprints. IEEE Trans. Inf. Forensics Secur. 12 (4), 793–808. Radford, Alec, Metz, Luke, Chintala, Soumith, 2015. Unsupervised representation
Lin, Jiacheng, Li, Yang, Yang, Guanci, 2021. FPGAN: Face de-identification method with learning with deep convolutional generative adversarial networks. arXiv preprint
generative adversarial networks for social robots. Neural Netw. 133, 132–147. arXiv:1511.06434.
Liu, Ming-Yu, Huang, Xun, Mallya, Arun, Karras, Tero, Aila, Timo, Lehtinen, Jaakko, Rahmouni, Nicolas, Nozick, Vincent, Yamagishi, Junichi, Echizen, Isao, 2017. Distin-
Kautz, Jan, 2019. Few-shot unsupervised image-to-image translation. In: Pro- guishing computer graphics from natural images using convolution neural networks.
ceedings of the IEEE/CVF International Conference on Computer Vision. pp. In: IEEE Workshop on Information Forensics and Security, WIFS. IEEE, pp. 1–6.
10551–10560. Read, M., 2019. Can you spot a deepfake? Does it matter? http://nymag.com/
Liu, Ming-Yu, Huang, Xun, Yu, Jiahui, Wang, Ting-Chun, Mallya, Arun, 2021. intelligencer/2019/06/how-do-you-spot-a-deepfake-it-might-not-matter.html.
Generative adversarial networks for image and video synthesis: Algorithms and Rosenfeld, Kurt, Sencar, Husrev Taha, 2009. A study of the robustness of PRNU-based
applications. Proc. IEEE 109 (5), 839–862. camera identification. In: Media Forensics and Security, vol. 7254, p. 72540M.
Liu, Ziwei, Luo, Ping, Wang, Xiaogang, Tang, Xiaoou, 2015. Deep learning face Rössler, Andreas, Cozzolino, Davide, Verdoliva, Luisa, Riess, Christian, Thies, Justus,
attributes in the wild. In: Proceedings of the IEEE International Conference on Nießner, Matthias, 2018. FaceForensics: A large-scale video dataset for forgery
Computer Vision. pp. 3730–3738. detection in human faces. arXiv preprint arXiv:1803.09179.
Lyu, Siwei, 2018. Detecting ’deepfake’ videos in the blink of an eye. http:// Rossler, Andreas, Cozzolino, Davide, Verdoliva, Luisa, Riess, Christian, Thies, Justus,
theconversation.com/detecting-deepfake-videos-in-the-blink-of-an-eye-101072. Nießner, Matthias, 2019. FaceForensics++: Learning to detect manipulated facial
Lyu, Siwei, 2020. Deepfake detection: Current challenges and next steps. In: IEEE images. In: Proceedings of the IEEE/CVF International Conference on Computer
International Conference on Multimedia & Expo Workshops, ICMEW. IEEE, pp. 1–6. Vision. pp. 1–11.
Makhzani, Alireza, Shlens, Jonathon, Jaitly, Navdeep, Goodfellow, Ian, Frey, Brendan, Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean,
2015. Adversarial autoencoders. arXiv preprint arXiv:1511.05644. Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, et al., 2015.
Malolan, Badhrinarayan, Parekh, Ankit, Kazi, Faruk, 2020. Explainable deep-fake de- ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115 (3),
tection using visual interpretability methods. In: The 3rd International Conference 211–252.
on Information and Computer Technologies, ICICT. IEEE, pp. 289–293. Sabir, Ekraam, Cheng, Jiaxin, Jaiswal, Ayush, AbdAlmageed, Wael, Masi, Iacopo,
Mao, Xudong, Li, Qing, Xie, Haoran, Lau, Raymond YK, Wang, Zhen, Paul Smol- Natarajan, Prem, 2019. Recurrent convolutional strategies for face manipulation
ley, Stephen, 2017. Least squares generative adversarial networks. In: Proceedings detection in videos. In: Proceedings of the IEEE Conference on Computer Vision
of the IEEE International Conference on Computer Vision. pp. 2794–2802. and Pattern Recognition Workshops, vol. 3, (1), pp. 80–87.
Maras, Marie-Helen, Alexandrou, Alex, 2019. Determining authenticity of video evi- Sabour, Sara, Frosst, Nicholas, Hinton, Geoffrey E., 2017. Dynamic routing between cap-
dence in the age of artificial intelligence and in the wake of deepfake videos. Int. sules. In: Proceedings of the 31st International Conference on Neural Information
J. Evidence Proof 23 (3), 255–262. Processing Systems. pp. 3859–3869.
Marr, B., 2019. The best (and scariest) examples of AI-enabled deepfakes. Samuel, S., 2019. A guy made a deepfake app to turn photos of women into
https://www.forbes.com/sites/bernardmarr/2019/07/22/the-best-and-scariest- nudes. It didn’t go well. https://www.vox.com/2019/6/27/18761639/ai-deepfake-
examples-of-ai-enabled-deepfakes/. deepnude-app-nude-women-porn.
Marra, Francesco, Gragnaniello, Diego, Cozzolino, Davide, Verdoliva, Luisa, 2018. Scherhag, Ulrich, Debiasi, Luca, Rathgeb, Christian, Busch, Christoph, Uhl, Andreas,
Detection of GAN-generated fake images over social networks. In: 2018 IEEE 2019. Detection of face morphing attacks based on PRNU analysis. IEEE Trans.
Conference on Multimedia Information Processing and Retrieval, MIPR. IEEE, pp. Biom. Behav. Identity Sci. 1 (4), 302–317.
384–389. Schroepfer, M., 2019. Creating a data set and a challenge for deepfakes. https://ai.
Marra, Francesco, Saltori, Cristiano, Boato, Giulia, Verdoliva, Luisa, 2019. Incremental facebook.com/blog/deepfake-detection-challenge.
learning for the detection and classification of gan-generated images. In: 2019 IEEE Schroff, Florian, Kalenichenko, Dmitry, Philbin, James, 2015. FaceNet: A unified em-
International Workshop on Information Forensics and Security, WIFS. IEEE, pp. 1–6. bedding for face recognition and clustering. In: Proceedings of the IEEE Conference
Matern, Falko, Riess, Christian, Stamminger, Marc, 2019. Exploiting visual artifacts to on Computer Vision and Pattern Recognition. pp. 815–823.
expose deepfakes and face manipulations. In: IEEE Winter Applications of Computer Simonyan, Karen, Zisserman, Andrew, 2014. Very deep convolutional networks for
Vision Workshops, WACVW. IEEE, pp. 83–92. large-scale image recognition. arXiv preprint arXiv:1409.1556.
Maurer, Ueli M., 2000. Authentication theory and hypothesis testing. IEEE Trans. Su, Lichao, Li, Cuihua, Lai, Yuecong, Yang, Jianmei, 2017. A fast forgery detection
Inform. Theory 46 (4), 1350–1356. algorithm based on exponential-Fourier moments for video region duplication. IEEE
Mirsky, Yisroel, Lee, Wenke, 2021. The creation and detection of deepfakes: A survey. Trans. Multimed. 20 (4), 825–840.
ACM Comput. Surv. 54 (1), 1–41. Suwajanakorn, Supasorn, Seitz, Steven M, Kemelmacher-Shlizerman, Ira, 2017. Syn-
Mittal, Trisha, Bhattacharya, Uttaran, Chandra, Rohan, Bera, Aniket, Manocha, Dinesh, thesizing Obama: Learning lip sync from audio. ACM Trans. Graph. 36 (4),
2020. Emotions don’t lie: A deepfake detection method using audio-visual affective 1–13.
cues. 3, arXiv preprint arXiv:2003.06711. Tewari, Ayush, Elgharib, Mohamed, Bharaj, Gaurav, Bernard, Florian, Seidel, Hans-
Miyato, Takeru, Kataoka, Toshiki, Koyama, Masanori, Yoshida, Yuichi, 2018. Spectral Peter, Pérez, Patrick, Zollhofer, Michael, Theobalt, Christian, 2020. StyleRig:
normalization for generative adversarial networks. arXiv preprint arXiv:1802. Rigging StyleGAN for 3D control over portrait images. In: Proceedings of the
05957. IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6142–6151.
13
T.T. Nguyen, Q.V.H. Nguyen, D.T. Nguyen et al. Computer Vision and Image Understanding 223 (2022) 103525
Tewari, Ayush, Zollhoefer, Michael, Bernard, Florian, Garrido, Pablo, Kim, Hyeongwoo, Xuan, Xinsheng, Peng, Bo, Wang, Wei, Dong, Jing, 2019. On the generalization of
Perez, Patrick, Theobalt, Christian, 2018. High-fidelity monocular face reconstruc- GAN image forensics. In: Chinese Conference on Biometric Recognition. Springer,
tion based on an unsupervised model-based face autoencoder. IEEE Trans. Pattern pp. 134–141.
Anal. Mach. Intell. 42 (2), 357–370. Yang, Chaofei, Ding, Leah, Chen, Yiran, Li, Hai, 2021. Defending against gan-based
The Guardian, 2019. Chinese deepfake app Zao sparks privacy row after going deepfake attacks via transformation-aware adversarial faces. In: IEEE International
viral. https://www.theguardian.com/technology/2019/sep/02/chinese-face-swap- Joint Conference on Neural Networks, IJCNN. IEEE, pp. 1–8.
app-zao-triggers-privacy-fears-viral. Yang, Xin, Li, Yuezun, Lyu, Siwei, 2019. Exposing deep fakes using inconsistent
Thies, Justus, Elgharib, Mohamed, Tewari, Ayush, Theobalt, Christian, head poses. In: IEEE International Conference on Acoustics, Speech and Signal
Nießner, Matthias, 2020. Neural voice puppetry: Audio-driven facial reenactment. Processing, ICASSP. IEEE, pp. 8261–8265.
In: European Conference on Computer Vision. Springer, pp. 716–731. Yang, Pengpeng, Ni, Rongrong, Zhao, Yao, 2016. Recapture image forensics based on
Thies, Justus, Zollhöfer, Michael, Nießner, Matthias, 2019. Deferred neural rendering: Laplacian convolutional neural networks. In: International Workshop on Digital
Image synthesis using neural textures. ACM Trans. Graph. 38 (4), 1–12. Watermarking. Springer, pp. 119–128.
Thies, Justus, Zollhofer, Michael, Stamminger, Marc, Theobalt, Christian, Yeh, Chin-Yuan, Chen, Hsi-Wen, Tsai, Shang-Lun, Wang, Sheng-De, 2020. Disrupting
Nießner, Matthias, 2016. Face2Face: Real-time face capture and reenactment image-translation-based deepfake algorithms with adversarial attacks. In: Proceed-
of RGB videos. In: Proceedings of the IEEE Conference on Computer Vision and ings of the IEEE/CVF Winter Conference on Applications of Computer Vision
Pattern Recognition. pp. 2387–2395. Workshops. pp. 53–62.
Younus, Mohammed Akram, Hasan, Taha Mohammed, 2020. Effective and fast deepfake
Todisco, Massimiliano, Wang, Xin, Vestman, Ville, Sahidullah, Md, Delgado, Héc-
detection method based on haar wavelet transform. In: International Conference on
tor, Nautsch, Andreas, Yamagishi, Junichi, Evans, Nicholas, Kinnunen, Tomi,
Computer Science and Software Engineering, CSASE. IEEE, pp. 186–190.
Lee, Kong Aik, 2019. ASVspoof 2019: Future horizons in spoofed and fake audio
Zakharov, Egor, Shysheya, Aliaksandra, Burkov, Egor, Lempitsky, Victor, 2019.
detection. arXiv preprint arXiv:1904.05441.
Few-shot adversarial learning of realistic neural talking head models. In:
Tolosana, Ruben, Vera-Rodriguez, Ruben, Fierrez, Julian, Morales, Aythami, Ortega-
Proceedings of the IEEE/CVF International Conference on Computer Vision.
Garcia, Javier, 2020. Deepfakes and beyond: A survey of face manipulation and
pp. 9459–9468.
fake detection. Inf. Fusion 64, 131–148.
Zhang, Han, Goodfellow, Ian, Metaxas, Dimitris, Odena, Augustus, 2019. Self-attention
Trinh, Loc, Tsang, Michael, Rambhatla, Sirisha, Liu, Yan, 2021. Interpretable and
generative adversarial networks. In: International Conference on Machine Learning.
trustworthy deepfake detection via dynamic prototypes. In: Proceedings of the
PMLR, pp. 7354–7363.
IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1973–1983.
Zhang, Ying, Zheng, Lilei, Thing, Vrizlynn L.L., 2017. Automated face swapping and its
Tucker, Patrick, 2019. The newest AI-enabled weapon: ‘deep-faking’ photos of detection. In: The 2nd International Conference on Signal and Image Processing,
the earth. https://www.defenseone.com/technology/2019/03/next-phase-ai-deep- ICSIP. IEEE, pp. 15–19.
faking-whole-world-and-china-ahead/155944/. Zhao, Tianchen, Xu, Xiang, Xu, Mingze, Ding, Hui, Xiong, Yuanjun, Xia, Wei, 2021.
Turek, M., 2019. Media forensics (MediFor). https://www.darpa.mil/program/media- Learning self-consistency for deepfake detection. In: Proceedings of the IEEE/CVF
forensics. International Conference on Computer Vision. pp. 15023–15033.
Verdoliva, Luisa, 2020. Media forensics and deepfakes: An overview. IEEE J. Sel. Top. Zheng, Lilei, Duffner, Stefan, Idrissi, Khalid, Garcia, Christophe, Baskurt, Atilla,
Sign. Proces. 14 (5), 910–932. 2016. Siamese multi-layer perceptrons for dimensionality reduction and face
VidTIMIT database, 2022. http://conradsanderson.id.au/vidtimit/. identification. Multimedia Tools Appl. 75 (9), 5055–5073.
Vincent, Pascal, Larochelle, Hugo, Bengio, Yoshua, Manzagol, Pierre-Antoine, 2008. Zhou, Peng, Han, Xintong, Morariu, Vlad I., Davis, Larry S., 2017. Two-stream neural
Extracting and composing robust features with denoising autoencoders. In: networks for tampered face detection. In: IEEE Conference on Computer Vision and
Proceedings of the 25th International Conference on Machine Learning. pp. Pattern Recognition Workshops, CVPRW. IEEE, pp. 1831–1839.
1096–1103. Zhou, Xinyi, Zafarani, Reza, 2020. A survey of fake news: Fundamental theories,
Wang, Xin, Thome, Nicolas, Cord, Matthieu, 2017. Gaze latent support vector machine detection methods, and opportunities. ACM Comput. Surv. 53 (5), 1–40.
for image classification improved by weakly supervised region selection. Pattern Zhu, Jun-Yan, Park, Taesung, Isola, Phillip, Efros, Alexei A, 2017. Unpaired image-to-
Recognit. 72, 59–71. image translation using cycle-consistent adversarial networks. In: Proceedings of
Wang, Sheng-Yu, Wang, Oliver, Zhang, Richard, Owens, Andrew, Efros, Alexei A, 2020. the IEEE International Conference on Computer Vision. pp. 2223–2232.
CNN-generated images are surprisingly easy to spot... for now. In: Proceedings Zubiaga, Arkaitz, Aker, Ahmet, Bontcheva, Kalina, Liakata, Maria, Procter, Rob, 2018.
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. Detection and resolution of rumours in social media: A survey. ACM Comput. Surv.
8695–8704. 51 (2), 1–36.
14