5 paper
5 paper
Israa H. Ali
Software Department, College of Information Technology, University of Babylon, Iraq
israa_hadi@[Link]
Received: 21 May 2024 | Revised: 1 June 2024 | Accepted: 3 June 2024
Licensed under a CC-BY 4.0 license | Copyright (c) by the authors | DOI: [Link]
ABSTRACT
Today, detecting fake news has become challenging as anyone can interact by freely sending or receiving
electronic information. Deep learning processes to detect multimodal fake news have achieved great
success. However, these methods easily fuse information from different modality sources, such as
concatenation and element-wise product, without considering how each modality affects the other,
resulting in low accuracy. This study presents a focused survey on the use of deep learning approaches to
detect multimodal visual and textual fake news on various social networks from 2019 to 2024. Several
relevant factors are discussed, including a) the detection stage, which involves deep learning algorithms,
b) methods for analyzing various data types, and c) choosing the best fusion mechanism to combine
multiple data sources. This study delves into the existing constraints of previous studies to provide future
tips for addressing open challenges and problems.
I. INTRODUCTION
Over the past few decades, fake news has become
ubiquitous to the point of deceiving the public. When this kind
of information becomes available, it causes social divisions and
suspicions in the ruling environment and among individuals [1-
3]. When data about a specific event (correct or incorrect) are
Fig. 1. Google fake news trends [10].
disseminated, they changes people's beliefs, typically
emphasizing certain prejudices. Furthermore, deceptive or Social networks have become an ideal setting for the spread
manipulative news seeks to feed widespread ignorance and of rumors, threatening network order, people's health, and
greed to benefit individuals or groups at the expense of society social stability [12-13]. Social networks and live streaming
[4]. Recently, many social networks have become the first platforms have become an essential part of daily life. Several
choice for transmitting knowledge and exchanging information dictionaries have defined the term fake news [14], which can
and events, providing platforms for sharing opinions and be defined more broadly based on its authenticity or intent [15].
beliefs with others around the world [5-6]. Several studies have One possible explanation for the widespread transmission of
focused on fake news detection. As a result, specific fake news is a lack of basic knowledge and skills within the
components have been developed, using some classic datasets, population. The public is not informed of the legitimacy of the
to provide insight into their issue of interest [7]. Some information sources and the veracity of the news it reads.
distinctive examples of fake news are the "Zinoviev Letter" [8], Another factor is that there is a lack of automatic fact-checking
the fake news on the 2016 elections in the United States [9- procedures. Although few websites have made significant
10], and the untrue environmental report on the spread of fires efforts to detect fake news, most of them rely on time-
in the Amazon rainforest in 2018 [11]. consuming manual methods. It is too difficult to prevent fake
news since the extensive use of social networks allows the fast
propagation of disinformation [7, 16].
[Link] Abduljaleel & Ali: Deep Learning and Fusion Mechanism-based Multimodal Fake News Detection …
Engineering, Technology & Applied Science Research Vol. 14, No. 4, 2024, 15665-15675 15666
Fake news detection is an ongoing study subject that can be statistical NLP models can be implemented to determine the
interpreted from several angles. It aims to mitigate the negative relevance of these aspects, and so gain a greater understanding
effects of such news by creating a system that recognizes it of the model. In contrast, it is more difficult to explain what
using techniques, such as Machine Learning (ML), language occurs in a neural network model. Much of the analytical work
proficiency, optimization algorithms, Deep Learning (DL), and therefore seeks to understand how language ideas, often used as
others [5, 7]. However, since ML-based systems have several features in NLP systems, are captured in neural networks. NLP
constraints, involvig generating a large training dataset and techniques employ attention mechanisms to increase text
selecting appropriate features to best capture the deception, DL classification accuracy. The attention model aims to improve
algorithms have been applied to detect fake news. In particular, efficiency by predicting the result based on only a few words of
attention mechanisms have emerged as one of the most potent the input series rather than the complete phrase [19].
strategies in Natural Language Processing (NLP). They are Furthermore, the development of pre-trained language models
primarily used alongside Recurrent Neural Networks (RNNs) (e.g., BERT, RoBERTa, and GPT) and their utilization in NLP
to anticipate the most significant information in an input has opened up new ways to categorize fake news [18].
sequence, either textual or visual [17]. Fake news providers
frequently employ written content and visuals or distort facts to TABLE I. ABBREVIATIONS USED IN THIS PAPER
appeal to readers' psychology and entice and mislead them, Abbreviation Description
allowing for quick diffusion. In general, themes on social CNN Convolutional Neural Network
hotspots or disputes include detailed textual descriptions of ResNet Residual Neural Network
their emotional expression and visual influence on pictures [9]. RNN Recurrent Neural Network
Multimodal knowledge is more difficult to handle than single- ViT Vision Transformers
modal knowledge since it requires information fusion BERT Bidirectional Encoder Representation of Transformer
procedures. Data fusion, decision-making, features, and other LSTM Long Short-Term Memory
NLP Natural Language Processing
approaches are examples of information fusion. These GPT Generative Pre-trained Transformers
approaches contain two steps: combining data, information, POS Parts of Speech Tagging
and features from multiple data sources and then processing TF-IDF Term Frequency Inverse Document Frequency
them. As a result, they can provide an additional accurate and BoW Bag of Words
reliable data representation [18]. GRU Gated Recurrent Unit
A Lite Bidirectional Encoder Representation of
Table I portrays the most important abbreviations used in ALBERT
Transformer
this paper. This study explored recent suggestive literature on Decoding-enhanced Bidirectional Encoder
fake news detection. In particular, the former focused on DeBERTa Representation of Transformer with Disentangled
developing detection systems based on specific characteristics Attention
Robustly optimized Bidirectional Encoder
of multimodal fake news. The papers were obtained by RoBERTa
Representation of Transformer Pretraining approach
searching for the keywords "fake news" through the search VGG Visual Geometry Group
engines observed in Table II. Several review studies exist in MLP Multi-Layer Perceptron
this domain, as evidenced in Table III. The main contributions DenseNet Densely Connected Convolutional Networks
of this review study can be summarized as: Glove Global Vectors
Provides knowledge about the specific fake news attributes TABLE II. SEARCH ENGINES
and their corresponding terms.
Number of Selected
Search engine Type
Focuses on detecting multimodal fake news and explaining results references
these system's methods to compare them in all stages from [20, 21] Journals
ACM Digital Library 10,375
the perspective of description to detection. [22] Conference
Science Direct 1043 [23, 24] Journals
Focuses briefly on the DL methods deployed in fake news Google Scholar 7854 [9, 25-26] Journals
detection models, such as attention mechanisms, CNN, ResearchGate 216 [16, 27] Journals
ResNet, etc. Scopus 361 [18, 28-30] Journals
[31-33] Conferences
IEEE 75
II. NATURAL LANGUAGE PROCESSING [34-36] Journals
MDPI 79 [11, 37] Journals
NLP systems include morphological traits, lexical classes, Springer 102 [38] Journal
syntactic categories, semantic connections, etc. In principle,
TABLE III. EXISTING REVIEWS ON DETECT MULTIMODAL FAKE NEWS
Word Fusion Deep Learning Techniques
Reference Datasets
Embedding Mechanism CNN RNN ViT Attention BERT LSTM
[1] √ √ √ √ √ × √ √ √
[2] √ √ × √ √ × √ √ √
[3] √ √ √ √ √ × √ √ √
[39] √ √ × √ √ × √ √ √
[14] √ √ × √ √ × √ √ √
This study √ √ √ √ √ √ √ √ √
[Link] Abduljaleel & Ali: Deep Learning and Fusion Mechanism-based Multimodal Fake News Detection …
Engineering, Technology & Applied Science Research Vol. 14, No. 4, 2024, 15665-15675 15667
Within NLP, the word and token features are similar. When not analyze the full image at once, instead it concentrates on
applying ML algorithms to extract text, it is critical to identify individual areas. This allows the concentrated areas of the
the best features. The goal of identifying these traits is to human visual space to be experienced in high resolution, while
develop effective indications that can be generalized for text the surroundings appear in low resolution. Instead of analyzing
classification. Some of them are mentioned below [40]. the entire vision space, the brain can examine and narrow down
the most important elements in a precise and efficient manner.
n-grams is used to record the dependencies between all This aspect of human eyesight led researchers to design the
words that appear sequentially in a sentence structure. attention mechanism [42]. Attention mechanisms work by
However, n-grams does not maintain the syntactical or assigning varying weights to various types of information.
semantic relationships of the words. Thus, assigning more weight to important information draws
Parts Of Speech (POS) Tagging: POS tagging distinguishes the focus of the DL model. Attention mechanism methods can
the grammatical meaning of words in a sentence putting be classified based on four criteria [16]:
into service particular tags, such as noun, pronoun, verb, Softness of attention (deterministic attention): To generate
adjective, adverb, conjunction, etc. the final context vector, the network calculates the average
TF-IDF: Its value increases linearly with the number of of each input weight item. The context vector is a high-
times a word appears in the document, but is offset by the dimensional vector that represents the components or
term's frequency in the body. Although this vectorization is sequence of the input factors, and the attention mechanism
effective, the semantic meaning of the words is lost in the generally seeks to add more contextual information to the
attempt to convert them to digits. final context vector. Hard attention (stochastic attention)
computes the final context vector by choosing pieces
BoW: This approach treats a single news story as a arbitrarily from the sample set. This decreases the
document and calculates the frequency count of each word computation time. In addition, global and local attention is
to provide a numerical representation of the data. Along often deployed in computer vision tasks. Global attention is
with data loss, this strategy has other drawbacks. The like soft attention in that it evaluates all input items.
relative position of the words is ignored and contextual However, the former improves soft attention by using the
information is removed. This loss can sometimes be output of the current time step rather than the previous one,
significant when weighed against the benefits of processing while local attention combines soft and hard attention. This
a pleasant level of usage. technique evaluates a subset of input components at a time,
overcoming the drawback of hard attention (i.e., being non-
Word2Vec provides a set of model designs and
differentiable) while remaining computationally efficient.
optimizations to extract word embeddings from large
datasets. Word embeddings learned through Word2Vec are Attention mechanisms' ability can be classified according to
more effective in collecting word semantics and leveraging their input requirements: item-wise and location-wise. Item-
word relatedness. wise attention necessitates inputs that are directly known to
the model or generated through pre-processing. However,
III. DEEP LEARNING (DL) location-wise attention is not implied because the model
DL networks are given sensory information such as texts, must deal with difficult-to-distinguish input objects.
photographs, movies, or sounds to simulate the human learning
process. These networks outperform other cutting-edge Attention models can work with single and multiple inputs.
approaches in several tasks, and as a result, the field has The overall processing strategy for the inputs varies
expanded enormously [41]. CNN, RNN, LSTM, and GRU are between the created models. Most contemporary attention
some of the conventional DL models used to identify fake networks utilize a single input and process it in two separate
news. CNN-based techniques can extract relevant information sequences (i.e., a distinctive model). Certain connections
from tiny areas but are incapable of dealing with larger exist within sources when recognizing multimodal systems
structural links. Time-series techniques examine the sequential (including images and text). Rather than simply splicing
spread of misinformation using temporal structural elements source features, the co-attention method is followed to
while ignoring the broader structural characteristics of fake simulate intense interactions between source features via
news. More importantly, these approaches cannot recognize sharing information and generates an attention-pooled
many modes concurrently. For example, existing designs limit feature for one modality (e.g., text) based on another one
the ability to expand the detection to other modalities. Current (e.g., image). The similarity of data pairs between sources
fusion algorithms are not particularly sophisticated and cannot is utilized to link them. A self-attention network computes
effectively integrate multi-modal advantages while avoiding attention solely based on model input, reducing the reliance
noises offered by other sources [34]. Transfer learning has on external data. This improves the model's performance in
proven to be indispensable in DL training, as it transfers images with complicated backgrounds by focusing more on
previously learned context knowledge to new designs that certain locations. The hierarchy attention mechanism
solve different issues [19]. computes weights based on the initial input and several of
its levels. This attention mechanism is often referred to as
A. Attention Mechanisms fine-grained attention in image classification.
Attention mechanisms try to deal with input in the same
way as the human brain/vision would. Human eyesight does
[Link] Abduljaleel & Ali: Deep Learning and Fusion Mechanism-based Multimodal Fake News Detection …
Engineering, Technology & Applied Science Research Vol. 14, No. 4, 2024, 15665-15675 15668
Attention structures usually utilize a single output form. It embedding location. This increases learning effort in the
processes one characteristic at a time and calculates weight relevant step, but additional efforts could be almost completely
ratings. There are two more multidimensional and multi- avoided given the number of trainable parameters in the
head attention systems. Multi-head attention evaluates encoder [16]. In certain recent-related tasks, BERT-based
inputs linearly in several groups before combining them to models outperform RNN and CNN networks. The Swin
compute the final attention weights. This is especially transformer broadens the usefulness of the transformer,
advantageous when deploying the attention mechanism in transferring its outstanding performance to visual surroundings,
conjunction with CNN approaches. Multidimensional addresses the shortage of CNNs for global information feature
attention, which is mostly employed for NLP, calculates extraction, and, with its unique window mechanism,
weights utilizing a matrix representation of the substantially reduces the computational cost of self-attention
characteristics rather than vectors. and solves the challenge of secured token scale, which has
become the general core of computer vision research [37].
Different types of attention mechanisms for computer ALBERT is a more portable form of BERT to address the
vision can be classified into different categories [36]: drawbacks of the huge number of parameters and the lengthy
Channel attention: This category assumes that in deep training time [44]. DeBERTa is an improved BERT with
CNNs, distinct channels in various feature maps frequently disentangled attention and has two new features. First, the
represent various objects. As a result, channel attention is model suggests a disentangled attention. In DeBERTa, each
responsible for automatically calibrating the weight of each token in the input is represented by two separate vectors that
channel. encode its word embedding and place. Attention weights
among words are acquired utilizing disentangled matrices in
Spatial Attention: This category is similar to channel this paired form. Second, an Enhanced Mask Decoder (EMD)
attention. In this case, the attention mechanism is is employed to forecast the masked tokens during the pre-
responsible for flexibly calibrating the weight of each part training phase. Although BERT depends on relative places,
of the image. This system functions as an adaptive spatial EMD enables DeBERTa to make more accurate predictions
area selection process, selecting where to focus. since the syntactic functions of words are greatly influenced by
their current location within the sentence. In an equivalent
Temporal attention: This category considers data to have a
time component. Thus, in computer vision tasks, this form spirit, the BERTweet approach shares a similar architecture to
BERT and was trained adopting the RoBERTa pre-training
of attention mechanism is commonly used for video
process [45]. Vision transformers break the image into 2D
analysis. This system operates as a dynamic temporal
patches and feed them into the framework. However, vision
selection process, selecting when to pay attention.
transformers face several hurdles, including computational
Branch attention: This category covers multi-branched DL cost, dimensions, scalability to huge datasets, understanding,
architectures. Branch attention is to adapt to the weight of resilience to adversarial attacks, and generalization accuracy
each branch. This mechanism functions as a dynamic [46].
branch selection process, deciding which branches to pay
attention to. IV. FAKE NEWS DETECTION
Fake news detection models can be categorized according
Channel and spatial attention: This approach functions as a to the following strategies: Strategies based on knowledge,
dynamic spatial area and object choice procedure, deciding
features, and modality [47]. From a knowledge viewpoint, an
what and where to focus attention.
impartial fact-checker reviews news stories and assigns an
Spatial and temporal attention: This system functions as a actual value to statements. The three kinds of fact-checking are
dynamic geographic area and time-frame process to select expert-oriented, assessing the accuracy of information by
where and when to focus. relying on domain-matter experts who analyze data and
documents and draw conclusions, crowd-sourcing-oriented,
B. Transformers allowing users to discuss and comment on the accuracy of
Transformers primarily deploy the self-attention specific news resources, and computational-oriented, an
mechanism to extract fundamental characteristics and have intelligent system that classifies a news item as having true or
enormous promise for widespread use in AI [43]. false matter.
Transformers, compared to RNNs, can attend to full sequences,
AI-based algorithms to detect fake news rely on a variety of
and thus learn long-term connections. Transformers parse text
important criteria, including content-based, network-based, and
in parallel implementing a powerful attention mechanism,
user-based attributes. However, combining all these variables
producing complex and meaningful word descriptions. This
may not increase the classifier's performance. Many studies
approach looks at the relationships between textual phrases or
relied only on content features or content-based characteristics
entities. Many competing models of neural pattern transmission
(textual and visual) in conjunction with using additional
contain an encoder-decoder component. The encoder turns an
characteristics to detect fake news. Existing fake news
endless flow of symbols from the input to a continuous output.
identification research is divided into two groups: single-modal
The decoder then generates an output series involving one
and multimodal.
symbol at a time, using the encoder's continuous form [44].
BERT is an encoder layer with a transformer design. Instead of
a static periodic function in the transformer, BERT learns the
[Link] Abduljaleel & Ali: Deep Learning and Fusion Mechanism-based Multimodal Fake News Detection …
Engineering, Technology & Applied Science Research Vol. 14, No. 4, 2024, 15665-15675 15669
In general, text and image characteristics can be employed Dataset Description Study
to detect fake news alone, while other features are typically Contains 15000 items, including 176 images
deployed as supplementary to help identification. The single- MedialEval (2015) in 5,008 real news tweets and 185 misused [33]
images in 7032 fraudulent tweets.
model-based technique utilizes only one characteristic to It contains 7898 fake news, 6026 real news, [9, 11, 21,
detect. In [48], the relevance of an image component was used Twitter (2016)
and 514 images. 23, 30, 36]
for automatic false news detection on social media to address Contains 17,000 unique tweets on various
this issue. It has been established that authentic and false news MediaEval (2016) events. One-third are real and the remaining [32]
events have different image distribution patterns. In [49], a co- are fake news.
attention technique was followed to identify the top K most [9, 23, 27,
Weibo (2017) Consists of 4749 fake and 4779 real news.
33, 36, 37]
significant phrases in a news story and the top K most Multimodal standard dataset of 1,063,106
important user evaluations for the final classification. In [50], a Fakeddit (2019) [16, 38]
samples.
CNN-based capsule network model with pre-trained word News stories include text, news image link,
embeddings was implemented to classify false news in the Gossip (2020) publishing time, author name, and social [9, 21, 26]
ISOT and LIAR datasets. In [51], a generative model was media responses.
proposed to extract new patterns and aid in the identification of Contains text, news image location,
Politifact (2020) publishing time, and remarks made on [9, 18, 21]
fake news by examining previous relevant user reactions. In social networks.
[52], n-grams were applied with TF-IDF word embedding to All Data (2020) Contains 11,941 fake and 8,074 real news. [19, 34]
obtain content characteristics, and LSTM and BERT models Contains 2,029 news articles shared on
were trained to deal with contextual information. Then a ReCOVery (2020) social media, most of which (2,017) have [53]
feedforward neural network was utilized for classification. both textual and visual information.
However, this technique did not account for the complete use Contains a list of fake and accurate news
Twitter Indian Dataset
stories covered primarily from politics, [29]
of different textual characteristics. v3 (2021)
Bollywood, and religion.
B. Multi-Modal Fake News 20,000 articles from websites, including
Ti-CNN (2021) over 11,000 fake and more than 8,000 real [9, 21]
In general, social media postings featuring photos and news items.
graphics receive far more retweets and comments and spread Fake news sample by
45,569 news, 25,343 are real the remaining
much faster than those having only text. Images are spread Guilherme Pontes [20]
are fake news articles.
widely, captivating people's emotions, and expressing a sense (2021)
Twitter_database Includes 5 partitions to perform 5-fold
of reality. Images related to a post may have been edited or (2023) cross-validation.
[26]
simply taken out of context. It is not uncommon to distort
images for political or personal motives, as well as to use photo
editing software to change an image. As a result, when
analyzing both text and images, photo captions are critical to
identifying clickbait and false captions [29].
1) Datasets
Table IV lists multimodal datasets applied in various
studies, and Figure 3 describes the most popular dataset
dimensions.
2) Textual and Visual Preprocessing
Pre-processing in text starts with cleaning the input datasets
by extracting excess, extreme, and duplicative text parts. Word
embedding methods keep only meaningful tokens that are
transformed into vectors. The text is stemmed/lemmatized,
normalized, and tokenized. Stemming and lemmatization Fig. 2. Distribution of multimodal fake news datasets.
remove words and symbols without meaning. Normalization
transforms text into canonical form. Stemming cuts off the ends 3) Textual and Visual Feature Extraction and Selection
of input words to lower their inflection and convert them into
their core structures. In general, the canonical form of the Textual properties can be obtained at several levels in the
original input word is deployed. It is very important to hierarchy, such as word, sentence, and message. The most
normalize text in the web scope and social media data, as it basic lexical characteristics are the overall total of characters,
contains a lot of noise, such as abbreviations, misspellings, and the number of different words, the average length of words, and
words that are out of vocabulary. Image data are pre-processed so on. In the meantime, the semantics of linguistic
by reviewing that all URLs are correct. It is also important to characteristics, such as the proportion of first/third person
normalize the size of images and divide them for training and pronouns, the number of news detection by pooling and
testing. Textual and visual data are pre-processed individually attention blocks, and positive or negative emoji symbols, are all
and then merged to complete each instance in terms of its three accessible options. Unlike linguistic characteristics, syntactic
parameters: title, text, and vision [20]. features improve the aim of feature extraction to a significant
[Link] Abduljaleel & Ali: Deep Learning and Fusion Mechanism-based Multimodal Fake News Detection …
Engineering, Technology & Applied Science Research Vol. 14, No. 4, 2024, 15665-15675 15670
level: emotion score or part-of-speech labeling. Recently, vectors) to increase and multiply the exchanges between all
various complicated models, namely BoW, Word2Vec, and elements of both vectors. This process is more expressive.
other embedding techniques, have been used to recognize fake
news. Image extraction provides additional visual information.
Several studies employed the BERT pre-trained model to
extract text characteristics. However, the BERT model has
many parameters and a slower training speed. Furthermore,
visual and text characteristics are in separate semantic feature
spaces, resulting in heterogeneity [31].
4) Fusion Mechanism
The combination of textual content and images is one of the
widely utilized features for multimodal fake news detection.
The intuition behind this cue is that some fake news spreaders
deploy tempting images, e.g., exaggerated, dramatic, or
sarcastic graphics, that are far from the textual content to attract
users' attention. Information fusion techniques have an original
ability to manage input data with their multimodal nature.
Many experiments have proven the benefit of these techniques
and that their full exploitation leads to improved performance
[31, 38]. Several techniques combine textual and visual
information into a single representation, ignoring their
associations, which might lead to poor results. Fusion can be Fig. 3. Early and late fusion mechanism structures.
classified according to different times as follows [53]:
5) Model Evaluation Metrics
Early fusion (feature fusion): Feature vectors from multiple
modalities are combined and fed into a model for A confusion matrix serves as the basis for evaluating a
prediction. Due to the fusion of pre-processed features from classification model. True Positives (TP) indicate news that
different modalities at the input layer, working with was projected to be true and was true, False positives (FP)
features with higher granularity becomes tedious (Figure 3). indicate news projected to be true but was fake, True Negatives
(TN) indicate news that was projected to be false and was
Late fusion (decision-level fusion or kernel-level fusion) untrue, and False Negatives (FN) indicate news projected as
combines results from various modalities using summation, untrue but was accurate. [52]. The efficiency of a model is
maximization, averages, or weighted average methods. evaluated by [54-56]:
Most late fusion solutions employ handcrafted rules, prone
to human bias and far from real-world peculiarities (Figure (1)
3).
(2)
Intermediate fusion (mid-fusion) involves combining units
from several modality-specific paths into a single shared (3)
layer. It is possible to create a representation layer either by
mapping multiple channels at the same time or by 1
!" # $%%
(4)
combining different modal sets at various levels. # $%% !"
Fusion mechanisms can be divided according to the C. Studies on Multimodal Fake News Detection
technology followed to merge textual and visual attributes in In [32], a scaled dot product attention mechanism was
[36]: implemented to capture the relationship between the text
features extracted by BERT and the image features extracted
Simple operation-based: DL combines vectorized features by VGG-19. In [33], another model based also on BERT and
from several data sources using fundamental algorithms VGG19 was proposed, accepting both text and picture input.
such as concatenation or weighted addition. As models Subsequently, the pair of embedding was joined and subjected
based on DL techniques are trained concurrently, the to a multi-modal variation autoencoder to obtain the common
features of high-level standards could be extracted at a level latent representation. A multimodal cross-attention network
that accommodates both activities. Such processes often was designed to fuse the resulting features. In [23], four distinct
have minimal or no correlation factors. submodules made up a fake news system: feature fusion based
Attention-based: Fusion often involves attention processing. on multi-modal factorized bilinear pooling, two attention
Different outputs are frequently used to provide different mechanisms, one for textual description combined with
sets of changing weights for summing, preserving more Stacked BiLSTM and the other for visual feature extraction,
information by merging the results from each peek. combined with multi-level CNN–RNN, and MLP for
classification. In [20], visual picture attributes were extracted
Bilinear pooling-based: This is achieved by adding the using image captioning and forensic evaluation, and textual
external product of both vectors (text and image input hidden patterns were extracted employing a Hierarchical
[Link] Abduljaleel & Ali: Deep Learning and Fusion Mechanism-based Multimodal Fake News Detection …
Engineering, Technology & Applied Science Research Vol. 14, No. 4, 2024, 15665-15675 15671
Attention Network (HAN). In [21], a multi-modal coupled feature representation in news content. Then, the feature fusion
ConvNet architecture was presented, combining textual and layer combined these features with the help of the cross-modal
visual data modules from three datasets and utilizing a late attention module to promote various modal feature
fusion mechanism. In [58], a detection framework was representations to complement the information. In [29], a
proposed, deploying Word2Vec to fuse text input to the multi-modal DL technique was proposed to use and process
embedding layer and passing image input to a cross-modal visual and textual features, employing EfficientNet-B0 and a
attention residual and multi-channel CNN. The multi-channel sentence transformer. Feature embedding was performed on
CNN was implemented as a reducer to the amount of trash data individual channels, while fusion was performed on the last
produced by cross-modal fusion parts. In [9], an MCNN was classification layer. Late fusion was applied to mitigate the
proposed, considering the consistency of multi-modal data and noisy data generated by multi-modalities. In [11], TLFND was
capturing the overall characteristics of social media proposed, which was based on a three-phase feature-matching
information based on an early fusion mechanism. This model distance technique to detect fake news. An attention-guiding
used BERT in the text feature extraction module and the module was devised to assist in aggregating the cross-modality
attention mechanism with ResNet-50 in the visual semantic correlations and the aligned unimodal representations in an
feature extraction module. In [34], three modalities were effective and interpretable manner. In [30], a model based on
evaluated: text, image, and image attributes. Additionally, a transformers and multi-modal fusion was introduced. This
model based on dual attention fusion networks was applied to model extracts text and image features using different
combine features. Initially, the model extracts image (based on transformers, and fuse features implementing attention
ResNet-50 V2) and text modalities (based on BERT). In the mechanisms. In [18], a quantum-based standard was proposed
end, the features were combined to create a feature vector that for multimedia data fusion to identify fake news. This system
can be used for classification. In [28], news post images were extracted features in both textual and visual forms and sent
converted from their spatial dimension to their frequency them to the convolutional-quantum network to achieve
domain utilizing machine learning. Subsequently, a multi-layer classification.
CNN model was engaged to extract the characteristics of the
frequency picture, and MML was deployed to retrieve image- V. DISCUSSION
related web pages on Google. Simultaneously, MML uses the DL has begun to be strongly involved in multimodal fake
evidence veracity classification task to support the false news news detection systems at all stages, whether it is engaged in
detection task by selecting evidence. This part involved feeding extracting features of textual and visual inputs, in the
the evidence and the claim into a BERT-based encoder, mechanisms of fusing features extracted from multimodal data,
followed by learning evidence representations employing or in the classification of fake news. It is possible to detect fake
claim-evidence correlation representations. Ultimately, the co- news adopting these strategies but some restrictions limit their
attention process fuses the representations of the image with accuracy, involving the requirement for a huge dataset
relevant evidence. In [35], a model was presented based on two containing diverse data in all fields of life (political, economic,
principles, blocking and fusion. This model determined the technological, technical, and health, etc.), in addition to the
spatial and temporal location of the data in the fusion inability to fuse the extracted features efficiently, take
mechanism for the visual and textual attributes. In [37], text advantage of the most multimodal important features, and
features were extracted from bidirectional encoder measure the extent of interconnection between them. Some
representations of transformers, image features were extracted studies focused on a single social network, such as Twitter,
from Swin-transformers, and then deep autoencoding was used Weibo, or Facebook, but future fake news detection systems
as an early fusion technique by merging text and visual must be applicable on different websites and social networks to
attributes. In [38], the proposed framework was based on the acquire knowledge deeply and detect fake news quickly. Many
BERT and Xception models to learn visual and linguistic studies used BERT word embedding [9, 33-35, 37-38, 57] and
models. In [31], the ALBERT model was combined with a depreciated traditional techniques, such as GloVe [20-21] and
multi-modal circulant fusion technique to detect fake news. Word2Vec [24, 58] in their textual feature extraction model.
This system included a textual feature extractor (ALBERT), a BERT can discover the implicit associations within the
visual feature extractor (VGG-19), a feature fusion, a fake sentence words and texts in which the system is trained, but
news detector, and domain classification modules. In [26], that has not prevented a recent trend toward including derived
multimodal pre-processing of both words and images was models, like RoBERTa [11, 16], ALBERT [31], distilBERT
performed. Glove embedding and Word2vector approaches [29], and XLNet [18, 27]. Although all proposed multimodal
were deployed to extract the text characteristics and the fake news detection systems still use CNN [21, 25, 26], VGG
Adaptive Water Strider Algorithm (A-WSA) was applied to [31-33], and ResNet [34-35] neural networks in a visual feature
extract the best characteristics from both text and image data. extraction stage, there is a new strategy deploying ViT for
Feature fusion receives the optimized features, which are textual and visual feature extraction. Regarding fusion
obtained by the same A-WSA optimization process based on techniques, it is clear that in recent years there was no clear
the weight factor. Lastly, O-BiLSTM was utilized for fake interest in examining how to benefit from extracted features
news classification. In [27], a model based on BLIP (FNDB) and how to choose, as concatenation [32, 33, 38] of extracted
was proposed. XLNet and VGG-19-based feature extractors features is the common fusing operation in early or late fusion
were engaged to extract textual and visual feature mechanisms. However, there is interest in the technology of
representations, respectively, and the BLIP-based multimodal attention mechanisms [23, 32, 36] and their strong entry during
feature extractor was put into service to obtain multimodal the past two years to support the approved fusion mechanisms.
[Link] Abduljaleel & Ali: Deep Learning and Fusion Mechanism-based Multimodal Fake News Detection …
Engineering, Technology & Applied Science Research Vol. 14, No. 4, 2024, 15665-15675 15672
[Link] Abduljaleel & Ali: Deep Learning and Fusion Mechanism-based Multimodal Fake News Detection …
Engineering, Technology & Applied Science Research Vol. 14, No. 4, 2024, 15665-15675 15673
Acc=65.4, Pre=66.4, VGG-19 Attention Failure of fine-tuned Other fusion methods based on
[36] Weibo BiLSTM FCL
Rec=66.8, F1=66.6 +pooling mechanism feature extraction. attention mechanisms.
Acc=75.6, Pre=72.8,
Twitter If a post has many images, Reduce the model's difficulty
Rec=97.7, F1=83.4 Swin- Trained deep
[37] BERT FCL only one may be used to to ensure its use on small
Acc=59.7, Pre=56.4, transformer autoencoder
Weibo identify it. devices.
Rec=99.4, F1=71.9
Use post content and
Acc=91.8, Pre=93.3,
[38] Fakeddit BERT Xception Concatenation FCL Small dataset. comments, together with user-
Rec=93.2, F1=93.2
related data.
Acc=94.4, Pre=97.4,
Politifact
Rec=96.6, F1=97 Adapting new areas and
A fusion approach that
Acc=90.9, Pre=93.2, VGG-19+ improving technology while
[11] Gossipcop RoBERTa Concatenation FCL focuses on the contents of
Rec=94.7, F1=93.9 BiLSTM testing the proposed model is
the extracted features.
Acc=83.1, Pre=85.2, underway.
Twitter
Rec=82.4, F1=83.7
Acc=86.8, Pre=83.1, ResNet-50 Multi-modal
Twitter BERT+ The use of multiple
Rec=75.4, F1=79.1 +two- bilinear pooling Combine textual information
[15] two-text- FCL techniques negatively
Acc=90.4, Pre=94.3, image- +Self-attention with several photos.
Weibo branch affects execution time
Rec=87.1, F1=90.5 branch mechanism
O-BiLSTM
Acc=96.5, Pre=88.8, Word2vec ResNet-50 Adaptive feature based on Not extracting textual Use audio signals and captions
[26] Twitter
Rec=96.2, F1=92.41 +Glove +VGG-16 fusion optimized features efficiently. to detect false news videos.
WSA
Acc=88.8, Pre=89.1,
Weibo
Rec=97.2, F1=93 Cross-modal Use a more effective model
[27] XLNet VGG-19 FCL Improve the fusion process.
Acc=87.3, Pre=79, attention to extract features.
Gossipcop
Rec=44, F1=56.5
Acc=86.4, Pre=84,
MediaEval
Rec=93, F1=88
Acc=81.4, Pre=80.3,
Weibo High-resolution photos
Rec=86.3, F1=83.6 Detect satirical news and the
Efficient- with only a small altered
[29] Twitter DistilBERT Late fusion ANN text that is placed over the
Net-B0 area appeared to be poorly
Indian Acc=67.1 photos.
detected.
Dataset v3
Acc=88.8, Pre=85,
Fakeddit
Rec=87, F1=86
Acc=93.5, Pre=96.5,
Twitter Cannot be used directly Improve feature extraction to
Rec=93.7, F1=95.1 BERT+
[30] VGG-19 Concatenation MLP when one of the modalities counteract intentionally
Acc=91.5, Pre=91.3, BiLSTM
Weibo is lacking. deceiving photos.
Rec=91.3, F1=91.3
Acc=91.8, Pre=91.2, Late fusion
Twitter
Rec=85.4, F1=91.8 GloVe+ based on Enhance the model for cross-
[59] ViT MLP Time complexity
Acc=92.2, Pre=96.9, Transformer attention domain news detection.
Weibo
Rec=88.6, F1=92.5 mechanism
Quantum
Acc=87.9, Pre=95.8, Quantum circuit with time- Apply quantum fuzzy neural
[18] Gossip XLNet VGG-19 multimodal FCL
Rec=89.9, F1=92.8 based complexity networks.
fusion
VI. RESEARCH GAPS AND CHALLENGES There is a significant difference between image similarities
and sentences in most fake news, but existing algorithms do
Fake news is fundamentally multimodal and multilingual,
not fully capitalize on this.
taking visual, auditory, or literary forms and expressing itself in
a language that readers may not be familiar with. A new The lack of large and rich multimodal fake news datasets
viewpoint can be developed to make deep systems more negatively affects system development. In addition, datasets
acceptable. Additionally, appropriate feature collection and are limited to the economic or political field only. In
classification techniques can improve the detection of fake addition, the lack of multilingual datasets supports the
news. Studies must investigate whether the classification possibility of developing fake news detection systems in
approach is most appropriate for certain features: textual or several languages and different dialects of the same
visual feature extractors. As a result, greater attention must be language as well.
paid to feature choice and fusion to improve performance. The
challenges in multimodal fake news detection approaches can Not relying on psychological data, combined with the
be summarized as: contextual features of texts and images of published news,
saves a great deal of time in contacting people responsible
Existing techniques often employ a basic concatenate for false information sharing and revealing their purposes.
strategy to fuse inter-modal information, yielding mediocre
detection results.
[Link] Abduljaleel & Ali: Deep Learning and Fusion Mechanism-based Multimodal Fake News Detection …
Engineering, Technology & Applied Science Research Vol. 14, No. 4, 2024, 15665-15675 15674
VII. CONCLUSION [14] Y. Shen, Q. Liu, N. Guo, J. Yuan, and Y. Yang, "Fake News Detection
on Social Networks: A Survey," Applied Sciences, vol. 13, no. 21, Jan.
After studying the literature on fake news analysis methods, 2023, Art. no. 11877, [Link]
this paper summarized the basic features of multimodal fake [15] Y. Guo, H. Ge, and J. Li, "A two-branch multimodal fake news detection
news detection systems, including datasets, visual and textual model based on multimodal bilinear pooling and attention mechanism,"
preprocessing, feature extraction, fusion mechanisms, and fake Frontiers in Computer Science, vol. 5, Apr. 2023, [Link]
10.3389/fcomp.2023.1159063.
news detection stages, as well as related techniques such as
BERT, transformer, ViT, and attention mechanisms. A brief [16] L. Qian, R. Xu, and Z. Zhou, "MRDCA: a multimodal approach for fine-
grained fake news detection through integration of RoBERTa and
review of important multimodal fake news detection systems DenseNet based upon fusion mechanism of co-attention," Annals of
was performed, with different deep learning methods in Operations Research, Dec. 2022, [Link]
different stages. Future studies could focus on modern attention 05154-9.
mechanisms in fake video detection systems. In addition, [17] A. M. Luvembe, W. Li, S. Li, F. Liu, and X. Wu, "CAF-ODNN:
efficient early detection mechanisms must be developed. Complementary attention fusion with optimized deep neural network for
multimodal fake news detection," Information Processing &
REFERENCES Management, vol. 61, no. 3, May 2024, Art. no. 103653, [Link]
10.1016/[Link].2024.103653.
[1] S. Hangloo and B. Arora, "Combating multimodal fake news on social [18] Z. Qu, Y. Meng, G. Muhammad, and P. Tiwari, "QMFND: A quantum
media: methods, datasets, and future perspective," Multimedia Systems, multimodal fusion-based fake news detection model for social media,"
vol. 28, no. 6, pp. 2391–2422, Dec. 2022, [Link] Information Fusion, vol. 104, Apr. 2024, Art. no. 102172,
s00530-022-00966-y. [Link]
[2] L. Hu, S. Wei, Z. Zhao, and B. Wu, "Deep learning for fake news [19] F. A. O. Santos, K. L. Ponce-Guevara, D. Macêdo, and C. Zanchettin,
detection: A comprehensive survey," AI Open, vol. 3, pp. 133–155, Jan. "Improving Universal Language Model Fine-Tuning using Attention
2022, [Link] Mechanism," in 2019 International Joint Conference on Neural
[3] C. Comito, L. Caroprese, and E. Zumpano, "Multimodal fake news Networks (IJCNN), Budapest, Hungary, Jul. 2019, pp. 1–7,
detection on social media: a survey of deep learning techniques," Social [Link]
Network Analysis and Mining, vol. 13, no. 1, Aug. 2023, Art. no. 101, [20] P. Meel and D. K. Vishwakarma, "HAN, image captioning, and
[Link] forensics ensemble multimodal fake news detection," Information
[4] D. Gifu, "An Intelligent System for Detecting Fake News," Procedia Sciences, vol. 567, pp. 23–41, Aug. 2021, [Link]
Computer Science, vol. 221, pp. 1058–1065, Jan. 2023, [Link] [Link].2021.03.037.
10.1016/[Link].2023.08.088. [21] C. Raj and P. Meel, "ConvNet frameworks for multi-modal fake news
[5] J. Li and M. Lei, "A Brief Survey for Fake News Detection via Deep detection," Applied Intelligence, vol. 51, no. 11, pp. 8132–8148, Nov.
Learning Models," Procedia Computer Science, vol. 214, pp. 1339– 2021, [Link]
1344, Jan. 2022, [Link]
[22] L. Wang, C. Zhang, H. Xu, Y. Xu, X. Xu, and S. Wang, "Cross-modal
[6] A. Gandhi, K. Adhvaryu, S. Poria, E. Cambria, and A. Hussain, Contrastive Learning for Multimodal Fake News Detection," in
"Multimodal sentiment analysis: A systematic review of history, Proceedings of the 31st ACM International Conference on Multimedia,
datasets, multimodal fusion methods, applications, challenges and future Ottawa, Canada, Nov. 2023, pp. 5696–5704, [Link]
directions," Information Fusion, vol. 91, pp. 424–444, Mar. 2023, 3581783.3613850.
[Link] [23] R. Kumari and A. Ekbal, "AMFB: Attention based multimodal
[7] M. Nirav Shah and A. Ganatra, "A systematic literature review and Factorized Bilinear Pooling for multimodal Fake News Detection,"
existing challenges toward fake news detection models," Social Network Expert Systems with Applications, vol. 184, Dec. 2021, Art. no. 115412,
Analysis and Mining, vol. 12, no. 1, Nov. 2022, Art. no. 168, [Link]
[Link] [24] J. Zeng, Y. Zhang, and X. Ma, "Fake news detection for epidemic
[8] A. Figueira, N. Guimaraes, and L. Torgo, "Current State of the Art to emergencies via deep correlations between text and images," Sustainable
Detect Fake News in Social Media: Global Trendings and Next Cities and Society, vol. 66, Mar. 2021, Art. no. 102652,
Challenges:," in Proceedings of the 14th International Conference on [Link]
Web Information Systems and Technologies, Seville, Spain, 2018, pp. [25] I. Segura-Bedmar and S. Alonso-Bartolome, "Multimodal Fake News
332–339, [Link] Detection," Information, vol. 13, no. 6, Jun. 2022, Art. no. 284,
[9] J. Xue, Y. Wang, Y. Tian, Y. Li, L. Shi, and L. Wei, "Detecting fake [Link]
news by exploring the consistency of multimodal data," Information [26] V. Kishore and M. Kumar, "Enhanced Multimodal Fake News Detection
Processing & Management, vol. 58, no. 5, Sep. 2021, Art. no. 102610, with Optimal Feature Fusion and Modified Bi-LSTM Architecture,"
[Link] Cybernetics and Systems, Jan. 2023, [Link]
[10] "Google Trends," Google Trends. [Link] 2023.2175155.
explore?date=today%205-y&q=fake%20news&hl=en (accessed May 30, [27] Z. Liang, "Fake News Detection Based on Multimodal Inputs,"
2024). Computers, Materials & Continua, vol. 75, no. 2, pp. 4519–4534, 2023,
[11] J. Wang, J. Zheng, S. Yao, R. Wang, and H. Du, "TLFND: A [Link]
Multimodal Fusion Model Based on Three-Level Feature Matching [28] X. Cui and Y. Li, "Fake News Detection in Social Media based on
Distance for Fake News Detection," Entropy, vol. 25, no. 11, Nov. 2023, Multi-Modal Multi-Task Learning," International Journal of Advanced
Art. no. 1533, [Link] Computer Science and Applications (IJACSA), vol. 13, no. 7, 31 2022,
[12] K. Liu and M. Hai, "Rumor Detection of Covid-19 Related Microblogs [Link]
on Sina Weibo," Procedia Computer Science, vol. 221, pp. 386–393, [29] D. K. Sharma, B. Singh, S. Agarwal, H. Kim, and R. Sharma,
Jan. 2023, [Link]
"FakedBits- Detecting Fake Information on Social Platforms using
[13] S. Ahmed, K. Hinkelmann, and F. Corradini, "Combining Machine Multi-Modal Features," KSII Transactions on Internet and Information
Learning with Knowledge Engineering to detect Fake News in Social Systems (TIIS), vol. 17, no. 1, pp. 51–73, Jan. 2023.
Networks - a survey." arXiv, Jan. 20, 2022, [Link] [30] L. Wu, P. Liu, and Y. Zhang, "See How You Read? Multi-Reading
arXiv.2201.08032. Habits Fusion Reasoning for Multi-Modal Fake News Detection,"
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37,
[Link] Abduljaleel & Ali: Deep Learning and Fusion Mechanism-based Multimodal Fake News Detection …
Engineering, Technology & Applied Science Research Vol. 14, No. 4, 2024, 15665-15675 15675
no. 11, pp. 13736–13744, Jun. 2023, [Link] [46] J. Maurício, I. Domingues, and J. Bernardino, "Comparing Vision
26609. Transformers and Convolutional Neural Networks for Image
[31] X. Wang, X. Li, X. Liu, and H. Cheng, "Using ALBERT and Multi- Classification: A Literature Review," Applied Sciences, vol. 13, no. 9,
modal Circulant Fusion for Fake News Detection," in 2022 IEEE Jan. 2023, Art. no. 5521, [Link]
International Conference on Systems, Man, and Cybernetics (SMC), [47] Z. Jin, J. Cao, Y. Zhang, J. Zhou, and Q. Tian, "Novel Visual and
Prague, Czech Republic, 2022, pp. 2936–2942, [Link] Statistical Image Features for Microblogs News Verification," IEEE
SMC53654.2022.9945303. Transactions on Multimedia, vol. 19, no. 3, pp. 598–608, Mar. 2017,
[32] N. M. Duc Tuan and P. Quang Nhat Minh, "Multimodal Fusion with [Link]
BERT and Attention Mechanism for Fake News Detection," in 2021 [48] K. Shu, L. Cui, S. Wang, D. Lee, and H. Liu, "dEFEND: Explainable
RIVF International Conference on Computing and Communication Fake News Detection," in Proceedings of the 25th ACM SIGKDD
Technologies (RIVF), Hanoi, Vietnam, Aug. 2021, pp. 1–6, International Conference on Knowledge Discovery & Data Mining,
[Link] Anchorage, AK, USA, Apr. 2019, pp. 395–405, [Link]
[33] R. Jaiswal, U. P. Singh, and K. P. Singh, "Fake News Detection Using 3292500.3330935.
BERT-VGG19 Multimodal Variational Autoencoder," in 2021 IEEE 8th [49] T. Chen, X. Li, H. Yin, and J. Zhang, "Call Attention to Rumors: Deep
Uttar Pradesh Section International Conference on Electrical, Attention Based Recurrent Neural Networks for Early Rumor
Electronics and Computer Engineering (UPCON), Dehradun, India, Detection," in Trends and Applications in Knowledge Discovery and
Nov. 2021, pp. 1–5, [Link] Data Mining, Melbourne, Australia, 2018, pp. 40–52,
9667614. [Link]
[34] H. Yang et al., "Multi-Modal fake news Detection on Social Media with [50] M. H. Goldani, S. Momtazi, and R. Safabakhsh, "Detecting fake news
Dual Attention Fusion Networks," in 2021 IEEE Symposium on with capsule neural networks," Applied Soft Computing, vol. 101, Mar.
Computers and Communications (ISCC), Athens, Greece, Sep. 2021, pp. 2021, Art. no. 106991, [Link]
1–6, [Link] [51] F. Qian, C. Gong, K. Sharma, and Y. Liu, "Neural User Response
[35] L. Ying, H. Yu, J. Wang, Y. Ji, and S. Qian, "Multi-Level Multi-Modal Generator: Fake News Detection with Collective User Intelligence," in
Cross-Attention Network for Fake News Detection," IEEE Access, vol. Proceedings of the Twenty-Seventh International Joint Conference on
9, pp. 132363–132373, 2021, [Link] Artificial Intelligence, Stockholm, Sweden, Jul. 2018, pp. 3834–3840,
3114093. [Link]
[36] Y. Guo and W. Song, "A Temporal-and-Spatial Flow Based Multimodal [52] N. Kausar, A. AliKhan, and M. Sattar, "Towards better representation
Fake News Detection by Pooling and Attention Blocks," IEEE Access, learning using hybrid deep learning model for fake news detection,"
vol. 10, pp. 131498–131508, 2022, [Link] Social Network Analysis and Mining, vol. 12, no. 1, Nov. 2022, Art. no.
2022.3229762. 165, [Link]
[37] Y. Liang, T. Tohti, and A. Hamdulla, "False Information Detection via [53] S. Abdali, S. Shaham, and B. Krishnamachari, "Multi-modal
Multimodal Feature Fusion and Multi-Classifier Hybrid Prediction," Misinformation Detection: Approaches, Challenges and Opportunities."
Algorithms, vol. 15, no. 4, Apr. 2022, Art. no. 119, arXiv, Mar. 27, 2024, [Link]
[Link] [54] M.-H. Guo et al., "Attention mechanisms in computer vision: A survey,"
[38] S. K. Uppada, P. Patel, and S. B., "An image and text-based multimodal Computational Visual Media, vol. 8, no. 3, pp. 331–368, Sep. 2022,
model for detecting fake news in OSN’s," Journal of Intelligent [Link]
Information Systems, vol. 61, no. 2, pp. 367–393, Oct. 2023, [55] B. Ahmed, G. Ali, A. Hussain, A. Baseer, and J. Ahmed, "Analysis of
[Link] Text Feature Extractors using Deep Learning on Fake News,"
[39] S. K. Hamed, M. J. Ab Aziz, and M. R. Yaakub, "A review of fake news Engineering, Technology & Applied Science Research, vol. 11, no. 2,
detection approaches: A critical analysis of relevant studies and pp. 7001–7005, Apr. 2021, [Link]
highlighting key challenges associated with the dataset, feature [56] H. M. Al-Dabbas, R. A. Azeez, and A. E. Ali, "Two Proposed Models
representation, and data fusion," Heliyon, vol. 9, no. 10, Oct. 2023, Art. for Face Recognition: Achieving High Accuracy and Speed with
no. e20382, [Link] Artificial Intelligence," Engineering, Technology & Applied Science
[40] D. S. Asudani, N. K. Nagwani, and P. Singh, "Impact of word Research, vol. 14, no. 2, pp. 13706–13713, Apr. 2024, [Link]
embedding models on text analytics in deep learning environment: a 10.48084/etasr.7002.
review," Artificial Intelligence Review, vol. 56, no. 9, pp. 10345–10425, [57] T. Zhang et al., "BDANN: BERT-Based Domain Adaptation Neural
Sep. 2023, [Link] Network for Multi-Modal Fake News Detection," in 2020 International
[41] J. Egger, A. Pepe, C. Gsaxner, Y. Jin, J. Li, and R. Kern, "Deep Joint Conference on Neural Networks (IJCNN), Glasgow, UK, Jul. 2020,
learning—a first meta-survey of selected reviews across scientific pp. 1–8, [Link]
disciplines, their commonalities, challenges and research impact," PeerJ [58] C. Song, N. Ning, Y. Zhang, and B. Wu, "A multimodal fake news
Computer Science, vol. 7, Nov. 2021, Art. no. e773, [Link] detection model based on crossmodal attention residual and
10.7717/peerj-cs.773. multichannel convolutional neural networks," Information Processing &
[42] K. Han et al., "A Survey on Visual Transformer," IEEE Transactions on Management, vol. 58, no. 1, Jan. 2021, Art. no. 102437, [Link]
Pattern Analysis and Machine Intelligence, vol. 45, no. 1, pp. 87–110, 10.1016/[Link].2020.102437.
Jan. 2023, [Link] [59] P. Yang, J. Ma, Y. Liu, and M. Liu, "Multi-modal transformer for fake
[43] A. Choudhary and A. Arora, "Assessment of bidirectional transformer news detection," Mathematical biosciences and engineering, vol. 20, no.
encoder model and attention based bidirectional LSTM language models 8, pp. 14699–14717, Jul. 2023, [Link]
for fake news detection," Journal of Retailing and Consumer Services,
vol. 76, Jan. 2024, 103545, [Link]
103545.
[44] S. F. N. Azizah, H. D. Cahyono, S. W. Sihwi, and W. Widiarto,
"Performance Analysis of Transformer Based Models (BERT, ALBERT
and RoBERTa) in Fake News Detection." arXiv, Aug. 09, 2023,
[Link]
[45] D. Tomás, R. Ortega-Bueno, G. Zhang, P. Rosso, and R. Schifanella,
"Transformer-based models for multimodal irony detection," Journal of
Ambient Intelligence and Humanized Computing, vol. 14, no. 6, pp.
7399–7410, Jun. 2023, [Link]
[Link] Abduljaleel & Ali: Deep Learning and Fusion Mechanism-based Multimodal Fake News Detection …
Attention mechanisms in multimodal fake news detection models enhance the model's ability to weigh different features according to their relevance and importance in a given context . They facilitate the generation of more accurate representations by allowing models to focus on crucial interactions between textual and visual inputs. This is particularly useful for highlighting significant details that may be critical for discerning falsehoods in the data . As a result, attention mechanisms improve not only the interpretability of models by making it clearer where focus is applied but also their accuracy in detecting fake news by capturing intricate inter-modal relationships .
Proposed improvements for enhancing early detection accuracy of fake news include the development of more advanced attention mechanisms and the creation of efficient early detection algorithms that can quickly assess credibility in rapidly spreading information . Emphasizing cross-modality feature extraction and effective fusion strategies, such as quantum-based standards, can better integrate diverse data inputs . Incorporating psychological data along with contextual content features could also improve understanding of fake news spread dynamics, providing faster identification of misleading content before it reaches wide dissemination . Moreover, expanding datasets to cover more languages and domains, along with improving multimodal data capabilities, are crucial steps in advancing these systems .
Major limitations include the reliance on large, diverse datasets that are often not available, which hampers the ability to generalize models across different contexts and domains . Additionally, there is a challenge in efficiently integrating and fusing features from multiple modalities, as current techniques may not fully capitalize on the complex relationships within multimodal data, potentially leading to noisy outputs . The computational requirements and time complexity of deep learning techniques also pose challenges, especially with high-dimensional data inputs . Finally, the lack of robust multilingual datasets further limits the system's applicability to diverse linguistic contexts .
The single-modal approach in fake news detection often focuses on either textual or visual features individually, utilizing models like CNN or LSTM for classification based on these standalone characteristics . It tends to be less robust because it misses cues available through complementary data, such as images or other metadata that multimodal approaches consider . Conversely, the multimodal approach integrates multiple data types – like text, images, and even user interactions – to enhance detection capabilities, often employing sophisticated fusion techniques using attention mechanisms . However, multimodal approaches require larger, more varied datasets and face challenges in effectively fusing diverse feature types, leading to increased computational complexity and sometimes noisy data integration .
Transformer-based models have improved the fusion of multimodal data by effectively capturing relationships across different data types with their attention mechanisms, which allows for better weighing of relevant information from both textual and visual inputs . They offer superior feature extraction capabilities by deriving dependencies and interactions between modalities more efficiently than traditional concatenation methods . This facilitates a more coherent integration of diverse input features, resulting in improved decision-making processes in fake news detection . However, these benefits come at the cost of increased computational complexity and data requirements, necessitating larger datasets for effective training .
Crucial preprocessing steps in textual data involve cleaning datasets by removing excess and duplicate text elements, token normalization, stemming, lemmatization, and transforming significant tokens into vector forms through word embedding techniques . For visual data, preprocessing includes filtering images, removing noise, aligning image characteristics across a dataset, and applying transformations to unify formats for accurate feature extraction . These steps are essential for effective deep learning model training and ensure that both textual and visual data are optimized for feature extraction and subsequent multimodal fusion in fake news detection .
Multimodal fake news detection systems feature the integration of both textual and visual data to improve accuracy, employing advanced techniques like deep learning and attention mechanisms . A key challenge is the need for large and diverse datasets encompassing various domains such as political, economic, and health-related topics . Current approaches frequently use basic concatenation strategies for feature fusion, which often results in suboptimal detection capabilities due to failure in capturing inter-modal relationships . Additionally, the absence of diverse multimodal datasets, especially in multilingual contexts, hinders the overall system development and performance .
BERT improves fake news detection by leveraging its transformer-based architecture to capture deep contextual relationships and semantic nuances, which enables it to understand implicit associations in textual data better than earlier techniques like GloVe or Word2Vec, which use static word embeddings . BERT's use of bidirectional transformers allows it to consider the context from both previous and following words, resulting in more effective feature extraction for nuanced content . However, BERT's major drawback includes higher computational demands and greater resource requirements compared to the simpler models like GloVe or Word2Vec . Additionally, fine-tuning BERT for specific tasks requires large datasets and significant computational power, potentially limiting its usage in environments with limited resources .
User-based attributes enhance fake news detection accuracy by providing additional context on how content is perceived and interacted with across social media platforms . These attributes may include user reactions, sharing patterns, engagement statistics, and historical behavior, each contributing insights into the credibility of the content . By incorporating these user interaction metrics, detection systems can identify anomalous patterns indicative of unnatural propagation or targeted misinformation campaigns, thus improving the robustness and contextuality of the model's decisions .
Feature fusion is challenging because it involves integrating disparate data types—text, images, and possibly others—into a unified model input, requiring reconciliation of different scales and dimensions . Misalignment and redundancy in multimodal data can introduce noise, hindering the model's accuracy . Late fusion techniques attempt to address these challenges by keeping feature extraction from each modality separate until the final stages of model processing, allowing for more contextual evaluation of features and reducing early-stage noise interference . This allows models to adjust weighting and importance of features more flexibly, which can improve classification accuracy .