0% found this document useful (0 votes)
33 views36 pages

2003 Neuro Computing

This systematic review focuses on Content-Based Fake News Detection (CBFND) using machine and deep learning models, addressing the challenges of manual fact-checking and the limited availability of labeled datasets. It proposes a taxonomy of models and features, evaluates their performance across various studies, and identifies the best-performing algorithms and features for detecting fake news. The review aims to provide guidelines for researchers and developers to enhance automatic fake news detection systems.

Uploaded by

MD Tanvir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views36 pages

2003 Neuro Computing

This systematic review focuses on Content-Based Fake News Detection (CBFND) using machine and deep learning models, addressing the challenges of manual fact-checking and the limited availability of labeled datasets. It proposes a taxonomy of models and features, evaluates their performance across various studies, and identifies the best-performing algorithms and features for detecting fake news. The review aims to provide guidelines for researchers and developers to enhance automatic fake news detection systems.

Uploaded by

MD Tanvir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Content Based Fake News Detection with machine and deep

learning: a systematic review


Nicola Capuanoa,∗, Giuseppe Fenzab , Vincenzo Loiab , Francesco David Notac
a
School of Enginneering, University of Basilicata, 85100, Potenza, Italy
b
Dipartimento di Scienze Aziendali - Management and Information Systems (DISA/MIS), University of
Salerno, Italy
c
Defence Analisys & Research Institute, Center for Higher Defence Studies, 00165 Rome (RM), Italy

Abstract
Fake news, which can be defined as intentionally and verifiably false news, has a strong
influence on critical aspects of our society. Manual fact-checking is a widely adopted ap-
proach used to counteract the negative effects of fake news spreading. However, manual
fact-checking is not sufficient when analysing the huge volume of newly created information.
Moreover, the number of labeled datasets is limited, humans are not particularly reliable
labelers and databases are mostly in English and focused on political news. To solve these
issues state-of-the-art machine learning models have been used to automatically identify fake
news. However, the high amount of models and the heterogeneity of features used in litera-
ture often represents a boundary for researchers trying to improve model performances. For
this reason, in this systematic review, a taxonomy of machine learning and deep learning
models and features adopted in Content-Based Fake News Detection is proposed and their
performance is compared over the analysed works. To our knowledge, our contribution is the
first attempt at identifying, on average, the best-performing models and features over mul-
tiple datasets/topics tested in all the reviewed works. Finally, challenges and opportunities
in this research field are described with the aim of indicating areas where further research is
needed.
Keywords: Fake News Detection, Content Based Fake News Detection, Content-Based
Features


Corresponding author
Email addresses: [Link]@[Link] (Nicola Capuano), gfenza@[Link] (Giuseppe Fenza),
loia@[Link] (Vincenzo Loia), [Link]@[Link] (Francesco David Nota)
1. Introduction
In the last years increasing attention to the phenomena of fake news has grown. Fake news,
which can be defined as intentionally and verifiably false news [1], is considered to destabilize
democracies, weaken the trust that citizens have in public institutions, and have a strong
influence on critical aspects of our society such as elections, the economy, and public opinion
(e.g. on wars). To counteract the negative effects of the spreading of fake news, several
initiatives have started to appear; one widespread approach to studying and analysing fake
news is fact-checking.

Manual fact-checking websites, such as [Link] and [Link], employ profes-


sional fact-checkers to analyse and detect fake news; the fact-checker role is to compare
known facts with knowledge extracted from news and assess their authenticity. Although
manual fact-checking has a critical role in contrasting fake news, it is not sufficient when
analysing the huge volume of newly created information, in particular for what concerns fake
news spreading on social networks. Moreover, the number of fake vs normal news labeled
datasets is limited, humans are not particularly reliable labelers and databases are mostly in
English and focused on political news (covering only a small subset of news) [1] [2]. To solve
these issues, in recent years Automatic Fake News Detection techniques have been developed
and new and more powerful machine learning models have been used to identify fake news
and news harmfulness (e.g. [3] [4] [5]).

Content-Based Fake News Detection (CBFND) has the purpose of assessing news inten-
tion as a set of quantifiable features, often machine learning features, extracted from news
content. CBFND is a critical tool for identifying news harmfulness. Specifically, CBFND
can be considered: 1) a fake news detector; 2) a complement to other fake news detection
techniques (such as credibility assessment and network propagation pattern analysis) [1] [2].
In this systematic review, we consider all the features that can be extrapolated from text con-
tent, excluding images or audio as they are not always present in news or social network posts.

In table 1 the differences with other reviews in the field of fake news detection are highlighted.
Specifically, compared to other reviews, this work is the only one that does an extensive eval-
uation of features and models as well as their performances on multiple datasets; moreover,
some reviews focus only on a subset of models (e.g. Natural Language Processing or Deep
Learning) or topics (e.g. covid-19), this work, instead, focuses on textual content-based fea-
tures, but does not exclude any kind of algorithms or topics. Finally, some of the existing
reviews tend to be too general and do not cover with enough detail the types of features and
models that, in most cases, provide better performances. To our knowledge, this is the first
attempt of identifying, on average, the best performing models and features over multiple
datasets/topics tested in multiple works.
Table 1: Differences between this work and other reviews in the literature. When using the symbol ”x” we
refer to a complete absence of a characteristic in the work. Written text may represent a complete or partial
presence of the characteristic.
Identifier Algorithms description features description datasets description performance analysis year
this It includes a description It describes the features It shows most of the It does an extensive 2022
work of both Machine Learn- and organizes them in available datasets in performance analysis on
ing (ML) and Deep groups literature with links all works by showing
Learning (DL) models. to them best-performing algo-
Specifically, traditional rithms and features over
ML, ensemble ML, DL, multiple datasets
pre-trained DL, and
Mixed models
[6] It shows the accuracy The study shows accu- 2022
performance of individ- × × racy of individual works
ual works over machine over machine and deep
and deep learning algo- learning algorithms,
rithms but there is no there is no average
short description related performance evaluation
to each of them over algorithms in lit-
erature and there is no
reference to features.
[7] It lists the algorithms There is a reference to 2022
used, but no description feature groups (e.g. lin- × ×
is provided for each of guistic) but no descrip-
them tion of individual fea-
tures and of the group
itself is provided
[8] It is focused only Natu- The performances are 2021
ral Language Processing × × represented by a small
models test conducted on some
traditional machine
learning algorithms on
one dataset
[9] It lists machine and It lists features used in It lists datasets for 2021
deep learning models the reviewed works re- fake news detection ×
used in the reviewed lated to covid-19, but but with no descrip-
works related to covid- they do not describe tion related to them
19, but they do not de- them
scribe them
[10] It provides a description 2021
only of traditional Ma- × × ×
chine learning (exclud-
ing also ensemble learn-
ing models)
[11] Analyses in detail a 2020
strong minority of Deep × × ×
Learning algorithms

[12] A minority of the algo- A minority of the fea- Some datasets used 2019
rithms used in literature tures used in literature in literature are listed ×
are shown, but there is listed and only some and described
is no description associ- of them are described
ated to them
[13] Some algorithms used in An extensive descrip- The individual accuracy 2018
literature are cited but × tion of the datasets is of some algorithms is
not described provided shown

[14] Some dataset are 2017


× × listed and properly ×
described
Despite the efforts of building automatic fake news detectors, further improvements may
be achieved by carefully selecting the best-performing features and models. To contrast the
influence that fake news has on our society it is critical to provide an indication to researchers
to improve the performance of Automatic Fake News Detectors.
The complexity and variety of models and features used in the literature often make research
slow and inefficient creating confusion rather than helping build better fake news detectors.
Furthermore, there are existing systematic reviews that tend to describe a more generic
picture and include content-based fake news detection, but with a low focus on details. The
figure below, taken from a systematic review made by [1] shows a classification of the types
of features used in automatic fake news detection and gives an idea of their heterogeneity.

Figure 1: Classes of features used for automatic fake news detection

Because of the heterogeneity and complexity of the models and features related to Automatic
Fake News Detection our systematic review, in contrast with existing works in literature,
focuses on content-based models and features and therefore excludes also hybrid approaches.
This work is structured similarly to other reviews of the sector, with the goal of facilitating
the comparison among them. The expected benefit is to provide guidelines to researchers
facing the issue of content-based fake news detection and to system developers dealing with
the implementation of accurate content-based fake news detectors.
The remainder of the work is structured as follows: Section 2. introduces the methodology
adopted in the systematic literature review; Section 3 answers to the research questions and
presents the results. Section 4 discusses the findings, and shows opportunities and gaps in
the area. Finally, Section 5 includes the conclusion, highlighting the scientific contributions
of the work, and the challenges to be addressed in the future.
2. Methodology
In this section, the adopted procedures, methods, and decisions taken into account in the
development of this systematic literature review are presented.

2.1. Study design


The work focuses on analyzing models and features for the task of Automatic Fake News
Detection. This study has been designed by following a schema widely used in the literature
[15]:
1. Research Questions: present the purposed research questions;
2. Search strategy: expose the strategy investigated to collect data;
3. Article selection: introduce the adopted criteria for study selection;
4. Distribution of studies: explain the chronological distribution of selected articles;
5. Quality assessment: introduce the quality assessment of the studies;
6. Data extraction: apply the research questions and point out the useful information
from the selected articles.

2.2. Research questions


The definition of research questions (RQ) was an important step in this study. The questions
are designed to follow the main aims of the study: reviewing features, machine learning (ML)
and Deep Learning (DL) models, and research trends for CBFND. The questions are listed
below together with a brief description of how we deal with each of them in this work.

1. RQ1: What machine learning models are used in CBFND?


Description of RQ1: We classify and describe the machine learning models used in
literature, as well as indicate how frequently they are used and by what works.
2. RQ2: What features are used in CBFND machine learning models?
Description of RQ2: We classify and describe the machine learning features used in
literature, as well as indicate how frequently they are used and by what works.
3. RQ3: What are the datasets used in the literature?
Description of RQ3: We describe the datasets used in the literature and we list for
each dataset the works using it.
4. RQ4: What are the best-performing algorithms and features used in literature?
Description of RQ4: For each algorithm and feature we list the features and/or the
models that perform best in relation to the works using them.
2.3. Search strategy
With the aim of selecting studies to answer the research questions, some keywords were
identified to compose the search string. This string was divided and combined by Boolean
operators, and also included synonyms and acronyms. The applied search string was ”(Fake
new detection) AND (Machine Learning OR deep learning)”. The search string described in
EC3 is fairly general because of the generality of the terms “Machine Learning” and “Deep
Learning”. As shown later in the work, the result is that models such as CNN, LSTM, and
others are considered eligible to appear in the selection because, for example, the term Deep
Learning is logically related to the term CNN during the search. The articles resulting from
the search were excluded by using the exclusion criteria listed in table 2.

Table 2: Exclusion criteria.


Exclusion criteria Description
EC1 Articles not in English language
EC2 Dissertations, theses, and books
EC3 Articles not including the following abstract, title, keywords or words in the text ”Fake news
detection” and ”machine learning”, or ”Fake news detection” and ”Deep Learning”
EC4 Articles before 2015
EC5 Articles that include features other than content-based
EC6 Articles using datasets with one language different from English
EC7 Articles with an unclear specification of features (e.g. only a class of features is specified but
not the actual features used)
EC8 Articles using private datasets that cannot be reconstructed from public ones (e.g. we did not
exclude articles merging and filtering multiple public datasets to create a private one)

2.4. Article selection


To select the studies, four of the most important database repositories were considered as
shown in Table 3.
By applying the search string and removing articles before 2015, 145 articles were found.
The exclusion criteria were applied and duplicates have been removed (this happens because
some works were in more than one database).

Table 3: Selected Databases.

Acronym Portal
SCOPUS SCOPUS (Elsevier database)
ScienceDirect ScienceDirect (Elsevier database)
IEEE IEEE Xplore Digital Library
ACM ACM Digital Library
Table 4: Selected Studies.
Study name
Liar, Liar Pants on Fire: A New Benchmark Dataset for Fake News Detection [16]
Fake news detection using naive Bayes classifier [17]
Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques [18]
Automatic Detection of Fake News [19]
A Tool for Fake News Detection [20]
Fake News Detection: A Deep Learning Approach [21]
A Benchmark Study on Machine Learning Methods for Fake News Detection [22]
Comparative Performance of Machine Learning Algorithms for Fake News Detection [23]
Which machine learning paradigm for fake news detection? [24]
Semantic Fake News Detection: A Machine Learning Perspective [25]
Supervised Learning for Fake News Detection [26]
Detecting Fake News using Machine Learning and Deep Learning Algorithms [27]
Behind the cues: A benchmarking study for fake news detection [28]
Deep learning methods for Fake News detection [29]
Machine Learning Methods for Fake News Classification [30]
Comparison of Various Machine Learning Models for Accurate Detection of Fake News [31]
A Closer Look at Fake News Detection: A Deep Learning Perspective [3]
Fake News Detection with Semantic Features and Text Mining [32]
dEFEND: Explainable Fake News Detection [33]
Learning Hierarchical Discourse-level Structure for Fake News Detection [34]
Sentiment Aware Fake News Detection on Online Social Networks [35]
Content Based Fake News Detection Using N-Gram Models [36]
Fake News Detection Using Machine Learning Ensemble Methods [37]
Performance Comparison of Machine Learning Classifiers for Fake News Detection [38]
Fake news detection in multiple platforms and languages [39]
FNDNet – A deep convolutional neural network for fake news detection [40]
Fake News Detection using Deep Learning [41]
Fake News Detection with Different Models [42]
Fake News Detection: An Ensemble Learning Approach [43]
Fake News Early Detection: A Theory-driven Model [44]
Fake News Detection Using Machine Learning Approaches [45]
Linguistic feature based learning model for fake news detection and classification [46]
A benchmark study of machine learning models for online fake news detection [47]
Constraint 2021: Machine Learning Models for COVID-19 Fake News Detection Shared Task [48]
Fake news detection: A hybrid CNN-RNN based deep learning approach [2]
WELFake: Word Embedding Over Linguistic Features for Fake News Detection [49]
Fake Detect: A Deep Learning Ensemble Model for Fake News Detection [5]
Transformer based Automatic COVID-19 Fake News Detection System [50]
FakeBERT: Fake news detection in social media with a BERT-based deep learning approach [51]
Evaluating Deep Learning Approaches for Covid19 Fake News Detection [52]
In Fig. 2 details related to the filtering process are presented. After the process, the 40
articles selected were used to elaborate the systematic literature review. In Table 4, the
selected studies are shown and sorted from the less recent to the most recent.

Figure 2: Application of exclusion criteria.

2.5. Quality assessment


In this work the quality of the selected articles was verified through the application of the
questions presented in Table 5. The criteria in the table are inspired by [53]. All of the
selected studies respect five or more of the criteria. For this reason, we decided not to
exclude any further studies from the selected list.

Table 5: Quality criteria.

Identifier Description
QC1 Is the goal of the study in the article clear?
QC2 Does the study include the literature review or background?
QC3 Is the main contribution of the study clear?
QC4 Does the article explains a research methodology?
QC5 Does the study show research results?
QC6 Is the conclusion of the study related to the research aim?
QC7 Does the article include future works and developments?
3. Results
This section answers the proposed research questions. To this end, the results obtained from
the 40 final reviewed works are shown and discussed.

3.1. RQ1 What machine learning models are used in CBFND?


In the analysed studies a wide variety of machine learning (ML) and deep learning (DL) al-
gorithms are used. In this section, we provide a brief description of each algorithm as well as
the studies that used it. For simplicity, we divided algorithms into groups. As with features,
we do not exclude the possibility of having missed other algorithms or variants of algorithms
(in particular for what concerns Deep learning algorithms) that are used in literature but not
in the considered studies.

The groups of algorithms are: 1) Machine Learning (ML) 2) Ensemble Machine Learning
3) Deep Learning (DL) 4) pre-trained Deep Learning 5) Mixed models (models that use a
mix of multiple sub-models). To give a better context we include some basic information
related to machine learning in general and the task associated with CBFND; furthermore,
for each group of algorithms, we provide an additional short introductory description.

The two most important higher-level classes of machine learning algorithms are supervised
learning, which relies on labeled input and output training data, and unsupervised learning,
which processes unlabeled or raw data. In this review we focus our attention on supervised
learning algorithms; in particular, a supervised machine learning algorithm has the goal of
predicting output values from given input data. The two major tasks of supervised machine
learning algorithms are classification and regression. The most important difference between
regression and classification is that while regression helps predict a continuous quantity, clas-
sification predicts discrete class labels. CBFND is a classification task, and in its simplest
form is associated with predicting whether a text has to be considered fake news or not. The
most important metrics to measure the performance of classification algorithms are accuracy,
precision, recall, and f1score.
Accuracy is the ratio of correct predictions out of all predictions. It is not a good metric
when the data set is unbalanced. Using accuracy in such scenarios can result in a misleading
interpretation of results.
Precision is the ratio of true positives over the sum of false positives and true negatives. The
question that this metric answer is: What proportion of positive identifications was actually
correct?
Recall is the ratio of correctly predicted outcomes to all predictions. The question recall
answers is: What proportion of actual positives was identified correctly?
F1 Score combines precision and recall into one single metric that ranges from 0 to 1; being
the harmonic mean of precision and recall and it is considered a better measure than accuracy.
Traditional Machine learning uses algorithms to parse data, learn from that data, and make
informed decisions based on what it has learned. These algorithms have been widely adopted
for solving the task of fake news detection, as they are fast to train and implement, do not
require dedicated hardware, and achieve good results on small datasets. In table 6 we list
the algorithms used in the reviewed papers as well as a short description of each algorithm.
Table 6: Traditional Machine Learning algorithms.

Identifier Description Studies


Support It maps training examples to points in space so as to maximise the [16, 18, 19, 20,
Vector width of the gap between two or more categories. New examples 47, 23, 25, 26,
Classifier are then mapped into that same space and predicted to belong to a 27, 28, 31, 34,
category based on which side they fall in. 35, 37, 38, 39,
42, 43, 45, 48,
49, 50]
Logistic models the probability of one event (out of two alternatives) taking [16, 18, 47, 24,
Regres- place by having the logarithm of the odds for the event be a linear 25, 27, 31, 35,
sion combination of one or more independent variables. 37, 38, 42, 48,
2]
Naive It applies Bayes’ theorem with the “naive” assumption of conditional [17, 26, 27, 28,
Bayes independence between every pair of features given the value of the 31, 32, 37, 39,
class variable. Some variants include the Gaussian Naive Bayes, 43, 45, 48, 49,
where the likelihood of the features is assumed to be Gaussian, and 20, 24, 25, 40,
the Multinomial Naive Bayes (a variant for multinomially distributed 2, 51, 23]
data)
K-nearest It classifies an object by a plurality vote of its neighbors, with the [18, 47, 23, 26,
neigh- object being assigned to the class most common among its k nearest 28, 39, 40, 43,
bours neighbor 45, 2, 49]
Multilayer consists of three types of layers of nodes: an input layer, a hidden [21, 47, 23, 24,
percep- layer, and an output layer. Except for the input nodes, each node is 38, 40, 41, 42,
tron an artificial neuron that uses a nonlinear activation function. 46, 48, 50, 51]
Decision It is a non-parametric algorithm, with a hierarchical tree structure. [18, 47, 24, 25,
Trees It employs a divide-and-conquer learning strategy by conducting a 28, 30, 31, 35,
greedy search to identify the optimal split points within a tree. The 37, 38, 40, 45,
process of splitting is then repeated until all, or the majority of records 2, 49, 51]
have been classified under specific class labels
Passive It is an online learning algorithm, it works by responding as passive [50]
Aggres- (the model remains unchanged) for correct classifications and respond-
sive ing as aggressive (the model is tuned) for any miscalculation.
Classifier
Ensemble learning refers to algorithms that combine the predictions from two or more models.
The most common ensemble learning strategies are Bagging, Stacking, and Boosting. Bagging
involves using multiple model instances of a single type of machine learning algorithm and
training each model on a different sample of the same dataset; Stacking involves the training
of multiple types of algorithms on the same dataset. In Boosting the models are fit and added
to the ensemble sequentially such that a model tries to correct the wrong predictions of the
previous model; the predictions made by the models are combined by voting or averaging.
In table 7 we list and describe the algorithms used in the analysed works.

Table 7: Ensemble Machine Learning algorithms.

Identifier Description Studies


Random is made up of a collection of decision trees, and each tree is trained on [47, 23, 24, 25,
Forest a data sample drawn from a training set with replacement, called the 26, 28, 32, 35,
bootstrap sample. Of that training sample, one-third is set aside as 36, 38, 39, 40,
test data. Another instance of randomness is then injected through 42, 44, 45, 48,
feature bagging, adding more diversity to the dataset and reducing 2, 51]
the correlation among decision trees. A majority vote will yield the
predicted class.
Gradient It is a prediction model in the form of an ensemble of weak prediction [47, 23, 36, 38,
Boosting models, which are typically decision trees. This algorithm builds an 45]
additive model in a forward stage-wise fashion; it allows for the op-
timization of arbitrary differentiable loss functions. In each stage N
regression trees are fit on the negative gradient of the loss function.
eXtreme It is one of the fastest implementations of gradient boosted trees. [26, 35, 38, 44]
Gradient It considers the potential loss for all possible splits to create a new
Boosting branch. The algorithm tackles this inefficiency by looking at the dis-
tribution of features across all data points in a leaf and using this
information to reduce the search space of possible feature splits.
AdaBoost AdaBoost It is an adaptive ensemble algorithm in the sense that sub- [47, 23, 28, 37,
sequent weak learners are tweaked in favor of those instances misclas- 2, 49]
sified by previous classifiers. The individual learners can be weak, but
as long as the performance of each one is slightly better than ran-
dom guessing, the final model can be proven to converge to a strong
learner.
Bagging It is an ensemble meta-estimator that fits base classifiers (generally [49]
classifier decision trees) each on random subsets of the original dataset and
then aggregate their individual predictions (either by voting or by
averaging) to form a final prediction.
The term Deep learning identifies a group of models based on artificial neural networks in
which the learning process is deep. Specifically, the structure of artificial neural networks is
made of different layers. Each layer is formed by units that transform the input data into
information that the next layer uses to perform a predictive task. The deep learning group
includes two major subclasses of models which are Recurrent Neural Networks (RNNs) and
Convolutional Neural Networks (CNNs); both extract features automatically from the input.

Table 8: Deep learning algorithms.

Identifier Description Studies


Long Short LSTM remembers information over data iterations. There are four [25, 27,
Term Memory gates in LSTM: 1) the forget gate decides what information is useful 29, 3, 32,
(LSTM) in the cell state and what should be thrown away; 2) the input gate 34, 37, 40,
adds new and updated information to be stored in the cell state; 3) 42, 43, 46,
the Output Gate selects useful information from the current cell state 2, 50, 51,
and shows only relevant output via the output gate; 4) the Input 52, 35]
Modulation Gate regulates the values flowing through the network.
Bidirectional Is a variant of the LSTM model: it processes the data in ”past” and [34, 37, 5,
LSTM (Bi- ”future” directions with two separate hidden layers, which are then 25, 50, 52]
LSTM) fed forwards to the same output layer.
Gated Re- GRU is a gating mechanism in recurrent neural networks, similarly to [25, 33]
current Units LSTM, it has a forget gate but it lacks an output gate. GRUs have
(GRUs) been shown to exhibit better performance on certain smaller and less
frequent datasets
Bidirectional As with Bi-LSTM, Bi-GRU is a variant of GRU that can access long- [33]
GRU (Bi- range context in both input directions
GRU)
Convolutional Is a model with three main types of layers, which are: 1) Convolutional [16, 25,
Neural Net- layer; 2) Pooling layer; 3) Fully-connected layer. With each layer, the 29, 37, 40,
works (CNN) CNN identifies greater portions of the input. Earlier layers focus on 43, 2, 50,
simple features, and, as the data progresses through the layers, it 51, 52]
starts to recognize larger elements of the object.
CapsNet Adds structures called “capsules” to CNN, and reuses output from [25]
those capsules to form more stable representations for higher capsules.
The output is a vector consisting of the probability of an observation,
and a pose (position and orientation) for that observation.
Hierarchical Is built from bidirectional RNNs composed of GRUs/LSTMs with [37, 52]
Attention Net- attention mechanisms made of “hierarchies” where the outputs of the
work (HAN) lower hierarchies become the inputs to the upper hierarchies
The pre-trained deep learning group includes all the models that do not require training
as they have already been trained and they show good performance on average on a wide
variety of datasets in different domains. This group is compromised mostly of BERT-based
models, which, in the last years, have become increasingly important for the task of Fake
News Detection. Pre-trained models have the advantage of being reliable starting points for
creating fine-tuned models that perform better in domain-specific tasks.
Table 9: pre-trained Deep learning algorithms.

Identifier Description Studies


BERT BERT is a Word Embedding model that, in comparison with [3, 47, 50, 52]
other word embedding models (such as Word2Vec), is able
to distinguish the meaning of the same or similar words
being used in different contexts. As with other embedding
models it must be used in conjunction with a classifier. In
literature, the use of the term BERT alone is associated with
a model made of BERT followed by a dense neural network.
RoBERTa Robustly optimized BERT approach is a retraining of BERT [47]
with improved training methodology. RoBERTa removes
the Next Sentence Prediction (NSP) task from BERT’s
pre-training and introduces dynamic masking so that the
masked token changes during the training epochs.
DistilBERT It learns an approximate version of BERT, retaining 97% [47, 52]
performance but using only half the number of parameters
ELECTRA It differs from BERT because instead of masking words, it [47]
uses a small BERT-like network as a generator to replace
some words with its predictions. Then, the main discrimi-
nator network is used to determine which of the words have
been replaced.
ELMo It is a character-based model using character convolutions [47]
and can handle out-of-vocabulary words for this reason.
However, the learned representations are words. Both
ELMo and BERT can generate different word embeddings
for a word that captures the context of a word. However,
ELMo uses LSTM internally while BERT uses transformers.
AlBERT It is a lighter version of BERT [50]
XLNet It is a large bidirectional transformer that uses improved [50]
training methodology, larger data, and more computational
power to achieve better than BERT prediction metrics
In the following group, we list all models that are made of multiple submodels. Many of
these models include the concept of attention.
Attention is an interface connecting the encoder and decoder that provides the decoder with
information from every encoder’s hidden state. With this framework, the model is able to
selectively focus on valuable parts of the input sequence and hence, learn the association
between them.
We omit the description of these models as the identifier already represents how the models
are structured; for example, CNN+LSTM is an LSTM in which input is transformed by
convolutions, or GRU + Attention is a GRU using the attention mechanism.

Table 10: Mixed learning algorithms.

Identifier Studies
CNN+LSTM [16, 37, 2, 50]
LSTM + Attention [35]
BI-LSTM + Attention [34, 37, 5]
GRU + Attention [25]
CNN+BI-LSTM + attention [3]
CNN+Bi-GRU [34]
CNN+HAN [37]
CNN + BERT [51]
LSTM + BERT [51]
Naive Bayes + BERT [51]
3.2. RQ2 What features are used in CBFND machine learning models?
Numerous features are used in the task of CBFND, in the analysed studies often features are
grouped into several categories such as linguistic, grammatical, readability, and sentiment
features. Although most of these groups are associated with similar features we noticed that
some works include the same feature in different groups, for this reason, we propose here
a simplified division of the features into groups intending to provide more intuitive feature
groups.

Each of the tables below represents a feature group; moreover, for each feature in the group,
we provide a brief description as well as the studies that use it. We omit the description of
the feature in the table whenever a feature identifier is self-explaining; In addition, although
we consider the list of features below exhaustive we do not exclude the possibility of having
missed other important features or variants of features that are used in literature but not in
the considered studies.

The groups of features are:

• Table 11 - Frequency-based: are features that include the calculation of a frequency,


specifically we group features using frequency associated with single words in the text
or/and n-grams (n-grams are a contiguous sequence of n items from a given sample of
text or speech).

• Table 12 - Embedding: Embeddings are a representation of words/characters/sentences


defined by a model. An embedding is a low-dimensional space into which you can trans-
late high-dimensional vectors. Embeddings make it easier to apply machine learning
on sparse vectors representing words.

• Table 13 - Word tagging: Word tagging features are features that associate tags related
to words. The most used features are Part Of Speech tagging and Named Entities.
Both of these are usually calculated by employing a machine learning algorithm.

• Table 14 and Table 15 - Absolute/relative quantity: associated with an absolute or


relative count of an element. Some examples are the number of words, characters,
adjectives, or percentages. Some of the features, such as the number/percentage of ad-
jectives/nouns/verbs/adverbs are the outcome of the process of Part-of-Speech tagging
(POS tagging).

• Table 16 - Readability: represent the level of ease of reading the text. Readability fo-
cuses on textual content such as lexical, semantical, syntactical, and discourse cohesion
analysis. In the context of Machine Learning and Fake News detection, several indices
are used to extrapolate readability-related features, most of which can be extracted by
using the Linguistic Inquiry and Word Count (LIWC) software. 1

• Table 17 - Sentiment and emotion-based: are a group formed by all those features that
extrapolate high-level sensations that a text expresses.

• Table 18 - Other: contains all the features to which we did not assign any group but
that have been used in some of the analysed studies.

Table 11: frequency-based features.

Identifier Description Studies


Term Frequency TF is the number of occurrences of a term in a document [18, 36,
(TF) in the corpus. It is the ratio of the number of times the 46]
word appears in a document compared to the total number
of words in that document.
Term Frequency - TF-IDF is the product of Term Frequency (TF) and Inverse [18, 21,
Inverted Document Document Frequency (IDF); where IDF is a score that mea- 34, 36,
Frequency (TF- sures how important a term is. Rarely occurring scores have 37, 40, 41,
IDF) a high IDF score. Therefore TF-IDF increases proportion- 42, 45, 47,
ally to the number of times a word appears in the document 48, 51, 20,
and is offset by the number of documents in the corpus that 32, 23, 24,
contain the word. 27, 31, 38,
43, 49]
Bag Of Words A bag-of-words is a representation of text that describes [17, 19,
the occurrence of words within a document. It involves two 20, 21,
things: 1) a vocabulary of known words; 2) a measure of the 27, 30, 31,
presence of known words. Any information about the order 39, 40, 41,
or structure of words in the document is discarded. 42, 43, 44,
49, 51]
Bag Of N-grams It is a generalization of bag of words, N-grams are con- [26]
tiguous sequences of n-items in a text or speech. The n
refers to the number of combinations of items, which can
be phonemes, syllables, letters, words, characters, bytes or
any sequence of data. Bag of words are 1-grams, while it is
also possible to use bigrams, trigrams, and so on. Bag Of
N-grams and TF or TF-IDF can be used together to calcu-
late frequencies of n-grams (e.g. tf-idf on bigrams)

1
[Link]
Table 12: Embedding features.

Identifier Description Studies


Glove Glove is an unsupervised learning algorithm for obtaining [23, 24,
vector representations for words. Training is performed on 25, 3, 32,
aggregated global word-word co-occurrence statistics from 37, 40,
a corpus, and the resulting representations showcase inter- 47, 2, 50,
esting linear substructures of the word vector space. 51, 52]
Word2Vec It creates vectors of the words that are distributed numerical [21, 29,
representations of word features – these word features could 39, 42,
comprise words that represent the context of the individual 43, 44]
words present in our vocabulary.
FastText FastText is an extension to Word2Vec. Instead of feeding [50, 52]
individual words into a Neural Network, FastText breaks
words into several n-grams.
Spacy embedding Spacy is an open-source software python library used in nat- [38]
ural language processing. Spacy embeddings are pre-trained
word embeddings.
Sentence2Vec It represents sentences and their semantic information as [44]
vectors
Character embed- Character embedding uses a one-dimensional convolutional [37, 47]
dings neural network to find a numeric representation of words by
looking at their character-level compositions.

Table 13: Word tagging features.

Identifier Description Studies


POS tagging Is defined as the process of assigning particular parts of [26, 43,
speech corresponding to that word based on its context and 44]
its meaning
Named Entities Named Entity is defined as proper names and quantities of [25]
interest. Person, organization, and location names can be
marked in a text as well as dates, times and percentages; the
task of identifying named entities is called Named-Entity
Recognition
Table 14: Relative quantity features.

Identifier Studies
Average number of words per sentence [28, 39, 44, 49]
Average number of syllables per word [28]
Average number of characters per word [44]
Average word length [37, 44, 45, 47]
Percentage of adverbs [28, 39, 49]
Percentage of adjectives [47, 28, 39, 49]
Percentage of articles [28]
Percentage of exclamation marks [39]
Percentage of negations [28]
Percentage of nouns [39]
Percentage of preposition [47, 28]
Percentage of question marks [39]
Percentage of uppercase characters [39]
Percentage of verbs [47]
Percentage of words longer than 6 letters [28]
Table 15: Absolute quantity features.

Identifier Studies
Number of characters [19, 39, 46]
Number of uppercase/special characters [46, 49]
Number of words [28, 37, 39, 45, 46, 47, 49]
Number of long words [19, 28, 44]
Number of words in the title [46]
Number of syllables [19, 28, 44, 49]
Number of numbers [37, 44, 45, 47]
Number of determinants [49]
Number of adverbs [19, 46, 49]
Number of verbs [19, 26, 44, 46, 49]
Number of nouns [19, 44, 46]
Number of pronouns [19, 46]
Number of adjectives [19, 26, 37, 44, 45, 46, 47, 49]
Number of articles [49]
Number of long/short sentences [28, 49]
Number of sentences [28, 44, 49]
Number of paragraphs [19, 44]
Number of conjunctions [28]
Number of punctuation characters [47, 26, 44]
Number of exclamation marks [37, 44, 47]
Article length [37, 45, 47]

All the readability features listed in the table 16 produce an approximate representation of
the education grade level needed to comprehend the text; however, they use different formulas
and algorithms.
Table 16: Readability features.
Identifier Description Studies
Automated read- [19, 44,
ability index characters words 46, 48, 49]
4.71 × ( ) + 0.5( ) − 21.43
words sentences
where ”characters” is the number of letters and numbers,
words the number of spaces, and sentences the number of
sentences
Coleman Liau index [44, 46,
(0.0588 × letters) − (0.296 × sentences) − 15.8 48]
where ”letters” is the average number of letters every 100
words and sentences is the average number of sentences ev-
ery 100 words
Flesch Kincaid [19, 28,
Score words syllabes 44, 46, 48]
4.71 × ( ) + 11.8( ) − 15.59
sentences words
where words, sentences, and syllables are the numbers of
them.
Flesch reading ease [19, 44,
words syllabes 46, 48]
206.835 − 1.015( ) − 84.6( )
sentences words
where words, sentences, and syllables are the numbers of
them.
Gunning fog index [19, 44,
words complexwords 46, 49]
0.4[( ) + 100( ))]
sentences words
where words, sentences, and complex words are the number
of them, and, where complex words are the one consisting
of three or more syllables
Lensear write for- It is calculated as follows: a) Count a 100-word sample; [46]
mula b) Count one point for all one-syllable words; c) Give three
points for each sentence in the 100-word sample to the near-
est period or semicolon; d) to obtain the final score, add
together the one-syllable word count and the three points
for each sentence.
Smog index Is calculated by by using a piece of text which is 30 sentences [46, 49]
or longer and doing the following: a) Counting ten sentences
near the beginning of the text, 10 in the middle, and ten near
the end, totaling 30 sentences; b) counting every word with
three or more syllables; c) square-rooting the number and
rounding it to the nearest 10; d) Adding three to this figure
Table 17: Sentiment/emotion-based features.

Identifier Description Studies


Polarity Defines the orientation of the expressed sentiment; in other [23, 25,
words, it determines if the text expresses the positive, neg- 26, 37,
ative or neutral sentiment of the user about the entity in 39, 44, 46,
consideration. 47, 49]
Subjectivity Refers to a number between 0 and 1 associated with personal [25, 26,
opinion, emotion, or judgment in contrast to facts. 44, 46, 49]
Toxicity score It indicates how toxic is a text. One of the most used ones [26]
is the Google Toxicity Score.
Ratio between It gives an idea of the negativity of a text. Often in litera- [35, 44]
number of negative ture, LIWC software is used to get the positive or negative
and positve words label associated with each word.
number of affective The number of terms that express an emotion [28]
terms
Table 18: Other features.

Identifier Description Studies


TF-IDF cosine The cosine similarity of the pair headline and content vectors [21, 23]
similarity between is calculated by taking their dot product and dividing that by
headline and con- the product of their norms.
tent
Context-Free- It is a formal grammar whose production rules are of the form: [19, 44]
Grammar A → X with A a single nonterminal symbol, and X a string of
terminals and/or nonterminals. A formal grammar is ”context
free” if its production rules can be applied regardless of the
context of a nonterminal.
Rewrite rules It is a rule of the form A → X where A is a syntactic category [44]
label, such as a noun phrase or sentence, and X is a sequence
of such labels and/or morphemes, expressing the fact that A
can be replaced by X in generating the constituent structure
of a sentence. Example: S → NP VP Which means: Sentence
consists of a Noun Phrase followed by Verb Phrase.
Encoders It is a component of a machine/deep learning algorithm that [33]
compresses the input data into an encoded representation that
is smaller than the input data.
Discourse level Uses a dependency parsing approach to represent the hier- [34]
structure archical structure of a document as a dependency tree. In
dependency parsing of discourse units, there is the need to
identify if a discourse unit semantically depends on another
one. If so, a parent-child link is established.
Lexical Diversity It is a measure of how many different words appear in a text [23]
Spelling errors It checks if there are spelling errors and if yes, how many. [39]
Informal Language It represents the presence of informal language. Usually ex- [47]
tracted from LIWC tool that uses a dictionary including Swear
words, Netspeak, Assent, Nonfluencies, and Fillers.
Emotiveness index It is given by the total number of adjectives plus the total [28]
number of adverbs divided by the total number of nouns plus
the total number of verbs

3.3. RQ3 What are the datasets used in the literature?


In this section, we briefly describe the datasets used in the analysed studies; let us notice
that ”other datasets” refers to all the public datasets to which both, authors of the papers
and creators of the dataset, did not provide a description associated with it.
Table 19: Datasets.
Identifier Description Studies
LIAR (Politi- LIAR is a publicly available dataset for fake news detection. A decade-long of 12.8K man- [16, 24, 25,
fact) [16] ually labeled short statements were collected in various contexts from [Link] 37, 41, 43,
44, 45, 47, 5]
Fake Or Real This dataset includes fake news of the 2016 USA election cycle. Real news was collected [20, 23, 33,
news By George from media organizations (e.g. New York Times, Bloomberg) for the duration of 2015 or 34, 36, 37,
McIntire 2016. It has 6.3k news with an equal allocation of fake and real news, with half of the 39, 47, 49]
corpus coming from political news
ISOT Fake News For this dataset 21417 real news were obtained by crawling articles from [Link], [18, 47, 2,
Dataset and, 23481 fake news articles were collected from unreliable websites flagged by Politifact 49]
(a fact-checking organization) and Wikipedia. The dataset contains different types of
articles on different topics
The Signal Me- Contains 1 million articles that are mainly English, but they also include non-English and [24]
dia One-Million multi-lingual articles. Sources of these articles include major ones, such as Reuters, in
News Articles addition to local news sources and blogs.
Dataset
Getting Real Contains text and metadata from 244 websites and represents 12,999 posts in total. The [30, 34, 38]
about Fake data was pulled using the [Link] API
News
BuzzFace [54] The dataset was built by using a collection of over 1.6 million news items posted to [23, 26, 46]
Facebook by nine news outlets during September 2016, which were annotated for veracity
by BuzzFeed
PHEME [55] Includes rumour tweets associated with nine different breaking news. It contains Twitter [35]
conversations that are initiated by a rumourous tweet; the conversations include tweets
responding to those rumourous tweets. These tweets have been annotated for support,
certainty, and evidentiality
Random Politi- This dataset includes randomly collected news from three types of sources during 2016. [46]
cal news The dataset labels are real, Satire, and fake and some of the sources where the news was
taken from are Wall Street Journal, The Economist, BBC for real; the Onion, Huffington
Post Satire for satire and Ending The Fed, True Pundit, [Link] for fake news
BuzzFeed News This dataset comprises a complete sample of news published in Facebook from nine news [17, 34, 44,
agencies over a week close to the 2016 U.S. election. Every post and the linked article 49]
were fact-checked claim-by-claim by five BuzzFeed journalists
Covid 19 Fake Contains 10700 fake and real news related to COVID-19. The posts came from various [23, 48, 50,
news Dataset social media and fact-checking web-sites, and have been manually verified, moreover, the 52]
[56] data is class-wise balanced.
Twitter Brasil This dataset is a collection of 3.9 million tweets and 18,413 online news around the online [39]
[57] discussion about COVID-19 in Brazil. The data from Twitter was collected through the
Twitterscraper Python library and considered a set of keywords in Portuguese regarding
to COVID-19
FA-KES Consists of news articles from several media outlets representing mobilisation press, loyalist [2]
press, and diverse print media [58]
FakeNewsNet Contains two data sets with news content, social context, and spatiotemporal information. [34]
[59] The data set is constructed using an end-to-end system, FakeNewsTracker
FakeNewsAMT Includes fake and real news from six different domains: technology, education, business, [19]
[60] sports, politics, and entertainment
Celebrity Contains news about celebrities. The legitimate news was obtained from entertain- [19]
ment, fashion, and style news sections on news and magazines websites. The fake news
was obtained from gossip websites such as Entertainment Weekly, People Magazine and
entertainment-oriented publications. The articles were manually verified.
Other datasets Fake news Dataset, NLP Real or Fake news, Fake news challange FNC-1, Kaggle Fake [47, 23, 29,
News Dataset 31, 40, 42,
51, 21, 3, 32]
3.4. RQ4 What are the best performing algorithm and features used in literature?
To analyse the wide variety of algorithms, features, and datasets performance we use as the
main metric the average accuracy of the algorithms and features. We use only accuracy as
many works do not define all four metrics (accuracy, precision, recall, and f1score), while all
works define accuracy. In our analysis, we also consider the number of studies and datasets
on which an algorithm or feature has been used and tested on.

Table 20: Algorithms with average accuracy higher than 80%

Algorithm Average Accuracy Number of Studies Number of datasets


XLNet 97 1 1
ALBERT 97 1 1
LSTM + BERT 97 1 1
Passive Aggressive Clas- 96 1 1
sifier
CNN + BERT 92 1 1
Bagging classifier 91 3 1
Multinomial Naive 92 1 1
Bayes + BERT
Bi-GRU 90 1 1
Gradient Boosting 87 5 5
eXtreme Gradient 86 4 5
Boosting
LSTM + Attention 86 1 1
Multilayer percep- 84 8 12
tron
Naive Bayes 82 12 12

In table 20, we list algorithms achieving an average accuracy higher than 80% as well as the
number of reviewed studies in which the algorithm was used and the number of datasets on
which it has been tested. Considering the relevance of the number of studies and datasets on
which the algorithm has been tested, we can state that the more robust results are given by
Gradient Boosting, eXtreme Gradient Boosting, Multilayer perceptron, and Naive Bayes. It
is also important to mention that the highest accuracy is given by XLNet, ALBERT, LSTM
with BERT, and the Passive Aggressive Classifier; however, the number of analysed studies
and datasets these models are based on is only one. As soon as enough articles on CBFND
will include recall, precision, and f1score this simple but effective comparison method can be
reused to evaluate the performances of the analysed algorithms with respect to the full set
of metrics (accuracy, recall, specificity, and f1score).
Few literature works analyse these algorithms in the context of CBFND and in general in
the field of Fake News detection, indicating the need for more research to confirm the per-
formance of these models by testing them on multiple datasets. The high performance of
Gradient Boosting over multiple datasets and studies could be justified by the model’s high
generalization capability (given by several weak learners) as well as the characteristics of hav-
ing better weak learners in the process of training. Transformer pre-trained models (XLNet,
BERT, and its variants) have shown the best performance compared to other Deep Learning
algorithms; this result is expected as these models have shown high generalization capabili-
ties and have been tested and used successfully over many different research areas involving
Natural Language Processing.

In table 21 we show all the features being present in more than one study, tested on more
than one dataset, and with an average accuracy higher than 80%. Considering the number
of studies and datasets on which each feature has been tested we can state that the stronger
results are given by Bag of Words, Automated Readability Index, Number of verbs, and
TF-IDF. Particular attention should be given to readability features such as the Coleman
Liau index and Smog Index; however, more research is needed to confirm the results of these
last features.

The most performant features tend to be language and topic invariant, and, therefore, provide
higher performances on average compared to other features. Interestingly quantity-based fea-
tures such as the number of uppercase characters, verbs, adverbs, and syllables are associated
with higher performances, as well as readability scores, which are mostly formulas composed
of quantity-based features. Furthermore, it is equally important to highlight the TF-IDF and
bag of words features, which, are showing surprisingly high average performance over many
datasets. Given the analysis of features in table 21 we can state that the most performant
features will show consistently high performance in different datasets. However, we do not
exclude the possibility of improving detection results with the advancements of automatic
feature extraction techniques in Deep Learning or by introducing new features.
Table 21: Features with average accuracy higher than 80%, tested on more than 1 dataset and used in more
than 1 study

Feature Average Accuracy Number of Studies Number of datasets


Coleman Liau Index 90 2 2
Percentage of adjectives 88 4 3
Smog Index 88 4 2
Number of short/long 88 4 2
sentences
Percentage of adverbs 88 4 2
Number of uppercase 88 4 2
characters
Average number of 88 4 2
words per sentence
Gunning Fog Index 87 6 3
Automated Readabil- 87 7 4
ity index
Number of syllables 87 6 3
Bag of words 86 12 15
Number of verbs 86 7 4
Number of adverbs 86 6 3
Term frequency 85 2 2
Word2vec 84 6 5
TF-IDF 83 11 13
4. Discussion
In this section research questions previously answered are discussed, as well as future chal-
lenges, opportunities, and limitations of the current review.
In this work 40 studies were reviewed, each of them using content-based features and ma-
chine learning for the problem of fake news detection. Our contribution was to summarise
the results of these studies, by providing an extensive description of features, algorithms,
and datasets used in literature, while, at the same time, giving an indication to researchers
of the best-performing models/features. In particular, in Tables 20 and 21 we provide a list
of the average accuracy performances obtained in relation to the number of works that have
used a feature or an algorithm and datasets on which they have been tested on. We consider
a ”promising result” a feature/algorithm that achieves high average accuracy, but is used in
too few studies or datasets; while we consider a ”robust result” a feature/algorithm with a
high average accuracy, a high number of studies using it, and a high number of databases on
which the feature/algorithm has been tested on.

The rationale for the classification of results as promising and robust is based on the in-
tuition that as the number of studies that apply a specific feature/algorithm increases and
the number of databases on which it is tested increases as well also the robustness of the re-
sult increases; conversely a performance resulting only from one or two of the reviewed study
has lower robustness and must be further verified by future works, in particular for what
concerns features and algorithms that are not commonly used in literature at the moment of
writing.

Specifically, the algorithms with ”promising results” are the deep learning algorithms XLNet,
ALBERT, and LSTM with BERT; while the ones with ”robust results” are Gradient Boost-
ing, eXtreme Gradient Boosting, Multilayer perceptron, and Naive Bayes. In addition, the
Coleman Liau Index can be classified as ”promising results”, while Automated Readability
Index, Bag of Words, Number of verbs, and TF-IDF can be classified as ”robust results”.

4.1. Challanges and opportunities


In this section challenges and opportunities will be described. For this reason, the text
was divided into the following challenges: Standard for representing fake and real news in
a Dataset; Multi-topic and multi-language resistant features and algorithms for CBFND;
model explainability and Continuous learning models.

Standard for representing fake and real news in a Dataset: As shown in section
3.3 a wide variety of datasets is used in the task of fake news detection, all of them providing
information to perform the CBFND task. However, there is a need for a standard in how the
process of data collection, verification, and storage. Often the source of data, the verification
process, and the lack of diversity in terms of topics related to the database (e.g. political top-
ics) are barriers to progress in this research field. Furthermore, it is critical that studies use
datasets that respect the principles of Open and FAIR (Findable, Accessible, Interoperable,
Reusable) data; in particular for what concerns interoperability and the datasets metadata
[61, 62].

Multi-topic and multi-language resistant features and algorithms for CBFND:


In this work, we focused our attention on studies that proposed methods applied to English
databases or English and other languages. Unfortunately in literature, there is a strong
minority of works using English and another language making it difficult to analyse the per-
formance of algorithms and features from a multi-language perspective; furthermore, there
is the need to have a model capable of performing appropriately also considering the high
variety of topics included in different datasets. Many works focus their results only on one
dataset, making it difficult to evaluate the model’s capability of generalizing over multiple
topics and data types (e.g. a social network post, or traditional online news).

Continuous learning models: The analysed studies have focused on the problem of iden-
tifying fake news by training models over existing databases. However, the continuous evolu-
tion of malicious techniques employed by fake news spreaders requires an adaptive continuous
learning model. Creating a model able to adapt to changing fake news characterization pat-
terns is a real-world necessity that is currently not discussed adequately in literature and,
therefore, an opportunity for feature research.
Association with context: The detection of fake news based on content is critically impor-
tant, however, it is often necessary to associate the news with a context in order to verify it.
Therefore, there is a need to have models that are able to relate contexts/topics to news con-
tent. One interesting research outcome of studying context and content together is verifying
if the performance of content-based features in relation to the context leads to an increase
or decrease for each feature.
Attacks on natural language learning: The use of Natural Language Processing to
identify fake news is vulnerable to attacks on the machine and deep learning models [63].
The distortion of facts, the exchange between subject and object, and the confusion of causes
are three types of attacks that could lead to poor performance. The distortion is to exaggerate
or modify some words. Textual elements can be distorted to lead to a false interpretation.
The exchange between subject and object aims to confuse the reader between those who
practice and those who suffer the reported action. The attack of confusion of cause consists
of creating non-existent causal relations between two independent events or cutting parts of
a story, leaving only the parts that the attacker wishes to present to the reader.
4.2. Limitations
In this section, we provide the limitation of this work. This systematic review is limited to
aspects related to the quality assessment of articles, and, potentially, a partial selection of
works. An additional limitation is the search string and the selected scientific databases;
both of them may be a restriction factor with respect to the analysed works. In this work we
have chosen accuracy as a metric to evaluate performance as it is the only metric available in
all the analysed studies; considering only accuracy is a limit of this work and must be taken
into consideration while reading the analysis. Furthermore, for simplicity, the accuracy has
been evaluated over an algorithm, or a feature but not on both at the same time. As more
articles are published future research will be able to analyse performance differences also
from metrics other than accuracy (e.g. recall, precision and f1score).

This work considers Machine/Deep Learning models without considering the performance
variability deriving from the choice of hyperparameters. Currently, the number of works us-
ing deep learning algorithms is limited and therefore our performance results on deep learning
can be considered only as promising rather than robust results; however, future research will
be able to cover this gap as the number of studies employing DL for CBFND increases and the
databases become bigger and more standardized. In addition, as more datasets are published,
more specialized benchmark studies, each one dealing with a single algorithm (or a single
class of algorithms) tested over multiple datasets and hyperparameters, could be performed.
However, deep learning application in this field is still in the early ages and the number of
large multi-topic datasets is still insufficient. This work is limited to content-based features
and models; in literature, there exist works dealing with all three categories defined in Figure
1, but they provide a high-level description of features and models. Our work differs in that
it provides greater details related to the content-based category. This approach allows the
comparison of performances on multiple datasets and provides more detailed knowledge to
researchers and system developers. The adopted approach is a potential alternative for future
systematic reviews related to the user and social-context-based categories.
5. Conclusions
In this review state-of-the-art algorithms and features associated with Content Based Fake
News Detection (CBFND) have been analysed and identified. To answer the research ques-
tions, a systematic literature review methodology, that allowed to select and organize the
works, was adopted. Specifically, the features and machine learning algorithms were de-
scribed and associated with the reviewed work. Furthermore, a description of the most
important datasets used is provided. For each work, the best-performing algorithms and fea-
tures were considered and an average performance analysis of the extrapolated data has been
conducted. The outcome of the analysis shows which features and models perform better
over multiple datasets. More in detail, we found the most performing models to be Gra-
dient Boosting, eXtreme Gradient Boosting, Multilayer perceptron, and Naive Bayes, and
the most performing features to be Automated Readability Index, Bag of Words, Number
of words, and TF-IDF. Furthermore, we identify as promising the models XLNet, ALBERT,
and LSTM with BERT as well as the Coleman Liau Index feature; however, more research
is needed to confirm their effect on CBFND. The analysis facilitates the work of researchers
in improving CBFND performance and, more importantly, indicates the models and features
that are more prone to have high performance on multiple datasets and, therefore, have a
higher probability to perform well also on real-world CBFND systems. Finally, challenges
and opportunities in this research field are described to indicate areas where further research
is needed.
References
[1] X. Zhang, A. A. Ghorbani, An overview of online fake news: Characterization, detection,
and discussion, Information Processing & Management 57 (2020) 102025.

[2] J. A. Nasir, O. S. Khan, I. Varlamis, Fake news detection: A hybrid cnn-rnn based deep
learning approach, International Journal of Information Management Data Insights 1
(2021) 100007.

[3] A. Abedalla, A. Al-Sadi, M. Abdullah, A closer look at fake news detection: A deep
learning perspective, in: Proceedings of the 2019 3rd International Conference on Ad-
vances in Artificial Intelligence, 2019, pp. 24–28.

[4] S. A. Alameri, M. Mohd, Comparison of fake news detection using machine learning
and deep learning techniques, in: 2021 3rd International Cyber Resilience Conference
(CRC), IEEE, 2021, pp. 1–6.

[5] N. Aslam, I. Ullah Khan, F. S. Alotaibi, L. A. Aldaej, A. K. Aldubaikil, Fake detect: A


deep learning ensemble model for fake news detection, complexity 2021 (2021).

[6] M. A. Al-Asadi, S. Tasdemir, Using artificial intelligence against the phenomenon of


fake news: a systematic literature review, Combating Fake News with Computational
Intelligence Techniques (2022) 39–54.

[7] M. Lahby, S. Aqil, W. Yafooz, Y. Abakarim, Online fake news detection using ma-
chine learning techniques: A systematic mapping study, Combating Fake News with
Computational Intelligence Techniques (2022) 3–37.

[8] C. Agrawal, A. Pandey, S. Goyal, A survey on role of machine learning and nlp in
fake news detection on social media, in: 2021 IEEE 4th International Conference on
Computing, Power and Communication Technologies (GUCON), IEEE, 2021, pp. 1–7.

[9] R. Varma, Y. Verma, P. Vijayvargiya, P. P. Churi, A systematic survey on deep learning


and machine learning approaches of fake news detection in the pre-and post-covid-19
pandemic, International Journal of Intelligent Computing and Cybernetics (2021).

[10] A. A. A. Ahmed, A. Aljabouh, P. K. Donepudi, M. S. Choi, Detecting fake news


using machine learning: A systematic literature review, arXiv preprint arXiv:2102.04458
(2021).

[11] A. Chokshi, R. Mathew, Deep learning and natural language processing for fake news
detection: a survey (2020).

[12] F. Cardoso Durier da Silva, R. Vieira, A. C. Garcia, Can machines learn to detect fake
news? a survey focused on social media (2019).
[13] R. Oshikawa, J. Qian, W. Y. Wang, A survey on natural language processing for fake
news detection, arXiv preprint arXiv:1811.00770 (2018).

[14] K. Shu, A. Sliva, S. Wang, J. Tang, H. Liu, Fake news detection on social media: A
data mining perspective, ACM SIGKDD explorations newsletter 19 (2017) 22–36.

[15] J. Biolchini, P. G. Mian, A. C. C. Natali, G. H. Travassos, Systematic review in soft-


ware engineering, System engineering and computer science department COPPE/UFRJ,
Technical Report ES 679 (2005) 45.

[16] W. Y. Wang, ” liar, liar pants on fire”: A new benchmark dataset for fake news detection,
arXiv preprint arXiv:1705.00648 (2017).

[17] M. Granik, V. Mesyura, Fake news detection using naive bayes classifier, in: 2017 IEEE
first Ukraine conference on electrical and computer engineering (UKRCON), IEEE, 2017,
pp. 900–903.

[18] H. Ahmed, I. Traore, S. Saad, Detection of online fake news using n-gram analysis
and machine learning techniques, in: International conference on intelligent, secure, and
dependable systems in distributed and cloud environments, Springer, 2017, pp. 127–138.

[19] V. Pérez-Rosas, B. Kleinberg, A. Lefevre, R. Mihalcea, Automatic detection of fake


news, arXiv preprint arXiv:1708.07104 (2017).

[20] B. Al Asaad, M. Erascu, A tool for fake news detection, in: 2018 20th International
Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC),
IEEE, 2018, pp. 379–386.

[21] A. Thota, P. Tilak, S. Ahluwalia, N. Lohia, Fake news detection: a deep learning
approach, SMU Data Science Review 1 (2018) 10.

[22] J. Y. Khan, M. Khondaker, T. Islam, A. Iqbal, S. Afroz, A benchmark study on machine


learning methods for fake news detection, arXiv preprint arXiv:1905.04749 (2019) 1–14.

[23] A. P. S. Bali, M. Fernandes, S. Choubey, M. Goel, Comparative performance of machine


learning algorithms for fake news detection, in: International conference on advances in
computing and data sciences, Springer, 2019, pp. 420–430.

[24] D. Katsaros, G. Stavropoulos, D. Papakostas, Which machine learning paradigm for


fake news detection?, in: 2019 IEEE/WIC/ACM International Conference on Web
Intelligence (WI), IEEE, 2019, pp. 383–387.

[25] A. M. Braşoveanu, R. Andonie, Semantic fake news detection: a machine learning


perspective, in: International Work-Conference on Artificial Neural Networks, Springer,
2019, pp. 656–667.
[26] J. C. Reis, A. Correia, F. Murai, A. Veloso, F. Benevenuto, Supervised learning for fake
news detection, IEEE Intelligent Systems 34 (2019) 76–81.

[27] E. M. Mahir, S. Akhter, M. R. Huq, et al., Detecting fake news using machine learning
and deep learning algorithms, in: 2019 7th International Conference on Smart Comput-
ing & Communications (ICSCC), IEEE, 2019, pp. 1–5.

[28] G. Gravanis, A. Vakali, K. Diamantaras, P. Karadais, Behind the cues: A benchmarking


study for fake news detection, Expert Systems with Applications 128 (2019) 201–213.

[29] V. M. Krešňáková, M. Sarnovskỳ, P. Butka, Deep learning methods for fake news
detection, in: 2019 IEEE 19th International Symposium on Computational Intelligence
and Informatics and 7th IEEE International Conference on Recent Achievements in
Mechatronics, Automation, Computer Sciences and Robotics (CINTI-MACRo), IEEE,
2019, pp. 000143–000148.

[30] P. Ksieniewicz, M. Choraś, R. Kozik, M. Woźniak, Machine learning methods for fake
news classification, in: International Conference on Intelligent Data Engineering and
Automated Learning, Springer, 2019, pp. 332–339.

[31] K. Poddar, K. Umadevi, et al., Comparison of various machine learning models for
accurate detection of fake news, in: 2019 Innovations in Power and Advanced Computing
Technologies (i-PACT), volume 1, IEEE, 2019, pp. 1–5.

[32] P. Bharadwaj, Z. Shao, Fake news detection with semantic features and text mining,
International Journal on Natural Language Computing (IJNLC) Vol 8 (2019).

[33] K. Shu, L. Cui, S. Wang, D. Lee, H. Liu, defend: Explainable fake news detection, in:
Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery
& data mining, 2019, pp. 395–405.

[34] H. Karimi, J. Tang, Learning hierarchical discourse-level structure for fake news detec-
tion, arXiv preprint arXiv:1903.07389 (2019).

[35] O. Ajao, D. Bhowmik, S. Zargari, Sentiment aware fake news detection on online social
networks, in: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), IEEE, 2019, pp. 2507–2511.

[36] H. E. Wynne, Z. Z. Wint, Content based fake news detection using n-gram models,
in: Proceedings of the 21st International Conference on Information Integration and
Web-based Applications & Services, 2019, pp. 669–673.

[37] I. Ahmad, M. Yousaf, S. Yousaf, M. O. Ahmad, Fake news detection using machine
learning ensemble methods, Complexity 2020 (2020).
[38] N. Smitha, R. Bharath, Performance comparison of machine learning classifiers for fake
news detection, in: 2020 Second International Conference on Inventive Research in
Computing Applications (ICIRCA), IEEE, 2020, pp. 696–700.

[39] P. H. A. Faustini, T. F. Covões, Fake news detection in multiple platforms and languages,
Expert Systems with Applications 158 (2020) 113503.

[40] R. K. Kaliyar, A. Goswami, P. Narang, S. Sinha, Fndnet–a deep convolutional neural


network for fake news detection, Cognitive Systems Research 61 (2020) 32–44.

[41] S. H. Kong, L. M. Tan, K. H. Gan, N. H. Samsudin, Fake news detection using deep
learning, in: 2020 IEEE 10th Symposium on Computer Applications & Industrial Elec-
tronics (ISCAIE), IEEE, 2020, pp. 102–107.

[42] S. Vijayaraghavan, Y. Wang, Z. Guo, J. Voong, W. Xu, A. Nasseri, J. Cai, L. Li,


K. Vuong, E. Wadhwa, Fake news detection with different models, arXiv preprint
arXiv:2003.04978 (2020).

[43] A. Agarwal, A. Dixit, Fake news detection: an ensemble learning approach, in: 2020
4th International Conference on Intelligent Computing and Control Systems (ICICCS),
IEEE, 2020, pp. 1178–1183.

[44] X. Zhou, A. Jain, V. V. Phoha, R. Zafarani, Fake news early detection: A theory-driven
model, Digital Threats: Research and Practice 1 (2020) 1–25.

[45] Z. Khanam, B. Alwasel, H. Sirafi, M. Rashid, Fake news detection using machine
learning approaches, in: IOP Conference Series: Materials Science and Engineering,
volume 1099, IOP Publishing, 2021, p. 012040.

[46] A. Choudhary, A. Arora, Linguistic feature based learning model for fake news detection
and classification, Expert Systems with Applications 169 (2021) 114171.

[47] J. Y. Khan, M. T. I. Khondaker, S. Afroz, G. Uddin, A. Iqbal, A benchmark study


of machine learning models for online fake news detection, Machine Learning with
Applications 4 (2021) 100032.

[48] T. Felber, Constraint 2021: Machine learning models for covid-19 fake news detection
shared task, arXiv preprint arXiv:2101.03717 (2021).

[49] P. K. Verma, P. Agrawal, I. Amorim, R. Prodan, Welfake: word embedding over lin-
guistic features for fake news detection, IEEE Transactions on Computational Social
Systems 8 (2021) 881–893.

[50] S. Gundapu, R. Mamidi, Transformer based automatic covid-19 fake news detection
system, arXiv preprint arXiv:2101.00180 (2021).
[51] R. K. Kaliyar, A. Goswami, P. Narang, Fakebert: Fake news detection in social media
with a bert-based deep learning approach, Multimedia tools and applications 80 (2021)
11765–11788.

[52] A. Wani, I. Joshi, S. Khandve, V. Wagh, R. Joshi, Evaluating deep learning approaches
for covid19 fake news detection, in: International Workshop on Combating Online
Hostile Posts in Regional Languages during Emergency Situation, Springer, 2021, pp.
153–163.

[53] M. Galster, D. Weyns, D. Tofan, B. Michalik, P. Avgeriou, Variability in software


systems—a systematic literature review, IEEE Transactions on Software Engineering
40 (2013) 282–306.

[54] G. C. Santia, J. R. Williams, Buzzface: A news veracity dataset with facebook user
commentary and egos, in: Twelfth international AAAI conference on web and social
media, 2018.

[55] A. Zubiaga, M. Liakata, R. Procter, G. Wong Sak Hoi, P. Tolmie, Analysing how people
orient to and spread rumours in social media by looking at conversational threads, PloS
one 11 (2016) e0150989.

[56] P. Patwa, S. Sharma, S. Pykl, V. Guptha, G. Kumari, M. S. Akhtar, A. Ekbal, A. Das,


T. Chakraborty, Fighting an infodemic: Covid-19 fake news dataset, in: International
Workshop on Combating Online Hostile Posts in Regional Languages during Emergency
Situation, Springer, 2021, pp. 21–29.

[57] T. de Melo, C. M. Figueiredo, A first public dataset from brazilian twitter and news on
covid-19 in portuguese, Data in brief 32 (2020) 106179.

[58] F. K. A. Salem, R. Al Feel, S. Elbassuoni, M. Jaber, M. Farah, Fa-kes: A fake news


dataset around the syrian war, in: Proceedings of the international AAAI conference
on web and social media, volume 13, 2019, pp. 573–582.

[59] K. Shu, D. Mahudeswaran, S. Wang, D. Lee, H. Liu, Fakenewsnet: A data repository


with news content, social context, and spatiotemporal information for studying fake
news on social media, Big data 8 (2020) 171–188.

[60] A. L. e. a. Veronica Perez-Rosas, Bennett Kleinberg, Automatic detection of fake news,


International Conference on Computational Linguistics (COLING) (2018).

[61] M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Axton, A. Baak,


N. Blomberg, J.-W. Boiten, L. B. da Silva Santos, P. E. Bourne, et al., The fair guiding
principles for scientific data management and stewardship, Scientific data 3 (2016) 1–9.
[62] B. Mons, C. Neylon, J. Velterop, M. Dumontier, L. O. B. da Silva Santos, M. D. Wilkin-
son, Cloudy, increasingly fair; revisiting the fair data guiding principles for the european
open science cloud, Information services & use 37 (2017) 49–56.

[63] Z. Zhou, H. Guan, M. M. Bhat, J. Hsu, Fake news detection via nlp is vulnerable to
adversarial attacks, arXiv preprint arXiv:1901.09657 (2019).

You might also like