Fake News Machine Learning Research Paper
Fake News Machine Learning Research Paper
1
misinformation related to COVID-19.3 Fake news detection
A. Electronic Image Files (Optional) A Taxonomy of DL-based methods The previous survey
(Zhou and Zafarani, 2018) divided the fake news detection
• A quantitative analysis. We present a quantitative analysis methods from the view of features. Note that limited labeled
on the performance of the DL based FND methods on a data has been attracting increasingly more researchers to
variety of datasets, so that researchers can learn from it. • combat fake news with few or no labeled data. We propose to
Some future directions. We discuss the remaining limitations of categorize
existing FND methods and point out possible future directions.
The remainder of this survey is organized as follows. We first the fake news detection methods as supervised, weakly
retrospect the task definition of FND in Section 2. Then we supervised, and unsupervised. Fig. 2 shows the classification
explain our new taxonomy of DL based FND methods in taxonomy of the fake news detection methods based on deep
Section 3. We present a comprehensive survey along with the learning in this paper. Moreover, it is worth noting that
three categories in Section 4, and 5 respectively. Section 6 disparate methods along these aforementioned three
introduces several commonly used FND datasets and presents dimensions focus on different features, which we will
a quantitative performance analysis of the DL based FND elaborate on in subsequent subsections.
methods, and in Section 7 we finally give a conclusion as well
as some promising directions. • Supervised methods: Supervised methods learn with
labeled data. The main concern is: how the DL models utilize
these rich feature information. Therefore, we further divide
these methods into: news content based, social context based,
and external knowledge based. Besides, we particularize
solutions to integrate different or even heterogeneous feature
information in Section 4.4. • Weakly/Un-supervised methods:
Weakly supervised methods assume only limited labeled data
is available in the learning stage. The common solution is to
derive weak supervision information from available
information. According to the way of getting weak supervision,
we divide the semi-supervised methods into weak content
supervision and weak social supervision. Unsupervised
methods learn with totally unlabeled data. Some researchers
resort to generative methods or utilize probabilistic
knowledge. We will particularize them in Section 5.
information retrieval, particularly in the domain of news be classified into two main categories: generic features and
content analysis. These methods are designed to learn and latent features.
make predictions based on labeled examples, and to enhance
their performance, various techniques have been explored. Generic textual features encompass a wide range of
This article delves into the world of supervised methods with a attributes that describe the text of news articles. These
focus on news content-based approaches, which rely on the features can be used within a traditional machine learning
explicit information carried by news articles, including text framework and typically capture information at various
and images. We will explore the different facets of news linguistic levels. Some of the linguistic levels that generic
content-based methods, with an emphasis on textual features textual features cover include:
and their classification into generic and latent features.
1. Lexicon: These features focus on the vocabulary used in
Supervised methods are a fundamental paradigm in the text, including word frequency, diversity of vocabulary,
machine learning, where models are trained on labeled and language complexity.
examples to make predictions or classify data points. In the
context of news content analysis, these methods are 2. Syntax: Syntax-related features examine the grammatical
particularly valuable for tasks such as sentiment analysis, structure of sentences, such as the use of punctuation, sentence
topic classification, and fake news detection. These tasks length, and grammatical patterns.
require the model to learn from various features extracted
from news articles, and the choice of features is a critical 3. Discourse: Discourse features consider the organization
factor in determining the model's performance. and flow of information within the text, including the use of
transitional words and coherence.
One approach to enhancing the performance of supervised
methods in news content analysis is the incorporation of 4. Semantic: Semantic features delve into the meaning and
multi-modal information. Multi-modal information involves interpretation of the text, including sentiment analysis, word
utilizing both textual and visual features present in news associations, and semantic roles.
articles. This means that in addition to the text of the articles,
any accompanying images or multimedia elements can be These generic textual features provide a comprehensive
used as valuable sources of information. This approach allows view of the linguistic characteristics of news articles. They can
models to have a more comprehensive understanding of the be used to build models that perform tasks such as text
news content and can lead to more accurate predictions. classification, summarization, and sentiment analysis.
External knowledge is another technique that has been In addition to generic textual features, there are latent
explored to improve the performance of supervised methods. textual features, which are a more advanced and abstract
External knowledge sources, such as databases, ontologies, or representation of news text. These features are often derived
domain-specific knowledge bases, can be integrated into the through techniques like text embedding. Text embedding is a
model. This external knowledge can provide additional context process that transforms words, sentences, or entire documents
and information that may not be present in the news articles into high-dimensional vector representations.
themselves. For instance, integrating information about
named entities, geographical locations, or historical events Text embedding can occur at different levels of granularity,
from external knowledge sources can enhance the model's including word, sentence, and document levels. For example,
ability to interpret and analyze news content effectively. at the word level, each word in a news article can be
represented as a vector, and these vectors can capture
Integration strategies also play a vital role in enhancing the semantic relationships between words. At the sentence level,
performance of supervised methods in news content analysis. the entire sentence is embedded into a vector, which can
These strategies involve combining information from different capture the context and meaning of the sentence. At the
sources or features in a coherent and effective manner. The document level, the entire news article is represented as a
choice of integration strategy can significantly impact the vector, summarizing the content in a high-dimensional space.
model's ability to make accurate predictions. For example, a
fusion of textual and visual features might require a different The use of latent textual features has gained popularity due
integration strategy compared to combining textual features to the effectiveness of methods like word embeddings, sentence
with external knowledge. embeddings, and document embeddings. These latent
representations allow news articles to be expressed as vectors
Now, let's delve deeper into the specific aspect of news in a continuous space, which enables various machine
content-based methods, with a focus on textual features. learning models to operate on them effectively.
Textual features are at the core of many news content analysis
tasks, as they provide valuable information about the One of the popular approaches for integrating latent textual
linguistic content of news articles. These textual features can features into supervised methods is the use of recurrent neural
4
networks (RNNs). RNNs are a class of neural networks that reformulated the fake news detection task as a graph
are well-suited for sequential data, making them particularly classification task, where the nodes represent the sentences of
useful for processing text. RNNs can take sequences of word the article and the edges represent the semantic similarity
embeddings or sentence embeddings as input and capture between a pair of sentences. They used two widely applicated
dependencies and context within the text. graph neural networks, GCN and GAT, to generate graph
embedding and use the embedding to classify the fake news.
In the context of news content analysis, RNNs can be used Inspired by multi-task learning, Wu et al. (2019) designed a
to classify news articles into different categories, perform multi-task learning model that employs a fake news detection
sentiment analysis, or detect the presence of specific topics. By task and a stance classification task to optimize the shared
utilizing the latent textual features, RNNs can capture the layer simultaneously, resulting in enhanced news
nuanced and context-dependent aspects of news content, representations. The sharing layer filters and selects shared
which may not be apparent when using generic textual feature flows between the tasks of fake news and stance
features alone. detection using a gate mechanism and an attention mechanism.
Cheng et al. (2020) proposed an LSTM-based variational
To summarize, supervised methods in news content analysis autoencoder model to extract latent representations of
have evolved significantly to incorporate multi-modal tweet-level text. Besides, in order to detect fake news as early
information, external knowledge, and effective integration as possible, some existing researchers assumed that
strategies. Within the realm of textual features, both generic multifaceted information is not available before a news article
and latent features are essential for understanding the has already become popular. These works utilized text-only
linguistic content of news articles. Generic features capture features on purpose. For example, Qian et al. (2018) proposed
linguistic attributes at various levels, while latent features, a model to generate user feedback based on text, which was
often derived through text embedding, provide abstract then used in the classification process along with word-level
representations of the text. Integration strategies, such as and sentence-level information from real articles to address
using recurrent neural networks, play a pivotal role in making the lack of user reviews as an auxiliary source of information
the most of these features and enhancing the performance of in early detection. Giachanou et al. (2019) considered the role
supervised methods. These advancements contribute to the of emotional signals and proposed an LSTM model that
field's ability to extract valuable insights from news content incorporates emotional signals obtained.
and make informed predictions or classifications, which are
crucial in the era of information overload and misinformation.
Skip-gram architecture is widely used because it works better approximately 8 to 12 point type.
on semantic tasks than the CBOW model [9]. The Skip-gram
architecture uses each current word as an input (ݓ௧) for the It has two channels of word vectors: one named “Static”
model and predicts words within a certain range before and that is kept static throughout the training and one named
after the current word (ݓ௧ି ~ ݓ௧ା). It “Non-static” that is fine-tuned via backpropagation. Previous
maximizes the classification of a word based on another word works have conducted sentimental analysis with a dataset that
in the same sentence, so similar words have similar vectors consists of short sentences and the “Static” and “Non-static”
and their similarity increases [10]. Given a sequence of results were comparable, but “Non-static” allows the words to
training words (ݓଵ ~ ݓ௧) and the size of attain more meaningful representations Fake News Detection
training context (c), the objective of the Skip-gram model is to Using Deep Learning 1122 | J Inf Process Syst, Vol.15, No.5,
maximize the average log probability pp.1119~1130, October 2019 [8]. However, if only
“Non-static” is used, new words can be over-fitted in this
model. Therefore, both channels are used to secure the
generality of the meaning of words. do not use “et al.” unless
there are six authors or more. Use a space after authors’
initials. Papers that have not been published should be cited as
“unpublished” [4]. Papers that have been accepted for
publication, but not yet specified for an issue should be cited
as “to be published” [5]. Papers that have been submitted for
publication should be cited as “submitted for publication” [6].
Please give affiliations and addresses for private
communications [7].
Capitalize only the first word in a paper title, except for
proper nouns and element symbols. For papers published in
“Fasttext” is a method of adding the concept of “Sub-word” translation journals, please give the English citation first,
to “Word2vec.” Each word is represented as the sum of followed by the original foreign-language citation [8].
n-gram vectors and the word vector itself. Taking the word 2.3 Attentive Pooling The model architecture shown in Fig.
apple and ݊=3 as an example. It will be represented by the 2 is the “Attentive-pooling” architecture of Santos et al. [12].
character n-grams: and the word itself . The reason why it has Recently, attention mechanisms have been successfully used
2-gram vector is that it adds special boundary symbols at the for image captioning [13] and machine translation [14].
beginning and end of words to distinguish prefixes and However, there were no further studies of applying the
suffixes from other character sequences. The formulation is as attention mechanism to NLP tasks with two inputs such as
follows: suppose that you are given a dictionary of n-grams of pair-wise ranking or text classification. Meanwhile, “attentive
size ܩ. Given a word , let us denote by, … , the set of n-grams pooling” has contributed to the improvement of performance
appearing in 2.2 Shallow-and-Wide CNN The model in these tasks by effectively representing the two inputs’
architecture as shown in Fig. 1, is the “Shallow-and-wide similarity [12]. Although there is the Term Frequency-Inverse
CNN” architecture of Kim [8]. The first layer is the look-up Document Frequency (TF-IDF) method that statistically
table that is the set of k-dimensional word vectors that each measures the similarity by the frequency of words in a
corresponds to the i th word in the sentence. Then, a document, this model measures the similarity by increasing
convolution operation is applied with multiple filter widths the weight for words that have the same or similar meanings to
and a max-overtime pooling operation. Finally, these features the two inputs.
are passed to a fully connected layer and the prediction is The passage you've provided discusses a specific approach
made with the softmax layer. to sentiment analysis and fake news detection using deep
learning. It introduces the concept of two channels of word
Figure axis labels are often a source of confusion. Use vectors, "Static" and "Non-static," and highlights the
words rather than symbols. As an example, write the quantity importance of attentive pooling in the model architecture. Let's
“Magnetization,” or “Magnetization M,” not just “M.” Put elaborate on these concepts and their significance in a
units in parentheses. Do not label axes only with units. As in 2000-word article.
Fig. 1, for example, write “Magnetization (A/m)” or
“Magnetization (A m−1),” not just “A/m.” Do not label axes ---
with a ratio of quantities and units. For example, write
“Temperature (K),” not “Temperature/K.” Sentiment analysis and fake news detection are critical tasks
Multipliers can be especially confusing. Write in the field of natural language processing. They are essential
“Magnetization (kA/m)” or “Magnetization (103 A/m).” Do for understanding public opinion, identifying deceptive
not write “Magnetization (A/m) × 1000” because the reader information, and making informed decisions in various
would not know whether the top axis label in Fig. 1 meant domains, including journalism, politics, and social media. This
16000 A/m or 0.016 A/m. Figure labels should be legible, article focuses on a specific approach that combines the use of
6
two channels of word vectors, "Static" and "Non-static," and vectors allow the model to make nuanced adjustments to word
employs attentive pooling to enhance performance in these representations based on the context and goals of the task.
tasks. This adaptability is particularly advantageous in tasks like
sentiment analysis and fake news detection, where the
### The Significance of Sentiment Analysis and Fake News meaning of words can be influenced by their surrounding
Detection context.
Sentiment analysis, also known as opinion mining, is the The rationale behind using both "Static" and "Non-static"
process of determining the sentiment or emotional tone in a channels is to strike a balance between generality and
piece of text, typically to understand whether it is positive, adaptability. When conducting sentiment analysis on short
negative, or neutral. It is widely used in applications such as sentences, previous research has shown that the results
customer feedback analysis, social media monitoring, and obtained using "Static" and "Non-static" channels are
product reviews. Sentiment analysis is particularly relevant in comparable. However, it's important to note that relying solely
the age of the internet, where vast amounts of textual data are on the "Non-static" channel can lead to overfitting, especially
generated daily, making it impossible for humans to manually when dealing with new or rare words. Therefore, the use of
process and interpret. both channels ensures that the model benefits from the general
semantic knowledge provided by "Static" vectors while also
Fake news detection, on the other hand, is a critical task in having the flexibility to fine-tune word representations as
the era of information overload and disinformation. The needed.
spread of false or misleading information can have severe
consequences, ranging from influencing public opinion to This dual-channel approach can be thought of as having a
causing real-world harm. Accurate and timely identification of strong foundation in general word semantics (Static) while
fake news is essential for maintaining the integrity of being able to make task-specific adjustments (Non-static). It
information dissemination, and machine learning approaches strikes a balance between capturing the inherent meaning of
have become valuable tools in this endeavor. words and adapting to the specific requirements of sentiment
analysis and fake news detection.
In this context, the article discusses an approach that
combines sentiment analysis and fake news detection using ### Attentive Pooling
deep learning techniques, with a particular focus on the use of
"Static" and "Non-static" channels of word vectors and The model architecture employed in this approach
attentive pooling. incorporates an "Attentive Pooling" mechanism, which is a
key component contributing to its success. This mechanism is
### Two Channels of Word Vectors: "Static" and introduced and credited to Santos et al. in the context of text
"Non-static" classification tasks, particularly in the context of pair-wise
ranking.
The article introduces a unique approach to word vector
representation by employing two channels: "Static" and 1. **Introduction to Attentive Pooling**: Attentive pooling
"Non-static." These channels play a crucial role in enhancing is a mechanism that leverages attention mechanisms to
the model's performance in sentiment analysis and fake news effectively represent the similarity between two inputs.
detection. Attention mechanisms, originally popularized in the field of
computer vision, have found success in various natural
1. **Static Word Vectors**: The "Static" channel consists language processing (NLP) tasks, such as machine translation
of word vectors that remain fixed or static throughout the and image captioning.
training process. These vectors are pre-trained on a large
corpus of text data and capture the semantic meaning of 2. **Applying Attention to NLP Tasks**: While attention
words. Static word vectors are often obtained using techniques mechanisms have been widely adopted in NLP for tasks like
like Word2Vec or GloVe, which provide distributed machine translation, their application to tasks involving two
representations of words based on co-occurrence statistics in inputs, such as pair-wise ranking or text classification, was
the training corpus. These vectors are highly valuable as they relatively unexplored until the introduction of attentive
encode general semantic information about words. pooling.
The combination of the two channels of word vectors, 2. **Handling Multimodal Data**: As mentioned, the
"Static" and "Non-static," along with attentive pooling, has approach can be extended to handle both text and image data.
practical implications and applications in various domains: However, integrating and effectively utilizing multimodal data
presents challenges in terms of data preprocessing, model
1. **Sentiment Analysis**: Sentiment analysis is a architecture, and alignment of features between modalities.
fundamental task in the realm of customer feedback analysis,
market research, and brand management. The model's ability 3. **Interpretable Models**: Understanding the decisions
to adapt word representations through the "Non-static" made by deep learning models is an ongoing challenge. The
channel allows it to capture fine-grained sentiment model discussed in the article may achieve high accuracy, but
information, making it more effective in identifying sentiment its decision-making process can be opaque. Developing
nuances in short texts, such as social media posts and product interpretable models that can explain why a particular
reviews. prediction was made is crucial, especially in sensitive
applications like fake news detection.
2. **Fake News Detection**: Fake news detection is of
paramount importance in the current information landscape, 4. **Adaptation to Evolving Language**: Language is
where misinformation and disinformation spread rapidly. The constantly evolving, with new words and phrases emerging
combination of both word vector channels and attentive over time. The model's adaptability through the "Non-static"
pooling enables the model to understand the semantic channel helps, but staying up-to-date with language changes
relationships between news articles and detect patterns and evolving semantics is a challenge.
associated with deceptive content. This can have significant
implications for maintaining the integrity of news 5. **Ethical Considerations**: Sentiment analysis and fake
dissemination and ensuring the public is well-informed. news detection models can have significant societal
implications. Ensuring that these models are used responsibly
3. **Natural Language Understanding**: The approach and ethically is an ongoing challenge. Bias, fairness, and
discussed in the article has broader applications in natural ethical considerations must be addressed in the development
language understanding. Understanding the meaning of words and deployment of such models.
and their relationships in context is crucial for various NLP
tasks, including text summarization, question answering, and ### Conclusion
dialogue systems. The dual-channel word vectors and attentive
pooling can be adapted to these tasks to enhance their Sentiment analysis and fake news detection are pivotal tasks
performance. in natural language processing and information retrieval. The
article discussed a unique approach that combines the use of
4. **Multimodal Analysis**: The concept of dual-channel two channels of word vectors, "Static" and "Non-static," along
word vectors can also be extended to multimodal analysis, with attentive pooling to enhance the performance of models
where in these tasks.
8
where σ(∙) is the sigmoid function and ݂௧, ݅௧, ௧, ܥ௧,
and ℎ௧ are the vectors of the forget gate, input gate, output
gate, memory cell, and hidden state, respectively. All of the
vectors are the same size. Moreover, ܹ, ܹ, ܹ, and ܹ denote the
weight matrices of each gates and ܾ, ܾ, ܾ, and ܾୡ denote the
bias vectors of each gates. Another shortcoming of
9
filters to extract relevant features from both the headlines and output scores into probability distributions over the possible
the article bodies. classes. This makes it easier to interpret the model's
predictions and determine the most likely class for a given
#### Different Numbers of Filters input.
A notable characteristic of the BCNN architecture is the use #### Leveraging the "Static" Channel
of different numbers of filters for the headline and body
inputs. This is based on the consideration that headlines are The BCNN relies on the "Static" channel, which retains
typically much shorter than full article bodies, resulting in word embeddings as static throughout the training process.
different amounts of text data to process. To address this This means that the word embeddings derived from "Fasttext"
imbalance, the BCNN allocates 256 filters for the headline remain fixed and unchanged during training. The use of static
input and a more substantial 1024 filters for the body input. word embeddings ensures that the model preserves the general
semantic knowledge encoded in these pre-trained embeddings.
The rationale behind this distribution is to ensure that the
model can extract meaningful features from both types of By keeping the word embeddings static, the BCNN benefits
input data, regardless of their differing lengths. By providing from the consistent and well-established word representations
more filters for the body input, the BCNN accommodates the provided by "Fasttext." These static embeddings serve as a
greater volume of text data found in articles. This approach reliable foundation for understanding the language used in
acknowledges the importance of treating each input source both the headlines and article bodies, contributing to the
appropriately to extract relevant information effectively. model's ability to capture the meaning of words effectively.
After extracting features using the convolutional layers, the The BCNN architecture, with its dual-channel word
BCNN employs a max-pooling layer. The max-pooling embeddings, convolutional layers, and max-pooling, has
process transforms the feature maps into one vector per input. practical implications and applications in various domains:
This transformation is crucial for obtaining a fixed-length
global vector representation of both the headline and the body. 1. **Text Classification**: The BCNN is particularly
well-suited for text classification tasks. It can be applied to a
The max-pooling layer selects the most significant wide range of classification problems, including sentiment
information from the feature maps, effectively summarizing analysis, topic categorization, and fake news detection. By
the relevant features. By choosing the maximum value within extracting features and creating global vector representations
each feature map, the BCNN retains the most salient of text inputs, the BCNN provides a powerful tool for making
information while discarding less important details. The result accurate predictions.
is a concise and informative representation of each input,
which is essential for further processing and classification. 2. **News Analysis**: In the context of news analysis, the
BCNN can be used to process both news headlines and full
#### Classification and Activation Functions article bodies, making it valuable for news classification and
summarization. For instance, it can help in categorizing news
The final step in the BCNN architecture involves articles into different topics or identifying potentially
classification. Once the fixed-length global vectors for the misleading headlines.
headline and body are obtained, they are fed into a fully
connected layer for classification. This layer is responsible for 3. **Multimodal Analysis**: While the BCNN architecture
making predictions based on the learned representations. is primarily designed for text data, its principles can be
extended to multimodal analysis. By incorporating both text
To introduce non-linearity and enable the model to capture and images, it becomes applicable in domains like social
complex relationships in the data, the BCNN uses the media content analysis, where textual content is often
Rectified Linear Unit (ReLU) as an activation function. ReLU accompanied by images.
is a widely used activation function that helps the model
handle non-linear transformations in the data. It replaces 4. **Natural Language Understanding**: The BCNN
negative values with zeros and passes positive values contributes to natural language understanding by enabling the
unchanged, enabling the network to learn complex patterns model to capture relevant features from text inputs. This
and relationships within the feature vectors. understanding is crucial for various NLP tasks, such as text
summarization, question answering, and dialogue systems.
For classification tasks, especially those involving multiple
classes or categories, the softmax function is employed as the 5. **Domain-Specific Applications**: The architecture can
output function. The softmax function converts the model's be adapted to specific domains, allowing for fine-tuning and
11
REFERENCES
[1] G. O. Young, “Synthetic structure of industrial plastics (Book style with
paper title and editor),” in Plastics, 2nd ed. vol. 3, J. Peters, Ed.
New York: McGraw-Hill, 1964, pp. 15–64.
[2] W.-K. Chen, Linear Networks and Systems (Book style). Belmont,
CA: Wadsworth, 1993, pp. 123–135.
[3] H. Poor, An Introduction to Signal Detection and Estimation. New
York: Springer-Verlag, 1985, ch. 4.
[4] B. Smith, “An approach to graphs of linear forms (Unpublished work
style),” unpublished.
[5] E. H. Miller, “A note on reflector arrays (Periodical style—Accepted for
publication),” IEEE Trans. Antennas Propagat., to be published.
[6] J. Wang, “Fundamentals of erbium-doped fiber amplifiers arrays
(Periodical style—Submitted for publication),” IEEE J. Quantum
Electron., submitted for publication.
[7] C. J. Kaufman, Rocky Mountain Research Lab., Boulder, CO, private
communication, May 1995.
[8] Y. Yorozu, M. Hirano, K. Oka, and Y. Tagawa, “Electron spectroscopy
studies on magneto-optical media and plastic substrate interfaces
(Translation Journals style),” IEEE Transl. J. Magn.Jpn., vol. 2, Aug.
1987, pp. 740–741 [Dig. 9th Annu. Conf. Magnetics Japan, 1982, p. 301].
[9] M. Young, The Techincal Writers Handbook. Mill Valley, CA:
University Science, 1989.
[10] J. U. Duncombe, “Infrared navigation—Part I: An assessment of
feasibility (Periodical style),” IEEE Trans. Electron Devices, vol. ED-11,
pp. 34–39, Jan. 1959.
[11] S. Chen, B. Mulgrew, and P. M. Grant, “A clustering technique for
digital communications channel equalization using radial basis function
networks,” IEEE Trans. Neural Networks, vol. 4, pp. 570–578, Jul.
1993.
[12] R. W. Lucky, “Automatic equalization for digital communication,” Bell
Syst. Tech. J., vol. 44, no. 4, pp. 547–588, Apr. 1965.
[13] S. P. Bingulac, “On the compatibility of adaptive controllers (Published
Conference Proceedings style),” in Proc. 4th Annu. Allerton Conf.
Circuits and Systems Theory, New York, 1994, pp. 8–16.
[14] G. R. Faulhaber, “Design of service systems with priority reservation,”
in Conf. Rec. 1995 IEEE Int. Conf. Communications, pp. 3–8.