0% found this document useful (0 votes)
22 views12 pages

Fake News Machine Learning Research Paper

This document discusses the challenges and methodologies of fake news detection (FND) using deep learning, particularly focusing on Korean language news. It categorizes existing FND methods into supervised, weakly supervised, and unsupervised learning, emphasizing the importance of various linguistic, temporal-structural, and hybrid features. The paper also reviews the performance of deep learning models and suggests future directions for improving FND techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views12 pages

Fake News Machine Learning Research Paper

This document discusses the challenges and methodologies of fake news detection (FND) using deep learning, particularly focusing on Korean language news. It categorizes existing FND methods into supervised, weakly supervised, and unsupervised learning, emphasizing the importance of various linguistic, temporal-structural, and hybrid features. The paper also reviews the performance of deep learning models and suggests future directions for improving FND techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

1

1
misinformation related to COVID-19.3 Fake news detection

Fake News Detection


aims to identify the fake news automatically. Existing
traditional ML based FND methods require feature sections of
TRANS-JOUR.DOC or cut and paste from another
Using Deep Learning engineering. According to the features that the models
(Oct 2023) utilized, those methods can be broadly divided into three
categories: linguistic features, temporal-structural features and
First A. Author, Second B. Author, Jr., and Third C. the hybrid features. Linguistic feature based approaches detect
Author, Member, IEEE fake news based on the text content (Castillo et al., 2011). For
example, Castillo et al. (2011) employed a variety of linguistic
Abstract—With the wide spread of Social Network Services features such as special characters, emoticon symbols,
(SNS), fake news—which is a way of disguising false information sentimental words, etc. Popat (2017) investigated language
as legitimate media—has become a big social issue. This paper style features such as assertive verbs and factive verbs. In
proposes a deep learning architecture for detecting fake news addition, some methods explore the temporal-structural feature
that is written in Korean. Previous works proposed appropriate
(e.g., the propagation feature) based on social networks to
fake news detection models for English, but Korean has two
issues that cannot apply existing models: Korean can be detect fake news (Jin et al., 2013). For instance, Wu et al.
expressed in shorter sentences than English even with the same (2015) proposed SVM to learn high-order propagation
meaning; therefore, it is difficult to operate a deep neural patterns. Sampson et al. (2016) employed implicit links
network because of the feature scarcity for deep learning. between conversation fragments to properly classify emergent
Difficulty in semantic analysis due to morpheme ambiguity. We conversations. Hybrid approaches that combine different types
worked to resolve these issues by implementing a system using
of features to detect fake news are also proposed. For
various convolutional neural network-based deep learning
architectures and “Fasttext” which is a word-embedding model example, Sun et al. (2013) combined news content, user, and
learned by syllable unit. After training and testing its multimedia features. Ma et al. (2015) combined the temporal
implementation, we could achieve meaningful accuracy for variations of content-based, user-based, and diffusion-based
classification of the body and context discrepancies, but the features as the news propagation evolves. Though these
accuracy was low for classification of the headline and body traditional ML methods achieve.
discrepancies.

Keywords : Artificial Intelligence, Fake News Detection,


Natural Language Processing.
Deep learning based fake news detection. With the
success of deep learning in various domains, DL based FND
methods have been proposed and attracted significant attention
I. INTRODUCTION recently. Firstly, deep learning.
The Internet and social media platforms have become
can avoid feature engineering and take full advantages of its
indispensable ways for people to obtain news in their daily
strong expressive power to model the features of input news.
life. Because they allow the news to be disseminated rapidly
For example, Ma et al. (2016) modeled the sequential
and freely, public viewers have unhindered access to consult
relationship between news posts utilizing recurrent neural
whenever and wherever they want. And till August 2018, over
networks. Yu et al. (2017) utilized convolutional neural
68 percent of Americans acquire their news through social
networks to represent high-level semantic relationships
media.1 However, the deficiency of censorship from authority
between news posts. Bian et al. (2020) leveraged a Graph
renders the quality of news spread on the far lower than by
Neural Networks (GCNs) with both top-down and bottom-up
traditional way. The online information ecosystem is
directed graph of rumor spreading to learn its propagation and
extremely noisy and fraught with disinformation and fake
dispersion patterns. Khattar et al. (2019) proposed the
news. Fake news refers to the news that is fabricated to
multi-modal Variational Autoencoder (MVAE) to extract the
deceive people, which exerts negative effects on individuals as
hidden multi-modal representations of multimedia new
well as the whole society. It misleads people with false or
biased stories for self-serving purposes, seriously affecting
public opinion and social stability. For example, during the
COVID-19 pandemic, a mix of real and fake information
• A comprehensive review. We conduct a comprehensive
about the outbreak is so overwhelming that the World Health
survey to present a thorough overview and analysis of
Organization2 called it an ‘‘information epidemic’’. In the first
DL-based FND methods along the three lines: supervised,
three months of 2020, about 6000 people were hospitalized
semi-supervised, and unsupervised learning.
worldwide because of coronavirus misinformation, and
researchers said at least 800 people may have died due to
1
2

A. Electronic Image Files (Optional) A Taxonomy of DL-based methods The previous survey
(Zhou and Zafarani, 2018) divided the fake news detection
• A quantitative analysis. We present a quantitative analysis methods from the view of features. Note that limited labeled
on the performance of the DL based FND methods on a data has been attracting increasingly more researchers to
variety of datasets, so that researchers can learn from it. • combat fake news with few or no labeled data. We propose to
Some future directions. We discuss the remaining limitations of categorize
existing FND methods and point out possible future directions.
The remainder of this survey is organized as follows. We first the fake news detection methods as supervised, weakly
retrospect the task definition of FND in Section 2. Then we supervised, and unsupervised. Fig. 2 shows the classification
explain our new taxonomy of DL based FND methods in taxonomy of the fake news detection methods based on deep
Section 3. We present a comprehensive survey along with the learning in this paper. Moreover, it is worth noting that
three categories in Section 4, and 5 respectively. Section 6 disparate methods along these aforementioned three
introduces several commonly used FND datasets and presents dimensions focus on different features, which we will
a quantitative performance analysis of the DL based FND elaborate on in subsequent subsections.
methods, and in Section 7 we finally give a conclusion as well
as some promising directions. • Supervised methods: Supervised methods learn with
labeled data. The main concern is: how the DL models utilize
these rich feature information. Therefore, we further divide
these methods into: news content based, social context based,
and external knowledge based. Besides, we particularize
solutions to integrate different or even heterogeneous feature
information in Section 4.4. • Weakly/Un-supervised methods:
Weakly supervised methods assume only limited labeled data
is available in the learning stage. The common solution is to
derive weak supervision information from available
information. According to the way of getting weak supervision,
we divide the semi-supervised methods into weak content
supervision and weak social supervision. Unsupervised
methods learn with totally unlabeled data. Some researchers
resort to generative methods or utilize probabilistic
knowledge. We will particularize them in Section 5.

Supervised methods Supervised methods tend to learn from


various features with labeled examples. In order to carry out
better performance, amounts of techniques such as the
introduction of multi-modal information, external knowledge,
and integration strategy have been explored. According to the
features utilized, we further divide them as follows. 4.1. News
content-based methods News content means explicit
information the news originally carries, such as the text of
articles or images attached to it. Generally, news
. Problem definition In the FND task, we define the output content-based methods utilize these textual/visual features.
space = {0, 1} indicating whether the news is fake or not. The Table 2 compares different news content-based methods,
input space is relatively complex, comprising information not which we will elaborate on in subsequent subsections. 4.1.1.
only from the news originally carries, such as the news Single modality Textual features can be classified as generic
content, but also from social context and external knowledge features and latent features. The former are often used within
like the knowledge base. See in Table 1. Formally, let = { 𝑥𝑖 , a traditional ML framework, which describe textual content
𝑦𝑖 }𝑛 𝑖=1 denote a set of 𝑛 news with annotated labels, with = from linguistic levels: lexicon, syntax, discourse, semantic, ect.
{𝑥} 𝑛 𝑖=1 denoting the news pieces and = { 𝑦𝑖 }𝑛 𝑖=1 ⊂ {0, Previous work has summarized them into a detailed table
1} 𝑛 denoting the corresponding labels of whether the news is (Zhou and Zafarani, 2020). Latent textual features refer to
fake or not. Note that, since manually annotating the data is news text embedding. The text embedding can be derived at
expensive and time consuming, there are large-scale unlabeled word, sentence, and document levels. In that, a news article
news in real world. Thus, facing the scenarios with few and no can be represented by latent vectors, which can either be
labeled data, weakly supervised and even unsupervised utilized as the input for classifiers (like SVMs) right away or
methods are in need. Formally, given the news data , fake new subsequently integrated into neural network structures.
Recurrent neural network4.
Supervised methods play a crucial role in the field of
3

information retrieval, particularly in the domain of news be classified into two main categories: generic features and
content analysis. These methods are designed to learn and latent features.
make predictions based on labeled examples, and to enhance
their performance, various techniques have been explored. Generic textual features encompass a wide range of
This article delves into the world of supervised methods with a attributes that describe the text of news articles. These
focus on news content-based approaches, which rely on the features can be used within a traditional machine learning
explicit information carried by news articles, including text framework and typically capture information at various
and images. We will explore the different facets of news linguistic levels. Some of the linguistic levels that generic
content-based methods, with an emphasis on textual features textual features cover include:
and their classification into generic and latent features.
1. Lexicon: These features focus on the vocabulary used in
Supervised methods are a fundamental paradigm in the text, including word frequency, diversity of vocabulary,
machine learning, where models are trained on labeled and language complexity.
examples to make predictions or classify data points. In the
context of news content analysis, these methods are 2. Syntax: Syntax-related features examine the grammatical
particularly valuable for tasks such as sentiment analysis, structure of sentences, such as the use of punctuation, sentence
topic classification, and fake news detection. These tasks length, and grammatical patterns.
require the model to learn from various features extracted
from news articles, and the choice of features is a critical 3. Discourse: Discourse features consider the organization
factor in determining the model's performance. and flow of information within the text, including the use of
transitional words and coherence.
One approach to enhancing the performance of supervised
methods in news content analysis is the incorporation of 4. Semantic: Semantic features delve into the meaning and
multi-modal information. Multi-modal information involves interpretation of the text, including sentiment analysis, word
utilizing both textual and visual features present in news associations, and semantic roles.
articles. This means that in addition to the text of the articles,
any accompanying images or multimedia elements can be These generic textual features provide a comprehensive
used as valuable sources of information. This approach allows view of the linguistic characteristics of news articles. They can
models to have a more comprehensive understanding of the be used to build models that perform tasks such as text
news content and can lead to more accurate predictions. classification, summarization, and sentiment analysis.

External knowledge is another technique that has been In addition to generic textual features, there are latent
explored to improve the performance of supervised methods. textual features, which are a more advanced and abstract
External knowledge sources, such as databases, ontologies, or representation of news text. These features are often derived
domain-specific knowledge bases, can be integrated into the through techniques like text embedding. Text embedding is a
model. This external knowledge can provide additional context process that transforms words, sentences, or entire documents
and information that may not be present in the news articles into high-dimensional vector representations.
themselves. For instance, integrating information about
named entities, geographical locations, or historical events Text embedding can occur at different levels of granularity,
from external knowledge sources can enhance the model's including word, sentence, and document levels. For example,
ability to interpret and analyze news content effectively. at the word level, each word in a news article can be
represented as a vector, and these vectors can capture
Integration strategies also play a vital role in enhancing the semantic relationships between words. At the sentence level,
performance of supervised methods in news content analysis. the entire sentence is embedded into a vector, which can
These strategies involve combining information from different capture the context and meaning of the sentence. At the
sources or features in a coherent and effective manner. The document level, the entire news article is represented as a
choice of integration strategy can significantly impact the vector, summarizing the content in a high-dimensional space.
model's ability to make accurate predictions. For example, a
fusion of textual and visual features might require a different The use of latent textual features has gained popularity due
integration strategy compared to combining textual features to the effectiveness of methods like word embeddings, sentence
with external knowledge. embeddings, and document embeddings. These latent
representations allow news articles to be expressed as vectors
Now, let's delve deeper into the specific aspect of news in a continuous space, which enables various machine
content-based methods, with a focus on textual features. learning models to operate on them effectively.
Textual features are at the core of many news content analysis
tasks, as they provide valuable information about the One of the popular approaches for integrating latent textual
linguistic content of news articles. These textual features can features into supervised methods is the use of recurrent neural
4

networks (RNNs). RNNs are a class of neural networks that reformulated the fake news detection task as a graph
are well-suited for sequential data, making them particularly classification task, where the nodes represent the sentences of
useful for processing text. RNNs can take sequences of word the article and the edges represent the semantic similarity
embeddings or sentence embeddings as input and capture between a pair of sentences. They used two widely applicated
dependencies and context within the text. graph neural networks, GCN and GAT, to generate graph
embedding and use the embedding to classify the fake news.
In the context of news content analysis, RNNs can be used Inspired by multi-task learning, Wu et al. (2019) designed a
to classify news articles into different categories, perform multi-task learning model that employs a fake news detection
sentiment analysis, or detect the presence of specific topics. By task and a stance classification task to optimize the shared
utilizing the latent textual features, RNNs can capture the layer simultaneously, resulting in enhanced news
nuanced and context-dependent aspects of news content, representations. The sharing layer filters and selects shared
which may not be apparent when using generic textual feature flows between the tasks of fake news and stance
features alone. detection using a gate mechanism and an attention mechanism.
Cheng et al. (2020) proposed an LSTM-based variational
To summarize, supervised methods in news content analysis autoencoder model to extract latent representations of
have evolved significantly to incorporate multi-modal tweet-level text. Besides, in order to detect fake news as early
information, external knowledge, and effective integration as possible, some existing researchers assumed that
strategies. Within the realm of textual features, both generic multifaceted information is not available before a news article
and latent features are essential for understanding the has already become popular. These works utilized text-only
linguistic content of news articles. Generic features capture features on purpose. For example, Qian et al. (2018) proposed
linguistic attributes at various levels, while latent features, a model to generate user feedback based on text, which was
often derived through text embedding, provide abstract then used in the classification process along with word-level
representations of the text. Integration strategies, such as and sentence-level information from real articles to address
using recurrent neural networks, play a pivotal role in making the lack of user reviews as an auxiliary source of information
the most of these features and enhancing the performance of in early detection. Giachanou et al. (2019) considered the role
supervised methods. These advancements contribute to the of emotional signals and proposed an LSTM model that
field's ability to extract valuable insights from news content incorporates emotional signals obtained.
and make informed predictions or classifications, which are
crucial in the era of information overload and misinformation.

Recurrent neural networks (RNNs) are highly capable of


modeling sequential data. Ma et al. (2016) utilized recurrent
neural networks (RNN) as the basis and captured the relevant
information of the event over time through learning its hidden
layer representation. Chen et al. (2019b) proposed an
Attention-Residual network called ARC which can capture
long-range dependency by attention mechanism. Meanwhile, In this paper, we apply and transform various mechanisms
the convolution neural network is applied to select important based on “Fasttext” [7] and “Shallow-andwide CNN” [8] to
components and local features. Ma et al. (2019) proposed a implement a model for detecting fake news. This section
GAN-based model to obtain low-frequency while strong introduces previous related works that we use to implement
representations of fake news. The model designed a generator models for fake news detection. 2.1 Word Embedding Word
based on GRU to produce controversial instances that makes embedding is a method of mapping words or phrases to
the distribution of tweets’ opinions complicated, and a vectors of real numbers. The traditional method, “discrete
RNN-based discriminator to identify the effective features from representation” has a “one-hot vector” representation that
hard samples generated by the generator. Although the above consists of 0 second in all dimensions with the exception of a
RNNbased models have achieved good results, they are biased single 1 in only one dimension that is used to represent the
towards the latest elements of the input sequence, while key word. However, “discrete representation” does not reflect the
features do not necessarily appear at the rear part of an input context and has problems handling synonyms and antonyms.
sequence. To address the Recently, “distributed representation” has emerged as a way to
represent words in a continuous vector space where all
issue with RNN-based models, Yu et al. (2017) proposed a dimensions are required to represent the word. This paper
method for fake news detection based on a convolutional introduces and applies “Word2vec” and “Fasttext” among
neural network (CNN). The model can extract essential various representations. 2.1.1 Word2vec “Word2vec”
features from an input sequence and form relationships among represents word embedding using a neural network; it has two
relevant features at a high level. Vaibhav and Hovy (2019) model architectures for learning distributed representations of
modeled each news article in the dataset as a graph and words: continuous bag-of-words (CBOW) and Skip-gram. The
5

Skip-gram architecture is widely used because it works better approximately 8 to 12 point type.
on semantic tasks than the CBOW model [9]. The Skip-gram
architecture uses each current word as an input (‫ݓ‬௧) for the It has two channels of word vectors: one named “Static”
model and predicts words within a certain range before and that is kept static throughout the training and one named
after the current word (‫ݓ‬௧ି௞ ~ ‫ݓ‬௧ା௞). It “Non-static” that is fine-tuned via backpropagation. Previous
maximizes the classification of a word based on another word works have conducted sentimental analysis with a dataset that
in the same sentence, so similar words have similar vectors consists of short sentences and the “Static” and “Non-static”
and their similarity increases [10]. Given a sequence of results were comparable, but “Non-static” allows the words to
training words (‫ݓ‬ଵ ~ ‫ݓ‬௧) and the size of attain more meaningful representations Fake News Detection
training context (c), the objective of the Skip-gram model is to Using Deep Learning 1122 | J Inf Process Syst, Vol.15, No.5,
maximize the average log probability pp.1119~1130, October 2019 [8]. However, if only
“Non-static” is used, new words can be over-fitted in this
model. Therefore, both channels are used to secure the
generality of the meaning of words. do not use “et al.” unless
there are six authors or more. Use a space after authors’
initials. Papers that have not been published should be cited as
“unpublished” [4]. Papers that have been accepted for
publication, but not yet specified for an issue should be cited
as “to be published” [5]. Papers that have been submitted for
publication should be cited as “submitted for publication” [6].
Please give affiliations and addresses for private
communications [7].
Capitalize only the first word in a paper title, except for
proper nouns and element symbols. For papers published in
“Fasttext” is a method of adding the concept of “Sub-word” translation journals, please give the English citation first,
to “Word2vec.” Each word is represented as the sum of followed by the original foreign-language citation [8].
n-gram vectors and the word vector itself. Taking the word 2.3 Attentive Pooling The model architecture shown in Fig.
apple and ݊=3 as an example. It will be represented by the 2 is the “Attentive-pooling” architecture of Santos et al. [12].
character n-grams: and the word itself . The reason why it has Recently, attention mechanisms have been successfully used
2-gram vector is that it adds special boundary symbols at the for image captioning [13] and machine translation [14].
beginning and end of words to distinguish prefixes and However, there were no further studies of applying the
suffixes from other character sequences. The formulation is as attention mechanism to NLP tasks with two inputs such as
follows: suppose that you are given a dictionary of n-grams of pair-wise ranking or text classification. Meanwhile, “attentive
size ‫ܩ‬. Given a word , let us denote by, … , the set of n-grams pooling” has contributed to the improvement of performance
appearing in 2.2 Shallow-and-Wide CNN The model in these tasks by effectively representing the two inputs’
architecture as shown in Fig. 1, is the “Shallow-and-wide similarity [12]. Although there is the Term Frequency-Inverse
CNN” architecture of Kim [8]. The first layer is the look-up Document Frequency (TF-IDF) method that statistically
table that is the set of k-dimensional word vectors that each measures the similarity by the frequency of words in a
corresponds to the i th word in the sentence. Then, a document, this model measures the similarity by increasing
convolution operation is applied with multiple filter widths the weight for words that have the same or similar meanings to
and a max-overtime pooling operation. Finally, these features the two inputs.
are passed to a fully connected layer and the prediction is The passage you've provided discusses a specific approach
made with the softmax layer. to sentiment analysis and fake news detection using deep
learning. It introduces the concept of two channels of word
Figure axis labels are often a source of confusion. Use vectors, "Static" and "Non-static," and highlights the
words rather than symbols. As an example, write the quantity importance of attentive pooling in the model architecture. Let's
“Magnetization,” or “Magnetization M,” not just “M.” Put elaborate on these concepts and their significance in a
units in parentheses. Do not label axes only with units. As in 2000-word article.
Fig. 1, for example, write “Magnetization (A/m)” or
“Magnetization (A m−1),” not just “A/m.” Do not label axes ---
with a ratio of quantities and units. For example, write
“Temperature (K),” not “Temperature/K.” Sentiment analysis and fake news detection are critical tasks
Multipliers can be especially confusing. Write in the field of natural language processing. They are essential
“Magnetization (kA/m)” or “Magnetization (103 A/m).” Do for understanding public opinion, identifying deceptive
not write “Magnetization (A/m) × 1000” because the reader information, and making informed decisions in various
would not know whether the top axis label in Fig. 1 meant domains, including journalism, politics, and social media. This
16000 A/m or 0.016 A/m. Figure labels should be legible, article focuses on a specific approach that combines the use of
6

two channels of word vectors, "Static" and "Non-static," and vectors allow the model to make nuanced adjustments to word
employs attentive pooling to enhance performance in these representations based on the context and goals of the task.
tasks. This adaptability is particularly advantageous in tasks like
sentiment analysis and fake news detection, where the
### The Significance of Sentiment Analysis and Fake News meaning of words can be influenced by their surrounding
Detection context.

Sentiment analysis, also known as opinion mining, is the The rationale behind using both "Static" and "Non-static"
process of determining the sentiment or emotional tone in a channels is to strike a balance between generality and
piece of text, typically to understand whether it is positive, adaptability. When conducting sentiment analysis on short
negative, or neutral. It is widely used in applications such as sentences, previous research has shown that the results
customer feedback analysis, social media monitoring, and obtained using "Static" and "Non-static" channels are
product reviews. Sentiment analysis is particularly relevant in comparable. However, it's important to note that relying solely
the age of the internet, where vast amounts of textual data are on the "Non-static" channel can lead to overfitting, especially
generated daily, making it impossible for humans to manually when dealing with new or rare words. Therefore, the use of
process and interpret. both channels ensures that the model benefits from the general
semantic knowledge provided by "Static" vectors while also
Fake news detection, on the other hand, is a critical task in having the flexibility to fine-tune word representations as
the era of information overload and disinformation. The needed.
spread of false or misleading information can have severe
consequences, ranging from influencing public opinion to This dual-channel approach can be thought of as having a
causing real-world harm. Accurate and timely identification of strong foundation in general word semantics (Static) while
fake news is essential for maintaining the integrity of being able to make task-specific adjustments (Non-static). It
information dissemination, and machine learning approaches strikes a balance between capturing the inherent meaning of
have become valuable tools in this endeavor. words and adapting to the specific requirements of sentiment
analysis and fake news detection.
In this context, the article discusses an approach that
combines sentiment analysis and fake news detection using ### Attentive Pooling
deep learning techniques, with a particular focus on the use of
"Static" and "Non-static" channels of word vectors and The model architecture employed in this approach
attentive pooling. incorporates an "Attentive Pooling" mechanism, which is a
key component contributing to its success. This mechanism is
### Two Channels of Word Vectors: "Static" and introduced and credited to Santos et al. in the context of text
"Non-static" classification tasks, particularly in the context of pair-wise
ranking.
The article introduces a unique approach to word vector
representation by employing two channels: "Static" and 1. **Introduction to Attentive Pooling**: Attentive pooling
"Non-static." These channels play a crucial role in enhancing is a mechanism that leverages attention mechanisms to
the model's performance in sentiment analysis and fake news effectively represent the similarity between two inputs.
detection. Attention mechanisms, originally popularized in the field of
computer vision, have found success in various natural
1. **Static Word Vectors**: The "Static" channel consists language processing (NLP) tasks, such as machine translation
of word vectors that remain fixed or static throughout the and image captioning.
training process. These vectors are pre-trained on a large
corpus of text data and capture the semantic meaning of 2. **Applying Attention to NLP Tasks**: While attention
words. Static word vectors are often obtained using techniques mechanisms have been widely adopted in NLP for tasks like
like Word2Vec or GloVe, which provide distributed machine translation, their application to tasks involving two
representations of words based on co-occurrence statistics in inputs, such as pair-wise ranking or text classification, was
the training corpus. These vectors are highly valuable as they relatively unexplored until the introduction of attentive
encode general semantic information about words. pooling.

2. **Non-static Word Vectors**: In contrast, the 3. **Enhancing Similarity Representation**: Attentive


"Non-static" channel of word vectors is fine-tuned via pooling is particularly valuable in tasks where determining the
backpropagation during the model's training. This means that similarity or relatedness between two inputs is essential. In the
the model can adjust and adapt these vectors to better fit the context of sentiment analysis and fake news detection,
specific task it is designed for. While the "Static" vectors understanding the relationship between text data is critical for
provide a solid foundation of word semantics, the "Non-static" making accurate predictions. Attentive pooling enhances the
7

model's ability to represent the similarity between the two


inputs, improving overall performance. both text and image data are considered. This can be
particularly valuable in applications like social media content
4. **Comparison with TF-IDF**: While traditional analysis, where textual content is accompanied by images. The
approaches like Term Frequency-Inverse Document model's adaptability through the "Non-static" channel can be
Frequency (TF-IDF) measure similarity based on word extended to both text and image data, allowing for a richer
frequency in a document, attentive pooling takes a different understanding of content.
approach. It assigns higher weights to words that have the
same or similar meanings in both inputs, effectively capturing ### Future Directions and Challenges
the semantic similarity between text data.
While the approach presented in the article offers promising
The significance of attentive pooling lies in its ability to results and practical applications, there are also challenges and
capture nuanced relationships between inputs. This is avenues for future research in this field.
particularly useful in sentiment analysis, where the sentiment
expressed in a sentence can be influenced by the context and 1. **Data Quality and Quantity**: The success of deep
the words used. In the context of fake news detection, learning models heavily depends on the quality and quantity of
attentive pooling helps the model understand the connections training data. Sentiment analysis and fake news detection
between different news articles or pieces of information, models need access to diverse and large datasets to generalize
which is crucial for identifying deceptive content. well. Ensuring the quality and representativeness of the data is
a challenge in itself, as it often requires manual annotation and
### Practical Implications and Applications curation.

The combination of the two channels of word vectors, 2. **Handling Multimodal Data**: As mentioned, the
"Static" and "Non-static," along with attentive pooling, has approach can be extended to handle both text and image data.
practical implications and applications in various domains: However, integrating and effectively utilizing multimodal data
presents challenges in terms of data preprocessing, model
1. **Sentiment Analysis**: Sentiment analysis is a architecture, and alignment of features between modalities.
fundamental task in the realm of customer feedback analysis,
market research, and brand management. The model's ability 3. **Interpretable Models**: Understanding the decisions
to adapt word representations through the "Non-static" made by deep learning models is an ongoing challenge. The
channel allows it to capture fine-grained sentiment model discussed in the article may achieve high accuracy, but
information, making it more effective in identifying sentiment its decision-making process can be opaque. Developing
nuances in short texts, such as social media posts and product interpretable models that can explain why a particular
reviews. prediction was made is crucial, especially in sensitive
applications like fake news detection.
2. **Fake News Detection**: Fake news detection is of
paramount importance in the current information landscape, 4. **Adaptation to Evolving Language**: Language is
where misinformation and disinformation spread rapidly. The constantly evolving, with new words and phrases emerging
combination of both word vector channels and attentive over time. The model's adaptability through the "Non-static"
pooling enables the model to understand the semantic channel helps, but staying up-to-date with language changes
relationships between news articles and detect patterns and evolving semantics is a challenge.
associated with deceptive content. This can have significant
implications for maintaining the integrity of news 5. **Ethical Considerations**: Sentiment analysis and fake
dissemination and ensuring the public is well-informed. news detection models can have significant societal
implications. Ensuring that these models are used responsibly
3. **Natural Language Understanding**: The approach and ethically is an ongoing challenge. Bias, fairness, and
discussed in the article has broader applications in natural ethical considerations must be addressed in the development
language understanding. Understanding the meaning of words and deployment of such models.
and their relationships in context is crucial for various NLP
tasks, including text summarization, question answering, and ### Conclusion
dialogue systems. The dual-channel word vectors and attentive
pooling can be adapted to these tasks to enhance their Sentiment analysis and fake news detection are pivotal tasks
performance. in natural language processing and information retrieval. The
article discussed a unique approach that combines the use of
4. **Multimodal Analysis**: The concept of dual-channel two channels of word vectors, "Static" and "Non-static," along
word vectors can also be extended to multimodal analysis, with attentive pooling to enhance the performance of models
where in these tasks.
8

conventional RNN is that they are only able to make use of


The "Static" channel provides a strong foundation of previous context [16]. To resolve this, bidirectional-RNN
general word semantics, while the "Non-static" channel allows (Bi-RNN) stacks two RNN layers. If the existing RNN is the
adaptability to task-specific requirements. This dual-channel forward RNN that only forwards previous information,
approach strikes a balance between generality and adaptability, Bi-RNN stacks backward RNN that can receive subsequent
ensuring that the model can handle both common and rare information, as shown in Fig. 3. Combing Bi-RNN with
words effectively. LSTM gives BidirectionalLSTM (Bi-LSTM), which can
handle long-range context in both input directions [16].
Attentive pooling, a mechanism introduced by Santos et al., 3. Model Architecture This paper modifies and combines
plays a crucial role in improving similarity representation “Fasttext” and “Shallow-and-wide CNN” to implement a fake
between inputs. It enhances the model's ability to capture news detection model. To detect so-called “Click-bait” articles
nuanced relationships between text data, which is valuable in among the various types of fake news, we need to understand
tasks like sentiment analysis and fake news detection. the consistency and relevance between the headline and body
of article. To do this, we extract the global feature vectors
The practical applications of this approach extend to from the headline and body, respectively and compare the
sentiment analysis, fake news detection, natural language vectors. For extracting method, there are several methods such
understanding, and multimodal analysis. It has the potential to as TF-IDF and RNN, but since the overall meaning of text is
contribute to the responsible and effective use of deep learning determined by a few key words in the text, we use CNN which
in understanding and interpreting textual data. can extract the most salient local features to form fixed-length
global feature vector [17]. Then, we pass these features to a
While there are challenges to address, including data fully connected layer and make the prediction with softmax
quality, model interpretability, and ethical considerations, the layer. We call this model “BCNN (Bi-CNN)” because for the
approach presented in the article offers a promising direction headline and the body which are two inputs of model, the
for advancing the capabilities of sentiment analysis and fake convolution and the pooling have been used. Moreover, we try
news detection models. In an age of information overload and to improve the accuracy by implementing new models by
rapid content dissemination, such models are more crucial applying LSTM/Bi-LSTM and attentive pooling to BCNN; in
than ever for ensuring accurate information dissemination and this section, we first apply “Word2vec” and “Fasttext”, which
informed decision-making. are representative word embedding techniques, to Korean and
compare the accuracy. Then, we introduce several BCNN
models with better performance word embedding technique.

3.1 Word Embedding We train 100K articles with


“Word2vec” and “Fasttext” to find suitable word embedding
(1) for Korean; the results are as shown in Table 1. This paper
uses “Fasttext” because its performance is better in terms of
Be sure that the symbols in your equation have been defined accuracy.
before the equation appears or immediately following.
Italicize symbols (T might refer to temperature, but T is the
unit tesla). Refer to “(1),” not “Eq. (1)” or “equation (1),”
except at the beginning of a sentence: “Equation (1) is ... .”

2.4 Bi-LSTM Long short-term memory (LSTM) is a


structure that learns how much of the previous network state
to apply when input data is received. It resolves the long-term
dependency problem of conventional recurrent neural network
(RNN) using both the hidden state and the cell state, which is
a memory for storing past input information and the gates that
are used to regulate the ability to remove or add information to
the cell state. The multiplicative gates and memory are defined
for time t [15]:

where σ(∙) is the sigmoid function and ݂௧, ݅௧, ‫݋‬௧, ‫ܥ‬௧,
and ℎ௧ are the vectors of the forget gate, input gate, output
gate, memory cell, and hidden state, respectively. All of the
vectors are the same size. Moreover, ܹ௙, ܹ௜, ܹ௢, and ܹ௖ denote the
weight matrices of each gates and ܾ௙, ܾ௜, ܾ௢, and ܾୡ denote the
bias vectors of each gates. Another shortcoming of
9

### Understanding BCNN Architecture

The BCNN is a powerful neural network architecture used


for processing two different input sources, typically referred to
as "headlines" and "bodies." These inputs could represent
various types of textual data, such as news headlines and the
corresponding full article bodies. The primary goal of the
BCNN is to learn meaningful and discriminating
representations from both inputs and utilize these
representations for various tasks, with classification being a
prominent example.

#### Pre-trained Word Embeddings

One of the foundational elements of the BCNN architecture


is the use of pre-trained word embeddings, specifically
"Fasttext." Word embeddings are vector representations of
words in a continuous space, and they capture the semantic
relationships between words. These embeddings are
pre-trained on large text corpora, allowing them to encode
general language knowledge. By using "Fasttext," the BCNN
benefits from these pre-trained embeddings to represent the
words in the input data.

The advantage of pre-trained word embeddings lies in their


ability to provide a meaningful and consistent representation
of words. This is especially important when dealing with
2 BCNN BCNN is a CNN neural network architectures, as it enables the model to
with two inputs and pre-trained word embedding “Fasttext” understand the meaning of words based on their context in a
as shown in Fig. 4. It extracts feature maps from headlines and large text corpus. Instead of initializing word representations
bodies using 3-g filters through the convolution layer. The randomly, leveraging pre-trained embeddings like "Fasttext"
number of filters is proportionally set as 256 filters for the equips the BCNN with a solid foundation for understanding
headline and 1024 filters for the body considering the huge the language used in the inputs.
difference in the amount of text between them. Then, it makes
each feature map to one vector through the max-pooling layer. #### Convolutional Layers and Feature Extraction
This is the process of forming fixed-length global vector for
Dong-Ho Lee, Yu-Ri Kim, Hyeong-Jun Kim, Seung-Myun In the BCNN, the next essential component is the
Park, and Yu-Jun Yang J Inf Process Syst, Vol.15, No.5, convolutional layers. These layers are responsible for
pp.1119~1130, October 2019 | 1125 the headline and the body. extracting features from the input data. Convolutional neural
Finally, classification is performed through the fully connected networks (CNNs) are well-known for their ability to capture
layer. We use Rectified Linear Unit (ReLU) as an activation local patterns and features within data. In the context of text
function and the softmax function as an output function. We data, the convolutional layers slide small windows, or filters,
use the “Static” channel that keeps word embedding static over the input to identify relevant patterns.
with pre-trained word embedding “Fasttext” throughout
training. One notable aspect of the BCNN is the use of 3-gram filters
The BCNN, or Bidirectional Convolutional Neural for feature extraction. This means that the convolutional layers
Network, is a deep learning architecture designed to process analyze sequences of three words at a time. The choice of
two different sources of input data. It leverages pre-trained 3-gram filters is a common practice in text analysis, as it
word embeddings, specifically "Fasttext," to represent words allows the model to capture short but meaningful sequences of
as vectors and then employs convolutional layers to extract words. These filters can identify patterns like word
features from these vectors. The BCNN is particularly useful combinations or phrases that hold significance in the given
in tasks that involve two distinct types of text data, such as context.
news headlines and the corresponding article bodies. This
article provides a detailed overview of the BCNN architecture For instance, in a news classification task, a 3-gram filter
and its various components, explaining how it processes input might detect the phrase "stock market crash" as a relevant
data and performs classification tasks. feature, which could be highly informative for the
classification of financial news. The BCNN employs these
10

filters to extract relevant features from both the headlines and output scores into probability distributions over the possible
the article bodies. classes. This makes it easier to interpret the model's
predictions and determine the most likely class for a given
#### Different Numbers of Filters input.

A notable characteristic of the BCNN architecture is the use #### Leveraging the "Static" Channel
of different numbers of filters for the headline and body
inputs. This is based on the consideration that headlines are The BCNN relies on the "Static" channel, which retains
typically much shorter than full article bodies, resulting in word embeddings as static throughout the training process.
different amounts of text data to process. To address this This means that the word embeddings derived from "Fasttext"
imbalance, the BCNN allocates 256 filters for the headline remain fixed and unchanged during training. The use of static
input and a more substantial 1024 filters for the body input. word embeddings ensures that the model preserves the general
semantic knowledge encoded in these pre-trained embeddings.
The rationale behind this distribution is to ensure that the
model can extract meaningful features from both types of By keeping the word embeddings static, the BCNN benefits
input data, regardless of their differing lengths. By providing from the consistent and well-established word representations
more filters for the body input, the BCNN accommodates the provided by "Fasttext." These static embeddings serve as a
greater volume of text data found in articles. This approach reliable foundation for understanding the language used in
acknowledges the importance of treating each input source both the headlines and article bodies, contributing to the
appropriately to extract relevant information effectively. model's ability to capture the meaning of words effectively.

#### Max-Pooling Layer ### Practical Implications and Applications

After extracting features using the convolutional layers, the The BCNN architecture, with its dual-channel word
BCNN employs a max-pooling layer. The max-pooling embeddings, convolutional layers, and max-pooling, has
process transforms the feature maps into one vector per input. practical implications and applications in various domains:
This transformation is crucial for obtaining a fixed-length
global vector representation of both the headline and the body. 1. **Text Classification**: The BCNN is particularly
well-suited for text classification tasks. It can be applied to a
The max-pooling layer selects the most significant wide range of classification problems, including sentiment
information from the feature maps, effectively summarizing analysis, topic categorization, and fake news detection. By
the relevant features. By choosing the maximum value within extracting features and creating global vector representations
each feature map, the BCNN retains the most salient of text inputs, the BCNN provides a powerful tool for making
information while discarding less important details. The result accurate predictions.
is a concise and informative representation of each input,
which is essential for further processing and classification. 2. **News Analysis**: In the context of news analysis, the
BCNN can be used to process both news headlines and full
#### Classification and Activation Functions article bodies, making it valuable for news classification and
summarization. For instance, it can help in categorizing news
The final step in the BCNN architecture involves articles into different topics or identifying potentially
classification. Once the fixed-length global vectors for the misleading headlines.
headline and body are obtained, they are fed into a fully
connected layer for classification. This layer is responsible for 3. **Multimodal Analysis**: While the BCNN architecture
making predictions based on the learned representations. is primarily designed for text data, its principles can be
extended to multimodal analysis. By incorporating both text
To introduce non-linearity and enable the model to capture and images, it becomes applicable in domains like social
complex relationships in the data, the BCNN uses the media content analysis, where textual content is often
Rectified Linear Unit (ReLU) as an activation function. ReLU accompanied by images.
is a widely used activation function that helps the model
handle non-linear transformations in the data. It replaces 4. **Natural Language Understanding**: The BCNN
negative values with zeros and passes positive values contributes to natural language understanding by enabling the
unchanged, enabling the network to learn complex patterns model to capture relevant features from text inputs. This
and relationships within the feature vectors. understanding is crucial for various NLP tasks, such as text
summarization, question answering, and dialogue systems.
For classification tasks, especially those involving multiple
classes or categories, the softmax function is employed as the 5. **Domain-Specific Applications**: The architecture can
output function. The softmax function converts the model's be adapted to specific domains, allowing for fine-tuning and
11

customization. For example, in the medical field, the

1) 3.3 (Bi-)LSTM + BCNN (Bi-)LSTM + BCNN applies the


context information to existing word embedding by
training it with (Bi-)LSTM as shown in Fig. 5. We expect
an improved performance because each word vector
would have both the vector trained by “Fasttext” and the
context information in the sentence.
2) Authors must convince both peer reviewers and the
editors of the scientific and technical merit of a paper; the
standards of proof are higher when extraordinary or
unexpected results are reported.
3) Because replication is required for scientific progress,
papers submitted for publication must provide sufficient
information to allow readers to perform similar
experiments or calculations and use the reported results.
Although not everything need be disclosed, a paper must
contain new, useable, and fully described information. For
4.1 Dataset We use 100K articles that were crawled from
Joongang Ilbo, Dong-A Ilbo, Chosun Ilbo, Hankyoreh,
and Maeil businesses as a dataset. For each press, we
categorize the news into economic, society, politics,
entertainment, and sports, and then collect articles in the
same proportion for each category. Of these, we use 31K
for mission1 and 68K for mission2. The real news and
fake news are in the same proportion for each mission and
the training and validation data are composed at a ratio of
9:1. We measure the model’s accuracy with test data that
includes 350 recent articles (as of March 2018) that are
not included in the training and validation data and are
composed of real news and fake news in the same II. CONCLUSION
proportion. 4.2 Experiment Results We measure the 5. Conclusions This paper implements a deep learning
accuracy using the model with the lowest validation loss model for fake news detection and measures the accuracy; its
among multiple steps of the model and the results are main contributions are as follows: (1) The accuracy of
shown in Table 4; AUROC (area under receiver operating classification for mission2, which consists of fake news that is
characteristic curve) is used as a measurement technique irrelevant to the article context, is the highest with
[18]. APS-BCNN at an AUROC score of 0.726. It can be concluded
that the similarity vector between the headline and body
contributes to detecting the content that is irrelevant to the
context. (2) The accuracy of classification for mission1, which
consists of fake news where the headline and body are
inconsistent, is the highest with a BCNN in AUROC score of
0.52; however, this accuracy cannot be used to detect real fake
news. We can deduce the causes of low accuracy as follows:
(a) as CNN uses the local information of texts to classify,
mission2 would have achieved high accuracy due to the large
amount of perturbed local information. However, as mission1
4) has a relatively small amount of perturbed local information, it
would have been difficult to classify it. (b) The difference in
the amount of training data between mission1 and mission2
would have caused a difference in the accuracy between
missions; we were able to acquire a large amount of fake news
data for mission2 by mixing parts of the bodies of several
articles. However, since we had to make the fake news data for
mission1 individually, it was difficult to acquire a large
amount of fake news data like for mission2. (3) CNN with
LSTM has low classification accuracy. Although there is a
12

previous work of LSTMCNN with high accuracy in the text


classification of one input [19], the application of LSTM in the
text classification of two inputs as shown in this paper had low
accuracy. We can deduce the cause of the low accuracy as
follows: for example, if we assume that both the headline and
body have the same word “apple”.
(4) “FASTTEXT” PERFORMS BETTER THAN “WORD2VEC” IN TERMS
OF KOREAN WORD SIMILARITY. WE CAN DEDUCE THE CAUSE OF THE
BETTER PERFORMANCE AS FOLLOWS: UNLIKE OTHER LANGUAGES, THE
SYLLABLES THAT FORM KOREAN WORDS HAVE THEIR OWN MEANING.
FOR EXAMPLE, THE WORD “” IS COMPOSED OF THE SYLLABLES “”
THAT MEANS “BIG” AND “” THAT MEANS “LEARN.” THIS WOULD HAVE
MADE “FASTTEXT” WHICH IS TRAINED IN SYLLABLE UNIT, PERFORM
BETTER IN WORD SIMILARITY THAN “WORD2VEC” WHICH IS TRAINED
IN WORD UNIT. THIS PAPER PROPOSES A MEANINGFUL DEEP LEARNING
MODEL FOR FAKE NEWS DETECTION. THE LIMITATION OF THIS STUDY
IS THAT WE COULD ACHIEVE MEANINGFUL ACCURACY FOR
CLASSIFICATION IN THE CASE WHERE THE CONTENT OF THE BODY IS
IRRELEVANT TO THE CONTEXT, BUT THE ACCURACY WAS LOW WHEN
THE HEADLINE AND BODY WERE INCONSISTENT

REFERENCES
[1] G. O. Young, “Synthetic structure of industrial plastics (Book style with
paper title and editor),” in Plastics, 2nd ed. vol. 3, J. Peters, Ed.
New York: McGraw-Hill, 1964, pp. 15–64.
[2] W.-K. Chen, Linear Networks and Systems (Book style). Belmont,
CA: Wadsworth, 1993, pp. 123–135.
[3] H. Poor, An Introduction to Signal Detection and Estimation. New
York: Springer-Verlag, 1985, ch. 4.
[4] B. Smith, “An approach to graphs of linear forms (Unpublished work
style),” unpublished.
[5] E. H. Miller, “A note on reflector arrays (Periodical style—Accepted for
publication),” IEEE Trans. Antennas Propagat., to be published.
[6] J. Wang, “Fundamentals of erbium-doped fiber amplifiers arrays
(Periodical style—Submitted for publication),” IEEE J. Quantum
Electron., submitted for publication.
[7] C. J. Kaufman, Rocky Mountain Research Lab., Boulder, CO, private
communication, May 1995.
[8] Y. Yorozu, M. Hirano, K. Oka, and Y. Tagawa, “Electron spectroscopy
studies on magneto-optical media and plastic substrate interfaces
(Translation Journals style),” IEEE Transl. J. Magn.Jpn., vol. 2, Aug.
1987, pp. 740–741 [Dig. 9th Annu. Conf. Magnetics Japan, 1982, p. 301].
[9] M. Young, The Techincal Writers Handbook. Mill Valley, CA:
University Science, 1989.
[10] J. U. Duncombe, “Infrared navigation—Part I: An assessment of
feasibility (Periodical style),” IEEE Trans. Electron Devices, vol. ED-11,
pp. 34–39, Jan. 1959.
[11] S. Chen, B. Mulgrew, and P. M. Grant, “A clustering technique for
digital communications channel equalization using radial basis function
networks,” IEEE Trans. Neural Networks, vol. 4, pp. 570–578, Jul.
1993.
[12] R. W. Lucky, “Automatic equalization for digital communication,” Bell
Syst. Tech. J., vol. 44, no. 4, pp. 547–588, Apr. 1965.
[13] S. P. Bingulac, “On the compatibility of adaptive controllers (Published
Conference Proceedings style),” in Proc. 4th Annu. Allerton Conf.
Circuits and Systems Theory, New York, 1994, pp. 8–16.
[14] G. R. Faulhaber, “Design of service systems with priority reservation,”
in Conf. Rec. 1995 IEEE Int. Conf. Communications, pp. 3–8.

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed


representations of words and phrases and their compositionality,” Advances in
Neural Information Processing Systems, vol. 26, pp. 3111- 3119, 2013. [12] C.
D. Santos, M. Tan, B. Xiang, and B. Zhou, “Attentive pooling networks,”
2016 [Online]. Available: https://arxiv.org/abs/1602.03609. [13] K. Xu, J. Ba,
R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. S. Zemel, and Y. Bengio,
“Show, attend and tell: neural image caption generation.

You might also like