0% found this document useful (0 votes)
55 views9 pages

Bad Actor, Good Advisor - Exploring The Role of Large Language Models in Fake News Detection

Uploaded by

kelvinhkcs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views9 pages

Bad Actor, Good Advisor - Exploring The Role of Large Language Models in Fake News Detection

Uploaded by

kelvinhkcs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

Bad Actor, Good Advisor:


Exploring the Role of Large Language Models in Fake News Detection
Beizhe Hu1,2 , Qiang Sheng1 , Juan Cao1,2 , Yuhui Shi1,2 , Yang Li1,2 , Danding Wang1 , Peng Qi3
1 Small
CAS Key Lab of Intelligent Information Processing, Institute of Computing Technology,
[News] Chinese Academy
REALof Sciences
2
University of Chinese Academy of Sciences Language Model
3
National University of Singapore
{hubeizhe21s, shengqiang18z, caojuan, shiyuhui22s, liyang23s, wangdanding}@ict.ac.cn, [email protected]

Abstract [Label: FAKE] Detailed photos of Xiang Liu's tendon surgery exposed. Stop
complaints and please show sympathy and blessings!
Detecting fake news requires both a delicate sense of di-
(a) [News] Large
verse clues and a profound understanding of the real-world The answer is real.
background, which remains challenging for detectors based + [Prompting] Language Model
on small language models (SLMs) due to their knowledge
and capability limitations. Recent advances in large language (b) [News] - Commonsense: Real surgery
+ [Perspective- Large generally won’t be exposed…
models (LLMs) have shown remarkable performance in var- - Textual Description: The
ious tasks, but whether and how LLMs could help with fake specific Language Model language is emotional and
Prompting] tries to attract audience…
news detection remains underexplored. In this paper, we in-
vestigate the potential of LLMs in fake news detection. First,
we conduct an empirical study and find that a sophisticated Small
LLM such as GPT 3.5 could generally expose fake news and [News] Prediction: FAKE
Language Model
provide desirable multi-perspective rationales but still under-
performs the basic SLM, fine-tuned BERT. Our subsequent Figure 1: Illustration of the role of large language models
analysis attributes such a gap to the LLM’s inability to select
and integrate rationales properly to conclude. Based on these (LLMs) in fake news detection. In this case, (a) the LLM
findings, we propose that current LLMs may not substitute fails to output correct judgment of news veracity but (b)
fine-tuned SLMs in fake news detection but can be a good helps the small language model (SLM) judge correctly by
advisor for SLMs by providing multi-perspective instructive providing informative rationales.
rationales. To instantiate this proposal, we design an adaptive
rationale guidance network for fake news detection (ARG), the news-faking process: Fake news creators might manip-
in which SLMs selectively acquire insights on news analysis ulate any part of the news, using diverse writing strategies
from the LLMs’ rationales. We further derive a rationale-free and being driven by inscrutable underlying aims. Therefore,
version of ARG by distillation, namely ARG-D, which ser- to maintain both effectiveness and universality for fake news
vices cost-sensitive scenarios without querying LLMs. Ex- detection, an ideal method is required to have: 1) a delicate
periments on two real-world datasets demonstrate that ARG
and ARG-D outperform three types of baseline methods, in-
sense of diverse clues (e.g., style, facts, commonsense); and
cluding SLM-based, LLM-based, and combinations of small 2) a profound understanding of the real-world background.
and large language models. Recent methods (Zhang et al. 2021; Kaliyar, Goswami,
and Narang 2021; Mosallanezhad et al. 2022; Hu et al.
2023) generally exploit pre-trained small language models
Introduction (SLMs)1 like BERT (Devlin et al. 2019) and RoBERTa (Liu
et al. 2019) to understand news content and provide fun-
The wide and fast spread of fake news online has posed real- damental representation, plus optional social contexts (Shu
world threats in critical domains like politics (Fisher, Cox, et al. 2019; Cui et al. 2022), knowledge bases (Popat et al.
and Hermann 2016), economy (CHEQ 2019), and public 2018; Hu et al. 2022b), or news environment (Sheng et al.
health (Naeem and Bhatti 2020). Among the countermea- 2022) as supplements. SLMs do bring improvements, but
sures to combat this issue, automatic fake news detection, their knowledge and capability limitations also compromise
which aims at distinguishing inaccurate and intentionally further enhancement of fake news detectors. For example,
misleading news items from others automatically, has been a BERT was pre-trained on text corpus like Wikipedia (De-
promising solution in practice (Shu et al. 2017; Roth 2022). vlin et al. 2019) and thus struggled to handle news items
Though much progress has been made (Hu et al. 2022a), that require knowledge not included (Sheng et al. 2021).
understanding and characterizing fake news is still challeng-
1
ing for current models. This is caused by the complexity of The academia lacks a consensus regarding the size boundary
between small and large language models at present, but it is widely
Copyright © 2024, Association for the Advancement of Artificial accepted that BERT (Devlin et al. 2019) and GPT-3 family (Brown
Intelligence (www.aaai.org). All rights reserved. et al. 2020) are respectively small and large ones (Zhao et al. 2023).

22105
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

As a new alternative to SLMs, large language models Chinese English


(LLMs) (OpenAI 2022; Anthropic 2023; Touvron et al. #
Train Val Test Train Val Test
2023), which are usually trained on the larger-scale corpus
and aligned with human preferences, have shown impres- Real 2,331 1,172 1,137 2,878 1,030 1,024
sive emergent abilities on various tasks (Wei et al. 2022a) Fake 2,873 779 814 1,006 244 234
Total 5,204 1,951 1,951 3,884 1,274 1,258
and are considered promising as general task solvers (Ma
et al. 2023). However, the potential of LLMs in fake news
detection remains underexplored: 1) Can LLMs help detect Table 1: Statistics of the fake news detection datasets.
fake news with their internal knowledge and capability? 2)
What solution should we adopt to obtain better performance Task Description News
using LLMs? (a) Zero-Shot Prompting
To answer these two questions, we first conduct a deep Task Description News
Eliciting Sentence Prediction
investigation of the effective role of LLMs in fake news
(e.g., “Let’s think step by step”) for (a) (c)
detection and attempt to provide a practical LLM-involved (b) Zero-Shot CoT Prompting
solution. Unlike contemporary works (Pelrine et al. 2023; LLM
News-Label Pairs 1 … N
Caramancion 2023) which simply prompt LLMs to provide Prediction
Task Description News
predictions with the task instruction, we conduct a detailed Rationale
(c) Few-Shot Prompting
empirical study to mine LLMs’ potential. Specifically, we
News-Label- for (b) (d)
use four typical prompting approaches (zero-shot/few-shot 1 … N
Rationale Triplets
vanilla/chain-of-thought prompting) to ask the LLM to make Task Description News
veracity judgments of given news items (Figure 1(a)) and (d) Few-Shot CoT Prompting
find that even the best-performing LLM-based method still
underperforms task-specific fine-tuned SLMs. We then per- Figure 2: Illustration of prompting approaches for LLMs.
form an analysis of the LLM-generated explanatory ratio-
nales and find that the LLM could provide reasonable and Is the LLM a Good Detector?
informative rationales from several perspectives. By subse-
quently inducing the LLM with perspective-specific prompts In this section, we evaluate the performance of the represen-
and performing rule-based ensembles of judgments, we find tative LLM, i.e., GPT-3.5 in fake news detection to reveal
that rationales indeed benefit fake news detection, and at- its judgment capability. We exploit four typical prompting
tribute the unsatisfying performance to the LLM’s inability approaches and perform a comparison with the SLM (here,
to select and integrate rationales properly to conclude. BERT) fine-tuned on this task. Formally, given a news item
Based on these findings, we propose that the current LLM x, the model aims to predict whether x is fake or not.
may not be a good substitute for the well-fine-tuned SLM but
could serve as a good advisor by providing instructive ratio- Experimental Settings
nales, as presented in Figure 1(b). To instantiate our pro- Dataset We employ the Chinese dataset Weibo21 (Nan
posal, we design the adaptive rationale guidance (ARG) net- et al. 2021) and the English dataset GossipCop (Shu et al.
work for fake news detection, which bridges the small and 2020) for evaluation. Following existing works (Zhu et al.
large LMs by selectively injecting new insight about news 2022; Mu, Bontcheva, and Aletras 2023), we preprocess the
analysis from the large LM’s rationales to the small LM. datasets with deduplication and temporal data split to avoid
The ARG further derives the rationale-free ARG-D via dis- possible performance overrating led by data leakage for the
tillation for cost-sensitive scenarios with no need to query SLM. Table 1 presents the dataset statistics.
LLMs. Experiments on two real-world datasets show that Large Language Model We evaluate GPT-3.5-turbo, the
ARG and ARG-D outperform existing SLM/LLM-only and LLM developed by OpenAI and supporting the popular
combination methods. Our contributions are as follows: chatbot ChatGPT (OpenAI 2022), due to its representative-
• Detailed investigation: We investigate the effective role ness and convenient calling. The large scale of parame-
of LLMs in fake news detection and find the LLM is bad ters makes task-specific fine-tuning almost impossible for
at veracity judgment but good at analyzing contents; LLMs, so we use the prompt learning paradigm, where an
• Novel and practical solution: We design a novel ARG LLM learns tasks given prompts containing instructions or
network and its distilled version ARG-D that comple- few-shot demonstrations (Liu et al. 2023a). In detail, we uti-
ments small and large LMs by selectively acquiring in- lize the following four typical prompting approaches to elicit
sights from LLM-generated rationales for SLMs, which the potential of the LLM in fake news detection (Figure 2):
has shown superiority based on extensive experiments; • Zero-Shot Prompting constructs prompt only contain-
• Useful resource: We construct a rationale collection ing the task description and the given news. To make the
from GPT-3.5 for fake news detection in two languages response more proficient and decrease the refusal ratio,
(Chinese and English) and make it publicly available to we optionally adopt the role-playing technique when de-
facilitate further research.2 scribing our task (Liu et al. 2023b; Ramlochan 2023).
• Zero-Shot CoT Prompting (Kojima et al. 2022) is
2 a simple and straightforward chain-of-thought (CoT)
Code, data, and the extended version are available at https:
//github.com/ICTMCG/ARG prompting approach to encourage the LLM to reason. In

22106
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

Model Usage Chinese English Chinese English


Perspective
Zero-Shot 0.676 0.568 Proportion macF1 Proportion macF1
GPT-3.5- Zero-Shot CoT 0.677 0.666
turbo Few-Shot 0.725 0.697 Textual Description 65% 0.706 71% 0.653
Few-Shot CoT 0.681 0.702 News: Everyone! Don’t buy cherries anymore: Cherries of this
BERT Fine-tuning 0.753 (+3.8%) 0.765 (+9.0%) year are infested with maggots, and nearly 100% are affected.
LLM Rationale: ...The tone of the news is extremely urgent,
seemingly trying to spread panic and anxiety.
Table 2: Performance in macro F1 of the large and small Prediction: Fake Ground Truth: Fake
LMs. The best two results are bolded and underlined, respec-
tively. The relative increases over the second-best results are Commonsense 71% 0.698 60% 0.680
shown in the brackets. News: Huang, the chief of Du’an Civil Affairs Bureau, gets
subsistence allowances of 509 citizens, owns nine properties,
addition to the elements in zero-shot prompting, it adds and has six wives...
an eliciting sentence such as “Let’s think step by step.” LLM Rationale: ...The news content is extremely outra-
• Few-Shot Prompting (Brown et al. 2020) provides task- geous...Such a situation is incredibly rare in reality and even
specific prompts and several news-label examples as could be thought impossible.
Prediction: Fake Ground Truth: Fake
demonstrations. After preliminary tests of {2,4,8}-shot
settings, we choose 4-shot prompting which includes two Factuality 17% 0.629 24% 0.626
real and two fake samples.
News: The 18th National Congress has approved that individ-
• Few-Shot CoT Prompting (Wei et al. 2022b) not only uals who are at least 18 years old are now eligible to marry...
provides news-label examples but also demonstrates rea- LLM Rationale: First, the claim that Chinese individuals at
soning steps with prepared rationales. Here, we obtain least 18 years old can register their marriage is real, as this is
the provided rationale demonstrations from the correct stipulated by Chinese law...
and reasonable outputs of zero-shot CoT prompting. Prediction: Real Ground Truth: Fake
Small Language Model We adopt the pre-trained small lan-
guage models, BERT (Devlin et al. 2019) as the representa- Others 4% 0.649 8% 0.704
tive, given its wide use in this task (Kaliyar, Goswami, and
Narang 2021; Zhu et al. 2022; Sheng et al. 2022). Specifi- Table 3: Analysis on different perspectives of LLM’s ratio-
cally, we limit the maximum length of the text to 170 tokens nales in the sample set, including the data ratio, LLM’s per-
and use chinese-bert-wwm-ext and bert-base-uncased from formance, and cases.
Transformers package (Wolf et al. 2020) for the Chinese and
English evaluation, respectively. We use Adam as the opti- Analysis on the Rationales from the LLM
mizer and do a grid search for the optimal learning rate. We
Though the LLM is bad at news veracity judgment, we also
report the testing result on the best-validation checkpoint.
notice that the rationales generated through zero-shot CoT
prompting exhibit a unique multi-perspective analytical ca-
Comparison between Small and Large LMs
pability that is challenging and rare for SLMs. For further
Table 2 presents the performance of GPT-3.5-turbo with four exploration, we sample 500 samples from each of the two
prompting approaches and the fine-tuned BERT on the two datasets and manually categorize them according to the per-
datasets. We observe that: spectives from which the LLM performs the news analysis.
1) Though the LLM is generally believed powerful, the Statistical results by perspectives and cases are presented in
LLM underperforms the fine-tuned SLM using all four Table 3.3 We see that: 1) The LLM is capable of gener-
prompting approaches. The SLM has a relative increase of ating human-like rationales on news content from var-
3.8%∼11.3% in Chinese and 9.0%∼34.6% in English over ious perspectives, such as textual description, common-
the LLM, indicating that the LLM lacks task-specific knowl- sense, and factuality, which meets the requirement of the
edge while the SLM learns during fine-tuning. delicate sense of diverse clues and profound understanding
2) Few-shot versions outperform zero-shot ones, suggest- of the real-world background in fake news detection. 2) The
ing the importance of task samples. However, introducing detection performance on the subset using certain perspec-
several samples only narrows the gap with the SLM but does tives is higher than the zero-shot CoT result on the full test-
not lead to surpassing. ing set. This indicates the potential of analysis by perspec-
3) CoT prompting brings additional performance gain in tives, though the coverage is moderate. 3) The analysis from
general, especially under the zero-shot setting on the English the perspective of factuality leads to the performance lower
dataset (+17.3%). However, we also observe some cases than average, indicating the unreliability of using the LLM
where CoT leads to a decrease. This indicates that effective for factuality analysis based on its internal memorization.
use of rationales may require more careful design. We speculate this is caused by the hallucination issue (Ji
Overall, given the LLM’s unsatisfying performance and et al. 2023; Zhang et al. 2023).
higher inference costs than the SLM, the current LLM has
not been a “good enough” detector to substitute task-specific 3
Note that a sample may be analyzed from multiple perspec-
SLMs in fake news detection. tives and thus the sum of proportions might be larger than 100%.

22107
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

Model Usage Chinese English collaboration via predicting the LLM’s judgment through
Zero-Shot CoT 0.677 0.666
the rationale, enriching news-rationale feature interaction,
GPT-3.5-turbo from Perspective TD 0.667 0.611 and evaluating rationale usefulness (Figure 3(b)). The inter-
from Perspective CS 0.678 0.698 active features are finally aggregated with the news feature
x for the final judgment of x being fake or not (Figure 3(c)).
BERT Fine-tuning 0.753 0.765 ARG-D is derived from the ARG via distillation for scenar-
Ensemble
Majority Voting 0.735 0.724 ios where the LLM is unavailable (Figure 3(d)).
Oracle Voting 0.908 0.878
Representation
Table 4: Performance of the LLM using zero-shot CoT with We employ two BERT models separately as the news and
perspective specified and other compared models. TD: Tex- rationale encoder to obtain semantic representations. For the
tual description; CS: Commonsense. given news item x and two corresponding rationales rt and
rc , the representations are X, Rt , and Rc , respectively.
We further investigate the LLM’s performance when
asked to perform analysis from a specific perspective on the News-Rationale Collaboration
full testing set (i.e., 100% coverage).4 From the first group The step of news-rationale collaboration aims at providing
in Table 4, we see that the LLM’s judgment with single- a rich interaction between news and rationales and learning
perspective analysis elicited is still promising. Compared to adaptively select useful rationales as references, which
with the comprehensive zero-shot CoT setting, the single- is at the core of our design. To achieve such an aim, ARG
perspective-based LLM performs comparatively on the Chi- includes three modules, as detailed and exemplified using
nese dataset and is better on the English dataset (for the com- the textual description rationale branch below:
monsense perspective case). The results showcase that the
internal mechanism of the LLM to integrate the rationales News-Rationale Interaction To enable comprehensive
from diverse perspectives is ineffective for fake news detec- information exchange between news and rationales, we
tion, limiting the full use of rationales. In this case, com- introduce a news-rationale interactor with a dual cross-
bining the small and large LMs to complement each other attention mechanism to encourage feature interactions. The
is a promising solution: The former could benefit from the cross-attention can be described as:
analytical capability of the latter, while the latter could be
 √ 
CA(Q, K, V) = softmax Q′ · K′ / d V′ , (1)
enhanced by task-specific knowledge from the former.
To exhibit the advantages of this solution, we apply major-
where Q′ = WQ Q, K′ = WK K, and V′ = WV V. d is
ity voting and oracle voting (assuming the most ideal situa-
the dimensionality. Given representations of the news X and
tion where we trust the correctly judged model for each sam-
the rationale Rt , the process is:
ple, if any) among the two single-perspective-based LLMs
and the BERT. Results show that we are likely to gain a per- ft→x = AvgPool (CA(Rt , X, X)) , (2)
formance better than any LLM-/SLM-only methods men-
tioned before if we could adaptively combine their advan- fx→t = AvgPool (CA(X, Rt , Rt )) , (3)
tages, i.e., the flexible task-specific learning of the SLM and where AvgPool(·) is the average pooling over the token
the informative rationale generated by the LLM. That is, representations outputted by cross-attention to obtain one-
the LLM could be possibly a good advisor for the SLM vector text representation f .
by providing rationales, ultimately improving the perfor-
mance of fake news detection. LLM Judgement Prediction Understanding the judg-
ment hinted by the given rationale is a prerequisite for fully
ARG: Adaptive Rationale Guidance Network exploiting the information behind the rationale. To this end,
we construct the LLM judgment prediction task, whose re-
for Fake News Detection quirement is to predict the LLM judgment of the news verac-
Based on the above findings and discussion, we propose the ity according to the given rationale. We expect this to deepen
adaptive rationale guidance (ARG) network for fake news the understanding of the rationale texts. For the textual de-
detection. Figure 3 overviews the ARG and its rationale-free scription rationale branch, we feed its representation Rt into
version ARG-D, for cost-sensitive scenarios. The objective the LLM judgment predictor, which is parametrized using a
of ARG is to empower small fake news detectors with the multi-layer perception (MLP)5 :
ability to adaptively select useful rationales as references for
m̂t = sigmoid(MLP(Rt )), Lpt = CE(m̂t , mt ), (4)
final judgments. Given a news item x and its correspond-
ing LLM-generated rationales rt (textual description) and rc where mt and m̂t are respectively the LLM’s claimed judg-
(commonsense), the ARG encodes the inputs using the SLM ment and its prediction. The loss Lpt is a cross-entropy loss
at first (Figure 3(a)). Subsequently, it builds news-rationale CE(ŷ, y) = −y log ŷ−(1−y) log(1− ŷ). The case is similar
for commonsense rationale Rc .
4
We exclude the factuality to avoid the impacts of hallucina-
5
tion. The eliciting sentence is “Let’s think from the perspective of For brevity, we omit the subscripts of all independently
[textual description/commonsense].” parametrized MLPs.

22108
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

(a) Representation (b) News-Rationale Collaboration (c) Prediction

Encoder
Feature

News
X Attention x Aggregator
Content
News Item
News-Rationale
Interactor ft→x ’
ft→x fcls
wt Rationale Usefulness
Rt fx→t Evaluator Let
Textual LLM Judgment Classifier
Description Lpt
Rationale

Predictor
Encoder

Rationales Rationale

LLM
News-Rationale
fc→x ’
fc→x
Lce
Interactor
Rationales wc Rationale Usefulness Vector/
Rc fx→c Evaluator Lec Matrix
LLM Judgment Module
Commonsense
Rationale Predictor Lpc
↑ ARG Network Loss

↓ ARG-D Network initialized from (d) Distillation for Rationale-Free Model initialized from
module in (a) module in (c)
News Rationale-Aware d
News Item Encoder X Feature Simulator Attention fcls Classifier Lce
distill from fcls Lkd

Figure 3: Overall architecture of our proposed adaptive rationale guidance (ARG) network and its rationale-free version ARG-
D. In the ARG, the news item and LLM rationales are (a) respectively encoded into X and R∗ (∗ ∈ {t, c}). Then the small and
large LMs collaborate with each other via news-rationale feature interaction, LLM judgment prediction, and rationale usefulness

evaluation. The obtained interactive features f∗→x (∗ ∈ {t, c}). These features are finally aggregated with attentively pooled
news feature x for the final judgment. In the ARG-D, the news encoder and the attention module are preserved and the output
of the rationale-aware feature simulator is supervised by the aggregated feature fcls for knowledge distillation.

Rationale Usefulness Evaluation The usefulness of ra- {0, 1}, we aggregate these vectors with different weights:
tionales from different perspectives varies across different ′
fcls = wxcls · x + wtcls · ft→x ′
+ wccls · fc→x , (7)
news items and improper integration may lead to perfor- cls cls cls
mance degradation. To enable the model to adaptively se- where wx , wt and wc are learnable parameters ranging
lect appropriate rationale, we devise a rationale usefulness from 0 to 1. fcls is the fusion vector, which is then fed into
evaluation process, in which we assess the contributions of the MLP classifier for final prediction of news veracity:
different rationales and adjust their weights for subsequent Lce = CE(MLP(fcls ), y). (8)
veracity prediction. The process comprises two phases, i.e., The total loss function is the weighted sum of the loss terms
evaluation and reweighting. For evaluation, we input the mentioned above:
news-aware rationale vector fx→t into the rationale useful-
ness evaluator (parameterized by an MLP) to predict its use- L = Lce + β1 (Let + Lec ) + β2 (Lpt + Lpc ), (9)
fulness ut . Following the assumption that rationales leading where β1 and β2 are hyperparameters.
to correct judgments are more useful, we use the judgment
correctness as the rationale usefulness labels. Distillation for Rationale-Free Model
The ARG requires sending requests to the LLM for every
ût = sigmoid(MLP(fx→t )), Let = CE(ût , ut ). (5) prediction, which might not be affordable for cost-sensitive
scenarios. Therefore, we attempt to build a rationale-free
In the reweighting phase, we input vector fx→t into an MLP
model, namely ARG-D, based on the trained ARG model via
to obtain a weight number wt , which is then used to reweight
knowledge distillation (Hinton, Vinyals, and Dean 2015).
the rationale-aware news vector ft→x . The procedure is as
The basic idea is simulated and internalized the knowledge
follows:
from rationales into a parametric module. As shown in Fig-
ft→x ′ = wt · ft→x . (6) ure 3(d), we initialize the news encoder and classifier with
We also use attentive pooling to transform the representation the corresponding modules in the ARG and train a rationale-
matrix X into a vector x. aware feature simulator (implemented with a multi-head
transformer block) and an attention module to internalize
Prediction knowledge. Besides the cross-entropy loss Lce , we let the
d
Based on the outputs from the last step, we now aggregate feature fcls to imitate fcls in the ARG, using the mean

news vector x and rationale-aware news vector ft→x ′
, fc→x squared estimation loss:
d
for the final judgment. For news item x with label y ∈ Lkd = MSE(fcls , fcls ). (10)

22109
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

Chinese English
Model
macF1 Acc. F1real F1fake macF1 Acc. F1real F1fake
G1: LLM-Only GPT-3.5-turbo 0.725 0.734 0.774 0.676 0.702 0.813 0.884 0.519
Baseline 0.753 0.754 0.769 0.737 0.765 0.862 0.916 0.615
EANNT 0.754 0.756 0.773 0.736 0.763 0.864 0.918 0.608
G2: SLM-Only
Publisher-Emo 0.761 0.763 0.784 0.738 0.766 0.868 0.920 0.611
ENDEF 0.765 0.766 0.779 0.751 0.768 0.865 0.918 0.618
Baseline + Rationale 0.767 0.769 0.787 0.748 0.777 0.870 0.921 0.633
SuperICL 0.757 0.759 0.779 0.734 0.736 0.864 0.920 0.551
ARG 0.784 0.786 0.804 0.764 0.790 0.878 0.926 0.653
(Relative Impr. over Baseline) (+4.2%) (+4.3%) (+4.6%) (+3.8%) (+3.2%) (+1.8%) (+1.1%) (+6.3%)
G3: LLM+SLM
w/o LLM Judgment Predictor 0.773 0.774 0.789 0.756 0.786 0.880 0.928 0.645
w/o Rationale Usefulness Evaluator 0.781 0.783 0.801 0.761 0.782 0.873 0.923 0.641
w/o Predictor & Evaluator 0.769 0.770 0.782 0.756 0.780 0.874 0.923 0.637
ARG-D 0.771 0.772 0.785 0.756 0.778 0.870 0.921 0.634
(Relative Impr. over Baseline) (+2.4%) (+2.3%) (+2.1%) (+2.6%) (+1.6%) (+0.9%) (+0.6%) (+3.2%)

Table 5: Performance of the ARG and its variants and the LLM-only, SLM-only, LLM+SLM methods. The best two results in
macro F1 and accuracy are respectively bolded and underlined. For GPT-3.5-turbo, the best results in Table 2 are reported.

Evaluation Performance Comparison and Ablation Study


Experimental Settings Table 5 presents the performance of our proposed ARG and
its variants and the compared methods. From the results,
Baselines We compare three groups of methods:
we observe that: 1) The ARG outperforms all other com-
G1 (LLM-Only): We list the performance of the best-
pared methods in macro F1, demonstrating its effectiveness.
performing setting on each dataset in Table 2, i.e., few-shot
2) The rationale-free ARG-D still outperforms all compared
in Chinese and few-shot CoT in English.
methods except ARG and its variants, which shows the pos-
G2 (SLM-Only)6 : 1) Baseline: The vanilla BERT-base
itive impact of the distilled knowledge from ARG. 3) The
model whose setting remains consistent with that in Sec-
two compared LLM+SLM methods exhibit different perfor-
tion . 2) EANNT (Wang et al. 2018): A model that learns
mance. The simple combination of features of news and ra-
effective signals using auxiliary adversarial training, aiming
tionale yields a performance improvement, showing the use-
at removing event-related features as much as possible. We
fulness of our prompted rationales. SuperICL outperforms
used publication year as the label for the auxiliary task. 3)
the LLM-only method but fails to consistently outperform
Publisher-Emo (Zhang et al. 2021): A model that fuses a
the baseline SLM on the two datasets. We speculate that this
series of emotional features with textual features for fake
is due to the complexity of our fake news detection task,
news detection. 4) ENDEF (Zhu et al. 2022): A model that
where injecting prediction and confidence of an SLM does
removes entity bias via causal learning for better generaliza-
not bring sufficient information. 4) We evaluate three abla-
tion on distribution-shifted fake news data. All methods in
tion experiment groups to evaluate the effectiveness of dif-
this group used the same BERT as the text encoder.
ferent modules in the ARG network. From the result, we
G3 (LLM+SLM): 1) Baseline+Rationale: It concatenates
can see that w/o LLM Judgement Predictor or w/o Ratio-
features from the news encoder and rationale encoder and
nale Usefulness Evaluator both bring a significant decrease
feeds them into an MLP for prediction. 2) SuperICL (Xu
in ARG performance, highlighting the significance of these
et al. 2023): It exploits the SLM as a plug-in for the in-
two structures. Besides, we found that even the weakest one
context learning of the LLM by injecting the prediction and
among the variants of ARG still outperforms all other meth-
the confidence for each testing sample into the prompt.
ods, which shows the importance of the news-rationale in-
Implementation Details We use the same datasets intro- teraction structure we designed.
duced in Section and keep the setting the same in terms
of the pre-trained model, learning rate, and optimization Result Analysis
method. For the ARG-D network, the parameters of the
news encoder and classifier are derived from the ARG To investigate which part the additional gain of the ARG(-
model. A four-head transformer block is implemented in the D) should be attributed to, we perform statistical analysis on
rationale-aware feature simulator. The weight of loss func- the additional correctly judged samples of ARG(-D) com-
tions Let , Lpt , Lec , Lpc in the ARG and Lkd in the ARG-D pared with the vanilla BERT. From Figure 4, we observe
are grid searched. that: 1) The proportions of the overlapping samples between
ARG(-D) and the LLM are over 77%, indicating that the
6
As this paper focuses on text-based news, we use the text-only ARG(-D) can exploit (and absorb) the valuable knowledge
variant of the original EANN following (Sheng et al. 2021) and the for judgments from the LLM, even its performance is unsat-
publisher-emotion-only variant in (Zhang et al. 2021). isfying. 2) The samples correctly judged by the LLM from

22110
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

79.6% 77.9% quire extra assistance from knowledge bases (Popat et al.
15.5% 15.8% 2018) and news environments (Sheng et al. 2022). Both two
LLM ARG groups of methods obtain textual representation from pre-
✓ ✓
20.9% 16.8% ✗ ✓ trained models like BERT as a convention but rarely con-
TD CS sider its potential for fake news detection. We conducted an
✓ ✓
✗ ✓ exploration in this paper by combining large and small LMs
20.4% 22.1% ✓ ✗ and obtained good improvement only using textual content.
43.3% 45.3%
LLMs for Natural Language Understanding LLMs,
(a) right(ARG) – right(Baseline) (b) right(ARG-D) – right(Baseline)
though mostly generative models, also have powerful nat-
ural language understanding (NLU) capabilities, especially
in the few-shot in-context learning scenarios (Brown et al.
Figure 4: Statistics of additional correctly judged samples of 2020). Recent works in this line focus on benchmarking the
(a) ARG and (b) ARG-D over the BERT baseline. right(·) latest LLM in NLU. Results show that LLMs may not have
denotes samples correctly judged by the method (·). TD/CS: comprehensive superiority compared with a well-trained
Textual description/commonsense perspective. small model in some types of NLU tasks (Zhong et al. 2023).
Our results provide empirical findings in fake news detection
P (0.23, 0.784)
with only textual content as the input.

Conclusion and Discussion


We investigated if large LMs help in fake news detection
and how to properly utilize their advantages for improving
performance. Results show that the large LM (GPT-3.5) un-
derperforms the task-specific small LM (BERT), but could
provide informative rationales and complement small LMs
Figure 5: Performance as the shifting threshold changes. in news understanding. Based on these findings, we designed
the ARG network to flexibly combine the respective advan-
both two perspectives contribute the most, suggesting more tages of small and large LMs and developed its rationale-free
diverse rationales may enhance the ARG(-D)’s training. 3) version ARG-D for cost-sensitive scenarios. Experiments
20.4% and 22.1% of correct judgments should be attributed showed the superiority of the ARG and ARG-D.
to the model itself. We speculate that it produces some kinds Discussion Our findings in fake news detection exemplify
of “new knowledge” based on the wrong judgments of the the current barrier for LLMs to be competent in applica-
given knowledge. tions closely related to the sophisticated real-world back-
ground. Though having superior analyzing capability, LLMs
Cost Analysis in Practice may struggle to properly make full use of their internal ca-
We showcase a possible model-shifting strategy to balance pability. This suggests that “mining” their potential may re-
the performance and cost in practical systems. Inspired quire novel prompting techniques and a deeper understand-
by Ma et al. (2023), we simulate the situation where we use ing of its internal mechanism. We then identified the possi-
the more economic ARG-D by default but query the more bility of combining small and LLMs to earn additional im-
powerful ARG for part of the data. As presented in Figure 5, provement and provided a solution especially suitable for
by sending only 23% of the data (according to the confi- situations where the better-performing models have to “se-
dence of ARG-D) to the ARG, we could achieve 0.784 in lect good to learn” from worse ones. We expect our solution
macro F1, which is the same as the performance fully using to be extended to other tasks and foster more effective and
the ARG. cost-friendly use of LLMs in the future.
Limitations We identify the following limitations: 1) We
Related Work do not examine other well-known LLMs (e.g., Claude7 and
Ernie Bot 8 ) due to the API unavailability for us when con-
Fake News Detection Fake news detection is generally for- ducting this research; 2) We only consider the perspectives
mulated as a binary classification task between real and fake summarized from the LLM’s response and there might be
news items. Research on this task could be roughly cate- other prompting perspectives based on a conceptualization
gorized into two groups: social-context-based and content- framework of fake news; 3) Our best results still fall behind
based methods. Methods in the first group aim at differ- the oracle voting integration of multi-perspective judgments
entiating fake and real news during the diffusion proce- in Table 4, indicating that rooms still exist in our line regard-
dure by observing the propagation patterns (Zhou and Za- ing performance improvements.
farani 2019), user feedback (Min et al. 2022), and so-
cial networks (Nguyen et al. 2020). The second group fo-
7
cuses on finding hints based on the given content, including https://claude.ai/
8
text (Przybyla 2020), images (Qi et al. 2021) and may re- https://yiyan.baidu.com/

22111
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

Acknowledgements Hu, L.; Wei, S.; Zhao, Z.; and Wu, B. 2022a. Deep learning
The authors would like to thank the anonymous review- for fake news detection: A comprehensive survey. AI Open,
ers for their insightful comments. This work is supported 3: 133–155.
by the National Natural Science Foundation of China Hu, X.; Guo, Z.; Wu, G.; Liu, A.; Wen, L.; and Yu, P. 2022b.
(62203425), the Zhejiang Provincial Key Research and De- CHEF: A Pilot Chinese Dataset for Evidence-Based Fact-
velopment Program of China (2021C01164), the Project Checking. In Proceedings of the 2022 Conference of the
of Chinese Academy of Sciences (E141020), the Post- North American Chapter of the Association for Computa-
doctoral Fellowship Program of CPSF (GZC20232738) tional Linguistics: Human Language Technologies, 3362–
(GZC20232738) and the CIPSC-SMP-Zhipu.AI Large 3376. ACL.
Model Cross-Disciplinary Fund. The corresponding author Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.;
is Qiang Sheng. Bang, Y. J.; Madotto, A.; and Fung, P. 2023. Survey of Hal-
lucination in Natural Language Generation. ACM Comput-
References ing Surveys, 55: 1–38.
Anthropic. 2023. Model Card and Evaluations for Claude Kaliyar, R. K.; Goswami, A.; and Narang, P. 2021. Fake-
Models. https://www-files.anthropic.com/production/ BERT: Fake News Detection in Social Media with a BERT-
images/Model-Card-Claude-2.pdf. Accessed: 2023-08-13. based Deep Learning Approach. Multimedia tools and ap-
Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; plications, 80(8): 11765–11788.
Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, Kojima, T.; Gu, S. S.; Reid, M.; Matsuo, Y.; and Iwasawa,
A.; Agarwal, S.; Herbert-Voss, A.; Krueger, G.; Henighan, Y. 2022. Large Language Models are Zero-Shot Reason-
T.; Child, R.; Ramesh, A.; Ziegler, D. M.; Wu, J.; Winter, ers. In Advances in Neural Information Processing Systems,
C.; Hesse, C.; Chen, M.; Sigler, E.; Litwin, M.; Gray, S.; volume 35, 22199–22213. Curran Associates, Inc.
Chess, B.; Clark, J.; Berner, C.; McCandlish, S.; Radford, Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; and Neubig,
A.; Sutskever, I.; and Amodei, D. 2020. Language Models G. 2023a. Pre-train, prompt, and predict: A systematic sur-
Are Few-Shot Learners. In Advances in Neural Information vey of prompting methods in natural language processing.
Processing Systems, 1877–1901. Curran Associates Inc. ACM Computing Surveys, 55(9): 1–35.
Caramancion, K. M. 2023. News Verifiers Showdown: Liu, Y.; Deng, G.; Xu, Z.; Li, Y.; Zheng, Y.; Zhang, Y.; Zhao,
A Comparative Performance Evaluation of ChatGPT 3.5, L.; Zhang, T.; and Liu, Y. 2023b. Jailbreaking ChatGPT via
ChatGPT 4.0, Bing AI, and Bard in News Fact-Checking. Prompt Engineering: An Empirical Study. arXiv preprint
arXiv preprint arXiv:2306.17176. arXiv:2305.13860.
CHEQ. 2019. The Economic Cost of Bad Actors Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.;
on the Internet. https://info.cheq.ai/hubfs/Research/THE Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V.
ECONOMIC COST Fake News final.pdf. Accessed: 2023- 2019. RoBERTa: A Robustly Optimized BERT Pretraining
08-13. Approach. arXiv preprint arXiv:1907.11692.
Cui, J.; Kim, K.; Na, S. H.; and Shin, S. 2022. Meta-Path- Ma, Y.; Cao, Y.; Hong, Y.; and Sun, A. 2023. Large Lan-
based Fake News Detection Leveraging Multi-level Social guage Model Is Not a Good Few-shot Information Extrac-
Context Information. In Proceedings of the 31st ACM Inter- tor, but a Good Reranker for Hard Samples! arXiv preprint
national Conference on Information & Knowledge Manage- arXiv:2303.08559.
ment, 325–334. ACM. Min, E.; Rong, Y.; Bian, Y.; Xu, T.; Zhao, P.; Huang, J.; and
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. Ananiadou, S. 2022. Divide-and-Conquer: Post-User Inter-
BERT: Pre-training of Deep Bidirectional Transformers for action Network for Fake News Detection on Social Media.
Language Understanding. In Proceedings of the 2019 Con- In Proceedings of the ACM Web Conference 2022, 1148–
ference of the North American Chapter of the Association 1158. ACM.
for Computational Linguistics: Human Language Technolo- Mosallanezhad, A.; Karami, M.; Shu, K.; Mancenido, M. V.;
gies, Volume 1 (Long and Short Papers), 4171–4186. ACL. and Liu, H. 2022. Domain Adaptive Fake News Detection
Fisher, M.; Cox, J. W.; and Hermann, P. 2016. Pizzagate: via Reinforcement Learning. In Proceedings of the ACM
From rumor, to hashtag, to gunfire in DC. The Washington Web Conference 2022, 3632–3640. ACM.
Post. Mu, Y.; Bontcheva, K.; and Aletras, N. 2023. It’s about
Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distill- Time: Rethinking Evaluation on Rumor Detection Bench-
ing the Knowledge in a Neural Network. arXiv preprint marks using Chronological Splits. In Findings of the Associ-
arXiv:1503.02531. ation for Computational Linguistics: EACL 2023, 736–743.
ACL.
Hu, B.; Sheng, Q.; Cao, J.; Zhu, Y.; Wang, D.; Wang, Z.;
and Jin, Z. 2023. Learn over Past, Evolve for Future: Fore- Naeem, S. B.; and Bhatti, R. 2020. The COVID-19 ‘info-
casting Temporal Trends for Fake News Detection. In Pro- demic’: a new front for information professionals. Health
ceedings of the 61st Annual Meeting of the Association for Information & Libraries Journal, 37(3): 233–239.
Computational Linguistics (Volume 5: Industry Track), 116– Nan, Q.; Cao, J.; Zhu, Y.; Wang, Y.; and Li, J. 2021. MD-
125. ACL. FEND: Multi-domain Fake News Detection. In Proceedings

22112
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

of the 30th ACM International Conference on Information Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux,
and Knowledge Management. ACM. M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.;
Nguyen, V.-H.; Sugiyama, K.; Nakov, P.; and Kan, M.-Y. Azhar, F.; Rodriguez, A.; Joulin, A.; Grave, E.; and Lample,
2020. FANG: Leveraging Social Context for Fake News G. 2023. LLaMA: Open and Efficient Foundation Language
Detection Using Graph Representation. In Proceedings of Models. arXiv preprint arXiv:2302.13971.
the 29th ACM International Conference on Information and Wang, Y.; Ma, F.; Jin, Z.; Yuan, Y.; Xun, G.; Jha, K.; Su, L.;
Knowledge Management, 1165–1174. ACM. and Gao, J. 2018. EANN: Event Adversarial Neural Net-
OpenAI. 2022. ChatGPT: Optimizing Language Models works for Multi-Modal Fake News Detection. In Proceed-
for Dialogue. https://openai.com/blog/chatgpt/. Accessed: ings of the 24th ACM SIGKDD International Conference on
2023-08-13. Knowledge Discovery & Data Mining, 849–857. ACM.
Pelrine, K.; Reksoprodjo, M.; Gupta, C.; Christoph, J.; and Wei, J.; Tay, Y.; Bommasani, R.; Raffel, C.; Zoph, B.;
Rabbany, R. 2023. Towards Reliable Misinformation Mit- Borgeaud, S.; Yogatama, D.; Bosma, M.; Zhou, D.; Met-
igation: Generalization, Uncertainty, and GPT-4. arXiv zler, D.; Chi, E. H.; Hashimoto, T.; Vinyals, O.; Liang, P.;
preprint arXiv:2305.14928v1. Dean, J.; and Fedus, W. 2022a. Emergent Abilities of Large
Language Models. Transactions on Machine Learning Re-
Popat, K.; Mukherjee, S.; Yates, A.; and Weikum, G. 2018. search.
DeClarE: Debunking Fake News and False Claims using
Evidence-Aware Deep Learning. In Proceedings of the Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.;
2018 Conference on Empirical Methods in Natural Lan- Xia, F.; Chi, E.; Le, Q. V.; and Zhou, D. 2022b. Chain-
guage Processing, 22–32. ACL. of-Thought Prompting Elicits Reasoning in Large Language
Models. In Advances in Neural Information Processing Sys-
Przybyla, P. 2020. Capturing the Style of Fake News. In Pro- tems, volume 35, 24824–24837. Curran Associates, Inc.
ceedings of the AAAI Conference on Artificial Intelligence,
volume 34, 490–497. AAAI Press. Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.;
Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; Davi-
Qi, P.; Cao, J.; Li, X.; Liu, H.; Sheng, Q.; Mi, X.; He, Q.; son, J.; Shleifer, S.; von Platen, P.; Ma, C.; Jernite, Y.; Plu,
Lv, Y.; Guo, C.; and Yu, Y. 2021. Improving Fake News J.; Xu, C.; Le Scao, T.; Gugger, S.; Drame, M.; Lhoest, Q.;
Detection by Using an Entity-enhanced Framework to Fuse and Rush, A. 2020. Transformers: State-of-the-Art Natural
Diverse Multimodal Clues. In Proceedings of the 29th ACM Language Processing. In Proceedings of the 2020 Confer-
International Conference on Multimedia, 1212–1220. ACM. ence on Empirical Methods in Natural Language Process-
Ramlochan, S. 2023. Role-Playing in Large Language Mod- ing: System Demonstrations, 38–45. Online: ACL.
els like ChatGPT. https://www.promptengineering.org/role- Xu, C.; Xu, Y.; Wang, S.; Liu, Y.; Zhu, C.; and McAuley, J.
playing-in-large-language-models-like-chatgpt/. Accessed: 2023. Small Models are Valuable Plug-ins for Large Lan-
2023-08-13. guage Models. arXiv preprint arXiv:2305.08848.
Roth, Y. 2022. The vast majority of content we take ac- Zhang, X.; Cao, J.; Li, X.; Sheng, Q.; Zhong, L.; and Shu,
tion on for misinformation is identified proactively. https: K. 2021. Mining Dual Emotion for Fake News Detection. In
//twitter.com/yoyoel/status/1483094057471524867. Ac- Proceedings of the web conference 2021, 3465–3476. ACM.
cessed: 2023-08-13. Zhang, Y.; Li, Y.; Cui, L.; Cai, D.; Liu, L.; Fu, T.; Huang, X.;
Sheng, Q.; Cao, J.; Zhang, X.; Li, R.; Wang, D.; and Zhu, Y. Zhao, E.; Zhang, Y.; Chen, Y.; Wang, L.; Luu, A. T.; Bi, W.;
2022. Zoom Out and Observe: News Environment Percep- Shi, F.; and Shi, S. 2023. Siren’s Song in the AI Ocean: A
tion for Fake News Detection. In Proceedings of the 60th Survey on Hallucination in Large Language Models. arXiv
Annual Meeting of the Association for Computational Lin- preprint arXiv:2309.01219.
guistics (Volume 1: Long Papers), 4543–4556. ACL. Zhao, W. X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.;
Sheng, Q.; Zhang, X.; Cao, J.; and Zhong, L. 2021. Inte- Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; Du, Y.; Yang, C.;
grating pattern-and fact-based fake news detection via model Chen, Y.; Chen, Z.; Jiang, J.; Ren, R.; Li, Y.; Tang, X.; Liu,
preference learning. In Proceedings of the 30th ACM inter- Z.; Liu, P.; Nie, J.-Y.; and Wen, J.-R. 2023. A Survey of
national conference on information & knowledge manage- Large Language Models. arXiv preprint arXiv:2303.18223.
ment, 1640–1650. ACM. Zhong, Q.; Ding, L.; Liu, J.; Du, B.; and Tao, D.
Shu, K.; Cui, L.; Wang, S.; Lee, D.; and Liu, H. 2019. dE- 2023. Can ChatGPT Understand Too? A Comparative
FEND: Explainable Fake News Detection. In Proceedings of Study on ChatGPT and Fine-tuned BERT. arXiv preprint
the 25th ACM SIGKDD International Conference on Knowl- arXiv:2302.10198.
edge Discovery & Data Mining, 395–405. ACM. Zhou, X.; and Zafarani, R. 2019. Network-Based Fake News
Shu, K.; Mahudeswaran, D.; Wang, S.; Lee, D.; and Liu, Detection: A Pattern-Driven Approach. ACM SIGKDD Ex-
H. 2020. FakeNewsNet: A Data Repository with News plorations Newsletter, 21(2): 48–60.
Content, Social Context and Spatiotemporal Information for Zhu, Y.; Sheng, Q.; Cao, J.; Li, S.; Wang, D.; and Zhuang,
Studying Fake News on Social Media. Big data, 8: 171–188. F. 2022. Generalizing to the Future: Mitigating Entity Bias
Shu, K.; Sliva, A.; Wang, S.; Tang, J.; and Liu, H. 2017. Fake in Fake News Detection. In Proceedings of the 45th Inter-
news detection on social media: A data mining perspective. national ACM SIGIR Conference on Research and Develop-
ACM SIGKDD Explorations Newsletter, 19: 22–36. ment in Information Retrieval, 2120–2125. ACM.

22113

You might also like