Fakeddit: A Multimodal Fake News Dataset
Fakeddit: A Multimodal Fake News Dataset
r/Fakeddit:
A New Multimodal Benchmark Dataset for
Fine-grained Fake News Detection
Kai Nakamura*¶, Sharon Levy*§, William Yang Wang§
¶Laguna Blanca School
§University of California, Santa Barbara
kai.nakamura42@[Link], {sharonlevy, william}@[Link]
Abstract
Fake news has altered society in negative ways in politics and culture. It has adversely affected both online social network systems as
well as offline communities and conversations. Using automatic machine learning classification models is an efficient way to combat
the widespread dissemination of fake news. However, a lack of effective, comprehensive datasets has been a problem for fake news
research and detection model development. Prior fake news datasets do not provide multimodal text and image data, metadata, comment
data, and fine-grained fake news categorization at the scale and breadth of our dataset. We present Fakeddit, a novel multimodal dataset
consisting of over 1 million samples from multiple categories of fake news. After being processed through several stages of review, the
samples are labeled according to 2-way, 3-way, and 6-way classification categories through distant supervision. We construct hybrid
text+image models and perform extensive experiments for multiple variations of classification, demonstrating the importance of the
novel aspect of multimodality and fine-grained classification unique to Fakeddit.
6149
Figure 1: Dataset examples with 6-way classification labels.
6150
Dataset Size (# of samples) # of Classes Modality Source Data Category
LIAR 12,836 6 text Politifact political
FEVER 185,445 3 text Wikipedia variety
BUZZFEEDNEWS 2,282 4 text Facebook political
BUZZFACE 2,263 4 text Facebook political
some-like-it-hoax 15,500 2 text Facebook scientific/conspiracy
PHEME 330 2 text Twitter variety
CREDBANK 60,000,000 5 text Twitter variety
Breaking! 700 2,3 text BS Detector political
NELA-GT-2018 713,000 8 IA text 194 news outlets variety
FAKENEWSNET 602,659 2 text Twitter political/celebrity
FakeNewsCorpus 9,400,000 10 text [Link] variety
FA-KES 804 2 text 15 news outlets Syrian war
Image Manipulation 48 2 image self-taken variety
Fauxtography 1,233 2 text, image Snopes, Reuters variety
image-verification-corpus 17,806 2 text, image Twitter variety
The PS-Battles Dataset 102,028 2 image Reddit manipulated content
Fakeddit (ours) 1,063,106 2,3,6 text, image Reddit variety
Table 1: Comparison of various fake news detection datasets. IA: Individual assessments.
6151
indicates that a post does not contribute to the subreddit’s
theme or is off-topic if it has a low score9 . As such, we
filtered out any submissions that had a score of less than 1
to further ensure that our data is credible. We assume that
invalid or irrelevant posts within a subreddit would be ei-
ther removed or down-voted to a score of less than 1. The
high popularity of the Reddit website makes this step par-
ticularly effective as thousands of individual users can give
their opinion of the quality of various submissions.
Our final degree of quality assurance is done manually and
occurs after the previous two stages. We randomly sampled
10 posts from each subreddit in order to determine whether
the submissions really do pertain to each subreddit’s theme.
If any of the 10 samples varied from this, we decided to re-
move the subreddit from our list. As a result, we ended up Figure 2: Distributions of word length in Fakeddit and
with 22 subreddits to keep our processed data after this fil- FEVER datasets. We exclude samples that have more than
tering. When labeling our dataset, we labeled each sample 100 words.
according to its subreddit’s theme. These labels were deter-
mined during the last processing phase, as we were able to
look through many samples for each subreddit. Each sub-
reddit is labeled with one 2-way, 3-way, and 6-way label.
Lastly, we cleaned the submission title text: we removed all
punctuation, numbers, and revealing words such as “PsBat-
tle” and “colorized” that automatically reveal the subreddit
source. For the savedyouaclick subreddit, we removed text
following the “ ” character and classified it as misleading
content. We also converted all the text to lowercase.
As mentioned above, we do not manually label each sample
and instead label our samples based on their respective sub-
reddit’s theme. By doing this, we employ distant supervi-
sion, a commonly used technique, to create our final labels.
While this may create some noise within the dataset, we Figure 3: Type-caption curve of Fakeddit vs. FEVER with
aim to remove this from our pseudo-labeled data. By go- 4-gram type.
ing through these stages of quality assurance, we can deter-
mine that our final dataset is credible and each subreddit’s
label will accurately identify the posts that it contains. We of fake news rather than just doing a simple binary or tri-
test this by randomly sampling 150 text-image pairs from nary classification. This can help in pinpointing the degree
our dataset and having two of our researchers individually and variation of fake news for applications that require this
manually label them for 6-way classification. It is difficult type of fine-grained detection. In addition, it will enable re-
to narrow down each sample to exactly one subcategory, searchers to focus on specific types of fake news classifica-
especially for those not working in the journalism industry. tion if they desire; for example, focusing on satire only. For
We achieve a Cohen’s Kappa coefficient (Cohen, 1960) of the 6-way classification, the first label is true and the other
0.54, showing moderate agreement and that some samples five are defined within the seven types of fake news (War-
may represent more than one label. While we only provide dle, 2017). Only five types of fake news were chosen as we
each sample with one 6-way label, future work can help did not find subreddits with posts aligning with the remain-
identify multiple labels for each text-image pair. ing two types. We provide examples from each class for
6-way classification in Figure 1. The 6-way classification
3.3. Labeling labels are explained below:
We provide three labels for each sample, allowing us to True: True content is accurate in accordance with fact.
train for 2-way, 3-way, and 6-way classification. Having Eight of the subreddits fall into this category, such as us-
this hierarchy of labels will enable researchers to train for news and mildlyinteresting. The former consists of posts
fake news detection at a high level or a more fine-grained from various news sites. The latter encompasses real pho-
one. The 2-way classification determines whether a sample tos with accurate captions.
is fake or true. The 3-way classification determines whether Satire/Parody: This category consists of content that spins
a sample is completely true, the sample is fake and con- true contemporary content with a satirical tone or informa-
tains text that is true (i.e. direct quotes from propaganda tion that makes it false. One of the four subreddits that
posters), or the sample is fake with false text. Our final 6- make up this label is theonion, with headlines such as “Man
way classification was created to categorize different types Lowers Carbon Footprint By Bringing Reusable Bags Ev-
ery Time He Buys Gas”.
9
[Link] Misleading Content: This category consists of informa-
6152
Dataset 1-gram 2-gram 3-gram 4-gram Meanwhile, FEVER’s longest text length stops at less than
FEVER 40874 179525 315025 387093 70 words.
Fakeddit 61141 507512 767281 755929 Secondly, we examine the linguistic variety of our dataset
by computing the Type-Caption Curve, as defined in (Wang
Table 3: Unique n-grams for FEVER and Fakeddit for et al., 2019). Figure 3 shows these results. Fakeddit pro-
equal sample size (FEVER’s total dataset size). vides significantly more lexical diversity. Even though
Fakeddit contains more samples than FEVER, the number
of unique n-grams contained in similar sized samples are
still much higher than those within FEVER. These effects
will be magnified as Fakeddit contains more than 5 times
more total samples than FEVER. In Table 3, we show the
number of unique n-grams for both datasets when sampling
n samples, where n is equal to FEVER’s dataset size. This
demonstrates that for all n-gram sizes, our dataset is more
lexically diverse than FEVER’s for equal sample sizes.
These salient text features - longer text lengths, broad ar-
ray of text lengths, and higher linguistic variety - high-
light Fakeddit’s diversity. This diversity can strengthen fake
news detection systems by increasing their lexical scope.
4. Experiments
4.1. Fake News Detection
Multiple methods were employed for text and image fea-
Figure 4: Multimodal model for integrating text and image ture extraction. We used InferSent (Conneau et al., 2017)
data for 2, 3, and 6-way classification. n, the hidden layer and BERT (Devlin et al., 2019) to generate text embeddings
size, is tuned for each model instance through hyperparam- for the title of the Reddit submissions. VGG16 (Simonyan
eter optimization. and Zisserman, 2015), EfficientNet (Tan and Le, 2019), and
ResNet50 (He et al., 2016) were utilized to extract the fea-
tures of the Reddit submission thumbnails.
tion that is intentionally manipulated to fool the audience. We used the InferSent model because it performs very well
Our dataset contains three subreddits in this category. as a universal sentence embeddings generator. For this
Imposter Content: This category contains two subreddits, model, we loaded a vocabulary of 1 million of the most
which contain bot-generated content and are trained on a common words in English and used fastText embeddings
large number of other subreddits. (Joulin et al., 2017). We obtained encoded sentence fea-
False Connection: Submission images in this category do tures of length 4096 for each submission title using In-
not accurately support their text descriptions. We have four ferSent.
subreddits with this label, containing posts of images with In addition, we used the BERT model. BERT achieves
captions that do not relate to the true meaning of the image. state-of-the-art results on many classification tasks, in-
Manipulated Content: Content that has been purposely cluding Q&A and named entity recognition. To ob-
manipulated through manual photo editing or other forms tain fixed-length BERT embedding vectors, we used the
of alteration. The photoshopbattle subreddit comments (not bert-as-service(Xiao, 2018) tool, to map variable-length
submissions) make up the entirety of this category. Sam- text/sentences into a 768 element array for each Reddit
ples contain doctored derivatives of images from the sub- submission title. For our experiments, we utilized the pre-
missions. trained BERT-Large, Uncased model.
We employed VGG16, ResNet50, and EfficientNet mod-
3.4. Dataset Analysis els for encoding images. VGG16 and ResNet50 are widely
In Table 2, we provide an overview of specific statistics per- used by many researchers, while EfficientNet is a relatively
taining to our dataset such as vocabulary size and number newer model. For EfficientNet, we used variation: B4. This
of unique users. We also provide a more in-depth analysis was chosen as it is comparable to ResNet50 in terms of
in comparison to another sizable dataset, FEVER. FLOP count. For the image models, we preloaded weights
First, we choose to examine the word lengths of our text of models trained on ImageNet and included the top layer
data. Figure 2 shows the proportion of samples per text and used the penultimate layer for feature extraction.
length for both Fakeddit and FEVER. It can be seen that
our dataset contains a higher proportion of longer text start- 4.2. Experiment Settings
ing from word lengths of around 17, while FEVER’s cap- As mentioned in section 3.2, the text was cleaned thor-
tions peak at around 10 words. In addition, while FEVER’s oughly through a series of steps. We also prepared the im-
peak is very sharp, Fakeddit has a much smaller and more ages by constraining the sizes of the images to match the
gradual slope. Fakeddit also provides a broader diversity input size of the image models. We applied necessary im-
of text lengths, with samples containing almost 100 words. age preprocessing required for the image models.
6153
2-way 3-way 6-way
Type Text Image Validation Test Validation Test Validation Test
Text BERT – 0.8654 0.8644 0.8582 0.8580 0.7696 0.7677
InferSent – 0.8634 0.8631 0.8569 0.8570 0.7652 0.7666
Image – VGG16 0.7355 0.7376 0.7264 0.7293 0.6462 0.6516
– EfficientNet 0.6115 0.6087 0.5877 0.5828 0.4152 0.4153
– ResNet50 0.8043 0.8070 0.7966 0.7988 0.7529 0.7549
Text+Image InferSent VGG16 0.8655 0.8658 0.8618 0.8624 0.8130 0.8130
InferSent EfficientNet 0.8328 0.8339 0.8259 0.8256 0.7266 0.7280
InferSent ResNet50 0.8888 0.8891 0.8855 0.8863 0.8546 0.8526
BERT VGG16 0.8694 0.8699 0.8644 0.8655 0.8177 0.8208
BERT EfficientNet 0.8334 0.8318 0.8265 0.8255 0.7258 0.7272
BERT ResNet50 0.8929 0.8909 0.8905 0.8890 0.8600 0.8588
Table 4: Results on fake news detection for 2, 3, and 6-way classification with combination method of maximum.
For our experiments, we excluded submissions that have by text-only, and image-only. Thus, image and text multi-
either text or image data missing. We performed 2-way, modality present in our dataset significantly improves fake
3-way, and 6-way classification for each of the three types news detection. The “maximum” method to merge image
of inputs: image only, text only, and multimodal (text and and text features yielded the highest accuracy. Overall, the
image). As in Figure 4, when combining the features in multimodal model that combined BERT text features and
multimodal classification, we first condensed them into n- ResNet50 image features through the maximum method
element vectors through a trainable dense layer and then performed most optimally. The best 6-way classification
merged them through four different methods: add, concate- model parameters were: hidden layer sizes of 224 units,
nate, maximum, average. These features were then passed 1e-4 learning rate, trained over 20 epochs.
through a fully connected softmax predictor. For all ex-
periments, we tuned the hyperparameters on the validation 5. Error Analysis
dataset using the keras-tuner tool10 . Specifically, we em-
We conduct an error analysis on our 6-way detection model
ployed the Hyperband tuner (Li and Jamieson, 2018) to
by examining samples from the test set that the model pre-
find optimal hyperparameters for the hidden layer size and
dicted incorrectly. A subset of these samples is shown in
learning rates. The hyperparameters are tuned on the val-
Table 6. Firstly, the model had the most difficult time iden-
idation set. We varied the number of units in the hidden
tifying imposter content. This category contains subreddits
layer from 32 to 256 with increments of 32. For the opti-
that contain machine-generated samples. Recent advances
mizer, we used Adam (Kingma and Ba, 2014) and tested
in machine learning such as Grover (Zellers et al., 2019),
three learning rate values: 1e-2, 1e-3, 1e-4. For the mul-
a model that produces realistic-looking machine-generated
timodal model, the unit size hyperparameter affected the
news articles, have allowed machines to automatically gen-
sizes of the 3 layers simultaneously: the 2 layers that are
eration human-like material. Our model has a relatively
combined and the layer that is the result of the combina-
difficult time identifying these samples. The second cate-
tion. For non-multimodal models, we utilized a single size-
gory the model had the poorest performance on was satire
tunable hidden layer, followed by a softmax predictor. For
samples. The model may have a difficult time identifying
each model, we specified a maximum of 20 epochs and an
satire because creators of satire tend to focus on creating
early stopping callback to halt training if the validation ac-
content that seems similar to real news if one does not have
curacy decreased.
a sufficient level of contextual knowledge. Classifying the
4.3. Results data into these two categories (imposter content and satire)
The results are shown in Tables 4 and 5. For image and mul- are complex challenges, and our baseline results show that
timodal classification, ResNet50 performed the best fol- there is significant room for improvement in these areas.
lowed by VGG16 and EfficientNet. In addition, BERT On the other hand, the model was able to correctly classify
achieved better results than InferSent for multimodal classi- almost all manipulated content samples. We also found that
fication. Multimodal features performed the best, followed misclassified samples were frequently categorized as being
true. This can be attributed to the relative size of true sam-
10
[Link] ples in the 6-way classification. While we have compara-
6154
Text Image Predicted Label Gold Label PM(%)
volcanic eruption in
False Connection True 17.9
bali last night
Table 6: Classification errors on the BERT+ResNet50 model for 6-way classification. PM: Proportion of samples misclas-
sified within each Gold label.
ble sizes of fake and true samples for 2-way classification, conducted using our dataset’s unique multimodality aspect.
6-way breaks down the fake news into more fine-grained We hope that our dataset can be used to advance efforts to
classes. As a result, the model trains on a higher number of combat the ever-growing rampant spread of disinformation
true samples and may be inclined to predict this label. in today’s society.
6. Conclusion Acknowledgments
In this paper, we presented a novel dataset for fake news We would like to acknowledge Facebook for the Online
research, Fakeddit. Compared to previous datasets, Faked- Safety Benchmark Award. The authors are solely responsi-
dit provides a large number of multimodal samples with ble for the contents of the paper, and the opinions expressed
multiple labels for various levels of fine-grained classifi- in this publication do not reflect those of the funding agen-
cation. We conducted several experiments with multiple cies.
baseline models and performed an error analysis on our
results, highlighting the importance of large scale multi-
7. Bibliographical References
modality unique to Fakeddit and demonstrating that there is
still significant room for improvement in fine-grained fake Abu Salem, F. K., Al Feel, R., Elbassuoni, S., Jaber, M., and
news detection. Our dataset has wide-ranging practicalities Farah, M. (2019). Fa-kes: A fake news dataset around
in fake news research and other research areas. Although the syrian war. Proceedings of the International AAAI
we do not utilize submission metadata and comments made Conference on Web and Social Media, 13(01):573–582,
by users on the submissions, we anticipate that these ad- Jul.
ditional multimodal features will be useful for further fake Allcott, H. and Gentzkow, M. (2017). Social media and
news research. For example, future research can look into fake news in the 2016 election. Journal of Economic Per-
tracking a user’s credibility through using the metadata and spectives, 31(2):211–36, May.
comment data provided and incorporating video data as an- Boididou, C., Papadopoulos, S., Zampoglou, M., Apos-
other multimedia source. Implicit fact-checking research tolidis, L., Papadopoulou, O., and Kompatsiaris, Y.
with an emphasis on image-caption verification can also be (2018). Detection and visualization of misleading con-
6155
Subreddit 6-Way Label URL
photoshopbattles submissions True [Link]
nottheonion True [Link]
neutralnews True [Link]
pic True [Link]
usanews True [Link]
upliftingnews True [Link]
mildlyinteresting True [Link]
usnews True [Link]
fakealbumcovers Satire [Link]
satire Satire [Link]
waterfordwhispersnews Satire [Link]
theonion Satire [Link]
propagandaposters Misleading Content [Link]
fakefacts Misleading Content [Link]
savedyouaclick Misleading Content [Link]
misleadingthumbnails False Connection [Link]
confusing perspective False Connection [Link] perspective
pareidolia False Connection [Link]
fakehistoryporn False Connection [Link]
subredditsimulator Imposter Content [Link]
subsimulatorgpt2 Imposter Content [Link]
photoshopbattles comments Manipulated Content [Link]
tent on twitter. International Journal of Multimedia In- (2017). Bag of tricks for efficient text classification.
formation Retrieval, 7(1):71–86. In Proceedings of the 15th Conference of the European
Christlein, V., Riess, C., Jordan, J., Riess, C., and An- Chapter of the Association for Computational Linguis-
gelopoulou, E. (2012). An evaluation of popular copy- tics: Volume 2, Short Papers, pages 427–431, Valencia,
move forgery detection approaches. IEEE Transactions Spain, April. Association for Computational Linguistics.
on information forensics and security, 7(6):1841–1854. Kingma, D. P. and Ba, J. (2014). Adam: A
Cohen, J. (1960). A coefficient of agreement for nomi- method for stochastic optimization. arXiv preprint
nal scales. Educational and psychological measurement, arXiv:1412.6980.
20(1):37–46. Li, L. and Jamieson, K. (2018). Hyperband: A novel
Conneau, A., Kiela, D., Schwenk, H., Barrault, L., and Bor- bandit-based approach to hyperparameter optimization.
des, A. (2017). Supervised learning of universal sen- Journal of Machine Learning Research, 18:1–52.
tence representations from natural language inference Mitra, T. and Gilbert, E. (2015). Credbank: A large-scale
data. In Proceedings of the 2017 Conference on Em- social media corpus with associated credibility annota-
pirical Methods in Natural Language Processing, pages tions. In Ninth International AAAI Conference on Web
670–680, Copenhagen, Denmark, September. Associa- and Social Media.
tion for Computational Linguistics. Nørregaard, J., Horne, B. D., and Adali, S. (2019). Nela-
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. gt-2018: A large multi-labelled news dataset for the
(2019). BERT: Pre-training of deep bidirectional trans- study of misinformation in news articles. Proceedings
formers for language understanding. In Proceedings of of the International AAAI Conference on Web and Social
the 2019 Conference of the North American Chapter of Media, 13(01):630–638, Jul.
the Association for Computational Linguistics: Human Pathak, A. and Srihari, R. (2019). BREAKING! presenting
Language Technologies, Volume 1 (Long and Short Pa- fake news corpus for automated fact checking. In Pro-
pers), pages 4171–4186, Minneapolis, Minnesota, June. ceedings of the 57th Annual Meeting of the Association
Association for Computational Linguistics. for Computational Linguistics: Student Research Work-
Dreyfuss, E. and Lapowsky, I. (2019). Facebook is chang- shop, pages 357–362, Florence, Italy, July. Association
ing news feed (again) to stop fake news. Wired. for Computational Linguistics.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid- Santia, G. C. and Williams, J. R. (2018). Buzzface: A
ual learning for image recognition. In Proceedings of the news veracity dataset with facebook user commentary
IEEE conference on computer vision and pattern recog- and egos. In Twelfth International AAAI Conference on
nition, pages 770–778. Web and Social Media.
Heller, S., Rossetto, L., and Schuldt, H. (2018). The PS- Shu, K., Mahudeswaran, D., Wang, S., Lee, D., and
Battles Dataset – an Image Collection for Image Manip- Liu, H. (2018). Fakenewsnet: A data repository with
ulation Detection. CoRR, abs/1804.04866. news content, social context and dynamic information
Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T. for studying fake news on social media. arXiv preprint
6156
arXiv:1809.01286. Appendix
Simonyan, K. and Zisserman, A. (2015). Very deep convo- We show the list of subreddits in Table 7.
lutional networks for large-scale image recognition. In
International Conference on Learning Representations.
Tacchini, E., Ballarin, G., Vedova, M. L. D., Moret, S.,
and de Alfaro, L. (2017). Some like it hoax: Auto-
mated fake news detection in social networks. CoRR,
abs/1704.07506.
Tan, M. and Le, Q. V. (2019). Efficientnet: Rethinking
model scaling for convolutional neural networks.
Thorne, J., Vlachos, A., Christodoulopoulos, C., and Mit-
tal, A. (2018). FEVER: a large-scale dataset for fact ex-
traction and VERification. In Proceedings of the 2018
Conference of the North American Chapter of the Associ-
ation for Computational Linguistics: Human Language
Technologies, Volume 1 (Long Papers), pages 809–819,
New Orleans, Louisiana, June. Association for Compu-
tational Linguistics.
Wang, X., Wu, J., Chen, J., Li, L., Wang, Y.-F., and
Wang, W. Y. (2019). Vatex: A large-scale, high-quality
multilingual dataset for video-and-language research. In
The IEEE International Conference on Computer Vision
(ICCV), October.
Wang, W. Y. (2017). “liar, liar pants on fire”: A new bench-
mark dataset for fake news detection. In Proceedings of
the 55th Annual Meeting of the Association for Computa-
tional Linguistics (Volume 2: Short Papers), pages 422–
426, Vancouver, Canada, July. Association for Computa-
tional Linguistics.
Wardle, C. (2017). Fake news. it’s complicated. First
Draft.
Xiao, H. (2018). bert-as-service. [Link]
com/hanxiao/bert-as-service.
Zellers, R., Holtzman, A., Rashkin, H., Bisk, Y., Farhadi,
A., Roesner, F., and Choi, Y. (2019). Defending against
neural fake news. In Advances in Neural Information
Processing Systems 32.
Zlatkova, D., Nakov, P., and Koychev, I. (2019). Fact-
checking meets fauxtography: Verifying claims about
images. In Proceedings of the 2019 Conference on Em-
pirical Methods in Natural Language Processing and the
9th International Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), pages 2099–2108, Hong
Kong, China, November. Association for Computational
Linguistics.
Zubiaga, A., Liakata, M., Procter, R., Hoi, G. W. S., and
Tolmie, P. (2016). Analysing how people orient to and
spread rumours in social media by looking at conversa-
tional threads. PloS one, 11(3):e0150989.
6157