GoEmotions Dataset Paper
GoEmotions Dataset Paper
be improved using large-scale datasets with Guilty of doing this tbph remorse
a fine-grained typology, adaptable to multi-
ple downstream tasks. We introduce GoEmo- This caught me off guard for real. surprise,
tions, the largest manually annotated dataset I’m actually off my bed laughing amusement
of 58k English Reddit comments, labeled for I tried to send this to a friend but
27 emotion categories or Neutral. We demon- disappointment
[NAME] knocked it away.
strate the high quality of the annotations via
Principal Preserved Component Analysis. We Table 1: Example annotations from our dataset.
conduct transfer learning experiments with ex-
isting emotion benchmarks to show that our
dataset generalizes well to other domains and
different emotion taxonomies. Our BERT- sification into Ekman (Ekman, 1992b) or Plutchik
based model achieves an average F1-score of (Plutchik, 1980) emotions.
.46 across our proposed taxonomy, leaving Recently, Bostan and Klinger (2018) have ag-
much room for improvement.1
gregated 14 popular emotion classification corpora
under a unified framework that allows direct com-
1 Introduction parison of the existing resources. Importantly,
Emotion expression and detection are central to the their analysis suggests annotation quality gaps in
human experience and social interaction. With as the largest manually annotated emotion classifi-
many as a handful of words we are able to express a cation dataset, CrowdFlower (2016), containing
wide variety of subtle and complex emotions, and it 40K tweets labeled for one of 13 emotions. While
has thus been a long-term goal to enable machines their work enables such comparative evaluations, it
to understand affect and emotion (Picard, 1997). highlights the need for a large-scale, consistently
In the past decade, NLP researchers made avail- labeled emotion dataset over a fine-grained taxon-
able several datasets for language-based emotion omy, with demonstrated high-quality annotations.
classification for a variety of domains and appli- To this end, we compiled GoEmotions, the
cations, including for news headlines (Strapparava largest human annotated dataset of 58k carefully
and Mihalcea, 2007), tweets (CrowdFlower, 2016; selected Reddit comments, labeled for 27 emotion
Mohammad et al., 2018), and narrative sequences categories or Neutral, with comments extracted
(Liu et al., 2019), to name just a few. However, ex- from popular English subreddits. Table 1 shows an
isting available datasets are (1) mostly small, con- illustrative sample of our collected data. We design
taining up to several thousand instances, and (2) our emotion taxonomy considering related work in
cover a limited emotion taxonomy, with coarse clas- psychology and coverage in our data. In contrast to
Ekman’s taxonomy, which includes only one posi-
∗
Work done while at Google Research. tive emotion (joy), our taxonomy includes a large
1
Data and code available at https://github.com/
google-research/google-research/tree/ number of positive, negative, and ambiguous emo-
master/goemotions. tion categories, making it suitable for downstream
conversation understanding tasks that require a sub- hashtags on Twitter (Wang et al., 2012; Abdul-
tle understanding of emotion expression, such as Mageed and Ungar, 2017). We build our dataset
the analysis of customer feedback or the enhance- manually, making it the largest human annotated
ment of chatbots. dataset, with multiple annotations per example for
We include a thorough analysis of the annotated quality assurance.
data and the quality of the annotations. Via Princi- Several existing datasets come from the domain
pal Preserved Component Analysis (Cowen et al., of Twitter, given its informal language and expres-
2019b), we show a strong support for reliable disso- sive content, such as emojis and hashtags. Other
ciation among all 27 emotion categories, indicating datasets annotate news headlines (Strapparava and
the suitability of our annotations for building an Mihalcea, 2007), dialogs (Li et al., 2017), fairy-
emotion classification model. tales (Alm et al., 2005), movie subtitles (Öhman
We perform hierarchical clustering on the emo- et al., 2018), sentences based on FrameNet (Ghazi
tion judgments, finding that emotions related in et al., 2015), or self-reported experiences (Scherer
intensity cluster together closely and that the top- and Wallbott, 1994) among other domains. We are
level clusters correspond to sentiment categories. the first to build on Reddit comments for emotion
These relations among emotions allow for their prediction.
potential grouping into higher-level categories, if
2.2 Emotion Taxonomy
desired for a downstream task.
We provide a strong baseline for modeling fine- One of the main aspects distinguishing our dataset
grained emotion classification over GoEmotions. is its emotion taxonomy. The vast majority of ex-
By fine-tuning a BERT-base model (Devlin et al., isting datasets contain annotations for minor varia-
2019), we achieve an average F1-score of .46 over tions of the 6 basic emotion categories (joy, anger,
our taxonomy, .64 over an Ekman-style grouping fear, sadness, disgust, and surprise) proposed by
into six coarse categories and .69 over a sentiment Ekman (1992a) and/or along affective dimensions
grouping. These results leave much room for im- (valence and arousal) that underpin the circumplex
provement, showcasing this task is not yet fully model of affect (Russell, 2003; Buechel and Hahn,
addressed by current state-of-the-art NLU models. 2017).
We conduct transfer learning experiments with Recent advances in psychology have offered new
existing emotion benchmarks to show that our conceptual and methodological approaches to cap-
data can generalize to different taxonomies and turing the more complex “semantic space” of emo-
domains, such as tweets and personal narratives. tion (Cowen et al., 2019a) by studying the distribu-
Our experiments demonstrate that given limited re- tion of emotion responses to a diverse array of stim-
sources to label additional emotion classification uli via computational techniques. Studies guided
data for specialized domains, our data can provide by these principles have identified 27 distinct va-
baseline emotion understanding and contribute to rieties of emotional experience conveyed by short
increasing model accuracy for the target domain. videos (Cowen and Keltner, 2017), 13 by music
(Cowen et al., in press), 28 by facial expression
2 Related Work (Cowen and Keltner, 2019), 12 by speech prosody
(Cowen et al., 2019b), and 24 by nonverbal vocal-
2.1 Emotion Datasets
ization (Cowen et al., 2018). In this work, we build
Ever since Affective Text (Strapparava and Mihal- on these methods and findings to devise our gran-
cea, 2007), the first benchmark for emotion recog- ular taxonomy for text-based emotion recognition
nition was introduced, the field has seen several and study the dimensionality of language-based
emotion datasets that vary in size, domain and tax- emotion space.
onomy (cf. Bostan and Klinger, 2018). The major-
ity of emotion datasets are constructed manually, 2.3 Emotion Classification Models
but tend to be relatively small. The largest manu- Both feature-based and neural models have been
ally labeled dataset is CrowdFlower (2016), with used to build automatic emotion classification mod-
39k labeled examples, which were found by Bostan els. Feature-based models often make use of hand-
and Klinger (2018) to be noisy in comparison with built lexicons, such as the Valence Arousal Dom-
other emotion datasets. Other datasets are automat- inance Lexicon (Mohammad, 2018). Using rep-
ically weakly-labeled, based on emotion-related resentations from BERT (Devlin et al., 2019), a
transformer-based model with language model pre- negative emotions. The dataset includes the list of
training, has recently shown to reach state-of-the- filtered tokens.
art performance on several NLP tasks, also includ-
Manual review. We manually review identity
ing emotion prediction: the top-performing models
comments and remove those offensive towards a
in the EmotionX Challenge (Hsu and Ku, 2018) all
particular ethnicity, gender, sexual orientation, or
employed a pre-trained BERT model. We also use
disability, to the best of our judgment.
the BERT model in our experiments and we find
that it outperforms our biLSTM model. Length filtering. We apply NLTK’s word tok-
enizer and select comments 3-30 tokens long, in-
3 GoEmotions cluding punctuation. To create a relatively balanced
Our dataset is composed of 58K Reddit comments, distribution of comment length, we perform down-
labeled for one or more of 27 emotion(s) or Neutral. sampling, capping by the number of comments
with the median token count (12).
3.1 Selecting & Curating Reddit comments
Sentiment balancing. We reduce sentiment bias
We use a Reddit data dump originating in the reddit- by removing subreddits with little representation
data-tools project2 , which contains comments from of positive, negative, ambiguous, or neutral senti-
2005 (the start of Reddit) to January 2019. We ment. To estimate a comment’s sentiment, we run
select subreddits with at least 10k comments and our emotion prediction model, trained on a pilot
remove deleted and non-English comments. batch of 2.2k annotated examples. The mapping
Reddit is known for a demographic bias lean- of emotions into sentiment categories is found in
ing towards young male users (Duggan and Smith, Figure 2. We exclude subreddits consisting of more
2013), which is not reflective of a globally diverse than 30% neutral comments or less than 20% of
population. The platform also introduces a skew negative, positive, or ambiguous comments.
towards toxic, offensive language (Mohan et al.,
2017). Thus, Reddit content has been used to study Emotion balancing. We assign a predicted emo-
depression (Pirina and Çöltekin, 2018), microag- tion to each comment using the pilot model de-
gressions (Breitfeller et al., 2019), and Yanardag scribed above. Then, we reduce emotion bias by
and Rahwan (2018) have shown the effect of us- downsampling the weakly-labelled data, capping
ing biased Reddit data by training a “psychopath” by the number of comments belonging to the me-
bot. To address these concerns, and enable building dian emotion count.
broadly representative emotion models using GoE- Subreddit balancing. To avoid over representa-
motions, we take a series of data curation measures tion of popular subreddits, we perform downsam-
to ensure our data does not reinforce general, nor pling, capping by the median subreddit count.
emotion-specific, language biases. From the remaining 315k comments (from 482
We identify harmful comments using pre-defined subreddits), we randomly sample for annotation.
lists containing offensive/adult, vulgar (mildly of-
fensive profanity), identity, and religion terms (in- Masking. We mask proper names referring to
cluded as supplementary material). These are used people with a [NAME] token, using a BERT-based
for data filtering and masking, as described below. Named Entity Tagger (Tsai et al., 2019). We mask
Lists were internally compiled and we believe they religion terms with a [RELIGION] token. The list
are comprehensive and widely useful for dataset of these terms is included with our dataset. Note
curation, however, they may not be complete. that raters viewed unmasked comments during rat-
ing.
Reducing profanity. We remove subreddits that
are not safe for work3 and where 10%+ of com- 3.2 Taxonomy of Emotions
ments include offensive/adult and vulgar tokens. When creating the taxonomy, we seek to jointly
We remove remaining comments that include offen- maximize the following objectives.
sive/adult tokens. Vulgar comments are preserved 1. Provide greatest coverage in terms of emo-
as we believe they are central to learning about tions expressed in our data. To address this, we
2 manually labeled a small subset of the data, and
https://github.com/dewarim/
reddit-data-tools ran a pilot task where raters can suggest emotion
3
http://redditlist.com/nsfw labels on top of the pre-defined set.
2. Provide greatest coverage in terms of kinds Number of examples 58,009
of emotional expression. We consult psychology Number of emotions 27 + neutral
Number of unique raters 82
literature on emotion expression and recognition
Number of raters / example 3 or 5
(Plutchik, 1980; Cowen and Keltner, 2017; Cowen
Marked unclear or
et al., 2019b). Since, to our knowledge, there 1.6%
difficult to label
has not been research that identifies principal cat- 1: 83%
egories for emotion recognition in the domain of 2: 15%
Number of labels per example
text (see Section 2.2), we consider those emotions 3: 2%
that are identified as basic in other domains (video 4+: .2%
and speech) and that we can assume to apply to Number of examples w/ 2+ raters
54,263 (94%)
agreeing on at least 1 label
text as well.
Number of examples w/ 3+ raters
3. Limit overlap among emotions and limit the 17,763 (31%)
agreeing on at least 1 label
number of emotions. We do not want to include
emotions that are too similar, since that makes the Table 2: Summary statistics of our labeled data.
annotation task more difficult. Moreover, combin-
ing similar labels with high coverage would result
more of an intrinsic feeling (e.g. joy). The instruc-
in an explosion in annotated labels.
tions highlighted that this separation of categories
The final set of selected emotions is listed in
was by no means clear-cut, but captured general
Table 4, and Figure 1. See Appendix B for more
tendencies, and we encouraged raters to ignore the
details on our multi-step taxonomy selection proce-
categorization whenever they saw fit. Emotions
dure.
with a straightforward mapping onto emojis were
3.3 Annotation shown with an emoji in the UI, to further ease their
We assigned three raters to each example. For those interpretation.
examples where no raters agree on at least one
4 Data Analysis
emotion label, we assigned two additional raters.
All raters are native English speakers from India.4 Table 2 shows summary statistics for the data. Most
of the examples (83%) have a single emotion label
Instructions. Raters were asked to identify the
and have at least two raters agreeing on a single la-
emotions expressed by the writer of the text, given
bel (94%). The Neutral category makes up 26% of
pre-defined emotion definitions (see Appendix A)
all emotion labels – we exclude that category from
and a few example texts for each emotion. Raters
the following analyses, since we do not consider it
were free to select multiple emotions, but were
to be part of the semantic space of emotions.
asked to only select those ones for which they were
Figure 1 shows the distribution of emotion labels.
reasonably confident that it is expressed in the text.
We can see a large disparity in terms of emotion
If raters were not certain about any emotion being
frequencies (e.g. admiration is 30 times more fre-
expressed, they were asked to select Neutral. We
quent than grief ), despite our emotion and senti-
included a checkbox for raters to indicate if an
ment balancing steps taken during data selection.
example was particularly difficult to label, in which
This is expected given the disparate frequencies of
case they could select no emotions. We removed
emotions in natural human expression.
all examples for which no emotion was selected.
The rater interface. Reddit comments were pre- 4.1 Interrater Correlation
sented with no additional metadata (such as the We estimate rater agreement for each emotion via
author or subreddit). To help raters navigate the interrater correlation (Delgado and Tibau, 2019).5
large space of emotion in our taxonomy, they were For each rater r ∈ R, we calculate the Spearman
presented a table containing all emotion categories correlation between r’s judgments and the mean
aggregated by sentiment (by the mapping in Fig- 5
We use correlations as opposed to Cohen’s kappa (Cohen,
ure 2) and whether that emotion is generally ex- 1960) because the former is a more interpretable metric and it
pressed towards something (e.g. disapproval) or is is also more suitable for measuring agreement among a vari-
able number of raters rating different examples. In Appendix C
4
Cowen et al. (2019b) find that emotion judgments in In- we report Cohen’s kappa values as well, which correlate highly
dian and US English speakers largely occupy the same dimen- with the values obtained from interrater correlation (Pearson
sions. r = 0.85, p < 0.001).
Sentiment
positive
negative
ambiguous
sentiment
amusement
excitement
joy
love
desire
optimism
caring
pride
admiration
gratitude
relief
approval
realization
surprise
curiosity
confusion
fear
nervousness
remorse
embarrassment
disappointment
sadness
grief
disgust
anger
annoyance
disapproval
joy
disapproval
love
desire
caring
pride
admiration
gratitude
relief
approval
realization
amusement
excitement
optimism
surprise
fear
grief
curiosity
confusion
nervousness
remorse
disgust
anger
embarrassment
disappointment
sadness
annoyance
Interrater
Correlation
Table 3: Top 5 words associated with each emotion ( positive , negative , ambiguous ). The rounded z-scored log
odds ratios in the parentheses, with the threshold set at 3, indicate significance of association.
why we release all 58K examples with all annota- decay rate of 0.95. We apply a dropout of 0.7.
tors’ ratings.
5.4 Results
Grouping emotions. We create a hierarchical
Table 4 summarizes the performance of our best
grouping of our taxonomy, and evaluate the model
model, BERT, on the test set, which achieves an
performance on each level of the hierarchy. A sen-
average F1-score of .46 (std=.19). The model ob-
timent level divides the labels into 4 categories –
tains the best performance on emotions with overt
positive, negative, ambiguous and Neutral – with
lexical markers, such as gratitude (.86), amusement
the Neutral category intact, and the rest of the map-
(.8) and love (.78). The model obtains the lowest
ping as shown in Figure 2. The Ekman level further
F1-score on grief (0), relief (.15) and realization
divides the taxonomy using the Neutral label and
(.21), which are the lowest frequency emotions. We
the following 6 groups: anger (maps to: anger, an-
find that less frequent emotions tend to be confused
noyance, disapproval), disgust (maps to: disgust),
by the model with more frequent emotions related
fear (maps to: fear, nervousness), joy (all positive
in sentiment and intensity (e.g., grief with sadness,
emotions), sadness (maps to: sadness, disappoint-
pride with admiration, nervousness with fear) —
ment, embarrassment, grief, remorse) and surprise
see Appendix G for a more detailed analysis.
(all ambiguous emotions).
Table 5 and Table 6 show results for a sentiment-
5.2 Model Architecture grouped model (F1-score = .69) and an Ekman-
grouped model (F1-score = .64), respectively. The
We use the BERT-base model (Devlin et al., 2019)
significant performance increase in the transition
for our experiments. We add a dense output layer
from full to Ekman-level taxonomy indicates that
on top of the pretrained model for the purposes
this grouping mitigates confusion among inner-
of finetuning, with a sigmoid cross entropy loss
group lower-level categories.
function to support multi-label classification. As an
The biLSTM model performs significantly
additional baseline, we train a bidirectional LSTM.
worse than BERT, obtaining an average F1-score
5.3 Parameter Settings of .41 for the full taxonomy, .53 for an Ekman-
When finetuning the pre-trained BERT model, we grouped model and .6 for a sentiment-grouped
keep most of the hyperparameters set by Devlin model.
et al. (2019) intact and only change the batch size
6 Transfer Learning Experiments
and learning rate. We find that training for at least
4 epochs is necessary for learning the data, but We conduct transfer learning experiments on exist-
training for more epochs results in overfitting. We ing emotion benchmarks, in order to show our data
also find that a small batch size of 16 and learning generalizes across domains and taxonomies. The
rate of 5e-5 yields the best performance. goal is to demonstrate that given little labeled data
For the biLSTM, we set the hidden layer dimen- in a target domain, one can utilize GoEmotions as
sionality to 256, the learning rate to 0.1, with a baseline emotion understanding data.
ISEAR (self-reported experiences) Emotion-Stimulus (FrameNet-based sentences) EmoInt (tweets)
1.0
0.6 0.8
0.8
0.6
F-score
0.4 0.6
0.2 0.4 0.4
0.2 0.2
0.0
100 200 500 1000 max 100 200 500 1000 max 100 200 500 1000 max
Training Set Size Training Set Size Training Set Size
Baseline Transfer Learning w/ Freezing Layers Transfer Learning w/o Freezing Layers
Figure 3: Transfer learning results in terms of average F1-scores across emotion categories. The bars indicate the
95% confidence intervals, which we obtain from 10 different runs on 10 different random splits of the data.
Sven Buechel and Udo Hahn. 2017. Emobank: Study- Paul Ekman. 1992b. An argument for basic emotions.
ing the impact of annotation perspective and repre- Cognition & Emotion, 6(3-4):169–200.
sentation format on dimensional emotion analysis.
In Proceedings of the 15th Conference of the Euro- Diman Ghazi, Diana Inkpen, and Stan Szpakowicz.
pean Chapter of the Association for Computational 2015. Detecting Emotion Stimuli in Emotion-
Linguistics: Volume 2, Short Papers, pages 578– Bearing Sentences. In International Conference on
585. Intelligent Text Processing and Computational Lin-
guistics, pages 152–165. Springer.
Jacob Cohen. 1960. A coefficient of agreement for
nominal scales. Educational and psychological mea- Chao-Chun Hsu and Lun-Wei Ku. 2018. SocialNLP
surement, 20(1):37–46. 2018 EmotionX challenge overview: Recognizing
emotions in dialogues. In Proceedings of the Sixth
Alan Cowen, Disa Sauter, Jessica L Tracy, and Dacher International Workshop on Natural Language Pro-
Keltner. 2019a. Mapping the passions: Toward cessing for Social Media, pages 27–31, Melbourne,
a high-dimensional taxonomy of emotional experi- Australia. Association for Computational Linguis-
ence and expression. Psychological Science in the tics.
Public Interest, 20(1):69–90.
Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang
Alan S Cowen, Hillary Anger Elfenbein, Petri Laukka, Cao, and Shuzi Niu. 2017. Dailydialog: A manually
and Dacher Keltner. 2018. Mapping 24 emotions labelled multi-turn dialogue dataset. arXiv preprint
conveyed by brief human vocalization. American arXiv:1710.03957.
Psychologist, 74(6):698–712.
Chen Liu, Muhammad Osama, and Anderson De An-
Alan S Cowen, Xia Fang, Disa Sauter, and Dacher Kelt- drade. 2019. Dens: A dataset for multi-class emo-
ner. in press. What music makes us feel: At least tion analysis. arXiv preprint arXiv:1910.11769.
thirteen dimensions organize subjective experiences
associated with music across cultures. Proceedings Saif Mohammad. 2018. Obtaining reliable human rat-
of the National Academy of Sciences. ings of valence, arousal, and dominance for 20,000
english words. In Proceedings of the 56th Annual
Alan S Cowen and Dacher Keltner. 2017. Self-report Meeting of the Association for Computational Lin-
captures 27 distinct categories of emotion bridged by guistics (Volume 1: Long Papers), pages 174–184.
continuous gradients. Proceedings of the National
Academy of Sciences, 114(38):E7900–E7909. Saif Mohammad, Felipe Bravo-Marquez, Mohammad
Salameh, and Svetlana Kiritchenko. 2018. SemEval-
Alan S Cowen and Dacher Keltner. 2019. What the 2018 task 1: Affect in tweets. In Proceedings of
face displays: Mapping 28 emotions conveyed by The 12th International Workshop on Semantic Eval-
naturalistic expression. American Psychologist. uation, pages 1–17, New Orleans, Louisiana. Asso-
ciation for Computational Linguistics.
Alan S Cowen, Petri Laukka, Hillary Anger Elfenbein,
Runjing Liu, and Dacher Keltner. 2019b. The pri- Saif M Mohammad. 2012. # emotional tweets. In Pro-
macy of categories in the recognition of 12 emotions ceedings of the First Joint Conference on Lexical
in speech prosody across two cultures. Nature Hu- and Computational Semantics-Volume 1: Proceed-
man Behaviour, 3(4):369. ings of the main conference and the shared task, and
Volume 2: Proceedings of the Sixth International
CrowdFlower. 2016. https://www.figure- Workshop on Semantic Evaluation, pages 246–255.
eight.com/data/sentiment-analysis-emotion-text/. Association for Computational Linguistics.
Rosario Delgado and Xavier-Andoni Tibau. 2019. Saif M Mohammad, Xiaodan Zhu, Svetlana Kir-
Why cohens kappa should be avoided as perfor- itchenko, and Joel Martin. 2015. Sentiment, emo-
mance measure in classification. PloS one, 14(9). tion, purpose, and style in electoral tweets. Informa-
tion Processing & Management, 51(4):480–499.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. Bert: Pre-training of deep Shruthi Mohan, Apala Guha, Michael Harris, Fred
bidirectional transformers for language understand- Popowich, Ashley Schuster, and Chris Priebe. 2017.
ing. In 17th Annual Conference of the North Amer- The impact of toxic language on the health of reddit
ican Chapter of the Association for Computational communities. In Canadian Conference on Artificial
Linguistics (NAACL). Intelligence, pages 51–56. Springer.
Burt L Monroe, Michael P Colaresi, and Kevin M Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.
Quinn. 2008. Fightin’words: Lexical feature selec- BERT Rediscovers the Classical NLP Pipeline. In
tion and evaluation for identifying the content of po- Association for Computational Linguistics.
litical conflict. Political Analysis, 16(4):372–403.
Henry Tsai, Jason Riesa, Melvin Johnson, Naveen Ari-
Emily Öhman, Kaisla Kajava, Jörg Tiedemann, and vazhagan, Xin Li, and Amelia Archer. 2019. Small
Timo Honkela. 2018. Creating a dataset for and Practical BERT Models for Sequence Labeling.
multilingual fine-grained emotion-detection using In EMNLP 2019.
gamification-based annotation. In Proceedings of
the 9th Workshop on Computational Approaches to Wenbo Wang, Lu Chen, Krishnaprasad Thirunarayan,
Subjectivity, Sentiment and Social Media Analysis, and Amit P Sheth. 2012. Harnessing twitter ‘big
pages 24–30. data’ for automatic emotion identification. In 2012
International Conference on Privacy, Security, Risk
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, and Trust and 2012 International Confernece on So-
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, cial Computing, pages 587–592. IEEE.
R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
Cebrian-M. Yanardag, P. and I. Rahwan. 2018. Nor-
D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
man: Worlds first psychopath ai.
esnay. 2011. Scikit-learn: Machine learning in
Python. Journal of Machine Learning Research,
12:2825–2830.
Softmax Weights
desire 0.177 0.251
0.06
disappointment 0.186 0.184
disapproval 0.274 0.234 0.04
disgust 0.192 0.241
embarrassment 0.177 0.218 0.02
excitement 0.193 0.222
0.00
fear 0.266 0.394 0 1 2 3 4 5 6 7 8 9 10 11 12
Layer
gratitude 0.645 0.749
grief 0.162 0.095 Figure 4: Softmax weights of each BERT layer when
joy 0.296 0.301 trained on our dataset.
love 0.446 0.555
nervousness 0.164 0.144
optimism 0.322 0.300 F Number of Emotion Labels Per
pride 0.163 0.148 Example
realization 0.194 0.155
Figure 5 shows the number of emotion labels per
relief 0.172 0.185
example before and after we filter for those labels
remorse 0.178 0.358 that have agreement. We use the filtered set of
sadness 0.346 0.336 labels for training and testing our models.
surprise 0.275 0.331
30k
D Sentiment of Reddit Subreddits
pre-filter
In Section 3, we describe how we obtain subreddits post-filter
20k
that are balanced in terms of sentiment. Here, we
note the distribution of sentiments across subred- 10k
dits before we apply the filtering: neutral (M=28%,
STD=11%), positive (M=41%, STD=11%), neg-
0k
ative (M=19%, STD=7%), ambiguous (M=35%, 0 1 2 3 4 5 6 7
Number of Labels
STD=8%). After filtering, the distribution of sen-
timents across our remaining subreddits became: Figure 5: Number of emotion labels per example be-
neutral (M=24%, STD=5%), positive (M=35%, fore and after filtering the labels chosen by only a sin-
STD=6%), negative (M=27%, STD=4%), ambigu- gle annotator.
ous (M=33%, STD=4%).
sentiment
curiosity
0.75 confusion
amusement
gratitude
admiration
pride
0.60 approval
realization
surprise
excitement
0.45 joy
relief
True Label
caring
optimism
0.30 desire
love
fear
nervousness
grief
0.15 sadness
remorse
disapproval
disappointment
0.00 anger
Proportion annoyance
of Predictions embarrassment
disgust
disapproval
admiration
approval
disappointment
embarrassment
curiosity
amusement
gratitude
pride
realization
surprise
optimism
desire
nervousness
sadness
confusion
excitement
joy
relief
caring
love
fear
grief
remorse
anger
annoyance
disgust
Predicted Label
Figure 6: A normalized confusion matrix for our model predictions. The plot shows that the model confuses
emotions with other emotions that are related in intensity and sentiment.
DailyDialog (everyday conversations) Emotion-Stimulus (FrameNet-based sentences) Affective Text (news headlines)
1.0
0.5
0.4 0.8
0.4
0.6
F-score
0.3 0.3
0.4
0.2 0.2 0.2
0.1
CrowdFlower (tweets) ElectoralTweets (tweets) ISEAR (self-reported experiences)
0.20 0.12 0.6
0.15 0.10
F-score
0.4
0.10 0.08
0.05 0.2
0.06
0.00 0.0
TEC (tweets) EmoInt (tweets) SSEC (tweets)
0.5 0.8
0.70
0.4 0.6
F-score
0.65
0.3
0.4
0.2 0.60
0.2
0.1
100 200 500 1000 max 100 200 500 1000 max 100 200 500 1000 max
Training Set Size Training Set Size Training Set Size
Baseline Transfer Learning w/ Freezing Layers Transfer Learning w/o Freezing Layers
Figure 7: Transfer learning results on 9 emotion benchmarks from the Unified Dataset (Bostan and Klinger, 2018).