0% found this document useful (0 votes)
38 views15 pages

GoEmotions Dataset Paper

Details

Uploaded by

mfrm27qghg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views15 pages

GoEmotions Dataset Paper

Details

Uploaded by

mfrm27qghg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

GoEmotions: A Dataset of Fine-Grained Emotions

Dorottya Demszky1∗ Dana Movshovitz-Attias2 Jeongwoo Ko2


Alan Cowen2 Gaurav Nemade2 Sujith Ravi3*
1
Stanford Linguistics 2 Google Research 3 Amazon Alexa
[email protected]
{danama, jko, acowen, gnemade}@google.com
[email protected]

Abstract Sample Text Label(s)


OMG, yep!!! That is the final ans- gratitude,
Understanding emotion expressed in language
wer. Thank you so much! approval
has a wide range of applications, from build-
ing empathetic chatbots to detecting harmful I’m not even sure what it is, why
confusion
online behavior. Advancement in this area can do people hate it
arXiv:2005.00547v2 [cs.CL] 3 Jun 2020

be improved using large-scale datasets with Guilty of doing this tbph remorse
a fine-grained typology, adaptable to multi-
ple downstream tasks. We introduce GoEmo- This caught me off guard for real. surprise,
tions, the largest manually annotated dataset I’m actually off my bed laughing amusement
of 58k English Reddit comments, labeled for I tried to send this to a friend but
27 emotion categories or Neutral. We demon- disappointment
[NAME] knocked it away.
strate the high quality of the annotations via
Principal Preserved Component Analysis. We Table 1: Example annotations from our dataset.
conduct transfer learning experiments with ex-
isting emotion benchmarks to show that our
dataset generalizes well to other domains and
different emotion taxonomies. Our BERT- sification into Ekman (Ekman, 1992b) or Plutchik
based model achieves an average F1-score of (Plutchik, 1980) emotions.
.46 across our proposed taxonomy, leaving Recently, Bostan and Klinger (2018) have ag-
much room for improvement.1
gregated 14 popular emotion classification corpora
under a unified framework that allows direct com-
1 Introduction parison of the existing resources. Importantly,
Emotion expression and detection are central to the their analysis suggests annotation quality gaps in
human experience and social interaction. With as the largest manually annotated emotion classifi-
many as a handful of words we are able to express a cation dataset, CrowdFlower (2016), containing
wide variety of subtle and complex emotions, and it 40K tweets labeled for one of 13 emotions. While
has thus been a long-term goal to enable machines their work enables such comparative evaluations, it
to understand affect and emotion (Picard, 1997). highlights the need for a large-scale, consistently
In the past decade, NLP researchers made avail- labeled emotion dataset over a fine-grained taxon-
able several datasets for language-based emotion omy, with demonstrated high-quality annotations.
classification for a variety of domains and appli- To this end, we compiled GoEmotions, the
cations, including for news headlines (Strapparava largest human annotated dataset of 58k carefully
and Mihalcea, 2007), tweets (CrowdFlower, 2016; selected Reddit comments, labeled for 27 emotion
Mohammad et al., 2018), and narrative sequences categories or Neutral, with comments extracted
(Liu et al., 2019), to name just a few. However, ex- from popular English subreddits. Table 1 shows an
isting available datasets are (1) mostly small, con- illustrative sample of our collected data. We design
taining up to several thousand instances, and (2) our emotion taxonomy considering related work in
cover a limited emotion taxonomy, with coarse clas- psychology and coverage in our data. In contrast to
Ekman’s taxonomy, which includes only one posi-

Work done while at Google Research. tive emotion (joy), our taxonomy includes a large
1
Data and code available at https://github.com/
google-research/google-research/tree/ number of positive, negative, and ambiguous emo-
master/goemotions. tion categories, making it suitable for downstream
conversation understanding tasks that require a sub- hashtags on Twitter (Wang et al., 2012; Abdul-
tle understanding of emotion expression, such as Mageed and Ungar, 2017). We build our dataset
the analysis of customer feedback or the enhance- manually, making it the largest human annotated
ment of chatbots. dataset, with multiple annotations per example for
We include a thorough analysis of the annotated quality assurance.
data and the quality of the annotations. Via Princi- Several existing datasets come from the domain
pal Preserved Component Analysis (Cowen et al., of Twitter, given its informal language and expres-
2019b), we show a strong support for reliable disso- sive content, such as emojis and hashtags. Other
ciation among all 27 emotion categories, indicating datasets annotate news headlines (Strapparava and
the suitability of our annotations for building an Mihalcea, 2007), dialogs (Li et al., 2017), fairy-
emotion classification model. tales (Alm et al., 2005), movie subtitles (Öhman
We perform hierarchical clustering on the emo- et al., 2018), sentences based on FrameNet (Ghazi
tion judgments, finding that emotions related in et al., 2015), or self-reported experiences (Scherer
intensity cluster together closely and that the top- and Wallbott, 1994) among other domains. We are
level clusters correspond to sentiment categories. the first to build on Reddit comments for emotion
These relations among emotions allow for their prediction.
potential grouping into higher-level categories, if
2.2 Emotion Taxonomy
desired for a downstream task.
We provide a strong baseline for modeling fine- One of the main aspects distinguishing our dataset
grained emotion classification over GoEmotions. is its emotion taxonomy. The vast majority of ex-
By fine-tuning a BERT-base model (Devlin et al., isting datasets contain annotations for minor varia-
2019), we achieve an average F1-score of .46 over tions of the 6 basic emotion categories (joy, anger,
our taxonomy, .64 over an Ekman-style grouping fear, sadness, disgust, and surprise) proposed by
into six coarse categories and .69 over a sentiment Ekman (1992a) and/or along affective dimensions
grouping. These results leave much room for im- (valence and arousal) that underpin the circumplex
provement, showcasing this task is not yet fully model of affect (Russell, 2003; Buechel and Hahn,
addressed by current state-of-the-art NLU models. 2017).
We conduct transfer learning experiments with Recent advances in psychology have offered new
existing emotion benchmarks to show that our conceptual and methodological approaches to cap-
data can generalize to different taxonomies and turing the more complex “semantic space” of emo-
domains, such as tweets and personal narratives. tion (Cowen et al., 2019a) by studying the distribu-
Our experiments demonstrate that given limited re- tion of emotion responses to a diverse array of stim-
sources to label additional emotion classification uli via computational techniques. Studies guided
data for specialized domains, our data can provide by these principles have identified 27 distinct va-
baseline emotion understanding and contribute to rieties of emotional experience conveyed by short
increasing model accuracy for the target domain. videos (Cowen and Keltner, 2017), 13 by music
(Cowen et al., in press), 28 by facial expression
2 Related Work (Cowen and Keltner, 2019), 12 by speech prosody
(Cowen et al., 2019b), and 24 by nonverbal vocal-
2.1 Emotion Datasets
ization (Cowen et al., 2018). In this work, we build
Ever since Affective Text (Strapparava and Mihal- on these methods and findings to devise our gran-
cea, 2007), the first benchmark for emotion recog- ular taxonomy for text-based emotion recognition
nition was introduced, the field has seen several and study the dimensionality of language-based
emotion datasets that vary in size, domain and tax- emotion space.
onomy (cf. Bostan and Klinger, 2018). The major-
ity of emotion datasets are constructed manually, 2.3 Emotion Classification Models
but tend to be relatively small. The largest manu- Both feature-based and neural models have been
ally labeled dataset is CrowdFlower (2016), with used to build automatic emotion classification mod-
39k labeled examples, which were found by Bostan els. Feature-based models often make use of hand-
and Klinger (2018) to be noisy in comparison with built lexicons, such as the Valence Arousal Dom-
other emotion datasets. Other datasets are automat- inance Lexicon (Mohammad, 2018). Using rep-
ically weakly-labeled, based on emotion-related resentations from BERT (Devlin et al., 2019), a
transformer-based model with language model pre- negative emotions. The dataset includes the list of
training, has recently shown to reach state-of-the- filtered tokens.
art performance on several NLP tasks, also includ-
Manual review. We manually review identity
ing emotion prediction: the top-performing models
comments and remove those offensive towards a
in the EmotionX Challenge (Hsu and Ku, 2018) all
particular ethnicity, gender, sexual orientation, or
employed a pre-trained BERT model. We also use
disability, to the best of our judgment.
the BERT model in our experiments and we find
that it outperforms our biLSTM model. Length filtering. We apply NLTK’s word tok-
enizer and select comments 3-30 tokens long, in-
3 GoEmotions cluding punctuation. To create a relatively balanced
Our dataset is composed of 58K Reddit comments, distribution of comment length, we perform down-
labeled for one or more of 27 emotion(s) or Neutral. sampling, capping by the number of comments
with the median token count (12).
3.1 Selecting & Curating Reddit comments
Sentiment balancing. We reduce sentiment bias
We use a Reddit data dump originating in the reddit- by removing subreddits with little representation
data-tools project2 , which contains comments from of positive, negative, ambiguous, or neutral senti-
2005 (the start of Reddit) to January 2019. We ment. To estimate a comment’s sentiment, we run
select subreddits with at least 10k comments and our emotion prediction model, trained on a pilot
remove deleted and non-English comments. batch of 2.2k annotated examples. The mapping
Reddit is known for a demographic bias lean- of emotions into sentiment categories is found in
ing towards young male users (Duggan and Smith, Figure 2. We exclude subreddits consisting of more
2013), which is not reflective of a globally diverse than 30% neutral comments or less than 20% of
population. The platform also introduces a skew negative, positive, or ambiguous comments.
towards toxic, offensive language (Mohan et al.,
2017). Thus, Reddit content has been used to study Emotion balancing. We assign a predicted emo-
depression (Pirina and Çöltekin, 2018), microag- tion to each comment using the pilot model de-
gressions (Breitfeller et al., 2019), and Yanardag scribed above. Then, we reduce emotion bias by
and Rahwan (2018) have shown the effect of us- downsampling the weakly-labelled data, capping
ing biased Reddit data by training a “psychopath” by the number of comments belonging to the me-
bot. To address these concerns, and enable building dian emotion count.
broadly representative emotion models using GoE- Subreddit balancing. To avoid over representa-
motions, we take a series of data curation measures tion of popular subreddits, we perform downsam-
to ensure our data does not reinforce general, nor pling, capping by the median subreddit count.
emotion-specific, language biases. From the remaining 315k comments (from 482
We identify harmful comments using pre-defined subreddits), we randomly sample for annotation.
lists containing offensive/adult, vulgar (mildly of-
fensive profanity), identity, and religion terms (in- Masking. We mask proper names referring to
cluded as supplementary material). These are used people with a [NAME] token, using a BERT-based
for data filtering and masking, as described below. Named Entity Tagger (Tsai et al., 2019). We mask
Lists were internally compiled and we believe they religion terms with a [RELIGION] token. The list
are comprehensive and widely useful for dataset of these terms is included with our dataset. Note
curation, however, they may not be complete. that raters viewed unmasked comments during rat-
ing.
Reducing profanity. We remove subreddits that
are not safe for work3 and where 10%+ of com- 3.2 Taxonomy of Emotions
ments include offensive/adult and vulgar tokens. When creating the taxonomy, we seek to jointly
We remove remaining comments that include offen- maximize the following objectives.
sive/adult tokens. Vulgar comments are preserved 1. Provide greatest coverage in terms of emo-
as we believe they are central to learning about tions expressed in our data. To address this, we
2 manually labeled a small subset of the data, and
https://github.com/dewarim/
reddit-data-tools ran a pilot task where raters can suggest emotion
3
http://redditlist.com/nsfw labels on top of the pre-defined set.
2. Provide greatest coverage in terms of kinds Number of examples 58,009
of emotional expression. We consult psychology Number of emotions 27 + neutral
Number of unique raters 82
literature on emotion expression and recognition
Number of raters / example 3 or 5
(Plutchik, 1980; Cowen and Keltner, 2017; Cowen
Marked unclear or
et al., 2019b). Since, to our knowledge, there 1.6%
difficult to label
has not been research that identifies principal cat- 1: 83%
egories for emotion recognition in the domain of 2: 15%
Number of labels per example
text (see Section 2.2), we consider those emotions 3: 2%
that are identified as basic in other domains (video 4+: .2%
and speech) and that we can assume to apply to Number of examples w/ 2+ raters
54,263 (94%)
agreeing on at least 1 label
text as well.
Number of examples w/ 3+ raters
3. Limit overlap among emotions and limit the 17,763 (31%)
agreeing on at least 1 label
number of emotions. We do not want to include
emotions that are too similar, since that makes the Table 2: Summary statistics of our labeled data.
annotation task more difficult. Moreover, combin-
ing similar labels with high coverage would result
more of an intrinsic feeling (e.g. joy). The instruc-
in an explosion in annotated labels.
tions highlighted that this separation of categories
The final set of selected emotions is listed in
was by no means clear-cut, but captured general
Table 4, and Figure 1. See Appendix B for more
tendencies, and we encouraged raters to ignore the
details on our multi-step taxonomy selection proce-
categorization whenever they saw fit. Emotions
dure.
with a straightforward mapping onto emojis were
3.3 Annotation shown with an emoji in the UI, to further ease their
We assigned three raters to each example. For those interpretation.
examples where no raters agree on at least one
4 Data Analysis
emotion label, we assigned two additional raters.
All raters are native English speakers from India.4 Table 2 shows summary statistics for the data. Most
of the examples (83%) have a single emotion label
Instructions. Raters were asked to identify the
and have at least two raters agreeing on a single la-
emotions expressed by the writer of the text, given
bel (94%). The Neutral category makes up 26% of
pre-defined emotion definitions (see Appendix A)
all emotion labels – we exclude that category from
and a few example texts for each emotion. Raters
the following analyses, since we do not consider it
were free to select multiple emotions, but were
to be part of the semantic space of emotions.
asked to only select those ones for which they were
Figure 1 shows the distribution of emotion labels.
reasonably confident that it is expressed in the text.
We can see a large disparity in terms of emotion
If raters were not certain about any emotion being
frequencies (e.g. admiration is 30 times more fre-
expressed, they were asked to select Neutral. We
quent than grief ), despite our emotion and senti-
included a checkbox for raters to indicate if an
ment balancing steps taken during data selection.
example was particularly difficult to label, in which
This is expected given the disparate frequencies of
case they could select no emotions. We removed
emotions in natural human expression.
all examples for which no emotion was selected.
The rater interface. Reddit comments were pre- 4.1 Interrater Correlation
sented with no additional metadata (such as the We estimate rater agreement for each emotion via
author or subreddit). To help raters navigate the interrater correlation (Delgado and Tibau, 2019).5
large space of emotion in our taxonomy, they were For each rater r ∈ R, we calculate the Spearman
presented a table containing all emotion categories correlation between r’s judgments and the mean
aggregated by sentiment (by the mapping in Fig- 5
We use correlations as opposed to Cohen’s kappa (Cohen,
ure 2) and whether that emotion is generally ex- 1960) because the former is a more interpretable metric and it
pressed towards something (e.g. disapproval) or is is also more suitable for measuring agreement among a vari-
able number of raters rating different examples. In Appendix C
4
Cowen et al. (2019b) find that emotion judgments in In- we report Cohen’s kappa values as well, which correlate highly
dian and US English speakers largely occupy the same dimen- with the values obtained from interrater correlation (Pearson
sions. r = 0.85, p < 0.001).
Sentiment
positive
negative
ambiguous

sentiment
amusement
excitement
joy
love
desire
optimism
caring
pride
admiration
gratitude
relief
approval
realization
surprise
curiosity
confusion
fear
nervousness
remorse
embarrassment
disappointment
sadness
grief
disgust
anger
annoyance
disapproval

joy

disapproval
love
desire

caring
pride
admiration
gratitude
relief
approval
realization
amusement
excitement

optimism

surprise

fear

grief
curiosity
confusion

nervousness
remorse

disgust
anger
embarrassment
disappointment
sadness

annoyance
Interrater
Correlation

Figure 1: Our emotion categories, ordered by the num-


0.30 0.15 0.00 0.15 0.30
ber of examples where at least one rater uses a particu-
lar label. The color indicates the interrater correlation.
Figure 2: The heatmap shows the correlation between
ratings for each emotion. The dendrogram represents
the a hierarchical clustering of the ratings. The senti-
of other raters’ judgments, for all examples that r ment labeling was done a priori and it shows that the
rated. We then take the average of these rater-level clusters closely map onto sentiment groups.
correlation scores. In Section 4.3, we show that
each emotion has significant interrater correlation,
after controlling for several potential confounds. We also perform hierarchical clustering to un-
Figure 1 shows that gratitude, admiration and cover the nested structure of our taxonomy. We
amusement have the highest and grief and nervous- use correlation as a distance metric and ward as
ness have the lowest interrater correlation. Emotion a linkage method, applied to the averaged ratings.
frequency correlates with interrater agreement but The dendrogram on the top of Figure 2 shows that
the two are not equivalent. Infrequent emotions emotions that are related by intensity are neighbors,
can have relatively high interrater correlation (e.g., and that larger clusters map closely onto sentiment
fear), and frequent emotions can have have rela- categories. Interestingly, emotions that we labeled
tively low interrater correlation (e.g., annoyance). as “ambiguous” in terms of sentiment (e.g. sur-
prise) are closer to the positive than to the negative
4.2 Correlation Among Emotions category. This suggests that in our data, ambiguous
emotions are more likely to occur in the context of
To better understand the relationship between emo-
positive sentiment than that of negative sentiment.
tions in our data, we look at their correlations. Let
N be the number of examples in our dataset. We
4.3 Principal Preserved Component Analysis
obtain N dimensional vectors for each emotion
by averaging raters’ judgments for all examples To better understand agreement among raters and
labeled with that emotion. We calculate Pearson the latent structure of the emotion space, we apply
correlation values between each pair of emotions. Principal Preserved Component Analysis (PPCA)
The heatmap in Figure 2 shows that emotions that (Cowen et al., 2019b) to our data. PPCA extracts
are related in intensity (e.g. annoyance and anger, linear combinations of attributes (here, emotion
joy and excitement, nervousness and fear) have judgments), that maximally covary across two sets
a strong positive correlation. On the other hand, of data that measure the same attributes (here, ran-
emotions that have the opposite sentiment are neg- domly split judgments for each example). Thus,
atively correlated. PPCA allows us to uncover latent dimensions of
Algorithm 1 Leave-One-Rater-Out PPCA t-SNE projection. To better understand how the
1: R ← set of raters examples are organized in the emotion space, we
2: E ← set of emotions apply t-SNE, a dimension reduction method that
3: C ∈ R|R|×|E| seeks to preserve distances between data points,
4: for all raters r ∈ {1, ..., |R|} do using the scikit-learn package (Pedregosa et al.,
5: n ← number of examples annotated by r 2011). The dataset can be explored in our inter-
6: J ∈ Rn×|R|×|E| ← all ratings for the exam- active plot6 , where one can also look at the texts
ples annotated by r and the annotations. The color of each data point is
7: J −r ∈ Rn×|R|−1×|E| ← all ratings in J, the weighted average of the RGB values represent-
excluding r ing those emotions that at least half of the raters
8: J r ∈ Rn×|E| ← all ratings by r selected.
9: X, Y ∈ Rn×|E| ← randomly split J −r and
4.4 Linguistic Correlates of Emotions
average ratings across raters for both sets
10: W ∈ R|E|×|E| ← result of P P CA(X, Y ) We extract the lexical correlates of each emotion by
11: for all components† wi∈{1,...,|E|} in W do calculating the log odds ratio, informative Dirichlet
12: vir ← projection‡ of J r onto wi prior (Monroe et al., 2008) of all tokens for each
13: vi−r ← projection‡ of J −r onto wi emotion category contrasting to all other emotions.
14: Cr,i ← correlation between vir and vi−r , Since the log odds are z-scored, all values greater
partialing out vk−r ∀k ∈ {1, ..., i − 1} than 3 indicate highly significant (>3 std) associ-
15: end for ation with the corresponding emotion. We list the
16: end for top 5 tokens for each category in Table 3. We find
17: C 0 ← Wilcoxon signed rank test on C that those emotions that are highly significantly as-
18: C 00 ← Bonferroni correction on C 0 (α = 0.05) sociated with certain tokens (e.g. gratitude with
† in descending order of eigenvalue “thanks”, amusement with “lol”) tend to have the
‡ we demean vectors before projection highest interrater correlation (see Figure 1). Con-
versely, emotions that have fewer significantly as-
sociated tokens (e.g. grief and nervousness) tend
emotion that have high agreement across raters. to have low interrater correlation. These results
Unlike Principal Component Analysis (PCA), suggest certain emotions are more verbally implicit
PPCA examines the cross-covariance between and may require more context to be interpreted.
datasets rather than the variancecovariance matrix
within a single dataset. We obtain the principal 5 Modeling
preserved components (PPCs) of two datasets (ma- We present a strong baseline emotion prediction
trices) X, Y ∈ RN ×|E| , where N is the number model for GoEmotions.
of examples and |E| is the number of emotions,
by calculating the eigenvectors of the symmetrized 5.1 Data Preparation
cross covariance matrix X T Y + Y T X.
To minimize the noise in our data, we filter out
Extracting significant dimensions. We remove emotion labels selected by only a single annotator.
examples labeled as Neutral, and keep those ex- We keep examples with at least one label after this
amples that still have at least 3 ratings after this filtering is performed — this amounts to 93% of the
filtering step. We then determine the number of original data. We randomly split this data into train
significant dimensions using a leave-one-rater out (80%), dev (10%) and test (10%) sets. We only
analysis, as described by Algorithm 1. evaluate on the test set once the model is finalized.
We find that all 27 PPCs are highly signifi- Even though we filter our data for the baseline
cant. Specifically, Bonferroni-corrected p-values experiments, we see particular value in the 4K ex-
are less than 1.5e-6 for all dimensions (corrected amples that lack agreement. This subset of the data
α = 0.0017), suggesting that the emotions were likely contains edge/difficult examples for the emo-
highly dissociable. Such a high degree of signifi- tion domain (e.g., emotion-ambiguous text), and
cance for all dimensions is nontrivial. For example, present challenges for further exploration. That is
Cowen et al. (2019b) find that only 12 out of their 6
https://nlp.stanford.edu/˜ddemszky/
30 emotion categories are significantly dissociable. goemotions/tsne.html
admiration amusement approval caring anger annoyance disappointment disapproval confusion
great (42) lol (66) agree (24) you (12) fuck (24) annoying (14) disappointing (11) not (16) confused (18)
awesome (32) haha (32) not (13) worry (11) hate (18) stupid (13) disappointed (10) don’t (14) why (11)
amazing (30) funny (27) don’t (12) careful (9) fucking (18) fucking (12) bad (9) disagree (9) sure (10)
good (28) lmao (21) yes (12) stay (9) angry (11) shit (10) disappointment (7) nope (8) what (10)
beautiful (23) hilarious (18) agreed (11) your (8) dare (10) dumb (9) unfortunately (7) doesn’t (7) understand (8)
desire excitement gratitude joy disgust embarrassment fear grief curiosity
wish (29) excited (21) thanks (75) happy (32) disgusting (22) embarrassing (12) scared (16) died (6) curious (22)
want (8) happy (8) thank (69) glad (27) awful (14) shame (11) afraid (16) rip (4) what (18)
wanted (6) cake (8) for (24) enjoy (20) worst (13) awkward (10) scary (15) why (13)
could (6) wow (8) you (18) enjoyed (12) worse (12) embarrassment (8) terrible (12) how (11)
ambitious (4) interesting (7) sharing (17) fun (12) weird (9) embarrassed (7) terrifying (11) did (10)
love optimism pride relief nervousness remorse sadness realization surprise
love (76) hope (45) proud (14) glad (5) nervous (8) sorry (39) sad (31) realize (14) wow (23)
loved (21) hopefully (19) pride (4) relieved (4) worried (8) regret (9) sadly (16) realized (12) surprised (21)
favorite (13) luck (18) accomplishment relieving (4) anxiety (6) apologies (7) sorry (15) realised (7) wonder (15)
loves (12) hoping (16) (4) relief (4) anxious (4) apologize (6) painful (10) realization (6) shocked (12)
like (9) will (8) worrying (4) guilt (5) crying (9) thought (6) omg (11)

Table 3: Top 5 words associated with each emotion ( positive , negative , ambiguous ). The rounded z-scored log
odds ratios in the parentheses, with the threshold set at 3, indicate significance of association.

why we release all 58K examples with all annota- decay rate of 0.95. We apply a dropout of 0.7.
tors’ ratings.
5.4 Results
Grouping emotions. We create a hierarchical
Table 4 summarizes the performance of our best
grouping of our taxonomy, and evaluate the model
model, BERT, on the test set, which achieves an
performance on each level of the hierarchy. A sen-
average F1-score of .46 (std=.19). The model ob-
timent level divides the labels into 4 categories –
tains the best performance on emotions with overt
positive, negative, ambiguous and Neutral – with
lexical markers, such as gratitude (.86), amusement
the Neutral category intact, and the rest of the map-
(.8) and love (.78). The model obtains the lowest
ping as shown in Figure 2. The Ekman level further
F1-score on grief (0), relief (.15) and realization
divides the taxonomy using the Neutral label and
(.21), which are the lowest frequency emotions. We
the following 6 groups: anger (maps to: anger, an-
find that less frequent emotions tend to be confused
noyance, disapproval), disgust (maps to: disgust),
by the model with more frequent emotions related
fear (maps to: fear, nervousness), joy (all positive
in sentiment and intensity (e.g., grief with sadness,
emotions), sadness (maps to: sadness, disappoint-
pride with admiration, nervousness with fear) —
ment, embarrassment, grief, remorse) and surprise
see Appendix G for a more detailed analysis.
(all ambiguous emotions).
Table 5 and Table 6 show results for a sentiment-
5.2 Model Architecture grouped model (F1-score = .69) and an Ekman-
grouped model (F1-score = .64), respectively. The
We use the BERT-base model (Devlin et al., 2019)
significant performance increase in the transition
for our experiments. We add a dense output layer
from full to Ekman-level taxonomy indicates that
on top of the pretrained model for the purposes
this grouping mitigates confusion among inner-
of finetuning, with a sigmoid cross entropy loss
group lower-level categories.
function to support multi-label classification. As an
The biLSTM model performs significantly
additional baseline, we train a bidirectional LSTM.
worse than BERT, obtaining an average F1-score
5.3 Parameter Settings of .41 for the full taxonomy, .53 for an Ekman-
When finetuning the pre-trained BERT model, we grouped model and .6 for a sentiment-grouped
keep most of the hyperparameters set by Devlin model.
et al. (2019) intact and only change the batch size
6 Transfer Learning Experiments
and learning rate. We find that training for at least
4 epochs is necessary for learning the data, but We conduct transfer learning experiments on exist-
training for more epochs results in overfitting. We ing emotion benchmarks, in order to show our data
also find that a small batch size of 16 and learning generalizes across domains and taxonomies. The
rate of 5e-5 yields the best performance. goal is to demonstrate that given little labeled data
For the biLSTM, we set the hidden layer dimen- in a target domain, one can utilize GoEmotions as
sionality to 256, the learning rate to 0.1, with a baseline emotion understanding data.
ISEAR (self-reported experiences) Emotion-Stimulus (FrameNet-based sentences) EmoInt (tweets)
1.0
0.6 0.8
0.8
0.6
F-score

0.4 0.6
0.2 0.4 0.4
0.2 0.2
0.0
100 200 500 1000 max 100 200 500 1000 max 100 200 500 1000 max
Training Set Size Training Set Size Training Set Size
Baseline Transfer Learning w/ Freezing Layers Transfer Learning w/o Freezing Layers

Figure 3: Transfer learning results in terms of average F1-scores across emotion categories. The bars indicate the
95% confidence intervals, which we obtain from 10 different runs on 10 different random splits of the data.

Emotion Precision Recall F1 Sentiment Precision Recall F1


admiration 0.53 0.83 0.65 ambiguous 0.54 0.66 0.60
amusement 0.70 0.94 0.80 negative 0.65 0.76 0.70
anger 0.36 0.66 0.47 neutral 0.64 0.69 0.67
annoyance 0.24 0.63 0.34 positive 0.78 0.87 0.82
approval 0.26 0.57 0.36 macro-average 0.65 0.74 0.69
std 0.09 0.10 0.09
caring 0.30 0.56 0.39
confusion 0.24 0.76 0.37
Table 5: Results based on sentiment-grouped data.
curiosity 0.40 0.84 0.54
desire 0.43 0.59 0.49
disappointment 0.19 0.52 0.28 Ekman Emotion Precision Recall F1
disapproval 0.29 0.61 0.39 anger 0.50 0.65 0.57
disgust 0.34 0.66 0.45 disgust 0.52 0.53 0.53
embarrassment 0.39 0.49 0.43 fear 0.61 0.76 0.68
excitement 0.26 0.52 0.34 joy 0.77 0.88 0.82
fear 0.46 0.85 0.60 neutral 0.66 0.67 0.66
gratitude 0.79 0.95 0.86 sadness 0.56 0.62 0.59
grief 0.00 0.00 0.00 surprise 0.53 0.70 0.61
joy 0.39 0.73 0.51 macro-average 0.59 0.69 0.64
love 0.68 0.92 0.78 std 0.10 0.11 0.10
nervousness 0.28 0.48 0.35
Table 6: Results using Ekman’s taxonomy.
neutral 0.56 0.84 0.68
optimism 0.41 0.69 0.51
pride 0.67 0.25 0.36 ity and taxonomy. In the interest of space, we only
realization 0.16 0.29 0.21 discuss three of these datasets here, chosen based
relief 0.50 0.09 0.15 on their diversity of domains. In our experiments,
remorse 0.53 0.88 0.66 we observe similar trends for the additional bench-
sadness 0.38 0.71 0.49 marks, and all are included in the Appendix H.
surprise 0.40 0.66 0.50 The International Survey on Emotion An-
macro-average 0.40 0.63 0.46 tecedents and Reactions (ISEAR) (Scherer and
std 0.18 0.24 0.19 Wallbott, 1994) is a collection of personal reports
on emotional events, written by 3000 people from
Table 4: Results based on GoEmotions taxonomy.
different cultural backgrounds. The dataset con-
tains 8k sentences, each labeled with a single emo-
tion. The categories are anger, disgust, fear, guilt,
6.1 Emotion Benchmark Datasets
joy, sadness and shame.
We consider the nine benchmark datasets from EmoInt (Mohammad et al., 2018) is part of the
Bostan and Klinger (2018)’s Unified Dataset, SemEval 2018 benchmark, and it contains crowd-
which vary in terms of their size, domain, qual- sourced annotations for 7k tweets. The labels are
intensity annotations for anger, joy, sadness, and analysis, demonstrating the reliability of the anno-
fear. We obtain binary annotations for these emo- tations for the full taxonomy. We show the general-
tions by using .5 as the cutoff. izability of the data across domains and taxonomies
Emotion-Stimulus (Ghazi et al., 2015) contains via transfer learning experiments. We build a strong
annotations for 2.4k sentences generated based on baseline by fine-tuning a BERT model, however,
FrameNet’s emotion-directed frames. Their taxon- the results suggest much room for future improve-
omy is anger, disgust, fear, joy, sadness, shame ment. Future work can explore the cross-cultural
and surprise. robustness of emotion ratings, and extend the tax-
onomy to other languages and domains.
6.2 Experimental Setup Data Disclaimer: We are aware that the dataset
Training set size. We experiment with varying contains biases and is not representative of global
amount of training data from the target domain diversity. We are aware that the dataset contains
dataset, including 100, 200, 500, 1000, and 80% potentially problematic content. Potential biases in
(named “max”) of dataset examples. We generate the data include: Inherent biases in Reddit and user
10 random splits for each train set size, with the base biases, the offensive/vulgar word lists used
remaining examples held as a test set. for data filtering, inherent or unconscious bias in
We report the results of the finetuning experi- assessment of offensive identity labels, annotators
ments detailed below for each data size, with con- were all native English speakers from India. All
fidence intervals based on repeated experiments these likely affect labeling, precision, and recall
using the splits. for a trained model. The emotion pilot model used
for sentiment labeling, was trained on examples
Finetuning. We compare three different finetun- reviewed by the research team. Anyone using this
ing setups. In the BASELINE setup, we finetune dataset should be aware of these limitations of the
BERT only on the target dataset. In the FREEZE dataset.
setup, we first finetune BERT on GoEmotions, then
perform transfer learning by replacing the final Acknowledgments
dense layer, freezing all layers besides the last We thank the three anonymous reviewers for their
layer and finetuning on the target dataset. The constructive feedback. We would also like to thank
NOFREEZE setup is the same as FREEZE , except
the annotators for their hard work.
that we do not freeze the bottom layers. We hold
the batch size at 16, learning rate at 2e-5 and num-
ber of epochs at 3 for all experiments. References
6.3 Results Muhammad Abdul-Mageed and Lyle Ungar. 2017.
EmoNet: Fine-Grained Emotion Detection with
The results in Figure 3 suggest that our dataset gen- Gated Recurrent Neural Networks. In Proceedings
eralizes well to different domains and taxonomies, of the 55th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers),
and that using a model using GoEmotions can help pages 718–728.
in cases when there is limited data from the target
domain, or limited resources for labeling. Cecilia Ovesdotter Alm, Dan Roth, and Richard Sproat.
2005. Emotions from text: Machine learning
Given limited target domain data (100 or 200 ex-
for text-based emotion prediction. In Proceed-
amples), both FREEZE and NOFREEZE yield signif- ings of Human Language Technology Conference
icantly higher performance than the BASELINE, for and Conference on Empirical Methods in Natural
all three datasets. Importantly, NOFREEZE results Language Processing, pages 579–586, Vancouver,
show significantly higher performance for all train- British Columbia, Canada. Association for Compu-
tational Linguistics.
ing set sizes, except for “max”, where NOFREEZE
and BASELINE perform similarly. Laura-Ana-Maria Bostan and Roman Klinger. 2018.
An analysis of annotated corpora for emotion clas-
7 Conclusion sification in text. In Proceedings of the 27th Inter-
national Conference on Computational Linguistics,
We present GoEmotions, a large, manually anno- pages 2104–2119.
tated, carefully curated dataset for fine-grained Luke Breitfeller, Emily Ahn, David Jurgens, and Yu-
emotion prediction. We provide a detailed data lia Tsvetkov. 2019. Finding microaggressions in the
wild: A case for locating elusive phenomena in so- Maeve Duggan and Aaron Smith. 2013. 6% of online
cial media posts. In Proceedings of the 2019 Con- adults are reddit users. Pew Internet & American
ference on Empirical Methods in Natural Language Life Project, 3:1–10.
Processing and the 9th International Joint Confer-
ence on Natural Language Processing (EMNLP- Paul Ekman. 1992a. Are there basic emotions? Psy-
IJCNLP), pages 1664–1674. chological Review, 99(3):550–553.

Sven Buechel and Udo Hahn. 2017. Emobank: Study- Paul Ekman. 1992b. An argument for basic emotions.
ing the impact of annotation perspective and repre- Cognition & Emotion, 6(3-4):169–200.
sentation format on dimensional emotion analysis.
In Proceedings of the 15th Conference of the Euro- Diman Ghazi, Diana Inkpen, and Stan Szpakowicz.
pean Chapter of the Association for Computational 2015. Detecting Emotion Stimuli in Emotion-
Linguistics: Volume 2, Short Papers, pages 578– Bearing Sentences. In International Conference on
585. Intelligent Text Processing and Computational Lin-
guistics, pages 152–165. Springer.
Jacob Cohen. 1960. A coefficient of agreement for
nominal scales. Educational and psychological mea- Chao-Chun Hsu and Lun-Wei Ku. 2018. SocialNLP
surement, 20(1):37–46. 2018 EmotionX challenge overview: Recognizing
emotions in dialogues. In Proceedings of the Sixth
Alan Cowen, Disa Sauter, Jessica L Tracy, and Dacher International Workshop on Natural Language Pro-
Keltner. 2019a. Mapping the passions: Toward cessing for Social Media, pages 27–31, Melbourne,
a high-dimensional taxonomy of emotional experi- Australia. Association for Computational Linguis-
ence and expression. Psychological Science in the tics.
Public Interest, 20(1):69–90.
Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang
Alan S Cowen, Hillary Anger Elfenbein, Petri Laukka, Cao, and Shuzi Niu. 2017. Dailydialog: A manually
and Dacher Keltner. 2018. Mapping 24 emotions labelled multi-turn dialogue dataset. arXiv preprint
conveyed by brief human vocalization. American arXiv:1710.03957.
Psychologist, 74(6):698–712.
Chen Liu, Muhammad Osama, and Anderson De An-
Alan S Cowen, Xia Fang, Disa Sauter, and Dacher Kelt- drade. 2019. Dens: A dataset for multi-class emo-
ner. in press. What music makes us feel: At least tion analysis. arXiv preprint arXiv:1910.11769.
thirteen dimensions organize subjective experiences
associated with music across cultures. Proceedings Saif Mohammad. 2018. Obtaining reliable human rat-
of the National Academy of Sciences. ings of valence, arousal, and dominance for 20,000
english words. In Proceedings of the 56th Annual
Alan S Cowen and Dacher Keltner. 2017. Self-report Meeting of the Association for Computational Lin-
captures 27 distinct categories of emotion bridged by guistics (Volume 1: Long Papers), pages 174–184.
continuous gradients. Proceedings of the National
Academy of Sciences, 114(38):E7900–E7909. Saif Mohammad, Felipe Bravo-Marquez, Mohammad
Salameh, and Svetlana Kiritchenko. 2018. SemEval-
Alan S Cowen and Dacher Keltner. 2019. What the 2018 task 1: Affect in tweets. In Proceedings of
face displays: Mapping 28 emotions conveyed by The 12th International Workshop on Semantic Eval-
naturalistic expression. American Psychologist. uation, pages 1–17, New Orleans, Louisiana. Asso-
ciation for Computational Linguistics.
Alan S Cowen, Petri Laukka, Hillary Anger Elfenbein,
Runjing Liu, and Dacher Keltner. 2019b. The pri- Saif M Mohammad. 2012. # emotional tweets. In Pro-
macy of categories in the recognition of 12 emotions ceedings of the First Joint Conference on Lexical
in speech prosody across two cultures. Nature Hu- and Computational Semantics-Volume 1: Proceed-
man Behaviour, 3(4):369. ings of the main conference and the shared task, and
Volume 2: Proceedings of the Sixth International
CrowdFlower. 2016. https://www.figure- Workshop on Semantic Evaluation, pages 246–255.
eight.com/data/sentiment-analysis-emotion-text/. Association for Computational Linguistics.

Rosario Delgado and Xavier-Andoni Tibau. 2019. Saif M Mohammad, Xiaodan Zhu, Svetlana Kir-
Why cohens kappa should be avoided as perfor- itchenko, and Joel Martin. 2015. Sentiment, emo-
mance measure in classification. PloS one, 14(9). tion, purpose, and style in electoral tweets. Informa-
tion Processing & Management, 51(4):480–499.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. Bert: Pre-training of deep Shruthi Mohan, Apala Guha, Michael Harris, Fred
bidirectional transformers for language understand- Popowich, Ashley Schuster, and Chris Priebe. 2017.
ing. In 17th Annual Conference of the North Amer- The impact of toxic language on the health of reddit
ican Chapter of the Association for Computational communities. In Canadian Conference on Artificial
Linguistics (NAACL). Intelligence, pages 51–56. Springer.
Burt L Monroe, Michael P Colaresi, and Kevin M Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.
Quinn. 2008. Fightin’words: Lexical feature selec- BERT Rediscovers the Classical NLP Pipeline. In
tion and evaluation for identifying the content of po- Association for Computational Linguistics.
litical conflict. Political Analysis, 16(4):372–403.
Henry Tsai, Jason Riesa, Melvin Johnson, Naveen Ari-
Emily Öhman, Kaisla Kajava, Jörg Tiedemann, and vazhagan, Xin Li, and Amelia Archer. 2019. Small
Timo Honkela. 2018. Creating a dataset for and Practical BERT Models for Sequence Labeling.
multilingual fine-grained emotion-detection using In EMNLP 2019.
gamification-based annotation. In Proceedings of
the 9th Workshop on Computational Approaches to Wenbo Wang, Lu Chen, Krishnaprasad Thirunarayan,
Subjectivity, Sentiment and Social Media Analysis, and Amit P Sheth. 2012. Harnessing twitter ‘big
pages 24–30. data’ for automatic emotion identification. In 2012
International Conference on Privacy, Security, Risk
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, and Trust and 2012 International Confernece on So-
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, cial Computing, pages 587–592. IEEE.
R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
Cebrian-M. Yanardag, P. and I. Rahwan. 2018. Nor-
D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
man: Worlds first psychopath ai.
esnay. 2011. Scikit-learn: Machine learning in
Python. Journal of Machine Learning Research,
12:2825–2830.

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt


Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018. Deep Contextualized Word Rep-
resentations. In 16th Annual Conference of the
North American Chapter of the Association for Com-
putational Linguistics (NAACL).

Rosalind W Picard. 1997. Affective Computing. MIT


Press.

Inna Pirina and Çağrı Çöltekin. 2018. Identifying


depression on reddit: The effect of training data.
In Proceedings of the 2018 EMNLP Workshop
SMM4H: The 3rd Social Media Mining for Health
Applications Workshop & Shared Task, pages 9–12.

Robert Plutchik. 1980. A general psychoevolutionary


theory of emotion. In Theories of emotion, pages
3–33. Elsevier.

James A Russell. 2003. Core affect and the psycholog-


ical construction of emotion. Psychological review,
110(1):145.

Klaus R Scherer and Harald G Wallbott. 1994. Evi-


dence for Universality and Cultural Variation of Dif-
ferential Emotion Response Patterning. Journal of
personality and social psychology, 66(2):310.

Hendrik Schuff, Jeremy Barnes, Julian Mohme, Sebas-


tian Padó, and Roman Klinger. 2017. Annotation,
modelling and analysis of fine-grained emotions on
a stance and sentiment detection corpus. In Pro-
ceedings of the 8th Workshop on Computational Ap-
proaches to Subjectivity, Sentiment and Social Me-
dia Analysis, pages 13–23.

Carlo Strapparava and Rada Mihalcea. 2007. SemEval-


2007 task 14: Affective text. In Proceedings of the
Fourth International Workshop on Semantic Evalua-
tions (SemEval-2007), pages 70–74, Prague, Czech
Republic. Association for Computational Linguis-
tics.
A Emotion Definitions lection, we used emotions that were identified to be
salient by Cowen and Keltner (2017), making sure
admiration Finding something impressive or
that our set includes Ekmans emotion categories, as
worthy of respect.
used in previous NLP work. In this round, we also
amusement Finding something funny or being
included an open input box where annotators could
entertained.
suggest emotion(s) that were not among the op-
anger A strong feeling of displeasure or antag-
tions. We annotated 3K examples in the first round.
onism.
We updated the taxonomy based on the results of
annoyance Mild anger, irritation.
this round (see details below). In the second pilot
approval Having or expressing a favorable opin-
round of data collection, we repeated this process
ion.
with 2k new examples, once again updating the
caring Displaying kindness and concern for others.
taxonomy.
confusion Lack of understanding, uncertainty.
curiosity A strong desire to know or learn some- While reviewing the results from the pilot
thing. rounds, we identified and removed emotions that
desire A strong feeling of wanting something or were scarcely selected by annotators and/or had
wishing for something to happen. low interrater agreement due to being very similar
disappointment Sadness or displeasure caused by to other emotions or being too difficult to detect
the nonfulfillment of one’s hopes or expectations. from text. These emotions were boredom, doubt,
disapproval Having or expressing an unfavor- heartbroken, indifference and calmness. We also
able opinion. identified and added those emotions to our tax-
disgust Revulsion or strong disapproval aroused onomy that were frequently suggested by raters
by something unpleasant or offensive. and/or seemed to be represented in the data upon
embarrassment Self-consciousness, shame, or manual inspection. These emotions were desire,
awkwardness. disappointment, pride, realization, relief and re-
excitement Feeling of great enthusiasm and ea- morse. In this process, we also refined the category
gerness. names (e.g. replacing ecstasy with excitement), to
fear Being afraid or worried. ones that seemed interpretable to annotators. This
gratitude A feeling of thankfulness and appre- is how we arrived at the final set of 27 emotions
ciation. + Neutral. Our high interrater agreement in the fi-
grief Intense sorrow, especially caused by some- nal data can be partially explained by the fact that
one’s death. we took interpretability into consideration while
joy A feeling of pleasure and happiness. constructing the taxonomy. The dataset is we are
love A strong positive emotion of regard and releasing was labeled in the third round over the
affection. final taxonomy.
nervousness Apprehension, worry, anxiety.
optimism Hopefulness and confidence about
the future or the success of something. C Cohen’s Kappa Values
pride Pleasure or satisfaction due to ones own
achievements or the achievements of those with In Section 4.1, we measure agreement between
whom one is closely associated. raters via Spearman correlation, following consid-
realization Becoming aware of something. erations by Delgado and Tibau (2019). In Table 7,
relief Reassurance and relaxation following release we report the Cohen’s kappa values for compari-
from anxiety or distress. son, which we obtain by randomly sampling two
remorse Regret or guilty feeling. ratings for each example and calculating the Co-
sadness Emotional pain, sorrow. hen’s kappa between these two sets of ratings. We
surprise Feeling astonished, startled by some- find that all Cohen’s kappa values are greater than 0,
thing unexpected. showing rater agreement. Moreover, the Cohen’s
kappa values correlate highly with the interrater
B Taxonomy Selection & Data Collection
correlation values (Pearson r = 0.85, p < 0.001),
We selected our taxonomy through a careful multi- providing corroborative evidence for the significant
round process. In the first pilot round of data col- degree of interrater agreement for each emotion.
Interrater Cohen’s gravity (Tenney et al., 2019) based on scalar mixing
Emotion weights (Peters et al., 2018). We find that all layers
Correlation kappa
are similarly important for our task, with center of
admiration 0.535 0.468 gravity = 6.19 (see Figure 4). This is consistent
amusement 0.482 0.474 with Tenney et al. (2019), who have also found that
anger 0.207 0.307 tasks involving high-level semantics tend to make
annoyance 0.193 0.192 use of all BERT layers.
approval 0.385 0.187
caring 0.237 0.252 0.10
confusion 0.217 0.270
curiosity 0.418 0.366 0.08

Softmax Weights
desire 0.177 0.251
0.06
disappointment 0.186 0.184
disapproval 0.274 0.234 0.04
disgust 0.192 0.241
embarrassment 0.177 0.218 0.02
excitement 0.193 0.222
0.00
fear 0.266 0.394 0 1 2 3 4 5 6 7 8 9 10 11 12
Layer
gratitude 0.645 0.749
grief 0.162 0.095 Figure 4: Softmax weights of each BERT layer when
joy 0.296 0.301 trained on our dataset.
love 0.446 0.555
nervousness 0.164 0.144
optimism 0.322 0.300 F Number of Emotion Labels Per
pride 0.163 0.148 Example
realization 0.194 0.155
Figure 5 shows the number of emotion labels per
relief 0.172 0.185
example before and after we filter for those labels
remorse 0.178 0.358 that have agreement. We use the filtered set of
sadness 0.346 0.336 labels for training and testing our models.
surprise 0.275 0.331

Table 7: Interrater agreement, as measured by interrater


correlation and Cohen’s kappa 40k
Number of Examples

30k
D Sentiment of Reddit Subreddits
pre-filter
In Section 3, we describe how we obtain subreddits post-filter
20k
that are balanced in terms of sentiment. Here, we
note the distribution of sentiments across subred- 10k
dits before we apply the filtering: neutral (M=28%,
STD=11%), positive (M=41%, STD=11%), neg-
0k
ative (M=19%, STD=7%), ambiguous (M=35%, 0 1 2 3 4 5 6 7
Number of Labels
STD=8%). After filtering, the distribution of sen-
timents across our remaining subreddits became: Figure 5: Number of emotion labels per example be-
neutral (M=24%, STD=5%), positive (M=35%, fore and after filtering the labels chosen by only a sin-
STD=6%), negative (M=27%, STD=4%), ambigu- gle annotator.
ous (M=33%, STD=4%).

E BERT’s Most Activated Layers G Confusion Matrix


To better understand whether there are any layers Figure 6 shows the normalized confusion matrix for
in BERT that are particularly important for our our model predictions. Since GoEmotions is a mul-
task, we freeze BERT and calculate the center of tilabel dataset, we calculate the confusion matrix
similarly as we would calculate a co-occurrence taxonomy of 36 emotions, FREEZE gives a signifi-
matrix: for each true label, we increase the count cant boost of performance over the BASELINE and
for each predicted label. Specifically, we define a NOFREEZE for all training set sizes besides “max”.
matrix M where Mi,j denotes the raw confusion For the other datasets, we find that FREEZE
count between the true label i and the predicted tends to give a performance boost compared to the
label j. For example, if the true labels are joy other setups only up to a couple of hundred train-
and admiration, and the predicted labels are joy ing examples. For 500-1000 training examples,
and pride, then we increase the count for Mjoy,joy , NOFREEZE tends to outperform the BASELINE, but
Mjoy,pride , Madmiration,joy and Madmiration,pride . we can see that these two setups come closer when
In practice, since most of our examples only has there is more training data available. These results
a single label (see Figure 5), our confusion matrix suggests that our dataset helps if there is limited
is very similar to one calculated for a single-label data from the target domain.
classification task.
Given the disparate frequencies among the la-
bels, we normalize M by dividing the counts in
each row (representing counts for each true emo-
tion label) by the sum of that row. The heatmap in
Figure 6 shows these normalized counts. We find
that the model tends to confuse emotions that are
related in sentiment and intensity (e.g., grief and
sadness, pride and admiration, nervousness and
fear).
We also perform hierarchical clustering over the
normalized confusion matrix using correlation as a
distance metric and ward as a linkage method. We
find that the model learns relatively similar clusters
as the ones in Figure 2, even though the training
data only includes a subset of the labels that have
agreement (see Figure 5).

H Transfer Learning Results


Figure 7 shows the results for all 9 datasets that
are downloadable and have categorical emotions
in the Unified Dataset (Bostan and Klinger, 2018).
These datasets are DailyDialog (Li et al., 2017),
Emotion-Stimulus (Ghazi et al., 2015), Affective
Text (Strapparava and Mihalcea, 2007), Crowd-
Flower (CrowdFlower, 2016), Electoral Tweets
(Mohammad et al., 2015), ISEAR (Scherer and
Wallbott, 1994), the Twitter Emotion Corpus (TEC)
(Mohammad, 2012), EmoInt (Mohammad et al.,
2018) and the Stance Sentiment Emotion Corpus
(SSEC) (Schuff et al., 2017).
We describe the experimental setup in Sec-
tion 6.2, which we use across all datasets. We
find that transfer learning helps in the case of all
datasets, especially when there is limited train-
ing data. Interestingly, in the case of Crowd-
Flower, which is known to be noisy (Bostan and
Klinger, 2018) and Electoral Tweets, which is a
small dataset of ∼4k labeled examples and a large
Sentiment
positive
negative
ambiguous

sentiment
curiosity
0.75 confusion
amusement
gratitude
admiration
pride
0.60 approval
realization
surprise
excitement
0.45 joy
relief

True Label
caring
optimism
0.30 desire
love
fear
nervousness
grief
0.15 sadness
remorse
disapproval
disappointment
0.00 anger
Proportion annoyance
of Predictions embarrassment
disgust

disapproval
admiration
approval

disappointment

embarrassment
curiosity
amusement
gratitude
pride
realization
surprise

optimism
desire

nervousness
sadness
confusion

excitement
joy
relief
caring

love
fear
grief
remorse

anger
annoyance
disgust
Predicted Label

Figure 6: A normalized confusion matrix for our model predictions. The plot shows that the model confuses
emotions with other emotions that are related in intensity and sentiment.

DailyDialog (everyday conversations) Emotion-Stimulus (FrameNet-based sentences) Affective Text (news headlines)
1.0
0.5
0.4 0.8
0.4
0.6
F-score

0.3 0.3
0.4
0.2 0.2 0.2
0.1
CrowdFlower (tweets) ElectoralTweets (tweets) ISEAR (self-reported experiences)
0.20 0.12 0.6
0.15 0.10
F-score

0.4
0.10 0.08
0.05 0.2
0.06
0.00 0.0
TEC (tweets) EmoInt (tweets) SSEC (tweets)
0.5 0.8
0.70
0.4 0.6
F-score

0.65
0.3
0.4
0.2 0.60
0.2
0.1
100 200 500 1000 max 100 200 500 1000 max 100 200 500 1000 max
Training Set Size Training Set Size Training Set Size
Baseline Transfer Learning w/ Freezing Layers Transfer Learning w/o Freezing Layers

Figure 7: Transfer learning results on 9 emotion benchmarks from the Unified Dataset (Bostan and Klinger, 2018).

You might also like