Introduction to emotion
detection
Tyler Schnoebelen
Stanford University
Welcome to the electronic copy of my
presentation!
• I’ve put a lot of stuff in the notes fields, so make
sure to check them out.
• Unfortunately, most of the audio/video
embedded in this presentation will only play if
you have a recent version of PowerPoint.
– If you can’t play the audio/video but would like to
hear it, please email me at tylers at stanford dot edu
and I’ll make it available.
• Please do let me know about any thoughts or
questions!
Goals
1. Reasons to do emotion detection
2. Understanding the main findings (cues)
3. Overview of limitations
4. Finding new ways forward
– Voice quality
– Using change over time
– A new measurement of speech tempo
– Adding lexical content (won’t do much of this
here, though)
WHY DETECT EMOTION?
Introduction to emotion detection
Introduction to emotion detection
It looks like you’re full of
rage and indignation.
Would like to:
[ ] Sound even angrier
[ ] Sound less angry
Introduction to emotion detection
Introduction to emotion detection
Also: we aren’t robots
• Linguistics is about
understanding human
beings.
• To understand human
beings is to understand
the variety and
complexity of emotional
experiences they have.
• Linguistics can offer a lot
by showing how linguistic
resources are used in
creating and coping with
these experiences.
WHAT ARE WE DETECTING?
What are we detecting?
Juslin and Laukka (2003)
Introduction to emotion detection
Introduction to emotion detection
The basic patterns
• Anger, fear, happiness
– Fast speech rate
– High voice intensity (m)
– High voice intensity variability
– High-frequency energy
– High F0
• Sadness, tenderness
– Pretty much the opposite of above:
• Slow speech rate and low voice intensity, low voice intensity variability, low high-
frequency energy, Low F0
• F0 variability?
– Anger and happiness are high
– Fear DOES NOT pattern with them; there’s low F0 variability, as with
tenderness and sadness
Introduction to emotion detection
CONFRONTING LIMITATIONS
A closer look
• Of the 104 studies J&L looked at from 1900 to
June 2002 they found:
– 87% of studies used actors (90 studies)
– 12% used natural speech samples (12 studies)
• Mostly from fear expressions in aviation accidents
For example
• Emotional Prosody Speech and Transcripts
Corpus (EPSaT)
What about naturalistic speech?
Clavel et al (2011)
Clavel et al (2011)
• A little more naturalistic—film clips involving fear
• 7 hours, 400 speakers
• 3 labelers
• 40 features selected for voiced, 40 for unvoiced:
– Prosodic (pitch)
– Voice quality (jitter, shimmer)
– Spectral and cepstral features (MFCC and Bark band
energy-related measures)
Clavel et al (2011)—Main findings
• Most useful in voiced (all higher in fear than in neutral):
– F0 mean
– Spectral centroid mean
– Jitter
– Note: Intensity was no good even after normalization (diversity
of audio sources)
• Useful in some subtypes of fear, though like “panic”
• Most useful in unvoiced:
– Bark band energy-related > MFCC
– Harmonics-to-noise ratio (HNR)
– Unvoiced rate (proportion of unvoiced frames in a segment)
Sobol-Shikler (2011)
http://www.jkp.com/mindreading/demo/content/dswmedia/MRF_Load.html
Sobol-Shikler (2011)
Sobol-Shikler (2011)
• 7000 sentences, 2 languages
– Mind Reading database
• Teaches autistic children how to recognize emotions
• 4400 recorded sentences, 10 UK English speakers
• Emotions labeled by 10 people
– Doors corpus
• Humans gambled for 15 minutes—2 repeated sentences (“open this door”, “close door”)
• 100 sentences for each participant (~27 participants)
– 100 spontaneous utterances from 25 speakers
– 9 emotional categories (joyful, absorbed, sure, stressed, excited, opposed,
interested, unsure, thinking)
• 173 features extracted
– F0
– Energy
– Tempo
– Harmonics
– Spectral content
Sobol-Shikler (2011)—Main findings
• To distinguish any pair of emotions, you only
need about 10 features
• But to classify everything, you need 166 of the
173
• Able to classify Hebrew and English affective
states using training on the other language
• About 76% accuracy in detection, which is
comparable to what most studies report
Laukka et al (2011)
Laukka et al (2011)
• 200 utterances, 64 speakers to a voice-controlled
travel services in Sweden
– 20 subjects judged them for:
• Irritation
• Resignation
• Neutrality
• Emotional intensity
• 73 acoustic measures
– Pitch, intensity, formant, voice source, temporal
– 23 selected for further analysis
Laukka et al (2011)—Main findings
• Irritated speech
– Higher F0 mean, first quantile of F0, and fifth quantile of F0 (essentially F0 min and max)
– Higher mean intensity and fifth quantile of intensity; lower percentage of frames with
intensity rise
– Higher median bandwidth of F2
• Resigned speech
– Lower F0 mean
– Smaller F0 standard deviation
– Lower mean intensity, first quantile of intensity and fifth quantile of intensity
– Smaller intensity standard deviation
• Both
– Slower speech rate (mean syllable duration)
– F0 and intensity cues were strongest
• Other
– Maybe an effect for H1MA3 (a measure of spectral tilt at higher formant frequencies), should
be large for breathy and small for creaky
• Irritation <-> creaky
– Jitter may be linked to irritation, too (jitter often goes with roughness and breathiness)
Different objects of study
• The emotions that are studied do change when
one switches to naturalistic data. While acted
data tends to investigate rage and sorrow,
naturalistic data expresses irritation and
resignation
– Ang, Dhillon, Krupski, Shriberg, & Stolcke, 2002;
Benus, Gravano, & Hirschberg, 2007; Petri Laukka,
Neiberg, Forsell, Karlsson, & Elenius, 2011
• Of course that’s partly because the easiest
naturalistic corpora are human-computer
interactions…
Introduction to emotion detection
Other naturalistic projects
• Work on naturalistic corpora has increased
through the efforts of the HUMAINE project,
which serves as a repository of emotion
corpora (Douglas-Cowie et al., 2007).
• Most of the data here is from talk shows.
Introduction to emotion detection
More cues? New cues?
• Modern methods pretty much extract as many features as
they can. For example:
– 46 acoustic features were extracted in Grimm, Kroschel, Mower,
& Narayanan (2007)
– 73 in Petri Laukka, Neiberg, Forsell, Karlsson, & Elenius (2011)
– 87 features in Ververidis, Kotropoulos, & Pitas (2004)
– 100 features in Amir & Cohen (2007)
– 116 in Vidrascu & L. Devillers (2008)
– 173 features in Sobol-Shikler (2011)
– 534 features for voiced content and 518 for unvoiced content in
Clavel, Vasilescu, L. Devillers, Richard, & Ehrette (2008)
– 1,280 features were extracted in Vogt & André (2005).
NEW WAYS FORWARD
Voice quality
• Voice quality was among the most neglected
cues as Scherer (1986) pointed out
• Things hadn’t improved that much in 2003
when Juslin and Laukka published
• But there have been more since:
– Amir & Cohen, 2007
– Campbell, 2004
– Drioli, Tisato, Cosi, & Tesser, 2003
– Fernandez & Picard, 2005
– Gobl & Ní Chasaide, 2000, 2003
– Johnstone & Scherer, 1999
– Laukkanen, Vilkman, Alku, & Oksanen, 1997
– Lugger & Yang, 2007
– Monzo, Alías, Iriondo, Gonzalvo, & Planet,
2007
– Nwe, Foo, & De Silva, 2003
– Yang & Lugger, 2010
Campbell (2004)
• One Japanese woman
who wore a
microphone for two
years
• 13,604 usable
utterances
• Here we see that
breathiness and pitch
are controlled
separately
Introduction to emotion detection
Voice quality
• Voice quality measures are squirrellier than
people would like, so they are often labeled by
hand
– Tense, harsh
– Whisper
– Creaky
– Breathy
– (Modal)
What about…
http://www.stanford.edu/~tylers/misc/turk/96_
A_a.wav
Speaker A: I'm totally getting like his wit and
giving it back {0.7s pause} to him. It's awesome.
Like it has taken a really long time, {breath} but
like I finally get him like as good as he gets me.
Introduction to emotion detection
The utterance in Praat
How intense is this utterance?
What about pitch?
• Given the literature, we would expect a happy utterance to have a high pitch.
• BUT! Speaker A is using low pitch to indicate excitement, as with the awesome
utterance.
– This is opposite of the prediction that most researchers would make, but there are actually
several instances of it—low pitch enthusiasm seems to be part of A’s emotional style.
170
180
190
200
210
220
230
Pre-wedding section Wedding section Relationship section Awesome utterance Post-relationship
section
Style
What else do we notice?
• A: I'm totally getting like his wit and giving it
back {0.7s pause} to him. It's awesome. Like it
has taken a really long time, {breath} but like I
finally get him like as good as he gets me.
Expected total
like’s
Observed total
like’s
Pre-wedding 42 38
Wedding 9 5
Relationship 15 31
Post-relationship 10 2
Change
• Normalization lets you see how things are
different than the average
• Which is good
• But it misses larger units
Speech rate
Avg 3-9 syllables 10-19 syllables 20+ syllables
Speaker A
Overall (31
utterances)
5.19 syll/sec 4.92 syll/sec 5.17 syll/sec 5.47 syll/sec
Relationship
section (18
utterances)
5.40 syll/sec 5.54 syll/sec 5.26 syll/sec 5.46 syll/sec
Non-
relationship
section (13
utterances)
4.91 syll/sec 3.84 syll/sec 4.91 syll/sec 5.47 syll/sec
Speaker B
Overall (25
utterances)
5.19 syll/sec 4.99 syll/sec 5.38 syll/sec 5.47 syll/sec
Relationship
section (12
utterances)
4.92 syll/sec 4.86 syll/sec 4.87 syll/sec 5.19 syll/sec
Non-
relationship
section (13
utterances)
5.45 syll/sec 5.18 syll/sec 5.59 syll/sec 5.65 syll/sec
Tempo varies
Who
Iasked
who
uses
tempo
even
withinanutterance
who?
Introduction to emotion detection
Introduction to emotion detection
We’ve got to get Spock to Vulcan!
Tunnel of terror
And some others
William Shatner impersonation
Leonard Nimoy’s Mr. Spock is even
A cue from the Internet
Burstiness
• Variance / (syllables * 0.5)
– Variance gets us dispersion of the data
– The denominator helps us see how spread out the
data is
– The bigger the ratio, the more it is characterized
by clusters (“bursts”)
Burstiness and emotionality
• 48 Americans judged the emotional intensity of
228 utterances
– Utterances taken from 8 episodes, focusing on:
• Captain Kirk
• Mr. Spock
• Lt. Sulu
• Dr. (Bones) McCoy
– Each utterance judged by 3-5 people
– Scores were normalized per judge and then averaged
– Top 30, bottom 30 and 63 randomly chosen in
between were analyzed for speech rate and burstiness
– Restricted to utterances that were at least 5 syllables
Emotional speech in Star Trek is bursty
speech
Better than speech rate
• Among factors tested:
– Burstiness
– Speech rate
– Syllable count
– Interactions among these
• Only burstiness is significant (in a simple linear
regression model or an ordinary least squares
model, p=~0.0125)
– But note that the r-squared isn’t all that great:
0.05044
• A better approach is to use a mixed model,
where speaker is a random effect.
– This allows us to see that Kirk and Bones use
burstiness, while Sulu and Spock don’t.
• Kirk 0.4371045
• Bones 0.1710811
• Sulu -0.1518260
• Spock -0.4563595
Mixed model
Spock in reversal
Bursty, but not emotional
Emotional, but not bursty
Emotionality by Burstiness and
Speaker
AIC BIC logLik deviance REMLdev
341.4 352.7 -166.7 336.7 333.4
Random effects:
Groups Name Variance Std.Dev.
Speaker (Intercept) 0.21810 0.46701
Residual 0.87055 0.93303
Number of obs: 123, groups: Speaker, 4
Fixed effects:
Estimate Std. Error t value
(Intercept) -0.07646 0.27625 -0.2768
Burstiness 7.07226 3.07077 2.3031
> pvals.fnc(data.lmer)$fixed
Estimate MCMCmean HPD95lower HPD95upper pMCMC
Pr(>|t|)
(Intercept) -0.0765 -0.0845 -0.8558 0.6649 0.8174 0.7824
Burstiness 7.0723 7.1217 1.0825 13.1220 0.0194 0.0230
Intraspeaker variation
Background: The Brunswikian Lens Model
• is used in several fields to study how observers correctly and
incorrectly use objective cues to perceive physical or social
reality
physical or social
environment
observer
(organism)
cues
• cues have a probabilistic (uncertain) relation to the actual objects
• a (same) cue can signal several objects in the environment
• cues are (often) redundant
slide from Tanja Baenziger
Overwhelmed
Other-oriented
Overwhelming
Authoritative
Persuading
Indexical fields
• Variables aren’t fixed but are located in a
“constellation of ideologically related
meanings” (Eckert 2008)
• Diverse but not unconstrained
Summary
• Reasons to do emotion detection
– Call center automation
– Detecting stress/anxiety
– Helping people communicate better
– Terrorism/law
– Progress in detection means progress in synthesis
• Understanding the main findings (cues)
– Speech rate
– Pitch (mean and variability)
– Intensity (mean and variability)
– High frequency energy
– Review of Clavel et al (2011), Sobol-Shikler (2011), and Laukka et al (2011)
• Overview of limitations
– Non-naturalistic data
– Few labelers
– Contextlessness
Summary
• Finding new ways forward
– Voice quality
– Using change over time
• Contrasts
– Low-pitch-but-happy “awesome” utterance
• Changes
– Normalization 1.0=gender
– Normalization 2.0=speaker
– Normalization 3.0=sections of talk
» Shriberg and colleagues work on “hot spot” detection may be the most
relevant here
– Burstiness (Captain Kirk)
– Embracing indeterminacy (the pitch of “awesome”, indexical
fields, etc)
– Adding linguistic content (not discussed so far—q&a?)
To learn more…
• I’ve put together a lot of essays and reading notes
about language and emotion here:
– http://www.stanford.edu/~tylers/emotions.sht
ml
• Among top overviews are:
– Cowie & Cornelius, 2003
– Juslin & Laukka, 2003
– Russell, Bachorowski, & Fernández-Dols, 2003
– Scherer, 1986, 2003
– Schroder, 2004
– Ververidis & Kotropoulos, 2006
Thank you
Works cited (1/5)
• Amir, N., & Cohen, R. (2007). Characterizing Emotion in the Soundtrack of an Animated Film: Credible or
Incredible? Affective Computing and Intelligent Interaction, 148–158.
• Ang, J., Dhillon, R., Krupski, A., Shriberg, E., & Stolcke, A. (2002). Prosody-based automatic detection of annoyance
and frustration in human-computer dialog. In Seventh International Conference on Spoken Language Processing.
• Banse, R., & Scherer, K. (1996). Acoustic profiles in vocal emotion expression. Journal of personality and social
psychology, 70(3), 614–636.
• Benus, S., Gravano, A., & Hirschberg, J. (2007). Prosody, emotions, and…‘whatever’. In Proceedings of International
Conference on Speech Communication and Technology (pp. 2629–2632).
• Campbell, N. (2004). Accounting for voice-quality variation. In Speech Prosody 2004, International Conference.
• Clavel, C., Vasilescu, I., Devillers, L., Richard, G., & Ehrette, T. (2008). Fear-type emotion recognition for future
audio-based surveillance systems. Speech Communication, 50(6), 487–503.
• Cowie, R., & Cornelius, R. R. (2003). Describing the emotional states that are expressed in speech. Speech
Communication, 40(1-2), 5–32.
• Devillers, L., & Campbell, N. (2011). Special issue of computer speech and language on. Computer Speech &
Language, 25(1), 1 - 3. doi:DOI: 10.1016/j.csl.2010.07.002
• Douglas-Cowie, E., Cowie, R., Sneddon, I., Cox, C., Lowry, O., McRorie, M., Martin, J. C., et al. (2007). The HUMAINE
database: Addressing the collection and annotation of naturalistic and induced emotional data. Affective
computing and intelligent interaction, 488–500.
• Drioli, C., Tisato, G., Cosi, P., & Tesser, F. (2003). Emotions and voice quality: experiments with sinusoidal modeling.
Proceedings of VOQUAL, 27–29.
Works cited (2/5)
• Eckert, P. (2008). Variation and the indexical field. Journal of Sociolinguistics, 12(4),
453–476.
• Fernandez, R., & Picard, R. W. (2005). Classical and novel discriminant features for
affect recognition from speech. In Ninth European Conference on Speech
Communication and Technology.
• Gobl, C., & Ní Chasaide, A. (2000). Testing affective correlates of voice quality
through analysis and resynthesis. In ISCA Tutorial and Research Workshop (ITRW)
on Speech and Emotion.
• Gobl, C., & Ní Chasaide, A. (2003). The role of voice quality in communicating
emotion, mood and attitude. Speech Communication, 40(1-2), 189–212.
• Grimm, M., Kroschel, K., Mower, E., & Narayanan, S. (2007). Primitives-based
evaluation and estimation of emotions in speech. Speech Communication, 49(10-
11), 787–800.
• Huson, D., D. Richter, C. Rausch, T. Dezulian, M. Franz and R. Rupp. (2007).
Dendroscope: An interactive viewer for large phylogenetic trees . BMC
Bioinformatics 8:460, 2007, software freely available from www.dendroscope.org
• de Jong, N. H., and T. Wempe. (2009). Praat script to detect syllable nuclei and
measure speech rate automatically. Behavior research methods, 41(2), 385.
Works cited (3/5)
• Juslin, P. N., & Laukka, P. (2003). Communication of emotions in vocal expression and music
performance: Different channels, same code?. Psychological Bulletin, 129(5), 770–814.
• Kendall, T. (2010). Language Variation and Sequential Temporal Patterns of Talk. Linguistics
Department, Stanford University: Palo Alto, CA. February.
• Kendall, T. (2009). Speech Rate, Pause, and Linguistic Variation: An Examination Through the
Sociolinguistic Archive and Analysis Project, Doctoral Dissertation. Durham, NC: Duke University.
• Laukka, P., Neiberg, D., Forsell, M., Karlsson, I., & Elenius, K. (2011). Expression of affect in
spontaneous speech: Acoustic correlates and automatic detection of irritation and resignation.
Computer Speech & Language, 25(1), 84 - 104. doi:DOI: 10.1016/j.csl.2010.03.004
• Laukkanen, A. M., Vilkman, E., Alku, P., & Oksanen, H. (1997). On the perception of emotions in
speech: the role of voice quality. Logopedics Phonatrics Vocology, 22(4), 157–168.
• Lugger, M., & Yang, B. (2007). The relevance of voice quality features in speaker independent
emotion recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing,
2007. ICASSP 2007 (Vol. 4, pp. 17-20).
• Monzo, C., Alías, F., Iriondo, I., Gonzalvo, X., & Planet, S. (2007). Discriminating expressive speech
styles by voice quality parameterization. In Proc. of the 16th International Congress of Phonetic
Sciences (ICPhS) (pp. 2081–2084).
• Nwe, T. L., Foo, S. W., & De Silva, L. C. (2003). Speech emotion recognition using hidden Markov
models. Speech communication, 41(4), 603–623.
• Russell, J. A., Bachorowski, J., & Fernández-Dols, J. (2003). Facial and Vocal Expressions of Emotion.
Annual Review of Psychology, 54(1), 329-349. doi:10.1146/annurev.psych.54.101601.145102
Works cited (4/5)
• Scherer, K. (1979). Nonlinguistic vocal indicators of emotion and psychopathology. Emotions in
personality and psychopathology, 493–529.
• Scherer, K. (1986). Vocal affect expression: A review and a model for future research. Psychological
bulletin, 99(2), 143–165.
• Scherer, K. (1999). Appraisal theory. In T. Dalgleish & M. Power (Eds.), Handbook of cognition and
emotion (pp. 637–663). John Wiley and Sons Ltd.
• Scherer, K. (2003). Vocal communication of emotion: a review of research paradigms. Speech
Communication, 40, 227-256.
• Scherer, K. R. (2005). What are emotions? And how can they be measured? Social Science
Information, 44(4), 695.
• Scherer, K. and J. Oshinsky. (1977). Cue utilization in emotion attribution from auditory stimuli.
Motiv. Emot. 1, 331–346.
• Schnoebelen, T. (2009). The social meaning of tempo.
http://www.stanford.edu/~tylers/notes/socioling/Social_meaning_tempo_Schnoebelen_3-23-
09.pdf
• Schnoebelen, T. (2010). The structure of the affective lexicon. California Universities Semantics and
Pragmatics Workshop (CUSP). Stanford University, October 15, 2010.
http://linguistics.stanford.edu/documents/cusp3-schnoebelen.pdf
• Schnoebelen, T. (2010). Variation in speech tempo: Capt. Kirk, Mr. Spock, and all of us in between.
NWAV, San Antonio, TX.
http://www.stanford.edu/~tylers/notes/socioling/NWAV_Capt_Kirk_Mr_Spock_rest_of_us_burstin
ess_11-4-10.pptx
Works cited (5/5)
• Schroder, M. (2004). Speech and emotion research: an overview of research frameworks and a
dimensional approach to emotional speech synthesis (Ph. D thesis). Saarland University.
• Schröder, M., Cowie, R., Douglas-Cowie, E., Westerdijk, M., & Gielen, S. (2001). Acoustic correlates
of emotion dimensions in view of speech synthesis. In Seventh European Conference on Speech
Communication and Technology.
• Sobol-Shikler, T. (2011). Automatic inference of complex affective states. Computer Speech &
Language, 25(1), 45 - 62. doi:DOI: 10.1016/j.csl.2009.12.005
• Ververidis, D., & Kotropoulos, C. (2006). Emotional speech recognition: Resources, features, and
methods. Speech Communication, 48(9), 1162–1181.
• Ververidis, D., Kotropoulos, C., & Pitas, I. (2004). Automatic emotional speech classification. In IEEE
International Conference on Acoustics, Speech, and Signal Processing, 2004.
Proceedings.(ICASSP'04).
• Vidrascu, L., & Devillers, L. (2008). Anger detection performances based on prosodic and acoustic
cues in several corpora. In Programme of the Workshop on Corpora for Research on Emotion and
Affect (pp. 13-16).
• Vogt, T., & André, E. (2005). Comparing feature sets for acted and spontaneous speech in view of
automatic emotion recognition. In 2005 IEEE International Conference on Multimedia and Expo (pp.
474–477).
• Yang, B., & Lugger, M. (2010). Emotion recognition from speech signals using new harmony
features. Signal Processing, 90(5), 1415–1423.
Appendix
What’s an emotion?
• Kleinginna & Kleinginna (1981) reviewed 92 definitions
of emotion (and 9 “skeptical statements”). Here’s their
proposal:
– Emotion is a complex set of interactions among subjective
and objective factors, mediated by neural/hormonal
systems, which can (a) give rise to affective experiences
such as feelings of arousal, pleasure/displeasure; (b)
generate cognitive processes such as emotionally relevant
perceptual effects, appraisals, labeling processes; (c)
activate widespread physiological adjustments to the
arousing conditions; and (d) lead to behavior that is often,
but not always, expressive, goal-directed, and adaptive.
(K&K 1981: 355)
Introduction to emotion detection
Basic assumptions
• Facial and vocal changes occur everywhere and are
coordinated with the sender's psychological state
– Most people can infer something from changes
– So why can’t computers?
• Insistence that there are fixed links between
facial/vocal expression and emotions is misplaced.
(Kappas 2002: 10, Russell et al 2003).
• And of course, the receiving side "is more than a reflex-
like decoding of a message" (Russell et al 2003: 331).
– That is, everything is contextualized! Change!
Kendall (2009)
Region
Ethnicity
Gender
Age
Utterance length
Median pause
Social meaning
• Speech rates are not stable by demographic
category
• They vary all over the place
• Conveying and creating identities and
attitudes
Accommodation by ethnicity
Speakers A & B (relationship
conversation)
Speaker A (relationship conversation)
Speaker B (relationship conversation)
Ang et al (2002): detecting annoyance
• Annoyed labeled 7.62% of the time,
interlabeler agreement (grouping frustrated
and annoyed) is 71%, Kappa of .47, not super
great, but better than many others
• Frustration
– Longer durations
– Slower speaking
– High values for a number of F0 pitch features
– Repeats and corrections
Ang et al ‘02 Conclusions
• Emotion labeling is a complex decision task
• Cases that labelers independently agree on are classified
with high accuracy
– Extreme emotion (e.g. ‘frustration’) is classified even more
accurately
• Classifiers rely heavily on prosodic features, particularly
duration and stylized pitch
– Speaker normalizations help
• Two nonprosodic features are important: utterance
position and repeat/correction
– Language model is an imperfect surrogate feature for the
underlying important feature repeat/correction
Slide from Shriberg, Ang, Stolcke
Example 3: “How May I Help YouSM
”
(HMIHY)
• Giuseppe Riccardi, Dilek Hakkani-Tür, AT&T Labs
• Liscombe, Riccardi, Hakkani-Tür (2004)
• Each turn in 20,000 turns (5690 dialogues) annotated for 7
emotions by one person
– Positive/neutral, somewhat frustrated, very frustrated, somewhat angry,
very angry, somewhat other negative, very other negative
– Distribution was so skewed (73.1% labeled positive/neutral)
– So classes were collapsed to negative/nonnegative
• Task is hard!
– Subset of 627 turns labeled by 2 people: kappa .32 (full set) and .42
(reduced set)!
Slide from Jackson Liscombe
Lexical Features
• Language Model (ngrams)
• Examples of words significantly correlated with
negative user state (p<0.001) :
– 1st person pronouns: ‘I’, ‘me’
– requests for a human operator: ‘person’, ‘talk’,
‘speak’, ‘human’, ‘machine’
– billing-related words: ‘dollars’, ‘cents’
– curse words: …
Slide from Jackson Liscombe
It can be tremendously useful
• To look at confusion matrices (which emotions
are mistaken for which other emotions the
most?)
• The patterns of confusion are not just a
measure of how good/bad people did—
• They are informative!
“subjective”
positive
love wonderful best
negative
bad stupid waste
Typical sentiment analysis
Introduction to emotion detection
Pennebaker
Introduction to emotion detection
Word choice matters
• Evidence suggests that people’s physical and
mental health are correlated with the words
they use
– Gottschalk & Glaser (1969); Rosenberg & Tucker
(1978); Stiles (1992)
• “Word use is a meaningful marker and
occasional mediator of natural social and
personality processes.”
– (Pennebaker et al 2003: 548)
Not markers…
• Words occur socially
(even when the speaker
is alone).
• So interlocutors aren’t
just listening for
meaning, they are
constructing and
imposing it.
Not markers…makers
Words can be infected…
• Semantic
dynamics,
changes of
meaning
• E.g., semantic
contempt can
creep in to things
from their use
(Kaplan 1999)
Introduction to emotion detection
Baggage

More Related Content

PPTX
Emotion Detection in text
PDF
Speech emotion recognition
PPTX
Facial Expression Recognition System using Deep Convolutional Neural Networks.
PPTX
Présentation sur l'IA en Francais, traite de l'intelligence artificielle
PPTX
Naive Bayes Presentation
PPT
MYSQL.ppt
PPTX
Human Memory (Psychology)
PPTX
Engineering careers presentation
Emotion Detection in text
Speech emotion recognition
Facial Expression Recognition System using Deep Convolutional Neural Networks.
Présentation sur l'IA en Francais, traite de l'intelligence artificielle
Naive Bayes Presentation
MYSQL.ppt
Human Memory (Psychology)
Engineering careers presentation

What's hot (20)

PDF
Facial emotion recognition
PDF
EMOTION DETECTION USING AI
PPTX
Emotion recognition using image processing in deep learning
PDF
Human Emotion Recognition
PPTX
Emotion recognition
PPTX
Face recognition technology
PPTX
Final year ppt
PPT
Automated Face Detection System
PPTX
Emotion recognition using facial expressions and speech
PPTX
Facial Expression Recognition (FER) using Deep Learning
PPTX
Facial Emotion Recognition: A Deep Learning approach
PPTX
HUMAN EMOTION RECOGNIITION SYSTEM
PPTX
face recognition
PPTX
Emotion based music player
PPTX
Project Face Detection
PPTX
Face recognition
PPTX
Face recognisation system
PPTX
Facial powerpoint
PDF
Face recognition Face Identification
PPTX
Facial emotion detection on babies' emotional face using Deep Learning.
Facial emotion recognition
EMOTION DETECTION USING AI
Emotion recognition using image processing in deep learning
Human Emotion Recognition
Emotion recognition
Face recognition technology
Final year ppt
Automated Face Detection System
Emotion recognition using facial expressions and speech
Facial Expression Recognition (FER) using Deep Learning
Facial Emotion Recognition: A Deep Learning approach
HUMAN EMOTION RECOGNIITION SYSTEM
face recognition
Emotion based music player
Project Face Detection
Face recognition
Face recognisation system
Facial powerpoint
Face recognition Face Identification
Facial emotion detection on babies' emotional face using Deep Learning.
Ad

Similar to Introduction to emotion detection (20)

PPTX
Variation in speech tempo: Capt. Kirk, Mr. Spock, and all of us in between
PPTX
Digital forensic, multimedia and incident response 2.pptx
PPTX
Auditory & Developmental Neurophysiology: Cochlear Implants seminar
PPTX
Phonetics & phonology, INTRODUCTION, Dr, Salama Embarak
PPTX
(Lecture 1) Introduction to English Language.pptx
PPTX
Introduction: Phonetics and Phonology.pt
PPTX
L1---Introducing-sound properties--.pptx
PPTX
Creating an Entertaining and Informative Music Visualization
PPTX
Uconn presentation 4 2014
PPTX
Uconn presentation
PDF
INDTRODUCTION TO PHONETICS SCIENCE.pdf
PDF
Daily Lesson Plan in Sound -For Upload.pdf
PPT
Principal characteristics of speech
PDF
Say That Again? Enhancing Your Accent Acumen
PPTX
Teaching L2 Pronunciation: Tips, Tricks and Tools
PPTX
CPAD21- digitaI audio design-basics.pptx
PPTX
Phonetics and its types.PPTX
PPTX
Activity 2 &amp; 3 science
PPTX
powerpoint presentation for class observatiom
Variation in speech tempo: Capt. Kirk, Mr. Spock, and all of us in between
Digital forensic, multimedia and incident response 2.pptx
Auditory & Developmental Neurophysiology: Cochlear Implants seminar
Phonetics & phonology, INTRODUCTION, Dr, Salama Embarak
(Lecture 1) Introduction to English Language.pptx
Introduction: Phonetics and Phonology.pt
L1---Introducing-sound properties--.pptx
Creating an Entertaining and Informative Music Visualization
Uconn presentation 4 2014
Uconn presentation
INDTRODUCTION TO PHONETICS SCIENCE.pdf
Daily Lesson Plan in Sound -For Upload.pdf
Principal characteristics of speech
Say That Again? Enhancing Your Accent Acumen
Teaching L2 Pronunciation: Tips, Tricks and Tools
CPAD21- digitaI audio design-basics.pptx
Phonetics and its types.PPTX
Activity 2 &amp; 3 science
powerpoint presentation for class observatiom
Ad

More from Tyler Schnoebelen (8)

PDF
Emoji are great and/or they will destroy the world
PDF
The Ethics of Everybody Else
PPTX
Studying emotion in the field
PPTX
Emoji linguistics
PDF
Computing with Affective Lexicons: Computational Linguistics Tutorial with Da...
PPTX
Crowdsourcing big data_industry_jun-25-2015_for_slideshare
PPTX
Towards a dictionary of the future
PPTX
Gender and language (linguistics, social network theory, Twitter!)
Emoji are great and/or they will destroy the world
The Ethics of Everybody Else
Studying emotion in the field
Emoji linguistics
Computing with Affective Lexicons: Computational Linguistics Tutorial with Da...
Crowdsourcing big data_industry_jun-25-2015_for_slideshare
Towards a dictionary of the future
Gender and language (linguistics, social network theory, Twitter!)

Recently uploaded (20)

PPTX
inbound6529290805104538764.pptxmmmmmmmmm
PPTX
cardiac failure and associated notes.pptx
PPTX
DATA ANALYTICS COURSE IN PITAMPURA.pptx
PPT
What is life? We never know the answer exactly
PDF
Book Trusted Companions in Delhi – 24/7 Available Delhi Personal Meeting Ser...
PDF
Buddhism presentation about world religion
PPTX
1.Introduction to orthodonti hhhgghhcs.pptx
PDF
Mcdonald's : a half century growth . pdf
PPTX
9 Bioterrorism.pptxnsbhsjdgdhdvkdbebrkndbd
PPT
BME 301 Lecture Note 1_2.ppt mata kuliah Instrumentasi
PDF
General category merit rank list for neet pg
PPTX
Power BI - Microsoft Power BI is an interactive data visualization software p...
PPTX
DAA UNIT 1 for unit 1 time compixity PPT.pptx
PDF
American Journal of Multidisciplinary Research and Review
PDF
Introduction to Database Systems Lec # 1
PPTX
ch20 Database System Architecture by Rizvee
PPTX
Chapter security of computer_8_v8.1.pptx
PDF
Delhi c@ll girl# cute girls in delhi with travel girls in delhi call now
PDF
9 FinOps Tools That Simplify Cloud Cost Reporting.pdf
PPT
Technicalities in writing workshops indigenous language
inbound6529290805104538764.pptxmmmmmmmmm
cardiac failure and associated notes.pptx
DATA ANALYTICS COURSE IN PITAMPURA.pptx
What is life? We never know the answer exactly
Book Trusted Companions in Delhi – 24/7 Available Delhi Personal Meeting Ser...
Buddhism presentation about world religion
1.Introduction to orthodonti hhhgghhcs.pptx
Mcdonald's : a half century growth . pdf
9 Bioterrorism.pptxnsbhsjdgdhdvkdbebrkndbd
BME 301 Lecture Note 1_2.ppt mata kuliah Instrumentasi
General category merit rank list for neet pg
Power BI - Microsoft Power BI is an interactive data visualization software p...
DAA UNIT 1 for unit 1 time compixity PPT.pptx
American Journal of Multidisciplinary Research and Review
Introduction to Database Systems Lec # 1
ch20 Database System Architecture by Rizvee
Chapter security of computer_8_v8.1.pptx
Delhi c@ll girl# cute girls in delhi with travel girls in delhi call now
9 FinOps Tools That Simplify Cloud Cost Reporting.pdf
Technicalities in writing workshops indigenous language

Introduction to emotion detection

  • 1. Introduction to emotion detection Tyler Schnoebelen Stanford University
  • 2. Welcome to the electronic copy of my presentation! • I’ve put a lot of stuff in the notes fields, so make sure to check them out. • Unfortunately, most of the audio/video embedded in this presentation will only play if you have a recent version of PowerPoint. – If you can’t play the audio/video but would like to hear it, please email me at tylers at stanford dot edu and I’ll make it available. • Please do let me know about any thoughts or questions!
  • 3. Goals 1. Reasons to do emotion detection 2. Understanding the main findings (cues) 3. Overview of limitations 4. Finding new ways forward – Voice quality – Using change over time – A new measurement of speech tempo – Adding lexical content (won’t do much of this here, though)
  • 7. It looks like you’re full of rage and indignation. Would like to: [ ] Sound even angrier [ ] Sound less angry
  • 10. Also: we aren’t robots • Linguistics is about understanding human beings. • To understand human beings is to understand the variety and complexity of emotional experiences they have. • Linguistics can offer a lot by showing how linguistic resources are used in creating and coping with these experiences.
  • 11. WHAT ARE WE DETECTING?
  • 12. What are we detecting?
  • 16. The basic patterns • Anger, fear, happiness – Fast speech rate – High voice intensity (m) – High voice intensity variability – High-frequency energy – High F0 • Sadness, tenderness – Pretty much the opposite of above: • Slow speech rate and low voice intensity, low voice intensity variability, low high- frequency energy, Low F0 • F0 variability? – Anger and happiness are high – Fear DOES NOT pattern with them; there’s low F0 variability, as with tenderness and sadness
  • 19. A closer look • Of the 104 studies J&L looked at from 1900 to June 2002 they found: – 87% of studies used actors (90 studies) – 12% used natural speech samples (12 studies) • Mostly from fear expressions in aviation accidents
  • 20. For example • Emotional Prosody Speech and Transcripts Corpus (EPSaT)
  • 22. Clavel et al (2011)
  • 23. Clavel et al (2011) • A little more naturalistic—film clips involving fear • 7 hours, 400 speakers • 3 labelers • 40 features selected for voiced, 40 for unvoiced: – Prosodic (pitch) – Voice quality (jitter, shimmer) – Spectral and cepstral features (MFCC and Bark band energy-related measures)
  • 24. Clavel et al (2011)—Main findings • Most useful in voiced (all higher in fear than in neutral): – F0 mean – Spectral centroid mean – Jitter – Note: Intensity was no good even after normalization (diversity of audio sources) • Useful in some subtypes of fear, though like “panic” • Most useful in unvoiced: – Bark band energy-related > MFCC – Harmonics-to-noise ratio (HNR) – Unvoiced rate (proportion of unvoiced frames in a segment)
  • 27. Sobol-Shikler (2011) • 7000 sentences, 2 languages – Mind Reading database • Teaches autistic children how to recognize emotions • 4400 recorded sentences, 10 UK English speakers • Emotions labeled by 10 people – Doors corpus • Humans gambled for 15 minutes—2 repeated sentences (“open this door”, “close door”) • 100 sentences for each participant (~27 participants) – 100 spontaneous utterances from 25 speakers – 9 emotional categories (joyful, absorbed, sure, stressed, excited, opposed, interested, unsure, thinking) • 173 features extracted – F0 – Energy – Tempo – Harmonics – Spectral content
  • 28. Sobol-Shikler (2011)—Main findings • To distinguish any pair of emotions, you only need about 10 features • But to classify everything, you need 166 of the 173 • Able to classify Hebrew and English affective states using training on the other language • About 76% accuracy in detection, which is comparable to what most studies report
  • 29. Laukka et al (2011)
  • 30. Laukka et al (2011) • 200 utterances, 64 speakers to a voice-controlled travel services in Sweden – 20 subjects judged them for: • Irritation • Resignation • Neutrality • Emotional intensity • 73 acoustic measures – Pitch, intensity, formant, voice source, temporal – 23 selected for further analysis
  • 31. Laukka et al (2011)—Main findings • Irritated speech – Higher F0 mean, first quantile of F0, and fifth quantile of F0 (essentially F0 min and max) – Higher mean intensity and fifth quantile of intensity; lower percentage of frames with intensity rise – Higher median bandwidth of F2 • Resigned speech – Lower F0 mean – Smaller F0 standard deviation – Lower mean intensity, first quantile of intensity and fifth quantile of intensity – Smaller intensity standard deviation • Both – Slower speech rate (mean syllable duration) – F0 and intensity cues were strongest • Other – Maybe an effect for H1MA3 (a measure of spectral tilt at higher formant frequencies), should be large for breathy and small for creaky • Irritation <-> creaky – Jitter may be linked to irritation, too (jitter often goes with roughness and breathiness)
  • 32. Different objects of study • The emotions that are studied do change when one switches to naturalistic data. While acted data tends to investigate rage and sorrow, naturalistic data expresses irritation and resignation – Ang, Dhillon, Krupski, Shriberg, & Stolcke, 2002; Benus, Gravano, & Hirschberg, 2007; Petri Laukka, Neiberg, Forsell, Karlsson, & Elenius, 2011 • Of course that’s partly because the easiest naturalistic corpora are human-computer interactions…
  • 34. Other naturalistic projects • Work on naturalistic corpora has increased through the efforts of the HUMAINE project, which serves as a repository of emotion corpora (Douglas-Cowie et al., 2007). • Most of the data here is from talk shows.
  • 36. More cues? New cues? • Modern methods pretty much extract as many features as they can. For example: – 46 acoustic features were extracted in Grimm, Kroschel, Mower, & Narayanan (2007) – 73 in Petri Laukka, Neiberg, Forsell, Karlsson, & Elenius (2011) – 87 features in Ververidis, Kotropoulos, & Pitas (2004) – 100 features in Amir & Cohen (2007) – 116 in Vidrascu & L. Devillers (2008) – 173 features in Sobol-Shikler (2011) – 534 features for voiced content and 518 for unvoiced content in Clavel, Vasilescu, L. Devillers, Richard, & Ehrette (2008) – 1,280 features were extracted in Vogt & André (2005).
  • 38. Voice quality • Voice quality was among the most neglected cues as Scherer (1986) pointed out • Things hadn’t improved that much in 2003 when Juslin and Laukka published • But there have been more since: – Amir & Cohen, 2007 – Campbell, 2004 – Drioli, Tisato, Cosi, & Tesser, 2003 – Fernandez & Picard, 2005 – Gobl & Ní Chasaide, 2000, 2003 – Johnstone & Scherer, 1999 – Laukkanen, Vilkman, Alku, & Oksanen, 1997 – Lugger & Yang, 2007 – Monzo, Alías, Iriondo, Gonzalvo, & Planet, 2007 – Nwe, Foo, & De Silva, 2003 – Yang & Lugger, 2010
  • 39. Campbell (2004) • One Japanese woman who wore a microphone for two years • 13,604 usable utterances • Here we see that breathiness and pitch are controlled separately
  • 41. Voice quality • Voice quality measures are squirrellier than people would like, so they are often labeled by hand – Tense, harsh – Whisper – Creaky – Breathy – (Modal)
  • 42. What about… http://www.stanford.edu/~tylers/misc/turk/96_ A_a.wav Speaker A: I'm totally getting like his wit and giving it back {0.7s pause} to him. It's awesome. Like it has taken a really long time, {breath} but like I finally get him like as good as he gets me.
  • 45. How intense is this utterance?
  • 46. What about pitch? • Given the literature, we would expect a happy utterance to have a high pitch. • BUT! Speaker A is using low pitch to indicate excitement, as with the awesome utterance. – This is opposite of the prediction that most researchers would make, but there are actually several instances of it—low pitch enthusiasm seems to be part of A’s emotional style. 170 180 190 200 210 220 230 Pre-wedding section Wedding section Relationship section Awesome utterance Post-relationship section
  • 47. Style
  • 48. What else do we notice? • A: I'm totally getting like his wit and giving it back {0.7s pause} to him. It's awesome. Like it has taken a really long time, {breath} but like I finally get him like as good as he gets me. Expected total like’s Observed total like’s Pre-wedding 42 38 Wedding 9 5 Relationship 15 31 Post-relationship 10 2
  • 49. Change • Normalization lets you see how things are different than the average • Which is good • But it misses larger units
  • 50. Speech rate Avg 3-9 syllables 10-19 syllables 20+ syllables Speaker A Overall (31 utterances) 5.19 syll/sec 4.92 syll/sec 5.17 syll/sec 5.47 syll/sec Relationship section (18 utterances) 5.40 syll/sec 5.54 syll/sec 5.26 syll/sec 5.46 syll/sec Non- relationship section (13 utterances) 4.91 syll/sec 3.84 syll/sec 4.91 syll/sec 5.47 syll/sec Speaker B Overall (25 utterances) 5.19 syll/sec 4.99 syll/sec 5.38 syll/sec 5.47 syll/sec Relationship section (12 utterances) 4.92 syll/sec 4.86 syll/sec 4.87 syll/sec 5.19 syll/sec Non- relationship section (13 utterances) 5.45 syll/sec 5.18 syll/sec 5.59 syll/sec 5.65 syll/sec
  • 52. Who
  • 54. who
  • 55. uses
  • 56. tempo
  • 57. even
  • 59. who?
  • 62. We’ve got to get Spock to Vulcan!
  • 66. Leonard Nimoy’s Mr. Spock is even
  • 67. A cue from the Internet
  • 68. Burstiness • Variance / (syllables * 0.5) – Variance gets us dispersion of the data – The denominator helps us see how spread out the data is – The bigger the ratio, the more it is characterized by clusters (“bursts”)
  • 69. Burstiness and emotionality • 48 Americans judged the emotional intensity of 228 utterances – Utterances taken from 8 episodes, focusing on: • Captain Kirk • Mr. Spock • Lt. Sulu • Dr. (Bones) McCoy – Each utterance judged by 3-5 people – Scores were normalized per judge and then averaged – Top 30, bottom 30 and 63 randomly chosen in between were analyzed for speech rate and burstiness – Restricted to utterances that were at least 5 syllables
  • 70. Emotional speech in Star Trek is bursty speech
  • 71. Better than speech rate • Among factors tested: – Burstiness – Speech rate – Syllable count – Interactions among these • Only burstiness is significant (in a simple linear regression model or an ordinary least squares model, p=~0.0125) – But note that the r-squared isn’t all that great: 0.05044
  • 72. • A better approach is to use a mixed model, where speaker is a random effect. – This allows us to see that Kirk and Bones use burstiness, while Sulu and Spock don’t. • Kirk 0.4371045 • Bones 0.1710811 • Sulu -0.1518260 • Spock -0.4563595 Mixed model
  • 73. Spock in reversal Bursty, but not emotional Emotional, but not bursty
  • 74. Emotionality by Burstiness and Speaker AIC BIC logLik deviance REMLdev 341.4 352.7 -166.7 336.7 333.4 Random effects: Groups Name Variance Std.Dev. Speaker (Intercept) 0.21810 0.46701 Residual 0.87055 0.93303 Number of obs: 123, groups: Speaker, 4 Fixed effects: Estimate Std. Error t value (Intercept) -0.07646 0.27625 -0.2768 Burstiness 7.07226 3.07077 2.3031 > pvals.fnc(data.lmer)$fixed Estimate MCMCmean HPD95lower HPD95upper pMCMC Pr(>|t|) (Intercept) -0.0765 -0.0845 -0.8558 0.6649 0.8174 0.7824 Burstiness 7.0723 7.1217 1.0825 13.1220 0.0194 0.0230
  • 76. Background: The Brunswikian Lens Model • is used in several fields to study how observers correctly and incorrectly use objective cues to perceive physical or social reality physical or social environment observer (organism) cues • cues have a probabilistic (uncertain) relation to the actual objects • a (same) cue can signal several objects in the environment • cues are (often) redundant slide from Tanja Baenziger
  • 78. Indexical fields • Variables aren’t fixed but are located in a “constellation of ideologically related meanings” (Eckert 2008) • Diverse but not unconstrained
  • 79. Summary • Reasons to do emotion detection – Call center automation – Detecting stress/anxiety – Helping people communicate better – Terrorism/law – Progress in detection means progress in synthesis • Understanding the main findings (cues) – Speech rate – Pitch (mean and variability) – Intensity (mean and variability) – High frequency energy – Review of Clavel et al (2011), Sobol-Shikler (2011), and Laukka et al (2011) • Overview of limitations – Non-naturalistic data – Few labelers – Contextlessness
  • 80. Summary • Finding new ways forward – Voice quality – Using change over time • Contrasts – Low-pitch-but-happy “awesome” utterance • Changes – Normalization 1.0=gender – Normalization 2.0=speaker – Normalization 3.0=sections of talk » Shriberg and colleagues work on “hot spot” detection may be the most relevant here – Burstiness (Captain Kirk) – Embracing indeterminacy (the pitch of “awesome”, indexical fields, etc) – Adding linguistic content (not discussed so far—q&a?)
  • 81. To learn more… • I’ve put together a lot of essays and reading notes about language and emotion here: – http://www.stanford.edu/~tylers/emotions.sht ml • Among top overviews are: – Cowie & Cornelius, 2003 – Juslin & Laukka, 2003 – Russell, Bachorowski, & Fernández-Dols, 2003 – Scherer, 1986, 2003 – Schroder, 2004 – Ververidis & Kotropoulos, 2006
  • 83. Works cited (1/5) • Amir, N., & Cohen, R. (2007). Characterizing Emotion in the Soundtrack of an Animated Film: Credible or Incredible? Affective Computing and Intelligent Interaction, 148–158. • Ang, J., Dhillon, R., Krupski, A., Shriberg, E., & Stolcke, A. (2002). Prosody-based automatic detection of annoyance and frustration in human-computer dialog. In Seventh International Conference on Spoken Language Processing. • Banse, R., & Scherer, K. (1996). Acoustic profiles in vocal emotion expression. Journal of personality and social psychology, 70(3), 614–636. • Benus, S., Gravano, A., & Hirschberg, J. (2007). Prosody, emotions, and…‘whatever’. In Proceedings of International Conference on Speech Communication and Technology (pp. 2629–2632). • Campbell, N. (2004). Accounting for voice-quality variation. In Speech Prosody 2004, International Conference. • Clavel, C., Vasilescu, I., Devillers, L., Richard, G., & Ehrette, T. (2008). Fear-type emotion recognition for future audio-based surveillance systems. Speech Communication, 50(6), 487–503. • Cowie, R., & Cornelius, R. R. (2003). Describing the emotional states that are expressed in speech. Speech Communication, 40(1-2), 5–32. • Devillers, L., & Campbell, N. (2011). Special issue of computer speech and language on. Computer Speech & Language, 25(1), 1 - 3. doi:DOI: 10.1016/j.csl.2010.07.002 • Douglas-Cowie, E., Cowie, R., Sneddon, I., Cox, C., Lowry, O., McRorie, M., Martin, J. C., et al. (2007). The HUMAINE database: Addressing the collection and annotation of naturalistic and induced emotional data. Affective computing and intelligent interaction, 488–500. • Drioli, C., Tisato, G., Cosi, P., & Tesser, F. (2003). Emotions and voice quality: experiments with sinusoidal modeling. Proceedings of VOQUAL, 27–29.
  • 84. Works cited (2/5) • Eckert, P. (2008). Variation and the indexical field. Journal of Sociolinguistics, 12(4), 453–476. • Fernandez, R., & Picard, R. W. (2005). Classical and novel discriminant features for affect recognition from speech. In Ninth European Conference on Speech Communication and Technology. • Gobl, C., & Ní Chasaide, A. (2000). Testing affective correlates of voice quality through analysis and resynthesis. In ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion. • Gobl, C., & Ní Chasaide, A. (2003). The role of voice quality in communicating emotion, mood and attitude. Speech Communication, 40(1-2), 189–212. • Grimm, M., Kroschel, K., Mower, E., & Narayanan, S. (2007). Primitives-based evaluation and estimation of emotions in speech. Speech Communication, 49(10- 11), 787–800. • Huson, D., D. Richter, C. Rausch, T. Dezulian, M. Franz and R. Rupp. (2007). Dendroscope: An interactive viewer for large phylogenetic trees . BMC Bioinformatics 8:460, 2007, software freely available from www.dendroscope.org • de Jong, N. H., and T. Wempe. (2009). Praat script to detect syllable nuclei and measure speech rate automatically. Behavior research methods, 41(2), 385.
  • 85. Works cited (3/5) • Juslin, P. N., & Laukka, P. (2003). Communication of emotions in vocal expression and music performance: Different channels, same code?. Psychological Bulletin, 129(5), 770–814. • Kendall, T. (2010). Language Variation and Sequential Temporal Patterns of Talk. Linguistics Department, Stanford University: Palo Alto, CA. February. • Kendall, T. (2009). Speech Rate, Pause, and Linguistic Variation: An Examination Through the Sociolinguistic Archive and Analysis Project, Doctoral Dissertation. Durham, NC: Duke University. • Laukka, P., Neiberg, D., Forsell, M., Karlsson, I., & Elenius, K. (2011). Expression of affect in spontaneous speech: Acoustic correlates and automatic detection of irritation and resignation. Computer Speech & Language, 25(1), 84 - 104. doi:DOI: 10.1016/j.csl.2010.03.004 • Laukkanen, A. M., Vilkman, E., Alku, P., & Oksanen, H. (1997). On the perception of emotions in speech: the role of voice quality. Logopedics Phonatrics Vocology, 22(4), 157–168. • Lugger, M., & Yang, B. (2007). The relevance of voice quality features in speaker independent emotion recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2007. ICASSP 2007 (Vol. 4, pp. 17-20). • Monzo, C., Alías, F., Iriondo, I., Gonzalvo, X., & Planet, S. (2007). Discriminating expressive speech styles by voice quality parameterization. In Proc. of the 16th International Congress of Phonetic Sciences (ICPhS) (pp. 2081–2084). • Nwe, T. L., Foo, S. W., & De Silva, L. C. (2003). Speech emotion recognition using hidden Markov models. Speech communication, 41(4), 603–623. • Russell, J. A., Bachorowski, J., & Fernández-Dols, J. (2003). Facial and Vocal Expressions of Emotion. Annual Review of Psychology, 54(1), 329-349. doi:10.1146/annurev.psych.54.101601.145102
  • 86. Works cited (4/5) • Scherer, K. (1979). Nonlinguistic vocal indicators of emotion and psychopathology. Emotions in personality and psychopathology, 493–529. • Scherer, K. (1986). Vocal affect expression: A review and a model for future research. Psychological bulletin, 99(2), 143–165. • Scherer, K. (1999). Appraisal theory. In T. Dalgleish & M. Power (Eds.), Handbook of cognition and emotion (pp. 637–663). John Wiley and Sons Ltd. • Scherer, K. (2003). Vocal communication of emotion: a review of research paradigms. Speech Communication, 40, 227-256. • Scherer, K. R. (2005). What are emotions? And how can they be measured? Social Science Information, 44(4), 695. • Scherer, K. and J. Oshinsky. (1977). Cue utilization in emotion attribution from auditory stimuli. Motiv. Emot. 1, 331–346. • Schnoebelen, T. (2009). The social meaning of tempo. http://www.stanford.edu/~tylers/notes/socioling/Social_meaning_tempo_Schnoebelen_3-23- 09.pdf • Schnoebelen, T. (2010). The structure of the affective lexicon. California Universities Semantics and Pragmatics Workshop (CUSP). Stanford University, October 15, 2010. http://linguistics.stanford.edu/documents/cusp3-schnoebelen.pdf • Schnoebelen, T. (2010). Variation in speech tempo: Capt. Kirk, Mr. Spock, and all of us in between. NWAV, San Antonio, TX. http://www.stanford.edu/~tylers/notes/socioling/NWAV_Capt_Kirk_Mr_Spock_rest_of_us_burstin ess_11-4-10.pptx
  • 87. Works cited (5/5) • Schroder, M. (2004). Speech and emotion research: an overview of research frameworks and a dimensional approach to emotional speech synthesis (Ph. D thesis). Saarland University. • Schröder, M., Cowie, R., Douglas-Cowie, E., Westerdijk, M., & Gielen, S. (2001). Acoustic correlates of emotion dimensions in view of speech synthesis. In Seventh European Conference on Speech Communication and Technology. • Sobol-Shikler, T. (2011). Automatic inference of complex affective states. Computer Speech & Language, 25(1), 45 - 62. doi:DOI: 10.1016/j.csl.2009.12.005 • Ververidis, D., & Kotropoulos, C. (2006). Emotional speech recognition: Resources, features, and methods. Speech Communication, 48(9), 1162–1181. • Ververidis, D., Kotropoulos, C., & Pitas, I. (2004). Automatic emotional speech classification. In IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004. Proceedings.(ICASSP'04). • Vidrascu, L., & Devillers, L. (2008). Anger detection performances based on prosodic and acoustic cues in several corpora. In Programme of the Workshop on Corpora for Research on Emotion and Affect (pp. 13-16). • Vogt, T., & André, E. (2005). Comparing feature sets for acted and spontaneous speech in view of automatic emotion recognition. In 2005 IEEE International Conference on Multimedia and Expo (pp. 474–477). • Yang, B., & Lugger, M. (2010). Emotion recognition from speech signals using new harmony features. Signal Processing, 90(5), 1415–1423.
  • 89. What’s an emotion? • Kleinginna & Kleinginna (1981) reviewed 92 definitions of emotion (and 9 “skeptical statements”). Here’s their proposal: – Emotion is a complex set of interactions among subjective and objective factors, mediated by neural/hormonal systems, which can (a) give rise to affective experiences such as feelings of arousal, pleasure/displeasure; (b) generate cognitive processes such as emotionally relevant perceptual effects, appraisals, labeling processes; (c) activate widespread physiological adjustments to the arousing conditions; and (d) lead to behavior that is often, but not always, expressive, goal-directed, and adaptive. (K&K 1981: 355)
  • 91. Basic assumptions • Facial and vocal changes occur everywhere and are coordinated with the sender's psychological state – Most people can infer something from changes – So why can’t computers? • Insistence that there are fixed links between facial/vocal expression and emotions is misplaced. (Kappas 2002: 10, Russell et al 2003). • And of course, the receiving side "is more than a reflex- like decoding of a message" (Russell et al 2003: 331). – That is, everything is contextualized! Change!
  • 96. Age
  • 99. Social meaning • Speech rates are not stable by demographic category • They vary all over the place • Conveying and creating identities and attitudes
  • 101. Speakers A & B (relationship conversation)
  • 102. Speaker A (relationship conversation)
  • 103. Speaker B (relationship conversation)
  • 104. Ang et al (2002): detecting annoyance • Annoyed labeled 7.62% of the time, interlabeler agreement (grouping frustrated and annoyed) is 71%, Kappa of .47, not super great, but better than many others • Frustration – Longer durations – Slower speaking – High values for a number of F0 pitch features – Repeats and corrections
  • 105. Ang et al ‘02 Conclusions • Emotion labeling is a complex decision task • Cases that labelers independently agree on are classified with high accuracy – Extreme emotion (e.g. ‘frustration’) is classified even more accurately • Classifiers rely heavily on prosodic features, particularly duration and stylized pitch – Speaker normalizations help • Two nonprosodic features are important: utterance position and repeat/correction – Language model is an imperfect surrogate feature for the underlying important feature repeat/correction Slide from Shriberg, Ang, Stolcke
  • 106. Example 3: “How May I Help YouSM ” (HMIHY) • Giuseppe Riccardi, Dilek Hakkani-Tür, AT&T Labs • Liscombe, Riccardi, Hakkani-Tür (2004) • Each turn in 20,000 turns (5690 dialogues) annotated for 7 emotions by one person – Positive/neutral, somewhat frustrated, very frustrated, somewhat angry, very angry, somewhat other negative, very other negative – Distribution was so skewed (73.1% labeled positive/neutral) – So classes were collapsed to negative/nonnegative • Task is hard! – Subset of 627 turns labeled by 2 people: kappa .32 (full set) and .42 (reduced set)! Slide from Jackson Liscombe
  • 107. Lexical Features • Language Model (ngrams) • Examples of words significantly correlated with negative user state (p<0.001) : – 1st person pronouns: ‘I’, ‘me’ – requests for a human operator: ‘person’, ‘talk’, ‘speak’, ‘human’, ‘machine’ – billing-related words: ‘dollars’, ‘cents’ – curse words: … Slide from Jackson Liscombe
  • 108. It can be tremendously useful • To look at confusion matrices (which emotions are mistaken for which other emotions the most?) • The patterns of confusion are not just a measure of how good/bad people did— • They are informative!
  • 109. “subjective” positive love wonderful best negative bad stupid waste Typical sentiment analysis
  • 113. Word choice matters • Evidence suggests that people’s physical and mental health are correlated with the words they use – Gottschalk & Glaser (1969); Rosenberg & Tucker (1978); Stiles (1992) • “Word use is a meaningful marker and occasional mediator of natural social and personality processes.” – (Pennebaker et al 2003: 548)
  • 114. Not markers… • Words occur socially (even when the speaker is alone). • So interlocutors aren’t just listening for meaning, they are constructing and imposing it.
  • 116. Words can be infected… • Semantic dynamics, changes of meaning • E.g., semantic contempt can creep in to things from their use (Kaplan 1999)