Learning Alignment for Multimodal Emotion Recognition from Speech

Xu, Haiyang; Zhang, Hui; Han, Kun; Wang, Yun; Peng, Yiping; Li, Xiangang

Computer Science > Computation and Language

arXiv:1909.05645 (cs)

[Submitted on 6 Sep 2019 (v1), last revised 3 Apr 2020 (this version, v2)]

Title:Learning Alignment for Multimodal Emotion Recognition from Speech

Authors:Haiyang Xu, Hui Zhang, Kun Han, Yun Wang, Yiping Peng, Xiangang Li

View PDF

Abstract:Speech emotion recognition is a challenging problem because human convey emotions in subtle and complex ways. For emotion recognition on human speech, one can either extract emotion related features from audio signals or employ speech recognition techniques to generate text from speech and then apply natural language processing to analyze the sentiment. Further, emotion recognition will be beneficial from using audio-textual multimodal information, it is not trivial to build a system to learn from multimodality. One can build models for two input sources separately and combine them in a decision level, but this method ignores the interaction between speech and text in the temporal domain. In this paper, we propose to use an attention mechanism to learn the alignment between speech frames and text words, aiming to produce more accurate multimodal feature representations. The aligned multimodal features are fed into a sequential model for emotion recognition. We evaluate the approach on the IEMOCAP dataset and the experimental results show the proposed approach achieves the state-of-the-art performance on the dataset.

Comments:	InterSpeech 2019
Subjects:	Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:1909.05645 [cs.CL]
	(or arXiv:1909.05645v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.1909.05645

Submission history

From: Haiyang Xu [view email]
[v1] Fri, 6 Sep 2019 03:06:38 UTC (106 KB)
[v2] Fri, 3 Apr 2020 03:08:30 UTC (106 KB)

Computer Science > Computation and Language

Title:Learning Alignment for Multimodal Emotion Recognition from Speech

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Learning Alignment for Multimodal Emotion Recognition from Speech

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators