Verma Tiwary 2014
Verma Tiwary 2014
NeuroImage
journal homepage: www.elsevier.com/locate/ynimg
Review
a r t i c l e i n f o a b s t r a c t
Article history: The purpose of this paper is twofold: (i) to investigate the emotion representation models and find out the pos-
Accepted 4 November 2013 sibility of a model with minimum number of continuous dimensions and (ii) to recognize and predict emotion
Available online xxxx from the measured physiological signals using multiresolution approach. The multimodal physiological signals
are: Electroencephalogram (EEG) (32 channels) and peripheral (8 channels: Galvanic skin response (GSR),
Keywords:
blood volume pressure, respiration pattern, skin temperature, electromyogram (EMG) and electrooculogram
Multimodal fusion
Multiresolution
(EOG)) as given in the DEAP database. We have discussed the theories of emotion modeling based on i) basic
Emotion recognition emotions, ii) cognitive appraisal and physiological response approach and iii) the dimensional approach and pro-
Wavelet transforms posed a three continuous dimensional representation model for emotions. The clustering experiment on the
Discrete wavelet transforms given valence, arousal and dominance values of various emotions has been done to validate the proposed
Physiological signals model. A novel approach for multimodal fusion of information from a large number of channels to classify and
EEG predict emotions has also been proposed. Discrete Wavelet Transform, a classical transform for multiresolution
SVM analysis of signal has been used in this study. The experiments are performed to classify different emotions
KNN
from four classifiers. The average accuracies are 81.45%, 74.37%, 57.74% and 75.94% for SVM, MLP, KNN and
MMC classifiers respectively. The best accuracy is for ‘Depressing’ with 85.46% using SVM. The 32 EEG channels
are considered as independent modes and features from each channel are considered with equal importance.
May be some of the channel data are correlated but they may contain supplementary information. In comparison
with the results given by others, the high accuracy of 85% with 13 emotions and 32 subjects from our proposed
method clearly proves the potential of our multimodal fusion approach.
© 2013 Elsevier Inc. All rights reserved.
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0
Emotion modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0
Theory of emotion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0
Darwinian evolutionary view of emotion (basic emotions and other emotions). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0
Cognitive appraisal and physiological theory of emotions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0
Dimensional approaches to emotions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0
Proposed model of emotion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0
Multimodal fusion framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0
Early fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0
Intermediate fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0
Late fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0
Wavelet based multiresolution approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0
The wavelet transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0
Multiresolution approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0
Experiments on 3D affective model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0
DEAP emotion database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0
Experiments on classification and recognition of emotions from physiological signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0
Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0
1053-8119/$ – see front matter © 2013 Elsevier Inc. All rights reserved.
http://dx.doi.org/10.1016/j.neuroimage.2013.11.007
Please cite this article as: Verma, G.K., Tiwary, U.S., Multimodal fusion framework: A multiresolution approach for emotion classification and
recognition from physiological signals, NeuroImage (2014), http://dx.doi.org/10.1016/j.neuroimage.2013.11.007
2 G.K. Verma, U.S. Tiwary / NeuroImage xxx (2014) xxx–xxx
Multimodal fusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0
Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0
Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0
Results of 3D emotion modeling using continuous VAD (valence, arousal, dominance) dimensions . . . . . . . . . . . . . . . . . . . . . . . . 0
Results of emotion classification using the proposed multimodal fusion approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0
Conclusion and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0
Appendix A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0
Introduction • Secondly, physiological signals are constantly emitted and as sensors are
attached directly to the body of the subject, they are never out of reach.
The natural communication among human beings is performed
through two ways i.e. verbal and non-verbal. The verbal communica- In addition to that, physiological data could also be used as com-
tion involves voice, speech or audio whereas non-verbal involves facial plementary to emotional data collected from voice or facial expres-
expression, body movement, sign language, etc. The non-verbal com- sions to improve recognition rates. Physiological pattern helps in
munication plays an important role in deciding the content of the com- assessing and quantifying stress, anger and other emotions that in-
munication. Generally, humans acquire information from more than fluence health (Sebea et al., 2005). The physiological signal contains
one modality such as audio, video, smell, touch, etc. and combine modalities such as electroencephalogram (EEG), electromyogram
these information into a coherent one. The human brain also derives (EMG), electrooculogram (EOG), galvanic skin response (GSR), blood
information from various modes of communication so as to integrate volume pressure, respiration pattern, skin temperature, etc. which can
various complimentary and supplementary information. In computa- provide multiple cues. Emotion recognition from physiological pattern
tional systems, information from various modalities, such as audio, has a vast number of applications in the area of medicine, entertain-
video, electroencephalogram (EEG), electrocardiogram (ECG), etc. ment and HCI.
may be fused together in a coherent way. This is known as multimodal The DEAP database (Koelstra et al., 2012) for emotion analysis using
information fusion. In affective systems, multimodal information fusion physiological signals, contains EEG and peripheral signals in addition to
can be used for the extraction and integration of interrelated informa- face videos from 32 participants. The EEG signals were recorded from 32
tion from multiple modalities to enhance the performance (Poh and active electrodes (channels), whereas peripheral physiological signals
Bengio, 2005) or to reduce the uncertainty on data classification or to (8 channels) include GSR, skin temperature, blood volume pressure
reduce ambiguity in decision making. Several examples are: audio- (plethysmography), respiration rate, EMG (zygomaticus major and
visual speaker detection, multimodal emotion recognition, human trapezius) and EOG (horizontal and vertical). We have used the DEAP
face or body tracking, event detection, etc. database in all our experiments. Full description of the database is
given in Appendix A.
Two or more modalities cannot be integrated in a context free man-
The purpose of this paper is twofold: (i) to propose a multi-
ner. There must be some context dependent model. Information fusion
dimensional 3D affective model of emotion representation and (ii) to
can be broadly classified into three major categories: (i) early fusion,
develop a multimodal fusion framework to classify and recognize emo-
(ii) intermediate fusion, and (iii) late fusion. In early fusion the informa-
tions from physiological signals using multiresolution approach. The
tion is being integrated at the signal or feature level, whereas in late fu-
overall paper is divided into eight sections including the present one
sion higher semantic level information is to be fused. The key issues in
as the first. Theories for emotion modeling are described in the second
multimodal information fusion are the number of modalities (Pfleger,
section while the basic approaches used in multimodal fusion frame-
2004; Bengio et al., 2002), synchronization of information derivation
work are described in the third section. The fourth section is about
and fusion process and finding the appropriate level at which the infor-
wavelet based multiresolution approach used for feature extraction.
mation is to be fused. It is not always necessary that different modalities
The experiments done on the 3D affective model are described in the
provide complimentary information in the fusion process; hence it is
fifth section while those on emotion classification and recognition
important to understand the contributions of each modality in reference
from physiological signals in the sixth section. The results and discus-
to accomplishing various tasks. A typical framework of multimodal in-
sions on both types of experiments are mentioned in the seventh sec-
formation fusion for human–computer interaction is illustrated in Fig. 1.
tion and concluding remarks are given in the last section.
In the area of human–computer interaction (HCI), the information
about cognitive, affective and emotional states of a user becomes more
and more important as this information could be used to make commu-
Emotion modeling
nication with computers in a more human-like manner or to make com-
puter learning environments more effective (Schaaff, 2008). Emotion
Emotion is an affective state of human beings (animals) arising as a
recognition is an important task as it is finding extensive applications
response to some interpersonal or other events. It involves an appraisal
in the areas of HCI and Human–Robot Interaction (HRI) (Cowie et al.,
of the given situation and the response is generally present as some
2001) and many other emerging areas. Emotions are expressed through
physiological signals and/or some action(s). It is worth mentioning that
posture and facial expression as well as through physiological signals,
the functionalist approaches to emotions vary by level of analysis, the in-
such as brain activity, heart rate, muscle activity, blood pressure, skin
dividual and dyadic (inter-personal between two people) levels of anal-
temperature, etc. (Schaaff, 2008). Recognizing emotion is an interesting
ysis, and the group and cultural levels of analysis. The group and the
and challenging problem. Generally the emotion is recognized through
cultural level of analysts see emotions as social cultural functions and
facial expressions or audio–video data, but facial expressions may not
they believe emotions to be constructed by individuals or groups in so-
involve all emotions (Ekman, 1993). There are a number of advantages
cial contexts, and they relate them to constructs of individuals, patterns
of using physiological signals for emotion recognition (Schaaff, 2008).
of social hierarchy, language, or requirements of socio-economic organi-
• Firstly, bio-signals are controlled by the central nervous system and zation, etc. (Lutz and Abu-Lughod, 1990). Here we are concerned with
therefore cannot be influenced intentionally, whereas actors can the effects (experiences) of emotions within the individual or between
play emotions on their face intentionally. interacting individuals (Ekman, 1993; Nesse, 1990) etc.
Please cite this article as: Verma, G.K., Tiwary, U.S., Multimodal fusion framework: A multiresolution approach for emotion classification and
recognition from physiological signals, NeuroImage (2014), http://dx.doi.org/10.1016/j.neuroimage.2013.11.007
G.K. Verma, U.S. Tiwary / NeuroImage xxx (2014) xxx–xxx 3
Please cite this article as: Verma, G.K., Tiwary, U.S., Multimodal fusion framework: A multiresolution approach for emotion classification and
recognition from physiological signals, NeuroImage (2014), http://dx.doi.org/10.1016/j.neuroimage.2013.11.007
4 G.K. Verma, U.S. Tiwary / NeuroImage xxx (2014) xxx–xxx
describe humans' affective states because it is able to describe the inten- Pitsikalis et al. (2006) proposed a Bayesian inference method to fuse
sity of emotion, which can be used for recognizing dynamics, and allows the audio-visual features obtained from Mel-frequency cepstral coeffi-
for adaptation to individual moods and personalities (Schuller, 2011). In cients (MFCC) and texture analysis respectively. The joint probability
this work, we have used three emotion dimensions valence, arousal and was computed by taking combined features. Mena J.B. and Malpica J.
dominance on a continuous scale from 1 to 9. The valence scale ranges (Mena and Malpica, 2003) proposed a Dempster–Shafer fusion ap-
from unhappy or sad to happy or joyful. The arousal scale ranges from proach for the segmentation of color images. The authors extracted
calm or bored to stimulated or excited. The dominance scale ranges the information from terrestrial, aerial or satellite images. The informa-
from submissive (or without control) to dominant (or in control, tion was based on the location of an isolated pixel, a group of pixels, and
empowered). a pair of pixels. The information was fused using the Dempster–Shafer
The major contribution of this work is as follows: fusion approach. Nefian et al. (2002) used coupled HMM (CHMM) to
combine audiovisual features for speech recognition. The authors have
• It presents a three dimension emotion model in terms of valence, modeled the state asynchrony of the features while preserving their
arousal and dominance. correlation over time. Magalhaes and Ruger (2007) used the maximum
• It proposes a multimodal fusion framework for affective prediction entropy model for semantic multimedia indexing. They combined the
based on physiological signals (not facial expressions). text and image features for image retrieval. The authors reported better
performance of maximum entropy model based fusion compare to
Multimodal fusion framework Naive Bays approach.
Multimodal fusion refers to integration of two or more than two mo- Intermediate fusion
dalities in order to improve the performance of the system. There are
several multimodal fusion techniques reported in literature; however The basic short coming of the early fusion technique was its inability
one major category of fusion is early, intermediate and late fusion. In to deal with the imperfect data. Besides, early fusion avoids the explicit
an early fusion, the features obtained from different modalities needs modeling of the different modalities also resulting in the failure to
to be combined in a single representation before feeding them to the model the fluctuation in the signal. Asynchrony problem among the dif-
learning phase. The intermediate fusion is able to deal with the imper- ferent streams is also a major problem for the early fusion. An approach
fect data together with reliability and asynchrony issues among differ- to overcome these issues is to consider the features of the correspond-
ent modalities. Late fusion, also known as decision level fusion is ing stream at various time instances. Hence by comparing the previous-
based on the semantic information of the modalities. The major issue ly observed instances with the current data of some observation
in multimodal data processing is that the data should be processed sep- channel, one can make some statistical prediction with certain derivable
arately and should be combined only at the end. The general fusion probability for the erroneous instances. Therefore the probabilistic
architectures and joint processing of modalities has been discussed graphical models like HMM (including the hierarchical variants),
in Huang and Suen (1995), Bolt (1980), and Blattner and Glinert Bayesian network and dynamic Bayesian networks are the most appro-
(1996)). In order to accomplish a human-like multimodal analysis of priate framework for fusing the multiple source of information in such a
multiple input signals acquired by different sensors, the signals cannot situation (Sebe et al., 2005). They can handle not only the noisy feature,
always be considered mutually independent. hence, they might not but using the probabilistic inference, also the temporal information and
combine in a context-free manner at the end of the intended analysis, missing feature values. Application of the hierarchical HMM has been
but on the contrary, the input data might preferably be processed in a demonstrated to recognize the facial expression (Cohen et al., 2003).
joint feature space and according to a context-dependent model Dynamic Bayesian network and HMM variants (Minsky, 1975) have
(Karray et al., 2008). The major problems faced in multimodal fusion shown their caliber to fuse several sources of the information to recog-
are dimensionality of joint feature space, different feature formats, and nize intention, office activities or other events in video using both audio
time-alignment. A potential way to achieve multisensory data fusion and video signals (Carpenter, 1992).
is to develop context-dependent versions of a suitable method such as
the Bayesian inference method as proposed by Pan et al., (1999). Late fusion
Multimodal systems usually integrate signals at the feature level
(early fusion), at a higher semantic level (late fusion), or something in Multimodal system integrates common meaning representations
between (intermediate fusion) (Turk, 2005; Corradini et al., 2003). In derived from different modalities to arrive at some common interpreta-
this section we are presenting the above fusion techniques and related tion. It necessitates for a common framework for common meaning
work reported in the literature. representation of all the modalities being used besides well defined op-
erations for integrating the partial meaning. Late integration models
Early fusion generally use independent classifiers, which can be trained separately
for each stream. The ultimate classification decision is made by combin-
When different modalities are combined into single representation ing the partial outputs of each unimodal classifiers. The cognizance of
prior to learning phase, the fusion is said to be early fusion. In this class the correspondence between the channels is made during the integra-
of fusion, features are integrated in the beginning (Cees et al., 2005). In tion step only. There are some obvious advantages of the late fusion.
the early fusion framework the recognition process at signal level in Now the inputs can be recognized separately, independent of one an-
some particular mode influences the course of recognition process in other, hence they need not occur simultaneously. It also simplifies the
the remaining modes. Eventually, this kind of fusion is found to be software development process (Rabiner and Juang, 1993). Late integra-
more appropriate for highly temporally synchronized input modalities. tion system uses recognizers trainable on unimodal data set but the
Audio-visual integration might be one of the most suitable examples of scalability in the number of modes as well in vocabulary is an important
the early fusion, where one simply concatenates the audio as well as vi- issue. For example, in case of audiovisual recognition, one can explore
sual feature vectors to get a combined audio-visual vector. The length of for a good heuristic and extract the promising hypothesis from audio
the resulting vector is contained by using the dimensionality reducing only and then rescoring can be done on the basis of the visual evidences
approaches like linear discriminant analysis (LDA), prior to feeding the (Kittler et al., 1998). Multimodal human–computer interaction is one
feature vectors to the recognition engine. The most preferred classifier such application where we need advance techniques and models for
for early integration system is very often the conventional Hidden cross model information fusion to integrate the audio and video fea-
Markov Model (HMM) trained for the mixed audiovisual feature vector. tures. This is an active area of research and many paradigms are being
Please cite this article as: Verma, G.K., Tiwary, U.S., Multimodal fusion framework: A multiresolution approach for emotion classification and
recognition from physiological signals, NeuroImage (2014), http://dx.doi.org/10.1016/j.neuroimage.2013.11.007
G.K. Verma, U.S. Tiwary / NeuroImage xxx (2014) xxx–xxx 5
proposed for addressing the issues arising in audiovisual fusion. Howev- then they are analyzed individually with a different resolution matched
er they can be metamorphosed into a general framework addressing the to its scale. Wavelet transform is the representation of a function by
integration of other modalities. Some of those that are based on late mother wavelets. The one dimensional continuous wavelet transform
fusion are discussed here. denoted by Wf(s,t) of a one dimensional function f(t) is defined by
Aguilar et al. (2003) proposed a rule-based fusion and learning Eq. (1) (Bhatnagar et al., 2012)
based fusion strategy for combining the scores of face, fingerprint and
Z
online signatures. The sum rule and Radial Basis Function Support ∞
W f ðs; τÞ ¼ f ðt Þφs;τ ðt Þdt: ð1Þ
Vector Machine (RBF SVM) are being used for comparison. The experi- −∞
mental results demonstrate that the learning-based RBF SVM scheme
outperforms the rule-based scheme based on some appropriate param- where ψ is the mother wavelet function, s is the scale parameter and t is
eter selections. Meyer et al. (2004) proposed a decision level fusion by the translation parameter. Now, to reconstruct the original signal back
taking speech and visual modalities. The speech features were extracted from transformed signal, the inverse continuous wavelet transform is
using MFCC algorithm and the lip contour features from the speakers defined as
face in the video. HMM classifiers were used in order to obtain the indi-
vidual decisions for both. Then fusion was performed using Bayesian in- Z ∞
Z ∞
1 dsdτ
ference method to estimate the joint probability of a spoken digit. Singh f ðt Þ ¼ W f ðs; τÞ φ ðt Þ ð2Þ
Cφ −∞ −∞ s2
et al. (2006)) used D–S theory to fuse the scores of three different finger s;φ
Table 1
An overview of the fusion approaches in terms of modalities used and their applications.
Support Vector Machine Decision Aguilar et al. (2003) Video, audio (MFCC) and textual cues Semantic concept detection
Hybrid Bredin and Chollet (2007) Audio (MFCC), video (DCT of lip area) Biometric identification of talking face
Zhu et al. (2006) Low level visual features, text color, size, location, Image classification
edge density, brightness, contrast
Bayesian inference Feature Pitsikalis et al. (2006) Audio (MFCC), video (Shape and texture) Speech recognition
Decision Meyer et al. (2004) Audio (MFCC) and video (lips contour) Spoken digit recognition
Hybrid Xu and Chua (2006) Audio, video, text, web log Sports video analysis
Atrey et al. (2006) Audio & video Event detection for surveillance
Dempster–Shafer theory Feature Mena and Malpica (2003) Video (trajectory coordinates) Segmentation of satellite images
Decision Guironnet et al. (2005) Audio (phonemes) and visual (visemes) Video classification
Singh et al. (2006) Audio, video and the synchrony score Finger print classification
Dynamic Bayesian networks Feature Nefian et al. (2002) Audio and visual (2D-DCT coefficients of the lips region) Speech recognition
Decision Beal et al. (2003) Audio and video Object tracking
hybrid Town (2007) Video (face and blob), ultrasonic sensors Human tracking
Xie et al. (2005) Text, audio, video clustering in video
Neural networks Feature Zou and Bhanu (2005) Audio and video Human tracking
Decision Gandetto et al. (2003) CPU load, login process, network Human activity monitoring
hybrid Ni et al. (2004) Load, camera images Image recognition
Maximum Entropy Model Feature Magalhaes and Ruger (2007) Text and image Semantic image indexing
Please cite this article as: Verma, G.K., Tiwary, U.S., Multimodal fusion framework: A multiresolution approach for emotion classification and
recognition from physiological signals, NeuroImage (2014), http://dx.doi.org/10.1016/j.neuroimage.2013.11.007
6 G.K. Verma, U.S. Tiwary / NeuroImage xxx (2014) xxx–xxx
Table 2
Euclidean distances of each emotion from other emotions based on centroids in VAD (valence, arousal and dominance) space.
E1 0 4.7024 2.7903 3.0255 1.1539 1.2791 0.6569 2.0658 1.3311 3.7630 4.1819 2.5741
E2 4.7024 0 2.0866 2.4879 4.3785 4.1845 4.2124 2.9519 3.9027 2.0082 1.4565 2.7661
E3 2.7903 2.0866 0 0.8270 2.4163 2.1952 2.2290 1.6070 2.3260 2.1677 2.2172 1.4238
E4 3.0255 2.4879 0.8270 0 2.6279 2.3719 2.4409 2.1307 2.7152 2.6873 2.7649 1.5790
E5 1.1539 4.3785 2.4163 2.6279 0 0.2699 0.7221 2.3893 2.0717 3.9552 4.2169 2.8017
E6 1.2791 4.1845 2.1952 2.3719 0.2699 0 0.7304 2.2996 2.0712 3.8264 4.0729 2.6373
E7 0.6569 4.2124 2.2290 2.4409 0.7221 0.7304 0 1.8544 1.4175 3.5065 3.8573 2.2583
E8 2.0658 2.9519 1.6070 2.1307 2.3893 2.2996 1.8544 0 0.9643 1.6977 2.1550 1.0230
E9 1.3311 3.9027 2.3260 2.7152 2.0717 2.0712 1.4175 0.9643 0 2.5764 3.0850 1.6562
E10 3.7630 2.0082 2.1677 2.6873 3.9552 3.8264 3.5065 1.6977 2.5764 0 0.6624 1.7175
E11 4.1819 1.4565 2.2172 2.7649 4.2169 4.0729 3.8573 2.1550 3.0850 0.6624 0 2.1546
E12 2.5741 2.7661 1.4238 1.5790 2.8017 2.6373 2.2583 1.0230 1.6562 1.7175 2.1546 0
Note: E1 — happy, E2 — sad, E3 — anger, E4 — hate, E5 — fun, E6 — exciting, E7 — joy, E8 — cheerful, E9 — love, E10 — sentimental, E11 — melancholy, E12 — pleasure.
Please cite this article as: Verma, G.K., Tiwary, U.S., Multimodal fusion framework: A multiresolution approach for emotion classification and
recognition from physiological signals, NeuroImage (2014), http://dx.doi.org/10.1016/j.neuroimage.2013.11.007
G.K. Verma, U.S. Tiwary / NeuroImage xxx (2014) xxx–xxx 7
multimodal fusion is described in detail in the Multimodal fusion on the concatenated vector formed. Fi and Hi. are calculated by applying
framework section. The feature level fusion was applied in this study. normal transformation techniques to each of the individual feature values
We have two feature vectors, say Fi = {fi, 1, fi, 2, … ¦, fi, n} feature of the feature vectors Fi and Hi. Suppose the feature vectors {Fi, Hi} and {Fi,
vector for one modality and Xi = {xi, 1, xi, 2, … ¦, xi, n} be the fused fea- Hi} obtained at two different time instances i and j, then the
ture vector of vectors Fi and Hi. Xi is obtained by augmenting the normal- corresponding fused vectors may be denoted as Xi and Xj respectively.
ized feature vectors Fi and Hi. and then performing the feature selection
Classification
Table 3
Extracted features from EEG and physiological signals.
Relative Power Energy (RPE) Four band of Alpha, Beta, Gamma and Theta
Logarithmic Relative Power Energy Four band of Alpha, Beta, Gamma and Theta
(LRPE)
Absolute Logarithmic Relative Energy Four band of Alpha, Beta, Gamma and Theta
Power energy (ALRPE)
Standard deviation All levels of detail coefficients and highest level
approximation coefficient
Spectral Entropy All levels of detail coefficients and highest level
approximation coefficient
Fig. 3. Architecture of emotion recognition from physiological signal.
Please cite this article as: Verma, G.K., Tiwary, U.S., Multimodal fusion framework: A multiresolution approach for emotion classification and
recognition from physiological signals, NeuroImage (2014), http://dx.doi.org/10.1016/j.neuroimage.2013.11.007
8 G.K. Verma, U.S. Tiwary / NeuroImage xxx (2014) xxx–xxx
instance-based learning classifier, is also used here for classifying un- network very well qualifies the 3D VAD model of emotion. We have also
known instances based on some distance or similarity function. A MMC shown the distances more than 25% (of maximum) and less than 40% (of
classifier, for handling multi-class datasets with 2-class classifiers by maximum) by dotted lines to show the relative positions of each emotion
building a random class-balanced tree structure, has been also used in group. As can be seen (happy, joy, fun, exciting) group is a good distance
this work. away from both (sentimental, melancholy) group and (anger, hate)
group. The (sad) emotion is further away from (sentimental, melancholy)
Results and discussion group. These results are in agreement to the general findings.
These results also demonstrate to some extent the sufficiency of the
The goal of analysis and evaluations is twofold: first the validation of 3D three continuous dimensions namely valence, arousal and dominance
emotion model, and second, the evaluation of classification and prediction (although in DEAP database the values of liking and familiarity are
of various emotions from EEG and other peripheral physiological signals. also indicated). As per the cognitive appraisal theory, the emotions are
The DEAP database consists of 40 channels of physiological signals responses to the cognitive appraisal of the (abnormal) situation. Smith
(EEG + Peripheral) with 8056 data with 40 trials each for 32 participants. and Ellsworth (1985) have pointed out eight dimensions of cognitive
All the calculations were carried out in Matlab 7.7.0 (R2008b), on a appraisal. These dimensions include pleasantness, attentional activity,
32-bit Intel 3.06 GHz processor, with 2 GB RAM. The Wavelet Transfor- control, certainty, goal–path obstacle, legitimacy, responsibility and an-
mation was done by using Wavelet toolbox available in MatLab. ticipated effort. Out of this, goal–path obstacles and legitimacy have not
been considered by many other researchers. However, here we are not
Results of 3D emotion modeling using continuous VAD (valence, arousal, considering the appraisal, but the common quantities which result from
dominance) dimensions the appraisal. Valence gives some kind of direction or a kind of pleasant-
ness or sadness (like hue in color representation). Arousal gives sharp-
Table 2 shows the Euclidean distances among various emotions in ness like very sad or very happy (which may be due to attentional
VAD space. It shows the maximum distance of 4.70 between happy activity or anticipated effort). Dominance quantifies the amount of self
and sad, and the minimum distance of 0.27 between fun and exciting, or situational control (may be the result of control, certainty and re-
which explains that ‘fun’ and ‘exciting’ are very near as compare to sponsibility). As mentioned in the Emotion modeling section, only two
‘anger’ and ‘sad’. Similarly ‘happy’ and ‘joy’, ‘love’ and ‘cheerful’ and dimensions of valence and arousal are not sufficient to explain distinc-
‘anger’ and ‘hate’ are near to each other compared to other emotions. tion among many emotions. Hence, we think that these three continu-
These results qualitatively validate the model. ous dimensions, valence, arousal and dominance are sufficient to
Further, we have plotted a graph of 12 emotions as nodes and the dis- uniquely represent an emotion.
tances as edges in Fig. 4. First we have only shown the distances up to
25% of the maximum value (4.7/4 = 1.18) as solid lines. This emotion net- Results of emotion classification using the proposed multimodal fusion
work (or graph) interestingly groups 12 emotions in 5 groups, which is ex- approach
actly equal to the number of clusters in Fig. 4. Interestingly these five
clusters in the graph are (happy, joy, fun, exciting), (love, cheerful, plea- For this category of experiment, the EEG (32 channels) and periph-
sure), (anger, hate), (sad) and (melancholy, sentimental). However, as eral physiological (8 channels) signals were used to extract the features
the points in the clusters of Fig. 3 have large deviations, one or two emo- using multiresolution analysis. The Wavelet Transform, a classical trans-
tions might have been clustered in different groups, but the emotion- form for multiresolution analysis, is used to decompose the signals up to
Fig. 4. Emotion graph or network with Euclidean distance ≤ 1.02 (solid lines) shows five clusters. Dotted lines show distance N 1.02 b 2.0.
Please cite this article as: Verma, G.K., Tiwary, U.S., Multimodal fusion framework: A multiresolution approach for emotion classification and
recognition from physiological signals, NeuroImage (2014), http://dx.doi.org/10.1016/j.neuroimage.2013.11.007
G.K. Verma, U.S. Tiwary / NeuroImage xxx (2014) xxx–xxx 9
Table 4 Table 5
Accuracy obtained with a 10-fold cross-validation test over the 640 training instances for Accuracy obtained for multimodal fusion of EEG and peripheral (GSR, BVP, EMG & EOG
each individual classifier and emotion for 32 channels EEG. etc.) with a 10-fold cross-validation test.
Terrible 80.93 75.78 55.62 75.00 Terrible 77.96 76.25 62.81 76.48
Love 82.18 75.62 59.06 71.87 Love 77.96 76.87 63.20 75.39
Hate 79.84 76.71 59.06 74.84 Hate 79.45 76.25 63.67 77.18
Sentimental 82.50 77.65 60.15 77.96 Sentimental 78.28 78.04 62.96 75.31
Lovely 81.56 75.62 60.00 79.06 Lovely 79.14 76.25 65.00 75.78
Happy 82.81 74.53 58.90 77.03 Happy 78.43 76.40 64.14 75.54
Fun 79.84 75.78 53.12 75.15 Fun 77.96 76.64 62.81 74.14
Shock 79.68 72.03 57.18 75.15 Shock 78.20 77.26 62.26 75.31
Cheerful 80.31 58.90 55.62 73.75 Cheerful 80.28 78.82 59.68 75.07
Depressing 85.46 81.25 63.28 79.21 Depressing 80.15 78.35 65.39 78.04
Exciting 83.59 78.28 57.03 76.87 Exciting 79.21 77.10 64.29 75.01
Melancholy 77.65 67.18 56.40 73.90 Melancholy 79.14 77.65 62.03 76.40
Mellow 82.50 77.50 55.31 77.50 Mellow 79.06 76.64 61.01 77.65
five levels in order to obtain the approximation and detail coefficients. (MLP), K-Nearest Neighbor (K-NN) and Meta-multiclass (MMC),
As the sampling rate of physiological signal is 512 Hz, the signal decom- based on best classification accuracy. A 10-fold cross-validation test
position up to five levels is sufficient to capture the information over 640 training instances have been performed for each selected clas-
from signals. The Discrete Wavelet Transform (DWT) is used with sifier. The accuracies obtained with 32 EEG channel data with each clas-
Daubechies 4 family (DB4) as it provides good results for non- sifier for each emotion are shown in Table 4.
stationary signal like EEG (Karray et al., 2008). Prior to classification, Table 4 shows the accuracy rate for different emotions obtained
all features have been normalized to the range of [0, 1] using the Min– from four classifiers. The average accuracies are 81.45%, 74.37%,
Max algorithm. We have performed the experiments for thirteen emo- 57.74% and 75.94% for SVM, MLP, KNN and MMC respectively. The
tions for which sufficient data was available in the DEAP database. A 25 best accuracy is for depressing with 85.46% using SVM. As can be seen
dimensional feature vector was generated using the features listed in from Table 4, SVM outperformed other classifiers. The SVM was imple-
Table 3. mented with kernel — PolyKernel and the value of C (200). The toler-
All experiments are performed with 10-fold cross-validation with ance parameter is 0.001 with Epsilon (1.0E − 12). The classification
random values of dataset to increase the recognition rate. In 10-fold accuracy for MLP is also significant.
cross-validation, whole dataset is divided into ten subsets. The nine sub- Fig. 5 presents the accuracies of SVM classifier for each emotion
sets of feature vectors are used for training and remaining subsets for when the features obtained from various frequency bands (Theta,
testing. This procedure is repeated ten times with different subset splits. Alpha, Beta and Gamma) are taken separately and in a combined
The classification accuracy is obtained by the correctly classified num- (fused) manner. It shows that the fusion at the feature level, combining
ber of instances and the total number of instances. multilevel frequency features, outperform each single band. Hence, in
The Waikado Environment for Knowledge Analysis Weka (Weka our proposed method the fusion takes place at two levels: one at differ-
machine learning tool) tool was used for classification purpose as it pro- ent frequency band features and other at various channel features. The
vides a collection of machine learning algorithms for data classification. 32 channels are considered as independent modes and features from
From the collection of various algorithms, we have selected four algo- each channel are considered with equal importance. Maybe some of
rithms i.e. Support Vector Machine (SVM), Multilayer Perceptron the channel data are correlated but they may contain complementary
Fig. 5. SVM results for different emotions with EEG frequency band. All represents all bands.
Please cite this article as: Verma, G.K., Tiwary, U.S., Multimodal fusion framework: A multiresolution approach for emotion classification and
recognition from physiological signals, NeuroImage (2014), http://dx.doi.org/10.1016/j.neuroimage.2013.11.007
10 G.K. Verma, U.S. Tiwary / NeuroImage xxx (2014) xxx–xxx
Table 6 Appendix A
Accuracy comparison of various studies.
Please cite this article as: Verma, G.K., Tiwary, U.S., Multimodal fusion framework: A multiresolution approach for emotion classification and
recognition from physiological signals, NeuroImage (2014), http://dx.doi.org/10.1016/j.neuroimage.2013.11.007
G.K. Verma, U.S. Tiwary / NeuroImage xxx (2014) xxx–xxx 11
differences in the judgments of facial expressions of emotion. J. Pers. Soc. Psychol. 53 Pan, H., Liang, Z., Anastasio, T., Huang, T., 1999. Exploiting the dependencies in informa-
(4), 712–717. tion fusion. Proc. Conf. Comput. Vis. Pattern Recognit. 2, 407–412.
Fontaine, J., Scherer, K.R., Roesch, E., Ellsworth, P., 2007. The world of emotions is not two- Pantic, M., Valstar, M., Rademaker, R., Maat, L., 2005. Web-based database for facial
dimensional. Psychol. Sci. 18, 1050–1057. expression analysis. Proc. IEEE Int. Conf. Multimed. Expo. 317–321.
Gandetto, M., Marchesotti, L., Sciutto, S., Negroni, D., Regazzoni, C.S., 2003. From multi- Parrott, W., 2001. Emotions in Social Psychology. Psychology Press, Philadelphia.
sensor surveillance towards smart interactive spaces. IEEE Int. Conf. Multimed. Pereira, C., 2000. Dimensions of emotional meaning in speech. Proc. ISCA Workshop
Expo. I 641–644. Speech and Emotion, pp. 25–28.
Grandjean, D., Sander, D., Scherer, K.R., 2008. Conscious emotional experience emerges as Pfleger, N., 2004. Context based multimodal fusion. ACM Int. Conf. Multimodal Interfaces
a function of multilevel, appraisal-driven response synchronization. Conscious. Cogn. 265–272.
17, 484–495. Pitsikalis, V., Katsamanis, A., Papandreou, G., Maragos, P., 2006. Adaptive multimodal
Grimm, M., Kroschel, K., Narayanan, S., 2008. The Vera am Mittag German audio-visual fusion by uncertainty compensation. Ninth Int. Conf. on Spoken Language Processing,
emotional speech database. Proc. IEEE Int. Conf. Multimed. Expo. 865–868. Pittsburgh.
Guironnet, M., Pellerin, D., Rombaut, M., 2005. Classification based on low-level feature Plutchik, R., 2008. A psycho-evolutionary theory of emotions. Soc. Sci. Inf. 21 (4–5), 529–553.
fusion model. the European Signal Processing Conference. Antalya, Turkey. Plutchik, Robert, Conte, R., Hope, 1997. Circumplex Models of Personality and Emotions.
Healey, J.A., Picard, R.W., 2005. Detecting stress during real-world driving tasks using American Psychological Association, Washington, DC.
physiological sensors. IEEE Trans. Intell. Transp. Syst. 6 (2), 156–166. Poh, N., Bengio, S., 2005. How do correlation and variance of base experts affect fusion in
Huang, Y.S., Suen, C.Y., 1995. Method of combining multiple experts for the recognition of un- biometric authentication tasks? IEEE Trans. Signal Process. 53, 4384–4396.
constrained handwritten numerals. IEEE Trans. Pattern. Anal. Mach. Intell. 17 (1), 90–94. Rabiner, L., Juang, B., 1993. Fundamentals of Speech Recognition. Prentice Hall,
James, W., 1950. The Principles of Psychology, vol. 1. Dover Publications. Englewood Cliffs, NJ.
Juslin, P.N., Scherer, K.R., 2005. Vocal expression of affect. The New Handbook of Methods Schaaff, K., 2008. EEG-based Emotion Recognition. Ph.D. thesis Universitat Karlsruhe.
in Nonverbal Behavior Research.Oxford Univ. Press 65–135. Schaaff, K., Schultz, T., 2009. Towards emotion recognition from electroencephalography
Karray, F., Alemzadeh, M., Saleh, j.A., Nours, M., 2008. Human–computer interaction: an signals. 3rd Int. Conf. on Affective Computing and Intelligent Interaction.
overview. Int. J. Smart Sensing Intell. Syst. vol. 1 (1). Schachter, S., Singer, J.E., 1962. Cognitive, social and physiological determinants of
Kim, J., Andre, E., 2008. Emotion recognition based on physiological changes in music emotional state. Psychol. Rev. 69, 379–399.
listening. IEEE Trans. Pattern Anal. Mach. Intell. vol. 30 (12), 2067–2083. Schuller, B., 2011. Recognizing affect from linguistic information in 3D continuous space.
Kittler, J., Hatef, M., Duin, R.P., Matas, J., 1998. On combining classifiers. IEEE Trans. Pattern IEEE Trans. Affect. Comput. 4 (4), 192–205.
Anal. Mach. Intell. 20 (3), 226–239. Sebe, N., Cohen, I., Garg, A., Huang, T.S., 2005. Machine Learning in Computer Vision.
Koelstra, S., Muhl, C., Soleymani, M., Yazdani, A., Lee, J.S., Ebrahimi, T., Pun, T., Nijholt, A., Springer, Berlin, NY.
Patras, I., 2012. DEAP: a database for emotion analysis using physiological signals. Sebea, N., Cohenb, I., Geversa, T., Huangc, T.S., 2005. Multimodal approaches for emotion
IEEE Trans. Affect. Comput. 3 (1), 18–31. recognition: a survey. Proc. of SPIE-IS&T Electronic Imaging, vol. 5670, pp. 56–67.
Lazarus, R.S., Kanner, A.D., Folkman, S., 1980. Emotions. A cognitive–phenomenological Singh, R., Vatsa, M., Noore, A., Singh, S.K., 2006. Dempster–Shafer theory based finger
analysis. Theories of Emotion.New York Academic Press 189–217. print classifier fusion with update rule to minimize training time. IEICE Electron.
Lewis, P.A., Critchley, H.D., Rotshtein, P., Dolan, R.J., 2007. Neural correlates of processing Express 3 (20), 429–435.
valence and arousal in affective words. Cereb. Cortex 17 (3), 742–748. Soleymani, M., Lichtenauer, J., Pun, T., Pantic, M., 2012a. A multimodal database for affect
Lutz, Catherine A., Abu-Lughod, Lila (Eds.), 1990. Language and the Politics of Emotion. recognition and implicit tagging. IEEE Trans. Affect. Comput. vol. 3 (1).
Studies in Emotion and Social Interaction. Cambridge University Press, New York. Soleymani, C., Pantic, M., Pun, T., 2012b. Multi-modal emotion recognition in response to
Magalhaes, J., Ruger, S., 2007. Information-theoretic semantic multimedia indexing. Int. videos. IEEE Tran. On Affective Computing 99 (2), 211–223.
Conf. on Image and Video Retrieval.Amsterdam 619–626. Stickel, C., Fink, J., Holzinger, A., 2007. Enhancing universal access — EEG based
McKeown, G., Valstar, M.F., Cowie, R., Pantic, M., 2010. The SEMAINE corpus of emotion- learnability assessment. Lect. Notes Comput. Sci. 4556, 813–822.
ally coloured character interactions. Proc. IEEE Int. Conf. Multimed. Expo. 1079–1084. Sundberg, J., Patel, S., Bjo, E., Scherer, K.R., 2011. Interdependencies among voice source
Mena, J.B., Malpica, J., 2003. Color image segmentation using the Dempster–Shafer theory parameters in emotional speech. IEEE Trans. Affect. Comput. vol. 2 (3).
of evidence for the fusion of texture. Int. Arch. Photogram. Rem. Sens. Spatial Inform. Tomkins, S.S., 1962. Affect, imagery, consciousness. The Positive Affects, vol. 1. Springer,
Sci. vol. XXXIV, 139–144 (Part 3/W8). New York.
Meyer, G.F., Mulligan, J.B., Wuerger, S.M., 2004. Continuous audiovisual digit recognition Town, C., 2007. Multi-sensory and multi-modal fusion for sentient computing. Int.
using N-best decision fusion. J. Inf. Fusion 5, 91–101. J. Comput. Vis. 71, 235–253.
Mihalis, A., Gunes, H., Pantic, M., 2011. Continuous prediction of spontaneous affect from mul- Turk, M., 2005. Multimodal human–computer interaction. Real-time Vision for Human–
tiple cues and modalities in valence–arousal space. IEEE Trans. Affect. Comput. vol. 2 (2). computer Interaction.Springer, Berlin.
Minsky, M., 1975. A framework for representing knowledge. The Psychology of Computer Weka machine learning tool — http://www.cs.waikato.ac.nz/ml/weka/downloading.html.
Vision.McGraw-Hill, New York. Xie, L., Kennedy, L., Chang, S.F., Divakaran, A., Sun, H., Lin, C.Y., 2005. Layered dynamic
Nefian, A.V., Liang, L., Pi, X., Liu, X., Murphye, K., 2002. Dynamic Bayesian networks for mixture model for pattern discovery in asynchronous multi-modal streams. IEEE
audio-visual speech recognition. EURASIP J. Appl. Signal Process. 11, 1–15. Int. Conf. Acoust. Speech Signal Process. 2, 1053–1056.
Nesse, R.M., 1990. Evolutionary explanations of emotions. Hum. Nat. 1, 261–289. Xu, H., Chua, T.S., 2006. Fusion of AV features and external information sources for event
Ni, J., Ma, X., Xu, L., Wang, J., 2004. An image recognition method based on multiple BP detection in team sports video. ACM Trans. Multimed. Comput. Commun. Appl. 2 (1),
neural networks fusion. IEEE Int. Conf. Inf. Acquis. 323–326. 44–67.
Oliveira, A.M., Teixeira, M.P., Fonseca, I.B., Oliveira, M., 2006. Joint model-parameter vali- Zhu, Q., Yeh, M.C., Cheng, K.T., 2006. Multimodal fusion using learned text concepts for
dation of self-estimates of valence and arousal. probing a differential-weighting image categorization. ACM Int. Conf. Multimed. 211–220.
model of affective intensity. Proc. 22nd Ann. Meeting Int'l Soc. for Psychophysics, Zou, X., Bhanu, B., 2005. Tracking humans using multimodal fusion. IEEE Conf. Comput.
pp. 245–250. Vis. Pattern Recognit. 4.
Please cite this article as: Verma, G.K., Tiwary, U.S., Multimodal fusion framework: A multiresolution approach for emotion classification and
recognition from physiological signals, NeuroImage (2014), http://dx.doi.org/10.1016/j.neuroimage.2013.11.007