0% found this document useful (0 votes)
20 views13 pages

Multimodal Depression Detection Based On Self-Attention Network With Facial Expression and Pupil

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views13 pages

Multimodal Depression Detection Based On Self-Attention Network With Facial Expression and Pupil

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS 1

Multimodal Depression Detection Based on


Self-Attention Network With Facial
Expression and Pupil
Xiang Liu , Hao Shen , Huiru Li , Yongfeng Tao , Graduate Student Member, IEEE, and
Minqiang Yang , Member, IEEE

Abstract—Depression is a major mental health issue in con- Index Terms—Facial expression, multimodal fusion, pupil, self-
temporary society, with an estimated 350 million people affected attention network.
globally. The number of individuals diagnosed with depression
continues to rise each year. Currently, clinical practice relies en- I. INTRODUCTION
tirely on self-reporting and clinical assessment, which carries the
risk of subjective biases. In this article, we propose a multimodal
method based on facial expression and pupil to detect depression
more objectively and precisely. Our method first extracts the
A CCORDING to the World Health Organization (WHO),
depression has become one of the most pressing mental
health concerns in modern society [1]. Currently, over 350
features of facial expressions and pupil diameter using residual
networks and 1-D convolutional neural networks. Second, a cross- million people worldwide, or roughly 5% of the global pop-
modal fusion model based on self-attention networks (CMF-SNs) ulation, are affected by depression, and in developed countries,
is proposed, which utilizes cross-modal attention networks within the percentage of patients with depressive disorders may be as
modalities and parallel self-attention networks between different high as 10% [2]. A person’s employment, studies, and daily
modalities to extract CMF features of facial expressions and
life can all be negatively impacted by depression. It can cause
pupil diameter, effectively complementing information between
different modalities. Finally, the obtained features are fully sorrow, irritation, diminished wellbeing, exhaustion, sleepless-
connected to identify depression. Multiple controlled experiments ness, difficulty concentrating, a sense of hopelessness, worry,
show that compared to single modality, the multimodal fusion and low self-esteem, among other symptoms. In more severe
method based on self-attention networks has a higher ability to cases, depression can even lead to suicidal tendencies [3], [4],
recognize depression, with the highest accuracy of 75.0%. In
[5], [6]. Also, more frightening is the increased risk of diseases
addition, we conducted comparative experiments under three
different stimulation paradigms, and the results showed that such as cardiovascular disease, cancer, and diabetes in the case
the classification accuracy under negative and neutral stim- of increased depression [7].
uli was higher than that under positive stimuli, indicating a Given the serious consequences of depression, it is crucial to
bias of depressed patients toward negative images. The exper- identify and assess the condition as early as possible to ensure
imental results demonstrate the superiority of our multimodal
patients receive timely and effective treatment. At present, the
fusion method.
diagnosis of depression is mainly based on clinical assessments
through interviews and questionnaires. However, these diagnos-
tic techniques rely entirely on the expertise of the physicians
and their clinical experience and are known to be based on
their subjective evaluations [8]. As the number of individu-
Manuscript received 21 February 2024; revised 23 April 2024; accepted als diagnosed with depression continues to rise, there is an
21 May 2024. This work was supported in part by the National Natural Science
Foundation of China under Grant 62071122, in part by the Outstanding increasing demand for more efficient and accurate diagnosis
Youth Project of Guangdong Basic and Applied Basic Research Foundation methods. Depending solely on physicians’ subjective diagno-
under Grant 2023B1515020064, in part by the Science and Technology sis and subsequent treatment can be restricted and susceptible
Innovation 2030-Major Projects under Grant 2021ZD0202002, in part by the
National Key Research and Development Program of China under Grant to inaccuracies. There is an urgent need for an objective and
2019YFA0706200, in part by the Natural Science Foundation of Gansu effective method to support the diagnosis of depression, partic-
Province, China under Grant 22JR5RA488, in part by the Fundamental ularly given the shortage of medical professionals available to
Research Funds for the Central Universities under Grant lzujbky-2023-16,
and in part by Supercomputing Center of Lanzhou University. (Xiang Liu and diagnose and treat this condition. With the quick advancement
Hao Shen are co-first authors.) (Corresponding author: Minqiang Yang.) of artificial intelligence, machine learning-based depression as-
Xiang Liu is with the School of Computer Science and Technology, sessment [9], [10] is useful for accurate and prompt diagnosis
Dongguan University of Technology, Dongguan 523000, China (e-mail:
[email protected]). of depression, guaranteeing that patients may obtain timely and
Hao Shen, Huiru Li, Yongfeng Tao, and Minqiang Yang are with the efficient treatment, and enhancing people’s sense of wellbeing.
School of Information Science and Engineering, Lanzhou University, Lanzhou All symptoms of depression can be attributed to mood disor-
730000, China (e-mail: [email protected]; [email protected];
[email protected]; [email protected]). ders characterized by depressed mood. Emotions are made up
Digital Object Identifier 10.1109/TCSS.2024.3405949 of three components: subjective feelings, physiological arousal,

2329-924X © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: INDIAN INST OF INFO TECH AND MANAGEMENT. Downloaded on December 27,2024 at 11:21:36 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS

and external manifestations [11]. The key elements of outward We evaluated our model on a dataset for multimodal depression
manifestations are changes in voice and expression, which are recognition, experimental results demonstrate that the proposed
two crucial signs of mood shifts. Research has indicated that in- method achieves a classification accuracy of 75.0%, indicating
dividuals with depression may exhibit lower levels of social be- its effectiveness for depression recognition research. In sum-
havior compared to healthy individuals, such as fewer changes mary, we make the following contributions.
in facial expressions [12]. The residual mechanism is beneficial 1) We propose a novel CMF-SN to extract fusion features
to extract more informative features in deeper networks [13]. To from facial expressions and pupil diameter by using at-
extract high-level semantic features from facial expressions, we tention networks, which enhances information integration
utilize a residual network (ResNext) in our approach. across modalities.
The manifestations of depression also include eye contact, 2) To enable effective feature selection for CMF and
which can reflect changes in a subject’s motor and mental status accurate depression recognition, we use parallel self-
by recording eye movements in response to specific stimuli. attention networks to preserve original semantic informa-
There is evidence that depressed individuals often look at neg- tion within and between modalities.
ative emotional material for long periods of time and pay less 3) To adequately fuse adjacent feature elements, we utilize
attention to positive emotional material [14]. Several studies the self-attention mechanism and project different modal-
using eye-movement measures for diagnostic classification of ities onto separate spaces.
depression have also been successful [15], [16], [17]. Behav- The rest of this article is structured as follows. Section II
ioral information about eye movements is usually measured discusses related works. Section III explains the techniques
by fixation, saccade, blink, and pupil diameter. Eye-movement used for preprocessing and extracting features from facial ex-
behaviors are actually covered in facial video streams. As a pressions and pupil diameter, along with the multimodal fusion
physiological feature that effectively indicates depression [18], strategy. Section IV outlines the experimental design and ex-
pupil diameter deserves a fusion study with other modalities. perimental protocols. Section V demonstrates the outcomes of
We use 1-D convolutional neural network (CNN) to extract the experiment. Section VI furnishes an elaborate examination
pupil features. and discourse on the conclusions inferred from the experiment.
The choice of method for fusing multiple modalities of in-
formation can significantly impact the effectiveness and per-
II. RELATED WORKS
formance of multimodal depression recognition systems [19].
There are several main methods for multimodal data fusion, Multimodal feature fusion method primarily involves two
including feature-based fusion [20], [21], decision-based fusion key aspects: first, feature extraction; and second, the interaction
[22], and hybrid fusion [23]. Feature-based fusion primarily between different modalities. Currently, there are two main ap-
takes into account the complementary nature of different modal- proaches for extracting facial expression features in depression
ities’ features and their relationships to improve the accuracy recognition research, namely geometry-based techniques and
of depression recognition. In contrast, decision-based fusion appearance-based techniques. Facial landmarks are the essen-
primarily relies on the integration and analysis of information tial component of geometry-based face recognition systems,
from each modality to improve decision reliability by fusing which divide the face picture into several areas and then iden-
features from different levels. The hybrid fusion method com- tify between expressions by taking into account the geometric
bines feature-based and decision-based fusion methods to fur- changes in these regions. For instance, differential geometric
ther enhance the performance of depression recognition. This fusion network (DGFN) [26] and deep action unit graph net-
allows for more comprehensive and effective integration of work (DAUGN) [27]. While DAUGN employs facial signs and
multiple modalities for improving recognition accuracy. the Voronoi diagram (VD) technique to transform face pictures
In previous studies, we conducted research on depression into facial graphics, DGFN specifically combines facial signs
recognition using single-modality facial expressions [24] and to create different characteristics corresponding to action units
eye movement [25] separately. Therefore, this study focuses on (AUs) [28]. Geometry-based features, on the other hand, signifi-
the complementary nature of features between two modalities cantly rely on landmark identification techniques and are unable
and proposes a cross-modal fusion (CMF) model based on self- to record face motions that result in landmark displacement.
attention networks for depression recognition. Specifically, we Appearance-based techniques utilize textual information, such
first learn the facial expression modality and pupil modality, as local binary patterns (LBPs) [29], [30], Gabor wavelets [31],
using ResNext to obtain the spatiotemporal features of the video and pyramid histogram of gradients-three orthogonal planes
frame facial expression sequence and using 1-D CNN to obtain (PHOG-TOPs) [32] to capture the variation of facial texture
the pupil features. Furthermore, the obtained single-modality in different expressions. For appearance-based features, they
features are simultaneously input into the CMF model to obtain are affected by background noise or facial organ deformation.
more effective features that capture the relationships between Since the changes of expressions are very subtle, it is possible
the two modalities. At the same time, we perform intermodal that the noise of small images will affect the weights of similar
feature selection on the two modal features separately using expressions and reduce the effectiveness of the weights. The
self-attention mechanisms, obtaining the features of the two feature extraction algorithms discussed above primarily focus
modalities within and between modalities. Finally, we connect on low-level feature extraction, which has been found to be
the features and obtain the output of the depression recognition. inadequate for accurately recognizing depression. Deep neural

Authorized licensed use limited to: INDIAN INST OF INFO TECH AND MANAGEMENT. Downloaded on December 27,2024 at 11:21:36 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

LIU et al.: MULTIMODAL DEPRESSION DETECTION BASED ON SELF-ATTENTION NETWORK 3

network models can transform lower level features into higher the modal fusion features of facial expressions and pupils are
level feature representations through feature selection using extracted through the intramodal self-attention cross-module
different network architectures. This enhances the ability to and parallel self-attention network between different modali-
capture more effective representations of information. ties. The general structure of the model is illustrated in Fig. 1.
In the field of depression recognition, the results of multi-
modal fusion methods outperform those of single modalities A. Preprocessing
due to multifarious significant manifestations of depressive dis-
1) Facial Image Preprocessing: The preprocessing of face
order [33], [34], [35], [36]. Technically, various types of data
data involved in this article mainly includes face alignment,
can be used for multimodal fusion, such as speech sequences
cropping, and normalization. The frame rate of facial video is
[37], facial expressions [38], text [39], and electroencephalo-
30 fps. According to the timestamps corresponding to all trials
gram (EEG) [40]. On the research of pupil related data and
in the paradigm, the facial expression data of each subject are
EEG fusion, Zhu et al. [41] used mutual information to measure
divided into 40 segments. We first converted the original image
the relevance between EEG and pupil area signals, and selected
to a grayscale image and set brightness between 0.3 and 1.0.
EEG electrodes based on mutual information. Then, they fused
Then, we used the Face_alignment library in the Python toolkit
bimodal features using the denoising autoencoder. Zhang et al.
to automatically detect the faces of subjects in each video.
[42] integrated eye-movement information into EEG and divide
We extract regions restricted to faces, removing background
the features into groups according to their respective character-
distractions such as hairstyles and accessories. Notably, for the
istics. To obtain a fusion representation of EEG and eye move-
cases that the hands of subjects cover his/her faces or something
ment, they used group sparse canonical correlation analysis
blocks the camera, it will fail to detect faces in those frames.
(GSCCA). On the research of visual feature and other features
In these cases, we just ignore the frames and interpolate. The
fusion, Tao et al. [43] proposed a plug-and-play multimodal
effect of frequency on illumination is subsequently incredibly
spatiotemporal fusion attention module (STAT), which captured
weak, as shown by [47]. This study utilized histogram equal-
the global dependencies of temporal and spatial information of
ization [48] to improve the image contrast in video frames. This
visual and acoustic sequences in a video stream. Ghosh et al.
technique enhances the contrast of image regions with low local
[44] proposed multimodal MT profile information encoder, uti-
contrast and can improve details in frames with underexposure
lizing image and text in Twitter tweets to infer users’ depression
or overexposure.
status and emotion. Existing multimodal fusion methods tend to
2) Pupil Diameter Preprocessing: Eye trackers often output
focus on leveraging the complementary nature of information
sight coordinates, pupil diameter, and related parameters. The
within each modality. While research results have shown the
frame rate of our eye tracker is 200 fps. This article only uses
effectiveness of this approach for depression recognition, it
pupil diameter data. Subjects may blink frequently or close their
does not consider the potential loss of information during the
eyes for a long time during the experiment, which will cause the
fusion process. These methods only take into account informa-
eye tracker to fail to capture pupils, resulting in blank or invalid
tion complementarity, while ignoring the importance of each
values. This article directly ignores the missing and invalid
modality’s inherent features for decision-making.
values [49] to prevent their influence on data analysis. The pupil
diameter data with confidence scores less than 0.5 are removed.
III. METHODS Additionally, the pupil diameter is normalized with min-max
method [50] to remove the impact of the dimension between
The recognition of depression based on video and multi-
characteristics, which might make processing subsequent data
modal approaches has received wide attention due to its ro-
easier and faster convergence. Same as facial expression data,
bustness. However, current multimodal fusion methods mainly
we divided pupil diameter into 40 segments.
focus on the complementarity of different modalities during
fusion, while ignoring the importance of inherent information
contained by each modality feature, i.e., the different informa- B. Facial Expression Features
tion they provide when expressing the same object or concept Considering the performance of the network and training
[45], [46]. Failure to consider the complementary information efficiency, we used the ResNext-50 (32 × 4d) network to ex-
during multimodal fusion may also result in information loss, tract facial expression features. One typical approach to en-
which can negatively impact the performance of the model hance a model’s recognition accuracy is to enlarge the depth or
for depression recognition. To address this issue, researchers width of the network. However, increasing the hyperparameters
have proposed methods such as attention mechanisms and gate of the network is not a simple way to improve the model’s
mechanisms, which allow the model to dynamically allocate performance, as it requires a balance between optimizing the
and select weights between different modalities, to retain im- model’s performance and computational cost. The ResNext-
portant information and suppress noise, better addressing the 50 network integrates the ResNet network’s stacking approach
multimodal problem in facial expressions. In this article, a with the package convolution technique of the initial network
CMF-SN is proposed for multimodal depression recognition. to create a more effective and robust neural network model. The
First, efficient ResNext-50 (32 × 4d) and 1-D CNN are used for ResNext-50 network differs from ResNet in that it replaces the
representation learning of facial expressions and pupil modali- three-layer convolution residual block with a residual block of
ties, respectively, to obtain features of the two modalities. Then, identical topology structure, which is then stacked in parallel

Authorized licensed use limited to: INDIAN INST OF INFO TECH AND MANAGEMENT. Downloaded on December 27,2024 at 11:21:36 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS

Fig. 1. Multimodal fusion flow diagram of self-attention network. The dark blue feature is facial expression feature extracted from ResNext and the green
feature is pupil diameter feature extracted from 1-D CNN network. The upper and lower right sections represent the single-mode self-attention network model.
The blue part in the middle represents the multimode fusion within and between modes.

to enhance its performance. The ResNext-50 network offers C. Pupil Features


several advantages over the ResNet network. First, ResNext- CNN is a deep neural network structure that uses convo-
50 networks can enhance the accuracy of modalities without lutional operations, which can learn representations of input
raising the complexity of their parameters by leveraging their information using hierarchical structures and achieve transla-
network structure, which is the first advantage. This is because tion invariance. Generally speaking, 1-D CNN can capture local
the ResNext-50 network uses a technique called model com- patterns and features of sequence data by sliding convolutional
pression, which increases the network’s width rather than depth kernels on the time axis, 2-D CNN can perform convolutional
to improve the model’s expressive ability and generalization operations on 2-D images to capture local features and spa-
performance. This approach not only improves the model’s ac- tial structure information, while 3-D CNN can perform con-
curacy but also reduces the risk of overfitting. The second bene- volutional operations on 3-D images and videos to capture
fit of the ResNext-50 network is that it can decrease the number local features and spatiotemporal structure information. Hence,
of hyperparameters required. This is because the ResNext-50 choosing a suitable CNN model is crucial for various kinds of
network uses a technique called “group convolution,” which data, as it can enhance the model’s efficiency and effectiveness.
decomposes the convolution operation into multiple small con- The pupil feature extraction technique employed in this article
volution operations, reducing the parameter volume and compu- is based on 1-D CNN. The structure of the 1-D CNN net-
tational cost of each convolution layer. This technique not only work comprises of two convolutional layers, two pooling layers,
enhances the computational efficiency and speed of the model and one fully connected layer. Just like 2-D CNN, 1-D CNN
but also reduces the number of hyperparameters necessary. operates on the input data through convolutional and pooling
The facial expression extraction network consists of a ResNext layers and performs classification using the fully connected
(32 × 4d) model [51] without structure modification, and we layer at the end. Meanwhile, it addresses the issue of losing a
retrained it based on pretrained model weights because of the significant amount of feature information that occurs when 2-D
limited amount of our collected dataset. CNN compresses the dimensions of the input data. Technically,

Authorized licensed use limited to: INDIAN INST OF INFO TECH AND MANAGEMENT. Downloaded on December 27,2024 at 11:21:36 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

LIU et al.: MULTIMODAL DEPRESSION DETECTION BASED ON SELF-ATTENTION NETWORK 5

Fig. 2. Expression self-attention network structure.

we chose 3 as kernel size, 2 as stride, and did not do padding convergence speed of the network during training. In a multi-
operation. ReLU is the activation function. We chose 1-D max layer network architecture, the commonly used in the following
pooling with kernel size 2 as pooling operation. equation:

Sl = LayerNorm(S + MultiHead(S, S, S)). (4)


D. Multimodal Fusion
The LayerNorm used in this context refers to a normalization
1) Self-Attention Module for Facial Expressions: The self-
technique applied to each layer of the model. Normalizing the
attention module, in parallel with the cross-attention mod-
input data by adjusting its mean and variance to a consistent
ule, is designed to capture how different modalities, such as
level can facilitate faster convergence during model training by
facial expressions and pupil, interact with each other. The self-
ensuring standardized features. The output L, which has been
attention mechanism operates in a similar way to the cross-
processed through the self-attention module and LayerNorm,
attention mechanism, but instead of using query, key, and value
is then inputted into a feedforward layer that comprises two
information from different modalities, it uses them from the
fully connected layers. According to the (5), the first layer of
same modality. As shown in Fig. 2. Thus, the complete process
the feedforward layer applies the ReLU activation function,
can be summarized as follows: using a multihead self-attention
whereas the second layer does not apply any activation function
model, the sequence S = [s1 , s2 , ..., sn ] is used as input, as
shown in the following equation: Sf = max(0, Sl W1 + b1 )W2 + b2 . (5)
o
MultiHead(S, S, S) = Concat(head1 , ..., headi )W (1)
Similarly, the output SE is obtained by normalizing using the
headi = Attention(SWiQ , SWiK , SWiV ). (2) LayerNorm layer, as shown in the following equation:
The self-attention module applies linear transformations as SE = LayerNorm(Sl + Sf ). (6)
its initial step in the model. S is multiplied by trainable pa-
rameter matrices WiQ , WiK , and WiV to obtain Q, K, and V , 2) Self-Attention Module for Pupils: Regarding the self-
respectively. The following step involves feeding the scaled dot- attention module for pupils, it is similar to the facial expression
product attention module, which iterates i times, as demon- module. Using a multihead self-attention model, the sequence
strated in (3). Ultimately, the multihead attention values are T = [t1 , t2 , ..., tn ] is used as input, as shown in
obtained by concatenating the i attention results of the scaled
dot-product and linearly transforming them. By using linear MultiHead(T, T, T ) = Concat(head1 , ..., headj )W o (7)
transformations in the self-attention module, the model can headj = Attention(T WjQ , T WjK , T WjV ). (8)
effectively identify and learn important information in distinct
subspaces, which is one of the main benefits of this approach. Normalization structure using LayerNorm layer
The scaled dot-product employed in the calculation of the at-
tention module is displayed as follows: Tl = LayerNorm(T + MultiHead(T, T, T )). (9)
 
Attention SW Q , SW K
, SW V Once the decoder stack produces the output Tl , it is fed into a
i i i
feedforward layer that contains two fully connected layers. The
   T
 feedforward layer that Tl passes through includes a first layer
SWiQ SWiK
= softmax √ SWiV . (3) that utilizes the ReLU activation function, followed by a second
dk layer that does not apply any activation function
To avoid excessively large values for the last dimension of Tf = max(0, LW3 + b3 )W4 + b4 . (10)
WiQ , WiK , and WiV , dk is divided by the square root of dk in
the self-attention module. An excessively large value of QK T Normalization is used again with the LayerNorm layer to
can lead to the gradient of the softmax function becoming very obtain the output TE
small. Then, the LayerNorm layer is used to normalize the
attention values of the network, improving the stability and TE = LayerNorm(Tl + Tf ). (11)

Authorized licensed use limited to: INDIAN INST OF INFO TECH AND MANAGEMENT. Downloaded on December 27,2024 at 11:21:36 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS

Fig. 3. Self-attentional structure used for expressions and pupils.

3) Cross-Attention Module: By encoding the two modali- In the equation involving SWhQ ∗ T WhK , dk represents the
ties, we obtain high-level semantic features of facial expressions final dimension of S and T . To avoid potential issues with
and eye movements. To make more accurate final decisions, we excessively large values in the multiplication of SWhQ ∗ T WhK
use complementary intramodality information fusion between caused by a large dk , it is divided by the square root of dk .
the two modalities. Specifically, we first use the self-attention This helps to prevent the gradient of the softmax function from
mechanism to perform intramodality representation learning on becoming too small. After calculating the attention values in
the facial expression modality. The utilization of a feedforward the model, a LayerNorm layer is applied to normalize these
layer featuring ReLU activation function on the self-attention values. Introducing this normalization step enhances the stabil-
module output empowers the model to acquire and adjust to the ity and speed of convergence for the network while undergoing
crucial high-level features of facial expressions. As a result, the training. In the multilayer network architecture, the following
model becomes more attentive to the features that carry greater equation is shown as:
weight in determining the final output result. By utilizing the
self-attention mechanism, the model can effectively incorporate Hl = LayerNorm(S + MultiHead(S, T, T )) (15)
the impact of adjacent feature elements, while its query, key, In this context, S refers to the input sequence of pupils.
and value components represent the modalities projected onto LayerNorm is a layer normalization technique that normalizes
separate spaces. As shown in Fig. 3. each feature of the input data using learnable parameters, in-
To establish a connection between facial expressions and cluding scaling and shifting factors, as well as statistical infor-
pupils, this article optimized the multihead self-attention net- mation such as mean and standard deviation. This resolves the
work by feeding data from both modalities concurrently, as issue of gradient vanishing while also enhancing generalization
opposed to the approach depicted in Fig. 2. The following ability. Following its passage through the attention mechanism
equation for computation is given as: and LayerNorm, the output Hl proceeds to a feedforward layer
MultiHead(S, T, T ) = Concat(head1 , ..., headh )W o (12) containing two fully connected layers. According to the design
of the network, the initial layer in the feedforward layer imple-
headh = Attention(SWhQ , T WhK , T WhV ). (13) ments the ReLU activation function, while the subsequent layer
does not use any activation function
The provided equation denotes S as the input sequence of
pupil data, while T is utilized to represent the input sequence Hm = max(0, Hl W5 + b5 )W6 + b6 . (16)
of facial expression data. Within the multihead self-attention
network, S is multiplied by the trainable weight matrix WhQ Similarly, normalization is used again with the LayerNorm
for linear transformation of attention Q, while T is multiplied layer, where ST is the facial expression feature that is more
by the trainable weight matrices WhK and WhV for linear trans- closely related to the pupil feature obtained through the atten-
formation of attention K and V , respectively. In this context, tion module, as shown in
h refers to the quantity of heads contained within the multihead ST = LayerNorm(Hl + Hm ). (17)
training network. The incorporation of multihead self-attention
allows the model to concentrate on multiple crucial regions in Subsequently, we leverage multihead attention to identify the
parallel. The calculation of the attention module requires the shared characteristics between facial expressions and pupils,
use of scaled dot product: with the calculation (18) outlined as follows:
 
Attention SW Q , T W K
, T W V MultiHead(T, S, S) = Concat(head1 , ..., headl )W o (18)
h h h
  headl = Attention(T WlQ , SWlK , SWlV ). (19)
SWhQ ∗ T WhK
= softmax √ T WhV . (14) In (18), T represents the input sequence of facial expressions,
dk
and S represents the input sequence of pupils. In the multihead

Authorized licensed use limited to: INDIAN INST OF INFO TECH AND MANAGEMENT. Downloaded on December 27,2024 at 11:21:36 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

LIU et al.: MULTIMODAL DEPRESSION DETECTION BASED ON SELF-ATTENTION NETWORK 7

self-attention network, T is multiplied by the trainable weight


matrix WlQ for linear transformation of attention Q, while S
is multiplied by the trainable weight matrices WlK and WlV
for linear transformation of attention K and V , respectively.
The attention module requires the use of scaled dot-product for
calculation, as shown in the following equation:
 
Attention T W Q K V
1 , SW 1 , SW 1
 
T WlQ ∗ SWlK
= softmax √ SWlV . (20) Fig. 4. Paradigm process.
dk
is the last dimension of the shapes of S and T and is
Here, dk √ states were evoked by freely watching videos. During the study,
divided by dk to prevent SWhQ ∗ T WhK from becoming too three different states of emotional pictures (including positive,
large when dk is too large, which can cause the softmax func- neutral, and negative) were used to induce changes in viewer
tion’s gradient to become too small. Subsequently, the attention expressions and pupil behavior.
values produced by the network undergo normalization through The emotional images were sourced from the international
the LayerNorm layer, which serves to enhance the network’s affective picture system (IAPS), which was developed by the
stability and speed of convergence throughout the training pro- American Center for Emotion and Attention at the National
cess. The formula typically utilized within multilayer network Institute of Mental Health (NIMH) [53]. IAPS contains a wide
architecture is as follows: variety of content and a very wide range. Different categories of
pictures can induce people’s emotional expressions in different
Ha = LayerNorm(T + MultiHead(T, S, S)). (21)
states. It is used in different fields, including psychology, neuro-
Here, T is the input sequence of facial expressions, and physiology, and brain cognitive science, but most importantly,
LayerNorm is the layer normalization. Then, Ha is inputted into it has played a huge role in the study of emotion and atten-
a feedforward layer tion [54], [55], [56], where multidimensional scores can help
research people design targeted experimental paradigms [57].
Hb = max(0, Ha W7 + b7 )W8 + b8 . (22)
In the IAPS emotional picture database, we selected 40 pictures
Then, normalization is used again with the LayerNorm layer, of different types, including 10 positive stimulus types (valence:
where TS is the pupil feature that is more closely related to the 5.03 ± 1.15, arousal: 2.91 ± 1.97), 20 neutral stimulus types
facial expression feature obtained through the attention module: (valence: 7.43 ± 1.48, arousal: 4.33 ± 2.27), and 10 negative
The following equation represents the computation obtained stimulus types (valence 2.95 ± 1.62, arousal 5.35 ± 2.24).
by integrating the features of all four modalities: The whole experimental process is mainly composed of two
parts, each part is composed of four types of experiments. In
K = Concat(SE , TE , ST , TS ). (23)
each experiment, it can be further divided into two parts, one
The output multimodal feature K is processed further by is for the refocusing of the fixation point, and the other is
a max pooling layer, which transforms it into a vector. The for the playback of the stimulus pictures. We first play the
next step involves the application of a fully connected layer, sequence of stimulus pictures in the order of neutral, positive,
along with the ReLU activation function, to subject the vector and negative. Then loop once in the same playback standard and
to a nonlinear transformation. Last, classification is performed order. Considering the possibility that the emotional effect of
utilizing a softmax layer. the previous group will interfere with the emotional expression
of the next group, each time before the next group of pictures is
IV. EXPERIMENTS played, there will be a five-second pause to buffer the possible
continuous reaction. The process of experimental video play-
A. Experimental Paradigm
back is depicted in Fig. 4. The number of pictures displayed in
When exposed to the same emotional stimuli, depressive sub- each group of small experiments is 5, and the playback time of
jects and healthy controls display various behavioral patterns. each picture is 5 s. The total video duration of the entire exper-
Furthermore, previous research has found that individuals with iment is 245 s. Table I shows the order of picture types in the
depression tend to exhibit a bias toward negative information eight groups of experiments in the video paradigm. The same
when exposed to different stimuli paradigms [52]. Therefore, stimulus paradigm was also used in the work of Yang et al. [58].
in this study, different stimuli paradigms were designed to
induce emotional changes in participants based on the behav-
ioral differences observed between individuals with depression B. Equipment Setup
and healthy individuals. To this end, we created a stimulus As described in the previous section, depressed patients
task capable of evoking different expressive emotions, using presented different emotional states than healthy controls un-
a traditional psychological experimental paradigm as a basis der different stimulation paradigms. This requires that differ-
in this process. During the assessment, participants’ emotional ent cameras need to be used to record the changes in facial

Authorized licensed use limited to: INDIAN INST OF INFO TECH AND MANAGEMENT. Downloaded on December 27,2024 at 11:21:36 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS

TABLE I
STIMULUS PARADIGM [58]

Block1
Trial Focus Neutral Focus Positive Focus Neutral Focus Negative Focus
Number “+” 1 2 3 4 5 “+” 6 7 8 9 10 “+” 11 12 13 14 15 “+” 16 17 18 19 20 “+”
Continuous Stimulus

Picture Room Basket Outlet Clock Clothespins Puppies Family Seal Butterfly Athletes Chair Coffee cup Key ring Book Abstract art Mutilation Toilet Burn Shadow Hand
Duration Time (s) 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
Block2
Trial Neutral Focus Positive Focus Neutral Focus Negative Focus
Number 21 22 23 24 25 “+” 26 27 28 29 30 “+” 31 32 33 34 35 “+” 36 37 38 39 40 “+”
Picture Plate Abstract art Bowl Clock Tool Giraffes Bunnies Women Mother Adult Cabinet Spoon Light bulb Mug Mug Snake Victim Snakes Baby Sick kitty
Duration Time (s) 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5

Fig. 5. Eye tracker used in this study.

expressions and eye-movement data of each subject while


watching the video at the same time during the experiment. Fig. 6. Proposed experimental setup.
The most popular type of camera for gathering facial ex-
pression data is an optical camera, however professional eye-
C. Participants Selection
tracking equipment with high accuracy is needed to record
eye movements. To use the desktop eye tracker, the subjects Before the start of the experiment, all subjects were aware of
were required to place their heads on the chin rest. Although it the experimental procedure and signed a written informed con-
effectively recorded the subjects’ eye movement information, it sent [24]. The biomedical research consent form was approved
affected their facial behavior. However, the wearable eye track- by the local ethics committee in accordance with the World
ing devices on the market also have some problems, such as the Medical Association’s Code of Ethics, allowing the research
occlusion of key information in the face area. The eye tracker to be conducted at the Third People’s Hospital of Guangyuan
used in this article is a redesigned eye tracking device [59], (Declaration of Helsinki). Depressed patients in all of our sub-
which has the advantage of effectively avoiding the occlusion of jects participated in a structured Mini International Neuropsy-
the facial expression information closed area, which is shown in chiatric Interview (M.I.N.I.) interview to judge whether they
Fig. 5, and the eye tracking algorithm and software are ported met the DSM-IV criteria for depression, and the results showed
from [60]. that the patients involved met this criterion. For this study, all
To collect facial expressions and pupil data synchronously, participants were aged between 18 and 55 years old [61] and
the capture processes and paradigm display process are syn- possessed at least an elementary school level of education. Each
chronized by real-time signal, which is implemented on a hard participant was screened for various exclusion criteria, includ-
real-time operating system. We designed the acquisition sce- ing organic brain disease, history of epilepsy or cranial trauma,
nario as shown in Fig. 6. The parameter settings used are as history of drug or substance abuse in the past six months, and
follows: the eye tracking frame rate is 200 fps, and the resolu- concomitant serious physical illness or high risk of suicide.
tion is 320 × 200P, the facial expression videos are capture by The psychiatrist scored each participant by interview and ques-
Logitech C1000 which worked at the frame rate of 30 fps and tionnaire. Among the questionnaires administered during the
the resolution of 1920 × 1080P. study were the Patient Health Questionnaire (PHQ-9) and the

Authorized licensed use limited to: INDIAN INST OF INFO TECH AND MANAGEMENT. Downloaded on December 27,2024 at 11:21:36 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

LIU et al.: MULTIMODAL DEPRESSION DETECTION BASED ON SELF-ATTENTION NETWORK 9

TABLE II
RESULTS OF USING DIFFERENT CLASSIFIERS FOR MULTIMODAL FUSION FEATURE
UNDER DIFFERENT STIMULUS MODES

SVM KNN RF DTREE


Positive Accuracy 0.721 0.734 0.711 0.706
Precision 0.691 0.722 0.711 0.645
Recall 0.713 0.710 0.625 0.783
F1 score 0.708 0.710 0.677 0.744
Neutral Accuracy 0.730 0.713 0.732 0.720
Precision 0.776 0.667 0.764 0.752
Recall 0.723 0.751 0.754 0.780
F1 score 0.772 0.712 0.702 0.741
Negative Accuracy 0.726 0.738 0.722 0.720
Precision 0.781 0.731 0.836 0.674
Recall 0.702 0.629 0.695 0.750
F1 score 0.743 0.710 0.761 0.712
Note: The bold entries represent the highest metrics of different classifiers among three stimulus modes.

International Neuropsychiatric Interview (MINI). The PHQ-9 TABLE III


was classified as healthy controls: <5 and patients: ≥5 accord- COMPARISON BETWEEN EXISTING FEATURE EXTRACT
METHODS AND OUR METHODS
ing to the PHQ-9 criteria, and this score was used as a label.
In the dataset, some invalid data were removed. For example, Method Accuracy Precision Recall F1 Score
during the acquisition process, some faces in the video disap- Facial Expression
LBP 0.564 0.560 0.667 0.610
peared for a long time, and frequent body movements covered HOG 0.566 0.818 0.600 0.692
the faces of the subjects. We used 57 valid data, including 25 LBP-TOP 0.610 0.571 0.502 0.530
depression patients (7 males and 18 females; age range 18–55 ResNext 0.615 0.621 0.802 0.706
Pupil
years) and 32 healthy controls (2 males and 32 females; age T-frequency domain 0.673 0.670 0.405 0.509
range 18–55 years). We divided dataset into training and test 1-D CNN 0.723 0.732 0.801 0.766
sets in a ratio of 8 to 2, without dividing the validation set. Note: The bold entries represent the best feature extract methods under
Meanwhile, we fixed the gender ratio of training dataset and accuracy, precision, recall and f1 score.
test dataset when we dividing dataset multiple times. Finally,
all metrics took average value of multiple experimental results. TABLE IV
COMPARISON OF DIFFERENT FUSION MODELS
D. Experiment Process
Method Accuracy Precision Recall F1 Score
All subjects were individually invited to the laboratory, read CCA 0.724 0.722 0.730 0.725
written descriptions of experimental objectives and procedures SCCA 0.713 0.701 0.711 0.707
TCCA 0.736 0.752 0.703 0.724
and gave their informed consent. As part of the experiment, CMF-SN 0.750 0.756 0.800 0.778
participants were situated within a soundproof and windowless
Note: The bold entries in represent the best fusion models
room, where they were seated about 50–70 cm away from a under accuracy, precision, recall and f1 score.
computer screen. Ceiling lighting produced stable illumination
conditions. An eye tracking device was needed for the exper-
experiment, to ensure that the extracted expression and the
iment to record changes in the subjects’ pupils and a desktop
pupil diameter change in the same frame, we choose the image
camera to record changes in the subjects’ expressions as they
of the same time frame for use, so as to ensure the consistency
viewed the film. Before the experiment began, the researchers
of expression and pupil. The relevant parameters in the learning
told the subjects through a computer screen that each stimulus
process are obtained through random optimization. SVM, RF,
image started at a fixation point (white cross on a black back-
KNN, and DTree algorithms are used for classification
ground) until the fixation point was concentrated. If the fixation
modeling, and the trials of negative, positive, and neutral
point deviation is too large, the calibration procedure is repeated
emotional stimulus are investigated, respectively. In the facial
until the calibration is successful to start the experiment. Sub-
expression feature extraction, the sliding window value is
jects watched the displayed pictures naturally, and pictures of
set to 12. The SVM algorithm selects a radial basis function
all stimulus types were equally likely to appear in each instance,
(RBF) kernel for use in the analysis. The parameter K in the
only once at a time, and the experiment lasted approximately
KNN algorithm is set to 3. The parameter criterion in the RF
4 min. As part of the experiment, participants were advised
algorithm is set to gini. The parameter n_estimators in the
to limit any movements of their body or head to the greatest
DTree algorithm is set to 12.
extent possible.

E. Experiment Details V. RESULTS


In the course of the experiment, the expression image and The algorithm was applied to both facial expression and
pupil image used are grayscale images. In the process of the pupil data, and a series of experiments were conducted. Deep

Authorized licensed use limited to: INDIAN INST OF INFO TECH AND MANAGEMENT. Downloaded on December 27,2024 at 11:21:36 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS

TABLE V
COMPARISON OF CMF-SN ABLATION EXPERIMENTS

SVM KNN RF DTREE


ResNext Accuracy 0.615 0.674 0.667 0.611
Precision 0.621 0.724 0.751 0.673
Recall 0.802 0.704 0.605 0.702
F1 score 0.706 0.655 0.674 0.641
1-D CNN Accuracy 0.723 0.721 0.720 0.716
Precision 0.732 0.695 0.781 0.781
Recall 0.801 0.812 0.703 0.703
F1 score 0.766 0.780 0.745 0.745
Res+Att Accuracy 0.674 0.705 0.679 0.691
Precision 0.649 0.625 0.752 0.602
Recall 0.702 0.805 0.607 0.823
Precision 0.677 0.701 0.674 0.721
1-D CNN+Att Accuracy 0.720 0.723 0.720 0.719
Precision 0.718 0.703 0.679 0.741
Recall 0.810 0.812 0.800 0.705
Precision 0.760 0.754 0.731 0.722
Res+Att+1-D CNN+Att Accuracy 0.723 0.716 0.720 0.721
Precision 0.742 0.685 0.690 0.740
Recall 0.831 0.750 0.792 0.800
Precision 0.783 0.722 0.737 0.790
CMF-SN Accuracy 0.750 0.730 0.725 0.730
Precision 0.756 0.720 0.742 0.739
Recall 0.800 0.702 0.780 0.752
Precision 0.778 0.711 0.761 0.745
Note: The bold entries mean that our method has achieved the best performance in accuracy.

Fig. 7. Comparison of ablation results.

information was extracted from both modalities and CMF was effective combination of ResNexts and self-attention networks
performed. To ensure that the original semantic information helps to ensure the integrity of information interaction between
was not lost during the interaction within and between the different modalities. Therefore, this study further evaluated the
modalities, this study also used parallel self-attention networks performance of this method on a private dataset, and the
for feature selection of both facial expression and pupil fea- experimental results proved its effectiveness, achieving a classi-
tures during CMF, thus obtaining features within and between fication accuracy rate of 75.0%. Furthermore, previous research
the modalities. Finally, the obtained features were fully con- [52] has shown that depressed patients may have different re-
nected for effective depression recognition. Several commonly actions to different stimuli and exhibit more attentional biases
used classification models in Python, including SVM, KNN, toward negative stimuli. Therefore, we further analyzed and
RF, and DTREE, were selected for depression classification. validated the study on this basis, and the results showed that
The experimental results, as shown in Table II, demonstrated the classification accuracy rates for negative stimuli and neutral
that the classification accuracy reached over 70% for all three stimuli were 72.6% and 73.0%, respectively, while the classifi-
stimulus states using SVM as an example. In addition, the cation accuracy rate for positive stimuli was 72.1%.

Authorized licensed use limited to: INDIAN INST OF INFO TECH AND MANAGEMENT. Downloaded on December 27,2024 at 11:21:36 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

LIU et al.: MULTIMODAL DEPRESSION DETECTION BASED ON SELF-ATTENTION NETWORK 11

In the single-modal experiments, as shown in the table, we The experimental findings depicted in Fig. 7 provide ev-
compared our deep models with previous studies using tra- idence that the proposed CMF-SN feature fusion approach
ditional methods. In the facial expression modality, we used utilizing self-attention networks can effectively amalgamate
traditional methods such as LBP, HOG, and LBP-TOP for ex- information from various modalities. The final recognition per-
perimentation in Table III, while in the pupil modality, we used formance is improved compared to the results of individual
time-frequency domain features and then used an SVM classi- modalities. Based on the conclusive experimental results, it
fication model for classification. The tabular data indicate that is apparent that the recognition performance obtained solely
traditional learning models had a relatively low classification through self-attention network fusion between modalities is
accuracy for depression recognition utilizing exclusively facial inferior to the performance achieved through utilizing both
expressions or pupil features, when compared to the higher intramodality and intermodality fusion. This is because the
accuracy achieved through the use of deep learning-based facial optimization of the multihead self-attention network in this
expression and pupil feature extraction techniques. We took article learns the similarity between facial expression and pupil
SVM as an example and examined the variance and bias in the features, obtaining pupil features that are more closely related to
classification model by obtaining the learning curve. To vali- facial expression features, and facial expression features that are
date the classification performance of our proposed approach, more closely related to pupil features, achieving complementar-
we compared it with existing feature fusion techniques using ity of information. This can better capture some key information
our collected dataset. The comparison results are presented in that was missed in the fusion between modalities, indicating
Table IV. Canonical correlation analysis (CCA) [62] is a sta- that the proposed method can more comprehensively model
tistical technique for analyzing multiple variables or factors, the fusion between multiple modalities and accurately fuse the
which can be utilized to investigate the correlation between two information expressed by different inputs, leading to better pre-
datasets. It projects the two datasets into new low-dimensional diction of depression. Furthermore, this further validates the su-
feature spaces and maximizes their correlation. Kernel canon- periority of the feature fusion algorithm proposed in this article.
ical correlation analysis (KCCA) [63] is a commonly used
multivariate statistical analysis method that maps data to high- VI. CONCLUSION
dimensional feature spaces through kernel functions and cal- The main focus of this study was to propose a CMF model
culates the correlation between the two datasets in the feature for depression identification based on self-attention networks.
space. Supervised canonical correlation analysis (SCCA) [64] This method considered the complementarity within and be-
can find enhanced correlation subspaces for two observation tween different modalities, used cross-modal blocks to fuse the
spaces by adding class label information, in which the mapping features within each modality, and then utilized the complemen-
of the same pattern to the observation has the highest correla- tarity of information between modalities for multimodal fusion,
tion, and the enhanced correlation subspaces have stronger pat- achieving feature fusion. Based on the experimental results,
tern recognition and semantic discrimination ability. The results the proposed method exhibited effectiveness and superiority in
showed that the model achieved a relatively good accuracy in depression recognition research. However, a certain number of
depression identification, further demonstrating the effective- samples still have not been correctly classified, and most of
ness of the fusion results obtained by the proposed method. them were males. We speculated that two factors cause such
To gain a deeper understanding of the impact of self-attention a phenomenon. First, there was imbalanced gender distribution
on the ultimate recognition performance of the two components problem in our collected dataset. Besides, the amount of our
within CMF-SN, we divided them and carried out compara- dataset was limited, which may lead to insufficient model train-
tive experiments independently. The experimental outcomes are ing. Therefore, we were continuously collecting new data. In the
presented in Table V. Res+Att denotes the use of ResNext- future work, we may alleviate these problems. The results also
50 (32 × 4d) for feature extraction of facial expressions, and suggested that there may be a bias toward negative stimuli. In
then selecting features using self-attention network. Similarly, subsequent studies, we will further compare different datasets
1-D CNN+Att refers to the use of 1-D CNN for feature extrac- to explore this. Another limitation of this article is that only
tion of pupils, and then selecting features using self-attention the fusion of pupil diameter and facial expression is considered,
network. Res+Att+1-D CNN+Att refers to the parallel fusion which may weaken the recognition accuracy. In the following
of features from both modalities using self-attention network, studies, the fusion of multiple features can be considered to
without including fusion features within modalities. CMF-SN is improve the accuracy of depression recognition.
the result of considering both intramodality and intermodality
fusion. It can be seen that using only features extracted by a REFERENCES
single-modality deep network for depression recognition results [1] D. Santomauro, J. Amm Herrera, P. Zheng, and A. Ferrari, “Global
in a relatively lower recognition accuracy compared to using prevalence and burden of depressive and anxiety disorders in 204
self-attention network feature selection. Both facial expressions countries and territories in 2020 due to the COVID-19 pandemic,”
Lancet, vol. 398, no. 10312, pp. 1700–1712, 2021.
and pupils experience a certain degree of reduction in accuracy [2] K. Skonieczna-Żydecka et al., “Faecal short chain fatty acids profile is
across different classifiers. However, despite being compared changed in polish depressive women,” Nutrients, vol. 10, no. 12, 2018,
to traditional single-modality methods for depression recogni- Art. no. 1939.
[3] J. Park and I. Lee, “Factors influencing suicidal tendencies during
tion, the proposed approach still demonstrated a considerable COVID-19 pandemic in Korean multicultural adolescents: A cross-
improvement in recognition accuracy. sectional study,” BMC Psychol., vol. 10, no. 1, 2022, Art no. 158.

Authorized licensed use limited to: INDIAN INST OF INFO TECH AND MANAGEMENT. Downloaded on December 27,2024 at 11:21:36 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS

[4] A. Chadha and B. Kaushik, “Performance evaluation of learning models [26] T. Yan, X. Zhang, and H. Wang, “Geometric-convolutional feature fusion
for identification of suicidal thoughts,” Comput. J., vol. 65, no. 1, based on learning propagation for facial expression recognition,” IEEE
pp. 139–154, 2022. Access, vol. 6, pp. 42532–42540, 2018.
[5] Z. Sarhan, H. Shinnawy, M. Eltawil, Y. Elnawawy, W. Rashad, [27] Y. Liu, X. Zhang, Y. Lin, and H. Wang, “Facial expression recognition
and M. Mohammed, “Global functioning and suicide risk in pa- via deep action units graph network based on psychological mechanism,”
tients with depression and comorbid borderline personality disorder,” IEEE Trans. Cogn. Devel. Syst., vol. 12, no. 2, pp. 311–322, Jun. 2020.
Neurology, Psychiatry Brain Res., vol. 31, pp. 37–42, 2019, doi: [28] P. Shivanasab and R. Abbaspour, “An incremental algorithm for simul-
10.1016/j.npbr.2019.01.001. taneous construction of 2D Voronoi diagram and Delaunay triangulation
[6] S. Ghosh, A. Ekbal, and P. Bhattacharyya, “A multitask framework based on a face-based data structure,” Adv. Eng. Softw., vol. 169, 2022,
to detect depression, sentiment and multi-label emotion from suicide Art. no. 103129.
notes,” Cogn. Comput., vol. 14, no. 1, pp. 110–129, Jan. 2022. [29] X. Huang, S. Wang, X. Liu, G. Zhao, X. Feng, and M. Pietikäinen,
[7] J. Sotelo and C. Nemeroff, “Depression as a systemic disease,” Pers. “Discriminative spatiotemporal local binary pattern with revisited in-
Med. Psychiatry, vol. 1, pp. 11–25, 2017, doi: 10.1016/j.pmip.2016. tegral projection for spontaneous facial micro-expression recognition,”
11.002. IEEE Trans. Affect. Comput., vol. 10, no. 1, pp. 32–47, Jan.–Mar. 2017.
[8] A. Nassibi, C. Papavassiliou, and S. Atashzar, “Depression diagnosis [30] L. Liu, S. Lao, P. Fieguth, Y. Guo, X. Wang, and M. Pietikäinen,
using machine intelligence based on spatiospectrotemporal analysis “Median robust extended local binary pattern for texture classifica-
of multi-channel EEG,” Med. Biol. Eng. Comput., vol. 60, no. 11, tion,” IEEE Trans. Image Process., vol. 25, no. 3, pp. 1368–1381,
pp. 3187–3202, 2022. Mar. 2016.
[9] T. Richter, B. Fishbain, E. Fruchter, G. Richter-Levin, and H. Okon- [31] N. Maddage, R. Senaratne, L. Low, M. Lech, and N. Allen, “Video-
Singer, “Machine learning-based diagnosis support system for differen- based detection of the clinical depression in adolescents,” in Proc. Annu.
tiating between clinical anxiety and depression disorders,” J. Psychi- Int. Conf. IEEE Eng. Med. Biol. Soc. Piscataway, NJ, USA: IEEE Press,
atric Res., vol. 141, pp. 199–205, 2021, doi: 10.1016/j.jpsychires.2021. 2009, pp. 3723–3726.
06.044. [32] X. Fan and T. Tjahjadi, “A spatial-temporal framework based on
[10] B. Hu, Y. Tao, and M. Yang, “Detecting depression based on facial cues histogram of gradients and optical flow for facial expression recognition
elicited by emotional stimuli in video,” Comput. Biol. Med., vol. 165, in video sequences,” Pattern Recognit., vol. 48, no. 11, pp. 3407–3416,
2023, Art. no. 107457. 2015.
[11] X. Li, W. Guo, and H. Yang, “Depression severity prediction from facial [33] J. Ye et al., “Multi-modal depression detection based on emotional audio
expression based on the DRR_DepressionNet network,” in Proc. IEEE and evaluation text,” J. Affect. Disorders, vol. 295, pp. 904–913, Dec.
Int. Conf. Bioinf. Biomed., (BIBM). Piscataway, NJ, USA: IEEE Press, 2021.
2020, pp. 2757–2764. [34] J. Zhu et al., “Content-based multiple evidence fusion on EEG and
[12] A. Pampouchidou et al., “Automatic assessment of depression based on eye movements for mild depression recognition,” Comput. Methods
visual cues: A systematic review,” IEEE Trans. Affect. Comput., vol. 10, Programs Biomed., vol. 226, 2022, Art. no. 107100.
no. 4, pp. 445–470, Oct.–Dec. 2019. [35] J. Shen et al., “Depression recognition from EEG signals using an
[13] Q. Li, T. Zhang, C. Chen, K. Yi, and L. Chen, “Residual GCB- adaptive channel fusion method via improved focal loss,” IEEE J.
Biomed. Health Inform., vol. 27, no. 7, pp. 3234–3245, Jul. 2023.
Net: Residual graph convolutional broad network on emotion recog-
[36] G. Zheng et al., “An attention-based multi-modal MRI fusion model
nition,” IEEE Trans. Cogn. Devel. Syst., vol. 15, no. 4, pp. 1673–1685,
for major depressive disorder diagnosis,” J. Neural Eng., vol. 20, no. 6,
Dec. 2023.
2023, Art. no. 066005.
[14] C. Nicolas et al., “Eye movement in unipolar and bipolar depression:
[37] Y. Tao, M. Yang, Y. Wu, K. Lee, A. Kline, and B. Hu, “Depressive
A systematic review of the literature,” Frontiers Psychol., vol. 6, 2015,
semantic awareness from vlog facial and vocal streams via spatio-
Art. no. 1809.
temporal transformer,” Digit. Commun. Netw., pp. 2352–8648, Mar.
[15] S. Alghowinem, R. Goecke, M. Wagner, G. Parker, and M. Breakspear,
2023.
“Eye movement analysis for depression detection,” in Proc. IEEE [38] S. Ghosh et al., “COMMA-DEER: Common-sense aware multimodal
Int. Conf. Image Process., Piscataway, NJ, USA: IEEE Press, 2013, multitask approach for detection of emotion and emotional reasoning
pp. 4220–4224. in conversations,” in N. Calzolari E. Santus, F. Bond, and S.-H. Na,
[16] R. Shen, Q. Zhan, Y. Wang, and H. Ma, “Depression detection by Eds., Gyeongju, Republic of Korea, Int. Committee Comput. Linguistics,
analysing eye movements on emotional images,” in Proc. IEEE Int. Conf. 2022, pp. 6978–6990.
Acoust., Speech Signal Process. (ICASSP), Piscataway, NJ, USA: IEEE [39] G. Singh, S. Ghosh, A. Verma, C. Painkra, and A. Ekbal, “Standard-
Press, 2021, pp. 7973–7977. izing distress analysis: Emotion-driven distress identification and cause
[17] M. Yang, Z. Weng, Y. Zhang, Y. Tao, and B. Hu, “Three-stream convo- extraction (dice) in multimodal online posts,” in Proc. Conf. Empirical
lutional neural network for depression detection with ocular imaging,” Methods Natural Lang. Process., H. Bouamor, J. Pino, and K. Bali, Eds.,
IEEE Trans. Neural Syst. Rehabil. Eng., vol. 31, pp. 4921–4930, 2023. Singapore, Assoc. Comput. Linguistics, 2023, pp. 4517–4532.
[18] N. Sekaninova et al., “Oculometric behavior assessed by pupil response [40] T. Siddharth and T. Sejnowski, “Utilizing deep learning towards multi-
is altered in adolescent depression,” Physiol. Res., vol. 68, no. 3, pp. modal bio-sensing and vision-based affective computing,” IEEE Trans.
325–338, 2019. Affect. Comput., vol. 13, no. 1, pp. 96–107, Jan. 2022.
[19] P. Atrey, M. Hossain, A. Saddik, and M. Kankanhalli, “Multimodal [41] J. Zhu, C. Yang, X. Xie, S. Wei, Y. Li, X. Li, and B. Hu, “Mutual
fusion for multimedia analysis: A survey,” Multimedia Syst., vol. 16, information based fusion model (MIBFM): Mild depression recognition
no. 6, pp. 345–379, 2010, doi: 10.1007/s00530-010-0182-0. using EEG and pupil area signals,” IEEE Trans. Affect. Comput.,
[20] S. Tripathi, S. Tripathi, and H. Beigi, “Multi-modal emotion recognition vol. 14, no. 3, pp. 2102–2115, Mar. 2023.
on iemocap dataset using deep learning,” 2018, arXiv:1804.05788. [42] X. Zhang et al., “Fusing of electroencephalogram and eye movement
[21] H. Zhang, H. Wang, S. Han, W. Li, and L. Zhuang, “Detecting depres- with group sparse canonical correlation analysis for anxiety detection,”
sion tendency with multimodal features,” Comput. Methods Programs IEEE Trans. Affect. Comput., vol. 13, no. 2, pp. 958–971, Feb. 2022.
Biomed., vol. 240, 2023, Art. no. 107702. [43] Y. Tao, M. Yang, H. Li, Y. Wu, and B. Hu, “DepMSTAT: Multimodal
[22] S. Zhang, S. Zhang, T. Huang, W. Gao, and Q. Tian, “Learning affective spatio-temporal attentional transformer for depression detection,” IEEE
features with a hybrid deep model for audio-visual emotion recognition,” Trans. Knowl. Data Eng., early access, Jan. 5, 2024.
IEEE Trans. Circuits Syst. Video Technol., vol. 28, no. 10, pp. 3030– [44] S. Ghosh, A. Ekbal, and P. Bhattacharyya, “What does your bio
3043, Oct. 2018. say? Inferring Twitter users’ depression status from multimodal profile
[23] J. Liu et al., “Multimodal emotion recognition with capsule graph information using deep learning,” IEEE Trans. Comput. Social Syst.,
convolutional based representation fusion,” in Proc. IEEE Int. Conf. vol. 9, no. 5, pp. 1484–1494, Oct. 2022.
Acoust., Speech Signal Process. (ICASSP), Piscataway, NJ, USA: IEEE [45] Y. Zhang, et al., “Improving brain age prediction with anatomical feature
Press, 2021, pp. 6339–6343. attention-enhanced 3D-CNN,” Comput. Biol. Med., vol. 169, 2024,
[24] M. Yang, Y. Ma, Z. Liu, H. Cai, X. Hu, and B. Hu, “Undisturbed mental Art. no. 107873.
state assessment in the 5G era: A case study of depression detection [46] B. Zhang, D. Wei, G. Yan, T. Lei, H. Cai, and Z. Yang, “Feature-
based on facial expressions,” IEEE Wireless Commun., vol. 28, no. 3, level fusion based on spatial-temporal of pervasive EEG for depression
pp. 46–53, Jun. 2021. recognition,” Comput. Methods Programs Biomed., vol. 226, 2022,
[25] M. Yang, C. Cai, and B. Hu, “Clustering based on eye tracking data for Art. no. 107113.
depression recognition,” IEEE Trans. Cogn. Devel. Syst., vol. 15, no. 4, [47] D. Reddy, R. Ramamoorthi, and B. Curless, “Frequency-space de-
pp. 1754–1764, Dec. 2023. composition and acquisition of light transport under spatially varying

Authorized licensed use limited to: INDIAN INST OF INFO TECH AND MANAGEMENT. Downloaded on December 27,2024 at 11:21:36 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

LIU et al.: MULTIMODAL DEPRESSION DETECTION BASED ON SELF-ATTENTION NETWORK 13

illumination,” in Proc. Eur. Conf. Comput. Vis., New York, NY, USA: Hao Shen received the B.E. degree in computer
Springer-Verlag, 2012, pp. 596–610. science from Lanzhou University, Lanzhou, China,
[48] R. Szeliski, Computer Vision: Algorithms and Applications. Cham, in 2022, where he is currently working toward
Switzerland: Springer Nature, 2011. the M.E. degree in computer science with Gansu
[49] M. Kwon, Y. Jeong, and H. Choi, “Feature embedding and conditional Provincial Key Laboratory of Wearable Computing,
neural processes for data imputation,” Electron. Lett., vol. 56, no. 11, School of Information Science and Engineering,
pp. 546–548, 2020. since 2022.
[50] T. Chen, X. Xu, and S. Wang, “Signal processing by energy normaliza- His research interests include affective computing
tion method based on wavelet packet,” in Key Engineering Materials, and natural language processing.
vol. 413, Trans. Tech. Publ., Switzerland, 2009, pp. 613–619.
[51] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual
transformations for deep neural networks,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit., 2017, pp. 1492–1500.
[52] I. Gotlib, E. Krasnoperova, D. Yue, and J. Joormann, “Attentional biases
for negative interpersonal stimuli in clinical depression,” J. Abnormal
Psychol., vol. 113, no. 1, pp. 121–135, 2004.
[53] B. Cuthbert, “International affective picture system (IAPS): Instruction
manual and affective ratings,” Center Res. Psychophysiol., Univ. Florida,
Tech. Rep. no. a-4., 1999. HuiRu Li received the B.Eng. degree in com-
[54] M. Bradley, L. Miccoli, M. Escrig, and P. Lang, “The pupil as a mea- puter science from Taiyuan University, Taiyuan,
sure of emotional arousal and autonomic activation,” Psychophysiology, China, in 2020. She is currently working toward
vol. 45, no. 4, pp. 602–607, 2010. the M.E. degree with Gansu Provincial Key Labo-
[55] D. Sabatinelli, M. Bradley, P. Lang, V. Costa, and F. Versace, “Pleasure ratory of Wearable Computing, School of Informa-
rather than salience activates human nucleus accumbens and medial tion Science and Engineering, Lanzhou University,
prefrontal cortex,” J. Neurophysiol., vol. 98, no. 3, pp. 1374–1379, 2007. Lanzhou, China.
[56] D. Zald, “The human amygdala and the emotional evaluation of sensory Her research interests include affective computing
stimuli,” Brain Res. Rev., vol. 41, no. 1, pp. 88–123, 2003. and image processing.
[57] T. Libkuman, H. Otani, R. Kern, S. Viger, and N. Novak, “Multidimen-
sional normative ratings for the international affective picture system,”
Behav. Res. Methods, vol. 39, no. 2, pp. 326–334, 2007.
[58] M. Yang, Y. Wu, Y. Tao, X. Hu, and B. Hu, “Trial selection tensor
canonical correlation analysis (TSTCCA) for depression recognition
with facial expression and pupil diameter,” IEEE J. Biomed. Health
Inform., early access, Oct. 5, 2023.
[59] M. Yang, Y. Gao, L. Tang, J. Hou, and B. Hu, “Wearable eye-tracking
system for synchronized multimodal data acquisition,” IEEE Trans.
Circuits Systems Video Technol., early access, Nov. 14, 2023.
[60] M. Kassner, W. Patera, and A. Bulling, “Pupil: An open-source plat- Yongfeng Tao (Graduate Student Member, IEEE)
form for pervasive eye tracking and mobile gaze-based interaction,” received the B.Ehg.Mgt. degree in information man-
in Proc. ACM Int. Joint Conf. Pervasive Ubiquitous Comput., 2014, agement and system from Tianjin University of
pp. 1151–1160. Finance and Economics, Tianjin, China, in 2018.
[61] M. Yang, X. Feng, R. Ma, X. Li, and C. Mao, “Orthogonal-moment- He is currently working toward the Ph.D. degree
based attraction measurement with ocular hints in video-watching with Gansu Provincial Key Laboratory of Wearable
task,” IEEE Trans. Comput. Social Syst., vol. 10, no. 3, pp. 900–909, Computing, School of information Science and En-
Mar. 2023. gineering, Lanzhou University, Lanzhou, China.
[62] D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor, “Canonical correlation His research interests include affective comput-
analysis: An overview with application to learning methods,” Neural ing, image processing, and machine learning.
Comput., vol. 16, no. 12, pp. 2639–2664, 2004.
[63] W. Zheng, X. Zhou, C. Zou, and L. Zhao, “Facial expression recognition
using kernel canonical correlation analysis (KCCA),” IEEE Trans.
Neural Netw., vol. 17, no. 1, pp. 233–238, Jan. 2006.
[64] X. Gao, Q. Sun, and H. Xu, “Multiple-rank supervised canonical cor-
relation analysis for feature extraction, fusion and recognition,” Expert
Syst. Appl., vol. 84, pp. 171–185, Oct. 2017.

Xiang Liu received the Ph.D. degree in electronics Minqiang Yang (Member, IEEE) received the
science and technology from Beijing Institute of Ph.D. degree in computer science from Lanzhou
Technology, Beijing, China, in 2019. University, Lanzhou, China, in 2022.
He is currently a Teacher with Dongguan Uni- He is currently an Associate Professor with
versity of Technology, Dongguan, China. He had Gansu Provincial Key Laboratory of Wearable Com-
completed a general project of NFSC in 2018 and puting, School of Information Science and En-
received a new general project of NFSC in 2020. gineering. His research interests include affective
His research interests include artificial intelligence, computing, image processing, machine learning,
machine learning, video coding, communication, and automatic depression detection. He has pub-
multimedia information retrieval, visual information lished more than 20 papers on IEEE Magazines,
processing, and pattern recognition. IEEE Journals, and leading conferences.

Authorized licensed use limited to: INDIAN INST OF INFO TECH AND MANAGEMENT. Downloaded on December 27,2024 at 11:21:36 UTC from IEEE Xplore. Restrictions apply.

You might also like