0% found this document useful (0 votes)
42 views12 pages

Spatial-Temporal Attention Network For Depression Recognition From Facial Videos.

Uploaded by

rafael
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views12 pages

Spatial-Temporal Attention Network For Depression Recognition From Facial Videos.

Uploaded by

rafael
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Expert Systems With Applications 237 (2024) 121410

Contents lists available at ScienceDirect

Expert Systems With Applications


journal homepage: www.elsevier.com/locate/eswa

Spatial–Temporal Attention Network for Depression Recognition from facial


videos
Yuchen Pan a , Yuanyuan Shang a,c ,∗, Tie Liu a,d , Zhuhong Shao a,d , Guodong Guo b , Hui Ding a,d ,
Qiang Hu e
a
College of Information Engineering, Capital Normal University, Beijing, 100048, China
b Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV 26506, USA
c Beijing Advanced Innovation Center for Imaging Technology, Beijing, 100048, China
d Beijing Key Laboratory of Electronic System Reliability Technology, Beijing, 100048, China
e
Department of Psychiatry, ZhenJiang Mental Health Center, Zhenjiang, Jiangsu, 212000, China

ARTICLE INFO ABSTRACT

Keywords: Recent studies focus on the utilization of deep learning approaches to recognize depression from facial videos.
Depression recognition However, these approaches have been hindered by their limited performance, which can be attributed to the
Attention mechanism inadequate consideration of global spatial–temporal relationships in significant local regions within faces. In
Video recognition
this paper, we propose Spatial–Temporal Attention Depression Recognition Network (STA-DRN) for depression
Deep learning
recognition to enhance feature extraction and increase the relevance of depression recognition by capturing
Visualization
Convolutional neural network
the global and local spatial–temporal information. Our proposed approach includes a novel Spatial–Temporal
Attention (STA) mechanism, which generates spatial and temporal attention vectors to capture the global
and local spatial–temporal relationships of features. To the best of our knowledge, this is the first attempt to
incorporate pixel-wise STA mechanisms for depression recognition based on 3D video analysis. Additionally,
we propose an attention vector-wise fusion strategy in the STA module, which combines information from both
spatial and temporal domains. We then design the STA-DRN by stacking STA modules ResNet-style. The exper-
imental results on AVEC 2013 and AVEC 2014 show that our method achieves competitive performance, with
mean absolute error/root mean square error (MAE/RMSE) scores of 6.15/7.98 and 6.00/7.75, respectively.
Moreover, visualization analysis demonstrates that the STA-DRN responds significantly in specific locations
related to depression. The code is available at: https://github.com/divertingPan/STA-DRN.

1. Introduction signals have been extensively researched to identify potential patterns


of depression and improve recognition performance.
Major depressive disorder is characterized by persistent feelings of Despite the increasing interest in utilizing facial videos to identify
low mood and can affect individuals of all ages (Ackerman et al., 2018). depression, there still exists a limited ability to extract features from
Abnormal facial reactions during interviews with psychologists have significant local regions within the face. One crucial contextual element
been identified as an important diagnostic clue for depression (Fava & in video signals is the spatial characteristic of global facial actions.
Kendler, 2000). This has inspired the increasing focus on recognizing For example, a person might seem to be smiling with their mouth but
depression based on facial videos, aiming to identify the relationship have a calm or even furrowed brow, indicating a hesitant expression
between depression levels and facial visual signals, and to recognize rather than a genuine smile. Another critical contextual factor is the
depression levels through algorithms that analyze facial videos. In temporal relationship among frames. While a person might exhibit
previous studies (de Melo, Granger, & Hadid, 2019a; He et al., 2022; distress during an unpleasant scene, it might not be enduring. How-
Jazaery & Guo, 2018; Uddin, Joolee, & Lee, 2020; Zhou, Jin, Shang, & ever, individuals with depression tend to maintain a distressed facial
Guo, 2020; Zhu, Shang, Shao, & Guo, 2018), visual features in video expression, and this variability in facial expression is a dynamic process

The code (and data) in this article has been certified as Reproducible by Code Ocean: (https://codeocean.com/). More information on the Reproducibility
Badge Initiative is available at https://www.elsevier.com/physical-sciences-and-engineering/computer-science/journals.
∗ Corresponding author at: College of Information Engineering, Capital Normal University, Beijing, 100048, China.
E-mail addresses: [email protected] (Y. Pan), [email protected] (Y. Shang), [email protected] (T. Liu), [email protected] (Z. Shao),
[email protected] (G. Guo), [email protected] (H. Ding), [email protected] (Q. Hu).

https://doi.org/10.1016/j.eswa.2023.121410
Received 3 June 2022; Received in revised form 30 August 2023; Accepted 30 August 2023
Available online 4 September 2023
0957-4174/© 2023 Published by Elsevier Ltd.
Y. Pan et al. Expert Systems With Applications 237 (2024) 121410

respectively. Furthermore, we utilize the modified Grad-CAM++ (Sel-


varaju et al., 2017) to illustrate that the model responds clearly and
stably to multiple specific areas. This approach helps to reveal the
relationship between facial action and depression level.
To summarize, our contributions are as follows:

1. We propose the STA mechanism to extract attention vectors


Fig. 1. Facial reactions closely associated with depression are primarily manifested in with both global and local spatial–temporal information. This
temporal changes of spatial facial areas. For instance, persistent frowns are depicted by aspect has been overlooked in previous works. Furthermore, we
yellow boxes in the example, while dynamic eye movements are represented by green
develop a vector-wise fusion strategy to fuse spatial–temporal
boxes. Dynamic facial features, such as changes in angle and illumination, are also
taken into consideration when analyzing facial videos. The appearance of nasolabial attention vectors in the STA module.
folds, as shown in the red boxes, is notably influenced by shifts in facial direction. 2. By incorporating the STA module with a popular deep learning
network structure, we propose the STA-DRN. This approach
can enhance the feature extraction and relevance in depression
that necessitates capturing sequential global frames. Facial reactions recognition using the STA mechanism.
related to depression are primarily reflected in the temporal and spatial 3. Our experiments on the AVEC 2013 and AVEC 2014 datasets
changes of local facial areas, as shown in Fig. 1, but current works demonstrate the competitive performance of STA-DRN. Addi-
rarely takes this aspect into consideration. Although video analysis has tionally, we apply visualization analysis to reveal the visual
been developed (Yan & Woźniak, 2022), it is essential not only to depression pattern from STA-DRN.
locate the expression moment but also to model the correlation between
multiple expressions, which could be extremely complex and implicit 2. Related work
in a feature extractor. Deep learning models, rather than hand-crafted
extractors, are suitable for this task. Furthermore, unlike general video Facial Depression Recognition. Prior research has demonstrated
detection tasks such as facial detection (Wieczorek, Siłka, Woz̀niak, that vision-based approaches are proficient in recognizing depression.
Garg, & Hassan, 2022) or body detection (Woźniak, Siłka, & Wiec- For instance, in AVEC 2013, the baseline approach utilized local phase
zorek, 2021), depression recognition typically uses pre-processed data quantization (Ojala, Pietikainen, & Maenpaa, 2002) as the feature
with clear and full-frame facial images. Hence, related works focus on descriptor, and a support vector regression was trained on histograms
addressing the challenge of facial feature processing. Specifically, Cum- from separated image blocks to predict the depression level. Meng
mins et al. (2013), Meng et al. (2013), Zhou, Jin, et al. (2020) and Zhu et al. (2013) proposed the Motion History Histogram, which records
et al. (2018) use a single frame to recognize depression, resulting in the grayscale value changes for each pixel to describe motion information.
loss of dynamic facial expression information. In de Melo et al. (2019a), They also extracted the Edge Orientation Histogram and Local Binary
Jazaery and Guo (2018), Tran, Bourdev, Fergus, Torresani, and Paluri Patterns from the histogram to recognize depression. Additionally,
(2015) and Zhou, Wei, Xu, Qu, and Guo (2020), a 3D convolutional Cummins et al. (2013) investigated the use of facial cues by combining
neural network (CNN) (Tran et al., 2015) is used as the video fea- the Space–Time Interest Points (Laptev, Marszalek, Schmid, & Rozen-
ture extractor. However, the translation invariance of convolution and feld, 2008) for detecting salient points and the Pyramid of Histogram
the down-sampling of pooling dilute the spatial–temporal information of Gradients (Bosch, Zisserman, & Munoz, 2007) for reflecting the
of the features and neglect the contributions from different spatial–
spatial–temporal changes.
temporal areas. As a result, the model takes all information equally,
These methods are primarily based on traditional hand-crafted fea-
rather than concentrating on several temporal frames or spatial areas
tures. More recent works have focused on utilizing CNNs or a combi-
that are rich in essential information. Recent approaches (He, Chan,
nation of CNNs and preprocessed hand-crafted features. For instance,
& Wang, 2021; Niu, Liu, Tao, & Li, 2021; Niu, Tao, Liu, Huang, &
in Zhu et al. (2018), an architecture employing two-stream CNNs was
Lian, 2020) add weights to features with various attention mechanisms,
proposed to separately process facial images and optical flows. An
boosting performance with features enhanced by adaptive weighted
end-to-end algorithm (Zhou, Jin, et al., 2020) exclusively utilized the
information. However, the spatial and temporal information is mixed
image signal and divided it into different network pipelines. de Melo,
during feature extraction, and the fusion strategy of spatial–temporal
Granger, and Hadid (2019b) utilized ResNet-50 as the backbone and
features fails to preserve the spatial–temporal structure.
proposed a method for transforming score regression into distribution
In this paper, a Spatial–Temporal Attention Depression Recognition
prediction. Shang et al. (2021) showed that local quaternion image
Network (STA-DRN) is proposed for depression recognition to enhance
feature extraction and relevance by capturing both global and local features were effective when combined with the CNN model.
spatial–temporal information. To achieve this, we first introduce spatial The aforementioned approaches employ either single frames or
and temporal attention modules that employ a Spatial–Temporal At- short-term dynamic features derived from consecutive frames (e.g., op-
tention (STA) mechanism to correlate information between frames and tical flow (Zhu et al., 2018) and quaternion local features (Shang et al.,
pixels inspired by Li, Wang, Hu, and Yang (2019) and Woo, Park, Lee, 2021)) for depression recognition. In other studies (de Melo et al.,
and Kweon (2018). Then, we propose a fusion strategy that combines 2019a; Jazaery & Guo, 2018; Zhou, Wei, et al., 2020), long-term video
spatial and temporal attention vectors to yield a novel vector, which frames were directly extracted and predicted with 3D CNN. To effec-
contains both local and global spatial–temporal information and the tively leverage the long-range temporal information, (Jazaery & Guo,
proposed STA module adaptively weights features. Furthermore, the 2018) extracted feature sequences from 3D convolution and employed
STA-DRN is constructed by residual connecting (He, Zhang, Ren, & Sun, RNN to predict the final score. de Melo et al. (2019a) employed global
2016) the STA modules. and local 3D CNN that focused on the overall face and eye area without
Experiments are conducted on the Audio-Visual Emotion Challenge RNN structure. Lastly, (Zhou, Wei, et al., 2020) formulated depression
and Workshop (AVEC) 2013 (Valstar, Schuller, Smith, Eyben, et al., score prediction as a label distribution learning problem and proposed
2013) and AVEC 2014 (Valstar, Schuller, Smith, Almaev, et al., 2014) a metric learning and 3D CNN-based approach.
datasets, respectively, and demonstrate that our approach achieves a Attention mechanism. The use of 3D methods is still limited in
competitive result when compared to other vision-based approaches. terms of effectively correlating global information and dynamically
Specifically, the mean absolute error (MAE) and root mean square error weighting features. To address this, attention mechanisms have been
(RMSE) are 6.15/7.98 on AVEC 2013 and 6.00/7.75 on AVEC 2014, introduced in 2D image recognition tasks to enable adaptive weighted

2
Y. Pan et al. Expert Systems With Applications 237 (2024) 121410

Fig. 2. The overall architecture of STA-DRN. Table 1 shows more detailed parameters.

features. SENet (Hu, Shen, Albanie, Sun, & Wu, 2020), SKNet (Li et al., vectors, which can assign adaptive weights to features with spatial–
2019) and ResNeSt (Zhang et al., 2020) all proposed different types temporal information. These two sub-modules are combined into an
of attention modules for this purpose, such as channel-wise attention STA module with attention vector-wise fusion. In this section, we
and feature-map split attention. Woo et al. (2018) proposed CBAM, provide an overview of the overall structure of STA-DRN, followed by
which generates attention information across both channel and spa- a detailed description of the spatial and temporal modules. Finally, we
tial dimensions. Similar adaptive weighting and partial information introduce the STA module, along with the fusion method.
selecting techniques have been used in various vision tasks (Dosovitskiy
et al., 2021; Fernandez, Peña, Ren, & Cunha, 2019; Wu et al., 2023; Wu, 3.1. Spatial–temporal attention depression recognition network
Wen, Xu, Yang, & Zhang, 2021). In video recognition, Li, Liu, Zhang,
and Wang (2020) and Wang, Girshick, Gupta, and He (2018) proposed Compared to previous 3D methods (de Melo et al., 2019a; Jaza-
non-local blocks to capture long-range dependencies along temporal ery & Guo, 2018; Zhou, Wei, et al., 2020), our proposed STA-DRN
axes, although this requires significant computation. For depression not only incorporates temporal dynamic features, but also captures
recognition, He et al. (2021) proposed the Deep Local Global Atten- the relationship between spatial–temporal features through the STA
tion Convolutional Neural Network, which can learn both global and module. The overall structure of the STA-DRN uses ResNet (He et al.,
local representations from facial images. Nonetheless, this approach 2016) as the backbone structure, which includes residual connections
may lack sufficient temporal information after feature extraction. In and bottleneck structures. The network architecture is illustrated in
a multimodal approach proposed by Niu et al. (2020), the spatial– Fig. 2. In the first layer, a convolution operation with a kernel size of
temporal fusion in VSLF combines the spatial–temporal information of 7 × 7 × 7 and a stride of 1 × 2 × 2 is applied to extract and downsample
visual frames, but the feature representation is highly abstract, and the low-level features. After a 3 × 3 × 3 pooling layer with the stride of
1 × 2 × 2, the features are then fed into a residual module containing
extraction of spatial attention information may interfere with positional
two bottleneck structures. The proposed STA module replaces the
relationships.
middle layer of the bottleneck and generates a weighted feature with
Visualization. Several CNN visualization techniques (Chattopad-
attention information. After a stack of residual modules, an adaptive
hay, Sarkar, Howlader, & Balasubramanian, 2018; Selvaraju et al.,
pooling layer resamples the feature into a fixed shape. Finally, the
2017; Springenberg, Dosovitskiy, Brox, & Riedmiller, 2015; Wang et al.,
last fully-connected layer predicts a score as the final output of the
2020; Zhou, Khosla, Lapedriza, Oliva, & Torralba, 2016) have been de-
STA-DRN.
veloped to aid in understanding neural networks. In depression recog-
In the STA module, the spatial attention module and temporal
nition research (Carneiro de Melo, Granger, & Hadid, 2020; de Melo
attention module generate the inner features 𝑿 𝑠 and 𝑿 𝑡 , attention
et al., 2019b; Jazaery & Guo, 2018; Zhou, Jin, et al., 2020; Zhou, Wei,
vector 𝒔𝑠 and 𝒔𝑡 . The attention vectors are fused and then combined
et al., 2020), these visualization methods have been utilized to display
with the features, which is formulated as
features related to depression. For example, in MR-DepressNet (Zhou,
Jin, et al., 2020), stable responses in the eye areas were displayed 𝑿 𝑓 𝑢𝑠𝑒𝑑 = (𝒔𝑠 ⊙ 𝒔𝑡 ) ⊙ (𝑿 𝑠 + 𝑿 𝑡 ), (1)
through a DAM heatmap, which illustrated the depression-related fea-
where 𝑿 𝑓 𝑢𝑠𝑒𝑑 denotes the output of the STA module. More details are
tures captured by the model. However, this visualization also revealed
presented in Section 3.2.
that the response was the only region highly related to MR-DepressNet.
To strike a balance between speed and performance, we employed
3.2. Spatial–temporal attention module
Grad-CAM++ (Chattopadhay et al., 2018) as our visualization method.
We expanded the calculation dimension from 2D to 3D and optimized 3.2.1. Spatial attention module
it for the depression recognition task to suit our model. In video recognition, objects of interest often appear in a series of
contiguous frames, and the ability to discern sequential information
3. Proposed method from features is essential for accurate recognition. Such information
can be extracted from certain relative patterns that remain stable and
This paper proposes STA-DRN, which aims to capture both spatial– spatially invariant across frames (Wang et al., 2018). Moreover, owing
temporal global and local information from video frames. To achieve to the positional relationships between different facial features, the
this, the core of the STA modules uses the STA mechanism to en- multiple spatial appearances of distinct patterns are also an impor-
hance the relationship between information across pixels and frames. tant indicator for prediction, as depicted in Fig. 1. Consequently, we
Specifically, the model incorporates a spatial attention module and a generate a spatial attention vector to enhance the inter-spatial relation-
temporal attention module to extract spatial and temporal attention ships among features. To more effectively capture spatial information,

3
Y. Pan et al. Expert Systems With Applications 237 (2024) 121410

Fig. 4. A diagram of the temporal attention module. The output feature 𝑿 𝑡𝑜𝑢𝑡 is
generated by multiplying the attention vector 𝒔𝑡 with the extracted feature 𝑿 𝑡 .

Fig. 3. A diagram of the spatial attention module. The output feature 𝑿 𝑠𝑜𝑢𝑡 is generated
by multiplying the attention vector 𝒔𝑠 with the extracted feature 𝑿 𝑠 . ′ ×𝐾
where 𝑿 𝑠𝑜𝑢𝑡 ∈ R𝑇 ×𝐻×𝑊 ×(𝐶 ) is a 4D tensor that has a size of (𝐶 ′ × 𝐾)
along the 4th axis.

CBAM (Woo et al., 2018) proposed a 2D spatial attention module. How- 3.2.2. Temporal attention module
ever, this module was developed for single-frame image recognition Temporal variation within frames is a significant factor in video
and the resulting spatial attention vector is generated across the entire recognition. While short-term feature extraction can make use of dy-
channel axis without incorporating temporal information. Therefore, namic information between a few frames, it is not suitable for extract-
we propose a 3D video spatial attention module that draws inspiration ing long-term dynamic features spanning several seconds. In the case
from spatial information representation, as shown in Fig. 3. During of facial videos, both long-term and short-term features are equally
feature extraction, we group features into distinct sets to increase the important for recognition, as shown in Fig. 1. A previous method (Niu
number of pattern combinations of feature extractors. In contrast to et al., 2020) employed an LSTM network to generate temporal infor-
CBAM’s spatial attention mechanism, we use soft-attention (Xu et al., mation, but this approach is limited in its ability to capture long-term
2015) to extract and combine spatial attention vectors from the spatial temporal relationships and can impede the parallelization of the deep
statistical information of features, thus fully utilizing global and local model. In order to address these challenges, we propose a temporal
spatial information to generate the weights of each feature. With the attention model that employs a temporal attention vector to capture
soft-attention mechanism, the spatial attention vector can be consid- the long-term or short-term relationships between features. We intro-
ered as the specific global spatial activation of features obtained from duce a temporal attention module to enhance temporal information,
convolutional local feature extractors. as illustrated in Fig. 4. Similar to the spatial attention module, the
Given the feature 𝑿 ∈ R𝑇 ×𝐻×𝑊 ×𝐶 as the input of this module; temporal attention vector is generated from temporal statistical infor-
𝑇 is the temporal length of the feature; (𝐻; 𝑊 ) is the spatial size of mation and attention mechanism to represent a long-term combination
the feature and 𝐶 is the number of channels. The function 𝑠𝑝 is a of short-term convolutional features.
composite operation to get 𝐾 feature groups 𝑿 𝑠 as: Similar to the spatial module, the input feature 𝑿 is split by 𝑠𝑝
which is the same as that shown in Eq. (2) in the spatial module to get
𝑿 𝑠 = 𝑠𝑝 (𝑿) = 𝑅𝑒𝐿𝑈 (𝐵𝑁( (𝑿))), (2) ′
𝐾 feature groups to generate 𝑿 𝑡 ∈ R𝑇 ×𝐻×𝑊 ×𝐶 ×𝐾 . Then the temporal

where 𝑅𝑒𝐿𝑈 denotes the Rectified Linear Unit (Nair & Hinton, 2010) statistical vector 𝒛𝑡 ∈ R𝑇 ×𝐶 ×𝐾 is calculated by the average pooling 𝑡𝑎𝑝 :
and 𝐵𝑁 denotes the Batch Normalization (Ioffe & Szegedy, 2015). The
′ ( ) 1 ∑𝐻 ∑ 𝑊
𝑿 𝑠 ∈ R𝑇 ×𝐻×𝑊 ×𝐶 ×𝐾 ; 𝐶 ′ is the size of channels from an output of the 𝒛𝑡 = 𝑡𝑎𝑝 𝑿 𝑡 = 𝑿 (𝑖, 𝑗). (7)
local feature extractor, which is a convolution operator  with kernel 𝐻 × 𝑊 𝑖=1 𝑗=1 𝑡
size of 3 × 3 × 3, and 𝐾 denotes the split groups.
After the fully connected 𝑓 𝑐 and softmax 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥 , the temporal
After that, the spatial statistics can be calculated from the feature
′ attention vector 𝒔𝑡 is generated:
𝑿 𝑠 . The 2D spatial feature statistical vector 𝒛𝑠 ∈ R𝐻×𝑊 ×𝐶 ×𝐾 can
( ( )) ′
be computed with an average pooling operation 𝑠𝑎𝑝 . Notice that the 𝒔𝑡 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥 𝑓 𝑐 𝒛𝑡 , 𝒔𝑡 ∈ R𝑇 ×𝐶 ×𝐾 . (8)
average pooling is operated on the temporal axis:
Finally, the temporal attention vector is utilized to weigh the ex-
( ) 1 ∑ 𝑇
tracted features along both the 𝐻-axis and 𝑊 -axis of features through
𝒛𝑠 = 𝑠𝑎𝑝 𝑿 𝑠 = 𝑿 (𝑖). (3)
𝑇 𝑖=1 𝑠 the Hadamard product, resulting in the generation of the weighted
feature:
The attention vector can be calculated from a fully connected oper-
ator 𝑓 𝑐 . In our practical implementation, the 1 × 1 convolution layer 𝑿 𝑡𝑜𝑢𝑡 (ℎ, 𝑤) = 𝒔𝑡 ⊙ 𝑿 𝑡 (ℎ, 𝑤), ℎ = 1 … 𝐻, 𝑤 = 1 … 𝑊 , (9)
is recommended to replace the fully connected layer. Then the softmax ′
where 𝑿 𝑡𝑜𝑢𝑡 ∈ R𝑇 ×𝐻×𝑊 ×(𝐶 ×𝐾 ) is a 4D tensor that has a size of (𝐶 ′ × 𝐾)
operator is implemented in groups to combine the features and select
along the 4th axis.
the specific combination of patterns and features. The exact description
of calculating the spatial attention vector is as follows:
3.2.3. Attention vector-wise feature fusion
( ( )) ′
𝒔𝑠 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥 𝑓 𝑐 𝒛𝑠 , 𝒔𝑠 ∈ R𝐻×𝑊 ×𝐶 ×𝐾 , (4) The conventional approach for combining spatial and temporal
attention modules involves a simple summation of features from each
𝑒 𝑧𝑗 branch of the model (Li et al., 2019). Formally, the fusion of spatial and
𝑠𝑜𝑓 𝑡𝑚𝑎𝑥 (𝒛)𝑗 = ∑𝐾 , 𝑗 = 1 … 𝐾. (5)
𝑒𝑧𝑘 temporal attention modules on a feature-wise basis can be represented
𝑘=1
as:
The multiplication of attention vectors and extracted features is
calculated along the 𝑇 -axis of features with the Hadamard product to 𝑿 𝑓 𝑢𝑠𝑒𝑑 = 𝑿 𝑠𝑜𝑢𝑡 + 𝑿 𝑡𝑜𝑢𝑡 , (10)
achieve the spatial attention weighted:
where 𝑿 𝑠𝑜𝑢𝑡 and 𝑿 𝑡𝑜𝑢𝑡 denote the output features from spatial and
𝑿 𝑠𝑜𝑢𝑡 (𝑡) = 𝒔𝑠 ⊙ 𝑿 𝑠 (𝑡), 𝑡 = 1…𝑇, (6) temporal attention modules.

4
Y. Pan et al. Expert Systems With Applications 237 (2024) 121410

Table 1
The architecture for our STA-DRN. The module layer is the bottleneck block shown in
Fig. 2. Downsampling is performed by module3_1, module4_1, and module5_1 with a
stride of (1,2,2).
layer name output size layer
conv1 64 × 112 × 112 7 × 7 × 7, 64, stride (1,2,2)
3 × 3 × 3 max pool, stride (1,2,2)
module2_x 64 × 56 × 56
⎡ 1 × 1 × 1, 16 ⎤
⎢ 3 × 3 × 3, 16 ⎥×2
⎢ ⎥
⎣ 1 × 1 × 1, 64 ⎦
Fig. 5. A diagram of attention vector-wise feature fusion method. The attention vector
⎡ 1 × 1 × 1, 32 ⎤
is utilized to capture the interaction of information between the spatial and temporal ⎢ 3 × 3 × 3, 32 ⎥×2
module3_x 64 × 28 × 28
modules. ⎢ ⎥
⎣ 1 × 1 × 1, 128 ⎦
⎡ 1 × 1 × 1, 64 ⎤
module4_x 64 × 14 × 14 ⎢ 3 × 3 × 3, 64 ⎥×2
⎢ ⎥
⎣ 1 × 1 × 1, 256 ⎦
⎡ 1 × 1 × 1, 128 ⎤
module5_x 64 × 7 × 7 ⎢ 3 × 3 × 3, 128 ⎥×2
⎢ ⎥
⎣ 1 × 1 × 1, 512 ⎦
adaptive average pool
fc1 1×1×1
1-d fc

Fig. 6. Some examples of input video frames. The augment is applied as the same set
of parameters on one video section. by the participants in each video. The labels span from 0 to 63 and
are divided into four depression levels: minimal (0–13), mild (14–19),
moderate (20–28), and severe (29–63).
To exploit the attention information from spatial and temporal The AVEC 2013 dataset includes a total of 150 video clips, with each
modules and incorporate the overall spatial–temporal information to video exclusively featuring one individual. This dataset encompasses a
the extracted feature, we propose a fusion method of attention vec- total of 82 unique human participants. It is evenly divided into three
tors in the STA module. In the Non-local Neural Network (Wang separate subsets for training, development, and testing. Each video is
et al., 2018), spatial–temporal information is produced by multiply- presented in a colored MP4 format, with a resolution of 640 × 480
ing high-dimensional matrices, which consumes more memory during pixels and a frame rate of 30 frames per second. The age range of
computation. Instead, we employ low-dimensional spatial and tem- the participants spans from 18 to 63 years, with an average age of
poral attention vectors to achieve the integration of spatial–temporal 31.5 years.
information. The AVEC 2014 dataset consists of a total of 300 video recordings,
As shown in Fig. 5, the input feature is fed into temporal and which are divided into three distinct sets: training, development, and
spatial modules and two attention vectors are combined into a spatial– testing sets. Each set contains two different types of video recordings:
temporal attention vector with the following operation: Freeform and Northwind. The Northwind task involves participants read-
ing aloud the fable ‘‘Die Sonne und der Wind’’ (The North Wind and
𝒔𝑠𝑡 (𝑡, ℎ, 𝑤) = 𝒔𝑡 (𝑡) ⊙ 𝒔𝑠 (ℎ, 𝑤), (11) the Sun) in German. On the other hand, the Freeform task requires
where 𝑡 = 1 … 𝑇 , ℎ = 1 … 𝐻, 𝑤 = 1 … 𝑊 , 𝒔𝑡 (𝑡) and 𝒔𝑠 (ℎ, 𝑤) denotes the participants to answer a series of questions in the German language.
temporal and spatial attention vector, respectively. To maximize the preservation of short-term facial expressions, we
Multiply the vector 𝒔𝑠𝑡 with features from spatial module 𝑿 𝑠 and extract video frames as images at a 1-frame interval. Our pre-processing
temporal module 𝑿 𝑡 , then add the result to output the spatial–temporal phase utilizes the Dlib toolkit to extract facial landmarks, which serves
feature: to exclude background interference and align human faces. During
( ) alignment, we align the centers between the eyes and set the vertical
𝑿 𝑓 𝑢𝑠𝑒𝑑 = 𝒔𝑠𝑡 ⊙ 𝑿 𝑠 + 𝑿 𝑡 . (12) distance between the eyes and mouth to be 1/3 of the image height.
The aligned facial images are then rescaled to a size of 224 × 224
4. Experiments pixels. For training, we randomly select a sequence of 64 frames from a
given video at a stochastic position to create a training clip. To augment
In this section, we conduct experiments on the AVEC 2013 (Valstar, the input data during training, we apply random horizontal flips and
Schuller, Smith, Eyben, et al., 2013) and AVEC 2014 (Valstar, Schuller, jitter on brightness, contrast, saturation, and hue ranging between 0
Smith, Almaev, et al., 2014) datasets to evaluate the effectiveness of our and 0.1 for all frames in one video clip, as illustrated in Fig. 6. In the
proposed STA-DRN. We first perform an ablation study to assess the testing phase, we use the cropped and aligned video frames without any
computational complexity and effectiveness of the STA-DRN, followed augmentation. For a given testing video, we crop it into a sub-video
by visual analysis using the modified Grad-CAM++ (Chattopadhay with 64 frames and predict each clip to calculate the mean value of
et al., 2018) on the STA-DRN. Additionally, we evaluate the robustness depression scores, which in turn yield the final result for that video.
of our STA-DRN to degraded inputs. Finally, we compare the perfor- We train our model on the union set of the AVEC 2013 and AVEC
mance of our proposed STA-DRN with other vision-based methods. 2014 training sets, and subsequently validate and test our model on
the development and testing sets of both datasets, respectively.

4.1. Datasets 4.2. Training details

We employ the AVEC 2013 and AVEC 2014 datasets for the training Our model is implemented in the PyTorch environment (https:
and evaluation of our proposed STA-DRN model. These datasets consist //pytorch.org) on a server equipped with 6 TITAN RTX GPUs.
of videos accompanied by BDI-II score labels, which are self-evaluated The initial weights of the STA-DRN are pretrained parameters on

5
Y. Pan et al. Expert Systems With Applications 237 (2024) 121410

Fig. 7. The visualization of the prediction vs ground truth with different attention modules on AVEC 2013 and 2014 test sets.

UCF101 (Soomro, Zamir, & Shah, 2012) for the temporal prior of Table 2
Comparison of different structures of STA-DRN on AVEC 2013 and AVEC 2014.
videos, and on CK+ (Lucey et al., 2010) for the spatial prior of facial
The suffix k is split groups, s/t/fusion_f/fusion_v denotes the embedded module of
images. During the fine-tuning process, we set the training epoch to spatial/temporal/feature-wise fusion/attention vector-wise fusion. All other parameters are
200 and the batch size to 5. To prevent overfitting, we save the weights kept consistent with those shown in Table 1.
of the network when the validation losses reach a historical minimum. AVEC 2013 AVEC 2014
The network parameters are described in detail in Table 1. We use the MAE RMSE MAE RMSE
Adam optimizer with an initial learning rate of 0.001. For the learning k1_s 7.97 10.35 7.81 10.25
rate schedule, we employ the cosine annealing warm restart: k1_t 7.58 9.12 7.68 9.29
( ( )) k1_fusion_f 7.54 9.88 7.43 9.79
1 𝑇𝑐𝑢𝑟
𝜂𝑡 = × 0.001 1 + cos 𝜋 . (13) k1_fusion_v 7.19 9.65 7.07 9.55
2 𝑇𝑖
k2_s 7.45 9.37 7.30 9.28
𝑇𝑐𝑢𝑟 represents the current epoch index, and 𝑇𝑖 is the interval between k2_t 7.21 9.32 7.07 9.22
two restart. The first restart iteration step 𝑇0 = 50 and the factor k2_fusion_f 7.11 9.21 6.97 9.12
k2_fusion_v 6.15 7.98 6.00 7.75
increases after a restart is set to 2 according to the suggestion
k4_s 7.97 10.11 7.90 9.88
in Loshchilov and Hutter (2016).
k4_t 7.77 10.12 7.48 9.56
The mean absolute error (MAE) and the root mean square er- k4_fusion_f 7.60 9.70 7.45 9.61
ror (RMSE) are used as evaluation functions to compare with other k4_fusion_v 7.41 9.83 7.38 9.63
methods. The definitions of MAE and RMSE are as follows:

1 ∑|
𝑁
𝑀𝐴𝐸 = 𝑦̂ − 𝑦𝑖 || , (14)
𝑁 𝑖=1 | 𝑖 model and the corresponding ground truth values for the AVEC 2013
√ and AVEC 2014 datasets.

√1 ∑ 𝑁
( )2
𝑅𝑀𝑆𝐸 = √ 𝑦̂ − 𝑦𝑖 , (15) The results of the experiments show that the models with k2 split
𝑁 𝑖=1 𝑖 groups outperform those with k1 split groups, demonstrating the superi-
where 𝑁 denotes the number of samples, 𝑦̂𝑖 represents the predicted ority of the split feature group. However, increasing the number of split
result, and 𝑦𝑖 denotes the true label of the 𝑖th sample. groups to k4 leads to a negative performance, which suggests that a
higher model complexity can result in optimization difficulty and over-
fitting effects. Comparing the performance of module fusion to that of
4.3. Results and analysis
a single spatial or temporal module, it is clear that the former is better
(fusion_f/fusion_v vs. s/t ). Additionally, the attention vector-wise fusion
4.3.1. Analysis of STA mechanism
module (fusion_v) outperforms all the other modules, indicating that the
To evaluate the effectiveness of STA mechanism, we incorporated
spatial–temporal fusion module can improve performance more than
different attention modules into the STA-DRN, namely the spatial at-
a single temporal or spatial feature. This finding also highlights the
tention module (s), temporal attention module (t ), feature-wise fusion
superiority of the STA module over the feature-wise fusion module.
module (fusion_f ), and attention vector-wise fusion module (fusion_v),
to serve as the feature extractor. The experiments are performed on In addition, the Pearson Correlation Coefficient (PCC) and the Con-
AVEC 2013 and AVEC 2014 test sets, and the results are presented in cordance Correlation Coefficient (CCC) are calculated using the follow-
Table 2. In the table, k represents the number of split groups 𝐾 in 𝑠𝑝 ing formulas:
in Eq. (2), and s/t/fusion_f/fusion_v denotes the attention module used cov(𝑥, 𝑦)
in STA-DRN. Furthermore, Fig. 7 displays the predictions made by each PCC = , (16)
𝜎𝑥 𝜎𝑦

6
Y. Pan et al. Expert Systems With Applications 237 (2024) 121410

Table 3
PCC and CCC of different STA-DRN structures on AVEC 2013 and AVEC 2014.
AVEC 2013 AVEC 2014
PCC CCC PCC CCC
k1_s 0.50 0.48 0.52 0.50
k1_t 0.65 0.52 0.66 0.52
k1_fusion_f 0.57 0.54 0.58 0.56
k1_fusion_v 0.59 0.57 0.60 0.59
k2_s 0.66 0.65 0.67 0.66
k2_t 0.67 0.56 0.68 0.58
k2_fusion_f 0.70 0.62 0.71 0.63
k2_fusion_v 0.73 0.72 0.75 0.73
k4_s 0.58 0.44 0.61 0.45
k4_t 0.58 0.57 0.63 0.59
k4_fusion_f 0.67 0.66 0.65 0.65
k4_fusion_v 0.67 0.68 0.67 0.67 Fig. 8. The degraded input used to evaluate the robustness of STA-DRN. In (a),
Gaussian noise is added to the input frames at various levels, and the numbers beneath
indicate the 𝜎 of the noise. In (b), the input frames are Gaussian blurred using distinct
Table 4 kernel sizes, and the numbers beneath the images represent the kernel size.
Comparison of the performance, number of parameters, and FLOPs of different 3D
approaches. P. and F. represent the number of parameters (×106 ) and FLOPs (×109 ),
respectively.
Model AVEC 2013 AVEC 2014 P. F.
MAE RMSE MAE RMSE
ResNet-18 6.94 9.14 6.72 8.93 33.20 65.80
MDN-18 7.02 8.58 6.48 8.64 11.41 179.84
STA-18 6.15 7.98 6.00 7.75 31.27 192.71
ResNet-50 6.83 8.84 6.53 8.53 46.20 79.96
MDN-50 6.28 8.02 6.38 8.14 18.73 213.37
STA-50 6.39 8.19 6.27 7.94 41.41 238.16
ResNet-152 6.57 8.42 6.64 8.29 117.41 148.41
MDN-152 6.42 7.83 6.28 7.85 52.25 424.96
STA-152 6.48 8.16 6.64 8.25 78.14 375.69

2𝜌𝑥,𝑦 𝜎𝑥 𝜎𝑦 Fig. 9. The comparison of STA-DRN performance in different noisy input levels on
CCC = ( )2 , (17) AVEC datasets.
𝜎𝑥2 + 𝜎𝑦2 + 𝜇𝑥 − 𝜇𝑦
where cov denotes the covariance, 𝜎𝑥 and 𝜎𝑦 are the standard de-
viations. 𝜇𝑥 and 𝜇𝑦 are the mean values, and 𝜌 denotes the PCC.
The PCC and CCC of each variant model are listed in Table 3. The
attention vector-wise fusion module with two split feature groups
(k2_fusion_v) achieves the best performance with high PCC and CCC
on both AVEC 2013 and AVEC 2014, which demonstrates a positive
correlation between the predicted values and ground truth.

4.3.2. Analysis of computational complexity


The performance and computational complexity of STA-DRN is
compared with ResNet (He et al., 2016) and MDN (Carneiro de Melo,
Granger, & Bordallo Lopez, 2021) across different complexity levels.
The experimental results, including the number of parameters (P.)
and FLOPs (F.), are presented in Table 4. To ensure fair compari-
son, we use the same training and testing procedures and setups as Fig. 10. The comparison of STA-DRN performance in different blurring input levels
described in Sections 4.1 and 4.2. The STA-18, STA-50, and STA- on AVEC datasets.
152 architectures are designed based on the specifications in Table 1,
with bottleneck block numbers of [2, 2, 2, 2], [3, 4, 6, 3], and [3, 8, 36, 3],
respectively. Interestingly, the lighter STA-DRN (STA-18) outperforms standard deviation 𝜎 = [0.03, 0.06, 0.125, 0.25, 0.375, 0.5, 1] is added
its deeper counterparts (STA-50, STA-152) in terms of performance. to the original inputs, and the performance is evaluated as shown in
Moreover, the more complex models exhibit no significant performance Fig. 9. The baseline model is the optimal STA-DRN weights (k=2, STA-
gains and may even lead to over-fitting. Remarkably, the STA-18 even 18), and the input images are tested directly. It shows that STA-DRN
outperforms more complicated models such as ResNet-50, MDN-50, and is robust to small amounts of noise (𝜎 < 0.25), as the recognition error
ResNet-152. Although MDN-152 yields a better performance in terms remains trivial. However, the recognition error increases dramatically
of AVEC 2013’s RMSE, it requires significantly higher computational when the noise level exceeds 0.375, and the trend shows that the error
resources (424.96G FLOPs) compared to STA-18 (192.71G FLOPs). increases as the noise level increases.
Then the STA-DRN is tested with inputs under Gaussian blur with
4.3.3. Analysis of degraded inputs varying kernel sizes [3, 5, 7, 15, 27, 49, 81], and the results are shown
We evaluate the robustness of STA-DRN to degraded inputs as in Fig. 10. As the kernel size increases, indicating blurrier inputs, the
shown in Fig. 8. The Gaussian noise  ∼ (0, 𝜎 2 ) with varying levels of recognition error increases. This is expected, as the key facial features

7
Y. Pan et al. Expert Systems With Applications 237 (2024) 121410

Score-CAM (Wang et al., 2020), we selected Grad-CAM++ to mea-


sure the importance of each area towards the overall prediction. This
method provides smoother and more stable heatmaps than Grad-CAM
while being faster than Score-CAM.
To apply the Grad-CAM++ method to our depression recognition
task, we made modifications to the calculation of feature weights.
Specifically, we took the final prediction from the node of the last
layer into account to better emphasize the effect of variant scores.
The modified Grad-CAM++ generates the heatmap using the following
approach:
( )

𝑘
𝐿 = 𝑅𝑒𝐿𝑈 𝜔𝛼𝑘 𝐴 , (18)
𝑘

where 𝜔 denotes the output score of the model. 𝐴𝑘 and 𝛼𝑘 denote the
𝑘th feature map and the corresponding weight which was presented
in detail in Chattopadhay et al. (2018). 𝐿 responds to the importance
of the highlighted area in the prediction. The area is related to the
depression level.
To provide visual examples, Fig. 11 shows some sample heatmaps
overlaid on facial images (left column), corresponding heatmaps (mid-
dle column), and guided backpropagation (Springenberg et al., 2015)
maps (right column) using the modified Grad-CAM++ method. To en-
sure clarity, we have sampled the image from the middle of a dynamic
image series, and recolored the guided backpropagation map. Unlike
the MR-DepressNet proposed in Zhou, Jin, et al. (2020), our STA-DRN
shows interesting areas that cover a range of facial features, including
frown, eyes, nasolabial folds, and jaw. This aligns with the expectation
that facial appearances related to depression detection distribute across
these areas. Specifically, we observe that:

1. Input samples with low depression scores tend to exhibit no spe-


cific or minimally activated features across the entire facial area,
as depicted in Fig. 11(a) and (b). However, as the depression
score increases, certain areas become more intensively activated,
as demonstrated in Fig. 11(c), (d), (e).
2. In the case of depression patients (samples with mild or higher
levels), the highlighted features emerge in individual or com-
bined areas, including the forehead, eye socket, nasolabial folds,
cheek, and angulus oris. These areas exhibit a strong correlation
with facial actions such as frowning and mouth curling. The
regions associated with the most influential features are marked
in red.
3. The guided backpropagation maps illustrate the comprehen-
sive features captured by our model, concentrating not just on
the shape and contour of facial features, but also on potential
textures and subtle expressions. In contrast to non-depressed in-
dividuals, whose maps present a ‘‘plain’’ feature across the entire
area (Fig. 11(a) and (b)), individuals with depression guide fea-
Fig. 11. Some examples of the visualization with heatmaps (left column: overlay
tures with greater contrast (Fig. 11(c), (d), (e)). This observation
heatmaps on images, middle column: original heatmaps) and guided backpropagation
maps (right column). The red areas in the heatmaps signify higher contribution to the further substantiates the significance of distinct features in the
final prediction, whereas the blue areas indicate lower contribution, providing clues to context of depression recognition.
recognize depression. The guided backpropagation maps further highlight areas of high
activation in yellow and low activation in cyan.
Our STA-DRN model produces reasonable predictions guided by
specific features, as observed above. The heatmap generated by the
model is consistent with the findings of MR-DepressNet (Zhou, Jin,
are lost in blurry images. However, the error rate rises more slowly
et al., 2020), but the features identified by our model are more diverse.
when the kernel size exceeds 15, as the model tends to output an
MR-DepressNet focuses primarily on the eyes, which is one of the
average recognition score with smoother and more average inputs.
regions of the multi-region network. This highlights the importance of
eyes in facial depression recognition. However, the performance of MR-
4.3.4. Visualization analysis DepressNet is limited by its fixed focus regions. In contrast, the adaptive
After evaluating various existing methods, such as Grad-CAM (Sel- selection of spatial–temporal information in STA-DRN improves the
varaju et al., 2017), Grad-CAM++ (Chattopadhay et al., 2018), and diversity of identified features.

8
Y. Pan et al. Expert Systems With Applications 237 (2024) 121410

Table 5 Table 7
Comparison with fine-tuned performance of popular pre-trained facial recognition Comparison with previous vision-based methods on AVEC 2014 testing set.
models. Method MAE RMSE
Model AVEC 2013 AVEC 2014
LGBP-TOP-SVM (Sidorov & Minker, 2014) 11.20 13.87
MAE RMSE MAE RMSE Baseline (Valstar, Schuller, Smith, Almaev, et al., 2014) 8.86 10.86
INN (Cholet, Paugam-Moisy, & Regis, 2019) 8.59 10.56
MobileFace (de Melo et al., 2022) 8.15 9.34 8.04 9.20
LBP-EOH-LPQ (Jan, Meng, Gaus, Zhang, & Turabzadeh, 2014) 8.44 10.50
IRSE50 (Hu et al., 2020) 8.34 9.86 8.25 9.67
LPQ-DFT (Kaya, Çilli, & Salah, 2014) 8.20 10.27
FaceNet (Schroff et al., 2015) 7.05 8.64 7.04 8.51
DCNNs (Zhu et al., 2018) 7.47 9.55
IR152 (He et al., 2016) 6.57 8.42 6.64 8.29
RNN-C3D (Jazaery & Guo, 2018) 7.22 9.20
STA-DRN (Ours) 6.15 7.98 6.00 7.75
DPFV (He et al., 2019) 7.21 9.01
VLDN (Uddin et al., 2020) 6.86 8.78
DJ-LDML (Zhou, Wei, et al., 2020) 6.59 8.30
Table 6
Global–Local C3D (de Melo et al., 2019a) 6.59 8.31
Comparison with previous vision-based methods on AVEC 2013 testing set.
LGA-CNN-WSPP (He et al., 2021) 6.51 8.30
Method MAE RMSE VSLF (Niu et al., 2020) 6.43 8.60
Baseline (Valstar, Schuller, Smith, Eyben, et al., 2013) 10.88 13.61 FDHH (Jan, Meng, Gaus, & Zhang, 2018) 6.68 8.01
PHOG (Cummins et al., 2013) – 10.45 MR-DepressNet (Zhou, Jin, et al., 2020) 6.21 8.39
MHH + PLS (Meng et al., 2013) 9.14 11.19 LQGDNet (Shang et al., 2021) 6.08 7.84
LPQ (Kächele, Glodek, Zharkov, Meudt, & Schwenker, 2014) 8.97 10.82 MDN (Carneiro de Melo et al., 2021) 6.06 7.65
LPQ-TOP (Wen, Li, Guo, & Zhu, 2015) 8.22 10.27 STA-DRN (Ours) 6.00 7.75
DCNNs (Zhu et al., 2018) 7.58 9.82
DPFV (He, Jiang, & Sahli, 2019) 7.55 9.20
RNN-C3D (Jazaery & Guo, 2018) 7.37 9.28
VSLF (Niu et al., 2020) 7.32 8.97 of 6.00/7.75. It is evident that our STA-DRN outperforms the CNN (de
VLDN (Uddin et al., 2020) 7.04 8.93
DJ-LDML (Zhou, Wei, et al., 2020) 6.63 8.37
Melo et al., 2019a; Shang et al., 2021; Zhou, Jin, et al., 2020; Zhou,
LGA-CNN-WSPP (He et al., 2021) 6.59 8.39 Wei, et al., 2020; Zhu et al., 2018) and spatial–temporal (Jan et al.,
Global–Local C3D (de Melo et al., 2019a) 6.40 8.26 2018; Jazaery & Guo, 2018; Uddin et al., 2020) approaches on AVEC
LQGDNet (Shang et al., 2021) 6.38 8.20 2014 in terms of MAE and ranks second in terms of RMSE. Furthermore,
MR-DepressNet (Zhou, Jin, et al., 2020) 6.20 8.28
our method prevails over the Local–Global Attention method (He et al.,
MDN (Carneiro de Melo et al., 2021) 6.24 7.55
STA-DRN (Ours) 6.15 7.98 2021) in terms of incorporating temporal information into the attention
mechanism. Moreover, our STA yields superior results when compared
to the visual spatial–temporal attention VSLF (Niu et al., 2020), which
also incorporates spatial–temporal information during feature extrac-
4.3.5. Comparison with previous works tion. It is important to note that the temporal vector in STA is generated
First, we fine-tuned several popular facial recognition mod- using a simpler CNN structure, unlike VSLF, which employs both CNN
els, namely MobileFace (de Melo et al., 2022), FaceNet (Schroff, and LSTM. Additionally, STA effectively preserves spatial information
Kalenichenko, & Philbin, 2015), IRSE50 (Hu et al., 2020), IR152 (He using a spatial attention vector, whereas VSLF employs a 3D CNN for
et al., 2016), and evaluated their performance on AVEC 2013 and spatial information that disrupts the spatial structure when the spatial
AVEC 2014 testing sets. The results are presented in Table 5. Our feature is flattened.
analysis indicates that while pre-trained facial recognition models
hold potential for fine-tuning and use in depression recognition, there
5. Conclusion
remains a performance gap between these models and STA-DRN. One
reason for this disparity is that these models rely on single-frame
We propose STA-DRN, a novel approach for recognizing depression
input, lacking the ability to extract temporal information. In contrast,
based on sequential facial video frames, utilizing the STA mecha-
STA-DRN incorporates spatial–temporal modules that enhance feature
nism. Firstly, we introduce a spatial and temporal attention module
extraction and recognition.
to effectively extract features by assigning adaptive weights, thereby
The effectiveness of STA-DRN is demonstrated by testing it on the
enhancing the ability to capture both global and local information. Sec-
AVEC 2013 and AVEC 2014 testing sets and comparing it with previous
ondly, we propose a fusion strategy to combine the information from
work. The comparison is made with recent vision-based methods on
AVEC 2013 dataset as shown in Table 6. The results indicate that the the spatial and temporal modules, via the STA module. Consequently,
MAE/RMSE of our method can reach 6.15/7.98. STA-DRN outperforms STA-DRN, comprised of the STA module, is developed to improve the
all the listed methods on MAE and reaches a competitive RMSE except feature extraction and relevance in depression recognition. Experiments
for MDN, as reported by the authors (Carneiro de Melo et al., 2021). on AVEC 2013 and AVEC 2014 validate the effectiveness of the pro-
However, the number of parameters in STA-DRN is 31.27M, which is posed methodology, achieving competitive results with a MAE/RMSE
less than MDN (52.25M). Despite MDN’s capacity to scale down to of 6.15/7.98 on AVEC 2013 and 6.00/7.75 on AVEC 2014, respectively.
a model with 11.41M parameters, such downsizing significantly com- Moreover, our visualization analysis not only demonstrates a specific
promises its performance. While MDN’s larger parameter configuration pattern obtained from STA-DRN, but also reveals the potential facial
yields improved performance, the corresponding increase in computa- appearances that are indicative of varying levels of depression.
tional complexity and model size is not justified by the performance Although STA-DRN achieved satisfactory performance in recogniz-
gain, considering the substantial trade-off stemming from heightened ing depression using video data, it currently does not utilize audio
complexity and size. In comparison with the CNN-based methods (de signals. Incorporating speech features into the recognition model could
Melo et al., 2019a; Shang et al., 2021; Zhou, Jin, et al., 2020; Zhou, potentially improve its performance, as has been demonstrated in
Wei, et al., 2020; Zhu et al., 2018) and the spatial–temporal meth- recent multimodal approaches. Moreover, the AVEC datasets used to
ods (Jazaery & Guo, 2018; Uddin et al., 2020), our STA-DRN with an develop and test STA-DRN have limited visual signal data. Thus, col-
attention mechanism demonstrates superiority. Furthermore, STA-DRN lecting and maintaining a larger and more diverse visual dataset is
achieves significantly better performance than the approaches with critical for further improving the model’s accuracy and developing a
visual attention mechanism (He et al., 2021; Niu et al., 2020). better depression recognition model.
Upon comparing STA-DRN with the vision-based approaches on In the future, we intend to develop a multimodal approach that
AVEC 2014, as presented in Table 7, STA-DRN achieves a MAE/RMSE incorporates not only video data but also audio, electroencephalogram

9
Y. Pan et al. Expert Systems With Applications 237 (2024) 121410

(EEG), and magnetic resonance imaging (MRI), in order to fully utilize Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T.,
biological data. Our future research will focus on exploring joint feature .... Houlsby, N. (2021). An image is worth 16 × 16 words: Transformers for
image recognition at scale. In 9th international conference on learning representations,
representation of multimodal data, as well as constraints on multimodal
ICLR 2021, virtual event, Austria, May 3-7, 2021. OpenReview.net, URL https:
features. Furthermore, we are interested in delving into the internal //openreview.net/forum?id=YicbFdNTTy.
data flow of the multimodal deep learning network, with the objective Fava, M., & Kendler, K. (2000). Major depressive disorder. Neuron, 28(2), 335–341.
of developing a novel visualization analysis. http://dx.doi.org/10.1016/s0896-6273(00)00112-4.
Fernandez, P. D. M., Peña, F. A. G., Ren, T. I., & Cunha, A. (2019). Feratt: Facial
expression recognition with attention net. In 2019 IEEE/CVF conference on computer
CRediT authorship contribution statement
vision and pattern recognition workshops (pp. 837–846). http://dx.doi.org/10.1109/
CVPRW.2019.00112.
Yuchen Pan: Methodology, Software, Validation, Resources, Writ- He, L., Chan, J. C.-W., & Wang, Z. (2021). Automatic depression recognition using
ing – original draft, Writing – review & editing. Yuanyuan Shang: Con- CNN with attention mechanism from videos. Neurocomputing, 422, 165–175. http:
ceptualization, Validation, Writing – original draft, Writing – review & //dx.doi.org/10.1016/j.neucom.2020.10.015, URL https://www.sciencedirect.com/
science/article/pii/S0925231220315101.
editing, Supervision, Project administration, Funding acquisition. Tie
He, L., Jiang, D., & Sahli, H. (2019). Automatic depression analysis using dynamic facial
Liu: Methodology, Validation, Writing – review & editing. Zhuhong appearance descriptor and Dirichlet process Fisher encoding. IEEE Transactions on
Shao: Methodology, Validation, Writing – review & editing. Guodong Multimedia, 21(6), 1476–1486.
Guo: Validation, Writing – original draft, Writing – review & editing. He, L., Niu, M., Tiwari, P., Marttinen, P., Su, R., Jiang, J., .... Dang, W. (2022). Deep
Hui Ding: Methodology, Validation, Writing – review & editing. Qiang learning for depression recognition with audiovisual cues: A review. Information
Fusion, 80, 56–86. http://dx.doi.org/10.1016/j.inffus.2021.10.012, URL https://
Hu: Software, Validation, Writing – review & editing.
www.sciencedirect.com/science/article/pii/S1566253521002207.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image
Declaration of competing interest recognition. In 2016 IEEE conference on computer vision and pattern recognition (pp.
770–778). http://dx.doi.org/10.1109/CVPR.2016.90.
The authors declare that they have no known competing finan- Hu, J., Shen, L., Albanie, S., Sun, G., & Wu, E. (2020). Squeeze-and-excitation networks.
cial interests or personal relationships that could have appeared to IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(8), 2011–2023.
http://dx.doi.org/10.1109/TPAMI.2019.2913372.
influence the work reported in this paper.
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network
training by reducing internal covariate shift. In Proceedings of the 32nd international
Data availability conference on international conference on machine learning - volume 37 (pp. 448–456).
JMLR.org.
Data will be made available on request. Jan, A., Meng, H., Gaus, Y. F. B. A., & Zhang, F. (2018). Artificial intelligent system
for automatic depression level analysis through visual and vocal expressions. IEEE
Transactions on Cognitive and Developmental Systems, 10(3), 668–680. http://dx.doi.
Acknowledgments
org/10.1109/TCDS.2017.2721552.
Jan, A., Meng, H., Gaus, Y. F. A., Zhang, F., & Turabzadeh, S. (2014). Automatic
This work was supported by the National Natural Science Founda- depression scale prediction using facial expression dynamics and regression. In
tion of China (61876112, 61601311), and the Natural Science Founda- Proceedings of the 4th international workshop on audio/visual emotion challenge (pp.
tion of Beijing (L201022). 73–80). New York, NY, USA: Association for Computing Machinery, http://dx.doi.
org/10.1145/2661806.2661812.
References Jazaery, M. A., & Guo, G. (2018). Video-based depression level analysis by encoding
deep spatiotemporal features. IEEE Transactions on Affective Computing, 1. http:
//dx.doi.org/10.1109/TAFFC.2018.2870884.
Ackerman, I., Buchbinder, R., Chin, K., Cicuttini, F., Driscoll, T., Gall, S., .... GBD 2017
Kächele, M., Glodek, M., Zharkov, D., Meudt, S., & Schwenker, F. (2014). Fusion of
Disease and Injury Incidence and Prevalence Collaborators (2018). Global, regional,
and national incidence, prevalence, and years lived with disability for 354 diseases audio-visual features using hierarchical classifier systems for the recognition of
and injuries for 195 countries and territories, 1990–2017: a systematic analysis affective states and the state of depression. In Proceedings of the 3rd international
for the Global Burden of Disease Study 2017. The Lancet, 392(10159), 1789–1858. conference on pattern recognition applications and methods (pp. 671–678). SciTePress,
http://dx.doi.org/10.1016/S0140-6736(18)32279-7. INSTICC, http://dx.doi.org/10.5220/0004828606710678.
Bosch, A., Zisserman, A., & Munoz, X. (2007). Representing shape with a spatial Kaya, H., Çilli, F., & Salah, A. A. (2014). Ensemble CCA for continuous emotion
pyramid kernel. In Proceedings of the 6th ACM international conference on image prediction. In Proceedings of the 4th international workshop on audio/visual emotion
and video retrieval (pp. 401–408). New York, NY, USA: Association for Computing challenge (pp. 19–26). New York, NY, USA: Association for Computing Machinery,
Machinery, http://dx.doi.org/10.1145/1282280.1282340. http://dx.doi.org/10.1145/2661806.2661814.
Carneiro de Melo, W., Granger, E., & Hadid, A. (2020). A deep multiscale spatiotem- Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic
poral network for assessing depression from facial dynamics. IEEE Transactions on human actions from movies. In 2008 IEEE conference on computer vision and pattern
Affective Computing, 1. http://dx.doi.org/10.1109/TAFFC.2020.3021755. recognition (pp. 1–8). http://dx.doi.org/10.1109/CVPR.2008.4587756.
Chattopadhay, A., Sarkar, A., Howlader, P., & Balasubramanian, V. N. (2018). Grad- Li, J., Liu, X., Zhang, M., & Wang, D. (2020). Spatio-temporal deformable 3D ConvNets
CAM++: Generalized gradient-based visual explanations for deep convolutional with attention for action recognition. Pattern Recognition, 98, Article 107037. http:
networks. In 2018 IEEE winter conference on applications of computer vision (pp. //dx.doi.org/10.1016/j.patcog.2019.107037, URL http://www.sciencedirect.com/
839–847). http://dx.doi.org/10.1109/WACV.2018.00097. science/article/pii/S0031320319303383.
Cholet, S., Paugam-Moisy, H., & Regis, S. (2019). Bidirectional associative memory for Li, X., Wang, W., Hu, X., & Yang, J. (2019). Selective kernel networks. In 2019
multimodal fusion: a depression evaluation case study. In 2019 international joint IEEE/CVF conference on computer vision and pattern recognition (pp. 510–519). http:
conference on neural networks (pp. 1–6). //dx.doi.org/10.1109/CVPR.2019.00060.
Cummins, N., Joshi, J., Dhall, A., Sethu, V., Goecke, R., & Epps, J. (2013). Diagnosis
Loshchilov, I., & Hutter, F. (2016). SGDR: Stochastic Gradient Descent with Warm
of depression by behavioural signals: A multimodal approach. In Proceedings of the
Restarts. arXiv e-prints arXiv:1608.03983.
3rd ACM international workshop on audio/visual emotion challenge (pp. 11–20). New
Lucey, P., Cohn, J. F., Kanade, T., Saragih, J., Ambadar, Z., & Matthews, I. (2010).
York, NY, USA: Association for Computing Machinery, http://dx.doi.org/10.1145/
The extended Cohn-Kanade dataset (CK+): A complete dataset for action unit and
2512530.2512535.
emotion-specified expression. In 2010 IEEE computer society conference on computer
de Melo, W. C., Granger, E., & Hadid, A. (2019a). Combining global and local
vision and pattern recognition - workshops (pp. 94–101). http://dx.doi.org/10.1109/
convolutional 3D networks for detecting depression from facial expressions. In 2019
14th IEEE international conference on automatic face gesture recognition (pp. 1–8). CVPRW.2010.5543262.
http://dx.doi.org/10.1109/FG.2019.8756568. Carneiro de Melo, W., Granger, E., & Bordallo Lopez, M. (2021). MDN: A deep
de Melo, W. C., Granger, E., & Hadid, A. (2019b). Depression detection based on deep maximization-differentiation network for spatio-temporal depression detection. IEEE
distribution learning. In 2019 IEEE international conference on image processing (pp. Transactions on Affective Computing, 1.
4544–4548). http://dx.doi.org/10.1109/ICIP.2019.8803467. Meng, H., Huang, D., Wang, H., Yang, H., AI-Shuraifi, M., & Wang, Y. (2013).
Deng, J., Guo, J., Yang, J., Xue, N., Kotsia, I., & Zafeiriou, S. (2022). ArcFace: Depression recognition based on dynamic facial and vocal expression features using
Additive angular margin loss for deep face recognition. IEEE Transactions on Pattern partial least square regression. In Proceedings of the 3rd ACM international workshop
Analysis and Machine Intelligence, 44(10), 5962–5979. http://dx.doi.org/10.1109/ on audio/visual emotion challenge (pp. 21–30). New York, NY, USA: Association for
TPAMI.2021.3087709. Computing Machinery, http://dx.doi.org/10.1145/2512530.2512532.

10
Y. Pan et al. Expert Systems With Applications 237 (2024) 121410

Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted Boltzmann ma- Wu, Z., Wen, J., Xu, Y., Yang, J., & Zhang, D. (2021). Multiple instance detection
chines. In Proceedings of the 27th international conference on international conference networks with adaptive instance refinement. IEEE Transactions on Multimedia, 1.
on machine learning (pp. 807–814). Madison, WI, USA: Omni Press. http://dx.doi.org/10.1109/TMM.2021.3125130.
Niu, M., Liu, B., Tao, J., & Li, Q. (2021). A time-frequency channel attention and Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., .... Bengio, Y. (2015).
vectorization network for automatic depression level prediction. Neurocomput- Show, attend and tell: Neural image caption generation with visual attention. In
ing, 450, 208–218. http://dx.doi.org/10.1016/j.neucom.2021.04.056, URL https: Proceedings of the 32nd international conference on international conference on machine
//www.sciencedirect.com/science/article/pii/S0925231221005981. learning - volume 37 (pp. 2048–2057). JMLR.org.
Niu, M., Tao, J., Liu, B., Huang, J., & Lian, Z. (2020). Multimodal spatiotemporal Yan, G., & Woźniak, M. (2022). Accurate key frame extraction algorithm of video action
representation for automatic depression level detection. IEEE Transactions on for aerobics online teaching. Mobile Networks and Applications, 27(3), 1252–1261.
Affective Computing, 1. http://dx.doi.org/10.1109/TAFFC.2020.3031345. http://dx.doi.org/10.1007/s11036-022-01939-1.
Ojala, T., Pietikainen, M., & Maenpaa, T. (2002). Multiresolution gray-scale and rotation Zhang, H., Wu, C., Zhang, Z., Zhu, Y., Lin, H., Zhang, Z., .... Smola, A. (2020). ResNeSt:
invariant texture classification with local binary patterns. IEEE Transactions on Split-Attention Networks. arXiv e-prints arXiv:2004.08955.
Pattern Analysis and Machine Intelligence, 24(7), 971–987. http://dx.doi.org/10. Zhou, X., Jin, K., Shang, Y., & Guo, G. (2020). Visually interpretable representation
1109/TPAMI.2002.1017623. learning for depression recognition from facial images. IEEE Transactions on Affective
Schroff, F., Kalenichenko, D., & Philbin, J. (2015). FaceNet: A unified embedding for Computing, 11(3), 542–552. http://dx.doi.org/10.1109/TAFFC.2018.2828819.
face recognition and clustering. In 2015 IEEE conference on computer vision and Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep
pattern recognition (pp. 815–823). http://dx.doi.org/10.1109/CVPR.2015.7298682. features for discriminative localization. In 2016 IEEE conference on computer vision
Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). and pattern recognition (pp. 2921–2929). http://dx.doi.org/10.1109/CVPR.2016.
Grad-CAM: Visual explanations from deep networks via gradient-based localization. 319.
In 2017 IEEE international conference on computer vision (pp. 618–626). http://dx. Zhou, X., Wei, Z., Xu, M., Qu, S., & Guo, G. (2020). Facial depression recognition
doi.org/10.1109/ICCV.2017.74. by deep joint label distribution and metric learning. IEEE Transactions on Affective
Shang, Y., Pan, Y., Jiang, X., Shao, Z., Guo, G., Liu, T., & Ding, H. (2021). LQGDNet: Computing, 1. http://dx.doi.org/10.1109/TAFFC.2020.3022732.
A local quaternion and global deep network for facial depression recognition. Zhu, Y., Shang, Y., Shao, Z., & Guo, G. (2018). Automated depression diagnosis based
IEEE Transactions on Affective Computing, 1. http://dx.doi.org/10.1109/TAFFC.2021. on deep networks to encode facial appearance and dynamics. IEEE Transactions
3139651. on Affective Computing, 9(4), 578–584. http://dx.doi.org/10.1109/TAFFC.2017.
Sidorov, M., & Minker, W. (2014). Emotion recognition and depression diagnosis by 2650899.
acoustic and visual features: A multimodal approach. In Proceedings of the 4th
international workshop on audio/visual emotion challenge (pp. 81–86). New York, NY,
USA: Association for Computing Machinery, http://dx.doi.org/10.1145/2661806.
Yuchen Pan received his B.E. degree from the College
2661816.
of Computer Science and Technology, Harbin Engineering
Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: a dataset of 101 human actions University in 2019, And received the M.S. degree from the
classes from videos in the wild. CoRR abs/1212.0402 URL http://arxiv.org/abs/ College of Information Engineering, Capital Normal Univer-
1212.0402. sity in 2022. He is now with the Intelligent Recognition and
Springenberg, J. T., Dosovitskiy, A., Brox, T., & Riedmiller, M. A. (2015). Striving for Image Processing Laboratory of Capital Normal University,
simplicity: The all convolutional net. CoRR abs/1412.6806. working on depression recognition in the field of computer
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotem- vision and deep learning. He has authored papers in peer
poral features with 3D convolutional networks. In 2015 IEEE international conference reviewed journals including IEEE Transactions on Affective
on computer vision (pp. 4489–4497). http://dx.doi.org/10.1109/ICCV.2015.510. Computing and licensed China patents. His research interest
Uddin, M. A., Joolee, J. B., & Lee, Y. (2020). Depression level prediction using deep includes computer vision, digital image processing and deep
spatiotemporal features and multilayer bi-LTSM. IEEE Transactions on Affective learning.
Computing, 1. http://dx.doi.org/10.1109/TAFFC.2020.2970418.
Valstar, M., Schuller, B., Smith, K., Almaev, T., Eyben, F., Krajewski, J., .... Pantic, M.
(2014). AVEC 2014: 3D dimensional affect and depression recognition challenge. Yuanyuan Shang received the Ph.D. degree from Chinese
In Proceedings of the 4th international workshop on audio/visual emotion challenge Academy of Sciences in 2005. She is currently a professor
(pp. 3–10). New York, NY, USA: Association for Computing Machinery, http: and vice dean of the Graduate School, Capital Normal Uni-
versity, Beijing, P.R. China. Her research interests include
//dx.doi.org/10.1145/2661806.2661807.
computer vision and medical image processing. She has
Valstar, M., Schuller, B., Smith, K., Eyben, F., Jiang, B., Bilakhia, S., .... Pantic, M.
authored more than 70 scientific papers in peer-reviewed
(2013). AVEC 2013: The continuous audio/visual emotion and depression recogni-
journals and conferences, including some top venues such
tion challenge. In Proceedings of the 3rd ACM international workshop on audio/visual
as the IEEE Transactions on Pattern Analysis and Machine
emotion challenge (pp. 3–10). New York, NY, USA: Association for Computing
Intelligence, IEEE Transactions on Affective Computing,
Machinery, http://dx.doi.org/10.1145/2512530.2512533.
CVPR and ACM Multimedia. She is currently serving as vice
Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In 2018 president of Beijing Artificial Intelligent Society.
IEEE/CVF conference on computer vision and pattern recognition (pp. 7794–7803).
http://dx.doi.org/10.1109/CVPR.2018.00813.
Wang, H., Wang, Z., Du, M., Yang, F., Zhang, Z., Ding, S., Mardziel, P., & Hu, X. (2020). Tie Liu received the BS, MS, and PhD degrees from Xian
Score-CAM: Score-weighted visual explanations for convolutional neural networks. Jiaotong University, in 2001, 2004, and 2007, respectively.
In 2020 IEEE/CVF conference on computer vision and pattern recognition workshops He is currently an associate professor in the College of In-
(pp. 111–119). http://dx.doi.org/10.1109/CVPRW50498.2020.00020. formation Engineering, Capital Normal University. His areas
Wen, L., Li, X., Guo, G., & Zhu, Y. (2015). Automated depression diagnosis based on of interest include machine learning, pattern recognition,
facial dynamic analysis and sparse coding. IEEE Transactions on Information Forensics multimedia computing, and computer vision. He is also
and Security, 10(7), 1432–1441. http://dx.doi.org/10.1109/TIFS.2015.2414392. interested in data analysis and mining.
Wieczorek, M., Siłka, J., Woz̀niak, M., Garg, S., & Hassan, M. M. (2022). Lightweight
convolutional neural network model for human face detection in risk situations.
IEEE Transactions on Industrial Informatics, 18(7), 4820–4829. http://dx.doi.org/10.
1109/TII.2021.3129629.
Woo, S., Park, J., Lee, J.-Y., & Kweon, I. S. (2018). CBAM: Convolutional block attention
module. In V. Ferrari, M. Hebert, C. Sminchisescu, & Y. Weiss (Eds.), Computer vision Zhuhong Shao received the B.S. degree in Biomedical
(pp. 3–19). Cham: Springer International Publishing. Engineering from Jilin Medical University, Jilin, China, in
Woźniak, M., Siłka, J., & Wieczorek, M. (2021). Deep learning based crowd counting 2009, and the M.S. degree in Electrical Engineering from
model for drone assisted systems. In Proceedings of the 4th ACM mobicom workshop Beijing Jiaotong University, Beijing China, in 2011 and the
on drone assisted wireless communications for 5G and beyond (pp. 31–36). New Ph.D. degree in Computer Science and Technology from
York, NY, USA: Association for Computing Machinery, http://dx.doi.org/10.1145/ Southeast University, Nanjing, China, in 2015. He is cur-
3477090.3481054. rently an associate professor in the College of Information
Wu, Z., Liu, C., Wen, J., Xu, Y., Yang, J., & Li, X. (2023). Selecting high-quality Engineering, Capital Normal University, Beijing, China. His
research interest includes computer vision and multimedia
proposals for weakly supervised object detection with bottom-up aggregated
information security.
attention and phase-aware loss. IEEE Transactions on Image Processing, 32, 682–693.
http://dx.doi.org/10.1109/TIP.2022.3231744.

11
Y. Pan et al. Expert Systems With Applications 237 (2024) 121410

Guodong Guo received his B.E. degree in Automation from Hui Ding received her Ph.D. degree from the School of
Tsinghua University, Beijing, China, in 1991, the Ph.D. Information Science and Technology, Beijing Institute of
degree in Pattern Recognition and Intelligent Control from Technology of China, in 2006. She is currently an Asso-
Chinese Academy of Sciences, in 1998, and the Ph.D. degree ciate Professor with the College of Information Engineering,
in computer science from the University of Wisconsin- Capital Normal University, Beijing, China. She has authored
Madison, in 2006. He is currently an Associate Professor over 30 scientific papers in peer-reviewed journals and
in the Lane Department of Computer Science and Electrical conferences. Her research interests include computer vision,
Engineering, West Virginia University. In the past, he visited medical image processing, and machine learning.
and worked in several places, including INRIA, Sophia
Antipolis, France, Ritsumeikan University, Japan, Microsoft
Research, China, and North Carolina Central University.
He won the North Carolina State Award for Excellence in
Innovation in 2008, and Outstanding New Researcher of Qiang Hu received his Ph.D. degree from Shanghai Jiao-
the Year (2010–2011) at CEMR, WVU. His research areas tong University, in 2022. He is currently working at the
include computer vision, machine learning, and multimedia. Department of Psychiatry, ZhenJiang Mental Health Center,
He has authored a book, Face, Expression, and Iris Recog- Zhenjiang, Jiangsu, China. His research interests include the
nition Using Learning-based Approaches (2008), published conduct of systematic reviews and the treating of patients
over 60 technical papers in face, iris, expression, and with psychoses.
gender recognition, age estimation, multimedia information
retrieval, and image analysis, and filed three patents on iris
and texture image analysis. He is an editorial board member
of the IET Biometrics, and a senior member of IEEE.

12

You might also like