0% found this document useful (0 votes)
11 views10 pages

Reducing Noisy Annotations For Depression Estimation From Facial Images

Uploaded by

ynuyangpeng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views10 pages

Reducing Noisy Annotations For Depression Estimation From Facial Images

Uploaded by

ynuyangpeng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Neural Networks 153 (2022) 120–129

Contents lists available at ScienceDirect

Neural Networks
journal homepage: www.elsevier.com/locate/neunet

Reducing noisy annotations for depression estimation from facial


images

Lang He a,b,c , Prayag Tiwari d , , Chonghua Lv e , WenShuai Wu f , Liyong Guo g
a
School of Computer Science and Technology, Xi’an University of Posts and Telecommunications, Xi’an, 710121, Shaanxi, China
b
Shaanxi Key Laboratory of Network Data Analysis and Intelligent Processing, Xi’an University of Posts and
Telecommunications, Xi’an, 710121, Shaanxi, China
c
Xi’an Key Laboratory of Big Data and Intelligent Computing, Xi’an University of Posts and Telecommunications, Xi’an, 710121, Shaanxi, China
d
Department of Computer Science, Aalto University, Espoo, Finland
e
School of Artificial Intelligence, Xidian University, Xi’an, 710121, Shaanxi, China
f
School of Telecommunication and Information Engineering, Xi’an University of Posts and Telecommunications, Xi’an, 710121, Shaanxi, China
g
Xiaomi Technology, Beijing, China

article info a b s t r a c t

Article history: Depression has been considered the most dominant mental disorder over the past few years. To
Received 25 November 2021 help clinicians effectively and efficiently estimate the severity scale of depression, various automated
Received in revised form 17 April 2022 systems based on deep learning have been proposed. To estimate the severity of depression, i.e.,
Accepted 25 May 2022
the depression severity score (Beck Depression Inventory-II), various deep architectures have been
Available online 3 June 2022
designed to perform regression using the Euclidean loss. However, they do not consider the label
Keywords: distribution, and they do not learn the relationships between the facial images and BDI-II scores,
Depression which can be resulting in the noisy labeling for automatic depression estimation (ADE). To mitigate
Self-adaptation network (SAN) this problem, we propose an automated deep architecture, namely the self-adaptation network (SAN),
Affective computing to improve this uncertain labeling for ADE. Specifically, the architecture consists of four modules:
Noisy labels (1) ResNet-18 and ResNet-50 are adopted in the deep feature extraction module (DFEM) to extract
informative deep features; (2) a self-attention module (SAM) is adopted to learn the weights from
the mini-batch; (3) a square ranking regularization module (SRRM) to create high partitions and low
partitions is proposed; and (4) a re-label module (RM) is used to re-label the uncertain annotations for
ADE in the low partitions. We conduct extensive experiments on depression databases (i.e., AVEC2013
and AVEC2014) and obtain a performance comparable to the performances of other ADE methods
in assessing the severity of depression. More importantly, the proposed method can learn valuable
depression patterns from facial videos and obtain a performance comparable to the performances of
other methods for depression recognition.
© 2022 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).

1. Introduction et al., 2020; Mathers & Loncar, 2006). In severe cases, depres-
sion leads to suicide (Hawton, i Comabella, Haw, & Saunders,
2013). It was found that about 50% of suicides are related to
Due to the increasing pressures of life, a growing number depression (Hawton et al., 2013; Senn et al., 1996). Unfortunately,
of people are suffering from depression. According to a report there are not impactful clinical patterns for the diagnosis of
released by the World Health Organization (WHO), depression depression, which makes the diagnosis of depression complicated
and subjective (Maj et al., 2020).
will become the most common mental disorder by 2030 (Lanillos
Therefore, various researchers from the affective computing
field have attempted to use audiovisual cues to help psychologists
or clinicians to diagnose the severity of depression. Traditional
∗ Corresponding author. machine learning approaches for estimating the severity of de-
E-mail addresses: [email protected] (L. He), [email protected] pression often consist of the following three steps: (i) extracting
(P. Tiwari), [email protected] (C. Lv). hand-crafted features, (ii) aggregating the learned features, and

https://doi.org/10.1016/j.neunet.2022.05.025
0893-6080/© 2022 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
L. He, P. Tiwari, C. Lv et al. Neural Networks 153 (2022) 120–129

Table 1 weight the importance of a batch. After the attention weighting


BDI-II score ranges and depression severity.
operation, the lower weight is assigned to the uncertain facial
BDI-II score ranges Depression severity
images. Moreover, the SRRM ranks the learned weights in de-
None or minimal 0–13 scending order, divides them into two partitions (i.e., a high
Mild 14–19
Moderate 20–28 partition and a low partition), and regularizes the two partitions
Severe 29–63 by adding a threshold between the square means of the weights
of the two partitions. To learn the discriminative features, a novel
loss function termed square ranking regularization loss (SRR-
(iii) performing regression (or classification). From the perspec- Loss) is proposed to implement the regularization operation. The
tive of feature extraction, methods for automatic depression esti- SRRM can automatically learn the significant weights to underline
mation (ADE)1 based on audiovisual information can be classified certain images with accurate labels and to hold back images
into two categories: those that use hand-crafted features and with unreliable labels. The RM can re-label the data samples
those that use deep-learned features. Evidence has shown that from the low partitions by comparing the maximum predicted
hand-crafted features can obtain comparable performance for probabilities to the probabilities of given labels. Then, each facial
ADE (He, Jiang, & Sahli, 2018; Wen, Li, Guo, & Zhu, 2015). Fortu- image is re-labeled with a pseudo-annotation when the maxi-
nately, deep learning (DL) has elevated the field of ADE (de Melo, mum prediction probability is greater than that of the given labels
Granger, & Hadid, 2019b, 2020; de Meto, Granger, & Lopez, 2020; by a margin value. To the best of our knowledge, our study is
Du, Li, Huang, & Wang, 2019; He & Cao, 2018; He, Chan and Wang, the first to attempt to address the noisy annotations in the ADE
2021; He et al., 2022; Kang, Jiang, Yin, Shang, & Zhou, 2017; Niu, task. Therefore, our goal is to provide motivation to the affective
Liu, Tao, & Li, 2021; Niu, Tao, Liu, & Fan, 2019; Niu, Tao, Liu, computing field for ADE. If the images label are noisy, we can
Huang, & Lian, 2020; Song, Jaiswal, Shen, & Valstar, 2020; Uddin, find the best way to re-label the images. To do this, we need to
Joolee, & Lee, 2020; Zhou, Jin, Shang, & Guo, 2020; Zhu, Shang, perform feature extraction, train the model, find the uncertain
Shao, & Guo, 2017). However, due to the difficulty of annotating
labels, and provide accurate labels for the uncertain images. More
for depression databases, there exist certain uncertainties in the
importantly, the proposed simple architecture is vastly different
data samples. These uncertainties can result in incorrect labels,
from complex deep architectures, e.g., Carneiro de Melo, Granger,
which cause ADE to have a lower performance. Specifically, the
and Bordallo Lopez (2021) and de Melo et al. (2020), etc. We
use of uncertainties with uncertain labels for ADE may bring
about the following problems. First, overfitting may occur during do not consider the design of deep learning architectures and
the training of the deep models. The reason for this is that the only focus on the problem of noisy labels for ADE. Though the
uncertain labels can negatively affect for the training of the deep proposed architecture is simple for ADE, it is a discriminative
models. Second, irrelevant facial features may be learned due to model designed from the perspective of the existing problems of
incorrect labels. This is because uncertain labels can result in the ADE tasks. Based on the illustration of the proposed method, one
extraction of inaccurate discriminative features for ADE. can note that the proposed method (1) extracts the representative
Created in 2013 and 2014, the aim of AVEC2013 (Valstar et al., features, (2) relabels the samples with noisy labels, and (3)
2013) and AVEC2014 (Valstar et al., 2014) were meant to be used provides the pseudo-labels for the images and re-trains the deep
to estimate depression scores, i.e., Beck Depression Inventory-II models for depression recognition.
(see Table 1), based on the collected audiovisual clips. This can be
viewed as a regression problem. However, if we adapt pre-trained
deep models trained on large-scale image datasets for ADE, the 1.1. Contributions
following problems occur. First, the image classification task has
been changed to regression for ADE, which may bring about over- This study makes the following three contributions:
fitting on the small depression datasets. Second, this method is
not good for learning discriminative facial features for ADE. 1. We consider the regression of AVEC2013 and AVEC2014 as
Furthermore, it is considered that facial expression, speech, a classification problem for ADE. Therefore, the model can
and semantic information occupy 55%, 38%, and 7%, respec- efficiently capture the latent patterns between the facial
tively, of the total information obtained from affective comput- images and the BDI-II scores.
ing (Mehrabian, 2017). In addition, due to the backgrounds of 2. We pose the problem of noisy and ambiguous labels in
different annotators of the depression level, the labels of the ADE and propose a novel framework, SAN, that effectively
depression videos may be noisy, which can make the trained
learns the facial patterns to reduce the uncertainties of
deep evaluation models inefficient for ADE. To mitigate the issues
annotation.
mentioned above, motivated by the work of He et al. (2022)
3. Extensive experiments are performed on the AVEC2013
and Wang, Peng, Yang, Lu, and Qiao (2020), an automated deep
and AVEC2014 databases, and our method achieves a
architecture called the self-adaption network (SAN) is proposed
to mitigate the uncertainties labels for ADE. Specifically, as il- promising performance for ADE. Furthermore, the intro-
lustrated in Fig. 2, the SAN consists of four modules: (1) a duced method can learn valuable depression patterns from
deep feature extraction module (DFEM); (2) self attention mod- the facial videos.
ule (SAM); (3) square ranking regularization module (SRRM);
and (4) re-label module (RM). Due to the advantage of ResNets
1.2. Organizations
(i.e., ResNet-18, ResNet-50) (He, Zhang, Ren, & Sun, 2016), dis-
criminative feature representations are extracted from ResNets.
Then the SAM aims to learn a weight from every image to The remainder of this paper is structured as follows. We briefly
discuss the main studies based on visual cues for ADE in Section 2.
1 In the present paper, we consider the ADE problem as a regression scheme We describe the proposed architecture in detail in Section 3.
on the AVEC2013 and AVEC2014 databases from the perspective of the machine The adopted databases and experimental results are shown in
learning. Section 4. In Section 5, conclusions and future work are described.
121
L. He, P. Tiwari, C. Lv et al. Neural Networks 153 (2022) 120–129

2. Related works Algorithm 1 The proposed SRRM algorithm


Input: Declare the input features X , and the number of correct
In order to assist clinicians in diagnosing the severity of de- images M, the number of uncertain images N − M. β denotes the
pression, various studies have emerged from the affective com- learned weights. Wa denotes the learned attention weights of the
puting field. Algorithms and frameworks have been designed by FC layer, and φ represents the activation function, i.e., the sigmoid
these affective computing researchers. From the perspective of activation function.
feature extraction, these categories of hand-crafted features and Output: BDI-II score D = [D1 , D2 , ..., Dm ].
deep-learned features are discussed. To perform a comprehensive
1: Initialization
review of ADE, we describe the above two categories based on
2: β = φ (Wa X)
T
visual cues as follows.
3: j → 0
4: The SRRM is divided into two partitions with a threshold ε ,
2.1. Hand-crafted features for ADE
i.e., βHigh , βlow .
5: While j = N − M do
In this subsection, hand-crafted features for ADE are reviewed.
6: βlow = Sum(βj )/N-M
In the AVEC2013 (Valstar et al., 2013) and AVEC2014 (Valstar
7: j=j+1
et al., 2014) depression sub-challenges, two hand-crafted video
8: end
features are adopted, i.e., local phase quantization (LPQ) and
9: Then, re-label the βlow partition.
local Gabor binary patterns from three orthogonal planes (LGBP-
TOP), for the ADE task. He, Jiang, and Sahli (2015) proposed
a multimodal architecture for ADE. In this architecture, various
hand-crafted features are extracted, i.e., low-level descriptors for ADE. To summarize these works, consider that they possess
(LLD) features, LGBP-TOP, space-temporal interesting point (STIP), the following common features: (i) They pre-train their deep
and local kinematic features via the divergence-curl-shear de- architectures on a large-scale database (e.g., CASIA, VGG Webface,
scriptors. After feature extraction, the motion history histogram etc.), and (ii) they increase the accuracy of the pre-trained models
(MHH), bag of words (BOW), and vector of local aggregated
by fine-tuning them on depression databases, e.g., AVEC2013 and
descriptors (VLAD) methods are used to aggregate the discrimina-
AVEC2014. He, Chan et al. (2021) tried to train deep architectures
tive features of ADE. To obtain a promising performance with the
from scratch to mitigate the ADE problem.
adopted features, feature fusion and model fusion by local linear
However, the works (de Melo et al., 2019a; de Meto et al.,
regression (LLR) are used. In Wen et al. (2015), the author adopts
LPQ-TOP features, sparse coding, and discriminative mapping to 2020; Gao, Liu, & Ju, 2020; He, Chan et al., 2021; Jiang et al.,
model the dynamics for ADE. He et al. (2018) proposed a novel 2021, 2021; Kang et al., 2017; Yang et al., 2021; Zhou et al., 2020;
framework for estimating a depression severity score, i.e., BDI-II, Zhu et al., 2017) adopted 2D-CNNs to represent the deep-learned
from facial expression features. The median robust local binary features, even though 2D-CNNs only capture spatial features for
patterns from three orthogonal planes (MRLBP-TOP) method are computer vision tasks. Therefore, various works have tried to
proposed to learn the microstructure and macrostructure of the adopt 3D convolutional neural networks (3D-CNNs) to capture
facial regions. After the feature extraction step, a new feature the representative features for ADE from video sequences. Al Jaza-
aggregation method called the Dirichlet process fisher vector ery and Guo (2021) used a fusion of recurrent neural network
(DPFV) is proposed. The MRLBP-TOP method is an extension of (RNN) and 3D convolutional networks (C3D) to learn spatio-
the LBP-TOP the (Dhall & Goecke, 2015) for ADE. temporal features at two different scales from video clips for ADE.
It was found that C3D architectures can learn spatio-temporal
2.2. Deep-learned features of ADE patterns from full-face and local regions, and 3D global average
pooling (3D-GAP) was used to pool the local and global features
The architectures for ADE, as mentioned earlier, mostly fo- for ADE (de Meto et al., 2020). de Melo et al. (2020) also designed
cused on hand-crafted features. They leverage domain knowledge a novel 3D framework called the multi-scale spatio-temporal
to design hand-crafted features, resulting in some limitations, (1) network (MSN). In addition, Carneiro de Melo et al. (2021) and
developing hand-crafted features requires the consideration of He, Guo, Tiwari, Pandey and Dang (2021) also used a 3D-CNN to
certain factors (e.g., domain knowledge, time). For example, LBP-
learn depression patterns for ADE. A 3D-CNN and LSTM were also
TOP features have been used for ADE, and provide a reasonable
used to model sequence information for ADE (Chao, Tao, Yang, &
accuracy (Dhall & Goecke, 2015). However, to develop features
Li, 2015; Uddin et al., 2020).
like the LBP-TOP features, one must possess depression-specific
One can note that the aforemetioned works focus on extract-
knowledge. (2) Some valuable depression patterns cannot be
learned well. Fortunately, due to the benefits of DL technology, ing the discriminative hand-crafted and deep-learned features,
deep-learned features show a promising performance for ADE. and do not consider the noisy and uncertain labels of the video
In the following paragraphs, we give a summary of deep-learned clips for ADE. Therefore, we borrow the advantage of the deep ar-
approaches for video-based ADE. chitectures to adapt the noisy labels so that they more accurately
In 2017, Zhu et al. (2017) fine-tuned to adopt deep models reflect the characteristics of the video clips.
(GoogleNet) pre-trained on the CASIA large facial database for
ADE. The main idea of this work (Zhu et al., 2017) is to utilize a 2D
3. Our architecture
convolutional neural networks (2D-CNN) to extract informative
representations for ADE. Later, Zhou et al. (2020) introduced a
framework based on a 2D-CNN for capturing the features from In this part, we first describe a skeleton of the proposed
images for ADE. Several 2D-CNN architectures (AlexNet, ResNet, architecture and introduce the deep feature scheme. The self-
VGGNet, etc.) were pre-trained on the CASIA database. In addi- attention weight learning module is introduced in Section 3.3,
tion, the works (de Melo, Granger, & Hadid, 2019a; de Meto et al., which is followed by a description of square ranking regulariza-
2020; He, et al., 2021; Kang et al., 2017; Zhou et al., 2020; Zhu tion in Section 3.4. Lastly, we introduce the re-labeling operation
et al., 2017) also used 2D-CNNs to extract discriminative features in Section 3.5.
122
L. He, P. Tiwari, C. Lv et al. Neural Networks 153 (2022) 120–129

Fig. 1. Illustration of the proposed pipeline for ADE. Training images are used to train the deep models, and testing images are used to validate the performance of
the deep models.

Fig. 2. The detailed flow of the proposed framework SAN for visual-based ADE. The SAN consists of four modules: (1) a deep feature extraction module (DFEM);
(2) self attention module (SAM); (3) square ranking regularization module (SRRM); (4) re-label module (RM). Due to privacy concerns, we blur the samples with a
white rectangle. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

3.1. Architecture overview be re-labeled to improve the performance of the deep ADE model
in classifying the facial images. More importantly, the introduced
Fig. 1 depicts the main structure of the proposed method. deep models can be run in an end-to-end manner for ADE.
We first use the training data to train the deep models, and
then we use the test data to validate the deep models. The 3.2. Extraction of deep learned features
detailed architecture of the method is illustrated in Fig. 2. First,
the OpenFace toolkit (Baltrušaitis, Robinson, & Morency, 2016) is Due to the small size of depression databases, transfer learning
used to pre-process the videos into facial images. Then, ResNet- is used for ADE. The goal of transfer learning is to fine-tune a
18 and ResNet-50 are each considered as potential backbone model that is pre-trained on a source task to fit a new target task.
networks for feature extraction; they are used to extract infor- For instance, one can consider fine-tuning an image classification
mative features for ADE. After the feature extraction procedure, model that is pre-trained on the ImageNet dataset for ADE using a
the SRRM (cf. Algorithm 1) can provide a learned weight for every small depression database. There exist two strategies for transfer-
image using a sigmoid function and a fully-connected (FC) layer. ring the model from the source task to the target task. The first is
To reduce the effect of uncertain facial images, a square rank- to consider the pre-trained deep model as a feature extractor tool,
ing regularization module is adopted to rank and regularize the whose learned weights cannot be adapted for a different task. In
learned weights. In the SRRM, the attention weights are ranked in general, a classifier is added on top of the deep architecture to
descending order of importance, and grouped into two partitions, do classification. The second strategy is to fine-tune the entire or
i.e., a high partition and low partition. To further restrict the mean a subset of the network for a new task. Thus, the weights of the
of the weights between the high and low partitions, a square pre-trained deep models are viewed as the initial values for the
ranking regularization loss is proposed. Lastly, a re-labeling op- new study, and they are updated during the training procedure.
eration is performed by the RM to alter the ambiguous labels of For the ADE task, due to the small size of the depression
the images in the low partition. In fact, our goal is to re-adjust database, we use pre-trained deep models (ResNet-18 and
the labels so that they approach the true values (i.e., BDI-II scores ResNet-502 ) as feature extractors to obtain the discriminative
summed over a certain period of time) of the severity of the features. To extract informative features, in this study, ResNet-18
depression experienced by the participants. Thus, some of the
certain labels can be re-validated, and uncertain images can also 2 For convenience, we take ResNet-18 as an example in the present paper.

123
L. He, P. Tiwari, C. Lv et al. Neural Networks 153 (2022) 120–129

Fig. 3. Illustration of the ResNet-18 architecture (He et al., 2016). The ResNet-18 architecture consists of 16 convolution layers, 2 down-sampling layers, and 2 fully
connected layers.

is adopted for feature extraction (see the purple dashed rectangle SRRM divides the weights into two partitions with a threshold ε .
in Fig. 3). As illustrated in Fig. 3, 16 convolution layers, 2 down- After performing the regularization operation, the mean value of
sampling layers, and 2 fully connected layers make up ResNet-18. the high-partition weights is higher than that of the low-partition
For ResNet-18, the size of the convolution layer is 224 × 224, weights by some margin. Therefore, to perform discriminative
with a kernel size of 7 × 7. 3 × 3 is the kernel size of the other segmentation, the ranking regularization loss ℓSRRM can be written
layers. Following by the last convolution layer, an eigenvector as follows (cf. Eq. (3)):
is generated by a fully connected layer. Meanwhile, ResNet-50
ℓSRRM = max 0, θ1 − (βHigh − βlow )
{ }
(3)
has essentially the same architecture as ResNet-18; the main
difference is that it has more layers. and
M N
3.3. Self-attention feature learning 1 ∑ 1 ∑
βHigh = βj , βlow = βj (4)
M N −M
j=0 j=M
After the deep discriminative features are extracted, the self-
attention mechanism used to learn the importance weights. In where θ1 denotes a margin value that can be learned from the
our task, as shown in Fig. 2 (the blue dashed rectangle), we deep models, and βHigh and βlow represent the averages value of
expect certain facial images to have higher weights, while un- the high partition with (ε ∗ N = M) data samples and the low
certain images should have lower weights. Formally, let Fea = partition with (N − M) data samples, respectively. Finally, we
[y1 , y2 , . . . , yN ] ∈ RD×N be the learned deep features from N learn the deep discriminative features by jointly minimizing the
facial images, where Fea represents the input features. The SAM loss functions ℓSRRM and ℓatt as follows:
takes Fea as input and outputs the attention weights of every
ℓall = λℓSRRM + (1 − λ)ℓatt (5)
feature. To learn more effective contexts that will be used in
the following steps, a linear fully-connected layer and an activa- where λ (λ > 0) represents a trade-off threshold between the
tion function, i.e., the sigmoid activation function, are adopted. loss functions ℓSRRM and ℓatt .
Mathematically, the operation can be written as (cf. Eq. (1)):
3.5. Relabeling noisy annotations
βj = φ (WTa xj ) (1)
where βj denotes the learned weight of the jth feature, Wa In the SRRM, every mini-batch is segmented into two par-
denotes the learned attention weights of the FC layer, and φ titions, i.e., a high partition and a low partition. By conducting
represents the activation function, i.e., the sigmoid activation a set of experiments on the proposed framework, we find that
function. By using the self-attention mechanism, the different there are uncertain data samples with low learned weights. To
scale weights are learned; this provides the foundation for the address this problem, we consider that these data samples have
SRRM and RM. ambiguous labels for model learning.
Motivated by Hu, Huang, Zhang, and Li (2019) and Wang et al. Thus, to achieve the goal of the present work and to ad-
(2020), the logit-weighted loss function is adopted. Mathemat- dress the problem of noisy labels in ADE, we only consider the
ically, the loss function ℓatt of the SAM can be expressed as low-partition samples that have weights of lower importance.
follows: Consequently, softmax probabilities are adopted. For each image,
the maximum value of the predicted probability Ppre and the
N αj WTZ xj
1 ∑ e j probability value of the provided label Pglabel are compared. Then,
ℓatt = − log ∑ (2) the low-partition samples are given a pseudo-label Ppse when Ppre
eαj Wk xj
C T
N
j=1 k=1 is greater than Pglabel by a threshold value. Mathematically, the
where Wk represents the kth classifier. RM can be written as (cf. Eq. (6)),
Pmax − Ppre > θ2
{
Pindex =
3.4. Square ranking regularization Ppse = (6)
Pglabel = other w ise
Using the attention mechanism, the self-attention weights where Ppse represents the new pseudo label, θ2 denotes a control
(which range from 0 to 1) were learned during the attention threshold, Pmax represents the maximum value of the predicted
learning step. To further improve the contributions of the uncer- probability, Ppre denotes the predicted probability of the original
tain samples, a square ranking regularization module is proposed label, Pglabel represents the original value, and Pindex is the index
to rank the weights in descending order of importance. Then, the of the maximum value of the predicted probability.
124
L. He, P. Tiwari, C. Lv et al. Neural Networks 153 (2022) 120–129

4. Experiments Table 2
Accuracy of ADE on the test set of AVEC2013 using ResNet-18.

In this part, we introduce the experimental validation of the Models Metrics

introduced framework for ADE. Furthermore, experiments of dif- RMSE MAE PCC
ferent scales are conducted on publicly databases to validate the A1: DFEM 9.78 7.83 0.70
proposed method. Our goals for the experimental studies are as B1: DFEM + SAM + SRRM 9.68 7.20 0.71
C1: DFEM + SAM + SRRM + RM 9.37 7.02 0.78
follows:

1. Evaluate the effect of uncertain labels for ADE. Table 3


2. Validate the performance of the proposed framework based Accuracy of ADE on the test set of AVEC2014 using ResNet-18.
on visual cues for ADE. Models Metrics
RMSE MAE PCC
In Section 4.1, the used datasets that we use are introduced.
A2: DFEM 9.60 7.33 0.71
Experimental setup and evaluation measures are briefly described
B2: DFEM + SAM + SRRM 9.60 7.33 0.71
in Section 4.2. Lastly, the experimental performance are shown in C2: DFEM + SAM + SRRM + RM 9.31 7.28 0.79
Section 4.3.

4.1. Datasets
(the MS-Celeb-1M face recognition dataset (He et al., 2016) is
To validate the proposed method on two datasets, i.e., used to pre-train the deep models). To extract the discriminative
AVEC2013 and AVEC2014, BDI-II is used to label the audiovisual features for ADE, the last pooling layers of ResNet-18 and ResNet-
data samples in the two databases. 50 are used. The batch sizes are set to 64 and 128 for ResNet-18
AVEC2013 is a subset of a large audiovisual depressive lan- and ResNet-50, respectively. To further validate the effectiveness
guage corpus (AViD-Corpus); It consists of 340 video clips of the proposed method, we divide the training samples into two
recorded by 292 subjects participating in human–computer in- groups using the threshold 0.8, i.e., the high-partition samples are
teraction (HCI) task with a microphone and a webcam. The mean the top 80%, and the low-partition samples are the bottom 20%.
age of the participants is 31.5 years, and the age range is between The threshold θ1 is either set to 0.07 or learned automatically dur-
18 and 63 years. Audio and video clips were recorded during ing the training stage. The ADAM parameters have the following
these HCI tasks. The sampling rate of the audio is 41 kHz (16 bit). values: the learning rate is set to 1e−3, with γ equal to 0.9. The
The video was collected at 30 frames/s, and the frames have the dropout rate is set to 0.5. We carry out the experiments using
dimensions of 640 × 480 pixels. H.264 is adopted as the codec. two Titan-X GPUs (each with 12 GB of memory). The number of
training epochs is empirically set to 30. The RM is utilized after
The format of the videos is MP4. These recordings are grouped
the 10th epoch with a threshold of 0.12. The training procedure
into three partitions: there are 50 recordings in the training,
takes 5 h.
development, and test sets, respectively.
AVEC2014 is a subset of the AVEC2013 database. Unlike the
4.2.2. Evaluation standards
task of AVEC2013, AVEC2014 contains two HCI tasks (Freeform
In order to make an equitable comparison with previous
and Northwind):
studies using the two depression databases (i.e., AVEC2013 and
1. Freeform — Participants respond to one of a number of AVEC2014), the mean absolute error (MAE) and the root mean
questions, such as ‘‘What is your favorite dish?’’, or discuss square error (RMSE) are used. Formally, the MAE and RMSE can
a sad childhood memory (German). be expressed as follows:
2. Northwind — Participants read aloud an excerpt of the fable N
1 ∑⏐
⏐lj − Prej ⏐,

‘‘Die Sonne und der Wind’’ (The North Wind and the Sun) MAE = (7)
(German). N
j=1

 N
In total, the corpus consists of 300 videos with lengths ranging 1 ∑
from 6 s to 4 min. There are a total of 84 subjects (the mean age RMSE = √ (lj − Prej )2 , (8)
N
is 31.5 years, with a standard deviation of 12.3 years). Training, j=1

development, and test sets with 50 recordings each are included where N represents the number of participants, lj is the provided
in the database. In particular, we combine the training and the BDI-II score, and Prej represents the predicted value of the jth
development sets to train the proposed deep models, and we use video clip.
the test set to validate the discriminative models.
4.3. Results
4.2. Experimental settings and evaluation standards
In this part, we first perform an ablation study to evaluate
In this part, we present the experimental settings and evalua- the performance of the proposed method. Then, we compare the
tion standards for the proposed architecture for ADE. proposed architecture with the existing methods for visual ADE
to show its promising performance.
4.2.1. Evaluation settings
In this study, we adopt the OpenFace architecture to process 4.3.1. Ablation study
the facial region and perform landmarks localization on both The accuracy of each module of the introduced architecture
databases (i.e., AVEC2013 and AVEC2014). The size of the facial is verified on the databases (i.e., AVEC2013 and AVEC2014). To
images is set to 224 × 224. To improve the performance of ADE, further evaluate the proposed scheme, experiments of different
we manually check the frames to guarantee that the facial regions scale are performed using the ResNet-18 and ResNet-50 deep
are correctly identified for ADE. architectures. The Pearson correlation coefficient (PCC) is also
We utilize the PyTorch toolbox to train the SAN models. The computed in the experiments. As shown in Tables 2 and 3, accord-
backbone networks used for ADE are ResNet-18 and ResNet-50 ing to the RMSE metric, all the combinations obtain comparable
125
L. He, P. Tiwari, C. Lv et al. Neural Networks 153 (2022) 120–129

Fig. 4. Comparison with other ADE methods in AVEC2013 challenge (Cummins et al., 2013; Meng et al., 2013; Williamson et al., 2013). A denotes audio modality,
and V represents the video modality.

Table 4 on the Sports-1M dataset and fine-tuned on the UCF101 dataset


Accuracy of ADE on the test set of AVEC2013 using ResNet-50. for depression recognition (Al Jazaery & Guo, 2021). Meanwhile,
Models Metrics other authors used a model on the VGGFace2 dataset for depres-
RMSE MAE PCC sion recognition (He, Guo, Tiwari, Su et al., 2021).
A3: DFEM 9.84 7.04 0.68 The advantage of our architecture is that it can model visual
B3: DFEM + SAM + SRRM 9.61 6.87 0.70 patterns obtained from facial images for ADE. The key explana-
C3: DFEM + SAM + SRRM + RM 9.45 6.72 0.71
tions are given below. First, the proposed architecture does not
consider deep architectures but focuses on the existing uncertain
Table 5 label problem for ADE. Second, the proposed architecture is the
Accuracy of ADE on the test set of AVEC2014 using ResNet-50.
first framework that attempts to consider the label distribution
Models Metrics
of ADE.
RMSE MAE PCC As can be seen from the performances reported above, using
A4: DFEM 9.55 7.00 0.70 only facial expression representations, our architecture obtains
B4: DFEM + SAM + SRRM 9.51 6.91 0.71 better results than multi-modal approaches for ADE. This further
C4: DFEM + SAM + SRRM + RM 9.24 6.95 0.79
demonstrates the capabilities of our proposed method for ADE to
some extent.
Figs. 4 and 5 depict our accuracy on the AVEC2013 and
performances for ADE. As shown in Table 2, C1 obtains the best AVEC2014 databases compared to the reported state-of-the-art
performance using the ResNet-18 on AVEC2013, with the RMSE results using audiovisual features. One can see that our method
of 9.37 and MAE of 7.83. On the AVEC2014 depression database, obtains a better performance than multi-modal methods for ADE.
as shown in Table 3, similar to AVEC2013, C2 obtains the best Furthermore, this observation also demonstrates the suitability of
performance with the RMSE of 9.31, and MAE of 7.28. our proposed architecture for ADE.
As shown in Tables 4 and 5, ResNet-50 is also adopted to
evaluate the proposed method. From Table 4, one can see that
4.3.3. Statistical significance tests
C3 obtains the best performance, with RMSE of 9.45, and MAE
To further illustrate the performance of the proposed method,
of 6.72, respectively. Similar performances are achieved on the
we use statistical significance tests to carry out extensive ex-
AVEC2014 database.
periments on AVEC2013 and AVEC2014. Normality (Shapiro–
Based on the above accuracy of ADE, one can notice that the
Wilk normality test), correlation (chi-squared test), stationary
proposed method obtained a reasonable performance by learning
(Dickey–Fuller unit root test), parametric (analysis of variance
facial representations for ADE from video clips. More importantly,
our proposed method first considers the uncertain labels for ADE. test [ANOVA]), and nonparametric (Mann–Whitney U test) tests
are used to estimate the p-values. As shown in Table 8, the
4.3.2. Comparison with the state-of-the-art methods p-value of the Shapiro–Wilk normality test is 0.004; one can
To further describe the capabilities of the proposed architec- note that the prediction of BDI-II scores is not Gaussian. For
ture, we compare our approach with previous methods, adopt- the chi-squared test, the p-value is 1.000; one can note that
ing the proposed architecture, for ADE based on visual-based the prediction of BDI-II scores is dependent. The p-value of the
hand-crafted and deep-learned features. To do a fair comparison, Dickey–Fuller unit root test is 0.003; one can note that the
we only list the results on the AVEC2013 and AVEC2014 in prediction of BDI-II scores is stationary. In addition, we obtain the
Tables 6 and 7. From these tables, one can see that our pro- same conclusions on AVEC2014 (see Table 9). Based on the statis-
posed method obtains comparable results on the AVEC2013 and tical results, the proposed architecture is promising. Furthermore,
AVEC2014 databases. In addition, in Zhou et al. (2020) and Zhu we can see that the prediction values are not the same as the
et al. (2017), the authors first train the deep model on the CASIA observations: this is because the BDI-II labels are discrete for our
webface database and then fine-tune it on the AVEC2013 and task. More importantly, most of the predicted values approached
AVEC2014 databases. Also, other authors used a model trained the true labels.
126
L. He, P. Tiwari, C. Lv et al. Neural Networks 153 (2022) 120–129

Table 6
Accuracy of DLF and HCF architectures for visual-based ADE on the test set of AVEC2013. HCF denotes hand-crafted
features, while DLF denotes the deep learned features.
Features References RMSE MAE
Baseline (Valstar et al., 2013)/LPQ, SVR 13.61 10.88
HCF Wen et al. (2015)/LPQ-TOP, SVR 10.27 8.22
He et al. (2018)/MRLBP-TOP, DPFV, SVR 9.20 7.55
Zhu et al. (2017)/Optical Flow, 2D-CNN 9.82 7.58
Al Jazaery and Guo (2021)/C3D, RNN 9.28 7.37
Uddin et al. (2020)/LSTM 8.93 7.04
Zhou et al. (2020)/2D-CNN 8.19 6.30
de Melo et al. (2019b)/C3D 8.26 6.40
DLF de Melo et al. (2019a)/ResNet-50 8.25 6.30
He, Guo et al. (2021)/3D-CNN 8.46 6.83
He, Guo, Tiwari, Su et al. (2021)/2D-CNN 9.17 7.36
He, Chan et al. (2021)/2D-CNN 8.39 6.59
Niu et al. (2020)/2D-CNN 8.97 7.32
Ours/2D-CNN 9.37 7.02

Table 7
Accuracy of DLF and HCF architectures for visual-based ADE on the test set of AVEC2014. HCF denotes hand-crafted
features, while DLF denotes deep learned features.
Features References RMSE MAE
Baseline (Valstar et al., 2014)/LGBP-TOP, SVR 10.86 8.86
HCF He et al. (2018)/MRLBP-TOP, DPFV, SVR 9.01 7.21
Dhall and Goecke (2015)/LBP-TOP, SVR 8.91 7.08
Zhu et al. (2017)/Optical Flow, 2D-CNN 9.55 7.47
Al Jazaery and Guo (2021)/C3D, RNN 9.20 7.22
Uddin et al. (2020)/LSTM 8.78 6.86
Zhou et al. (2020)/2D-CNN 8.39 6.21
de Melo et al. (2019b)/C3D 8.31 6.59
DLF de Melo et al. (2019a)/ResNet-50 8.23 6.13
He, Guo et al. (2021)/3D-CNN 8.42 6.78
He, Guo, Tiwari, Su et al. (2021)/2D-CNN 9.03 7.26
He, Chan et al. (2021)/2D-CNN 8.30 6.51
Niu et al. (2020)/2D-CNN 8.60 6.43
Ours/2D-CNN 9.24 6.95

Fig. 5. Comparison with other ADE methods in AVEC2014 challenge (Gupta et al., 2014; Jain, Crowley, Dey, & Lux, 2014; Jan, Meng, Gaus, Zhang, & Turabzadeh,
2014; Mitra et al., 2014; Pérez Espinosa et al., 2014; Sidorov & Minker, 2014; Williamson, Quatieri, Helfer, Ciccarelli, & Mehta, 2014). A denotes audio modality, and
V represents the video modality.

5. Conclusion importantly, the square ranking regularization loss of the SRRM


is proposed to divide the training mini-batch into two partitions:
In this paper, we consider the regression of AVEC2013 and a high partition, and a low partition. In addition, the square
AVEC2014 as a classification problem for ADE. Therefore, we
operation of the SRRM can make the proposed model robust
address the problem by considering the label distribution for
ADE. Specifically, an end-to-end framework called the SAN is and ensures that it converges easily. Then, the re-label operation
proposed to re-label the uncertain labels for ADE. The architecture can be performed for ADE. An extensive experimental and the-
consists of four modules: the DFEM, SAM, SRRM, and RM. More oretical analysis of the performance of the proposed model on
127
L. He, P. Tiwari, C. Lv et al. Neural Networks 153 (2022) 120–129

Table 8 Cummins, N., Joshi, J., Dhall, A., Sethu, V., Goecke, R., & Epps, J. (2013). Diagnosis
Statistical significance tests of predicted BDI-II values for AVEC2013. of depression by behavioural signals: A multimodal approach. In Proceedings
Statistical significance test method P-value of the 3rd ACM international workshop on audio/visual emotion challenge (pp.
11–20). Barcelona, Spain: ACM.
Shapiro–Wilk normality test 0.004
de Melo, W. C., Granger, E., & Hadid, A. (2019a). Depression detection based on
Dickey–Fuller unit root test 0.003
deep distribution learning. ICIP.
Analysis of variance test (ANOVA) 1.000
Chi-squared test 1.000 de Melo, W. C., Granger, E., & Hadid, A. (2019b). Combining global and local
Mann–Whitney U test 0.500 convolutional 3d networks for detecting depression from facial expressions. FG.
de Melo, W. C., Granger, E., & Hadid, A. (2020). A deep multiscale spatiotemporal
network for assessing depression from facial dynamics. IEEE Transactions on
Table 9 Affective Computing.
Statistical significance test of predicted BDI-II values for AVEC2014. de Meto, W. C., Granger, E., & Lopez, M. B. (2020). Encoding temporal information
for automatic depression recognition from facial analysis. In ICASSP 2020—
Statistical significance test method P-value
2020 IEEE international conference on acoustics, speech and signal processing
Shapiro–Wilk normality test 0.003 (pp. 1080–1084). IEEE.
Dickey–Fuller unit root test 0.002 Dhall, A., & Goecke, R. (2015). A temporally piece-wise fisher vector approach
Analysis of variance test (ANOVA) 1.000 for depression analysis. In 2015 International conference on affective computing
Chi-squared test 1.000 and intelligent interaction (ACII) (pp. 255–259). IEEE.
Mann–Whitney U test 0.500 Du, Z., Li, W., Huang, D., & Wang, Y. (2019). Encoding visual behaviors with
attentive temporal convolution for depression prediction. In 2019 14th IEEE
international conference on automatic face & gesture recognition (FG 2019) (pp.
1–7). IEEE.
AVEC2013 and AVEC2014 demonstrated the efficiency of our pro- Gao, Q., Liu, J., & Ju, Z. (2020). Robust real-time hand detection and localization
posed method. The main advantages of the our approach are that for space human–robot interaction based on deep learning. Neurocomputing,
it can leverage the label distribution to learn the low-partition 390, 198–206.
Gupta, R., Malandrakis, N., Xiao, B., Guha, T., Van Segbroeck, M., Black, M., et
samples, which can be re-weighted by the SRRM for efficient and al. (2014). Multimodal prediction of affective dimensions and depression
effective ADE. More importantly, the proposed architecture only in human-computer interactions. In Proceedings of the 4th international
considers the existing uncertain label problem and does not focus workshop on audio/visual emotion challenge (pp. 33–40). Orlando, Florida,
on designing DL architectures; this provides a new perspective for USA: ACM.
Hawton, K., i Comabella, C. C., Haw, C., & Saunders, K. (2013). Risk factors
audiovisual-based ADE.
for suicide in individuals with depression: a systematic review. Journal of
In our next study, we will explore the use of multi-modal deep Affective Disorders, 147(1–3), 17–28.
features for ADE. In addition, we will experiment with more dis- He, L., & Cao, C. (2018). Automated depression analysis using convolutional
criminative patterns and regression models with DL technology. neural networks from speech. Journal of Biomedical Informatics, 83, 103–111.
Lastly, we will try to use the proposed method to assist clinicians He, L., Chan, J. C.-W., & Wang, Z. (2021). Automatic depression recognition using
CNN with attention mechanism from videos. Neurocomputing, 422, 165–175.
in assessing depressed subjects.
He, L., Guo, C., Tiwari, P., Pandey, H. M., & Dang, W. (2021). Intelligent system
for depression scale estimation with facial expressions and case study in
Declaration of competing interest industrial intelligence. International Journal of Intelligent Systems.
He, L., Guo, C., Tiwari, P., Su, R., Pandey, H. M., & Dang, W. (2021). DepNet: An
automated industrial intelligent system using deep learning for video-based
The authors declare that they have no known competing finan-
depression analysis. International Journal of Intelligent Systems.
cial interests or personal relationships that could have appeared He, L., Jiang, D., & Sahli, H. (2015). Multimodal depression recognition with
to influence the work reported in this paper. dynamic visual and audio cues. In 2015 International conference on affective
computing and intelligent interaction (ACII) (pp. 260–266). IEEE.
He, L., Jiang, D., & Sahli, H. (2018). Automatic depression analysis using dynamic
Acknowledgments
facial appearance descriptor and dirichlet process fisher encoding. IEEE
Transactions on Multimedia, 21(6), 1476–1486.
This work is supported by the Shaanxi Provincial Social Sci- He, L., Niu, M., Tiwari, P., Marttinen, P., Su, R., Jiang, J., et al. (2022). Deep learning
ence Foundation (grant 2021K015), the Shaanxi Provincial Natu- for depression recognition with audiovisual cues: A review. Information
ral Science Foundation (grant 2021JQ-824), the Shaanxi Provin- Fusion, 80, 56–86.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image
cial Natural Science Foundation (grant 2022JM-380), the Spe-
recognition. In Proceedings of the IEEE conference on computer vision and
cial Construction Fund for Key Disciplines of Shaanxi Provincial pattern recognition (pp. 770–778).
Higher Education, the Scientific Research Program Funded by Hu, W., Huang, Y., Zhang, F., & Li, R. (2019). Noise-tolerant paradigm for training
Shaanxi Provincial Education Department (Program No. 19JS028), face recognition CNNs. In Proceedings of the IEEE/CVF conference on computer
and the Scientific Research Program Funded by Shaanxi Provin- vision and pattern recognition (pp. 11887–11896).
Jain, V., Crowley, J. L., Dey, A. K., & Lux, A. (2014). Depression estimation using
cial Education Department (Program No. 20JG030). This work audiovisual features and Fisher vector encoding. In Proceedings of the 4th
was also supported by the Academy of Finland (grants 336033, international workshop on audio/visual emotion challenge (pp. 87–91). Orlando,
315896), Business Finland (grant 884/31/2018), and EUH2020 Florida, USA: ACM.
(grant 101016775). Jan, A., Meng, H., Gaus, Y. F. A., Zhang, F., & Turabzadeh, S. (2014). Automatic
depression scale prediction using facial expression dynamics and regression.
In Proceedings of the 4th international workshop on audio/visual emotion
References challenge (pp. 73–80). Orlando, Florida, USA: ACM.
Jiang, D., Li, G., Sun, Y., Hu, J., Yun, J., & Liu, Y. (2021). Manipulator grabbing
Al Jazaery, M., & Guo, G. (2021). Video-based depression level analysis by encod- position detection with information fusion of color image and depth image
ing deep spatiotemporal features. IEEE Transactions on Affective Computing, using deep learning. Journal of Ambient Intelligence and Humanized Computing,
12(1), 262–268. 1–14.
Baltrušaitis, T., Robinson, P., & Morency, L.-P. (2016). OpenFace: an open source Jiang, D., Li, G., Tan, C., Huang, L., Sun, Y., & Kong, J. (2021). Semantic segmen-
facial behavior analysis toolkit. In 2016 IEEE winter conference on applications tation for multiscale target based on object recognition using the improved
of computer vision (WACV) (pp. 1–10). IEEE. faster-RCNN model. Future Generation Computer Systems, 123, 94–104.
Carneiro de Melo, W., Granger, E., & Bordallo Lopez, M. (2021). MDN: A Kang, Y., Jiang, X., Yin, Y., Shang, Y., & Zhou, X. (2017). Deep transformation
deep maximization-differentiation network for spatio-temporal depression learning for depression diagnosis from facial images. In Chinese conference
detection. IEEE Transactions on Affective Computing, 1. on biometric recognition (pp. 13–22). Springer.
Chao, L., Tao, J., Yang, M., & Li, Y. (2015). Multi task sequence learning for Lanillos, P., Oliva, D., Philippsen, A., Yamashita, Y., Nagai, Y., & Cheng, G. (2020).
depression scale prediction from video. In 2015 International conference on A review on neural network models of schizophrenia and autism spectrum
affective computing and intelligent interaction (ACII) (pp. 526–531). IEEE. disorder. Neural Networks, 122, 338–363.

128
L. He, P. Tiwari, C. Lv et al. Neural Networks 153 (2022) 120–129

Maj, M., Stein, D. J., Parker, G., Zimmerman, M., Fava, G. A., De Hert, M., et Song, S., Jaiswal, S., Shen, L., & Valstar, M. (2020). Spectral representation of
al. (2020). The clinical characterization of the adult patient with depression behaviour primitives for depression analysis. IEEE Transactions on Affective
aimed at personalization of management. World Psychiatry, 19(3), 269–293. Computing, 1.
Mathers, C. D., & Loncar, D. (2006). Projections of global mortality and burden Uddin, M. A., Joolee, J. B., & Lee, Y.-K. (2020). Depression level prediction using
of disease from 2002 to 2030. PLoS Medicine, 3(11), Article e442. deep spatiotemporal features and multilayer Bi-LSTM. IEEE Transactions on
Mehrabian, A. (2017). Communication without words. In Communication theory Affective Computing.
(pp. 193–200). Routledge. Valstar, M., Schuller, B., Smith, K., Almaev, T., Eyben, F., Krajewski, J., et al. (2014).
Meng, H., Huang, D., Wang, H., Yang, H., Ai-Shuraifi, M., & Wang, Y. (2013). AVEC 2014: 3D dimensional affect and depression recognition challenge. In
Depression recognition based on dynamic facial and vocal expression fea- Proceedings of the 4th international workshop on audio/visual emotion challenge
(pp. 3–10). Orlando, FL, USA: ACM.
tures using partial least square regression. In Proceedings of the 3rd ACM
Valstar, M., Schuller, B., Smith, K., Eyben, F., Jiang, B., Bilakhia, S., et al.
international workshop on audio/visual emotion challenge (pp. 21–30).
(2013). AVEC2013: the continuous audio/visual emotion and depression
Mitra, V., Shriberg, E., McLaren, M., Kathol, A., Richey, C., Vergyri, D., et al. (2014).
recognition challenge. In Proceedings of the 3rd ACM international workshop
The SRI AVEC2014 evaluation system. In Proceedings of the 4th international
on audio/visual emotion challenge (pp. 3–10).
workshop on audio/visual emotion challenge (pp. 93–101). Orlando, Florida, Wang, K., Peng, X., Yang, J., Lu, S., & Qiao, Y. (2020). Suppressing uncertainties
USA: ACM. for large-scale facial expression recognition. In Proceedings of the IEEE/CVF
Niu, M., Liu, B., Tao, J., & Li, Q. (2021). A time–frequency channel atten- conference on computer vision and pattern recognition (pp. 6897–6906).
tion and vectorization network for automatic depression level prediction. Wen, L., Li, X., Guo, G., & Zhu, Y. (2015). Automated depression diagnosis based
Neurocomputing. on facial dynamic analysis and sparse coding. IEEE Transactions on Information
Niu, M., Tao, J., Liu, B., & Fan, C. (2019). Automatic depression level detection Forensics and Security, 10(7), 1432–1441.
via lp-norm pooling. In INTERSPEECH (pp. 4559–4563). Williamson, J. R., Quatieri, T. F., Helfer, B. S., Ciccarelli, G., & Mehta, D. D. (2014).
Niu, M., Tao, J., Liu, B., Huang, J., & Lian, Z. (2020). Multimodal spatiotemporal Vocal and facial biomarkers of depression based on motor incoordination
representation for automatic depression level detection. IEEE Transactions on and timing. In Proceedings of the 4th international workshop on audio/visual
Affective Computing. emotion challenge (pp. 65–72). Orlando, Florida, USA: ACM.
Pérez Espinosa, H., Escalante, H. J., Villaseñor Pineda, L., Montes-y Gómez, M., Williamson, J. R., Quatieri, T. F., Helfer, B. S., Horwitz, R., Yu, B., & Mehta, D.
Pinto-Avedaño, D., & Reyez-Meza, V. (2014). Fusing affective dimensions D. (2013). Vocal biomarkers of depression based on motor incoordination.
and audio-visual features from segmented video for depression recognition: In Proceedings of the 3rd ACM international workshop on audio/visual emotion
INAOE-buap’s participation at avec’14 challenge. In Proceedings of the 4th challenge (pp. 41–48).
international workshop on audio/visual emotion challenge (pp. 49–55). Orlando, Yang, Z., Du Jiang, Y. S., Tao, B., Tong, X., Jiang, G., Xu, M., et al. (2021). Dy-
Florida, USA: ACM. namic gesture recognition using surface EMG signals based on multi-stream
residual network. Frontiers in Bioengineering and Biotechnology, 9.
Senn, W., Wyler, K., Streit, J., Larkum, M., Lüscher, H.-R., Mey, H., et al. (1996).
Zhou, X., Jin, K., Shang, Y., & Guo, G. (2020). Visually interpretable representation
Dynamics of a random neural network with synaptic depression. Neural
learning for depression recognition from facial images. IEEE Transactions
Networks, 9(4), 575–588.
on Affective Computing, 11(3), 542–552. http://dx.doi.org/10.1109/TAFFC.2018.
Sidorov, M., & Minker, W. (2014). Emotion recognition and depression diagnosis
2828819.
by acoustic and visual features: A multimodal approach. In Proceedings of Zhu, Y., Shang, Y., Shao, Z., & Guo, G. (2017). Automated depression diagnosis
the 4th international workshop on audio/visual emotion challenge (pp. 81–86). based on deep networks to encode facial appearance and dynamics. IEEE
Orlando, Florida, USA: ACM. Transactions on Affective Computing, 9(4), 578–584.

129

You might also like