Reducing Noisy Annotations For Depression Estimation From Facial Images
Reducing Noisy Annotations For Depression Estimation From Facial Images
Neural Networks
journal homepage: www.elsevier.com/locate/neunet
article info a b s t r a c t
Article history: Depression has been considered the most dominant mental disorder over the past few years. To
Received 25 November 2021 help clinicians effectively and efficiently estimate the severity scale of depression, various automated
Received in revised form 17 April 2022 systems based on deep learning have been proposed. To estimate the severity of depression, i.e.,
Accepted 25 May 2022
the depression severity score (Beck Depression Inventory-II), various deep architectures have been
Available online 3 June 2022
designed to perform regression using the Euclidean loss. However, they do not consider the label
Keywords: distribution, and they do not learn the relationships between the facial images and BDI-II scores,
Depression which can be resulting in the noisy labeling for automatic depression estimation (ADE). To mitigate
Self-adaptation network (SAN) this problem, we propose an automated deep architecture, namely the self-adaptation network (SAN),
Affective computing to improve this uncertain labeling for ADE. Specifically, the architecture consists of four modules:
Noisy labels (1) ResNet-18 and ResNet-50 are adopted in the deep feature extraction module (DFEM) to extract
informative deep features; (2) a self-attention module (SAM) is adopted to learn the weights from
the mini-batch; (3) a square ranking regularization module (SRRM) to create high partitions and low
partitions is proposed; and (4) a re-label module (RM) is used to re-label the uncertain annotations for
ADE in the low partitions. We conduct extensive experiments on depression databases (i.e., AVEC2013
and AVEC2014) and obtain a performance comparable to the performances of other ADE methods
in assessing the severity of depression. More importantly, the proposed method can learn valuable
depression patterns from facial videos and obtain a performance comparable to the performances of
other methods for depression recognition.
© 2022 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).
1. Introduction et al., 2020; Mathers & Loncar, 2006). In severe cases, depres-
sion leads to suicide (Hawton, i Comabella, Haw, & Saunders,
2013). It was found that about 50% of suicides are related to
Due to the increasing pressures of life, a growing number depression (Hawton et al., 2013; Senn et al., 1996). Unfortunately,
of people are suffering from depression. According to a report there are not impactful clinical patterns for the diagnosis of
released by the World Health Organization (WHO), depression depression, which makes the diagnosis of depression complicated
and subjective (Maj et al., 2020).
will become the most common mental disorder by 2030 (Lanillos
Therefore, various researchers from the affective computing
field have attempted to use audiovisual cues to help psychologists
or clinicians to diagnose the severity of depression. Traditional
∗ Corresponding author. machine learning approaches for estimating the severity of de-
E-mail addresses: [email protected] (L. He), [email protected] pression often consist of the following three steps: (i) extracting
(P. Tiwari), [email protected] (C. Lv). hand-crafted features, (ii) aggregating the learned features, and
https://doi.org/10.1016/j.neunet.2022.05.025
0893-6080/© 2022 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
L. He, P. Tiwari, C. Lv et al. Neural Networks 153 (2022) 120–129
Fig. 1. Illustration of the proposed pipeline for ADE. Training images are used to train the deep models, and testing images are used to validate the performance of
the deep models.
Fig. 2. The detailed flow of the proposed framework SAN for visual-based ADE. The SAN consists of four modules: (1) a deep feature extraction module (DFEM);
(2) self attention module (SAM); (3) square ranking regularization module (SRRM); (4) re-label module (RM). Due to privacy concerns, we blur the samples with a
white rectangle. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
3.1. Architecture overview be re-labeled to improve the performance of the deep ADE model
in classifying the facial images. More importantly, the introduced
Fig. 1 depicts the main structure of the proposed method. deep models can be run in an end-to-end manner for ADE.
We first use the training data to train the deep models, and
then we use the test data to validate the deep models. The 3.2. Extraction of deep learned features
detailed architecture of the method is illustrated in Fig. 2. First,
the OpenFace toolkit (Baltrušaitis, Robinson, & Morency, 2016) is Due to the small size of depression databases, transfer learning
used to pre-process the videos into facial images. Then, ResNet- is used for ADE. The goal of transfer learning is to fine-tune a
18 and ResNet-50 are each considered as potential backbone model that is pre-trained on a source task to fit a new target task.
networks for feature extraction; they are used to extract infor- For instance, one can consider fine-tuning an image classification
mative features for ADE. After the feature extraction procedure, model that is pre-trained on the ImageNet dataset for ADE using a
the SRRM (cf. Algorithm 1) can provide a learned weight for every small depression database. There exist two strategies for transfer-
image using a sigmoid function and a fully-connected (FC) layer. ring the model from the source task to the target task. The first is
To reduce the effect of uncertain facial images, a square rank- to consider the pre-trained deep model as a feature extractor tool,
ing regularization module is adopted to rank and regularize the whose learned weights cannot be adapted for a different task. In
learned weights. In the SRRM, the attention weights are ranked in general, a classifier is added on top of the deep architecture to
descending order of importance, and grouped into two partitions, do classification. The second strategy is to fine-tune the entire or
i.e., a high partition and low partition. To further restrict the mean a subset of the network for a new task. Thus, the weights of the
of the weights between the high and low partitions, a square pre-trained deep models are viewed as the initial values for the
ranking regularization loss is proposed. Lastly, a re-labeling op- new study, and they are updated during the training procedure.
eration is performed by the RM to alter the ambiguous labels of For the ADE task, due to the small size of the depression
the images in the low partition. In fact, our goal is to re-adjust database, we use pre-trained deep models (ResNet-18 and
the labels so that they approach the true values (i.e., BDI-II scores ResNet-502 ) as feature extractors to obtain the discriminative
summed over a certain period of time) of the severity of the features. To extract informative features, in this study, ResNet-18
depression experienced by the participants. Thus, some of the
certain labels can be re-validated, and uncertain images can also 2 For convenience, we take ResNet-18 as an example in the present paper.
123
L. He, P. Tiwari, C. Lv et al. Neural Networks 153 (2022) 120–129
Fig. 3. Illustration of the ResNet-18 architecture (He et al., 2016). The ResNet-18 architecture consists of 16 convolution layers, 2 down-sampling layers, and 2 fully
connected layers.
is adopted for feature extraction (see the purple dashed rectangle SRRM divides the weights into two partitions with a threshold ε .
in Fig. 3). As illustrated in Fig. 3, 16 convolution layers, 2 down- After performing the regularization operation, the mean value of
sampling layers, and 2 fully connected layers make up ResNet-18. the high-partition weights is higher than that of the low-partition
For ResNet-18, the size of the convolution layer is 224 × 224, weights by some margin. Therefore, to perform discriminative
with a kernel size of 7 × 7. 3 × 3 is the kernel size of the other segmentation, the ranking regularization loss ℓSRRM can be written
layers. Following by the last convolution layer, an eigenvector as follows (cf. Eq. (3)):
is generated by a fully connected layer. Meanwhile, ResNet-50
ℓSRRM = max 0, θ1 − (βHigh − βlow )
{ }
(3)
has essentially the same architecture as ResNet-18; the main
difference is that it has more layers. and
M N
3.3. Self-attention feature learning 1 ∑ 1 ∑
βHigh = βj , βlow = βj (4)
M N −M
j=0 j=M
After the deep discriminative features are extracted, the self-
attention mechanism used to learn the importance weights. In where θ1 denotes a margin value that can be learned from the
our task, as shown in Fig. 2 (the blue dashed rectangle), we deep models, and βHigh and βlow represent the averages value of
expect certain facial images to have higher weights, while un- the high partition with (ε ∗ N = M) data samples and the low
certain images should have lower weights. Formally, let Fea = partition with (N − M) data samples, respectively. Finally, we
[y1 , y2 , . . . , yN ] ∈ RD×N be the learned deep features from N learn the deep discriminative features by jointly minimizing the
facial images, where Fea represents the input features. The SAM loss functions ℓSRRM and ℓatt as follows:
takes Fea as input and outputs the attention weights of every
ℓall = λℓSRRM + (1 − λ)ℓatt (5)
feature. To learn more effective contexts that will be used in
the following steps, a linear fully-connected layer and an activa- where λ (λ > 0) represents a trade-off threshold between the
tion function, i.e., the sigmoid activation function, are adopted. loss functions ℓSRRM and ℓatt .
Mathematically, the operation can be written as (cf. Eq. (1)):
3.5. Relabeling noisy annotations
βj = φ (WTa xj ) (1)
where βj denotes the learned weight of the jth feature, Wa In the SRRM, every mini-batch is segmented into two par-
denotes the learned attention weights of the FC layer, and φ titions, i.e., a high partition and a low partition. By conducting
represents the activation function, i.e., the sigmoid activation a set of experiments on the proposed framework, we find that
function. By using the self-attention mechanism, the different there are uncertain data samples with low learned weights. To
scale weights are learned; this provides the foundation for the address this problem, we consider that these data samples have
SRRM and RM. ambiguous labels for model learning.
Motivated by Hu, Huang, Zhang, and Li (2019) and Wang et al. Thus, to achieve the goal of the present work and to ad-
(2020), the logit-weighted loss function is adopted. Mathemat- dress the problem of noisy labels in ADE, we only consider the
ically, the loss function ℓatt of the SAM can be expressed as low-partition samples that have weights of lower importance.
follows: Consequently, softmax probabilities are adopted. For each image,
the maximum value of the predicted probability Ppre and the
N αj WTZ xj
1 ∑ e j probability value of the provided label Pglabel are compared. Then,
ℓatt = − log ∑ (2) the low-partition samples are given a pseudo-label Ppse when Ppre
eαj Wk xj
C T
N
j=1 k=1 is greater than Pglabel by a threshold value. Mathematically, the
where Wk represents the kth classifier. RM can be written as (cf. Eq. (6)),
Pmax − Ppre > θ2
{
Pindex =
3.4. Square ranking regularization Ppse = (6)
Pglabel = other w ise
Using the attention mechanism, the self-attention weights where Ppse represents the new pseudo label, θ2 denotes a control
(which range from 0 to 1) were learned during the attention threshold, Pmax represents the maximum value of the predicted
learning step. To further improve the contributions of the uncer- probability, Ppre denotes the predicted probability of the original
tain samples, a square ranking regularization module is proposed label, Pglabel represents the original value, and Pindex is the index
to rank the weights in descending order of importance. Then, the of the maximum value of the predicted probability.
124
L. He, P. Tiwari, C. Lv et al. Neural Networks 153 (2022) 120–129
4. Experiments Table 2
Accuracy of ADE on the test set of AVEC2013 using ResNet-18.
introduced framework for ADE. Furthermore, experiments of dif- RMSE MAE PCC
ferent scales are conducted on publicly databases to validate the A1: DFEM 9.78 7.83 0.70
proposed method. Our goals for the experimental studies are as B1: DFEM + SAM + SRRM 9.68 7.20 0.71
C1: DFEM + SAM + SRRM + RM 9.37 7.02 0.78
follows:
4.1. Datasets
(the MS-Celeb-1M face recognition dataset (He et al., 2016) is
To validate the proposed method on two datasets, i.e., used to pre-train the deep models). To extract the discriminative
AVEC2013 and AVEC2014, BDI-II is used to label the audiovisual features for ADE, the last pooling layers of ResNet-18 and ResNet-
data samples in the two databases. 50 are used. The batch sizes are set to 64 and 128 for ResNet-18
AVEC2013 is a subset of a large audiovisual depressive lan- and ResNet-50, respectively. To further validate the effectiveness
guage corpus (AViD-Corpus); It consists of 340 video clips of the proposed method, we divide the training samples into two
recorded by 292 subjects participating in human–computer in- groups using the threshold 0.8, i.e., the high-partition samples are
teraction (HCI) task with a microphone and a webcam. The mean the top 80%, and the low-partition samples are the bottom 20%.
age of the participants is 31.5 years, and the age range is between The threshold θ1 is either set to 0.07 or learned automatically dur-
18 and 63 years. Audio and video clips were recorded during ing the training stage. The ADAM parameters have the following
these HCI tasks. The sampling rate of the audio is 41 kHz (16 bit). values: the learning rate is set to 1e−3, with γ equal to 0.9. The
The video was collected at 30 frames/s, and the frames have the dropout rate is set to 0.5. We carry out the experiments using
dimensions of 640 × 480 pixels. H.264 is adopted as the codec. two Titan-X GPUs (each with 12 GB of memory). The number of
training epochs is empirically set to 30. The RM is utilized after
The format of the videos is MP4. These recordings are grouped
the 10th epoch with a threshold of 0.12. The training procedure
into three partitions: there are 50 recordings in the training,
takes 5 h.
development, and test sets, respectively.
AVEC2014 is a subset of the AVEC2013 database. Unlike the
4.2.2. Evaluation standards
task of AVEC2013, AVEC2014 contains two HCI tasks (Freeform
In order to make an equitable comparison with previous
and Northwind):
studies using the two depression databases (i.e., AVEC2013 and
1. Freeform — Participants respond to one of a number of AVEC2014), the mean absolute error (MAE) and the root mean
questions, such as ‘‘What is your favorite dish?’’, or discuss square error (RMSE) are used. Formally, the MAE and RMSE can
a sad childhood memory (German). be expressed as follows:
2. Northwind — Participants read aloud an excerpt of the fable N
1 ∑⏐
⏐lj − Prej ⏐,
⏐
‘‘Die Sonne und der Wind’’ (The North Wind and the Sun) MAE = (7)
(German). N
j=1
N
In total, the corpus consists of 300 videos with lengths ranging 1 ∑
from 6 s to 4 min. There are a total of 84 subjects (the mean age RMSE = √ (lj − Prej )2 , (8)
N
is 31.5 years, with a standard deviation of 12.3 years). Training, j=1
development, and test sets with 50 recordings each are included where N represents the number of participants, lj is the provided
in the database. In particular, we combine the training and the BDI-II score, and Prej represents the predicted value of the jth
development sets to train the proposed deep models, and we use video clip.
the test set to validate the discriminative models.
4.3. Results
4.2. Experimental settings and evaluation standards
In this part, we first perform an ablation study to evaluate
In this part, we present the experimental settings and evalua- the performance of the proposed method. Then, we compare the
tion standards for the proposed architecture for ADE. proposed architecture with the existing methods for visual ADE
to show its promising performance.
4.2.1. Evaluation settings
In this study, we adopt the OpenFace architecture to process 4.3.1. Ablation study
the facial region and perform landmarks localization on both The accuracy of each module of the introduced architecture
databases (i.e., AVEC2013 and AVEC2014). The size of the facial is verified on the databases (i.e., AVEC2013 and AVEC2014). To
images is set to 224 × 224. To improve the performance of ADE, further evaluate the proposed scheme, experiments of different
we manually check the frames to guarantee that the facial regions scale are performed using the ResNet-18 and ResNet-50 deep
are correctly identified for ADE. architectures. The Pearson correlation coefficient (PCC) is also
We utilize the PyTorch toolbox to train the SAN models. The computed in the experiments. As shown in Tables 2 and 3, accord-
backbone networks used for ADE are ResNet-18 and ResNet-50 ing to the RMSE metric, all the combinations obtain comparable
125
L. He, P. Tiwari, C. Lv et al. Neural Networks 153 (2022) 120–129
Fig. 4. Comparison with other ADE methods in AVEC2013 challenge (Cummins et al., 2013; Meng et al., 2013; Williamson et al., 2013). A denotes audio modality,
and V represents the video modality.
Table 6
Accuracy of DLF and HCF architectures for visual-based ADE on the test set of AVEC2013. HCF denotes hand-crafted
features, while DLF denotes the deep learned features.
Features References RMSE MAE
Baseline (Valstar et al., 2013)/LPQ, SVR 13.61 10.88
HCF Wen et al. (2015)/LPQ-TOP, SVR 10.27 8.22
He et al. (2018)/MRLBP-TOP, DPFV, SVR 9.20 7.55
Zhu et al. (2017)/Optical Flow, 2D-CNN 9.82 7.58
Al Jazaery and Guo (2021)/C3D, RNN 9.28 7.37
Uddin et al. (2020)/LSTM 8.93 7.04
Zhou et al. (2020)/2D-CNN 8.19 6.30
de Melo et al. (2019b)/C3D 8.26 6.40
DLF de Melo et al. (2019a)/ResNet-50 8.25 6.30
He, Guo et al. (2021)/3D-CNN 8.46 6.83
He, Guo, Tiwari, Su et al. (2021)/2D-CNN 9.17 7.36
He, Chan et al. (2021)/2D-CNN 8.39 6.59
Niu et al. (2020)/2D-CNN 8.97 7.32
Ours/2D-CNN 9.37 7.02
Table 7
Accuracy of DLF and HCF architectures for visual-based ADE on the test set of AVEC2014. HCF denotes hand-crafted
features, while DLF denotes deep learned features.
Features References RMSE MAE
Baseline (Valstar et al., 2014)/LGBP-TOP, SVR 10.86 8.86
HCF He et al. (2018)/MRLBP-TOP, DPFV, SVR 9.01 7.21
Dhall and Goecke (2015)/LBP-TOP, SVR 8.91 7.08
Zhu et al. (2017)/Optical Flow, 2D-CNN 9.55 7.47
Al Jazaery and Guo (2021)/C3D, RNN 9.20 7.22
Uddin et al. (2020)/LSTM 8.78 6.86
Zhou et al. (2020)/2D-CNN 8.39 6.21
de Melo et al. (2019b)/C3D 8.31 6.59
DLF de Melo et al. (2019a)/ResNet-50 8.23 6.13
He, Guo et al. (2021)/3D-CNN 8.42 6.78
He, Guo, Tiwari, Su et al. (2021)/2D-CNN 9.03 7.26
He, Chan et al. (2021)/2D-CNN 8.30 6.51
Niu et al. (2020)/2D-CNN 8.60 6.43
Ours/2D-CNN 9.24 6.95
Fig. 5. Comparison with other ADE methods in AVEC2014 challenge (Gupta et al., 2014; Jain, Crowley, Dey, & Lux, 2014; Jan, Meng, Gaus, Zhang, & Turabzadeh,
2014; Mitra et al., 2014; Pérez Espinosa et al., 2014; Sidorov & Minker, 2014; Williamson, Quatieri, Helfer, Ciccarelli, & Mehta, 2014). A denotes audio modality, and
V represents the video modality.
Table 8 Cummins, N., Joshi, J., Dhall, A., Sethu, V., Goecke, R., & Epps, J. (2013). Diagnosis
Statistical significance tests of predicted BDI-II values for AVEC2013. of depression by behavioural signals: A multimodal approach. In Proceedings
Statistical significance test method P-value of the 3rd ACM international workshop on audio/visual emotion challenge (pp.
11–20). Barcelona, Spain: ACM.
Shapiro–Wilk normality test 0.004
de Melo, W. C., Granger, E., & Hadid, A. (2019a). Depression detection based on
Dickey–Fuller unit root test 0.003
deep distribution learning. ICIP.
Analysis of variance test (ANOVA) 1.000
Chi-squared test 1.000 de Melo, W. C., Granger, E., & Hadid, A. (2019b). Combining global and local
Mann–Whitney U test 0.500 convolutional 3d networks for detecting depression from facial expressions. FG.
de Melo, W. C., Granger, E., & Hadid, A. (2020). A deep multiscale spatiotemporal
network for assessing depression from facial dynamics. IEEE Transactions on
Table 9 Affective Computing.
Statistical significance test of predicted BDI-II values for AVEC2014. de Meto, W. C., Granger, E., & Lopez, M. B. (2020). Encoding temporal information
for automatic depression recognition from facial analysis. In ICASSP 2020—
Statistical significance test method P-value
2020 IEEE international conference on acoustics, speech and signal processing
Shapiro–Wilk normality test 0.003 (pp. 1080–1084). IEEE.
Dickey–Fuller unit root test 0.002 Dhall, A., & Goecke, R. (2015). A temporally piece-wise fisher vector approach
Analysis of variance test (ANOVA) 1.000 for depression analysis. In 2015 International conference on affective computing
Chi-squared test 1.000 and intelligent interaction (ACII) (pp. 255–259). IEEE.
Mann–Whitney U test 0.500 Du, Z., Li, W., Huang, D., & Wang, Y. (2019). Encoding visual behaviors with
attentive temporal convolution for depression prediction. In 2019 14th IEEE
international conference on automatic face & gesture recognition (FG 2019) (pp.
1–7). IEEE.
AVEC2013 and AVEC2014 demonstrated the efficiency of our pro- Gao, Q., Liu, J., & Ju, Z. (2020). Robust real-time hand detection and localization
posed method. The main advantages of the our approach are that for space human–robot interaction based on deep learning. Neurocomputing,
it can leverage the label distribution to learn the low-partition 390, 198–206.
Gupta, R., Malandrakis, N., Xiao, B., Guha, T., Van Segbroeck, M., Black, M., et
samples, which can be re-weighted by the SRRM for efficient and al. (2014). Multimodal prediction of affective dimensions and depression
effective ADE. More importantly, the proposed architecture only in human-computer interactions. In Proceedings of the 4th international
considers the existing uncertain label problem and does not focus workshop on audio/visual emotion challenge (pp. 33–40). Orlando, Florida,
on designing DL architectures; this provides a new perspective for USA: ACM.
Hawton, K., i Comabella, C. C., Haw, C., & Saunders, K. (2013). Risk factors
audiovisual-based ADE.
for suicide in individuals with depression: a systematic review. Journal of
In our next study, we will explore the use of multi-modal deep Affective Disorders, 147(1–3), 17–28.
features for ADE. In addition, we will experiment with more dis- He, L., & Cao, C. (2018). Automated depression analysis using convolutional
criminative patterns and regression models with DL technology. neural networks from speech. Journal of Biomedical Informatics, 83, 103–111.
Lastly, we will try to use the proposed method to assist clinicians He, L., Chan, J. C.-W., & Wang, Z. (2021). Automatic depression recognition using
CNN with attention mechanism from videos. Neurocomputing, 422, 165–175.
in assessing depressed subjects.
He, L., Guo, C., Tiwari, P., Pandey, H. M., & Dang, W. (2021). Intelligent system
for depression scale estimation with facial expressions and case study in
Declaration of competing interest industrial intelligence. International Journal of Intelligent Systems.
He, L., Guo, C., Tiwari, P., Su, R., Pandey, H. M., & Dang, W. (2021). DepNet: An
automated industrial intelligent system using deep learning for video-based
The authors declare that they have no known competing finan-
depression analysis. International Journal of Intelligent Systems.
cial interests or personal relationships that could have appeared He, L., Jiang, D., & Sahli, H. (2015). Multimodal depression recognition with
to influence the work reported in this paper. dynamic visual and audio cues. In 2015 International conference on affective
computing and intelligent interaction (ACII) (pp. 260–266). IEEE.
He, L., Jiang, D., & Sahli, H. (2018). Automatic depression analysis using dynamic
Acknowledgments
facial appearance descriptor and dirichlet process fisher encoding. IEEE
Transactions on Multimedia, 21(6), 1476–1486.
This work is supported by the Shaanxi Provincial Social Sci- He, L., Niu, M., Tiwari, P., Marttinen, P., Su, R., Jiang, J., et al. (2022). Deep learning
ence Foundation (grant 2021K015), the Shaanxi Provincial Natu- for depression recognition with audiovisual cues: A review. Information
ral Science Foundation (grant 2021JQ-824), the Shaanxi Provin- Fusion, 80, 56–86.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image
cial Natural Science Foundation (grant 2022JM-380), the Spe-
recognition. In Proceedings of the IEEE conference on computer vision and
cial Construction Fund for Key Disciplines of Shaanxi Provincial pattern recognition (pp. 770–778).
Higher Education, the Scientific Research Program Funded by Hu, W., Huang, Y., Zhang, F., & Li, R. (2019). Noise-tolerant paradigm for training
Shaanxi Provincial Education Department (Program No. 19JS028), face recognition CNNs. In Proceedings of the IEEE/CVF conference on computer
and the Scientific Research Program Funded by Shaanxi Provin- vision and pattern recognition (pp. 11887–11896).
Jain, V., Crowley, J. L., Dey, A. K., & Lux, A. (2014). Depression estimation using
cial Education Department (Program No. 20JG030). This work audiovisual features and Fisher vector encoding. In Proceedings of the 4th
was also supported by the Academy of Finland (grants 336033, international workshop on audio/visual emotion challenge (pp. 87–91). Orlando,
315896), Business Finland (grant 884/31/2018), and EUH2020 Florida, USA: ACM.
(grant 101016775). Jan, A., Meng, H., Gaus, Y. F. A., Zhang, F., & Turabzadeh, S. (2014). Automatic
depression scale prediction using facial expression dynamics and regression.
In Proceedings of the 4th international workshop on audio/visual emotion
References challenge (pp. 73–80). Orlando, Florida, USA: ACM.
Jiang, D., Li, G., Sun, Y., Hu, J., Yun, J., & Liu, Y. (2021). Manipulator grabbing
Al Jazaery, M., & Guo, G. (2021). Video-based depression level analysis by encod- position detection with information fusion of color image and depth image
ing deep spatiotemporal features. IEEE Transactions on Affective Computing, using deep learning. Journal of Ambient Intelligence and Humanized Computing,
12(1), 262–268. 1–14.
Baltrušaitis, T., Robinson, P., & Morency, L.-P. (2016). OpenFace: an open source Jiang, D., Li, G., Tan, C., Huang, L., Sun, Y., & Kong, J. (2021). Semantic segmen-
facial behavior analysis toolkit. In 2016 IEEE winter conference on applications tation for multiscale target based on object recognition using the improved
of computer vision (WACV) (pp. 1–10). IEEE. faster-RCNN model. Future Generation Computer Systems, 123, 94–104.
Carneiro de Melo, W., Granger, E., & Bordallo Lopez, M. (2021). MDN: A Kang, Y., Jiang, X., Yin, Y., Shang, Y., & Zhou, X. (2017). Deep transformation
deep maximization-differentiation network for spatio-temporal depression learning for depression diagnosis from facial images. In Chinese conference
detection. IEEE Transactions on Affective Computing, 1. on biometric recognition (pp. 13–22). Springer.
Chao, L., Tao, J., Yang, M., & Li, Y. (2015). Multi task sequence learning for Lanillos, P., Oliva, D., Philippsen, A., Yamashita, Y., Nagai, Y., & Cheng, G. (2020).
depression scale prediction from video. In 2015 International conference on A review on neural network models of schizophrenia and autism spectrum
affective computing and intelligent interaction (ACII) (pp. 526–531). IEEE. disorder. Neural Networks, 122, 338–363.
128
L. He, P. Tiwari, C. Lv et al. Neural Networks 153 (2022) 120–129
Maj, M., Stein, D. J., Parker, G., Zimmerman, M., Fava, G. A., De Hert, M., et Song, S., Jaiswal, S., Shen, L., & Valstar, M. (2020). Spectral representation of
al. (2020). The clinical characterization of the adult patient with depression behaviour primitives for depression analysis. IEEE Transactions on Affective
aimed at personalization of management. World Psychiatry, 19(3), 269–293. Computing, 1.
Mathers, C. D., & Loncar, D. (2006). Projections of global mortality and burden Uddin, M. A., Joolee, J. B., & Lee, Y.-K. (2020). Depression level prediction using
of disease from 2002 to 2030. PLoS Medicine, 3(11), Article e442. deep spatiotemporal features and multilayer Bi-LSTM. IEEE Transactions on
Mehrabian, A. (2017). Communication without words. In Communication theory Affective Computing.
(pp. 193–200). Routledge. Valstar, M., Schuller, B., Smith, K., Almaev, T., Eyben, F., Krajewski, J., et al. (2014).
Meng, H., Huang, D., Wang, H., Yang, H., Ai-Shuraifi, M., & Wang, Y. (2013). AVEC 2014: 3D dimensional affect and depression recognition challenge. In
Depression recognition based on dynamic facial and vocal expression fea- Proceedings of the 4th international workshop on audio/visual emotion challenge
(pp. 3–10). Orlando, FL, USA: ACM.
tures using partial least square regression. In Proceedings of the 3rd ACM
Valstar, M., Schuller, B., Smith, K., Eyben, F., Jiang, B., Bilakhia, S., et al.
international workshop on audio/visual emotion challenge (pp. 21–30).
(2013). AVEC2013: the continuous audio/visual emotion and depression
Mitra, V., Shriberg, E., McLaren, M., Kathol, A., Richey, C., Vergyri, D., et al. (2014).
recognition challenge. In Proceedings of the 3rd ACM international workshop
The SRI AVEC2014 evaluation system. In Proceedings of the 4th international
on audio/visual emotion challenge (pp. 3–10).
workshop on audio/visual emotion challenge (pp. 93–101). Orlando, Florida, Wang, K., Peng, X., Yang, J., Lu, S., & Qiao, Y. (2020). Suppressing uncertainties
USA: ACM. for large-scale facial expression recognition. In Proceedings of the IEEE/CVF
Niu, M., Liu, B., Tao, J., & Li, Q. (2021). A time–frequency channel atten- conference on computer vision and pattern recognition (pp. 6897–6906).
tion and vectorization network for automatic depression level prediction. Wen, L., Li, X., Guo, G., & Zhu, Y. (2015). Automated depression diagnosis based
Neurocomputing. on facial dynamic analysis and sparse coding. IEEE Transactions on Information
Niu, M., Tao, J., Liu, B., & Fan, C. (2019). Automatic depression level detection Forensics and Security, 10(7), 1432–1441.
via lp-norm pooling. In INTERSPEECH (pp. 4559–4563). Williamson, J. R., Quatieri, T. F., Helfer, B. S., Ciccarelli, G., & Mehta, D. D. (2014).
Niu, M., Tao, J., Liu, B., Huang, J., & Lian, Z. (2020). Multimodal spatiotemporal Vocal and facial biomarkers of depression based on motor incoordination
representation for automatic depression level detection. IEEE Transactions on and timing. In Proceedings of the 4th international workshop on audio/visual
Affective Computing. emotion challenge (pp. 65–72). Orlando, Florida, USA: ACM.
Pérez Espinosa, H., Escalante, H. J., Villaseñor Pineda, L., Montes-y Gómez, M., Williamson, J. R., Quatieri, T. F., Helfer, B. S., Horwitz, R., Yu, B., & Mehta, D.
Pinto-Avedaño, D., & Reyez-Meza, V. (2014). Fusing affective dimensions D. (2013). Vocal biomarkers of depression based on motor incoordination.
and audio-visual features from segmented video for depression recognition: In Proceedings of the 3rd ACM international workshop on audio/visual emotion
INAOE-buap’s participation at avec’14 challenge. In Proceedings of the 4th challenge (pp. 41–48).
international workshop on audio/visual emotion challenge (pp. 49–55). Orlando, Yang, Z., Du Jiang, Y. S., Tao, B., Tong, X., Jiang, G., Xu, M., et al. (2021). Dy-
Florida, USA: ACM. namic gesture recognition using surface EMG signals based on multi-stream
residual network. Frontiers in Bioengineering and Biotechnology, 9.
Senn, W., Wyler, K., Streit, J., Larkum, M., Lüscher, H.-R., Mey, H., et al. (1996).
Zhou, X., Jin, K., Shang, Y., & Guo, G. (2020). Visually interpretable representation
Dynamics of a random neural network with synaptic depression. Neural
learning for depression recognition from facial images. IEEE Transactions
Networks, 9(4), 575–588.
on Affective Computing, 11(3), 542–552. http://dx.doi.org/10.1109/TAFFC.2018.
Sidorov, M., & Minker, W. (2014). Emotion recognition and depression diagnosis
2828819.
by acoustic and visual features: A multimodal approach. In Proceedings of Zhu, Y., Shang, Y., Shao, Z., & Guo, G. (2017). Automated depression diagnosis
the 4th international workshop on audio/visual emotion challenge (pp. 81–86). based on deep networks to encode facial appearance and dynamics. IEEE
Orlando, Florida, USA: ACM. Transactions on Affective Computing, 9(4), 578–584.
129