0% found this document useful (0 votes)
38 views17 pages

Fair ML for Depression Detection

Uploaded by

ramakanth rama
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views17 pages

Fair ML for Depression Detection

Uploaded by

ramakanth rama
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Proceedings of Machine Learning Research 259:1–15, 2024 Machine Learning for Health (ML4H) 2024

U-Fair: Uncertainty-based Multimodal Multitask Learning


for Fairer Depression Detection
Jiaee Cheong∗ [email protected]
University of Cambridge & the Alan Turing Institute, United Kingdom.
Aditya Bangar [email protected]
Indian Institute of Technology, Kanpur, India.
Sinan Kalkan [email protected]
arXiv:2501.09687v1 [cs.LG] 16 Jan 2025

Dept. of Comp. Engineering and ROMER Center for Robotics and AI,
Middle East Technical University (METU), Turkiye.
Hatice Gunes [email protected]
University of Cambridge, United Kingdom.

Abstract 1. Introduction
Machine learning bias in mental health is
becoming an increasingly pertinent challenge.
Mental health disorders (MHDs) are becoming in-
Despite promising efforts indicating that multi- creasingly prevalent world-wide (Wang et al., 2007)
task approaches often work better than unitask Machine learning (ML) methods have been success-
approaches, there is minimal work investigat- fully applied to many real-world and health-related
ing the impact of multitask learning on per- areas (Sendak et al., 2020). The natural exten-
formance and fairness in depression detection sion of using ML for MHD analysis and detection
nor leveraged it to achieve fairer prediction out- has proven to be promising (Long et al., 2022; He
comes. In this work, we undertake a systematic et al., 2022; Zhang et al., 2020). On the other hand,
investigation of using a multitask approach to ML bias is becoming an increasing source of con-
improve performance and fairness for depres- cern (Buolamwini and Gebru, 2018; Barocas et al.,
sion detection. We propose a novel gender-
2017; Xu et al., 2020; Cheong et al., 2021, 2022,
based task-reweighting method using uncer-
tainty grounded in how the PHQ-8 question-
2023a). Given the high stakes involved in MHD
naire is structured. Our results indicate that, analysis and prediction, it is crucial to investigate
although a multitask approach improves per- and mitigate the ML biases present. A substantial
formance and fairness compared to a unitask amount of literature has indicated that adopting a
approach, the results are not always consis- multitask learning (MTL) approach towards depres-
tent and we see evidence of negative transfer sion detection demonstrated significant improvement
and a reduction in the Pareto frontier, which is across classification-based performances (Li et al.,
concerning given the high-stake healthcare set- 2022; Zhang et al., 2020). Most of the existing work
ting. Our proposed approach of gender-based rely on the standardised and commonly used eight-
reweighting with uncertainty improves perfor- item Patient Health Questionnaire depression scale
mance and fairness and alleviates both chal-
(PHQ-8) (Kroenke et al., 2009) to obtain the ground-
lenges to a certain extent. Our findings on each
PHQ-8 subitem task difficulty are also in agree-
truth labels on whether a subject is considered de-
ment with the largest study conducted on the pressed. A crucial observation is that in order to
PHQ-8 subitem discrimination capacity, thus arrive at the final classification (depressed vs non-
providing the very first tangible evidence link- depressed), a clinician has to first obtain the scores
ing ML findings with large-scale empirical pop- of each of the PHQ-8 sub-criterion and then sum
ulation studies conducted on the PHQ-8. them up to arrive at the final binary classification
(depressed vs non-depressed). Details on how the fi-
∗ This work was undertaken while Jiaee Cheong was a visit- nal score is derived from the PHQ-8 questionnaire
ing PhD student at METU. can be found in Section 3.1.

© 2024 J. Cheong, A. Bangar, S. Kalkan & H. Gunes.


U-Fair: Uncertainty-based Multimodal Multitask Learning for Fairer Depression Detection

Task Losses
Task 1 ℒ1 8
1
ℒ𝐹 = ෍ ℒ𝑡 + log 𝜎𝑡𝐹
ℒ2 𝜎𝑡𝐹 2

Attentional Fusion Module


Task 2
Visual
Modality
CONV-
2D
BiLSTM FC … Task 3 ℒ3
𝑡=1 U-Fair
Loss
Audio
Modality
CONV-
1D
BiLSTM FC … Task 4 ℒ4

ℒU−Fair
Task 5 ℒ5 = ℒ𝐹 + ℒ𝑀
Text
Modality
CONV-
1D
BiLSTM FC … Task 6 ℒ6
8
Concatenation of the 1
extracted visual, audio
Task 7 ℒ7 ℒ𝑀 = ෍ ℒ𝑡 + log 𝜎𝑡𝑀
𝜎𝑡𝑀 2
and textual features Task 8 ℒ8 𝑡=1

Figure 1: Our proposed method is rooted in the observation that each gender may have different PHQ-8
distributions and different levels of task difficulty across the t1 to t8 tasks. We propose accounting
for this gender difference in PHQ-8 distributions via U-Fair.

Moreover, each gender may display different PHQ- approach towards building relevant ML for healthcare
8 task distribution which may results in different solutions, we propose a novel method, U-Fair, which
PHQ-8 distribution and variance. Although inves- accounts for the gender difference in PHQ-8 distri-
tigation on the relationship between the PHQ-8 and bution and leverages on uncertainty as a MTL task
gender has been explored in other fields such as psy- reweighing mechanism to achieve better gender fair-
chiatry (Thibodeau and Asmundson, 2014; Vetter ness for depression detection. Our key contributions
et al., 2013; Leung et al., 2020), this has not been in- are as follow:
vestigated nor accounted for in any of the existing ML • We conduct the first analysis to investigate how
for depression detection methods. Moreover, existing MTL impacts fairness in depression detection by
work has demonstrated the risk of a fairness-accuracy using each PHQ-8 subcriterion as a task. We
trade-off (Pleiss et al., 2017) and how mainstream show that a simplistic baseline MTL approach
MTL objectives might not correlate well with fair- runs the risk of incurring negative transfer and
ness goals (Wang et al., 2021b). No work has inves- may not improve on the Pareto frontier. A
tigated how a MTL approach impacts performance Pareto frontier can be understood as the set of
across fairness for the task of depression detection. optimal solutions that strike a balance among
In addition, prior works have demonstrated the in- different objectives such that there is no better
tricate relationship between ML bias and uncertainty solution beyond the frontier.
(Mehta et al., 2023; Tahir et al., 2023; Kaiser et al., • We propose a simple yet effective approach
2022; Kuzucu et al., 2024). Uncertainty broadly that leverages gender-based aleatoric uncer-
refers to confidence in predictions. Within ML re- tainty which improves the fairness-accuracy
search, two types of uncertainty are commonly stud- trade-off and alleviates the negative transfer phe-
ied: data (or aleatoric) and model (or epistemic) un- nomena and improves on the Pareto-frontier be-
certainties. Aleatoric uncertainty refers to the inher- yond a unitask method.
ent randomness in the experimental outcome whereas • We provide the very first results connecting the
epistemic uncertainty can be attributed to a lack of empirical results obtained via ML experiments
knowledge (Gal, 2016). A particularly relevant theme with the empirical findings obtained via the
is that ML bias can be attributed to uncertainty in largest study conducted on the PHQ-8. Inter-
some models or datasets (Kuzucu et al., 2024) and estingly, our results highlight the intrinsic rela-
that taking into account uncertainty as a bias mit- tionship between task difficulty as quantified by
igation strategy has proven effective (Tahir et al., aleatoric uncertainty and the discrimination ca-
2023; Kaiser et al., 2022). A growing body of lit- pacity of each item of the PHQ-8 subcriterion.
erature has also highlighted the importance of taking
uncertainty into account within a range of tasks (Naik
et al., 2024; Han et al., 2024; Baltaci et al., 2023; 2. Literature Review
Cetinkaya et al., 2024) and healthcare settings (Grote
and Keeling, 2022; Chua et al., 2023). Motivated by Gender difference in depression manifestation has
the above and the importance of a clinician-centred long been studied and recognised within fields such as

2
U-Fair: Uncertainty-based Multimodal Multitask Learning for Fairer Depression Detection

Approach Evaluation
Study Problem Multimodal Uncertainty NFM Measures PF NT ND
Zanna et al. (2022) Anxiety ✕ ✓ 2 ✕ ✕ 1
Li et al. (2023a) Healthcare prediction ✕ ✕ 2 ✕ ✕ 1
Li et al. (2023b) Organ transplant ✕ ✕ 2 ✕ ✕ 1
Ban and Ji (2024) Resource allocation ✕ ✕ 2 ✓ ✕ 3
Li et al. (2024) Risk factor prediction ✕ ✕ 2 ✕ ✕ 1
U-Fair (Ours) Depression detection ✓(AVT) ✓ 4 ✓ ✕ 2

Table 1: Comparative Summary with existing MTL Fairness studies. Abbreviations (sorted): A: Audio.
NFM: Number of Fairness Measures. NT: Negative Transfers. ND: Number of Datasets. PF:
Pareto Frontier. T: Text. V: Visual.

medicine (Barsky et al., 2001) and psychology (Hall In concurrence, although MTL has proven to be
et al., 2022). Anecdotal evidence has also often sup- effective at improving fairness for other tasks such
ported this view. Literature indicates that females as healthcare predictive modelling (Li et al., 2023a),
and males tend to show different behavioural symp- organ transplantation (Li et al., 2023b) and resource
toms when depressed (Barsky et al., 2001; Ogrod- allocation (Ban and Ji, 2024), this approach has been
niczuk and Oliffe, 2011). For instance, certain acous- underexplored for the task of depression detection.
tic features (e.g. MFCC) are only statistically signifi-
Comparative Summary: Our work differs from
cantly different between depressed and healthy males
the above in the following ways (see Table 1). First,
(Wang et al., 2019). On the other hand, compared
our work is the first to leverage an MTL approach to
to males, depressed females are more emotionally ex-
improve gender fairness in depression detection. Sec-
pressive and willing to reveal distress via behavioural
ond, we utilise an MTL approach where each task
cues (Barsky et al., 2001; Jansz et al., 2000).
corresponds to each of the PHQ-8 subtasks (Kroenke
Recent works have indicated that ML bias is et al., 2009) in order to exploit gender-specific differ-
present within mental health analysis (Zanna et al., ences in PHQ-8 distribution to achieve greater fair-
2022; Bailey and Plumbley, 2021; Cheong et al., ness. Third, we propose a novel gender-based uncer-
2024a,b; Cameron et al., 2024; Spitale et al., 2024). tainty MTL loss reweighing to achieve fairer perfor-
Zanna et al. (2022) proposed an uncertainty-based mance across gender for
approach to address the bias present in the TILES
dataset. Bailey and Plumbley (2021) demonstrated
the effectiveness of using an existing bias mitigation 3. Methodology: U-Fair
method, data re-distribution, to mitigate the gender In this section, we introduce U-Fair, which uses
bias present in the DAIC-WOZ dataset. Cheong et al. aleatoric-uncertainties for demographic groups to
(2023b, 2024a) demonstrated that bias exists in exist- reweight their losses.
ing mental health algorithms and datasets and sub-
sequently proposed a causal multimodal method to
3.1. PHQ-8 Details
mitigate the bias present.
One of the standardised and most commonly used de-
MTL is noted to be particularly effective when the pression evaluation method is the PHQ-8 developed
tasks are correlated (Zhang and Yang, 2021). Ex- by Kroenke et al. (2009). In order to arrive at the
isting works using MTL for depression detection has final classification (depressed vs non-depressed), the
proven fruitful. Ghosh et al. (2022) adopted a MTL protocol is to first obtain the subscores of each of the
approach by training the network to detect three PHQ-8 subitem as follows:
closely related tasks: depression, sentiment and emo-
• PHQ-1: Little interest or pleasure in doing
tion. Wang et al. (2022) proposed a MTL approach
things,
using word vectors and statistical features. Li et al.
(2022) implemented a similar strategy by using de- • PHQ-2: Feeling down, depressed, or hopeless,
pression and three other auxiliary tasks: topic, emo- • PHQ-3: Trouble falling or staying asleep, or
tion and dialog act. Gupta et al. (2023) adopted a sleeping too much,
multimodal, multiview and MTL approach where the • PHQ-4: Feeling tired or having little energy,
subtasks are depression, sentiment and emotion. • PHQ-5: Poor appetite or overeating,

3
U-Fair: Uncertainty-based Multimodal Multitask Learning for Fairer Depression Detection

• PHQ-6: Feeling that you are a failure, of fairness in depression detection, where the goal is
• PHQ-7: Trouble concentrating on things, to predict a correct outcome y i ∈ Y from input xi ∈
X based on the available dataset D for individual
• PHQ-8: Moving or speaking so slowly that other
i ∈ I. In our setup, Y = 1 denotes the PHQ-8 binary
people could have noticed.
outcome corresponding to “depressed” and Y = 0
Each PHQ-8 subcategory is scored between 0 to denotes otherwise. Only gender was provided as a
3, with the final PHQ-8 total score (TS) ranging be- sensitive attribute S.
tween 0 to 24. The PHQ-8 binary outcome is ob-
tained via thresholding. A PHQ-8 TS of ≥ 10 belongs 3.3. Unitask Approach
to the depressed class (Y = 1) whereas TS ≤ 10 be-
For our single task approach, we use a Kullback-
longs to the non-depressed class (Y = 0).
Leibler (KL) Divergence loss as follows:
Most existing works focused on predicting the fi-
nal binary class (Y ) (Zheng et al., 2023; Bailey and X 
pt (x)

Plumbley, 2021). Some focused on predicting the LST L = pt (x) log . (3)
qt (x)
PHQ-8 total score and further obtained the binary t∈T
classification via thresholding according to the for-
pt (x) is the soft ground-truth label for each task t and
mal definition (Williamson et al., 2016; Gong and
qt (x) is the probability of the 4 different score classes
Poellabauer, 2017). Others adopted a bimodal setup
y ∈ {0, 1, 2, 3} as explained in Section 3.1.
with 2 different output heads to predict the PHQ- t
8 total score as well as the PHQ-8 binary outcome
(Valstar et al., 2016; Al Hanai et al., 2018). 3.4. Multitask Approach
For our baseline multitask approach, we extend the
3.2. Problem Formulation loss function in Equation 3 to arrive at the following
generalisation:
In our work, in alignment with how the PHQ-8 works,
we adopt the approach where each PHQ-8 subcat- LM T L =
X
wt Lt . (4)
egory is treated as a task t. The architecture is t∈T
adapted from Wei et al. (2022). For each individ-
ual i ∈ I, we have 8 different prediction heads for Lt is the single task loss LST L for each t as defined
each of the tasks, [t1 , ..., t8 ] ∈ T , to predict the in Equation 3. We set wt = 1 in our experiments.
score yti ∈ {0, 1, 2, 3} for each task or sub PHQ-8
category. The ground-truth labels for each task t is 3.5. Baseline Approach
transformed into a Gaussian-based soft-distribution
pt (x), as soft labels provide more information for the To compare between the generic multitask approach
model to learn from (Yuan et al., 2024). x is the input in Equation 4 and an uncertainty-based loss reweight-
feature provided to the model. Each of the classifi- ing approach, we use the commonly used multitask
cation heads are trained to predict the probability learning method by Kendall et al. (2018) as the base-
qt (x) of the 4 different score classes yti ∈ {0, 1, 2, 3}. line uncertainty weighting (UW) appraoch. The un-
During inference, the final yti ∈ {0, 1, 2, 3} is obtained certainty MTL loss across tasks is thus defined by:
by selecting the score with the maximum probability. X 1 
The PHQ-8 Total Score T S and final PHQ-8 binary LU W = Lt + log σt , (5)
σt2
classification Ŷ for each individual i ∈ I are derived t∈T
from each subtask via:
where Lt is the single task loss as defined in Equa-
X8 tion 3. σt is the learned weight of loss for each task t
TS = yt , (1) and can be interpreted as the aleatoric uncertainty of
t=1 the task. A task with a higher aleatoric uncertainty
will thus lead to a larger single task loss Lt thus pre-
and
venting the trained model to optimise on that task.
Ŷ = 1 if T S ≥ 10, else Ŷ = 0. (2)
The higher σt , the more difficult the task t. log σt
Ŷ thus denotes the final predicted class calculated penalizes the model from arbitrarily increasing σt to
based on the summation of yt . We study the problem reduce the overall loss (Kendall et al., 2018).

4
U-Fair: Uncertainty-based Multimodal Multitask Learning for Fairer Depression Detection

3.6. Proposed Loss: U-Fair modality are concatenated in parallel to form a fea-
ture map as input to the subsequent fusion layer. We
To achieve fairness across the different PHQ-8 tasks,
have 8 different attention fusion layers connected to
we propose the idea of task prioritisation based on
the 8 output heads which corresponds to the t1 to
the model’s task-specific uncertainty weightings. Mo-
t8 tasks. For all loss functions, we train the models
tivated by literature highlighting the existence of gen-
with the Adam optimizer (Kingma and Ba, 2014) at
der difference in depression manifestation (Barsky
a learning rate of 0.0002 and a batch size of 32. We
et al., 2001), we propose a novel gender based un-
train the network for a maximum of 150 epochs and
certainty reweighting approach and introduce U-Fair
apply early stopping.
Loss which is defined as follows:
!
1 XX 1 s s
4.2. Evaluation Measures
LU −F air = 2 Lt + log σt . (6)
|S| s
(σt ) To evaluate performance, we use F1, recall, preci-
s∈S t∈T
sion, accuracy and unweighted average recall (UAR)
For our setting, s can either be male s1 or female s0 in accordance with existing work (Cheong et al.,
and |S| = 2. Thus, we have the uncertainty weighted 2023c). To evaluate group fairness, we use the most
task loss for each gender, and sum them up to arrive commonly-used definitions according to (Hort et al.,
at our proposed loss function LM M F air . 2022). s1 denotes the male majority group and s0
This methodology has two key benefits. First, fair- denotes the female minority group for both datasets.
ness is optimised implicitly as we train the model to
optimise for task-wise prediction accuracy. As a re- • Statistical Parity, or demographic parity, is
sult, by not constraining the loss function to blindly based purely on predicted outcome Ŷ and inde-
optimise for fairness at the cost of utility or accuracy, pendent of actual outcome Y :
we hope to reduce the negative impact on fairness and
improve the Pareto frontier with a constraint-based P (Ŷ = 1|s0 )
MSP = . (7)
fairness optimisation approach (Wang et al., 2021b). P (Ŷ = 1|s1 )
Second, as highlighted by literature in psychiatry (Le-
ung et al., 2020; Thibodeau and Asmundson, 2014), According to MSP , in order for a classifier to be
each task has different levels of uncertainty in relation deemed fair, P (Ŷ = 1|s1 ) = P (Ŷ = 1|s0 ).
to each gender. By adopting a gender based uncer-
tainty loss-reweighting approach, we account for such • Equal opportunity states that both demo-
uncertainty in a principled manner, thus encouraging graphic groups s0 and s1 should have equal True
the network to learn a better joint-representation due Positive Rate (TPR).
to the MTL and the gender-base aleatoric uncertainty
loss reweighing approach. P (Ŷ = 1|Y = 1, s0 )
MEOpp = . (8)
P (Ŷ = 1|Y = 1, s1 )

4. Experimental Setup According to this measure, in order for a clas-


sifier to be deemed fair, P (Ŷ = 1|Y = 1, s1 ) =
We outline the implementation details and evaluation P (Ŷ = 1|Y = 1, s0 ).
measures here. We use DAIC-WOZ (Valstar et al.,
2016) and E-DAIC (Ringeval et al., 2019) for our ex- • Equalised odds can be considered as a gener-
periments. Further details about the datasets can be alization of Equal Opportunity where the rates
found within the Appendix. are not only equal for Y = 1, but for all values
of Y ∈ {1, ...k}, i.e.:
4.1. Implementation Details
P (Ŷ = 1|Y = i, s0 )
MEOdd = . (9)
We adopt an attention-based multimodal architec- P (Ŷ = 1|Y = i, s1 )
ture adapted from Wei et al. (2022) featuring late
fusion of extracted representations from the three According to this measure, in order for a clas-
different modalities (audio, visual, textual) as illus- sifier to be deemed fair, P (Ŷ = 1|Y = i, s1 ) =
trated in Figure 1. The extracted features from each P (Ŷ = 1|Y = i, s0 ), ∀i ∈ {1, ...k}.

5
U-Fair: Uncertainty-based Multimodal Multitask Learning for Fairer Depression Detection

• Equal Accuracy states that both subgroups s0 Measure Approach Binary Outcome
and s1 should have equal rates of accuracy. Unitask 0.66
Multitask 0.70
Acc
MACC,s0 Baseline UW 0.82
MEAcc = . (10) U-Fair (Ours) 0.80
MACC,s1
Unitask 0.47
Multitask 0.53
For all fairness measures, the ideal score of 1 thus F1

Performance Measures
Baseline UW 0.29
indicates that both measures are equal for s0 and s1 U-Fair (Ours) 0.54
and is thus considered “perfectly fair”. We adopt the Unitask 0.44
approach of existing work which considers 0.80 and Multitask 0.50
Precision
1.20 as the lower and upper fairness bounds respec- Baseline UW 0.22
tively (Zanna et al., 2022). Values closer to 1 are U-Fair (Ours) 0.56
fairer, values further form 1 are less fair. For all bi- Unitask 0.50
nary classification, the “default” threshold of 0.5 is Multitask 0.57
Recall
Baseline UW 0.43
used in alignment with existing works (Wei et al.,
U-Fair (Ours) 0.60
2022; Zheng et al., 2023).
Unitask 0.60
Multitask 0.65
UAR
Baseline UW 0.64
5. Results U-Fair (Ours) 0.63
For both datasets, we normalise the fairness results Unitask 0.47
to facilitate visualisation in Figures 2 and 3. Multitask 0.86
MSP
Baseline UW 1.23
U-Fair (Ours) 1.06
5.1. Uni vs Multitask Unitask 0.45
Fairness Measures

Multitask 0.78
For DAIC-WOZ (DW), we see from Table 2, we find MEOpp
Baseline UW 1.70
that a multitask approach generally improves results U-Fair (Ours) 1.46
compared to a unitask approach (Section 3.3). The Unitask 0.54
baseline loss re-weighting approach from Equation 5 Multitask 0.76
MEOdd
managed to further improve performance. For exam- Baseline UW 1.31
ple, we see from Table 2 that the overall classification U-Fair (Ours) 1.17
accuracy improved from 0.70 within a vanilla MTL Unitask 1.44
approach to 0.82 using the baseline uncertainty-based Multitask 0.94
MEAcc
task reweighing approach. Baseline UW 1.25
U-Fair (Ours) 0.95
However, this observation is not consistent for E-
DAIC (ED). With reference to Table 3, a unitask Table 2: Results for DAIC-WOZ. Full table results
approach seems to perform better. We see evidence for DW, Table 6, is available within the Ap-
of negative transfer, i.e. the phenomena where learn- pendix. Best values are highlighted in bold.
ing multiple tasks concurrently result in lower per-
formance than a unitask approach. We hypothe-
5.2. Uncertainty & the Pareto Frontier
sise that this is because ED is a more challenging
dataset. When adopting a multitask approach, the Our proposed loss reweighting approach seems to ad-
model completely relies on the easier tasks thus neg- dress the negative transfer and Pareto frontier chal-
atively impacting the learning of the other tasks. lenges. Although accuracy dropped slightly from 0.82
Moreover, performance improvement seems to to 0.80, fairness largely improved compared to the
come at a cost. This may be due to the fairness- baseline UW approach (Equation 5). We see from
accuracy trade-off (Wang et al., 2021b). For instance Table 2 that fairness improved across MSP , MEOpp ,
in DW, we see that the fairness scores MSP , MEOpp , MEOdd and MAcc from 1.23, 1.70, 1.31, 1.25 to 1.06,
MOdd and MAcc reduced from 0.86, 0.78, 0.94 and 1.46, 1.17 and 0.95 for DW.
0.76 to 1.23, 1.70, 1.31 and 1.25 respectively. This is For ED, the baseline UW which adopts a task based
consistent with the analysis across the Pareto frontier difficulty reweighting mechanism seems to some-
depicted in Figures 2 and 3. what mitigate the task-based negative transfer which

6
U-Fair: Uncertainty-based Multimodal Multitask Learning for Fairer Depression Detection

1.00 1.00
1.00 1.00
0.75 0.75 0.75 0.75

EOpp
EOdd

SP
0.50
EAcc

0.50 0.50 0.50


0.25 0.25 0.25 0.25

0.00 0.00 0.00 0.00


0.00 0.25 0.50 0.75 0.00 0.25 0.50 0.75 0.00 0.25 0.50 0.75 0.00 0.25 0.50 0.75
Accuracy Accuracy Accuracy Accuracy

(a) MEAcc vs Acc (b) MEOdd vs Acc (c) MEOpp vs Acc (d ) MSP vs Acc
Figure 2: Fairness-Accuracy Pareto Frontier across the DAIC-WOZ results. Upper right indicates better
Pareto optimality, i.e. better fairness-accuracy trade-off. Orange: Unitask. Green: Multitask.
Blue: Multitask UW. Red: U-Fair. Abbreviations: Acc: accuracy.

1.00 0.8
1.00 1.0
0.75 0.8 0.6
0.75
0.6

EOpp
EAcc

0.50
EOdd

0.50 0.4

SP
0.4
0.25 0.25 0.2
0.2
0.00 0.00 0.0 0.0
0.00 0.25 0.50 0.75 0.00 0.25 0.50 0.75 0.00 0.25 0.50 0.75 0.00 0.25 0.50 0.75
Accuracy Accuracy Accuracy Accuracy

(a) MEAcc vs Acc (b) MEOdd vs Acc (c) MEOpp vs Acc (d ) MSP vs Acc
Figure 3: Fairness-Accuracy Pareto Frontier across the E-DAIC results. Upper right indicates better Pareto
optimality, i.e. better fairness-accuracy trade-off. Orange: Unitask. Green: Multitask. Blue:
Multitask UW. Red: U-Fair. Abbreviations: Acc: accuracy.

improves the unitask performance but not overall For DW, with reference to Figures 4(a) and 4(b),
performance nor fairness measures. Our proposed we see that there is a difference in task difficulty. Task
method which takes into account the gender differ- 4 and 6 is easier for females whereas task 7 is easier for
ence may have somewhat addressed this task-based males. For ED, with reference to Figures 4(c), 4(d )
negative transfer. In concurrence, U-Fair also ad- and Table 5, Task 4 seems to be easier for females
dressed the initial bias present. We see from Ta- whereas task 7 seems easier for males. Thus, adopt-
ble 3 that fairness improved across all fairness mea- ing a gender-based uncertainty reweighting approach
sures. The scores improved from 3.86, 2.31, 8.21, 0.92 might have ensured that the tasks are more appropri-
to 1.67, 1.00, 5.00 and 0.94 across MSP , MEOpp , ately weighed leading towards better performance for
MEOdd and MAcc . both genders whilst mitigating the negative transfer
and Pareto frontier challenges.
The Pareto frontier across all four measures illus-
5.3. Task Difficulty & Discrimination
trated in Figures 2 and 3 demonstrated that our
Capacity
proposed method generally provides better accuracy-
fairness trade-off across most fairness measures for A particularly relevant and exciting finding is that
both datasets. With reference to Figure 2, we see that each PHQ-8 subitem’s task difficulty agree with its
U-Fair, generally provides a slightly better Pareto op- discrimination capacity as evidenced by the rigorous
timality compared to other methods. This improve- study conducted by de la Torre et al. (2023). This
ment in the Pareto frontier is especially pronounced largest study to date assessed the internal structure,
for Figure 3(c). The difference in the Pareto frontier reliability and cross-country validity of the PHQ-8
between our proposed method and other compared for the assessment of depressive symptoms. Discrim-
methods is greater in ED (Fig 3), the more challeng- ination capacity is defined as the ability of item to
ing dataset, compared to that in DW (Fig 2). distinguish whether a person is depressed or not.

7
U-Fair: Uncertainty-based Multimodal Multitask Learning for Fairer Depression Detection

Measure Approach Binary Outcome each task. The higher σt , the more difficult the task
Unitask 0.55 t. In other words, the lower the value of σ12 , the more
Multitask 0.58 difficult the task. For instance, in their study, PHQ-
Acc
Baseline UW 0.87 1, 2 and 6 were the items that has the greatest ability
U-Fair (Ours) 0.90
to discriminate whether a person is depressed. This
Unitask 0.51
Multitask 0.45 is in alignment with our results where PHQ-1,2 and
F1 8 are easier across both datasets. PHQ-3 and PHQ-5
Performance Measures

Baseline UW 0.27
U-Fair (Ours) 0.45 are the least discriminatory or more difficult tasks as
Unitask 0.36 evidenced by the values highlighted in red.
Multitask 0.32
Precision
Baseline UW 0.28 1
σ2
U-Fair (Ours) 0.46 DC DW-F DW-M ED-F ED-M
Unitask 0.87 PHQ-1 3.06 1.50 1.41 1.69 1.69
Multitask 0.80 PHQ-2 3.42 1.41 1.47 1.38 1.41
Recall PHQ-3 1.91 0.62 0.64 0.51 0.58
Baseline UW 0.26
PHQ-4 2.67 0.82 0.68 0.91 0.60
U-Fair (Ours) 0.45
PHQ-5 2.22 0.61 0.69 0.51 0.58
Unitask 0.63 PHQ-6 2.86 0.73 0.59 0.63 0.60
Multitask 0.67 PHQ-7 2.55 0.75 0.80 0.61 0.89
UAR
Baseline UW 0.60 PHQ-8 2.43 1.58 1.72 1.69 1.70
U-Fair (Ours) 0.70
Unitask 0.65 Table 5: Discrimination capacity (DC) vs σ12 . Lower
Multitask 1.25 1
MSP
Baseline UW 3.86 σ 2 values implies higher task difficulty.
Green: top 3 highest scores. Red: bot-
U-Fair (Ours) 1.67
Unitask 0.57
tom 2 lowest scores. Our results are in har-
Fairness Measures

Multitask 0.81 mony with the largest and most comprehen-


MEOpp sive study on the PHQ-8 conducted by de la
Baseline UW 2.31
U-Fair (Ours) 1.00 Torre et al. (2023). DW: DAIC-WOZ. ED:
Unitask 0.75 E-DAIC. F: Female. M: Male.
Multitask 1.41
MEOdd
Baseline UW 8.21 6. Discussion and Conclusion
U-Fair (Ours) 5.00
Unitask 0.83
Our experiments unearthed several interesting in-
Multitask 0.65
sights. First, overall, there are certain gender-based
MEAcc
Baseline UW 0.92
differences across the different PHQ-8 distribution
U-Fair (Ours) 0.94
labels as evidenced in Figure 4. In addition, each
Table 3: Results for E-DAIC. Full table results for task have slightly different degree of task uncertainty
ED, Table 7, is available within the Ap- across gender. This may be due to a gender differ-
pendix. Best values are highlighted in bold. ence in PHQ-8 questionnaire profiling or inadequate
data curation. Thus, employing a gender-aware ap-
proach may be a viable method to improve fairness
Method Prec. Rec. F1
Ma et al. (2016) 0.35 1.00 0.52
and accuracy for depression detection.
Song et al. (2018) 0.32 0.86 0.46 Second, though a multitask approach generally
Williamson et al. (2016) - - 0.53 performs better than a unitask approach, this comes
Song et al. (2018) 0.60 0.43 0.50 with several caveats. We see from Table 5 that each
U-Fair (Ours) 0.52 0.60 0.57 task has a different level of difficulty. Naively using
all tasks may worsen performance and fairness com-
Table 4: Comparison with other models which used pared to a unitask approach if we do not account for
extracted features for DAIC-WOZ. Best re- task-based uncertainty. This is in agreement with ex-
sults highlighted in bold. isting literature which indicates that there can be a
With reference to Table 5, it is noteworthy that mix of positive and negative transfers across tasks (Li
the task difficulty captured by σ12 in our experiments et al., 2023c) and tasks have to be related for perfor-
corresponds to the discrimination capacity (DC) of mance to improve (Wang et al., 2021a).

8
U-Fair: Uncertainty-based Multimodal Multitask Learning for Fairer Depression Detection

1.8 1.8 1.8 1.8


1/(σ1M )2
1.6 1.6 1.6 1.6 1/(σ2M )2
1.4 1.4 1.4 1.4 1/(σ3M )2
1/(σ4M )2

1/(σM )2

1/(σ M )2
1/(σF )2

1/(σF )2
1.2 1.2 1.2 1.2 1/(σ5M )2
1/(σ6M )2
1 1 1 1
1/(σ7M )2
0.8 0.8 0.8 0.8 1/(σ8M )2
0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4


0 200 400 600 800 1,000 0 200 400 600 800 1,000 0 200 400 600 800 1,000 0 200 400 600 800 1,000
Iterations Iterations Iterations Iterations

(a) DAIC-WOZ:Female (b) DAIC-WOZ: Male (c) E-DAIC: Female (d ) E-DAIC: Male
Figure 4: Task-based weightings for both gender and datasets.

Third, understanding, analysing and improving certainty reweighting mechanism to account for the
upon the fairness-accuracy Pareto frontier within gender difference in PHQ-8 label distribution. By
the task of depression requires a nuanced and care- replicating the inferential process used by clinicians,
ful use of measures and datasets in order to avoid this work attempts to bridge ML methods with the
the fairness-accuracy trade-off. Moreover, there is a symptom-based profiling system used by clinicians.
growing amount of research indicating that if using Future work can also make use of this property dur-
appropriate methodology and metrics, these trade- ing inference in order to improve the trustworthiness
offs are not always present (Dutta et al., 2020; Black of the machine learning or decision-making model
et al., 2022; Cooper et al., 2021) and can be mitigated (Huang and Ma, 2022).
with careful selection of models (Black et al., 2022) In the process of doing so, our proposed method
and evaluation methods (Wick et al., 2019). Our re- also provide the elusive first evidence that each PHQ-
sults are in agreement with existing works indicating 8 subitem’s task difficulty aligns with its discrimina-
that state-of-the-art bias mitigation methods are typ- tion capacity as evidenced from data collected from
ically only effective at removing epistemic discrimina- the largest PHQ-8 population-based study to date
tion (Wang et al., 2023), i.e. the discrimination made (de la Torre et al., 2023). We hope this piece of work
during model development, but not aleatoric discrim- will encourage other ML and healthcare researchers
ination. In order to address aleatoric discrimination, to further investigate methods that could bridge ML
i.e. the bias inherent within the data distribution, experimental results with empirical real world health-
and to improve the Pareto frontier, better data cu- care findings to ensure its reliability and validity.
ration is required (Dutta et al., 2020). Though our
results are unable to provide a significant improve- Limitations: We only investigated gender fairness
ment on the Pareto frontier, we believe that this work due to the limited availability of other sensitive at-
presents the first step in this direction and would en- tributes in both datasets. Future work can consider
courage future work to look into this. investigating this approach across different sensitive
In sum, we present a novel gender-based uncer- attributes such as race and age, the intersectionality
tainty multitask loss reweighting mechanism. We of sensitive attributes and other healthcare challenges
showed that our proposed multitask loss reweight- such as cognitive impairment or cancer diagnosis.
ing is able to improve fairness with lesser fairness- Moreover, we have adopted our existing experimental
accuracy trade-off. Our findings also revealed the im- approach in alignment with the train-validation-test
portance of accounting for negative transfers and for split provided by the dataset owners as well as other
more effort to be channelled towards improving the existing works. Future works can consider adopting
Pareto frontier in depression detection research. a cross-validation approach. Other interesting direc-
tions include investigating this challenge as an or-
ML for Healthcare Implication: Producing a dinal regression problem (Diaz and Marathe, 2019).
thorough review of strategies to improve fairness is Future work can also consider repeating the experi-
not within the scope of this work. Instead, the ments using datasets collected from other countries
key goal is to advance ML for healthcare solutions and dive deeper into the cultural intricacies of the
that are grounded in the framework used by clini- different PHQ-8 subitems, investigate the effects of
cians. In our settings, this corresponds to using each the different modalities and its relation to a multi-
PHQ-8 subcriterion as individual subtask within our task approach, as well as investigate other important
MTL-based approach and using a a gender-based un- topics such as interpretability and explainability to

9
U-Fair: Uncertainty-based Multimodal Multitask Learning for Fairer Depression Detection

advance responsible (Wiens et al., 2019) and ethical


machine learning for healthcare (Chen et al., 2021).

10
U-Fair: Uncertainty-based Multimodal Multitask Learning for Fairer Depression Detection

Acknowledgments Joy Buolamwini and Timnit Gebru. Gender shades:


Intersectional accuracy disparities in commercial
Funding: J. Cheong is supported by the Alan Tur- gender classification. In FAccT, pages 77–91.
ing Institute doctoral studentship, the Leverhulme PMLR, 2018.
Trust and further acknowledges resource support
from METU. A. Bangar contributed to this while Joseph Cameron, Jiaee Cheong, Micol Spitale, and
undertaking a remote visiting studentship at the De- Hatice Gunes. Multimodal gender fairness in de-
partment of Computer Science and Technology, Uni- pression prediction: Insights on data from the usa
versity of Cambridge. H. Gunes’ work is supported & china. arXiv preprint arXiv:2408.04026, 2024.
by the EPSRC/UKRI project ARoEq under grant
Bedrettin Cetinkaya, Sinan Kalkan, and Emre Akbas.
ref. EP/R030782/1. Open access: The authors
Ranked: Addressing imbalance and uncertainty in
have applied a Creative Commons Attribution (CC
edge detection using ranking-based losses. In Pro-
BY) licence to any Author Accepted Manuscript ver-
ceedings of the IEEE/CVF Conference on Com-
sion arising. Data access: This study involved sec-
puter Vision and Pattern Recognition, pages 3239–
ondary analyses of existing datasets. All datasets are
3249, 2024.
described and cited accordingly.
Irene Y Chen, Emma Pierson, Sherri Rose, Shal-
mali Joshi, Kadija Ferryman, and Marzyeh Ghas-
References semi. Ethical machine learning in healthcare. An-
nual review of biomedical data science, 4(1):123–
Tuka Al Hanai, Mohammad M Ghassemi, and
144, 2021.
James R Glass. Detecting depression with au-
dio/text sequence modeling of interviews. In In- Jiaee Cheong, Sinan Kalkan, and Hatice Gunes. The
terspeech, pages 1716–1720, 2018. hitchhiker’s guide to bias and fairness in facial af-
fective signal processing: Overview and techniques.
Andrew Bailey and Mark D Plumbley. Gender bias IEEE Signal Processing Magazine, 38(6), 2021.
in depression detection using audio features. EU-
SIPCO 2021, 2021. Jiaee Cheong, Sinan Kalkan, and Hatice Gunes.
Counterfactual fairness for facial expression recog-
Zeynep Sonat Baltaci, Kemal Oksuz, Selim Kuzucu, nition. In European Conference on Computer Vi-
Kivanc Tezoren, Berkin Kerim Konar, Alpay sion, pages 245–261. Springer, 2022.
Ozkan, Emre Akbas, and Sinan Kalkan. Class un-
certainty: A measure to mitigate class imbalance. Jiaee Cheong, Sinan Kalkan, and Hatice Gunes.
arXiv preprint arXiv:2311.14090, 2023. Causal structure learning of bias for fair affect
recognition. In Proceedings of the IEEE/CVF Win-
Hao Ban and Kaiyi Ji. Fair resource alloca- ter Conference on Applications of Computer Vi-
tion in multi-task learning. arXiv preprint sion, pages 340–349, 2023a.
arXiv:2402.15638, 2024. Jiaee Cheong, Selim Kuzucu, Sinan Kalkan, and Hat-
ice Gunes. Towards gender fairness for mental
Solon Barocas, Moritz Hardt, and Arvind Narayanan.
health prediction. In IJCAI 2023, pages 5932–5940,
Fairness in machine learning. NeurIPS Tutorial, 1:
US, 2023b. IJCAI.
2, 2017.
Jiaee Cheong, Micol Spitale, and Hatice Gunes. “it’s
Arthur J Barsky, Heli M Peekna, and Jonathan F not fair!” – fairness for a small dataset of multi-
Borus. Somatic symptom reporting in women and modal dyadic mental well-being coaching. In ACII,
men. Journal of general internal medicine, 16(4): pages 1–8, USA, sep 2023c.
266–275, 2001.
Jiaee Cheong, Sinan Kalkan, and Hatice Gunes.
Emily Black, Manish Raghavan, and Solon Barocas. Fairrefuse: Referee-guided fusion for multi-modal
Model multiplicity: Opportunities, concerns, and causal fairness in depression detection. In Proceed-
solutions. In Proceedings of the 2022 ACM Con- ings of the Thirty-Third International Joint Con-
ference on Fairness, Accountability, and Trans- ference on Artificial Intelligence, IJCAI-24, pages
parency, pages 850–863, 2022. 7224–7232, 8 2024a. AI for Good.

11
U-Fair: Uncertainty-based Multimodal Multitask Learning for Fairer Depression Detection

Jiaee Cheong, Micol Spitale, and Hatice Gunes. Shelley Gupta, Archana Singh, and Jayanthi Ranjan.
Small but fair! fairness for multimodal human- Multimodal, multiview and multitasking depres-
human and robot-human mental wellbeing coach- sion detection framework endorsed with auxiliary
ing, 2024b. sentiment polarity and emotion detection. Inter-
national Journal of System Assurance Engineering
Michelle Chua, Doyun Kim, Jongmun Choi, Nahy- and Management, 14(Suppl 1), 2023.
oung G Lee, Vikram Deshpande, Joseph Schwab,
Michael H Lev, Ramon G Gonzalez, Michael S Gee, Melissa Hall, Laurens van der Maaten, Laura
and Synho Do. Tackling prediction uncertainty in Gustafson, Maxwell Jones, and Aaron Adcock.
machine learning for healthcare. Nature Biomedi- A systematic study of bias amplification. arXiv
cal Engineering, 7(6):711–718, 2023. preprint arXiv:2201.11706, 2022.
A Feder Cooper, Ellen Abrams, and Na Na. Emer- Mengjie Han, Ilkim Canli, Juveria Shah, Xingxing
gent unfairness in algorithmic fairness-accuracy Zhang, Ipek Gursel Dino, and Sinan Kalkan. Per-
trade-off research. In Proceedings of the 2021 spectives of machine learning and natural language
AAAI/ACM Conference on AI, Ethics, and Soci- processing on characterizing positive energy dis-
ety, pages 46–54, 2021. tricts. Buildings, 14(2):371, 2024.
Jorge Arias de la Torre, Gemma Vilagut, Amy
Lang He, Mingyue Niu, Prayag Tiwari, Pekka Mart-
Ronaldson, Jose M Valderas, Ioannis Bakolis, Alex
tinen, Rui Su, Jiewei Jiang, Chenguang Guo,
Dregan, Antonio J Molina, Fernando Navarro-
Hongyu Wang, Songtao Ding, Zhongmin Wang,
Mateu, Katherine Pérez, Xavier Bartoll-Roca,
et al. Deep learning for depression recognition with
et al. Reliability and cross-country equivalence of
audiovisual cues: A review. Information Fusion,
the 8-item version of the patient health question-
80:56–86, 2022.
naire (phq-8) for the assessment of depression: re-
sults from 27 countries in europe. The Lancet Re- Max Hort, Zhenpeng Chen, Jie M Zhang, Federica
gional Health–Europe, 31, 2023. Sarro, and Mark Harman. Bias mitigation for ma-
chine learning classifiers: A comprehensive survey.
Raul Diaz and Amit Marathe. Soft labels for ordinal
arXiv preprint arXiv:2207.07068, 2022.
regression. In Proceedings of the IEEE/CVF con-
ference on computer vision and pattern recognition, Guanjie Huang and Fenglong Ma. Trustsleepnet: A
pages 4738–4747, 2019. trustable deep multimodal network for sleep stage
Sanghamitra Dutta, Dennis Wei, Hazar Yueksel, Pin- classification. In 2022 IEEE-EMBS International
Yu Chen, Sijia Liu, and Kush Varshney. Is there Conference on Biomedical and Health Informatics
a trade-off between fairness and accuracy? a (BHI), pages 01–04. IEEE, 2022.
perspective using mismatched hypothesis testing.
Jeroen Jansz et al. Masculine identity and restrictive
In International conference on machine learning,
emotionality. Gender and emotion: Social psycho-
pages 2803–2813. PMLR, 2020.
logical perspectives, pages 166–186, 2000.
Yarin Gal. Uncertainty in deep learning. 2016.
Patrick Kaiser, Christoph Kern, and David Rügamer.
Soumitra Ghosh, Asif Ekbal, and Pushpak Bhat- Uncertainty-aware predictive modeling for fair
tacharyya. A multitask framework to detect de- data-driven decisions, 2022.
pression, sentiment and multi-label emotion from
suicide notes. Cognitive Computation, 14(1), 2022. Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-
task learning using uncertainty to weigh losses for
Yuan Gong and Christian Poellabauer. Topic mod- scene geometry and semantics. In CVPR, pages
eling based multi-modal depression detection. In 7482–7491, 2018.
Proceedings of the 7th annual workshop on Au-
dio/Visual emotion challenge, pages 69–76, 2017. Diederik P Kingma and Jimmy Ba. Adam: A method
for stochastic optimization. ICLR, 2014.
Thomas Grote and Geoff Keeling. Enabling fairness
in healthcare through machine learning. Ethics and Kurt Kroenke, Tara W Strine, Robert L Spitzer,
Information Technology, 24(3):39, 2022. Janet BW Williams, Joyce T Berry, and Ali H

12
U-Fair: Uncertainty-based Multimodal Multitask Learning for Fairer Depression Detection

Mokdad. The phq-8 as a measure of current depres- Lakshadeep Naik, Sinan Kalkan, and Norbert
sion in the general population. Journal of affective Krüger. Pre-grasp approaching on mobile robots:
disorders, 114(1-3):163–173, 2009. a pre-active layered approach. IEEE Robotics and
Automation Letters, 2024.
Selim Kuzucu, Jiaee Cheong, Hatice Gunes, and
Sinan Kalkan. Uncertainty as a fairness measure. John S Ogrodniczuk and John L Oliffe. Men and de-
Journal of Artificial Intelligence Research, 81:307– pression. Canadian Family Physician, 57(2):153–
335, 2024. 155, 2011.
Doris YP Leung, Yim Wah Mak, Sau Fong Leung, Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Klein-
Vico CL Chiang, and Alice Yuen Loke. Measure- berg, and Kilian Q Weinberger. On fairness and
ment invariances of the phq-9 across gender and calibration. NeurIPS, 30, 2017.
age groups in chinese adolescents. Asia-Pacific
Psychiatry, 12(3):e12381, 2020. Fabien Ringeval, Björn Schuller, Michel Valstar,
Can Li, Sirui Ding, Na Zou, Xia Hu, Xiaoqian Jiang, Nicholas Cummins, Roddy Cowie, and Maja Pan-
and Kai Zhang. Multi-task learning with dynamic tic. Avec’19: Audio/visual emotion challenge and
re-weighting to achieve fairness in healthcare pre- workshop. In ICMI, pages 2718–2719, 2019.
dictive modeling. Journal of Biomedical Informat-
Mark Sendak, Madeleine Clare Elish, Michael Gao,
ics, 143:104399, 2023a.
Joseph Futoma, William Ratliff, Marshall Nichols,
Can Li, Dejian Lai, Xiaoqian Jiang, and Kai Zhang. Armando Bedoya, Suresh Balu, and Cara O’Brien.
Feri: A multitask-based fairness achieving algo- ”the human body is a black box” supporting clini-
rithm with applications to fair organ transplanta- cal decision-making with deep learning. In FAccT,
tion. arXiv preprint arXiv:2310.13820, 2023b. pages 99–109, 2020.

Can Li, Xiaoqian Jiang, and Kai Zhang. A Siyang Song, Linlin Shen, and Michel Valstar. Human
transformer-based deep learning approach for fairly behaviour-based automatic depression analysis us-
predicting post-liver transplant risk factors. Jour- ing hand-crafted statistics and deep learned spec-
nal of Biomedical Informatics, 149:104545, 2024. tral features. In FG 2018, pages 158–165. IEEE,
2018.
Chuyuan Li, Chloé Braud, and Maxime Amblard.
Multi-task learning for depression detection in di- Micol Spitale, Jiaee Cheong, and Hatice Gunes. Un-
alogs. arXiv preprint arXiv:2208.10250, 2022. derneath the numbers: Quantitative and qualita-
tive gender fairness in llms for depression predic-
Dongyue Li, Huy Nguyen, and Hongyang Ryan
tion. arXiv preprint arXiv:2406.08183, 2024.
Zhang. Identification of negative transfers in multi-
task learning using surrogate models. Transactions Anique Tahir, Lu Cheng, and Huan Liu. Fairness
on Machine Learning Research, 2023c. through aleatoric uncertainty. In CIKM, 2023.
Nannan Long, Yongxiang Lei, Lianhua Peng, Ping
Xu, and Ping Mao. A scoping review on moni- Michel A Thibodeau and Gordon JG Asmundson.
toring mental health using smart wearable devices. The phq-9 assesses depression similarly in men and
Mathematical Biosciences and Engineering, 19(8), women from the general population. Personality
2022. and Individual Differences, 56:149–153, 2014.

Xingchen Ma, Hongyu Yang, Qiang Chen, Di Huang, Michel Valstar, Jonathan Gratch, Björn Schuller,
and Yunhong Wang. Depaudionet: An efficient Fabien Ringeval, Denis Lalanne, Mercedes Tor-
deep model for audio based depression classifica- res Torres, Stefan Scherer, Giota Stratou, Roddy
tion. In 6th Intl. Workshop on audio/visual emo- Cowie, and Maja Pantic. Avec 2016: Depression,
tion challenge, 2016. mood, and emotion recognition workshop and chal-
lenge. pages 3–10, 2016.
Raghav Mehta, Changjian Shui, and Tal Arbel. Eval-
uating the fairness of deep learning uncertainty es- Marion L Vetter, Thomas A Wadden, Christopher
timates in medical image analysis, 2023. Vinnard, Reneé H Moore, Zahra Khan, Sheri

13
U-Fair: Uncertainty-based Multimodal Multitask Learning for Fairer Depression Detection

Volger, David B Sarwer, and Lucy F Faulcon- Jenna Wiens, Suchi Saria, Mark Sendak, Marzyeh
bridge. Gender differences in the relationship be- Ghassemi, Vincent X Liu, Finale Doshi-Velez, Ken-
tween symptoms of depression and high-sensitivity neth Jung, Katherine Heller, David Kale, Mo-
crp. International journal of obesity, 37(1):S38– hammed Saeed, et al. Do no harm: a roadmap
S43, 2013. for responsible machine learning for health care.
Nature medicine, 25(9):1337–1340, 2019.
Hao Wang, Luxi He, Rui Gao, and Flavio Calmon.
Aleatoric and epistemic discrimination: Funda- James R Williamson, Elizabeth Godoy, Miriam
mental limits of fairness interventions. In Thirty- Cha, Adrianne Schwarzentruber, Pooya Khorrami,
seventh Conference on Neural Information Pro- Youngjune Gwon, Hsiang-Tsung Kung, Charlie
cessing Systems, 2023. Dagli, and Thomas F Quatieri. Detecting depres-
sion using vocal, facial and semantic communi-
Jialu Wang, Yang Liu, and Caleb Levy. Fair classi- cation cues. In Proceedings of the 6th Interna-
fication with group-dependent label noise. In Pro- tional Workshop on Audio/Visual Emotion Chal-
ceedings of the 2021 ACM conference on fairness, lenge, pages 11–18, 2016.
accountability, and transparency, pages 526–536,
2021a. Tian Xu, Jennifer White, Sinan Kalkan, and Hat-
ice Gunes. Investigating bias and fairness in fa-
Jingying Wang, Lei Zhang, Tianli Liu, Wei Pan, cial expression recognition. In Computer Vision–
Bin Hu, and Tingshao Zhu. Acoustic differences ECCV 2020 Workshops: Glasgow, UK, August 23–
between healthy and depressed people: a cross- 28, 2020, Proceedings, Part VI 16, pages 506–523.
situation study. BMC psychiatry, 19:1–12, 2019. Springer, 2020.
Philip S Wang, Sergio Aguilar-Gaxiola, Jordi Alonso, Hua Yuan, Yu Shi, Ning Xu, Xu Yang, Xin Geng, and
Matthias C Angermeyer, Guilherme Borges, Eve- Yong Rui. Learning from biased soft labels. Ad-
lyn J Bromet, Ronny Bruffaerts, Giovanni De Giro- vances in Neural Information Processing Systems,
lamo, Ron De Graaf, Oye Gureje, et al. Use of 36, 2024.
mental health services for anxiety, mood, and sub-
stance disorders in 17 countries in the who world Khadija Zanna, Kusha Sridhar, Han Yu, and Akane
mental health surveys. The Lancet, 370(9590):841– Sano. Bias reducing multitask learning on mental
850, 2007. health prediction. In ACII, pages 1–8. IEEE, 2022.

Yiding Wang, Zhenyi Wang, Chenghao Li, Yilin Yu Zhang and Qiang Yang. A survey on multi-task
Zhang, and Haizhou Wang. Online social network learning. IEEE Transactions on Knowledge and
individual depression detection using a multitask Data Engineering, 34(12):5586–5609, 2021.
heterogenous modality fusion approach. Informa-
tion Sciences, 609, 2022. Ziheng Zhang, Weizhe Lin, Mingyu Liu, and Marwa
Mahmoud. Multimodal deep learning framework
Yuyan Wang, Xuezhi Wang, Alex Beutel, Flavien for mental disorder recognition. In 15th IEEE
Prost, Jilin Chen, and Ed H Chi. Understand- International Conference on Automatic Face and
ing and improving fairness-accuracy trade-offs in Gesture Recognition (FG 2020), pages 344–350.
multi-task learning. In Proceedings of the 27th IEEE, 2020.
ACM SIGKDD Conference on Knowledge Discov-
ery & Data Mining, pages 1748–1757, 2021b. Wenbo Zheng, Lan Yan, and Fei-Yue Wang. Two
birds with one stone: Knowledge-embedded tem-
Ping-Cheng Wei, Kunyu Peng, Alina Roitberg, poral convolutional transformer for depression de-
Kailun Yang, Jiaming Zhang, and Rainer Stiefelha- tection and emotion recognition. IEEE Transac-
gen. Multi-modal depression estimation based on tions on Affective Computing, 2023.
sub-attentional fusion. In European Conference on
Computer Vision, pages 623–639. Springer, 2022.

Michael Wick, Jean-Baptiste Tristan, et al. Unlock-


ing fairness: a trade-off revisited. Advances in neu-
ral information processing systems, 32, 2019.

14
U-Fair: Uncertainty-based Multimodal Multitask Learning for Fairer Depression Detection

Appendix A. Experimental Setup


A.1. Datasets
For both DAIC-WOZ and E-DAIC, we work with
the extracted features and followed the train-validate-
test split provided. The dataset owners provided the
ground-truths for each of the PHQ-8 sub-criterion
and the final binary classification for both datasets.
DAIC-WOZ (Valstar et al., 2016) contains au-
dio recordings, extracted visual features and tran-
scripts collected in a lab-based setting of 100 males
and 85 females. The dataset owners provided a stan-
dard train-validate-test split which we followed. The
dataset owners also provided the ground-truths for
each of the PHQ-8 questionnaire sub-criterion as well
as the final binary classification.
E-DAIC (Ringeval et al., 2019) contains acous-
tic recordings and extracted visual features of 168
males and 103 females. The dataset owners provided
a standard train-validate-test split which we followed.

15
U-Fair: Uncertainty-based Multimodal Multitask Learning for Fairer Depression Detection

Measure Approach PHQ-1 PHQ-2 PHQ-3 PHQ-4 PHQ-5 PHQ-6 PHQ-7 PHQ-8 Binary Outcome
Unitask 0.87 0.51 0.62 0.57 0.57 0.51 0.79 0.94 0.66
Multitask 0.72 0.68 0.57 0.62 0.64 0.68 0.74 0.89 0.70
Acc
Baseline UW 0.81 0.70 0.64 0.60 0.66 0.62 0.72 0.87 0.82
U-Fair (Ours) 0.68 0.66 0.47 0.43 0.43 0.49 0.60 0.74 0.80
Unitask 0.25 0.41 0.44 0.33 0.33 0.53 0.44 0.40 0.47
Multitask 0.32 0.29 0.50 0.44 0.32 0.48 0.45 0.29 0.53
F1
Baseline UW 0.40 0.30 0.51 0.42 0.33 0.31 0.43 0.25 0.29
U-Fair (Ours) 0.29 0.33 0.44 0.43 0.27 0.33 0.39 0.00 0.54
Unitask 1.00 0.27 0.47 0.31 0.26 0.37 0.67 0.50 0.44
Multitask 0.25 0.25 0.43 0.39 0.29 0.47 0.50 0.25 0.50
Precision
Baseline UW 0.38 0.27 0.50 0.37 0.31 0.33 0.45 0.20 0.22
U-Fair (Ours) 0.21 0.27 0.36 0.30 0.19 0.27 0.32 0.00 0.56
Unitask 0.14 0.89 0.41 0.36 0.45 0.93 0.33 0.33 0.50
Multitask 0.43 0.33 0.59 0.50 0.36 0.50 0.42 0.33 0.57
Recall
Baseline UW 0.43 0.33 0.53 0.50 0.36 0.29 0.42 0.33 0.43
U-Fair (Ours) 0.43 0.44 0.59 0.71 0.45 0.43 0.50 0.00 0.60
Unitask 0.93 0.60 0.58 0.51 0.52 0.64 0.74 0.73 0.60
Multitask 0.57 0.54 0.57 0.57 0.54 0.62 0.66 0.60 0.65
UAR
Baseline UW 0.65 0.56 0.61 0.57 0.56 0.52 0.62 0.62 0.64
U-Fair (Ours) 0.58 0.58 0.49 0.51 0.44 0.47 0.56 0.40 0.63
Unitask 0.00 1.44 1.92 1.60 0.86 1.44 4.79 0.96 0.47
Multitask 1.92 0.96 1.80 1.20 3.51 1.10 3.83 2.88 0.86
MSP
Baseline UW 2.88 1.15 1.92 1.06 2.16 1.34 1.15 1.44 1.23
U-Fair (Ours) 0.72 0.64 1.28 1.15 1.12 0.66 0.86 0.77 1.06
Unitask 0.00 1.50 2.00 1.67 0.90 1.50 5.00 1.00 0.45
Multitask 2.00 1.00 1.88 1.25 3.67 1.14 4.00 3.00 0.78
MEOpp
Baseline UW 3.00 1.20 2.00 1.11 2.25 1.40 1.20 1.50 1.70
U-Fair (Ours) 0.75 0.67 1.33 1.20 1.17 0.69 0.90 0.80 1.46
Unitask 0.00 1.44 1.90 2.83 1.25 1.53 0.00 0.00 0.54
Multitask 0.00 1.60 1.83 1.28 9.00 1.88 4.00 0.00 0.76
MEOdd
Baseline UW 0.00 0.00 2.29 1.49 3.50 2.25 1.50 2.74 1.31
U-Fair (Ours) 0.80 0.80 1.43 1.16 1.33 0.75 1.00 0.00 1.17
Unitask 0.91 0.81 0.89 0.56 1.20 0.81 1.01 0.96 1.44
Multitask 0.96 1.09 0.89 0.89 0.55 1.23 1.01 0.87 0.94
MEAcc
Baseline UW 0.96 1.30 0.84 0.72 0.69 1.03 1.08 0.91 1.25
U-Fair (Ours) 1.09 1.16 0.80 0.96 0.64 1.28 1.11 1.14 0.95

Table 6: Full experimental results for DAIC-WOZ across the different PHQ-8 subitems. Best values are
highlighted in bold.

16
U-Fair: Uncertainty-based Multimodal Multitask Learning for Fairer Depression Detection

Measure Approach PHQ-1 PHQ-2 PHQ-3 PHQ-4 PHQ-5 PHQ-6 PHQ-7 PHQ-8 Binary Outcome
Unitask 0.80 0.66 0.59 0.66 0.59 0.61 0.63 0.89 0.55
Multitask 0.68 0.54 0.48 0.43 0.52 0.54 0.48 0.54 0.58
Acc
Baseline UW 0.75 0.63 0.61 0.73 0.73 0.63 0.59 0.89 0.87
U-Fair (Ours) 0.77 0.61 0.61 0.54 0.71 0.71 0.71 0.93 0.90
Unitask 0.27 0.24 0.49 0.60 0.47 0.45 0.49 0.25 0.51
Multitask 0.18 0.32 0.47 0.43 0.40 0.38 0.38 0.07 0.45
F1
Baseline UW 0.22 0.36 0.54 0.48 0.29 0.09 0.08 0.00 0.27
U-Fair (Ours) 0.13 0.21 0.39 0.43 0.33 0.33 0.27 0.00 0.45
Unitask 0.29 0.21 0.38 0.45 0.34 0.33 0.33 0.25 0.36
Multitask 0.14 0.22 0.33 0.30 0.29 0.28 0.25 0.04 0.32
Precision
Baseline UW 0.20 0.27 0.41 0.54 0.43 0.10 0.07 0.00 0.28
U-Fair (Ours) 0.14 0.18 0.35 0.33 0.40 0.36 0.27 0.00 0.46
Unitask 0.25 0.27 0.69 0.88 0.71 0.69 0.91 0.25 0.87
Multitask 0.25 0.55 0.81 0.75 0.64 0.62 0.82 0.25 0.80
Recall
Baseline UW 0.25 0.55 0.81 0.44 0.21 0.08 0.09 0.00 0.26
U-Fair (Ours) 0.13 0.27 0.44 0.63 0.29 0.31 0.27 0.00 0.45
Unitask 0.58 0.51 0.60 0.69 0.60 0.60 0.65 0.60 0.63
Multitask 0.50 0.52 0.58 0.53 0.55 0.55 0.58 0.47 0.67
UAR
Baseline UW 0.54 0.59 0.67 0.64 0.56 0.43 0.40 0.48 0.60
U-Fair (Ours) 0.50 0.48 0.56 0.56 0.57 0.57 0.55 0.50 0.70
Unitask 0.26 2.78 0.81 1.12 0.94 1.44 1.03 0.52 0.65
Multitask 5.67 2.63 1.19 1.40 0.98 1.44 1.24 0.41 1.25
MSP
Baseline UW 1.55 1.29 2.58 2.47 2.06 2.32 5.67 0.00 3.86
U-Fair (Ours) 2.06 2.83 1.26 2.67 3.61 1.29 1.29 0.00 1.67
Unitask 0.17 1.80 0.53 0.72 0.61 0.93 0.67 0.33 0.57
Multitask 3.67 1.70 0.77 0.90 0.63 0.93 0.80 0.26 0.81
MEOpp
Baseline UW 1.00 0.83 1.67 1.60 1.33 1.50 3.67 0.00 2.31
U-Fair (Ours) 1.33 1.83 0.82 1.73 2.33 0.83 0.83 0.00 1.00
Unitask 0.35 3.65 1.39 1.38 1.00 1.46 1.40 0.74 0.75
Multitask 7.00 3.42 1.29 1.63 1.03 1.53 1.43 0.41 1.41
MEOdd
Baseline UW 3.00 1.76 4.20 6.11 2.00 0.00 0.00 0.00 8.21
U-Fair (Ours) 2.80 3.42 2.22 3.67 3.60 2.25 1.90 0.00 5.00
Unitask 1.13 0.74 1.45 0.84 1.14 0.96 0.71 1.08 0.83
Multitask 0.63 0.39 0.77 0.41 0.94 0.77 0.54 1.77 0.65
MEAcc
Baseline UW 1.05 0.71 0.48 0.99 0.89 0.81 0.88 1.12 0.92
U-Fair (Ours) 0.96 0.64 1.22 0.47 0.83 0.74 1.03 1.05 0.94

Table 7: Full experimental results for E-DAIC across the different PHQ-8 subitems. Best values are high-
lighted in bold.

17

You might also like