18 s2.0 S294971912400027X Main
18 s2.0 S294971912400027X Main
∗ Corresponding author.
E-mail addresses: [email protected] (T. Ahmed), [email protected] (S. Ivan), [email protected] (A. Munir),
[email protected] (S. Ahmed).
https://doi.org/10.1016/j.nlp.2024.100079
Received 8 January 2024; Received in revised form 3 April 2024; Accepted 1 May 2024
2949-7191/© 2024 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC license
(http://creativecommons.org/licenses/by-nc/4.0/).
T. Ahmed, S. Ivan, A. Munir et al. Natural Language Processing Journal 7 (2024) 100079
depression for a particular individual so that healthcare professionals experimental results showed that DT performed better as a standalone
can obtain a more accurate idea about the patient’s mental state. Recent classifier than as part of the ensemble. However, their work was
advances in Machine Learning (ML), particularly Natural Language constrained by the limited number of attributes of the LIWC software.
Processing (NLP), have opened up new avenues for automated and A study by He et al. (2021) examined five AI techniques, including the
objective depression severity assessment utilizing large-scale language Bayesian model, Logistic Regression (LR), DT, SVM, and Deep Learning
models (Martínez-Castaño et al., 2020; Khan et al., 2023c). These mod- (DL), as well as three primary ways for brain analysis for psychiatric
els, particularly transformer-based architectures such as BERT, have diseases, including MRI, EEG, and kinesics diagnosis. However this
developed the ability to recognize complex linguistic patterns, semantic work used only classic shallow learning algorithms, which leaves us to
links, and contextual nuances after having been trained on enormous wonder whether deeper and more advanced architectures would have
amounts of textual data and are thus shown to be quite promising for performed better in this regard.
this particular task. The practicality of applying ML-based approaches in healthcare has
This work examines the viability and efficacy of employing the dif- been demonstrated by their capacity to analyze massive amounts of
ferent variants of transformer-based architectures in depression severity diverse data and deliver insightful clinical information (Khan et al.,
detection and offers a quantifiable evaluation that will help clinicians, 2022). These methods have been shown to effectively guide the under-
researchers, and other mental health professionals to prioritize pa- standing of mental health issues and help experts in accurate decision-
tients, track their progress through treatment, and facilitate targeted making (Dwyer et al., 2018; Kulkarni et al., 2024b). The results of these
interventions. The contributions of this work are as follows: predictions can aid in the early detection of people with high-risk med-
ical problems, such as depression (Sidey-Gibbons and Sidey-Gibbons,
1. An ensemble-based pipeline exploiting three variants of trans-
2019). The key domains for extracting observations linked to mental
former-based models, namely vanilla BERT, BERTweet, and AL-
health conditions through ML-based methods can be broadly cate-
BERT, has been proposed for predicting the severity of depres-
gorized into sensor data, text data, structured data, and multimodal
sion into four categories: non-depressed, mild, moderate, and
technology interactions (Thieme et al., 2020). The sensor data can be
severely depressed.
evaluated with the aid of auditory signals and specialized devices that
2. We utilized a wide range of data preprocessing techniques to
may take readings from the patient. The text data are usually obtained
improve the quality and relevance of the input and enhance the
from social media platforms, instant messages, and clinical records.
overall effectiveness of the proposed pipeline.
Structured data are obtained from more rigorously designed documents
3. A detailed explainability analysis of the predictions from both
such as questionnaires, standard screening scales, and medical health
local and global perspectives has been provided to shed light on
records. And finally, multimodal technology interactions incorporate
the decision-making strategy of the proposed framework.
the data from human interactions with common technological devices,
4. Furthermore, to the best of our knowledge, we are the first one to
robots, and virtual agents. Among these, the bulk of research uses
explore the extent to which a Large Language Model (LLM) like
sensor data from mobile devices or smart devices (Aldabbas et al., 2022;
‘ChatGPT’ can perform this task without fine-tuning compared
Nisar et al., 2021) and textual data from Twitter (Chen et al., 2018;
to the proposed architecture.
Joshi et al., 2018) to identify mood disorders.
The remainder of the paper is organized as follows: Section 2 Diagnostic information can be gleaned from the patient’s psychiatric
discusses the relevant literature in this field. Section 3 provides a records by analyzing textual data (Diederich et al., 2007). Moreover,
description of the dataset used in our experiments. In Section 4, we the severity of mental diseases and suicidal behaviors can be pre-
present our proposed methodology, followed by the results and findings dicted through the use of text message data analysis (Nobles et al.,
of our experiment in Section 5. Finally, we discuss the concluding 2018) and clinical health records (Adamou et al., 2018; Tran et al.,
remarks in Section 6 and discuss the future scopes of our work. 2013). In recent times, researchers have been emphasizing more on
using textual data from social media platforms for user sentiment
2. Literature review analysis, which is evident from various works related to automated
cyberbullying detection (Ahmed et al., 2021; Akhter et al., 2023), hate
The task of detecting depression from linguistic data can be for- speech detection (Davidson et al., 2017; Chakravarthi et al., 2023),
mulated as either a two-class classification problem, as in ‘depression’ etc. Researchers have obtained considerable success in these tasks by
or ‘not depression’, or as in our case, it is formulated as a multiclass applying different approaches such as Logistic Regression, SVM, and
classification problem, where the different classes correspond to lev- even transformer-based architectures like BERT, DistilBERT, RoBERTa,
els of severity of depression. Although it is very challenging to use etc. It is worth mentioning that compared to traditional models like
computational linguistics techniques to replace in-person mental illness Logistic Regression and SVM, the more advanced transformer-based
diagnosis completely, this can be used as an additional tool in tracking models have performed significantly better in recent years. Thus, our
patients’ progress and depression levels during diagnosis and enable work finds application in using textual data for detecting depression
the doctors to take interventions more skillfully and effectively. Various severity using advanced transformer-based models and their ensem-
works in the past have attempted to classify depression from different bles. The application of transformer-based approaches for detecting the
types of physiological or anatomical data. Patel et al. (2015) worked severity of depression is a relatively new and less explored idea. The
on detecting depression from neuroimaging data. The authors formu- in-depth analysis provided in this paper will help medical practitioners
lated their task as a binary classification problem and used supervised assess their patients better and provide valuable insights for future
Machine Learning (ML) models like Support Vector Machine (SVM) researchers striving to contribute to this field of study.
with linear and nonlinear kernels and Relevance Vector Regression.
In a similar work, Gao et al. (2018) used MRI data to diagnose Major 3. Dataset
Depression Disorder (MDD) using popular ML-based algorithms such as
SVM, Linear Discriminant Analysis (LDA), Relevance Vector Machine We have utilized the ‘DEPTWEET’ dataset, curated by Kabir et al.
(RVM), Decision Trees (DT), and Neural Networks. However, both (2022) for conducting our experiments. The dataset consists of 40 191
of these works have a notable limitation, which is the absence of crowdsourced tweets with corresponding labels and their associated
depression screening scales. confidence scores. The proponents of this dataset established a typol-
Mahdy et al. (2020) analyzed the social media data from Facebook ogy for the social media contents (in this case, the texts from the
using the LIWC software to detect depression-relevant factors using tweets), which is built upon a psychological theory for detecting the
DT, K-Nearest Neighbor (KNN), SVM, and an ensemble model. Their severity of depression. Based on their labeling typology, they assigned
2
T. Ahmed, S. Ivan, A. Munir et al. Natural Language Processing Journal 7 (2024) 100079
3
T. Ahmed, S. Ivan, A. Munir et al. Natural Language Processing Journal 7 (2024) 100079
denoted by the ‘#’ symbol in social media posts and comments, are fre- ALBERT (Lan et al., 2020), all of which are based on the transformer
quently used to categorize or group related content. However, hashtags architecture initially proposed by Vaswani et al. (2017). All three
do not typically have significant meaning in the context of classification models used here utilize a bidirectional architecture, which allows the
tasks and may bias the judgment of the classifier. As a result, we analysis of the input text from both left and right directions. The BERT
decided to remove them to eliminate potential distractions and reduce model is chosen as the standard baseline model that is considered
the dimensionality of the data. Similarly, users on social networking a breakthrough in Large Language Models (LLMs). An alternative to
sites frequently mention other users by including their usernames fol- BERT, RoBERTa, was discarded due to the significantly larger model
lowed by the ‘@’ symbol. While mentions are useful in user interactions size and, consequently, the slower training time of the model. Since we
and conversations, they rarely help with text classification tasks. By are only considering social media posts made on Twitter, BERTweet
removing mentions, we remove noise and reduce data complexity, was a logical choice in our architecture, having been pretrained solely
allowing the models to focus on the content of the posts and comments. on a large collection of tweets. We included ALBERT as a representa-
The presence of URLs or links to external websites, articles, or media tive lightweight model. Although DistilBERT (Sanh et al., 2020) is a
is another common occurrence in social media posts and comments. popular choice as a compact model, it was omitted since it suffers from
While URLs can add context or information, they are not always performance degradation due to its smaller size. Such degradation is
necessary for text classification. As a result, we eliminate them before not as significant in ALBERT.
passing them to the classifier for training or inference. Contractions, The BERT model is considered a breakthrough in Large Language
such as ‘‘don’t’’ rather than ‘‘do not’’, are frequently used in informal Models (LLMs) that follow an encoder–decoder network architecture.
language within social media posts and comments. Extending contrac- We have utilized the base variant of BERT, which consists of 12
tions into their full form ensures that the models recognize the full transformer-based encoder layers. For a given input text, the model can
words and may capture more meaningful information. This step helps generate feature vectors for each position of the input, which are then
to maintain consistency and ensures that the models accurately capture utilized in different language-based tasks. Initially, the model is pre-
the intended semantics. In addition, we addressed repeated punctuation trained for both Masked Language Modeling (MLM) and Next Sentence
in the text. Multiple consecutive repetitions of punctuation marks, such Prediction (NSP). Here, MLM allows BERT to understand bi-directional
as ‘‘!!!’’ or ‘‘??’’, are commonly used for emphasis in social media posts contexts while NSP facilitates the understanding of consecutive sen-
and comments, but reducing repetitions to a single instance eliminates tences. The pretraining dataset of BERT consists of 16 GB of text data
potential bias introduced by excessive punctuation usage and mitigates obtained from Wikipedia and the Toronto BookCorpus dataset.
the risk of overfitting. This ensures that the models are influenced by BERTweet is a language model pretrained on a large dataset of
meaningful textual information rather than exaggerated punctuation. English tweets. It follows the pretraining technique introduced by Liu
In social media posts and comments, numerical expressions and et al. (2019) in RoBERTa. The pretraining dataset consists of 845M
digits are frequently used. However, unless specific tasks necessitate tweets collected between 2012 and 2019, with an additional 5M tweets
the capture of numerical data, numbers are usually unimportant in text related to the COVID-19 pandemic. The total size of the dataset is
classification. We simplify the text and reduce noise by removing num- around 80 GB. BERTweet is pretrained for a comparatively longer time
bers. Emojis are graphical symbols that are used to express emotions, using larger batch sizes and focuses on only the MLM task. ALBERT
ideas, or concepts in social media posts and comments. While emojis is a lighter version of the original BERT model that aims to speed
can add context or emotion, they are not always necessary for classifi- up the training process and lower memory consumption by applying
cation tasks. Similarly, emoticons, which are textual representations of parameter reduction techniques while suffering little to no performance
facial expressions or emotions that use characters like ‘‘:)’’ or ‘‘:D’’, are degradation. One of the key contributions of ALBERT is that it allows
widely used in social media communication. We also removed emojis for weight sharing between the different layers of the network. The
and emoticons in the data preprocessing phase. Finally, we addressed datasets used for pretraining ALBERT are the same ones used for the
the presence of extra spaces within the text and replaced it with a BERT model.
single space. We did not remove stop words from the input sequences
as we used contextual transformer-based models that provide context 4.3. Ensemble classifiers
to the user’s intent (Qiao et al., 2019). Table 1 shows some examples
regarding how our data is preprocessed. We have experimented with two different heterogeneous ensem-
bling methods for categorizing depression in the input texts. In het-
4.2. Classifier networks erogeneous methods, the classifier networks are finetuned using the
same dataset. Here, we have used majority voting (hard voting) and
We have leveraged the benefits of transfer learning by using three probability averaging (soft voting) to perform the ensemble.
classifier networks pretrained on generalized tasks from different do- In the case of majority voting, we take the label predictions from the
mains. Recently works have shown that the choice of two to five three models and consider the label that occurred the maximum num-
independent and diverse models strikes the perfect balance between ber of times as our final prediction. The problem of an equal number
prediction accuracy and training time (Mohammed and Kora, 2023). of votes for the different classes is mostly alleviated by using an odd
We selected three models as it results in acceptable training time number of networks in our architecture. For probability averaging, the
and avoid ties during the model voting. We have experimented with probability scores of each of the classes were taken from each network,
BERT (Devlin et al., 2019), BERTweet (Nguyen et al., 2020), and which is then used to compute a weighted average probability score for
4
T. Ahmed, S. Ivan, A. Munir et al. Natural Language Processing Journal 7 (2024) 100079
Table 1
Examples of data preprocessing.
Original sample Preprocessed sample
Considering how lonely and depressed I’ve been for the past 19 months Considering how lonely and depressed i have been for the past months my
my internet fam (especially here on Twitter) has kept me alive. Follow internet fam especially here on twitter has kept me alive follow me because you
me because you didn’t totally hate my noisy techno once and end up did not totally hate my noisy techno once and end up helping my obliterated
helping my obliterated mental health a little... mental health a little
//hey guys! So i finally got my first fursuit mask done! Here is a vid Hey guys so i finally got my first fursuit mask done here is a vid of me using it
of me using it! https://t.co/CHJsjk1Cq1
I’m just more disappointed that they RUSHED Melissa back to set I am just more disappointed that they rushed melissa back to set instead of
instead of allowing her to be a mom. She had JUST given birth and allowing her to be a mom she had just given birth and boom back on set in
BOOM, back on set in 2.5 months and the episodes show how much months and the episodes show how much she is drained so much
she’s drained so much!
each of the classes. Finally, the class with the highest probability score We generate 5000 distinct perturbed samples, denoted as 𝑥𝑖 , where
is taken as the predicted label. For an input sample 𝑥, the class label 𝑖 ∈ [1, 5000], from the input instance 𝑥.
can be mathematically written as: The black-box models then evaluate these perturbed instances to
obtain their class-wise probabilities. Assuming the prediction for sam-
1∑
3
𝑐𝑙𝑎𝑠𝑠(𝑥) = arg max 𝑃 (𝑦 = 𝑐 ∣ 𝑥) (4) ple 𝑥𝑖 made by the 𝑘th transformer-based classifier is 𝑝𝑖 = (𝑦𝑘𝑖 ), where
𝑐 3 𝑘=1 𝑀𝑘 𝑖 ∈ [1, 5000] and 𝑘 ∈ [1, 3], we train the linear regressor using the
perturbed features and the class-wise probabilities as the ground truth.
Here, 𝑀𝑘 denotes the 𝑘th classifier and 𝑃𝑀𝑘 (𝑦 = 𝑐 ∣ 𝑥) denotes
The coefficients of the surrogate model indicate the importance of each
the probability of 𝑦 obtaining the class 𝑐 for a sample 𝑥. Our initial
feature in the local decision-making process of the 𝑘th model. To obtain
experiments suggested that soft voting performs better since it allows
the ensemble feature importance of the input instance 𝑥, we train the
the final prediction to be equally influenced by all the models of the
regressor 𝑘 times, altering the class-wise probabilities in each iteration.
ensemble.
Finally, we calculate the average of the coefficients to determine the
overall importance of features for the ensemble architecture’s decision
4.4. Interpretability for ensemble architectures
regarding the input instance 𝑥.
Furthermore, we provide a comprehensive explanation on a global
Local explanation techniques such as LIME (Local Interpretable
scale. In this context, we aim to identify and highlight the most signif-
Model-Agnostic Explanations) (Ribeiro et al., 2016) are essential for
icant tokens for each class within a given dataset. To obtain a global
interpreting the decision-making process of highly complex and opaque
text classifiers. Although these transformer-based models demonstrate explanation, we aggregate the token importance values from multiple
exceptional capabilities in capturing the contextual and semantic infor- samples within each class by calculating the average importance score
mation in textual data, understanding the inner mechanisms becomes for each token across the samples belonging to a specific class. By doing
challenging due to their inherent complexity. LIME serves as a valuable so, we obtain a comprehensive understanding of the most relevant
tool in addressing this challenge by generating comprehensible expla- tokens for distinguishing between the different classes in the dataset.
nations for humans with insights for individual predictions. To gain a To mathematically formulate the interpretability of the proposed
concise understanding of ensemble architectures comprising multiple architecture, we consider the set of class labels in the dataset denoted
transformer-based text classifiers, we introduce modifications to the by 𝐶, the set of samples in the dataset by 𝑆, and the set of tokens in
existing LIME algorithm, resulting in a summarized interpretability the input sample represented by 𝑇 . Let us denote the token importance
framework. matrix as 𝑀, where 𝑀[𝑡, 𝑠, 𝑐] represents the importance of the token
Our objective is to explain the decision made by an ensemble 𝑡 in sample 𝑠 belonging to class 𝑐. For each token 𝑡 in 𝑇 , and each
model when given an input instance, denoted as 𝑥, and generate a sample 𝑠 in 𝑆, we determine the token importance 𝑀[𝑡, 𝑠, 𝑐] for each
prediction 𝑓 (𝑥). The ensemble model consists of three transformer- class 𝑐 following the method described in Fig. 3. Subsequently, for each
based text classifiers: BERT, BERTweet, and ALBERT. To provide a class 𝑐 in 𝐶, we calculate the global explanation by averaging the token
combined explanation, we employ a surrogate model, specifically a importance across all samples corresponding to that class as shown in
linear regressor, which is easier to interpret. The surrogate model Eq. (5).
approximates the behavior of the black-box models, focusing on the 1 ∑
instance of interest. We utilize word-level perturbation on the input 𝐸[𝑡, 𝑐] = ∗ 𝑀[𝑡, 𝑠, 𝑐] (5)
∣ 𝑆𝑐 ∣ 𝑠∈𝑆
𝑐
instance 𝑥 in our proposed explainer. This strategy involves randomly
replacing words with synonyms, adding or removing words, shuffling Here, 𝑆𝑐 represents the subset of samples in 𝑆 belonging to class 𝑐, and
word positions, masking or removing certain words while keeping the ∣ 𝑆𝑐 ∣ denotes the number of samples in 𝑆𝑐 . 𝐸[𝑡, 𝑐] represents the global
rest intact, and perturbing n-grams to capture local linguistic structures. explanation score for token 𝑡 in class 𝑐.
5
T. Ahmed, S. Ivan, A. Munir et al. Natural Language Processing Journal 7 (2024) 100079
Table 2
Experimental results.
Architecture Accuracy ↑ Accuracya ↑ Precision ↑ Precisiona ↑ Recall ↑ Recalla ↑ F1-score ↑ F1-scorea ↑ AUC–ROC ↑ AUC–ROCa ↑
BERT 0.8337 0.8432 0.8235 0.8359 0.8337 0.8432 0.8279 0.8392 0.9633 0.9723
BERTweet 0.8398 0.8503 0.8287 0.8463 0.8398 0.8503 0.8324 0.8475 0.9663 0.9748
ALBERT 0.8326 0.8385 0.7975 0.8223 0.8326 0.8385 0.8058 0.8246 0.9558 0.9713
Weighted hard-voting 0.8061 0.8060 0.6497 0.6497 0.8061 0.8060 0.7195 0.7195 – –
Weighted soft-voting 0.8492 0.8549 0.8259 0.8474 0.8492 0.8549 0.8323 0.8502 0.9740 0.9759
a Performance using the preprocessed data as described in Section 4.1.
5. Results and discussion soft-voting approach outperforms all other architectures, achieving an
accuracy of 0.8492. By considering the probabilities assigned to each
In order to evaluate the efficacy of the proposed pipeline, we have class by individual models, the weighted soft-voting approach takes
investigated the experimental results using the DEPTWEET dataset into account the varying levels of confidence and uncertainty present
described in Section 3. In this section, we present and justify our results in their predictions. This flexibility enables the ensemble to make more
along with a thorough explainability analysis. nuanced decisions and avoid amplifying biases that may exist in the in-
dividual models. This indicates the effectiveness of combining multiple
5.1. Evaluation metrics models to improve classification performance. The weighted soft-voting
ensemble also achieves the highest precision, recall, F1-score, and
In the context of multi-class text classification using an ensemble of AUC–ROC values compared to other architectures. This demonstrates
transformer-based models, the evaluation metrics of Model Size, Param- its ability to handle imbalanced datasets and make accurate predic-
eter Count, Accuracy, Precision, Recall, F1-score, AUC–ROC, Inference tions across different classes of depression severity. However, weighted
Time, and Maximum Memory Allocated play a crucial role in assessing hard-voting performs significantly worse compared to our weighted
and understanding the performance and practicality of the proposed soft-voting approach, as soft-voting effectively utilizes the full range
architecture. These metrics are not only relevant for research purposes of probabilistic information provided by the models, whereas hard
but also have real-life implications in various scenarios. voting only focuses on final decisions, potentially discarding valuable
Accuracy alone can be misleading in our depression classification information.
task since the dataset is heavily imbalanced. If a model predicts all Furthermore, we have observed the class-wise AUC–ROC scores for
individuals as non-depressed (the majority class), it may achieve a BERT, ALBERT, and BERTweet as demonstrated in Fig. 4. Notably, our
high accuracy score, potentially over 80%. However, this score does proposed architecture exhibited superior performance compared to the
not reflect the model’s ability to correctly identify individuals with individual models as shown in Fig. 4(d).
depression across different severity levels. Precision, which measures
the proportion of correctly predicted positive instances out of all pre- 5.3. Comparison with state-of-the-art methods
dicted positive instances, may also be misleading. If the model predicts
individuals as non-depressed, it may achieve high precision for the non- In this section, we compare our results with the state-of-the-art
depressed class due to its larger representation in the dataset. Similarly, works presented in the literature for multi-class depression classifi-
recall, which measures the proportion of correctly predicted positive cation. Specifically, we compare our proposed model with the per-
instances out of all actual positive instances, can be misleading. In formance reported by Kabir et al. (2022) using various baseline ar-
this case, the recall may be lower for the severe, mild, and moder- chitectures, including traditional ML-based approaches, DL-based ap-
ate depression classes due to their smaller sample sizes, even if the proaches, and transformer-based approaches.
model correctly identifies some instances within these classes. F1-score, Kabir et al. (2022) conducted experiments using Support Vector Ma-
however, considers both precision and recall and provides a balanced chines (SVM), Bidirectional Long Short-Term Memory (BiLSTM), BERT,
estimation. It takes into account the model’s ability to correctly classify and DistilBERT as their baseline models for depression classification.
individuals across all depression severity levels. For example, if the Among these models, DistilBERT showed the best performance in terms
model correctly identifies 40 out of the 100 severe depression cases of class-wise AUC–ROC values, which makes it a relevant point of
with a precision of 80% and a recall of 40%, the F1-score balances both comparison for our study. To ensure a fair and meaningful comparison,
metrics and offers a more accurate representation of the performance. we focus on evaluating and comparing the AUC–ROC scores, which
When evaluating the performance across different classes, the Area the original authors reported in their work. By considering this metric,
Under the Receiver Operating Characteristic Curve (AUC–ROC) is a we can assess the discrimination and predictive capabilities of our
useful metric since it considers both the true positive rate and the false proposed model in relation to the baseline performance.
positive rate. It provides a comprehensive evaluation of correctness in We observe significant enhancements in the performance of our
ranking instances, making it applicable to unbalanced datasets. The proposed model compared to the DistilBERT-based classifier proposed
discriminating strength and capacity to distinguish between classes of by Kabir et al. (2022) across different depression severity classes as
the model improve as the AUC–ROC increases. shown in Table 3. We showed in our experiments that BERT and
ALBERT outperform DistilBERT; hence, it was not included in our
5.2. Performance of baseline architectures ensemble pipeline. Our model achieved an AUC–ROC of 0.9256 for
the ‘non-depressed’ class, representing an improvement of 13.68%.
In this section, we evaluate the performance of different baseline In the ‘mild’ depression class, our model achieved an AUC–ROC of
architectures on the task of multi-class depression classification. We 0.8829, showing a substantial enhancement of 13.57%. Similarly, for
compare the results of BERT, BERTweet, ALBERT, and the two ensem- the ‘moderate’ depression class, our model achieved an AUC–ROC of
ble approaches — weighted hard-voting and weighted soft-voting. The 0.9370, reflecting an improvement of 14.91%. Lastly, in the ‘severe’ de-
evaluation metrics include accuracy, precision, recall, F1-score, and pression class, our model achieved an impressive AUC–ROC of 0.9848,
AUC–ROC. representing a significant improvement of 11.88%. Based on these
The results, as shown in Table 2, indicate that BERTweet achieves comparisons, it is evident that our proposed model outperforms the
the highest accuracy (0.8398) among the standalone baseline archi- baseline approach used by Kabir et al. (2022) in terms of class-wise
tectures. However, when considering ensemble methods, the weighted discrimination for depression severity classification. The substantial
6
T. Ahmed, S. Ivan, A. Munir et al. Natural Language Processing Journal 7 (2024) 100079
7
T. Ahmed, S. Ivan, A. Munir et al. Natural Language Processing Journal 7 (2024) 100079
Table 4
Comparison of architecture metrics.
Architecture Model size ↓ (MB) Parameter count ↓ Inference time ↓ (ms) Maximum memory
allocated during inference ↓
(MB)
BERT 427.76 109 485 316 1.70 984.40
BERTweet 527.05 134 903 044 1.44 1077.15
ALBERT 45.67 11 686 660 1.52 690.42
Weighted hard-voting
1000.48 256 075 020 4.66 1077.15
Weighted soft-voting
the maximum improvement achieved in the ‘moderate’ class. Fig. 5(b) media platforms so that we can provide timely and engaged user ex-
provides a general overview where we compare the two models based periences. The model’s enhanced efficiency and responsiveness are the
on the four metrics. Our model shows improvement across all the met- result of a reduction in the inference time required to process incoming
rics with significant improvements in accuracy, recall, and F-1 score. text input. The scalability and efficiency of the deployed system rely
While ChatGPT demonstrates promising results in generative tasks, the on keeping an eye on the Maximum Memory Allocated during model
nature of depression classification necessitates a deeper understanding inference. Monitoring memory use peaks helps scientists spot and fix
of the specific domain and relevant contextual cues. ChatGPT, being memory bottlenecks that may otherwise compromise the efficacy and
a generic language model, may not capture the task-specific patterns reliability of the model. When distributing models to devices with
and nuances essential for accurate classification. On the other hand, limited memory, or when running many models in parallel, efficient
our proposed fine-tuned domain-specific model is specifically tailored memory management becomes crucial.
for depression classification, leveraging the benefits of fine-tuning on While Model Size and Parameter Count provide valuable insights
depression-related data. This enables our models to learn and adapt into the complexity and capacity of the models, their importance
to the intricacies of the depression domain, such as identifying symp- may be overshadowed by the practical considerations of inference
toms, analyzing language patterns indicative of depressive states, and time and memory allocation during real-life deployment. The focus
capturing nuances in different levels of depression severity. shifts towards optimizing the efficiency and responsiveness of the
models, enabling them to handle high volumes of data and deliver fast
predictions without exceeding memory limitations. Furthermore, ad-
5.5. Comparison of architecture metrics
vancements in model architectures, compression techniques, and model
optimization algorithms have enabled the development of compact
In the comparison of architecture metrics presented in Table 4, we
models with reduced parameter counts and model sizes without sacri-
quantitatively assess several key metrics: model size, parameter count,
ficing performance significantly. Therefore, model size and parameter
inference time, and maximum memory allocated during inference.
count, although still important, have become less critical compared to
These metrics allow us to make informed judgments about the different
inference time and memory allocation in many real-life scenarios.
baseline architectures and the ensemble models.
Starting with model size, we observe that ALBERT has the smallest
To deploy models in the real world, it is essential to keep model size size, occupying only 45.67 MB. BERT follows with a size of 427.76
and parameter count in mind, as resources like memory and storage MB, and BERTweet has the largest size of 527.05 MB. However, it is
space are typically scarce. Researchers may evaluate the viability of important to note that the ensemble models have a larger model size
deploying models on diverse platforms, from cloud servers to edge of 1000.48 MB. This indicates that the ensemble models require more
devices, by assessing the size and parameter count of the models. The storage capacity compared to individual architectures. Moving on to
number of parameters is one of many aspects of a model’s overall size, parameter count, ALBERT again demonstrates its parameter-efficient
which is why the two are not directly related. The total size of a model design with the lowest count of 11,686,660 parameters. BERT has a
includes not only the model itself but also its metadata, optimizer higher parameter count of 109,485,316, and BERTweet exceeds both
state, and data structures. In addition to the number of parameters, with 134,903,044 parameters. The ensemble models have a signifi-
additional factors, such as compression methods, data type, framework cantly larger parameter count of 256,075,020. These numbers illustrate
optimizations, and so on, might affect the overall size of the model. the trade-off between model complexity and ensemble performance.
Real-time applications rely heavily on accurate predictions, making Regarding inference time, we find that BERTweet achieves the lowest
Inference Time an essential factor. The time it takes to evaluate each inference time of 1.44 ms, followed by ALBERT with 1.52 ms. BERT
instance and provide predictions is critical if we want to deploy a real- takes slightly more time at 1.70 ms. However, when employing en-
time depression detection chatbot or integrate it into current social semble models, inference time increases significantly to 4.66 ms. These
8
T. Ahmed, S. Ivan, A. Munir et al. Natural Language Processing Journal 7 (2024) 100079
Fig. 6. Local explanation from the soft-voting classifier for a sample predicted as ‘Non-Depressed’.
values highlight the impact of model aggregation on inference speed, During our analysis, we encountered a misclassification scenario in
with ensemble models requiring more time due to the combination and which the best-performing standalone model, BERTweet, misclassified
decision-making processes involved. Lastly, we analyze the maximum a sample from the ‘severe’ class as depicted in Fig. 8. Further analysis
memory allocated during inference. BERTweet consumes the highest indicated that BERTweet did not assign sufficient weights to the crucial
memory, allocating 1077.15 MB, followed by BERT at 984.40 MB. tokens for predicting the correct class. However, when we visualized
ALBERT utilizes the least memory, with 690.42 MB. Interestingly, the local explainability for the ensemble architecture (Fig. 7(c)), we
both ensemble models also allocate 1077.15 MB of memory, which observed that the model placed greater importance on the tokens that
aligns with the memory requirements of BERTweet. This suggests that were significant to the ground truth class. Consequently, the ensemble
ensemble models inherit the memory consumption characteristics of model correctly classified the sample. From the visual representation
their constituent models. given in Fig. 7, it is evident that these local interpretability insights
Quantifying these architecture metrics provides valuable insights showcase the effectiveness of the ensemble architecture in capturing
into the characteristics and trade-offs of each architecture and the use class-specific token importance, leading to improved performance and
of ensemble models. While ALBERT demonstrates compact size, low a more accurate prediction process.
parameter count, and efficient memory utilization, it may sacrifice
some performance compared to larger models like BERT and BERTweet. 5.6.2. Global explainability
However, ensemble models offer the potential for improved accuracy Moving beyond local explainability, we also employed global ex-
and robustness by leveraging diverse models’ strengths. The decision plainability techniques to gain a broader understanding of the decision
to utilize an ensemble model should consider the specific use cases and patterns of our proposed model. Global explainability involves iden-
the balance between performance gains and resource requirements. tifying important class-wise tokens and analyzing their relevance to
the respective classes. Notably, the analysis provided by the Fig. 9
5.6. Interpretability and error analysis proved instrumental in conducting an overall performance and error
analysis. By determining the tokens that carry the most weight for each
In this work, the terms ‘interpretability’ and ‘explainability’ have class, we could observe distinct patterns highly relevant to the specific
been used interchangeably and refer to the same concept of under- characteristics and symptoms of the class of interest.
standing. One of the primary contributions of our research work lies Our proposed ensemble architecture, employing soft-voting, demon-
in implementing both local and global explainability techniques for strated satisfactory performance on the dataset (Section 3), surpassing
an ensemble of transformer models. The goal of interpretability in previous benchmarks. An analysis of the confusion matrix, presented
our study is to provide insights into the decision-making process of in Fig. 10, reveals a significant decrease in performance for the ‘mild’
the ensemble architecture and shed light on the factors influencing and ‘moderate’ classes. Upon examining the empirical results and eval-
its predictions. In this subsection, we elaborate on the findings and uating the dataset samples, we hypothesize that two primary factors
implications of our analysis. contribute most to this performance drop.
Firstly, there is a noticeable imbalance in the number of samples
5.6.1. Local explainability across different severity classes. The ‘non-depressed’ class contains a
Local explainability allows us to delve into the prediction process substantially larger number of samples compared to the ‘mild’, ‘mod-
of the ensemble architecture at an individual sample level. To achieve erate’, and ‘severe’ classes. This imbalance arises as most user com-
this, we selected representative samples from each class and visualized ments and posts published worldwide daily do not exhibit any degree
the importance of different tokens to reveal how much each token of depression. However, for enhanced generalization ability of deep
contributes to the prediction, whether it remains neutral or negatively transformer-based networks, it is crucial to have an ample number
impacts the predicted class. This examination enables us to gain a better of class-specific samples. Therefore, we hypothesize that increasing
understanding of the decision-making process of the model of interest. the sample size will lead to improved performance. Secondly, the
Fig. 6 shows the local explanation for a sample predicted as a ‘Non- high similarity between the ‘non-depressed’ and ‘mild’ classes, as well
Depressed’ class. Observing the importance of each token, we see that as the ‘mild’ and ‘moderate’ classes, contributes to the performance
despite the sentence containing the word ‘depressed’, the ensemble decline. The contextual differences between samples from these classes
model successfully predicted the correct class. The local explanations are minimal, requiring the classifier to possess a robust contextual
for samples with different severity of depression are presented in Fig. 7. understanding to distinguish between them accurately. For example,
9
T. Ahmed, S. Ivan, A. Munir et al. Natural Language Processing Journal 7 (2024) 100079
Fig. 7. Local explanation from the soft-voting classifier for samples of ‘Mild’, ‘Moderate’, and ‘Severe’ levels of depression.
in Fig. 9(b), the most influential token in the ‘mild’ class is ‘depressed’. predictions. The ensemble’s ability to incorporate diverse perspectives
However, a sample from the ‘non-depressed’ class (Fig. 6) also contains from individual models enhances its overall performance and enables
the word ‘depressed’. In such cases, the model must grasp the context a more comprehensive understanding of the data. Secondly, the global
(identifying the underlying sarcasm in this case) to classify the sample explainability analysis allows us to uncover class-specific patterns and
correctly. This high inter-class similarity is evident in the samples from gain insights into the features and themes contributing to the predic-
the ‘mild’ (Fig. 7(a)) and ‘moderate’ (Fig. 7(b)) classes, which even pose tion of each class. This knowledge can aid in better understanding
difficulties for non-domain-expert humans to discern. Additionally, depression severity and facilitate more informed decision-making in the
context of mental health assessment.
there are several common tokens among the highly weighted tokens
from the ‘mild’ and ‘moderate’ classes, as depicted in Figs. 9(b) and 6. Conclusion
9(c).
The interpretability and error analysis conducted in this study offers Depression, when left untreated, can lead to various complications,
several important implications. Firstly, the local explainability analysis such as an increased risk of self-harm, substance abuse, and other
highlights the advantage of ensemble architectures over standalone mental health disorders. Hence, early detection of depression is cru-
models in capturing nuanced token importances and making accurate cial in such mental health diagnosis by allowing prompt intervention
10
T. Ahmed, S. Ivan, A. Munir et al. Natural Language Processing Journal 7 (2024) 100079
Fig. 8. Local explanation for a misclassified sample from the Bertweet model.
and treatment, preventing the condition from worsening. This work analysis with other modalities, such as images, videos, or audio from
presents a deep learning-based pipeline for quantifying depression from social media posts, to improve the accuracy of depression detection
social media posts into four categories — mild, moderate, severe, and and severity assessment. The imbalanced distribution of samples was
non-depressed. We have utilized a wide range of preprocessing tech- a prime reason behind misclassifications, which can be improved by
niques along with an ensemble of three transformed-based models for collecting more data in the future. Context-level data augmentation
effective prediction and achieved state-of-the-art performance on the
may turn out handy in this regard. Moreover, extended analysis beyond
‘DEPTWEET’ dataset. We also provided a thorough explainability analy-
a single social media platform may provide a more comprehensive view
sis to understand the decision-making process of the proposed pipeline.
Moreover, recognizing the widespread use of LLMs such as ChatGPT, of individuals’ mental health and allow for a more accurate assessment.
we demonstrated the capabilities of ChatGPT in a zero-shot context Finally, It is important to note that while these methods can provide
for this particular task, underscoring the distinction in performance indications of potential depressive symptoms, they should not replace
between general-purpose LLMs and models tailored to specific tasks. professional diagnosis or clinical assessment. The analysis of social
In the future, multimodal Analysis can be conducted by combining text media posts should be used as a supplementary tool to aid mental
11
T. Ahmed, S. Ivan, A. Munir et al. Natural Language Processing Journal 7 (2024) 100079
12
T. Ahmed, S. Ivan, A. Munir et al. Natural Language Processing Journal 7 (2024) 100079
Kulkarni, D., Ghosh, A., Girdhari, A., Liu, S., Vance, L.A., Unruh, M., Sarkar, J., 2024a. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C.,
Enhancing pre-trained contextual embeddings with triplet loss as an effective fine- Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L.,
tuning method for extracting clinical features from electronic health record derived Simens, M., Askell, A., Welinder, P., Christiano, P.F., Leike, J., Lowe, R.,
mental health clinical notes. Natl. Lang. Process. J. 6, 100045. http://dx.doi.org/ 2022. Training language models to follow instructions with human feedback. In:
10.1016/j.nlp.2023.100045, URL: https://www.sciencedirect.com/science/article/ Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (Eds.), In:
pii/S2949719123000420. Advances in Neural Information Processing Systems, vol. 35, Curran Associates,
Kulkarni, D., Ghosh, A., Girdhari, A., Liu, S., Vance, L.A., Unruh, M., Sarkar, J., 2024b. Inc., pp. 27730–27744.
Enhancing pre-trained contextual embeddings with triplet loss as an effective fine- Parveen, N., Chakrabarti, P., Hung, B.T., Shaik, A., 2023. Twitter sentiment analysis
tuning method for extracting clinical features from electronic health record derived using hybrid gated attention recurrent network. J. Big Data 10, 1–29.
mental health clinical notes. Natl. Lang. Process. J. 6, 100045. http://dx.doi.org/ Patel, M.J., Khalaf, A.M., Aizenstein, H.J., 2015. Studying depression using imaging
10.1016/j.nlp.2023.100045, URL: https://www.sciencedirect.com/science/article/ and machine learning methods. NeuroImage : Clin. 10, 115–123.
pii/S2949719123000420. Qiao, Y., Xiong, C., Liu, Z., Liu, Z., 2019. Understanding the behaviors of BERT in
KVTKN, P., Ramakrishnudu, T., 2023. Semi-supervised approach for tweet-level ranking. ArXiv abs/1904.07531.
stress detection. Natl. Lang. Process. J. 4, 100019. http://dx.doi.org/10. Ribeiro, M.T., Singh, S., Guestrin, C., 2016. ‘‘Why should I trust you?’’: Explaining the
1016/j.nlp.2023.100019, URL: https://www.sciencedirect.com/science/article/pii/ predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International
S294971912300016X. Conference on Knowledge Discovery and Data Mining.
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R., 2020. ALBERT: Safa, R., Bayat, P., Moghtader, L., 2021. Automatic detection of depression symptoms
A lite BERT for self-supervised learning of language representations. arXiv:1909. in twitter using multimodal analysis. J. Supercomput. 78, 4709–4744.
11942. Sanh, V., Debut, L., Chaumond, J., Wolf, T., 2020. DistilBERT, a distilled version of
Laskar, M.T.R., Bari, M.S., Rahman, M., Bhuiyan, M.A.H., Joty, S., Huang, J.X., 2023. A BERT: smaller, faster, cheaper and lighter.
systematic study and comprehensive evaluation of ChatGPT on benchmark datasets. Shahzad, M., Freeman, C., Rahimi, M., Alhoori, H., 2023. Predicting Facebook sen-
arXiv:2305.18486. timents towards research. Natl. Lang. Process. J. 3, 100010. http://dx.doi.org/10.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettle- 1016/j.nlp.2023.100010, URL: https://www.sciencedirect.com/science/article/pii/
moyer, L., Stoyanov, V., 2019. RoBERTa: A robustly optimized BERT pretraining S2949719123000079.
approach. arXiv preprint arXiv:1907.11692. Sidey-Gibbons, J.A.M., Sidey-Gibbons, C.J., 2019. Machine learning in medicine: a
Mahdy, N., Magdi, D.A., Dahroug, A., Rizka, M.A., 2020. Comparative study: Dif- practical introduction. BMC Med. Res. Methodol. 19.
ferent techniques to detect depression using social media. In: Ghalwash, A.Z., TaghiBeyglou, B., Rudzicz, F., 2024. Context is not key: Detecting Alzheimer’s disease
El Khameesy, N., Magdi, D.A., Joshi, A. (Eds.), Internet of Things—Applications with both classical and transformer-based neural language models. Natl. Lang.
and Future. Springer Singapore, Singapore, pp. 441–452. Process. J. 6, 100046. http://dx.doi.org/10.1016/j.nlp.2023.100046, URL: https:
Martínez-Castaño, R., Htait, A., Azzopardi, L., Moshfeghi, Y., 2020. Early risk detection //www.sciencedirect.com/science/article/pii/S2949719123000432.
of self-harm and depression severity using BERT-based transformers. In: Conference Thieme, A., Belgrave, D., Doherty, G., 2020. Machine learning in mental health: A
and Labs of the Evaluation Forum. systematic review of the HCI literature to support effective ML system design. ACM
Trans. Comput.-Hum. Interact. 27.
Mohammed, A., Kora, R., 2023. A comprehensive review on ensemble deep learn-
Tran, T., Phung, D.Q., Luo, W., Harvey, R., Berk, M., Venkatesh, S., 2013. An integrated
ing: Opportunities and challenges. J. King Saud Univ. - Comput. Inf. Sci. 35
framework for suicide risk prediction. In: Proceedings of the 19th ACM SIGKDD
(2), 757–774. http://dx.doi.org/10.1016/j.jksuci.2023.01.014, URL: https://www.
international conference on Knowledge discovery and data mining.
sciencedirect.com/science/article/pii/S1319157823000228.
Uysal, A.K., Günal, S., 2014. The impact of preprocessing on text classification. Inf.
Nguyen, D.Q., Vu, T., Nguyen, A.T., 2020. BERTweet: A pre-trained language model
Process. Manage. 50, 104–112.
for English Tweets. In: Proceedings of the 2020 Conference on Empirical Methods
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u.,
in Natural Language Processing: System Demonstrations. pp. 9–14.
Polosukhin, I., 2017. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H.,
Nisar, D.-E.-M., Amin, R., Shah, N.-U.-H., Ghamdi, M.A.A., Almotiri, S.H., Al-
Fergus, R., Vishwanathan, S., Garnett, R. (Eds.), Attention is All you Need. In:
ruily, M., 2021. Healthcare techniques through deep learning: Issues, challenges
Advances in Neural Information Processing Systems, vol. 30, Curran Associates,
and opportunities. IEEE Access 9, 98523–98541.
Inc..
Nobles, A.L., Glenn, J.J., Kowsari, K., Teachman, B.A., Barnes, L.E., 2018. Identification
Yazdavar, A.H., Al-Olimat, H.S., Ebrahimi, M., Bajaj, G., Banerjee, T., Thirunarayan, K.,
of imminent suicide risk among Young adults using text messages. In: Proceedings
Pathak, J., Sheth, A., 2017. Semi-supervised approach to monitoring clinical
of the 2018 CHI Conference on Human Factors in Computing Systems.
depressive symptoms in social media. In: Proceedings of the 2017 IEEE/ACM
Ofek, N., Katz, G., Shapira, B., Bar-Zev, Y., 2015. Sentiment analysis in transcribed
International Conference on Advances in Social Networks Analysis and Mining
utterances. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining.
2017.
Organization, W.H., et al., 1993. The ICD-10 Classification of Mental and Behavioural
Zafar, A., Chitnis, D.S., 2020. Survey of depression detection using social networking
Disorders: Diagnostic Criteria for Research. World Health Organization.
sites via data mining. In: 2020 10th International Conference on Cloud Computing,
Organization, W.H., et al., 2017. Depression and Other Common Mental Disorders:
Data Science & Engineering (Confluence). pp. 88–93.
Global Health Estimates. Technical Report, World Health Organization.
13