0% found this document useful (0 votes)
22 views13 pages

18 s2.0 S294971912400027X Main

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views13 pages

18 s2.0 S294971912400027X Main

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Natural Language Processing Journal 7 (2024) 100079

Contents lists available at ScienceDirect

Natural Language Processing Journal


journal homepage: www.elsevier.com/locate/nlp

Decoding depression: Analyzing social network insights for depression


severity assessment with transformers and explainable AI
Tasnim Ahmed a,b , Shahriar Ivan a ,∗, Ahnaf Munir a , Sabbir Ahmed a
a Islamic University of Technology, Board Bazar, Gazipur, 1704, Bangladesh
b
Queen’s University, Kingston, K7L 2N8, Ontario, Canada

ARTICLE INFO ABSTRACT


Keywords: Depression is a mental state characterized by recurrent feelings of melancholy, hopelessness, and disinterest
Mental health in activities, having a significant negative influence on everyday functioning and general well-being. Millions
Depression severity estimation of users express their thoughts and emotions on social media platforms, which can be used as a rich source
Ensemble voting
of data for early detection of depression. In this connection, this work leverages an ensemble of transformer-
Explainability analysis
based architectures for quantifying the severity of depression from social media posts into four categories —
ChatGPT evaluation
Social media analysis
non-depressed, mild, moderate, and severe. At first, a diverse range of preprocessing techniques is employed
to enhance the quality and relevance of the input. Then, the preprocessed samples are passed through
three variants of transformer-based models, namely vanilla BERT, BERTweet, and ALBERT, for generating
predictions, which are combined using a weighted soft-voting approach. We conduct a comprehensive
explainability analysis to gain deeper insights into the decision-making process, examining both local and
global perspectives. Furthermore, to the best of our knowledge, we are the first ones to explore the extent
to which a Large Language Model (LLM) like ‘ChatGPT’ can perform this task. Evaluation of the model on
the publicly available ‘DEPTWEET’ dataset produces state-of-the-art performance with 13.5% improvement in
AUC–ROC score.

1. Introduction depression commonly lead to suicidal thoughts in an individual (Akhter


et al., 2023). It is estimated that around 800,000 people all over the
Social media sites have become a part of our daily lives in recent world die from suicide annually, most of which result from depression,
years, offering an unparalleled amount of user-generated material. and as such, it necessitates a comprehensive response from the aca-
Social media data has become a great resource for numerous appli- demics and research communities (Ezawa et al., 2021; Organization
cations in mental health diagnosis as millions of users express their et al., 2017).
thoughts, emotions, and experiencesonline (Kayalvizhi et al., 2022; Traditional techniques of diagnosing depression frequently rely on
Shahzad et al., 2023; KVTKN and Ramakrishnudu, 2023). The detection subjective and time-consuming clinical interviews or self-reported as-
of depression, a common mental health condition that affects millions
sessments (Alshawwa et al., 2019). Additionally, those who are de-
of people globally (Johnson et al., 2018), is one area of focus in partic-
pressed might not always ask for assistance or be unaware of their
ular. Researchers have been investigating the possibilities of using data
condition, which could result in a delayed diagnosis and treatment.
obtained from social media platforms as a means of early identification
The emergence of social media platforms, particularly microblogging
and intervention for depression (Zafar and Chitnis, 2020), and they
accomplish this task by examining the language and patterns displayed platforms like Facebook, Twitter, etc., has created a new way to utilize
in posts on these platforms. non-clinical data, which can be quite useful for the overall assessment
Depression is a complicated and multifaceted disorder marked by of patients (TaghiBeyglou and Rudzicz, 2024; Khan et al., 2023a; Bucci
persistent feelings of sadness, loss of interest or pleasure, and a variety et al., 2019; Khan et al., 2023b). There has been a massive data flow
of physical and cognitive symptoms. It is one of the most preva- resulting from the increasing rate of internet access, as well as people
lent, and at the same time, treatable mental health conditions that is spontaneously sharing their struggle, pain, and suffering anonymously
regularly observed by health care providers and mental health special- on these platforms (Ofek et al., 2015). Moreover, along with detecting
ists (Kroenke et al., 2001; Kulkarni et al., 2024a). Increasing levels of cases of depression, it is equally important to assess the severity of

∗ Corresponding author.
E-mail addresses: [email protected] (T. Ahmed), [email protected] (S. Ivan), [email protected] (A. Munir),
[email protected] (S. Ahmed).

https://doi.org/10.1016/j.nlp.2024.100079
Received 8 January 2024; Received in revised form 3 April 2024; Accepted 1 May 2024

2949-7191/© 2024 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC license
(http://creativecommons.org/licenses/by-nc/4.0/).
T. Ahmed, S. Ivan, A. Munir et al. Natural Language Processing Journal 7 (2024) 100079

depression for a particular individual so that healthcare professionals experimental results showed that DT performed better as a standalone
can obtain a more accurate idea about the patient’s mental state. Recent classifier than as part of the ensemble. However, their work was
advances in Machine Learning (ML), particularly Natural Language constrained by the limited number of attributes of the LIWC software.
Processing (NLP), have opened up new avenues for automated and A study by He et al. (2021) examined five AI techniques, including the
objective depression severity assessment utilizing large-scale language Bayesian model, Logistic Regression (LR), DT, SVM, and Deep Learning
models (Martínez-Castaño et al., 2020; Khan et al., 2023c). These mod- (DL), as well as three primary ways for brain analysis for psychiatric
els, particularly transformer-based architectures such as BERT, have diseases, including MRI, EEG, and kinesics diagnosis. However this
developed the ability to recognize complex linguistic patterns, semantic work used only classic shallow learning algorithms, which leaves us to
links, and contextual nuances after having been trained on enormous wonder whether deeper and more advanced architectures would have
amounts of textual data and are thus shown to be quite promising for performed better in this regard.
this particular task. The practicality of applying ML-based approaches in healthcare has
This work examines the viability and efficacy of employing the dif- been demonstrated by their capacity to analyze massive amounts of
ferent variants of transformer-based architectures in depression severity diverse data and deliver insightful clinical information (Khan et al.,
detection and offers a quantifiable evaluation that will help clinicians, 2022). These methods have been shown to effectively guide the under-
researchers, and other mental health professionals to prioritize pa- standing of mental health issues and help experts in accurate decision-
tients, track their progress through treatment, and facilitate targeted making (Dwyer et al., 2018; Kulkarni et al., 2024b). The results of these
interventions. The contributions of this work are as follows: predictions can aid in the early detection of people with high-risk med-
ical problems, such as depression (Sidey-Gibbons and Sidey-Gibbons,
1. An ensemble-based pipeline exploiting three variants of trans-
2019). The key domains for extracting observations linked to mental
former-based models, namely vanilla BERT, BERTweet, and AL-
health conditions through ML-based methods can be broadly cate-
BERT, has been proposed for predicting the severity of depres-
gorized into sensor data, text data, structured data, and multimodal
sion into four categories: non-depressed, mild, moderate, and
technology interactions (Thieme et al., 2020). The sensor data can be
severely depressed.
evaluated with the aid of auditory signals and specialized devices that
2. We utilized a wide range of data preprocessing techniques to
may take readings from the patient. The text data are usually obtained
improve the quality and relevance of the input and enhance the
from social media platforms, instant messages, and clinical records.
overall effectiveness of the proposed pipeline.
Structured data are obtained from more rigorously designed documents
3. A detailed explainability analysis of the predictions from both
such as questionnaires, standard screening scales, and medical health
local and global perspectives has been provided to shed light on
records. And finally, multimodal technology interactions incorporate
the decision-making strategy of the proposed framework.
the data from human interactions with common technological devices,
4. Furthermore, to the best of our knowledge, we are the first one to
robots, and virtual agents. Among these, the bulk of research uses
explore the extent to which a Large Language Model (LLM) like
sensor data from mobile devices or smart devices (Aldabbas et al., 2022;
‘ChatGPT’ can perform this task without fine-tuning compared
Nisar et al., 2021) and textual data from Twitter (Chen et al., 2018;
to the proposed architecture.
Joshi et al., 2018) to identify mood disorders.
The remainder of the paper is organized as follows: Section 2 Diagnostic information can be gleaned from the patient’s psychiatric
discusses the relevant literature in this field. Section 3 provides a records by analyzing textual data (Diederich et al., 2007). Moreover,
description of the dataset used in our experiments. In Section 4, we the severity of mental diseases and suicidal behaviors can be pre-
present our proposed methodology, followed by the results and findings dicted through the use of text message data analysis (Nobles et al.,
of our experiment in Section 5. Finally, we discuss the concluding 2018) and clinical health records (Adamou et al., 2018; Tran et al.,
remarks in Section 6 and discuss the future scopes of our work. 2013). In recent times, researchers have been emphasizing more on
using textual data from social media platforms for user sentiment
2. Literature review analysis, which is evident from various works related to automated
cyberbullying detection (Ahmed et al., 2021; Akhter et al., 2023), hate
The task of detecting depression from linguistic data can be for- speech detection (Davidson et al., 2017; Chakravarthi et al., 2023),
mulated as either a two-class classification problem, as in ‘depression’ etc. Researchers have obtained considerable success in these tasks by
or ‘not depression’, or as in our case, it is formulated as a multiclass applying different approaches such as Logistic Regression, SVM, and
classification problem, where the different classes correspond to lev- even transformer-based architectures like BERT, DistilBERT, RoBERTa,
els of severity of depression. Although it is very challenging to use etc. It is worth mentioning that compared to traditional models like
computational linguistics techniques to replace in-person mental illness Logistic Regression and SVM, the more advanced transformer-based
diagnosis completely, this can be used as an additional tool in tracking models have performed significantly better in recent years. Thus, our
patients’ progress and depression levels during diagnosis and enable work finds application in using textual data for detecting depression
the doctors to take interventions more skillfully and effectively. Various severity using advanced transformer-based models and their ensem-
works in the past have attempted to classify depression from different bles. The application of transformer-based approaches for detecting the
types of physiological or anatomical data. Patel et al. (2015) worked severity of depression is a relatively new and less explored idea. The
on detecting depression from neuroimaging data. The authors formu- in-depth analysis provided in this paper will help medical practitioners
lated their task as a binary classification problem and used supervised assess their patients better and provide valuable insights for future
Machine Learning (ML) models like Support Vector Machine (SVM) researchers striving to contribute to this field of study.
with linear and nonlinear kernels and Relevance Vector Regression.
In a similar work, Gao et al. (2018) used MRI data to diagnose Major 3. Dataset
Depression Disorder (MDD) using popular ML-based algorithms such as
SVM, Linear Discriminant Analysis (LDA), Relevance Vector Machine We have utilized the ‘DEPTWEET’ dataset, curated by Kabir et al.
(RVM), Decision Trees (DT), and Neural Networks. However, both (2022) for conducting our experiments. The dataset consists of 40 191
of these works have a notable limitation, which is the absence of crowdsourced tweets with corresponding labels and their associated
depression screening scales. confidence scores. The proponents of this dataset established a typol-
Mahdy et al. (2020) analyzed the social media data from Facebook ogy for the social media contents (in this case, the texts from the
using the LIWC software to detect depression-relevant factors using tweets), which is built upon a psychological theory for detecting the
DT, K-Nearest Neighbor (KNN), SVM, and an ensemble model. Their severity of depression. Based on their labeling typology, they assigned

2
T. Ahmed, S. Ivan, A. Munir et al. Natural Language Processing Journal 7 (2024) 100079

might struggle to effectively learn from imbalanced data distribu-


tions, especially when long-term dependencies are involved. Secondly,
In terms of capturing long-term context-aware features, Transformers
outperform LSTM and BiLSTM models by excelling in modeling se-
quences with long-range dependencies. The self-attention mechanism
allows Transformers to attend to all positions in the input sequence
simultaneously, facilitating them to capture contextual information
across the entire sequence more efficiently, whereas LSTM and BiLSTM
models face limitations in capturing long-range dependencies due to
the vanishing gradient problem. The seminal paper on Transform-
ers by Vaswani et al. (2017) discusses this aspect in greater depth,
emphasizing that the lengths of the paths traversed by forward and
backward signals significantly impact the models’ ability to learn long-
range dependencies. The shorter the path between any combination of
positions in the input and output sequences, the easier it is to learn
such dependencies, as supported by an earlier study by Hochreiter et al.
(2001).

Fig. 1. Dataset distribution. 4. Proposed methodology

We have investigated a classification task that predicts the severity


a higher-level classification to each tweet in the dataset, such as (1) of depression using an ensemble of transformer-based models on social
Non-depressed, (2) Mild depression, (3) Moderate depression, and (4) media posts and comments in this study. To formulate the task, we
Severe depression, as well as an associated confidence score (from 0.5 consider the context of words denoted by 𝑊 , the sample (tweet)
to 1) for each of the labels. To measure the severity of depression represented by 𝐶, the prediction provided by the proposed architecture
from a particular tweet, the authors used a well-established clinical labeled as 𝑃 , and the ground truth indicated by 𝐺𝑡 . Our dataset 𝐷
assessment method known as the Diagnostic and Statistical Manual of comprises pairs of {𝐶, 𝐺𝑡 }. The objective of our study is to determine
Mental Disorders (DSM-5) (Arbanas, 2015). According to this manual, the specific context of words 𝑊 that leads to the prediction 𝑃 generated
clinical depression can be diagnosed based on a set of symptoms that by our proposed architecture when considering both the words 𝑊 and
have persisted for a significant amount of time (Yazdavar et al., 2017). the sample 𝐶. Considering 𝑓 as the function representing our proposed
Based on this idea, the authors used a set of questionnaires provided architecture, the prediction 𝑃 is generated as-
by the Patient Health Questionnaire (PHQ-9) (Kroenke et al., 2001),
𝑃 = 𝑓 (𝑊 , 𝐶) (1)
which is widely used for diagnosing the severity of depression. A set of
nine symptoms related to different mental disorders, such as suicidal Therefore, the objective of training our classifier is to find the 𝑊
thoughts, lack of interest, sleep disorder, etc., had been extracted using that minimizes:
this set of questionnaires, where the frequency of these symptoms
min 𝑔(𝑓 (𝑊 , 𝐶), 𝐺𝑡 ) (2)
indicated the severity level of depression (Organization et al., 1993). 𝑊
The dataset by Kabir et al. (2022) originally curated a total of 44 100 Here, 𝑔 is the cross-entropy loss function that measures the dissimilar-
tweets, from which 1399 samples were discarded from the dataset by ity between the predicted probability distribution and the true label
the authors as they were damaged (i.e., either the tweet ID or text distribution.
was altered) during the annotation process, while 2510 samples were Our optimization objective is to find the 𝑊 that minimizes the
removed due to ambiguous annotation (receiving different labels from expected loss over the dataset 𝐷:
three different annotators). Thus, the final dataset we worked with
min E(𝐶,𝐺𝑡 )∈𝐷 𝑔(𝑓 (𝑊 , 𝐶), 𝐺𝑡 ) (3)
consisted of 40 191 samples, each identified as a tuple having the 𝑊
following fields — tweetId, repliesCount, retweetsCount, likesCount, Here, E denotes the expectation over the samples and predictions in
target, label, and confidence score. the dataset, and 𝑊 is in a very high dimensional discrete space.
The overall class distribution of the dataset is illustrated in Fig. 1. As An overview of our pipeline is depicted in Fig. 2. At first, the
evident from the dataset distribution, one of the main challenges of this input text undergoes data preprocessing steps before tokenization. The
dataset is the data imbalance since the data is highly biased towards tokenized data is then fed to the three standalone models. An ensemble
the ‘Non-depressed’ class, and there is a scarcity of samples from the of their outputs is performed using either max voting or probability
‘Severe’ class. The subtle differences between the more difficult samples averaging to determine the category of depression of the input text.
of the ‘Mild’ class and the ‘Moderate’ class make it quite difficult for
traditional state-of-the-art language models such as LSTM and BiLSTM 4.1. Data preprocessing
to perform well on this type of data since they are not particularly
context-aware. In contrast, Transformer-based architectures have per- To increase the quality and relevance of the input, we applied a
formed quite well in such scenarios since they can capture the subtle variety of carefully selected data preprocessing approaches. The follow-
meaning behind various words by observing the context where these ing data cleaning techniques have been found useful in several studies
words are used (Ahmed et al., 2022). for various text classification tasks (Uysal and Günal, 2014) such as
The decision to opt for Transformer-based models over LSTM or sentiment analysis (Parveen et al., 2023), depression classification (Safa
BiLSTM is based on some distinct advantages. Firstly, Transformer et al., 2021), etc.
models are better equipped to avoid bias from imbalanced data. This is The first step was to lowercase all of the text. This normalization
primarily due to their self-attention mechanism, which allows them to technique ensures consistency across the dataset and allows the models
effectively capture relationships between distant tokens in a sequence to focus on semantic meaning rather than being influenced by word
without being constrained by sequential dependencies. In contrast, cases. By treating all words as lowercase, we create consistency and
LSTM and BiLSTM architectures rely on recurrent connections, which reduce potential noise caused by capitalization variations. Hashtags,

3
T. Ahmed, S. Ivan, A. Munir et al. Natural Language Processing Journal 7 (2024) 100079

Fig. 2. Overview of the proposed pipeline.

denoted by the ‘#’ symbol in social media posts and comments, are fre- ALBERT (Lan et al., 2020), all of which are based on the transformer
quently used to categorize or group related content. However, hashtags architecture initially proposed by Vaswani et al. (2017). All three
do not typically have significant meaning in the context of classification models used here utilize a bidirectional architecture, which allows the
tasks and may bias the judgment of the classifier. As a result, we analysis of the input text from both left and right directions. The BERT
decided to remove them to eliminate potential distractions and reduce model is chosen as the standard baseline model that is considered
the dimensionality of the data. Similarly, users on social networking a breakthrough in Large Language Models (LLMs). An alternative to
sites frequently mention other users by including their usernames fol- BERT, RoBERTa, was discarded due to the significantly larger model
lowed by the ‘@’ symbol. While mentions are useful in user interactions size and, consequently, the slower training time of the model. Since we
and conversations, they rarely help with text classification tasks. By are only considering social media posts made on Twitter, BERTweet
removing mentions, we remove noise and reduce data complexity, was a logical choice in our architecture, having been pretrained solely
allowing the models to focus on the content of the posts and comments. on a large collection of tweets. We included ALBERT as a representa-
The presence of URLs or links to external websites, articles, or media tive lightweight model. Although DistilBERT (Sanh et al., 2020) is a
is another common occurrence in social media posts and comments. popular choice as a compact model, it was omitted since it suffers from
While URLs can add context or information, they are not always performance degradation due to its smaller size. Such degradation is
necessary for text classification. As a result, we eliminate them before not as significant in ALBERT.
passing them to the classifier for training or inference. Contractions, The BERT model is considered a breakthrough in Large Language
such as ‘‘don’t’’ rather than ‘‘do not’’, are frequently used in informal Models (LLMs) that follow an encoder–decoder network architecture.
language within social media posts and comments. Extending contrac- We have utilized the base variant of BERT, which consists of 12
tions into their full form ensures that the models recognize the full transformer-based encoder layers. For a given input text, the model can
words and may capture more meaningful information. This step helps generate feature vectors for each position of the input, which are then
to maintain consistency and ensures that the models accurately capture utilized in different language-based tasks. Initially, the model is pre-
the intended semantics. In addition, we addressed repeated punctuation trained for both Masked Language Modeling (MLM) and Next Sentence
in the text. Multiple consecutive repetitions of punctuation marks, such Prediction (NSP). Here, MLM allows BERT to understand bi-directional
as ‘‘!!!’’ or ‘‘??’’, are commonly used for emphasis in social media posts contexts while NSP facilitates the understanding of consecutive sen-
and comments, but reducing repetitions to a single instance eliminates tences. The pretraining dataset of BERT consists of 16 GB of text data
potential bias introduced by excessive punctuation usage and mitigates obtained from Wikipedia and the Toronto BookCorpus dataset.
the risk of overfitting. This ensures that the models are influenced by BERTweet is a language model pretrained on a large dataset of
meaningful textual information rather than exaggerated punctuation. English tweets. It follows the pretraining technique introduced by Liu
In social media posts and comments, numerical expressions and et al. (2019) in RoBERTa. The pretraining dataset consists of 845M
digits are frequently used. However, unless specific tasks necessitate tweets collected between 2012 and 2019, with an additional 5M tweets
the capture of numerical data, numbers are usually unimportant in text related to the COVID-19 pandemic. The total size of the dataset is
classification. We simplify the text and reduce noise by removing num- around 80 GB. BERTweet is pretrained for a comparatively longer time
bers. Emojis are graphical symbols that are used to express emotions, using larger batch sizes and focuses on only the MLM task. ALBERT
ideas, or concepts in social media posts and comments. While emojis is a lighter version of the original BERT model that aims to speed
can add context or emotion, they are not always necessary for classifi- up the training process and lower memory consumption by applying
cation tasks. Similarly, emoticons, which are textual representations of parameter reduction techniques while suffering little to no performance
facial expressions or emotions that use characters like ‘‘:)’’ or ‘‘:D’’, are degradation. One of the key contributions of ALBERT is that it allows
widely used in social media communication. We also removed emojis for weight sharing between the different layers of the network. The
and emoticons in the data preprocessing phase. Finally, we addressed datasets used for pretraining ALBERT are the same ones used for the
the presence of extra spaces within the text and replaced it with a BERT model.
single space. We did not remove stop words from the input sequences
as we used contextual transformer-based models that provide context 4.3. Ensemble classifiers
to the user’s intent (Qiao et al., 2019). Table 1 shows some examples
regarding how our data is preprocessed. We have experimented with two different heterogeneous ensem-
bling methods for categorizing depression in the input texts. In het-
4.2. Classifier networks erogeneous methods, the classifier networks are finetuned using the
same dataset. Here, we have used majority voting (hard voting) and
We have leveraged the benefits of transfer learning by using three probability averaging (soft voting) to perform the ensemble.
classifier networks pretrained on generalized tasks from different do- In the case of majority voting, we take the label predictions from the
mains. Recently works have shown that the choice of two to five three models and consider the label that occurred the maximum num-
independent and diverse models strikes the perfect balance between ber of times as our final prediction. The problem of an equal number
prediction accuracy and training time (Mohammed and Kora, 2023). of votes for the different classes is mostly alleviated by using an odd
We selected three models as it results in acceptable training time number of networks in our architecture. For probability averaging, the
and avoid ties during the model voting. We have experimented with probability scores of each of the classes were taken from each network,
BERT (Devlin et al., 2019), BERTweet (Nguyen et al., 2020), and which is then used to compute a weighted average probability score for

4
T. Ahmed, S. Ivan, A. Munir et al. Natural Language Processing Journal 7 (2024) 100079

Table 1
Examples of data preprocessing.
Original sample Preprocessed sample
Considering how lonely and depressed I’ve been for the past 19 months Considering how lonely and depressed i have been for the past months my
my internet fam (especially here on Twitter) has kept me alive. Follow internet fam especially here on twitter has kept me alive follow me because you
me because you didn’t totally hate my noisy techno once and end up did not totally hate my noisy techno once and end up helping my obliterated
helping my obliterated mental health a little... mental health a little
//hey guys! So i finally got my first fursuit mask done! Here is a vid Hey guys so i finally got my first fursuit mask done here is a vid of me using it
of me using it! https://t.co/CHJsjk1Cq1
I’m just more disappointed that they RUSHED Melissa back to set I am just more disappointed that they rushed melissa back to set instead of
instead of allowing her to be a mom. She had JUST given birth and allowing her to be a mom she had just given birth and boom back on set in
BOOM, back on set in 2.5 months and the episodes show how much months and the episodes show how much she is drained so much
she’s drained so much!

Fig. 3. Local explainability pipeline for ensemble architectures.

each of the classes. Finally, the class with the highest probability score We generate 5000 distinct perturbed samples, denoted as 𝑥𝑖 , where
is taken as the predicted label. For an input sample 𝑥, the class label 𝑖 ∈ [1, 5000], from the input instance 𝑥.
can be mathematically written as: The black-box models then evaluate these perturbed instances to
obtain their class-wise probabilities. Assuming the prediction for sam-
1∑
3
𝑐𝑙𝑎𝑠𝑠(𝑥) = arg max 𝑃 (𝑦 = 𝑐 ∣ 𝑥) (4) ple 𝑥𝑖 made by the 𝑘th transformer-based classifier is 𝑝𝑖 = (𝑦𝑘𝑖 ), where
𝑐 3 𝑘=1 𝑀𝑘 𝑖 ∈ [1, 5000] and 𝑘 ∈ [1, 3], we train the linear regressor using the
perturbed features and the class-wise probabilities as the ground truth.
Here, 𝑀𝑘 denotes the 𝑘th classifier and 𝑃𝑀𝑘 (𝑦 = 𝑐 ∣ 𝑥) denotes
The coefficients of the surrogate model indicate the importance of each
the probability of 𝑦 obtaining the class 𝑐 for a sample 𝑥. Our initial
feature in the local decision-making process of the 𝑘th model. To obtain
experiments suggested that soft voting performs better since it allows
the ensemble feature importance of the input instance 𝑥, we train the
the final prediction to be equally influenced by all the models of the
regressor 𝑘 times, altering the class-wise probabilities in each iteration.
ensemble.
Finally, we calculate the average of the coefficients to determine the
overall importance of features for the ensemble architecture’s decision
4.4. Interpretability for ensemble architectures
regarding the input instance 𝑥.
Furthermore, we provide a comprehensive explanation on a global
Local explanation techniques such as LIME (Local Interpretable
scale. In this context, we aim to identify and highlight the most signif-
Model-Agnostic Explanations) (Ribeiro et al., 2016) are essential for
icant tokens for each class within a given dataset. To obtain a global
interpreting the decision-making process of highly complex and opaque
text classifiers. Although these transformer-based models demonstrate explanation, we aggregate the token importance values from multiple
exceptional capabilities in capturing the contextual and semantic infor- samples within each class by calculating the average importance score
mation in textual data, understanding the inner mechanisms becomes for each token across the samples belonging to a specific class. By doing
challenging due to their inherent complexity. LIME serves as a valuable so, we obtain a comprehensive understanding of the most relevant
tool in addressing this challenge by generating comprehensible expla- tokens for distinguishing between the different classes in the dataset.
nations for humans with insights for individual predictions. To gain a To mathematically formulate the interpretability of the proposed
concise understanding of ensemble architectures comprising multiple architecture, we consider the set of class labels in the dataset denoted
transformer-based text classifiers, we introduce modifications to the by 𝐶, the set of samples in the dataset by 𝑆, and the set of tokens in
existing LIME algorithm, resulting in a summarized interpretability the input sample represented by 𝑇 . Let us denote the token importance
framework. matrix as 𝑀, where 𝑀[𝑡, 𝑠, 𝑐] represents the importance of the token
Our objective is to explain the decision made by an ensemble 𝑡 in sample 𝑠 belonging to class 𝑐. For each token 𝑡 in 𝑇 , and each
model when given an input instance, denoted as 𝑥, and generate a sample 𝑠 in 𝑆, we determine the token importance 𝑀[𝑡, 𝑠, 𝑐] for each
prediction 𝑓 (𝑥). The ensemble model consists of three transformer- class 𝑐 following the method described in Fig. 3. Subsequently, for each
based text classifiers: BERT, BERTweet, and ALBERT. To provide a class 𝑐 in 𝐶, we calculate the global explanation by averaging the token
combined explanation, we employ a surrogate model, specifically a importance across all samples corresponding to that class as shown in
linear regressor, which is easier to interpret. The surrogate model Eq. (5).
approximates the behavior of the black-box models, focusing on the 1 ∑
instance of interest. We utilize word-level perturbation on the input 𝐸[𝑡, 𝑐] = ∗ 𝑀[𝑡, 𝑠, 𝑐] (5)
∣ 𝑆𝑐 ∣ 𝑠∈𝑆
𝑐
instance 𝑥 in our proposed explainer. This strategy involves randomly
replacing words with synonyms, adding or removing words, shuffling Here, 𝑆𝑐 represents the subset of samples in 𝑆 belonging to class 𝑐, and
word positions, masking or removing certain words while keeping the ∣ 𝑆𝑐 ∣ denotes the number of samples in 𝑆𝑐 . 𝐸[𝑡, 𝑐] represents the global
rest intact, and perturbing n-grams to capture local linguistic structures. explanation score for token 𝑡 in class 𝑐.

5
T. Ahmed, S. Ivan, A. Munir et al. Natural Language Processing Journal 7 (2024) 100079

Table 2
Experimental results.
Architecture Accuracy ↑ Accuracya ↑ Precision ↑ Precisiona ↑ Recall ↑ Recalla ↑ F1-score ↑ F1-scorea ↑ AUC–ROC ↑ AUC–ROCa ↑
BERT 0.8337 0.8432 0.8235 0.8359 0.8337 0.8432 0.8279 0.8392 0.9633 0.9723
BERTweet 0.8398 0.8503 0.8287 0.8463 0.8398 0.8503 0.8324 0.8475 0.9663 0.9748
ALBERT 0.8326 0.8385 0.7975 0.8223 0.8326 0.8385 0.8058 0.8246 0.9558 0.9713
Weighted hard-voting 0.8061 0.8060 0.6497 0.6497 0.8061 0.8060 0.7195 0.7195 – –
Weighted soft-voting 0.8492 0.8549 0.8259 0.8474 0.8492 0.8549 0.8323 0.8502 0.9740 0.9759
a Performance using the preprocessed data as described in Section 4.1.

5. Results and discussion soft-voting approach outperforms all other architectures, achieving an
accuracy of 0.8492. By considering the probabilities assigned to each
In order to evaluate the efficacy of the proposed pipeline, we have class by individual models, the weighted soft-voting approach takes
investigated the experimental results using the DEPTWEET dataset into account the varying levels of confidence and uncertainty present
described in Section 3. In this section, we present and justify our results in their predictions. This flexibility enables the ensemble to make more
along with a thorough explainability analysis. nuanced decisions and avoid amplifying biases that may exist in the in-
dividual models. This indicates the effectiveness of combining multiple
5.1. Evaluation metrics models to improve classification performance. The weighted soft-voting
ensemble also achieves the highest precision, recall, F1-score, and
In the context of multi-class text classification using an ensemble of AUC–ROC values compared to other architectures. This demonstrates
transformer-based models, the evaluation metrics of Model Size, Param- its ability to handle imbalanced datasets and make accurate predic-
eter Count, Accuracy, Precision, Recall, F1-score, AUC–ROC, Inference tions across different classes of depression severity. However, weighted
Time, and Maximum Memory Allocated play a crucial role in assessing hard-voting performs significantly worse compared to our weighted
and understanding the performance and practicality of the proposed soft-voting approach, as soft-voting effectively utilizes the full range
architecture. These metrics are not only relevant for research purposes of probabilistic information provided by the models, whereas hard
but also have real-life implications in various scenarios. voting only focuses on final decisions, potentially discarding valuable
Accuracy alone can be misleading in our depression classification information.
task since the dataset is heavily imbalanced. If a model predicts all Furthermore, we have observed the class-wise AUC–ROC scores for
individuals as non-depressed (the majority class), it may achieve a BERT, ALBERT, and BERTweet as demonstrated in Fig. 4. Notably, our
high accuracy score, potentially over 80%. However, this score does proposed architecture exhibited superior performance compared to the
not reflect the model’s ability to correctly identify individuals with individual models as shown in Fig. 4(d).
depression across different severity levels. Precision, which measures
the proportion of correctly predicted positive instances out of all pre- 5.3. Comparison with state-of-the-art methods
dicted positive instances, may also be misleading. If the model predicts
individuals as non-depressed, it may achieve high precision for the non- In this section, we compare our results with the state-of-the-art
depressed class due to its larger representation in the dataset. Similarly, works presented in the literature for multi-class depression classifi-
recall, which measures the proportion of correctly predicted positive cation. Specifically, we compare our proposed model with the per-
instances out of all actual positive instances, can be misleading. In formance reported by Kabir et al. (2022) using various baseline ar-
this case, the recall may be lower for the severe, mild, and moder- chitectures, including traditional ML-based approaches, DL-based ap-
ate depression classes due to their smaller sample sizes, even if the proaches, and transformer-based approaches.
model correctly identifies some instances within these classes. F1-score, Kabir et al. (2022) conducted experiments using Support Vector Ma-
however, considers both precision and recall and provides a balanced chines (SVM), Bidirectional Long Short-Term Memory (BiLSTM), BERT,
estimation. It takes into account the model’s ability to correctly classify and DistilBERT as their baseline models for depression classification.
individuals across all depression severity levels. For example, if the Among these models, DistilBERT showed the best performance in terms
model correctly identifies 40 out of the 100 severe depression cases of class-wise AUC–ROC values, which makes it a relevant point of
with a precision of 80% and a recall of 40%, the F1-score balances both comparison for our study. To ensure a fair and meaningful comparison,
metrics and offers a more accurate representation of the performance. we focus on evaluating and comparing the AUC–ROC scores, which
When evaluating the performance across different classes, the Area the original authors reported in their work. By considering this metric,
Under the Receiver Operating Characteristic Curve (AUC–ROC) is a we can assess the discrimination and predictive capabilities of our
useful metric since it considers both the true positive rate and the false proposed model in relation to the baseline performance.
positive rate. It provides a comprehensive evaluation of correctness in We observe significant enhancements in the performance of our
ranking instances, making it applicable to unbalanced datasets. The proposed model compared to the DistilBERT-based classifier proposed
discriminating strength and capacity to distinguish between classes of by Kabir et al. (2022) across different depression severity classes as
the model improve as the AUC–ROC increases. shown in Table 3. We showed in our experiments that BERT and
ALBERT outperform DistilBERT; hence, it was not included in our
5.2. Performance of baseline architectures ensemble pipeline. Our model achieved an AUC–ROC of 0.9256 for
the ‘non-depressed’ class, representing an improvement of 13.68%.
In this section, we evaluate the performance of different baseline In the ‘mild’ depression class, our model achieved an AUC–ROC of
architectures on the task of multi-class depression classification. We 0.8829, showing a substantial enhancement of 13.57%. Similarly, for
compare the results of BERT, BERTweet, ALBERT, and the two ensem- the ‘moderate’ depression class, our model achieved an AUC–ROC of
ble approaches — weighted hard-voting and weighted soft-voting. The 0.9370, reflecting an improvement of 14.91%. Lastly, in the ‘severe’ de-
evaluation metrics include accuracy, precision, recall, F1-score, and pression class, our model achieved an impressive AUC–ROC of 0.9848,
AUC–ROC. representing a significant improvement of 11.88%. Based on these
The results, as shown in Table 2, indicate that BERTweet achieves comparisons, it is evident that our proposed model outperforms the
the highest accuracy (0.8398) among the standalone baseline archi- baseline approach used by Kabir et al. (2022) in terms of class-wise
tectures. However, when considering ensemble methods, the weighted discrimination for depression severity classification. The substantial

6
T. Ahmed, S. Ivan, A. Munir et al. Natural Language Processing Journal 7 (2024) 100079

Fig. 4. Comparison among AUC–ROC for the different models.

Table 3 in different NLP tasks such as question-answering, code generation,


Performance comparison with the state-of-the-art methods for multi-class depression
text summarization, etc. (Laskar et al., 2023). For instance, Jahan
classification in terms of AUC–ROC score.
et al. (2024) have emphasized the capabilities of ChatGPT in diverse
Class DistilBERT (Kabir et al., 2022) Our model Improvement
biomedical tasks, including relation extraction, classification, question-
Non-depressed 0.7888 0.9256 +13.68%
answering, and summarization, without the need for fine-tuning. Their
Mild 0.7472 0.8829 +13.57%
Moderate 0.7879 0.9370 +14.91% study specifically highlights the performance of ChatGPT in datasets
Severe 0.8660 0.9848 +11.88% with smaller training sets. However, this work has demonstrated empir-
ical evidence in the biomedical domain that fine-tuned models tailored
to specific domains surpass the performance of zero-shot ChatGPT in
classification tasks, even when expert prompting is employed. This
improvements in AUC–ROC scores across all classes demonstrate the ef-
is consistent with our experimental findings. In this study, we ac-
fectiveness and superiority of our proposed architecture. This improve-
knowledge the prevalence of LLMs like ChatGPT. While these LLMs
ment in performance can be attributed to efficient data preprocessing
excel in general tasks due to their extensive pretraining, they may fall
techniques and an ensemble of carefully selected transformer-based
standalone classifiers. These techniques improved the quality of the short in specialized domains. However, we intentionally included the
input data, captured meaningful patterns, and handled the unique performance of ChatGPT as a benchmark for comparison due to their
characteristics of social media language. The combination of these wide accessibility. Their availability to a broad audience signifies the
advancements resulted in better class-wise discrimination and superior need to understand their limitations. Our evaluation of the performance
performance in depression severity classification. of ChatGPT serves as a reference point for practitioners who might
consider using such models in real-world scenarios. We hypothesized
5.4. Comparison with LLM that showcasing this performance without fine-tuning would highlight
the gap between general-purpose LLMs and task-specific models.
Large language models (LLMs) such as GPT-3.5 have recently started We compare the performance of our proposed model with the
leveraging context-learning techniques to overcome the need for fine- performance of zero-shot ChatGPT (gpt-3.5-turbo - accessed June 2023)
tuning on task-specific data. Ouyang et al. (2022) further improved on depression classification tasks in terms of accuracy, precision, recall,
this method by introducing a Reinforcement Learning (RL) framework and F1-score. Results presented in Fig. 5(a) show that our model signif-
incorporating human feedback. These technologies resulted in the de- icantly outperforms ChatGPT. We observe a performance improvement
velopment of ChatGPT, which has exhibited significant performance of at least 46.62% in terms of F1-score across the four classes, with

7
T. Ahmed, S. Ivan, A. Munir et al. Natural Language Processing Journal 7 (2024) 100079

Fig. 5. Performance comparison with LLM.

Table 4
Comparison of architecture metrics.
Architecture Model size ↓ (MB) Parameter count ↓ Inference time ↓ (ms) Maximum memory
allocated during inference ↓
(MB)
BERT 427.76 109 485 316 1.70 984.40
BERTweet 527.05 134 903 044 1.44 1077.15
ALBERT 45.67 11 686 660 1.52 690.42
Weighted hard-voting
1000.48 256 075 020 4.66 1077.15
Weighted soft-voting

the maximum improvement achieved in the ‘moderate’ class. Fig. 5(b) media platforms so that we can provide timely and engaged user ex-
provides a general overview where we compare the two models based periences. The model’s enhanced efficiency and responsiveness are the
on the four metrics. Our model shows improvement across all the met- result of a reduction in the inference time required to process incoming
rics with significant improvements in accuracy, recall, and F-1 score. text input. The scalability and efficiency of the deployed system rely
While ChatGPT demonstrates promising results in generative tasks, the on keeping an eye on the Maximum Memory Allocated during model
nature of depression classification necessitates a deeper understanding inference. Monitoring memory use peaks helps scientists spot and fix
of the specific domain and relevant contextual cues. ChatGPT, being memory bottlenecks that may otherwise compromise the efficacy and
a generic language model, may not capture the task-specific patterns reliability of the model. When distributing models to devices with
and nuances essential for accurate classification. On the other hand, limited memory, or when running many models in parallel, efficient
our proposed fine-tuned domain-specific model is specifically tailored memory management becomes crucial.
for depression classification, leveraging the benefits of fine-tuning on While Model Size and Parameter Count provide valuable insights
depression-related data. This enables our models to learn and adapt into the complexity and capacity of the models, their importance
to the intricacies of the depression domain, such as identifying symp- may be overshadowed by the practical considerations of inference
toms, analyzing language patterns indicative of depressive states, and time and memory allocation during real-life deployment. The focus
capturing nuances in different levels of depression severity. shifts towards optimizing the efficiency and responsiveness of the
models, enabling them to handle high volumes of data and deliver fast
predictions without exceeding memory limitations. Furthermore, ad-
5.5. Comparison of architecture metrics
vancements in model architectures, compression techniques, and model
optimization algorithms have enabled the development of compact
In the comparison of architecture metrics presented in Table 4, we
models with reduced parameter counts and model sizes without sacri-
quantitatively assess several key metrics: model size, parameter count,
ficing performance significantly. Therefore, model size and parameter
inference time, and maximum memory allocated during inference.
count, although still important, have become less critical compared to
These metrics allow us to make informed judgments about the different
inference time and memory allocation in many real-life scenarios.
baseline architectures and the ensemble models.
Starting with model size, we observe that ALBERT has the smallest
To deploy models in the real world, it is essential to keep model size size, occupying only 45.67 MB. BERT follows with a size of 427.76
and parameter count in mind, as resources like memory and storage MB, and BERTweet has the largest size of 527.05 MB. However, it is
space are typically scarce. Researchers may evaluate the viability of important to note that the ensemble models have a larger model size
deploying models on diverse platforms, from cloud servers to edge of 1000.48 MB. This indicates that the ensemble models require more
devices, by assessing the size and parameter count of the models. The storage capacity compared to individual architectures. Moving on to
number of parameters is one of many aspects of a model’s overall size, parameter count, ALBERT again demonstrates its parameter-efficient
which is why the two are not directly related. The total size of a model design with the lowest count of 11,686,660 parameters. BERT has a
includes not only the model itself but also its metadata, optimizer higher parameter count of 109,485,316, and BERTweet exceeds both
state, and data structures. In addition to the number of parameters, with 134,903,044 parameters. The ensemble models have a signifi-
additional factors, such as compression methods, data type, framework cantly larger parameter count of 256,075,020. These numbers illustrate
optimizations, and so on, might affect the overall size of the model. the trade-off between model complexity and ensemble performance.
Real-time applications rely heavily on accurate predictions, making Regarding inference time, we find that BERTweet achieves the lowest
Inference Time an essential factor. The time it takes to evaluate each inference time of 1.44 ms, followed by ALBERT with 1.52 ms. BERT
instance and provide predictions is critical if we want to deploy a real- takes slightly more time at 1.70 ms. However, when employing en-
time depression detection chatbot or integrate it into current social semble models, inference time increases significantly to 4.66 ms. These

8
T. Ahmed, S. Ivan, A. Munir et al. Natural Language Processing Journal 7 (2024) 100079

Fig. 6. Local explanation from the soft-voting classifier for a sample predicted as ‘Non-Depressed’.

values highlight the impact of model aggregation on inference speed, During our analysis, we encountered a misclassification scenario in
with ensemble models requiring more time due to the combination and which the best-performing standalone model, BERTweet, misclassified
decision-making processes involved. Lastly, we analyze the maximum a sample from the ‘severe’ class as depicted in Fig. 8. Further analysis
memory allocated during inference. BERTweet consumes the highest indicated that BERTweet did not assign sufficient weights to the crucial
memory, allocating 1077.15 MB, followed by BERT at 984.40 MB. tokens for predicting the correct class. However, when we visualized
ALBERT utilizes the least memory, with 690.42 MB. Interestingly, the local explainability for the ensemble architecture (Fig. 7(c)), we
both ensemble models also allocate 1077.15 MB of memory, which observed that the model placed greater importance on the tokens that
aligns with the memory requirements of BERTweet. This suggests that were significant to the ground truth class. Consequently, the ensemble
ensemble models inherit the memory consumption characteristics of model correctly classified the sample. From the visual representation
their constituent models. given in Fig. 7, it is evident that these local interpretability insights
Quantifying these architecture metrics provides valuable insights showcase the effectiveness of the ensemble architecture in capturing
into the characteristics and trade-offs of each architecture and the use class-specific token importance, leading to improved performance and
of ensemble models. While ALBERT demonstrates compact size, low a more accurate prediction process.
parameter count, and efficient memory utilization, it may sacrifice
some performance compared to larger models like BERT and BERTweet. 5.6.2. Global explainability
However, ensemble models offer the potential for improved accuracy Moving beyond local explainability, we also employed global ex-
and robustness by leveraging diverse models’ strengths. The decision plainability techniques to gain a broader understanding of the decision
to utilize an ensemble model should consider the specific use cases and patterns of our proposed model. Global explainability involves iden-
the balance between performance gains and resource requirements. tifying important class-wise tokens and analyzing their relevance to
the respective classes. Notably, the analysis provided by the Fig. 9
5.6. Interpretability and error analysis proved instrumental in conducting an overall performance and error
analysis. By determining the tokens that carry the most weight for each
In this work, the terms ‘interpretability’ and ‘explainability’ have class, we could observe distinct patterns highly relevant to the specific
been used interchangeably and refer to the same concept of under- characteristics and symptoms of the class of interest.
standing. One of the primary contributions of our research work lies Our proposed ensemble architecture, employing soft-voting, demon-
in implementing both local and global explainability techniques for strated satisfactory performance on the dataset (Section 3), surpassing
an ensemble of transformer models. The goal of interpretability in previous benchmarks. An analysis of the confusion matrix, presented
our study is to provide insights into the decision-making process of in Fig. 10, reveals a significant decrease in performance for the ‘mild’
the ensemble architecture and shed light on the factors influencing and ‘moderate’ classes. Upon examining the empirical results and eval-
its predictions. In this subsection, we elaborate on the findings and uating the dataset samples, we hypothesize that two primary factors
implications of our analysis. contribute most to this performance drop.
Firstly, there is a noticeable imbalance in the number of samples
5.6.1. Local explainability across different severity classes. The ‘non-depressed’ class contains a
Local explainability allows us to delve into the prediction process substantially larger number of samples compared to the ‘mild’, ‘mod-
of the ensemble architecture at an individual sample level. To achieve erate’, and ‘severe’ classes. This imbalance arises as most user com-
this, we selected representative samples from each class and visualized ments and posts published worldwide daily do not exhibit any degree
the importance of different tokens to reveal how much each token of depression. However, for enhanced generalization ability of deep
contributes to the prediction, whether it remains neutral or negatively transformer-based networks, it is crucial to have an ample number
impacts the predicted class. This examination enables us to gain a better of class-specific samples. Therefore, we hypothesize that increasing
understanding of the decision-making process of the model of interest. the sample size will lead to improved performance. Secondly, the
Fig. 6 shows the local explanation for a sample predicted as a ‘Non- high similarity between the ‘non-depressed’ and ‘mild’ classes, as well
Depressed’ class. Observing the importance of each token, we see that as the ‘mild’ and ‘moderate’ classes, contributes to the performance
despite the sentence containing the word ‘depressed’, the ensemble decline. The contextual differences between samples from these classes
model successfully predicted the correct class. The local explanations are minimal, requiring the classifier to possess a robust contextual
for samples with different severity of depression are presented in Fig. 7. understanding to distinguish between them accurately. For example,

9
T. Ahmed, S. Ivan, A. Munir et al. Natural Language Processing Journal 7 (2024) 100079

Fig. 7. Local explanation from the soft-voting classifier for samples of ‘Mild’, ‘Moderate’, and ‘Severe’ levels of depression.

in Fig. 9(b), the most influential token in the ‘mild’ class is ‘depressed’. predictions. The ensemble’s ability to incorporate diverse perspectives
However, a sample from the ‘non-depressed’ class (Fig. 6) also contains from individual models enhances its overall performance and enables
the word ‘depressed’. In such cases, the model must grasp the context a more comprehensive understanding of the data. Secondly, the global
(identifying the underlying sarcasm in this case) to classify the sample explainability analysis allows us to uncover class-specific patterns and
correctly. This high inter-class similarity is evident in the samples from gain insights into the features and themes contributing to the predic-
the ‘mild’ (Fig. 7(a)) and ‘moderate’ (Fig. 7(b)) classes, which even pose tion of each class. This knowledge can aid in better understanding
difficulties for non-domain-expert humans to discern. Additionally, depression severity and facilitate more informed decision-making in the
context of mental health assessment.
there are several common tokens among the highly weighted tokens
from the ‘mild’ and ‘moderate’ classes, as depicted in Figs. 9(b) and 6. Conclusion
9(c).
The interpretability and error analysis conducted in this study offers Depression, when left untreated, can lead to various complications,
several important implications. Firstly, the local explainability analysis such as an increased risk of self-harm, substance abuse, and other
highlights the advantage of ensemble architectures over standalone mental health disorders. Hence, early detection of depression is cru-
models in capturing nuanced token importances and making accurate cial in such mental health diagnosis by allowing prompt intervention

10
T. Ahmed, S. Ivan, A. Munir et al. Natural Language Processing Journal 7 (2024) 100079

Fig. 8. Local explanation for a misclassified sample from the Bertweet model.

Fig. 9. Global explainability: Top 10 important tokens for each class.

and treatment, preventing the condition from worsening. This work analysis with other modalities, such as images, videos, or audio from
presents a deep learning-based pipeline for quantifying depression from social media posts, to improve the accuracy of depression detection
social media posts into four categories — mild, moderate, severe, and and severity assessment. The imbalanced distribution of samples was
non-depressed. We have utilized a wide range of preprocessing tech- a prime reason behind misclassifications, which can be improved by
niques along with an ensemble of three transformed-based models for collecting more data in the future. Context-level data augmentation
effective prediction and achieved state-of-the-art performance on the
may turn out handy in this regard. Moreover, extended analysis beyond
‘DEPTWEET’ dataset. We also provided a thorough explainability analy-
a single social media platform may provide a more comprehensive view
sis to understand the decision-making process of the proposed pipeline.
Moreover, recognizing the widespread use of LLMs such as ChatGPT, of individuals’ mental health and allow for a more accurate assessment.
we demonstrated the capabilities of ChatGPT in a zero-shot context Finally, It is important to note that while these methods can provide
for this particular task, underscoring the distinction in performance indications of potential depressive symptoms, they should not replace
between general-purpose LLMs and models tailored to specific tasks. professional diagnosis or clinical assessment. The analysis of social
In the future, multimodal Analysis can be conducted by combining text media posts should be used as a supplementary tool to aid mental

11
T. Ahmed, S. Ivan, A. Munir et al. Natural Language Processing Journal 7 (2024) 100079

Chakravarthi, B.R., Priyadharshini, R., Banerjee, S., Jagadeeshan, M.B., Kumare-


san, P.K., Ponnusamy, R., Benhur, S., McCrae, J.P., 2023. Detecting abusive
comments at a fine-grained level in a low-resource language. Natl. Lang. Process.
J. 3, 100006. http://dx.doi.org/10.1016/j.nlp.2023.100006, URL: https://www.
sciencedirect.com/science/article/pii/S2949719123000031.
Chen, X., Sykora, M.D., Jackson, T.W., Elayan, S., 2018. What about mood swings: Iden-
tifying depression on Twitter with temporal measures of emotions. In: Companion
Proceedings of the the Web Conference 2018.
Davidson, T., Warmsley, D., Macy, M.W., Weber, I., 2017. Automated hate speech
detection and the problem of offensive language. In: International Conference on
Web and Social Media.
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K., 2019. BERT: Pre-training of deep
bidirectional transformers for language understanding. In: Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
Association for Computational Linguistics, Minneapolis, Minnesota, pp. 4171–4186.
Diederich, J., Al-Ajmi, A., Yellowlees, P.M., 2007. Ex-ray: Data mining and mental
health. Appl. Soft Comput. 7, 923–928.
Dwyer, D.B., Falkai, P.G., Koutsouleris, N., 2018. Machine learning approaches for
clinical psychology and psychiatry. Annu. Rev. Clin. Psychol. 14, 91–118.
Ezawa, I.D., Bartels, G.C., Strunk, D.R., 2021. Getting down to business: an examination
Fig. 10. Confusion matrix of DEPTWEET dataset for the proposed architecture. of occupational outcomes in cognitive behavioral therapy for depression. Cogn.
Behav. Ther. 50, 479–491.
Gao, S., Calhoun, V.D., Sui, J., 2018. Machine learning in major depression: From
classification to treatment outcome prediction. CNS Neurosci. Ther. 24, 1037–1052.
health professionals in identifying individuals who may require further
He, L., Niu, M., Tiwari, P., Marttinen, P., Su, R., Jiang, J., Guo, C., Wang, H., Ding, S.,
evaluation and support. Wang, Z., Dang, W., Pan, X., 2021. Deep learning for depression recognition with
audiovisual cues: A review. ArXiv abs/2106.00610.
Funding Hochreiter, S., Bengio, Y., Frasconi, P., et al., 2001. Gradient flow in recurrent nets:
the difficulty of learning long-term dependencies.
This research did not receive any specific grant from funding agen- Jahan, I., Laskar, M.T.R., Peng, C., Huang, J.X., 2024. A comprehensive evaluation of
cies in the public, commercial, or not-for-profit sectors. large language models on benchmark biomedical text processing tasks. Comput.
Biol. Med. 108189.
CRediT authorship contribution statement Johnson, D., Dupuis, G., Piché, J., Clayborne, Z.M., Colman, I., 2018. Adult mental
health outcomes of adolescent depression: A systematic review. Depress. Anxiety
35, 700–716.
Tasnim Ahmed: Writing – original draft, Visualization, Method-
Joshi, D.J., Makhija, M., Nabar, Y., Nehete, N., Patwardhan, M.S., 2018. Mental health
ology, Formal analysis, Conceptualization. Shahriar Ivan: Writing –
analysis using deep learning for feature extraction. In: Proceedings of the ACM
original draft, Visualization, Investigation. Ahnaf Munir: Writing – India Joint International Conference on Data Science and Management of Data.
original draft, Visualization, Validation. Sabbir Ahmed: Writing – Kabir, M., Ahmed, T., Hasan, M.B., Laskar, M.T.R., Joarder, T.K., Mahmud, H.,
review & editing, Supervision, Project administration, Investigation. Hasan, K., 2022. DEPTWEET: A typology for social media texts to detect depression
severities. Comput. Hum. Behav. 139, 107503.
Declaration of competing interest Kayalvizhi, S., Durairaj, T., Chakravarthi, B.R., Mahibha, C.J., 2022. Findings
of the shared task on detecting signs of depression from social media. In:
The authors declare that they have no known competing financial Chakravarthi, B.R., Bharathi, B., McCrae, J.P., Zarrouk, M., Bali, K., Buitelaar, P.
interests or personal relationships that could have appeared to (Eds.), Proceedings of the Second Workshop on Language Technology for Equal-
ity, Diversity and Inclusion. Association for Computational Linguistics, Dublin,
influence the work reported in this paper.
Ireland, pp. 331–338. http://dx.doi.org/10.18653/v1/2022.ltedi-1.51, URL: https:
//aclanthology.org/2022.ltedi-1.51.
References Khan, W., Daud, A., Khan, K., Muhammad, S., Haq, R., 2023c. Exploring the frontiers
of deep learning and natural language processing: A comprehensive overview of
Adamou, M., Antoniou, G., Greasidou, E., Lagani, V., Charonyktakis, P., Tsamardinos, I., key challenges and emerging trends. Natl. Lang. Process. J. 4, 100026. http://dx.
2018. Mining free-text medical notes for suicide risk assessment. In: Proceedings doi.org/10.1016/j.nlp.2023.100026, URL: https://www.sciencedirect.com/science/
of the 10th Hellenic Conference on Artificial Intelligence. article/pii/S2949719123000237.
Ahmed, T., Ivan, S., Kabir, M., Mahmud, H., Hasan, K., 2022. Performance analy-
Khan, A., Kamal, F., Chowdhury, M.A., Ahmed, T., Laskar, M.T.R., Ahmed, S.,
sis of transformer-based architectures and their ensembles to detect trait-based
2023a. BanglaCHQ-summ: An abstractive summarization dataset for medical
cyberbullying. Soc. Netw. Anal. Min. 12, 1–17.
queries in bangla conversational speech. In: Alam, F., Kar, S., Chowdhury, S.A.,
Ahmed, T., Kabir, M., Ivan, S., Mahmud, H., Hasan, K., 2021. Am I being bullied on
Sadeque, F., Amin, R. (Eds.), Proceedings of the First Workshop on Bangla
social media? An ensemble approach to categorize cyberbullying. In: 2021 IEEE
Language Processing (BLP-2023). Association for Computational Linguistics, Singa-
International Conference on Big Data (Big Data). pp. 2442–2453.
pore, pp. 85–93. http://dx.doi.org/10.18653/v1/2023.banglalp-1.10, URL: https:
Akhter, A., Acharjee, U.K., Talukder, M.A., Islam, M.M., Uddin, M.A., 2023. A robust
//aclanthology.org/2023.banglalp-1.10.
hybrid machine learning model for Bengali cyber bullying detection in social media.
Khan, A., Kamal, F., Nower, N., Ahmed, T., Ahmed, S., Chowdhury, T., 2023b.
Natl. Lang. Process. J. 4, 100027. http://dx.doi.org/10.1016/j.nlp.2023.100027,
NERvous about my health: Constructing a bengali medical named entity recognition
URL: https://www.sciencedirect.com/science/article/pii/S2949719123000249.
dataset. In: Bouamor, H., Pino, J., Bali, K. (Eds.), Findings of the Association
Aldabbas, H., Albashish, D., Khatatneh, K., Amin, R., 2022. An architecture of IoT-aware
for Computational Linguistics: EMNLP 2023. Association for Computational Lin-
healthcare smart system by leveraging machine learning. Int. Arab J. Inf. Technol.
guistics, Singapore, pp. 5768–5774. http://dx.doi.org/10.18653/v1/2023.findings-
19, 160–172.
Alshawwa, I.A., Elkahlout, M., El-Mashharawi, H.Q., Abu-Naser, S.S., 2019. An expert emnlp.383, URL: https://aclanthology.org/2023.findings-emnlp.383.
system for depression diagnosis. Int. J. Acad. Health Med. Res. (IJAHMR) 3 (4), Khan, A.A., Kamal, F., Nower, N., Ahmed, T., Chowdhury, T.M., 2022. An evaluation
20–27. of transformer-based models in personal health mention detection. In: 2022 25th
Arbanas, G., 2015. Diagnostic and statistical manual of mental disorders (DSM-5). International Conference on Computer and Information Technology. ICCIT, pp. 1–6.
Alcohol. Psychiatry Res. 51, 61–64. http://dx.doi.org/10.1109/ICCIT57492.2022.10054937.
Bucci, S., Schwannauer, M., Berry, N., 2019. The digital revolution and its impact on Kroenke, K., Spitzer, R.L., Williams, J.B., 2001. The PHQ-9: validity of a brief depression
mental health care. Psychol. Psychother. 92 2, 277–297. severity measure. J. Gen. Intern. Med. 16 9, 606–613.

12
T. Ahmed, S. Ivan, A. Munir et al. Natural Language Processing Journal 7 (2024) 100079

Kulkarni, D., Ghosh, A., Girdhari, A., Liu, S., Vance, L.A., Unruh, M., Sarkar, J., 2024a. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C.,
Enhancing pre-trained contextual embeddings with triplet loss as an effective fine- Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L.,
tuning method for extracting clinical features from electronic health record derived Simens, M., Askell, A., Welinder, P., Christiano, P.F., Leike, J., Lowe, R.,
mental health clinical notes. Natl. Lang. Process. J. 6, 100045. http://dx.doi.org/ 2022. Training language models to follow instructions with human feedback. In:
10.1016/j.nlp.2023.100045, URL: https://www.sciencedirect.com/science/article/ Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (Eds.), In:
pii/S2949719123000420. Advances in Neural Information Processing Systems, vol. 35, Curran Associates,
Kulkarni, D., Ghosh, A., Girdhari, A., Liu, S., Vance, L.A., Unruh, M., Sarkar, J., 2024b. Inc., pp. 27730–27744.
Enhancing pre-trained contextual embeddings with triplet loss as an effective fine- Parveen, N., Chakrabarti, P., Hung, B.T., Shaik, A., 2023. Twitter sentiment analysis
tuning method for extracting clinical features from electronic health record derived using hybrid gated attention recurrent network. J. Big Data 10, 1–29.
mental health clinical notes. Natl. Lang. Process. J. 6, 100045. http://dx.doi.org/ Patel, M.J., Khalaf, A.M., Aizenstein, H.J., 2015. Studying depression using imaging
10.1016/j.nlp.2023.100045, URL: https://www.sciencedirect.com/science/article/ and machine learning methods. NeuroImage : Clin. 10, 115–123.
pii/S2949719123000420. Qiao, Y., Xiong, C., Liu, Z., Liu, Z., 2019. Understanding the behaviors of BERT in
KVTKN, P., Ramakrishnudu, T., 2023. Semi-supervised approach for tweet-level ranking. ArXiv abs/1904.07531.
stress detection. Natl. Lang. Process. J. 4, 100019. http://dx.doi.org/10. Ribeiro, M.T., Singh, S., Guestrin, C., 2016. ‘‘Why should I trust you?’’: Explaining the
1016/j.nlp.2023.100019, URL: https://www.sciencedirect.com/science/article/pii/ predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International
S294971912300016X. Conference on Knowledge Discovery and Data Mining.
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R., 2020. ALBERT: Safa, R., Bayat, P., Moghtader, L., 2021. Automatic detection of depression symptoms
A lite BERT for self-supervised learning of language representations. arXiv:1909. in twitter using multimodal analysis. J. Supercomput. 78, 4709–4744.
11942. Sanh, V., Debut, L., Chaumond, J., Wolf, T., 2020. DistilBERT, a distilled version of
Laskar, M.T.R., Bari, M.S., Rahman, M., Bhuiyan, M.A.H., Joty, S., Huang, J.X., 2023. A BERT: smaller, faster, cheaper and lighter.
systematic study and comprehensive evaluation of ChatGPT on benchmark datasets. Shahzad, M., Freeman, C., Rahimi, M., Alhoori, H., 2023. Predicting Facebook sen-
arXiv:2305.18486. timents towards research. Natl. Lang. Process. J. 3, 100010. http://dx.doi.org/10.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettle- 1016/j.nlp.2023.100010, URL: https://www.sciencedirect.com/science/article/pii/
moyer, L., Stoyanov, V., 2019. RoBERTa: A robustly optimized BERT pretraining S2949719123000079.
approach. arXiv preprint arXiv:1907.11692. Sidey-Gibbons, J.A.M., Sidey-Gibbons, C.J., 2019. Machine learning in medicine: a
Mahdy, N., Magdi, D.A., Dahroug, A., Rizka, M.A., 2020. Comparative study: Dif- practical introduction. BMC Med. Res. Methodol. 19.
ferent techniques to detect depression using social media. In: Ghalwash, A.Z., TaghiBeyglou, B., Rudzicz, F., 2024. Context is not key: Detecting Alzheimer’s disease
El Khameesy, N., Magdi, D.A., Joshi, A. (Eds.), Internet of Things—Applications with both classical and transformer-based neural language models. Natl. Lang.
and Future. Springer Singapore, Singapore, pp. 441–452. Process. J. 6, 100046. http://dx.doi.org/10.1016/j.nlp.2023.100046, URL: https:
Martínez-Castaño, R., Htait, A., Azzopardi, L., Moshfeghi, Y., 2020. Early risk detection //www.sciencedirect.com/science/article/pii/S2949719123000432.
of self-harm and depression severity using BERT-based transformers. In: Conference Thieme, A., Belgrave, D., Doherty, G., 2020. Machine learning in mental health: A
and Labs of the Evaluation Forum. systematic review of the HCI literature to support effective ML system design. ACM
Trans. Comput.-Hum. Interact. 27.
Mohammed, A., Kora, R., 2023. A comprehensive review on ensemble deep learn-
Tran, T., Phung, D.Q., Luo, W., Harvey, R., Berk, M., Venkatesh, S., 2013. An integrated
ing: Opportunities and challenges. J. King Saud Univ. - Comput. Inf. Sci. 35
framework for suicide risk prediction. In: Proceedings of the 19th ACM SIGKDD
(2), 757–774. http://dx.doi.org/10.1016/j.jksuci.2023.01.014, URL: https://www.
international conference on Knowledge discovery and data mining.
sciencedirect.com/science/article/pii/S1319157823000228.
Uysal, A.K., Günal, S., 2014. The impact of preprocessing on text classification. Inf.
Nguyen, D.Q., Vu, T., Nguyen, A.T., 2020. BERTweet: A pre-trained language model
Process. Manage. 50, 104–112.
for English Tweets. In: Proceedings of the 2020 Conference on Empirical Methods
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u.,
in Natural Language Processing: System Demonstrations. pp. 9–14.
Polosukhin, I., 2017. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H.,
Nisar, D.-E.-M., Amin, R., Shah, N.-U.-H., Ghamdi, M.A.A., Almotiri, S.H., Al-
Fergus, R., Vishwanathan, S., Garnett, R. (Eds.), Attention is All you Need. In:
ruily, M., 2021. Healthcare techniques through deep learning: Issues, challenges
Advances in Neural Information Processing Systems, vol. 30, Curran Associates,
and opportunities. IEEE Access 9, 98523–98541.
Inc..
Nobles, A.L., Glenn, J.J., Kowsari, K., Teachman, B.A., Barnes, L.E., 2018. Identification
Yazdavar, A.H., Al-Olimat, H.S., Ebrahimi, M., Bajaj, G., Banerjee, T., Thirunarayan, K.,
of imminent suicide risk among Young adults using text messages. In: Proceedings
Pathak, J., Sheth, A., 2017. Semi-supervised approach to monitoring clinical
of the 2018 CHI Conference on Human Factors in Computing Systems.
depressive symptoms in social media. In: Proceedings of the 2017 IEEE/ACM
Ofek, N., Katz, G., Shapira, B., Bar-Zev, Y., 2015. Sentiment analysis in transcribed
International Conference on Advances in Social Networks Analysis and Mining
utterances. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining.
2017.
Organization, W.H., et al., 1993. The ICD-10 Classification of Mental and Behavioural
Zafar, A., Chitnis, D.S., 2020. Survey of depression detection using social networking
Disorders: Diagnostic Criteria for Research. World Health Organization.
sites via data mining. In: 2020 10th International Conference on Cloud Computing,
Organization, W.H., et al., 2017. Depression and Other Common Mental Disorders:
Data Science & Engineering (Confluence). pp. 88–93.
Global Health Estimates. Technical Report, World Health Organization.

13

You might also like