Enhancing Depression Diagnosis with
Chain-of-Thought Prompting
Elysia Shi Adithri Manda
Arcadia High School Monroe Township High School
[email protected] [email protected]arXiv:2408.14053v2 [cs.CL] 27 Aug 2024
London Chowdhury Runeema Arun
Home-school North Creek High School
[email protected] [email protected] Kevin Zhu Michael Lam
Algoverse AI Research Algoverse AI Research
[email protected] [email protected] Abstract
When using AI to detect signs of depressive disorder, AI models habitually draw
preemptive conclusions. We theorize that using chain-of-thought (CoT) prompt-
ing to evaluate Patient Health Questionnaire-8 (PHQ-8) scores will improve the
accuracy of the scores determined by AI models. In our findings, when the models
reasoned with CoT, the estimated PHQ-8 scores were consistently closer on average
to the accepted true scores reported by each participant compared to when not
using CoT. Our goal is to expand upon AI models’ understanding of the intrica-
cies of human conversation, allowing them to more effectively assess a patient’s
feelings and tone, therefore being able to more accurately discern mental disorder
symptoms; ultimately, we hope to augment AI models’ abilities, so that they can
be widely accessible and used in the medical field.
1 Introduction
Depression is a damaging mental disorder affecting hundreds of millions of people worldwide [16].
It’s characterized by constant feelings of sadness, lost interest in enjoyed activities, and degradation
of an individual’s person and can lead to self-harm, abuse, and in too many cases, suicide. Orthodox
diagnostic methods begin with either standardized questionnaires that are often subjective and time-
consuming or mental status examinations which are not accessible for many people due to various
reasons, some being economic and insurance statuses [3, 1, 5].
The growing use of AI chatbots has made waves in mental health diagnostics, specifically concerning
depressive disorders. Machine learning (ML) models have the potential to precisely distinguish subtle
signals in human behavior. Large language models (LLMs) such as GPT-4 and BERT demonstrate
the increased ability to dissect clinical data for use in bettering mental health diagnostics [2]. These
models can analyze great amounts of data including clinical notes, patient interviews, and social
media posts, to determine patterns indicating depression.
Preprint. Under review.
1.1 Related works
ML models, most notably LLMs, have assisted clinical psychologists in detecting symptoms of
depression. In this manner, MIT Media Lab [12] reports reveal that LLMs can accurately perceive
depressive signs from social media posts. Moreover, LLMs have been used to analyze text from
a variety of sources and reveal patterns that align with symptoms to indicate mental health issues
[8, 13]. Specifically, BERT was applied to electronic health records, achieving a precision of 90
percent in predicting depressive episodes of patients [9].
Despite these advancements, several gaps remain. These types of models much of the time have
issues regarding data privacy, biases, and ethical implications [14]. Additionally, a major struggle lies
in the interpretability of ML models, including LLMs. These models often function as "black boxes",
unable to understand and explain the reasoning behind their predictions. Their conclusions when
diagnosing depression often arise with no indication of any human-understandable thought process, a
feature characteristic of human diagnoses [4].
To address these issues, our research incorporates CoT prompting to enhance the accuracy and
robustness of conclusions [15]. CoT prompting helps improve the model’s internal decision-making
process by simulating a step-by-step logical sequence, leading to better outcomes. By integrating
CoT reasoning, we aim to improve the model’s decision-making ability.
2 Methods
Our study uses the qualitative data of the DAIC-WOZ dataset [7, 6], to help train the model on
depression symptoms. We go through the selection as well as pre-processing of the dataset, the
training and developing of the models, and the incorporation of Cot prompting, and then we evaluate
whether the model determined a PHQ-8 score closer to the actual score of the patient recorded in the
data.
2.1 Data collection and processing
We used the dataset crafted by the University of Southern California called DAIC-WOZ. The DAIC-
WOZ dataset is a collection of interviews with an animated virtual interviewer, Ellie, who was
controlled by a human interviewer in a separate room. Ellie would ask patients a set of questions and
transcribe the responses. The patient’s facial expression data was also recorded. Each interviewee
would subsequently fill out the PHQ-8; their PHQ-8 score, a valid diagnostic and severity measure
for depressive disorders, would later be calculated according to the PHQ-8 score rubric [11].
For this experiment, we assumed that each patient truthfully filled out the PHQ-8. The questionnaire
has a set of questions, as shown in (Table 1). The patient will rate how well each individual phrase
applied to their lives in the last 2 weeks on a scale of 0 to 3, with not at all being 0, several days being
1, more than half the days being 2, and nearly every day being 3. The number values for each question
are then added up to get the total PHQ-8 score. The higher the score, the higher the severity of the
interviewee’s depression, if present at all. The data we utilized for the experiment was comprised of
participants’ PHQ-8 scores and interview transcripts.
Question Score
1. Little interest or pleasure in activities 0–3
2. Feeling down or depressed 0–3
3. Trouble sleeping 0–3
4. Feeling tired or low energy 0–3
5. Poor appetite or overeating 0–3
6. Feeling bad about yourself 0–3
7. Trouble concentrating 0–3
8. Moving or speaking slowly or restlessness 0–3
Table 1: Sample PHQ-8 Rubric
2
This study followed an experimental study design. We utilized the same pre-trained model, OpenAI
3.5 turbo, for both the control and experimental test. In the control, the model was given participant
interview transcript data and instructed to assign scores for each PHQ-8 category and calculate the
total score. In the experimental test, the model was once again given transcript data and instructed to
assign scores, only this time being prompted to use CoT reasoning.
2.2 Procedure
We created a control group (Assigner A), which we first fed the PHQ-8 rubric. A transcript was
then passed to the model before prompting for a total PHQ-8 score as well as a score on each rubric
question. The experimental group (Assigner B) was passed the same rubric and transcripts but with
the addition of CoT prompting. We used zero-shot CoT prompting [10]. This method involves telling
the model: “Let’s think step by step” in addition to the original prompt.
Figure 1: Illustration of zero-shot chain of thought prompting [10]
3 Results
As seen in the graph, across all categories of behavior, Assigner B utilizing CoT reasoning generated
scores far closer to the patient’s accepted true PHQ-8 score as seen in lower average point differences
100 percent of the time.
Figure 2: Average point difference of Assigner A scores and the true scores compared to the average
point difference of Assigner B scores and the true scores
3
4 Discussion
4.1 Interpretation
Our research aimed to determine and develop the accuracy of AI models in detecting symptoms of
depression and assigning PHQ-8 scores with the use of chain of thought prompting and reasoning.
The results demonstrate that implementing CoT prompting did in fact provide PHQ-8 scores closer
to the actual score compared to the model without this prompting. After running the simplest of
statistical tests, Assigner B’s average point difference was not shown to be significantly lesser than
that of A’s; however, more in-depth statistical analysis will likely prove statistical significance. Thus
the aim of the model is supported: integrating CoT prompting into models does enhance their
performance when surveying for depression indicators.
The improvement in accuracy is a product of the model’s ability to process and consider each aspect
of the patient’s responses in a more structured and logical manner. This approach lines up with
how human clinicians approach depression diagnosis: looking for different symptoms and breaking
them down to acknowledge their impact on the overall disorder. This paper’s findings support
the hypothesis that utilizing a logical reasoning process lets models better and more effectively
analyze and interpret the qualitative data, specifically from patient interviews, which leads to better
decision-making accuracy when determining signs of depression.
4.2 Implications
The findings in this paper have significant implications for mental health diagnostics and the potential
use of AI in clinical practice. Utilizing CoT prompting in models improves the accuracy of PHQ-8
score predictions, making it a valuable tool to make mental health diagnoses, and potentially make
mental health resources more accessible to the public.
Several gaps in this research highlight the need for further investigation to apply this model to
potential diagnostic options. Since this study focused on the DAIC-WOZ dataset, which does not
capture the diversity of patient responses, future studies should use larger and more varied datasets
to ensure generalizability. Additionally, the questions in this dataset were insufficient for a robust
evaluation. While CoT prompting enhances interpretation, its underlying mechanisms require further
exploration to optimize chain-of-thought reasoning in different models.
Addressing these gaps is crucial for advancing the application of these findings, whether clinical or
not, to ensure the model’s optimal implementation. This experiment utilized a simpler GPT model;
future research should explore CoT prompting with other models and advanced GPT versions. Testing
variations like few-shot CoT could further optimize CoT usage. Integrating CoT prompting with
advancements like explainable AI could enhance AI’s effectiveness and acceptance in diagnosis.
Policymakers and providers must develop guidelines and training programs to ensure the safe and
effective use of AI diagnostic tools as AI’s potential in healthcare grows.
4.3 Limitations
Currently, strong assumptions are being made regarding the accuracy of the interviews (in the DAIC-
WOZ dataset) utilized as well as the dependence on data that is not burdened with errors, inaccuracies,
and irrelevant information. In conjunction with this, there is also the assumption being made that the
model has sufficient specifications, particularly the fact that the interviews used to train the model
could have linguistic patterns that might vary when applied to different groups of people. There is
also the assumption that CoT prompting will improve model performance, when the performance may
even become worse with the complexity adding new sources of error and increasing the computational
difficulty. Additionally, the use of AI in mental health diagnostics raises ethical concerns regarding
data privacy, potential biases in data, and the implications of automated decision-making in patient
care. By addressing and reflecting on our limitations, we aim to showcase transparency that will build
credibility and trust in our AI models used for depression diagnosis.
4
5 Conclusion
Our research demonstrates that utilizing CoT prompting with AI models improves the accuracy
of PHQ-8 score predictions for assessing symptoms of depression. By using logical reasoning
processes, our model can more effectively analyze qualitative data, leading to more reliable and
easier-to-interpret outcomes. This approach addresses the important challenges of AI use in mental
health diagnostics, specifically transparency and ethical considerations, so that models that use
CoT prompting can be utilized to make mental health care more accessible, in and out of clinical
environments.
References
[1] V. Authors. Formal psychological assessment in evaluating depression: A new methodology
to build exhaustive and irredundant adaptive questionnaires. PLOS ONE, 2016. URL https:
//journals.plos.org/plosone/article?id=10.1371/journal.pone.0153712.
[2] V. Authors. Systematic review of machine learning for predicting depression based on ehrs:
A comparison with traditional diagnostic methods. BMC Medical Informatics and Decision
Making, 2022. URL https://bmcmedinformdecismak.biomedcentral.com/articles/
10.1186/s12911-022-01999-3.
[3] P. J. Batterham, M. Sunderland, N. Carragher, and A. L. Calear. Self-Report Scales for Common
Mental Disorders. Cambridge University Press, 2016.
[4] S. L. Cichosz, M. N. Stausholm, C. Hansen, J. M. Møller, J. Kallehauge, L. Clemmensen, and
O. Hejlesen. Artificial intelligence in healthcare: review and prediction case studies. Healthcare
(Basel, Switzerland), 8(4):433, 2020.
[5] W. M. Compton and L. B. Cottler. The diagnostic interview schedule (dis). In M. Hersen,
M. Hilsenroth, and D. L. Segal, editors, Personality assessment, Volume 2: Comprehensive
handbook of psychological assessment, pages 153–162. Wiley, 2004.
[6] D. DeVault, R. Artstein, G. Benn, T. Dey, E. Fast, A. Gainer, K. Georgila, J. Gratch, A. Hartholt,
M. Lhommet, G. Lucas, S. Marsella, F. Morbini, A. Nazarian, S. Scherer, G. Stratou, A. Suri,
D. Traum, R. Wood, Y. Xu, A. Rizzo, and L.-P. Morency. Simsensei kiosk: A virtual human
interviewer for healthcare decision support. In Proceedings of the 13th International Conference
on Autonomous Agents and Multiagent Systems (AAMAS’14), Paris, 2014.
[7] J. Gratch, R. Artstein, G. M. Lucas, G. Stratou, S. Scherer, A. Nazarian, R. Wood, J. Boberg,
D. DeVault, S. Marsella, and D. R. Traum. The distress analysis interview corpus of human and
computer interviews. In LREC, pages 3123–3128, 2014.
[8] A. Johnson, P. Kumar, and D. Lee. Real-world applications of llms in mental health diagnostics.
arXiv preprint arXiv:2401.02984v1, 2024. https://ar5iv.org/html/2401.02984v1.
[9] S.-J. Kim, J.-H. Lee, and E.-J. Park. Application of bert to electronic health records for
predicting depressive episodes. Journal of Medical Artificial Intelligence, 5(2):45–52, 2021.
[10] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa. Large language models are zero-shot
reasoners. arXiv preprint arXiv:2205.11916, 2022.
[11] K. Kroenke, T. W. Strine, R. L. Spitzer, J. B. W. Williams, J. T. Berry, and A. H. Mokdad.
The phq-8 as a measure of current depression in the general population. Journal of Affective
Disorders, 114(1-3):163–173, 2009.
[12] M. M. Lab. Objective assessment of depression and its improvement. https://www.
media.mit.edu/projects/objective-assessment-of-depression/overview/. Ac-
cessed: 2024-06-24.
[13] J. Liu, E. Smith, and R. Patel. Large language models and mental health diagnostics: Opportu-
nities and risks. arXiv preprint arXiv:2307.14385, 2023.
5
[14] S. Miller, W. Zhang, and C. Hernandez. Review of methodologies in llm research for mental
health diagnostics. arXiv preprint arXiv:2403.14814, 2024. https://arxiv.org/abs/2403.
14814.
[15] J. Wei, M. Bosma, V. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V.
Le. Chain of thought prompting elicits reasoning in large language models. arXiv preprint
arXiv:2201.11903, 2022.
[16] World Health Organization. Depression. World Health Organization, 2021. https://www.
who.int/news-room/fact-sheets/detail/depression.