Large Language Models For Mental Health Diagnostic Assessments Exploring The Potential of Large Lang
Large Language Models For Mental Health Diagnostic Assessments Exploring The Potential of Large Lang
ABSTRACT
Large language models (LLMs) are increasingly attracting the attention of healthcare professionals for their potential to
assist in diagnostic assessments, which could alleviate the strain on the healthcare system caused by a high patient
load and a shortage of providers. For LLMs to be effective in supporting diagnostic assessments, it is essential that
they closely replicate the standard diagnostic procedures used by clinicians. In this paper, we specifically examine the
diagnostic assessment processes described in the Patient Health Questionnaire-9 (PHQ-9) for major depressive disorder
(MDD) and the Generalized Anxiety Disorder-7 (GAD-7) questionnaire for generalized anxiety disorder (GAD). We
investigate various prompting and fine-tuning techniques to guide both proprietary and open-source LLMs in adhering
to these processes, and we evaluate the agreement between LLM-generated diagnostic outcomes and expert-validated
ground truth. For fine-tuning, we utilize the Mentalllama and Llama models, while for prompting, we experiment with
proprietary models like GPT-3.5 and GPT-4o, as well as open-source models such as llama-3.1-8b and mixtral-8x7b.
Software Availability. We make all software artifacts available at this Github link1
Institutional Review Board (IRB). This study does not require approval from the Institutional Review Board (IRB).
It involves using clinician-annotated social media posts, authorized for research purposes. The primary objective is
to evaluate the effectiveness of LLMs that incorporate diagnostic criteria for major depressive disorder and general
anxiety disorder for assisting with mental health assessments.
1 INTRODUCTION
LLMs are large neural networks (≥∼7 billion weights and biases) designed to encode complex language patterns
achieved by training on massive language-based datasets [1]. Their remarkable success in a wide array of natural
language processing tasks has led to the proliferation of LLM-based tools and applications across various industries
1 https://github.com/kauroy1994/Large-Language-Models-for-Assisting-with-Mental-Health-Diagnostic-Assessments
Authors’ addresses: Kaushik Roy, [email protected], Artificial Intelligence Institute University of South Carolina, USA; Harshul Surana, harshul19@
iiserb.ac.in, Indian Institute of Research and Science, Bhopal, India; Darssan Eswaramoorthi, [email protected], Artificial Intelligence Institute
University of South Carolina, USA; Yuxin Zi, [email protected], Artificial Intelligence Institute University of South Carolina, USA; Vedant Palit, ledarssan@
gmail.com, Indian Institute of Technology, Kharagpur, India; Ritvik Garimella, [email protected], Artificial Intelligence Institute University of South
Carolina, USA; Amit Sheth, [email protected], Artificial Intelligence Institute University of South Carolina, USA.
1
2 Trovato et al.
[2]. In healthcare, particularly in contexts involving natural language conversations, such as interactions between
patients and clinicians, LLMs have piqued the interest of stakeholders as a potentially valuable tool to investigate for
assisting with alleviating some of the burden on clinicians and the overall healthcare system [3]. During patient-clinician
interactions, clinicians employ standard diagnostic assessment processes for capturing a patient’s state, such as the
PHQ-9 for depression assessment and the GAD-7 for anxiety assessment [4, 5]. Figure 1 shows the PHQ-9 and GAD-7
questionnaires. To gainfully leverage LLMs for diagnostic assistance, it is necessary to provide mechanisms for guiding
Fig. 1. Mental Health Diagnostic Assessment Questionnaires. The Patient Health Questionnaire (PHQ)-9 for depression
assessment and the Generalized Anxiety Disorder (GAD)-7 for anxiety assessment.
LLMs in closely following standardized clinical assessment procedures. There are two categories of methods available
to enable this behavior:
(i) Prompting LLMs - Modern LLMs stand out for their capacity to tailor responses based on user instructions
or prompts [6]. However, LLMs are highly sensitive to the specific prompts used [7]. Prompting techniques have
continuously evolved to enhance the robustness of LLM responses, for example, by using Chain-of-Thought (CoT)
prompting [8]. Prompting methods are broadly classified under three categories: (i) Naive prompting - Providing direct
instructions to the LLM in a prompt, (ii) Exemplar-based prompting - Providing direct instructions along with few
examples of the expected output, and (iii) Guidance-based prompting - Exemplar-based prompting along with providing
specific guidance on reasoning steps (for example by prompting the LLM to “think” step-by-step).
(ii) Finetuning LLMs - Fine-tuning of LLMs involves adapting the model’s behavior to closely align with the diagnos-
tic procedures that clinicians follow, using fine-tuning algorithms such as supervised fine-tuning (SFT), reinforcement
learning with human feedback (RLHF), and direct preference optimization (DPO) [9]. Fine-tuning LLMs is relatively
Large Language Models for Mental Health Diagnostic Assessments 3
more complex than prompting due to the need to curate high-quality data and appropriately formulate task-specific
prompts or instructions during the fine-tuning process.
In this work, we explore both approaches using a variety of proprietary and open-source models, namely - the
Mentalllama and Llama models for finetuning, and the models GPT-3.5 and GPT-4o, llama-3.1-8b and mixtral-8x7b for
prompting [7, 10–12].
2 METHODOLOGY
2.1 MDD Diagnostic Assistance based on the PHQ-9
Ground Truth Dataset Creation. We start with the publicly available PRIMATE dataset, which consists of a
collection of social media posts annotated for PHQ-9 relevant criteria [16]. Appendix A.1 shows an example post and its
annotation, specifically the post title, the post text, and the annotations indicating whether specific PHQ-9 symptoms are
present in the post (using yes/no values). We chose this dataset as the authors provide preliminary experimental evidence
on the effectiveness of using this dataset for guiding language models toward questionnaire-specific determination of
diagnostic criteria. We first prompt GPT-4o to identify text spans in the posts corresponding to the PHQ-9 symptoms
by providing an example of the expected output. Appendix A.2 shows an example of a prompt to GPT-4o. It is evident
from this example how we are attempting to steer the model toward providing PHQ-9-specific diagnostic criteria. We
then pass the model outputs to expert clinicians who provide us with a subset of GPT-4o annotated outputs that the
clinicians agree with. This subset is available here2 . The clinicians are three anonymized experts from a non-profit
institution run by a retired professional from the National Institute of Mental Health, Neuroscience, and Allied Fields
(NIMHANS), India3 . The agreement score of 0.74 was recorded among the annotators (measured using Cohen’s Kappa).
2 https://huggingface.co/datasets/darssanle/GPT-4o-eval
3 https://www.justdial.com/Bangalore/Dr-C-R-Chandrashekar-Samadhana-Counselling-Trust-Centre-Near-Subramanya-Temple-Mico-Layout-Bus-
Stand-Bannerghatta-Road/080PXX80-XX80-170124231410-U5U4_BZDET
4 Trovato et al.
Obtaining Proprietary Model Outputs for MDD Diagnostic Assistance based on the PHQ-9. Maintaining
exactly the same prompt structure as shown in Appendix A.2, we prompt the models GPT-3.5-Turbo and GPT-4o-mini
to obtain annotations to a subset of the posts in the PRIMATE dataset. Our subset selection is randomized and limited
by request costs and our available budget (see Section 4 for funding information).
For evaluation of the outputs, we employ two methods, (i) hits@k based ranking - We rank-order the text spans
identified in the model output based on cosine similarity with the symptom, and then check if the identified text span
occurs within the top k positions in the ground truth output, and (ii) Standard Classification Metrics - We evaluate
the accuracy, precision, recall and F1-score of the model outputs against the ground truth. Tables 1 and 2 shows the
evaluation results.
Table 1. Evaluation of Proprietary LLMs for PHQ-9 Symptom Annotation of PRIMATE Posts Using hits@k.
Table 2. Evaluation of Proprietary LLMs for PHQ-9 Symptom Annotation of PRIMATE Posts Using Standard Classification Metrics.
2.1.2 Obtaining Open-source Model Outputs for MDD Diagnostic Assistance based on the PHQ-9. Similar to
the proprietary model case, we use the same prompt structure shown in Appendix A.2 and prompt the models llama3.1-
8b and mixtral-8x7b to obtain annotations. Like the proprietary model(s) case, the subset selection is randomized and
limited only by rate-limit costs.
For evaluation, we use the same two methods defined in Section 2.1.1 using the ground truth dataset introduced in
Section 2 (the hits@k and standard classification metrics). Tables 3 and 4 show the results.
Table 3. Evaluation of Open-source LLMs for PHQ-9 Symptom Annotation of PRIMATE Posts Using hits@k.
Table 4. Evaluation of Open-source LLMs for PHQ-9 Symptom Annotation of PRIMATE Posts Using Standard Classification Metrics.
The MentalllaMa model. MentalllaMA is a model trained on 105K data samples of mental health instructions on
social media posts. The samples are collected from 10 existing sources covering eight mental health analysis tasks,
making MentalllaMA a suitable foundation model for the tasks covered in this study. The instructions used for training
are a combination of expert-written and few-shot ChatGPT prompt outputs, further validating MentalllaMa as a viable
candidate for testing adherence to diagnostic criteria by language models [10].
We perform experiments using MentaLLaMa on the ground truth dataset we introduce in Section 2 and report the
results. The Prompt is provided in appendix Section B.1.
The DiagnosticLlama model - Fine-tuning Mentalllama on the PRIMATE dataset using Hugging Face
AutoTrain. Autorain is a no-code platform designed to simplify the process of training and fine-tuning language
models on custom data4 . The full training specifications for training this model are available in appendix Section C. We
refer to this model as DiagnosticLlama. Appendix section C.1 shows an example of an input (prompt) and output pair
obtained using the DiagnosticLlama model. The model space is available here5 .
For evaluation of the outputs, we employ the same two methods as in Section 2.1.1, i.e., (i) hits@k based ranking, and
(ii) Standard Classification Metrics - the accuracy, precision, recall and F1-score of the model outputs against the ground
truth. Tables 5 and 6 show the evaluation results.
Table 5. Evaluation of MentalllaMa and DiagnosticLlama for PHQ-9 Symptom Annotation of PRIMATE Posts Using hits@k.
Table 6. Evaluation of MentalllaMa and DiagnosticLlama for PHQ-9 Symptom Annotation of PRIMATE Posts Using Standard
Classification Metrics.
Obtaining Proprietary Model Outputs for MDD Diagnostic Assistance based on the GAD-7. Maintaining
exactly the same prompt structure as shown in Appendix A.2, we prompt the models GPT-3.5-Turbo and GPT-4o-mini
to obtain annotations to a subset of the posts in the PRIMATE dataset, but this time geared towards responses to the
GAD-7 symptoms. Our subset selection is randomized and limited by request costs and our available budget (see Section
4 for funding information).
For evaluation of the outputs, we employ the same two methods as in the PHQ-9 case, i.e., (i) hits@k based ranking,
and (ii) Standard Classification Metrics - the accuracy, precision, recall and F1-score of the model outputs against the
ground truth. Tables 7 and 8 show the evaluation results.
Table 7. Evaluation of Proprietary LLMs for GAD-7 Symptom Annotation of PRIMATE Posts Using hits@k.
Table 8. Evaluation of Proprietary LLMs for GAD-7 Symptom Annotation of PRIMATE Posts Using Standard Classification Metrics.
2.2.2 Obtaining Open-source Model Outputs for MDD Diagnostic Assistance based on the GAD-7. Like the
PHQ-9, we use the same prompt structure shown in Appendix A.2 and prompt the models llama3.1-8b and mixtral-8x7b
to obtain annotations to the GAD-7 symptoms. As before, the subset selection is randomized and limited only by
rate-limit costs.
For evaluation, we use the same two methods defined in Section 2.1.1 using the ground truth dataset introduced in
Section 2 (the hits@k and standard classification metrics). Tables 9 and 10 show the results.
Table 9. Evaluation of Open-source LLMs for GAD-7 Symptom Annotation of PRIMATE Posts Using hits@k.
Table 10. Evaluation of Open-source LLMs for GAD-7 Symptom Annotation of PRIMATE Posts Using Standard Classification Metrics.
Table 11. Evaluation of Llama2-7b-chat and Mistral-Instruct for PHQ-9 Symptom Annotations of the PRIMATE Posts Using F1 scores.
Method F1-score
llama2-7b-chat 0.663
mistral-instruct 0.655
Older pretrained language models. Several classification-based approaches have been used to classify posts into labels
corresponding to diagnostic criteria on questionnaires as an alternative to generative models [16, 18]. Although this
work focuses on modern LLMs, we also perform experiments in the classification setting using the older pretrained
models - BERT, MentalBERT, and MentalRoBERTa [19, 20]. Table 12 shows the results7 .
Table 12. Evaluation of BERT, MentalBERT, and MentalRoBERTa for PHQ-9 Symptom Annotations of the PRIMATE Posts Using F1
scores.
Method F1-score
BERT 0.69
MentalBERT 0.71
MentalRoBERTa 0.48
7 For completeness, we also show results of traditional machine learning-based classification methods in appendix Section D
8 https://huggingface.co/barca-boy/primate_autotrain_mental_llama
9 https://huggingface.co/collections/darssanle/mhd-datasets-669628ee2d25bd04e99dc3bf
10 https://github.com/kauroy1994/Large-Language-Models-for-Assisting-with-Mental-Health-Diagnostic-Assessments
8 Trovato et al.
Table 13. Dataset Statistics (number of posts) for All the Datasets in Section 2.4
3 RESULTS
3.1 PHQ-9 Results
From Tables 1, 2, 3, and 4, we see that both the proprietary and open-source LLMs approach human annotation quality,
and Tables 5 and 6 show that fine-tuning LLMs for diagnostic assistance shows promising results. However, fine-tuning
LLMs has turned out to be highly challenging and needs considerable resources and hyperparameter tuning to get right.
The entries for MentalllaMa are blank in the tables because the MentalllaMa model reiterates the input verbatim, as
seen in Section B.1. This further shows the difficulty of adequately leveraging fine-tuned models to achieve good results
in highly specialized tasks such as diagnostic assistance. Still, the preliminary results on the PHQ-9 task demonstrate
that this can be done with a good bit of trial and error on the fine-tuning configurations. It is essential to be able to
deploy specialized models fine-tuned/trained on custom data in safety-constrained and privacy-critical settings.
Interestingly, Table 11 shows significant performance gaps between the older and newer LLMs (open-source and
proprietary models). We also find from Table 12 that older pretrained language models (that are not autoregressive),
perform as well as older LLMs. We also see again that fine-tuning in the case of pretrained LLMs does not lead to
much change in performance and sometimes leads to bad performance (e.g., MentalRoBERTa), further evidencing the
significant challenge with fine-tuning language models for specialized tasks such as mental health diagnostic assistance.
major depressive disorder and the GAD-7 for general anxiety disorder, are indispensable for accurate and effective
treatment planning in mental healthcare. Our research addresses this gap by specifically targeting these assessment
procedures and developing prompting strategies to guide LLMs toward crafting clinician-friendly responses with
explanations using assessment and reasoning prompts.
Our findings reveal that while LLMs struggle to effectively utilize questionnaire information in prompts to provide
assessments resembling those of clinicians in the zero-shot setting, their performance significantly improves in the
few-shot setting (both in fine-tuning and few-shot prompting regime), nearly matching the assessments of expert
clinicians. However, despite this improvement, LLMs still do not reason in the same manner as clinicians when arriving
at assessments, matching clinician reasoning only a fraction of the time, as evidenced by the sizes of the ground truth
datasets for which a high expert agreement score is obtained. This underscores the need for further scrutiny in the
integration of LLMs, along with prompting methods incorporating diagnostic assessment criteria, before they can be
reliably utilized in mental healthcare assistance. Moreover, our work introduces several novel assessment LLM and
instruction-tuning datasets, offering a valuable resource for future research aimed at understanding and enhancing the
effectiveness of LLMs in assisting with assessments within mental healthcare settings. This contribution holds promise
for advancing the capabilities of LLMs in mental health support, potentially alleviating the strain on healthcare systems
caused by a shortage of care providers and an increasing number of patients.
Future Work. We are working on integrating the models studied in this work into a clinician-facing app, and
extending the DiagnosticLlama model to include GAD-7, and expanding all datasets in Section 2.4 to match the
original PRIMATE dataset. We are also expanding our datasets and results to include more GAD-7-based results and
non-linearly structured questionnaires (example flowcharts) such as the CSSRS [21]11 . Finally, we are also working to
incorporate additional constraints, such as restricted terminology (e.g., non-toxic terminology), by paraphrasing the
LLM outputs [22, 23]. All future updates will be released on the GitHub repository.
ACKNOWLEDGEMENTS
This research is partially supported by NSF Award 2335967 EAGER: Knowledge-guided neurosymbolic AI with guardrails
for safe virtual health assistants12 [24–29]. The views expressed here are those of the authors, not those of the sponsors.
REFERENCES
[1] Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. Large language
models in medicine. Nature medicine, 29(8):1930–1940, 2023.
[2] Nitin Rane. Chatgpt and similar generative artificial intelligence (ai) for building and construction industry: Contribution, opportunities and
challenges of large language models for industry 4.0, industry 5.0, and society 5.0. Opportunities and Challenges of Large Language Models for
Industry, 4, 2023.
[3] Peter Lee, Sebastien Bubeck, and Joseph Petro. Benefits, limits, and risks of gpt-4 as an ai chatbot for medicine. New England Journal of Medicine,
388(13):1233–1239, 2023.
[4] Joseph Ford, Felicity Thomas, Richard Byng, and Rose McCabe. Use of the patient health questionnaire (phq-9) in practice: Interactions between
patients and physicians. Qualitative Health Research, 30(13):2146–2159, 2020.
[5] Sverre Urnes Johnson, Pål Gunnar Ulvenes, Tuva Øktedalen, and Asle Hoffart. Psychometric properties of the general anxiety disorder 7-item
(gad-7) scale in a heterogeneous psychiatric sample. Frontiers in psychology, 10:1713, 2019.
[6] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex
Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744,
2022.
[7] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry,
Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
[8] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits
reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
[9] Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your
language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
[10] Kailai Yang, Tianlin Zhang, Ziyan Kuang, Qianqian Xie, Jimin Huang, and Sophia Ananiadou. Mentallama: interpretable mental health analysis on
social media with large language models. In Proceedings of the ACM on Web Conference 2024, pages 4489–4500, 2024.
[11] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric
Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
[12] Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas,
Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
[13] Anxo Pérez, Marcos Fernández-Pichel, Javier Parapar, and David E Losada. Depresym: A depression symptom annotated corpus and the role of llms
as assessors of psychological markers. arXiv preprint arXiv:2308.10758, 2023.
[14] Ayah Zirikly and Mark Dredze. Explaining models of mental health via clinically grounded auxiliary tasks. In Proceedings of the Eighth Workshop on
Computational Linguistics and Clinical Psychology, pages 30–39, 2022.
[15] Andrew Yates, Bart Desmet, Emily Prud’Hommeaux, Ayah Zirikly, Steven Bedrick, Sean MacAvaney, Kfir Bar, Molly Ireland, and Yaakov Ophir.
Proceedings of the 9th workshop on computational linguistics and clinical psychology (clpsych 2024). In Proceedings of the 9th Workshop on
Computational Linguistics and Clinical Psychology (CLPsych 2024), 2024.
[16] Shrey Gupta, Anmol Agarwal, Manas Gaur, Kaushik Roy, Vignesh Narayanan, Ponnurangam Kumaraguru, and Amit Sheth. Learning to automate
follow-up question generation using process knowledge for depression triage on reddit posts. In Proceedings of the Eighth Workshop on Computational
Linguistics and Clinical Psychology, pages 137–147, 2022.
[17] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna
Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
[18] Sumit Dalal, Deepa Tilwani, Manas Gaur, Sarika Jain, Valerie Shalin, and Amit Seth. A cross attention approach to diagnostic explainability using
clinical practice guidelines for depression. arXiv preprint arXiv:2311.13852, 2023.
[19] Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[20] Shaoxiong Ji, Tianlin Zhang, Luna Ansari, Jie Fu, Prayag Tiwari, and Erik Cambria. Mentalbert: Publicly available pretrained language models for
mental healthcare. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 7184–7190, 2022.
[21] Kaushik Roy, Yuxin Zi, Manas Gaur, Jinendra Malekar, Qi Zhang, Vignesh Narayanan, and Amit Sheth. Process knowledge-infused learning for
clinician-friendly explanations. In Proceedings of the AAAI Symposium Series, volume 1, pages 154–160, 2023.
[22] Adam Tsakalidis, Jenny Chim, Iman Munire Bilal, Ayah Zirikly, Dana Atzil-Slonim, Federico Nanni, Philip Resnik, Manas Gaur, Kaushik Roy, Becky
Inkster, et al. Overview of the clpsych 2022 shared task: Capturing moments of change in longitudinal user posts. In Proceedings of the Eighth
Workshop on Computational Linguistics and Clinical Psychology, pages 184–198, 2022.
[23] Kaushik Roy, Manas Gaur, Misagh Soltani, Vipula Rawte, Ashwin Kalyan, and Amit Sheth. Proknow: Process knowledge for safety constrained and
explainable question generation for mental health diagnostic assistance. Frontiers in big Data, 5:1056728, 2023.
[24] Amit Sheth, Manas Gaur, Kaushik Roy, and Keyur Faldu. Knowledge-intensive language understanding for explainable ai. IEEE Internet Computing,
25(5):19–24, 2021.
[25] Amit Sheth, Manas Gaur, Kaushik Roy, Revathy Venkataraman, and Vedant Khandelwal. Process knowledge-infused ai: Toward user-level
explainability, interpretability, and safety. IEEE Internet Computing, 26(5):76–84, 2022.
[26] Amit Sheth, Kaushik Roy, and Manas Gaur. Neurosymbolic artificial intelligence (why, what, and how). IEEE Intelligent Systems, 38(3):56–62, 2023.
[27] Amit Sheth and Kaushik Roy. Neurosymbolic value-inspired artificial intelligence (why, what, and how). IEEE Intelligent Systems, 39(1):5–11, 2024.
[28] Amit Sheth, Kaushik Roy, Hemant Purohit, and Amitava Das. Civilizing and humanizing artificial intelligence in the age of large language models.
IEEE Internet Computing, 28(5):5–10, 2024.
[29] Amit Sheth, Vishal Pallagani, and Kaushik Roy. Neurosymbolic ai for enhancing instructability in generative ai. IEEE Intelligent Systems, 39(5):5–11,
2024.
APPENDIX
A DATASET EXAMPLES
A.1 Primate Data Example
{ 1
h i g h e s t ␣ c o m p e t i t o r s ␣ i n ␣ my ␣ s c h o o l . ␣ I ␣ j o i n e d ␣ t h e ␣ h i g h ␣ s c h o o l ␣ band ␣ i n ␣
f r e s h m a n ␣ y e a r ␣ and ␣ by ␣ s e n i o r ␣ y e a r ␣ I ␣ became ␣ one ␣ o f ␣ t h e ␣ b e s t ␣ i n ␣ my ␣ s e c t i o n
. ␣ My ␣ a c a d e m i c s ␣ were ␣ a l w a y s ␣ s t r a i g h t ␣ and ␣ I ␣ e x e r c i s e d ␣ d a i l y . ␣ S e n i o r ␣ y e a r ␣
I ␣ e n l i s t e d ␣ i n ␣ t h e ␣ m i l i t a r y ␣ and ␣ now ␣ I ␣ b e l i e v e ␣ i t ␣ was ␣ one ␣ o f ␣ my ␣ w o r s t ␣
d e c i s i o n s ␣ i n ␣ l i f e . ␣ B e f o r e ␣ I ␣ went ␣ t o ␣ b o o t ␣ camp ␣ I ␣ was ␣ m o t i v a t e d , ␣ a ␣
p a t r i o t ␣ and ␣ b e l i e v e d ␣ t h a t ␣ t h e ␣ e l i t e ␣ j o i n e d ␣ t h e ␣ m i l i t a r y . ␣ I n ␣ s e n i o r ␣ y e a r
␣ I ␣ n e v e r ␣ a p p l i e d ␣ f o r ␣ any ␣ s c h o l a r s h i p s ␣ and ␣ I ␣ was ␣ o f f e r e d ␣ one ␣ b u t ␣ t u r n e d ␣
i t ␣ down ␣ b e c a u s e ␣ I ␣ a l r e a d y ␣ s i g n e d ␣ t h e ␣ p a p e r s . ␣ I ␣ t h o u g h t ␣ I ␣ s e t ␣ m y s e l f ␣ up ␣
f o r ␣ s u c c e s s . ␣ Now ␣ I ␣ b e l i e v e ␣ I ␣ was ␣ dead ␣ wrong ␣ f o r ␣ j o i n i n g . ␣ The ␣ o n l y ␣
b e n e f i t ␣ I ␣ s e e ␣ s o ␣ f a r ␣ a f t e r ␣ a ␣ y e a r ␣ and ␣ a ␣ h a l f ␣ o f ␣ s e r v i c e ␣ i s ␣ t h a t ␣ I 'm␣
t r y i n g ␣ t o ␣ s e t ␣ m y s e l f ␣ up ␣ f i n a n c i a l l y ␣ b e f o r e ␣ I ␣ g e t ␣ o u t ␣ and ␣ h o p e f u l l y ␣
a t t e n d ␣ c o l l e g e . ␣ I t ␣ s o u n d s ␣ l i k e ␣ a ␣ p l a n ␣ b u t ␣ I ␣ f e e l ␣ no ␣ h a p p i n e s s ␣ from ␣ what
␣ I ␣ do ␣ a t ␣ a l l . ␣ I ␣ c o n v i n c e d ␣ m y s e l f ␣ t h e r e ' s ␣ no ␣ honor ␣ i n ␣ i t ␣ anymore , ␣ i t ' s ␣
j u s t ␣ a n o t h e r ␣ j o b . ␣ I ␣ don ' t ␣ e x e r c i s e ␣ by ␣ m y s e l f ␣ anymore . ␣ I ␣ f e e l ␣ l i k e ␣ I 'm␣
n o t ␣ p r o g r e s s i n g ␣ anywhere ␣ i n ␣ l i f e ␣ b e i n g ␣ i n ␣ s e r v i c e . ␣ I 'm␣ j u s t ␣ a ␣ body ␣ and ␣
i f ␣ I ␣ wasn ' t ␣ h e r e ␣ d o i n g ␣ what ␣ I 'm␣ doing , ␣ t h e r e ' d ␣ j u s t ␣ be ␣ somebody ␣ e l s e ␣
d o i n g ␣ t h e ␣ e x a c t ␣ same . ␣ I 'm␣ r e p l a c e a b l e . ␣ That ' s ␣ t h e ␣ m i n d s e t ␣ t h e ␣ m i l i t a r y ␣
g a ve ␣ me . ␣ I ␣ l o o k ␣ f o r w a r d ␣ t o ␣ g o i n g ␣ b a c k ␣ home ␣ i n ␣ 6 ␣ months ␣ f o r ␣ v a c a t i o n ␣ and
␣ t h a t ' s ␣ t h e ␣ o n l y ␣ t h i n g ␣ I ' ve ␣ been ␣ l o o k i n g ␣ f o r w a r d ␣ t o ␣ s i n c e ␣ I ' ve ␣ been ␣
s t a t i o n e d . ␣ A f t e r ␣ t h a t , ␣ t h e ␣ o n l y ␣ t h i n g ␣ I ␣ have ␣ my ␣ e y e s ␣ on ␣ a r e ␣ g e t t i n g ␣ o u t
␣ o f ␣ s e r v i c e , ␣ g o i n g ␣ home , ␣ b e i n g ␣ c l o s e r ␣ t o ␣ my ␣ f a m i l y ␣ a g a i n . ␣ There ' s ␣
n o t h i n g ␣ h e r e ␣ t h a t ␣ s a t i s f i e s ␣ me ␣ and ␣ I ␣ h a t e ␣ i t . ␣ I ␣ f e e l ␣ l i k e ␣ I ' ve ␣ t r i e d ␣
e v e r y t h i n g ␣ t o ␣ be ␣ happy ␣ h e r e ␣ b u t ␣ i t ␣ seems ␣ i m p o s s i b l e . ␣ I ␣ wish ␣ somebody ␣
could ␣ help . " ,
" annotations " : [ 4
[ 5
], 8
[ 9
" no " 11
], 12
[ 13
], 16
[ 17
" L i t t l e − i n t e r e s t − or − p l e a s u r e − in − d o i n g " , 18
], 20
[ 21
], 24
[ 25
" no " 27
], 28
[ 29
], 32
[ 33
], 36
[ 37
" no " 39
] 40
] 41
} 42
s p e c i f i c s e n t e n c e s from t h e t e x t t h a t e x h i b i t a s e t o f symptoms . B e l o w i s
an e x a m p l e o f INPUT and OUTPUT . Keep JSON F o r m a t t i n g f o r o u t p u t :
2
Large Language Models for Mental Health Diagnostic Assessments 13
{ 3
h i g h e s t c o m p e t i t o r s i n my s c h o o l . I j o i n e d t h e h i g h s c h o o l band i n
f r e s h m a n y e a r and by s e n i o r y e a r I became o n e o f t h e b e s t i n my s e c t i o n
. My a c a d e m i c s w e r e a l w a y s s t r a i g h t and I e x e r c i s e d d a i l y . S e n i o r y e a r
I e n l i s t e d i n t h e m i l i t a r y and now I b e l i e v e i t was o n e o f my w o r s t
d e c i s i o n s i n l i f e . B e f o r e I went t o b o o t camp I was m o t i v a t e d , a
p a t r i o t and b e l i e v e d t h a t t h e e l i t e j o i n e d t h e m i l i t a r y . I n s e n i o r y e a r
I n e v e r a p p l i e d f o r any s c h o l a r s h i p s and I was o f f e r e d o n e b u t t u r n e d
i t down b e c a u s e I a l r e a d y s i g n e d t h e p a p e r s . I t h o u g h t I s e t m y s e l f up
f o r s u c c e s s . Now I b e l i e v e I was d e a d wrong f o r j o i n i n g . The o n l y
b e n e f i t I s e e s o f a r a f t e r a y e a r and a h a l f o f s e r v i c e i s t h a t I 'm
t r y i n g t o s e t m y s e l f up f i n a n c i a l l y b e f o r e I g e t o u t and h o p e f u l l y
a t t e n d c o l l e g e . I t s o u n d s l i k e a p l a n b u t I f e e l no h a p p i n e s s from what
I do a t a l l . I c o n v i n c e d m y s e l f t h e r e ' s no h o n o r i n i t anymore , i t ' s
j u s t a n o t h e r j o b . I don ' t e x e r c i s e by m y s e l f anymore . I f e e l l i k e I 'm
n o t p r o g r e s s i n g a n y w h e r e i n l i f e b e i n g i n s e r v i c e . I 'm j u s t a body and
i f I wasn ' t h e r e d o i n g what I 'm d o i n g , t h e r e ' d j u s t b e somebody e l s e
d o i n g t h e e x a c t same . I 'm r e p l a c e a b l e . T h a t ' s t h e m i n d s e t t h e m i l i t a r y
g a v e me . I l o o k f o r w a r d t o g o i n g b a c k home i n 6 months f o r v a c a t i o n and
t h a t ' s t h e only t h i n g I ' ve been l o o k i n g forward t o s i n c e I ' ve been
s t a t i o n e d . A f t e r t h a t , t h e o n l y t h i n g I h a v e my e y e s on a r e g e t t i n g o u t
o f s e r v i c e , g o i n g home , b e i n g c l o s e r t o my f a m i l y a g a i n . T h e r e ' s
n o t h i n g h e r e t h a t s a t i s f i e s me and I h a t e i t . I f e e l l i k e I ' v e t r i e d
e v e r y t h i n g t o b e happy h e r e b u t i t s e e m s i m p o s s i b l e . I w i s h somebody
could help . " ,
" annotations " : [ 6
[ 7
], 10
[ 11
" no " 13
], 14
[ 15
14 Trovato et al.
" F e e l i n g − t i r e d − or − h a v i n g − l i t t l e − e n e r g y " , 16
], 18
[ 19
" L i t t l e − i n t e r e s t − or − p l e a s u r e − i n − d o i n g " , 20
], 22
[ 23
], 26
[ 27
" P o o r − a p p e t i t e − or − o v e r e a t i n g " , 28
" no " 29
], 30
[ 31
], 34
[ 35
" T r o u b l e − c o n c e n t r a t i n g −on − t h i n g s − s u c h − as − r e a d i n g − t h e − n e w s p a p e r − or − 36
watching − t e l e v i s i o n " ,
" no " 37
], 38
[ 39
" no " 41
] 42
] 43
} 44
45
And t h i s i s an e x a m p l e e x p e c t e d o u t p u t f o r m a t : 46
47
{ 48
h i g h e s t c o m p e t i t o r s i n my s c h o o l . I j o i n e d t h e h i g h s c h o o l band i n
f r e s h m a n y e a r and by s e n i o r y e a r I became o n e o f t h e b e s t i n my s e c t i o n
. My a c a d e m i c s w e r e a l w a y s s t r a i g h t , and I e x e r c i s e d d a i l y . S e n i o r y e a r
I e n l i s t e d i n t h e m i l i t a r y , and now I b e l i e v e i t was o n e o f my w o r s t
d e c i s i o n s i n l i f e . B e f o r e I went t o b o o t camp I was m o t i v a t e d , a
p a t r i o t and b e l i e v e d t h a t t h e e l i t e j o i n e d t h e m i l i t a r y . I n s e n i o r y e a r
I n e v e r a p p l i e d f o r any s c h o l a r s h i p s and I was o f f e r e d o n e b u t t u r n e d
i t down b e c a u s e I a l r e a d y s i g n e d t h e p a p e r s . I t h o u g h t I s e t m y s e l f up
f o r s u c c e s s . Now I b e l i e v e I was d e a d wrong f o r j o i n i n g . The o n l y
b e n e f i t I s e e s o f a r a f t e r a y e a r and a h a l f o f s e r v i c e i s t h a t I 'm
t r y i n g t o s e t m y s e l f up f i n a n c i a l l y b e f o r e I g e t o u t and h o p e f u l l y
a t t e n d c o l l e g e . I t s o u n d s l i k e a p l a n b u t I f e e l no h a p p i n e s s from what
I do a t a l l . I c o n v i n c e d m y s e l f t h e r e ' s no h o n o r i n i t anymore ; i t ' s
j u s t a n o t h e r j o b . I don ' t e x e r c i s e by m y s e l f anymore . I f e e l l i k e I 'm
n o t p r o g r e s s i n g a n y w h e r e i n l i f e b e i n g i n s e r v i c e . I 'm j u s t a body , and
i f I wasn ' t h e r e d o i n g what I 'm d o i n g , t h e r e ' d j u s t b e somebody e l s e
d o i n g t h e e x a c t same . I 'm r e p l a c e a b l e . T h a t ' s t h e m i n d s e t t h e m i l i t a r y
g a v e me . I l o o k f o r w a r d t o g o i n g b a c k home i n 6 months f o r v a c a t i o n ,
and t h a t ' s t h e o n l y t h i n g I ' v e b e e n l o o k i n g f o r w a r d t o s i n c e I ' v e b e e n
s t a t i o n e d . A f t e r t h a t , t h e o n l y t h i n g I h a v e my e y e s on i s g e t t i n g o u t
o f s e r v i c e , g o i n g home , b e i n g c l o s e r t o my f a m i l y a g a i n . T h e r e ' s
n o t h i n g h e r e t h a t s a t i s f i e s me , and I h a t e i t . I f e e l l i k e I ' v e t r i e d
e v e r y t h i n g t o b e happy h e r e b u t i t s e e m s i m p o s s i b l e . I w i s h somebody
could help . " ,
" annotations " : { 51
" F e e l i n g − t i r e d − or − h a v i n g − l i t t l e − e n e r g y " : [ 56
], 58
" L i t t l e − i n t e r e s t − or − p l e a s u r e − i n − d o i n g " : [ 59
], 61
16 Trovato et al.
watching − t e l e v i s i o n " : [ ] ,
" T r o u b l e − f a l l i n g − or − s t a y i n g − a s l e e p − or − s l e e p i n g − t o o −much " : [ ] 66
} 67
}, 68
69
B MENTALLLAMA
B.1 Example Input and Output
Input
# # # INSTRUCTION : 1
b i n a r y " y e s " or
" no " , f o r e v e r y symptom . The symptoms a r e a s f o l l o w s : 5
[ L i t t l e i n t e r e s t or p l e a s u r e in d o i n g t h i n g s , 6
F e e l i n g down , d e p r e s s e d , or h o p e l e s s , 8
T r o u b l e f a l l i n g or s t a y i n g a s l e e p , or s l e e p i n g t o o much , 10
F e e l i n g t i r e d or h a v i n g l i t t l e energy , 11
12
Poor a p p e t i t e or o v e r e a t i n g , 13
14
F e e l i n g bad a b o u t y o u r s e l f or t h a t you a r e a f a i l u r e or 15
17
T r o u b l e c o n c e n t r a t i n g on t h i n g s , s u c h a s r e a d i n g t h e 18
newspaper or w a t c h i n g t e l e v i s i o n , 19
Large Language Models for Mental Health Diagnostic Assessments 17
20
Moving or s p e a k i n g s o s l o w l y t h a t o t h e r p e o p l e c o u l d 21
have n o t i c e d . Or t h e o p p o s i t e b e i n g s o f i g e t y or 22
than usual , 24
25
hurting yourself ] . 27
28
G i v e t h e o u t p u t a s a python l i s t o f l i s t s , w i t h t h e l i s t c o n t a i n i n g t h e symptom 29
and t h e
answer , s e p a r a t e d by a comma . 30
31
32
33
# # # INPUT : 34
b u t ␣ I ␣ j u s t ␣ end ␣ up ␣ c r y i n g , ␣ and ␣ e v e r y t h i n g ␣ I ␣ w r i t e ␣ i s ␣ i n c o h e r e n t , ␣ s o . ␣ S o r r y ␣
i f ␣ t h e r e ' s not enough c o n t e x t ?
36
t h i n g s , I 'm␣ j u s t ␣ p u t t i n g ␣ e v e r y t h i n g ␣ o f f ␣ and ␣ s e t t i n g ␣ m y s e l f ␣ up ␣ f o r ␣ f a i l u r e . ␣
But ␣ d o i n g ␣ a n y t h i n g ␣ p r o d u c t i v e ␣ f e e l s ␣ t o o ␣ d a u n t i n g , ␣ and ␣ I ␣ don ' t know how t o
start ?
41
44
46
I ' ve ␣ t r i e d ␣ l o o k i n g ␣ f o r ␣ a d v i c e ␣ o n l i n e , ␣ b u t ␣ e v e r y t h i n g ␣ I ␣ s e e ␣ i n v o l v e s : 47
c o m f o r t a b l e t a l k i n g a b o u t my m e n t a l h e a l t h t o any o f them .
− S e l f − help books and s t u f f , which j u s t . . . d o e s not work on me . 50
Output
# # # INSTRUCTION : 1
b i n a r y " y e s " or
" no " , f o r e v e r y symptom . The symptoms a r e a s f o l l o w s : 5
[ L i t t l e i n t e r e s t or p l e a s u r e in d o i n g t h i n g s , 6
F e e l i n g down , d e p r e s s e d , or h o p e l e s s , 8
T r o u b l e f a l l i n g or s t a y i n g a s l e e p , or s l e e p i n g t o o much , 10
F e e l i n g t i r e d or h a v i n g l i t t l e energy , 11
12
Poor a p p e t i t e or o v e r e a t i n g , 13
14
F e e l i n g bad a b o u t y o u r s e l f or t h a t you a r e a f a i l u r e or 15
17
Large Language Models for Mental Health Diagnostic Assessments 19
T r o u b l e c o n c e n t r a t i n g on t h i n g s , s u c h a s r e a d i n g t h e 18
newspaper or w a t c h i n g t e l e v i s i o n , 19
20
Moving or s p e a k i n g s o s l o w l y t h a t o t h e r p e o p l e c o u l d 21
have n o t i c e d . Or t h e o p p o s i t e b e i n g s o f i g e t y or 22
than usual , 24
25
hurting yourself ] . 27
28
G i v e t h e o u t p u t a s a python l i s t o f l i s t s , w i t h t h e l i s t c o n t a i n i n g t h e symptom 29
and t h e
answer , s e p a r a t e d by a comma . 30
31
32
33
# # # INPUT : 34
b u t ␣ I ␣ j u s t ␣ end ␣ up ␣ c r y i n g , ␣ and ␣ e v e r y t h i n g ␣ I ␣ w r i t e ␣ i s ␣ i n c o h e r e n t , ␣ s o . ␣ S o r r y ␣
i f ␣ t h e r e ' s not enough c o n t e x t ?
36
t h i n g s , I 'm␣ j u s t ␣ p u t t i n g ␣ e v e r y t h i n g ␣ o f f ␣ and ␣ s e t t i n g ␣ m y s e l f ␣ up ␣ f o r ␣ f a i l u r e . ␣
But ␣ d o i n g ␣ a n y t h i n g ␣ p r o d u c t i v e ␣ f e e l s ␣ t o o ␣ d a u n t i n g , ␣ and ␣ I ␣ don ' t know how t o
start ?
41
46
I ' ve ␣ t r i e d ␣ l o o k i n g ␣ f o r ␣ a d v i c e ␣ o n l i n e , ␣ b u t ␣ e v e r y t h i n g ␣ I ␣ s e e ␣ i n v o l v e s : 47
c o m f o r t a b l e t a l k i n g a b o u t my m e n t a l h e a l t h t o any o f them .
− S e l f − help books and s t u f f , which j u s t . . . d o e s not work on me . 50
C AUTOTRAINING Diagnostic𝐿𝑙𝑎𝑚𝑎
Model details, input format, sample inferences
# # # INSTRUCTION : 1
b i n a r y " y e s " or
" no " , f o r e v e r y symptom . The symptoms a r e a s f o l l o w s : 5
[ L i t t l e i n t e r e s t or p l e a s u r e in d o i n g t h i n g s , 6
F e e l i n g down , d e p r e s s e d , or h o p e l e s s , 8
9
Large Language Models for Mental Health Diagnostic Assessments 21
T r o u b l e f a l l i n g or s t a y i n g a s l e e p , or s l e e p i n g t o o much , 10
F e e l i n g t i r e d or h a v i n g l i t t l e energy , 11
12
Poor a p p e t i t e or o v e r e a t i n g , 13
14
F e e l i n g bad a b o u t y o u r s e l f or t h a t you a r e a f a i l u r e or 15
17
T r o u b l e c o n c e n t r a t i n g on t h i n g s , s u c h a s r e a d i n g t h e 18
newspaper or w a t c h i n g t e l e v i s i o n , 19
20
Moving or s p e a k i n g s o s l o w l y t h a t o t h e r p e o p l e c o u l d 21
have n o t i c e d . Or t h e o p p o s i t e b e i n g s o f i g e t y or 22
than usual , 24
25
hurting yourself ] . 27
28
G i v e t h e o u t p u t a s a python l i s t o f l i s t s , w i t h t h e l i s t c o n t a i n i n g t h e symptom 29
and t h e
answer , s e p a r a t e d by a comma . 30
31
32
33
# # # INPUT : 34
b u t ␣ I ␣ j u s t ␣ end ␣ up ␣ c r y i n g , ␣ and ␣ e v e r y t h i n g ␣ I ␣ w r i t e ␣ i s ␣ i n c o h e r e n t , ␣ s o . ␣ S o r r y ␣
i f ␣ t h e r e ' s not enough c o n t e x t ?
36
I 'm t i r e d o f s p e n d i n g h o u r s on my phone i n s t e a d o f d o i n g f u n c t i o n a l − p e o p l e 40
t h i n g s , I 'm␣ j u s t ␣ p u t t i n g ␣ e v e r y t h i n g ␣ o f f ␣ and ␣ s e t t i n g ␣ m y s e l f ␣ up ␣ f o r ␣ f a i l u r e . ␣
But ␣ d o i n g ␣ a n y t h i n g ␣ p r o d u c t i v e ␣ f e e l s ␣ t o o ␣ d a u n t i n g , ␣ and ␣ I ␣ don ' t know how t o
start ?
41
46
I ' ve ␣ t r i e d ␣ l o o k i n g ␣ f o r ␣ a d v i c e ␣ o n l i n e , ␣ b u t ␣ e v e r y t h i n g ␣ I ␣ s e e ␣ i n v o l v e s : 47
c o m f o r t a b l e t a l k i n g a b o u t my m e n t a l h e a l t h t o any o f them .
− S e l f − help books and s t u f f , which j u s t . . . d o e s not work on me . 50
Output
Large Language Models for Mental Health Diagnostic Assessments 23
Table 14. Evaluation of Traditional ML-based methods for PHQ-9 Symptom Annotations of the PRIMATE Posts Using F1 scores.
Method F1-score
Logistic Regression 0.49
Random Forest 0.38
XGBoost 0.65