Background: The last two years have seen the advent of large language models (LLMs), a form of artificial intelligence (AI) that has shown promise in natural language understanding. LLMs have particular potential in psychiatry, as language is fundamental to both diagnosis and treatment in behavioral health. Recent studies have demonstrated that the majority (78.4%) of patients are willing to use ChatGPT for self-diagnosis. Empirically, we have found that patients are increasingly using these services to help understand their mental health symptoms and causes. The prevalence of this use motivates a detailed exploration of these models’ performance in psychiatric diagnosis. Today, five major companies maintain state-of-the-art LLMs readily available to the general public. Here, we evaluate the capabilities of these models to make psychiatric diagnoses using standardized case vignettes.
Methods: 28 full-text case diagnosis vignettes and associated diagnoses were retrieved from the DSM-5-TR Clinical Cases book. The latest LLM was selected for evaluation from each of the major companies: OpenAI’s gpt-4o, Anthropic’s Claude 3.5 Sonnet, Google’s Gemini 1.5 Pro, Mistral’s Large 2, and Meta’s Llama 3.1 405B. A prompt was developed that instructed the models to provide a list of all DSM-5-TR diagnoses that were applicable to the vignette. To facilitate comparison, diagnoses were simplified by removing specifiers, modifiers, specific substances, and severity levels. For each model, the positive predictive value and sensitivity were calculated for every vignette based on the predicted diagnoses and then averaged for a final result. A two-factor ANOVA was used to determine if there were statistically significant differences in performance between the FMs and vignettes.
Results: The five LLMs exhibited a mean sensitivity (i.e., the proportion of author-designated diagnoses that were correctly predicted) between 71%-75% and a mean positive predictive value (PPV, i.e., the proportion of predicted diagnoses that were correct) between 50%-65%. No significant differences were found between the models by the ANOVA test (p=0.48). Only seven vignettes generated identical predictions across all five models, two of which were entirely incorrect and four of which were entirely correct.
Conclusion: The state-of-the-art LLMs from the five largest vendors all exhibited impressive out-of-the-box diagnostic performance, demonstrating significant inherent psychiatric reasoning capabilities without task-specific training. However, they also exhibited significant overdiagnosis, producing an average of 0.5-1 incorrect diagnoses per correct diagnosis. This finding raises concern that patients using these models for self-diagnosis may be presented with excessive pathologization of their concerns and demonstrates the need for further refinement of these models before they can be used without expert clinician oversight.
@inproceedings{sarma2025apa_robo,
author = {Sarma, K. V. and Hanss, K. E. and Glowinski, A. L. and Krystal, A. and Halls, A. J. M. and Butte, A. J.},
title = {The Robo-Doctor is Always In: Assessing and Comparing the Psychiatric Diagnostic Capabilities of ChatGPT and other Large Language Models},
booktitle = {APA Annual Meeting},
year = {2025},
}