Can Large Language Model-based AI Reason about Behavioral Health? Preliminary Evaluation of a Decision Tree-Based LLM Algorithm for Psychiatric Case Diagnosis

Abstract

Background: Large Language Models (LLMs) have recently garnered enthusiasm about their potential use in mental health care. However, the promise of LLMs is predicated on the quality of the knowledge and reasoning capability encoded within the models and the specific prompting approach used for any inquiry. Previously early work has suggested that combining LLMs with expert knowledge may enable effective automated reasoning. Here, we perform a preliminary evaluation to determine the promise of a decision tree-based approach to prompting LLMs, drawn from structured diagnostic pathways, to provide DSM diagnoses for standardized psychiatric scenarios and for real-world psychiatric intake assessment notes.

Methods: 28 full-text case diagnosis vignettes and associated author-designated diagnoses were retrieved from the DSM-5-TR Clinical Cases book using stratified random sampling without replacement. Stratification was performed by the diagnostic group chapter, excluding chapters 11, 14, 18, and 19 as these DSM categories were not represented in the structured diagnostic pathways. 21 randomly selected de-identified outpatient psychiatric intake assessment notes were retrieved from the UCSF Information Commons electronic medical record database. The HPI and diagnoses were extracted from each note, with manual censoring of all DSM diagnoses within the HPI. 28 decision trees were extracted from the DSM-5-TR Handbook of Differential Diagnoses and implemented as gpt-4-turbo LLM prompts, each as an iterated series of yes/no prompts leading to a diagnosis node that contained zero, one, or two diagnoses. Every case was processed through the 28 trees to create a list of candidate diagnoses, which the LLM then refined pairwise by determining whether one diagnosis was more appropriate than another or both were equally appropriate. Once a final diagnosis list was obtained, the positive predictive value (PPV) and true positive rate (TPR) were calculated on a per-case basis and averaged across all vignettes and HPIs.

Results: All inputs were successfully processed by the LLM-based decision tree model. The case vignettes had a mean of 1.5 author-designated diagnoses per vignette, and the model predicted a mean of 1.7 diagnoses. The real-world intakes had a mean of 2.1 author-designated diagnoses per note, and the model predicted a mean of 2.4 diagnoses. For the diagnosis of case vignettes, the model had a mean positive predictive value [PPV, true positives / (true positives + false positives)] of 78.6% (SD 37.1%) and a mean true positive rate [TPR, true positives / (true positives + false negatives)] of 76.2% (SD 36.7%). For the diagnosis of real-world intake HPIs, the mean PPV was 61.5% (SD 41.0%), and the mean TPR was 57.0% (SD 39.2%).

Conclusions: The LLM-based decision tree model exhibited high performance in the diagnosis of case vignettes and moderate performance in the diagnosis of real-world intake HPIs. There was significant inter-case performance variability. We attribute the performance limitations to the diagnostic uncertainty inherent within the field of psychiatry (as demonstrated by the DSM field trials). Real-world intake assessment is further limited by the reality that insufficient information often exists at a first visit to confirm a diagnosis. There may also be psychiatric domain-specific limitations of the model’s performance that could be detected by a higher-powered study. Given these challenges, we see the positive LLM performance on this task as a signal that the combination of structured clinical knowledge with LLM technology could enable high-quality automated psychiatric reasoning capabilities and further study is indicated. Such future efforts might better focus on the ability of LLMs to evaluate symptoms and behaviors, rather than diagnoses, and to make appropriate psychotherapeutic recommendations based on these findings. Such a model could be deployed as a clinical decision support system to support psychotropic prescribing by primary care providers and allied health professionals, enhancing access to appropriate mental health interventions.

BibTeX

@inproceedings{sarma2024acnp,
  author = {Sarma, K. V. and Hanss, K. E. and Glowinski, A. L. and Krystal, A. and Halls, A. J. M. and Butte, A. J.},
  title = {Can Large Language Model-based AI Reason about Behavioral Health? Preliminary Evaluation of a Decision Tree-Based LLM Algorithm for Psychiatric Case Diagnosis},
  booktitle = {ACNP 63rd Annual Meeting: Poster Abstracts P1-P304},
  journal = {Neuropsychopharmacology},
  volume = {49},
  number = {1},
  pages = {65--235},
  year = {2024},
  doi = {10.1038/s41386-024-02011-0},
  note = {ACNP Abstract P44}
}

copy bibtex