Improving the Performance of LLM-Based Semi-Automated Psychiatric Case Diagnosis using Decision Tree-Based Prompting

Abstract

Introduction: The rapid advancement of artificial intelligence (AI)-based technologies over the last decade has led to dramatic innovation in healthcare technology. Recent advances in generative AI have led to the development of modern large language models (LLMs), such as OpenAI’s ChatGPT. The promise of LLMs for text processing, information retrieval, and reasoning, however, is predicated on the quality of the knowledge encoded within the models by the training process. The use of large-scale corpuses collected from publicly available written and internet literature to train these models may create limitations on the applicability of the model to the specialized tasks found in the practice of psychiatry and behavioral health. Further, LLMs are highly sensitive to the specific prompting approach used for any inquiry, motivating investigation into the most effective approaches for mental health. Here, we demonstrate a decision tree-based approach to prompting LLMs to provide semi-automated case diagnoses for standardized psychiatric scenarios, drawing from previously developed structured diagnostic pathways within psychiatry.

Methods: For initial test prompting, validation, and refinement, 10 full-text diagnosis vignettes and associated diagnoses were randomly retrieved from the DSM-5-TR Clinical Cases manual. For evaluation, an additional 28 full-text case diagnosis vignettes and associated diagnoses were retrieved using stratified random sampling without replacement, stratified by chapter heading (corresponding to the DSM-5-TR diagnostic category), excluding chapters 11, 14, 18, and 19 as these categories were not represented in the structured diagnostic pathway handbook. The gpt-4-turbo (gpt-4-0125-preview) model was used via the Azure OpenAI API with a temperature of 0, top_p of 1, and a response format of json_object, using zero-shot prompting exclusively. To evaluate baseline performance, a basic prompt requested a list of applicable DSM-5-TR diagnoses. A decision tree (DT)-based prompting model was then implemented based on the DSM-5-TR Handbook of Differential Diagnosis: 28 decision trees (each pertaining to a specific symptom or mental status exam abnormality) were extracted and each implemented as a series of yes/no prompts, with an initial screening prompt to determine applicability and iterative questions leading to a diagnosis, no diagnosis, or the next question. The final diagnosis list for each vignette was the union of diagnoses from all applicable trees. Candidate diagnoses were then simplified (removing specifiers, modifiers, and specific substances; consolidating neurocognitive disorders; combining specified/unspecified diagnoses; and applying Boolean reduction) and, in the post-processing condition, compared pairwise to eliminate diagnoses better explained by another. For each vignette, precision and recall were calculated against the case diagnosis list, averaged across vignettes, and compared between experiments using the two-tailed paired t-test.

Results: Use of the DT-based approach was associated with higher recall (without reaching statistical significance), and use of pairwise post-processing was associated with significantly higher precision. Basic prompting without post-processing yielded recall 69.6% and precision 39.1%; basic prompting with post-processing yielded recall 66.0% (p=.33) and precision 53.9% (p<.01); decision tree prompting without post-processing yielded recall 80.4% (p=.23) and precision 35.0% (p=.50); and decision tree prompting with post-processing yielded recall 78.6% (p=.25) and precision 65.2% (p<.01).

Discussion and Conclusions: We found that the tested LLM had promise in performing case diagnosis using the provided standardized vignettes. In all cases, the model had better performance on the recall metric than the precision metric, tending more towards overdiagnosis than underdiagnosis. The decision tree approach appeared to improve recall, but did not reach statistical significance in this low-powered study. The use of pairwise post-processing prompts significantly increased precision in both the basic and DT prompting approaches. Overall, our results suggest that LLMs have promise in detecting and reasoning about psychiatric symptoms and diagnoses; this is the first effort that our team is aware of that attempts to automatically make diagnoses from case vignettes and no prior comparator is available. Further work is required to evaluate their performance on real-world inputs, such as clinical notes or patient interviews, and to compare the diagnostic performance of the LLMs to other approaches or human performance. Such further work could also make use of larger corpora to enable better calibration and statistical analysis.

BibTeX

@inproceedings{sarma2024decision_tree,
  author = {Sarma, K. V. and Hanss, K. E. and Glowinski, A. L. and Butte, A. J. and Halls, A. J. M.},
  title = {Improving the Performance of LLM-Based Semi-Automated Psychiatric Case Diagnosis using Decision Tree-Based Prompting},
  booktitle = {American Medical Informatics Association Annual Meeting},
  year = {2024},
}

copy bibtex