Can Artificial Intelligence Make the Diagnosis? Evaluating the Accuracy of Large Language Models in Diagnosing Child and Adolescent Psychiatry Clinical Cases

Abstract

Objectives: With a severe shortage of Child and Adolescent Psychiatry (CAP) providers, Large Language Models (LLMs) could serve as a novel tool to diagnose psychiatric illness, possibly with comparable accuracy to clinicians, and potentially free from undue bias. However, their effectiveness hinges on successfully engineering LLMs for clinical reasoning. Few studies have evaluated LLM performance in unstructured, clinical reasoning. Here, we seek to evaluate the ability of LLMs to make diagnoses from unstructured CAP vignettes, and the effectiveness of prompt engineering to enhance the model’s clinical reasoning.

Methods: Clinical cases (n = 22) were extracted from the DSM-5 Casebook and Treatment Guide for Child Mental Health. Cases were presented to the gpt-4-turbo LLM using two prompting schemes: (1) instructions to generate diagnoses for the case ("basic reasoning"); and (2) enhanced instructions with guidance to review proposed diagnoses and narrow to those most relevant ("enhanced reasoning"). Performance metrics were: % of cases with at least one correct answer, sensitivity (true positives / true positives + false negatives), and positive predictive value (PPV = true positives / true positives + false positives). The model was evaluated both on its ability to report a diagnosis from the correct DSM diagnostic category and a correct diagnosis.

Results: For category, basic reasoning yielded a correct result in 95% of cases, a sensitivity of 79%, and PPV of 52%. Enhanced reasoning yielded a correct result in 95% of cases, a sensitivity of 74%, and PPV of 67%. For diagnosis, basic reasoning yielded a correct result in 81% of cases, a sensitivity of 56% and PPV of 37%. Enhanced reasoning yielded a correct result in 73% of cases, a sensitivity of 48%, and PPV of 51%.

Conclusion: Our results show LLMs’ promising capability to reason about unstructured CAP vignettes and performance is competitive with prior estimates of diagnostic accuracy ( 30-50%) among non-psychiatric physicians. Our enhanced prompting scheme substantially reduced overdiagnosis with a mild reduction in sensitivity, demonstrating the value of encoding clinical reasoning directly into model queries. Continued efforts to develop prompts and queries specialized for CAP may offer further avenues to improve the efficacy of LLMs within our field.

BibTeX

@inproceedings{hanss2024ai_diagnosis,
  author = {Hanss, K. E. and Sarma, K. V. and Halls, A. J. M. and Gorrell, S. and Reilly, E.},
  title = {Can Artificial Intelligence Make the Diagnosis? Evaluating the Accuracy of Large Language Models in Diagnosing Child and Adolescent Psychiatry Clinical Cases},
  booktitle = {American Academy of Child and Adolescent Psychiatry Annual Meeting},
  year = {2024},
}

copy bibtex