Background: Large Language Models (LLMs) are a subset of artificial intelligence models that can interpret and generate written language. Popular, publicly available LLMs include OpenAI’s GPT3.5 and GPT4, commonly known as "ChatGPT." There is growing interest in using LLMs in clinical informatics and patient care. To be useful in medical domains, LLMs must have been trained on and encode accurate, domain-relevant knowledge (e.g., how does one diagnose and treat a manic episode?). There have been several studies published in other medical specialties (e.g., Medicine, OBGYN) that evaluate the knowledge encoded in GPT3.5 and GPT4. Few efforts have focused on evaluating LLM-encoded knowledge in psychiatric domains.
Objectives: The research objective of this study is to evaluate and characterize the psychiatric knowledge of OpenAI’s GPT3.5 and GPT4 via the LLMs’ performance on a selection of standardized multiple-choice questions.
Methods: 150 single-answer multiple choice questions (MCQs) were extracted from a practice test in the Psychiatry Test Preparation and Review Manual. MCQs were prefaced with a standard prompt and input into GPT3.5 and GPT4, and answers were extracted from the LLMs’ responses. Accuracy was assessed via the percentage of MCQs answered correctly. Binomial tests evaluated for variations in accuracy across question domains (alpha = 0.004 after Bonferroni Correction). Qualitative analysis was conducted for all incorrect questions to evaluate for themes in incorrectness.
Conclusion: GPT3.5 and GPT4 perform well on standardized psychiatry MCQs across question domains, indicating they encode a broad body of psychiatric knowledge. While not statistically significant, model performance seemed to degrade in more niche, technical domains. GPT4 outperformed its predecessor, suggesting that advances in general LLMs may translate to better performance on psychiatry-specific domains. LLMs may encode bias about mental health conditions abundant in non-expert internet sources (such as Reddit or Twitter) on which the models were trained. LLMs may be able to efficiently identify outdated or stigmatizing medical language.
@inproceedings{hanss2023grading_machine,
author = {Hanss, K. E.* and Sarma, K. V.* and Saunders, R. and Elkin, D.},
title = {Grading the Machine: Assessing ChatGPT’s Psychiatric Knowledge through Boards-Style Assessment},
booktitle = {American Psychiatric Association Annual Meeting},
year = {2024},
award = {Winner, 2024 APA Medical Student/Resident Poster Competition}
}