BACKGROUND: Anecdotal reports of psychosis emerging in the context of artificial intelligence (AI) chatbot use have been increasingly reported in the media. However, it remains unclear to what extent these cases represent the induction of new-onset psychosis versus the exacerbation of pre-existing psychopathology. We report a case of new-onset psychosis in the setting of AI chatbot use. CASE PRESENTATION: A 26-year-old woman with no previous history of psychosis or mania developed delusional beliefs about establishing communication with her deceased brother through an AI chatbot. This occurred in the setting of prescription stimulant use for the treatment of attention-deficit hyperactivity disorder (ADHD), recent sleep deprivation, and immersive use of the chatbot. Review of her chatlogs revealed that the chatbot validated, reinforced, and encouraged her delusional thinking, with reassurances that "You’re not crazy." Following hospitalization and antipsychotic medication for agitated psychosis, her delusional beliefs resolved. However, three months later, her psychosis recurred after she stopped antipsychotic therapy, restarted prescription stimulants, and continued immersive use of AI chatbots so that she required brief rehospitalization. CONCLUSION: This case provides evidence that new-onset psychosis in the form of delusional thinking can emerge in the setting of immersive AI chatbot use. Although multiple pre-existing risk factors may be associated with psychosis proneness, the sycophancy of AI chatbots together with AI chatbot immersion and deification on the part of users may represent particular red flags for the emergence of AI-associated psychosis.
@article{pierre2025you,title={“{You're} {Not} {Crazy}”: {A} {Case} of {New-onset} {AI}-associated {Psychosis}},author={Pierre, J. M. and Gaeta, B. and Raghavan, G. and Sarma, K. V.},journal={Innovations in Clinical Neuroscience},volume={22},number={10-12},pages={11},year={2025},}
Background and methods: The authors sought to evaluate the performance of common large language models (LLMs) in psychiatric diagnosis, and the impact of integrating expert-derived reasoning on their performance. Clinical case vignettes and associated diagnoses were retrieved from the DSM-5-TR Clinical Cases book. Diagnostic decision trees were retrieved from the DSM-5-TR Handbook of Differential Diagnosis and refined for LLM use. Three LLMs were prompted to provide diagnosis candidates for the vignettes either by directly prompting or using the decision trees. These candidates and diagnostic categories were compared against the correct diagnoses. The positive predictive value (PPV), sensitivity, and F1 statistic were used to measure performance. Results: When directly prompted to predict diagnoses, the best LLM by F1 statistic (gpt-4o) had sensitivity of 76.7 % and PPV of 40.4 %. When making use of the refined decision trees, PPV was significantly increased (65.3 %) without a significant reduction in sensitivity (70.9 %). Across all experiments, the use of the decision trees statistically significantly increased the PPV, significantly increased the F1 statistic in 5/6 experiments, and significantly reduced sensitivity in 4/6 experiments. Discussion: When used to predict psychiatric diagnoses from case vignettes, direct prompting of the LLMs yielded most true positive diagnoses but had significant overdiagnosis. Integrating expert-derived reasoning improved performance by suppressing overdiagnosis with lower negative impact on sensitivity. This suggests that clinical expert reasoning could improve LLM-based behavioral health tools.
@article{sarma_integrating_2026,title={Integrating expert knowledge into large language models improves performance for psychiatric reasoning and diagnosis},volume={355},issn={0165-1781},doi={10.1016/j.psychres.2025.116844},journal={Psychiatry Research},author={Sarma, K. V. and Hanss, K. E. and Halls, A. J. M. and Krystal, A. and Becker, D. F. and Glowinski, A. L. and Butte, A. J.},month=jan,year={2026},pages={116844},}
Background: Large language models (LLMs), such as OpenAI’s GPT-3.5, GPT-4, and GPT-4o, have garnered early and significant enthusiasm for their potential applications within mental health, ranging from documentation support to chat-bot therapy. Understanding the accuracy and reliability of the psychiatric "knowledge" stored within the parameters of these models and developing measures of confidence in their responses (ie, the likelihood that an LLM response is accurate) are crucial for the safe and effective integration of these tools into mental health settings. Objective: This study aimed to assess the accuracy, reliability, and predictors of accuracy of GPT-3.5 (175 billion parameters), GPT-4 (approximately 1.8 trillion parameters), and GPT-4o (an optimized version of GPT-4 with unknown parameters) with standardized psychiatry multiple-choice questions (MCQs). Methods: A cross-sectional study was conducted where 3 commonly available, commercial LLMs (GPT-3.5, GPT-4, and GPT-4o) were tested for their ability to provide answers to single-answer MCQs (N=150) extracted from the Psychiatry Test Preparation and Review Manual. Each model generated answers to every MCQ 10 times. We evaluated the accuracy and reliability of the answers and sought predictors of answer accuracy. Our primary outcome was the proportion of questions answered correctly by each LLM (accuracy). Secondary measures were (1) response consistency to MCQs across 10 trials (reliability), (2) the correlation between MCQ answer accuracy and response consistency, and (3) the correlation between MCQ answer accuracy and model self-reported confidence. Results: On the first attempt, GPT-3.5 answered 58.0% (87/150) of MCQs correctly, while GPT-4 and GPT-4o answered 84.0% (126/150) and 87.3% (131/150) correctly, respectively. GPT-4 and GPT-4o showed no difference in performance (P=.51), but they significantly outperformed GPT-3.5 (P<.001). GPT-3.5 exhibited less response consistency on average compared to the other models (P<.001). MCQ response consistency was positively correlated with MCQ accuracy across all models (r=0.340, 0.682, and 0.590 for GPT-3.5, GPT-4, and GPT-4o, respectively; all P<.001), whereas model self-reported confidence showed no correlation with accuracy in the models, except for GPT-3.5, where self-reported confidence was weakly inversely correlated with accuracy (P<.001). Conclusions: To our knowledge, this is the first comprehensive evaluation of the general psychiatric knowledge encoded in commercially available LLMs and the first study to assess their reliability and identify predictors of response accuracy within medical domains. The findings suggest that GPT-4 and GPT-4o encode accurate and reliable general psychiatric knowledge and that methods, such as repeated prompting, may provide a measure of LLM response confidence. This work supports the potential of LLMs in mental health settings and motivates further research to assess their performance in more open-ended clinical contexts.
@article{hanss2025jmir,author={Hanss, K. E. and Sarma, K. V. and Glowinski, A. L. and Krystal, A. and Saunders, R. and Halls, A. J. M. and Gorrell, S. and Reilly, E.},title={Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study},journal={Journal of Medical Internet Research},volume={27},number={1},pages={e69910},year={2025},month=may,doi={10.2196/69910},}
Introduction: Clinical documentation is an essential component of the provision of medical care, enabling continuity of information across provider and site handoffs. This is particularly important in the combat casualty care setting when a single casualty may be treated by four or more or five completely disparate teams across the roles of care. The Battlefield Assisted Trauma Distributed Observation Kit (BATDOK) is a digital battlefield clinical documentation system developed by the Air Force Research Laboratory to address this need. To support the deployment of this tool, we integrated BATDOK into a commercially available virtual reality (VR) medical simulation platform used by the U.S. Air Force and Defense Health Agency personnel in order to provide an immersive simulation training experience which included battlefield documentation. Methods: A multidisciplinary team consisting of medical educators, VR simulation engineers, emergency physicians and pararescuemen, and BATDOK developers first developed a specification for a virtual BATDOK capability, including a detailed listing of learning objectives, critical interfaces and task plans, and sensor integrations. These specifications were then implemented into the commercially available Virtual Advancement of Learning for Operational Readiness VR Medical Simulation System and underwent developmental testing and evaluation during pararescueman training exercises at the Air Force Special Operations Command Special Operations Center for Medical Integration and Development. Results and Conclusions: The BATDOK capability was successfully implemented within the VR Medical Simulation System. The capability consisted of a virtual tablet with replicated interfaces and capabilities based on the developed specifications. These capabilities included integrated point-of-care ultrasound capability, multi-patient management, vitals sign monitoring with sensor pairing and continuous monitoring, mechanism of injury documentation (including injury pattern documentation), intervention logging (including tourniquets, dressing, airways, lines, tubes and drains, splints, fluids, and medications), and event logging. The capability was found to be operational and in alignment with learning objectives and user acceptance goals.
@article{sarma_integrating_2023,title={Integrating {Battlefield} {Documentation} into {Virtual} {Reality} {Medical} {Simulation} {Training}: {Virtual} {Battlefield} {Assisted} {Trauma} {Distributed} {Observation} {Kit} ({BATDOK})},volume={188},copyright={All rights reserved},issn={0026-4075},shorttitle={Integrating {Battlefield} {Documentation} into {Virtual} {Reality} {Medical} {Simulation} {Training}},url={https://doi.org/10.1093/milmed/usad051},doi={10.1093/milmed/usad051},number={Supplement\_6},urldate={2024-11-18},journal={Military Medicine},author={Sarma, K. V. and Barrie, M. G. and Dorsch, J. R. and Andre, T. W. and Polson, J. S. and Ribeira, R. J. and Andre, T. B. and Ribeira, R. J.},month=nov,year={2023},pages={110--115},}
Sarma, K. V., Harmon, S., Sanford, T., Roth, H. R., Xu, Z., Tetreault, J., Xu, D., Flores, M. G., Raman, A. G., Kulkarni, R., Wood, B. J., Choyke, P. L., Priester, A. M., Marks, L. S., Raman, S. S., Enzmann, D., Turkbey, B., Speier, W., and Arnold, C. W.
Journal of the American Medical Informatics Association 2021
Objective: To demonstrate enabling multi-institutional training without centralizing or sharing the underlying physical data via federated learning (FL). Materials and Methods: Deep learning models were trained at each participating institution using local clinical data, and an additional model was trained using FL across all of the institutions. Results: We found that the FL model exhibited superior performance and generalizability to the models trained at single institutions, with an overall performance level that was significantly better than that of any of the institutional models alone when evaluated on held-out test sets from each institution and an outside challenge dataset. Discussion: The power of FL was successfully demonstrated across 3 academic institutions while avoiding the privacy risk associated with the transfer and pooling of patient data. Conclusion: Federated learning is an effective methodology that merits further study to enable accelerated development of models across institutions, enabling greater generalizability in clinical use.
@article{Sarma2021,author={Sarma, K. V. and Harmon, S. and Sanford, T. and Roth, H. R. and Xu, Z. and Tetreault, J. and Xu, D. and Flores, M. G. and Raman, A. G. and Kulkarni, R. and Wood, B. J. and Choyke, P. L. and Priester, A. M. and Marks, L. S. and Raman, S. S. and Enzmann, D. and Turkbey, B. and Speier, W. and Arnold, C. W.},doi={10.1093/jamia/ocaa341},issn={1067-5027},journal={Journal of the American Medical Informatics Association},keywords={deep learning,federated learning,generalizability,privacy,prostate},month=feb,publisher={Oxford University Press (OUP)},title={{Federated learning improves site performance in multicenter deep learning without data sharing}},url={https://academic.oup.com/jamia/advance-article/doi/10.1093/jamia/ocaa341/6127556},year={2021},}
I maintain a limited advisory practice for companies and counsel working at the intersection of artificial intelligence, clinical care, and behavioral health. Work with me →