Pilot Evaluation of an LLM-based Virtual Patient for Immersive Medical Simulation

Abstract

Background: The recent uptick in accessibility of generative AI models has rapidly accelerated innovative applications; the natural language capabilities of large language models (LLMs) pose an exciting avenue of opportunity in medical simulation, particularly as conversation engines for autonomous, virtual patient encounters, enabling realistic, immersive simulation with dynamic responses while reducing facilitator load. Conversational agents built off LLMs have shown moderate performance in recall, reasoning, and believability, but there remains a dearth of data demonstrating their utility in medical simulation contexts. This study sought to determine the usability of a prototype conversational agent simulating a patient encounter. We hypothesized that persona conditioning and prompt engineering techniques could achieve benchmark levels of perceived clarity, context awareness, and overall conversational quality among healthcare practitioners.

Methods: A conversational agent was developed using a manually derived persona prompt with automated scenario context prompting using GPT-4 as the foundation model. The agent was given a prompt to play the role of a patient reporting to the emergency department for chest pain and given a structured set of prompts regarding presenting complaint, past medical history, drug history, family history, and social history. A total of 12 participants were recruited from four institutions to assess the usability of the conversational AI agent. The study entailed participants receiving a prompt indicating the overall medical simulation scenario, the role of the conversational AI agent, and instructions for interaction. The participants then used a web-based interface to interact with the agent according to the prompt. Following the scenario, participants completed an evaluation survey about their experience with the agent, answering questions with 5-point Likert scales.

Results: Participants included healthcare professionals across several disciplines, including nursing students, practicing nurses, physician assistants, and other clinicians with an average of 10 years of experience in the field. Conversations had a median length of 45 messages between the agent and the end-user; transcript analysis revealed the agent was able to accurately recall and report its present complaint, history of symptoms, and medical, social, and family history in 98% of chat queries. Responses reported highly positive perceptions of clarity (mean Likert 4.83), context awareness (mean Likert 4.67), and accuracy of responses (mean Likert 4.42), and moderately positive perceptions of the agent handling conversational ambiguity (mean Likert 4.25), providing appropriate amounts of information (mean Likert 4.08), and engaging in an ongoing conversation (mean Likert 4.25).

Conclusions: This research study demonstrates that the LLM approach can feasibly support scenario-appropriate simulated patient conversations. Post-hoc analysis of conversation transcripts revealed that the agents achieved high levels of linguistic consistency and believability, with a hallucination rate of under 2% for factually incorrect information based on the persona profile conditioning prompt. Surprisingly, users reported the lowest scores for information provided, prompting further study about the conciseness and delivery of the bot. Future areas of study could include evaluating the conversational bot for generating speech, refinement of the persona prompts, and a wider usability study across a broader healthcare practitioner population.

BibTeX

@inproceedings{polson2025imsh_vpatient,
  author = {Polson, J. S. and Sarma, K. V.},
  title = {Pilot Evaluation of an LLM-based Virtual Patient for Immersive Medical Simulation},
  booktitle = {International Meeting on Simulation in Healthcare (IMSH)},
  year = {2025},
}

copy bibtex