A Pilot Evaluation of Prompt Engineering for Autonomous Agents in Immersive Medical Simulation

Abstract

Background: Simulation-based education often relies on live facilitators to portray patients and contextual roles, limiting scalability and standardization. Large language models (LLMs) offer potential for automating character dialogue and behavior, but unconstrained use may result in hallucinations, role inconsistency, or deviation from learning objectives. Current simulation studies lack strategies for reliably integrating generative AI without compromising scenario fidelity, or for operationalizing these tools in educational settings. This study examined whether prompt engineering, informed by scenario design experts, could constrain LLM behavior to produce aligned, controlled interactions with little-to-no facilitator input. We hypothesized that these constraints would create sufficient adherence to maintain learner immersion while lowering facilitator load.

Methods: We conducted an evaluation of LLM agents embedded in existing VR medical simulation scenarios on a commercial platform. GPT-4o agents were assigned to roles such as patients, relatives of the patients, and clinical personnel. Each agent was configured with persona prompts and bound to scenario-specific scaffolds that limited their dialogue and action options to predetermined responses. This structure aimed to preserve alignment with the simulation’s clinical and educational goals. Initial simulated transcripts were reviewed to assess adherence to prompt constraints and accuracy compared to facilitator-chosen responses. A comparative usability study measured learner immersion and facilitator workload across sessions using either automated agents or traditional facilitator-driven interactions. Both qualitative and quantitative data were collected to evaluate system performance and user experience.

Results: During initial validation, agents produced dialogue consistent with their assigned roles and adhered to scenario logic. Transcript review found no evidence of hallucinated content, off-role responses, or narrative inconsistency. In the usability phase, learners who interacted with LLM-driven agents reported immersion and realism ratings comparable to those in facilitator-led sessions. Facilitators noted reduced need for live moderation and lower cognitive load. Observers and learners provided informal feedback indicating that simulations retained flow and effectiveness even when facilitated by prompt-constrained agents. These results suggest that well-structured prompts can guide LLMs to deliver consistent, high-fidelity interactions suitable for training or assessment use cases.

Conclusions: This study demonstrates that prompt engineering is an effective strategy for constraining LLM behavior in medical simulation. Embedding role-specific guidance and limiting output options allowed agents to operate autonomously while maintaining alignment with scenario objectives. These findings support the use of LLM agents in simulations where facilitator time is limited or standardization is required. Prompt-constrained agents may enable repeatable training modules, asynchronous learning experiences, and novel assessment formats. Future work should evaluate educational impact, explore hybrid scripted-generative designs, and refine prompt development techniques. Structured LLM integration offers a practical pathway to scalable, consistent, and low-burden simulation experiences.

BibTeX

@inproceedings{polson2026imsh_prompt,
  author = {Polson, J. S. and Sarma, K. V.},
  title = {A Pilot Evaluation of Prompt Engineering for Autonomous Agents in Immersive Medical Simulation},
  booktitle = {International Meeting on Simulation in Healthcare (IMSH)},
  year = {2026},
}

copy bibtex