AI-Generated Research Experiences

Harnessing the Power of AI For Ecologically Valid Insight

Jeff Brodscholl, Ph.D
Greymatter Behavioral Sciences

How do we get an accurate picture of what people think, feel, or do in the real world when their thoughts, feelings, and actions in that world are not available for direct observation?

This question captures what is one of the biggest sources of tension for anyone involved in industry applied behavioral research. It has implications for how we assess not only what people do currently when they’re out in the world, but also what they may be inclined to do in the future, and even what they have done in the past. It also has implications for the way we go about uncovering the drivers of people’s behavior, where our insights depend on seeing how real-world behavior varies with its circumstances and capturing the relevant mental processes and experiences that are triggered by those circumstances.

Behind the worry is the well-worn fact that people’s reports about what they do and why often deviate substantially from what’s discovered once their behavior and thought processes are assessed in their natural habitats. Ideally, we’d look to circumvent this issue by simply conducting all our research in these naturalistic settings, yet there’s a reason why we don’t, which is that it’s often either too cost-prohibitive, impractical, or unscalable to do so, or the related methods lack the ingredients to make the inferences we seek owing to other assessment challenges, insufficient data capture, and lack of control over the surrounding context. That leaves us with research that can be conducted under more controllable, attainable, yet also artificial settings, raising questions about what we can do to bridge the gap between those settings and real life to make the results of these studies valid and projectable.

Enter GenAI. Now to state the obvious, the insights community has hardly been lacking efforts to bring AI into the world of research. To date, most of these efforts have focused on ways to use AI to extract additional insight from project-specific datasets, as well as to learn something new from pre-existing research data pools and secondary data streams. Others have attempted to replace human study participants with synthetic respondents, or have turned to large language models to deliver chatbot-based tools for generating literature reviews or bench research summaries.

Yet, AI also has an opportunity to offer solutions to the validity and projectability problem by delivering stimuli and tasks that are more likely to activate research respondents in a manner closer to the way they would be activated in the real world. This is about GenAI’s potential to contribute to the orchestration of the right, ecologically valid research experiences, where the definition of “right” depends on the criteria for creating the conditions in a study that have the psychological properties to be a proper analog to the real-world conditions for which they provide a stand-in. This is a potential application of AI that doesn’t get as much attention as I've seen with other uses, yet it’s one that I think has enough promise for industry researchers to be worthy of further discussion.

I’ll use this post to discuss some of the deeper thinking behind why I find this use case particularly compelling. I’ll talk about some of the lessons I’ve encountered thus far in my own exploration of GenAI for research task and stimulus development, focusing in particular on GenAI videos and chatbot-based simulators for use in the types of insights projects sometimes encountered in the life sciences industry. I’ll then pivot to what I see as being a major advantage of these applications for encouraging methodological innovation, and will discuss some of the practical issues I think are important to keep in mind if you’re going to bring these applications into the methodological toolkit.

The Ecological Validity Imperative

To give the argument some force, it’s worth stepping back to ask a question about a key principle we need to be observing whenever we conduct behavioral research, whether it's in survey, interview, or other form. In short, what really is it that we need any research method to do if it’s going to turn our studies into engines for valid, generalizable insight?

The answer lies with a fundamental fact about people, which is that we are inescapably creatures of context. Stating this goes beyond echoing the obvious fact that our mood can sometimes be swayed by the weather or that even the mildest person can go into a fit of rage when provoked by the right circumstances. It’s about acknowledging just how deeply our behavior depends on the real, imagined, or implied features of our world, and the extent to which nitty-gritty events, embedded in the moment-to-moment flow of experience, can unconsciously influence us – a product of the way our minds work and the types of real-world problems our brains have evolved to solve [1]. This reality about people is pervasive: It underlies our susceptibility nudges, affects whether we achieve insights into novel problems, and renders key behaviors, such as learned skills and expressions of our personality, more context-specific than we might anticipate. It’s a factor in both everyday behaviors and the performance of skilled experts, including the clinical decision-making of medical professionals and the actions of first responders. And it puts the onus on researchers to know which real-world contextual features might matter enough to require representation in a research setting – an implication that can weigh heavily once we take a moment to look at how rich these real-world contexts can be.

That people show such intense sensitivity to context goes a long way to explaining why findings from non-naturalistic research settings can be prone to poor generalization. Standard interviews and surveys simply aren’t well-suited for taking account these kinds of real-life contextual nuances. They thrive instead on abstraction, on eliciting responses that are based on gist representations about the world and oneself, or on a mental picture that is constructed in a reflective environment different from the one in which the object of reflection would normally be encountered. Insights professionals recognize these challenges, and it’s one of the reasons they sometimes turn to a range of projective, neuromarketing, choice exercise, and live simulation techniques to try to overcome them. Yet, these methods can introduce their own problems by focusing on a measurement problem at the expense of proper stage-setting, eliciting thought processes that don’t fit the real-world ones we need to investigate, creating a context that’s a poor analog for real life by being engaging in the wrong ways, or alternately being so sterile and stylized as to render their results noisy or open to bias.

What’s needed, then, are the tools that can help us achieve the one thing that matters most in this case: Ecological validity, or the ability of findings to generalize to real-life settings because study conditions adequately mimic the critical features of those settings [1,2]. That’s difficult, as it requires not only careful analyses and a good grasp of behavioral principles to determine the features a task or elicitation context needs to have, and the form they need to take, if it’s going to be a good analog for the real-world circumstances we look to mimic. It also requires the means to implement these features in the manner that is desired. And that’s where things can get tricky: Store shelves may be easy to simulate, and app prototypes easy to present in UX tests, but the pathway forward can get murky when you need to investigate a behavior that unfolds over many steps and takes place in a context that is complex, multidimensional, and highly case-specific – the ideal solution in an automated environment being 3D simulators with all the bells and whistles of immersive realism and dynamic interactivity, but often at a cost in time, effort, and money that puts them out of reach.

This is where I think GenAI can start to sing.

Solving the Ecological Validity Problem With GenAI

GenAI's potential value starts to show when we consider some of the tools insights professionals use when they look to help life sciences industry teams understand the subtler dynamics behind healthcare innovation usage. These cases can arise when a treatment's use is underperforming expectations and the reasons for it aren’t obvious, or when a brand is expected to face headwinds and there’s an openness to finding outside-the-box strategies and tactics to try to overcome them. Such cases may call for a deeper exploration of the reasoning of prescribers, the drivers of patient treatment behaviors, or the dynamics of doctor-patient dialogs to uncover key determinants of market performance, lying at the level of customer behavior, that cannot be adequately inferred from standard self-reports or secondary data sources (e.g., electronic health records).

In other words, they are exactly the types of cases where ecologically-valid ways of conducting survey or interview-based research become critical, leading insights professionals to turn to methods that can bring a critical healthcare moment to life in a study in the hope of digging out the thoughts, feelings, and behaviors that might impact real-world treatment use and be targets for effective patient- and provider-centric action. Some of the more popular methods they sometimes rely upon to serve this purpose include:

Roleplay exercises in which respondents, adopting their real-world roles as physicians, patients, or caregivers, interact with one another or with confederates as they might when at a critical moment in treatment decision-making or in a patient’s healthcare journey;
Case vignettes, typically in written form, that describe a patient case with the temporal flow and contextual nuance that allows a range of clinical and patient-centric case features to be available to a physician as they would if they were treating the patient in real life;
Videos with trained actors that allow respondents to hear a physician or patient deliver information in their own voice, with all the gestures, nonverbal cues, and ways of saying things that would come along with it.

Each of these methods has its strengths and drawbacks. Roleplay exercises provide an opportunity to observe and probe around highly engaging face-to-face interactions, but do so at the risk of poor control, restriction to small sample studies, and failure to fully account for the way the artificiality of the setup can unduly impact the behavior of the roleplay participants. Case vignettes impose standardization and control but forego interactivity and depend for their impact on respondents’ ability to construct a rich image of themselves in a setting that can only be conveyed in a limited amount of written text. Videos improve upon written vignettes by packing in a broad range of dynamically-unfolding verbal, aural, and visual cues, but are more expensive to produce and require a production process that works against the ability to be agile.

Every one of these methods, though, has an analogous GenAI application that can function as an adequate substitute for it. More importantly, the substitutes can deliver key improvements by bringing dynamic, interactive tasks under better control with increased realism and fewer artifacts while providing greater room for stimulus enhancement and experimentation. And they can do so with the advantages of increased scalability, greater flexibility, and lower cost than is achievable with the traditional non-AI versions.

GenAI Videos

GenAI videos have had a historical reputation for slop, but while that was pretty much the status-quo for quite a while, it isn’t anymore. The biggest changes came in 2025 with the launch of models that can provide realistic, properly synched speech for AI characters, along with visual capabilities that allow for everything from minimalist documentary-style realism to more highly stylized drama and action sequences. Most of these models have been designed to generate short clips that are clearly meant for easy social media sharing; not surprisingly, some of their behavior, such as when it comes to camera movements and emotion expression, betray training and model tuning fit for the types of cinematic experiences that are likely to be attention-getting on platforms like YouTube and Instagram. Yet, some of the best AI video platforms provide tools to begin new clips from previously-generated ones to build longer-lasting shots and to create continuity, while being responsive to a range of cinematography, audio, shot, editing, scene composition, setting, and action parameters with the right prompting. (They also support non-English dialog without resorting to obvious overdubbing, making them suitable for developing videos for research in non-English-speaking markets.) In my own tests, I was able to emulate a realistic doctor-patient dialog, seen from the physician’s point of view, with minimal camera movement and naturalistic lighting, ambient noise, and scenery while being able to control:

Character demographics, such as age and race / ethnicity, with voice characteristics that fit without caricature or stereotyping;
Emotional states, such as sadness and dejection, fear, uncertainty, assertiveness, and anger, and their timing; and
Verbal and nonverbal behaviors, such as pauses and glances, that, when combined with the right dialog and emotion prompts, are able to betray intentions and emotional states in subtle ways.

The resulting videos can be easily edited in a range of sophisticated, well-supported third-party postproduction tools to allow for more fancy maneuvering – for instance, the insertion of b-roll clips to simulate looking at one thing while voiceover continues to be addressed to another character now out of view. Best of all, they put video production in the hands of a single user and allow for videos to be created with less effort, and at a fraction of the cost, than would be typical of the production of a standard short video by a professional crew – all key to supporting experimentation and encouraging more frequent usage.

Fig 1: Two 16-second clips from a 01:10 GenAI video developed in August, 2025 of a consultation with a patient who is currently on chemotherapy for prostate cancer. In the original video, the second clip appears after a 14-second segment in which abnormally high liver enzyme values, appearing in an electronic health record seen from the physician's point of view, are revealed. The model adds a nonverbal gesture at the 30-second mark, consistent with prompt instructions, that provides an ambiguous cue as to whether the patient may be downplaying a symptom consistent with an emerging hepatic issue. Dialog for the first 16-second clip is adapted from [16].

Fig 2: GenAI clips developed in August, 2025 of various indirect and direct expressions of pushback and emotional distress in response to a chemotherapy recommendation. Moments such as these challenge physicians to take the right patient-centric actions, and their responses can have implications for immediate- and longer-term treatment usage that are less likely to be captured in studies focused on choices made for highly abstract and stylized patient profiles.

Chatbot-Based Simulators

Videos provide an absorbing way to convey the informationally rich, dynamic aspects of everyday experience, but they also remove the opportunity for interactivity, leaving the viewer’s involvement to their reactions to what they see and hear and to any inclination they might have to imagine what it would be like to be a part of the action. Here, simulators powered by generative large language models (LLMs) can provide an important contribution. These simulators allow us to capitalize on some of the same interactive features found in automated virtual, augmented, and mixed-reality simulations without some of the burdens these more robust simulators can impose. They have increasingly made a showing in medical education, where LLMs such as ChatGPT have proven to be quite good at simulating patients for the purpose of practicing history-taking [3-10], conducing psychotherapy sessions [11,12], and discussing fraught topics [13], as well as simulating patients and care teams for training in first responder communications [14] or in scenarios involving critically ill patients on hospital wards [15]. They’re even now increasingly being leveraged in VR, AR, and mixed reality simulators themselves to power the natural language aspects of simulated characters with greater realism in a more efficient, streamlined manner [6,13-15].

Behind these use cases is the ability of LLMs to persuasively play the role of a character in a roleplay scenario, where the LLM can be easily instructed with a plain language prompt to assume a role that’s defined loosely or with specific instructions about what behaviors to perform and what contextual information to assume when formulating a response to the user’s statements and questions. Prompts containing this information can include specific examples from which the models can learn to guide their in-role behavior, but most of the LLMs’ power derives from their ability to combine their own prior learning with what they’re asked to assume and what they encounter in their interactions with the user to generate a realistic estimate of what the simulated agent would say or do if the agent were interacting with the user in the real world. User interactions may occur through custom apps that are designed to call a third-party LLM such as ChatGPT, and these apps may include capabilities that allow the interaction to occur through speech as opposed to typed text. Yet, while these applications can be designed to be as complex as a traditional VR, AR, or mixed-reality simulation, they don’t have to be: They can be pared down to the standard, ChatGPT-like text interface and still allow their impact to derive from the purely linguistic aspects of the exchange between the user and the simulated character.

How realistic are these simulations, and how easy is it to develop the right prompts for them? I took a test drive using a prompt to simulate a patient who needed to have a conversation with her physician about an ambiguous cancer screening result. I used the prompts from three published patient simulators as models to figure out the best way to prompt the LLM to generate verbalizations consistent with assumptions I wished to make about the patient’s current mental and emotional state along with their broader dispositions and psychosocial profile. I was interested in seeing whether the LLM would behave the way patients are known to behave in doctor-patient dialogs, and whether the behavior would vary in ways expected of changes in the patient’s personality and demographic background. Initial runs revealed failures in goal satisfaction and prioritization, as well as patterns of emotion escalation and de-escalation, that weren’t always psychologically plausible, but while some of these issues were hard to control, others were amenable to correction, and the overall result proved to be quite good:

The LLM generated responses with content, tone, and style that tracked appropriately with the nature of my communications, including where they were clear and compassionate versus where they were vague and lacked empathy;
The LLM also produced responses consistent with background assumptions the LLM was instructed to make about the patient’s demographic background and personality;
Where the LLM veered from realism (e.g., in being overly verbose, assuming too much medical knowledge, or indexing too heavily on direct references to inner experiences), I was able to achieve full or partial corrections with additional refinements to the instruction prompt but without having to resort to over-engineering.

Fig 3: Conversation between a physician roleplayed by the author (left column) and a chatbot-based simulated patient (right column) discussing an ambiguous mammogram finding. Setup prompt instructed the LLM to assume that the patient was well-educated, persistent, and a fighter, but also intolerant of ambiguity and prone to anxiety, with emotion that was easy to trigger but unlikely to be expressed directly. Note the frequent use of "reflecting back" on the patient's part, which had never been requested but would be somewhat more likely with a patient with this SES profile. Note, too, the reference in the 4th line to "that's exactly what makes me nervous" – a residual effect of the LLM's tendency to not understand what patients would likely find most anxiety-inducing in this case (i.e., the ambiguous mammogram result).

Most striking, though, was what I noticed in my own behavior, which was an increasing reluctance to behave in overtly insensitive ways with the simulated patient even when I needed to do so to test the full range of simulated reactions. It felt only proper, and somehow necessary, to take a caring approach to the simulated patient even though I knew it was just an LLM I was interacting with! I also found myself reacting viscerally to the tone of the simulated patient – for instance, wanting to back away from ambiguity to deal effectively with the patient’s need for clarity when they got a little “hot” about it, but being lax in my communication with a patient whose response to vague, surface-level information was indirect and passive. And while I felt “present” with the simulated patient, at no time did I feel implicitly observed by anyone other than the character with whom I was interacting. The degree to which I felt these influences wasn’t expected, but it’s what we’d hope for if what we’d be looking to achieve is an experience that accurately emulates a key aspect of the world and feels real enough to impact behavior accordingly. As a bonus, it was possible to conduct the exploration with little more than a $20 monthly subscription and a few afternoons’ worth of time, the result being a simulator that was easy to envision implementing in an online survey thanks to the LLM provider’s strong support for API integration.

Fig 4: Re-run of the test conversation with the physician roleplay involving fewer verbalizations and poor sensitivity, and the LLM instructed to assume the role of a patient that was from a modest SES background, was dysthymic, leaned toward a passive, avoidant style, and found it challenging to organize their thoughts and feelings into words. The contrast can be seen with the responses obtained in the test shown in Fig. 3. Note the continued cues for distress despite the absence of overt patient pushback, which are rendered by the model in a tone consistent with what has often been reported in observational research on doctor-patient conversations (e.g. [17,18]).

The GenAI Contribution: Key Benefits, Watchouts, and Ways to Make It Work

By now, it should be obvious why these two GenAI use cases would be a good fit for addressing the challenge that motivated this post. GenAI videos pack in a lot of information per unit time and mental processing while conveying the right information by playing to our dependence on dynamic visual information to inform our understandings of the people and events that are relevant to our ability to successfully complete tasks and goals. Likewise, chatbots tap into how much of our interaction with the world occurs through language and provide the opportunity to partake in the types of real-world exchanges that achieve their impact on us just by the power of linguistic information alone. Both have limitations, as videos remove the opportunity for interactivity while chatbots reduce to text-based interactions absent steps to transform them into richer audiovisual form. Yet, both are at a point where they’re able to deliver an impressive amount of psychological realism, doing so relatively easily and inexpensively with expanded opportunities for deployment in multiple modalities (e.g., large-scale online surveys) and reduced barriers to exploration and experimentation.

This last point is particularly worth expansion. The low barrier to entry with GenAI tools isn’t just a potential boon to anyone seeking cost reductions or greater efficiencies. It’s also a driver of innovation. One of the big problems with AR, VR, and mixed-reality technologies is that, while they may be a gold standard for automating the delivery of highly realistic research experiences, their time, cost, and effort requirements don’t make them particularly friendly to trial-and-error experimentation or to project-specific customization. That poses a significant barrier to third-party insights agencies that are very sales-forward and tooled to produce deliverables at high speed to support specific, well-defined project types within historically tight timelines and budgets. Being able to sit down with ChatGPT, Claude, Sora, or Veo3 and play with prompts to explore what works and doesn’t within a few reasonable time blocks gives us a way to overcome this obstacle. It permits the kind of exploration we need to be doing to craft stimuli and tasks that will best suit project objectives, and to find the maneuvers that will make them as good as possible at emulating the real-world conditions to which we want our insights to generalize – the latter being essential to almost any bit of methodological artistry, where we need the room to explore and test out lots of different options before we hit on the one that works the way we intend it to. And it becomes possible to do all of this without the investments that can be difficult to de-risk.

That said, it’s not as if GenAI videos and simulators don’t have their own challenges. Current GenAI models don’t act and reason the way we do. Most really are more like “stochastic parrots” than like humans who are able to reason from causal models and something like a true “theory of mind”. As such, they can be given to hallucinations and other strange behavior that can be hard to predict and bring under control. Likewise, their behavior can be steered in certain directions, but their architecture and training history does betray itself in their tendency to “want” to do certain things whether we find the resulting behavior realistic or desirable. Their behavior can also be susceptible to the unintended, hidden implications of prompt information configurations, which may become obvious only upon after-the-fact analysis. And they have been built with all the usual commercial tradeoffs that can make them poorly performing in ways that won’t be a problem for the average user (e.g., videos losing voice continuity or degrading from clip to clip when pushing scene builders to their limits) but do become noticeable once they are brought to bear for more specialized purposes. All these limitations do emerge in the applications I’ve described.

Fortunately, there are ways to deal with some of these challenges. Prompts that are engineered to be specific, clear, and crisp, with content and organization developed thoughtfully and tuned to the way LLMs work, can go a long way to channeling LLM behavior in desired directions. With chatbot-based simulators, realistic behavioral and psychological profiles can be successfully assigned to simulated characters provided prompts are constructed to respect the fact that they will be read by a model that has no real capacity for empathy and only a shaky understanding of how human minds work. I also found that I achieved my best breakthroughs when I adopted an approach that used a combination of serendipity and outside-the-LLM workarounds to align results with an initial vision while allowing some of the details to float. That was particularly useful in video creation, where, in addition to adopting a particular prompt-writing style, I developed a list of GenAI quirks to anticipate, figured out which ones I needed to avoid by not trying certain maneuvers and finding adequate substitutes, being tolerant of multiple takes, and finding ways to deal with certain persistent problems either in postproduction or via the use of other tools (e.g., manually constructed images containing text, which GenAI does a poor job emulating). All of this requires considerable upfront planning, room for trial-and-error testing, and tolerance for residual uncertainty, and thus it goes without saying that any incorporation of GenAI into research experience design needs to be supported by the right process steps to develop appropriate budgets and timelines, obtain needed inputs and alignments, pressure-test videos and simulations with prospective respondents and content experts, and manage expectations as the development process unfolds.

Finally, there is one requirement that GenAI-based research experiences place on insights teams that doesn’t have to do with technical or operational matters. It comes down to domain knowledge. In short, the very level of contextual detail that GenAI videos and simulators capture also requires substantial upfront knowledge of the real-world circumstances they are meant to emulate if they are going to be designed to perform as intended. This is true of any methodology that looks to achieve high levels of realism of course, yet it’s a requirement that can easily break a team that hasn’t been conditioned to it. Some might see that as a deal-breaker, but I see it as an opportunity: A team with the knowledge, skills, and resources to have as much command of a domain as to know where every paperclip is stored – and to have the discernment to know what circumstances are the strategically most important to represent – is also likely to have the depth of knowledge to deliver excellence when it comes to translating business objectives into study design and transforming study results into deep, actionable insights. It’s a perfect example of AI not being a replacement for brain power but being an enhancement of it.

Conclusion

The examples I used in my discussion of the two GenAI applications reflect the types of problems I’ve often needed to address in my own work, but they should hardly be taken to be exhaustive. Both videos and chatbot-based simulators can be used to emulate any kind of real-life experience, social or nonsocial, that would be in keeping with the way they represent the world or the type of behavior they allow people to perform. They can be treated as building blocks rather than standalone capabilities, cobbled together to create an experience stream composed of visual, nonvisual, interactive, and noninteractive segments, with branching logic added to create different adventures based on researcher-determined assignment or the respondent’s own answers to questions. They can be tooled to serve any number of research purposes, including not only uncovering people’s thoughts, feelings, and behavioral tendencies in richly-defined current or future scenarios, but also assessing people’s perceptions of themselves and their world, using reactions to stimuli that may convey far more than can be captured in the wording of rating scales. These purposes can, in turn, serve a range of strategic objectives, from understanding what people do now and why given something specific about the world as it is, or how they might react to a future world involving knowledge of, or an encounter with, a new product, service, communication, or experience. And they may be developed with any level of rigor, from stimuli and tasks that pass a basic sniff test to ones that are more heavily piloted and validated.

These applications aren’t perfect, and the methods they support are, like any other, merely a way to move closer to solving the ecological validity problem versus being an outright solution to it. Their implementation requires care: The studies into which they are added must be designed so that their introduction feels natural and seamless for the respondent; follow-up questions and behavioral elicitations must also have the same quality of seamlessness to avoid disruption; and they must be used ethically, with all the disclosures that are required without inadvertently taking the punch out of them.

Yet, their ability to more closely capture the experience of real-world situations and behavioral contexts, along with their flexibility, low cost, and ease of use, creates all the excuse one needs to experiment with them and find ways to begin including them in the primary research arsenal. Given the vital importance of ecological validity in making primary research findings credible and generalizable, I find them to be an exciting development and think they’re worth the time and effort for any industry applied researcher to try to build into their toolkit.

References (Were We Got Some of This)

Vigliocco, G., Convertino, L., DeFelice, S., Gregorians, L., Kewenig, V., Mueller, M.A.E., Veselic, S., Musolesi, M., Hudson-Smith, A., Tyler, N., Flouri, E., & Spiers, H.J. (2024). Ecological brain: Reframing the study of human behaviour and cognition. Royal Society Open Science, 11, 240762. https://doi.org/10.1098/rsos.240762.
Kihlstrom, J.F. (2021). Ecological validity and “ecological validity”. Perspectives on Psychological Science, 16, 466-471. https://doi.org/10.1177/1745691620966791.
Aster, A., Ragaller, S.V., Raupach, T., & Marx, A. (2025). ChatGPT as virtual patient: Written empathic responses during medical history taking. Medical Science Educator, 35, 1513-1522. https://doi.org/10.1007/s40670-025-02342-7 .
Holderreid, F., Stegmann-Philipps, C., Herschbach, L., Moldt, J., Nevins, A., Griewatz, J., Holderreid, M., Herrmann-Werner, A., Festl-Wietek, T., & Mhaling, M. (2024). A generative pretrained transformer (GPT)-powered chatbot as a simulated patient to practice history taking: Prospective, mixed methods study. JMIR Medical Education, 10, e53961. https://doi.org/10.2196/53961.
Jones, B., Desu, A., & Honig, C.D.F. (2025). Artificial intelligence chatbots as virtual patients in dental education: A constructivist approach to classroom implementation. European Journal of Dental Education. https://doi.org/10.1111/eje.13135.
Laverde, N., Grevisse, C., Jaramillo, S., & Manrique, R. (2025). Integrating large language model-based agents into a virtual patient chatbot for clinical anamnesis training. Computational and Structural Biotechnology Journal, 27, 2481-2491. https://doi.org/10.1016/j.csbj.2025.05.025.
Liu, X., Wu, C., Lai, R., Lin, H., Xu, Y., Lin, Y., & Zhang, W. (2023). ChatGPT: When the artificial intelligence meets standardized patients in clinical training. Journal of Translational Medicine, 21, 447. https://doi.org/10.1186/s12967-023-04314-0.
Oncu, S., Torun. F., & Ulku, H.H. (2025). AI-powered standardized patients: Evaluating ChatGPT-4o’s impact on clinical case management in intern physicians. BMC Medical Education, 25, 278. https://doi.org/10.1186/s12909-025-06877-6.
Thesen, T., Alilonu, N.A., & Stone, S. (2024). AI patient actor: An open-access generative-AI app for communication training in health professionals. Medical Science Educator, 35, 25-27. https://doi.org/10.1007/s40670-024-02250-2.
Yuan, Y., He, J., Wang, F., Li, Y., Guan. C., & Jiang, C. (2025). AI agent as a simulated patient for history-taking training in clinical clerkship: An example in stomatology. Global Medical Education, 2, 171-177. https://doi.org/10.1515/gme-2024-0025.
Sanz, A., Tapia, J.L., Garcia-Carpintero, E., Rocabado, J.F., & Pedrajas, L.M. (2025). ChatGPT simulated patient: Use in clinical training in psychology. Psicothema, 37, 23-32. https://doi.org/10.70478/psicothema.2025.37.21.
Wang, R., Milani, S., Chiu, J.C., Zhi, J., Eack, S.M., Labrum, T., Murphy, S.M., Jones, N., Hardy, K., Shen, H., Fang, F., & Chen, Z.Z. (2024). Patient-psi: Using large language models to simulate patients for training mental health professionals. arXiv, 2405.19660v2. https://doi.org/10.48550/arXiv.2405.19660.
Wesiman, D., Sugarman, A., Huang, Y.M., Gelberg, L., Ganz. P.A., & Comulada, W.S. (2025). Development of a GPT-4-powered virtual simulated patient and communication training platform for medical students to practice discussing abnormal mammogram results with patients: Multiphase study. JMIR Formative Research, 9, e65670. https://doi.org/10.2196/65670.
Gutierrez Maquillon, R., Uhl, J., Schrom-Feiertag, H., & Tscheligi, M. (2024). Integrating GPT-based AI into virtual patients to facilitate communication training among medical first responders: Usability study of mixed reality simulation. JMIR Formative Research, 8, e58623. https://doi.org/10.2196/58623.
Liaw, S.Y., Tan, J.Z., Lim, S., Zhou, W., Yap, J., Ratan, R., Ooi, S.L., Wong, S.J., Seah, B., & Chua, W.L. (2023). Artificial intelligence in virtual reality simulation for interprofessional communication training: Mixed method study. Nurse Education Today, 122, 105718. https://doi.org/10.1016/j.nedt.2023.105718.
Donnelly, T., Haimowitz, B., Simpson, P., Bell, S., Valant, V., McCracken, D., & Leonard, J. (2025, September 1). Strategies for improving the quality and impact of doctor-patient conversations. Quirk's Media. https://www.quirks.com/articles/strategies-for-improving-the-quality-and-impact-of-doctor-patient-conversations.
Beach, W.A., Easter, D.W., Good, J.S., & Pigeron, E. (2005). Disclosing and responding to cancer "fears" during oncology interviews. Social Science & Medicine, 60, 893-910. https://doi.org/10.1016/j.socscimed.2004.06.031.
Beach, W.A., & Dozier, D.M. (2015). Fears, uncertainties, and hopes: Patient-initiated actions and doctors' responses during oncology interviews. Journal of Health Communication, 20, 1243-1254. https://doi.org/10.1080/10810730.2015.1018644.