Elad Yacobson, Ofra Amir, Ayelet Baram-Tsabari E109 Methodology 37 Israeli students from the Technion and the Weizmann Institute of Science participated in the study. Each student engaged in two voice conversations with an LLM (ChatGPT 4o). The scenario simulated a casual encounter with a layperson in a medical doctor’s waiting room. The students were asked to explain their research to the LLM (See Figure 1 for an example excerpt of a conversation between a student and the LLM). Students first completed the initial conversation. They then received feedback from ChatGPT, and were asked to reflect on the feedback and consider how to improve. Immediately following this feedback, they participated in a second conversation based on the same scenario. Each conversation lasted between five to eight minutes and was transcribed by ChatGPT. After both conversations, students completed a reflective questionnaire about their experience with the training tool. Two independent expert evaluators separately rated 36 transcripts (49% of the total number of transcripts, which was 74). Their scores reached 87% agreement (Cohen’s k = 0.70). The remaining 38 transcripts were scored by one of the experts. To assess changes in dialogic performance, we conducted a Wilcoxon signed-rank test comparing students’ scores across the four dimensions (Content, Interpersonal Rapport, Perspective Taking & Listening, Integrity & Humility) in their first and second conversations. Figure 1. Example excerpt from a student’s dialogue with ChatGPT. Results Regarding RQ1, the results demonstrated a statistically significant improvement in students’ overall dialogic performance from the first to the second conversation. Students improved across three of the four dimensions: Content (p = .015), Interpersonal Rapport (p < .001), and Perspective Taking & Listening (p = .007). The only dimension that did not show a statistically significant gain was Integrity & Humility (p = .38), which also received the lowest scores in both conversations relatively to the other dimensions. The overall average dialogic score increased from 0.20 to 0.30 (p < .001) on a (-1) to 1 scale, indicating a robust training effect. These results suggest that even a single feedback-guided interaction can lead to measurable improvements in key dialogic competencies. Regarding RQ2, participants’ responses to the post-task survey revealed high levels of satisfaction with the training experience. Most students (29 of 37) rated the simulation as helpful (ratings of 4 or 5 out of 5), and 75% reported that they felt they had improved their communication skills in the
RkJQdWJsaXNoZXIy Mjk0MjAwOQ==