Future of Multiple Choice in AI Education
Future of Multiple Choice in AI Education
A Posttest-only
RCT
DANIELLE R. THOMAS, Carnegie Mellon University, USA
CONRAD BORCHERS, Carnegie Mellon University, USA
SANJIT KAKARLA, Carnegie Mellon University, USA
JIONGHAO LIN, Carnegie Mellon University, USA
SHAMBHAVI BHUSHAN, Carnegie Mellon University, USA
arXiv:2412.10267v1 [[Link]] 13 Dec 2024
CCS Concepts: • Human-centered computing → Human computer interaction (HCI); • Applied computing → Computer-
managed instruction; • Computing methodologies → Artificial intelligence.
Additional Key Words and Phrases: Tutoring, Generative AI, Human-AI tutoring, AI-assisted tutoring, Assessment
Authors’ Contact Information: Danielle R. Thomas, drthomas@[Link], Carnegie Mellon University, Pittsburgh, PA, USA; Conrad Borchers, cborcher@
[Link], Carnegie Mellon University, Pittsburgh, PA, USA; Sanjit Kakarla, [Link]@[Link], Carnegie Mellon University, Pittsburgh,
PA, USA; Jionghao Lin, jionghao@[Link], Carnegie Mellon University, Pittsburgh, PA, USA; Shambhavi Bhushan, shambhab@[Link],
Carnegie Mellon University, Pittsburgh, PA, USA; Boyuan Guo, boyuan@[Link], Carnegie Mellon University, Pittsburgh, PA, USA; Erin Gatz,
egatz@[Link], Carnegie Mellon University, Pittsburgh, PA, USA; Kenneth R. Koedinger, koedinger@[Link], Carnegie Mellon University,
Pittsburgh, PA, USA.
1 Introduction
The effectiveness of multiple-choice questions (MCQs) in learning is the subject of much debate [3, 16, 18]. Although
MCQs are often criticized for lack of depth, they remain a common feature in K-12 and higher education, due to
their ease of grading [3]. However, their potential as instructional tools, rather than just assessment tools, meaning
that they provide feedback from which students can learn, has received less attention. In contrast, open-response
questions are frequently used in assignments such as homework, under the assumption that they promote deeper
learning [3, 16]. However, open responses can be more time consuming for learners and resource intensive to grade [3],
although recent advancements in the field have made the automated grading of these responses more feasible. This
study evaluates the effectiveness of MCQs in relation to open-response questions, both individually and in combination,
as learning-by-doing activities. These learning-by-doing activities are embedded in six tutoring lessons that involve
advocacy training. To investigate scalability of autograding open-ended responses, we use generative AI to evaluate
tutors’ open responses. The contributions of this work are twofold: theoretically, it offers insights into the learning
benefits of MCQs compared to open responses in learning-by-doing instruction; and practically, it provides implications
for optimizing tutor training by determining the most efficient method of instructing and assessing tutors, as measured
in the completion time of instruction and accuracy of automated open-response grading. Furthermore, this study
contributes a dataset of lesson log data, human annotation rubrics, and AI-generated generative commands to improve
transparency, reproducibility, and collaboration within the learning analytics community.
The design of instructional materials plays a critical role in promoting effective learning. MCQs are often favored for
their efficiency–they can be administered and graded quickly and require less time for learners to respond, making
them appealing in large-scale settings [3, 16, 18]. However, MCQs are sometimes criticized for promoting surface-level
learning, as they may encourage guessing and recognition rather than deeper understanding [3]. In contrast, open-
response questions require tutors to construct their answers, thereby engaging in higher-order thinking and reflection.
Although open-response questions can be powerful pedagogically, they are also more time-consuming to complete
and evaluate [3]. A key question is whether MCQs can be designed to be as pedagogically effective as open-response
questions within the context of teaching tutors advocacy skills, particularly in scenarios where scalability is a concern
[16]. Advocacy, an emerging area of instruction aimed at improving equity and inclusion in tutoring, is particularly
suited for a comparison of MCQ to open-ended responses, because the skills it requires–such as critical thinking and
ethical reasoning–are potentially more effective for comprehension when practiced through open-ended formats [3]
rather than MCQs where generating distractors can pose a challenge [13]. Compared to STEM learning, where many
closed-form grading systems exist (e.g., tutoring systems), there is a need in learning analytics to study what forms of
instruction are effective in novel, less structured domains like advocacy training.
Generative AI, particularly large language models (LLMs), have the capability to evaluate tutors’ textual, open
responses in real-time. LLMs such as GPT-4 [31], Claude [5], and LLaMa [37] have demonstrated remarkable performance
in a variety of linguistic tasks. These modern LLMs are built on a large-scale transformer architecture and trained on
extensive datasets [2, 14]. As a result, LLMs have attracted substantial interest from researchers across various fields,
including education, because of their potential to perform reasoning tasks at scale and with reduced costs. Generative
AI systems can evaluate human tutor responses across a wide range of scenarios, providing feedback and assessment
at a scale that would be impossible for human evaluators alone. Importantly, LLMs may have the potential to make
situational judgments, assessing not only the correctness of a response but also underlying reasoning [2]. This capability
is crucial in scenario-based training, where tutors must navigate complex real-world situations. However, despite their
potential, LLMs also have limitations, such as the tendency to generate nonsensical or factually incorrect outputs [42],
and bias and fairness issues [14]. Using generative AI for tutor evaluation, this study explores the potential of AI to
support large-scale, effective tutor training, ultimately improving tutor quality and accessibility.
This work addresses the need for effective and scalable tutor training by evaluating tutors’ posttest performance
across six scenario-based lessons focused on advocacy skills. Using a posttest-only randomized experimental design, we
analyze tutor performance and time spent across three learning conditions: MCQ only, open-response questions only,
and a combination of both. We evaluated the scalability of using generative AI to evaluate tutor responses, comparing
the performance of GPT-4-turbo and GPT-4o with human graders. This study addresses the following research questions:
RQ1: What differences exist in tutor learning, as evidenced by posttest performance, across the learning-by-doing
activities, i.e., MCQ only, open-response questions only, or both?
RQ2: In what contexts do MCQs, open-response questions, or a combination of both yield the highest accuracy and
efficiency, thereby optimizing the impact of the lesson?
RQ3: How effective are LLMs, namely GPT-4o and GPT-4-turbo, in assessing tutors’ open responses at posttest?
2 Related Work
2.1 Tutor Advocacy Skills and Scenario-based Training
Tutoring is widely recognized as one of the most effective interventions for improving student learning outcomes [17,
30, 32]. Research consistently shows that personalized support from skilled human tutors can significantly boost student
academic performance, particularly among struggling students [34]. However, ensuring access to adequately trained
tutors is challenging [6, 24], with many tutoring organizations relying on paraprofessionals. Many paraprofessional
tutors have college education, but lack formal training and in providing instruction and building quality relationships
with students [6, 30]. In addition, very limited instructional materials are available for tutors on attending to students’
social-emotional needs. The process of training human tutors presents substantial scalability challenges, such as the
need for human evaluators to assess tutor performance. Traditional methods of tutor training and evaluation are
both time-consuming and resource-intensive, limiting the ability to scale tutoring programs to meet the needs of all
students. Tutoring is more effective when delivered by teachers or well-trained professional tutors [30]. Currently,
limited instructional materials are available to prepare and provide situational experiences to inexperienced tutors.
The lessons draw from previous research that identified impactful competencies of effective tutoring within the area
of Advocacy [6, 9, 35]. Past studies have internally validated the construct validity by demonstrating 20% learning gain
from pretest to posttest on similarly-structured lessons covering topics related to: giving effective praise to students;
reacting when a student makes an error; and determining what students know [35]. Using the same scenario-based
structure as in previous work [8, 9, 35, 36], our goal is to optimize tutor learning focusing on tutor lessons that instruct
tutors in advocacy skills. There are very limited instructional and training materials available to tutors in the area of
advocacy. Advocacy in teaching and tutoring encompasses a range of skills that promote student success by addressing
their academic, social-emotional, and equity-related needs [6, 12]. Key areas include: promoting equity and inclusion;
fostering cultural awareness; and challenging unconscious bias and assumptions [6, 35].
In contrast, open-response questions foster deeper reflection and higher-order thinking, but are more time-consuming,
raising the question of whether MCQs can be made pedagogically effective to teach advocacy skills in scalable scenarios
[3, 16]. This present work applies a learning engineering approach to investigate the learning efficiency of the following
type of questions: open response, which encourages deeper cognitive engagement; MCQs which can provide structured
assessment and objective grading; and a combination leveraging the strengths of both [3, 16]. Central to this present
work is the “learn-by-doing” methodology, which emphasizes active participation in the learning process. This approach
aligns with “doer” philosophy, advocating for hands-on, practical experiences to enhance understanding and retention
[23]. An example in practice is the integration of computer-based Cognitive Tutors, whereby students are required not
only to complete tasks but also to articulate (analogous to the ability of tutors to explain in this current work) their
reasoning, reinforcing their comprehension and retention [1]. This dual emphasis on doing and explaining has been
shown to significantly improve learning outcomes, fostering deeper comprehension and critical thinking skills [1, 35].
Brief scenario-based lessons were strategically designed using the learning-by-doing approach that provides action-
able feedback and requires tutors to apply what they learned within the learning-by-doing conditions and the instruction
phase by applying their learning in analogous tutoring situations at posttest. Fig. 1 illustrates the instructional design
of the lesson. First, the tutors are presented with a scenario (Scenario 1), whereby they are prompted to predict the best
approach, followed by being asked to explain their rationale or reasoning behind their chosen approach. There are three
possible learning-by-doing conditions: multiple choice only, open response only, or both. Multiple choice questions
begin with “Which of the following. . . ,” followed by four options for the tutor to choose. Open response questions start
with “What would you say or how would you respond. . . ” for predicting the best approach and “Why do you think your
response is the best approach. . . ” for explaining the rationale behind their chosen approach. The tutors then engage
in the instruction phase where the tutors observe the research-recommended approach and explain their reasoning
in support or not of the best approach. Finally, the tutors complete a posttest, which is the same for all tutors and
uses both MCQs and open responses. This instructional design is considered a modified predict-observe-explain (POE)
approach and is theoretically related to Gibbs’ Reflective Cycle, a cyclical instructional model that provides structure
for learning by doing to individual learning experiences [15].
Fig. 1. Instructional design sequence of the lessons illustrating the three learning-by-doing conditions, then the follow-up instruction
phase, and concluding with posttest.
scoring rubrics. Their study demonstrated that GPT-4 outperformed GPT-3.5, highlighting the potential of generative
AI to provide more accurate and explainable automatic scores in educational contexts. [28] used few-shot prompting
strategies with GPT-4 to assess the correctness of the responses of novice tutors in various tutoring strategies, such
as giving praise, responding to student errors, and understanding student knowledge levels. [41] used GPT-4-turbo
to evaluate the understanding of novice tutors of essential tutoring practices, including encouraging active student
learning and fostering a respectful community. These studies illustrate the growing capacity of generative AI to offer
more nuanced and contextually aware assessments of open-ended educational responses.
ensure that the model consistently interprets nuanced answers in a similar manner across multiple runs. Furthermore,
prompting for rationale can help in obtaining more interpretable outputs. By explicitly asking the model to explain its
reasoning or decision-making process, researchers can gain insight into how the model arrived at its conclusions, thus
improving the transparency and interpretability of the assessment. This is particularly valuable in educational contexts,
where understanding the reasoning behind a student’s answer is as important as the answer itself. Overall, prompt
engineering strategies such as few-shot prompting [2], chain-of-thought prompting [40], and self-consistency [39] play
a vital role in enhancing the performance and reliability of LLMs in educational assessments. By carefully designing
prompts and leveraging these techniques, we can guide LLMs to provide more accurate, nuanced, and consistent
evaluations of open-ended educational responses.
3 Method
Six scenario-based lessons were created and designed to align with the tutoring competencies within the area of
Advocacy [6]. The lesson content taken from the tutoring platform and formatted as documents can be found in Digital
Appendix. The lesson titles and learning objectives are listed below.
• Addressing Microaggressions: define the term microaggression; identify microaggressions that occur in tutoring
settings; and apply equity-focused strategies to help students address microaggressions.
• Avoiding Unconscious Assumptions: identify unconscious assumptions; and apply strategies to prevent making
unconscious assumptions while tutoring.
• Building Cultural Competence: identify when students have different cultural backgrounds and experiences than
your own; practice strategies to build cultural competence, supporting and engaging students across cultures.
• Exploring Implicit Bias: identify implicit, or conscious bias; and apply strategies to counter the effects of your
own implicit biases.
• Narrowing Opportunity Gaps: define the term opportunity gap; identify examples of opportunity gaps in tutoring
settings; and explain strategies to narrow opportunity gaps in tutoring settings.
• Helping Students Manage Inequity: recognize when a student is experiencing inequity related to their learning;
and apply strategies to help students manage inequities by assisting students to advocate for themselves.
tutoring [6, 35]. For broad dissemination and use across varying tutoring organizations, all lessons have a Flesch-Kincaid
readability index measure that ranges from grades 6-9 (defined as spanning from easy to read to average reading level).
This ensures that all tutors, regardless of their individual reading level, can understand the content of the lesson. Each
participant was randomly assigned one of three conditions in the learning-by-doing phase of the lesson: multiple choice
only, open response only, or both. All participants received one of two randomly chosen scenarios used in the posttest
for counterbalancing of scenario difficulty.
Table 1. Number of tutors by lesson for each condition with the average self-reported knowledge of the lesson topic (1-5).
Tutor performance on the open-response questions ranged from 45% to 89% on the predict questions and 51% to 68%
for the explain question types. To gain perspective on lesson and question difficulty, Table 3 displays the percentage
of correct responses for each lesson by question type. Overall, tutor performance was lower on questions prompting
tutors to explain the rationale behind their predictions of the best approach.
Manuscript submitted to ACM
8 Thomas et al.
Table 3. Percentage of correct open responses for each lesson broken out by predict and explain responses.
Table 4. The LLM prompt for scoring predict responses from Addressing Microaggressions.
SCORING_PROMPT_START = """
You are a tutor evaluator. Please score the following tutor response to a tutor training scenario involving a middle school student facing a microaggression as
follows:
-if the tutor’s response acknowledges a student facing a microaggression by providing support to the student, by giving the student space to talk, or facilitating
an apology, score with a 1. Sample responses scoring a 1 are "Nori, I heard what your teacher said about the exit ticket. It’s okay to find problems challenging.
Let’s work on it together during our session. Your efforts are important, and I’m here to support you." and "Nori, I noticed what your teacher said to you could
have been upsetting and I am here if you would like to talk about it."; and "Ashley, that was not a kind or respectful thing to say. Please apologize to your
classmate."; "I would acknowledge Ashley’s mean comment and tell her what she did wasn’t right"; "Nori, I notice that you look sad on your face. Is it because
what your teacher just said? Do you want to talk about it with me?"
-if the tutor’s response does not acknowledge the microaggression or does not provide any support to the student, score with a 0. Yes or no questions, if they
don’t acknowledge the student’s experience are scored with a 0. Sample responses scoring a 0 are "Nothing is too difficult if you decide you want to do it Nori.
You are capable"; "Do you want to talk about what the teacher said to you"; "Nori, I heard what your teacher said to you about the exit ticket. Do you want to
discuss it with me"; and "I would address the problem."
Response Start ---"""
FORMAT_PROMPT =
"--- Response End. Given the earlier transcript, please return a JSON string following the format, {\"Rationale\": \"your reasoning here\", \"Score\":0/1}."
Table 5. The LLM prompt for scoring explain responses from Addressing Microaggressions.
SCORING_PROMPT_START = """
You are a tutor evaluator. Please assess a tutor’s response within a tutor training scenario involving a tutor instructing a middle school student who has faced a
microaggression: The tutor is explaining the rationale behind their response. Assess and score the tutor’s response, as follows:
-if the tutor’s response demonstrates that they understand how to recognize and acknowledge a microaggression by providing the student support or issuing an
apology, score with a 1. Sample responses scoring a 1 include: "Acknowledging the student’s feelings and naming the microaggression, the teacher’s comment,
will provide an opportunity to address the microaggression"; "This approach will acknowledge the microaggression because it directly addresses Nori’s feelings
and opens up a supportive dialogue. By acknowledging that the teacher’s comment may have been hurtful, it validates Nori’s experience and gives her the
opportunity to express her emotions."
-if the tutor’s response does not demonstrate that the tutor recognizes how they should acknowledge microaggressions, score with a 0. Sample responses scoring
a 0 include: "Telling the student she is capable of solving the problem will boost her confidence and addressing the problem will help to boost the students
emotional status"; "It encourages the student to work on the problem"; and "This will provide her with a safe space to communicate."
Response Start ---"""
FORMAT_PROMPT =
"--- Response End. Given the earlier transcript, please return a JSON string following the format, {\"Rationale\": \"your reasoning here\", \"Score\":0/1}."
performance, with all factors being between subjects. Regarding RQ2, we replicated the ANOVA used for RQ1 on the
time it took students to complete the learning-by-doing and follow-up instruction in each condition. For answering
RQ3, we used prompt engineering of large language models, GPT-4 and GPT-4o, to evaluate tutors’ open responses.
Then employed the same ANOVA model from RQ1 to determine if the results are synonymous with the analysis of
human-graded tutor responses. We then report the absolute performance of both LLM models.
3.4.1 Lesson Log Data. Student responses to individual practice questions, survey questions, and other forms of
instruction (e.g., responses to multiple-choice options, Likert scales, and open-ended responses) were recorded in
PSLC DataShop, an open repository for educational log data commonly used for tutoring systems in learning analytics
research [23]. Specifically, data was recorded in transaction format, which means that we analyzed individual interactions
of students with timestamps. We prioritized maintaining the privacy and confidentiality of tutors, adhering to all
Institutional Review Board (IRB) requirements. The lesson log data can be accessed within the Digital Appendix.
To measure performance on the posttest (RQ1), we aggregated the accuracy of student responses on the posttest,
where students completed two multiple-choice and two open-response questions, with the latter graded by two
experienced researchers and LLM models (RQ3). To measure the time students took to complete the instruction
(RQ2), we calculated the difference between the instruction start time, as recorded in log data, and the last student
Manuscript submitted to ACM
10 Thomas et al.
response to questions associated with each lesson’s instruction. In cases where students completed lessons over multiple
sessions with substantial breaks between them, we excluded the breaks from the total lesson completion time. Due
to the right-skewed distribution of completion times, which are always greater than zero, we applied a logarithmic
transformation for ANOVA and statistical tests that assume normally distributed outcomes. This assumption was
confirmed in the logarithmic transformation data through visual inspection of standard diagnostic plots (e.g., residual
Q-Q plots). However, for ease of interpretation, we re-transformed the averages and confidence intervals from the log
scale back to the standard time scale (i.e., minutes) for presentation in plots.
4 Results
4.1 Learner Performance Across Conditions
The overall average results of the posttest are shown in Fig. 2. As is visually apparent, there is no consistent pattern
illustrating that one instructional condition produces better learning outcomes than others across the conditions. Indeed,
in an analysis of variance, we did not find a statistically significant main effect of condition, F(2, 717) = 0.27, p = .765.
There was a significant interaction between condition and lesson F (10, 717) = 2.20, p =.012, which means that the
posttest scores differed significantly by condition depending on what lesson the tutors completed. Furthermore, a
significant main effect of the lesson suggested substantial accuracy differences by lesson, which means that some
lessons were harder than others, on average, F(5, 717) = 10.18, p < .001.
Fig. 2. Average posttest scores compared across learning-by-doing conditions: MCQ Only, Open-response Only, or Both. No significant
differences were found in posttest scores between conditions. Error bars represent 95% confidence intervals.
We did post hoc contrasts using marginal means to estimate which condition-level differences within lessons were
reliable [26]. Significant differences were found in only two lessons. In the Addressing Microaggressions lesson, the Both
condition produced higher posttest scores than the Open-response Only condition (estimate = 0.136, SE = 0.054, 𝑝 = .031)
with the MCQ Only condition ambiguously in between (i.e., not statistically different from the other two conditions,
Manuscript submitted to ACM
Does Multiple Choice Have a Future in the Age of Generative AI? 11
𝑝-values > 0.12). The results for the Exploring Implicit Bias lesson were essentially the opposite, consistent with the
overall interaction. In the Exploring Implicit Bias lesson, the Both condition produced lower posttest scores than the
Open-response Only condition (estimate = -0.175, SE = 0.054, 𝑝 = .004). In this lesson, the Both condition also produced
lower posttest scores than the MCQ Only condition (estimate = -0.134, SE = 0.056, 𝑝 = .046). All other comparisons
were not significant (𝑝 > .299).
Fig. 3. Average instruction time prior to posttest compared across learning-by-doing conditions: MCQ Only, Open-response Only, or
Both. Although MCQ Only took less time on average, no overall significant differences in instruction time prior to posttest were found
between the conditions. Error bars represent 95% confidence intervals.
Although the overall interaction between condition and lesson was not significant, we report marginal mean contrasts
of conditions within lessons similar to RQ1. The MCQ Only condition took significantly less time than the Both condition
in the Building Cultural Competence and Narrowing Opportunity Gaps (estimate = -0.55, SE = 0.19, p = .011). In the
Building Cultural Competence lesson, MCQ Only was also significantly faster than the Open-response Only (estimate
= -0.50, SE = 0.19, p = .025) and Both conditions (estimate = -0.76, SE = 0.16, p < .001). Notably, in no lesson was the
Open-response Only condition significantly different from the Both condition (p-values > .269). Overall, these findings
suggest that the Both condition took learners significantly longer than the MCQ Only condition in 2/6 lessons, but not
longer than the Open-response Only condition.
Table 6. Comparison of absolute performance for GPT-4o across lessons for predict and explain open responses.
Table 7. Comparison of absolute performance for GPT-4-turbo across lessons for predict and explain open responses.
5 Discussion
This study investigated differences in tutor learning across conditions that align with varying learning-by-doing
activities (i.e., MCQ only, open response only, or both) and assessed the scalability of using generative AI for evaluating
tutor responses. Several important insights emerged, which offer a comprehensive understanding of this present work.
Both condition was significantly faster in processing the follow-up instruction than the MCQ Only condition, estimate
= -0.47, SE = 0.18, p = .020, as well as the Open-response Only condition, estimate = -0.78, SE = 0.19, p < .001.
5.3 LLMs demonstrate proficiency, but more research is needed for wide-scale assessment.
Similar to [25], we found the GPT models to be comparable to each other and overall demonstrated proficiency. However,
GPT-4-turbo exhibited variability across lessons, with poorer performance in the Helping Students Manage Inequity
(𝐴𝑈 𝐶 = 0.17) and Exploring Implicit Bias (𝐴𝑈𝐶 = 0.43) lessons. The low 𝐴𝑈𝐶 scores in these cases suggest that the
GPT-4-turbo struggled to classify responses in more complex or nuanced topics. This highlights the need for further
refinement of LLMs to enhance their ability to assess open responses in content areas that require situational reasoning.
In addition, more research is needed on the nuances within each lesson. The interrater reliability between the human
graders was high for some lessons (𝜅 = 0.93), so it is necessary to revisit the human grade and annotation rubrics for
the lessons and determine the sources of disagreement between the GPT models and humans (and between the raters)
to better understand the results. This work adds to the recent literature on the potential use of LLMs for low stakes
assessment tasks in different domain areas (Henkel et al. [20]), adding to the area of tutor training in advocacy skills.
5.4 Limitations
While this study used a posttest-only randomized experimental design, which offers several advantages, there are
inherent limitations to this approach. One limitation is the lack of baseline data due to the absence of an analogous
pretest scenario, which prevents measuring individual learning gains from pretest to posttest. This limitation makes
it challenging to quantify the exact effect size of the lessons and to track individual progress. However, the primary
purpose of this study was not to determine individual learning gain but to identify whether any of the learning-by-
doing conditions–multiple choice questions, open-response questions, or both–led to better learning outcomes. By
focusing on posttest results and employing random assignment to the condition, we were able to directly compare the
relative learning effectiveness of each condition. In addition, the study aimed to find the most efficient lesson design by
analyzing the time it took learners to complete each condition. Although the absence of pretest data can sometimes
reduce statistical power, our sample size was sufficient to detect meaningful differences across conditions in time taken.
Furthermore, our previous study [35], which implemented the Both condition with a counter-balanced pre-post design,
demonstrated that our assessments are sufficient to detect significant pre-post learning gains. Finally, while the lack of
baseline data can complicate the replication of the study in different contexts, the posttest only design was specifically
chosen to minimize the risk of pretest sensitization, which can enhance the validity of the experimental results [11, 38].
assessing student learning. We used both in our posttest, and we intend to keep open-response questions to continue to
do so to test that instructional practice with MCQ transfers to performance on open-ended questions.
Our finding that MCQ practice transfers to open-ended performance is an important general result. The use of MCQs
is often the target of criticism in assessment and instructional design. This criticism of MCQs as a shallow form of
practice is perhaps more common when used for content that is less well defined, such as is the case for the learning
goals of the tutoring lessons used in these studies (i.e., advocacy). Our evidence does not support such criticism and,
further, provides evidence for the instructional benefits of MCQs relative to open-response questions in requiring
less time for learners to reach the same outcome. Practically, this finding provides support for greater use, or at least
greater exploration, of MCQs as practice tasks during instruction. Theoretically, this result has implications for the
refinement of general frameworks for instructional design. In particular, the ICAP framework [7] makes a general
prediction in the subject matter domain that constructive learning tasks, such as open-response questions, should
produce better learning outcomes than active (but not constructive) learning tasks, such as feedback-based MCQs. Our
learning outcome evidence is inconsistent with this prediction, especially as a generalization across learning content.
We observed evidence of effect heterogeneity, with some lessons that included MCQ learning tasks yielding better
learning outcomes, while others showed improved outcomes when MCQ tasks were excluded. This content-treatment
interaction has been found and well explained in previous research (e.g., [33]); however, in this case, the explanation
is not clear, and we suggest future work to probe whether there is a replicable finding here and, if so, what theory
might explain it. Finally, given how prevalent open-response questions are as learning tasks in homework assignments
in school and college education, we recommend further investigation of the potential for more efficient and equally
effective learning from the use of multiple-choice questions as learning tasks.
Acknowledgments
This work was made possible with the support of the Learning Engineering Virtual Institute. The opinions, findings,
and conclusions expressed in this material are those of the authors.
References
[1] Vincent A Aleven and Kenneth R Koedinger. 2002. An effective metacognitive strategy: Learning by doing and explaining with a computer-based
cognitive tutor. Cognitive science 26, 2 (2002), 147–179.
[2] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry,
Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
[3] Andrew C Butler. 2018. Multiple-choice testing in education: Are the best practices for assessment also good for learning? Journal of Applied
Research in Memory and Cognition 7, 3 (2018), 323–331.
[4] Dan Carpenter, Wookhee Min, Seung Lee, Gamze Ozogul, Xiaoying Zheng, and James Lester. 2024. Assessing Student Explanations with Large
Language Models Using Fine-Tuning and Few-Shot Learning. In Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational
Applications (BEA 2024), Ekaterina Kochmar, Marie Bexte, Jill Burstein, Andrea Horbach, Ronja Laarmann-Quante, Anaïs Tack, Victoria Yaneva, and
Zheng Yuan (Eds.). Association for Computational Linguistics, Mexico City, Mexico, 403–413.
[5] Loredana Caruccio, Stefano Cirillo, Giuseppe Polese, Giandomenico Solimando, Shanmugam Sundaramurthy, and Genoveffa Tortora. 2024. Claude
2.0 Large Language Model: tackling a real-world classification problem with a new Iterative Prompt Engineering approach. Intelligent Systems with
Applications 21 (2024).
[6] Pallavi Chhabra, Danielle Chine, Adetunji Adeniran, Shivang Gupta, and Kenneth Koedinger. 2022. An evaluation of perceptions regarding mentor
competencies for technology-based personalized learning. In Society for Information Technology & Teacher Education International Conference.
Association for the Advancement of Computing in Education (AACE), San Diego, CA, USA, 1812–1817.
[7] Michelene TH Chi and Ruth Wylie. 2014. The ICAP framework: Linking cognitive engagement to active learning outcomes. Educational psychologist
49, 4 (2014), 219–243.
[8] Danielle R Chine, Pallavi Chhabra, Adetunji Adeniran, Shivang Gupta, and Kenneth R Koedinger. 2022. Development of scenario-based mentor
lessons: an iterative design process for training at scale. In Proceedings of the Ninth ACM Conference on Learning@ Scale. Association for Computing
Manuscript submitted to ACM
16 Thomas et al.
[34] Carly D Robinson and Susanna Loeb. 2021. High-impact tutoring: State of the research and priorities for future learning. National Student Support
Accelerator 21, 284 (2021), 1–53.
[35] Danielle Thomas, Xinyu Yang, Shivang Gupta, Adetunji Adeniran, Elizabeth Mclaughlin, and Kenneth Koedinger. 2023. When the tutor becomes
the student: Design and evaluation of efficient scenario-based lessons for tutors. In LAK23: 13th International Learning Analytics and Knowledge
Conference. Association for Computing Machinery (ACM), Arlington, TX, USA, 250–261.
[36] Danielle R Thomas, Jionghao Lin, Shambhavi Bhushan, Ralph Abboud, Erin Gatz, Shivang Gupta, and Kenneth R Koedinger. 2024. Learning and AI
Evaluation of Tutors Responding to Students Engaging in Negative Self-Talk. In Proceedings of the Eleventh ACM Conference on Learning@ Scale.
481–485.
[37] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric
Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
[38] William MK Trochim and James P Donnelly. 2001. Research methods knowledge base. Vol. 2. Atomic dog publishing Cincinnati, OH.
[39] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-Consistency
Improves Chain of Thought Reasoning in Language Models. In The Eleventh International Conference on Learning Representations.
[40] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting
elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
[41] Joy Yun, Yann Hicke, Mariah Olson, and Dorottya Demszky. 2024. Enhancing Tutoring Effectiveness Through Automated Feedback: Preliminary
Findings from a Pilot Randomized Controlled Trial on SAT Tutoring. In Proceedings of the Eleventh ACM Conference on Learning@ Scale. 422–426.
[42] Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. 2023. Siren’s song in
the AI ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219 (2023).
A Digital Appendix
All analysis code, study materials, and log data references can be found in the study’s supplementary GitHub repository:
[Link]