0% found this document useful (0 votes)
39 views15 pages

Feduc 09 1386075

This research investigates the quality of AI-supported problem solving in mathematics education using various prompt techniques with the Generative Pretrained Transformer (GPT). The study evaluates the content and process-related quality of AI outputs through model validations involving 1,080 data points and finds that while content quality remains stable across prompt techniques, certain methods significantly enhance process-related quality. The findings highlight the importance of prompt engineering in improving the educational effectiveness of AI tools in mathematics learning.

Uploaded by

tontio2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views15 pages

Feduc 09 1386075

This research investigates the quality of AI-supported problem solving in mathematics education using various prompt techniques with the Generative Pretrained Transformer (GPT). The study evaluates the content and process-related quality of AI outputs through model validations involving 1,080 data points and finds that while content quality remains stable across prompt techniques, certain methods significantly enhance process-related quality. The findings highlight the importance of prompt engineering in improving the educational effectiveness of AI tools in mathematics learning.

Uploaded by

tontio2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

TYPE Original Research

PUBLISHED 09 May 2024


DOI 10.3389/feduc.2024.1386075

Prompt the problem –


OPEN ACCESS investigating the mathematics
educational quality of
EDITED BY
Ali Ibrahim Can Gözüm,
Kafkas University, Türkiye

REVIEWED BY
José Cravino,
AI-supported problem solving by
University of Trás-os-Montes and Alto Douro,
Portugal
Yizhu Gao,
comparing prompt techniques
University of Georgia, United States

*CORRESPONDENCE Sebastian Schorcht 1*, Nils Buchholtz 2 and Lukas Baumanns 3


Sebastian Schorcht 1
Primary Education/Mathematical Education, Faculty of Education, TUD Dresden University of
[Link]@[Link]
Technology, Dresden, Germany, 2 Secondary Mathematics Education, Faculty of Education, University
RECEIVED 14February 2024 of Hamburg, Hamburg, Germany, 3 IEEM, Faculty of Mathematics, TU Dortmund University, Dortmund,
ACCEPTED 19 April 2024 Germany
PUBLISHED 09 May 2024

CITATION
Schorcht S, Buchholtz N and The use of and research on the large language model (LLM) Generative
Baumanns L (2024) Prompt the problem – Pretrained Transformer (GPT) is growing steadily, especially in mathematics
investigating the mathematics educational education. As students and teachers worldwide increasingly use this AI model
quality of AI-supported problem solving by
comparing prompt techniques. for teaching and learning mathematics, the question of the quality of the
Front. Educ. 9:1386075. generated output becomes important. Consequently, this study evaluates AI-
doi: 10.3389/feduc.2024.1386075 supported mathematical problem solving with different GPT versions when the
COPYRIGHT LLM is subjected to prompt techniques. To assess the mathematics educational
© 2024 Schorcht, Buchholtz and Baumanns.
quality (content related and process related) of the LLM’s output, we facilitated
This is an open-access article distributed
under the terms of the Creative Commons four prompt techniques and investigated their effects in model validations
Attribution License (CC BY). The use, (N = 1,080) using three mathematical problem-based tasks. Subsequently,
distribution or reproduction in other forums is
human raters scored the mathematics educational quality of AI output. The
permitted, provided the original author(s) and
the copyright owner(s) are credited and that results showed that the content-related quality of AI-supported problem solving
the original publication in this journal is cited, was not significantly affected by using various prompt techniques across GPT
in accordance with accepted academic
versions. However, certain prompt techniques, particular Chain-of-Thought
practice. No use, distribution or reproduction
is permitted which does not comply with and Ask-me-Anything, notably improved process-related quality.
these terms.
KEYWORDS

large language model, problem solving, prompt engineering, mathematics education,


model validation, ChatGPT, Generative AI

1 Introduction
Concerning recent technological developments, the use of generative artificial intelligence
(AI) has become increasingly relevant for the teaching and learning of mathematics, especially
in problem solving (Hendrycks et al., 2021; Lewkowycz et al., 2022; Baidoo-Anu and Owusu
Ansah, 2023; Plevris et al., 2023). However, generative AI poses unique challenges in
mathematics educational settings. Although large language models (LLM), such as Generative
Pretrained Transformer (GPT), can already correctly process different and complex
mathematical inputs when mathematical problem solving, difficulties still arise when
presenting reliable, correct solutions, even for simple mathematics problems (Hendrycks et al.,
2021; Lewkowycz et al., 2022; Plevris et al., 2023; Schorcht et al., 2023). Therefore, the
respective AI-generated outputs should always be checked for accuracy and correctness.
Mathematical errors frequently occur that can cause harm when technology is used for

Frontiers in Education 01 [Link]


Schorcht et al. 10.3389/feduc.2024.1386075

educational purposes (e.g., in creating worked-out examples for predecessor model GPT-3, published in 2020, amounts to 175 billion
learning problem-solving skills). parameters (Floridi and Chiriatti, 2020). To determine and compare
Several changes have been made in recent months: GPT the LLM their abilities, LLMs are subjected to tests developed for humans,
on which ChatGPT is based has been improved to its latest version, among others, after inputting training data. For example, OpenAI
GPT-4. Although the mathematical performance of GPT-4 is tested GPT-4 with the SAT Evidence-Based Reading & Writing and
supposed to be better than previous versions, such as GPT-3 or the SAT Math Test, among others. Both tests are used primarily in the
GPT-3.5 (OpenAI, 2023), this has not yet solved the challenges U.S. to certify a person’s ability to study. In the language test, the
(Schönthaler, 2023). In addition, the mathematical quality of AI generative AI language model scored 710 out of a possible 800 points;
output plays a role in the teaching and learning of mathematics and in the mathematics test, it scored 700 out of 800 points. Unlike the
mathematics educational quality of output, such as whether certain results of GPT-3.5 (SAT Reading & Writing: 670 out of 800; SAT
solution paths are comprehensible for learners and specific learning Math: 590 out of 800), GPT-4 improved its already-remarkable score
aids are considered. To influence LLMs’ output, prompts (e.g., the here, especially in mathematics (OpenAI, 2023).
specific structure of the input) are coming into focus. Studies show The potential and the challenges of LLMs, such as GPT in school
that prompts have marked influence on the output’s quality (Kojima educational contexts and university teaching are currently discussed
et al., 2022; Arora et al., 2023; Wei et al., 2023). However, their intensely and controversially (e.g., Lample and Charton, 2019; Floridi
influence on the mathematics educational quality of the output and Chiriatti, 2020; Baidoo-Anu and Owusu Ansah, 2023; Buchholtz
remains unexplored. et al., 2023; Cherian et al., 2023; Fütterer et al., 2023; Kasneci et al.,
Herein, we scrutinize the current situation of GPT’s mathematics 2023; Schorcht et al., 2023). Some experts emphasize opportunities for
educational capabilities in AI-supported problem solving and analyze using generative AI, such as the greater personalization of learning
the use of prompt techniques to enhance its capabilities using varying environments or adaptive feedback to learning processes. For example,
inputs (prompts). By revealing the capabilities of generative AI and AI can help design school courses or assist teachers (Miao et al., 2023).
discerning how prompt techniques can be used specifically in Teachers worldwide are starting to use AI tools, such as voice
mathematics educational contexts, we want to contribute to a better recognition and translation software, to help students with special
understanding of the functionality of AI support in education. In the needs, those who speak multiple languages, or anyone who benefits
following theoretical background, we elaborate on the use of AI and from a more tailored learning approach. This makes it easier for these
LLMs in mathematics educational settings, such as students’ problem- students to be involved in and succeed in class (Cardona et al., 2023).
solving activities, and discuss their challenges in mathematics. Whether generative AI suitably supports students in acquiring
We continue to illustrate two ways these challenges can be addressed problem-solving skills despite the improved mathematical abilities
by implementing quality assessments through human raters and the of AI remains an open question. Initial studies examined the abilities
targeted use of prompt techniques to improve the reliability of the of LLMs regarding the correctness of mathematical solutions
output generated by LLMs. In our research questions, we ultimately (Hendrycks et al., 2021; Lewkowycz et al., 2022; Frieder et al., 2023).
examine how this affects the mathematics educational quality of Here, LLMs’ hallucinations continue to cause problems despite
AI-supported problem solving (Section 2). Subsequently, this study’s improvements in quality (Maynez et al., 2020; Ji et al., 2023; Plevris
methodology is introduced, followed by a description of how data et al., 2023; Rawte et al., 2023; Schorcht et al., 2023). However, the
were collected and analyzed (Section 3). The data collection involved correctness of a solution alone is not yet a sufficient quality criterion
entering prompts for three problem-solving tasks across four prompt in the school context of problem solving regarding the benefits of
technique scenarios, utilizing three GPT versions, resulting in LLMs acquiring problem-solving skills. Rather, the aim is for
N = 1,080 data points for model validation (Schorcht and Buchholtz, students to understand the problem-solving process as a
2024). The outcomes were evaluated based on six mathematics mathematical practice and to develop strategies for solving problems
educational quality criteria. Our findings provide preliminary insights that are transferable to other problems (Pólya, 1957; Schoenfeld,
into the efficacy of certain prompt techniques and GPT versions 1985). Following the seminal works of Schoenfeld (1985) and Pólya
(Section 4). We conclude our study by examining the implications of (1957), part of the mathematical problem-solving process is
these results for employing GPT for mathematics education and the understanding a mathematical problem by retrieving the necessary
teaching and learning of problem solving (Section 5). information and making sense of the problem and its conditions.
Recent overviews on problem solving in mathematics education
highlight the central aspect of the exploration of strategies (e.g.,
2 Theoretical background working backward, solving similar easier problems, changing
representation), since problem solving is characterized by not having
2.1 Generative AI in educational settings a known algorithm or method for solving a task. Accordingly, a
transformation in representation can facilitate other strategies for
GPT is a generative AI-based LLM that understands human solving problems; however, such strategies emerge only if the solver
language and can process images via the ChatGPT interface. It possesses a profound understanding of the mathematical content of
automatically completes and processes human input (prompt) using a problem, so altering the representation is a viable option (Hiebert
stochastic processes (Hiemstra, 2009; Hadi et al., 2023). Similar to its and Carpenter, 1992; Prediger and Wessel, 2013). Being able to seek
predecessors, the current GPT-4 model was built on extensive training strategies, perceive them as suitable for solving a problem, and apply
data. The analysis of these training data pursues the goal of pattern them requires self-regulatory or metacognitive skills (Artzt and
and relationship recognition to generate appropriate human-like Armour-Thomas, 1992; Schoenfeld, 1992). This also implies the
responses to human input. The size of the training data of the ability to reflect at the end of the problem-solving process, in which

Frontiers in Education 02 [Link]


Schorcht et al. 10.3389/feduc.2024.1386075

the solution is reviewed, other solutions are considered, and identified many incorrect answers in several trials, especially for more
applications to other problems are reflected on (Pólya, 1957; complex mathematical problems. Frieder et al. (2023) investigated
Schoenfeld, 1985). ChatGPT’s mathematical skills. They evaluated its capabilities by using
However, critical aspects are observed concerning the use of AI in open-source data and comparing GPT versions. The study also aimed
educational settings, such as data protection challenges or dependence to explore whether ChatGPT can generally support professional
on technology in education (Cardona et al., 2023; Miao et al., 2023; mathematicians in everyday scenarios. Similarly, Wardat et al. (2023)
Navigli et al., 2023). There are also currently still difficulties in using investigated the mathematical performance of GPT models using
AI for classroom problem solving and in acquiring problem-solving mathematical tasks, such as solving problem-based tasks, solving
skills. LLMs can certainly generate solutions to problem-solving tasks, equations, or determining the limits of functions. Similar to the other
but how a solution is attained is largely a black-box process, not studies, both groups of authors concluded the mathematical abilities
comprehensible to students. However, this is precisely where an of ChatGPT were insufficient for more complex mathematical tasks.
educational approach should start so students can use AI for Partly contrary to previous findings, Schorcht et al. (2023) explored
learning processes. prompt techniques for optimizing ChatGPT’s output when solving
arithmetic and algebraic word problems. They found arithmetic tasks
caused GPT-4 almost no difficulties. The systematic use of prompt
2.2 AI’s black-box problem techniques, such as Chain-of-Thought prompting (Kojima et al., 2022;
Wei et al., 2023) or Ask-me-Anything prompting (Arora et al., 2023),
The so-called black-box problem plays a prominent role in led to the LLM’s notably improvements in mathematical performance
ongoing discussions about the threats of technology (Herm et al., in some cases for algebraic problems.
2021; Franzoni, 2023). The black-box problem refers to the lack of Against these mixed results concerning LLMs’ mathematics
transparency and interpretability in the decision-making processes of performance, it seems even more important to deal productively and
many AI systems. This problem arises because AI, especially in critically with LLMs in mathematics educational settings. For
complex models, such as deep learning, often operates through mathematics educational use cases of LLMs, such as the semi-
intricate algorithms that humans do not easily understand (Herm automated planning of lessons (Huget and Buchholtz, 2024) or the
et al., 2021). In educational contexts, this may lead, for example, to the teaching of problem-solving skills (Schorcht and Baumanns, 2024),
reproduction of unconscious biases and other (human) errors that this seems insufficient, because model’s outputs are further processed
LLM adopts from training data that can cause unfairness and harm to and must meet mathematics educational quality criteria for valid use.
vulnerable groups of students, such as in assessments or decisions
about personalized learning (Buchholtz et al., 2023; Navigli et al.,
2023). Thus, educators and students might find it difficult to trust an 2.3 Explainable AI
AI system in problem solving if they do not understand how it
operates or makes decisions. This lack of trust could ultimately hinder One way of dealing with the AI’s black-box problem is the
the adoption and effective use of AI in education (Franzoni, 2023). Explainable AI research approach, in which situate our study. This
Although LLMs improved their performance during the last research makes complex AI systems more transparent and
two years in processing mathematical input and solving mathematical understandable. The goal is to transform these black-box models into
problems (Hendrycks et al., 2021; Lewkowycz et al., 2022; Frieder “gray-box” models (Gunning et al., 2019). A gray-box model is a
et al., 2023; OpenAI, 2023; Plevris et al., 2023; Schönthaler, 2023), middle ground in which the AI still performs at a high level; however,
reservations about the benefits of generative AI language models in its decision-making process is more understandable to humans. One
educational settings still exist, especially for mathematics teaching and key idea of Explainable AI is to create a local explainability of the
learning. The lack of verifiability of mathematical solution paths model, which means understanding the reasons AI made a particular
created by LLMs and the lack of mathematical accuracy of the outputs decision or prediction and assessing its accuracy (Adadi and Berrada,
produced by the models’ inherent data “hallucinations” add another 2018; Arrieta et al., 2020). In essence, Explainable AI aims to open AI
layer to the intransparent black-box problem which becomes more models for scrutiny, allowing users to understand and trust the
relevant to mathematics educational settings and more difficult decisions made by these systems and to use this understanding in
mathematical problems solved with AI. Even with a current LLM, practical applications. Two factors enabled in our study that abet
such as GPT-4, it is only possible in certain cases to reliably determine Explainable AI approach are the use of human raters and the
the mathematical answer to a task (Frieder et al., 2023; Plevris et al., application of prompt engineering.
2023; Yuan et al., 2023).
Despite this problem, few empirical approaches validate and 2.3.1 Quality assessment of AI outputs by human
assess the output of LLMs using mathematics educational criteria. To raters
date, the (currently still) manageable number of studies has mainly First, the use of human raters to assess the quality of AI output
explored and evaluated the mathematical capabilities of LLMs. Even and validate the LLM models targets the output of the AI models (Qiu
here, studies reach ambiguous findings. For example, Cherian et al. et al., 2017; Maroengsit et al., 2019; Rodriguez-Torrealba et al., 2022;
(2023) presented neural networks and second-graders with math Küchemann et al., 2023). Human expert involvement is crucial for
problems from the U.S. Kangaroo Competition and compared their ensuring that the explanations generated by AI systems are accurate,
performance. In this study, AI performance still lagged behind human relevant, understandable, and ethically sound. This collaboration
performance. Plevris et al. (2023) tested three versions of LLMs with between human expertise and AI can lead to more robust, reliable,
30 mathematical problems in a comparison in which they still and trustworthy AI systems. In model validations that consider

Frontiers in Education 03 [Link]


Schorcht et al. 10.3389/feduc.2024.1386075

FIGURE 1
A Zero-Shot-Scenario (left) and a Few-Shot-Scenario (right), according to Liu et al. (2021).

human assessments, a series of tests involves entering several, often 2023), which encourages AI to elaborate on its reasoning process,
controlled, modified prompts into the generative AI language model and a unique form of Ask-me-Anything prompting (Arora et al.,
to investigate through these simulations whether the model responds 2023) designed to elicit comprehensive responses through
consistently and sensibly to the same query (Schorcht and Buchholtz, AI’s questions.
2024). The outputs were then subjected to a criteria-oriented In many cases, prompt techniques can be used for inputs without
comparative rating by experts concerning their quality, depending on special training in the LLM and can provide reasonable outputs for
the question (Qiu et al., 2017), to assess the usability of the outputs. simple mathematical questions. These inputs are called Zero-Shot-
The study presented here is based on human raters for model Scenarios because the prompt input does not contain additional
validation. In this process, a prompt is repeatedly entered into the training data (Figure 1). The LLM then gives an appropriately
LLM. The resulting outputs of the chat are then collected as data. All probabilistic answer based on its original general training data which
collected outputs of prompts can then be evaluated by human experts forms the baseline of prompt techniques in our study. However, there
concerning their mathematics educational quality. This process entails are also techniques for using accurate training data in prompts to train
assessing the output to comprehend its quality, employing diverse the model with input and optimize output. These training data already
criteria specific to the LLM’s evaluation. In a problem-solving context, comprise very few examples for model building, called a Few-Shot-
the mathematics educational quality of problem solutions can Scenario (Figure 1). In this case, the LLM uses the input training data
be assessed using content-related criteria, such as whether a solution in the prompt, as in a worked-out sample (Renkl, 2002), to provide a
considers all given information, is clearly understandable, or is correct. corresponding answer with higher accuracy (Brown et al., 2020;
Conversely, mathematics educational quality can also be assessed by Reynolds and McDonell, 2021; Dong et al., 2022). Correspondingly,
process-related criteria, such as whether a solution contains elements training-based prompt techniques can be useful when there is
of strategies for finding a solution, such as heuristics and insufficient training data to generate a required output on an input
representational changes. Finally, metacognitive aspects, such as with reliable accuracy, such as in mathematical problems. Initial
reflection on a solution, also play a role (Pólya, 1957; Schoenfeld, 1992). studies show increased accuracy in answering mathematical questions
(Liu et al., 2021; Drori et al., 2022; Schorcht et al., 2023). Accordingly,
2.3.2 Prompt engineering of AI inputs we assume that the use of Few-Shot-Scenarios should be particularly
Second, by carefully designing prompts and applying prompt effective in guiding LLMs in the right direction when solving
engineering techniques, LLMs can be directed not only to provide mathematical problems concerning solutions’ accuracy
answers but also to include explanations for those answers. This can and comprehensiveness.
be used to gain insights into its reasoning processes and to identify Another technique that is particularly helpful in educational
areas where its explainability needs improvement. Especially in settings can guide LLMs to form a chain of thought and render an
educational settings where AI supports learners, this can add to the output in a structured way via the input of Follow-up prompts. This
transparency of the output and can be used as a strategy to direct form of prompt engineering is called Chain-of-Thought prompting
the model’s answers, for example when the output is used for and refers to a series of intermediate steps of a linguistically formulated
personalized learning. In this study, we employ several prompt way of reasoning that leads to a final output (Wei et al., 2023). Kojima
techniques for problem solving, each altering the nature of the input et al. (2022) give the example of completing Zero-Shot prompts with
and, hence, the output. These include Zero-Shot prompting (Brown “Let us think step by step.” Ramlochan (2023) claims that even better
et al., 2020; Kojima et al., 2022) and Few-Shot prompting, which solutions result from adding, “Let us go step by step to make sure
provide AI with few or no examples, respectively. Additionally, we have the right answer.” By replacing the simple output of the result
we use Chain-of-Thought prompting (Liu et al., 2021; Wei et al., in this case with a detailed output of a solution path, LLMs lead to

Frontiers in Education 04 [Link]


Schorcht et al. 10.3389/feduc.2024.1386075

significantly better and, above all, more frequently correct solutions, 3.1.1 Problem-based tasks
especially for mathematical questions (Wei et al., 2023). This study analyzed three problem-based tasks, each with a
Furthermore, using the technique of Ask-me-Anything slightly different focus. This variety ensured that the results, according
prompting, Arora et al. (2023) propose encouraging a generative AI to the prompt techniques used, were transferable to a range of problem
language model to ask questions back to the user. This involves giving tasks. The chosen problems are known internationally and go back to
a context, making an assertion, and having the generative AI classify famous scholars in mathematics and educational psychology. They
the assertion as true or false. By alternately inputting an assertion and offer task variables that represent disparate ways of thinking in the
a question in a Few-Shot-Scenario, Arora et al. (2023) trained an AI mathematical problem-solving process. Their distinction lies in the
system to ask questions. respective heuristic strategies involved and the mathematical skills
With these possibilities of prompt engineering and new they demand (Goldin and McClintock, 1979; Liljedahl et al., 2016;
capabilities of the existing GPT models, teachers and students in Liljedahl and Cai, 2021). For the initial problem, a hybrid strategy of
educational settings have a wide range of opportunities for problem moving both forward and backward is necessary, besides the
solving. For example, these techniques make problems more accessible application of measurement principles. In contrast, the second
to learners. This requires that the solutions created by AI are formally problem presents creativity and flexibility in its solution methods,
correct and clearly understandable. In addition, the techniques can necessitating the use of algebraic calculations. The final problem
provide hints for solution strategies, for example. Accordingly, in our primarily relies on basic arithmetic operations and traditional
study, we assume that Chain-of-Thought and Ask-me-Anything backward calculations to reach a solution.
prompting give better results than Zero- or Few-Shot-Scenarios when The first problem we term the pails problem. It originates from
solving mathematical problems. Pólya (1957, p. 226) and demands understanding measurements.
“How can you bring up from the river exactly six quarts of water when
you have only two containers, a four quart pail and a nine quart pail,
2.4 Research questions to measure with?” This problem can be solved by combining working
forward and working backward. First, this means filling up the nine
This study compares the mathematics educational quality of quart pail. From these nine quarts, four quarts are scooped out twice,
AI-supported problem solving by applying the prompt techniques in leaving one quart. This one quart is then transferred to the four quart
varied GPT versions. This approach must be multifaceted, as the pail, leaving room for only three more quarts in it. By filling the nine
educational quality of problem solving contains different dimensions quart pail again and then pouring three quarts into the four quart pail,
(Schoenfeld, 1985, 1992). While the accuracy and comprehensiveness six quarts remain in the nine quart pail.
of solutions are fundamental, they alone do not suffice as quality We refer to the second problem as the car problem (Cooper and
metrics for mathematical problem solving. This study considers not Sweller, 1987, p. 361) that serves as an algebraic problem-solving
only aspects of the solution (content-related criteria) but also the exercise. “A car travels at the speed of 10 kph. Four hours later a second
processes leading to this solution (process-related criteria). This car leaves to overtake the first car, using the same route and going
approach aims to yield insights into the effectiveness of prompt 30 kph. In how many hours will the second car overtake the first car?”
techniques and the reliability of LLMs in the domain of mathematics The problem can be solved, for example, by constructing a system of
education. The research focuses, therefore, on three primary questions: linear equations and applying either the elimination method or the
substitution method to solve the equations, as well as working
RQ1: How do variations in prompt techniques affect the content- iteratively with a table and trying different values. If the first car starts
related quality of AI-supported problem solving provided by GPT at 10 kph, it will be 40 km from the starting point after four hours. The
concerning the specificity, clarity, and correctness in the solutions? second car, starting from the same point, travels at three times the
speed of the first car. Thus, when both cars move simultaneously, the
RQ2: How do variations in prompt techniques affect the process- second car makes up 20 km per hour. With a head start of 40 km, the
related quality of AI-supported problem solving provided by GPT second car should overtake the first car after two hours.
concerning the strategies mentioned, representations used, and We term the third task the orchard problem. This is attributed to
reflection in the solutions? de Pisa (1202) and requires a backward-solving approach. “A man
entered an orchard through 7 gates, and there took a certain number
RQ3: To what extent does the quality of AI-supported problem of apples. When he left the orchard, he gave the first guard half the
solving vary across GPT versions? apples he had and 1 apple more. To the second guard, he gave half his
remaining apples and 1 apple more. He did the same to each of the
remaining five guards and left the orchard with 1 apple. How many
3 Materials and methods apples did he gather in the orchard?” This problem requires arithmetic
backward calculation. Starting from the remaining apple, it is
3.1 Data collection calculated how many apples the man had before each gate.
Accordingly, the number of apples after passing each gate increased
A systematic variation of problem-based tasks (Section 3.1.1) and by one and then doubled. Therefore, before the last gate, the man had
prompt techniques regarding problem solving in mathematics 2 × (1 + 1) = 4 apples. Before the second-last gate, he had
educational settings was used for data collection (Section 3.1.2; cf. 2 × ( 4 + 1) = 10 apples, and so on. In total, the man had 382 apples
Schorcht et al., 2023) and the variation of GPT versions (Section 3.1.3). before the first gate.

Frontiers in Education 05 [Link]


Schorcht et al. 10.3389/feduc.2024.1386075

3.1.2 Prompt techniques A problem based on Cooper and Sweller (1987, p. 361) was
The three problem-based tasks were utilized in a series of tests to chosen as a distinct yet structurally similar problem for the algebraic
validate the model (Schorcht and Buchholtz, 2024). The following four problem-solving task:
variants of prompt techniques in mathematical problem solving were
used and modified for our study (Schorcht et al., 2023). Every prompt [Link].3 Problem 2
technique starts with the problem and is expanded according to
its specifications: “A car travelling at a speed of 20 kph left a certain place at 3:00 p.m.
At 5:00 p.m. another car departed from the same place at 40 kph
[Link] Zero-Shot-Scenario and travelled the same route. In how many hours will the second
In the Zero-Shot-Scenario, the problem-based tasks were entered car overtake the first?”
into GPT without additional inputs.

[Link] Chain-of-Thought [Link].4 Answer 2


In the Chain-of-Thought-Scenario, GPT tends to offer a more
nuanced depiction of the solution process, organizing output with “The problem is a distance-speed-time problem in which
subheadings such as “Step 1,” “Step 2,” and “Step 3.” These intermediate Distance = Speed × Time. Because both cars travel the same
steps assist the generative AI language model to produce an accurate distance, the distance of the first car ( D1 ) equals the distance of
solution. This suggests that this modification might significantly the second car ( D2 ). Therefore
enhance GPT’s performance in straightforward problem-solving tasks
involving the calculation of basic computational steps. Thus, in our D1 = D2 or v1 × t1 = v2 × t2 ,
study, the three problem-based tasks were followed with “Let us go
step by step to make sure we have the right answer.” where = v1 20= kph , v2 40 kph , and t1 = t2 + 2 hours.
Substituting gives the following:
[Link] Ask-me-Anything
Building upon the concept proposed by Arora et al. (2023), our
approach in the Ask-me-Anything-Scenario diverges by enhancing 20 × ( t2 + 2 ) = 40 × t2
the prompts with the simple directive to ask questions: “Ask me
anything you need to answer the prompt.” The user application is 20t2 + 40 = 40t2
modified so instead of the user posing questions to the LLM, the
model now generates questions for the user. With the addition of “… 20t2 = 40
and wait for my input,” we enforced a step-by-step approach to avoid
long outputs before asking questions. If GPT asked a question, the t2 = 2 hours ”
answer was kept to a minimum. For example, only “yes” or “no,”
clarifying, or simple one-word answers were given, such as “Yes, (Cooper and Sweller, 1987, p. 361).
proceed.” For the third problem, a task from Salkind (1961) was chosen
that shows a partition situation instead of a distribution situation
[Link] Few-Shot-Scenario but also suggests calculating backwards. In addition, the problem
In the Few-Shot-Scenario, an additional task with a given chosen does not give a single-element solution but a complete
solution was introduced to GPT alongside the primary problem- solution set.
solving task to increase the probability of the correct output. For the
three problem-centered tasks, a similar problem was selected that [Link].5 Problem 3
differs only in context and/or mathematical details. The original
problem-solving task was then presented after the related problem “Three boys agree to divide a bag of marbles in the following
and its answer in the prompt. For the pails problem, we used a manner. The first boy takes one more than half the marbles. The
related problem-based task with a solution provided by Pólya (1957, second takes a third of the number remaining. The third boy finds
p. 226). that he is left with twice as many marbles as the second boy.”
(Salkind, 1961, p. 40).
[Link].1 Problem 1
How can you bring up from the river exactly five quarts of water
when you have only two containers, a four quart pail and a nine quart [Link].6 Answer 3
pail, to measure with?
“The first boy takes ( n / 2 ) + 1 marbles, leaving ( n / 2 ) − 1
[Link].2 Answer 1 marbles. The second boy takes (1 / 3) × ( ( n / 2 ) − 1) . The third
boy, with twice as many, must necessarily
“We could fill the larger container to full capacity and empty so have ( 2 / 3) × ( ( n / 2 ) − 1) , so that n is indeterminate; i.e. n may
much as we can into the smaller container; then we could get 5 be any even integer of the form 2 + 6a, with = a 0,1, 2,… .”
quarts” (Pólya, 1957, p. 226). (Salkind, 1961, p. 116).

Frontiers in Education 06 [Link]


Schorcht et al. 10.3389/feduc.2024.1386075

3.1.3 Variation of GPT versions teaching degree program and were attending seminars and lectures
This study used three GPT versions: GPT-3.5, GPT-4, and GPT-4 that discussed aspects of problem solving.
with the Wolfram plugin. Data for all three variants were collected The following exemplifies how a solution to the orchard problem
from 25th September to 26th October 2023. Even though only a few was evaluated. The solution depicted in Figure 2 contains all the
performance comparisons between GPT versions exist, initial necessary information (“7 gates”, “half the apples he had and 1 apple
explorative studies have already investigated the performance of GPT more”, “left the orchard with 1 apple”). Therefore, the solution is
versions in mathematical performance (Cherian et al., 2023; OpenAI, specific to all task properties. All the sentences contained in the
2023; Plevris et al., 2023), with the studies unanimously attesting to solution are relevant to the process of solution and there is no
GPT-4’s better mathematical performance than GPT-3 or GPT-3.5. essential part that is missing, so the output was rated as clear. In
Additionally, since March 2023, the GPT-4 version of ChatGPT can addition, the solution is correct. This leads to a score of 1 for the
be extended with the Wolfram plugin (Spannagel, 2023), which can correctness criterion. Concerning strategy, the output in Figure 2 was
specifically access mathematical data. The plugin is based on access to also interpreted as using strategies, therefore scoring 1. The change
the Wolfram Alpha online platform and the Wolfram Language in representation from a written text to an equation was assessed as
System (Wolfram, 2023). With the command “Use Wolfram,” a conversion, meaning the criterion of representation applies. The
ChatGPT translates the prompt into Wolfram Language and sends a example lacks a retrospective analysis of the solution, resulting in a
request to the platform. Wolfram calculates the required output and reflection score of 0.
returns it back to ChatGPT in the form of a URL. The AI language To assess the coding’s reliability, 33.3% of the reports (N = 1,080)
model then displays the output in the chat, and the translation can underwent double coding. Each criterion was coded dichotomously.
be viewed via the “Wolfram used” button. This technology can be used The intercoder reliability was generally satisfactory, with an average
for solving mathematical equations, approximating, and plotting Cohen’s kappa (κ) of 0.99 (SD = 0.04, minimum κ = 0.81, maximum
functions and diagrams. κ = 1). Any differences were resolved through consensus among
The data collection in our study can be summarized as follows: the coders.
Each of the four prompt techniques was applied to each of the three
problems (pails problem, car problem, orchard problem) 30 times. 3.2.2 Statistical analysis
This was repeated for all three tested GPT versions. The investigation
of 30 repetitions aimed to obtain valid insight into the output quality [Link] Bivariate analysis
of ChatGPT under the given problem-solving conditions. Before a To identify distinctions in the six evaluation criteria (specificity,
prompt was entered into ChatGPT, ChatGPT was started with a new clarity, correctness, strategy, representation, and reflection) for the
chat to counteract GPT’s built-in Few-Shot learning and to recalibrate three tasks (pails, car, and orchard problem), the three GPT versions
the system with each test. The single dialogues were collected as data (GPT-3.5, GPT-4, and GPT-4 with plugin Wolfram), and the four
points, resulting in a dataset of N = 1,080 ChatGPT prompt techniques (Zero-Shot-Scenario, Chain-of-Thought, Ask-me-
dialogues (4 × 3 × 30 × 3). Anything, and Few-Shot-Scenario), we analyzed contingency tables.
These contingency tables show for each task how often the respective
evaluation criteria were coded for all combinations of prompt
3.2 Data analysis technique and GPT version. To determine differences in the
frequencies, we used the Fisher–Freeman–Halton test exact (Freeman
For the data analysis, the systematically collected GPT chats were and Halton, 1951). This test for r × c contingency tables provides the
subjected to a human expert rating applying mathematics educational exact p-value. The Fisher–Freeman–Halton test is suitable for
quality tailored to the research question (Section 3.2.1). The ratings contingency tables for which over 20% of cells have expected
were then systematically analyzed quantitatively (Section 3.2.2). frequencies below five, where using the chi-squared test is inadequate.
This was the case for some tables.
3.2.1 Model validation by human expert rating
Our methodical approach for rating problems’ solutions builds on [Link] Logistic mixed-effects model
Qiu et al. (2017), Rodriguez-Torrealba et al. (2022), and Küchemann A logistic mixed-effects model was employed in R to address the
et al. (2023), who used expert human raters to evaluate the output research question regarding the impact of varying prompt techniques
generated by AI to validate AI models. We therefore used a respective and GPT versions on the content- and process-related quality of
model validation approach (Schorcht and Buchholtz, 2024) to rate the AI-supported problem solving (Agresti, 2012). This statistical
problem solutions of GPT versions along the problem-based tasks. In approach was chosen due to the dichotomous nature of the outcome
line with RQ1, two trained student raters should specifically consider variables (specificity, clarity, correctness, strategy, representation, and
aspects of the solutions’ comprehensiveness, such as the solutions’ reflection) and the repeated design measures of the study. The analysis
specificity, clarity, and correctness (Küchemann et al., 2023). To address was conducted separately for each evaluation criterion that was
RQ2 and the solutions’ properties concerning problem-solving affected by significant frequency differences indicated by the Fisher–
processes (Schoenfeld, 1985, 1992), the raters should rate strategy- Freeman–Halton test. To analyze the effect of the prompt techniques
related aspects, GPT’s use of changes of representation, and indications and GPT versions, two models were fitted for every content- and
for reflection. Table 1 gives an overview of the rating categories and process-related evaluation criterion, one considering the effect of
their respective indications. At the time of rating, both raters were in different prompt techniques (Zero-Shot-Scenario, Chain-of-Thought,
their seventh and fifth university semesters of the mathematics Ask-me-Anything, and Few-Shot-Scenario) and the other focusing on

Frontiers in Education 07 [Link]


Schorcht et al. 10.3389/feduc.2024.1386075

the influence of GPT versions (GPT-3.5, GPT-4, and GPT-4 with

function graph or an equation). If this is the case, it is rated with 1; if no changes between representations are found (including word
This criterion indicates whether the solution is correct. If this is the case, it is rated with 1; if the solution is incorrect, it is rated with
phrases included that are not relevant for the solution process or essential parts for the solution process are missing, it is rated with

This criterion indicates whether the solution contains a look back at what the AI has done. If this is the case, it is rated with 1; if no
This criterion indicates whether the solution shows heuristic descriptions of the approach. If this is the case, it is rated with 1; if no
plugin Wolfram).

This criterion indicates whether the solution contains a conversion between different representations (from word to other, e.g., a
This criterion indicates whether the solution is formulated clearly and concisely. If this is the case, it is rated with 1; if there are Statistical analyses were conducted in R Version 4.3.2 using the
This criterion indicates whether the solution contains all relevant information, such as variables and descriptions to solve the

[Link] function for the Fisher–Freeman–Halton test, glmer


function for the logistic mixed-effects model (Bates et al., 2015), and
emmeans function for post-hoc analysis (Lenth et al., 2019). For
post-hoc analysis, p values were adjusted using Bonferroni–
Holm correction.
problem. If this is the case, it is rated with 1; if there is any information missing, it is rated with 0.

4 Results
Regarding RQ1, RQ2, and RQ3, we first present the descriptive
descriptions are found or just a step-wise solution is presented, it is rated with 0.

results of the frequencies for each evaluation criterion in a


comprehensive manner (see Figures 3, 4). Concerning RQ1,
Figure 3 depicts contingency tables for the three tasks (pails, car,
and orchard problem) for the content-related evaluation criteria
specificity, clarity, and correctness for the different GPT versions
(RQ3), along with the four prompt techniques. We performed a
Fisher–Freeman–Halton test to determine disparities in the
frequencies for the content-related evaluation criteria regarding the
prompt technique and the GPT version. The Fisher–Freeman–
Halton test indicated significant differences in the distribution of
look backs are found, it is rated with 0.

the clarity (p = 0.025) and correctness (p = 0.000) criteria for the


pails problem, and the clarity (p = 0.000) and correctness (p = 0.001)
to number), it is rated with 0.

criteria for the orchard problem. The Fisher–Freeman–Halton test


indicated no significant differences in the distribution of the other
contingency tables (pails × specificity: p = 1; car × specificity: p = 1;
Indications

orchard × specificity: p = 1; car × clarity: p = 0.727; car × correctness:


p = 0.986). However, we observe ceiling effects for the specificity
criterion, as this criterion was almost always fulfilled in the answers
0.

0.

to the pails problem and the car problem and at least in the more
advanced GPT versions for the orchard problem.
We fitted logistic mixed-effects models to analyze if and how
the two content-related evaluation criteria clarity and correctness
that showed significant differences in their distribution were
affected by varying prompt techniques or GPT versions. The
logistic mixed-effects model analysis was conducted across all
Evaluation criteria

tasks, rather than for each task individually, to exclude potentially


influential outliers represented by unfilled cells in the contingency
tables. Each combination of task, prompt technique, and GPT
Representation

version was entered identically 30 times. These 30 entries were


Correctness
Specificity

Reflection
Strategy

included in the model as repeated measures. Concerning RQ1, the


Clarity

logistic mixed-effects model indicated no significant interaction


between the prompt techniques and the clarity or correctness
TABLE 1 Mathematics educational quality criteria.

criteria. However, for the GPT version, the logistic mixed-effects


model indicated a significant interaction regarding clarity and
Mathematics educational quality

correctness (see Table 2).


In examining the effects of diverse GPT versions on the
content-related evaluation criteria, post-hoc analyses were
conducted to assess pairwise differences, adjusting for multiple
comparisons using the Bonferroni–Holm method. For both
Content-related quality

evaluation criteria clarity and correctness, GPT-3.5 revealed a


Process-related quality

noticeable decrease in performance compared to GPT-4 (clarity:


Log Odds Ratio (LOR) = −2.63, Standard Error (SE) = 0.83,
z-score = −3.17, p-value = 0.005; correctness: LOR = −3.19,
SE = 1.14, z = −2.79, p = 0.016) and GPT-4 with the Wolfram
plugin (clarity: LOR = −2.18, SE = 0.83, z-score = −2.63,

Frontiers in Education 08 [Link]


Schorcht et al. 10.3389/feduc.2024.1386075

FIGURE 2
GPT-4 output of the orchard problem under the Chain-of-Thought-Scenario.

p-value = 0.017; correctness: LOR = −2.82, SE = 1.13, of various prompt techniques on the process-related evaluation
z-score = −2.49, p-value = 0.025). Concerning RQ3, this means criteria, post-hoc analyses were conducted to assess pairwise
that, statistically, GPT-4 and its version with the Wolfram plugin differences, adjusting for multiple comparisons using the Bonferroni–
were significantly much more effective in providing solutions to Holm method. For strategy, the Few-Shot-Scenario displayed a
mathematics problems that were rated as being clear and correct significant decrease in performance compared to Chain-of-Thought
than GPT-3.5. The comparison between GPT-4 and GPT-4 with (LOR = 11.12, SE = 3.14, z = 3.54, p = 0.002) and Ask-me-Anything
the Wolfram plugin, however, did not yield a statistically (LOR = 10.55, SE = 3.22, z = 3.28, p = 0.005). No other significant
significant difference (LOR = 0.449, SE = 0.796, z = 0.564, differences were found. Concerning RQ3, the logistic mixed-effects
p = 0.5726). model indicated no significant interaction effect between the GPT
Concerning RQ2, Figure 4 visualizes the contingency tables for version and the process-related evaluation criteria strategy
the three tasks of the process-related evaluation criteria strategy, or representation.
representation, and reflection for the assorted GPT versions (RQ3), In addition to the statistical analysis, we examined the
alongside the four prompt techniques. The reflection criterion showed frequency of GPT-4’s use of the Wolfram plugin. As Table 4
such low frequencies for the pails problem and the car problem that discloses, the Wolfram plugin was frequently used to solve the
no statistical analysis was conducted here. It can be stated that this orchard problem and the car problem, which is plausible, because
criterion does nearly not occur at all in these two tasks in these problems can be solved in a more algebraic way. Its use
ChatGPT. However, a Fisher–Freeman–Halton test indicated decreased in the Few-Shot-Scenario. For the pails problem,
significant differences in the distribution of the criteria strategy Wolfram plugin has not been considered a single time independent
(p = 0.013) and representation (p = 0.023) for the pails problem. The from the prompt technique.
Fisher–Freeman–Halton test indicated no significant differences in the Our analysis revealed that, across all GPT versions, when
distribution of the other contingency tables (car × strategy: p = 0.066; employing the Ask-me-Anything prompting, GPT generated
car × representation: p = 1; orchard × strategy: p = 0.910; questions in response to 68 out of 270 prompts (refer to Table 5).
orchard × representation: p = 0.884; orchard × reflection: p = 0.061). We responded with simple words, such as “Yes, proceed.” Notably, the
Again, only the process-related criteria strategy and representation GPT-4 version without the Wolfram plugin generated questions more
were used for the logistic mixed-effects model analysis. Concerning frequently than the GPT-3.5 version and the GPT-4 version equipped
RQ2, for the prompt technique, the logistic mixed-effects model with the Wolfram plugin. The pails problem and the car problem,
indicated a significant interaction regarding strategy but no significant triggered the LLM to pose questions when using the Ask-me-
interaction with representation (see Table 3). In examining the effects Anything prompting.

Frontiers in Education 09 [Link]


Schorcht et al. 10.3389/feduc.2024.1386075

FIGURE 3
Frequencies of the specificity, clarity, and correctness of the evaluation criteria on the answers by GPT-3.5 and GPT-4, with and without plugin
Wolfram, under the four prompt techniques and the three given tasks.

5 Discussion also apparent in the correctness of outputs, where the GPT versions’
performance significantly declined when the problem-solving task
Our study provides initial insights into how prompt techniques extended beyond basic algebraic calculations. When contemplating
influence the AI-supported problem-solving capabilities of different the consequences of these results for the application of LLMs in
GPT versions. Contrary to our expectations, regarding RQ1, the mathematics teaching practice, our results indicate that problem-
variation in the use of prompt techniques illustrated no significant based tasks with straightforward computational operations might
effects on the content-related quality of solutions to AI-supported be solved well with the help of LLMs in the classroom. Nevertheless,
problems provided by distinct GPT versions (e.g., solutions’ specificity, we believe that when students need to develop problem-solving skills,
clarity, and correctness). Across all GPT versions, for tasks and it is not just about using LLMs; it is about learning and having to apply
prompts, the specificity of solutions meaning how well all provided heuristics by themselves in a targeted manner. AI-supported problem
information was incorporated into the problem-solving process was solving therefore, can probably be implemented with challenging
consistently high. However, the enhanced clarity of the solutions, problem tasks that surpass simple algebraic operations and require a
specifically the thorough articulation of problem-solving steps, was heuristic approach or the application of more specific prompt
high only in the car problem scenario. This suggests that LLMs may techniques. LLMs can then support students in finding solutions. The
struggle more with generating comprehensive solutions for problems aim of further research efforts should therefore be to identify precisely
that demand more than straightforward computation, particularly those problem-solving tasks that can and cannot be solved directly
those requiring more complex algebraic reasoning. This challenge is using LLMs.

Frontiers in Education 10 [Link]


Schorcht et al. 10.3389/feduc.2024.1386075

FIGURE 4
Frequencies of the strategy, representation, and reflection of the evaluation criteria on the answers by GPT-3.5 and GPT-4, with and without plugin
Wolfram, under the four prompt techniques and the three given tasks.

Concerning RQ2, for the process-related quality of AI-supported as part of the prompt generally degraded the process-related quality
problem solving, significant variances were observed in the criteria of the output. This decline could be attributed to the selection of
strategy (the solution shows heuristic descriptions) when employing solved examples in the Few-Shot-Scenarios that were not structurally
prompt techniques, particularly in the pails problem scenario. Our identical, particularly noticeable in the orchard problem. Here, further
statistical analysis revealed that the choice of a certain prompt research efforts with clearly similar solved problems are necessary in
technique notably influences an LLM’s problem-solving strategy. order to exclude the Few-Shot-Scenario as a suitable prompt technique
Interestingly, the logistic mixed-effects model analysis revealed a for AI-supported problem solving based on evidence. However, our
significant interaction regarding the evaluation criterion strategy and research results also indicate a minor occurrence of reflection in
the prompt technique, highlighting the effectiveness of Chain-of- LLMs’ problem-solving processes for certain tasks. Although there
Thought and Ask-me-Anything over Few-Shot-Scenario in enhancing were isolated situations in which the GPT versions conducted a look
GPT versions’ strategic approaches to problem solving. This indicates back in the sense of problem solving during the generation process
that the prompt “Let us go step by step to make sure we have the right and identified their own problem solution as wrong, these situations
answer” with the enhancement of interaction options with the LLM were rare in our scenarios. The absence of significant differences in
by incorporating “Ask me anything you need to answer the prompt strategy and representation across GPT versions, contrary to the
and wait for my input” notably improves the visibility of strategies in results found for content-related evaluation criteria, also suggests that
problem solutions, which may be beneficial in the teaching and while the content-related quality of the LLMs’ output may have
learning of problem solving and as a reflection tool (Goulet-Lyle et al., improved, the process-related quality through which problems were
2020). In our experiments, surprisingly, providing a solved example approached and solved remained relatively consistent, unaffected by

Frontiers in Education 11 [Link]


Schorcht et al. 10.3389/feduc.2024.1386075

TABLE 2 Logistic mixed-effects model on the interaction of prompt techniques and GPT versions on the clarity and correctness of the evaluation
criteria.

Clarity Correctness
Prompt technique Odds ratios CI p Odds ratios CI p
(Intercept) 0.44 [0.11, 1.82] 0.257 1.59 [0.21, 11.83] 0.652

Zero-Shot-Scenario 0.75 [0.10, 5.50] 0.775 1.16 [0.07, 19.70] 0.920

Chain-of-Thought 2.09 [0.28, 15.55] 0.470 0.78 [0.05, 13.40] 0.866

Few-Shot-Scenario 0.41 [0.05, 3.15] 0.392 0.25 [0.01, 4.28] 0.336

ICC = 0.57, Marginal R 2 = 0.043 ICC = 0.73, Marginal R 2 = 0.030

GPT version Odds Ratios CI p Odds Ratios CI p


(Intercept) 0.08 [0.02, 0.26] 0.000 0.15 [0.03, 0.71] 0.017

GPT-4 13.82 [2.72, 70.23] 0.002 ** 24.36 [2.59, 229.16] 0.005 **

GPT-4 + plugin Wolfram 8.82 [1.74, 44.73] 0.009 ** 16.77 [1.83, 153.89] 0.013 *

ICC = 0.52, Marginal R 2 = 0.162 ICC = 0.68, Marginal R 2 = 0.165

TABLE 3 Logistic mixed-effects model on the interaction of prompt techniques and GPT versions on the strategy and representation of the evaluation
criteria.

Strategy Representation
Prompt technique Odds ratios CI p Odds ratios CI p
(Intercept) 1458.49 [26.39, 80595.09] <0.001 5.35 [0.20, 10.50] 0.042

Chain-of-Thought 1.77 [0.02, 180.38] 0.810 0.26 [−5.23, 5.75] 0.925

Few-Shot-Scenario 0.00 [0.00, 0.01] 0.001 ** −1.10 [−6.84, 4.65] 0.709

Zero-Shot-Scenario 0.00 [0.00, 0.66] 0.035 * −1.34 [−6.95, 4.28] 0.640

ICC = 0.90, Marginal R 2 = 0.390 ICC = 0.88, Marginal R 2 = 0.016

GPT version Odds ratios CI p Odds ratios CI Odds ratios


(Intercept) 1.89 [0.03, 128.94] 0.767 85.84 [0.90, 8194.73] 0.056

GPT-4 147.25 [0.12, 173795.37] 0.167 0.70 [0.00, 101.20] 0.890

GPT-4 + plugin Wolfram 26.82 [0.04, 18780.06] 0.325 6.58 [0.05, 791.51] 0.441

ICC = 0.92, Marginal R 2 = 0.092 ICC = 0.89, Marginal R 2 = 0.032


The very high odds ratios in the strategy criterion result from the fact that no strategy was coded either for GPT-3.5 or the Few-Shot-Scenario.

TABLE 4 Absolute frequencies (n = 30) using the Wolfram plugin in GPT-4 during the AI-supported problem solving of the pails problem, car problem,
and orchard problem under the four prompt techniques.

Zero-shot-scenario Chain-of-thought Ask-me-anything Few-shot-scenario


Pails problem 0 0 0 0

Car problem 28 29 30 3

Orchard problem 29 23 26 17

TABLE 5 Absolute frequencies (n = 30) of questions asked by ChatGPT under Ask-me-Anything prompting during the AI-supported problem solving of
the pails problem, car problem and orchard problem with the three GPT versions.

GPT-3.5 GPT-4 GPT-4 with plugin Wolfram


Pails problem 0 22 2

Car problem 0 29 1

Orchard problem 1 9 4

Frontiers in Education 12 [Link]


Schorcht et al. 10.3389/feduc.2024.1386075

version upgrades. If we transfer the results to the application of LLMs engagement with future AI tools. While AI cannot and should not
in the school context, then strategies in the solution of LLMs become be expected to solve all problems, it can assist to solve them. It is
particularly visible through the two prompt techniques Chain-of- essential to explore how these systems can reach their full
Thought and Ask-me-Anything. Learners who use LLMs for potential with the help of students.
AI-supported problem solving should therefore employ these two In conclusion, while employing Ask-Me-Anything and Chain-of-
strategies to gain insight into the solution procedure. Furthermore, Thought prompting enhances process-related quality in AI-supported
Ask-me-Anything can aid in comprehending the limitations of LLMs, problem solutions, content-related quality advancements are primarily
following the idea of Explainable AI. For example, underdetermined attributed to the evolution of GPT versions, with GPT-4 standing out
instructions from the students to the AI might become visible in those as the most effective. In contrast, the process-related quality remained
LLM questions. Our findings also suggest that the selection of an unaffected by GPT versions, and the Wolfram plugin demonstrated
example for a Few-Shot-Scenario impacts the level of success of the no significant effect on the evaluated criteria.
AI-supported problem-solving process. Therefore, the choice of
examples for a Few-Shot-Scenario requires extensive preparation
activities by the teachers. Data availability statement
Summarizing our results to RQ3, the use of GPT versions,
particularly the transition from GPT-3.5 to GPT-4 and the integration The raw data supporting the conclusions of this article will
of the Wolfram plugin, has shown significant improvements in be made available by the authors, without undue reservation.
content-related quality aspects, such as clarity and correctness of
solutions to mathematics problems. This became especially evident in
solving the pails and orchard problem in our study. This enhancement Author contributions
underscores the advanced capabilities of current GPT versions in
generating more precise and accurate responses. Notably, the addition SS: Conceptualization, Formal analysis, Methodology, Project
of the Wolfram plugin to GPT-4 did not further statistically enhance administration, Visualization, Writing – original draft. NB:
these aspects, suggesting a plateau in improvement within the scope Conceptualization, Methodology, Writing – review & editing. LB:
of these criteria. The Wolfram plugin is only used in certain problem- Formal analysis, Visualization, Writing – review & editing.
solving tasks. In addition, its use in the Few-Shot-Scenario decreases
noticeably. In educational settings, improvements to the content-
related quality of problem solutions may be achieved by using GPT Funding
versions. However, this does not apply to process-related quality.
Changing the version may not improve the use of strategies, a possible The author(s) declare that no financial support was received for
change of representations, or a reflection in lessons using AI-supported the research, authorship, and/or publication of this article.
solutions. Therefore, the version of GPT used may be relevant in
educational settings that focus on correct task solutions rather than
the solution process. Acknowledgments
Our study has limitations, which is why our results can only
be viewed with caution overall. For example, three individual We thank the reviewers for their constructive and insightful
problems were selected in relation to problem solving, which did comments and suggestions. We would like to thank Tobias Binder,
not allow conclusions to be drawn about other problems. Because Julian Kriegel, and Leonie Schlenkrich for their assistance in data
the complexity of mathematical problems is not always acquisition and preliminary analysis. The Statistical Consulting and
immediately apparent and can vary greatly, generalizations are not Analysis (SBAZ) at the Center for Higher Education of the Technical
possible. Furthermore, we tested each prompt-technique scenario University Dortmund provided free advice on statistical methodology
only 3 × 30 times for each GPT version. Although this is a first for this work.
step toward systematic empirical version comparisons, it cannot
be ruled out that the still-small number of model validations may
result in statistical bias, which is why we only conducted statistical Conflict of interest
frequency analyses. More systematic approaches that deepen our
initial findings would be desirable here. The effectiveness of using The authors declare that the research was conducted in the
additional prompt techniques for solving a broad array of absence of any commercial or financial relationships that could
problem-solving tasks at the current level of LLMs’ performance be construed as a potential conflict of interest.
remains to be validated through additional examples. As AI
research progresses rapidly, with updates occurring nearly every
month, forthcoming GPT versions might seamlessly incorporate Publisher’s note
these prompt techniques into their user interfaces. Such
integration could be vital for LLMs’ responses in mathematics All claims expressed in this article are solely those of the authors
educational settings to align with educational requirements. and do not necessarily represent those of their affiliated organizations,
Identifying such techniques could be instrumental for the or those of the publisher, the editors and the reviewers. Any product
development of AI-enhanced learning platforms in mathematics that may be evaluated in this article, or claim that may be made by its
education, fostering critical and, more importantly, constructive manufacturer, is not guaranteed or endorsed by the publisher.

Frontiers in Education 13 [Link]


Schorcht et al. 10.3389/feduc.2024.1386075

References
Adadi, A., and Berrada, M. (2018). Peeking inside the black-box: a survey on Herm, L-V, Wanner, J, Seubert, F, and Janiesch, C. (2021). I don’t get it, but it seems
explainable artificial intelligence (XAI). IEEE Access 6, 52138–52160. doi: 10.1109/ valid! The connection between explainability and comprehensibility in (X)AI research. In
ACCESS.2018.2870052 European Conference on Information Systems (ECIS).
Agresti, A. (2012). Categorical data analysis. Hoboken, New Jersey: John Wiley & Son. Hiebert, J., and Carpenter, T. P. (1992). “Learning and teaching with understanding”
in Handbook of research on mathematics teaching and learning. ed. D. A. Grouws
Arora, S., Narayan, A., Chen, M. F., Orr, L., Guha, N., Bhatia, K., et al. (2023). Ask me
(New York, NY: Macmillan), 65–97.
anything: a simple strategy for prompting language models. Paper presented at ICLR
2023. Available at: [Link] Hiemstra, D. (2009). “Language Models” in Encyclopedia of database systems. eds. L.
Liu and M. T. Özsu (Boston, MA: Springer).
Arrieta, A. B., Díaz-Rodríguez, N., Del Ser, J., Bennetot, A., and Tabik, S. (2020).
Explainable artificial intelligence (XAI): concepts, taxonomies, opportunities and Huget, J., and Buchholtz, N. (2024). Gut gepromptet ist halb geplant – ChatGPT als
challenges toward responsible AI. Inform. Fusion 58, 82–115. doi: 10.1016/[Link]. Assistenten bei der Unterrichtsplanung nutzen. Praxisratgeber „Künstliche Intelligenz als
2019.12.012 Unterrichtsassistent“, 8–10.
Artzt, A., and Armour-Thomas, E. (1992). Development of a cognitive-metacognitive Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., et al. (2023). Survey of hallucination in
framework for protocol analysis of mathematical problem solving in small groups. Cogn. natural language generation. ACM Comput. Surv. 55, 1–38. doi: 10.1145/3571730
Instr. 9, 137–175. doi: 10.1207/s1532690xci0902_3
Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., et al.
Baidoo-Anu, D., and Owusu Ansah, L. (2023). Education in the era of generative (2023). ChatGPT for good? On opportunities and challenges of large language models
artificial intelligence (AI): understanding the potential benefits of ChatGPT in for education. Learn. Individ. Differ. 103:102274. doi: 10.1016/[Link].2023.102274
promoting teaching and learning. Journal of AI, 7, 52–62. doi: 10.2139/ssrn.4337484
Kojima, T., and Shane Gu, S., Reid, M., Matsuo, Y., and Iwasawa, Y. (2022). Large
Bates, D., Mächler, M., Bolker, B., and Walker, S. (2015). Fitting linear mixed-effects language models are zero-shot reasoners. Available at: [Link]
models using lme4. J. Stat. Softw. 67, 1–48. doi: 10.18637/jss.v067.i01 [Link]#page=1
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., et al. (2020). Küchemann, S., Steinert, S., Revenga, N., Schweinberger, M., Dinc, Y., Avila, K. E.,
Language models are few-shot learners. NeurIPS. doi: 10.48550/arXiv.2005.14165 et al. (2023). Can ChatGPT support prospective teachers in physics task development?
Buchholtz, N., Baumanns, L., Huget, J., Peters, F., Schorcht, S., and Pohl, M. (2023). DOI. doi: 10.1103/PhysRevPhysEducRes.19.020128
Herausforderungen und Entwicklungsmöglichkeiten für die Mathematikdidaktik Lample, G., and Charton, F. (2019). Deep learning for symbolic mathematics. arXiv
durch generative KI-Sprachmodelle. Mitteilungen Gesellschaft Didaktik Mathematik preprint. doi: 10.48550/arXiv.1912.01412
114, 19–26.
Lenth, R., Singmann, H., Love, J., Buerkner, P., and Herve, M. (2019). Estimated
Cardona, M. A., Rodríguez, R. J., and Ishmael, K. (2023). Artificial intelligence and the marginal means, aka least-squares means. R package version 1.3.2. Available at: https://
future of teaching and learning: Insights and recommendations. Department of Education: [Link]/package=emmeans
United States.
Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V.,
Cherian, A., Peng, K.-C., Lohit, S., Smith, K., and Tenenbaum, J. B. (Eds.) (2023). Are et al. (2022). Solving quantitative reasoning problems with language models. Adv. Neural
deep neural networks SMARTer than second graders? Proceedings of the IEEE/CVF Inf. Proces. Syst. 35, 3843–3857. doi: 10.48550/arXiv.2206.14858
Conference on Computer Vision and Pattern Recognition. (pp. 10834–10844).
Liljedahl, P., and Cai, J. (2021). Empirical research on problem solving and problem
Cooper, G., and Sweller, J. (1987). Effects of schema acquisition and rule automation posing: a look at the state of the art. ZDM 53, 723–735. doi: 10.1007/s11858-021-01291-w
on mathematical problem-solving transfer. J. Educ. Psychol. 79, 347–362. doi:
10.1037/0022-0663.79.4.347 Liljedahl, P., Santos-Trigo, M., Malaspina, U., and Bruder, R. (2016). “Problem solving
in mathematics education” in Problem solving in mathematics education. ICME-13
de Pisa, L. (1202). Liber Abaci. topical surveys (Cham: Springer).
Dong, Q., Li, L., Dai, D., Zheng, C., Wu, Z., Chang, B., et al. (2022). A survey for in- Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., and Neubig, G. (2021). Pre-train,
context learning. arXiv preprint. doi: 10.48550/arXiv.2301.00234 prompt, and predict: a systematic survey of prompting methods in natural language
Drori, I., Zhang, S., Shuttleworth, R., Tang, L., Lu, A., Ke, E., et al. (2022). A neural processing. Available at: [Link]
network solves, explains, and generates university math problems by program synthesis Maroengsit, W., Piyakulpinyo, T., Phonyiam, K., Pongnumkul, S., Chaovalit, P., and
and few-shot learning at human level. Proc. Natl. Acad. Sci. 119, 1–181. doi: 10.1073/ Theeramunkong, T. (2019). A survey on evaluation methods for Chatbots, 111–119. doi:
pnas.2123433119 10.1145/3323771.3323824
Floridi, L., and Chiriatti, M. (2020). GPT-3: its nature, scope, limits, and consequences. Maynez, J., Narayan, S., Bohnet, B., and McDonald, R. (2020). On faithfulness and
Minds Machines 30, 681–694. doi: 10.1007/s11023-020-09548-1 factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the
Franzoni, V. (2023). “From black box to glass box: advancing transparency in artificial Association for Computational Linguistics, 1906–1919. Available at: [Link]
intelligence Systems for Ethical and Trustworthy AI” in Computational science and its pdf/[Link]
applications – ICCSA 2023 workshops. ICCSA 2023. Lecture notes in computer science. Miao, F., and Holmes, [Link] (2023). Guidance for generative AI in education
eds. O. Gervasi, B. Murgante, A. M. A. C. Rocha, C. Garau, F. Scorza, Y. Karaca, et al., and research. Paris: UNESCO.
vol. 14107 (Cham: Springer).
Navigli, R., Conia, S., and Ross, B. (2023). Biases in large language
Freeman, G. H., and Halton, J. H. (1951). Note on an exact treatment of contingency, models: origins, inventory, and discussion. J. Data Inform. Qual. 15, 1–21. doi:
goodness of fit and other problems of significance. Biometrika 38, 141–149. doi: 10.1093/ 10.1145/3597307
biomet/38.1-2.141
OpenAI (2023). GPT-4 technical report. Available at: [Link]
Frieder, S., Pinchetti, L., Chevalier, A., Griffiths, R.-R., Salvatori, T., Lukasiewicz, T., pdf
et al. (2023). Mathematical capabilities of ChatGPT. Available at: [Link]
abs/2301.13867 Plevris, V., Papazafeiropoulos, G., and Jiménez Rios, A. (2023). Chatbots put to the
test in math and logic problems: a comparison and assessment of ChatGPT-3.5,
Fütterer, T., Fischer, C., Alekseeva, A., Chen, X., Tate, T., Warschauer, M., et al. (2023). ChatGPT-4, and Google bard. AI 4, 949–969. doi: 10.3390/ai4040048
Chatgpt in education: global reactions to AI innovations. Sci. Rep. 13:15310. doi:
10.1038/s41598-023-42227-6 Pólya, G. (1957). How to solve it: A new aspect of mathematical method. 2nd Edn.
Princeton, NJ: Princeton University Press.
Goldin, G. A., and McClintock, C. E. (Eds.) (1979). Task variables in mathematical
problem solving. Lawrence Erlbaum Associates. Prediger, S., and Wessel, L. (2013). Fostering German-language learners’ constructions
of meanings for fractions—design and effects of a language-and mathematics-integrated
Goulet-Lyle, M. P., Voyer, D., and Verschaffel, L. (2020). How does imposing a step- intervention. Math. Educ. Res. J. 25, 435–456. doi: 10.1007/s13394-013-0079-2
by-step solution method impact students’ approach to mathematical word problem
solving? ZDM 52, 139–149. doi: 10.1007/s11858-019-01098-w Qiu, M., Li, F.-L., Wang, S., Gao, X., Chen, Y., Zhao, W., et al. (2017). AliMe chat: a
sequence to sequence and Rerank based Chatbot engine. In Proceedings of the 55th
Gunning, D., Stefik, M., Choi, J., Miller, T., Stumpf, S., and Yang, G. Z. (2019). XAI- Annual Meeting of the Association for Computational Linguistics, 498–503.
Explainable artificial intelligence. Sci. Robot. 4:eaay7120. doi: 10.1126/scirobotics.
aay7120 Ramlochan, S. (2023). Master prompting concepts: chain of thought prompting.
Available at: [Link]
Hadi, M. U., Al-Tashi, Q., Qureshi, R., Shah, A., Muneer, A., Irfan, M., et al. (2023). of-thought-prompting/#introduction-to-chain-of-thought-cot-prompting
Large language models: a comprehensive survey of its applications, challenges,
limitations, and future prospects. TechRxiv. doi: 10.36227/techrxiv.23589741.v4 Rawte, V., Sheth, A., and Das, A. (2023). A survey of hallucination in large foundation
models. Available at: [Link]
Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., et al. (2021).
Measuring mathematical problem solving with the math dataset. Available at: https:// Renkl, A. (2002). Worked-out examples: instructional explanations support learning
[Link]/pdf/[Link] by self-explanations. Learn. Instr. 12, 529–556. doi: 10.1016/S0959-4752(01)00030-5

Frontiers in Education 14 [Link]


Schorcht et al. 10.3389/feduc.2024.1386075

Reynolds, L., and McDonell, K. (2021). “Prompt programming for large language verbessern durch gezieltes Prompt Engineering. Mitteilungen Gesellschaft Didaktik
models: beyond the few-shot paradigm” in Extended abstracts of the 2021 CHI conference Mathematik 115, 12–24.
on human factors in computing systems (CHI EA’21) (New York, NY: Association for
Schorcht, S., and Buchholtz, N. (2024). “Wie verlässlich ist ChatGPT?
Computing Machinery).
Modellvalidierung als empirische Methode zur Untersuchung der
Rodriguez-Torrealba, R., Garcia-Lopez, E., and Garcia-Cabot, A. (2022). End-to-end mathematikdidaktischen Qualität algorithmischer Problemlösungen” in Beiträge zum
generation of multiple-choice questions using text-to-text transfer transformer models. Mathematikunterricht. WTM-Verlag.
Expert Syst. Appl. 208:118258. doi: 10.1016/[Link].2022.118258
Spannagel, C. (2023). Hat ChatGPT eine Zukunft in der Mathematik? Mitteilungen
Salkind, C. (1961). The contest problem book I: Annual high school mathematics der Deutschen Mathematiker-Vereinigung 31, 168–172. doi: 10.1515/
examinations 1950–1960. New York, NY: Mathematical Association of America. dmvm-2023-0055
Schoenfeld, A. H. (1985). Mathematical problem solving. Orlando, FL: Academic Press. Wardat, Y., Tashtoush, M., Alali, R., and Jarrah, A. (2023). ChatGPT: a revolutionary
tool for teaching and learning mathematics. Eurasia J. Math. Sci. Technol. Educ. 19, 1–18.
Schoenfeld, A. H. (1992). “Learning to think mathematically: problem solving, doi: 10.29333/ejmste/13272
metacognition, and sense-making in mathematics” in Handbook for
research on mathematics teaching and learning. ed. D. A. Grouws (New York: Mac- Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., et al. (2023). Chain-
Millan), 334–370. of-thought prompting elicits reasoning in large language models. Available at: https://
[Link]/pdf/[Link]
Schönthaler, P. (2023). Schneller als gedacht: ChatGPT zwischen wirtschaftlicher
Effizienz und menschlichem Wunschdenken. c’t 9, 126–131. Wolfram, S. (2023). Instant plugins for ChatGPT: introducing the Wolfram
ChatGPT plugin kit: Stephen Wolfram writings. Available at: [Link]
Schorcht, S., and Baumanns, L. (2024). Alles falsch?! Reflektiertes Problemlösen mit [Link]/2023/04/instant-plugins-for-chatgpt-introducing-the-wolfram-
KI-Unterstützung im Mathematikunterricht. Praxisratgeber „Künstliche Intelligenz als chatgpt-plugin-kit/
Unterrichtsassistent“, 32–34.
Yuan, Z., Yuan, H., Tan, C., Wang, W., and Huang, S. (2023). How well do large
Schorcht, S., Baumanns, L., Buchholtz, N., Huget, J., Peters, F., and Pohl, M. (2023). language models perform in arithmetic tasks? Available at: [Link]
Ask Smart to Get Smart: Mathematische Ausgaben generativer KI-Sprachmodelle pdf/[Link]

Frontiers in Education 15 [Link]

You might also like