GPT 4 in Education: Evaluating Aptness, Reliability, and Loss of Coherence in Solving Calculus Problems and Grading Submissions
GPT 4 in Education: Evaluating Aptness, Reliability, and Loss of Coherence in Solving Calculus Problems and Grading Submissions
https://doi.org/10.1007/s40593-024-00403-3
ARTICLE
Alberto Gandolfi1
Abstract
In this paper, we initially investigate the capabilities of GPT-3 5 and GPT-4 in
solving college-level calculus problems, an essential segment of mathematics that
remains under-explored so far. Although improving upon earlier versions, GPT-4
attains approximately 65% accuracy for standard problems and decreases to 20% for
competition-like scenarios. Overall, the models prove to be unreliable due to com-
mon arithmetic errors.
Our primary contribution lies then in examining the use of ChatGPT for grading
solutions to calculus exercises. Our objectives are to probe an in-context learning
task with less emphasis over direct calculations; recognize positive applications
of ChatGPT in educational contexts; highlight a potentially emerging facet of AI
that could necessitate oversight; and introduce unconventional AI benchmarks, for
which models like GPT are untrained. Pertaining to the latter, we uncover a ten-
dency for loss of coherence in extended contexts. Our findings suggest that while
the current ChatGPT exhibits comprehension of the grading task and often provides
relevant outputs, the consistency of grading is marred by occasional loss of coher-
ence and hallucinations. Intriguingly, GPT-4’s overall scores, delivered in mere
moments, align closely with human graders, although its detailed accuracy remains
suboptimal.
This work suggests that, when appropriately orchestrated, collaboration between
human graders and LLMs like GPT-4 might combine their unique strengths while
mitigating their respective shortcomings In this direction, it is imperative to consider
implementing transparency, fairness, and appropriate regulations in the near future.
* Alberto Gandolfi
[email protected]
1
Division of Science, New York University Abu Dhabi, Abu Dhabi 129188, UAE
13
Vol.:(0123456789)
368 International Journal of Artificial Intelligence in Education (2025) 35:367–397
Introduction
The natural language processing tool ChatGPT, particularly the one based on
the multimodal transformer (Vaswani et al. (2017)) GPT-4, has demonstrated the
ability to interpret text input and generate natural language text that addresses a
variety of requests (OpenAI (2023)). ChatGPT can produce convincing college
papers and fake research-paper abstracts that are difficult for instructors and sci-
entists to detect (Gao et al. (2022); Crothers et al. (2022)), and its human-like
capabilities have even inspired the idea of subjecting ChatGPT to cognitive psy-
chological assessment, see Binz and Schulz (2023).
In particular, ChatGPT performs quite well and often outscores the vast major-
ity of human test takers in various domains, see OpenAI (2023); Ibrahim et al.
(2023); Srivastava et al. (2022); Suzgun et al. (2022); Chung et al. (2022). This
raises concerns about the potential uses and misuses of ChatGPT in education,
especially in high schools and colleges.
Despite the quality of ChatGPT’s answers in many domains, its output in the
field of Mathematics has been less human-like. A recent study based on GPT-3.5
(Frieder et al. (2024)) examined a large number of mathematical problems across
all levels (with limited calculus content) and found that ChatGPT’s responses,
if graded, would receive scores insufficient for passing standard exams. A simi-
lar pattern was found for high school math problems in Vietnam (Dao and Le
(2023)). Additionally, a comparison across multiple fields showed that, for GPT-
3.5, Mathematics is the only field in which ChatGPT’s outputs would consistently
score worse than human students’ answers (Ibrahim et al. (2023)).
To complete the range of topics for which LLM have been tested, we first
analyze the potential of GPT-3.5, but more importantly the more recent upgrade
GPT-4, in solving Calculus exercises. Our focus is on GPT among other possible
LLM’s as it appears to be the best performer in Calculus (Calonge et al. (2023)).
As some degradation in time has been noted in GPT-4 abilities in solving prob-
lems (Chen et al. (2023)), we have repeated the tests six months apart.
Next, we analyze another potential use of ChatGPT that has received little
attention so far but may soon be considered in many contexts: the possibility
of asking ChatGPT to grade submissions such as solutions to exercises.This is
largely an in-context learning task (Garg et al. (2022)), as we assign both the
text of the exercise and the solution and ask for an evaluation of other proposed
solutions.
Several reasons exist for testing ChatGPT on grading. One reason is the inves-
tigation of a less demanding in-context learning task; another is the possibility
of envisioning some profitable integration of ChatGPT in the education process:
while we consider Calculus here, grading activity can be tested and extended to
other areas; another reason is that grading has not been considered yet as a poten-
tial activity for LLMs, but it could become widespread as soon as the potential is
realized and input becomes more versatile. Finally, we aim to develop nonstand-
ard AI tests (Bubeck et al. (2023)) that have not been considered during training
or preliminary testing by AI developers (OpenAI (2023)). One of the results is
13
International Journal of Artificial Intelligence in Education (2025) 35:367–397 369
that GPT-4 tends to exhibit loss of coherence within a single answer when faced
with a long list of numerical inputs.
After the first draft of this work several other researches have investigated the
potentials of LLM’s in grading students’ submissions: Nilsson and Tuvstedt (2023),
for instance, considers programming tests, Mizumoto and Eguchi (2023); Fiacco
et al. (2023) analyze the ability to evaluate linguistic papers.
No work has yet investigated the possibility to grade Calculus solutions, which is
the main focus of the present paper.
Background
Testing LLM’s
Since their recent growth in abilities and popularity LLMs and, in particular, Chat-
GPT have been subjected to numerous evaluations across several domains (Tamkin
et al. (2021); Chen et al. (2021); Ouyang et al. (2022); Bubeck et al. (2023); Frieder
et al. (2024); Ibrahim et al. (2023); Binz and Schulz (2023)). Several competing
models are now available for free or with a moderate fee, and their relative perfor-
mances have been evaluated in various forms. While in general which LLM per-
forms better depends on the context (Dao (2023)), it appears that in Math ChatGPT
and even more GPT-4 are currently the best performers in many respects (Calonge
et al. (2023)). Potentialities and challenges of GPT in education have been reviewed
by several studies (Kasneci et al. (2023)).
13
370 International Journal of Artificial Intelligence in Education (2025) 35:367–397
Automated essays scoring has a long history (Gao et al. (2023)) and has been more
extensively explored recently, showing that GPT has a certain level of accuracy and
reliability (Mizumoto and Eguchi (2023); Fiacco et al. (2023)). It has been con-
cluded (Mizumoto and Eguchi (2023)) that GPT can be effectively utilized as auto-
mated essays scoring tools, potentially revolutionizing methods of writing evalua-
tion and feedback in both research and practice in the areas of humanities. A wider
study of simple question–answer pairs (Schneider et al. (2023)) including multiple
languages and diverse fields using a specific hyperparameter-tuned model finds an
accuracy of about 86.5%. This is close to being acceptable for unsupervised auto-
mated scoring.
In computer science, the grading abilities of GPT-4 have been investigated on
exercises from the foundations of programming first year courses of the Computer
Science: Nilsson and Tuvstedt (2023) finds accuracy and precision ranging from
about 55% to 100% depending on the observed year. As an effect of such variability
13
International Journal of Artificial Intelligence in Education (2025) 35:367–397 371
the overall performance of GPT at grading in Nilsson and Tuvstedt (2023) is not sat-
isfactory enough to deploy as a grading tool on its own today; GPT it could plausibly
be used as part of a workflow together with unit tests and manual TA inspections.
In the related area of scientific paper reviewing, the idea of automated review has
been thoroughly examined in Yuan et al. (2022). The authors carefully investigate
the potential criteria for evaluating reviews and conclude that while natural language
processing models, which are the basis for LLMs, are useful for general aspects,
they lack attention to detail.
Overall, the potential use of GPT for automated grading of specific Math tests
has not been explored in details so far. An analysis of various AI tools, includ-
ing neural networks and large language models pre-GPT, for automated grading
has been carried out in Erickson et al. (2020) for high school math problems
in the ASSISTments platform; other statistical methods for high school math
problems have been developed in Baral et al. (2021), Baral et al. (2022), Baral
et al. (2023), Botelho et al. (2023), Zhang et al. (2022). With ad hoc methods
developed in close consideration of the specific task of instructor’s grade predic-
tion, these works achieve around 80% prediction, based on various metrics, of
the grade assigned by instructors. While this is a remarkable achievement these
works do not directly address the possibility to use GPT nor discuss calculus
level problems.
Several limitations have been noted on ChatGPT and GPT-4. The two below are
relevant to our discussion.
Despite not having a persistent memory of individual prompts, these AI models
use the context window to track the current conversation, akin to a”short-term mem-
ory.” This allows the AI to recall previous responses and prompts within the token
limit, aiding in the generation of contextually relevant and coherent responses. The
ability to reference prior tokens helps counter some of the context loss by forming
responses based on learned patterns, grammatical rules, vocabulary, and simulated
common sense reasoning (Jacob (2023)). Token limitation can however lead to lack
of coherence in long sentences or exchanges.
In addition, a recent work has observed that the performance of GPT-4 seems to
have been degrading over time, in particular in Math, while GPT3.5 seems to have
improve its performance (Chen et al. (2023)).
The current research landscape suggests that while LLMs such as ChatGPT have
made significant progress in various domains, they still face challenges in solving
complex mathematical problems: we aim to extend this analysis to problems from
Calculus which, in spite of this being one of the most attended math courses, have
not been considered in details so far.
13
372 International Journal of Artificial Intelligence in Education (2025) 35:367–397
The second research question involves the ability of ChatGPT or GPT-4 to auto-
matically grade calculus submissions in comparison to that of human expert graders.
In addition, we aim to follow-up on a series of issues which have surfaced in
other studies or while investigating the previous questions. In particular, we want to
highlight a phenomenon which has become apparent during this study, and can be
identified as a type of loss of coherence.
Finally, we test whether the claimed degradation over time of GPT can be traced
in the tasks that we are analyzing.
We hope hereby to contribute to the ongoing discussion about the potential appli-
cations and limitations of LLMs in education.
Methods
1. standard calculus exercises from Stewart (2020) whose solution consists only of
a numerical answer; sulutions can be found in the appendix of Stewart (2020);
2. high school competition problems from the 2021 SMT Team Round Stanford
Math Tournament from AoPSOnline (2022), with solutions published in the web-
site;
3. exercises from a real Calculus course held at New York University Abu Dhabi
in the Summer of 2017; problems were chosen by randomly selecting 10 ques-
tions from homework and final exams; the questions were selected based on the
following criteria: (i) the questions were limited to two paragraphs, although the
answers could be of any length; (ii) multiple-choice questions were not allowed;
(iii) no images, sketches, or diagrams were allowed in either the question or the
answer; (iv) no attachments were allowed in either the question or the answer.
Solutions are provided by the instructor of the course.
The Stanford problems are published in TeX format, while all other problems
are typewritten in Math format. To prepare for submission to ChatGPT the other
problems have been rewritten in a Math plain-text format available in Keely (2009);
notice that this format is essentially the only one accepted by Wolfram-Alpha in its
Natural-Language mode. The format is ambiguous for large formulas, but consistent
for the calculus exercises we are considering when enough brackets are used, see
Keely (2009).
13
International Journal of Artificial Intelligence in Education (2025) 35:367–397 373
All exercises are reproduced in the online material, in the format which has
been presented to ChatGPT, including the exact prompt, and can be directly
resubmitted. All exercises have been submitted to ChatGPT-3.5 and GPT-4, ask-
ing for a solution; for the real calculus exercises each problem had been submit-
ted three times in a double-blind previous experiment (Ibrahim et al. (2023)); the
solutions returned by the chat are included in the online material. If submitted
again, the answers will be different due to the internal randomness of ChatGPT,
but even when the same exercise was submitted multiple times, the overall results
seemed to be consistent.
This data is the basis of the evaluation of calculus solving capabilities of
ChatGPT.
Submissions
Next, in order to prepare a submission to be graded, all exercises have been submit-
ted to both GPT-3.5 and GPT-4 several times, asking for solutions. All answers are
recorded in the online material.
In addition, for the exercises from Stewart and Stanford, we created plausible
synthetic answers which, starting from one correct solution, included one correct
answer equal to the correct solution, one correct answer in a form different from
the correct solution, one partially correct answer, and one completely incorrect, in
random order.
For the exercises from the actual calculus class, answers by actual students of the
course, selected among those who have given consent so as to represent a variety
of overall class grades, were collected in Ibrahim et al. (2023); such data collection
is blind on the side of students, who clearly were unaware of the current test at the
time of solving the exercises; the answers by the students have been converted in
digital text format. Students’ consent for the usage of their anonymized submissions
in the study has been obtained; and students’ confirmation that they were 18 years of
age or older has been obtained.
Grading Rubrics
Submissions to be graded were prepared by stating again each problem, and the rela-
tive correct solution.
Next, grading rubrics have been established for the study. Due to the extensive
experience in grading calculus papers, the development of the grading system did
not require complicated considerations (see Yuan et al. (2022) for challenges in
establishing standards for paper reviewing).
The rubric for problems from Stewart (2020) has been chosen to be the simplest
one: Exact solution: 10 pts. Incorrect solution: 0 pts.
For the other problems, the rubric follows traditional academic standards, with
partial credit awarded for meaningful descriptions of procedures that would lead to
the solution even if the calculations are incorrect. The rubric assigns only 8 points
out of 10 for correct solutions without explanation. Since ChatGPT provides more
13
374 International Journal of Artificial Intelligence in Education (2025) 35:367–397
explanations than the students, the rubric was designed to favor the chatbot. The
grading rubric is as follows:
Prompts
Grading Submissions
For each proposed solution to an exercise, the author of this research has provided a
golden standard of scores in accordance with the rubric for each exercise. In about
half of the exercises, when ample leeway was present, the golden standard has been
provided as an interval of possible scores rather then a single value.
For the Stewart and Stanford problems, a prompt was prepared with the text of
the exercise, the rubric, the correct solution and the synthetic solutions and this was
submitted to ChatGPT and GPT-4 for grading.
For the real calculus problems, the answers provided by the chat, and the solutions
of the students were labeled Submission 1, 2, … and for each problem, a prompt was
prepared with the text of the exercise, a correct solution, the corresponding grading
rubric, and the labeled submissions. Some prompt engineering was added, such as
reminding that − x is not correct when x is a solution, and the prompt was submitted
to ChatGPT three times, and to GPT-4 for grading.
In addition, for the problems from the actual calculus class, we recruited three
graders who were senior students that had previously taken the course and obtained
an A grade (Ibrahim et al. (2023)). The graders were provided with the same mate-
rial as the chat, and were asked to grade the submissions The test was conducted in a
13
International Journal of Artificial Intelligence in Education (2025) 35:367–397 375
controlled environment to ensure that the human graders were unaware that some of
the responses had been machine generated. Each grader was compensated at a rate
of $15 per hour for their time. Participants were given debrief information after the
research task was complete, which explained the full purpose of the research.
All the material submitted to chat and graders and their scores, redacted to
remove handwriting, are available in the supplementary material.
Appendix A shows two examples of the prompts that we submitted to the chat
and the corresponding scores provided by ChatGPT.
Several tests were conducted after noticing that GPT seemed to become less reliable
along a test or when many analogous tasks were required. In particular, we focused
on recognizing whether sequences of numbers contained some in a given interval.
The task is easy for humans, but also time consuming and repetitive; it is also one of
the basic tasks involved in grading. We report of the most significant tests.
In one test we generated 100 sequences of 10 random integers each picked uni-
formly in [1, 100]. We used the default random generator in Mathematica. GPT-4
was asked to identify in which sequences there appear numbers in the interval
[16,24]. The test was repeated many times, each time with independent random
numbers, and number of incorrect assignments were averaged. We report one typi-
cal list of the positions in the sequence of 100 in which GPT-4 made an incorrect
identification.
In the second test, we similarly generated 100 sequences of 10 random inte-
gers each picked uniformly in [1,100]; and GPT-4 was asked to identify in which
sequences there appear numbers in the interval [32,34]. The test was repeated 10
times, each time with independent random numbers using a Python random num-
ber generator, and a suitable prompt engineering was used:”Which of the follow-
ing lists contains a number which is greater than or equal to 32 and less than or
equal to 34? Check each list manually, do not use plugins. Make a list of the list
numbers, but only the numbers. Write the answer as a list of the list numbers like
[2,5,…].” The number of incorrect answers were averaged for each position in the
100 sequences. Many other similar tests were performed with different intervals, all
with the approximately the same outcomes.
In the final test, the chat was asked to identify in which lists there appear the
number 620 within the sequences of the previous test. In this final experiment,
exploiting the interactive capabilities of the chatbots, we replied to the chat alert-
ing of mistakes if its reply contained some, or even if it did not contain any to test
possible hallucination.
13
376 International Journal of Artificial Intelligence in Education (2025) 35:367–397
To test the apparent degrading of ChatGPT3.5 and GPT-4 in the realm of Calculus
problems, the solutions to all problems tested in March 2023 have been resubmitted
in September for solution.
We also tested ChatGPT-LangChain (Chat-GPT-LangChain), which has the abil-
ity to leverage WolframAlpha for specific calculations by submitting all of the calcu-
lus problems in the same format.
Statistical Analysis
For the performance in solving calculus problems average percentage scores have
been computed and p-values of two-sample t-test are reported to compare abilities
as problems solvers of ChatGPT3.5 and GPT-4 in March and September, and the
students.
For the test of potential degrading of the chatbots, besides the p-values of two-
sample t-test we have computed the correlations between the scores received in the
various exercises.
For the evaluation of the chatbots as graders, four statistics have been computed
by comparison with the golden standard determined by the author.
1. The average score assigned to each submission: this is relevant both since it is
the only figure contributing to students’ records and since it averages out possible
discrepancies in opposite directions.
2. Scores deviating from the golden standard have been classified as”close to range”
if the distance is less than or equal to 2 points (out of 10) or”completely off range”
for larger deviations. The fraction of completely off range scores has been deter-
mined for each grader, representing the fraction of cases in which the grader has
completely misunderstood the content of the submission.
3. For a finer evaluation of the grading activity, a grader’s aptness index in the form
of mean quadratic deviation from the golden standard has been computed. This
value ranges potentially from 0 to 10, with 0 representing perfect scoring and
larger values indicating some inefficiency. Grader’s aptness indices have been
compared for all pairs of graders to compare relative aptness, and p-values of
two-sample t-tests have been reported.
For the evaluation of ability of the chatbots to identify sequences with integers in
given intervals, the percentage error has been determined, both for the overall per-
formance and for the performance at the tests in corresponding order of submission.
For the answers which halted, the average number of items which have been checked
by GPT-4 before halting has also been computed.
13
International Journal of Artificial Intelligence in Education (2025) 35:367–397 377
Results
Notably, ChatGPT, even GPT-4, made many elementary errors (see Cobbe et al.
(2021)), such as claiming that 3(− 1)4 − 4(− 1)3 − 12(− 1)2 + 1 = − 8 while the cor-
rect value is − 4 (see Question 4.1.55 from Stewart (2020) in the Supplementary
Material). Since calculus problems often involve several simple calculations,
ChatGPT frequently produced incorrect intermediate results, leading to incorrect
final answers. Nonetheless, ChatGPT’s justifications for its reasoning were gener-
ally appropriate.
Our findings are consistent with related studies on ChatGPT-3.5 performance
in general math (Frieder et al. (2024); Ibrahim et al. (2023)). The ability of GPT-4
in Calculus has considerably improved see OpenAI (2023), but, even with a leni-
ent grading policy giving partial credit for explanations without correct results
and point deductions for lack of explanations, its scores barely approach suffi-
ciency for standard calculus problems, and the chat offers very little reliability
even for simple calculations.
13
378 International Journal of Artificial Intelligence in Education (2025) 35:367–397
Fig. 2 Changes in average scores of ChatGPT3.5 (above) and GPT-4 (below) in solving exercises
between March and September 2023 for the indicated sets of exercises
The test about over time changes or degrading has shown that indeed Chat-
GPT3.5 seems to have improved its mathematical abilities, especially in the exer-
cises from the calculus class (p-value 1.3%), while in all other cases, including
the performance of GPT-4, there has been no significant change on average, see
Fig. 2. The fluctuation has been substantial, with correlations ranging from 0.19
13
International Journal of Artificial Intelligence in Education (2025) 35:367–397 379
Table 2 Average scores by solver and grader in percentage of total score Calculus class exercises
Calculus class exercises Stewart Stanford
Solver GPT-3.5 GPT-4 Human grader Author Author Author
to 0.69 for GPT-4 and from 0.21 to 0.82 for ChatGPT-3.5. This testifies of the
intrinsic randomness with the default temperature, and of some over time fluctua-
tions of the quality of the outcomes. However, we did not identify any relevant
change in GPT-4, while ChatGPT-3.5 has shown a detectable improvement.
Our findings with ChatGPT-LangChain were not encouraging. In almost all of
the Calculus problems in our study, ChatGPT either did not utilize WolframAlpha
or, when it did, the prompts were often too linguistically complex for WolframAlpha
1
to comprehend. In one instance, for the prompt ∫ −1 x4 +x
sinx
2 +2
dx , ChatGPT erroneously
reported that WolframAlpha returned a non-zero value of − cos(1)/4 + cos(− 1)/2 − l
og(2). Going directly to WolframAlpha and reworking the prompt multiple times,
WolframAlpha was finally able to understand the input and correctly returned a
value of 0 (easily obtainable by symmetry) and a complex integral expression. We
did not conduct further testing with ChatGPT-LangChain. While integration with
Wolfram Alpha holds great promise, the chatbot’s ability to rephrase queries in a
way that WolframAlpha can use remains a key challenge (a more thorough investi-
gation with similar conclusions is in Davis and Aaronson (2023)).
Grading Calculus Submissions For the Stewart’s problems, ChatGPT had an error
rate of 10 ± 0.49% and GPT-4 6.67 ± 0.42%, a significant difference (p-value < 10−6).
When asked to grade solutions to the Stanford competition problems, GPT-3.5
graded about 62.5% of the submissions correctly, with close to range scores in about
32.5% of cases, and completely completely off range scores in around 5% of the
cases. GPT-4 performed similarly but with no completely off range scores. See
Table 1.
13
380 International Journal of Artificial Intelligence in Education (2025) 35:367–397
Fig. 3 Average scores of ChatGPT submissions and students’ submissions and error bars, as graded by
human graders (blue), the GPT-3.5 (red), and GPT-4 (yellow) using the provided rubric. Human graders
are more accurate compared to golden standard, based on their aptness index in Table 3, and correctly
assign a higher passing grade to students, and a lower, and non passing, grade to the chat. GPT-3.5 is
rather incorrect, while GPT-4 reproduces the averages quite accurately
an example, the average scores for Student D where 70, 70, 72 from GPT-3.5, 56
from GPT-4 and 58,62,58 from the three human graders. Regarding incorrect scor-
ing, GPT-3.5 had an average of 10 scores incorrect but close to ranges, with about 3
completely off range, while GPT-4 has 10 close to range, but 1 off range; for com-
parison, human graders have about 1 close to range, but none off range.
ChatGPT-3.5 failed severely when grading its own work, altogether GPT-3.5 has
an average of 7 extra incorrect scores close to the expected ranges, with about 10
extra completely off range, while GPT-4 has 6 more close to range, and 5 off range.
Human graders also got confused, to a lesser degree, with about 5 extra close to
range, and 1 off range.
Table 4 p-value two-sample t-test for pairwise comparison of scores using Two-sample t-test. HG stands
for Human Grader
p-value GPT-4 GPT3.5–1 GPT3.5–2 GPT3.5–3 HG 1 HG 2 HG 3
13
International Journal of Artificial Intelligence in Education (2025) 35:367–397 381
Fig. 4 Scores assigned to single exercises, averaged over the number of grading iterations when there
are more then one. The right table corresponds to grading students’ submission; while GPT-3.5 scores
diverge in several of the exercises, the once assigned by GPT-4 are very close to those of the human
graders. The left table is for GPT-3.5 submissions; there is less agreement, likely due to the length of the
answers, and the related difficulty in assigning partial score; while GPT-3.5 as grader is unreasonably
lenient, GPT-4 and the human graders end up having about the same overall average
The grader’s aptness index based on the grading of real students’ submissions
gave the values reported in Table 3. The human grader all have indices which are
statistically indistinguishable from 0, reflecting their preparation. GPT-4 has an
index of 0.867 ± 0.248, significantly different from zero but better than ChatGPT
which has an index substantially greater than 1. p-values for two-sample t-test for
pairwise comparison of index values are reported in Table 4. It can be seen that
human graders have a very high degree of homogeneity, that GPT-4 performs better
than most GPT3.5, but worse than the human graders, and that GPT3.5 has also a
high degree of consistency, see also Fig. 4.
In summary, some of the grading outputs are indeed impressive: ChatGPT
quickly returns appropriate scores for each submission, often providing motiva-
tion interpreted in terms of the grading rubric. Other times, the output contains
minor misinterpretations of the submissions. There are, however, blunders and
other issues which, as we discuss in detail, make the scoring scarcely reliable.
The appropriateness of the scores varies depending on the input format, the com-
plexity, or the length of the exercise or the solutions.
GPT-4 has improved upon the previous version and generates scoring of Cal-
culus exercises which are generally appropriate, although with some errors and
misunderstanding of equivalent solutions. It is remarkable that the average scores
from GPT-4 are au-pair with those of humans, having been produced in a few
seconds each, instead of the hours needed by human graders. On the other hand,
at a more accurate aptness investigation, GPT-4 shows the persistence of (mostly
moderate) inaccuracies. In a sense, the grading errors average out when comput-
ing the overall score.
With all of this taken into account, automated grading of calculus exercises by
GPT-4 alone is not feasible at the moment, but it can certainly be of great support
to a human grader, and blending scoring has enormous potentials.
13
382 International Journal of Artificial Intelligence in Education (2025) 35:367–397
Number of errors
3.0
2.5
2.0
1.5
1.0
0.5
Loss of coherence and forced hallucination Already in the simple task of identifying
which lists contain integer numbers in a certain range GPT-4 makes a substantial
number of mistakes, although this is improving.
In a first experiment with 50 sequence out of the 100, the fraction of errors was
around 30% (28.6% ± 3.34%) in June 2023, but is it now down to 6.15% ± 2.18%.
This is still below human level, for which recognition in adult age is considered
essentially error free (Rhodes et al. (2019)). The error rate is even greater with 100
sequences: only the first sequences are analyzed by GPT-4, with an observed error
rate which was 55.3 ± 6.41% and now is still 32% ± 3.55%. After a certain number of
sequences, GPT-4 stops producing outcomes: the mean number of sequences ana-
lyzed before the output stops was 59.4 ± 0.5 out of 100 input sequences, and has now
improved to 78 ± 3.64. Despite the improvements, the error rates remain very high.
13
International Journal of Artificial Intelligence in Education (2025) 35:367–397 383
13
384 International Journal of Artificial Intelligence in Education (2025) 35:367–397
Prompt: Which rows contain the number 620. Check carefully one row at a time.
[100 rows are provided].
GPT-4: I have found the number 620 in the following row: 621 34 440 789 499
620 919 839 425 23 [this is Row 17, one of the two containing the requred number].
Prompt: Are you sure?
GPT-4: Apologies for the confusion earlier. Upon re-examination, I have found
the number 620 in the following row: 29 599 395 752 490 970 184 620 649 65 [this
is Row 98, the other containing the requred number].
Prompt: It is also in another row, no?
GPT-4: You are correct. I apologize for the oversight. The number 620 appears
in three rows:
Row 17: 621 34 440 789 499 620 919 839 425 23.
Row 98: 29 599 395 752 490 970 184 620 649 65.
Row 110: 892 150 522 605 225 947 660 440 112 742.
Thank you for pointing that out, and I apologize for any confusion earlier.
Loss of coherence and hallucinations are ubiquitous in GPT-4 output, often in
tiny details which can be hard to detect. In Appendix B, we report the scoring of
a competition level exercise by GPT-4: the scoring is essentially correct, but hal-
lucinates in one detail. Later on, there is another peculiar mistake: GPT-4 seems to
accept as correct one result which it had previously graded as incorrect; somehow,
the mere presence of the incorrect result becomes a supporting element for a later
statement. We consider this an instance of loss of coherence, in the sense that GPT-4
has lost track of what can be used and what not (see end of Appendix B).
Discussion
Summary
We have confirmed that, as expected from tests on other math problems, Chat-
GPT-3.5 in March 2023 was not capable of providing satisfactory answers to cal-
culus problems. Both GPT-4 and ChatGPT-3.5 in September 2023 show substantial
improvement, but can barely reach a passing score on standard problems and are
not capable of handling competition-level questions. Although identification and
description of mathematical tools are appropriate, there remain many arithmetical
mistakes. Instructors should be aware that, at this stage, calculus students could use
the chat to set up the solution, but would have to check every single calculation. In
the end, this could be rather beneficial to the learning process.
A possibility insufficiently explored is that of asking ChatGPT to grade calculus
assignments. At the moment, this is limited by the requirement to submit queries to
ChatGPT in text form, necessitating submissions in an intuitive but unnatural math
textual form. If image submission becomes possible, grading by ChatGPT might
rapidly become a widespread option. Our tests show that while ChatGPT-3.5 used
to make more mistakes, the more recent chat GPT’s or GPT-4’s grading are rather
close to that of a human grader. It could be easily implemented, with appropriate
prompt engineering (Lewkowycz et al. (2022). Errors are mostly due to arithmetical
13
International Journal of Artificial Intelligence in Education (2025) 35:367–397 385
13
386 International Journal of Artificial Intelligence in Education (2025) 35:367–397
intriguing insights into the interplay between machine learning and pedagogy. His-
torically, attempts to apply AI for educational purposes have seen fluctuating levels
of success, contingent upon the specificity and complexity of tasks involved (Chen
et al. (2020); Ilkka (2018)). Our recent findings reveal a consistent trend observed
across various AI evaluations: while ChatGPT-3.5 shows limitations in delivering
satisfactory answers to calculus problems, its successor, GPT-4, demonstrates dis-
cernible improvements, albeit with lingering issues. These results had been observed
in other studies (Calonge et al. (2023); Dao and Le (2023); Okonkwo and Ade-Ibi-
jola (2021); Bhutoria (2022); Binz and Schulz (2023)) and mirror the overarching
trajectory of AI in education, whereby progressive iterations inch closer to human-
like performances yet grapple with nuanced tasks, especially competition-level
questions.
The proposition to deploy ChatGPT for grading calculus assignments is indeed
novel; similar intents have been explored for other disciplines (Fiacco et al. (2023);
Mizumoto and Eguchi (2023); Nilsson and Tuvstedt (2023); Ndukwe et al. (2019)).
Presently, the feasibility of this idea is hindered by the textual constraints imposed
on query submissions. The contemporary reliance on textual inputs can be juxta-
posed against earlier endeavors in AI education, which predominantly hinged on
graphical interfaces (Shabrina et al. (2023)). Transitioning to an image-based sub-
mission system could revolutionize the utility of ChatGPT in educational settings,
enabling more intuitive and natural student–teacher interactions.
Interestingly, our evaluations unveiled a potential strength of GPT-4, whereby its
grading, with suitable prompt engineering, has some resemblance to that of human
graders. The very act of grading, which likely remains an unseen task during GPT-
4’s training, underscores the model’s latent intelligence, aligning with recent explo-
rations on AI’s ability to”comprehend” novel requests (Srivastava et al. (2022)).
However, educators and AI researchers must remain cognizant of the emerg-
ing challenges, among which the arithmetical mistakes, the phenomena of’loss of
coherence’ and hallucination. This observed decrement in performance over repeti-
tive tasks could be symptomatic of the inherent attention mechanisms in transformer
models (Vaswani et al. (2017)).
In summation, as AI’s footprint in education continues to grow, so do the nuances
and intricacies of its application. Our exploration with ChatGPT, both its potentials
and drawbacks, exemplifies this dynamic interplay. Future endeavors must remain
attuned to these observations, ensuring that as AI models evolve, so does their effi-
cacy in pedagogical settings (Stoica (2022)).
The mere possibility of having papers graded by ChatGPT raises a number of ethi-
cal, pedagogical, and regulatory issues (European Commission (2023), De Winter
et al. (2023)):
13
International Journal of Artificial Intelligence in Education (2025) 35:367–397 387
suitable prompt engineering, and even if the grading is stated by the chat
without raising doubts, also automated systems as ChatGPT make frequent
errors. One can envision that, if appropriately carried out, integration of
human and automated grading could lead to an overall reduction of unreli-
ability. The opposite effect of compounding mistakes is, however, a risk, so
it is a future challenge to find proficuous way of combining the two methods
(Budhwar et al. (2023)).
• Fairness: An analogous potential concern is about fairness. The output of
Chat-GPT is unlikely to be biased towards any specific student especially
if submissions cannot be traced to any specific individual. A potential con-
cern regarding fairness might stem from the inability to recognize equiva-
lent expressions, possible linked to students’ prior educational experience.
To reduce the likelihood of the AI failing to recognize diverse but correct
expressions of the solution, integrating prompt engineering into the course
could be beneficial (Tinguely et al. (2023)). It appears, however, that grading
by AI in general gives a perception of greater fairness (Chai, Fangyuan et al.
(2024)); a possible reason is the greater transparency in the process due to
the more detailed explanations that AI can easily incorporate in the feedback
(Chai, Fangyuan et al. (2024)).
• Speed: The great advantage of using systems like ChatGPT to grade submissions
is that feedback can drastically decrease the time needed by a human grader to
complete grading. This is an enormous benefit, as feedback can be provided
when most useful, and not with a distracting delay (Kasneci et al. (2023), Fiacco
et al. (2023)).
• Partial credit: One of the relevant features of grading is the ability to capture par-
tial elements of the solution, even if the final result is not exactly correct. While
automated grading is already available through dedicated websites and pre-
arranged questions, this would be the first instance of grading by a generic tool.
The human factor, which is partially built into dedicated platforms, would have
to be replaced by the wisdom embedded in the training of an LLM. Our results
show that the outputs of the chat appropriately reflect partial contributions, occa-
sionally even better than the human counterpart has been able to detect.
• Submission language: To allow for the possibility of having ChatGPT providing
scores, submissions must use a specific language. This is a relevant issue, as one
either uses TeX, which is not known by most calculus students, or uses a math
plain-text format, which is not entirely consistent and can create ambiguities or
misunderstandings (Keely (2009)). This issue is likely to be solved when images
will be accepted in the prompts.
• Regulations: due the ethical issues raised by the use of GPT in education, and in
particular in grading, suitable regulations are necessary on the institutional side,
especially with regard to data privacy and security, transparency of usage, fair
access, sustainability: we refer to thorough studies for extensive discussion of
these issues (Kasneci et al. (2023); Zhou et al. (2023)).
13
388 International Journal of Artificial Intelligence in Education (2025) 35:367–397
Rapid technological advances are occurring in the field of AI. Our work has spe-
cifically focused on grading Calculus problems with ChatGPT or GPT-4, adopting
the perspective of an end-user. This viewpoint will inform future analyses and the
development of grading processes when more sophisticated tools will become avail-
able. Additionally, we have highlighted issues such as loss of coherence and ethical
considerations, which will demand closer scrutiny in the near future.
More importantly, our analysis largely presupposes the availability of technologi-
cal advancements, such as the input of handwritten solutions (Tiwari et al. (2023))
and the reliable integration with plugins (Davis and Aaronson (2023).
Limitations
Our study is limited by the use of a small sample of questions from a single text-
book. Therefore, the generalizability of our results to other sources of calculus prob-
lems is unclear and further testing is necessary.
The risk of cross contamination is moderate: the problems from the actual Calcu-
lus exams were never published; the Stewart problems are available with their solu-
tions; the Stanford problems are from the 2021 competition, which has been pub-
lished in 2022 after the GPT pre-training data cuts off in September 2021.
Although we were able to reduce the occurrence of trivial mistakes from the chat
by providing it with a solution and additional warnings, human supervision is still
necessary to ensure accurate grading.
The Math plain-text format we adopted can be ambiguous, particularly if paren-
theses are not used carefully (Keely (2009)). This may have contributed to errors
made by ChatGPT. Moreover, in order to implement the grading process in real-
world situations, students may need to be required to submit their work in this for-
mat, which could create an additional burden for them.
In spite of having provided a useful context on how to address automated grading
in Calculus, rapidly advancing technologies could make this study obsolete.
Conclusions
13
International Journal of Artificial Intelligence in Education (2025) 35:367–397 389
and accelerating the grading process while maintaining fairness. Besides this, the
insights gained from this study have broader implications beyond calculus grading,
as we outline below.
Future directions
The swift evolution of large language models like GPT paints an optimistic trajec-
tory for their integration into the educational landscape. The transition from Chat-
GPT-3.5 to GPT-4 over a brief period emphasizes the potential of rapid enhance-
ments in forthcoming versions. The current foundational research on automated
calculus grading by LLM allows to delineate the following avenues for prospective
applications and explorations:
13
390 International Journal of Artificial Intelligence in Education (2025) 35:367–397
13
International Journal of Artificial Intelligence in Education (2025) 35:367–397 391
We now provide the solution, the grading rubric and 6 synthetic solutions, asking
the chat to grade. Based on the submissions and the provided solution and grading
rubric, the scores that we expect are as follows:
Submission 1: 10 pts.
Submission 2: 0 pts.
Submission 3: 10 pts.
Submission 4: 0 pts.
Submission 5: 0 pts.
Submission 6: 0 pts.
We see that both versions of ChatGPT grade correctly, with GPT-4 also noticing
explicitly that one answer was algebraically equivalent to the provided solution. This
has not happened often, and algebraically equivalent solutions are generally over-
looked by the chat.
13
392 International Journal of Artificial Intelligence in Education (2025) 35:367–397
The following is an exercise from the Stanford competition, and it illustrates how
GPT-4 is more apt to solving calculus problems with respect to GPT-3.5, while
still not yet satisfactory; and possesses a substantial ability to grade the solutions,
although punctuated by hallucinations and loss of coherence. Notice that the whole
conversation has taken place in TeX format, so it is not intelligible to humans unless
further compiled in LaTeX; the TeX code returned by ChatGPT compiles perfectly
(it only required the additional package”cancel”).
We summarize the results and include the detailed answers in the online material.
We initially provided the text of the exercise, and asked for solutions:
∑2020 ( 1 )k(2021)n
Prompt ∶ Compute Π2020
n=0 k=0 2020
13
International Journal of Artificial Intelligence in Education (2025) 35:367–397 393
Based on the rubric and the solution, the expected scores for the submissions (see
online material) are.
Both scores are essentially well done, with an immediate output, and with cor-
rect interpretion and output in correct TeX format. A human would have taken a few
minutes to grade these exercises, plus the time to compile the TeX.
In details, GPT-4 misses that there is no explanation in Submission 1 by halluci-
nation; correctly identifies the minor error in Submission 2; also, gives the appropri-
ate partial credit in Submission 3; it also finds a non existent minor error in Submis-
sion 4, very likely by the mechanism of loss of coherence, as GPT-4 now takes for
correct the incorrect final formula of Submission 2.
13
394 International Journal of Artificial Intelligence in Education (2025) 35:367–397
The online supplementary material contains all the evidence supporting the claimed
results:
1— Grading of the Calculus course exercises. For each of the 10 exercises, the file
includes:
– The text of the exercise.
– The solutions proposed by GPT.
– The prompt for grading with the rubric, correct solutions, and correspond-
ing submissions.
– The expected scores.
– The scores assigned by GPT.
2— Grading of the Stanford competition exercises. The structure/content format is as
detailed in point 1.
3— Grading of the Stewart Calculus textbook exercises. The structure/content format
is as detailed in point.
4 – A spreadsheet summarizing all the scores and statistical analyses.
5 – The details of the conversations with GPT-4 regarding the Stanford competition
exercise discussed in Appendix B.
6 – The entire chat with GPT-4 about identifying which lists out of 100 contains
numbers in [32–34] is reproduced.
Supplementary Information The online version contains supplementary material available at https://doi.
org/10.1007/s40593-024-00403-3.
Acknowledgements The author would like to thank M. Grasselli, E. Beretta, T. Rahwan, Y. Zaki for
interesting discussions and for pointing to some of the relevant literature, and the reviewers for their
careful comments which helped improving the manuscript.
Declarations
Ethical Approval The research was approved by the Institutional Review Board of New York University
Abu Dhabi (HRPP-2023–23). All research was performed in accordance with relevant guidelines and
regulations. Informed consent was obtained from all participants in every segment of this study.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License,
which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long
as you give appropriate credit to the original author(s) and the source, provide a link to the Creative
Commons licence, and indicate if changes were made. The images or other third party material in this
article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line
to the material. If material is not included in the article’s Creative Commons licence and your intended
13
International Journal of Artificial Intelligence in Education (2025) 35:367–397 395
use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permis-
sion directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/
licenses/by/4.0/.
References
AoPSOnline. (2022). 2021 SMT Team Round - Stanford Math Tournament. Retrieved March 30,
2023, from https://artofproblemsolving.com/community/c4h2775229p24340532.
Baral, S., Seetharaman, K., Botelho, A. F., Wang, A., Heineman, G., & Heffernan, N. T. (2022).
Enhancing auto-scoring of student open responses in the presence of mathematical terms and
expressions. International Conference on Artificial Intelligence in Education (pp. 685–690).
Springer International Publishing.
Baral, S., et al. (2021). Improving Automated Scoring of Student Open Responses in Mathematics. Inter-
national Educational Data Mining Society.
Baral, S., et al. (2023). Auto-Scoring Student Responses with Images in Mathematics. International Edu-
cational Data Mining Society.
Bhutoria, A. (2022). Personalized education and artificial intelligence in the United States, China, and
India: A systematic review using a Human-In-The-Loop model. Computers and Education: Artifi-
cial Intelligence. Elsevier, 100068.
Binz, M., & Schulz, E. (2023). Using cognitive psychology to understand GPT-3. Proceedings of the
National Academy of Sciences, 120(6), e2218523120.
Botelho, A., et al. (2023). Leveraging natural language processing to support automated assessment and
feedback for student open responses in mathematics. Journal of Computer Assisted Learning, 39(3),
823–840.
Bubeck, Sébastien, et al. (2023) Sparks of artificial general intelligence: Early experiments with gpt-4.
arXiv preprint arXiv:2303.12712.
Budhwar, P., Chowdhury, S., Wood, G., Aguinis, H., Bamber, G. J., Beltran, J. R., Boselie, P., Cooke, F.
L., Decker, S., DeNisi, A., & others. (2023). Human resource management in the age of generative
artificial intelligence: Perspectives and research directions on ChatGPT. Human Resource Manage-
ment Journal. Wiley Online Library.
Calonge, D. S., Smail, L., & Kamalov, F. (2023). Enough of the chit-chat: A comparative analysis of four
AI chatbots for calculus and statistics. Journal of Applied Learning and Teaching, 6(2).
Chai, F., et al. (2024). Grading by AI Makes Me Feel Fairer? How Different Evaluators Affect College
Students’ Perception of Fairness. Frontiers in Psychology, 15, 1221177.
Chat-GPT-LangChain (2023). Retrieved February 12, 2023, from https://huggingface.co/spaces/JavaF
Xpert/Chat-GPT-LangChain
Chen, L., Chen, P., & Lin, Z. (2020). Artificial intelligence in education: A review. IEEE Access, 8,
75264–75278. IEEE.
Chen, L., Zaharia, M., & Zou, J. (2023). How is ChatGPT’s behavior changing over time? arXiv preprint
arXiv:2307.09009.
Chen, M., et al. (2021). Evaluating large language models trained on code. arXiv preprint
arXiv:2107.03374.
Chlipala, A. (2022). Certified programming with dependent types: A pragmatic introduction to the Coq
proof assistant. MIT Press.
Chung, H. W., et al. (2022). Scaling Instruction-Finetuned Language Models. arXiv preprint
arXiv:2210.11416.
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., ... & Tworek, J. (2021). Training
verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
Crompton, H., & Burke, D. (2023). Artificial intelligence in higher education: The state of the field.
International Journal of Educational Technology in Higher Education, 20(1), 1–22. SpringerOpen.
Crothers, E., Japkowicz, N., & Viktor, H. (2022). Machine Generated Text: A Comprehensive Survey of
Threat Models and Detection Methods. arXiv preprint arXiv:2210.07321.
Dao, X.-Q., & Le, N.-B. (2023). Investigating the Effectiveness of ChatGPT in Mathematical Reasoning
and Problem Solving: Evidence from the Vietnamese National High School Graduation Examina-
tion. arXiv preprint arXiv:2306.06331.
13
396 International Journal of Artificial Intelligence in Education (2025) 35:367–397
Dao, X.-Q. (2023). Which Large Language Model should You Use in Vietnamese Education: ChatGPT,
Bing Chat, or Bard? Bing Chat, or Bard.
Davis, E., & Aaronson, S. (2023). Testing GPT-4 with Wolfram Alpha and Code Interpreter plug-ins on
math and science problems. arXiv preprint arXiv:2308.05713.
De Winter. (2023). ChatGPT in Education: Empowering educators through methods for recognition and
assessment. Informatics, 10(4)
Dimiceli, V. E., Lang, A. S., & Locke, L. (2010). Teaching calculus with Wolfram| Alpha. International
Journal of Mathematical Education in Science and Technology, 41(8), 1061–1071. Taylor & Francis.
Erickson, J. A., Botelho, A. F., McAteer, S., Varatharaj, A., & Heffernan, N. T. (2020). The automated
grading of student open responses in mathematics Proceedings of the Tenth International Confer-
ence on Learning Analytics & Knowledge, Frankfurt, Germany.
European Commission (2023). Ethical Guidelines on the Use of Artificial Intelligence (AI) and Data in
Teaching and Learning for Educators. Available online: https://op.europa.eu/en/publication-detail/-/
publication/d81a0d54-5348-11ed-92ed-01aa75ed71a1/language-en (accessed on February 12, 2024).
Fiacco, J., Adamson, D., & Ros, C. (2023). Towards Extracting and Understanding the Implicit Rubrics
of Transformer Based Automatic Essay Scoring Models. In Proceedings of the 18th Workshop on
Innovative Use of NLP for Building Educational Applications (BEA 2023) (pp. 232–241).
Frieder, S., et al. (2024). Mathematical capabilities of chatgpt. Advances in Neural Information Pro-
cessing Systems, 36,
Gao, C. A., Howard, F. M., Markov, N. S., Dyer, E. C., Ramesh, S., Luo, Y., & Pearson, A. T. (2022).
Comparing scientific abstracts generated by ChatGPT to original abstracts using an artificial
intelligence output detector, plagiarism detector, and blinded human reviewers. bioRxiv. Cold
Spring Harbor Laboratory.
Gao, R.; Merzdorf, H.E.; Anwar, S.; Hipwell, M.C.; Srinivasa, A. (2023) Automatic Assessment
of Text-Based Responses in Post-Secondary Education: A Systematic Review. arXiv 2023,
arXiv:2308.16151.
Garg, S., Tsipras, D., Liang, P. S., & Valiant, G. (2022). What can transformers learn in-context? A
case study of simple function classes. Advances in Neural Information Processing Systems, 35,
30583–30598.
Ghaith, S. (2024). The Triple Attention Transformer: Advancing Contextual Coherence in Trans-
former Models. De Winter, Joost CF, Dimitra Dodou, and Arno HA Stienen (2023). "ChatGPT
in Education: Empowering educators through methods for recognition and assessment." Infor-
matics. Vol. 10. No. 4. MDPI.
Huang, J., Chen, X., Mishra, S., Zheng, H. S., Yu, A. W., Song, X., & Zhou, D. (2023). Large lan-
guage models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798.
Ibrahim, H., et al. (2023). Perception, performance, and detectability of conversational artificial intel-
ligence across 32 university courses. Scientific Reports, 13(1), 12187.
Ilkka, T. (2018). The impact of artificial intelligence on learning, teaching, and education. European
Union.
Jacob, A. (2023). The Impact of Context Window Limitation on AI and Insights from GPT.
Published online: www.linkedin.com/pulse/impact-context-window-limitation-ai-insig
hts-from-gpt-jacob-adm/.
Johnson, W. L. (2023). How to Harness Generative AI to Accelerate Human Learning. International
Journal of Artificial Intelligence in Education. Springer, 1–5.
Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh,
G., Günnemann, S., Hüllermeier, E., & others. (2023). ChatGPT for good? On opportunities
and challenges of large language models for education. Learning and individual differences,
103:102274. Elsevier.
Keely, S. J. (2009). Writing Mathematical Expressions in Plain Text – Examples and Cautions.
Retrieved February 11, 2023, from https://www.integreat.ca/OL/docs/WritingMathPlainText.pdf.
Kumar, K. (2023). Geotechnical Parrot Tales (GPT): Overcoming GPT hallucinations with prompt
engineering for geotechnical applications. arXiv preprint arXiv:2304.02138.
Lewkowycz, A., et al. (2022). Solving quantitative reasoning problems with language models.
Advances in Neural Information Processing Systems, 35, 3843–3857.
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2023). Pre-train, prompt, and predict:
A systematic survey of prompting methods in natural language processing. ACM Computing Sur-
veys, 55(9), 1–35.
13
International Journal of Artificial Intelligence in Education (2025) 35:367–397 397
Mizumoto, A., & Eguchi, M. (2023). Exploring the potential of using an AI language model for auto-
mated essay scoring. Research Methods in Applied Linguistics, 2(2), 100050. Elsevier.
Ndukwe, I. G., Daniel, B. K., & Amadi, C. E. (2019). A machine learning grading system using chat-
bots. In Artificial Intelligence in Education: 20th International Conference, AIED 2019, Chi-
cago, IL, USA, June 25–29, 2019, Proceedings, Part II 20 (pp. 365–368). Springer.
Nilsson, F., & Tuvstedt, J. (2023). GPT-4 as an Automatic Grader: The accuracy of grades set by
GPT-4 on introductory programming assignments.
Okonkwo, C. W., & Ade-Ibijola, A. (2021). Chatbots applications in education: A systematic review.
Computers and Education: Artificial Intelligence, 2, 100033. Elsevier.
OpenAI. (2023). GPT-4 Technical Report. arXiv preprint arXiv:2303.08774.
Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. arXiv
preprint arXiv:2203.02155.
Rhodes, S., Greene, N. R., & Naveh-Benjamin, M. (2019). Age-related differences in recall and recogni-
tion: A meta-analysis. Psychonomic Bulletin & Review, 26, 1529–1547.
Schneider, J., Richner, R., & Riser, M. (2023). Towards trustworthy autograding of short, multi-lingual, multi-
type answers. International Journal of Artificial Intelligence in Education, 33(1), 88–118. Springer.
Shabrina, P., Mostafavi, B., Abdelshiheed, M., Chi, M., & Barnes, T. (2023). Investigating the impact
of backward strategy learning in a logic tutor: Aiding subgoal learning towards improved problem
solving. International Journal of Artificial Intelligence in Education, 1–37. Springer.
Srivastava, A., et al. (2022). Beyond the imitation game: Quantifying and extrapolating the capabilities of
language models. arXiv preprint arXiv:2206.04615.
Stewart, J. (2020). Calculus: Early transcendentals. International metric edition. 9th Edition. CENGAGE
Learning, 759.
Stoica, E. (2022). A student’s take on challenges of AI-driven grading in higher education. B.S. thesis,
University of Twente.
Suzgun, et al.(2022). Challenging BIG-Bench tasks and whether chain-of-thought can solve them. arXiv
preprint arXiv:2210.09261.
Tamkin, A., Brundage, M., Clark, J., & Ganguli, D. (2021). Understanding the capabilities, limitations,
and societal impact of large language models. arXiv preprint arXiv:2102.02503.
Tinguely, P. N., Lee, J., & He, V. F. (2023). Designing Human Resource Management Systems in the Age
of AI. Journal of Organization Design, Forthcoming.
Tiwari, Adarsh, Sanket Biswas, and Josep Lladós (2023). "Can Pre-trained Language Models Help in
Understanding Handwritten Symbols?." International Conference on Document Analysis and Rec-
ognition. Cham: Springer Nature Switzerland.
Vaswani, A., et al. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
Vig, J. (2019). A multiscale visualization of attention in the transformer model. arXiv preprint
arXiv:1906.05714.
Wenzel, M., Paulson, L. C., & Nipkow, T. (2008). The Isabelle framework. In Theorem Proving in Higher
Order Logics: 21st International Conference, TPHOLs 2008 (pp. 33–38). Springer.
Yijun Liu, Jian-Yun Bi, and Zhi-Ping Fan (2020). Deep learning for natural language processing: Advan-
tages and challenges. In National CCF Conference on Natural Language Processing and Chinese
Computing, pages 3–14. Springer.
Yuan, W., Liu, P., & Neubig, G. (2022). Can we automate scientific reviewing? Journal of Artificial Intel-
ligence Research, 75, 171–212.
Zhang, M., Baral, S., Heffernan, N., & Lan, A. (2022). Automatic short math answer grading via in-
context meta-learning. arXiv preprint arXiv:2205.15219.
Zhang, T., Zhang, Y., Vineet, V., Joshi, N., & Wang, X. (2023). Controllable Text-to-Image Generation
with GPT-4. arXiv preprint arXiv:2305.18583.
Zhou, J., Müller, H., Holzinger, A., & Chen, F. (2023). Ethical ChatGPT: Concerns, challenges, and com-
mandments. arXiv preprint arXiv:2305.10646.
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
13