ChatGPT's Mathematical Evaluation 2023
ChatGPT's Mathematical Evaluation 2023
Abstract
We investigate the mathematical capabilities of two versions of ChatGPT
(released 9-January-2023 and 30-January-2023) and of GPT-4 by testing
them on publicly available datasets, as well as hand-crafted ones, using a
novel evaluation scheme. In contrast to formal mathematics, where large
databases of formal proofs are available (e.g., mathlib, the Lean Mathe-
matical Library), current datasets of natural-language mathematics used to
benchmark language models either cover only elementary mathematics or are
very small. We address this by publicly releasing two new datasets: GHOSTS
and miniGHOSTS. These are the first natural-language datasets curated
by working researchers in mathematics that (1) aim to cover graduate-level
mathematics, (2) provide a holistic overview of the mathematical capabilities
of language models, and (3) distinguish multiple dimensions of mathematical
reasoning. These datasets test, by using 1636 human expert evaluations,
whether ChatGPT and GPT-4 can be helpful assistants to professional
mathematicians by emulating use cases that arise in the daily professional
activities of mathematicians. We benchmark the models on a range of
fine-grained performance metrics. For advanced mathematics, this is the
most detailed evaluation effort to date. We find that ChatGPT and GPT-4
can be used most successfully as mathematical assistants for querying facts,
acting as mathematical search engines and knowledge base interfaces. GPT-
4 can additionally be used for undergraduate-level mathematics but fails on
graduate-level difficulty. Contrary to many positive reports in the media
about GPT-4 and ChatGPT’s exam-solving abilities (a potential case of
selection bias), their overall mathematical performance is well below the
level of a graduate student. Hence, if you aim to use ChatGPT to pass
a graduate-level math exam, you would be better off copying from your
average peer!
∗
Corresponding author: [email protected].
37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and
Benchmarks.
1 Introduction
Since its release in November 2022, the language model Chat Generative Pre-trained Trans-
former (ChatGPT) has rapidly become a widely known question-and-answer dialogue system.
ChatGPT has been referenced in mainstream media across the globe [1–4] and across all
major internet platforms [5, 6]. With similar reactions, the release of ChatGPT’s successor,
GPT-4, followed in March 2023 [7].
The performance of ChatGPT has been analyzed in a large number of exam-related use cases,
with varying degrees of scientific rigor, ranging from detailed studies to anecdotal evidence.
Use cases include passing the United States Medical Licensing Examination (USMLE) [8],
scoring highly on the Psychology Today Verbal-Linguistic Intelligence IQ Test [9], and
answering (and generating) Operations Management exam questions that were deemed to
be within the scope of a typical MBA curriculum [10], all with a performance that elicited
a positive sense of surprise from the authors. In turn, the performance of GPT-4 even
surpasses that of ChatGPT on a large batch of academic and professional exams [7, Table 1].
Such strong task-related performance indicates that large language models (LLMs) could be
frequently used as assistants in many domains.
In this paper, we introduce a new dataset called GHOSTS, which measures the advanced
mathematical abilities of LLMs. Using this dataset, we perform a detailed analysis of the
mathematical capabilities of ChatGPT on two of its versions, the 9-January-2023 and the
30-January-2023 version. Note that, according to the release notes, the 30-January-2023
version should possess “improved factuality and mathematical capabilities” [11]. We further
examine the performance of GPT-4 on a smaller dataset called miniGHOSTS, which exhibits
statistics similar to the larger GHOSTS dataset. We further make available microGHOSTS,
which itself is a subset of miniGHOSTS, and is designed to facilitate pre-screen of language
models, by incurring minimal human evaluation costs. Our analysis includes but is not
limited to testing how many of the skills necessary to do professional mathematics can be
emulated by these models. Examples of such skills are the ability to answer computational
questions, the ability to complete mathematical proofs that have gaps or missing steps, the
ability to solve questions that are more focused on deep insights and original solutions, such
as those of mathematical olympiads, and the ability to survey the literature and think across
domains. None of the previous benchmarks (see Section 2) cover such a broad range of
mathematical abilities.
To achieve the goals outlined above, GHOSTS consists of carefully composed prompts aimed
at testing different aspects of LLMs related to mathematical comprehension; see Section 3.
This includes both hand-crafted prompts as well as samples from existing datasets that were
devised to test models specifically trained for mathematical comprehension [12, 13].
For brevity, we use the expression “(Chat)GPT” to refer collectively to both the ChatGPT
and GPT-4 language models. We refer to Appendix C for further details regarding different
(Chat)GPT versions.
To evaluate the output of (Chat)GPT, we designed a thorough testing methodology, including
warning and error codes that represent various possible failure modes of (Chat)GPT. We
score (Chat)GPT’s responses, report on the results using this methodology, and compare
(Chat)GPT to a selection of state-of-the-art models trained for mathematical comprehension.
In summary, the contributions of this article are threefold:
• Benchmark for testing the mathematical capabilities of LLMs: We introduce
a new natural-language mathematics dataset, called GHOSTS2 , to test the capabilities
of LLMs across a range of aspects regarding advanced mathematical comprehension;
see Section 3. It consists of two subdatasets derived from state-of-the-art datasets of
mathematical queries for language models. Additionally, we devise four hand-crafted
subdatasets covering further mathematical tasks. Parts of our dataset consist of problems
that were selected to have a high probability of not being in the data on which (Chat)GPT
was trained; see tags D1-D3 from Table 1.
2
The GHOSTS dataset is available at github.com/friederrr/GHOSTS, and the project is hosted
at ghosts.friederrr.org.
2
• Insight for mathematical use of (Chat)GPT: Based on our benchmark, we show
for which types of questions and which domains of mathematics, (Chat)GPT may be
useful and how it could be integrated into the workflow of a mathematician. On the other
hand, we identify the failure modes, as well as the limits of its capabilities. This can
aid future efforts to develop LLMs that perform better in mathematics. Our analysis is
akin to a mathematical model card [14] in terms of intended use, metrics, evaluation data,
and qualitative analyses, as mathematical strengths and weaknesses of (Chat)GPT are
summarized; see Section 4.
• Evaluation of improvements of (Chat)GPT: We can further use our benchmark
to track the mathematical capabilities of (Chat)GPT variants over time. As a first step,
we analyze the impact of the upgrade from the 9-January-2023 to the 30-January-2023
version of ChatGPT, which promises a better mathematical performance according to the
release notes. Then, we proceed to investigate what performance increases the successor
GPT-4 brings; see Section 4.1.
2 Related Work
As a language model, (Chat)GPT can be universally employed to perform mathematical
reasoning and, therefore, has to compete with technologies in this space that are sometimes
decades old. Performing mathematical reasoning in an automated way has a long history
and can be traced back to 1959 [15], the most focus being devoted to proving theorems [16].
According to Harrison [17], there is a realization that classical approaches, using a symbolic
encoding of mathematics, have reached a “plateau”.
On the other hand, there is now a growing body of literature on learning mathematical
relationships directly in a supervised-learning manner [18–20] or by using LLMs to perform
mathematical reasoning directly on mathematics encoded in natural language [21]. Sometimes,
the distinction is blurred, because architectures of LLMs can also be used in a supervised-
learning setting and have been employed successfully in learning mathematical relationships,
such as between the syntactical form of a function and its integral [13, 22].
Among the supervised approaches, we mention [13], where a Transformer architecture [23]
was used to generate symbolic, closed-form solutions to integrals and first and second-order
differential equations, which outperformed classical solvers3 , such as Mathematica, MATLAB,
and Maple by at least 14% on a test set of integration problems. On the task of solving
differential equations, the Transformer-based approach still exceeds the classical approach,
but by a smaller margin (at least 4% in the case of first-order differential equations and with
more varied results for second-order equations).
Regarding LLMs, recent ones, for instance, PaLM [24] (released in 2022), are tested only
on elementary-level mathematical reasoning datasets, such as the MathQA or GSM8K
datasets [25, 26]. We hypothesize that this is due to a lack of advanced-level natural language
mathematics datasets. Moreover, the results obtained indicate that the models at that time
had difficulty with much simpler datasets than ours. For example, the version of PaLM with
540 billion parameters only correctly solves 58% of the problems of the GSM8K dataset,
even with chain-of-thought prompting and access to an external calculator [24, Table 10].
This model nonetheless outperforms GPT-3 [27], which only achieves 54% on the same
dataset. Variations of BERT [28] have been shown to only solve between 28% and 37% of
the problems when fine-tuned and tested on the Algebra Question Answering with Rationales
(AQuA-RAT) dataset [29], which is the direct predecessor of MathQA. For some models,
such as BLOOM [30] or the LaMDA model [31] (both released in 2022), an evaluation of the
mathematical reasoning capability is entirely missing. An up-to-date survey on mathematical
datasets and the performance of various LLMs can be found in [32].
Most similar to our dataset is the NaturalProofs dataset [33] and the NaturalProofs-
Gen dataset [34]. In this paragraph, we illustrate the similarities and differences between
these datasets and ours. NaturalProofs and NaturalProofs-Gen are similar among
3
For a given prompt, the computer algebra system is considered to have failed if it does not
provide a closed-form solution or times out after 30 seconds (in case of Mathematica).
3
themselves and cover graduate-level mathematics by focusing on data from ProofWiki4 (the
latter dataset), as well as on the Stacks Project5 and two open-source textbooks (the former
dataset). Using the LATEX source code, which is available for all these resources, annotated
theorems and their proof graphs are extracted. The annotations consist of reference graphs
highlighting references to other theorems or definitions, the idea being that these references
capture the “skeleton” of a proof. This task resembles the mathematical abilities that the
Named Theorem Proof Completion subdataset from the GHOSTS dataset evaluates (see
Table 1), although (1) we only retrieve a single reference, and (2) (Chat)GPT, as far as
known, does not use training objectives that make use of information from data annotation,
in contrast to models evaluated in [33, 34].
Our framework pertains to general language model evaluation, which may be presented in a
black-box manner (as is the case for (Chat)GPT), and therefore does not allow to leverage
any additional information, such as reference graphs. This is also reflected in the human
evaluation schema introduced in [34, Table 24], which classifies common model mistakes.
As reference graphs form the foundation of how the mathematical proofs are engineered,
many elements of the evaluation schema are strongly tailored toward this representation of
mathematical data. Our benchmark is not reference-centric and therefore allows evaluations of
any type of proof (including computations, as featured in the Symbolic-Integration subdataset,
which we consider to be a particular kind of proof). Therefore, our methodology includes
further and more general failure modes to make for a more fine-grained evaluation that
explains the nature of the errors. We refer to Appendix A for further related works.
4
Table 1: A summary of all the files from all the subdatasets comprising our GHOSTS dataset,
together with their size, i.e., the number of prompts and their associated attribute tags. The
tags Mi, Qi, and Di relate to the level of Mathematical difficulty, the Question type, and the
Out-of-Distribution type from Section 3.1, respectively. For the Olympiad-Problem-Solving
subdataset, 24 further prompts were created where prompt engineering was applied, see
Appendix 4.2. These 24 prompts do not count towards the 709 total prompts, only towards
the 1636 evaluations.
3.1 Subdatasets
The subdatasets that make up our GHOSTS dataset are summarized in Table 1. In the
following, we describe each subdataset in more detail.
Grad-Text. This subdataset consists of a collection of books (R. Durrett’s Probability
Theory [35], J. R. Munkres Topology [36] and W. Rudin’s Functional Analysis [37]) that are
widely used in universities to teach upper undergraduate or first-year graduate courses in
a degree in mathematics. We have used most of the exercises from these books’ first and
second chapters, except for [35], where we only used exercises from the first chapter, which
was longer than the other books’ chapters.
Holes-in-Proofs. This subdataset consists of a number of proofs sourced from math.
stackexchange.com, as well as some proofs sourced from books (S. Axler’s Linear Algebra
Done Right [38] and W. Rudin’s Principles of Mathematical Analysis [39]), and from the
MATH dataset [12], where parts of the proofs were intentionally deleted and the LLM was
prompted to fill in the gaps: This was done either by (1) using a MISSING token, (2) finishing
the proof early and prompting the LLM to complete it, or (3) explicitly asking for certain
conditions or results.
Olympiad-Problem-Solving. This subdataset consists of a selection of exercises from A.
Engel’s Problem-Solving Strategies [40] book, which is often used to prepare for mathematical
competitions. We selected and graded the LLM outputs on one hundred exercises drawn
from all chapters.
Symbolic-Integration. This subdataset consists of random samples of integrals from
the test set of [13]. There are three ways in which integrals are generated in [13]: Forward
generation (FWD), Backward generation (BWD), and Backward generation with integration
by parts (IBP). We sample 21 integrals from FWD test set, 20 integrals from the BWD test
set, and 59 integrals from the IBP test set. As these integrals are given in Polish/prefix
notation, a natural-language prompt conversion of them is unlikely to be witnessed in the
training dataset of (Chat)GPT. The assessment was done by verifying the correctness of
the output both by using Mathematica, as well as making use of the provided solutions
5
(in Polish notation), which [13] generated using SymPy. In particular, we notice that all
integrals in this dataset have solutions that can be expressed using elementary functions.
MATH. This subdataset consists of a random sample of problems from the MATH
dataset [12]. The latter dataset attaches a level of difficulty to each problem. We focused on
two domains, Algebra and Probability Theory, and sampled an equal number of problems at
each level of difficulty.
Search-Engine-Aspects. This subdataset consists of problems that were not sampled
from a particular source but generated by a human expert in the field. In the file Named
Theorem Proof Completion, we focused on prompting the LLM to provide proof outlines of
various theorems that are sufficiently well-known within Functional Analysis to have names.
In the Definition Retrieval file, we asked the LLM to correctly state various definitions
centered around Functional Analysis and Topology. In contrast, in the Reverse Definition
Retrieval file, we verified whether the LLM was able to deduce the name of a mathematical
object by describing its properties.
Because input to (Chat)GPT is purely textual (at the time of writing), certain types of
questions that might be stated and solved in a non-text-based fashion (e.g., questions involving
graphical diagrams, without text explaining the diagram6 , as occasionally occur in [40]),
have been excluded. Our subdatasets can be categorized along the following dimensions (see
Appendix B.1 for more details):
• Mathematical difficulty (ascending): (M1) Elementary arithmetic problems, (M2)
Symbolic problems, (M3) (Under)graduate-level exercises, (M4) Mathematical olympiad
problems.
• Question type: (Q1) Stating mathematical facts, (Q2) Overview-type review questions,
(Q3) Computational questions, (Q4) Theorem proofs or puzzle solutions, (Q5) Proof-
completion questions.
• Types of high out-of-distribution likelihood: (D1) Nontrivial problem encoding,
(D2) Succinct solution, (D3) Spoken dialogue.
The existing datasets of natural-language mathematics are far from covering all possible
combinations across these dimensions. In our well-crafted GHOSTS datasets, we have striven
to cover each of these aspects individually, as can be seen in Table 1. The next section
specifies the format of our dataset and the methodology for analyzing (Chat)GPT’s output.
3.2 Format
The format of each of the subdatasets that make up our GHOSTS dataset follows the same
convention. Each subdataset consists of JSON-formatted files, and our format is similar to,
e.g., the AQuA-RAT dataset [29]. A single data point7 in a file has the following form:
" prompt " : " Let $X$ be a t o p o l o g i c a l v e c t o r s p a c e . A l l s e t s mentioned
below a r e u n d e r s t o o d t o be t h e s u b s e t s o f $X$ . Prove t h e
f o l l o w i n g s t a t e m e n t : I f $A$ and $B$ a r e compact , s o i s $A + B$ " ,
" output " : "The s t a t e m e n t i s wrong i n g e n e r a l . C o n s i d e r t h e example $A
= [ − 1 , 1 ] \ t i m e s {0} $ and $B = {0} \ t i m e s [ − 1 , 1 ] $ . Then $A$ and
$B$ a r e compact but $A + B = [ − 1 , 1 ] \ t i m e s [ − 1 , 1 ] $ i s not
compact . "
" r a t i n g " : " 2" ,
" e r r o r c o d e s " : [ " e3 " , "e5_2" , "e5_4" ] ,
" warningcodes " : [ ] ,
"comment" : "The g i v e n $A + B$ a c t u a l l y ∗ i s ∗ compact . " ,
"msc" : [ " 46A03" ] ,
" r e f " : " Rudin−F u n c t i o n a l A n a l y s i s −Second−Ed . Part1−ex3 /d−page38 " ,
6
See, e.g., Exercise 15 in [40, Chapter 2], which asked the reader to inspect a figure on which the
problem is based.
7
The JSON object of an output of the 30-January-2023 version of ChatGPT, as identifiable by
the timestamp at which the output was generated, is shown. The prompt comes from the “W.
Rudin, Functional Analysis (ch. 1)” file from the Grad-Text subdataset.
6
" c o n f i d e n c e " : " high " ,
" timestamp " : "2023−01−31"
We require each data point to have the same JSON keys as in this example, some of which
may be empty depending on the prompt. Among the listed keys, the rating key stands
out as the most fundamental one. Its value serves as a condensed representation of the
mathematical capability of the tested language model, compressed into a one-dimensional
measure ranging from 1 (lowest) to 5 (highest). A more nuanced and fine-grained perspective
on the mathematical capabilities is provided by the errorcodes and warningcodes keys.
The msc key denotes the mathematics subject classification. We explain each JSON key in
Appendix B.2. For end-users of (Chat)GPT, it is desirable to avoid having a long-winded
dialogue to arrive at a solution. Therefore, we require that (Chat)GPT provides us with the
correct solution given only the input prompt without any subsequent interaction.
3.3 Human Input in Dataset Creation and Mathematical Evaluation
For all data points, the values of the keys rating, errorcodes, warningcodes, comment, and
confidence were manually labeled, without any automation. The msc, ref, and timestamp
keys were populated in a semi-automatic way, since their values change only slightly within
the same subdataset.
Two of the subdatasets, the MATH subdataset and the Symbolic-Integration subdataset, use
prompts taken from existing datasets, the MATH dataset by [12] and the dataset comprising
integrals from [13], respectively. This was done to compare how (Chat)GPT performs against
existing state-of-the-art models that use these datasets, see Section 4. Nonetheless, significant
additional annotation effort was involved, since, in both cases, the authors rated the output.
Furthermore, in the second case, the data are publicly presented in a Polish notation format,
and conversion was necessary. The prompts of the other subdatasets were hand-crafted by
the authors. We refer to Appendix B.6 for more information on different aspects of human
effort in dataset creation.
4 Results
Will ChatGPT get you through a university math class? No, you would be better off copying
from your average peer—unless it is undergraduate mathematics, for which GPT-4 can offer
sufficient (but not perfect) performance.
If we take a rating of 3.5, the average between the lowest and highest rating, to be the
threshold between success and failure, then Figure 1 shows that for a majority of subdatasets,
both versions of ChatGPT will not pass. However, for GPT-4, the situation is different, and,
on miniGHOSTS, it passes (sometimes barely) on all subdatasets files, except W. Rudin,
Functional Analysis (ch. 2), which tests graduate-level mathematical knowledge and the
Olympiad Problem Solving file, which tests mathematical problem-solving skills. We note
that, unless otherwise stated, we do not use prompt-engineered questions in the results
presented here (see Appendix 4.2).
We first focus on the results of the 9-January-2023 version of ChatGPT and note that the
results for the 30-January-2023 are very similar, as can be inferred from the figures. On
average, the 9-January-2023 version achieves a rating of 3.20 with a standard deviation8 of
1.23. It performs particularly poorly on proof-based questions in the style of graduate-level
exercises or mathematical olympiads, as well as more complicated symbolic calculations. We
note that prompt engineering only slightly improved the results for such complex questions;
see Appendix 4.2. However, in tasks that only required filling in gaps or stating mathematical
facts, ChatGPT was mostly able to achieve a score above 3.5. In particular, ChatGPT was
strong at recognizing the context of the question, and the notation of the output almost
always matched the one given in the prompt, see Figure 4 in the appendix. Generally,
Figure 1 indicates that the ratings closely correspond to how mathematicians would rank
the difficulty of the exercises. In this context, we note that the length of the prompt does
not have a clear effect on the rating; see Figure 9 in the appendix. We present results for
different mathematical fields in Figure 5 in the appendix. For a detailed qualitative analysis
8
We use Bessel’s correction term to obtain an unbiased estimate of the variance.
7
Grad-Text Version
W. Rudin, Functional Analysis (ch. 1) 9-Jan.
W. Rudin, Functional Analysis (ch. 2) 30-Jan.
GPT-4
J. Munkres, Topology (ch. 1)
J. Munkres, Topology (ch. 2)
R. Durret, Probability Theory
Holes-in-Proofs
Proofs Collection A
Proofs Collection B Prealgebra
Proofs Collection B Precalculus
Subdataset
Olympiad-Problem-Solving
Symbolic-Integration
MATH
MATH Algebra
MATH Counting and Probability
MATH Prealgebra
MATH Precalculus
Search-Engine-Aspects
Definition Retrieval
Reverse Definition Retrieval
Named Theorem Proof Completion
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Rating
Figure 1: Average rating for each file in each subdataset (bold) of GHOSTS on the 9-January-
2023 and the 30-January-2023 versions of ChatGPT and for miniGHOSTS on GPT-4. Note
that the maximal ranking is 5 and the minimal ranking, where the question was at least
understood, is 2, see Appendix B.4; the lower rating of 1 indicates that the answer completely
misses the question. Thus, a reasonable passing grade, i.e., 50% of points, corresponds
to a score of 3.5, as indicated by the vertical dotted line. The error bars represent 95%
confidence intervals.
of the results on the different subdatasets, we refer to Appendix D.1. Finally, we note that
(Chat)GPT almost never expressed any form of uncertainty, even if its output has been
completely wrong; see Appendix D.2.
Comparing ChatGPT to the performance obtained by [13], who correctly solved nearly 100%
of the integrals in a collection of 500 test equations [13, Table 3], the 9-January-2023 version
of ChatGPT achieves an average rating of 2.51 (standard deviation: 0.87) on our random
sample of their dataset (after conversion from Polish notation to LATEX). Specifically, a rating
of 2 is dominating 70% of the time, followed by a rating of 3 and 4 for 13% of the prompts
each; see also Figure 7 in the appendix. GPT-4 achieves an average of 3.50 (standard
deviation: 1.43), barely a passing grade, on the corresponding subset from miniGHOSTS.
These scores trail far behind the performance achieved by the model in [13]. The situation
is similar when comparing ChatGPT to Minerva [21, Table 3]. Their best model achieved
an accuracy of 50% on the MATH dataset [12]. However, the 9-January-2023 version of
ChatGPT achieves a perfect score only on 29% of our random samples from the MATH
8
5 4 3 2 1 9-Jan.
5 4 3 2 1 30-Jan.
5 4 3 2 1 GPT-4
Figure 2: A Sankey diagram of how the ratings evolve from 9-January-2023 ChatGPT to
30-January-2023 ChatGPT to GPT-4 (from top to bottom), with all models evaluated on
miniGHOST. While grades on the 9-January and 30-January models are shuffled between
the ChatGPT versions, the overall performance remains approximately the same. However,
we observe a significant increase in perfect ratings, i.e., a score of 5, for GPT-4.
dataset (which is above the total average of 25% of data points across all subdatasets in
which this version achieves a perfect score), see Figures 6 and 7 in the appendix. In contrast,
GPT-4 performs substantially better and obtains a score of 5 on 70% of the corresponding
questions within the miniGHOSTS dataset, see Figure 7 in the appendix.
4.1 Quantitative Comparison of (Chat)GPT Versions
The ensuing model version, 30-January-2023, overall performed similarly with an average
rating of 3.29 (standard deviation: 1.28), although performance was inconsistent across
subdatasets and on some subdatasets marginally better, see Figure 1. A significant jump
in performance could only be observed for GPT-4, which achieved a substantially higher
average rating of 4.15 (standard deviation: 1.12). We note that the evaluation of GPT-4 is
only on the miniGHOSTS dataset, i.e., a subset of GHOSTS. Nonetheless, these findings
send a clear message that the performance of GPT-4 dominates the performance of ChatGPT
(both versions), see Figure 1.
Figure 2 shows how the ratings change between the different versions of (Chat)GPT. Surpris-
ingly, one can see a shuffling of the grades for the two ChatGPT versions, even though the
counts in each grade bracket stay approximately the same. For instance, there are roughly
the same amount of outputs that received grade 4, yet less than half of the prompts were
the same between model changes. Appendix D.4 provides different perspectives on this
and reinforces the mixed performance increase that the 30-January-2023 model brings. For
GPT-4, we see that the percentage of perfect ratings almost doubles, while the percentage of
prompts that are not understood or completely wrong (i.e., ratings of 1 or 2) approximately
halves as compared to the ChatGPT versions.
Analysis of (Chat)GPT’s output and our warning codes reveal that GPT-4 provides even
longer (“rambling”) answers, whereas ChatGPT usually answered the question without giving
any additional context about the topic, see Figures 6 and 8 in the appendix. The answer
style of GPT-4 was often beneficial (resulting in better overall scores) but sometimes reduced
the readability of the output. Furthermore, we found the behavior of GPT-4, compared to
ChatGPT, to be more opinionated. Finally, despite its better overall performance, GPT-4
still seems to be vulnerable to mistakes in seemingly simple calculations. We refer the reader
to Appendix D for further results on the models’ performance.
4.2 Prompt Engineering
One interesting finding of our study is related to performing prompt engineering on mathe-
matical questions. Prompt engineering was solely carried out on questions from the Olympiad-
Problem-Solving subdataset, and prompt-engineered questions consist of lists consisting of
9
two JSON objects. These lists contain the original question that was not prompt-engineered,
as well as the prompt-engineered question. The latter question is identified, as it contains
the string <prompt engineered> as the value in the comment key. These lists containing
prompt-engineered questions are in the same hierarchy in the JSON file as the other questions
from the subdataset.
About 20% of the questions were prompt-engineered: In these cases, ChatGPT was instructed
to proceed either step-by-step, by prefixing the sentence “Let’s answer this question step by
step.” or by adding words that formulate the mathematical task in a more explicit way, i.e.,
by adding “Prove that...” or “Show that...” to the prompt9 . Instructing ChatGPT to proceed
step-by-step in this way was shown to increase the performance of GPT-3 on datasets that
test mathematical reasoning (e.g., GSM8K); furthermore, this is a type of prompt engineering
that is recommended by OpenAI in their cookbook to improve reliability10 .
As a result of prompt engineering, for the 9-January-2023 version of ChatGPT, the number
of wrong statements and computations (i.e., error codes e2, e3, and e4) decreased, while
the number of errors rooted in faulty logic (i.e., error code e5) actually increased. Overall,
prompt engineering improves the average rating only slightly, see Figure 3 from the Appendix.
For the questions from Olympiad-Problem-Solving that were selected for the miniGHOSTS
dataset, we allow to sample from the entire Olympiad-Problem-Solving subdataset, since the
goal of miniGHOSTS is not to measure prompt-engineering effects. Therefore, some of the
questions in the miniGHOSTS version of the Olympiad-Problem-Solving subdataset contain
prompt-engineered questions. The <prompt engineered> string was therefore removed from
the comments in the miniGHOSTS dataset.
5 Conclusion
We have examined the behavior of (Chat)GPT across various tasks that test different
aspects of mathematical skill. Contrary to the media sensation that (Chat)GPT has caused,
(Chat)GPT is not yet ready to deliver high-quality proofs or calculations consistently. At the
same time, the quality of the answers can be positively surprising. Moreover, our evaluation
of GPT-4 on the miniGHOSTS dataset reveals promising improvements over ChatGPT’s
performance. In Appendix G, we collect the best and worst results for a number of selected
subdatasets. The best responses can be seen to justify the media sensation. It thus seems fair
to say that (Chat)GPT is inconsistently bad at advanced mathematics: While its capabilities
generally drop with the mathematical difficulty of a prompt, it does give insightful proofs in
a few cases.
However, (Chat)GPT falls short of achieving the same performance as models specifically
trained for single tasks. These models, in contrast, lack the flexibility of (Chat)GPT, which is
a universal tool suitable for any area of mathematics. In fact, (Chat)GPT’s ability to search
for mathematical objects, given information about them, is where it shines. For a user that
is already sufficiently mathematically proficient to discern the correctness of (Chat)GPT’s
output, (Chat)GPT can be integrated as an assistant in the user’s workflow. It can function
as a search engine or knowledge base to speed up various lookup tasks, as they often occur
at certain stages of mathematical research.
Due to the prohibitive annotation effort, the GHOSTS dataset is not yet large enough to
significantly improve the mathematical capabilities of LLMs by fine-tuning them on GHOSTS;
though we believe it is sufficiently comprehensive to allow an evaluation and comparison of
LLMs (and more rapid evaluation using miniGHOSTS and microGHOSTS, respectively).
We encourage other researchers to mine our dataset beyond the descriptive statistics that
we computed to gain a deeper understanding of how LLMs behave on mathematical tasks.
Finally, we hope that our work motivates other mathematicians to contribute to this growing
field, by evaluating their LLMs on micro/mini/GHOSTS in order to establish a thorough
benchmark for assessing the mathematical abilities of LLMs.
9
Some prompts (e.g., the ones taken from the book by Engel [40]) only contain a mathematical
statement, without a clear instruction; for example, “An a × b rectangle can be covered by 1 × n
rectangles iff n|a or n|b. From the context, one must conclude that this statement is correct and
should be proven.”
10
github.com/openai/openai-cookbook/blob/main/techniques_to_improve_reliability.md
10
Acknowledgments
This work was partially supported by the AXA Research Fund. The authors would like to
thank Karan Desai for helpful discussions and advise on the datasheet.
References
[1] Sascha Lobo. Das Ende von Google, wie wir es kannten. Der Spiegel, Retrieved 2023-
01-10. https://www.spiegel.de/netzwelt/netzpolitik/bessere-treffer-durch-
chatgpt-das-ende-von-google-wie-wir-es-kannten-kolumne-a-77820af6-51d7-
4c03-b822-cf93094fd709.
[2] John Naughton. The ChatGPT bot is causing panic now – but it’ll soon be as mundane
a tool as Excel. The Guardian, Retrieved 2023-01-14. https://www.theguardian.com/
commentisfree/2023/jan/07/chatgpt-bot-excel-ai-chatbot-tec.
[3] Kevon Roose. The Brilliance and Weirdness of ChatGPT. The New York Times,
Retrieved 2023-01-24. https://www.nytimes.com/2022/12/05/technology/chatgpt-
ai-twitter.html.
[4] Joe Rogan and Bret Weinstein. What ChatGPT Could Mean for the Future of Artificial
Intelligence [Podcast episode]. In The Joe Rogan Experience. Episode 1919, Retrieved
2023-01-05. https://www.youtube.com/watch?v=kh5dN72GTQ8.
[5] teddy [@teddynpc]. I made ChatGPT take a full SAT test. Here’s how it did: [Image
attached] [Tweet]. Twitter. Retrieved 2023-01-13. https://twitter.com/teddynpc/
status/1598767389390573569.
[6] Timothy Gowers [@wtgowers]. It’s amusing when ChatGPT makes ridiculous math-
ematical mistakes. But of course, it’s more interesting to find out what it can do
well. Here’s one example that wasn’t bad: I gave it a very rough outline of a proof
and asked it to fill in the details [Tweet]. Twitter. Retrieved 2023-01-13. https:
//twitter.com/wtgowers/status/1611750773607604224.
[7] OpenAI (2023). GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
[8] Tiffany H. Kung, Morgan Cheatham, Arielle Medenilla, Czarina Sillos, and Lorie
De Leon et al. Performance of ChatGPT on USMLE: Potential for AI-Assisted Medical
Education Using Large Language Models. medRxiv, 2022.
[9] David Rozado. What is the IQ of ChatGPT? Retrieved 2023-01-09. https://
davidrozado.substack.com/p/what-is-the-iq-of-chatgpt.
[10] Christian Terwiesch. Would Chat GPT3 Get a Wharton MBA? A Prediction Based
on Its Performance in the Operations Management Course. Retrieved 2023-01-
04. https://mackinstitute.wharton.upenn.edu/wp-content/uploads/2023/01/
Christian-Terwiesch-Chat-GTP.pdf.
[11] Natalie. ChatGPT – Release Notes. Retrieved 2023-04-03. https://help.openai.com/
en/articles/6825453-chatgpt-release-notes.
[12] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang,
Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the
MATH dataset. In J. Vanschoren and S. Yeung, editors, Proceedings of the Neural
Information Processing Systems Track on Datasets and Benchmarks, volume 1. Curran,
2021.
[13] Guillaume Lample and François Charton. Deep learning for symbolic mathematics.
arXiv preprint arXiv:1912.01412, 2019.
[14] Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben
Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards
for model reporting. In Proceedings of the conference on fairness, accountability, and
transparency, pages 220–229, 2019.
11
[15] A. L. Samuel. Some studies in machine learning using the game of checkers. IBM
Journal of Research and Development, 3(3):210–229, 1959.
[16] Jörg Denzinger, Matthias Fuchs, Christoph Goller, and Stephan Schulz. Learning from
previous proof experience: A survey. Technical report, TU München, 1999.
[17] John Harrison, Josef Urban, and Freek Wiedijk. History of interactive theorem proving.
In Computational Logic, volume 9, pages 135–214, 2014.
[18] Malik Amir, Yang-Hui He, Kyu-Hwan Lee, Thomas Oliver, and Eldar Sultanow. Machine
Learning Class Numbers of Real Quadratic Fields. arXiv preprint arXiv:2209.09283,
2022.
[19] Alex Davies, Petar Veličković, Lars Buesing, Sam Blackwell, and Daniel Zheng et al.
Advancing mathematics by guiding human intuition with AI. Nature, 600(7887):70–74,
2021.
[20] Yang-Hui He. Machine-learning the string landscape. Physics Letters B, 774:564–568,
2017.
[21] Aitor Lewkowycz, Anders Johan Andreassen, David Dohan, Ethan Dyer, and Hen-
ryk Michalewski et al. Solving quantitative reasoning problems with language models.
In Advances in Neural Information Processing Systems, 2022.
[22] Francois Charton, Amaury Hayat, and Guillaume Lample. Learning advanced mathe-
matical computations from examples. In International Conference on Learning Repre-
sentations, 2021.
[23] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in
Neural Information Processing Systems, 2017.
[24] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, and Gau-
rav Mishra et al. PaLM: Scaling language modeling with pathways. arXiv preprint
arXiv:2204.02311, 2022.
[25] Aida Amini, Saadia Gabriel, Shanchuan Lin, and Rik Koncel-Kedziorski et al. MathQA:
Towards interpretable math word problem solving with operation-based formalisms. In
Proceedings of the 2019 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, pages 2357–2367. Association
for Computational Linguistics, 2019.
[26] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, and Heewoo Jun et al.
Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
[27] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, and Jared D Kaplan et al.
Language models are few-shot learners. In Advances in Neural Information Processing
Systems, pages 1877–1901, 2020.
[28] Piotr Piękos, Mateusz Malinowski, and Henryk Michalewski. Measuring and improving
BERT’s mathematical abilities by predicting the order of reasoning. In Proceedings
of the 59th Annual Meeting of the Association for Computational Linguistics and the
11th International Joint Conference on Natural Language Processing, pages 383–394.
Association for Computational Linguistics, 2021.
[29] Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction
by rationale generation: Learning to solve and explain algebraic word problems. In
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics,
pages 158–167. Association for Computational Linguistics, 2017.
[30] Teven Le Scao and Angela Fan et al. BLOOM: A 176B-parameter open-access multilin-
gual language model. arXiv preprint arXiv:2211.05100, 2022.
12
[31] Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, and Apoorv Kul-
shreshtha et al. Lamda: Language models for dialog applications. arXiv preprint
arXiv:2201.08239, 2022.
[32] Pan Lu, Liang Qiu, Wenhao Yu, Sean Welleck, and Kai-Wei Chang. A survey of deep
learning for mathematical reasoning. arXiv preprint arXiv:2212.10535, 2022.
[33] Sean Welleck, Jiacheng Liu, Ronan Le Bras, Hannaneh Hajishirzi, Yejin Choi, and
Kyunghyun Cho. NaturalProofs: Mathematical theorem proving in natural language.
arXiv preprint arXiv:2104.01112, 2021.
[34] Sean Welleck, Jiacheng Liu, Ximing Lu, Hannaneh Hajishirzi, and Yejin Choi. Natural-
Prover: Grounded mathematical proof generation with language models. arXiv preprint
arXiv:2205.12910, 2022.
[35] Rick Durrett. Probability: Theory and Examples. Cambridge Series in Statistical and
Probabilistic Mathematics. Cambridge University Press, 2019.
[36] James R. Munkres. Topology. Prentice-Hall, 2000.
[37] Walter Rudin. Functional analysis. McgGraw-Hill, 1991.
[38] Sheldon Axler. Linear algebra done right. Springer, 2015.
[39] W. Rudin. Principles of Mathematical Analysis. International series in pure and applied
mathematics. McGraw-Hill, 1976.
[40] Arthur Engel. Problem-Solving Strategies. Springer, 1998.
[41] Tranquil Sea Of Math. Does ChatGPT code LaTeX and write proofs? Youtube.
Retrieved 2023-01-12. https://www.youtube.com/watch?v=ge2N7VI_8P0.
[42] Richard Van Noorden @[email protected] [@Richvn]. Huh. ChatGPT confidently
gives the right kind of reasoning to solve this math problem, but whiffs on the algebra
in the middle and gets the answer wrong [Tweet]. Twitter. Retrieved 2023-01-09.
https://twitter.com/Richvn/status/1598714487711756288.
[43] Amos Azaria. ChatGPT Usage and Limitations. Retrieved 2023-01-15. https://hal.
science/hal-03913837.
[44] Ernest Davis. Mathematics, word problems, common sense, and artificial intelligence.
arXiv preprint arXiv:2301.09723, 2023.
[45] Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz,
Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial
general intelligence: Early experiments with GPT-4. arXiv preprint arXiv:2303.12712,
2023.
[46] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song,
and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv
preprint arXiv:2009.03300, 2020.
[47] Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard Tang, Sean Welleck, Chitta
Baral, Tanmay Rajpurohit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark, et al. Lila:
A unified benchmark for mathematical reasoning. arXiv preprint arXiv:2210.17517,
2022.
[48] The mathlib Community. The Lean mathematical library. In Proceedings of the 9th
ACM SIGPLAN International Conference on Certified Programs and Proofs. ACM,
2020.
[49] Markus N. Rabe, Dennis Lee, Kshitij Bansal, and Christian Szegedy. Mathematical
reasoning via self-supervised skip-tree training. arXiv preprint arXiv:2006.04757v3,
2020.
13
[50] Shuyin Ouyang, Jie M Zhang, Mark Harman, and Meng Wang. LLM is like a box
of chocolates: the non-determinism of ChatGPT in code generation. arXiv preprint
arXiv:2308.02828, 2023.
[51] Sherman Chann. Non-determinism in GPT-4 is caused by sparse MoE, 2023. Accessed
on August 5, 2023.
[52] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, and Carroll L. Wainwright et al.
Training language models to follow instructions with human feedback. arXiv preprint
arXiv:2203.02155, 2022.
[53] Carroll Wainwright and Ryan Lowe. InstructGPT: Training Language Models to
Follow Instructions with Human Feedback. GitHub repository, Retrieved 2023-01-09.
https://github.com/openai/following-instructions-human-feedback.
[54] Sarah Wiegreffe (sigmoid.social/@sarah) [@sarahwiegreffe]. If text-davinci-001 is a rough
approximate to the model reported in the NeurIPS 2020 paper, and text-davinci-002 is
InstructGPT in the 2022 preprint, then what is just "davinci"? Trying to reproduce
results from a time before this naming existed [Tweet]. Twitter. Retrieved 2023-01-15.
https://twitter.com/sarahwiegreffe/status/1583617355678355456.
[55] OpenAI. GPT-4 API waitlist. Retrieved 2023-06-06. https://openai.com/waitlist/
gpt-4-api.
[56] OpenAI. Documentation - Models. Retrieved 2023-06-06. https://platform.openai.
com/docs/models/gpt-4.
[57] OpenAI. OpenAI API Reference - Chat Completion Endpoint. Retrieved 2023-06-06.
https://platform.openai.com/docs/api-reference/chat.
[58] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, and Henrique Ponde de Oliveira
Pinto et al. Evaluating large language models trained on code. arXiv preprint
arXiv:2107.03374, 2021.
[59] Alexander Bogomolny. Pythagorean theorem, Retrieved 2023-08-10. https://www.cut-
the-knot.org/pythagoras.
[60] Benj F Yanney and James A Calderhead. New and old proofs of the Pythagorean
theorem. The American Mathematical Monthly, 3(4):110–113, 1896.
[61] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia,
Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-Thought prompting elicits reasoning in
large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho,
and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35,
pages 24824–24837. Curran Associates, Inc., 2022.
[62] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and
Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language
models. arXiv preprint arXiv:2305.10601, 2023.
[63] Xi Ye and Greg Durrett. The unreliability of explanations in few-shot in-context learning.
arXiv preprint arXiv:2205.03401, 2022.
[64] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan,
Hanna Wallach, Hal Daumé Iii, and Kate Crawford. Datasheets for datasets. Commu-
nications of the ACM, 64(12):86–92, 2021.
14
Appendix
A Further Related Works 16
B Dataset Creation 16
B.1 Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
B.2 Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
B.3 Copyright and Licensing Terms . . . . . . . . . . . . . . . . . . . . . . . . . . 18
B.4 Data Sources and Labeling Policies . . . . . . . . . . . . . . . . . . . . . . . . 19
B.5 Mitigating Human Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
B.6 Human Effort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
B.7 Dataset Misuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
D Further Results 23
D.1 Qualitative Analysis of Subdatasets on ChatGPT 9-January-2023 . . . . . . . 23
D.2 (Chat)GPT’s Confidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
D.3 Figures of ChatGPT’s Performance (version 9-January-2023) . . . . . . . . . 25
D.4 Comparison of (Chat)GPT Versions . . . . . . . . . . . . . . . . . . . . . . . 29
15
A Further Related Works
In this section, we present further related works. For (Chat)GPT, most investigations related
to mathematical reasoning consist of anecdotal evidence concerning its performance and its
failure modes. Notable mentions on social media can, for instance, be found in [5, 6, 41, 42].
Unfortunately, a clear methodology is missing, as most of the results are scattered on various
internet platforms and cannot be easily reproduced.
To the best of our knowledge, the only investigations into the mathematical capabilities prior
to the appearance of our first preprint were undertaken by [43, 44]. However, these works
only report a small number of qualitative results, often on rather simple mathematical tasks
and without specifying the precise versions of (Chat)GPT. The latter reference reports results
only on a few selected examples, while the former reference investigates ChatGPT’s11 ability
to compute irrational numbers as well as to solve some elementary math word problems.
Recently, the dataset by [45] contains a systematic evaluation of ChatGPT on the GSM8K
dataset [26], the MATH dataset [12], and the MMMLU-STEM dataset [46]. These datasets
allow for an automatic evaluation using only accuracy as an evaluation metric. The Līla
dataset [47] also consists of a set of diverse mathematical problems, spanning different
dimensions and problem types. The problems are formulated in such a way that solutions
can be given by Python programs, which precludes problems that involve more advanced
forms of mathematical proofs. Further anecdotal examples of mathematical performance are
presented in [45].
Among LLMs prior to (Chat)GPT, Minerva [21], based on PaLM (discussed in Section 2),
stands out, being trained in equal parts on websites that contain MathJax elements and
arXiv preprints (additionally to general natural language data on which PaLM was trained).
It achieves a score of roughly 50% on the significantly harder Mathematics Aptitude Test of
Heuristics (MATH) dataset [12], which was sourced from various mathematical competitions.
One distinguishing feature of the MATH dataset is that its problems admit a unique answer
that can be condensed within a few characters (a number, for example). This is beneficial
for the automatic evaluation of a model on such a dataset, since one can simply check the
final answer, ignoring the step-by-step solution.
Finally, we would also like to mention the field of formalized mathematics, where large
databases that encode advanced mathematical concepts exist, e.g., the Lean Mathematical
Library [48]. Some of the ideas that we have used in this article, such as using prompts
that formulate a task to fill in gaps in proofs, are echoed in [49] for datasets for formal
mathematics consisting of expression trees. Yet, for the purpose of doing mathematics with
large language models, these formal datasets cannot be leveraged since no straightforward
way exists to convert them to natural language.
B Dataset Creation
B.1 Categorization
Our subdatasets can be categorized along the following dimensions (see Table 1):
• Mathematical difficulty (ascending):
M 1 Elementary arithmetic problems, as found in the MATH dataset [12] at lower levels
of difficulty.
M 2 Symbolic problems (integration of functions) that can also be solved via a supervised-
learning, data-driven approach to mathematics [13].
M 3 (Under)graduate-level exercises from well-known textbooks [35–39] as well as questions
from math.stackexchange.com, spanning diverse domains of mathematics.
M 4 Exercises that are in the style of mathematical olympiad problems, such as those
taken from Engel’s Problem-Solving Strategies book [40].
• Question/prompt type:
11
Using an unknown version of ChatGPT that predates the 9-January-2023 version.
16
Q1 Review questions, which ask to state or name certain mathematical facts correctly.
Q3 Computational questions.
Q4 Proof-based questions, which ask for a theorem proof or for a puzzle solution.
D1 Nontrivial problem encoding: The data points from the Symbolic Integration sub-
dataset come from [13] and are publicly available12 . Since the online training set
uses Polish notation, it is very unlikely that (Chat)GPT has seen the corresponding
prompts in LATEX before.
D2 Succinct solution: The solutions for the Olympiad-Problem-Solving subdataset are
included in the book by Engel [40]. But the solutions are extremely concise, and
simply repeating them would not show an immediate understanding of the problem.
D3 Spoken dialogue: The Search-Engine-Aspects subdataset is unlikely to be well repre-
sented in the data on which (Chat)GPT has been trained since its prompts resemble
word fragments that might appear in a mathematical dialogue (e.g., an oral mathe-
matical exam), rather than in a textbook.
B.2 Format
The dataset consists of a collection of UTF-8 encoded JSON files. We explain the JSON
keys of each data point in our dataset in the following and also indicate whether its value is
optional. If the value is optional, the key has to be present, but the value will be an empty
array or string.
• prompt denotes the input that we provide to (Chat)GPT through its web interface at the
URL chat.openai.com/chat, see also Appendix C. We use a new session for each prompt
to avoid (Chat)GPT being biased by previous prompts.
• output denotes the raw output that (Chat)GTP supplies us with. In some cases, mathe-
matical formulas were rendered in the web interface such that we copied them in LATEX.
• rating is a number from 1 to 5 that shows how many points (Chat)GPT has scored, 5
being a perfect answer and 1 being the lowest rating. A detailed explanation of the rating
policy that we followed is contained in Appendix B.4.
• errorcodes (optional) highlight a list of error types that illustrate the failure modes of
(Chat)GPT in a more fine-grained way. Not all types of errors apply to all (sub)datasets:
For example, an error code for a missing proof step would not be applicable on a dataset
that tests whether (Chat)GPT can multiply numbers or find prime divisors. The detailed
explanation of the error codes (and the warning codes; see below) that was provided to the
annotators is contained in Appendix B.4. There, we also include a policy of how ratings
and error codes have to be used together.
• warningcodes (optional) highlight any problematic aspects of (Chat)GPT; for example,
(Chat)GPT might be rambling and providing the user with unrelated information or use a
poor (but correct) way of solving problems.
• comment (optional) denotes any noteworthy commentary that an assessor of (Chat)GPT
may make. This can be used to give a more detailed explanation of the output, provide
reasoning behind awarding a certain error code or rating, or generally provide context. For
some subdatasets, this key was used to indicate the difficulty level of the prompt, as well
12
github.com/facebookresearch/SymbolicMathematics
17
as an official solution, if available, see Section 3.1. It was also used to indicate whether we
used prompt engineering, see Appendix 4.2.
• msc denotes a list of mathematics subject classifications 13 (MSC) that pertain to the
output. Note that we do not classify the prompt given to (Chat)GPT with MSC codes, as
there may be no proper classification; for example, when (Chat)GPT is asked what the
most important theorem in all of mathematics is14 , it is meaningless to assign an MSC
code to that prompt.
• ref (optional) indicates a reference to where the prompt was originally taken from (for
some subdatasets, such as Holes-in-Proofs, we have used excerpts from various books
or math.stackexchange.com; the original source was recorded as a value of this key).
This key can have an empty value if the question was formulated by the authors and no
authoritative source was plausible.
• confidence indicates how confident we have perceived (Chat)GPT to be when presenting
us with its output. We allow values of high, medium, and low.
• timestamp denotes when the prompt was entered into (Chat)GPT. This can be used to
track the version of (Chat)GPT; see Section 4.1.
The values of these keys within a single data point interact in nontrivial ways: If a rating of
5 is given, then it is expected that no error code is present—though there may be warning
codes that are used. The error codes and warning codes are loosely in the spirit of a compiler
throwing errors and warnings if it is given incorrect or sloppy code—although we have a role
reversal, where the human is now the compiler, and the machine produces the code. In this
sense, for some prompts, we have used multiple error and/or warning codes, which is why
the corresponding values are arrays of strings. We use these codes to collect statistics on the
behavior of (Chat)GPT; see Section 4.
For most of the subdatasets that make up our GHOSTS dataset, we have used LATEX to
encode mathematical formulas in our prompts. Our experiments have shown that (Chat)GPT
can process LATEX-encoded mathematics well.
The usage of MSC codes can be useful for mathematicians who want to integrate (Chat)GPT
in their daily workflow, as it allows them to know in which areas the model performs better
and can hence be trusted more. Our dataset is very diverse, having a total of 78 MSC
codes. The top short versions of these codes (first two digits) are 26 (“Real functions”, 127
occurrences) followed by 00 (“General”, 110 occurrences) and 46 (“Functional analysis”, 77
occurrences), see also Figure 5.
B.3 Copyright and Licensing Terms
Because our dataset draws on multiple sources, which have different copyright holders, that
have issued different licenses (or no licenses at all), we associate licenses on a fine-grained
level to our data, to differentiate between parts where we are bound by the license of an
existing dataset and the parts that we created, that we release under the following Creative
Commons license: Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)15 . By this
license, one may not use those parts of the dataset that were created by us for commercial
purposes, and one must give appropriate credit when using it non-commercially; if users are
building on the GHOSTS dataset, they need to indicate the changes that were made and
distribute their contributions under the same license as the original.
Licenses per datapoint: We associate licenses on a fine-grained level to our dataset in the
following way:
• For each datapoint, one license applies to the value of the prompt key (which, in some
cases, is governed by copyright where no license is issued, in which case we do not release
it publicly; in that case, the value of the prompt key will be <copyrighted>). Which
prompt values are subject to which licenses is fully detailed in Table B.1.
13
A complete list of MSC codes can be accessed at the URL zbmath.org/static/msc2020.pdf.
14
The answer is Pythagoras’ theorem, according to (Chat)GPT.
15
See https://creativecommons.org/licenses/by-nc/4.0/ for the detailed terms of the license.
18
Table B.1: A summary of all the files from all the subdatasets comprising our GHOSTS
dataset, together with their license. In the case of the prompt values from the “Proofs
Collection A” file, either no license was issued from the copyright holder (if the prompt
value was taken from the books Linear Algebra Done Right by S. Axler or Principles of
Mathematical Analysis by W. Rudin, cf. the ref key associated to that prompt value); or one
of three Creative Commons Attribution-ShareAlike license, in different versions, depending
on the date of the math.stackexchange.com post which served the source for the prompt
value, see https://math.stackexchange.com/help/licensing.
Subdataset Name Comprised of the json file(s) License for value of prompt key
Grad-Text W. Rudin, Functional Analysis (ch. 1) No license specified17
W. Rudin, Functional Analysis (ch. 2) No license specified17
J. Munkres, Topology (ch. 1) No license specified17
J. Munkres, Topology (ch. 2) No license specified17
R. Durrett, Probability Theory No license specified18
H oles-in-Proofs Proofs Collection A More than one license
Proofs Collection B Prealgebra MIT license
Proofs Collection B Precalculus MIT license
Olympiad-Problem-Solving Olympiad Problem Solving No license specified17
Symbolic-Integration Symbolic Integration CC BY-NC 4.0
MAT H MATH Algebra MIT license
MATH Counting and Probability MIT license
MATH Prealgebra MIT license
MATH Precalculus MIT license
Search-Engine-Aspects Definition Retrieval CC BY-NC 4.0
Reverse Definition Retrieval CC BY-NC 4.0
Named Theorem Proof Completion CC BY-NC 4.0
• Another license, the CC BY-NC 4.0 license, applies to all other, remaining components of
each datapoint; for the datapoints where we also created the prompt, these two licenses
will coincide, as outlined by able B.1. We note that therefore, in particular, the value of
the output, is also released by us under the CC BY-NC 4.0 license. This is consistent
with OpenAI’s terms, which at the time of writing state16 : “You may provide input to the
Services (“Input”), and receive output generated and returned by the Services based on the
Input (“Output”). Input and Output are collectively “Content.” As between the parties and
to the extent permitted by applicable law, you own all Input. Subject to your compliance
with these Terms, OpenAI hereby assigns to you all its right, title and interest in and to
Output. This means you can use Content for any purpose, including commercial purposes
such as sale or publication if you comply with these Terms.”
Release policy: Some of the subdataset files contain prompts that are protected under
copyright, and no specific license has been issued for them by the copyright holder: These are
all the prompts from all files from the Grad-Text and Olympiad-Problem-Solving subdatasets,
as well as selected prompts from the “Proofs Collection A” file. In such cases where a
copyrighted prompt was used for which no license was available, we have opted not to
withhold the value of the prompt key in the publicly released dataset. The ref key includes
a detailed reference to the page where the original theorem or exercise was presented, so a
reader can easily retrieve the original prompt.
19
policy was to transcribe it to LATEX. Below are the policies that were followed by each
assessor of (Chat)GPT’s output regarding the rating, the error codes, and the warning codes:
Rating
• 1 → failure to understand the query (e.g., the user asks it something about number theory,
and it responds with information about differential equations);
• 2 → query was understood, but the answer was entirely wrong (e.g., the user asks what
the prime divisors of 111 are19 , and it responds with 8 and 6);
• 3 → query was understood, but the answer was only partially correct (e.g., the user asks
it what the prime divisors of 111 are, and it responds with 3 and 6);
• 4 → query was understood, and the answer was mostly correct (e.g., the user asks it what
the prime divisors of 222 are20 and it responds with 3 and 37);
• 5 → query was understood and answer was completely correct.
Error codes
• e1 → missing examples or information (e.g., the user asks it what the prime divisors of
111 are, and it responds with 3, missing 37); this also applies, if (Chat)GPT ignores a
part of the prompt (e.g., an equivalence needs to be shown, but (Chat)GPT shows only
one direction);
• e2 → a few wrong/vague statements (e.g., the user asks it what the prime divisors of
30030 are21 and it responds with 2, 3, 5, 7, 13 (wrong); or says that 2, 3, 5, and some
other numbers are prime divisors (vague)); it can also denote a single statement, that is
slightly vague;
• e3 → a lot of wrong/too vague statements (e.g., the user asks it what the prime divisors
of 30030 are, and it responds with 2, 5, 8, 12, 13, 15 (wrong); or says that 2 and many
other numbers are prime divisors (vague)); it can also denote a single statement, that is
highly vague;
• e4 → wrong computations (i.e., an additional error flag to disambiguate between statements
that are of computational nature or not);
• e5 → denotes wrong logic or wrong flow of arguments, which we further subdivide into
specific flags, as we prohibit the use of e5 on its own (since it would be uninformative):
– e5_1 → (Chat)GPT claims that to complete a proof, statements need to be shown that
are unrelated to the claim;
– e5_2 → a proof step is missing;
– e5_3 → an edge case has not been considered;
– e5_4 → an inference step is not supported (e.g., (Chat)GPT claims that from A follows
B, but this claim is not true);
– e5_5 → circular logical argument (i.e., using the hypothesis to prove the hypothesis);
• e6 → the general set-up is understood, but the legal operations are not respected or
misunderstood (e.g., we are given a puzzle where we are only allowed to add even integers,
but (Chat)GPT changes the rules and motivates the solution by allowing the addition of
odd integers; or (Chat)GPT misunderstands an adjective that has multiple mathematical
meanings, such as “dual”, which can mean either topological dual space or algebraic dual
space).
The following policy applies for error codes: If a rating r with 1 < r < 5 has been given,
then an error code is mandatory to explain the type of error that occurred. For a perfect
score of 5, no error codes should be assigned (but warning codes can be assigned). If the
19
They are 37 and 3.
20
They are 2, 37, and 3.
21
They are 2, 3, 5, 7, and 11.
20
score is lowest, i.e., a rating of 1, error codes can be assigned, but do not have to: In the case
where (Chat)GPT has not understood the prompt, there typically is no reason to further
detail the type of error.
Warning codes
• w1 → (Chat)GPT is withholding essential information related to the prompt (e.g., the
R∞ 2
user asked it something about the integral −∞ e−x dx, and it answers correctly but does
not tell the user that the integral was actually a famous, named integral, i.e., the Gaussian
integral);
• w2 → (Chat)GPT is rambling (i.e., after answering, correctly or incorrectly, (Chat)GPT
tells the user much more details than the user wanted to know);
• w3 → (Chat)GPT is hallucinating (i.e., after answering, correctly or incorrectly, (Chat)GPT
tells the user unrelated information);
• w4 → (Chat)GPT behaves weirdly (e.g., by using a weird proof structure (where applicable),
using strange mathematical formulations, or by adopting a strange tone of the conversation
or making opinionated statements);
• w5 → (Chat)GPT changes the notation from the prompt without being instructed to do
so (e.g., the prompt contains a vector space X, but (Chat)GPT calls it F).
B.5 Mitigating Human Errors
Any assessment procedure that has a human component is prone to introducing bias—in
particular, a procedure involving manual work such as rating the model outputs. In the list
below we describe safeguards that help to mitigate bias and human errors (such as typos) as
well as streamlining procedures that we undertook:
1. Safeguards against LATEX errors and typos:
Various typographical errors may appear due to incorrect LaTeX formatting. In this
case, we noticed that (Chat)GPT was able to correctly infer what was intended (e.g.,
$cup$ was correctly interpreted as $\cup$). Similarly, we noticed that (Chat)GPT’s
output was stable under small “perturbations” of the prompt, such as minor typos.
Thus, (Chat)GPT therefore provided its own safeguard against these types of errors.
2. Safeguards against encoding issues:
We presented clear instructions to each author who prompted (Chat)GPT on how
to record and save the data in order to avoid any file encoding issues. In the end,
all JSON files were inspected and streamlined to Unicode, if a different encoding
was used.
3. Safeguards against unfair comparisons:
Clear instructions were given to all authors that used (Chat)GPT to ensure that
the language model has, to the extent possible, an identical state and starts from a
blank chat.
4. Safeguards against missing data and copy-paste errors:
Given a lack of API access in the early stages of our investigation (see Appendix C),
there was a fair amount of data being copied from (Chat)GPT. To mitigate any
copy-paste errors, several passes over the entire dataset, as well as automatic checks,
were made to look, e.g., for potential inconsistencies, missing timestamps, and
outputs not matching the prompts.
5. Safeguards against dataset misformatting:
Because the methodology authors had to adhere to is complex (see Appendix B.2
and B.4, we introduced a number of automatic checks to make sure the general
format of each datapoint is consistent and in line with our rules. In particular, we
have made sure that various entries are in the correct place and take the correct
values (e.g., that no warning flag values are in the key for errors, that the ref key
value has the correct format, etc.).
21
6. Safeguards against other unforeseen errors:
Random samples: Random samples (< 10) were drawn from each dataset, and a
second assessor reviewed the rating. If deemed problematic, the original assessors
were asked to re-evaluate.
Statistical checks: Additional statistical checks were carried out as plausibility checks
to make sure no other unforeseen errors occurred: If prompts deviated from the
average length on that dataset, they were flagged, the output was manually inspected,
and, if deemed necessary, a re-evaluation was carried out.
We are aware that these measures are not exhaustive, but given a fixed time budget, we
considered them the most feasible. We note that authors evaluated ChatGPT models in
multiple stages, as the models became available, and that checks and corrections were made
at the end.
Because our checks and corrections pipelines are complex and a number were made after all
the evaluations on all models were obtained, we cannot exclude that the strings containing the
prompts weren’t altered in very minor ways (e.g., converting back to Unicode from a different,
original, unknown encoding may introduce slight variations of certain special characters that
slipped through inspection). However, because of the sufficiently large number of evaluations
that we carried out in each file, which is the smallest unit on which we collect statistics,
this will not affect any results. From the viewpoint of reproducing our dataset, this is also
not an issue, since even for identical prompts, (Chat)GPT itself is non-deterministic; this
continues even if its temperature parameter in the API is set to zero [50, 51]. Lastly, as
noted above, (Chat)GPT’s answers will almost always still express the same idea, even if
minor variations of the same prompt are used, so even individual scores will stay unchanged
in such a scenario.
B.6 Human Effort
The evaluation was carried out by a subset of the authors of this paper who have substantial
mathematical expertise, ranging from master’s degrees in mathematics to postdoc-level and
professor-level positions at departments of mathematics. Assignment of prompts was done
based on mathematical difficulty, with more senior mathematicians having received more
difficult prompts. No third parties were involved.
Each of the 709 prompts of the GHOSTS dataset was evaluated on both the 9-January-2023
and 30-January-2023 version of ChatGPT; an additional 24 prompts were used to test the
effect of prompt engineering on a single type of subdataset, see Appendix 4.2. We further
evaluate GPT-4 on the 170 prompts of the miniGHOSTS dataset. This amounts to a total of
1636 prompt evaluations of advanced mathematics performed by graduate-level researchers.
We like to mention that our effort has occasionally unearthed small inconsistencies in
existing datasets: For example, the “MATH Counting and Probability” file, which was
sourced from the larger MATH dataset [12], contains the prompt “What is the value of
1013 − 3 · 1012 + 3 · 101 − 1?”, which is neither about counting, nor about probability, but
arithmetic (our MSC codes allow users to find such examples).
We note that it is neither possible to outsource the creation of these subdatasets to a
crowdsourcing service, such as Amazon Mechanical Turk, nor is it possible to generate
them automatically from code because advanced mathematical insight is required for the
creation of each prompt (where applicable) and for providing the fine-grained evaluation
of the mathematical capabilities. We note that already in the case of the MATH dataset,
which contained less advanced mathematics than some of our subdataset contained, it was
noted in [12]: “Because MATH requires a strong mathematical background to perform well
on, and a long amount of time to solve problems, we were restricted to assessing six human
participants and could not rely on crowdsourcing sites such as Amazon Mechanical Turk”.
Furthermore, unlike in the case of the MATH dataset, the answer to most of our prompts
cannot be condensed into a few tokens (such as a number or a function), e.g., when the
answer is a mathematical proof.
This raises the difficulty of the creation of more data since graduate-level (and in some
cases, PhD-level) mathematics is required. The combined effort of devising mathematically
22
insightful prompts and carefully rating the output of (Chat)GPT amounts to 1636 prompt
evaluations, totaling several hundreds of person-hours, see Appendix B.6. However, as a
result of these efforts, our dataset goes beyond all the mentioned mathematical datasets for
LLMs in Section 2 in terms of the different aspects of mathematical reasoning that are being
tested.
D Further Results
D.1 Qualitative Analysis of Subdatasets on ChatGPT 9-January-2023
In this section, we go through common mistakes performed by ChatGPT, as well as notable
observations regarding the output, one subdataset at a time. We focus on the 9-January-2023
version, see Section D.4 for more information regarding the other version. We note that the
output of (Chat)GPT (and, generally, LLMs) is stochastic and therefore may differ on the
same prompt. Nonetheless, clear trends can be observed, which we describe here. Individual
outputs can be found in Appendix G.
23
5.0 60
4.5 Non-Engineered
Engineered 50
4.0
3.5 40
Percent
Rating
3.0 30
2.5 20
2.0
1.5 10
1.0 0
Non-Engineered Engineered e1 e2 e3 e4 e5
Error Codes
Figure 3: Effect of prompt engineering on the rating (left) and the error codes (right) for
the 9-January-2023 model.
24
is no closed-form solution for the integral with complete confidence when, in fact, there is a
solution; only integrals that have an elementary antiderivative are in this dataset.
MATH On the questions related to Algebra and Probability theory, the 9-January-2023
version of ChatGPT got the reasoning often correctly. However, the most common type
of error was e4, occurring 36% of the time (in total 62 times). This version of ChatGPT
may struggle when confronted with standard operations, such as inverting fractions, least
common multiples, and changing the sign of numbers when moving them from one side of
the equal sign to the other. Often, in these questions, a correct solution requires performing
multiple operations in sequence. In such cases, most often, at least one operation was wrong.
This prevented the model from getting a rating of 5 on the output, which was only achieved
for 29% of the questions.
Search-Engine-Aspects On the Search-Engine-Aspects file, the 9-Januar-2023 version of
ChatGPT knew almost all the theorems that it was asked at a basic level but made mistakes
when stating them. When it came to listing other results required for the proofs, this version
typically requested way more than the necessary theory—occasionally even results that only
follow from the theorem which was asked for (error code e5_5). On the Definition Retrieval
file, this version had quite a good performance: it recited most definitions correctly. It
sometimes got confused when being asked about distributions in the sense of elements of the
dual space of test functions. ChatGPT, version 9-January-2023, strongly favors the notion
of distributions in the stochastic sense. Similarly, for the adjective “closed”, where it chose
to pick the context of algebra (instead of topology) and interpreted it to mean “algebraically
closed”. On the Reverse Definition Retrieval file, this version had the strongest performance,
being able to recover most definitions from their descriptions, with an average rating of 4.30
(standard deviation 1.14). This indicates the usefulness of ChatGPT as a general-purpose
mathematical search engine. This subdataset is also the simplest from a mathematical point
of view since no logical thinking is required, but only a name needs to be found.
D.2 (Chat)GPT’s Confidence
(Chat)GPT is usually very confident, unlike other LLMs that we have experimented with. As
an illustrative example, consider the following prompt testing the sensitivity to LATEX-encoded
mathematics vs. Unicode-encoded mathematics:
Here a r e two s t a t e m e n t s : " Let (X, A, μ ) be a measure s p a c e . Show t h a t
t h e s i g n e d measures λ : A -> R form a Banach s p a c e . " ; " Let $ (X, A, \mu) $
be a measure s p a c e . Show t h a t t h e s i g n e d measur es $ \ lambda : A
\ r i g h t a r r o w R$ form a Banach s p a c e . " Are t h e y e q u i v a l e n t ?
The response by ChatGPT is not phrased in order to show any nuance in terms of confidence
(which is typical, even if ChatGPT is wrong):
Yes , t h e two s t a t e m e n t s a r e e q u i v a l e n t . They both e x p r e s s t h e same
m a t h e m a t i c a l i d e a but u s e d i f f e r e n t n o t a t i o n and f o r m a t t i n g
c o n v e n t i o n s . The f i r s t s t a t e m e n t u s e s p a r e n t h e s e s t o e n c l o s e t h e
e l e m e n t s o f t h e measure s p a c e and r e g u l a r l e t t e r s t o r e p r e s e n t t h e
functions , while the second statement uses s u b s c r i p t e d parentheses
and u s e s t h e LaTeX f o r m a t t i n g t o r e p r e s e n t t h e e l e m e n t s o f t h e
measure s p a c e and t h e f u n c t i o n .
The response by Codex [58], another model that we briefly tested (but whose scope would
have exceeded that of a single conference article) gives a cautions response and, unlike
ChatGPT, is capable of voicing doubt:
I t h i n k t h e y a r e e q u i v a l e n t , but I am not s u r e .
25
Error Codes Warning Codes
Grad-Text 2 11 39 4 92 9 25 4 50 4 1 1 1 1.0
W. Rudin, Functional Analysis (ch. 1) 1 1 4 30 2 7 1 20
W. Rudin, Functional Analysis (ch. 2) 2 8 3 17 3 3 11 1 1
J. Munkres, Topology (ch. 1) 1 6 8 9 1 2 6
J. Munkres, Topology (ch. 2) 2 12 17 3 6 1 6 1
R. Durret, Probability Theory 7 1 19 1 8 7 3 1 0.8
Holes-in-Proofs 3 28 18 44 25 1 10 2 9 3 4 1 17 9
Proofs Collection A 1 12 1 9 18 7 2 7 2 1 4 9
Proofs Collection B Prealgebra 1 10 8 10 3 1 2 2 8
Proofs Collection B Precalculus 1 6 9 25 4 3 1 1 1 5 0.6
Subdataset
Olympiad-Problem-Solving 6 27 51 36 88 16 16 4 48 4 8 16 20 3
Symbolic-Integration 1 24 74 2 3 1
0.4
MATH 17 7 62 41 1 2 33 5 5 2 3
MATH Algebra 7 2 18 17 14 3 3
MATH Counting and Probability 3 26 19 2 17 2
MATH Prealgebra 7 1 4 4 1 2 1 1 3
MATH Precalculus 4 14 1 1 1 0.2
Search-Engine-Aspects 7 11 19 6 4 2 3 1 1 2
Definition Retrieval 4 6 5 1 1 2 1 2
Reverse Definition Retrieval 2 3 5 1 1
Named Theorem Proof Completion 1 2 9 4 2 2 1 1
0.0
e1
e2
e3
e4
e5
e5_1
e5_2
e5_3
e5_4
e5_5
e6
w1
w2
w3
w4
w5
Figure 4: Counts (annotation) and relative frequencies (color) of error and warning codes by
subdatasets (bold) and files for ChatGPT 9-January-2023 on GHOSTS.
00 (General) 120
26 (Real functions)
33 (Special functions)
28 (Measure and integration) 100
46 (Functional analysis)
54 (General topology)
51 (Geometry) 80
MSC code
Figure 5: Average rating over mathematical fields for the 9-January-2023 version of ChatGPT
on GHOSTS. The color depicts the occurrence of each MSC code, and only MSC codes
that have at least 5 occurrences are shown. Note that the ranking is not indicative of the
complexity of the fields since we do not use equally complicated exercises for all fields. The
error bars represent 95% confidence intervals.
26
70 Version
60 9-Jan.
30-Jan.
50 GPT-4
Percent
40
30
20
10
0
1 2 3 4 5 e1 e2 e3 e4 e5 w1 w2 w3 w4 w5
Rating Error Codes Warning Codes
40 GPT-4
20
0
Symbolic-Integration MATH Search-Engine-Aspects
60
Percent
40
20
0
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Rating Rating Rating
27
19-157 9-Jan.
158-320
321-457 30-Jan.
458-751
752-2144 GPT-4
Output Length
23-189
190-329
330-470
471-648
649-2076
224-658
659-938
939-1317
1318-1756
1757-2973
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Rating
21-49
50-65
66-88
89-117
Prompt Length
118-152 Version
9-Jan.
30-Jan.
153-189 GPT-4
190-251
252-336
337-461
462-1360
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Rating
28
D.4 Comparison of (Chat)GPT Versions
In this section, we collect figures which illustrate the differences and similarities between
versions of (Chat)GPT. We note that even though the 30-January-2023 version performs
very similarly to the 9-January-2023 version, there are some differences in the distribution of
ratings, error codes, and warning codes, see Figure 6.
On the other hand, GPT-4 strictly dominates the ChatGPT versions in terms of performance.
It always provides context around the question (whether that was asked for or not) and
often gives useful (and correct) pointers that, for example, highlight the importance of a
particular theorem. Figure 8 depicts the verbosity of different (Chat)GPT versions and the
achieved rating. However, we also note that the optimal level of verbosity can depend on
the mathematical background of the user. As a result, there have been significantly more
warning codes of type w2 (i.e., rambling) for GPT-4, see Figure 6.
29
F Limitations and Reproducibility
In this section, we describe the limitations of what our dataset and our evaluations cover;
some of these also are related to issues regarding reproducibility, which we also discuss below.
In terms of:
• model selection, we have focused on those models that, at the time of writing, are
known to have the best performance among existing LLMs, either based on anecdotal
evidence [3, 5, 6] or via previous benchmarks on more elementary mathematical reasoning
tasks [7]. Because human evaluation has been very time-consuming, we have chosen to
invest the available time solely in the evaluation of these models (and not older ones as
well) to obtain a picture of their performance that is as accurate as possible.
• correctness, we have put a large number of measures in place to guarantee evaluations are
as correct as possible (see Appendix B.5). Nonetheless, we cannot guarantee the absolute
correctness of each evaluation. There are also some questions where, aside from human
errors, fundamental evaluator disagreement occurs regarding the question of what rating
should be given; this is a well-known problem in psychometrics and, in a simple instance,
can be resolved by having multiple evaluators evaluate the same question. Given a fixed
budget in terms of man-hours, we have chosen not to follow this approach because 1) we
opted to be able to devise a larger dataset and 2) in light of the checks that we described
in Appendix B.5, such an approach would result in diminishing returns on invested time.
An issue in this regard is the fact that an absolute ground truth is lacking for a mathematical
dataset of this type that goes beyond simple arithmetical questions. In particular, in
the case of proofs, these can be presented in myriad ways: On one hand, for the same
mathematical fact, one can find conceptually distinct proofs. (See [59] for an extreme
example of this, where 122 conceptually distinct proofs of the Pythagorean theorem are
given.) On the other hand, one and the same proof can be presented in different ways,
since the order in which the arguments are made and connected can be changed22 .
Assuming all the previous problems were, somehow, solvable, further complicating the
matters is the fact that not all possible proofs are known, and new ones are being discovered
for known statements (staying with the topic of the Pythagorean theorem, even millennia
after its publication, new proofs have surfaced [60]). If a model were to output a proof
that is not known, it would still require a human to judge correctness, as in this case,
there may not be any ground truth available to compare against.
All of the above makes it very hard for the ground-truth data to be included in our
dataset - and if it is included, it may not be meaningful for the outlined reasons. Some
elementary datasets, such as the MATH dataset [12], have ground-truth data included in
the comment. However, their ground truth is also susceptible to the issues we mentioned;
human evaluation is necessary for absolute certainty.
(Chat)GPT’s non-determinism, ground-truth absence, and author disagreements notwith-
standing, the above shows that an effort to reproduce our results should lead to highly
similar scores on each file of each subdataset. Due to the fact that our analysis is made
per subdataset file and not per prompt, we are confident that a double-digit number of
evaluations per file makes the conclusions that we draw robust.
• prompt engineering type, we have deliberately chosen to avoid more complex types
of prompt engineering mechanics, such as using Chain-of-Thoughts [61], or Tree-of-
Thoughts [62] because 1) the main focus of the paper is the standard performance of
the tested language models, not their the performance augmented by various forms of
in-context learning and 2) these types of prompt engineering methods require additional
significant human effort to decorate the original prompts with example when carrying out
few-shot prompting and to track the output. The human effort for the current dataset
was already considerable; adding such prompt engineering methods would have further
22
E.g., if proving statements C depends on three statements X,Y ,Z, we could prove first X,Y ,Z
and then conclude C; but we could also prove Y ,Z,X and then conclude C - or construct the proof
based on any other permutation of X,Y ,Z. While all of these proof presentations would use the
same ideas, syntactically, they would be different proof presentations
30
raised the evaluation cost while also increasing the complexity of our non-trivial evaluation
methodology and 3) for GPT-3 and certain datasets and prompting engineering techniques,
performance was degraded [63].
• datapoint-level MSC code coverage, our dataset covers 78 distinct codes spanning
most areas of mathematics. Because there are 1636 distinct, MSC-classified outputs, by
the pigeonhole principle, some of the MSC codes will necessarily be covered only by a
small amount of MSC code (Figure 5 indicates those MSC codes that occur at least five
times, although some MSC codes appear over one hundred times).
We also note that for particularly easy mathematical questions (e.g., simple arithmetical
questions), no suitable MSC codes exist to classify the output since MSC codes typically
classify more advanced mathematics23 . Nonetheless, we have attempted to match them as
well as possible and allow multiple MSC codes in order to classify the output as precisely
as possible.
We further note that an exhaustive survey of (Chat)GPT’s performance across every MSC
code would necessitate a large, community-driven effort to set up an extensive database.
Due to the high cost of rating each output, requiring specialized skills, this is something
that no individual research group could reasonably do – but we hope that our approach is
a starting point for such an effort.
• subdataset-level attribute tag coverage, we could, in theory, start to investigate every
combination of these attribute tags for the subdatasets. We recall that the subdatasets
have attribute tags assigned to them that indicate difficulty to subdatasets, the type of
questions a subdataset contains, and the ways the subdataset is likely to not have been part
of training distributions of the considered models (see the explanation of the tags M 1-M 4,
Q1-Q5 and D1-D3 from Appendix B.1 and Table 1). Thus, for a single mathematical
domain, represented by one MSC code (such as for elementary arithmetic problems), one
could generate subdatasets to cover every possible combination of the 60 possible ways to
combine the attribution tag listed above. If one combines this with different MSC codes,
this would, as above, also lead to a manual curation effort, which is too intensive and
time-consuming for a single research group.
23
The MSC codes starting with the numbers “97”, which at first glance might be most suitable,
are solely reserved to classify content that is related to the educational process of mathematics,
rather than the mathematical content itself.
31
G Best-3 and Worst-3 Across Selected Subdatasets
We list below the best and worst answers of ChatGPT, version 9-January-2023, over a
selection of subdatasets. For readability, the prompts and answers are lightly modified so
that the LATEX-based formulas are correctly displayed, and whitespace is removed.
G.1 Grad-Text
Examples from the Grad-Text subdataset, comprising the books Probability Theory by R.
Durret [35], Topology by J. R. Munkres [36] and Functional Analysis by W. Rudin [37].
Best-3
" prompt " :R " Let µ be a measure R , ∥f ∥∞ = inf{M : µ({x : |f (x)| > M }) = 0} and
∥g∥1 = |g|dµ . Prove t h a t |f g|dµ ≤ ∥g∥1 ∥f ∥∞ . " ,
" output " : " Let M = ∥f ∥∞ and ϵ > 0 . Then , by t h e d e f i n i t i o n o f ∥f ∥∞ , we
have µ({x : |f (x)| > M + ϵ}) = 0 . T h e r e f o r e ,
Z Z Z Z Z
|f g|dµ = |f g|dµ + |f g|dµ ≤ (M + ϵ) |g|dµ + |f ||g|dµ.
|f |≤M +ϵ |f |>M +ϵ |f |>M +ϵ
Worst-3
" prompt " : " Find a f u n c t i o n f : R → R t h a t i s c o n t i n u o u s a t p r e c i s e l y
one p o i n t . " ,
" output " : "One example o f such a f u n c t i o n i s
if x ̸= 0
x2
f (x) =
1 if x = 0.
This f u n c t i o n i s c o n t i n u o u s a t x = 0 but not a t any o t h e r p o i n t . "
32
" prompt " : " Let X be a t o p o l o g i c a l v e c t o r s p a c e . A l l s e t s mentioned
below a r e u n d e r s t o o d t o be t h e s u b s e t s o f X . Prove t h e f o l l o w i n g
s t a t e m e n t : I f A i s compact and B i s c l o s e d , then A + B i s
closed ." ,
" output " : "A t o p o l o g i c a l v e c t o r s p a c e i s a v e c t o r s p a c e with a
t o p o l o g y such t h a t t h e v e c t o r a d d i t i o n and s c a l a r m u l t i p l i c a t i o n
a r e c o n t i n u o u s . To p r o v e t h a t A + B i s c l o s e d , we must show t h a t
t h e complement o f A + B i s open . Let y ∈ X \ (A + B) . Then t h e r e
e x i s t s some x1 ∈ A and x2 ∈ B such t h a t y = x1 + x2 . S i n c e A i s compact
and B i s c l o s e d , t h e r e e x i s t n e i g h b o r h o o d s U1 o f x1 and U2 o f x2
such t h a t U1 i s r e l a t i v e l y compact and X \ U2 i s open . Now c o n s i d e r
t h e s e t V = U1 + U2 = {x1 + x2 : x1 ∈ U1 , x2 ∈ U2 } . This s e t i s an open
n e i g h b o r h o o d o f y and i t i s c o n t a i n e d i n X \ (A + B) , which shows
t h a t X \ (A + B) i s open and hence A + B i s c l o s e d . "
Pn 1/p
" prompt " : "The p−norm on Rn i s g i v e n by ∥x∥p = k=1
|xk |p . We want t o
show show t h a t ∥x∥p ≤ C∥x∥q f o r a r b i t r a r y 1 ≤ p < q . We compute
!1/p ! pq 1/p ! q1
n n n
1− p 1
− q1
X X X
p q q
∥x∥p = |xi | ≤ |xi | n q = |xi | np = n1/p−1/q ∥x∥q .
i=1 i=1 i=1
n
! n
!1/p n
!1/q
X X X
|xi yi | ≤ p
|xi | q
|yi | ."
i=1 i=1 i=1
33
" prompt " : " Let f be a r e a l −v a l u e d f u n c t i o n c o n t i n u o u s on [a, b] and
d i f f e r e n t i a b l e on (a, b) . Suppose t h a t limx→a f ′ (x) e x i s t s . Then , we
want t o p r o v e t h a t f i s d i f f e r e n t i a b l e a t a and f ′ (a) = limx→a f ′ (x) .
Using t h e d e f i n i t i o n o f t h e d e r i v a t i v e , we need t o show t h a t
f (a + h) − f (a)
lim
h→0+ h
e x i s t s and i s e q u a l t o lim f ′ (x) . Note t h a t f o r h > 0 t h e Mean Value
x→a+
Theorem p r o v i d e s a p o i n t ch with a < ch < a + h such t h a t [ MISSING ] .
F i n a l l y , n o t e t h a t ch → a+ a s h → 0+ . Can you f i l l i n t h e m i s s i n g
equation ?" ,
" output " : "The m i s s i n g e q u a t i o n from t h e Mean Value Theorem i s :
f (a + h) − f (a)
f ′ (ch ) = ."
h
Worst-3
" prompt " : "We want t o compute
∞
X 2n
.
3n+1
n=1
Let ’ sP f i r s t f i n d a g e n e r a l f o r m u l a f o r t h e f o l l o w i n g sum :
m
Sm = n=1
nrn . N o t i c e t h a t
m
X r − rm+1 mrm+2 − (m + 1)rm+1 + r
Sm − rSm = −mrm+1 + rn = −mrm+1 + = .
1−r 1−r
n=1
Hence
mrm+2 − (m + 1)rm+1 + r
Sm = .
(1 − r)2
Can you f i n i s h t h e c a l c u l a t i o n ? " ,
" output " : " Yes , t o f i n i s h t h e c a l c u l a t i o n we can s u b s t i t u t e t h e
v a l u e s o f m = ∞ , r = 2/3 i n t o t h e f o r m u l a we found f o r Sm :
2 ∞+2 2 ∞+1
2
∞ 3 − (∞ + 1) 3 + 3
S∞ =
2 2
1− 3
This s i m p l i f i e s t o
2
S∞ = 9
1
= 2, "
9
34
" prompt " : " F i s a f i n i t e f i e l d o f o r d e r q . We want t o compute t h e
s i z e o f GLn (F ) . I n o r d e r f o r an n × n m a t r i x t o be i n v e r t i b l e , we
need t h e rows t o be l i n e a r l y i n d e p e n d e n t . As you note , we have
q n − 1 c h o i c e s f o r t h e f i r s t row ; now , t h e r e a r e q v e c t o r s i n t h e
span o f t h e f i r s t row , s o we have q n − q c h o i c e s f o r t h e s e c o n d
row . Now , l e t v1 , v2 be t h e f i r s t two rows . Then t h e s e t o f v e c t o r s
i n t h e span o f v1 , v2 i s o f t h e form {c1 v1 + c2 v2 |c1 , c2 ∈ F } . This s e t i s
o f s i z e q 2 , a s we have q c h o i c e s f o r c1 and q c h o i c e s f o r c2 . Thus ,
we have q n − q 2 c h o i c e s f o r t h e t h i r d row . C o n t i n u i n g t h i s g i v e s
t h e d e s i r e d f o r m u l a . Can you s t a t e t h e f o r m u l a ? " ,
" output " : "The s i z e o f GLn (F ) i s g i v e n by t h e f o r m u l a
(q n − 1)(q n−1 − q)(q n−2 − q 2 ) · · · (q 2 − q)(q − 1)."
i n t e r v a l . x = 3 π g i v e s s o l u t i o n s x = 3 π, 3 π, 24
8n 8 16
3 π which a r e i n t h e
g i v e n i n t e r v a l . T h e r e f o r e , t h e r e a r e 5 s o l u t i o n s f o r x = 8n 5 π and 3
s o l u t i o n s f o r x = 8n 3 π i n t h e g i v e n i n t e r v a l . The t o t a l number o f
s o l u t i o n s i s 5+3=8."
35
f o r some s c a l a r k. Find t h e a n g l e between b and c, i n d e g r e e s .
S o l u t i o n : S i n c e b × c i s o r t h o g o n a l t o both b and c, a · b = 0 and
a · c = 0. S i n c e ∥a + b + c∥ = 1 ,
(a + b + c) · (a + b + c) = 1.
<MISSING> F i l l i n t h e m i s s i n g s t e p s . " ,
" output " : " Expanding t h e dot p r o d u c t on t h e l e f t s i d e g i v e s :
a · a + a · b + a · c + b · a + b · b + b · c + c · a + c · b + c · c = 1. S i n c e
a · b = a · c = b · a = c · a = 0, t h i s s i m p l i f i e s t o : |a|2 + |b|2 + |c|2 + 2b · c = 1. We
know t h a t |a|2 = 12 , |b|2 = 13 , |c|2 = 61 , and s u b s t i t u t i n g t h e s e v a l u e s i n
g i v e s : 12 + 13 + 61 + 2b · c = 1 S o l v i n g f o r b · c g i v e s : b · c = 61 − 16 = 0 Which
means t h e v e c t o r s a r e o r t h o g o n a l , hence t h e a n g l e between them i s
90 d e g r e e s . "
Worst-3
" prompt " : "The l i n e d e s c r i b e d by
! !
2 k
−1 +t 2
3 1
i s p e r p e n d i c u l a r t o t h e l i n e d e s c r i b e d by
! !
2 2
−1 +u 1
1 2
36
" output " : "The m i s s i n g s t e p s a r e : We need t o f i n d t h e m u l t i p l e s
o f 1 2 , which a r e d i v i s i b l e by both 3 and 4 . To f i n d t h e f i r s t
m u l t i p l e o f 12 l e s s than 1 0 0 , we d i v i d e 100 by 12 and round
down t o t h e n e a r e s t i n t e g e r , which i s 8 . To f i n d t h e l a s t
m u l t i p l e o f 12 l e s s than 1 0 0 , we d i v i d e 12 by 100 and round
up t o t h e n e a r e s t i n t e g e r , which i s 8 . The number o f
m u l t i p l e s o f 12 between 1 and 100 i s t h e d i f f e r e n c e between
t h e l a s t and f i r s t m u l t i p l e , which i s 8 − 8 = 0 . So t h e r e a r e
no p o s i t i v e i n t e g e r s l e s s than 100 t h a t a r e d i v i s i b l e by both
3 and 4 . " ,
G.4 Olympiad-Problem-Solving
Examples from the Olympiad-Problem-Solving subdataset based on the book [40].
Best-3
" prompt " : " n numbers a r e w r i t t e n on a b l a c k b o a r d . I n one s t e p , you
may e r a s e any two o f t h e numbers , s a y a and b , and w r i t e , i n s t e a d
(a + b)/4 . R e p e a t i n g t h i s s t e p n − 1 times , t h e r e i s one number l e f t .
Prove t h a t , i n i t i a l l y , i f t h e r e were n o n e s on t h e board , a t t h e
end , a number which i s not l e s s than 1/n w i l l remain . " ,
" output " : " This can be proven by i n d u c t i o n . Base c a s e : n = 2 . I f t h e r e
a r e two o n e s on t h e board , a f t e r one s t e p , t h e r e m a i n i n g number
i s (1 + 1)/4 = 1/2 , which i s not l e s s than 1/2 . I n d u c t i v e c a s e : Assume
t h a t t h e s t a t e m e n t i s t r u e f o r n = k . Now c o n s i d e r n = k + 1 . I f
t h e r e a r e k + 1 o n e s on t h e board , we can e r a s e any two o f them and
w r i t e t h e i r a v e r a g e , which i s not l e s s than 1/k + 1 by t h e
i n d u c t i o n h y p o t h e s i s . T h e r e f o r e , we can c o n c l u d e t h a t t h e
statement holds f o r a l l n . "
37
" prompt " : " Let ’ s answer t h i s q u e s t i o n s t e p by s t e p . D e r i v e
d i v i s i b i l i t y c r i t e r i a f o r 9 and 11 . " ,
" output " : " D i v i s i b i l i t y C r i t e r i a f o r 9 : I f t h e sum o f d i g i t s o f a
number i s d i v i s i b l e by 9 , then t h e number i s a l s o d i v i s i b l e by 9 .
I f t h e d i f f e r e n c e between t h e sum o f d i g i t s i n t h e odd p o s i t i o n s
and even p o s i t i o n s i s d i v i s i b l e by 9 , then t h e number i s a l s o
d i v i s i b l e by 9 . D i v i s i b i l i t y C r i t e r i a f o r 11 : I f t h e d i f f e r e n c e
between t h e sum o f d i g i t s i n t h e odd p o s i t i o n s and even p o s i t i o n s
i s d i v i s i b l e by 11 , then t h e number i s a l s o d i v i s i b l e by 11 . I f
t h e a l t e r n a t i n g sum o f d i g i t s ( f i r s t minus s e c o n d p l u s t h i r d
minus f o u r t h , e t c . ) i s d i v i s i b l e by 11 , then t h e number i s a l s o
d i v i s i b l e by 11 . "
Worst-3
" prompt " : "Assume an 8 × 8 c h e s s b o a r d with t h e u s u a l c o l o r i n g . You may
r e p a i n t a l l s q u a r e s ( a ) o f a row o r column ( b ) o f a 2 × 2 s q u a r e .
The g o a l i s t o a t t a i n j u s t one b l a c k s q u a r e . Can you r e a c h t h e
goal ?" ,
" output " : " Yes , i t i s p o s s i b l e t o a t t a i n j u s t one b l a c k s q u a r e on an
8 × 8 c h e s s b o a r d . One way t o do t h i s i s t o r e p a i n t a l l t h e s q u a r e s
o f t h e top− l e f t 2 × 2 s q u a r e , which w i l l l e a v e one b l a c k s q u a r e i n
t h e bottom−r i g h t c o r n e r o f t h e c h e s s b o a r d . Another way t o do t h i s
i s t o r e p a i n t a l l t h e s q u a r e s o f t h e f i r s t column , which w i l l
l e a v e one b l a c k s q u a r e i n t h e bottom−r i g h t c o r n e r o f t h e
chessboard .
G.5 Symbolic-Integration
Examples from our Symbolic-Integration subdataset based on [13].
38
Best-3
" prompt " : "What i s t h e i n t e g r a l o f 4x7 + sin(1 + x) " ,
" output " : "The i n t e g r a l o f 4x7 i s (4/8)x8 + C , where C i s t h e c o n s t a n t
o f i n t e g r a t i o n . The i n t e g r a l o f sin(1 + x) i s − cos(1 + x) + C . So t h e
i n t e g r a l o f 4x7 + sin(1 + x) i s (4/8)x8 − cos(1 + x) + C . "
Worst-3
2
" prompt " : "What i s t h e i n t e g r a l o f (4 + x)[4 + 3x][ex∗(4+x) ] " ,
2
" output " : "The i n t e g r a l o f (4 + x)[4 + 3x][ex∗(4+x) ] i s not e x p r e s s i b l e i n
terms o f e l e m e n t a r y f u n c t i o n s . "
39
H Datasheet for the GHOSTS Dataset
This appendix provides a datasheet for the GHOSTS dataset. The format of this datasheet
was introduced in [64] and consolidates the motivation, creation process, composition, and
intended uses of our dataset as a series of questions and answers.
H.1 Motivation
Q1. For what purpose was the dataset created? Was there a specific task in mind?
Was there a specific gap that needed to be filled? Please provide a description.
The existing datasets of natural-language mathematics are far from covering all the
typical tasks professional mathematicians encounter in daily life, making it unclear
whether language models can be of any help in this regard. Existing datasets mostly
cover elementary mathematics or resemble standard tests like SATs (see Sections 2
and 3). Hence, they do not offer any insight into the usage of ChatGPT as a tool
for mathematicians. In this work, we have made the first attempt towards filling this
gap, going beyond math problems that are yes-no rated, and proposed a benchmark
made and curated by working researchers in the field that tests different dimensions of
mathematical reasoning.
Q2. Who created this dataset (e.g., which team, research group) and on behalf
of which entity (e.g., company, institution, organization)?
The authors of this work created GHOSTS; see Appendix B.6 for more information.
Q3. Who funded the creation of the dataset? If there is an associated grant, please
provide the name of the grantor and the grant name and number.
There is no associated grant or funding which has been used to create the GHOSTS
dataset.
Q4. Any other comments?
No.
H.2 Composition
Q5. What do the instances that comprise the dataset represent (e.g., documents,
photos, people, countries)? Are there multiple types of instances (e.g., movies,
users, and ratings; people and interactions between them; nodes and edges)? Please
provide a description.
GHOSTS consists of textual prompts, in natural language, representing mathematical
questions. For each prompt, GHOSTS contains one or more instances of outputs of
(Chat)GPT and corresponding fine-grained evaluation by the authors.
Q6. How many instances are there in total (of each type, if appropriate)?
There are 709 prompts in GHOSTS; a selection of 170 of these makes up miniGHOSTS,
and a selection of 14 of those makes up microGHOSTS. For 24 of the questions from the
GHOSTS dataset, light prompt engineering variations have been carried out. Each of the
709 + 24 questions from GHOSTS has been evaluated on ChatGPT, version 9-January-
2023 and 30-January-2023, and 170 questions from miniGHOSTS have been evaluated
on GPT-4 (and thus on microGHOSTS too). Thus, in total, (709 + 24) × 2 + 170 = 1636
outputs and evaluations have been carried out. See also Appendix B.6 for more
information.
Q7. Does the dataset contain all possible instances or is it a sample (not neces-
sarily random) of instances from a larger set? If the dataset is a sample, then
what is the larger set? Is the sample representative of the larger set (e.g., geographic
coverage)? If so, please describe how this representativeness was validated/verified. If
it is not representative of the larger set, please describe why not (e.g., to cover a more
diverse range of instances because instances were withheld or unavailable).
GHOSTS tries to cover a wide range of mathematical questions from 78 different MSC
codes; see Appendix B.1 and B.2. However, due to the prohibitive cost of human
evaluation, which cannot be fully automated away (see Section 3.3), it is not feasible to
40
represent all mathematical fields across all dimensions of “mathematical behavior” and
all types of mathematical questions (overview questions, fact-stating questions, etc.).
Q8. What data does each instance consist of? “Raw” data (e.g., unprocessed
text or images) or features? In either case, please provide a description.
GHOSTS, miniGHOSTS, and microGHOSTS consist of a collection of JSON objects
(one for each data point), and each JSON object consists of 10 key-values pairs as
detailed in Appendix B.2.
Q9. Is there a label or target associated with each instance? If so, please provide a
description.
No, we do not explicitly define a label or target for the instances. However, the rating
of the output can potentially be used to select good and bad mathematical conversations
of (Chat)GPT in order to fine-tune models and the errorcodes and warningcodes can
be used to make a more fine-grained classification possible.
Q10. Is any information missing from individual instances? If so, please provide a
description, explaining why this information is missing (e.g., because it was unavailable).
This does not include intentionally removed information but might include, e.g., redacted
text.
No.
Q11. Are relationships between individual instances made explicit (e.g., users’
movie ratings, social network links)? If so, please describe how these relationships
are made explicit.
Relations between instances are explicitly given by the same values on (subsets) of
the fields, e.g., the same prompt, the same model version, or the same MSC code.
Prompt-engineered variations of the same question are represented as an array of JSON
objects, one object for each variation.
Q12. Are there recommended data splits (e.g., training, development/validation,
testing)? If so, please provide a description of these splits, explaining the rationale
behind them.
Not applicable.
Q13. Are there any errors, sources of noise, or redundancies in the dataset? If so,
please provide a description.
The evaluation of the prompts included in GHOSTS underlies human errors. However,
we tried to mitigate these errors; see Appendix B.5.
Q14. Is the dataset self-contained, or does it link to or otherwise rely on external
resources (e.g., websites, tweets, other datasets)? If it links to or relies on
external resources,
(a) Are there guarantees that they will exist, and remain constant, over time?
(b) Are there official archival versions of the complete dataset (i.e., including the external
resources as they existed at the time the dataset was created)?
(c) Are there any restrictions (e.g., licenses, fees) associated with any of the external
resources that might apply to a future user? Please provide descriptions of all
external resources and any restrictions associated with them, as well as links or
other access points, as appropriate.
The dataset is self-contained. However, some of the prompts cannot be publicly released
since they are taken or adapted from sources that are protected by copyright, for which
no license was given, see Appendix B.3; though we do release the output of the models
on these prompts, together with our evaluation of the output. Further, we provide a
reference to the original, copyrighted materials, so that a user can easily retrieve the
original prompts.
Q15. Does the dataset contain data that might be considered confidential (e.g.,
data that is protected by legal privilege or by doctor-patient confidentiality,
data that includes the content of individuals non-public communications)?
If so, please provide a description.
No.
41
Q16. Does the dataset contain data that, if viewed directly, might be offensive,
insulting, threatening, or might otherwise cause anxiety? If so, please describe
why.
No.
Q17. Does the dataset relate to people? If not, you may skip remaining questions in
this section.
No.
Q18. Does the dataset identify any subpopulations (e.g., by age, gender)? If so,
please describe how these subpopulations are identified and provide a description of their
respective distributions within the dataset.
No.
Q19. Is it possible to identify one or more natural persons, either directly or
indirectly (i.e., in combination with other data) from the dataset? If so, please
describe how.
No.
Q20. Does the dataset contain data that might be considered sensitive in any way
(e.g., data that reveals racial or ethnic origins, sexual orientations, religious
beliefs, political opinions or union memberships, or locations; financial or
health data; biometric or genetic data; forms of government identification,
such as social security numbers; criminal history)? If so, please provide a
description.
No.
Q21. Any other comments?
No.
42
of ChatGPT. A manual inspection of these subsets, in order to pick a subset with
appropriate mathematical content (we want to have a mathematically diverse dataset),
then led to the final selection of the miniGHOSTS dataset. The microGHOSTS dataset
was created from manual inspection of the miniGHOSTS dataset, by isolating in total
14 questions which were deemed difficult for language models.
Q25. Who was involved in data collection process (e.g., students, crowd-workers,
contractors) and how were they compensated (e.g., how much were crowd-
workers paid)?
Only we have been involved in the data collection process. No payment (other than one
made through regular employment) in relation to creating this dataset and writing this
article was made.
Q26. Over what timeframe was the data collected? Does this timeframe match
the creation timeframe of the data associated with the instances (e.g., recent
crawl of old news articles)? If not, please provide a description of the timeframe.
The collection date matches the creation time. It is specified in the timestamp key
in each data point from GHOSTS and spans a timeframe from January 9, 2023, to
now. Using the timestamp, the version of ChatGPT that was used can be inferred, see
Appendix C.
Q27. Were any ethical review processes conducted (e.g., by an institutional
review board)? If so, please provide a description of these review processes, including
the outcomes, as well as a link or other access point to any supporting documentation.
Not applicable.
Q28. Does the dataset relate to people? If not, you may skip remaining questions in
this section.
No.
Q29. Did you collect the data from the individuals in question directly, or obtain
it via third parties or other sources (e.g., websites)?
Not applicable.
Q30. Were the individuals in question notified about the data collection? If so,
please describe (or show with screenshots or other information) how notice was provided,
and provide a link or other access point to, or otherwise reproduce, the exact language
of the notification itself.
Not applicable.
Q31. Did the individuals in question consent to the collection and use of their
data? If so, please describe (or show with screenshots or other information) how consent
was requested and provided, and provide a link or other access point to, or otherwise
reproduce, the exact language to which the individuals consented.
Not applicable.
Q32. If consent was obtained, were the consenting individuals provided with a
mechanism to revoke their consent in the future or for certain uses? If so,
please provide a description, as well as a link or other access point to the mechanism
(if appropriate).
Not applicable.
Q33. Has an analysis of the potential impact of the dataset and its use on data
subjects (e.g., a data protection impact analysis) been conducted? If so,
please provide a description of this analysis, including the outcomes, as well as a link
or other access point to any supporting documentation.
Not applicable.
Q34. Any other comments?
No.
43
H.4 Preprocessing, Cleaning, and/or Labeling
Q35. Was any preprocessing/cleaning/labeling of the data done (e.g., discretiza-
tion or bucketing, tokenization, part-of-speech tagging, SIFT feature extrac-
tion, removal of instances, processing of missing values)? If so, please provide a
description. If not, you may skip the remainder of the questions in this section.
We corrected various minor issues and inconsistencies that could arise in the process of
manual evaluation, see Appendix B.5.
Q36. Was the “raw” data saved in addition to the preprocessed/cleaned/labeled
data (e.g., to support unanticipated future uses)? If so, please provide a link or
other access point to the “raw” data.
The output key in each JSON object contains the raw output from (Chat)GPT—unless
ChatGPT used rendered LATEX in which case our policy was to transcribe it. In very
few cases, potential copy-paste errors were noticed, which were fixed.
Q37. Is the software used to preprocess/clean/label the instances available? If so,
please provide a link or other access point.
The raw output of (Chat)GPT in the output key has not been cleaned, see Q36. Cleaning
of the other values has been done first using Python scripts, in an automated way, and
subsequently by hand, to correct any further, unforeseen mistakes, see Appendix B.5.
The Python scripts are available upon request.
Q38. Any other comments?
No.
H.5 Uses
Q39. Has the dataset been used for any tasks already? If so, please provide a
description.
We have used the GHOSTS dataset to evaluate and compare the mathematical capa-
bilities of different LLMs, in particular, different (Chat)GPT versions; see Section 4.
Q40. Is there a repository that links to any or all papers or systems that use the
dataset? If so, please provide a link or other access point.
Future work citing the GHOSTS dataset will be listed by citation trackers such as
Google Scholar and Semantic Scholar.
Q41. What (other) tasks could the dataset be used for?
If the dataset is growing further, we anticipate that GHOSTS can be used as training
data for fine-tuning LLMs.
Q42. Is there anything about the composition of the dataset or the way it was
collected and preprocessed/cleaned/labeled that might impact future uses?
For example, is there anything that a future user might need to know to avoid uses
that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality
of service issues) or other undesirable harms (e.g., financial harms, legal risks) If so,
please provide a description. Is there anything a future user could do to mitigate these
undesirable harms?
No.
Q43. Are there any tasks for which the dataset should not be used? If so, please
provide a description.
No.
Q44. Any other comments?
No.
44
H.6 Distribution
Q45. Will the dataset be distributed to third parties outside of the entity (e.g.,
company, institution, organization) on behalf of which the dataset was
created? If so, please provide a description.
Yes, the GHOSTS, miniGHOSTS, and microGHOSTS datasets will be made publicly
available. It consists of three JSON files, one per ChatGPT version (9-January-2023,
30-January-2023, and GPT-4, where the latter was evaluated on miniGHOSTS, while
the former two on the full GHOSTS dataset; the microGHOSTS prompts are tagged
within the miniGHOSTS dataset, see the comment key). Some prompts will not be
available due to copyright issues (see Appendix B.3), but a precise reference where the
original prompt can be found will be included instead.
Q46. How will the dataset be distributed (e.g., tarball on website, API, GitHub)
Does the dataset have a digital object identifier (DOI)?
The dataset will be made available on GitHub in the public repository github.com/
friederrr/GHOSTS as a collection of JSON files.
Q47. When will the dataset be distributed?
The dataset is already available.
Q48. Will the dataset be distributed under a copyright or other intellectual
property (IP) license, and/or under applicable terms of use (ToU)? If so,
please describe this license and/or ToU, and provide a link or other access point to, or
otherwise reproduce, any relevant licensing terms or ToU, as well as any fees associated
with these restrictions.
We release the GHOSTS, miniGHOSTS, and microGHOSTS datasets under the fol-
lowing Creative Commons license: Attribution-NonCommercial 4.0 International (CC
BY-NC 4.0), unless we are bound by licenses of individual prompts or files from var-
ious subdatasets to release those prompts or files under more restrictive licenses; see
Appendix B.3 for more information.
Q49. Have any third parties imposed IP-based or other restrictions on the data
associated with the instances? If so, please describe these restrictions, and provide
a link or other access point to, or otherwise reproduce, any relevant licensing terms, as
well as any fees associated with these restrictions.
IP-restrictions apply only to those prompts that were not solely created by the authors
(which are under the CC BY-NC 4.0, as explained above), see Appendix B.3 for these
cases.
Q50. Do any export controls or other regulatory restrictions apply to the dataset
or to individual instances? If so, please describe these restrictions, and provide a
link or other access point to, or otherwise reproduce, any supporting documentation.
No.
Q51. Any other comments?
No.
H.7 Maintenance
Q52. Who will be supporting/hosting/maintaining the dataset?
The dataset will be hosted on a GitHub repository; see Q46. All the information about
the dataset, including links to the paper and future announcements, will be written in
the README file of the GitHub repository.
Q53. How can the owner/curator/manager of the dataset be contacted (e.g.,
email address)?
The email addresses of the authors are publicly available. Moreover, it is possible to
raise an issue on GitHub.
45
Q54. Is there an erratum? If so, please provide a link or other access point.
Future changes will be documented in the README file of the GitHub repository.
Differences in single files can be tracked in the Git history.
Q55. Will the dataset be updated (e.g., to correct labeling errors, add new
instances, delete instances)? If so, please describe how often, by whom, and how
updates will be communicated to users (e.g., mailing list, GitHub)?
We will continue to maintain the dataset to fix any errors that may have occurred. We
will allow either direct email contact or GitHub pull requests in this sense. Beyond
fixing such errors, we consider the dataset to be frozen.
Q56. If the dataset relates to people, are there applicable limits on the retention
of the data associated with the instances (e.g., were individuals in question
told that their data would be retained for a fixed period of time and then
deleted)? If so, please describe these limits and explain how they will be enforced.
Not applicable.
Q57. Will older versions of the dataset continue to be supported/hosted/main-
tained? If so, please describe how. If not, please describe how its obsolescence will be
communicated to users.
Not applicable.
Q58. If others want to extend/augment/build on/contribute to the dataset, is
there a mechanism for them to do so? If so, please provide a description. Will
these contributions be verified? If so, please describe how. If not, why not? Is there a
process for communicating/distributing these contributions to other users? If so, please
provide a description.
Any external contribution to our dataset is strongly encouraged. Every addition to the
dataset will be carefully reviewed by the authors. For other details, please see Q55.
46