0% found this document useful (0 votes)
276 views14 pages

Large Language Models For Mathematical Reasoning - Progresses and Challenges

Uploaded by

Lindy W
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
276 views14 pages

Large Language Models For Mathematical Reasoning - Progresses and Challenges

Uploaded by

Lindy W
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Large Language Models for Mathematical Reasoning:

Progresses and Challenges

Janice Ahn♠ Rishu Verma♠ Renze Lou♠ Di Liu♢


Rui Zhang♠ and Wenpeng Yin♠

The Pennsylvania State University ♢ Temple University
{jfa5672, wenpeng}@psu.edu; [email protected]

Abstract stride towards achieving a more generalized and


adept AI.
Mathematical reasoning serves as a cornerstone
for assessing the fundamental cognitive capa- In recent times, the landscape of AI has been
bilities of human intelligence. In recent times, reshaped by the ascendancy of Large Language
arXiv:2402.00157v3 [cs.CL] 5 Apr 2024

there has been a notable surge in the devel- Models (LLMs) as formidable tools for automating
opment of Large Language Models (LLMs) intricate tasks. Notably, LLMs have proven to be
geared towards the automated resolution of potent assets in unraveling the nuances of mathe-
mathematical problems. However, the land- matical problem-solving (Romera-Paredes et al.,
scape of mathematical problem types is vast
2023; Imani et al., 2023). Their language capabili-
and varied, with LLM-oriented techniques un-
dergoing evaluation across diverse datasets and ties fuel focused exploration in utilizing them for
settings. This diversity makes it challenging mathematical reasoning, uncovering fresh insights
to discern the true advancements and obsta- into the synergy between language and logic.
cles within this burgeoning field. This survey However, amid this progress, the current state
endeavors to address four pivotal dimensions: of LLM-oriented research in mathematics presents
i) a comprehensive exploration of the various a complex panorama. Diverse mathematical prob-
mathematical problems and their correspond-
lem types pose a formidable challenge, exacerbated
ing datasets that have been investigated; ii) an
examination of the spectrum of LLM-oriented by the varied evaluation metrics, datasets, and set-
techniques that have been proposed for math- tings employed in the assessment of LLM-oriented
ematical problem-solving; iii) an overview of techniques (Testolin, 2023; Lu et al., 2023c). The
factors and concerns affecting LLMs in solving lack of a unified framework hampers our ability to
math; and iv) an elucidation of the persisting gauge the true extent of progress achieved and im-
challenges within this domain. To the best of pedes a coherent understanding of the challenges
our knowledge, this survey stands as one of the
that persist in this evolving field.
first extensive examinations of the landscape
of LLMs in the realm of mathematics, provid- This survey endeavors to cast a spotlight on the
ing a holistic perspective on the current state, multifaceted landscape of LLMs in the realm of
accomplishments, and future challenges in this mathematics. We plan to traverse four crucial di-
rapidly evolving field. mensions: a meticulous exploration of math prob-
lem types and the datasets associated with them;
1 Introduction an in-depth analysis of the evolving techniques em-
Mathematical reasoning is crucial to human intel- ployed by LLMs in mathematical problem-solving;
ligence, driving ongoing efforts in the AI commu- an examination of factors that affect the LLMs solv-
nity to autonomously tackle math challenges. This ing math problems; and a critical discussion on the
pursuit inherently calls for an augmentation of AI persisting challenges that loom over this burgeon-
capabilities, delving into the intricate realms of tex- ing field.
tual comprehension, image interpretation, tabular To our knowledge, this survey marks one of the
analysis, symbolic manipulation, operational logic, first comprehensive examinations of LLMs specif-
and a nuanced grasp of world knowledge. As the ically tailored for mathematics. By weaving to-
AI landscape evolves, the endeavor to empower gether insights from various dimensions, we aim to
machines with a comprehensive understanding of provide a holistic understanding of the current state
diverse mathematical facets becomes not only a tes- of affairs in LLM-driven mathematical reasoning,
tament to technological prowess but also a pivotal shedding light on achievements, challenges, and
the uncharted territories that await exploration in P ROVING, and M ATH IN VISION CONTEXT.
this captivating intersection of language and logic.

2 Related Work 3.1 Arithmetic

To the best of our knowledge, the existing litera- This category of problems entails pure mathemati-
ture on summarizing mathematical research, partic- cal operations and numerical manipulation, devoid
ularly within the context of LLMs, remains lim- of the need for the model to interpret text, images,
ited. Notably, Frieder et al. (2023a) compared or other contextual elements. An illustrative exam-
two ChatGPT versions (9-January-2023 and 30- ple is presented below, where “Q” denotes ques-
January-2023) and GPT-4 for four math-related tions and “A” for answers.
problems: producing proofs, filling holes in proofs,
Q: 21 + 97
acting as a mathematical search engine and compu-
A: 118
tation. More importantly, they summarized some
insightful strategies regarding how LLMs can help The dataset M ATH -140 (Yuan et al., 2023) con-
mathematicians and advocated a more collabora- tains 401 arithmetic expressions for 17 groups.
tive approach, incorporating human expertise and
LLM automation, for theorem proving. Chang et al.
(2023) conducted a comprehensive evaluation of 3.2 Math Word Problems
LLMs, incorporating an examination of their per-
formance in mathematical problem-solving, albeit M ATH WORD PROBLEMS (MWP) are mathemati-
with a relatively brief exploration of the mathemati- cal exercises or scenarios presented in the form of
cal field. Conversely, both (Testolin, 2023) and (Lu written or verbal descriptions rather than straight-
et al., 2023c) delved into the application of Deep forward equations in A RITHMETIC. These prob-
Learning in the domain of mathematical reason- lems require individuals to decipher the informa-
ing. Our work distinguishes itself on three fronts: tion provided, identify relevant mathematical con-
firstly, we concentrate on LLMs, providing a more cepts, and formulate equations or expressions to
in-depth analysis of their various advancements; solve the given problem. MWP often reflect real-
secondly, beyond merely reporting progress, we en- world situations, allowing individuals to apply
gage in a thorough discussion of the challenges in- mathematical principles to practical contexts. Solv-
herent in this trajectory; and thirdly, we extend our ing these problems typically involves critical think-
scrutiny to encompass the perspective of mathemat- ing, problem-solving skills, and the application of
ics pedagogy. In doing so, we contribute a nuanced mathematical operations to find a solution.
perspective that seeks to broaden the understanding MWP invariably comprise a question (Q) and
of LLMs in the context of mathematical research. its corresponding final answer (A) (referred to as
The only work contemporaneous with ours is Question-Answer). However, the presence or ab-
(Liu et al., 2023b). In comparison, our contribution sence of additional clues can give rise to various
lies in: i) not only introducing various methods versions of these problems. Variations may emerge
but also paying more attention to various factors based on factors such as the availability of an equa-
affecting model performance; ii) taking a broader tion (E; referred to as Question-Equation-Answer)
perspective on the progress of LLM in the field or the provision of a step-by-step rationale (R;
of mathematics, elucidating not only from the AI Question-Rationale-Answer) to guide the problem-
perspective but also from the perspective of ed- solving process.
ucation. It emphasizes that the pursuit of model
performance alone, while neglecting human factors, Question-Answer. The instance of this type of
is something that needs attention. MWP consists of a question (Q) and the final an-
swer (A), such as:
3 Math Problems & Datasets
Q: Lily received $20 from her mum. After
This section concisely overviews prominent math- spending $10 on a storybook and $2.5 on
ematical problem types and associated datasets, a lollipop, how much money does she have
spanning A RITHMETIC, M ATH W ORD P ROB - left?
LEMS , G EOMETRY , AUTOMATED T HEOREM A: $7.5
N AME S IZE L EVEL N OTE
Q-A CM ATH (Wei et al., 2023) 1.7K E Chinese; grade 1-6
S AT-M ATH (Zhong et al., 2023) 220 H Multi-choice
SVAMP (Patel et al., 2021) 1K E Three types of variations
ASD IV (Miao et al., 2020) 2.3K E Problem type and grade level annotated
Question-Equation-Answer

MAWPS (Koncel-Kedziorski et al., 2016) 3.3K E Extension of A DD S UB, M ULTI A RITH, etc.
PARA MAWPS (Raiyan et al., 2023) 16K E Paraphrased, adversarial MAWPS
S INGLE E Q (Koncel-Kedziorski et al., 2015) 508 E
A DD S UB (Hosseini et al., 2014) 395 E Only addition and subtraction
M ULTI A RITH (Roy and Roth, 2015) 600 E Multi-step reasoning
D RAW-1K (Upadhyay and Chang, 2017) 1K E
M ATH 23K (Wang et al., 2017) 23K E Chinese
A PE 210K (Zhao et al., 2020) 210K E Chinese
K6 (Yang et al., 2023) 600 E Chinese; grade 1-6
CM17K (Qin et al., 2021) 17K M H Chinese; grade 6-12
C ARP (Zhang et al., 2023a) 4.9K M Chinese
GSM8K (Cobbe et al., 2021) 8.5K M Linguistically diverse
Question-Rationale-Answer

MATH (Hendrycks et al., 2021) 12.5K H Problems are put into difficulty levels 1-5
PRM800K (Lightman et al., 2023) 12K H MATH w/ step-wise labels
M ATH QA (Amini et al., 2019) 37K C GRE examinations; have quality concern
AQ UA (Ling et al., 2017) 100K C GRE&GMAT questions
A RB (Sawada et al., 2023) 105 C Contest problems and university math proof
G HOSTS (Frieder et al., 2023b) 709 C
T HEOREM QA-M ATH (Chen et al., 2023b) 442 C Theorem as rationale
LILA (Mishra et al., 2022) 132K H Incorporates 20 existing datasets
M ATH -I NSTRUCT (Yue et al., 2023) 260K H Instruction-following style
TAB MWP (Lu et al., 2023b) 38K H Tabular MWP; below the College level

Table 1: Datasets for Math Word Problems.


E = Elementary, M = Middle School, H = High School, C = College, H = Hybrid

Question-Equation-Answer. Compared with Q: Beth bakes 4, or 2 dozen batches of


Question-Answer, this MWP type provides the cookies in a week. If these cookies are
equation solution, such as shared amongst 16 people equally, how
many cookies does each person consume?
R: Beth bakes 4 2 dozen batches of
Q: Jack had 8 pens and Mary had 5 pens. cookies for a total of 4 ∗ 2 =<< 4 ∗ 2 =
Jack gave 3 pens to Mary. How many pens 8 >> 8 dozen cookies. There are 12
does Jack have now? cookies in a dozen and she makes 8 dozen
E: 8 − 3 cookies for a total of 12∗8 =<< 12∗8 =
A: 5 (optional) 96 >> 96 cookies. She splits the 96
cookies equally amongst 16 people so
they each eat 96/16 =<< 96/16 = 6 >>
6 cookies.
A: 6

Question-Rationale-Answer. This type of


MWP includes answers and reasoning paths, akin
to the Chain-of-Thought method, which explicates Table 1 lists most datasets that are summarized
reasoning steps rather than defining problem types in three categories: Question-Answer, Question-
(Wei et al., 2022). The rationale guides correct Equation-Answer, and Question-Rationale-Answer.
problem-solving and serves as a valuable reference In addition to the above three MWP types of con-
for model training, including fine-tuning and ventional styles, recent work studied MWP in
few-shot learning. given tables and even MWP generation.
Tabular MWP. TAB MWP (Lu et al., 2023b) is N AME S IZE
the first dataset to study MWP over tabular context G EO S HADER (Alvin et al., 2017) 102
on open domains and is the largest in terms of data G EO S (Seo et al., 2015) 186
size. Each problem in TAB MWP is accompanied G EO S++ (Sachan et al., 2017) 1.4K
by a tabular context, which is represented in three G EO S-OS (Sachan and Xing, 2017) 2.2K
formats: an image, a semi-structured text, and a G EOMETRY 3K (Lu et al., 2021) 3K
structured table. G EO QA (Chen et al., 2021a) 5K
U NI G EO (Chen et al., 2022) 14.5K
BEADS $/ KILOGRAM
heart-shaped 3 Table 3: Geometry datasets
rectangular 2
spherical 2
predefined search heuristics, highlighting the spe-
oval 2
cialized strategies required in this domain (Trinh
Table 2: Table for the tabular MWP example. et al., 2024). This contrast in problem-solving
approaches highlights the multifaceted nature of
T : Table 2 mathematical challenges and the varied skill sets
Q: Henrik bought 2.5 kilograms of oval required in different mathematical domains. An
beads. How much did he spend? (Unit: example can be seen as follows and Table 3 lists
$) mainstream datasets.
A: 5
MWP Generation. Instead of deriving the an-
swer for a given math question, this type of mathe- a b
h
matical reasoning tries to generate MWP questions.
For example, Wang et al. (2021) fine-tuned GPT-
c
2 (Radford et al., 2019) on equation-to-MWP in-
stances for MWP generation. The effectiveness of Q: a=7 inches; b=24 inches; c=25 inches;
GPT-3’s question-generation capabilities was as- h=5.4 inches; What is its area? (Unit:
sessed by Zong and Krishnamachari (2023), who square inches)
instructed the model to generate a question similar A: 24.03
to a provided MWP question. Deb et al. (2023) an-
3.4 Automated theorem proving
alyzed a group of LLMs (GPT-4, GPT-3.5, PaLM-
2 (Anil et al., 2023), and LLaMa (Touvron et al., In the specialized area of Automated Theorem
2023a)), and found a significant drop in accuracy Proving (ATP), the inherent challenges are unique
for backward reasoning compared to forward rea- and encompass a wide spectrum, akin to those
soning. Norberg et al. (2023) used GPT-4 to rewrite found in distinct mathematical fields. ATP’s core
human-written MWP, reporting optimal readabil- focus is on autonomously constructing proofs for
ity, lexical diversity, and cohesion scores, although specified conjectures, requiring a blend of logical
GPT-4 rewrites incorporated more low-frequency analysis and a profound grasp of formal languages,
words. supported by an extensive knowledge base. Its
application is crucial in areas like the validation
3.3 Geometry and development of both software and hardware
Compared with MWP, GEOMETRY problems in- systems.
volve a distinct set of challenges. While MWP of- For example, the M INI F2F dataset (Zheng et al.,
ten requires logical reasoning and arithmetic op- 2022) stands out in ATP, featuring a series of com-
erations, geometry problems demand a spatial un- plex Olympiad-level mathematical problems, de-
derstanding of shapes, sizes, and their interrela- signed to evaluate theorem-proving systems includ-
tionships. Solving geometry problems typically ing Metamath (Yu et al., 2023), Lean (Han et al.,
entails applying geometric principles, theorems, 2022), and Isabelle (Wenzel et al., 2008). In a
and formulas to analyze and deduce properties of similar vein, the HOList benchmark (Bansal et al.,
geometric figures. Furthermore, current geometry 2019), with its comprehensive array of theorem
approaches mainly rely on symbolic methods and statements from various corpora, sets a sequential
proving challenge for ATP systems, where each GPT-4. Wu et al. (2023) adapted and evaluated
theorem must be proved using only the lemmas several existing prompting methods to the usage
preceding it. Additionally, the C OQ G YM dataset of GPT-4, including a vanilla prompt, Program-
(Yang and Deng, 2019) provides a broad ATP en- of-Thoughts prompt (Chen et al., 2023a), and Pro-
vironment, showcasing a rich collection of more gram Synthesis prompt (Drori et al., 2022). The
than 71,000 proofs penned by humans, all within study by Gu (2023) investigated the capability of
the framework of the Coq proof assistant. These GPT-4 to actively engage in math-oriented brain-
datasets illustrate the diverse methodologies and storming sessions. This includes tasks like iden-
skillsets necessary in ATP, reflecting the multi- tifying new research problems, refining problem
faceted nature of solving mathematical problems. formulations, and suggesting potential methods or
unconventional solutions, all achieved through it-
3.5 Math in vision-language context erative ideation with a human partner—a common
C HART QA (Masry et al., 2022), with 9.6K human- practice in collaborative brainstorming with other
written questions and 23.1K model-generated ques- professionals.
tions have explored a variety of complex reasoning
questions that involve several logical and arithmetic GPT4V & Bard. Lu et al. (2023a) presented
operations over charts. M ATH V ISTA (Lu et al., M ATH V ISTA, a benchmark of evaluating math-
2023a): size: 6K; it features seven types of mathe- ematical reasoning in visual context, conducted
matical reasoning: algebraic reasoning, arithmetic a comprehensive, quantitative evaluation of three
reasoning, geometry reasoning, logical reasoning, LLMs (i.e, ChatGPT, GPT-4, Claude-2 (Bai et al.,
numeric common sense, scientific reasoning, and 2022)), two proprietary large multimodal mod-
statistical reasoning. In addition, fine-grained meta- els (LMMs) (i.e., GPT4V, Bard), and seven
data are available, including question type, answer open-source LMMs, with Chain-of-Thought and
type, language, source, category, task, grade level, Program-of-Thought.
and visual context.
Multiple. Wei et al. (2023) evaluated a variety
4 Methodologies of popular LLMs, including both commercial and
open-source options, aiming to provide a bench-
We summarize these methods into three progressive mark tool for assessing the following question:
levels: i) Prompting frozen LLMs, ii) Strategies en- to what grade level of Chinese elementary school
hancing frozen LLMs, and iii) Fine-tuning LLMs. math do the abilities of popular LLMs correspond?

4.1 Prompting frozen LLMs 4.2 Strategies enhancing frozen LLMs


We organize prior work by typical LLMs.
Preprocessing the math question. An et al.
GPT-3. Zong and Krishnamachari (2023) eval- (2023a) explored ChatGPT for the dataset SVAMP
uated the use of GPT-3, a 175B parameter trans- and observed that substituting numerical expres-
former model for three related challenges pertain- sions with English expressions can elevate the per-
ing to math word problems: i) classifying word formance.
problems, ii) extracting equations from word prob-
lems, and iii) generating word problems. More advanced prompts. Chain-of-thought
(Wei et al., 2022), the first time to steer the
ChatGPT. Shakarian et al. (2023) reported the LLMs to do step-by-step math reasoning, Self-
first independent evaluation of ChatGPT on MWP, Consistency (Wang et al., 2023) tried multiple
and found that ChatGPT’s performance changes Chain-of-Thought reasoning paths and leverage the
dramatically based on the requirement to show its consistency mechanism to discover a more proba-
work. Cheng and Zhang (2023) assessed Chat- ble answer. Zhou et al. (2023a) proposed a novel
GPT, OpenAI’s latest conversational chatbot and and effective prompting method, explicit code-
LLM, on its performance in elementary-grade arith- based self-verification, to further boost the mathe-
metic and logic problems, and found that Chat- matical reasoning potential of GPT-4 Code Inter-
GPT performed better than previous models such preter. This method employs a zero-shot prompt
as InstructGPT (Ouyang et al., 2022) and Minerva on GPT-4 Code Interpreter to encourage it to use
(Lewkowycz et al., 2022). code to self-verify its answers.
Using external tool. Yamauchi et al. (2023) em- the GPT-3 API, eliminating the need for manually
ployed an external tool, specifically the Python designed heuristics.
REPL, to correct errors in Chain-of-Thought. Their
demonstration highlighted that integrating Chain- Generating intermediate steps. Nye et al.
of-Thought and Python REPL using a markup (2021) initiated the fine-tuning of decoder-only
language improves the reasoning capabilities of LLMs, ranging from 2M to 137B in size. Their
ChatGPT. In a related context, He-Yueya et al. approach involved training these models to solve
(2023) introduced an approach that merges an integer addition and polynomial evaluation by gen-
LLM, Codex (Chen et al., 2021b), capable of pro- erating intermediate computation steps into a des-
gressively formalizing word problems into vari- ignated “scratchpad.” In a related effort, Zhang
ables and equations, with an external symbolic et al. (2023b) introduced a fine-tuning strategy for
solver adept at solving the generated equations. GPT-2 or T5, enabling them to produce step-by-
Program-of-Thought (Chen et al., 2023a) separates step solutions with a combination of textual and
the computational aspect from the reasoning by mathematical tokens leading to the final answer.
utilizing a Language Model (primarily Codex) to Additionally, Yang et al. (2023) applied a step-by-
articulate the reasoning procedure as a program. step strategy in fine-tuning a series of GLM models
The actual computation is delegated to an external (Zeng et al., 2023), specifically tailored for solving
computer, responsible for executing the generated distinct Chinese mathematical problems. Minerva,
programs to arrive at the desired answer. developed by Lewkowycz et al. (2022), enhances
LLMs’ ability to generate intermediate steps in
Improving the whole interaction. Wu et al. complex math problems. Its fine-tuning of diverse
(2023) introduced MathChat, a conversational datasets enables nuanced, step-by-step problem-
framework designed for chat-based LLMs. In solving, demonstrating advanced handling of intri-
this framework, math problems from the MATH cate mathematical concepts.
dataset are resolved through a simulated conversa-
tion between the model and a user proxy agent. Learning an answer verifier. OpenAI re-
searchers, per Cobbe et al. (2021), fine-tuned a
Considering more comprehensive factors in eval- GPT-3 model of 175B as a verifier, assigning
uation. While accuracy is crucial in evaluating probabilities to solution candidates. In explor-
LLMs for math problem-solving, it shouldn’t be the ing reexamination processes for MWP solving,
sole metric. Other important dimensions include: Bin et al. (2023) introduced Pseudo-Dual Learn-
i) Confidence Provision: Imani et al. (2023)’s ing, involving solving and reexamining modules.
”MathPromper” boosts LLM performance and con- For MWP solution, Zhu et al. (2023) developed a
fidence by generating algebraic expressions, pro- cooperative reasoning-induced PLM, with GPT-J
viding diverse prompts, and evaluating consensus (Wang and Komatsuzaki, 2021) generating paths
among multiple runs. ii) Verifiable Explanations: and DeBERTa-large (He et al., 2021) supervising
Gaur and Saunshi (2023) used concise, verifiable evaluation. Google researchers, as per Liu et al.
explanations to assess LLM reasoning, revealing (2023c), observed improved correctness in LLMs
their proficiency in zero-shot solving of symbolic with multiple attempts, which hints that LLMs
MWPand their ability to produce succinct explana- might generate correct solutions while struggling
tions. to differentiate between accurate and inaccurate
4.3 Fine-tuning LLMs ones. They sequentially fine-tuned their PaLM 2
model (Anil et al., 2023) as a solution generator,
Learning to select in-context examples. As in- evaluator, and generator again.
dicated by prior research, few-shot GPT-3’s perfor-
mance is susceptible to instability and may decline Learning from enhanced dataset. Emulating
to near chance levels due to the reliance on in- the error-driven learning process observed in hu-
context examples. This instability becomes more man learning, An et al. (2023b) conducted fine-
pronounced when dealing with intricate problems tuning on various open-source LLMs within the
such as TAB MWP. In addressing this issue, Lu LLaMA (Touvron et al., 2023a), LLaMA-2 (Tou-
et al. (2023b) introduced PROMPTPG, which can vron et al., 2023b), CodeLLaMA (Rozière et al.,
autonomously learn to select effective in-context 2023), WizardMath (Luo et al., 2023), MetaMath
examples through policy gradient interactions with (Yu et al., 2023), and Llemma (Azerbayev et al.,
2023) families. This fine-tuning utilized mistake- preceding datasets. Stolfo et al. (2023) observed
correction data pairs generated by GPT-4. To that, among non-instruction-tuned LLMs, the larger
mitigate over-reliance on knowledge distillation ones tend to be more sensitive to changes in the
from LLM teachers, Liang et al. (2023a) fine- ground-truth result of a MWP, but not necessarily
tuned LLaMA-7B on existing mathematical prob- more robust. However, a different behavior exists
lem datasets that exhibit diverse annotation styles. in the instruction-tuned GPT-3 models, which show
In a related approach, Raiyan et al. (2023) demon- a remarkable improvement in both sensitivity and
strated that training on linguistic variants of prob- robustness, although the robustness reduces when
lem statements and implementing a voting mecha- problems get more complicated. Wei et al. (2023)
nism for candidate predictions enhance the math- assessed the robustness of several top-performing
ematical reasoning and overall robustness of the LLMs by augmenting the original problems in the
model. curated CMATH dataset with distracting informa-
tion. Their findings reveal that GPT-4 can maintain
Teacher-Student knowledge distillation. Liang robustness while other models fail.
et al. (2023b) utilized GPT-3 to coach a more Zhou et al. (2023b) proposed a new dataset RO -
efficient MWP solver (RoBERTa-based encoder- BUST M ATH to evaluate the robustness of LLMs in
decoder (Liu et al., 2019)). They shifted the focus math-solving ability. Extensive experiments show
from explaining existing exercises to identifying that (i) Adversarial samples from higher-accuracy
the student model’s learning needs and generating LLMs are also effective for attacking LLMs with
new, tailored exercises. The resulting smaller LLM lower accuracy; (ii) Complex MWPs (such as more
achieves competitive accuracy on the SVAMP solving steps, longer text, more numbers) are more
dataset with significantly fewer parameters com- vulnerable to attack; (iii) We can improve the ro-
pared to state-of-the-art LLMs. bustness of LLMs by using adversarial samples in
Finetuning on many datasets. Mishra et al. few-shot prompts.
(2022) conducted fine-tuning on a series of GPT-
Neo2.7B causal language models (Black et al., 5.2 Factors in influencing LLMs in math
2021) using LILA, a composite of 20 existing math The comprehensive evaluation conducted by Yuan
datasets. Similarly, Yue et al. (2023) created “Math- et al. (2023) encompasses OpenAI’s GPT series,
Instruct”, a meticulously curated instruction tun- including GPT-4, ChatGPT2, and GPT-3.5, along
ing dataset. Comprising 13 math datasets with with various open-source LLMs. This analysis
intermediate Chain-of-Thought and Program-of- methodically examines the elements that impact the
Thought rationales, this dataset was used to fine- arithmetic skills of LLMs, covering aspects such as
tune Llama (Touvron et al., 2023a,b; Rozière et al., tokenization, pre-training, prompting techniques,
2023) models across different scales. The result- interpolation and extrapolation, scaling laws, Chain
ing models demonstrate unprecedented potential in of Thought (COT), and In-Context Learning (ICL).
cross-dataset generalization.
Tokenization. This research underscores tok-
Math solver ensemble. Yao et al. (2023) incor- enization’s critical role in LLMs’ arithmetic perfor-
porated a problem typing subtask that combines mance (Yuan et al., 2023). Models like T5, lacking
the strengths of the tree-based solver and the LLM specialized tokenization for arithmetic, are less ef-
solver (ChatGLM-6B (Zeng et al., 2023)). fective than those with advanced methods, such as
Galactica (Taylor et al., 2022) and LLaMA, which
5 Analysis show superior accuracy in arithmetic tasks. This
5.1 LLMs’s robustness in math indicates that token frequency in pre-training and
the method of tokenization are key to arithmetic
Patel et al. (2021) provided strong evidence that the proficiency.
pre-LLM MWP solvers, mostly LSTM-equipped
encoder-decoder models, rely on shallow heuristics Pre-training Corpus. Enhanced arithmetic skills
to achieve high performance on some simple bench- in LLMs correlate with the inclusion of code and
mark datasets, then introduced a more challenging LATEX in pre-training data (Yuan et al., 2023).
dataset, SVAMP, created by applying carefully Galactica, heavily utilizing LATEX, excels in arith-
chosen variations over examples sampled from metic tasks, while models like Code-DaVinci-002,
better at reasoning, lags in arithmetic, highlight- Inaccurate responses might result in the reinforce-
ing a distinction between arithmetic and reasoning ment of misconceptions, impacting the quality of
skills. education (Yen and Hsu, 2023). (ii) Limited un-
derstanding of individual learning styles. LLMs
Prompts. The nature of input prompts greatly may struggle to cater to diverse learning styles, as
affects LLMs’ arithmetic performance (Liu et al., they primarily rely on algorithms and might not
2023a; Lou et al., 2023). Without prompts, perfor- fully grasp the unique needs of each student. Some
mance drops (Yuan et al., 2023). Models like Chat- learners may benefit more from hands-on activities
GPT, which respond well to instructional system- or visual aids that LLMs may not adequately ad-
level messages, demonstrate the importance of dress. Gattupalli et al. (2023) proposed that hints
prompt type. Instruction tuning in pre-training also produced by GPT-4 could be excessively intricate
emerges as a significant factor (Yue et al., 2023). for younger students who have shorter attention
Model Scale. There’s a noted correlation be- spans. (iii) Privacy and data security issues. De-
tween parameter count and arithmetic capability ploying LLMs involves collecting and analyzing
in LLMs (Yuan et al., 2023). Larger models gen- substantial amounts of student data. Privacy con-
erally perform better, but a performance plateau cerns may arise if proper measures are not in place
is observed, as shown by Galactica’s similar out- to safeguard this data from unauthorized access or
comes at 30B and 120B parameters. However, this misuse.
doesn’t always mean superior performance, with
6 Challenges
smaller models like ChatGPT occasionally outper-
forming larger ones. Data-driven & limited generalization. The pre-
vailing trend in current research revolves around
5.3 Perspectives of mathematics pedagogy the curation of extensive datasets. Despite this
While machine learning emphasizes LLMs’ emphasis, there is a noticeable lack of robust gener-
problem-solving abilities in mathematics, in prac- alization across various datasets, grade levels, and
tical education, their primary role is to aid learn- types of math problems. Examining how humans
ing. Thus, the focus shifts from mere mathematical acquire math-solving skills suggests that machines
performance to a crucial consideration of LLMs’ may need to embrace continual learning to enhance
understanding of students’ needs, capabilities, and their capabilities.
learning methods.
LLMs’ brittleness in math reasoning. The
Advantages of deploying LLMs in math edu- fragility of LLMs in mathematical reasoning is
cation. Educators have observed the following evident across three dimensions. Firstly, when pre-
benefits of leveraging LLMs for math education. (i) sented with questions expressed in varying textual
LLMs foster critical thinking and problem-solving forms (comprising words and numbers), LLMs ex-
skills, as they provide comprehensive solutions and hibit inconsistent performance. Secondly, for iden-
promote rigorous error analysis (Matzakos et al., tical questions, an LLM may yield different final
2023); (ii) Educators and students prefer LLM- answers through distinct reasoning paths during
generated hints because of their detailed, sequen- multiple trials. Lastly, pre-trained math-oriented
tial format and clear, coherent narratives (Gattupalli LLMs are susceptible to attacks from adversarial
et al., 2023); (iii) LLMs introduce a conversational inputs, highlighting their vulnerability in the face
style in problem-solving, an invaluable asset in of manipulated data.
math education (Gattupalli et al., 2023); (iv) The
impact of LLMs extends beyond mere computa- Human-oriented math interpretation. The cur-
tional assistance, offering deep insights and under- rent LLM-oriented math reasoning, such as chain-
standing spanning diverse disciplines like Algebra, of-thoughts, does not take into account the needs
Calculus, and Statistics (Rane, 2023). and comprehension abilities of users, such as stu-
dents. As an example, Yen and Hsu (2023) discov-
Disadvantages of deploying LLMs in math edu- ered that GPT-3.5 had a tendency to misinterpret
cation. (i) Potential for misinterpretation. Misin- students’ questions in the conversation, resulting
terpretation of students’ queries or errors in provid- in a failure to deliver adaptive feedback. Addition-
ing explanations by LLMs could lead to confusion. ally, research conducted by Gattupalli et al. (2023)
revealed that GPT-4 frequently overlooks the prac- Chen, Eric Chu, Jonathan H. Clark, Laurent El
tical comprehension abilities of younger students. Shafey, Yanping Huang, Kathy Meier-Hellstern, Gau-
rav Mishra, Erica Moreira, Mark Omernick, Kevin
It tends to generate overly intricate hints that even
Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao,
confuse those students. Consequently, there is a Yuanzhong Xu, Yujing Zhang, Gustavo Hernández
pressing need for increased AI research that ac- Ábrego, Junwhan Ahn, Jacob Austin, Paul Barham,
tively incorporates human factors into its design, Jan A. Botha, James Bradbury, Siddhartha Brahma,
ensuring future developments align more closely Kevin Brooks, Michele Catasta, Yong Cheng, Colin
Cherry, Christopher A. Choquette-Choo, Aakanksha
with the nuanced requirements of users. Chowdhery, Clément Crepy, Shachi Dave, Mostafa
Dehghani, Sunipa Dev, Jacob Devlin, Mark Dı́az,
7 Conclusion Nan Du, Ethan Dyer, Vladimir Feinberg, Fangxi-
aoyu Feng, Vlad Fienber, Markus Freitag, Xavier
This survey on LLMs for Mathematics delves into Garcia, Sebastian Gehrmann, Lucas Gonzalez, and
various aspects of LLMs in mathematical reason- et al. 2023. Palm 2 technical report. CoRR,
ing, including their capabilities and limitations. abs/2305.10403.
The paper discusses different types of math prob- Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster,
lems, datasets, and the persisting challenges in the Marco Dos Santos, Stephen McAleer, Albert Q.
domain. It highlights the advancements in LLMs, Jiang, Jia Deng, Stella Biderman, and Sean Welleck.
their application in educational settings, and the 2023. Llemma: An open language model for mathe-
matics.
need for a human-centric approach in math edu-
cation. We hope this paper will guide and inspire Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda
future research in the LLM community, fostering Askell, Anna Chen, Nova DasSarma, Dawn Drain,
further advancements and practical applications in Stanislav Fort, Deep Ganguli, Tom Henighan,
Nicholas Joseph, Saurav Kadavath, Jackson Kernion,
diverse mathematical contexts. Tom Conerly, Sheer El Showk, Nelson Elhage, Zac
Hatfield-Dodds, Danny Hernandez, Tristan Hume,
Scott Johnston, Shauna Kravec, Liane Lovitt, Neel
References Nanda, Catherine Olsson, Dario Amodei, Tom B.
Brown, Jack Clark, Sam McCandlish, Chris Olah,
Chris Alvin, Sumit Gulwani, Rupak Majumdar, and Benjamin Mann, and Jared Kaplan. 2022. Train-
Supratik Mukhopadhyay. 2017. Synthesis of solu- ing a helpful and harmless assistant with rein-
tions for shaded area geometry problems. In Proceed- forcement learning from human feedback. CoRR,
ings of the Thirtieth International Florida Artificial abs/2204.05862.
Intelligence Research Society Conference, FLAIRS
2017, Marco Island, Florida, USA, May 22-24, 2017, Kshitij Bansal, Sarah M. Loos, Markus N. Rabe, Chris-
pages 14–19. AAAI Press. tian Szegedy, and Stewart Wilcox. 2019. Holist: An
environment for machine learning of higher-order
Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik theorem proving.
Koncel-Kedziorski, Yejin Choi, and Hannaneh Ha-
jishirzi. 2019. Mathqa: Towards interpretable math Yi Bin, Wenhao Shi, Yujuan Ding, Yang Yang, and See-
word problem solving with operation-based for- Kiong Ng. 2023. Solving math word problems with
malisms. In Proceedings of NAACL-HLT, pages reexamination. CoRR, abs/2310.09590.
2357–2367.
Sid Black, Leo Gao, Phil Wang, Connor Leahy, and
Jisu An, Junseok Lee, and Gahgene Gweon. 2023a. Stella Biderman. 2021. Gpt-neo: Large scale autore-
Does chatgpt comprehend the place value in num- gressive language modeling with mesh-tensorflow.
bers when solving math word problems? In Pro-
ceedings of the Workshop ”Towards the Future of Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu,
AI-augmented Human Tutoring in Math Learning” Kaijie Zhu, Hao Chen, Linyi Yang, Xiaoyuan Yi,
co-located with The 24th International Conference Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang,
on Artificial Intelligence in Education (AIED 2023), Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie.
Tokyo, Japan, July 3, 2023, volume 3491 of CEUR 2023. A survey on evaluation of large language mod-
Workshop Proceedings, pages 49–58. els. CoRR, abs/2307.03109.

Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, Jiaqi Chen, Tong Li, Jinghui Qin, Pan Lu, Liang Lin,
Jian-Guang Lou, and Weizhu Chen. 2023b. Learning Chongyu Chen, and Xiaodan Liang. 2022. Unigeo:
from mistakes makes LLM better reasoner. CoRR, Unifying geometry logical reasoning via reformu-
abs/2310.20689. lating mathematical expression. In Proceedings of
EMNLP, pages 3313–3323.
Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin John-
son, Dmitry Lepikhin, Alexandre Passos, Siamak Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang,
Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Lingbo Liu, Eric P. Xing, and Liang Lin. 2021a.
Geoqa: A geometric question answering benchmark Simon Frieder, Julius Berner, Philipp Petersen, and
towards multimodal numerical reasoning. In Find- Thomas Lukasiewicz. 2023a. Large language models
ings of ACL/IJCNLP, volume ACL/IJCNLP 2021, for mathematicians. Internationale Mathematische
pages 513–523. Nachrichten, 254:1–20.

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Simon Frieder, Luca Pinchetti, Ryan-Rhys Grif-
Henrique Pondé de Oliveira Pinto, Jared Kaplan, fiths, Tommaso Salvatori, Thomas Lukasiewicz,
Harrison Edwards, Yuri Burda, Nicholas Joseph, Philipp Christian Petersen, Alexis Chevalier, and
Greg Brockman, Alex Ray, Raul Puri, Gretchen Julius Berner. 2023b. Mathematical capabilities of
Krueger, Michael Petrov, Heidy Khlaaf, Girish Sas- chatgpt. CoRR, abs/2301.13867.
try, Pamela Mishkin, Brooke Chan, Scott Gray,
Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Sai Gattupalli, William Lee, Danielle Allessio, Danielle
Kaiser, Mohammad Bavarian, Clemens Winter, Crabtree, Ivon Arroyo, Beverly Woolf, and Beverly
Philippe Tillet, Felipe Petroski Such, Dave Cum- Woolf. 2023. Exploring pre-service teachers’ per-
mings, Matthias Plappert, Fotios Chantzis, Eliza- ceptions of large language models-generated hints in
beth Barnes, Ariel Herbert-Voss, William Hebgen online mathematics learning.
Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie
Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, Vedant Gaur and Nikunj Saunshi. 2023. Reasoning in
William Saunders, Christopher Hesse, Andrew N. large language models through symbolic math word
Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan problems. In Findings of ACL, pages 5889–5903.
Morikawa, Alec Radford, Matthew Knight, Miles
Brundage, Mira Murati, Katie Mayer, Peter Welinder, Sophia Gu. 2023. Llms as potential brainstorming
Bob McGrew, Dario Amodei, Sam McCandlish, Ilya partners for math and science problems. CoRR,
Sutskever, and Wojciech Zaremba. 2021b. Evaluat- abs/2310.10677.
ing large language models trained on code. CoRR,
abs/2107.03374. Jesse Michael Han, Jason Rute, Yuhuai Wu, Edward W.
Ayers, and Stanislas Polu. 2022. Proof artifact co-
Wenhu Chen, Xueguang Ma, Xinyi Wang, and training for theorem proving with language models.
William W. Cohen. 2023a. Program of thoughts In Proceedings of ICLR.
prompting: Disentangling computation from reason-
ing for numerical reasoning tasks. Transactions on Pengcheng He, Xiaodong Liu, Jianfeng Gao, and
Machine Learning Research. Weizhu Chen. 2021. Deberta: decoding-enhanced
bert with disentangled attention. In Proceedings of
Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, ICLR.
Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony
Xia. 2023b. Theoremqa: A theorem-driven question Joy He-Yueya, Gabriel Poesia, Rose E. Wang, and
answering dataset. In Proceedings of EMNLP, pages Noah D. Goodman. 2023. Solving math word prob-
7889–7901. lems by combining language models with symbolic
solvers. CoRR, abs/2304.09102.
Vincent Cheng and Yu Zhang. 2023. Analyzing Chat-
GPT’s mathematical deficiencies: Insights and con- Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul
tributions. In Proceedings of the 35th Conference Arora, Steven Basart, Eric Tang, Dawn Song, and
on Computational Linguistics and Speech Processing Jacob Steinhardt. 2021. Measuring mathematical
(ROCLING 2023), pages 188–193. problem solving with the MATH dataset. In Proceed-
ings of NeurIPS.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian,
Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren
Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Etzioni, and Nate Kushman. 2014. Learning to solve
Nakano, Christopher Hesse, and John Schulman. arithmetic word problems with verb categorization.
2021. Training verifiers to solve math word prob- In Proceedings of EMNLP, pages 523–533. ACL.
lems. CoRR, abs/2110.14168.
Shima Imani, Liang Du, and Harsh Shrivastava. 2023.
Aniruddha Deb, Neeva Oza, Sarthak Singla, Dinesh Mathprompter: Mathematical reasoning using large
Khandelwal, Dinesh Garg, and Parag Singla. 2023. language models. In Proceedings of ACL, pages 37–
Fill in the blank: Exploring and enhancing LLM 42.
capabilities for backward reasoning in math word
problems. CoRR, abs/2310.01991. Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish
Sabharwal, Oren Etzioni, and Siena Dumas Ang.
Iddo Drori, Sarah Zhang, Reece Shuttleworth, Leonard 2015. Parsing algebraic word problems into equa-
Tang, Albert Lu, Elizabeth Ke, Kevin Liu, Linda tions. Trans. Assoc. Comput. Linguistics, 3:585–597.
Chen, Sunny Tran, Newman Cheng, et al. 2022. A
neural network solves, explains, and generates uni- Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate
versity math problems by program synthesis and few- Kushman, and Hannaneh Hajishirzi. 2016. MAWPS:
shot learning at human level. Proceedings of the Na- A math word problem repository. In Proceedings of
tional Academy of Sciences, 119(32):e2123433119. NAACL, pages 1152–1157.
Aitor Lewkowycz, Anders Andreassen, David Dohan, Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan
Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Huang, Xiaodan Liang, and Song-Chun Zhu. 2021.
Ambrose Slone, Cem Anil, Imanol Schlag, Theo Inter-gps: Interpretable geometry problem solving
Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy with formal language and symbolic reasoning. In
Gur-Ari, and Vedant Misra. 2022. Solving quantita- Proceedings of ACL/IJCNLP, pages 6774–6786.
tive reasoning problems with language models.
Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu,
Zhenwen Liang, Dian Yu, Xiaoman Pan, Wenlin Yao, Song-Chun Zhu, Tanmay Rajpurohit, Peter Clark,
Qingkai Zeng, Xiangliang Zhang, and Dong Yu. and Ashwin Kalyan. 2023b. Dynamic prompt learn-
2023a. Mint: Boosting generalization in mathemat- ing via policy gradient for semi-structured mathemat-
ical reasoning via multi-view fine-tuning. CoRR, ical reasoning. In Proceedings of ICLR.
abs/2307.07951.
Pan Lu, Liang Qiu, Wenhao Yu, Sean Welleck, and
Zhenwen Liang, Wenhao Yu, Tanmay Rajpurohit, Pe- Kai-Wei Chang. 2023c. A survey of deep learning
ter Clark, Xiangliang Zhang, and Ashwin Kalyan. for mathematical reasoning. In Proceedings of ACL,
2023b. Let GPT be a math tutor: Teaching math pages 14605–14631.
word problem solvers with customized exercise gen-
eration. CoRR, abs/2305.14386. Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jian-
guang Lou, Chongyang Tao, Xiubo Geng, Qingwei
Hunter Lightman, Vineet Kosaraju, Yura Burda, Har- Lin, Shifeng Chen, and Dongmei Zhang. 2023. Wiz-
rison Edwards, Bowen Baker, Teddy Lee, Jan ardmath: Empowering mathematical reasoning for
Leike, John Schulman, Ilya Sutskever, and Karl large language models via reinforced evol-instruct.
Cobbe. 2023. Let’s verify step by step. CoRR, CoRR, abs/2308.09583.
abs/2305.20050.
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq R.
Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blun- Joty, and Enamul Hoque. 2022. Chartqa: A bench-
som. 2017. Program induction by rationale genera- mark for question answering about charts with visual
tion: Learning to solve and explain algebraic word and logical reasoning. In Findings of ACL, pages
problems. In Proceedings of ACL, pages 158–167. 2263–2279.

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Nikolaos Matzakos, Spyridon Doukakis, and Maria
Hiroaki Hayashi, and Graham Neubig. 2023a. Pre- Moundridou. 2023. Learning mathematics with large
train, prompt, and predict: A systematic survey of language models: A comparative study with com-
prompting methods in natural language processing. puter algebra systems and other tools. International
ACM Computing Surveys, 55(9):1–35. Journal of Emerging Technologies in Learning (iJET),
18(20):51–71.
Wentao Liu, Hanglei Hu, Jie Zhou, Yuyang Ding,
Junsong Li, Jiayi Zeng, Mengliang He, Qin Chen, Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su.
Bo Jiang, Aimin Zhou, and Liang He. 2023b. 2020. A diverse corpus for evaluating and developing
Mathematical language models: A survey. CoRR, english math word problem solvers. In Proceedings
abs/2312.07622. of ACL, pages 975–984.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Tang, Sean Welleck, Chitta Baral, Tanmay Rajpuro-
Luke Zettlemoyer, and Veselin Stoyanov. 2019. hit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark,
Roberta: A robustly optimized BERT pretraining and Ashwin Kalyan. 2022. LILA: A unified bench-
approach. CoRR, abs/1907.11692. mark for mathematical reasoning. In Proceedings of
EMNLP, pages 5807–5832.
Yixin Liu, Avi Singh, C. Daniel Freeman, John D. Co-
Reyes, and Peter J. Liu. 2023c. Improving large lan- Kole Norberg, Husni Almoubayyed, Stephen E. Fanc-
guage model fine-tuning for solving math problems. sali, Logan De Ley, Kyle Weldon, April Murphy, and
CoRR, abs/2310.10047. Steven Ritter. 2023. Rewriting math word problems
with large language models. In Proceedings of the
Renze Lou, Kai Zhang, and Wenpeng Yin. 2023. Is Workshop on Empowering Education with LLMs -
prompt all you need? no. a comprehensive and the Next-Gen Interface and Content Generation 2023
broader view of instruction learning. arXiv preprint co-located with 24th International Conference on Ar-
arXiv:2303.10475. tificial Intelligence in Education (AIED 2023), Tokyo,
Japan, July 7, 2023, volume 3487 of CEUR Work-
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun- shop Proceedings, pages 163–172.
yuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei
Chang, Michel Galley, and Jianfeng Gao. 2023a. Maxwell I. Nye, Anders Johan Andreassen, Guy Gur-
Mathvista: Evaluating math reasoning in visual con- Ari, Henryk Michalewski, Jacob Austin, David
texts with gpt-4v, bard, and other large multimodal Bieber, David Dohan, Aitor Lewkowycz, Maarten
models. CoRR, abs/2310.02255. Bosma, David Luan, Charles Sutton, and Augustus
Odena. 2021. Show your work: Scratchpads for inter- Mrinmaya Sachan and Eric P. Xing. 2017. Learn-
mediate computation with language models. CoRR, ing to solve geometry problems from natural lan-
abs/2112.00114. guage demonstrations in textbooks. In Proceedings
of *SEM @ACM, pages 251–261.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
Carroll L. Wainwright, Pamela Mishkin, Chong Tomohiro Sawada, Daniel Paleka, Alexander Havrilla,
Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, Pranav Tadepalli, Paula Vidas, Alexander Kranias,
John Schulman, Jacob Hilton, Fraser Kelton, Luke John J. Nay, Kshitij Gupta, and Aran Komatsuzaki.
Miller, Maddie Simens, Amanda Askell, Peter Welin- 2023. ARB: advanced reasoning benchmark for large
der, Paul F. Christiano, Jan Leike, and Ryan Lowe. language models. CoRR, abs/2307.13692.
2022. Training language models to follow instruc-
tions with human feedback. In NeurIPS. Min Joon Seo, Hannaneh Hajishirzi, Ali Farhadi, Oren
Etzioni, and Clint Malcolm. 2015. Solving geometry
Arkil Patel, Satwik Bhattamishra, and Navin Goyal. problems: Combining text and diagram interpretation.
2021. Are NLP models really able to solve simple In Proceedings of EMNLP, pages 1466–1476.
math word problems? In Proceedings of NAACL-
HLT, pages 2080–2094. Paulo Shakarian, Abhinav Koyyalamudi, Noel Ngu, and
Lakshmivihari Mareedu. 2023. An independent eval-
Jinghui Qin, Xiaodan Liang, Yining Hong, Jianheng uation of chatgpt on mathematical word problems
Tang, and Liang Lin. 2021. Neural-symbolic solver (MWP). In Proceedings of the AAAI 2023 Spring
for math word problems with auxiliary tasks. In Symposium on Challenges Requiring the Combina-
Proceedings of ACL/IJCNLP, pages 5870–5881. tion of Machine Learning and Knowledge Engineer-
ing (AAAI-MAKE 2023), Hyatt Regency, San Fran-
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, cisco Airport, California, USA, March 27-29, 2023,
Dario Amodei, Ilya Sutskever, et al. 2019. Language volume 3433 of CEUR Workshop Proceedings.
models are unsupervised multitask learners. OpenAI
blog, 1(8):9. Alessandro Stolfo, Zhijing Jin, Kumar Shridhar, Bern-
hard Schölkopf, and Mrinmaya Sachan. 2023. A
Syed Rifat Raiyan, Md. Nafis Faiyaz, Shah Md. Jawad
causal framework to quantify the robustness of math-
Kabir, Mohsinul Kabir, Hasan Mahmud, and
ematical reasoning with language models. In Pro-
Md Kamrul Hasan. 2023. Math word problem solv-
ceedings of ACL, pages 545–561.
ing by generating linguistic variants of problem state-
ments. CoRR, abs/2306.13899. Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas
Nitin Rane. 2023. Enhancing mathematical capabili- Scialom, Anthony Hartshorn, Elvis Saravia, An-
ties through chatgpt and similar generative artificial drew Poulton, Viktor Kerkez, and Robert Stojnic.
intelligence: Roles and challenges in solving mathe- 2022. Galactica: A large language model for science.
matical problems. SSRN Electronic Journal. CoRR, abs/2211.09085.

Bernardino Romera-Paredes, Mohammadamin Alberto Testolin. 2023. Can neural networks do arith-
Barekatain, Alexander Novikov, Matej Balog, metic? A survey on the elementary numerical skills
M Pawan Kumar, Emilien Dupont, Francisco JR of state-of-the-art deep learning models. CoRR,
Ruiz, Jordan S Ellenberg, Pengming Wang, Omar abs/2303.07735.
Fawzi, et al. 2023. Mathematical discoveries from Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
program search with large language models. Nature, Martinet, Marie-Anne Lachaux, Timothée Lacroix,
pages 1–3. Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
Subhro Roy and Dan Roth. 2015. Solving general arith- Azhar, Aurélien Rodriguez, Armand Joulin, Edouard
metic word problems. In Proceedings of EMNLP, Grave, and Guillaume Lample. 2023a. Llama: Open
pages 1743–1752. and efficient foundation language models. CoRR,
abs/2302.13971.
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten
Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom bert, Amjad Almahairi, Yasmine Babaei, Nikolay
Kozhevnikov, Ivan Evtimov, Joanna Bitton, Man- Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
ish Bhatt, Cristian Canton-Ferrer, Aaron Grattafiori, Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-
Wenhan Xiong, Alexandre Défossez, Jade Copet, Ferrer, Moya Chen, Guillem Cucurull, David Esiobu,
Faisal Azhar, Hugo Touvron, Louis Martin, Nico- Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller,
las Usunier, Thomas Scialom, and Gabriel Synnaeve. Cynthia Gao, Vedanuj Goswami, Naman Goyal, An-
2023. Code llama: Open foundation models for code. thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan
CoRR, abs/2308.12950. Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa,
Isabel Kloumann, Artem Korenev, Punit Singh Koura,
Mrinmaya Sachan, Avinava Dubey, and Eric P. Xing. Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di-
2017. From textbooks to knowledge: A case study in ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Mar-
harvesting axiomatic knowledge from textbooks to tinet, Todor Mihaylov, Pushkar Mishra, Igor Moly-
solve geometry problems. In Proceedings of EMNLP, bog, Yixin Nie, Andrew Poulton, Jeremy Reizen-
pages 773–784. stein, Rashi Rungta, Kalyan Saladi, Alan Schelten,
Ruan Silva, Eric Michael Smith, Ranjan Subrama- Zhen Yang, Ming Ding, Qingsong Lv, Zhihuan Jiang,
nian, Xiaoqing Ellen Tan, Binh Tang, Ross Tay- Zehai He, Yuyi Guo, Jinfeng Bai, and Jie Tang. 2023.
lor, Adina Williams, Jian Xiang Kuan, Puxin Xu, GPT can solve mathematical problems without a cal-
Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, culator. CoRR, abs/2309.03241.
Melanie Kambadur, Sharan Narang, Aurélien Ro-
driguez, Robert Stojnic, Sergey Edunov, and Thomas Jie Yao, Zihao Zhou, and Qiufeng Wang. 2023. Solving
Scialom. 2023b. Llama 2: Open foundation and math word problem with problem type classification.
fine-tuned chat models. CoRR, abs/2307.09288. In Proceedings of NLPCC, volume 14304, pages 123–
134.
Trieu Trinh, Yuhuai Wu, Quoc Le, He He, and Thang
Luong. 2024. Solving olympiad geometry without An-Zi Yen and Wei-Ling Hsu. 2023. Three questions
human demonstrations. Nature. concerning the use of large language models to facil-
itate mathematics learning. CoRR, abs/2310.13615.
Shyam Upadhyay and Ming-Wei Chang. 2017. An-
notating derivations: A new evaluation strategy and Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu,
dataset for algebra word problems. In Proceedings Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo
of EACL, pages 494–504. Li, Adrian Weller, and Weiyang Liu. 2023. Meta-
math: Bootstrap your own mathematical questions
Ben Wang and Aran Komatsuzaki. 2021. Gpt-j-6b: A 6 for large language models. CoRR, abs/2309.12284.
billion parameter autoregressive language model.
Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang,
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. and Songfang Huang. 2023. How well do large lan-
Le, Ed H. Chi, Sharan Narang, Aakanksha Chowd- guage models perform in arithmetic tasks? CoRR,
hery, and Denny Zhou. 2023. Self-consistency im- abs/2304.02015.
proves chain of thought reasoning in language mod-
els. In Proceedings of ICLR. Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao
Huang, Huan Sun, Yu Su, and Wenhu Chen. 2023.
Yan Wang, Xiaojiang Liu, and Shuming Shi. 2017. Mammoth: Building math generalist models through
Deep neural solver for math word problems. In Pro- hybrid instruction tuning. CoRR, abs/2309.05653.
ceedings of EMNLP, pages 845–854.
Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang,
Zichao Wang, Andrew S. Lan, and Richard G. Baraniuk.
Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu,
2021. Math word problem generation with mathe-
Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma,
matical consistency and problem context constraints.
Yufei Xue, Jidong Zhai, Wenguang Chen, Zhiyuan
In Proceedings of EMNLP, pages 5986–5999.
Liu, Peng Zhang, Yuxiao Dong, and Jie Tang. 2023.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten GLM-130B: an open bilingual pre-trained model. In
Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, Proceedings of ICLR.
and Denny Zhou. 2022. Chain-of-thought prompt-
ing elicits reasoning in large language models. In Beichen Zhang, Kun Zhou, Xilin Wei, Wayne Xin
Proceedings of NeurIPS. Zhao, Jing Sha, Shijin Wang, and Ji-Rong Wen.
2023a. Evaluating and improving tool-augmented
Tianwen Wei, Jian Luan, Wei Liu, Shuang Dong, and computation-intensive math reasoning. arXiv
Bin Wang. 2023. CMATH: can your language model preprint arXiv:2306.02408.
pass chinese elementary school math test? CoRR,
abs/2306.16636. Mengxue Zhang, Zichao Wang, Zhichao Yang, Weiqi
Feng, and Andrew S. Lan. 2023b. Interpretable math
Makarius Wenzel, Lawrence C Paulson, and Tobias word problem solution generation via step-by-step
Nipkow. 2008. The isabelle framework. In Theo- planning. In Proceedings of ACL, pages 6858–6877.
rem Proving in Higher Order Logics: 21st Interna-
tional Conference, TPHOLs 2008, Montreal, Canada, Wei Zhao, Mingyue Shang, Yang Liu, Liang Wang, and
August 18-21, 2008. Proceedings 21, pages 33–38. Jingming Liu. 2020. Ape210k: A large-scale and
Springer. template-rich dataset of math word problems.

Yiran Wu, Feiran Jia, Shaokun Zhang, Hangyu Li, Kunhao Zheng, Jesse Michael Han, and Stanislas Polu.
Erkang Zhu, Yue Wang, Yin Tat Lee, Richard Peng, 2022. Minif2f: a cross-system benchmark for formal
Qingyun Wu, and Chi Wang. 2023. An empirical olympiad-level mathematics.
study on challenging math problem solving with GPT-
4. CoRR, abs/2306.01337. Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang,
Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen,
Ryutaro Yamauchi, Sho Sonoda, Akiyoshi Sannai, and and Nan Duan. 2023. Agieval: A human-centric
Wataru Kumagai. 2023. LPML: llm-prompting benchmark for evaluating foundation models. CoRR,
markup language for mathematical reasoning. CoRR, abs/2304.06364.
abs/2309.13078.
Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun
Kaiyu Yang and Jia Deng. 2019. Learning to prove Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song,
theorems via interacting with proof assistants. Mingjie Zhan, and Hongsheng Li. 2023a. Solving
challenging math word problems using GPT-4 code
interpreter with code-based self-verification. CoRR,
abs/2308.07921.
Zihao Zhou, Qiufeng Wang, Mingyu Jin, Jie Yao, Jianan
Ye, Wei Liu, Wei Wang, Xiaowei Huang, and Kaizhu
Huang. 2023b. Mathattack: Attacking large lan-
guage models towards math solving ability. CoRR,
abs/2309.01686.
Xinyu Zhu, Junjie Wang, Lin Zhang, Yuxiang Zhang,
Yongfeng Huang, Ruyi Gan, Jiaxing Zhang, and Yu-
jiu Yang. 2023. Solving math word problems via
cooperative reasoning induced language models. In
Proceedings of ACL, pages 4471–4485.
Mingyu Zong and Bhaskar Krishnamachari. 2023. Solv-
ing math word problems concerning systems of equa-
tions with GPT-3. In Proceedings of AAAI, pages
15972–15979.

You might also like