0% found this document useful (0 votes)
28 views34 pages

Mathematical Language Models: A Survey

Uploaded by

negoyo5972
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views34 pages

Mathematical Language Models: A Survey

Uploaded by

negoyo5972
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Mathematical Language Models: A Survey

WENTAO LIU∗ , School of Computer Science and Technology, East China Normal University, China
HANGLEI HU∗ , Department of Educational Information Technology, East China Normal University, China
JIE ZHOU† , School of Computer Science and Technology, East China Normal University, China
arXiv:2312.07622v3 [cs.CL] 23 Feb 2024

YUYANG DING, School of Computer Science and Technology, East China Normal University, China
JUNSONG LI, School of Computer Science and Technology, East China Normal University, China
JIAYI ZENG, School of Computer Science and Technology, East China Normal University, China
MENGLIANG HE, School of Computer Science and Technology, East China Normal University, China
QIN CHEN, School of Computer Science and Technology, East China Normal University, China
BO JIANG, Department of Educational Information Technology, East China Normal University, China
and Lab of Artificial Intelligence for Education, East China Normal University, China
AIMIN ZHOU† , School of Computer Science and Technology, East China Normal University, China and Lab
of Artificial Intelligence for Education, East China Normal University, China
LIANG HE, School of Computer Science and Technology, East China Normal University, China and Lab of
Artificial Intelligence for Education, East China Normal University, China
In recent years, there has been remarkable progress in leveraging Language Models (LMs), encompassing Pre-
trained Language Models (PLMs) and Large-scale Language Models (LLMs), within the domain of mathematics.
This paper conducts a comprehensive survey of mathematical LMs, systematically categorizing pivotal research
endeavors from two distinct perspectives: tasks and methodologies. The landscape reveals a large number of
proposed mathematical LLMs, which are further delineated into instruction learning, tool-based methods,
fundamental CoT techniques, and advanced CoT methodologies. In addition, our survey entails the compilation
of over 60 mathematical datasets, including training datasets, benchmark datasets, and augmented datasets.
Addressing the primary challenges and delineating future trajectories within the field of mathematical LMs,
this survey is positioned as a valuable resource, poised to facilitate and inspire future innovation among
researchers invested in advancing this domain.
∗ Both authors contributed equally to this research.
† Corresponding authors.

Authors’ addresses: Wentao Liu, School of Computer Science and Technology, East China Normal University, Shanghai,
China; Hanglei Hu, Department of Educational Information Technology, East China Normal University, Shanghai, China; Jie
Zhou, School of Computer Science and Technology, East China Normal University, Shanghai, China, [email protected];
Yuyang Ding, School of Computer Science and Technology, East China Normal University, Shanghai, China; Junsong Li,
School of Computer Science and Technology, East China Normal University, Shanghai, China; Jiayi Zeng, School of Computer
Science and Technology, East China Normal University, Shanghai, China; Mengliang He, School of Computer Science
and Technology, East China Normal University, Shanghai, China; Qin Chen, School of Computer Science and Technology,
East China Normal University, Shanghai, China; Bo Jiang, Department of Educational Information Technology, East China
Normal University, Shanghai, China and Lab of Artificial Intelligence for Education, East China Normal University, Shanghai,
China; Aimin Zhou, School of Computer Science and Technology, East China Normal University, Shanghai, China and Lab
of Artificial Intelligence for Education, East China Normal University, Shanghai, China, [email protected]; Liang
He, School of Computer Science and Technology, East China Normal University, Shanghai, China and Lab of Artificial
Intelligence for Education, East China Normal University, Shanghai, China.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the
full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2018 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM 0004-5411/2018/8-ART111
https://doi.org/XXXXXXX.XXXXXXX

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:2 Liu et al.

CCS Concepts: • Computing methodologies → Natural language generation; Scene understanding;


Cognitive robotics; Cognitive science; Intelligent agents.
Additional Key Words and Phrases: Mathematics, Language Models, Pre-trained, LLMs, Survey
ACM Reference Format:
Wentao Liu, Hanglei Hu, Jie Zhou, Yuyang Ding, Junsong Li, Jiayi Zeng, Mengliang He, Qin Chen, Bo Jiang,
Aimin Zhou, and Liang He. 2018. Mathematical Language Models: A Survey. J. ACM 37, 4, Article 111
(August 2018), 34 pages. https://doi.org/XXXXXXX.XXXXXXX

1 INTRODUCTION
Mathematics is the queen of sciences.
– Carl Friedrich Gauss
Mathematics stands as a foundational skill integral to human intelligence, wielding significance
across diverse fields such as natural sciences, engineering, medicine, finance, computer science, and
social sciences. Within the field of natural language processing (NLP), the development of computer
models aimed at autonomously resolving mathematical word problems has captivated researchers
since as early as 1963 [15, 16, 38, 41]. We believe that addressing this problem is a potential way
toward illuminating pathways for general reasoning mechanisms, consequently advancing the
pursuit of general artificial intelligence (AI).
One main traditional solution to math word problems is statistical learning-based methodologies
[62, 85, 121, 220]. Notably, machine learning techniques [85, 148, 151] alongside semantic parsing
methods [83, 161] have been employed to address this challenge, showcasing promising outcomes
on certain datasets. The evolution of deep learning has prompted considerable interest in crafting
neural networks capable of resolving mathematical problems [28, 183]. Notably, recent years have
witnessed significant advancements in the domain of mathematical AI [112, 185], propelled by
the emergence of powerful language models (LMs) [78, 172]. These language models, comprising
pre-trained language models (PLMs) and large-scale language models (LLMs), have assumed a
central role in reshaping the landscape of mathematical exploration and practical applications. The
paramount focus of this comprehensive survey lies in assessing the impact of these models on the
field of mathematics. The survey endeavors to provide a thorough overview of extant research and
its consequential implications.
PLMs such as BERT [32], RoBERTa [102], BART [89], GPT-1 [144] and GPT-2 [145] undergo pre-
training on extensive textual corpora to assimilate worldly knowledge. To enhance mathematical
performance, certain endeavors focus on either pre-training or fine-tuning PLMs using mathematical
datasets [26, 39, 47]. For instance, methodologies like GenBERT [47], NF-NSM [39], MathBERT[136]
and LISA [70] incorporate numerical data or mathematical formulas into PLMs to augment their
capabilities. Moreover, specialized modules or tailored loss functions are devised, leveraging existing
PLMs to learn mathematical reasoning and operations [74, 94, 214, 224].
The recent advent of LLMs, exemplified by OpenAI’s GPT-4 [131], has catalyzed an unforeseen
surge in innovation, underscoring the multifaceted potential of AI within the domain of mathematics.
These models ([172, 173]) have demonstrated remarkable success across diverse Natural Language
Processing (NLP) tasks, leveraging in-context learning [18, 21, 118] and instruction learning [101,
195]. Recent studies by Wang et al. [182] indicate that LLMs featuring over 100 billion parameters
(e.g., GPT-3 [18] with 175 billion, PaLM [23] with 540 billion) exhibit the capability to address
intricate tasks by employing a chain-of-thought (CoT) mechanism [185] when furnished with a
limited set of reasoning examples as demonstrations. Advancements in CoT frameworks [20, 43, 107,
210] have been tailored to enhance mathematical performance, incorporating tools and programs
[34, 44, 58]. An exemplar in this domain is LLEMMA [9], an openly accessible LLM specifically

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Mathematical Language Models: A Survey 111:3

Arithmetic
DigitRNN [168], DigitCNN [178], GenBERT [47], NumBERT [208]
Representation
Mathematical
Calculation (§2.1)
Mathematical Tasks

Arithmetic
ScratchpadGPT [130], Goat [101], MathGLM [195],
Calculation

Math Word Problem


MathPrompter [66], MetaMath [200], WizardMath [110], MathAttack [223], LLEMMA [9]
Solving

Mathematical Math Question


GEOS [158], DROP [35], Mathematics [153], Lila [119]
Reasoning (§2.2) Answering

Theorem
DeepMath [67], ASTactic [194], NaturalProofs [186], INT [190]
Proving

Fig. 1. Taxonomy of mathematical tasks.

designed for mathematical tasks. LLEMMA showcases the capacity to generate self-contained
textual solutions for mathematical problems and formulate formal proofs.
We mainly review the most related surveys about mathematical reasoning and LMs [117].
Lu et al. [109] delineate the landscape of deep learning applications specifically pertaining to
mathematical reasoning. Qiao et al. [141] predominantly delve into reasoning mechanisms centered
around language models, encompassing arithmetic, commonsense, logical, symbolic, and multi-
modal reasoning. Chu et al. [24] offer an exploration of studies focused on chain-of-thought (CoT)
reasoning. Additionally, Qiu el al. [143] and Zhao et al. [213] contribute comprehensive reviews
elucidating the realm of existing PLMs and LLMs, respectively.
Diverging from these perspectives, our work delves into the burgeoning domain of mathematical
language models, systematically summarizing the diverse range of studies and innovations in
this field. In light of the demonstrated proficiency of LMs in comprehending, generating, and
manipulating mathematical expressions, a multitude of advanced methodologies have emerged
within the field of mathematics leveraging LMs. By shedding light on the current landscape, tasks,
datasets, challenges, and future prospects, we aim to inspire and inform those who aspire to harness
the power of language models to revolutionize mathematics and its myriad applications.
Specifically, our categorization of mathematical tasks (§2) contain two primary domains: mathe-
matical calculation (§2.1), consisting of arithmetic representation and arithmetic calculation, and
mathematical reasoning (§2.2), consisting of problem-solving and theorem proving. Furthermore,
our taxonomy of existing algorithms segregates them into PLMs-based approaches (§3), including
autoregression (§3.1) and non-autoregression (§3.2) LMs, and LLMs-based methodologies (§4),
which include instruction learning (§4.1), tool-based strategies (§4.2), fundamental CoT techniques
(§4.3), and advanced CoT methodologies (§4.4). Moreover, we provide a comprehensive compilation
of over 60 mathematical datasets (§5), systematically categorized into training (§5.1), benchmark
(§5.2), and augmented datasets (§5.3). This classification aims to facilitate research by delineating
the utility and applicability of these datasets within distinct research contexts. Lastly, our survey
meticulously addresses the primary challenges and explores the future directions of this domain
(§6), including faithfulness, multi-modal, uncertainty, evaluation, creation, application, and data
scarcity. In conclusion (§7), we believe that this comprehensive survey will serve as a valuable asset
for researchers and practitioners alike, seeking to push the boundaries of innovation in the field of
mathematical language models.

2 MATHEMATICAL TASKS
In this section, we summarize the existing mathematical tasks into mathematical calculation and
mathematical reasoning (Figure 1).

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:4 Liu et al.

2.1 Mathematical Calculation


The advent of Language Models (LMs) has ushered in a new era of exploration into their computa-
tional capabilities, particularly in arithmetic. Initially, these models demonstrated basic computa-
tional abilities by representing numbers in textual formats. With the increasing capabilities of LMs,
it has been observed that they can acquire arithmetic skills through methods such as fine-tuning,
even without carefully crafted numeric representations.
2.1.1 Arithmetic Representation. In the early stages of the research, numerical values were either
omitted or oversimplified, treated as ordinary text, or categorized as “unknown" (UNK). This
approach, however, proved inadequate for tasks. For instance, BERT performs five times worse
when the answer is a number rather than a textual span over the DROP question-answering
benchmark [35].
Recent literature proposes various methods for numerical representation. Geva et al. [47] in-
troduce GenBERT, which tokenizes numbers at the digit level and undergoes fine-tuning with
arithmetic word problems and simple arithmetic tasks. Zhang et al. [209] experiment with a re-
trained BERT model by converting numbers into scientific notation (e.g., 314.1 as 3141[EXP]2).
Spithourakis and Riedel [168], along with Wallace et al. [178], explore the integration of digit
embeddings into a singular embedding that represents the entire number. Berg-Kirkpatrick and
Spokoyny [11] propose a method with digit-RNN and exponent embeddings. This method specifi-
cally emphasizes the exponent while disregarding the mantissa. Goat [101] introduces Consistent
Tokenization, enhancing the relationships between similar numerical values.
2.1.2 Arithmetic Calculation. There has been significant research into the arithmetic capabilities
of LLMs. Nogueira et al. [128] and Wang et al. [180] assessed addition and subtraction tasks. Muffo
et al. [123] evaluated two-digit multiplication. Yuan et al. [202] evaluated the arithmetic operation
capabilities of different models, including GPT-4 [131], Galactica [170], and LLaMA [172].
Traditionally, it was presumed that LLMs could not accurately perform complex arithmetic,
especially multiplication involving more than eight digits. However, recent approaches challenge
this assumption. Zhou et al. [219] apply specialized prompt engineering to enhance addition ca-
pabilities but note limitations in multiplication beyond seven digits. Jelassi et al. [68] investigate
length generalization in basic arithmetic tasks using techniques like relative position embeddings
and training set priming. ScratchpadGPT [130] demonstrates the effectiveness of pre-generating a
Chain of Thought (CoT) before producing an answer in 8-digit addition tasks. Goat [101] utilizes
supervised instruction for fine-tuning elementary arithmetic operations with large integers. Math-
GLM [195] excels in intricate arithmetic tasks by pre-training on datasets that decompose complex
arithmetic expressions into simpler steps. Through this process, it gradually generates answers and
learns the rules of computation.

2.2 Mathematical Reasoning


Mathematical reasoning is a pivotal aspect of artificial intelligence that facilitates the understanding
and solving of complex mathematical problems. The integration of LLMs in this domain has been
significant, owing to their ability to interpret, process, and generate complex natural text. This
section delves into the current research and developments within this field, focusing on two main
areas: math problem solving and theorem proving.
2.2.1 Math Problem Solving. In the field of math problem solving in artificial intelligence, math
problem solving refers to the process of using algorithms, computational models, and increasingly
LLMs to understand, explain, and solve mathematical problems. This approach to solving problems
spans all levels from basic arithmetic to advanced mathematics, including but not limited to algebra,

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Mathematical Language Models: A Survey 111:5

geometry, statistics, and calculus. In this subsection, we condense it into Math Word Problems
(MWPs) and Math Question Answering (MQA).
Math Word Problem Solving. Math Word Problems (MWPs) refer to problems that present mathe-
matical concepts and calculations in the form of written descriptions. Such questions usually include
one or more story situations describing a series of events or conditions from which the solver needs
to extract relevant mathematical information and apply appropriate mathematical principles to
solve the problem. Over the years, using various efficient and intelligent algorithms to solve MWPs
has also been a research focus in the fields of computer science and artificial intelligence [15].
Recent research has highlighted the growing capabilities of LLMs in the field of mathematical
word problem solving, emphasizing the trend toward more nuanced and sophisticated AI-driven
mathematical analysis. MathPrompter [66] uses LLM called GPT3 DaVinci to solve MWPs with
excellent results, demonstrating the potential of LLMs to not only explain but also generate complex
mathematical reasoning, reflecting a human-like understanding of complex problem sets.
Yuan et al. [201] investigated the interaction between various factors such as pre-training loss,
amount of supervised data, and augmented data to improve the mathematical inference performance
of LLMs. They found that pre-training loss is a more reliable performance metric and empirically
found a log-linear relation between data amount and model performance, but it is worth mentioning
that this improvement diminishes with better pre-training models. At the same time, they proposed
Rejection sampling Fine-Tuning (RFT) to enhance performance by using supervised models to
generate and collect correct reasoning paths, thereby achieving better inference generalization
and more significant improvements. Meanwhile, the proficiency with which LLMs perform basic
arithmetic tasks is critical to establishing the baseline capabilities of these models [202]. This
assessment is integral to understanding the limitations and potential of LLMs in future work when
faced with mathematical challenges that require precise and logical calculations.
MetaMath [200] further extends the practicality of the LLMs, proposing a paradigm whereby the
LLMs generate their mathematical problems, thereby creating a self-sustaining learning environ-
ment that encourages continuous improvement of the model’s problem-solving acuity. Similarly,
WizardMath [110] explores new ways to enhance LLMs’ mathematical reasoning capabilities
by reinforcing evolutionary instructions, indicating an important step towards autonomous self-
improvement of artificial intelligence models. Instead, MathAttack [223] introduces a critical
perspective by exploring the sensitivity of LLMs to specialized adversarial input designed to test
their mathematical problem-solving abilities. Such investigations are crucial to developing more
resilient and reliable LLMs that can withstand and adapt to a variety of problem-solving situations.
LLEMMA [9], an open language model dedicated to mathematics, outlines ongoing efforts to make
LLMs accessible and adaptable to a wide range of mathematical inquiries, from basic problem
solving to advanced theorem proving.
The present findings confirm that these studies comprehensively depict the current research
hotspots and future development potential directions of LLMs in the MWPs field. They emphasize
that LLMs can have a transformative impact on the field of mathematics, providing innovative tools
for educators, students, and researchers. As these models continue to develop, they are expected to
revolutionize the way we approach and solve mathematical problems, ushering in a new era of
mathematical reasoning and assisted education through artificial intelligence.
Math Question Answering. Math Question Answering (MQA) refers to the computational task
where a system is required to understand and automatically solve math-related questions presented
in natural language. These questions can range from simple arithmetic problems to complex high
school or college-level mathematics, including algebra, calculus, and geometry. The challenge
for MQA systems is to accurately interpret the text, convert it into an appropriate mathematical

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:6 Liu et al.

representation, perform the necessary computations, and generate a correct and often step-by-step
solution, mimicking the problem-solving process of a human.
Sachan et al. [153] introduce a framework for evaluating neural architectures through a suite
of math problems, highlighting the importance of algebraic generalization. This is complemented
by GEOS [158], an innovative system that solves SAT geometry problems by integrating text
understanding with diagram interpretation. The introduction of datasets and systems like Inter-
GPS [105], IconQA [108], and PGDP5K [56, 205, 206] further advances the field, offering new
benchmarks for abstract diagram understanding and visual language reasoning. Additionally,
Scharpf et al. [154] indicate that unsupervised methods for formula labeling contribute to the
automated understanding of mathematical documents. Collectively, these works demonstrate the
evolving landscape of MQA, where the synergy between natural language processing, computer
vision, and symbolic reasoning opens new avenues for educational technology and AI research.

2.2.2 Theorem Proving. Theorem proving (TP) refers to the process of demonstrating that a
statement is correct based on existing axioms and facts, which is a long-term challenge in AI [127].
In the context of LLMs, people began to try to use the huge knowledge base of LLMs and some
manual guidance to solve this task. LLMs like GPT-3 [18] can process natural text to understand
the premises of the theorem, and apply certain logical reasoning rules to complete the proof
requirements. This can then be done by calling relevant tools or using manual inspection to check
the correctness of the process. The advanced approach aims to leverage LLM’s superior computing
power and vast knowledge base relative to humans to automate the theorem-proving process, a
task that previously required professional mathematicians.
In the field of AI, research on theorem proving has progressed from data sets containing large num-
bers of human-written proofs (such as CoqGym [194]) to complex models that can autonomously
generate proof strategies (such as ASTactic [194]). This improvement is reflected in the application
of some language models that are based on Transformer [177] to TP tasks, especially in recent
research such as GPT-f [138]. Most notably, some of the proofs generated by these models have
been formally recognized by the mathematical community. Jiang et al. [72] further refined this
process by using informal proofs to guide TP models. NaturalProofs [186] extends these capabilities
by leveraging the language of natural mathematics to create a rich corpus of model training and
evaluation. At the same time, DeepMath [67] and INT [190] have promoted the development of
this field by demonstrating the effectiveness of neural sequence models for premise selection and
evaluating the LLMs’ generalization ability in theorem proving.
Ongoing research is delving into the integration of language models and interactive proof assis-
tants within the realm of mathematics. This involves generating proofs through tactic prediction
[53, 71, 87, 139], automating formalization processes [72, 191], and developing unified tools [188].
Due to the substantial computational demands of exploration, language models applied in this
domain have traditionally been limited in size. However, recent advancements have showcased
potential in employing larger models [40, 72]. LLEMMA [9] demonstrates few-shot proof auto-
formalization and tactic prediction, offering a vast dataset of formal mathematics and an openly
accessible model for further investigation in these directions. These endeavors establish crucial
theoretical and practical frameworks for automating and enhancing tactic prediction.
The above series of progressive research work shows that utilizing the powerful computing
power and extensive knowledge base of LLMs is the main development direction for future TP
tasks. The more excellent models (such as GPT-4 [131]) obviously have better reasoning capabilities
without any human intervention, but we also need to be clear about one thing, that is the inevitable
hallucination problem in neural text generation models [113]. The hallucination problem refers to
the generated content that is nonsensical or unfaithful to the provided source content [69]. The

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Mathematical Language Models: A Survey 111:7

Autoregression GPT-𝑓 [138], EPT [80] , Generate & Rank [159] , Thor [71] , HTPS [86] , Galactica [170] , MATH-PLM [60] , LISA [70]
Pre-trained LMs (§3.1) PACT [53] , Minerva [90] , LIME [192]
Language
Models (§3) Non-Autoregression Aristo [26], GenBERT [47] , NF-NSM [39] , MWP-BERT [95] , MathBERT[136] , TAGOP [224] , MT2Net [214]
LMs (§3.2) DeductReasoner [74] , BERT-TD [94]

Instruction
Auto-explanation [125], ProofNet [8] , WizardLM [193] , Wizardmath [110]
Building
Language Models

Instruction Instruct
MathGLM [195], Goat [101], Calculon [123], PaLM 2-L-Math [103], LLEMMA [9]
Learning (§4.1) Tuning

In-context ScratchpadGPT [130] , Codex-math [34] , In-Context Sampling [100] , ComplexPrompt [43]
Learning ART [210]

Single-tool
SymbLLM [58], PoT [20], Codex-math [34] , MathPrompter [66] , PAL [44] , Code-scratchpad [176]
Tool-based Methods
Methods (§4.2) Multi-tool Toolformer [155], Chameleon [106] , ART [132] , ToolkenGPT [55] , CRITIC [49] , TALM [133]
Methods Tool-Documentation [63]
Large
Language Foundation of Answer Rationales [97] , MathQA [4], CoT [185], Complexity-based CoT [43] , LVSS [96], RFT [201]
Models (§4) CoT Thought Propagation [199] , AFT [181] , PoT [20] , AOT [157] , MAmmoTH [203]
Fundamental CoT
Methods (§4.3)
Construction of Zero-shot CoT [82] , Auto-CoT [210] , Complexity-based CoT [43] , PromptPG-CoT [107]
CoT AutoMate CoT [165] , BoostedPrompt [137]

Verify-based Code-based self-verification [217] , VerifyCoT [98] , DIVERSE [93] , Verify-and-Edit [211]
Methods Retrieval-CoT [57] , SCREWS [163]

Ensemble-based Self-Consistency [182] , Diversity-of-Thought [124] , Complexity-based CoT [43], DIVERSE [93]
Methods Self-check [115], MCR [198], Rank-verifier [27], GRACE [79]
Advanced CoT
Methods (§4.4) ToT [196] , TouT [122] , ToT-system [104] , GoT [14, 88, 197] , RESPROMPT [73]
Planing-based
RAP [54] , LATS [218] , LLM+P [99] , LLM+DP [29] , Self-refine [111, 169] , ISR-LLM [222]
Methods
Reflexion [162]

Socratic Teaching Socratic Models [204], SOCRATIC QUESTIONING [140], Socratic prompt [19], SocratiQ [2]
Methods multi-turn Socratic advice [5]

Fig. 2. Taxonomy of language models for mathematic.

Autoregression LMs Non-Autoregression LMs

I have a good idea I have a good idea I have a good idea


I
I

have
have

have

Attention
a
a

patterns
good
good

good

idea
idea

idea

Generation
I have a good idea I have a good idea [mask] have a [mask] idea I good
forms

Fig. 3. The distinctions between Autoregression LMs and Non-Autoregression LMs.

occurrence of hallucinations at any step in the reasoning process will lead to errors in the final
reasoning results. Therefore, solving the hallucination problem is also particularly important in
future work related to TP.

3 PLMS-BASED METHODS
Based on the transformer architecture [177] with self-attention mechanisms, the “pre-training
and fine-tuning" paradigm has revolutionized the field of natural language processing. Within the
context of Pre-trained Language Models (PLMs), numerous approaches have been proposed to
tackle text generation problems [129]. Autoregressive LMs (ALMs, e.g., GPT-1 [144] and T5 [147])
and Non-Autoregressive LMs (NALMs, e.g., BERT [78] and Roberta [102]) are the two primary
strategies for mathematics (Figure 2).

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:8 Liu et al.

3.1 Autoregression LMs


The distinctions between Autoregression LMs (ALMs) and Non-Autoregressive LMs (NALMs) in
sequence generation tasks are depicted in Figure 3. Furthermore, ALMs can be categorized into
two architectures: causal decoder (e.g., GPT-1 [144] and GPT-2 [145]) and encoder-decoder (e.g., T5
[147]). The generation mechanisms of these two structures for sequences are depicted on the left and
in the middle of Figure 3, respectively. These two architectures have not only significantly propelled
the advancement of PLMs but have also become the predominant foundational architectures for
LLMs.
To enhance the performance of PLMs in mathematical problems, many improved methods based
on ALMs have been proposed. GPT-𝑓 [138] presents an automated prover and proof assistant to
explore the performance of LMs to automated theorem proving. To resolve challenges related to
expression fragmentation and operand-context separation, Kim et al. [80] introduce the Expression-
Pointer Transformer (EPT), a pure neural model that employs both an ‘Expression’ token and
operand-context pointers during the generation of solution equations. Generate & Rank [159]
introduces a novel ranking task for MWPs, a multi-task framework built upon a generative PLM,
effectively addressing the challenge of minor mistakes in mathematical expressions. Thor [71]
introduces a framework that integrates language models and automated TP. The latter is employed
to selectively choose relevant premises from an extensive library to assist the language model
in TP. HTPS [86] presents HyperTree Proof Search, a novel algorithm leveraging online training
from past proof searches, enhancing model performance and surpassing the previous SOTA GPT-𝑓 .
Galactica [170] proposes a working memory token approach that can achieve strong performance
over existing methods on mathematical MMLU [59] and MATH [60] benchmarks.
Furthermore, MATH-PLM [60] introduces a new dataset named MATH, consisting of challenging
competition mathematics problems. It is found that relying solely on increasing budgets and model
parameter counts would be impractical for achieving robust mathematical reasoning. LISA [70]
gathers an extensive collection of lemmas and theorems extracted from the Isabelle standard
library and the Archive of Formal Proofs (AFP). Using this vast corpus, it builds LMs that prove
theorems effectively within the AFP. PACT [53] extracts abundant self-supervised data from kernel-
level proof terms, enabling joint training alongside the standard tactic prediction objective. This
effectively enhances the model’s ability for TP. Minerva [90] trains a language model, initially
pre-trained on general natural language data, through additional training on technical content to
address Quantitative Reasoning Problems. It achieves a SOTA performance on technical benchmarks
including MATH [60]. LIME [192] introduces a novel pre-training methodology that specifically
learns inductive bias for mathematical reasoning. Observations indicate that models trained using
LIME exhibit superior performance compared to standard transformers.

3.2 Non-Autoregression LMs


Unlike ALMs, NALMs enable the model to generate all parts of a sequence simultaneously, without
depending on previously generated content, as depicted on the right side of Figure 3. Notably,
models such as BERT [78], Roberta [102], and related architectures utilize masked word representa-
tions, leveraging pre-trained context-aware embeddings as comprehensive semantic features. This
significantly elevates the performance benchmarks of NLP tasks.
In the field of mathematics, researchers have proposed various methods for designing models to
address mathematical reasoning and computational problems. For example, Aristo [26] fine-tunes
BERT using scientific curriculum data, yielding promising results on science exams. GenBERT
[47] and NF-NSM [39] enhance the numerical reasoning capabilities of models by incorporating
numerical data into the training process of PLMs. MWP-BERT [95] further enhances the model’s

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Mathematical Language Models: A Survey 111:9

capacity to represent and calculate numerical values by incorporating numeric attributes into
symbol placeholders. MathBERT[136] employs additional joint training of text and formulas to
effectively capture the semantic-level structural information of formulas. TAGOP [224], MT2Net
[214], and DeductReasoner [74] utilize BERT or RoBERTa to extract the fundamental arithmetic
relationships between quantities, enabling mathematical reasoning and operations. BERT-TD [94]
utilizes semantic encoding and contrastive learning to cluster problems with similar prototype
equations, thereby enhancing the understanding of MWP patterns.

4 LLMS-BASED METHODS
Large-scale language models (LLMs) are designed for processing and generating text akin to human
communication [131, 172]. Mathematics is also a form of language, that communicates complex
concepts and relationships through a structured system of symbols and notation, akin to the rules
of a spoken language. Thus, a language model that grasps these mathematical rules can “speak"
the language of mathematics, proving to be a valuable asset for mathematicians. In numerous
ways, a sophisticated language model like GPT-4 [131] becomes an invaluable tool in the field of
mathematics [9, 110]. We classify the existing studies into four parts: instruction learning, tool-based
methods, fundamental CoT methods, and advanced CoT methods (Figure 2).

4.1 Instruction Learning


Numerous approaches have been proposed to enhance the mathematical performance of models
by instruction learning. These approaches are categorically delineated as instruction building,
instruction tuning, and in-context learning, based on their distinctive characteristics.
Instruction Building. A semi-supervised approach, as presented in Auto-explanation [125], lever-
ages LLMs to create datasets for automating the scoring of mathematical self-explanations in
the context of mathematics education. A challenging benchmark ProofNet [8] is introduced to
build a system of automated theorem proving. Meanwhile, prompt retrieval and distilled back
translation methods are introduced for statement autoformalization. WizardLM [193] presents
a groundbreaking method called Evol-Instruct, designed to autonomously generate high-quality
instructions by LLMs themselves. The conceptual flow is depicted in Figure 4. The instructional
process commences with a straightforward “initial instruct" and progresses into diverse forms,
deepening, increase reasoning, add constraints, concretizing, complicate Input (formula), in-Breadth
evolving, etc. Furthermore, Wizardmath [110] proposes a reinforced Evol-instruct method (RLEIF)
to build more complexity of instruct dataset. This method effectively enhances the mathematical
reasoning capabilities of Llama-2 by supervised fine-tuning and PPO training [156].
Instruction Tuning. Instruction tuning stands as a potent method to elevate the prowess of large
models, aiding in steering the model towards generating outputs that align with human intent [207].
Compared to the pre-training phase of LLMs, instruction tuning demands notably less instruction
data and computational resources, making it a prevalent strategy for enhancing a model’s domain-
specific capabilities. For instance, MathGLM [195] showcases that even a language model with 2
billion parameters can adeptly execute multi-digit arithmetic operations with limited training data.
Goat [101] discerns patterns in the tokenization of different numbers and proposes a consistent
approach for their tokenization. Utilizing the LLaMA model, Goat ensures consistent tokenization
for various numbers, illustrated in Figure 5. Similarly, Calculon [123] refines the model’s arithmetic
skills by employing a digit decomposition technique to construct a fine-tuning dataset, as depicted
in Figure 6.
In PaLM 2-L-Math [103], three fine-tuning strategies were explored, revealing key insights: 1) The
quality and style of step-by-step solutions significantly influence model performance; 2) Combining

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
struction data generated by real human users, OpenAI’s LLMs (e.g., InstructGPT [2] and ChatGPT 4 )
have achieved great success. These open-domain instructions can fully unleash the unlimited potential
of LLMs [14–17] and enable them to perform more complex and diverse tasks. However, using
humans to create open-domain instruction datasets like OpenAI did will encounter the following
challenges. The whole annotating process is extremely expensive and time-consuming [18–21]. On
the other hand, the difficulty level distribution of human-created instructions is skewed towards being
easy or moderate, with fewer difficult ones (according to the difficulty statistics of ShareGPT [22]
from Figure 7a). Possible reasons for this are that the proportion of experts among annotators is low
and creating complex instructions demands a lot of mental effort. Human annotators are prone to
111:10 fatigue and cannot sustain high-intensity work to produce a sufficient proportion of high-difficulty
instructions [23–26]. Based on these issues, developing an automatic method that can mass-produce Liu et al.
open-domain instructions (especially the more difficult ones) at a relatively low cost becomes the key
to further advancing instruction-tuned language models [27–30].

The process of plant photosynthesis is commonly written as:


Please fill in the table below with the approximate
6CO2 + 6H2O → C6H12O6 + 6O2
values of the speed of light in each medium.
Please explain the main role of chlorophyll in above formula.

Medium Speed of light (km/s)


Air
In-Breadth Evolving Water
Glass

import math
import random
Complicate Input (Table)

# choose a random integer between 1 and 10


x = random.randint(1, 10)
1/(math.sqrt(x) + x^2) =? How many times faster is light How is the speed of light in a
than sound in a vacuum? vacuum measured and defined?

Complicate Input (Code) Increase Reasoning Deepening

1/(sqrt(2) + 4^2) = ? What is the speed of light in a vacuum?

Complicate Input (Formula) In-Breadth Evolving

If you have one apple and someone


How to prove 1 + 1 = 2 in
gives you another banana, how
the Goldbach Conjecture?
many fruits do you have?

Add Constraints Concretizing

In what situation does What is the value of x,


1+1 not equal to 2? Deepening
1+1=? Increase Reasoning if x^3 + 2x + 3=7?
Initial Instruction

Figure 1: Running Examples of Evol-Instruct.

Fig. 4. Example of Evol-Instruct taken from WizardLM [193].


In this work, we introduce Evol-Instruct, a novel method using LLMs instead of humans to automati-
cally mass-produce open-domain instructions of various difficulty levels, to improve the performance
of LLMs. Figure 1 shows the running examples of Evol-Instruct. Starting from a simple initial in-
struction “1+1=?”, our method randomly selects In-depth Evolving (blue direction line) or In-breadth
Evolving (red direction line) to upgrade the simple instruction to a more complex one or create a new
one (to increase diversity). The In-depth Evolving includes five types of operations: add constraints,
deepening, concretizing, increase reasoning steps, and complicate input. The In-breadth Evolving
isModel Number
mutation, i.e., generating Tokenization
a completely new instruction based on the given instruction. These six
LLaMA
4
https://chat.openai.com/74815 [1, 29871, 29955, 29946, 29947, 29896, 29945]
7481 [1, 29871, 29955, 29946, 29947, 29896]
748 [1, 29871, 2 29955, 29946, 29947]
74 [1, 29871, 29955, 29946]
7 [1, 29871, 29955]
GPT-4 74815 [20338, 868]
7481 [20338, 16]
748 [20338]
74 [5728]
7 [22]
Bloom 74815 [88241, 2057]
7481 [88241, 20]
748 [88241]
74 [8771]
7 [26]
OPT 74815 [2, 39373, 996]
7481 [2, 406, 34490]
748 [2, 39373]
Fig. 5. Consistent tokenization 74 [2, 5243]
for similar number in Goat [101].
7 [2, 406]
Pythia 74815 [24, 2385, 1010]
GPT-NeoX-20B 7481 [24, 34474]
Approach
MPT-7B Observation 748 [24, 2385]
Calculon Compute with pipeline 1201 plus 1302. Translate from number to decomposition: 1201 = 1 units,
74 1 thousands.
0 tens, 2 hundreds, [3566]Translate from number to decomposition: 1302 = 2 units, 0 tens, 3
hundreds, 7 1 thousands. Sum[24]1 units, 0 tens, 2 hundreds, 1 thousands + 2 units, 0 tens, 3 hundreds,
1 thousands = 3 units, 0 tens, 5 hundreds, 2 thousands. Translate from decomposition to number: 3
GPT-J 74815 [48246, 1314]
units, 0 tens, 5 hundreds, 2 thousands = 2503
GPT-Neo 7481 [22, 40271]
Baseline Compute 1201 plus 1302. Final result = 2503
748 [48246]
Spaced Compute 120174 plus 1302. 1[4524]
2 0 1 plus 1 3 0 2 = 2 5 0 3. Final result = 2503
7 [22]
Table 1: Examples of addition training observations for the considered approaches. Bold sub-strings represent in-
putChatGLM
prompts provided to LMs 74815
at inference time.[5,The
25,same16,examples
23, 9,for15,the130001,
subtraction130004]
and multiplication tasks can

Fig.berespectively.
6. Number decomposition
obtained substituting {plus, 7481+, sum} [5, with25, 16, 23,
{minus,
taken
-,9,subtract}
130001, from 130004]
and
Calculon [123].
{times, *, multiply}
748 [5, 25, 16, 23, 130001, 130004]
74 [5, 25, 16, 130001, 130004]
prompt3 only 4 few-shot examples with the decompo- The GPT-2 models fine-tuned in our experiments are
7
sition pipeline that we introduced. [5, 25, 130001, GPT-2 130004]
Small architectures, which count ∼117M pa-
rameters. The GPT-3 model evaluated in our experi-
Table 5: Comparison 4. ofData and training
number details
tokenization of various LLMs. It should be
ments corresponds noted
to the that
biggest ChatGLM
architecture also splits each
proposed
digit into anFor
individual
the additiontoken. Evaluating
and subtraction ChatGLM’s
operations, in Brown
arithmetic
we gener- et al. (2020),
abilities will bewhich
leftcounts ∼175B
as future parame-
work.
solution re-ranking and atemajority
training sets of 12000voting
observations each. yields
In partic- aters. more substantial
We fine-tune each GPT-2 Language Model performance
for 25
epochs with an initial learning rate of 10 and a batch
enhancement −4
ular for each N ∈ {3, 4, 5} we randomly sample 3000
compared to using themcouples
{10
individually.
of integer numbers (n , n Additionally,
) , with (n , n ) ∈
, . . . , 10 − 1} , ∀i ∈ {1, . . . , 3000}. Simi-
N −1 N 2
1 multi-task
2 i 1 fine-tuning,
size of 32, using Adam (Kingma and Ba, 2017) as op-
2 i
timizer. For the experiments on GPT-3, due to limited
separating solution
resources available on the dedicated API, we evaluate
generation and evaluation tasks, surpasses
larly, for N = 2 we randomly sample 3000 couples
the
of numbers (n , n ) ∈ {0, . . . , 99} (one-digit num-
1 2 i
baseline
the model only on achieved
the
2 first 100 by
observations of each solution
test fine-tuning alone.
bers are included). We then join all the couples created set. We adopt a greedy decoding strategy for the GPT-2
Moving beyond the enhancement
(obtaining a set of 12000 ofcouples)
mathematical
and we compute the capabilities
models through
and a random sampling strategy
ture=0.7 for the GPT-3 generations.
with tempera- fine-tuning, LLEMMA [9]
results of the operations. At the end of this step, we
introduces the Proof-Pile-2 dataset,
obtain two an
vectors of results, r andamalgamation
r , where r =
n +n and r = n −n , ∀i ∈ {1, . . . , 12000}.
1,i 2,i −,i 1,i
+
2,i
of mathematical
5. Results and discussion
− +,i texts and code. Continual
Finally, given a triplet (n , n , r) , we generate a string In table 2 we show the results obtained with the exper-
pre-training with Code Llama
according to theempowers the3, model
procedures described in section
1
de-
2
to leverage
i
iments described in section 3. Python interpreters and formal
pending on the selected approach. For the multiplica- The GPT-2 model fine-tuned without decomposition
theorem provers, demonstrating impressive
tion, we generate training performance
sets following the same
cedure explained above but, instead of sampling 12000
pro- on
(Baseline row) obtains low the
accuracy MATH
scores in all tasksbenchmark.
except two-digit addition, where achieves 53.35 accu-
couples, we sample 3000 couples of numbers from the racy. In particular, in the 4 and 5 addition and sub-
set {0, . . . , 99}2 because we will only test the multipli- traction tasks it achieves zero or near-zero accuracy.
cations between two-digit numbers. This demonstrates that, without decomposing numbers,
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
At the end of this procedure we obtain 9 training a GPT-2 Language Model is not able to learn to per-
form computations, especially between numbers with
sets, each of which corresponding to a combination
operation-approach (e.g. addition-decomposition), that a higher number of digits. On the other hand, Calcu-
we use to fine-tune as many Language Models. Now, lon obtains high accuracy scores in all the tasks tested
we want to underline some points relative to the gen- with the exception of 2Dx. This demonstrates that fine-
erated training sets. First, by fixing the operation and tuning using the proposed decomposition pipeline ef-
varying the approach, the same couples of numbers are fectively makes possible for a transformer Language
used to generate strings, so that couples of numbers are Model to learn to do calculations. Here, we underline
again that none of the couples of numbers composing
i i i=1
we want the model to produce the result of a program, represented as source code, on some input. If
each πi is the source code of a program fi , then the training data is D = {(πi , xi , fi (xi ))}N
i=1 (it is
common for each fi to have multiple input-output examples, but we omit this to lighten notation).
The main idea of this paper is that to solve a given algorithmic task, we simply encode the inter-
mediate steps of the algorithm as text and train the model to emit them to a buffer that we call a
“scratchpad.” For example, let us consider the algorithmic induction task of learning long addition.
To teach a model to add 29 to 57, a training example may look like the text in Figure 2, where the
Mathematical
steps of the Language
grade-school long Models:
addition A Survey
algorithm are written out explicitly. 111:11

Learning to execute tasks can be encoded in a


similar way, except now we add the source code
πi before the input, scratchpad, and desired out- Input:
2 9 + 5 7
put. An example of a training example for a
learning to execute task is shown in Figure 1. Target:
At training time, the model will be given the <scratch>
input plus target for standard likelihood-based 2 9 + 5 7 , C: 0
2 + 5 , 6 C: 1 # added 9 + 7 = 6 carry 1
training. At test time, the model will be given
, 8 6 C: 0 # added 2 + 5 + 1 = 8 carry 0
only the input and will be required to predict the 0 8 6
target, e.g., by beam search or temperature sam- </scratch>
pling. In principle, any sequence model could 8 6
be used for this. In this work, we choose to
use decoder-only Transformer language mod- Figure 2: Example of input and target for addition
els, but other sequence models could be effec- with a scratchpad. The carry is recorded in the
Fig. 7. Use
tive, such as encoder-decoder a “scratchpad"
models to improve arithmetic calculation in ScratchpadGPT [130].
(Raffel digit following “C:”. Comments (marked by #)
et al., 2019), or recurrent networks. are added for clarity and are not part of the target.
Adding a scratchpad has several potential ad-
vantages: First, the model has adaptive compu-
In-context Learning. In-context Learning (ICL) [18, 77] empowers LLMs to execute target tasks by
tation time, that is, it can now process the information for as long as needed, depending on the
presenting
complexity of the task givenspecific task Second,
the input. examples as conditions
the model during
can store the inference,
intermediate state ofwithout
its updating model parame-
computation in the
ters. scratch buffer
Inspired and ScratchpadGPT
by this, refer back to it by attending to its context.
[130] enhances itsThis removes in multi-step computations
proficiency
the need to store all intermediate state in activations. Third, by forcing the model to output con-
by mandating
crete intermediate the model
states by sampling from to
theoutput
generativeintermediate
model, we aim to computation steps into a designated “scratchpad."
reduce the propagation
and compounding
Codex-mathof small errors, because states
[34] refines Codexare quantized to token embeddings.
by generating programsCompounded
through code and integrates few-shot
errors can show up in methods—like Neural Turing Machines (Graves et al., 2014)—that use recur-
learning
rence to support to automatically
extended create
computations. Finally, programs
examining for scratchpad
a model’s solving mathematical
output can help problems. This method has
us identifysubstantially
common errors and correct themthe
improved by revising
previousthe scratchpad format. We accuracy
state-of-the-art found this ability
in automatic solutions by 73.1%
to interpret errors to be useful in this work.
across various benchmarks like MIT’s mathematics courses, Columbia University’s Computational
Linear Algebra, and the MATH benchmark [60]. Notably, few-shot learning contributed to a 10%
3
accuracy boost. Recent studies also highlight the variability in in-context learning performance
across different chosen examples [100]. Fu et al. [43] and Zhang et al. [210] selected intricate and
diverse examples to enhance reasoning performance.

4.2 Tool-based Methods


LLMs are designed to use tools, such as codes and calculators, to enhance their problem-solving
abilities [133, 155].
Single-tool Methods. To improve the performance of mathematical reasoning, math-specific tools
such as symbolic solvers and programs are utilized for LLMs [20, 58]. For example, SymbLLM [58]
solved math word problems by combining language models with symbolic solvers. They focus
on automatically generating high-quality, step-by-step solutions to mathematical word problems,
especially those encountered in mathematical applications.
Furthermore, previous studies have explored the process of converting mathematical problems
into code [20, 34]. PoT [20] proposes a fusion of CoT with programs, while Drori et al. [34]
showcased the effectiveness of a neural network pre-trained on text and fine-tuned on code,
particularly leveraging OpenAI’s Codex transformer. This approach successfully solves, explains
and generates math problems at the university level. MathPrompter [66] employs a zero-shot chain-
of-thought prompting technique to generate multiple algebraic expressions or Python functions
in varied ways to solve the same math problem, enhancing reliance on output results. PAL [44]
introduces an innovative approach to bolster the performance of pre-trained language models
(PLMs) in mathematical problem-solving. This involves utilizing LLMs to comprehend natural
language problems and generate programs as intermediate reasoning steps.
In addition to solving math problems, programs can also play a role in tutoring math. For instance,
Upadhyay explores the role of large language models in tutoring systems [176], introducing a “code
scratchpad" alongside the traditional “language scratchpad" to enhance the model’s performance in
tutoring steps, particularly using a grade school mathematics dataset.

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:12 Liu et al.

Multi-tool Methods. To facilitate the seamless integration of LLMs with diverse tools, several
multi-tool approaches have been proposed to enable LLMs to learn how to use multiple tools
simultaneously. Toolformer [155] adopts a self-supervised training approach to enable the utilization
of different tools such as search engines, calculators and translation systems via simple API calls.
This is accomplished by fine-tuning on a vast collection of sampled API calls and filtering based
on their ability to reduce perplexity on subsequent tokens. Chameleon [106], on the other hand,
enhances LLMs with plug-and-play modules designed for compositional reasoning. It creates
programs by combining various tools, including LLMs, off-the-shelf vision models, web search
engines, Python functions, and heuristic-based modules, to tackle complex reasoning tasks. The
Automatic Reasoning and Tool-use (ART) framework [132] leverages frozen LLMs to automatically
generate intermediate reasoning steps as a program. This seamlessly incorporates external tools
to support computations that surpass the core capabilities of LLMs. Meanwhile, ToolkenGPT [55]
adopts a strategy of learning “toolken" embeddings to represent each tool as a token. This empowers
LLMs to effortlessly utilize tools similar to generating word tokens. It caters to a broader range of
tools and utilizes extensive demonstration data to learn toolkit embeddings.
Furthermore, CRITIC [49] significantly enhances the outputs of LLMs by enabling them to verify
and self-correct through interactions with external tools. Inspired by human cognition and critical
thinking, CRITIC continuously refines text generated by LLMs through iterative interactions with
tools like search engines and code interpreters. Tool Augmented Language Models (TALM) [133]
seamlessly integrate language models with non-differentiable tools. They employ an iterative
“self-play" method to bootstrap performance based on a limited number of tool demonstrations. In
contrast, Tool-Documentation [63] opted for tools based on documentation rather than relying on
demonstrations. Current practices involve teaching LLMs through a few-shot demonstration of tool
usage, a process prone to bias and overfitting. The alternative proposed by Tool-Documentation,
utilizing tool documentation that provides descriptions of individual tool usage, is argued to
be a more effective approach. ToRA [50] employs a set of Tool-integrated Reasoning Agents,
which integrate computational libraries and symbolic solvers, among other mathematical tools, to
effectively solve complex mathematical problems.

4.3 Fundamental CoT Methods


A multitude of fundamental Chain-of-Thought (CoT) methods integrated with LLMs have been
proposed to enhance the mathematical reasoning abilities of LLMs.

Foundation of CoT.. In the initial stages of research, a limited body of work has leveraged the
principles of chain-of-thought (CoT) to enhance the mathematical capabilities of language models.
Notably, Ling et al [97] propose a methodology involving the generation of a series of concise
steps, termed "Answer Rationales," to guide the resolution of algebraic word problems. MathQA [4]
suggests decomposing Math Word Problems into multiple steps corresponding to programmatic
operations for resolution.
Utilizing the in-context learning of LLMs, CoT [185] explicitly introduces the concept of the
chain of thought (CoT) for the first time, and substantiates its efficacy in enhancing reasoning
abilities. A CoT example is shown in Figure 8, where the text highlighted in blue and green guides
to the ultimately correct answer. This validation is also corroborated in Complexity-based CoT
[43]. Following a similar line of thought, LVSS [96] presents a Process Reward Model (PRM)
and compares it with the Outcome Reward Model (ORM). Empirical findings affirm that process
supervision outperforms outcome supervision, leading to notable performance improvements on
the MATH [60] benchmarks. Consequently, several methods [20, 157, 181, 199, 201, 203] leverage
CoT methodologies to enhance the mathematical performance of models.

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Abstract

arXiv:2201.11903v6 [cs.CL] 10 J
We explore how generating a chain of thought—a series of intermediate reasoning
steps—significantly improves the ability of large language models to perform
complex reasoning. In particular, we show how such reasoning abilities emerge
naturally in sufficiently large language models via a simple method called chain-of-
thought prompting, where a few chain of thought demonstrations are provided as
exemplars in prompting.
Experiments on three large language models show that chain-of-thought prompting
improves performance on a range of arithmetic, commonsense, and symbolic
reasoning tasks. The empirical gains can be striking. For instance, prompting a
Mathematical Language Models: A Survey PaLM 540B with just eight chain-of-thought exemplars achieves state-of-the-art 111:13
accuracy on the GSM8K benchmark of math word problems, surpassing even
finetuned GPT-3 with a verifier.

Standard Prompting Chain-of-Thought Prompting


Model Input Model Input

Q: Roger has 5 tennis balls. He buys 2 more cans of Q: Roger has 5 tennis balls. He buys 2 more cans of
tennis balls. Each can has 3 tennis balls. How many tennis balls. Each can has 3 tennis balls. How many
tennis balls does he have now? tennis balls does he have now?

A: The answer is 11. A: Roger started with 5 balls. 2 cans of 3 tennis balls
each is 6 tennis balls. 5 + 6 = 11. The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to
make lunch and bought 6 more, how many apples Q: The cafeteria had 23 apples. If they used 20 to
do they have? make lunch and bought 6 more, how many apples
do they have?

Model Output Model Output

A: The answer is 27. A: The cafeteria had 23 apples originally. They used
20 to make lunch. So they had 23 - 20 = 3. They
bought 6 more apples, so they have 3 + 6 = 9. The
answer is 9.

Figure 1: Chain-of-thought prompting enables large language models to tackle complex arithmetic,
commonsense, and symbolic reasoning tasks. Chain-of-thought reasoning processes are highlighted.
Fig. 8. Example of CoT in Wei et al.[185].

RFT [201] proposes the36thapplication


Conference on Neural Information Processing Systems (NeurIPS 2022).
of rejection sampling finetuning (RFT) to gather more optimal
reasoning paths, thereby enhancing mathematical reasoning performance. Thought Propagation
[199] considers gaining insights from analogous problems to assist in addressing the current issue.
Wang et al. [181] proposed a Fine-Tuning (AFT) paradigm that aligns the model to prioritize the
generation of CoT with superior performance. Additionally, PoT [20] introduces a Program of
Thoughts, combining CoT and programming, and AOT [157] enhances reasoning abilities through
algorithmic-style demonstrations of CoT. Furthermore, MAmmoTH [203] integrates CoT and PoT
[20] rationales to instruct large language models in utilizing code tools for solving mathematical
problems.
Construction of CoT.. To streamline the process of CoT creation, various approaches [43, 82, 107,
137, 157, 165, 181, 199, 210] have been introduced for its automatic generation.
Zero-shot CoT [82] introduces a simple CoT example, “Let’s think step by step," which effectively
enhances the model’s reasoning capabilities. Auto-CoT [210] selects representative questions
through clustering and answers them using “Let’s think step by step," concatenating the question-
answer pairs one by one to automatically generate CoT. In the case of Complexity-based CoT [43],
the demonstration is chosen by simply selecting the reasoning chain with the highest number
of inference steps sampled from the model. Using reinforcement learning, PromptPG-CoT [107]
trains a model with policy gradient to assist GPT-3 in selecting suitable CoT demonstrations.
Similarly, AutoMate CoT [165] employs a variance-reduced policy gradient strategy to estimate
the significance of each example in a black box language model, thereby selecting more effective
examples. BoostedPrompt [137] presents an iterative prompt ensemble method that enhances
prompts when the current demonstration faces challenges in handling specific problems.

4.4 Advanced CoT Methods


To further enhance the capability of CoT in LLMs, advanced CoT methods have been proposed,
including Verify-based Methods, Ensemble-based Methods, Planning-based Methods, and Socratic
Teaching Methods.
Verify-based Methods. LLMs often produce incorrect reasoning steps, which can lead to a series
of cascading errors in their solutions. Implementing verification and refining the reasoning process
based on feedback can significantly reduce these errors. This approach is akin to human reflection
and involves critically evaluating each step of the reasoning process. Various verify-based methods
[57, 93, 98, 163, 211, 217] have been proposed to address these issues.
Zhou et al. [217] propose a code-based self-verification approach, utilizing a zero-shot prompt to
encourage the GPT-4 Code Interpreter to employ code for self-verifying its responses, enhancing
its mathematical reasoning capabilities. To mitigate the challenge of validating the entire deductive

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:14 Liu et al.

reasoning process, VerifyCoT [98] introduces a deductive reasoning form, ensuring that each
reasoning step strictly relies on the preceding steps. Furthermore, DIVERSE [93] independently
verifies each reasoning step and a voting mechanism to eliminate incorrect answers. Both Verify-
and-Edit [211] and Retrieval-CoT [57] utilizes external retrieval tools to support the model in
validating reasoning rationales. The key difference is the former edits rationales using retrieved
information, while the latter helps the model self-rethink to improve performance on complex
reasoning tasks. Both methods effectively reduce factual mistakes during the reasoning process.
Shridhar et al. [163] summarize the use of the revisions approach in the SCREWS framework, which
includes a selection module to choose between the original and modified reasoning steps.
Ensemble-based Methods. Due to the inherent stochastic nature of language models, which output
probability distributions over predicted words, they may randomly generate incorrect reasoning
steps and outcomes. To tackle this challenge, some methods [27, 43, 79, 93, 115, 124, 182, 198]
leverage the concept of ensemble learning, employing techniques such as voting and ranking to
eliminate uncertainties in the reasoning process.
Self-Consistency [182] employs multiple reasoning paths and selects the final response through
a simple majority vote. Similarly, Diversity-of-Thought [124] generates diverse reasoning paths by
altering prompts and aggregates the responses via the majority vote. Complexity-based CoT[43]
favors answers derived from more intricate reasoning paths. DIVERSE [93] uses a weighted voting
mechanism to filter out incorrect answers. Nevertheless, these voting-based methods often over-
look the potentially useful information within unsuccessful CoT reasoning and lack an efficient
integration of multiple reasoning chains to improve performance. Self-check [115] addresses this
by incorporating reasoning steps into the voting mechanism, ensuring both consistent answers
and reliable reasoning. MCR [198] takes a step further by consolidating information across various
reasoning chains, acquiring the most relevant facts, facilitating a more comprehensive analysis to
make successful reasoning.
Additionally, there are a few methods based on ranking. Rank-verifier [27] proposes using a
ranking system to judge the correctness of model completions. Furthermore, GRACE [79] leverages
a discriminator trained via contrastive learning to rank each reasoning step.
Planing-based Methods. The original structure of the Chain of Thought (CoT) is sequential,
facing limitations when handling highly complex problems and lacking the ability for retrospective
correction. To make the CoT chain structure more systematic and intricate, certain planning-based
methods [14, 29, 54, 73, 88, 99, 104, 111, 122, 162, 169, 196, 197, 218, 222] have been proposed. These
methods achieve this by altering the organizational structure of reasoning steps or incorporating
mechanisms for refinement and reflection.
The Tree-of-Thought (ToT) [196] proposes organizing reasoning paths into a tree structure,
where each node represents a reasoning step, and edges denote dependencies between nodes. Dur-
ing the inference process, ToT can use self-assessment to determine future actions. This structure
facilitates both forward and backward exploration, employing either deep-first or breadth-first
search techniques. TouT [122] effectively employs Monte Carlo Dropout to assess uncertainty
scores associated with diverse local responses of language models at intermediate steps, enhancing
response precision through integration with global search algorithms. Long et al. [104] introduced
a multi-module ToT system, where the state of the ToT thought chain process can be stored and
retraced through its storage module. Furthermore, Yao et al. [197], Besta et al. [14], and Lei et al.
[88] introduced a novel thought structure called the Graph of Thought (GoT). Similar to ToT, each
graph node represents a distinct thought step or state, with edges indicating dependencies between
nodes. These methods can effectively integrate diverse reasoning thoughts, fostering collabora-
tive outcomes and improving reasoning by incorporating feedback loops. This contributes to an

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Mathematical Language Models: A Survey 111:15

overall enhancement in the inference capacity of the thought network. Additionally, RESPROMPT
[73] introduces Residual Connection Prompting, which converts the input prompt into complex
reasoning graphs.
Inspired by Monte Carlo tree search, RAP [54] utilizes the Monte Carlo tree search algorithm to
traverse tree structures during the reasoning process. This approach effectively balances global
and local searches to obtain a high-reward reasoning path. On the other hand, LATS [218] employs
LLMs as agents, value functions, and optimizers, repurposing their inherent strengths to enhance
decision-making. Liu et al. [99] and Dagan et al. [29] introduced LLM+P and LLM+DP, incorporating
the generation of Planning Domain Definition Language (PDDL) to break down complex problems
and execute plans using specialized models.
Furthermore, the self-refine approach, as proposed by Madaan et al. [111] and Sun et al. [169],
focuses on error correction and summarizing past experiences. Specifically, Self-Refine performs
pre-execution validation, providing feedback to the model for refinement in case of errors. Zhou
et al. [222] then took a step further by merging the iterative self-refine with PDDL in ISR-LLM.
Additionally, Shinn et al. [162] introduced re-flexion to rectify errors stemming from previous
actions.

Socratic Teaching Methods. The Socratic teaching method [126], derived from the philosophies of
ancient Greek thinker Socrates, provides a unique perspective when considering the integration of
LMs in mathematics education [30]. This method, emphasizing questioning and dialogue, aims to
guide learners toward profound thinking [17, 31]. In realm of mathematics education, employing
this method can aid students in grasping concepts with clarity, fostering more rigorous logical
application, and promoting reflection and self-discovery.
For LLMs, research shows that the Socratic teaching method can also improve their abilities to a
certain extent. For example, the framework of Socratic Models (SMs) [204] significantly broadens
the scope of interactive capabilities. SMs facilitate the amalgamation of multiple LLMs to tackle
novel multi-modal tasks using structured Socratic dialogues. In Chang et al. [19], techniques
including definition, dialectics, cross-examination, induction, and counterfactual reasoning are
utilized. These techniques contribute to enhancing the LLMs’ comprehension of user intentions,
thereby improving model’s effectiveness in addressing user queries. The inductive, deductive,
and analogical reasoning techniques in the Socratic teaching method establish a comprehensive
analytical framework for solving mathematical problems [1, 19], ensuring a strong and concise
ability to address the complexities of mathematical problem-solving. Moreover, contributions from
Ang et al. [5] and Al et al. [2] provided invaluable datasets and evaluation methodologies for Socratic
questioning. These contributions underscore the practical applications and potential impact of
Socratic methods in counseling and coaching, showcasing their diverse applicability.
Overall, the integration of Socratic teaching techniques in LLMs offers novel perspectives and
tools for mathematics education. Unlike CoT [185], Socratic questioning explicitly guides the
thinking process, stimulating effective recursive thinking and displaying greater resilience to errors
in the thinking process [140]. Ranging from basic conversational guidance to intricate problem-
solving, these methods demonstrate the effective fusion of modern technology with classical ideas
to bolster the learning and comprehension abilities of LLMs.

5 DATASETS
To train and evaluate the arithmetic and math ability of language models, various math word
problems datasets [27, 60, 85, 153] are designed. In this paper, we organize the frequently used
datasets for mathematical language models by dividing them into training, benchmark and data
augmentation datasets (Table 1).

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:16 Liu et al.

Dataset #Train #Val #Test #Total Language Task Type Solution


VERBPHYSICS [42] 733 1,096 1,828 3,657 EN Calculation Training Formula
Clinical [167, 168] 11,170 1,625 3,220 16,015 EN Calculation Training Formula
Scientific [168] 14,694 2,037 4,231 20,962 EN Calculation Training Formula
DoQ [37] 587 5,418 6,007 12,012 EN Calculation Training Formula
DROP [35] 77,409 9,536 9,622 96,567 EN Calculation Training Text
AddSub [62] - - - 395 EN MWP Training Formula
SingleOp [151] 265 107 159 531 EN MWP Training Formula
SingleEq [83] - - - 508 EN MWP Training Formula
MultiArith [148] 420 - 180 600 EN MWP Training Formula
Alg514 [85] - - - 514 EN MWP Training Formula
Math23k [183] 22,162 - 1000 23,162 CH MWP Training Formula
AQuA [97] 97,467 254 254 97,975 EN MWP Training Text
MathQA [4] 29,837 4,475 28,985 37,297 EN MWP Training Formula
GSM8K [27] 7,473 1,319 8,792 EN MWP Training Text
SVAMP [135] 700 300 1,000 EN MWP Training Formula
DRAW [174] - - - 1,000 EN MWP Training Formula
Dolphin1878 [161] - 374 1,504 1,878 EN MWP Training Formula
AllArith [149] - - - 831 EN MWP Training Formula
DRAW-1K [175] - - - 1,000 EN MWP Training Formula
HMWP [142] - - - 5,470 CH MWP Training Formula
ArMATH [3] - - - 6,000 Arabic MWP Training Formula
TabMWP [107] - - - 38,431 EN MWP Training Text
Ape210K [212] - - - 210,488 CH MWP Training Formula
TAL-SCQ5K-CH 1 3,000 - 2,000 5,000 CH MWP Training Text
TAL-SCQ5K-EN 2 3,000 - 2,000 5,000 EN MWP Training Text
FinQA [22] 6,251 883 1,147 8,281 EN MWP Training Formula
REALFP [76] 185 185 558 928 EN MWP Training Formula
SYNTHFP [76] 10,000 1,000 1,000 12,000 EN MWP Training Formula
TAT-QA [224] - - - 16,552 EN MWP Training Text
MultiHiertt [214] 7,830 1,044 1,566 10,440 EN MWP Training Formula
MATHPILE [184] - - - 903,180 EN MWP Training Text
OpenWebMath [134] - - - - EN MWP Training Formula
MML [51] - - - 57,882 EN TP Training Formula
HolStep [75] 2,013,046 - 196,030 2,209,076 EN TP Training Formula
Feit-Thompson [64] - - - 83,478 EN TP Training Formula
CoqGym [194] - - - 71,000 EN TP Training Formula
HOList [10] - - - 29,462 EN TP Training Formula
IsarStep [92] 820,000 5,000 5,000 830,000 EN TP Training Formula
LISA [70] - - - 183,000 EN TP Training Formula
NaturalProofs [186] 32,000 EN TP Training Text
LeanStep [53] - - - 21,606,000 EN TP Training Formula
NumGLUE [120] - - - 101,835 EN Calculation Benchmark Text
Dophin18k [65] - - - 18,460 EN MWP Benchmark Text
MAWPS [84] - - - 3,320 EN MWP Benchmark Formula
ASDiv [116] - - - 2,305 EN MWP Benchmark Formula
MATH [60] 7,500 5,000 12,500 EN MWP Benchmark Text
MGSM [160] - - - - Multilingual MWP Benchmark Text
Mathematics [153] 2,000,000 100,000 2,100,000 EN MWP Benchmark Formula
MMLU-Math [59] - - - 906 EN WMP Benchmark Formula
AGIEval [216] - - - 469/220 CH/EN MWP Benchmark Formula
MATH 401 [202] - - - 401 EN MWP Benchmark Formula
INT [190] - - - - EN TP Benchmark Formula
miniF2F [215] - 244 244 488 EN TP Benchmark Formula
Aggregate [150] - - - 1,492 EN MWP Augmented Formula
MathQA-Python [7] 19,209 2,822 2,822 23,914 EN MWP Augmented Code
Math50k [91] - - - 50,000 EN WMP Augmented Text
PRM800K [96] - - - 2,868 EN WMP Augmented Text
MetaMathQA [200] - - - 395,000 EN MWP Augmented Text
Lila [119] - - - 134,000 EN MWP Augmented Code
PEN [81] - - - 3,581 EN MWP Augmented Formula
miniF2F+informal [72] - 244 244 488 EN TP Augmented Formula
NaturalProofs-Gen [187] 12,500 1,000 1,000 14,500 EN TP Augmented Text
Table 1. The statistics information of mathematical datasets. Solution means the format of the output, such
as text, formula and code.

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Mathematical Language Models: A Survey 111:17

5.1 Training Datasets


5.1.1 Mathematical Calculation. Several studies have introduced datasets aiming to identify nu-
merical information within text [37, 167, 168], specifically targeting the prediction of attribute
values. The Clinical Data [167, 168] consists of clinical records sourced from the London Chest
Hospital. Each patient record includes a text report and structured KB tuples detailing 20 potential
numeric attributes like age and gender. Scientific Data [168] encompasses paragraphs extracted
from Cornell’s ARXIV repository, featuring over half a million converted papers across 37 scientific
sub-fields. Additionally, Elazar et al. [37] proposed the Distributions over Quantities (DoQ) dataset,
comprising empirical counts of scalar attribute values linked to more than 350K nouns, adjectives,
and verbs across 10 different attributes.
Furthermore, certain studies consider the relational knowledge between various numbers. For
instance, the VERBPHYSICS [42] dataset aggregates crowdsourced information about actions
and objects, encompassing the relative knowledge of grounded object pairs and implications of
actions associated with those objects. More intricately, the DROP [35] dataset necessitates Discrete
Reasoning Over the content of Paragraphs. In this benchmark of 55,000 adversarially-created
questions, a system is tasked with resolving references within a question, potentially spanning
multiple input positions, and performing discrete operations like addition, counting, or sorting.
5.1.2 Math Word Problems. A large number of datasets are proposed for MWPs. Hosseini et al.
[62] curated the AddSub dataset, primarily focusing on addition and subtraction problems (Figure
9). This dataset identifies relevant variables and their respective values to translate sentences into
problem statements, represented as equations. Another dataset, SingleOp [151], encompasses a
wider range of mathematical operations including multiplication and division. Its purpose is to
facilitate reasoning about quantities articulated in natural language. DRAW [174] comprises 1000
algebra word problems semi-automatically annotated for evaluating automatic solvers. It features
gold coefficient alignments crucial for uniquely identifying equation system derivations. The
Alg514 [85] dataset comprises 514 linear algebra problems structured around 28 equation templates.
Meanwhile, MultiArith [148] is designed to handle arithmetic problems involving multiple steps
and operations, without predefined templates. MATHPILE [184] is a diverse and high-quality
math-centric corpus comprising about 9.5 billion tokens. OpenWebMath [134] is an open dataset
inspired by these works containing 14.7B tokens of mathematical webpages from Common Crawl.
Several studies have introduced methods focused on predicting equations through semantic pars-
ing. The SingleEq [83] dataset revolves around grade-school algebra word problems. It comprises
samples that can be correlated to a single equation involving multiple mathematical operations on
one variable, structured as a parsing tree (Figure 10). In contrast, the Dolphin1878 [161] dataset
contains 1,878 word problems in mathematics, wherein each sample might encompass multiple
equations and variables depicted through multiple trees. AllArith [149] offers a concise represen-
tation of dependencies among number units mentioned in a given problem, referred to as Unit
Dependency Graphs (UDGs). To augment dataset size, Math23k [183] gathers 23,161 problems
labeled with structured equations and corresponding answers. GSM8K [27] comprises 8.5K high-
quality grade school math problems crafted by human problem writers. These problems typically
involve 2 to 8 steps for resolution, predominantly requiring a sequence of elementary calculations
using basic arithmetic operations to arrive at the final solution.
HMWP [142] aims to enhance dataset diversity by extracting math word problems from a Chinese
K12 problem bank. This effort sought to validate the universality of math word problem solvers
and expand research in MWPs to better align with real-world scenarios. The dataset encompasses
three types of MWPs: arithmetic word problems, equation set problems, and non-linear equation
problems. In total, it comprises 5,491 MWPs, including 2,955 single-variable linear MWPs, 1,636

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
ni, hannaneh}@washington.edu, 2 [email protected], 3 [email protected]
ity of Washington, 2 Allen Institute for AI, 3 Massachusetts Institute of Technology

111:18 Liu et al.

Abstract Arithmetic word Problem


Liz had 9 black kittens. She gave some of her kittens to
per presents a novel approach to Joan. Joan now has 11 kittens. Liz has 5 kittens left and 3
to solve simple arithmetic word have spots. How many kittens did Joan get?
s. Our system, A RIS, analyzes
State Transition
the sentencesParsing in theAlgebraic
problem state- Word Problems s1
Liz
into Equations s2
Liz
Joan

identify the relevant variables and give

N: 9
N: 9-L1
N: J0+L1

ues. A RIS then maps this infor- E: Kitten
E: Kitten
E: Kitten

into an equation that represents A: Black
A: Black
A: Black

blem, and enables Rikits Koncel-Kedziorski,
(trivial) so- Hannaneh Hajishirzi,
Liz gave some of her kittens to Joan.

s shownAshish in FigureSabharwal
1. The† , Oren pa- Etzioni† , and Siena Dumas Ang
yzes the arithmetic-word Universityproblemsof Washington, Equation:

Allen Institute9for − xAI =5
identifying seven categories of
{kedzior,hannaneh,sienaang}@uw.edu, Solution:
{ashishs,orene}@allenai.org
x = 4 kittens
ed in such problems. A RIS learns

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00160/1566814/tacl_a_00160.pdf by Loyola Notre Dame Library user on 09 December 2023
orize verbs with 81.2% accuracy, Figure 1: Example problem and solution.
ble to solve 77.7% of the problems Fig. 9. An example of AddSub, taken from [62].
make sense of multiple sentences, as shown in Fig-
pus of standard primary school test
ure 2, without a priori restrictions on the syntax or
s. We report the first learning re-
vocabulary used Rental
to describeShop the problem.= Figure
this taskAbstract
without reliance on pre- Oceanside Bike
1charges
shows an example 17 dollars where plusA RIS is asked to infer
templates and make our data pub-
aper formalizes
1 the problem of solv- how manyankittens
7 dollars hour forJoan received
renting a based+$ on facts 80 and
ilable. $
ti-sentence algebraic word problems as constraints
bike. Tom expressedpaid 80 dollarsin theto text, and represented
generating and scoring equation trees. byrentthea state
bike. diagram
How many andhours
corresponding
17$ ∗$equation.
uction
integer linear programming to gener- While
did hethe payequation
to have is thetrivial,
bike the text could have
ation trees and
lgorithms score their likelihood
to automatically solvebymath
7$ xh
g local and global discriminative mod-
involved
checked out? assembling toy aircraft, collecting coins,
problems
hese models is area trained
long-standing
on a small AIsetchal- eating
solution cookies,
:9 or just about any activity 17 + (7 ∗involving
x) = 80
d problems and their answers, withoutNLP,
nbaum and Feldman, 1963). For changes in the quantities of discrete objects.
al word problems are toparticularly Figure 1: Example problem and solution
nual annotation, in order choose the at- This paper investigates the task of learning to
n thatthe
ause besttext
matches the problem
is concise text.
and relatively Fig. 10. An example of SingleEQ, taken from [151].
solve such problems by mapping the verbs in the
er to the overall system as
ard, while the semantics reduces to A LGES . sentences intointo
a coherent mental
problem text categories thatmodel.
describe In their
contrast,
im-
tions. A LGES with previous work and
mpare the
pactchallenge
on the world for anstate.
NLP system
While the is toverbs
“make sense”
category
hat it covers
c word the two-variable
problems full gamut
beginofby arithmetic
describing of the narrative, whichhappens
may refer to arbitrary activ- Additionally, MathQA [4]
linear MWPs, is crucial
and 900 (e.g., what
single-variable if “give”
nonlinear is MWPs.
replaced
ons whereas Hosseini et al. (2014) only ities like renting bikes, collecting coins, or eating
rld state, followed presents by simple
a diverseupdates by “receive” in Figure 1?), some elements of the word problems spanning
collection of 37,000 English multiple-choice math
addition and subtraction. In addition,
ons and end with
various a quantitative
mathematical ques- cookies.
problem
domains are irrelevant.
(Figure 11). For instance, the fact that
overcomes the brittleness of the Kush-
child,
al. (2014)the approach
language onunderstanding
Existing researchpart
single-equation endeavors Previous
three tokittens
enhancework coped
havethespots with the open-domain
is immaterial
comprehensibility to and
the solu-as-
precision of intricate reasoning
ut the
ms, reasoning
yielding a 15% may
to 50%be challenging;
reduction in pect of algebraic
tion. Thus, Aof word problems
has to determine by relying
what on deter-
by providing fine-grained annotations RISreasoning processes [22,informa-
96, 97, 175, 214]. DRAW-1K [175]
m, the opposite is true. A aRIS needsdataset
to ministic state transitions based on verb categoriza-
introduces novel tion is relevant to solving
featuring the problem.word problems, each annotated with
tion (Hosseini1,000 et al.,general
2014) or algebra
by learning templates
derivations.
a is available at https://www.cs. AQUA [97] To
structures abstract eachfrom the
question problem into text,
four Acomponents:
RIS maps problem description,
that cover equations of particular forms (Kushman et
.edu/nlp/arithmetic.
uction multiple-choice answer options, the textrationale,
to a state and representation
correct whichlabel
option consists of 12). FinQA [22] presents
(Figure
al., 2014). We have discovered, however, that both
ol algebra word problems are brief nar- approaches are brittle, particularly as training data
an expert-annotated dataset comprising 8,281 financial QA pairs, meticulously annotated with
numerical reasoning
Figure 1). A typical problem first de-523 is
processes scarce in this domain,
to ensure and the space
comprehensive of equations
explication. PRM800K [96] assigns labels
srtial
of theworld
2014 state(positive,
consisting
Conference negative,
of characters,
on Empirical or grows
neutral) toexponentially
each step with
in the
solving
Methods in Natural Language Processing (EMNLP), pages 523–533,
number
MATH of quantities
problems sampled by a large-scale
October 25-29, 2014, Doha,
d quantities. Next it updates the condition
generator. The Qatar.
trainingc 2014set mentioned
Association
encompasses for in the math
Computational
800K problem.
Linguistics
step-level labels across 75K solutions to 12K problems.
We introduce A LGES 1 which maps an unseen
y or explicates the relationship between
Additionally, Patel et al. [135] introduced the, SVAMP challenge set to establish a more robust
ally, it poses aevaluation
question about a quantity formulti-sentence
framework methods designed algebraic word problem into a set of
to tackle elementary-level math word problems.
tive. possible equation trees. Figure 1 shows an equation
Alghamdi et al. [3] contributed the inaugural large-scale dataset, ArMATH, for Arabic MWPs. This
ary child has dataset to learn the required alge- tree alongside the word problem it represents.
comprises 6,000 samples A LGESofgenerates
primary-school
the spacemath of trees problems
via Integer written in Modern Standard
l easily grasp the narrative utilizing ex-
Arabic. Kalyan et al. [76] proposed two datasets, REALFP and SYNTHFP, for a novel task called Fermi
ld knowledge, large vocabulary, word- Linear Programming (ILP), which allows it to con-
Problems (FPs). FPs entail questions whose answers can only be approximately estimated due to
mbiguation, coreference resolution, mas- 1
The code and data is publicly available at https://
the impracticality or impossibility of
ax, and the ability to combine individual gitlab.cs.washington.edu/ALGES/TACL2015 . precise computation. This task necessitates the amalgamation

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
585

ctions of the Association for Computational Linguistics, vol. 3, pp. 585–597, 2015. Action Editor: Regina Barzilay.
Submission batch: 7/2015; Revision batch: 10/2015; Published 12/2015.
c 2015 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
Aida Amini1 , Saadia Gabriel1 , Shanchuan Lin1 , Rik Koncel-Kedziorski1 ,
Yejin Choi1,2 , and Hannaneh Hajishirzi1,2
1
University of Washington
2
Allen Institute for AI
{amini91, skgabrie, linsh,
Mathematical kedzior,
Language yejin,
Models: hannaneh}@cs.washington.edu
A Survey 111:19

Abstract Math word problem

oduce a large-scale dataset of math An artist wishes to paint a circular region on a square poster that
oblems and an interpretable neural is 3.4 feet on a side. if the area of the circular region is to be 1/2
the area of the poster, what must be the radius of the circular
oblem solver that learns to map prob- region in feet?
operation programs. Due to an-
challenges, current datasets in this 3.4
0.5
have been either relatively small in Operation 1 Operation 2
3.4
did not offer precise operational an- Square_area(3.4)
0.5
Multiply(11.56, 0.5)
11.56
s over diverse problem types. We
e a new representation language to
3.4, 0.5, 11.56, 5.78
recise operation programs correspond-
each math problem that aim to im-
Operation 3 Operation 4
oth the performance and the inter- 3.4
ty of the learned models. Using Divide(5.78, const_pi)
0.5
11.56 Sqrt(1.8343)
esentation language, our new dataset, 5.78
1.8343
A, significantly enhances the AQuA
with fully-specified operational pro- Output == 1.3543
We additionally introduce a neu-
ence-to-program model enhanced with
c problem categorization. Our exper-
how improvements over competitive
Fig. 11.
Figure 1: An example
Example of aofmath
MathQA, taken from
word problem [4].
aligned
s in our MathQA as well as the AQuA with representation language by crowd-sourced anno-
The results are still significantly tation
Problem 1:
an human performance indicating that gram underlying
Question: Two trainsthe problem
running in Figure
in opposite 1 high-
directions cross a
2 Dataset
set poses new challenges for future re- man standing on the platform in 27 seconds and 17 seconds
lights the complexity of the problem-solving task.
Our dataset is available at: https: respectively and they cross each other in 23 seconds. The ratio We built a dataset1 with 100,000 pro
Here,
of theirwe need
speeds is: the ability to deduce implied con- the annotations shown in Figure 1. Ea
h-qa.github.io/math-QA/. Options:
stants (pi)A)and3/7 knowledge
B) 3/2 C) 3/88 of D) 3/8 E) 2/2
domain-specific
Rationale: Let the speeds of the two trains be x m/sec and y
for- is decomposed in four parts, two inp
uction mulas (area of the
m/sec respectively. square).
Then, length of the first train = 27x meters, outputs: the description of the problem
and length of the second train = 17 y meters. (27x + 17y) / (x +
y)In
= 23this
→ 27xpaper,
+ 17ywe = 23xintroduce
+ 23y → 4xa=new 6y → operation-
x/y = 3/2. will denote as the question, and the po
math word problems poses unique based
Correctrepresentation
Option: B language for solving math tiple choice) answer options, denoted
for logical reasoning over implicit or Problem
word 2:
problems. We use this representation lan-
ntities expressed in text. Math word- Question: From a pack of 52 cards, two cards are drawn to- Our goal is to generate the description
guage to random.
construct MathQA 1 , a new large-scale,
Fig. 12.
gether at An exampleWhat isof
theAQUA, taken
probability from
of both the[97].
cards nale used to reach the correct answer
lving requires extraction of salient in- being kings?
diverse dataset of 37k English multiple-choice
rom natural language narratives. Auto- Options: A) 2/1223 B) 1/122 C) 1/221 D) 3/1253 E) 2/153 rationale and the correct option lab
math wordLetproblems
Rationale: covering
s be the sample space. multiple math do- 1 illustrates an example of an algeb
rs must transform the textual narratives Then including
main n(s) = 52C2 =by
categories 1326 modeling operation programs
of various reasoning abilities, E = event of getting
suitable
2 kings out
abstractions,
of 4
commonsense knowledge,
which and the into an expr
must be translated
able meaning representations, a process corresponding to word problems in the AQuA
creative synthesis of problem-solving
n(E) = 4C2 = 6 strategies. (27x + 17y)/(x + y) = 23) and then
s both high precision and, in the case of P(E) = 6/1326
(Ling= et 1/221
Recently, mathematical dataset
ems, significant world knowledge.
reasoning al., 2017). We introduce a neu-
Answer is C has incorporated multiple input modalities such(x/y)
quantity as textual
solvedand
for. Problem 2
tabular data [107, 214, ral
224]. model
Correct
For for mapping
Option:
instance, C TabMWP problems[107] topresents
operationapro- new ple
dataset that could
comprisingbe solved
38,431 by multi-ste
n by the geometry question in Figure 1, grams
Problem with
3: domain categorization.
open-domain grade-level
problems are generally narratives de- problems that necessitate mathematical
Question: For which of the following does p(a)−p(b) = p(a− reasoningoperations
across proposed
both in
textual(Roy and Roth
and tabular
progress of actions data. Zhu
and relations b) for
over et al. all values of a and b? a large-scale QA dataset, TAT-QA,
1[224] introduced nally, Problem 3 describes
encompassing a problem t
The dataset
Options:A) p(x) is=x available
2 , B) p(x) at:= x/2
https://math-qa.
, C) p(x) = x + 5, D)
both tabular
es and quantities. and textual
The operation pro- data. This2x1, dataset
github.io/math-QA/
p(x) = E) p(x) = often
|x| requires numerical reasoning—such by testing each of the options, which h
as addition,
subtraction, multiplication,Rationale:
division,Tocounting,
solve this easiest way is just put the valueand
comparison/sorting, addressed
and their in the past.
combinations—based
see that if it equals or not.
on real financial reports towith
infer
optionanswers.
A. p(a) = aAdditionally,
2 and p(b) = b2 MultiHiertt [214] is a proposed dataset
so L.H.S
consisting of QA pairs derived from − b2 Hierarchical Tabular and Textual2.1
= a2Multi Construction
data (Figure 13). Each
and R.H.S = (a − b) → a + b − 2ab.
2 2 2
document in this dataset contains
so L.H.S not multiple tables and longer unstructured
equal to R.H.S Wetexts,
first collect a set of 34,202 seed pr
accompanied
by fine-grained annotations with option B. p(a) = a/2 and p(b) = b/2
of= reasoning processes consist of multiple
and supporting facts to elucidate complex option math questio
L.H.S a/2 − b/2 → 1/2(a − b)
numerical reasoning. R.H.S = (a − b)/2 a broad range of topics and difficulty
so L.H.S = R.H.S which is the correct answer. amples of exams with such problems
answer:B
5.1.3 Theorem Proving. Current Option: B delves into exploring mathematicalGMAT
Correctresearch (Graduate
theorem datasetsManagement
(e.g., Adm
and GRE (General
INT [190], Feit-Thompson [64], and IsarStep [92]) to develop novel machine learning-based Test). Many webs
theorem-
Figure 1: Examples of solved math problems.
proving strategies. For instance, the format of IsarStep is shown in 14. Huang et al. [64] introducedin such exam
example math questions
the Feit-Thompson dataset, encompassing 1,602 lemmas, expanding into 83,478 answer is supported
proof statesbyfor
a rationale.
Next, we turned to crowdsourcing
new questions. We create a task whe
J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
presented with a set of 5 questions fro
dataset. Then, we ask the Turker to
from the dataset. Second, we propose a sequence of the questions and write a similar q
Yilun Zhao1 Yunxiang Li2 Chenying Li3 Rui Zhang4
1
Yale University 2 The Chinese University of Hong Kong
3
Northeastern University 4 Penn State University
[email protected] [email protected]
111:20
[email protected] [email protected] Liu et al.

Abstract

l reasoning over hybrid data contain-


extual and tabular content (e.g., finan-
ts) has recently attracted much atten-
e NLP community. However, existing
answering (QA) benchmarks over hy-
only include a single flat table in each
and thus lack examples of multi-
erical reasoning across multiple hi-
tables. To facilitate data analyti-
ess, we construct a new large-scale
k, M ULTI H IERTT, with QA pairs
ti Hierarchical Tabular and Textual
LTI H IERTT is built from a wealth of
reports and has the following unique
stics: 1) each document contain mul-
es and longer unstructured texts; 2)
ables contained are hierarchical; 3)
ning process required for each ques-
ore complex and challenging than ex-
nchmarks; and 4) fine-grained anno-
reasoning processes and supporting
provided to reveal complex numeri-
ning. We further introduce a novel
l termed MT2Net, which first applies
eving to extract relevant supporting
m both tables and text and then uses a
module to perform symbolic reason-
retrieved facts. We conduct compre-
xperiments on various baselines. The
ntal results show that M ULTI H IERTT
a strong challenge for existing base-
ose results lag far behind the perfor- Figure 1: An example of M ULTI H IERTT: The system
Published as a conference paper at ICLR 2021
human experts. The dataset and code Fig. 13. toAn
needs example
first of MultiHiertt,
locate which taken
segment got fromfunds
the most [214].
cly available at https://github. in 2017 in the second hierarchical table, then select rel-
unlpgroup/MultiHiertt. evant numbers from the first hierarchical table and gen-
1
erate
2 the correct reasoning program to get the answer.
3
uction The
4
5
annotated supporting facts are highlighted in red,
and
6 the hierarchical column and row headers are high-
7
ars, as key to many NLP tasks such as lighted
8 in orange and green, respectively.
9
a flurry of works on numerical rea- 10
11
various types of data including textual 12
13

al., 2019; Amini et al., 2019; Xie and reasoning over hybrid data containing both textual
14
15

nd tabular data (Moosavi et al., 2021; andFigure


tabular contentproof
1: Full declarative (Zhu et al., 2021;
the irrationality

Chen et al.,
of 2 in Isabelle/HOL.
l., 2021). More recently, numerical 2021)
Fig. 14.hasAnattracted
examplemuch attention.
of IsarStep, Forfrom
taken example,
[92].
To derive ∃c ∈ Z. a = 2c from 2b2 = a2 , the intermediate proposition “a is even” would reduce the
gap and lead to a successful proof. We would like to simulate the way humans prove theorems by
proposing an intermediate proposition synthesis task — IsarStep. Instead of having primitive steps
like 3 + 5 = 8, the proof steps in IsarStep are at a higher-level, with much bigger steps as basic.
the Feit-Thompson theorem [48]. IsarStep [92] constitutes a non-synthetic dataset extracted from
Therefore it usually cannot be simply solved by pattern matching and rewriting. To succeed in this
task, a model is required to learn the meaning of important mathematical concepts (e.g. determinant
the largest repository of proofs handwritten by human experts in a theorem prover. Furthermore,
in linear algebra, residue in complex analysis), how they are related to each other through theorems,
and how they are utilised in proof derivations. Solving the IsaStep task will be potentially helpful for
NaturalProofs [186] comprises 32,000 theorem statements and proofs, 14,000 definitions, and
improving the automation of theorem provers, because proposing a valid intermediate proposition
will help reduce their search space significantly. It is also a first step towards the long-term goal of
2,000 other types of pages (e.g., axioms, corollaries). This dataset draws from three domains:
sketching complete human-readable proofs automatically.
broad-coverage dataWesourced
have built thefrom
IsarStepProofWiki, deep-coverage
dataset by mining arguably data from
the largest publicly-hosted the
repository of Stacks project, and
mechanised proofs: the Achieve of Formal Proofs (AFP).1 The AFP is checked by the Isabelle proof
low-resource data extracted from mathematics textbooks.
assistant (Paulson, 1994) and contains 143K lemmas. Combining the AFP with the standard library of
Isabelle/HOL yields a dataset of 204K formally-proved lemmas. The dataset covers a broad spectrum
of subjects, including foundational logic (e.g. Gödel’s incompleteness theorems), advanced analysis
(e.g. the Prime Number Theorem), computer algebra, cryptographic frameworks, and various data
J. ACM, Vol. 37, No. 4, Article 111.A Publication
structures. nice property of date: August
the mined 2018.is that they are mostly declarative proofs, a
formal proofs √
proof style very close to human prose proofs.2 Fig. 1 illustrates the proof of irrationality of 2 in
Isabelle. We can see that the proof is actually legible (even to people who are not familiar with the
system) and and it captures high-level structures like those in human proofs.
We further explore the reasoning capabilities of neural models. We frame the proposed task as a
sequence-to-sequence (seq2seq) prediction problem. Beyond evaluating the existing neural seq2seq
Mathematical Language Models: A Survey 111:21

Several formal mathematical libraries [51, 114, 152] encompass a range of formal languages
tailored for theorem proving (TP). Examples include MML [51], Coq [13], Lean [53], Isabelle
[189], and Hol light [166]. For instance, MML [51], also known as the Mizar Mathematical Library,
was established to systematically construct a centralized, reusable knowledge base reflecting the
standard foundations of mathematics, namely classical first-order logic and set theory. Yang et al.
[194] compiled the CoqGym dataset, comprising 71,000 human-written proofs from 123 projects
developed using the Coq proof assistant. LeanStep [53] serves as the tactic proof dataset for the Lean
theorem prover, featuring high-level human-written tactics alongside kernel-level proof terms. LISA
[70], one of the largest proof corpora for interactive theorem provers, contains 183,000 theorems
and 2.16 million proof steps. This dataset is extracted from the Archive of Formal Proofs and the
standard library of Isabelle. Kaliszyk et al. [75] introduced the HolStep dataset rooted in Higher-
Order Logic (HOL) proofs. HOList [10] includes almost 30,000 theorems and proofs across three
corpora: core, complex, and flyspeck. The core corpus contains fundamental theorems necessary
for defining tactics, while the complex corpus consists of theorems related to complex calculus.
Flyspeck contains the majority of the lemmas and theorems related to the Kepler conjecture [52].

5.2 Benchmark Datasets


Various benchmark datasets have been introduced to assess the performance of different math-
ematical algorithms, such as MAWPS [84]. Huang et al. [65] developed the Dolphin18K dataset,
a large-scale and diverse collection containing over 18,000 annotated math word problems. The
Mathematics dataset [153] encompasses problems spanning arithmetic, algebra, probability, and
calculus. These problems involve sequential questions and answers presented in a free-form tex-
tual input/output format. For assessing arithmetic understanding, Mishra et al. [120] introduced
NumGLUE, a multi-task benchmark that evaluates AI systems across eight distinct tasks. miniF2F
[215] is a dataset designed to create a unified cross-system benchmark for neural theorem proving.
It comprises 488 formal Olympiad-level mathematics problem statements, encompassing Metamath,
Lean, Isabelle, and HOL Light.
Several datasets consider varying difficulty levels across educational stages, ranging from elemen-
tary school [116, 160] to high school [60]. Miao et al. [116] introduced the ASDiv corpus, comprising
2,305 math word problems (MWPs) categorized by problem type and elementary school grade
level. The Multilingual Grade School Math Benchmark (MGSM) [160] leverages problems from
GSM8K, translating them into 10 languages with the assistance of human annotators. The MATH
dataset [60] contains 12,500 problems sourced from high school math competitions. Each problem
within MATH is equipped with a step-by-step solution, enabling models to learn to generate answer
derivations and explanations.
Furthermore, several datasets serve to evaluate foundation models [202, 216]. AGIEval [216]
functions as a human-centric benchmark tailored to assess foundation models’ general abilities in
tasks related to human cognition and problem-solving. This dataset is collected from standardized
human-centric exams like college entrance exams, law school admission tests, math competitions,
and lawyer qualification tests. Yuan et al. [202] introduced the arithmetic dataset MATH 401,
designed to test the latest large language models such as GPT-4, ChatGPT, InstrctGPT, Galactica,
and LLaMA. It features various arithmetic expressions and offers a comprehensive analysis of the
capabilities of large language models.

5.3 Augmented Datasets


Augmented datasets serve to enhance existing datasets by incorporating additional samples or
information. Roy et al. [150] introduced Aggregate by augmenting the AllArith dataset with 661
word problems from Perturb. Perturb contains problems that were manually pruned either for not

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:22 Liu et al.

yielding the desired solution 𝑎 or being too dissimilar from the input problem 𝑝. Large-scale language
models like ChatGPT and GPT-4 have been utilized to expand datasets. For instance, MetaMathQA
[200] rephrases questions from multiple perspectives without introducing extra knowledge to
bootstrap mathematical questions. The Math5K dataset [91] comprises 50,000 problem-solution
pairs generated using GPT-4. These datasets leverage advanced language models to enrich the
existing collection of problems and solutions.
Several studies have expanded existing datasets by including supplementary information, such
as code programs [7, 119] and explanations [81]. MathQA-Python [7] is a Python adaptation of the
MathQA benchmark, comprising 23,914 problems. This dataset aims to evaluate models’ capabilities
in synthesizing code from intricate textual descriptions. Lila [119] is a comprehensive mathematical
reasoning benchmark composed of 23 diverse tasks. It extends 20 existing datasets by gathering
task instructions and solutions as Python programs, offering not only correct answers but also
explainable solutions. PEN [81] focuses on providing plausible and accurate explanations for solving
algebraic word problems present in three benchmark datasets: ALG514, DRAW-1K, and MAWPS.
Moreover, some studies have contributed statements and proofs based on existing datasets for
theorem proving [72, 187]. For example, miniF2F+informal [72] is a dataset consisting of manually
curated informal statements and proofs aligned with formal statements from the miniF2F dataset.
NaturalProofs-Gen [187] adapts data from NATURALPROOFS, including theorem statements,
proofs, definitions, and additional pages sourced from ProofWiki. Each proof follows a multi-step
structure and references a variable number of sources.

6 CHALLENGES AND FURTHER DIRECTIONS


Faithfulness. Mathematical LLMs reside in the phenomena of hallucination and faithfulness,
where the generated output may lack factual accuracy or logical grounding [69, 146]. This phenom-
enon leads to the production of erroneous or misleading mathematical results, undermining the
reliability of the model’s outputs. Some studies explored to address this problem by integrating
extra knowledge [57], reinforcement learning from human feedback [96], tools [133, 155], and
verify-based methods [57, 93, 98, 163, 211, 217]. However, the improvement of the hallucination
phenomena is limited, which influences the trustworthiness and utility of mathematical language
models in practical applications and scholarly pursuits.
Multi-modal. In particular, math problems (e.g., geometry problems [46]) involve not only textual
information but also various modalities such as diagrams, graphs, or mathematical symbols [108].
While existing models excel in processing textual information, interpreting and reasoning across
multiple modalities concurrently remains a formidable task. Few multi-modal mathematical datasets
and methods are proposed for this task [108, 214]. Bridging the gap between textual representations
and visual/mathematical elements necessitates robust mechanisms that can effectively capture and
synthesize information from disparate sources, ultimately leading to comprehensive and accurate
problem-solving capabilities. In fact, the multi-modal information in mathematics is much more
complex than general multi-modal tasks like vision question answering [6] and image captioning
[61]. Achieving proficiency in handling multi-modal mathematical problems stands as a pivotal yet
intricate objective in advancing the competency of mathematical LLMs.
Uncertainty. The uncertainty of LLMs [36, 45] leads to the ambiguity and variability problem
of mathematical problems. While these models excel in deterministic tasks, their handling of
uncertainty, such as probabilistic reasoning or dealing with incomplete or vague information, poses a
significant challenge. Mathematical problems often entail nuanced interpretations, fuzzy constraints,
or scenarios where a single precise solution might not exist. Several studies investigated this problem
via controlled generation technologies [221]. However, ensuring that LLMs can navigate and

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Mathematical Language Models: A Survey 111:23

appropriately account for these uncertainties while providing accurate and contextually relevant
solutions remains a complex task.
Evaluation. It is still a challenge to evaluate mathematical LMs with robust and comprehensive
evaluation metrics that adequately capture the models’ performance across various mathematical
tasks. Traditional evaluation metrics in natural language processing might not effectively capture
the intricacies of mathematical reasoning and problem-solving. Designing evaluation benchmarks
[27, 60, 85, 153] and metrics [25, 171, 179, 202] that encompass a wide spectrum of mathematical
tasks, spanning from basic arithmetic to complex theorem proving, while accounting for linguistic
fluency and mathematical accuracy, remains a significant challenge. Addressing these challenges
is crucial to ascertain the reliability, efficacy, and generalizability of mathematical LMs, fostering
advancements in this burgeoning field.
Creation. While previous models exhibit remarkable capabilities in understanding and manipu-
lating existing mathematical concepts, their ability to autonomously devise and rigorously prove
entirely new theorems presents a formidable hurdle. The development of novel mathematical
theorems demands not only profound mathematical reasoning but also creative and insightful
problem-solving abilities, aspects that necessitate a deeper understanding of abstract mathematical
concepts beyond what the models have been trained on. Davies et al. [30] applied machine learning
techniques to discover potential patterns and relationships among mathematical entities. Recently,
Bernardino et al. [12] proposed FunSearch to find the first discoveries made for established open
problems using LLMs. Bridging the gap between the models’ proficiency in assimilating existing
mathematical knowledge and their capability to generate novel, verifiable, and impactful theo-
rems represents a significant frontier in leveraging LLMs for the advancement of mathematical
knowledge and discovery.
Application. While mathematical LMs exhibit promising potential in autonomously solving math
problems, their deployment in educational settings as teaching aids or tutors necessitates addressing
several pivotal challenges. Tailoring these models to serve as effective educational tools demands
not only mathematical proficiency but also adeptness in pedagogy and instructional methodologies.
Few studies applied Socratic questioning for mathematical teaching [164, 204]. Customizing LLMs
to cater to diverse learning styles, adapting explanations to suit different proficiency levels, and
fostering an interactive and engaging learning environment is interesting.
Data scarcity. The training data significantly influences the performance of language models,
particularly in LLMs [33]. High-quality and diverse training data can assist the model in enhancing
its mathematical reasoning capabilities [110]. As discussed in Section 4.1, while there have been
limited studies on constructing instruction data through LLMs, these efforts have only considered
building from a small set of mathematical reasoning datasets, such as GSM8k [27] and MATH [60].
High-quality and diverse mathematical instruction data remains scarce. There is a necessity to
explore and investigate additional forms and construction methods for mathematical train data.
Additionally, the generation of mathematical training datasets in a multimodal context is also a
promising direction.

7 CONCLUSIONS
The survey elucidates the pivotal role of mathematical LMs in reshaping the landscape of math-
ematical problem-solving, leveraging a spectrum of models, from pre-trained language models
(PLMs) to large-scale language models (LLMs), to address diverse mathematical tasks. Our tax-
onomical delineation of mathematical tasks and methods provides a systematic framework for
comprehending the intricacies of LMs-based methodologies, distinguishing between arithmetic

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:24 Liu et al.

calculation, mathematical reasoning, and various algorithmic approaches employed in these models.
The compilation of over 60 diverse mathematical datasets, categorized meticulously into training,
benchmark, and augmented datasets, underscores the pivotal role of data in advancing mathe-
matical research, facilitating informed research endeavors within distinct mathematical contexts.
Moreover, by critically addressing challenges such as faithfulness, multi-modality, uncertainty,
evaluation, theorem creation, application, and Data scarcity, this survey paves the way for future
investigations aimed at refining and advancing mathematical LMs capabilities. By shedding light
on the current state-of-the-art, challenges, and avenues for future exploration, we envision this
comprehensive overview to be a cornerstone in driving innovation and shaping the trajectory of
mathematical LMs research, ultimately contributing to the evolving landscape of mathematics and
artificial intelligence.

[1] Erfan Al-Hossami, Razvan Bunescu, Justin Smith, and Ryan Teehan. 2023. Can Language Models Employ the Socratic
Method? Experiments with Code Debugging. arXiv:2310.03210 [cs.CL]
[2] Erfan Al-Hossami, Razvan Bunescu, Ryan Teehan, Laurel Powell, Khyati Mahajan, and Mohsen Dorodchi. 2023.
Socratic questioning of novice debuggers: A benchmark dataset and preliminary evaluations. In Proceedings of the
18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023). 709–726.
[3] Reem Alghamdi, Zhenwen Liang, and Xiangliang Zhang. 2022. ArMATH: a Dataset for Solving Arabic Math Word
Problems. In Proceedings of the Thirteenth Language Resources and Evaluation Conference. 351–362.
[4] Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. 2019.
MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms. In Proceedings of the
2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers). 2357–2367.
[5] Beng Heng Ang, Sujatha Das Gollapalli, and See Kiong Ng. 2023. Socratic Question Generation: A Novel Dataset, Mod-
els, and Evaluation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational
Linguistics. 147–165.
[6] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi
Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision.
2425–2433.
[7] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie
Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732
(2021).
[8] Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W Ayers, Dragomir Radev, and Jeremy Avi-
gad. 2023. Proofnet: Autoformalizing and formally proving undergraduate-level mathematics. arXiv preprint
arXiv:2302.12433 (2023).
[9] Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q Jiang, Jia Deng,
Stella Biderman, and Sean Welleck. [n. d.]. LLEMMA: AN OPEN LANGUAGE MODEL FOR MATHEMATICS. Minerva
8 ([n. d.]), 164B.
[10] Kshitij Bansal, Sarah Loos, Markus Rabe, Christian Szegedy, and Stewart Wilcox. 2019. Holist: An environment for
machine learning of higher order logic theorem proving. In International Conference on Machine Learning. PMLR,
454–463.
[11] Taylor Berg-Kirkpatrick and Daniel Spokoyny. 2020. An empirical investigation of contextualized number prediction.
In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Bonnie Webber,
Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, 4754–4764. https://doi.org/
10.18653/v1/2020.emnlp-main.385
[12] Alexander Novikov Bernardino Romera-Paredes, Mohammadamin Barekatain et al. 2023. Mathematical discoveries
from program search with large language models. Nature (2023).
[13] Yves Bertot and Pierre Castéran. 2013. Interactive theorem proving and program development: Coq’Art: the calculus of
inductive constructions. Springer Science & Business Media.
[14] Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann,
Michal Podstawski, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. 2023. Graph of Thoughts: Solving
Elaborate Problems with Large Language Models. http://arxiv.org/abs/2308.09687 arXiv:2308.09687 [cs].
[15] Daniel Bobrow et al. 1964. Natural language input for a computer problem solving system. (1964).

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Mathematical Language Models: A Survey 111:25

[16] Diane J Briars and Jill H Larkin. 1984. An integrated model of skill in solving elementary word problems. Cognition
and instruction 1, 3 (1984), 245–296.
[17] Thomas C Brickhouse and Nicholas D Smith. 2009. Socratic teaching and Socratic method. (2009).
[18] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan,
Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural
information processing systems 33 (2020), 1877–1901.
[19] Edward Y Chang. 2023. Prompting large language models with the socratic method. In 2023 IEEE 13th Annual
Computing and Communication Workshop and Conference (CCWC). IEEE, 0351–0360.
[20] Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. 2022. Program of thoughts prompting: Disentangling
computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588 (2022).
[21] Yanda Chen, Ruiqi Zhong, Sheng Zha, George Karypis, and He He. 2022. Meta-learning via Language Model In-context
Tuning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers). 719–730.
[22] Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane,
Ting-Hao Huang, Bryan R Routledge, et al. 2021. FinQA: A Dataset of Numerical Reasoning over Financial Data. In
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 3697–3711.
[23] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham,
Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways.
arXiv preprint arXiv:2204.02311 (2022).
[24] Zheng Chu, Jingchang Chen, Qianglong Chen, Weijiang Yu, Tao He, Haotian Wang, Weihua Peng, Ming Liu, Bing Qin,
and Ting Liu. 2023. A Survey of Chain of Thought Reasoning: Advances, Frontiers and Future. arXiv:2309.15402 [cs.CL]
[25] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang,
Mostafa Dehghani, Siddhartha Brahma, et al. 2022. Scaling instruction-finetuned language models. arXiv preprint
arXiv:2210.11416 (2022).
[26] Peter Clark, Oren Etzioni, Daniel Khashabi, Tushar Khot, Bhavana Dalvi Mishra, Kyle Richardson, Ashish Sabharwal,
Carissa Schoenick, Oyvind Tafjord, Niket Tandon, Sumithra Bhakthavatsalam, Dirk Groeneveld, Michal Guerquin,
and Michael Schmitz. 2021. From ’F’ to ’a’ on the N.Y. Regents Science Exams: An Overview of the Aristo Project.
arXiv:1909.01958 [cs]
[27] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry
Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training Verifiers to Solve
Math Word Problems. arXiv:2110.14168 [cs.LG]
[28] Jelle Couperus. 2023. Large Language Models and Mathematical Understanding. Master’s thesis.
[29] Gautier Dagan, Frank Keller, and Alex Lascarides. 2023. Dynamic Planning with a LLM. arXiv preprint arXiv:2308.06391
(2023).
[30] Alex Davies, Petar Veličković, Lars Buesing, Sam Blackwell, Daniel Zheng, Nenad Tomašev, Richard Tanburn, Peter
Battaglia, Charles Blundell, András Juhász, et al. 2021. Advancing mathematics by guiding human intuition with AI.
Nature 600, 7887 (2021), 70–74.
[31] Peggy Cooper Davis and Elizabeth Ehrenfest Steinglass. 1997. A dialogue about Socratic teaching. NYU Rev. L. & Soc.
Change 23 (1997), 249.
[32] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding. arXiv:1810.04805 [cs]
[33] Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou.
2023. Enhancing Chat Language Models by Scaling High-quality Instructional Conversations. http://arxiv.org/abs/
2305.14233 arXiv:2305.14233 [cs].
[34] Iddo Drori, Sarah Zhang, Reece Shuttleworth, Leonard Tang, Albert Lu, Elizabeth Ke, Kevin Liu, Linda Chen, Sunny
Tran, Newman Cheng, et al. 2022. A neural network solves, explains, and generates university math problems by
program synthesis and few-shot learning at human level. Proceedings of the National Academy of Sciences 119, 32
(2022), e2123433119.
[35] Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. DROP: A
Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs. In Proceedings of NAACL-HLT.
2368–2378.
[36] Jinhao Duan, Hao Cheng, Shiqi Wang, Chenan Wang, Alex Zavalny, Renjing Xu, Bhavya Kailkhura, and Kaidi Xu.
2023. Shifting attention to relevance: Towards the uncertainty estimation of large language models. arXiv preprint
arXiv:2307.01379 (2023).
[37] Yanai Elazar, Abhijit Mahabal, Deepak Ramachandran, Tania Bedrax-Weiss, and Dan Roth. 2019. How Large Are
Lions? Inducing Distributions over Quantitative Attributes. In Proceedings of the 57th Annual Meeting of the Association
for Computational Linguistics. 3973–3983.

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:26 Liu et al.

[38] Edward A Feigenbaum, Julian Feldman, et al. 1963. Computers and thought. Vol. 7. New York McGraw-Hill.
[39] Yu Feng, Jing Zhang, Xiaokang Zhang, Lemao Liu, Cuiping Li, and Hong Chen. 2022. Injecting Numerical Reasoning
Skills into Knowledge Base Question Answering Models. arXiv:2112.06109 [cs]
[40] Emily First, Markus N Rabe, Talia Ringer, and Yuriy Brun. 2023. Baldur: whole-proof generation and repair with
large language models. arXiv preprint arXiv:2303.04910 (2023).
[41] Charles R Fletcher. 1985. Understanding and solving arithmetic word problems: A computer simulation. Behavior
Research Methods, Instruments, & Computers 17, 5 (1985), 565–571.
[42] Maxwell Forbes and Yejin Choi. 2017. Verb Physics: Relative Physical Knowledge of Actions and Objects. In Proceedings
of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 266–276.
[43] Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. 2023. Complexity-Based Prompting for Multi-step
Reasoning. In The Eleventh International Conference on Learning Representations.
[44] Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023.
Pal: Program-aided language models. In International Conference on Machine Learning. PMLR, 10764–10799.
[45] Jakob Gawlikowski, Cedrique Rovile Njieutcheu Tassi, Mohsin Ali, Jongseok Lee, Matthias Humt, Jianxiang Feng,
Anna Kruspe, Rudolph Triebel, Peter Jung, Ribana Roscher, et al. 2023. A survey of uncertainty in deep neural
networks. Artificial Intelligence Review 56, Suppl 1 (2023), 1513–1589.
[46] Herbert Gelernter, James R Hansen, and Donald W Loveland. 1960. Empirical explorations of the geometry theorem
machine. In Papers presented at the May 3-5, 1960, western joint IRE-AIEE-ACM computer conference. 143–149.
[47] Mor Geva, Ankit Gupta, and Jonathan Berant. 2020. Injecting Numerical Reasoning Skills into Language Models.
arXiv:2004.04487 [cs]
[48] Georges Gonthier, Andrea Asperti, Jeremy Avigad, Yves Bertot, Cyril Cohen, François Garillot, Stéphane Le Roux,
Assia Mahboubi, Russell O’Connor, Sidi Ould Biha, et al. 2013. A machine-checked proof of the odd order theorem. In
International conference on interactive theorem proving. Springer, 163–179.
[49] Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. 2023. CRITIC: Large
Language Models Can Self-Correct with Tool-Interactive Critiquing. arXiv:2305.11738 [cs.CL]
[50] Zhibin Gou, Zhihong Shao, Yeyun Gong, Yujiu Yang, Minlie Huang, Nan Duan, Weizhu Chen, et al. 2023. Tora: A
tool-integrated reasoning agent for mathematical problem solving. arXiv preprint arXiv:2309.17452 (2023).
[51] Adam Grabowski, Artur Korniłowicz, and Adam Naumowicz. 2015. Four Decades of Mizar: Foreword. Journal of
Automated Reasoning 55 (2015), 191–198.
[52] Thomas Hales, Mark Adams, Gertrud Bauer, Tat Dat Dang, John Harrison, Hoang Le Truong, Cezary Kaliszyk, Victor
Magron, Sean McLaughlin, Tat Thang Nguyen, et al. 2017. A formal proof of the Kepler conjecture. In Forum of
mathematics, Pi, Vol. 5. Cambridge University Press, e2.
[53] Jesse Michael Han, Jason Rute, Yuhuai Wu, Edward Ayers, and Stanislas Polu. 2022. Proof Artifact Co-Training for
Theorem Proving with Language Models. In International Conference on Learning Representations.
[54] Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. 2023. Reasoning
with language model is planning with world model. arXiv preprint arXiv:2305.14992 (2023).
[55] Shibo Hao, Tianyang Liu, Zhen Wang, and Zhiting Hu. 2023. ToolkenGPT: Augmenting Frozen Language Models
with Massive Tools via Tool Embeddings. arXiv:2305.11554 [cs.CL]
[56] Yihan Hao, Mingliang Zhang, Fei Yin, and Linlin Huang. 2022. PGDP5K: A Diagram Parsing Dataset for Plane
Geometry Problems. 2022 26th International Conference on Pattern Recognition (ICPR) (2022), 1763–1769.
[57] Hangfeng He, Hongming Zhang, and Dan Roth. 2022. Rethinking with Retrieval: Faithful Large Language Model
Inference. http://arxiv.org/abs/2301.00303 arXiv:2301.00303 [cs].
[58] Joy He-Yueya, Gabriel Poesia, Rose E Wang, and Noah D Goodman. 2023. Solving math word problems by combining
language models with symbolic solvers. arXiv preprint arXiv:2304.09102 (2023).
[59] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021.
Measuring Massive Multitask Language Understanding. Proceedings of the International Conference on Learning
Representations (ICLR) (2021).
[60] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob
Steinhardt. 2021. Measuring Mathematical Problem Solving With the MATH Dataset. In Thirty-fifth Conference on
Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
[61] MD Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and Hamid Laga. 2019. A comprehensive survey of deep
learning for image captioning. ACM Computing Surveys (CsUR) 51, 6 (2019), 1–36.
[62] Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. 2014. Learning to Solve
Arithmetic Word Problems with Verb Categorization. In Proceedings of the 2014 Conference on Empirical Methods in
Natural Language Processing (EMNLP), Alessandro Moschitti, Bo Pang, and Walter Daelemans (Eds.). Association for
Computational Linguistics, Doha, Qatar, 523–533. https://doi.org/10.3115/v1/D14-1058

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Mathematical Language Models: A Survey 111:27

[63] Cheng-Yu Hsieh, Si-An Chen, Chun-Liang Li, Yasuhisa Fujii, Alexander Ratner, Chen-Yu Lee, Ranjay Krishna,
and Tomas Pfister. 2023. Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models.
arXiv:2308.00675 [cs.CL]
[64] Daniel Huang, Prafulla Dhariwal, Dawn Song, and Ilya Sutskever. 2019. GamePad: A Learning Environment for
Theorem Proving. In International Conference on Learning Representations.
[65] Danqing Huang, Shuming Shi, Chin-Yew Lin, Jian Yin, and Wei-Ying Ma. 2016. How well do computers solve math
word problems? large-scale dataset construction and evaluation. In Proceedings of the 54th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers). 887–896.
[66] Shima Imani, Liang Du, and Harsh Shrivastava. 2023. Mathprompter: Mathematical reasoning using large language
models. arXiv preprint arXiv:2303.05398 (2023).
[67] Geoffrey Irving, Christian Szegedy, Alexander A Alemi, Niklas Eén, François Chollet, and Josef Urban. 2016. Deepmath-
deep sequence models for premise selection. Advances in neural information processing systems 29 (2016).
[68] Samy Jelassi, Stéphane d’Ascoli, Carles Domingo-Enrich, Yuhuai Wu, Yuanzhi Li, and François Charton. 2023. Length
Generalization in Arithmetic Transformers. https://doi.org/10.48550/arXiv.2306.15400
[69] Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and
Pascale Fung. 2023. Survey of hallucination in natural language generation. Comput. Surveys 55, 12 (2023), 1–38.
[70] Albert Qiaochu Jiang, Wenda Li, Jesse Michael Han, and Yuhuai Wu. 2021. LISA: Language models of ISAbelle proofs.
In 6th Conference on Artificial Intelligence and Theorem Proving. 378–392.
[71] Albert Q. Jiang, Wenda Li, Szymon Tworkowski, Konrad Czechowski, Tomasz Odrzygóźdź, Piotr Miłoś, Yuhuai Wu,
and Mateja Jamnik. 2022. Thor: Wielding Hammers to Integrate Language Models and Automated Theorem Provers.
arXiv:2205.10893 [cs.AI]
[72] Albert Qiaochu Jiang, Sean Welleck, Jin Peng Zhou, Timothee Lacroix, Jiacheng Liu, Wenda Li, Mateja Jamnik,
Guillaume Lample, and Yuhuai Wu. 2023. Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal
Proofs. In The Eleventh International Conference on Learning Representations.
[73] Song Jiang, Zahra Shakeri, Aaron Chan, Maziar Sanjabi, Hamed Firooz, Yinglong Xia, Bugra Akyildiz, Yizhou Sun,
Jinchao Li, Qifan Wang, et al. 2023. Resprompt: Residual connection prompting advances multi-step reasoning in
large language models. arXiv preprint arXiv:2310.04743 (2023).
[74] Zhanming Jie, Jierui Li, and Wei Lu. 2022. Learning to Reason Deductively: Math Word Problem Solving as Complex
Relation Extraction. arXiv:2203.10316 [cs]
[75] Cezary Kaliszyk, François Chollet, and Christian Szegedy. 2017. HolStep: A Machine Learning Dataset for Higher-order
Logic Theorem Proving. In International Conference on Learning Representations.
[76] Ashwin Kalyan, Abhinav Kumar, Arjun Chandrasekaran, Ashish Sabharwal, and Peter Clark. 2021. How Much Coffee
Was Consumed During EMNLP 2019? Fermi Problems: A New Reasoning Challenge for AI. In Conference on Empirical
Methods in Natural Language Processing (EMNLP 2021). Association for Computational Linguistics, 7318–7328.
[77] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford,
Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020).
[78] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding. In Proceedings of NAACL-HLT. 4171–4186.
[79] Muhammad Khalifa, Lajanugen Logeswaran, Moontae Lee, Honglak Lee, and Lu Wang. 2023. Discriminator-Guided
Multi-step Reasoning with Language Models. arXiv preprint arXiv:2305.14934 (2023).
[80] Bugeun Kim, Kyung Seo Ki, Donggeon Lee, and Gahgene Gweon. 2020. Point to the Expression: Solving Algebraic
Word Problems using the Expression-Pointer Transformer Model. In Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 3768–3779.
https://doi.org/10.18653/v1/2020.emnlp-main.308
[81] Bugeun Kim, Kyung Seo Ki, Sangkyu Rhim, and Gahgene Gweon. 2022. EPT-X: An Expression-Pointer Transformer
model that generates eXplanations for numbers. In Proceedings of the 60th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers). 4442–4458.
[82] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. [n. d.]. Large Language
Models are Zero-Shot Reasoners. ([n. d.]).
[83] Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish Sabharwal, Oren Etzioni, and Siena Dumas Ang. 2015. Parsing
algebraic word problems into equations. Transactions of the Association for Computational Linguistics 3 (2015),
585–597.
[84] Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. 2016. MAWPS: A math
word problem repository. In Proceedings of the 2016 conference of the north american chapter of the association for
computational linguistics: human language technologies. 1152–1157.
[85] Nate Kushman, Yoav Artzi, Luke Zettlemoyer, and Regina Barzilay. 2014. Learning to Automatically Solve Algebra
Word Problems. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1:

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:28 Liu et al.

Long Papers), Kristina Toutanova and Hua Wu (Eds.). Association for Computational Linguistics, Baltimore, Maryland,
271–281. https://doi.org/10.3115/v1/P14-1026
[86] Guillaume Lample, Marie-Anne Lachaux, Thibaut Lavril, Xavier Martinet, Amaury Hayat, Gabriel Ebner, Aurélien
Rodriguez, and Timothée Lacroix. [n. d.]. HyperTree Proof Search for Neural Theorem Proving. ([n. d.]).
[87] Guillaume Lample, Timothee Lacroix, Marie-Anne Lachaux, Aurelien Rodriguez, Amaury Hayat, Thibaut Lavril,
Gabriel Ebner, and Xavier Martinet. 2022. Hypertree proof search for neural theorem proving. Advances in Neural
Information Processing Systems 35 (2022), 26337–26349.
[88] Bin Lei, Chunhua Liao, Caiwen Ding, et al. 2023. Boosting logical reasoning in large language models through a new
framework: The graph of thought. arXiv preprint arXiv:2308.08614 (2023).
[89] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov,
and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation,
Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics. 7871–7880.
[90] Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose
Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra.
[n. d.]. Solving Quantitative Reasoning Problems With Language Models. ([n. d.]).
[91] Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. CAMEL:
Communicative Agents for" Mind" Exploration of Large Language Model Society. In Thirty-seventh Conference on
Neural Information Processing Systems.
[92] Wenda Li, Lei Yu, Yuhuai Wu, and Lawrence C Paulson. 2021. IsarStep: a Benchmark for High-level Mathematical
Reasoning. In International Conference on Learning Representations.
[93] Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. 2023. Making language
models better reasoners with step-aware verifier. In Proceedings of the 61st Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers). 5315–5333.
[94] Zhongli Li, Wenxuan Zhang, Chao Yan, Qingyu Zhou, Chao Li, Hongzhi Liu, and Yunbo Cao. 2022. Seeking Patterns,
Not Just Memorizing Procedures: Contrastive Learning for Solving Math Word Problems. arXiv:2110.08464 [cs]
[95] Zhenwen Liang, Jipeng Zhang, Lei Wang, Wei Qin, Yunshi Lan, Jie Shao, and Xiangliang Zhang. 2022. MWP-BERT:
Numeracy-augmented Pre-Training for Math Word Problem Solving. arXiv:2107.13435 [cs]
[96] Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman,
Ilya Sutskever, and Karl Cobbe. 2023. Let’s Verify Step by Step. arXiv:2305.20050 [cs.LG]
[97] Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. 2017. Program Induction by Rationale Generation:
Learning to Solve and Explain Algebraic Word Problems. In Proceedings of the 55th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers). 158–167.
[98] Zhan Ling, Yunhao Fang, Xuanlin Li, Zhiao Huang, Mingu Lee, Roland Memisevic, and Hao Su. 2023. Deductive
Verification of Chain-of-Thought Reasoning. arXiv preprint arXiv:2306.03872 (2023).
[99] Bo Liu, Yuqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi Zhang, Joydeep Biswas, and Peter Stone. 2023. Llm+ p:
Empowering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477 (2023).
[100] Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. 2022. What Makes Good
In-Context Examples for GPT-3?. In Deep Learning Inside Out: 3rd Workshop on Knowledge Extraction and Integration
for Deep Learning Architectures, DeeLIO 2022. Association for Computational Linguistics (ACL), 100–114.
[101] Tiedong Liu and Bryan Kian Hsiang Low. 2023. Goat: Fine-tuned LLaMA Outperforms GPT-4 on Arithmetic Tasks.
http://arxiv.org/abs/2305.14201 arXiv:2305.14201 [cs].
[102] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer,
and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs]
[103] Yixin Liu, Avi Singh, C. Daniel Freeman, John D. Co-Reyes, and Peter J. Liu. 2023. Improving Large Language Model
Fine-tuning for Solving Math Problems. http://arxiv.org/abs/2310.10047 arXiv:2310.10047 [cs].
[104] Jieyi Long. 2023. Large Language Model Guided Tree-of-Thought. arXiv preprint arXiv:2305.08291 (2023).
[105] Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-chun Zhu. 2021. Inter-GPS:
Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning. In Proceedings of the 59th
Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural
Language Processing (Volume 1: Long Papers). 6774–6786.
[106] Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao.
2023. Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models. arXiv:2304.09842 [cs.CL]
[107] Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Tanmay Rajpurohit, Peter Clark, and Ashwin
Kalyan. 2023. Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning. In The
Eleventh International Conference on Learning Representations.

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Mathematical Language Models: A Survey 111:29

[108] Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. 2021.
IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning. In Thirty-fifth
Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
[109] Pan Lu, Liang Qiu, Wenhao Yu, Sean Welleck, and Kai-Wei Chang. 2023. A Survey of Deep Learning for Mathematical
Reasoning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics,
Toronto, Canada, 14605–14631. https://doi.org/10.18653/v1/2023.acl-long.817
[110] Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng
Chen, and Dongmei Zhang. 2023. Wizardmath: Empowering mathematical reasoning for large language models via
reinforced evol-instruct. arXiv preprint arXiv:2308.09583 (2023).
[111] Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri,
Shrimai Prabhumoye, Yiming Yang, et al. 2023. Self-refine: Iterative refinement with self-feedback. arXiv preprint
arXiv:2303.17651 (2023).
[112] Nikolaos Matzakos, Spyridon Doukakis, and Maria Moundridou. 2023. Learning Mathematics with Large Language
Models: A Comparative Study with Computer Algebra Systems and Other Tools. International Journal of Emerging
Technologies in Learning (Online) 18, 20 (2023), 51.
[113] Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On Faithfulness and Factuality in
Abstractive Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics,
Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online,
1906–1919. https://doi.org/10.18653/v1/2020.acl-main.173
[114] Norman Megill and David A Wheeler. 2019. Metamath: a computer language for mathematical proofs. Lulu. com.
[115] Ning Miao, Yee Whye Teh, and Tom Rainforth. 2023. Selfcheck: Using llms to zero-shot check their own step-by-step
reasoning. arXiv preprint arXiv:2308.00436 (2023).
[116] Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su. 2020. A Diverse Corpus for Evaluating and Developing English
Math Word Problem Solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
975–984.
[117] Bonan Min, Hayley Ross, Elior Sulem, Amir Pouran Ben Veyseh, Thien Huu Nguyen, Oscar Sainz, Eneko Agirre,
Ilana Heintz, and Dan Roth. 2023. Recent Advances in Natural Language Processing via Large Pre-Trained Language
Models: A Survey. ACM Comput. Surv. 56, 2, Article 30 (sep 2023), 40 pages. https://doi.org/10.1145/3605943
[118] Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2022. MetaICL: Learning to Learn In Context.
In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies. 2791–2809.
[119] Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard Tang, Sean Welleck, Chitta Baral, Tanmay Rajpurohit, Oyvind
Tafjord, Ashish Sabharwal, Peter Clark, et al. 2022. LILA: A Unified Benchmark for Mathematical Reasoning. In
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 5807–5832.
[120] Swaroop Mishra, Arindam Mitra, Neeraj Varshney, Bhavdeep Sachdeva, Peter Clark, Chitta Baral, and Ashwin Kalyan.
2022. NUMGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning Tasks. In 60th Annual Meeting of
the Association for Computational Linguistics, ACL 2022. Association for Computational Linguistics (ACL), 3505–3523.
[121] Arindam Mitra and Chitta Baral. 2016. Learning to use formulas to solve simple arithmetic problems. In Proceedings
of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2144–2153.
[122] Shentong Mo and Miao Xin. 2023. Tree of Uncertain Thoughts Reasoning for Large Language Models. arXiv preprint
arXiv:2309.07694 (2023).
[123] Matteo Muffo, Aldo Cocco, and Enrico Bertino. [n. d.]. Evaluating Transformer Language Models on Arithmetic
Operations Using Number Decomposition. ([n. d.]).
[124] Ranjita Naik, Varun Chandrasekaran, Mert Yuksekgonul, Hamid Palangi, and Besmira Nushi. 2023. Diversity of
Thought Improves Reasoning Abilities of Large Language Models. arXiv preprint arXiv:2310.07088 (2023).
[125] Ryosuke NAKAMOTO, Brendan Flanagan, Taisei Yamauchi, Dai Yilling, Kyosuke Takami, and Horoaki Ogata. 2023.
Enhancing Automated Scoring of Math Self-Explanation Quality using LLM-Generated Datasets: A Semi-Supervised
Approach. (2023).
[126] Leonard Nelson. 1980. The socratic method. Thinking: The Journal of Philosophy for Children 2, 2 (1980), 34–38.
[127] A. Newell, J. C. Shaw, and H. A. Simon. 1957. Empirical Explorations of the Logic Theory Machine: A Case Study
in Heuristic (IRE-AIEE-ACM ’57 (Western)). Association for Computing Machinery, New York, NY, USA, 218–230.
https://doi.org/10.1145/1455567.1455605
[128] Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin. 2021. Investigating the limitations of transformers with simple
arithmetic tasks. arXiv preprint arXiv:2102.13019 (2021).
[129] Kimia Noorbakhsh, Modar Sulaiman, Mahdi Sharifi, Kallol Roy, and Pooyan Jamshidi. 2021. Pretrained Language
Models are Symbolic Mathematics Solvers too! arXiv preprint arXiv:2110.03501 (2021).

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:30 Liu et al.

[130] Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David
Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, and Augustus Odena. 2021. Show Your Work:
Scratchpads for Intermediate Computation with Language Models. http://arxiv.org/abs/2112.00114 arXiv:2112.00114
[cs].
[131] OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
[132] Bhargavi Paranjape, Scott Lundberg, Sameer Singh, Hannaneh Hajishirzi, Luke Zettlemoyer, and Marco Tulio Ribeiro.
2023. ART: Automatic multi-step reasoning and tool-use for large language models. arXiv:2303.09014 [cs.CL]
[133] Aaron Parisi, Yao Zhao, and Noah Fiedel. 2022. TALM: Tool Augmented Language Models. arXiv:2205.12255 [cs.CL]
[134] Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, and Jimmy Ba. 2023. OpenWebMath: An Open Dataset of
High-Quality Mathematical Web Text. arXiv:2310.06786 [cs.AI]
[135] Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. Are NLP Models really able to Solve Simple Math Word
Problems?. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies. 2080–2094.
[136] Shuai Peng, Ke Yuan, Liangcai Gao, and Zhi Tang. 2021. MathBERT: A Pre-Trained Model for Mathematical Formula
Understanding. arXiv:2105.00377 [cs]
[137] Silviu Pitis, Michael R. Zhang, Andrew Wang, and Jimmy Ba. 2023. Boosted Prompt Ensembles for Large Language
Models. http://arxiv.org/abs/2304.05970 arXiv:2304.05970 [cs].
[138] Stanislas Polu and Ilya Sutskever. 2020. Generative Language Modeling for Automated Theorem Proving.
arXiv:2009.03393 [cs.LG]
[139] Stanislas Polu and Ilya Sutskever. 2020. Generative Language Modeling for Automated Theorem Proving. CoRR
abs/2009.03393 (2020). arXiv:2009.03393 https://arxiv.org/abs/2009.03393
[140] Jingyuan Qi, Zhiyang Xu, Ying Shen, Minqian Liu, dingnan jin, Qifan Wang, and Lifu Huang. 2023. The Art of
SOCRATIC QUESTIONING: Recursive Thinking with Large Language Models. In Conference on Empirical Methods in
Natural Language Processing. https://api.semanticscholar.org/CorpusID:264935025
[141] Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, and Huajun
Chen. 2023. Reasoning with Language Model Prompting: A Survey. In Proceedings of the 61st Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki
Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 5368–5393. https://doi.org/10.18653/v1/
2023.acl-long.294
[142] Jinghui Qin, Lihui Lin, Xiaodan Liang, Rumin Zhang, and Liang Lin. 2020. Semantically-Aligned Universal Tree-
Structured Solver for Math Word Problems. In Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing (EMNLP). 3780–3789.
[143] Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang. 2020. Pre-trained models for
natural language processing: A survey. Science China Technological Sciences 63, 10 (2020), 1872–1897.
[144] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving Language Understanding by
Generative Pre-Training. (June 2018), 12.
[145] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are
Unsupervised Multitask Learners. (Feb. 2019), 24.
[146] Vipula Rawte, Amit Sheth, and Amitava Das. 2023. A survey of hallucination in large foundation models. arXiv
preprint arXiv:2309.05922 (2023).
[147] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and
Peter J Liu. [n. d.]. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. ([n. d.]).
[148] Subhro Roy and Dan Roth. 2015. Solving General Arithmetic Word Problems. In Proceedings of the 2015 Conference on
Empirical Methods in Natural Language Processing. 1743–1752.
[149] Subhro Roy and Dan Roth. 2017. Unit dependency graph and its application to arithmetic word problem solving. In
Proceedings of the AAAI conference on artificial intelligence, Vol. 31.
[150] Subhro Roy and Dan Roth. 2018. Mapping to declarative knowledge for word problem solving. Transactions of the
Association for Computational Linguistics 6 (2018), 159–172.
[151] Subhro Roy, Tim Vieira, and Dan Roth. 2015. Reasoning about quantities in natural language. Transactions of the
Association for Computational Linguistics 3 (2015), 1–13.
[152] Piotr Rudnicki. 1992. An overview of the Mizar project. In Proceedings of the 1992 Workshop on Types for Proofs and
Programs. Citeseer, 311–330.
[153] David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. 2019. Analysing Mathematical Reasoning Abilities
of Neural Models. In International Conference on Learning Representations.
[154] Philipp Scharpf, Moritz Schubotz, and Bela Gipp. 2022. Mining mathematical documents for question answering via
unsupervised formula labeling. In Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries. 1–11.

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Mathematical Language Models: A Survey 111:31

[155] Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and
Thomas Scialom. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv:2302.04761 [cs.CL]
[156] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization
algorithms. arXiv preprint arXiv:1707.06347 (2017).
[157] Bilgehan Sel, Ahmad Al-Tawaha, Vanshaj Khattar, Lu Wang, Ruoxi Jia, and Ming Jin. 2023. Algorithm of thoughts:
Enhancing exploration of ideas in large language models. arXiv preprint arXiv:2308.10379 (2023).
[158] Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi, Oren Etzioni, and Clint Malcolm. 2015. Solving Geometry Prob-
lems: Combining Text and Diagram Interpretation. In Proceedings of the 2015 Conference on Empirical Methods in
Natural Language Processing, Lluís Màrquez, Chris Callison-Burch, and Jian Su (Eds.). Association for Computational
Linguistics, Lisbon, Portugal, 1466–1476. https://doi.org/10.18653/v1/D15-1171
[159] Jianhao Shen, Yichun Yin, Lin Li, Lifeng Shang, Xin Jiang, Ming Zhang, and Qun Liu. 2021. Generate & Rank: A
Multi-task Framework for Math Word Problems. http://arxiv.org/abs/2109.03034 arXiv:2109.03034 [cs].
[160] Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi
Tay, Sebastian Ruder, Denny Zhou, et al. 2022. Language models are multilingual chain-of-thought reasoners. In The
Eleventh International Conference on Learning Representations.
[161] Shuming Shi, Yuehui Wang, Chin-Yew Lin, Xiaojiang Liu, and Yong Rui. 2015. Automatically solving number word
problems by semantic parsing and reasoning. In Proceedings of the 2015 conference on empirical methods in natural
language processing. 1132–1142.
[162] Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R Narasimhan, and Shunyu Yao. 2023. Reflexion: Language
agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems.
[163] Kumar Shridhar, Harsh Jhamtani, Hao Fang, Benjamin Van Durme, Jason Eisner, and Patrick Xia. 2023. SCREWS: A
Modular Framework for Reasoning with Revisions. arXiv preprint arXiv:2309.13075 (2023).
[164] Kumar Shridhar, Jakub Macina, Mennatallah El-Assady, Tanmay Sinha, Manu Kapur, and Mrinmaya Sachan. 2022.
Automatic Generation of Socratic Subquestions for Teaching Math Word Problems. arXiv preprint arXiv:2211.12835
(2022).
[165] KaShun Shum, Shizhe Diao, and Tong Zhang. 2023. Automatic Prompt Augmentation and Selection with Chain-of-
Thought from Labeled Data. http://arxiv.org/abs/2302.12822 arXiv:2302.12822 [cs].
[166] Konrad Slind and Michael Norrish. 2008. A brief overview of HOL4. In International Conference on Theorem Proving
in Higher Order Logics. Springer, 28–32.
[167] Georgios Spithourakis, Isabelle Augenstein, and Sebastian Riedel. 2016. Numerically Grounded Language Models for
Semantic Error Correction. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.
987–992.
[168] GP Spithourakis and S Riedel. 2018. Numeracy for language models: Evaluating and improving their ability to
predict numbers. In ACL 2018-56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the
Conference (Long Papers), Vol. 56. Association for Computational Linguistics, 2104–2115.
[169] Haotian Sun, Yuchen Zhuang, Lingkai Kong, Bo Dai, and Chao Zhang. 2023. AdaPlanner: Adaptive Planning from
Feedback with Language Models. arXiv preprint arXiv:2305.16653 (2023).
[170] Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton,
Viktor Kerkez, and Robert Stojnic. 2022. Galactica: A Large Language Model for Science. http://arxiv.org/abs/2211.
09085 arXiv:2211.09085 [cs, stat].
[171] Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia
Jin, Taylor Bos, Leslie Baker, Yu Du, et al. 2022. Lamda: Language models for dialog applications. arXiv preprint
arXiv:2201.08239 (2022).
[172] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste
Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models.
arXiv preprint arXiv:2302.13971 (2023).
[173] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov,
Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models.
arXiv preprint arXiv:2307.09288 (2023).
[174] Shyam Upadhyay and Ming-Wei Chang. 2015. Draw: A challenging and diverse algebra word problem set. Technical
Report. Citeseer.
[175] Shyam Upadhyay and Ming-Wei Chang. 2017. Annotating Derivations: A New Evaluation Strategy and Dataset
for Algebra Word Problems. In Proceedings of the 15th Conference of the European Chapter of the Association for
Computational Linguistics: Volume 1, Long Papers. 494–504.
[176] Shriyash Upadhyay, Etan Ginsberg, and Chris Callison-Burch. 2023. Improving Mathematics Tutoring With A Code
Scratchpad. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA
2023), Ekaterina Kochmar, Jill Burstein, Andrea Horbach, Ronja Laarmann-Quante, Nitin Madnani, Anaïs Tack,

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:32 Liu et al.

Victoria Yaneva, Zheng Yuan, and Torsten Zesch (Eds.). Association for Computational Linguistics, Toronto, Canada,
20–28. https://doi.org/10.18653/v1/2023.bea-1.2
[177] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[178] Eric Wallace, Yizhong Wang, Sujian Li, Sameer Singh, and Matt Gardner. 2019. Do NLP Models Know Numbers?
Probing Numeracy in Embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language
Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 5307–5315.
[179] Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 billion parameter autoregressive language model.
[180] Cunxiang Wang, Boyuan Zheng, Yuchen Niu, and Yue Zhang. 2021. Exploring Generalization Ability of Pretrained
Language Models on Arithmetic and Logical Reasoning. In CCF International Conference on Natural Language
Processing and Chinese Computing. 758–769.
[181] Peiyi Wang, Lei Li, Liang Chen, Feifan Song, Binghuai Lin, Yunbo Cao, Tianyu Liu, and Zhifang Sui. 2023. Making
large language models better reasoners with alignment. arXiv preprint arXiv:2309.02144 (2023).
[182] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and
Denny Zhou. 2023. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In The Eleventh
International Conference on Learning Representations.
[183] Yan Wang, Xiaojiang Liu, and Shuming Shi. 2017. Deep neural solver for math word problems. In Proceedings of the
2017 conference on empirical methods in natural language processing. 845–854.
[184] Zengzhi Wang, Rui Xia, and Pengfei Liu. 2023. Generative AI for Math: Part I–MathPile: A Billion-Token-Scale
Pretraining Corpus for Math. arXiv preprint arXiv:2312.17120 (2023).
[185] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022.
Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing
Systems 35 (2022), 24824–24837.
[186] Sean Welleck, Jiacheng Liu, Ronan Le Bras, Hannaneh Hajishirzi, Yejin Choi, and Kyunghyun Cho. 2021. NaturalProofs:
Mathematical Theorem Proving in Natural Language. In Thirty-fifth Conference on Neural Information Processing
Systems Datasets and Benchmarks Track (Round 1).
[187] Sean Welleck, Jiacheng Liu, Ximing Lu, Hannaneh Hajishirzi, and Yejin Choi. 2022. Naturalprover: Grounded
mathematical proof generation with language models. Advances in Neural Information Processing Systems 35 (2022),
4913–4927.
[188] Sean Welleck and Rahul Saha. 2023. llmstep: LLM proofstep suggestions in Lean. https://github.com/wellecks/llmstep.
[189] Makarius Wenzel, Lawrence C Paulson, and Tobias Nipkow. 2008. The isabelle framework. In Theorem Proving in
Higher Order Logics: 21st International Conference, TPHOLs 2008, Montreal, Canada, August 18-21, 2008. Proceedings 21.
Springer, 33–38.
[190] Yuhuai Wu, Albert Jiang, Jimmy Ba, and Roger Baker Grosse. 2021. INT: An Inequality Benchmark for Evaluating
Generalization in Theorem Proving. In International Conference on Learning Representations.
[191] Yuhuai Wu, Albert Qiaochu Jiang, Wenda Li, Markus N. Rabe, Charles Staats, Mateja Jamnik, and Christian Szegedy.
2022. Autoformalization with Large Language Models. In NeurIPS. http://papers.nips.cc/paper_files/paper/2022/
hash/d0c6bc641a56bebee9d985b937307367-Abstract-Conference.html
[192] Yuhuai Wu, Markus Rabe, and Wenda Li. [n. d.]. LIME: Learning Inductive Bias for Primitives of Mathematical
Reasoning. ([n. d.]).
[193] Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. 2023.
WizardLM: Empowering Large Language Models to Follow Complex Instructions. http://arxiv.org/abs/2304.12244
arXiv:2304.12244 [cs].
[194] Kaiyu Yang and Jia Deng. 2019. Learning to prove theorems via interacting with proof assistants. In International
Conference on Machine Learning. PMLR, 6984–6994.
[195] Zhen Yang, Ming Ding, Qingsong Lv, Zhihuan Jiang, Zehai He, Yuyi Guo, Jinfeng Bai, and Jie Tang. 2023. GPT Can
Solve Mathematical Problems Without a Calculator. arXiv preprint arXiv:2309.03241 (2023).
[196] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. 2023.
Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023).
[197] Yao Yao, Zuchao Li, and Hai Zhao. 2023. Beyond Chain-of-Thought, Effective Graph-of-Thought Reasoning in Large
Language Models. arXiv preprint arXiv:2305.16582 (2023).
[198] Ori Yoran, Tomer Wolfson, Ben Bogin, Uri Katz, Daniel Deutch, and Jonathan Berant. 2023. Answering questions by
meta-reasoning over multiple chains of thought. arXiv preprint arXiv:2304.13007 (2023).
[199] Junchi Yu, Ran He, and Rex Ying. 2023. Thought Propagation: An Analogical Approach to Complex Reasoning with
Large Language Models. arXiv preprint arXiv:2310.03965 (2023).
[200] Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian
Weller, and Weiyang Liu. 2023. Metamath: Bootstrap your own mathematical questions for large language models.

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Mathematical Language Models: A Survey 111:33

arXiv preprint arXiv:2309.12284 (2023).


[201] Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Chuanqi Tan, and Chang Zhou. 2023. Scaling relationship
on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825 (2023).
[202] Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, and Songfang Huang. 2023. How well do Large Language
Models perform in Arithmetic tasks? arXiv preprint arXiv:2304.02015 (2023).
[203] Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. 2023. Mammoth:
Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653 (2023).
[204] Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Choromanski, Adrian Wong, Stefan Welker, Federico Tombari,
Aveek Purohit, Michael Ryoo, Vikas Sindhwani, et al. 2022. Socratic models: Composing zero-shot multimodal
reasoning with language. arXiv preprint arXiv:2204.00598 (2022).
[205] Ming-Liang Zhang, Fei Yin, Yi-Han Hao, and Cheng-Lin Liu. 2022. Plane Geometry Diagram Parsing. In Proceedings
of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22. 1636–1643. https://doi.org/10.
24963/ijcai.2022/228
[206] Ming-Liang Zhang, Fei Yin, and Cheng-Lin Liu. 2023. A Multi-Modal Neural Geometric Solver with Textual Clauses
Parsed from Diagram. In IJCAI.
[207] Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei
Wu, and Guoyin Wang. 2023. Instruction Tuning for Large Language Models: A Survey. http://arxiv.org/abs/2308.10792
arXiv:2308.10792 [cs].
[208] Xikun Zhang, Deepak Ramachandran, Ian Tenney, Yanai Elazar, and Dan Roth. 2020. Do language embeddings capture
scales?. In Findings of the Association for Computational Linguistics: EMNLP 2020, Trevor Cohn, Yulan He, and Yang Liu
(Eds.). Association for Computational Linguistics, 4889–4896. https://doi.org/10.18653/v1/2020.findings-emnlp.439
[209] Xikun Zhang, Deepak Ramachandran, Ian Tenney, Yanai Elazar, and Dan Roth. 2020. Do Language Embeddings
capture Scales?. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for
NLP. 292–299.
[210] Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. 2023. Automatic Chain of Thought Prompting in Large
Language Models. In The Eleventh International Conference on Learning Representations.
[211] Ruochen Zhao, Xingxuan Li, Shafiq Joty, Chengwei Qin, and Lidong Bing. 2023. Verify-and-edit: A knowledge-
enhanced chain-of-thought framework. arXiv preprint arXiv:2305.03268 (2023).
[212] Wei Zhao, Mingyue Shang, Yang Liu, Liang Wang, and Jingming Liu. 2020. Ape210k: A large-scale and template-rich
dataset of math word problems. arXiv preprint arXiv:2009.11506 (2020).
[213] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie
Zhang, Zican Dong, et al. 2023. A survey of large language models. arXiv preprint arXiv:2303.18223 (2023).
[214] Yilun Zhao, Yunxiang Li, Chenying Li, and Rui Zhang. 2022. MultiHiertt: Numerical Reasoning over Multi Hierarchical
Tabular and Textual Data. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers). 6588–6600.
[215] Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. 2023. miniF2F: a cross-system benchmark for formal Olympiad-
level mathematics. In International Conference on Learning Representations.
[216] Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan
Duan. 2023. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models. arXiv:2304.06364 [cs.CL]
[217] Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song, Mingjie
Zhan, et al. 2023. Solving challenging math word problems using gpt-4 code interpreter with code-based self-
verification. arXiv preprint arXiv:2308.07921 (2023).
[218] Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. 2023. Language agent tree
search unifies reasoning acting and planning in language models. arXiv preprint arXiv:2310.04406 (2023).
[219] Hattie Zhou, Azade Nova, Hugo Larochelle, Aaron Courville, Behnam Neyshabur, and Hanie Sedghi. 2022. Teaching
Algorithmic Reasoning via In-Context Learning. https://doi.org/10.48550/arXiv.2211.09066
[220] Lipu Zhou, Shuaixiang Dai, and Liwei Chen. 2015. Learn to solve algebra word problems using quadratic programming.
In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 817–822.
[221] Wangchunshu Zhou, Yuchen Eleanor Jiang, Ethan Wilcox, Ryan Cotterell, and Mrinmaya Sachan. 2023. Controlled
Text Generation with Natural Language Instructions. In International Conference on Machine Learning, ICML 2023,
23-29 July 2023, Honolulu, Hawaii, USA (Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma
Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). PMLR, 42602–42613.
https://proceedings.mlr.press/v202/zhou23g.html
[222] Zhehua Zhou, Jiayang Song, Kunpeng Yao, Zhan Shu, and Lei Ma. 2023. ISR-LLM: Iterative Self-Refined Large
Language Model for Long-Horizon Sequential Task Planning. arXiv preprint arXiv:2308.13724 (2023).
[223] Zihao Zhou, Qiufeng Wang, Mingyu Jin, Jie Yao, Jianan Ye, Wei Liu, Wei Wang, Xiaowei Huang, and Kaizhu Huang.
2023. MathAttack: Attacking Large Language Models Towards Math Solving Ability. arXiv:2309.01686 [cs.CL]

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:34 Liu et al.

[224] Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng
Chua. 2021. TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint
Conference on Natural Language Processing (Volume 1: Long Papers). 3277–3287.

Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.

You might also like