Large Language Models For Mathematical Reasoning - Progresses and Challenges
Large Language Models For Mathematical Reasoning - Progresses and Challenges
there has been a notable surge in the devel- Models (LLMs) as formidable tools for automating
opment of Large Language Models (LLMs) intricate tasks. Notably, LLMs have proven to be
geared towards the automated resolution of potent assets in unraveling the nuances of mathe-
mathematical problems. However, the land- matical problem-solving (Romera-Paredes et al.,
scape of mathematical problem types is vast
2023; Imani et al., 2023). Their language capabili-
and varied, with LLM-oriented techniques un-
dergoing evaluation across diverse datasets and ties fuel focused exploration in utilizing them for
settings. This diversity makes it challenging mathematical reasoning, uncovering fresh insights
to discern the true advancements and obsta- into the synergy between language and logic.
cles within this burgeoning field. This survey However, amid this progress, the current state
endeavors to address four pivotal dimensions: of LLM-oriented research in mathematics presents
i) a comprehensive exploration of the various a complex panorama. Diverse mathematical prob-
mathematical problems and their correspond-
lem types pose a formidable challenge, exacerbated
ing datasets that have been investigated; ii) an
examination of the spectrum of LLM-oriented by the varied evaluation metrics, datasets, and set-
techniques that have been proposed for math- tings employed in the assessment of LLM-oriented
ematical problem-solving; iii) an overview of techniques (Testolin, 2023; Lu et al., 2023c). The
factors and concerns affecting LLMs in solving lack of a unified framework hampers our ability to
math; and iv) an elucidation of the persisting gauge the true extent of progress achieved and im-
challenges within this domain. To the best of pedes a coherent understanding of the challenges
our knowledge, this survey stands as one of the
that persist in this evolving field.
first extensive examinations of the landscape
of LLMs in the realm of mathematics, provid- This survey endeavors to cast a spotlight on the
ing a holistic perspective on the current state, multifaceted landscape of LLMs in the realm of
accomplishments, and future challenges in this mathematics. We plan to traverse four crucial di-
rapidly evolving field. mensions: a meticulous exploration of math prob-
lem types and the datasets associated with them;
1 Introduction an in-depth analysis of the evolving techniques em-
Mathematical reasoning is crucial to human intel- ployed by LLMs in mathematical problem-solving;
ligence, driving ongoing efforts in the AI commu- an examination of factors that affect the LLMs solv-
nity to autonomously tackle math challenges. This ing math problems; and a critical discussion on the
pursuit inherently calls for an augmentation of AI persisting challenges that loom over this burgeon-
capabilities, delving into the intricate realms of tex- ing field.
tual comprehension, image interpretation, tabular To our knowledge, this survey marks one of the
analysis, symbolic manipulation, operational logic, first comprehensive examinations of LLMs specif-
and a nuanced grasp of world knowledge. As the ically tailored for mathematics. By weaving to-
AI landscape evolves, the endeavor to empower gether insights from various dimensions, we aim to
machines with a comprehensive understanding of provide a holistic understanding of the current state
diverse mathematical facets becomes not only a tes- of affairs in LLM-driven mathematical reasoning,
tament to technological prowess but also a pivotal shedding light on achievements, challenges, and
the uncharted territories that await exploration in P ROVING, and M ATH IN VISION CONTEXT.
this captivating intersection of language and logic.
To the best of our knowledge, the existing litera- This category of problems entails pure mathemati-
ture on summarizing mathematical research, partic- cal operations and numerical manipulation, devoid
ularly within the context of LLMs, remains lim- of the need for the model to interpret text, images,
ited. Notably, Frieder et al. (2023a) compared or other contextual elements. An illustrative exam-
two ChatGPT versions (9-January-2023 and 30- ple is presented below, where “Q” denotes ques-
January-2023) and GPT-4 for four math-related tions and “A” for answers.
problems: producing proofs, filling holes in proofs,
Q: 21 + 97
acting as a mathematical search engine and compu-
A: 118
tation. More importantly, they summarized some
insightful strategies regarding how LLMs can help The dataset M ATH -140 (Yuan et al., 2023) con-
mathematicians and advocated a more collabora- tains 401 arithmetic expressions for 17 groups.
tive approach, incorporating human expertise and
LLM automation, for theorem proving. Chang et al.
(2023) conducted a comprehensive evaluation of 3.2 Math Word Problems
LLMs, incorporating an examination of their per-
formance in mathematical problem-solving, albeit M ATH WORD PROBLEMS (MWP) are mathemati-
with a relatively brief exploration of the mathemati- cal exercises or scenarios presented in the form of
cal field. Conversely, both (Testolin, 2023) and (Lu written or verbal descriptions rather than straight-
et al., 2023c) delved into the application of Deep forward equations in A RITHMETIC. These prob-
Learning in the domain of mathematical reason- lems require individuals to decipher the informa-
ing. Our work distinguishes itself on three fronts: tion provided, identify relevant mathematical con-
firstly, we concentrate on LLMs, providing a more cepts, and formulate equations or expressions to
in-depth analysis of their various advancements; solve the given problem. MWP often reflect real-
secondly, beyond merely reporting progress, we en- world situations, allowing individuals to apply
gage in a thorough discussion of the challenges in- mathematical principles to practical contexts. Solv-
herent in this trajectory; and thirdly, we extend our ing these problems typically involves critical think-
scrutiny to encompass the perspective of mathemat- ing, problem-solving skills, and the application of
ics pedagogy. In doing so, we contribute a nuanced mathematical operations to find a solution.
perspective that seeks to broaden the understanding MWP invariably comprise a question (Q) and
of LLMs in the context of mathematical research. its corresponding final answer (A) (referred to as
The only work contemporaneous with ours is Question-Answer). However, the presence or ab-
(Liu et al., 2023b). In comparison, our contribution sence of additional clues can give rise to various
lies in: i) not only introducing various methods versions of these problems. Variations may emerge
but also paying more attention to various factors based on factors such as the availability of an equa-
affecting model performance; ii) taking a broader tion (E; referred to as Question-Equation-Answer)
perspective on the progress of LLM in the field or the provision of a step-by-step rationale (R;
of mathematics, elucidating not only from the AI Question-Rationale-Answer) to guide the problem-
perspective but also from the perspective of ed- solving process.
ucation. It emphasizes that the pursuit of model
performance alone, while neglecting human factors, Question-Answer. The instance of this type of
is something that needs attention. MWP consists of a question (Q) and the final an-
swer (A), such as:
3 Math Problems & Datasets
Q: Lily received $20 from her mum. After
This section concisely overviews prominent math- spending $10 on a storybook and $2.5 on
ematical problem types and associated datasets, a lollipop, how much money does she have
spanning A RITHMETIC, M ATH W ORD P ROB - left?
LEMS , G EOMETRY , AUTOMATED T HEOREM A: $7.5
N AME S IZE L EVEL N OTE
Q-A CM ATH (Wei et al., 2023) 1.7K E Chinese; grade 1-6
S AT-M ATH (Zhong et al., 2023) 220 H Multi-choice
SVAMP (Patel et al., 2021) 1K E Three types of variations
ASD IV (Miao et al., 2020) 2.3K E Problem type and grade level annotated
Question-Equation-Answer
MAWPS (Koncel-Kedziorski et al., 2016) 3.3K E Extension of A DD S UB, M ULTI A RITH, etc.
PARA MAWPS (Raiyan et al., 2023) 16K E Paraphrased, adversarial MAWPS
S INGLE E Q (Koncel-Kedziorski et al., 2015) 508 E
A DD S UB (Hosseini et al., 2014) 395 E Only addition and subtraction
M ULTI A RITH (Roy and Roth, 2015) 600 E Multi-step reasoning
D RAW-1K (Upadhyay and Chang, 2017) 1K E
M ATH 23K (Wang et al., 2017) 23K E Chinese
A PE 210K (Zhao et al., 2020) 210K E Chinese
K6 (Yang et al., 2023) 600 E Chinese; grade 1-6
CM17K (Qin et al., 2021) 17K M H Chinese; grade 6-12
C ARP (Zhang et al., 2023a) 4.9K M Chinese
GSM8K (Cobbe et al., 2021) 8.5K M Linguistically diverse
Question-Rationale-Answer
MATH (Hendrycks et al., 2021) 12.5K H Problems are put into difficulty levels 1-5
PRM800K (Lightman et al., 2023) 12K H MATH w/ step-wise labels
M ATH QA (Amini et al., 2019) 37K C GRE examinations; have quality concern
AQ UA (Ling et al., 2017) 100K C GRE&GMAT questions
A RB (Sawada et al., 2023) 105 C Contest problems and university math proof
G HOSTS (Frieder et al., 2023b) 709 C
T HEOREM QA-M ATH (Chen et al., 2023b) 442 C Theorem as rationale
LILA (Mishra et al., 2022) 132K H Incorporates 20 existing datasets
M ATH -I NSTRUCT (Yue et al., 2023) 260K H Instruction-following style
TAB MWP (Lu et al., 2023b) 38K H Tabular MWP; below the College level
Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, Jiaqi Chen, Tong Li, Jinghui Qin, Pan Lu, Liang Lin,
Jian-Guang Lou, and Weizhu Chen. 2023b. Learning Chongyu Chen, and Xiaodan Liang. 2022. Unigeo:
from mistakes makes LLM better reasoner. CoRR, Unifying geometry logical reasoning via reformu-
abs/2310.20689. lating mathematical expression. In Proceedings of
EMNLP, pages 3313–3323.
Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin John-
son, Dmitry Lepikhin, Alexandre Passos, Siamak Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang,
Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Lingbo Liu, Eric P. Xing, and Liang Lin. 2021a.
Geoqa: A geometric question answering benchmark Simon Frieder, Julius Berner, Philipp Petersen, and
towards multimodal numerical reasoning. In Find- Thomas Lukasiewicz. 2023a. Large language models
ings of ACL/IJCNLP, volume ACL/IJCNLP 2021, for mathematicians. Internationale Mathematische
pages 513–523. Nachrichten, 254:1–20.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Simon Frieder, Luca Pinchetti, Ryan-Rhys Grif-
Henrique Pondé de Oliveira Pinto, Jared Kaplan, fiths, Tommaso Salvatori, Thomas Lukasiewicz,
Harrison Edwards, Yuri Burda, Nicholas Joseph, Philipp Christian Petersen, Alexis Chevalier, and
Greg Brockman, Alex Ray, Raul Puri, Gretchen Julius Berner. 2023b. Mathematical capabilities of
Krueger, Michael Petrov, Heidy Khlaaf, Girish Sas- chatgpt. CoRR, abs/2301.13867.
try, Pamela Mishkin, Brooke Chan, Scott Gray,
Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Sai Gattupalli, William Lee, Danielle Allessio, Danielle
Kaiser, Mohammad Bavarian, Clemens Winter, Crabtree, Ivon Arroyo, Beverly Woolf, and Beverly
Philippe Tillet, Felipe Petroski Such, Dave Cum- Woolf. 2023. Exploring pre-service teachers’ per-
mings, Matthias Plappert, Fotios Chantzis, Eliza- ceptions of large language models-generated hints in
beth Barnes, Ariel Herbert-Voss, William Hebgen online mathematics learning.
Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie
Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, Vedant Gaur and Nikunj Saunshi. 2023. Reasoning in
William Saunders, Christopher Hesse, Andrew N. large language models through symbolic math word
Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan problems. In Findings of ACL, pages 5889–5903.
Morikawa, Alec Radford, Matthew Knight, Miles
Brundage, Mira Murati, Katie Mayer, Peter Welinder, Sophia Gu. 2023. Llms as potential brainstorming
Bob McGrew, Dario Amodei, Sam McCandlish, Ilya partners for math and science problems. CoRR,
Sutskever, and Wojciech Zaremba. 2021b. Evaluat- abs/2310.10677.
ing large language models trained on code. CoRR,
abs/2107.03374. Jesse Michael Han, Jason Rute, Yuhuai Wu, Edward W.
Ayers, and Stanislas Polu. 2022. Proof artifact co-
Wenhu Chen, Xueguang Ma, Xinyi Wang, and training for theorem proving with language models.
William W. Cohen. 2023a. Program of thoughts In Proceedings of ICLR.
prompting: Disentangling computation from reason-
ing for numerical reasoning tasks. Transactions on Pengcheng He, Xiaodong Liu, Jianfeng Gao, and
Machine Learning Research. Weizhu Chen. 2021. Deberta: decoding-enhanced
bert with disentangled attention. In Proceedings of
Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, ICLR.
Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony
Xia. 2023b. Theoremqa: A theorem-driven question Joy He-Yueya, Gabriel Poesia, Rose E. Wang, and
answering dataset. In Proceedings of EMNLP, pages Noah D. Goodman. 2023. Solving math word prob-
7889–7901. lems by combining language models with symbolic
solvers. CoRR, abs/2304.09102.
Vincent Cheng and Yu Zhang. 2023. Analyzing Chat-
GPT’s mathematical deficiencies: Insights and con- Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul
tributions. In Proceedings of the 35th Conference Arora, Steven Basart, Eric Tang, Dawn Song, and
on Computational Linguistics and Speech Processing Jacob Steinhardt. 2021. Measuring mathematical
(ROCLING 2023), pages 188–193. problem solving with the MATH dataset. In Proceed-
ings of NeurIPS.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian,
Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren
Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Etzioni, and Nate Kushman. 2014. Learning to solve
Nakano, Christopher Hesse, and John Schulman. arithmetic word problems with verb categorization.
2021. Training verifiers to solve math word prob- In Proceedings of EMNLP, pages 523–533. ACL.
lems. CoRR, abs/2110.14168.
Shima Imani, Liang Du, and Harsh Shrivastava. 2023.
Aniruddha Deb, Neeva Oza, Sarthak Singla, Dinesh Mathprompter: Mathematical reasoning using large
Khandelwal, Dinesh Garg, and Parag Singla. 2023. language models. In Proceedings of ACL, pages 37–
Fill in the blank: Exploring and enhancing LLM 42.
capabilities for backward reasoning in math word
problems. CoRR, abs/2310.01991. Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish
Sabharwal, Oren Etzioni, and Siena Dumas Ang.
Iddo Drori, Sarah Zhang, Reece Shuttleworth, Leonard 2015. Parsing algebraic word problems into equa-
Tang, Albert Lu, Elizabeth Ke, Kevin Liu, Linda tions. Trans. Assoc. Comput. Linguistics, 3:585–597.
Chen, Sunny Tran, Newman Cheng, et al. 2022. A
neural network solves, explains, and generates uni- Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate
versity math problems by program synthesis and few- Kushman, and Hannaneh Hajishirzi. 2016. MAWPS:
shot learning at human level. Proceedings of the Na- A math word problem repository. In Proceedings of
tional Academy of Sciences, 119(32):e2123433119. NAACL, pages 1152–1157.
Aitor Lewkowycz, Anders Andreassen, David Dohan, Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan
Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Huang, Xiaodan Liang, and Song-Chun Zhu. 2021.
Ambrose Slone, Cem Anil, Imanol Schlag, Theo Inter-gps: Interpretable geometry problem solving
Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy with formal language and symbolic reasoning. In
Gur-Ari, and Vedant Misra. 2022. Solving quantita- Proceedings of ACL/IJCNLP, pages 6774–6786.
tive reasoning problems with language models.
Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu,
Zhenwen Liang, Dian Yu, Xiaoman Pan, Wenlin Yao, Song-Chun Zhu, Tanmay Rajpurohit, Peter Clark,
Qingkai Zeng, Xiangliang Zhang, and Dong Yu. and Ashwin Kalyan. 2023b. Dynamic prompt learn-
2023a. Mint: Boosting generalization in mathemat- ing via policy gradient for semi-structured mathemat-
ical reasoning via multi-view fine-tuning. CoRR, ical reasoning. In Proceedings of ICLR.
abs/2307.07951.
Pan Lu, Liang Qiu, Wenhao Yu, Sean Welleck, and
Zhenwen Liang, Wenhao Yu, Tanmay Rajpurohit, Pe- Kai-Wei Chang. 2023c. A survey of deep learning
ter Clark, Xiangliang Zhang, and Ashwin Kalyan. for mathematical reasoning. In Proceedings of ACL,
2023b. Let GPT be a math tutor: Teaching math pages 14605–14631.
word problem solvers with customized exercise gen-
eration. CoRR, abs/2305.14386. Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jian-
guang Lou, Chongyang Tao, Xiubo Geng, Qingwei
Hunter Lightman, Vineet Kosaraju, Yura Burda, Har- Lin, Shifeng Chen, and Dongmei Zhang. 2023. Wiz-
rison Edwards, Bowen Baker, Teddy Lee, Jan ardmath: Empowering mathematical reasoning for
Leike, John Schulman, Ilya Sutskever, and Karl large language models via reinforced evol-instruct.
Cobbe. 2023. Let’s verify step by step. CoRR, CoRR, abs/2308.09583.
abs/2305.20050.
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq R.
Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blun- Joty, and Enamul Hoque. 2022. Chartqa: A bench-
som. 2017. Program induction by rationale genera- mark for question answering about charts with visual
tion: Learning to solve and explain algebraic word and logical reasoning. In Findings of ACL, pages
problems. In Proceedings of ACL, pages 158–167. 2263–2279.
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Nikolaos Matzakos, Spyridon Doukakis, and Maria
Hiroaki Hayashi, and Graham Neubig. 2023a. Pre- Moundridou. 2023. Learning mathematics with large
train, prompt, and predict: A systematic survey of language models: A comparative study with com-
prompting methods in natural language processing. puter algebra systems and other tools. International
ACM Computing Surveys, 55(9):1–35. Journal of Emerging Technologies in Learning (iJET),
18(20):51–71.
Wentao Liu, Hanglei Hu, Jie Zhou, Yuyang Ding,
Junsong Li, Jiayi Zeng, Mengliang He, Qin Chen, Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su.
Bo Jiang, Aimin Zhou, and Liang He. 2023b. 2020. A diverse corpus for evaluating and developing
Mathematical language models: A survey. CoRR, english math word problem solvers. In Proceedings
abs/2312.07622. of ACL, pages 975–984.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Tang, Sean Welleck, Chitta Baral, Tanmay Rajpuro-
Luke Zettlemoyer, and Veselin Stoyanov. 2019. hit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark,
Roberta: A robustly optimized BERT pretraining and Ashwin Kalyan. 2022. LILA: A unified bench-
approach. CoRR, abs/1907.11692. mark for mathematical reasoning. In Proceedings of
EMNLP, pages 5807–5832.
Yixin Liu, Avi Singh, C. Daniel Freeman, John D. Co-
Reyes, and Peter J. Liu. 2023c. Improving large lan- Kole Norberg, Husni Almoubayyed, Stephen E. Fanc-
guage model fine-tuning for solving math problems. sali, Logan De Ley, Kyle Weldon, April Murphy, and
CoRR, abs/2310.10047. Steven Ritter. 2023. Rewriting math word problems
with large language models. In Proceedings of the
Renze Lou, Kai Zhang, and Wenpeng Yin. 2023. Is Workshop on Empowering Education with LLMs -
prompt all you need? no. a comprehensive and the Next-Gen Interface and Content Generation 2023
broader view of instruction learning. arXiv preprint co-located with 24th International Conference on Ar-
arXiv:2303.10475. tificial Intelligence in Education (AIED 2023), Tokyo,
Japan, July 7, 2023, volume 3487 of CEUR Work-
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun- shop Proceedings, pages 163–172.
yuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei
Chang, Michel Galley, and Jianfeng Gao. 2023a. Maxwell I. Nye, Anders Johan Andreassen, Guy Gur-
Mathvista: Evaluating math reasoning in visual con- Ari, Henryk Michalewski, Jacob Austin, David
texts with gpt-4v, bard, and other large multimodal Bieber, David Dohan, Aitor Lewkowycz, Maarten
models. CoRR, abs/2310.02255. Bosma, David Luan, Charles Sutton, and Augustus
Odena. 2021. Show your work: Scratchpads for inter- Mrinmaya Sachan and Eric P. Xing. 2017. Learn-
mediate computation with language models. CoRR, ing to solve geometry problems from natural lan-
abs/2112.00114. guage demonstrations in textbooks. In Proceedings
of *SEM @ACM, pages 251–261.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
Carroll L. Wainwright, Pamela Mishkin, Chong Tomohiro Sawada, Daniel Paleka, Alexander Havrilla,
Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, Pranav Tadepalli, Paula Vidas, Alexander Kranias,
John Schulman, Jacob Hilton, Fraser Kelton, Luke John J. Nay, Kshitij Gupta, and Aran Komatsuzaki.
Miller, Maddie Simens, Amanda Askell, Peter Welin- 2023. ARB: advanced reasoning benchmark for large
der, Paul F. Christiano, Jan Leike, and Ryan Lowe. language models. CoRR, abs/2307.13692.
2022. Training language models to follow instruc-
tions with human feedback. In NeurIPS. Min Joon Seo, Hannaneh Hajishirzi, Ali Farhadi, Oren
Etzioni, and Clint Malcolm. 2015. Solving geometry
Arkil Patel, Satwik Bhattamishra, and Navin Goyal. problems: Combining text and diagram interpretation.
2021. Are NLP models really able to solve simple In Proceedings of EMNLP, pages 1466–1476.
math word problems? In Proceedings of NAACL-
HLT, pages 2080–2094. Paulo Shakarian, Abhinav Koyyalamudi, Noel Ngu, and
Lakshmivihari Mareedu. 2023. An independent eval-
Jinghui Qin, Xiaodan Liang, Yining Hong, Jianheng uation of chatgpt on mathematical word problems
Tang, and Liang Lin. 2021. Neural-symbolic solver (MWP). In Proceedings of the AAAI 2023 Spring
for math word problems with auxiliary tasks. In Symposium on Challenges Requiring the Combina-
Proceedings of ACL/IJCNLP, pages 5870–5881. tion of Machine Learning and Knowledge Engineer-
ing (AAAI-MAKE 2023), Hyatt Regency, San Fran-
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, cisco Airport, California, USA, March 27-29, 2023,
Dario Amodei, Ilya Sutskever, et al. 2019. Language volume 3433 of CEUR Workshop Proceedings.
models are unsupervised multitask learners. OpenAI
blog, 1(8):9. Alessandro Stolfo, Zhijing Jin, Kumar Shridhar, Bern-
hard Schölkopf, and Mrinmaya Sachan. 2023. A
Syed Rifat Raiyan, Md. Nafis Faiyaz, Shah Md. Jawad
causal framework to quantify the robustness of math-
Kabir, Mohsinul Kabir, Hasan Mahmud, and
ematical reasoning with language models. In Pro-
Md Kamrul Hasan. 2023. Math word problem solv-
ceedings of ACL, pages 545–561.
ing by generating linguistic variants of problem state-
ments. CoRR, abs/2306.13899. Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas
Nitin Rane. 2023. Enhancing mathematical capabili- Scialom, Anthony Hartshorn, Elvis Saravia, An-
ties through chatgpt and similar generative artificial drew Poulton, Viktor Kerkez, and Robert Stojnic.
intelligence: Roles and challenges in solving mathe- 2022. Galactica: A large language model for science.
matical problems. SSRN Electronic Journal. CoRR, abs/2211.09085.
Bernardino Romera-Paredes, Mohammadamin Alberto Testolin. 2023. Can neural networks do arith-
Barekatain, Alexander Novikov, Matej Balog, metic? A survey on the elementary numerical skills
M Pawan Kumar, Emilien Dupont, Francisco JR of state-of-the-art deep learning models. CoRR,
Ruiz, Jordan S Ellenberg, Pengming Wang, Omar abs/2303.07735.
Fawzi, et al. 2023. Mathematical discoveries from Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
program search with large language models. Nature, Martinet, Marie-Anne Lachaux, Timothée Lacroix,
pages 1–3. Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
Subhro Roy and Dan Roth. 2015. Solving general arith- Azhar, Aurélien Rodriguez, Armand Joulin, Edouard
metic word problems. In Proceedings of EMNLP, Grave, and Guillaume Lample. 2023a. Llama: Open
pages 1743–1752. and efficient foundation language models. CoRR,
abs/2302.13971.
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten
Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom bert, Amjad Almahairi, Yasmine Babaei, Nikolay
Kozhevnikov, Ivan Evtimov, Joanna Bitton, Man- Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
ish Bhatt, Cristian Canton-Ferrer, Aaron Grattafiori, Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-
Wenhan Xiong, Alexandre Défossez, Jade Copet, Ferrer, Moya Chen, Guillem Cucurull, David Esiobu,
Faisal Azhar, Hugo Touvron, Louis Martin, Nico- Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller,
las Usunier, Thomas Scialom, and Gabriel Synnaeve. Cynthia Gao, Vedanuj Goswami, Naman Goyal, An-
2023. Code llama: Open foundation models for code. thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan
CoRR, abs/2308.12950. Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa,
Isabel Kloumann, Artem Korenev, Punit Singh Koura,
Mrinmaya Sachan, Avinava Dubey, and Eric P. Xing. Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di-
2017. From textbooks to knowledge: A case study in ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Mar-
harvesting axiomatic knowledge from textbooks to tinet, Todor Mihaylov, Pushkar Mishra, Igor Moly-
solve geometry problems. In Proceedings of EMNLP, bog, Yixin Nie, Andrew Poulton, Jeremy Reizen-
pages 773–784. stein, Rashi Rungta, Kalyan Saladi, Alan Schelten,
Ruan Silva, Eric Michael Smith, Ranjan Subrama- Zhen Yang, Ming Ding, Qingsong Lv, Zhihuan Jiang,
nian, Xiaoqing Ellen Tan, Binh Tang, Ross Tay- Zehai He, Yuyi Guo, Jinfeng Bai, and Jie Tang. 2023.
lor, Adina Williams, Jian Xiang Kuan, Puxin Xu, GPT can solve mathematical problems without a cal-
Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, culator. CoRR, abs/2309.03241.
Melanie Kambadur, Sharan Narang, Aurélien Ro-
driguez, Robert Stojnic, Sergey Edunov, and Thomas Jie Yao, Zihao Zhou, and Qiufeng Wang. 2023. Solving
Scialom. 2023b. Llama 2: Open foundation and math word problem with problem type classification.
fine-tuned chat models. CoRR, abs/2307.09288. In Proceedings of NLPCC, volume 14304, pages 123–
134.
Trieu Trinh, Yuhuai Wu, Quoc Le, He He, and Thang
Luong. 2024. Solving olympiad geometry without An-Zi Yen and Wei-Ling Hsu. 2023. Three questions
human demonstrations. Nature. concerning the use of large language models to facil-
itate mathematics learning. CoRR, abs/2310.13615.
Shyam Upadhyay and Ming-Wei Chang. 2017. An-
notating derivations: A new evaluation strategy and Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu,
dataset for algebra word problems. In Proceedings Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo
of EACL, pages 494–504. Li, Adrian Weller, and Weiyang Liu. 2023. Meta-
math: Bootstrap your own mathematical questions
Ben Wang and Aran Komatsuzaki. 2021. Gpt-j-6b: A 6 for large language models. CoRR, abs/2309.12284.
billion parameter autoregressive language model.
Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang,
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. and Songfang Huang. 2023. How well do large lan-
Le, Ed H. Chi, Sharan Narang, Aakanksha Chowd- guage models perform in arithmetic tasks? CoRR,
hery, and Denny Zhou. 2023. Self-consistency im- abs/2304.02015.
proves chain of thought reasoning in language mod-
els. In Proceedings of ICLR. Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao
Huang, Huan Sun, Yu Su, and Wenhu Chen. 2023.
Yan Wang, Xiaojiang Liu, and Shuming Shi. 2017. Mammoth: Building math generalist models through
Deep neural solver for math word problems. In Pro- hybrid instruction tuning. CoRR, abs/2309.05653.
ceedings of EMNLP, pages 845–854.
Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang,
Zichao Wang, Andrew S. Lan, and Richard G. Baraniuk.
Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu,
2021. Math word problem generation with mathe-
Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma,
matical consistency and problem context constraints.
Yufei Xue, Jidong Zhai, Wenguang Chen, Zhiyuan
In Proceedings of EMNLP, pages 5986–5999.
Liu, Peng Zhang, Yuxiao Dong, and Jie Tang. 2023.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten GLM-130B: an open bilingual pre-trained model. In
Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, Proceedings of ICLR.
and Denny Zhou. 2022. Chain-of-thought prompt-
ing elicits reasoning in large language models. In Beichen Zhang, Kun Zhou, Xilin Wei, Wayne Xin
Proceedings of NeurIPS. Zhao, Jing Sha, Shijin Wang, and Ji-Rong Wen.
2023a. Evaluating and improving tool-augmented
Tianwen Wei, Jian Luan, Wei Liu, Shuang Dong, and computation-intensive math reasoning. arXiv
Bin Wang. 2023. CMATH: can your language model preprint arXiv:2306.02408.
pass chinese elementary school math test? CoRR,
abs/2306.16636. Mengxue Zhang, Zichao Wang, Zhichao Yang, Weiqi
Feng, and Andrew S. Lan. 2023b. Interpretable math
Makarius Wenzel, Lawrence C Paulson, and Tobias word problem solution generation via step-by-step
Nipkow. 2008. The isabelle framework. In Theo- planning. In Proceedings of ACL, pages 6858–6877.
rem Proving in Higher Order Logics: 21st Interna-
tional Conference, TPHOLs 2008, Montreal, Canada, Wei Zhao, Mingyue Shang, Yang Liu, Liang Wang, and
August 18-21, 2008. Proceedings 21, pages 33–38. Jingming Liu. 2020. Ape210k: A large-scale and
Springer. template-rich dataset of math word problems.
Yiran Wu, Feiran Jia, Shaokun Zhang, Hangyu Li, Kunhao Zheng, Jesse Michael Han, and Stanislas Polu.
Erkang Zhu, Yue Wang, Yin Tat Lee, Richard Peng, 2022. Minif2f: a cross-system benchmark for formal
Qingyun Wu, and Chi Wang. 2023. An empirical olympiad-level mathematics.
study on challenging math problem solving with GPT-
4. CoRR, abs/2306.01337. Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang,
Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen,
Ryutaro Yamauchi, Sho Sonoda, Akiyoshi Sannai, and and Nan Duan. 2023. Agieval: A human-centric
Wataru Kumagai. 2023. LPML: llm-prompting benchmark for evaluating foundation models. CoRR,
markup language for mathematical reasoning. CoRR, abs/2304.06364.
abs/2309.13078.
Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun
Kaiyu Yang and Jia Deng. 2019. Learning to prove Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song,
theorems via interacting with proof assistants. Mingjie Zhan, and Hongsheng Li. 2023a. Solving
challenging math word problems using GPT-4 code
interpreter with code-based self-verification. CoRR,
abs/2308.07921.
Zihao Zhou, Qiufeng Wang, Mingyu Jin, Jie Yao, Jianan
Ye, Wei Liu, Wei Wang, Xiaowei Huang, and Kaizhu
Huang. 2023b. Mathattack: Attacking large lan-
guage models towards math solving ability. CoRR,
abs/2309.01686.
Xinyu Zhu, Junjie Wang, Lin Zhang, Yuxiang Zhang,
Yongfeng Huang, Ruyi Gan, Jiaxing Zhang, and Yu-
jiu Yang. 2023. Solving math word problems via
cooperative reasoning induced language models. In
Proceedings of ACL, pages 4471–4485.
Mingyu Zong and Bhaskar Krishnamachari. 2023. Solv-
ing math word problems concerning systems of equa-
tions with GPT-3. In Proceedings of AAAI, pages
15972–15979.