AI in Math Word Problem Solving
AI in Math Word Problem Solving
Keyur Faldu ∗ Amit Sheth Prashant Kikani Manas Gaur Aditi Avasthi
Embibe University of Embibe University of Embibe
South Carolina South Carolina
Abstract
rule-based and pattern matching algorithms would their relationships (Kushman et al., 2014). These
not be scalable to learn general-purpose mathemat- methods use machine learning algorithms like sup-
ical reasoning ability. port vector machines, probabilistic models, or mar-
gin classifiers for concept prediction, equation pre-
2.2 Semantic Parsing
diction, or slot filling. They leverage sentence se-
A semantic parsing-based system attempts to learn mantics or verb categorization to map mathemati-
the semantic structure of the input problem and cal operations with segments of an MWP (Hosseini
transforms it into mathematical expressions or a et al., 2014) (Amnueypornsakul and Bhat, 2014).
set of equations. It represents problems in object- A further set of candidate hypotheses are reduced
oriented structure, much similar to text-SQL gen- by handling noun slots and number slots separately
eration, following methods in natural language (Zhou et al., 2015). Few other techniques process
processing (Liguda and Pfeiffer, 2012) (Koncel- narratives to fill concepts-specific slots first and
Kedziorski et al., 2015). Underlying tree depen- derive equations from them using domain-specific
dencies of mathematical semantics are captured rules (Mitra and Baral, 2016)(Roy and Roth, 2018).
by context-free grammar rules, which is analogous
to dependency parse trees from the perspective of 3 Mathematical Reasoning in
mathematical semantics (Shi et al., 2015). Non-Neural Approaches
2.3 Statistical and Machine Learning Mathematical reasoning ability is the key to solving
Approaches mathematical problems. There were few attempts
Prior works on Statistical and machine learning made in this direction using reasoning-oriented sub-
approaches developed automated methods that are tasks like quantity entailment for semantic parsing
rule-based and semantic parsing, which processes (Roy et al., 2015), comprehending numerical in-
training data to derive a set of templates or learn formation, quantity alignment prediction for slot
the semantic structure from the input. They range filling , handling quantity slots and noun slots sep-
from text classification of MWPs with equation arately, ignoring extraneous information present in
templates, extracting and mapping quantities with narratives, etc (Bakman, 2007) (Mitra and Baral,
equations, or entities extraction and classifying 2016) (Roy and Roth, 2018) (Zhou et al., 2015).
Non-neural methods that consider rule-based, of the variable. The comparison concept has three
pattern matching, semantic parsing, and machine slots, namely the large quantity, the small quantity,
learning-based approaches are limited to specific and their difference (Mitra and Baral, 2016). Sim-
areas of mathematics like addition-subtractions, ilarly, other such mutually exclusive groups were
arithmetic, linear algebra, calculus, or quadratic identified by researchers to categorize MWPs, like
equations. The performance of such methods (i) join-separate, (ii) part-whole and (iii) compare
is constrained by the coverage and diversity of (Amnueypornsakul and Bhat, 2014); (i) change,
MWPs in the training corpus. The performance of (ii) combine (iii) compare (Bakman, 2007); or (i)
such approaches could not scale on out-of-corpus transfer, (ii) dimensional analysis, (iii) part-whole
MWPs mainly because of the relatively small-sized relation, (iv) explicit math, (Roy and Roth, 2018)
training corpus (Koncel-Kedziorski et al., 2016) etc.
(Kushman et al., 2014). (Huang et al., 2016) at- Commonsense and linguistic knowledge could
tempted these techniques on a large-scale dataset be useful to derive the semantic structure and rep-
Dolphin18K, having 18000 annotated linear and resent the meaning of MWPs. (Briars and Larkin,
non-linear MWPs. Non-neural methods performed 1984) investigates how commonsense knowledge
much worse on diverse and large-scale datasets could be helpful to solve complex problems narrat-
than their reported performance on their corre- ing real-world situations. The knowledge that an
sponding small and specific datasets, and their per- object defined in the MWP is a member of both a
formance improved sub-linearly with more exten- set and its superset and whether subsets can be ex-
sive training data (Huang et al., 2016). It made a changed in the context of deriving an answer for an
solid case to build more generic systems with better MWP. For example, the math word problem shown
reasoning ability. in figure 2 illustrates objects “orange” and “apple.”
Both belong to a superset “fruits,” and they can be
3.1 Domain Knowledge or External exchanged if the unknown quantity in the question
Knowledge is agnostic to the type of fruit. Dolphin language
representation of MWPs requires extracting nodes
Classifying MWPs into subtypes and handling and their relationships, where nodes could be con-
them using mathematical domain knowledge is stants, classes, or functions. Commonsense knowl-
another way to further push such systems’ capa- edge is used to classify entities sharing common
bilities. It becomes easier to reason about a prob- semantic properties (Shi et al., 2015). Linguis-
lem classified to a subtype as it reduces the candi- tic knowledge like verb sense and verb entailment
date math laws, axioms, and symbolic rules and could help efficiently semantic parsing of MWPs
contextualizes semantics related to the specific narratives to extract semantic slots and fill their
area of math (Mitra and Baral, 2016) (Roy and values. Researchers have explored such external
Roth, 2018)(Bakman, 2007) (Amnueypornsakul knowledge from WordNet (Hosseini et al., 2014)
and Bhat, 2014). For example, addition and sub- or E-HowNet (Chen et al., 2005) (Lin et al., 2015).
traction problems can be grouped in three classes of
mathematical concepts, (i) change, (ii) part-whole 4 Neural Approaches
and (iii) compare as shown in table 1 (Mitra and
Baral, 2016). Semantic parsing techniques form The recent advancement in deep learning ap-
rules to extract semantic information specific to proaches has opened up new possibilities. There is
these sub-types. Table 1 illustrates subtypes of a flurry of research that aims to apply neural net-
a problem and how each problem subtype would works to solve MWPs. Sequence to sequence archi-
have different semantic slots to apply math rules. tectures, transformers, graph neural networks, con-
The concept of part-whole has two slots, one for volutions, and attention mechanisms are a few such
the whole that accepts a single variable and the architectures and techniques of neural approaches.
other for its parts that accepts a set of variables of To empower neural approaches, larger and diverse
size at least two. The change concept has four slots, datasets have been also created (Amini et al., 2019),
namely start, end, gains, losses which respectively (Wang et al., 2017), (Zhang et al., 2020), (Saxton
denote the original value of a variable, the final et al., 2019), (Lample and Charton, 2019). Also,
value of that variable, and the set of increments the potential of neural approaches solving complex
and decrements that happen to the original value problems on calculus and integration has raised ex-
Addition-Subtraction Math Word Problems Subtype Semantic Slots
Sam’s dog had 8 puppies. He gave 2 to his friends. He now has 6 Change (i) start (ii) end (iii) gains
puppies . How many puppies did he have to start with? (iv) losses
Tom went to 4 hockey games this year, but missed 7 . He went to Part- (i) whole (ii) parts
9 games last year. How many hockey games did Tom go to in all ? Whole
Bill has 9 marbles. Jim has 7 more marbles than Bill. How many Compare (i) large quantity (ii) small
marbles does Jim have? quantity (iii) difference
Table 1: Classifying math word problems into subtypes, and semantic slots required for solving MWPs in each
subtypes.
pectations on its future promise. These approaches (Lample and Charton, 2019) achieved near-perfect
could be further categorized based on their prob- accuracy of 99% mainly attributed to the learning
lem formulation, (i) to predict answers directly, (ii) ability of neural models over a huge dataset. (Ran
generate intermediate math expressions, or (iii) re- et al., 2019) proposes a numerically aware graph
trieve a template. However, it is evident that neural neural network to directly predict the type of an-
networks are black boxes, and hard to interpret swer and the actual answer.
their functioning and explain their decision (Gaur The second approach deals with generating an
et al., 2021). Attempts to interpret its functioning expression tree and executing them to compute the
reveal how volatile its reasoning ability is, and they answer. Such models have relatively better inter-
lack generalizability (Patel et al., 2021). pretability and explainability and generated expres-
In the following sections, we aim to categorize sions, which provides scaffolding to the reasoning
the current state-of-the-art development. Then, we ability of the model. Seq-to-seq neural models
navigate problem formulations, datasets and anal- framing from LSTM, Transformers, GNN and Tree
ysis, approaches and architectural design choices, decoders have proven to useful (Wang et al., 2017)
data augmentation methods, and interpretability (Amini et al., 2019) (Xie and Sun, 2019) (Qin et al.,
and reasoning ability. 2020) (Liang et al., 2021). The key challenge for
methods in this approach is the need for expert an-
4.1 Problem Formulation notated datasets, as they need expression trees as
We can categorize the problem formulation to solve the labels for each problem in addition to its answer
MWPs in three different ways, (i) predicting the (Amini et al., 2019)(Wang et al., 2017)(Koncel-
answer directly, (ii) generating a set of equations or Kedziorski et al., 2016).
mathematical expressions, and inferring answers The third approach derives template equations
from them by executing them, and (iii) retrieving from training data, retrieves the most similar tem-
the templates from a pool of templates derived from plate, and substitutes numerical parameters. This
training data and augmenting numerical quantities approach suffers from generalization ability, as the
to compute the answer. An example is shown in set of templates are limited to the training set. Such
Figure 2. methods would fail to solve diverse and out-of-
The first approach to predict the answers directly corpus MWPs as they would not be able to re-
could demonstrate the inherent ability of the neural trieve the template needed to solve them. It was
model to learn complex mathematical transforma- one of the popular statistical machine learning ap-
tions (Csáji et al., 2001); however, the black-box na- proach (Kushman et al., 2014) (Koncel-Kedziorski
ture of such models suffers from poor interpretabil- et al., 2015), (Mitra and Baral, 2016), but there are
ity and explainability (Gaur et al., 2021). It has also few other neural approaches, which have stud-
been shown that sequence to sequence (seq-to-seq) ied its value for an ensemble setup or standalone
neural models could compute free form answers model with larger training data (Wang et al., 2017)
for complex problems with high accuracy when (Robaidek et al., 2018).
trained on large datasets (Saxton et al., 2019)(Lam- Generally, these neural approaches follow
ple and Charton, 2019). It has been analyzed how encoder-decoder architecture to predict expression
models behave when just numerical parameters are trees or directly compute the answer, specifically
changed vs. mathematical concepts in the test set. when the answer could be an expression in itself.
Dataset Mathematics Area Language Type Annotation Size
AI2 (Hosseini et al., 2014) Arithmetic English Curated Equation / Answer 395
IL (Roy et al., 2015) Arithmetic English Curated Equation / Answer 562
ALGES (Koncel-Kedziorski Arithmetic English Curated Equation / Answer 508
et al., 2015)
AllArith (Roy and Roth, 2017) Arithmetic English Derived Equation / Answer 831
Alg514 (Kushman et al., 2014) Algebraic (linear) English Curated Equation/Answer 514
Dolphin1878 (Shi et al., 2015) Algebraic (linear) English Curated Equation / Answer 1,878
DRAW (Upadhyay and Chang, Algebraic (linear) English Curated Equation / Answer / 1,000
2015) Template
Dolphin18K (Huang et al., Algebraic (linear, nonlinear) English Curated Equation / Answer 18,000
2016)
MAWPS (Koncel-Kedziorski Arithmetic, Algebraic English Curated Equation/answer 3,320
et al., 2016)
AQuA (Ling et al., 2017) Arithmetic, Algebraic (linear, English Curated Rationale/MCQ- 100,000
nonlinear) choices/Answer
MathQA (Amini et al., 2019) Arithmetic, Algebraic English Derived Equation/MCQ- 37,000
choices/Answer
MATH (Hendrycks et al., 2021) Algebra, Number Theory, English Curated Step-by-step solution / 12500
Probability, Geometry, Calcu- Answer
lus
ASDiv (Miao et al., 2021) Arithmetic, Algebraic English Derived Equation/Answer + 2,305
Grade/Problem-type
SVAMP (Patel et al., 2021) Arithmetic English Derived Equation/Answer 1,000
Math23K (Wang et al., 2017) Algebraic (linear) Chinese Curated Equation/Answer 23,161
HMWP (Qin et al., 2020) Algebraic (linear + nonlinear) Chinese Curated Equation/Answer 5,491
Ape210K (Zhao et al., 2020) Algebraic (linear) Chinese Curated Equation/Answer 210,488
Deepmind Mathematics (Saxton Algebra, Probability, Calcu- English Synthetic Answer 2,000,000
et al., 2019) lus
Integration & Differentiation Integration, Differentiation English Synthetic Answer 160,000,000
Synthetic Dataset (Lample and
Charton, 2019)
AMPS (Hendrycks et al., 2021) Algebra, Calculus, Geometry, English Synthetic Step-by-step solution / 5,000,0000
Statistics, Number Theory Answer
Table 2: Categorization of MWP Datasets available for training AI models based on (a) Mathematics Area, (b)
Language, (c) Type, and (d) Annotation.
would be parent nodes, and operands would be quantity span, question span, or both could be used
their children nodes, which would recursively to learn graph relationships between quantities and
expand further. Encoders in such architectures question present in an MWP (Li et al., 2019).
varied from GRU/LSTM based (Xie and Sun,
2019) or transformer-based (Liang et al., 2021). 4.4 Design Choices
As tree decoders help model learning the tree In this section, we aim to highlight several design
semantics out output expressions, such models decisions for the above architectures. First, the
perform better than the seq-to-seq model on seq-to-seq model suffers from generating spurious
datasets with limited size. The expected output numbers or predicting numbers at the wrong po-
could be represented as a computational tree, sition. Copy and alignment mechanisms can be
which is then traversed and evaluated to compute used to avoid this problem (Huang et al., 2018a).
the answer. Second, it has become a common practice to keep
Graph-to-tree architecture further exploits the decoder vocabulary limited to numbers, constants,
graph semantics present in MWPs (Ran et al., 2019) operators, and other helpful information present
(Zhang et al., 2020) (Li et al., 2019). Graph se- in the question (Xie and Sun, 2019) (Zhang et al.,
mantics captures the relationships between differ- 2020).
ent entities, which can be thought of as relation- There are multiple ways of generating output
ships among numerical quantities, their descrip- expressions, and maximum likelihood expectations
tion, and unknown quantities. The goal-driven would compromise learning if predicted, and ex-
tree-structured model is designed to generate ex- pected expressions are different but semantically
pressions using computational goals inferred from the same. Reinforcement learning would solve this
an MWP, which uses graph transformer network problem by rewarding the learning strategy based
as the encoder to incorporate quantity compari- on the final answer (Huang et al., 2018a).
son graph and quantity cell graph (Zhang et al., Tree-based decoders have received lots of atten-
2020). It leverages self-attention blocks for incor- tion because of their ability to invoke tree relation-
porating these instances of graphs. The numerical ships present in mathematical expressions. Tree de-
reasoning module attempts to learn comparative coders attend both parents and siblings to generate
relations between numerical quantities by connect- the next token. Bottom-up representation of sub-
ing them in a graph. Encoder representations are tree of a sibling could further help to derive better
appended with representations learned by numeri- outcomes (Qin et al., 2020). For efficient imple-
cal reasoning module to improve the performance mentation of tree decoder, the stack data structure
(Ran et al., 2019). An MWP can be segmented can be used to store and retrieve hidden representa-
into a quantity spans and a question span. Specific tions of parent and sibling (Liu et al., 2019). Tree
self-attention head blocks which could just attend regularization transforms encoder representation
and subtree representations and minimizes their L2 (Saxton et al., 2019), etc, but such models often
distance as a regularization term of the loss func- overfit on smaller datasets (Miao et al., 2021) (Pa-
tion (Qin et al., 2020). tel et al., 2021). Data augmentation is a popu-
Graph-based encoder aims to learn different lar preprocessing technique to increase the size of
types of relationships among the constituents of training data. Generally, data augmentation cre-
MWPs. There are several attempts to construct dif- ates the variant of training records by applying
ferent types of graphs. (Ran et al., 2019) inserts a domain-specific argumentation techniques. Several
numerical reasoning module that uses numerically techniques were proposed for the data augmenta-
aware graph neural networks, where two types of tion of MWP datasets. Reverse operation-based
edge less-than-equal-to and greater-than connect augmentation techniques swap an unknown quan-
numerical quantities present in the question. A sit- tity with a known quantity in an MWP, and their
uation model for algebra story problem SMART expression trees are modified to reflect the change
aims to build a graph using attributed grammar, (Liu et al., 2020). Different traversal orders of ex-
which connects nodes with its attributes using pression trees and other preprocessing techniques
relationships extracted from the problem (Hong like lemmatization, POS tagging, sentence reorder-
et al., 2021b). Self-attention blocks of transformers ing, stop word removal, etc., and their impact on
could also be used to model the graph relationships the performance of models have been studied by
(Zhang et al., 2020) (Li et al., 2019). The goal- (Griffith and Kalita, 2021). Data processing and
driven tree-structured model incorporated quantity augmentation could also help to generate adversar-
cell graph and quantity comparison graph (Zhang ial datasets, which could test the reasoning ability
et al., 2020). Quantity cell graph connects numeri- of models (Patel et al., 2021)(Miao et al., 2021).
cal quantity of the input equations with its descrip- Weak supervision is another popular technique
tive words or entities, whereas quantity comparison that generates large datasets with noisy labels. The
graph is similar to the concept used by (Ran et al., model training process could cancel noise and learn
2019). Segmentation of question into quantity span actual patterns. For example, generating expression
and question span, and using self-attention blocks trees for an MWP dataset with question and answer
to derive global attention, quantity-related atten- could be done using weak supervision, where ex-
tion, quantity pair attention, and question-related pression trees may not always be accurate. Still,
attention is another such approach to derive better the answer it evaluates would match the desired
semantic representation (Li et al., 2019). answer. Learning by fixing is a technique where
Multitasking is a prevalent paradigm to train the expression trees are generated iteratively by fix-
same model for multiple tasks. It enriches the se- ing operators and operands till it evaluates to the
mantic representations of models and avoids them desired answer (Hong et al., 2021a). Such a tech-
getting overfitted. Auxiliary tasks could also be nique helps model learning the more diverse ways
part of such a setup. Auxiliary tasks like Com- of solving mathematical problems.
mon Sense Prediction Task, Number Quantity Pre-
diction, and Number location prediction could be 5 Mathematical Reasoning in Neural
helpful (Qin et al., 2021). The Commonsense pre- Approaches
diction task aims to predict the commonsense re-
quired to solve an MWP, like the number of legs Mathematical reasoning is core to human intelli-
of a horse, or a number of days in a week, etc. gence. As explained before, it is a complex phe-
On the other hand, mask language modeling pre- nomenon interfacing natural language understand-
training on mathematical corpus Ape210k in a self- ing and visual understanding to invoke mathemat-
supervised way to help its downstream application ical transformations. Over the past few decades,
for math word problems (Liang et al., 2021). there has been a flurry of research to develop mod-
els with mathematical reasoning ability; however,
the progress is limited. Larger datasets definitely
4.5 Data Augmentation and Weak
help models learning to solve complex and niche
Supervision
problems (Lample and Charton, 2019) (Saxton
Large datasets have empowered neural models to et al., 2019), but a common purpose model to
learn complex mathematical concepts like integra- solve a diverse variety of math problems is still
tion, differentiation (Lample and Charton, 2019) a distinct reality (Patel et al., 2021) (Miao et al.,
2021). Smaller datasets overfit the models, and Still, they are not generalizable because of the in-
hence reasoning ability is questionable. Non-neural herent limitations of statistical and classical ma-
models could not improve performance with larger chine learning techniques to represent the mean-
datasets (Huang et al., 2016). On the other hand, ing. However, deriving intermediate representa-
neural models achieve much better performance tions and inferring the solution henceforth would
on the smaller datasets and attain near-perfect per- remain one of the niche areas to explore in the
formance over very large datasets; however, its context of neural approaches. Transforming prob-
BlackBox structure hinders interpreting its reason- lem narratives to expressions or equations and then
ing ability. There is a decent consensus among evaluating them to get an answer is one of the most
researchers to solve problems by first generating important approaches suitable for elementary level
intermediate expression trees and evaluating them MWPs (Wang et al., 2017). There were other at-
using solvers to predict the answer (Wang et al., tempts in deriving representation language (Shi
2017) (Amini et al., 2019) (Xie and Sun, 2019). As et al., 2015), using logic forms (Liang et al., 2018)
opposed to directly predicting the final answer, gen- and intermediate meaning representations (Huang
erating intermediate expression trees helps under- et al., 2018b), etc. It would require much more ef-
stand models’ reasoning ability better. Currently, fort to derive intermediate representation for com-
the efforts are mainly to solve MWPs of primary or prehensive and complex mathematical forms and
secondary schools as shown in Table 2. The state of expressions in hindsight.
the current art is still very far from building math-
ematical intelligence that could also infer visual 5.1.2 Interspersed Natural language
narratives and natural language and use domain- Explanations
specific knowledge. Natural language explanations to describe the math-
Models with desired reasoning ability would not ematical rationale would reflect the system’s rea-
only be generalizable, but it would also help to soning ability and also makes the system easier to
make systems interpretable and explainable. On the comprehend and explainable. For example, (Ling
other hand, interpretable and explainable systems et al., 2017) uses LSTM to focus on generating an-
are easier to comprehend, which helps to progress swer rationale, which is a natural language explana-
further by expanding the scope of its reasoning on tion interspersed with algebraic expression to arrive
more complex mathematical concepts. The field of at a decision. AQuA dataset contains 100,000 such
mathematics is full of axioms, rules, and formulas, algebraic problems with answer rationales. Such
and it would also be important to plug explicit and answer rationales not only improve interpretability
definite knowledge into such systems. but provide scaffolding which helps the model learn
mathematical reasoning better. Language models
5.1 Interpretability, and Explainability like WT5, which produce explanations along with
Deep learning models directly predicting answers prediction, could be an inspiration (Narang et al.,
from an input problem lack interpretability and 2020). It would be essential to extend such ap-
explainability. As models often end up learning proaches for more complex mathematical concepts
shallow heuristic patterns to arrive at the answer. beyond algebra. Natural language explanations in-
(Huang et al., 2016), (Patel et al., 2021). That’s terspersed with mathematical expressions would
where it is important to generate intermediate repre- be critical areas for the researchers to focus on for
sentations or explanations, then infer the answer us- building more comprehensive systems. Such inter-
ing them. It not only improves interpretability but spersed explanations would open up the possibility
also provides scaffolding to streamline reasoning to help students understand how to solve the prob-
ability and hence generalizability. Such intermedi- lem more effectively or would foster users’ trust in
ate representations could be of different forms, and the system.
they could be interspersed with natural language
5.2 Infusing Explicit and Definitive
explanations to aid the explainability of the model
Knowledge
further.
Mathematical problem solving requires identifying
5.1.1 Intermediate Representations relevant external knowledge and applying it to the
The semantic parsing approach attempts to derive mathematical rules to arrive at an answer. External
semantic structure from the problem narratives. knowledge could be commonsense world knowl-
edge, particular domain knowledge, or complex tion graph would give an answer, and a penalty or
mathematical formulas. There is some progress reward would be given based on the comparison
to infuse commonsense world knowledge (Liang between the computed answer and the expected
et al., 2021) or predict it using auxiliary tasks (Qin ground truth. Researchers have also leveraged rein-
et al., 2021), however remaining two dimensions forcement learning on difficult math domains like
are yet to be addressed. Such external knowledge is automated theorem proving (Crouse et al., 2021)
incorporated using rule-based, pattern-matching ap- (Kaliszyk et al., 2018). It is still an early stage of
proaches in non-neural models; on the other hand, the application of reinforcement learning for solv-
neural models leverage data augmentation and aux- ing math word problems. It could bring the next
iliary task training. (Zhang et al., 2020) (Xie and set of opportunities to build systems capable of
Sun, 2019) developed novel neural architectures to mathematical reasoning ability.
represent semantics captured in the problems using
graph and tree relations. These examples are of 6 Outlook
the shallow knowledge infusion category. It would
be intriguing to explore deep knowledge infusion The mathematical reasoning process involves un-
wherein model architecture would transform and derstanding natural language and visual narratives
leverage external knowledge vector spaces during and extracting entities with explicit and implicit
its training (Wang et al., 2020) (Faldu et al., 2021). quantities building a self-consistent symbolic rep-
Furthermore, it would be promising to build an eval- resentation invoking mathematical laws, axioms,
uation benchmark for such knowledge-intensive and symbolic rules.
mathematical problems to encourage and stream- The ability to solve MWPs in an automated fash-
line efforts to solve math problems (Sheth et al., ion could lead to many applications in the edu-
2021). cation domain for content creation and delivering
learning outcomes (Donda et al., 2020) (Faldu et al.,
2020a).
5.3 Reinforcement Learning
We surveyed sets of different approaches and
Mathematical reasoning ability comprises applying their claims to have reasonable solved MWPs on
a series of mathematical transformations. Rather specific datasets. However, we have also high-
than learning mathematical transformations, the lighted studies with “concrete evidence” that exist-
system could focus on synthesizing the computa- ing MWP solvers tend to rely on shallow heuristics
tional graph. Learning algorithms like gradient to achieve their high performance and questions
descent works on iterative reduction of the error. these models’ capabilities to solve even the sim-
The choices of these mathematical transformations plest of MWPs robustly (Huang et al., 2016) (Patel
are not differential, and hence it is unclear if its the et al., 2021) (Miao et al., 2021).
optimal strategy. Learning policy to construct such Non-neural approaches do not improve linearly
a computation graph using reinforcement learning with larger training data (Huang et al., 2016), on
could be helpful (Palermo et al., 2021) (Huang the other hand, neural approaches have shown a
et al., 2018a). Exponential search space of math- promise to solve even complex MWPs trained with
ematical concepts is another key problem where very large corpus (Lample and Charton, 2019) (Sax-
reinforcement learning could be useful (Wang et al., ton et al., 2019). However, the availability of exten-
2018b). Reinforcement learning reduces a problem sive training data with diverse MWPs is a crucial
to state transition problem with reward function. challenge. The inherent limitation of neural models
It initializes the state from a given problem, and in terms of interpretability and explainability could
based on the semantic information extracted from be addressed partly by predicting expression trees
the problem, it derives actions from updating the instead of directly predicting answers (Gaur et al.,
state with a reward or penalty. It learns a policy to 2021). Such expression trees could be converted
maximize the reward as it traverses through several into equations, and which are evaluated to infer the
intermediate states. For an MWP, the computa- final answer. Further, expression trees also provide
tion graph of expression trees denotes the state, scaffolding to assist mathematical reasoning abil-
and action denotes the choices of mathematical ity and open up design choices like graph-based
transformations and their mapping with operands encoder and tree-based decoder, which helps to
from the input. The evaluation of the computa- learn from the inadequately sized training corpus.
Interspersed natural language explanations and in- Witbrock, and Achille Fokoue. 2021. A deep rein-
fusing explicit knowledge could be of significant forcement learning approach to first-order logic the-
orem proving. In Proceedings of the AAAI Con-
value as it not only improves the explainability of
ference on Artificial Intelligence, volume 35, pages
the system but also guides the model to traverse 6279–6287.
through intermediate states anchoring mathemati-
cal reasoning (Ling et al., 2017) (Gaur et al., 2021). Balázs Csanád Csáji et al. 2001. Approximation with
artificial neural networks. Faculty of Sciences, Etvs
Interspersed explanations not only solve the math Lornd University, Hungary, 24(48):7.
word problems but also engage users on how to
solve them. Denise Dellarosa. 1986. A computer simulation of
children’s arithmetic word-problem solving. Behav-
We highlight that solving MWPs would require
ior Research Methods, Instruments, & Computers,
the inherent ability of mathematical reasoning, 18(2):147–154.
which comprises natural language understanding,
image understanding, and the ability to invoke do- Soma Dhavala, Chirag Bhatia, Joy Bose, Keyur Faldu,
and Aditi Avasthi. 2020. Auto generation of diag-
main knowledge of mathematical laws, axioms, nostic assessments and their quality evaluation. In-
and theorems. There is a massive scope of com- ternational Educational Data Mining Society.
plementing neural approaches with external exper-
tise/knowledge and developing design choices as Chintan Donda, Sayan Dasgupta, Soma S Dhavala,
Keyur Faldu, and Aditi Avasthi. 2020. A framework
we extend the problem to complex mathematical for predicting, interpreting, and improving learning
areas. outcomes. arXiv preprint arXiv:2010.02629.
Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel- Keyur Faldu, Aditi Avasthi, and Achint Thomas. 2020a.
Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Adaptive learning machine for score improvement
2019. Mathqa: Towards interpretable math word and parts thereof. US Patent 10,854,099.
problem solving with operation-based formalisms.
arXiv preprint arXiv:1905.13319. Keyur Faldu, Amit Sheth, Prashant Kikani, and He-
mang Akabari. 2021. Ki-bert: Infusing knowledge
Bussaba Amnueypornsakul and Suma Bhat. 2014. context for better language and domain understand-
Machine-guided solution to mathematical word ing. arXiv preprint arXiv:2104.08145.
problems. In Proceedings of the 28th Pacific Asia
Conference on Language, Information and Comput- Keyur Faldu, Achint Thomas, and Aditi Avasthi. 2020b.
ing, pages 111–119. System and method for recommending personalized
content using contextualized knowledge base. US
Yefim Bakman. 2007. Robust understanding of Patent App. 16/586,512.
word problems with extraneous information. arXiv
preprint math/0701393. Edward A Feigenbaum, Julian Feldman, et al. 1963.
Computers and thought. New York McGraw-Hill.
Albert Bandura. 2008. Observational learning. The
international encyclopedia of communication. Charles R Fletcher. 1985. Understanding and solving
arithmetic word problems: A computer simulation.
Daniel G Bobrow. 1964. Natural language input for a Behavior Research Methods, Instruments, & Com-
computer problem solving system. puters, 17(5):565–571.
Diane J Briars and Jill H Larkin. 1984. An integrated Manas Gaur, Keyur Faldu, and Amit Sheth. 2021. Se-
model of skill in solving elementary word problems. mantics of the black-box: Can knowledge graphs
Cognition and instruction, 1(3):245–296. help make deep learning systems more interpretable
and explainable? IEEE Internet Computing,
Eugene Charniak. 1968. Calculus word problems. 25(1):51–59.
Ph.D. thesis, Ph. D. thesis, Massachusetts Institute
of Technology. Kaden Griffith and Jugal Kalita. 2021. Solving
arithmetic word problems with transformers and
Keh-Jiann Chen, Shu-Ling Huang, Yueh-Yin Shih, and preprocessing of problem text. arXiv preprint
Yi-Jun Chen. 2005. Extended-hownet: A represen- arXiv:2106.00893.
tational framework for concepts. In Proceedings of
OntoLex 2005-Ontologies and Lexical Resources. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul
Arora, Steven Basart, Eric Tang, Dawn Song, and
Maxwell Crouse, Ibrahim Abdelaziz, Bassem Makni, Jacob Steinhardt. 2021. Measuring mathematical
Spencer Whitehead, Cristina Cornelio, Pavan Kapa- problem solving with the math dataset. arXiv
nipathi, Kavitha Srinivas, Veronika Thost, Michael preprint arXiv:2103.03874.
Yining Hong, Qing Li, Daniel Ciao, Siyuan Huang, and Nate Kushman, Yoav Artzi, Luke Zettlemoyer, and
Song-Chun Zhu. 2021a. Learning by fixing: Solv- Regina Barzilay. 2014. Learning to automatically
ing math word problems with weak supervision. In solve algebra word problems. In Proceedings of the
AAAI Conference on Artificial Intelligence. 52nd Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers), pages
Yining Hong, Qing Li, Ran Gong, Daniel Ciao, Siyuan 271–281.
Huang, and Song-Chun Zhu. 2021b. Smart: A situa-
tion model for algebra story problems via attributed Guillaume Lample and François Charton. 2019. Deep
grammar. In Proceedings of the AAAI Conference learning for symbolic mathematics. arXiv preprint
on Artificial Intelligence, volume 35, pages 13009– arXiv:1912.01412.
13017.
Jierui Li, Lei Wang, Jipeng Zhang, Yan Wang,
Mohammad Javad Hosseini, Hannaneh Hajishirzi, Bing Tian Dai, and Dongxiang Zhang. 2019. Model-
Oren Etzioni, and Nate Kushman. 2014. Learning ing intra-relation in math word problems with differ-
to solve arithmetic word problems with verb catego- ent functional multi-head attentions. In Proceedings
rization. In Proceedings of the 2014 Conference on of the 57th Annual Meeting of the Association for
Empirical Methods in Natural Language Processing Computational Linguistics, pages 6162–6167.
(EMNLP), pages 523–533.
Chao-Chun Liang, Yu-Shiang Wong, Yi-Chung Lin,
Danqing Huang, Jing Liu, Chin-Yew Lin, and Jian Yin. and Keh-Yih Su. 2018. A meaning-based statistical
2018a. Neural math word problem solver with rein- english math word problem solver. arXiv preprint
forcement learning. In Proceedings of the 27th Inter- arXiv:1803.06064.
national Conference on Computational Linguistics,
Zhenwen Liang, Jipeng Zhang, Jie Shao, and Xian-
pages 213–223.
gliang Zhang. 2021. Mwp-bert: A strong base-
Danqing Huang, Shuming Shi, Chin-Yew Lin, Jian Yin, line for math word problems. arXiv preprint
and Wei-Ying Ma. 2016. How well do comput- arXiv:2107.13435.
ers solve math word problems? large-scale dataset Christian Liguda and Thies Pfeiffer. 2012. Modeling
construction and evaluation. In Proceedings of the math word problems with augmented semantic net-
54th Annual Meeting of the Association for Compu- works. In International Conference on Application
tational Linguistics (Volume 1: Long Papers), pages of Natural Language to Information Systems, pages
887–896. 247–252. Springer.
Danqing Huang, Jin-Ge Yao, Chin-Yew Lin, Qingyu Yi-Chung Lin, Chao-Chun Liang, Kuang-Yi Hsu,
Zhou, and Jian Yin. 2018b. Using intermediate rep- Chien-Tsung Huang, Shen-Yun Miao, Wei-Yun Ma,
resentations to solve math word problems. In Pro- Lun-Wei Ku, Churn-Jung Liau, and Keh-Yih Su.
ceedings of the 56th Annual Meeting of the Associa- 2015. Designing a tag-based statistical math word
tion for Computational Linguistics (Volume 1: Long problem solver with reasoning and explanation. In
Papers), pages 419–428. International Journal of Computational Linguistics
& Chinese Language Processing, Volume 20, Num-
Daniel Kahneman. 2011. Thinking, fast and slow. ber 2, December 2015-Special Issue on Selected Pa-
Macmillan. pers from ROCLING XXVII.
Cezary Kaliszyk, Josef Urban, Henryk Michalewski, Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blun-
and Mirek Olšák. 2018. Reinforcement learning of som. 2017. Program induction by rationale genera-
theorem proving. arXiv preprint arXiv:1805.07563. tion: Learning to solve and explain algebraic word
problems. arXiv preprint arXiv:1705.04146.
Daniel Khashabi, Erfan Sadeqi Azer, Tushar Khot,
Ashish Sabharwal, and Dan Roth. 2019. On Johan Lithner. 2000. Mathematical reasoning in task
the capabilities and limitations of reasoning for solving. Educational studies in mathematics, pages
natural language understanding. arXiv preprint 165–190.
arXiv:1901.02522.
Qianying Liu, Wenyu Guan, Sujian Li, Fei Cheng,
Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish Daisuke Kawahara, and Sadao Kurohashi. 2020.
Sabharwal, Oren Etzioni, and Siena Dumas Ang. Reverse operation based data augmentation for
2015. Parsing algebraic word problems into equa- solving math word problems. arXiv preprint
tions. Transactions of the Association for Computa- arXiv:2010.01556.
tional Linguistics, 3:585–597.
Qianying Liu, Wenyv Guan, Sujian Li, and Daisuke
Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kawahara. 2019. Tree-structured decoding for solv-
Kushman, and Hannaneh Hajishirzi. 2016. Mawps: ing math word problems. In Proceedings of the
A math word problem repository. In Proceedings of 2019 Conference on Empirical Methods in Natu-
the 2016 Conference of the North American Chap- ral Language Processing and the 9th International
ter of the Association for Computational Linguistics: Joint Conference on Natural Language Processing
Human Language Technologies, pages 1152–1157. (EMNLP-IJCNLP), pages 2370–2379.
Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su. David Saxton, Edward Grefenstette, Felix Hill, and
2021. A diverse corpus for evaluating and devel- Pushmeet Kohli. 2019. Analysing mathematical rea-
oping english math word problem solvers. arXiv soning abilities of neural models. arXiv preprint
preprint arXiv:2106.15772. arXiv:1904.01557.
Arindam Mitra and Chitta Baral. 2016. Learning to use Amit Sheth, Manas Gaur, Kaushik Roy, and Keyur
formulas to solve simple arithmetic problems. In Faldu. 2021. Knowledge-intensive language under-
Proceedings of the 54th Annual Meeting of the As- standing for explainable ai. IEEE Internet Comput-
sociation for Computational Linguistics (Volume 1: ing, (01):1–1.
Long Papers), pages 2144–2153.
Shuming Shi, Yuehui Wang, Chin-Yew Lin, Xiaojiang
Sharan Narang, Colin Raffel, Katherine Lee, Adam Liu, and Yong Rui. 2015. Automatically solving
Roberts, Noah Fiedel, and Karishma Malkan. 2020. number word problems by semantic parsing and rea-
Wt5?! training text-to-text models to explain their soning. In Proceedings of the 2015 Conference on
predictions. arXiv preprint arXiv:2004.14546. Empirical Methods in Natural Language Processing,
Joseph Palermo, Johnny Ye, and Alok Singh. 2021. A pages 1132–1142.
reinforcement learning environment for mathemati- Shane Storks, Qiaozi Gao, and Joyce Y Chai. 2019.
cal reasoning via program synthesis. arXiv preprint Commonsense reasoning for natural language un-
arXiv:2107.07373. derstanding: A survey of benchmarks, resources,
Arkil Patel, Satwik Bhattamishra, and Navin Goyal. and approaches. arXiv preprint arXiv:1904.01172,
2021. Are nlp models really able to solve pages 1–60.
simple math word problems? arXiv preprint
arXiv:2103.07191. Shyam Upadhyay and Ming-Wei Chang. 2015. Draw:
A challenging and diverse algebra word problem set.
Jinghui Qin, Xiaodan Liang, Yining Hong, Jianheng Technical report, Citeseer.
Tang, and Liang Lin. 2021. Neural-symbolic solver
for math word problems with auxiliary tasks. arXiv Alex Wang, Yada Pruksachatkun, Nikita Nangia,
preprint arXiv:2107.01431. Amanpreet Singh, Julian Michael, Felix Hill, Omer
Levy, and Samuel R Bowman. 2019. Super-
Jinghui Qin, Lihui Lin, Xiaodan Liang, Rumin Zhang, glue: A stickier benchmark for general-purpose
and Liang Lin. 2020. Semantically-aligned univer- language understanding systems. arXiv preprint
sal tree-structured solver for math word problems. arXiv:1905.00537.
arXiv preprint arXiv:2010.06823.
Alex Wang, Amanpreet Singh, Julian Michael, Felix
Qiu Ran, Yankai Lin, Peng Li, Jie Zhou, and Zhiyuan Hill, Omer Levy, and Samuel R Bowman. 2018a.
Liu. 2019. Numnet: Machine reading compre- Glue: A multi-task benchmark and analysis platform
hension with numerical reasoning. arXiv preprint for natural language understanding. arXiv preprint
arXiv:1910.06701. arXiv:1804.07461.
Benjamin Robaidek, Rik Koncel-Kedziorski, and Han- Lei Wang, Dongxiang Zhang, Lianli Gao, Jingkuan
naneh Hajishirzi. 2018. Data-driven methods for Song, Long Guo, and Heng Tao Shen. 2018b. Math-
solving algebra word problems. arXiv preprint dqn: Solving arithmetic word problems via deep re-
arXiv:1804.10718. inforcement learning. In Proceedings of the AAAI
Subhro Roy and Dan Roth. 2017. Unit dependency Conference on Artificial Intelligence, volume 32.
graph and its application to arithmetic word problem Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei,
solving. In Proceedings of the AAAI Conference on Xuanjing Huang, Guihong Cao, Daxin Jiang, Ming
Artificial Intelligence, volume 31. Zhou, et al. 2020. K-adapter: Infusing knowl-
Subhro Roy and Dan Roth. 2018. Mapping to declara- edge into pre-trained models with adapters. arXiv
tive knowledge for word problem solving. Transac- preprint arXiv:2002.01808.
tions of the Association for Computational Linguis-
tics, 6:159–172. Yan Wang, Xiaojiang Liu, and Shuming Shi. 2017.
Deep neural solver for math word problems. In Pro-
Subhro Roy, Tim Vieira, and Dan Roth. 2015. Reason- ceedings of the 2017 Conference on Empirical Meth-
ing about quantities in natural language. Transac- ods in Natural Language Processing, pages 845–
tions of the Association for Computational Linguis- 854.
tics, 3:1–13.
Qinzhuo Wu, Qi Zhang, Zhongyu Wei, and Xuan-Jing
Maarten Sap, Vered Shwartz, Antoine Bosselut, Yejin Huang. 2021. Math word problem solving with
Choi, and Dan Roth. 2020. Commonsense reason- explicit numerical values. In Proceedings of the
ing for natural language processing. In Proceed- 59th Annual Meeting of the Association for Compu-
ings of the 58th Annual Meeting of the Association tational Linguistics and the 11th International Joint
for Computational Linguistics: Tutorial Abstracts, Conference on Natural Language Processing (Vol-
pages 27–33. ume 1: Long Papers), pages 5859–5869.
Zhipeng Xie and Shichao Sun. 2019. A goal-driven
tree-structured neural model for math word prob-
lems. In IJCAI, pages 5299–5305.
Ma Yuhui, Zhou Ying, Cui Guangzuo, Ren Yun, and
Huang Ronghuai. 2010. Frame-based calculus of
solving arithmetic multi-step addition and subtrac-
tion word problems. In 2010 Second International
Workshop on Education Technology and Computer
Science, volume 2, pages 476–479. IEEE.
Jipeng Zhang, Lei Wang, Roy Ka-Wei Lee, Yi Bin, Yan
Wang, Jie Shao, and Ee-Peng Lim. 2020. Graph-to-
tree learning for solving math word problems. Asso-
ciation for Computational Linguistics.
Wei Zhao, Mingyue Shang, Yang Liu, Liang Wang, and
Jingming Liu. 2020. Ape210k: A large-scale and
template-rich dataset of math word problems. arXiv
preprint arXiv:2009.11506.
Lipu Zhou, Shuaixiang Dai, and Liwei Chen. 2015.
Learn to solve algebra word problems using
quadratic programming. In Proceedings of the 2015
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 817–822.