0% found this document useful (0 votes)

28 views21 pages

Mamo A Mathematical Modeling Benchmark With Solver

The document introduces Mamo, a benchmark designed to evaluate the mathematical modeling capabilities of Large Language Models (LLMs) by focusing on the modeling process rather than just the accuracy of solutions. It emphasizes the integration of solvers to standardize outputs and provides a comprehensive framework for assessing LLMs' abilities in areas such as optimization and differential equations. This innovative approach aims to enhance understanding of LLMs' problem-solving strategies and sets a new standard for performance evaluation in complex mathematical scenarios.

Uploaded by

perro ñ

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views21 pages

Mamo A Mathematical Modeling Benchmark With Solver

Uploaded by

perro ñ

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

preprint

Mamo: a Mathematical Modeling Benchmark with Solvers

Xuhan Huang Qingning Shen

The Chinese University of Hong Kong, Shenzhen The Chinese University of Hong Kong, Shenzhen
xuhanhuang@[Link] qingningshen@[Link]

Yan Hu Anningzhe Gao ∗

The Chinese University of Hong Kong, Shenzhen Shenzhen Research Institute of Big Data
Shenzhen Research Institute of Big Data gaoanningzhe@[Link]
sthuyan@[Link]
arXiv:2405.13144v1 [[Link]] 21 May 2024

Benyou Wang ∗
The Chinese University of Hong Kong, Shenzhen
Shenzhen Research Institute of Big Data
wangbenyou@[Link]

Abstract

Mathematical modeling involves representing real-world phenomena, sys-

tems, or problems using mathematical expressions and equations to an-
alyze, understand, and predict their behavior. Given that this process
typically requires experienced experts, there is an interest in exploring
whether Large Language Models (LLMs) can undertake mathematical modeling
to potentially decrease human labor. To evaluate of LLMs in mathemat-
ical modeling, we introduce a new benchmark, Mamo, that transcends
traditional result-oriented assessments. Unlike conventional methods that
primarily assess LLMs based on the accuracy of solutions to mathematical
problems, our approach offers deeper insight into the modeling process
itself. By focusing on the processes LLMs undertake rather than the correct-
ness of their final solutions, Mamo pioneers a novel evaluation paradigm.
This shift underscores the importance of understanding the inherent mod-
eling capabilities of LLMs, paving the way for a more nuanced and com-
prehensive analysis of their problem-solving strategies. Our work marks a
significant advancement in the field, suggesting a new direction for future
research by emphasizing the evaluation of LLMs’ modeling processes over
the mere correctness of answers. This benchmark not only facilitates a better
understanding of LLMs’ mathematical modeling capabilities but also sets a
new standard for evaluating their performance in complex problem-solving
scenarios.

1 Introduction

Recent advancements in Large Language Models (LLMs) have garnered widespread inter-
est, demonstrating remarkable capabilities across a broad spectrum of natural language
processing tasks Ouyang et al. (2022); OpenAI (2023); Nijkamp et al. (2022); Tang et al. (2024).
However, the domain of mathematics remains a critical aspect of LLM evaluation Wei et al.
(2022). This focus not only gauges LLMs’ proficiency in comprehending and addressing
mathematical challenges but also serves as a profound indicator of their underlying abilities
in abstract conceptualization and logical reasoning, among other intellectual faculties.
Natural language captures human complexity and nuance, while mathematical language excels
in precision and modeling the natural world (Giordano et al., 2013). Bridging their gap
∗ Corresponding authors: gaoanningzhe@[Link], wangbenyou@[Link]

1
preprint

requires high-level cognitive skills akin to Artificial General Intelligence (AGI), a task at
which LLMs show promise. However, the challenge lies in the evaluation of modeling, as
perfectly representing reality may be inherently elusive.
In this endeavor, solvers1 becomes critically important as it standardizes the modeling
output and could provide a way to falsifiably validate the produced answers. Recent studies
have already combined LLMs with solvers Yang et al. (2023); Feng et al. (2023); Pan et al.
(2023). In the field of optimization, AhmadiTeshnizi et al. (2023b) introduced OptiMUS, an
LLM-based agent that uses a solver to address optimization problems. The key task for LLM
in OptiMUS involves first generating a mathematical formulation of a real-world problem,
followed by writing Python code to invoke a solver. This implies the potential for testing
LLM’s modeling capabilities.
We thus develope Mamo, a new benchmark that leverages solvers to assess the mathematical
modeling prowess of large language models (LLMs). Our approach shifts from conven-
tional outcome-focused evaluation to process-focused evaluation, offering in-depth insights
into the LLMs’ problem-solving strategies. We concentrate on the models’ mathematical
modeling skills, delegating the task of solving abstract problems to solvers to circumvent
errors arising from the models’ computational capabilities.
The contributions of this paper are as below:

1. Clarification of a Mathematical Modeling Framework for LLMs in Natural Lan-

guage Domains: We have meticulously clarified the concept of mathematical mod-
eling within Large Language Models (LLM), focusing on the realm of natural
language challenges. This effort has culminated in the establishment of a compre-
hensive framework designed to guide the mathematical modeling process in LLMs,
addressing both theoretical underpinnings and practical applications.
2. Innovation in Benchmarking Strategy through Solver Integration: Our research
introduces a pioneering benchmarking strategy for the assessment of mathematical
modeling process. By incorporating solvers into the benchmarking methodology,
this strategy facilitates a rigorous evaluation process. This novel approach sets a
new standard for the assessment in NLP research.
3. Creation of the Mamo Benchmark for Extensive Mathematical Modeling Assess-
ment: We have developed a new benchmark, named Mamo, specifically tailored for
the evaluation of mathematical modeling capabilities. Mamo encompasses a wide
array of modeling questions, focusing on areas such as ordinary differential equa-
tions and optimization problems within linear programming and mixed-integer
linear programming frameworks. With a total of 1059 meticulously curated ques-
tions, Mamo stands as a comprehensive tool for assessing the breadth and depth of
mathematical modeling proficiency.

2 Related Work

Recent studies have made significant strides in applying LLMs to mathematical problem-
solving, introducing innovative datasets for comprehensive evaluation. Frieder et al. (2024)
and Yuan et al. (2023) have focused on datasets that test LLMs across various math problems
and arithmetic expressions, respectively. Lewkowycz et al. (2022) explored training models
using natural language paired with LaTeX-formatted mathematical content from arXiv,
enhancing syntactic understanding. Zhang et al. (2024) developed the CARP dataset for
computation-intensive challenges, while He et al. (2024) introduced OlympiadBench, a bilin-
gual and multimodal benchmark for competition-level mathematical reasoning and proofs.
Liu et al. (2024) constructed MathBench, which is designed with a carefully structured
difficulty hierarchy and focuses on assessing the understanding of theoretical knowledge
in LLMs. AhmadiTeshnizi et al. (2023a) constructed NLP4LP, a benchmark comprising 52
1 In optimization, solvers are algorithms used to find the best solution to a problem. They search
through possible options to maximize or minimize an objective under given constraints if any. By
using solves, The challenge becomes problem formulation under a given solver or a solver set.

2
preprint

linear programming (LP) and mixed-integer linear programming (MILP) problems. These
efforts collectively push the boundaries of LLMs in mathematical cognition, showcasing the
growing complexity and specificity of tasks LLMs are being trained to tackle.
The pursuit of enhancing and devising novel solvers has seen a variety of research efforts,
with a notable focus on the integration of LLMs. Yang et al. (2023) investigated the potential
of employing LLMs as direct solvers, pioneering the exploration of LLMs’ autonomous
problem-solving capabilities. He-Yueya et al. (2023) introduced a hybrid approach that
marries an LLM with an external symbolic solver, specifically for equation solving, marking
a significant advancement in combining neural and symbolic computation. Furthermore,
Pan et al. (2023) developed LOGIC-LM, a cutting-edge framework that fuses LLMs with
symbolic solvers, aimed at enhancing the solving of logical problems. In a closely related
study, AhmadiTeshnizi et al. (2023a) introduced OptiMUS, an LLM-based agent explicitly
crafted for formulating and solving optimization problems, thus demonstrating the broad
applicability and potential of LLMs in tackling intricate computational challenges. These
endeavors underscore the innovative directions being explored in the realm of solver
development, leveraging the robust capabilities of LLMs.

3 Background

3.1 Mathematical Modeling

Giordano et al. (2013) emphasize the role of mathematical models as intermediaries that
transform real-world problems into structured mathematical forms, facilitating the discov-
ery of solutions. Despite LLMs’ proven competencies across various tasks, their capacity
for developing mathematical models remains uncertain. This involves two key phases:
model formulation, which demands a thorough comprehension of the problem, and model
resolution, typically addressed by computational solvers. To accurately gauge LLMs’ poten-
tial in this domain, our strategy focuses on their ability to construct mathematical models,
entrusting calculations to specialized solvers. This approach highlights LLMs’ creativity
and generalization skills, simultaneously capitalizing on solvers’ computational efficiency.

Motivations to benchmark mathematical modeling for LLMs We evaluate mathematical

models for two reasons. The first reason is out of curiosity: We aim to investigate whether
LLMs can do what expert humans can. Mathematical modeling is considered a highly
advanced capability that we believe experts possess. This is a beneficial test for exploring the
boundaries of LLMs’ capabilities and to what extent they can approach human intelligence.
On the other hand, in terms of evaluating the models themselves, most current benchmarks,
especially those related to mathematics like GSM8K Cobbe et al. (2021), etc., show that
the performance has become saturated. Even a 7B model Yu et al. (2023b); Liu et al. (2023)
(sometimes with a verifier-like model Yu et al. (2023a)) can achieve a accuracy exceeding 80%,
closely approaching that of GPT 4. Therefore, we believe there is a need to construct more
challenging mathematical datasets. Traditional mathematical modeling is a labor-intensive,
custom process reliant on human expertise, raising the question: Can large language models
automate or assist in this process?

3.2 Using Solvers in Mathematical Modeling

The model generated by the LLM is then subjected to verification against real-world data to
assess its validity, ensuring it addresses the problem logically and sensibly. The solver, an
algorithmic tool designed to find solutions to mathematical models, is also involved in this
process and it is also beneficial for verification.
Different types of solvers are specialized for various classes of problems, such as optimiza-
tion solvers. However, they share the common property: They are all algorithmic tools
designed to find solutions to mathematical models.
Rationale of Using solver in Mathematical Modeling for LLMs These solvers operational-
ize the abstract constructs created by LLMs, translating them into tangible outputs that

3
preprint

can be analyzed and validated against real-world data and phenomena. The solver is a
crucial instrument that resolves the model presented by the LLM, and by comparing the
solution it provides with the correct answer, we assess the accuracy of the mathematical
model generated by the LLM.

4 Philosophy of Mamo Benchmark

4.1 Methodology: exact answer verification with solvers

We utilize solver to resolve this mathematical model. The solver operates at the final state of
the question-solving process, meaning it directly computes the final answer from the model
without intermediate steps. By executing the solver and juxtaposing its output ( A) b with
the correct answer (A) to the original problem, we can verify the accuracy of the Language
Model’s mathematical model( M). b This final-state approach ensures that our assessment
is squarely focused on the Language Model’s capacity to model the problem, abstracting
from any additional computational steps that might otherwise obscure the evaluation of the
model’s validity.

Figure 1: The pipeline to use exact answer verification via an additional solver.

4.2 Scope of Mathematical Modeling

We construct the category of Mathematical Modeling by referring to Giordano et al. (2013).

In the context of Natural Language we propose a framework in mathematical modelling in
LLM, specified in solving natural language problem, see Tab. 4.2.

Category In Mamo
Changes and Differences ✗
Proportionality and Geometric Similarity ✗
Optimization Problem Modeling ✓
Differential Equations ✓
Probabilistic Modeling ✗

Table 1: The taxonomy of mathematical modeling. In Mamo benchmark, we only include

the categories where sophisticated Solvers is available.

Given the broad spectrum of mathematical modeling and the availability of specific solvers,
our benchmark is meticulously designed to encompass domains where these tools can be
effectively utilized. This section outlines the rationale for our focus on certain areas of
mathematical modeling, based on solver availability and the clarity of the modeling process.
Availability of Sophisticated Solvers: Advanced optimization solvers, such as COPT Ge
et al. (2023), along with Python libraries for solving ordinary differential equations (ODEs),
offer a strong foundation for testing the capabilities of Large Language Models (LLMs).
These tools enable a detailed and automated evaluation of the LLMs’ abilities to abstract,
formulate, and accurately solve mathematical problems within these specific domains.

4
preprint

Challenges in Modeling Changes and Differences, Proportionality and Geometric Simi-

larity: Although simpler tools like calculators could address problems related to modeling
changes and differences, proportionality and geometric similarity, the lack of a defined,
structured modeling process complicates the implementation of our final-state-approach
benchmark. Our objective is to assess the LLMs’ skill in developing comprehensive mathe-
matical models that adhere to a solvable framework. The absence of a distinct modeling
procedure in these areas impedes a standardized and objective assessment of an LLM’s
modeling efficiency.
Following the considerations mentioned above, we have developed our benchmark in
the fields of optimization, and ODE, due to the lack of advanced and universal Partial
Differential Equations (PDE) solvers. In the optimization segment, our focus is on Linear
Programming (LP) and Mixed-Integer Linear Programming (MILP) problems. Compared
to NLP4LP, introduced by AhmadiTeshnizi et al. (2023b) and focusing exclusively on
optimization, our benchmark includes a larger number of problems. This benchmark
set is highly scalable, with potential future expansions into nonlinear optimization and
probabilistic modeling. Additionally, as new and advanced solvers emerge, it could be
expanded into areas like Partial Differential Equations (PDE) and beyond.

5 Data Collection and Synthesis

5.1 Data Selection and Source Credibility

The data set for our benchmark consists of a blend of manually selected and GPT-generated
questions. All items within the data set have been subjected to a meticulous review process
conducted by qualified collectors and domain experts to ensure their validity and relevance.

Annotator Qualifications The benchmarking process is conducted by a team of qualified

individuals with extensive expertise in various mathematical disciplines. 2 :

5.1.1 Manually Selected Questions

• Source and Reliability: When doing manually selecting, we refer to the textbooks,
exercises and the solutions in ODE and optimization area(Thomas et al., 2016;
Stewart et al., 2014; Lial et al., 2017; Braun, 1993; Boyce & DiPrima, 2012; Giordano
et al., 2014; Zill, 2013; Giordano et al., 2014; Bertsimas & Tsitsiklis, 1997; Hurlbert,
2010). The selecting process focus on deriving the mathematical model in the
questions and assign a reasonable scenario. The source ensure the correctness of the
answer of each question.
• Rewriting Process: Each question is meticulously rewritten to fit the benchmark’s
criteria(the properties in Section 5.3), which may involve assigning new scenarios
or adapting the question and its answer to align with our testing methodology.
This process also circumvents potential licensing issues, as the original content is
significantly transformed.

Data Synthesis The process of annotating data starts with the creation of typical mathe-
matical constructs, into which random parameters are introduced to give them specificity.
This is followed by the computation of answers to establish a ground truth. Subsequently,
GPT-4 is employed to craft real-life scenarios based on these predefined models, effectively
translating abstract mathematical concepts into tangible, context-rich problems. This ap-
proach not only tests the model’s ability to handle mathematical reasoning but also its
capacity to contextualize mathematical principles within real-world situations. See the
example in Appendix. E.

Cross Validation GPT-4 is employed to confirm that the newly generated questions are
amenable to being solved by solvers using the final-state approach, ensuring the bench-
2 See the details of the annotator qualifications in Appendix. B

5
preprint

mark’s focus remains on the LLM’s modeling capabilities. Moreover, we employ a compre-
hensive review process, re-examining each question for consistency and correctness, thereby
ensuring the quality of our dataset. See the review details in Appendix. A.

5.2 Data Statistics and Visualization

In the section on Ordinary Differential Equations (ODE), we present a total of 346 problems:
196 are based on first-order equations, 110 on second-order equations, and 40 on systems of
equations, offering a comprehensive exploration across different levels.
The optimization section is divided into two segments: Easy LP, featuring 652 high school-
level Mixed Integer Linear Programming (MILP) problems, and Complex LP, comprising
211 undergraduate-level problems that integrate both LP and MILP.
See the word cloud of the combined data in Figure 2

Figure 2: The word cloud of the data after filtering the requirement words such as ’significant
figures’, ’rounded’

5.3 Guidance of Data Synthesis

The synthesis of problems for our benchmark are governed by specific criteria designed to
evaluate the mathematical modeling capacity of Large Language Models (LLMs). The guide-
lines ensure that problems are appropriate for our final-state approach and yield numerical
answers suitable for automated verification. The following are the key considerations in our
benchmark guidance:
Property 1. Final-State Approach Solvability: Problems should be structured so that they
can be solved by a solver in a final-state approach, meaning the solution process should not
require intermediate steps once the mathematical model has been formulated by the LLM.
Property 2. Numerical Answers: The answers to all questions must be numerical, allowing
for clear-cut evaluation and comparison with the solver’s output.
Property 3. Significant Figures and Precision: Each question will explicitly state the
required significant figures or the level of numerical precision for the answer. This clar-
ity ensures that the LLM’s output can be properly evaluated against the solver’s results,
maintaining consistency and rigor in the assessment of the model’s accuracy.
Property 4. Modeling Capacity Test: To specifically test the LLM’s modeling capabilities:

• For Ordinary Differential Equations (ODEs), questions should focus on determining

the value of the original function at a specific time t0 .
• For optimization problems, questions should aim to find the optimal value.

6
preprint

These constraints facilitate the testing accuracy and the convenience of verifying potentially
multiple valid solutions to a single problem.
Property 5. Real-World Problem Context: Questions should be framed as real-world
problems without explicitly stating the underlying mathematical model, challenging the
LLM to demonstrate its capacity for abstraction and application in practical scenarios.

6 Evaluation

6.1 Evaluation Protocol

6.1.1 Evaluation Details

Our evaluation protocol is meticulously designed to test the LLMs’ mathematical modeling
process. For ODE problems, the LLM is prompted to generate Python code from a natural
language description. The code is executed, and its output is compared with the correct
answer to assess accuracy. In optimization problems, the LLM is asked to express the model
in .lp format, after which the COPT solver is used to find the optimal value for comparison
with the correct answer.

Model output In ODE parts, we ask LLMs to output Python code calling solvers (
solve_ivp odeint in SciPy, and dsolve in NumPy). In the optimization sections, we
prompt LLMs to generate the standard .lp file. Compared to the testing methodology in
NLP4LP (AhmadiTeshnizi et al., 2023b), which involves prompting the model to output
Python code for invoking the solver, the .lp format more closely aligns with the standard
optimization form, making it more readable. Furthermore, optimization solvers such as
COPT (Ge et al., 2023) can directly read and solve the .lp file, minimizing potential issues
related to the LLM’s coding capabilities. To better adhere to the format, we conducted tests
using 3-shot learning. Detailed information about the number of few-shot samples used
in our experiments and the result can be found in the Appendix. (see Appendix C and
Appendix. G).

Mitigation of Formatting Errors It is essential to distinguish errors in modeling from those

in coding or file formatting. Therefore, if the LLM’s output contains syntax errors or incorrect
.lp formatting, we need to rectify these issues without altering the underlying logic of
the code or model. This step ensures that our evaluation focuses solely on the modeling
process. To achieve this, we use a language model to fix any syntax errors, referring to it as
the code modifier. In our experiments, we use the original language model and GPT-4-0613
as code modifiers. For the sake of reproducibility, we choose the orignal LLM; To ensure
the comparison focuses exclusively on modeling capacity, we choose GPT-4-0613 as the code
modifier due to its exceptional coding proficiency and operational reliability, as well as its
stability than other provisional release Language Models. See the comparison of the result
from different code modifier in Section 6.3 and the prompts in Appendix. G. Besides, at
the beginning of the testing, we do the first clean of the codes, clearing up the string like "
“‘python “‘ " that will affect our testing.

6.1.2 Evaluation Metrics

Given the potential for minor discrepancies in numerical solutions, our protocol includes a
comparison rule that accommodates for slight inaccuracies. Let O denote the output from
the LLM, and A denote the standard answer. The comparison is defined by the following
criteria 3
Precision Adjustment for Floating-Point Numbers If the standard answer, A, is a floating-
point number, the precision adjustment involves calculating the number of significant digits,
n, after the decimal point. Both O and A are then scaled by 10n for comparison.

3 See the evaluation script in Appendix. F

7
preprint

General Comparison Criterion The overall correctness is determined based on the following
conditions, applicable to both adjusted and original values:
(
1 if O− A
A
≤ 1 × 10−4 or |O − A| ≤ 1,
Correctness = (1)
0 otherwise.

This structured evaluation ensures that the LLM’s output is accurately assessed against the
standard answer with consideration for the specific requirements of numerical precision
and the inherent variability in computational answers.

6.2 Benchmarking Results

ODE Optimization
Models
First order Second order System All Easy LP Complex LP
GPT-4o 69.39% 73.64% 42.50% 67.63% 87.27% 23.22%
GPT-4o † 69.90% 74.55% 42.50% 68.21% 87.42% 32.23%
GPT-4-turbo 66.84% 69.09% 42.50% 65.12% 87.88% 23.70%
GPT-4-turbo † 67.86% 70.00% 42.50% 65.99% 87.88% 30.33%
GPT-4 60.71% 65.45% 42.50% 60.12% 86.50% 22.27%
GPT-4 † 61.73% 67.27% 45.00% 61.56% 86.96% 24.64%
GPT-3.5-turbo 17.35% 16.36% 7.50% 15.90% 81.29% 9.95%
GPT-3.5-turbo † 25.51% 31.82% 15.00% 26.30% 84.82% 9.95%
Claude-3-Sonnet 53.57% 61.82% 30.00% 53.47% 83.59% 25.12%
Claude-3-Sonnet † 55.61% 63.64% 35.00% 55.78% 84.20% 25.59%
Claude-3-Haiku 42.35% 60.91% 20.00% 45.66% 78.53% 19.91%
Claude-3-Haiku † 52.04% 63.64% 22.50% 52.31% 85.89% 21.33%
Gemini-1-pro 26.53% 35.45% 2.50% 26.59% 78.99% 15.31%
Gemini-1-pro † 35.71% 39.09% 7.50% 33.53% 80.06% 15.31%
Gemini-1.5-pro 66.33% 66.36% 47.50% 64.16% 85.28% 30.81%
Gemini-1.5-pro † 67.35% 66.36% 52.50% 65.32% 85.28% -
DeepSeek-v2 40.82% 45.45% 32.50% 41.33 % 86.50% 6.16%
DeepSeek-v2 † 42.35% 64.55% 32.50% 48.27% 86.66% 6.16%
DeepSeek-coder 32.14% 49.09% 15.00% 35.55% 75.77% 18.96%
DeepSeek-coder † 34.18% 49.09% 15.00% 36.71% 76.99% 18.96%

Table 2: Evaluation result on Mamo Benchmark. The remark † indicates it additionally

uses the original LLM to rectify syntax error of the code and .lp file generated by the
corresponding LLM. Here GPT-4-turbo denotes GPT-4-turbo-2024-04-09 and GPT-4o
denotes GPT-4o-2024-05-13, while GPT-4 refers to GPT-4-0613

Our evaluation includes both closed-source LLMs and open-source LLMs. The closed-
source commercial LLMs include the GPT-4 series OpenAI (2023), the Claude series 4 , and
the Gemini series 5 . The open-source LLMs include DeepSeek-v2 DeepSeek-AI (2024) and
DeepSeek-coder Guo et al. (2024). Table 6.2 presents the evaluation results of different
language models on the Mamo Benchmark. The performance of these models is assessed
both in their original form and when enhanced with the language model itself to rectify
syntax errors (indicated by the dagger symbol † ). Table 6.2 displays the performance results
when GPT-4-0613 is selected as the code modifier.
The evaluation results on the Mamo Benchmark reveal that GPT-4o stands out among the
models in both ODE and LP tasks. Specifically, GPT-4o achieves the highest performance in
ODE tasks with 69.90% in first-order and 74.55% in second-order equations, and also excels
in LP tasks with 87.42% in easy LP and 32.23% in complex LP scenarios when using the LLM
to rectify syntax errors. Additionally, DeepSeek-v2, as an open-source LLM, demonstrates
significant capability with 48.27% overall in ODE tasks and 86.66% in easy LP tasks when
enhanced with syntax correction, highlighting its potential despite being open-source.
4 [Link]
5 [Link]

8
preprint

ODE Optimization
Models
First order Second order System All Easy LP Complex LP
GPT-4o †† 69.90% 74.55% 42.50% 68.21% 87.42% 29.86%
GPT-4-turbo †† 67.85% 70.00% 42.50% 65.99% 88.19% 28.44%
GPT-4 †† 61.73% 67.27% 45.00% 61.56% 86.96% 24.64%
GPT-3.5-turbo †† 47.96% 60.91% 20.00% 48.84% 85.43% 10.43%
Claude-3-Sonnet †† 58.16% 66.36% 35.00% 59.09% 85.12% 28.44%
Claude-3-Haiku †† 53.06% 65.45% 25.00% 53.76% 86.81% 21.33%
Gemini-1-pro †† 43.37% 54.55% 15.00% 43.64% 80.37% 16.27%
Gemini-1.5-pro †† 67.35 % 67.27% 52.50% 65.61% 85.28% 34.12%
DeepSeek-v2 †† 56.12% 68.18% 32.50% 57.23% 86.81% 6.16%
DeepSeek-coder †† 33.67% 49.09% 17.50% 36.71% 78.83% 18.96%

Table 3: Evaluation result on Mamo Benchmark. The remark †† indicates it additionally

uses the GPT-4-0613 to rectify syntax error of the code and .lp file generated by the
corresponding LLM.

The inclusion of the LLM itself as code modifier to rectify syntax errors results in significant
improvements across various models. For instance, GPT-3.5 Turbo’s performance in ODE
tasks increased from 15.90% to 26.30% overall and Claude-3-Haiku shows a remarkable im-
provement in ODE tasks from 45.66% to 52.31%. Similarly, DeepSeek-v2, as an open-source
LLM, demonstrates significant capability with its overall ODE performance increasing from
41.33% to 48.27% when enhanced with syntax correction.
Selecting GPT-4 as a code modifier generally results in better performance than using
the LLM itself for self-improvement. For instance, DeepSeek-v2’s accuracy in ODE tasks
improves from 48.27% with self-improvement to 57.23% when GPT-4 is used as the code
modifier. This pattern indicates that GPT-4 is a more effective code modifier, enhancing the
accuracy and performance of various models more reliably than self-improvement.

6.3 On the format errors and code modifiers

Models ODE Optimization

format raw self-improved GPT-4-improved raw self-improved GPT-4-improved
GPT-4o 96.82% 99.42% 99.42% 92.93% 97.91% 97.91%
GPT-4-turbo 96.82% 99.13% 98.84% 93.86% 97.80% 97.56%
GPT-4 94.51% 97.69% 97.69% 90.96% 95.13% 95.13%
GPT-3.5-turbo 32.37% 60.40% 94.22% 87.95% 89.80% 96.76%
Claude-3-Sonnet 84.68% 93.06% 97.40% 91.89% 93.97% 96.76%
Claude-3-Haiku 78.03% 92.77% 98.27% 82.61% 92.58% 96.29%
Gemini-1.0-pro 56.36% 69.94% 88.73% 88.06% 89.11% 94.21%
Gemini-1.5-pro 93.64% 98.27% 98.84% 95.48% - 96.87%
deepseek-v2 70.81% 83.81% 96.24% 90.27% 92.00% 94.90%
deepseek-coder 88.44% 93.64% 95.95% 92.58% 96.29% 98.84%

Table 4: Correct format rate on raw, self-improved, and GPT4-improved format.

The data presented in Table 6.3 provides a comparative analysis of the performance (rate of
correct format) of various models using different improvement methods. The table considers
three scenarios: no improvement, self-improvement, and improvement provided by GPT-4.
The GPT-3.5-turbo model shows the most dramatic improvement when enhanced by
GPT-4, jumping from a 32.37% success rate to 94.22%. Interestingly, Claude-3 models
(Sonnet, and Haiku) and DeepSeek models (DeepSeek-v2, DeepSeek-coder) also exhibit
high improvement rates when enhanced by GPT-4, highlighting the compatibility of GPT-
4’s improvement methods with other models. In the realm of Optimization tasks, the
GPT-3.5-turbo model again shows a significant improvement when enhanced by GPT-4,
jumping from a 87.95% success rate to 96.76%.

9
preprint

The improvement from GPT-4 is generally superior to both self-improvement and no

improvement. This indicates that GPT-4 can serve as a more effective and fair code modifier.
By using GPT-4 for syntax error correction, we can better examine the modeling ability of
different LLMs without the confounding factor of formatting errors. Notably, in ODE tasks,
the correct format rate for all models, except one, is generally greater than 94%, with the
lowest being 88.73%. In Optimization tasks, the correct format rate is even higher, with the
lowest rate being 94.21%.

7 Conclusion and Future Directions

In this article, we introduce Mamo, a novel benchmark that utilizes solvers to evaluate the
mathematical modeling capabilities of large language models (LLMs). We propose a shift in
the evaluation paradigm from traditional outcome-oriented evaluation to process-oriented
evaluation. This shift allows us to scrutinize the modeling process undertaken by LLMs,
providing a more detailed and comprehensive assessment of their mathematical problem-
solving abilities. Our evaluation focuses more on the mathematical modeling capabilities of
the models, leaving the process of solving abstract mathematical problems to the solvers,
thereby avoiding the impact of errors such as computational capabilities of the models on
their actual modeling abilities.
Our benchmark represents an important step in understanding the fundamental modeling
capabilities of LLMs. By focusing on the modeling process rather than the correctness of
the final solution, we are able to gain valuable insights into the strengths and limitations of
current LLMs. This approach provides new directions for future research, emphasizing the
importance of the modeling process in solving mathematical problems.
The introduction of Mamo paves new avenues for future research. The benchmark can be
expanded to include a wider variety of mathematical problems and solvers, broadening
its applicability and providing a more comprehensive evaluation of the mathematical
modeling capabilities of LLMs. Insights gained from the benchmark can be used to guide
the development of future LLMs. By understanding the strengths and weaknesses of current
models in solving mathematical problems, researchers can design new models that better
handle these tasks. We will extend our model list to other models (Llama2 series, Mistral
series etc) in our further study.

Limitations While the current benchmarking methodology necessitates LLMs to generate

Python code for ODEs and .lp files for optimization problems, it opens up exciting avenues
for future research. The process, while effective, may inadvertently influence the LLMs’
focus towards formalization over the conceptual modeling, potentially affecting the depth
of mathematical reasoning. This presents an opportunity to explore innovative benchmark
designs that can better distinguish between an LLM’s modeling prowess and its formal-
ization abilities. Future research could aim to develop methodologies that allow LLMs to
demonstrate their mathematical modeling capabilities without the intermediary step of
formalization, leading to a more nuanced and comprehensive assessment of their abilities

10
preprint

Ethics Statement:

The benchmark is designed with a stringent ethical framework to ensure that all problems
are socially responsible and do not perpetuate or encourage harmful biases or stereotypes.
Questions are meticulously reviewed to avoid any content that could lead to the dissemina-
tion of misinformation, support unethical practices, or cause harm to individuals or groups.
This includes but is not limited to issues of privacy, security, and the potential for misuse of
the model. Furthermore, the benchmark abstains from any problem that could indirectly
endorse unethical behaviors or decisions in real-world scenarios, upholding the highest
standards of academic integrity and social responsibility.

References
Ali AhmadiTeshnizi, Wenzhi Gao, and Madeleine Udell. Optimus: Optimization modeling
using mip solvers and large language models. arXiv preprint arXiv:2310.06116, 2023a.
Ali AhmadiTeshnizi, Wenzhi Gao, and Madeleine Udell. OptiMUS: Optimization Modeling
Using MIP Solvers and large language models, October 2023b. URL [Link]
abs/2310.06116. Issue: arXiv:2310.06116 arXiv:2310.06116 [cs].
Dimitris Bertsimas and John N. Tsitsiklis. Introduction to linear optimization. Athena Scientific,
1997.
William E. Boyce and Richard C. DiPrima. Elementary differential equations and boundary value
problems. Wiley, 10 edition, 2012.
Martin Braun. Differential equations and their applications an introduction to applied mathematics.
Springer, 4 edition, 1993.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser,
Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers
to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language
model, 2024.
Jiazhan Feng, Ruochen Xu, Junheng Hao, Hiteshi Sharma, Yelong Shen, Dongyan Zhao, and
Weizhu Chen. Language models can be logical solvers. arXiv preprint arXiv:2311.06158,
2023.
Simon Frieder, Luca Pinchetti, Ryan-Rhys Griffiths, Tommaso Salvatori, Thomas
Lukasiewicz, Philipp Petersen, and Julius Berner. Mathematical capabilities of chatgpt.
Advances in Neural Information Processing Systems, 36, 2024.
Dongdong Ge, Qi Huangfu, Zizhuo Wang, Jian Wu, and Yinyu Ye. Cardinal Optimizer
(COPT) user guide. [Link] 2023.
Frank R. Giordano, Maurice D. Weir, and William P. Fox. A First Course in Mathematical
Modeling. Brooks/Cole, Cengage Learning, 5th edition, 2013.
Frank R. Giordano, Steven B. Horton, and William P. Fox. A first course in mathematical
modeling. Brooks/Cole, 5 edition, 2014.
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen,
Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. Deepseek-coder:
When the large language model meets programming – the rise of code intelligence, 2024.
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi
Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging bench-
mark for promoting agi with olympiad-level bilingual multimodal scientific problems.
arXiv preprint arXiv:2402.14008, 2024.

11
preprint

Joy He-Yueya, Gabriel Poesia, Rose E Wang, and Noah D Goodman. Solving math
word problems by combining language models with symbolic solvers. arXiv preprint
arXiv:2304.09102, 2023.
Glenn Hurlbert. Linear Optimization: The simplex workbook. Springer, 2010.
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski,
Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al.
Solving quantitative reasoning problems with language models. Advances in Neural
Information Processing Systems, 35:3843–3857, 2022.
Margaret L. Lial, Raymond N. Greenwell, and Nathan P. Ritchey. Calculus with applications.
Pearson, 11 edition, 2017.
Bingbin Liu, Sebastien Bubeck, Ronen Eldan, Janardhan Kulkarni, Yuanzhi Li, Anh Nguyen,
Rachel Ward, and Yi Zhang. Tinygsm: achieving> 80% on gsm8k with small language
models. arXiv preprint arXiv:2312.09241, 2023.
Hongwei Liu, Zilong Zheng, Yuxuan Qiao, Haodong Duan, Zhiwei Fei, Fengzhe Zhou,
Wenwei Zhang, Songyang Zhang, Dahua Lin, and Kai Chen. Mathbench: Evaluating the
theory and application proficiency of llms with a hierarchical mathematics benchmark,
2024.
Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese,
and Caiming Xiong. Codegen: An open large language model for code with multi-turn
program synthesis. arXiv preprint arXiv:2203.13474, 2022.
R OpenAI. Gpt-4 technical report. arxiv 2303.08774. View in Article, 2(5), 2023.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin,
Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language
models to follow instructions with human feedback. Advances in neural information
processing systems, 35:27730–27744, 2022.
Liangming Pan, Alon Albalak, Xinyi Wang, and William Yang Wang. Logic-lm: Empowering
large language models with symbolic solvers for faithful logical reasoning. arXiv preprint
arXiv:2305.12295, 2023.
James Stewart, Dan Clegg, and Saleem Watson. Calculus: Early transcendentals. Cengage, 8
edition, 2014.
Zhengyang Tang, Xingxing Zhang, Benyou Wan, and Furu Wei. Mathscale: Scaling instruc-
tion tuning for mathematical reasoning. arXiv preprint arXiv:2403.02884, 2024.
George B. Thomas, Maurice D. Weir, Joel Hass, and Frank R. Giordano. Thomas’ calculus.
Pearson, 13 edition, 2016.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V
Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language
models. Advances in neural information processing systems, 35:24824–24837, 2022.
Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and
Xinyun Chen. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023.
Fei Yu, Anningzhe Gao, and Benyou Wang. Outcome-supervised verifiers for planning in
mathematical reasoning. arXiv preprint arXiv:2311.09724, 2023a.
Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T
Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own
mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023b.
Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, and Songfang Huang. How well do
large language models perform in arithmetic tasks? arXiv preprint arXiv:2304.02015, 2023.

12
preprint

Beichen Zhang, Kun Zhou, Xilin Wei, Xin Zhao, Jing Sha, Shijin Wang, and Ji-Rong Wen.
Evaluating and improving tool-augmented computation-intensive math reasoning. Ad-
vances in Neural Information Processing Systems, 36, 2024.
Dennis G. Zill. A first course in differential equations with modeling applications. Cengage, 10
edition, 2013.

A Cross Review
In ODE part, we have the following review process:

1. We employed GPT-4 to confirm that the newly generated questions are amenable to
being solved by solvers using the final-state approach. There were only 10 out of
383 questions are invalid. All were filtered, 373 questions were left.
2. Then we conducted a thorough evaluation of the questions. Our dataset initially
comprised 373 questions. We identified 29 questions that lacked sufficient infor-
mation, 21 that had incorrect answers, and 18 that featured unclear statements.
Following our assessment, 27 questions were deleted, and 41 were corrected to
ensure clarity and correctness.

Finally, we have 346 problems in ODE part.

We have also conducted the cross-review process. See the details in Appendix. A.1 and the
results on Appendix. A.1.1.
In the optimization section, particularly with easy LP part, out of 688 entries, 12 were found
to be completely invalid (not meeting the criteria), 26 had incorrect answers, and 6 had
misleading descriptions. All invalid entries were filtered out, and the remaining issues were
corrected manually. Among them, 8 of them were corrected and others (36 questions) were
filtered out. In complex LP part, for the original 211 questions, 40 were found synthetized
by wrong mathematical models. All of them were corrected.

A.1 Cross-Review Process

Furthermore, adhering to the scaling law evident in the dataset’s distribution, we selected
50 questions from the ODE parts in our dataset for a deeper analysis. Four independent
reviewers (see their qualification in Section B) were tasked with assessing 50 questions
each. The responses could either be numeric or labeled as “error” if a question was deemed
problematic. This section outlines the metrics used to evaluate the reviewers’ responses
against pre-defined correct answers, which also include the “error” label for any flawed
questions. We use the same metric as described in Subsection 6.1.2. Here are our results:

A.1.1 Results
The effectiveness of the review process was quantified using the metrics defined in Sub-
section 6.1.2, focusing on the accuracy and inter-rater reliability among reviewers. The
statistical outcomes are summarized as follows:

• Average Cohen’s Kappa: The average Cohen’s Kappa across all reviewer pairs was
0.60, indicating a moderate to substantial agreement. This suggests that reviewers
generally agreed on the classification of answers, though there were variations in
some cases.
• Minimum and Maximum Cohen’s Kappa: The minimum Kappa value recorded
was 0.50, and the maximum was 0.71. The spread in these values highlights areas
where alignment and training could potentially enhance consistency.
• Accuracy Rates: The individual accuracy rates were as follows:
– Reviewer 1: 90.0%
– Reviewer 2: 84.0%

13
preprint

– Reviewer 3: 88.0%
– Reviewer 4: 74.0%
These rates reflect the precision with which each reviewer matched the standard
answers, including their recognition of problematic questions correctly labeled as
“error.”

These results provide insights into the overall effectiveness and areas of improvement for
the cross-review process, particularly in terms of aligning understanding and interpretation
of the evaluation criteria among reviewers.

B Annotator Qualifications

The benchmarking process benefits from the expertise of reviewers with robust backgrounds
in mathematics, underscored by their academic accomplishments. The team is composed
of individuals whose education encompasses critical mathematical disciplines, including
calculus (I and II), linear algebra (both introductory and advanced levels), optimization,
probability, and ordinary differential equations. The average Grade Point Average (GPA)
across these foundational courses is approximately 3.89 out of a 4.0 scale. This high level of
academic achievement demonstrates the reviewers’ strong grasp of essential mathematical
concepts and their application. Furthermore, the team is enriched by members who hold
Ph.D. degrees in mathematics, further solidifying the depth of expertise and analytical skills
available for the benchmarking process.

B.1 Qualification of Reviewers

We select the reviewers mentioned in Appendix A.1 based on the following criteria:

1. Undergraduate students who have taken fundamental mathematical courses (Cal-

culus and Linear Algebra) and have an average GPA of at least 3.5/4.0.
2. Have completed or are currently taking courses in Ordinary Differential Equations
(ODE).
3. Meet one of the following English proficiency requirements: TOEFL > 80, IELTS >
6.5, a minimum grade of B+ in all English classes, or a Gaokao English score > 120.

These criteria ensure their ability to read questions in English and solve ODE problems.

C Few-shot Experiment

To test the sensitivity of different models to the number of prompts, we conducted a

sensitivity analysis using the scaling law to examine the effect of the number of shots on
model performance. In our analysis, we tested the performances of 11 models on ODE
problems using prompts with 0 shot, 1 shot, 3 shots, 5 shots, and 10 shots. The results, shown
in Figure 3, illustrate the ODE problem-solving accuracy of different models under varying
numbers of shots. Similarly, we performed sensitivity tests on optimization problems,
assessing the accuracy of 10 models with prompts of 0 shot, 1 shot, 3 shots, and 5 shots. The
results are depicted in Figure 4.
Additionally, we specifically tested the 0-shot performance of GPT-4-0613 on all questions
in Easy LP and Complex LP categories. The accuracy of GPT-4-0613 was 66.56% in Easy LP
and 14.69% in Complex LP, resulting in an overall LP accuracy of 53.65%, consistent with
the results obtained from sampling.

14
preprint

Figure 3: Results of models accuracy and shots on ODE problems

Figure 4: Results of models accuracy and shots on optimization problems

15
preprint

D Testing Process

We use 3-shot learnging in testing, see examples in G

Figure 5: Example of testing process in optimization

Figure 6: Example of testing process in ODE

16
preprint

E Data Synthesis

Figure 7 shows the example of synthesis process in ODE.

Figure 7: Example of synthesis process in ODE

17
preprint

F Evaluation script

def comp(output, standard_answer):

dif = abs(float(output) - float(standard_answer))
if float(standard_answer) == 0:
rate = dif * 1e-4
else:
rate = dif / float(standard_answer)
if abs(rate) <= 1e-4:
return 1
else:
return 0

def compare_output_with_standard(output, standard_answer):

try:
float_output = float(output)
except ValueError:
return False

if ’.’ in standard_answer:
digit = len(standard_answer.split(’.’)[1])
s_ans = float(standard_answer) * 10 ** digit
ans = float_output * 10 ** digit
return (abs(ans - s_ans) <= 1 or comp(output, standard_answer))
else:
s_ans = float(standard_answer)
ans = float_output
return (abs(ans - s_ans) <= 1 or comp(output, standard_answer))

18
preprint

G Prompts

Assume you are a virtual assistant with expertise in optimization, specifically in creating .lp files for linear programming
problems. Your task is to translate given natural language problems into optimization models, formatted as .lp [Link]
you receive a question, it might include mathematical expressions in LaTeX format. Your job is to interpret these expressions
accurately and model the problem in .lp format. Your response must adhere to the following guidelines:
- The optimization model must be written in .lp format, adhering to the conventions and syntax appropriate for linear
programming problems.
- The model should be designed so that solving it yields the optimal value, which directly answers the question posed.
- Your response should be an entire .lp file, ready to be processed by a linear programming solver. Ensure that the file
contains no comments or extraneous content beyond the model itself.
- Handle LaTeX expressions with care to ensure that the mathematical aspects of the problem are accurately represented in
the .lp model.
- If the solution needs to be rounded to an integer, make use of the ’General’ integer constraint in the .lp file to specify integer
variables, please do not use ’General’ if there are not requirement for the integer variables.

Here comes the examples:

Example_1:
(the input)
A manufacturing company produces two types of products: X and Y. The production cost for each unit of product X is
50, while the cost for each unit of product Y is
10. There are constraints in the production process, such that twice the number of units produced from product X, plus the
number of units from product Y, cannot exceed 200 due to resource limitations. In addition, to meet market demand, four
times the number of units produced from product X, plus the number of units from product Y, must be at least 50.
Considering these constraints and given that both products can only be produced in whole numbers due to their physical
nature, what is the minimum total cost needed for producing these items while satisfying all conditions? Provide your
answer rounded to the nearest dollar.
Your response:
Minimize
obj: 50 x + 10 y

Subject To
c1: 2 x + y <= 200
c2: 4 x + y >= 50

Bounds
x >= 0
y >= 0

Generals
x
y

End

[...]

Please craft the .lp file according to these instructions, focusing on delivering a model that is directly solvable to
obtain the answer.
And Please follow the syntax like examples to write the .lp file.

Here comes the question:

[question]

Generate the contents of an .lp file for this problem, starting with the objective function and followed by the constraints,
without any additional sentences. The constraints should be formatted as ’variable + variable >= number’ for inequalities,
all the variables shoule on the left hand side of the inequality . Ensure there is a space between variables and their coefficients.

Your response:

Figure 8: The prompt for testing in optimization part. [question] refers to a question in the
benchmark. [...] refers to the examples. We use 3-shot in testing

19
preprint

Assume you are a virtual assistant with expertise in ordinary differential equations (ODEs), particularly in formulating ODE
models from natural language descriptions and solving them using Python’s ‘solve_ivp‘, ‘odeint‘ from SciPy, and ‘dsolve‘
from SymPy. Your primary task is to convert the given natural language problems into ODE models and then apply ODE
solvers to find the solutions.

When you receive a question, it will often contain mathematical expressions in LaTeX format, which you need to interpret
accurately. Your response must be in the form of Python code that directly outputs the final solution to the problem upon
execution. This Python code should adhere to the following criteria:

1. The output should only display the final answer of the problem. 2. Utilize a ’final-state-approach’ where the code
immediately prints the solution after solving the ODE, without showing any intermediate steps.
3. Ensure the answer is rounded to the specified number of significant figures or decimal places. Use Python’s ‘round‘
function for decimal rounding. For significant figures, include and use the following function in your code:

“‘python
def round_to_significant_figures(num, sig_figs):
if num != 0:
return round(num, -int([Link](math.log10(abs(num))) + (1 - sig_figs)))
else:
return 0 # Handles the case of num being 0
“‘

4. Your response should be entirely in Python code, formatted to run directly without modifications.
5. Handle LaTeX expressions in the problem statement carefully to ensure accurate modeling and solution.

Here comes the examples:

[...]

Please process the problem according to these instructions, focusing solely on delivering the Python code that meets these
requirements. And Please directly generate the code without any explaintion(except the comments in the code).
the following lines are forbidden:
′′′

Here’s the Python code that solves the given problem and meets the specified requirements:

“‘python

“‘
This code uses the ‘solvei vp‘ function from SciPy to solve the initial value problem for the given differential equation. The
‘roundt os igni f icant f igures‘ function is included to round the final answer to the specified number of significant figures.
Upon execution, the code will directly output the amount of pollutant left in the tank after 5 minutes, rounded to five
significant figures. ′′′
Please avoid the above sentences.

Take a deep breathe before answering the question. This is a piece of cake to you.

Here comes the question:

[question]

Your response:

Figure 9: The prompt for testing in ODE part. [question] refers to a question in the
benchmark. [...] refers to the examples.

20
preprint

You are experienced python engineer, the following codes may have some errors (in syntax), with the error information
[error_info] , please fix the errors to make sure the code can run successfully.
If there is no error, please return the original code.
And please do not change the original logic of the code. Your response should be entirely in Python code, formatted to run
directly without modifications. Take a deep breathe before answering the question. This is a piece of cake to you.
PLEASE do not response anything except the code, No other comments outside the code are allowed.

The followings are forbidden:

”’
The code is almost correct, but there is a syntax error in the function ‘round_to_significant_figures‘. The round function is
missing a closing parenthesis. Here is the corrected version:

“‘python
“‘

”’

Please avoid resposing above sentences.

Please output only the corrected code, with no additional text or explanations.
Here comes the code:
[code]
The correct code is:

Figure 10: The prompt for fixing syntax error in ODE. [error] refers to the message when
executing the Python code. While [code] refer to the code which contains syntax error

You are experienced engineer in optimization, please fix the errors to make sure the code can run successfully.
If there is no error, please return the original code.
And please do not change the original logic of the lp file. Your response should be entirely in .lp format, formatted to run
directly without modifications.
Take a deep breathe before answering the question. This is a piece of cake to you.
PLEASE do not response anything except the code, No comments are allowed.
The followings are forbidden:

”’
This is the correct .lp file:
“‘lp
“‘

”’

Please avoid resposing above sentences.

Please output only the corrected code, with no additional text or explanations.
Here comes the lp:
[lp_code]
Generate the correct .lp format, starting with the objective function and followed by the constraints, without any additional
sentences. The constraints should be formatted as ’variable + variable >= number’ for inequalities, all the variables shoule
on the left hand side of the inequality. For example, make a + b + c <= d into a + b + c - d <= 0. Ensure there is a space
between variables and their coefficients (coefficients should be numerical).
The correct lp code is:

Figure 11: The prompt for fixing syntax error in LP. [lp_code] refer to the lp code which
contains syntax error

Leveraging Online Olympiad-Level Math Problems For Llms Training and Contamination-Resistant Evaluation
No ratings yet
Leveraging Online Olympiad-Level Math Problems For Llms Training and Contamination-Resistant Evaluation
25 pages
OptiMUS: Scalable Optimization Modeling With (MI) LP Solvers and Large Language Models
No ratings yet
OptiMUS: Scalable Optimization Modeling With (MI) LP Solvers and Large Language Models
20 pages
Automatic Robustness Stress Testing of Llms As Mathematical Problem Solvers
No ratings yet
Automatic Robustness Stress Testing of Llms As Mathematical Problem Solvers
16 pages
End-to-End Bangla AI For Solving Math Olympiad Problem Benchmark:Leveraging Large Language Model Using Integrated Approach
No ratings yet
End-to-End Bangla AI For Solving Math Olympiad Problem Benchmark:Leveraging Large Language Model Using Integrated Approach
11 pages
MathScale: Enhancing Math Reasoning
No ratings yet
MathScale: Enhancing Math Reasoning
15 pages
LLM-ProS Analyzing Large Language Models Performance in Competitive Problem Solving
No ratings yet
LLM-ProS Analyzing Large Language Models Performance in Competitive Problem Solving
8 pages
Mathador LM 大型语言模型数学推理的动态基准
No ratings yet
Mathador LM 大型语言模型数学推理的动态基准
8 pages
Bangla AI for Math Olympiad Solutions
No ratings yet
Bangla AI for Math Olympiad Solutions
11 pages
End-to-End Bangla AI For Solving Math Olympiad Problem Benchmark:Leveraging Large Language Model Using Integrated Approach
No ratings yet
End-to-End Bangla AI For Solving Math Olympiad Problem Benchmark:Leveraging Large Language Model Using Integrated Approach
11 pages
Mathprompter - Mathematical Reasoning Using Large Language Models
No ratings yet
Mathprompter - Mathematical Reasoning Using Large Language Models
7 pages
rStar-Math: Advancing Small LLMs in Math
No ratings yet
rStar-Math: Advancing Small LLMs in Math
26 pages
Large Language Models For Mathematical Reasoning - Progresses and Challenges
No ratings yet
Large Language Models For Mathematical Reasoning - Progresses and Challenges
14 pages
2024 Acl-Long 510
No ratings yet
2024 Acl-Long 510
14 pages
What Makes Math Word Problems Challenging For LLMS?
No ratings yet
What Makes Math Word Problems Challenging For LLMS?
11 pages
Math Odyssey Benchmarks
No ratings yet
Math Odyssey Benchmarks
14 pages
LM4OPT: Unveiling The Potential of Large Language Models in Formulating Mathematical Optimization Problems
No ratings yet
LM4OPT: Unveiling The Potential of Large Language Models in Formulating Mathematical Optimization Problems
8 pages
LLMs Benchmark on Mathematical Problem-Solving
No ratings yet
LLMs Benchmark on Mathematical Problem-Solving
8 pages
ASurvey On Feedback-Based Multi-Step Reasoning For Large Language
No ratings yet
ASurvey On Feedback-Based Multi-Step Reasoning For Large Language
15 pages
MathChat: Multi-Turn Math Reasoning
No ratings yet
MathChat: Multi-Turn Math Reasoning
24 pages
RStar-Math Small LLMs Can Master Math Reasoning W
No ratings yet
RStar-Math Small LLMs Can Master Math Reasoning W
19 pages
Ai-A G D M Q - : Ssisted Eneration of Ifficult ATH UES Tions
No ratings yet
Ai-A G D M Q - : Ssisted Eneration of Ifficult ATH UES Tions
30 pages
rStar-Math: Advancing Small LLMs in Math
No ratings yet
rStar-Math: Advancing Small LLMs in Math
44 pages
FMC: Formalization of Natural Language Mathematical Competition Problems
No ratings yet
FMC: Formalization of Natural Language Mathematical Competition Problems
18 pages
A Comparative Analysis of Encoder Only and Decoder Only Models For Challenging LLM-generated STEM MCQs Using A Self-Evaluation Approach
No ratings yet
A Comparative Analysis of Encoder Only and Decoder Only Models For Challenging LLM-generated STEM MCQs Using A Self-Evaluation Approach
18 pages
LLM Routing with Benchmark Datasets
No ratings yet
LLM Routing with Benchmark Datasets
18 pages
Efficient Large Language Models Survey
No ratings yet
Efficient Large Language Models Survey
67 pages
LLM360
No ratings yet
LLM360
15 pages
Improving Large Language Model
No ratings yet
Improving Large Language Model
14 pages
An Overview of Large Language Models For Statisticians
No ratings yet
An Overview of Large Language Models For Statisticians
67 pages
Inference Efficiency by Learning Task Complexity
No ratings yet
Inference Efficiency by Learning Task Complexity
9 pages
Towards Automated Machine Learning Research: Shervin Ardeshir
No ratings yet
Towards Automated Machine Learning Research: Shervin Ardeshir
16 pages
Actuarygpt Applications of Large Language Models To Insurance and Actuarial Work
No ratings yet
Actuarygpt Applications of Large Language Models To Insurance and Actuarial Work
42 pages
Reasoning in Large Language Models Through Symbolic Math Word Problems
No ratings yet
Reasoning in Large Language Models Through Symbolic Math Word Problems
13 pages
OlympiadBench: Bilingual AGI Benchmark
No ratings yet
OlympiadBench: Bilingual AGI Benchmark
20 pages
Safurai-001: Advancements in Code LLMs
No ratings yet
Safurai-001: Advancements in Code LLMs
22 pages
AIMMS Energy Modelling Overview
No ratings yet
AIMMS Energy Modelling Overview
17 pages
Llama-Berry: Pairwise Optimization For Olympiad-Level Mathematical Reasoning Via O1-Like Monte Carlo Tree Search
No ratings yet
Llama-Berry: Pairwise Optimization For Olympiad-Level Mathematical Reasoning Via O1-Like Monte Carlo Tree Search
22 pages
DeepSeekMath: Advanced Math Reasoning Model
No ratings yet
DeepSeekMath: Advanced Math Reasoning Model
30 pages
Advanced Reasoning Benchmark for LLMs
No ratings yet
Advanced Reasoning Benchmark for LLMs
28 pages
R-Zero - Self-Evolving Reasoning LLM From Zero Data - Tencent AI Lab
No ratings yet
R-Zero - Self-Evolving Reasoning LLM From Zero Data - Tencent AI Lab
17 pages
LLEMMA: Open Language Model for Math
No ratings yet
LLEMMA: Open Language Model for Math
28 pages
Dataset 1
No ratings yet
Dataset 1
32 pages
LPML: Enhancing LLM Math Reasoning
No ratings yet
LPML: Enhancing LLM Math Reasoning
10 pages
LLMs in Evolutionary Algorithms
No ratings yet
LLMs in Evolutionary Algorithms
15 pages
F: L L M A - E R E: X Inder Arge Anguage Odels As UTO Mated Valuators For Eliable Valuation
No ratings yet
F: L L M A - E R E: X Inder Arge Anguage Odels As UTO Mated Valuators For Eliable Valuation
43 pages
Sample-Efficient Human Evaluation of Large Language Models Via Maximum Discrepancy Competition
No ratings yet
Sample-Efficient Human Evaluation of Large Language Models Via Maximum Discrepancy Competition
32 pages
Variability in LLM Evaluation Methods
No ratings yet
Variability in LLM Evaluation Methods
15 pages
Meta Prompting For Ai Systems
No ratings yet
Meta Prompting For Ai Systems
32 pages
LLEMMA: Open Math Language Model
No ratings yet
LLEMMA: Open Math Language Model
28 pages
Toward Generalizable Evaluation in The LLM Era A Survey Beyond Benchmarks
No ratings yet
Toward Generalizable Evaluation in The LLM Era A Survey Beyond Benchmarks
42 pages
Mathematical Language Models: A Survey
No ratings yet
Mathematical Language Models: A Survey
34 pages
LLM Metrics
No ratings yet
LLM Metrics
45 pages
CMM-Math - A Chinese Multimodal Math Dataset To Evaluate and Enhance The Mathematics Reasoning of Large Multimodal Models
No ratings yet
CMM-Math - A Chinese Multimodal Math Dataset To Evaluate and Enhance The Mathematics Reasoning of Large Multimodal Models
11 pages
pdf2306 08997 PDF
No ratings yet
pdf2306 08997 PDF
20 pages
Formalmath: Benchmarking Formal Mathematical Reasoning of Large Language Models
No ratings yet
Formalmath: Benchmarking Formal Mathematical Reasoning of Large Language Models
33 pages
Sociology II Pyqs Atarani Law Academy
No ratings yet
Sociology II Pyqs Atarani Law Academy
10 pages
Verification of Prepaid Meters
No ratings yet
Verification of Prepaid Meters
1 page
Approval Sheet
No ratings yet
Approval Sheet
4 pages
Polygon Law of Forces
No ratings yet
Polygon Law of Forces
4 pages
Multimedia and ICT in Empowerment Tech
No ratings yet
Multimedia and ICT in Empowerment Tech
1 page
Equipment Proposal & Quotation: Sabaleta Hydroelectric Project
No ratings yet
Equipment Proposal & Quotation: Sabaleta Hydroelectric Project
23 pages
Sp-1 Oem: Pumps For OEM, Metering and Dispensing
No ratings yet
Sp-1 Oem: Pumps For OEM, Metering and Dispensing
2 pages
Advanced Geotechnical Engineering PDF
No ratings yet
Advanced Geotechnical Engineering PDF
4 pages
IEM PI Form A100 - Application Form - Report
No ratings yet
IEM PI Form A100 - Application Form - Report
9 pages
Water Influx in Reservoirs Explained
100% (1)
Water Influx in Reservoirs Explained
43 pages
Career Planning Worksheet Guide
No ratings yet
Career Planning Worksheet Guide
3 pages
Choosing A Feedline Choke Palomar Engineers 2022 Update by AK6R
No ratings yet
Choosing A Feedline Choke Palomar Engineers 2022 Update by AK6R
26 pages
Meghan Kneringer: Teaching Portfolio
No ratings yet
Meghan Kneringer: Teaching Portfolio
2 pages
A Strain Based Topology Optimization Method For Compliant Mechanism Design
No ratings yet
A Strain Based Topology Optimization Method For Compliant Mechanism Design
9 pages
Sound and Electricity Quiz Questions
No ratings yet
Sound and Electricity Quiz Questions
10 pages
Grammar For Business Book - Table of Contents
100% (1)
Grammar For Business Book - Table of Contents
2 pages
Thermodynamics for Chem Eng Students
No ratings yet
Thermodynamics for Chem Eng Students
38 pages
Treatment of Residual Stress in Fracture Assessment
No ratings yet
Treatment of Residual Stress in Fracture Assessment
20 pages
RPS Technology For Instructional Media
0% (1)
RPS Technology For Instructional Media
11 pages
Aashto T 27 T 11 2012
No ratings yet
Aashto T 27 T 11 2012
18 pages
Importance of Peace in Thesis Writing
100% (1)
Importance of Peace in Thesis Writing
8 pages
Connect 4 Game Programming Guide
No ratings yet
Connect 4 Game Programming Guide
10 pages
Key Stage 3 ks3 Science 36P2 2010
No ratings yet
Key Stage 3 ks3 Science 36P2 2010
32 pages
Fsuipc4 User Guide
No ratings yet
Fsuipc4 User Guide
50 pages
Graduate Management Problem Set
No ratings yet
Graduate Management Problem Set
4 pages
Introduction To Human Movement Analysis: Jason Friedman
No ratings yet
Introduction To Human Movement Analysis: Jason Friedman
33 pages
MB 1000 Atlas Copco
100% (3)
MB 1000 Atlas Copco
32 pages
Introduction To English Language & Linguistics-2
No ratings yet
Introduction To English Language & Linguistics-2
100 pages
Ternary Computer Architecture Overview
No ratings yet
Ternary Computer Architecture Overview
192 pages
DB12 50 PDF
No ratings yet
DB12 50 PDF
2 pages

Mamo A Mathematical Modeling Benchmark With Solver

Uploaded by

Mamo A Mathematical Modeling Benchmark With Solver

Uploaded by

preprint

Mamo: a Mathematical Modeling Benchmark with Solvers

Xuhan Huang Qingning Shen

Yan Hu Anningzhe Gao ∗

Mathematical modeling involves representing real-world phenomena, sys-

1. Clarification of a Mathematical Modeling Framework for LLMs in Natural Lan-

3.1 Mathematical Modeling

Motivations to benchmark mathematical modeling for LLMs We evaluate mathematical

3.2 Using Solvers in Mathematical Modeling

4 Philosophy of Mamo Benchmark

4.2 Scope of Mathematical Modeling

We construct the category of Mathematical Modeling by referring to Giordano et al. (2013).

Table 1: The taxonomy of mathematical modeling. In Mamo benchmark, we only include

Challenges in Modeling Changes and Differences, Proportionality and Geometric Simi-

5 Data Collection and Synthesis

5.1 Data Selection and Source Credibility

Annotator Qualifications The benchmarking process is conducted by a team of qualified

5.1.1 Manually Selected Questions

5.2 Data Statistics and Visualization

5.3 Guidance of Data Synthesis

• For Ordinary Differential Equations (ODEs), questions should focus on determining

6.1 Evaluation Protocol

6.1.1 Evaluation Details

Mitigation of Formatting Errors It is essential to distinguish errors in modeling from those

6.1.2 Evaluation Metrics

3 See the evaluation script in Appendix. F

6.2 Benchmarking Results

Table 2: Evaluation result on Mamo Benchmark. The remark † indicates it additionally

Table 3: Evaluation result on Mamo Benchmark. The remark †† indicates it additionally

6.3 On the format errors and code modifiers

Models ODE Optimization

Table 4: Correct format rate on raw, self-improved, and GPT4-improved format.

The improvement from GPT-4 is generally superior to both self-improvement and no

7 Conclusion and Future Directions

Limitations While the current benchmarking methodology necessitates LLMs to generate

Finally, we have 346 problems in ODE part.

A.1 Cross-Review Process

B.1 Qualification of Reviewers

1. Undergraduate students who have taken fundamental mathematical courses (Cal-

To test the sensitivity of different models to the number of prompts, we conducted a

Figure 3: Results of models accuracy and shots on ODE problems

Figure 4: Results of models accuracy and shots on optimization problems

We use 3-shot learnging in testing, see examples in G

Figure 5: Example of testing process in optimization

Figure 6: Example of testing process in ODE

Figure 7 shows the example of synthesis process in ODE.

Figure 7: Example of synthesis process in ODE

def comp(output, standard_answer):

def compare_output_with_standard(output, standard_answer):

Here comes the examples:

Here comes the question:

Here comes the examples:

Here comes the question:

The followings are forbidden:

Please avoid resposing above sentences.

Please avoid resposing above sentences.

You might also like