M V: E M R - F M V C: ATH Ista Valuating Athematical Eason Ing of Oundation Odels in Isual Ontexts
M V: E M R - F M V C: ATH Ista Valuating Athematical Eason Ing of Oundation Odels in Isual Ontexts
Pan Lu1,3 , Hritik Bansal1 , Tony Xia1 , Jiacheng Liu2 , Chunyuan Li3 ,
Hannaneh Hajishirzi2 , Hao Cheng3 , Kai-Wei Chang1 , Michel Galley3 , Jianfeng Gao3
1
UCLA, 2 University of Washington, 3 Microsoft Research, Redmond
https://mathvista.github.io
A BSTRACT
arXiv:2310.02255v3 [cs.CV] 21 Jan 2024
Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit
impressive problem-solving skills in many tasks and domains, but their ability
in mathematical reasoning in visual contexts has not been systematically studied.
To bridge this gap, we present M ATH V ISTA, a benchmark designed to combine
challenges from diverse mathematical and visual tasks. It consists of 6,141 ex-
amples, derived from 28 existing multimodal datasets involving mathematics and
3 newly created datasets (i.e., IQTest, FunctionQA, and PaperQA). Completing
these tasks requires fine-grained, deep visual understanding and compositional
reasoning, which all state-of-the-art foundation models find challenging.
With M ATH V ISTA, we have conducted a comprehensive, quantitative evaluation
of 12 prominent foundation models. The best-performing GPT-4V model achieves
an overall accuracy of 49.9%, substantially outperforming Bard, the second-best
performer, by 15.1%. Our in-depth analysis reveals that the superiority of GPT-
4V is mainly attributed to its enhanced visual perception and mathematical rea-
soning. However, GPT-4V still falls short of human performance by 10.4%, as
it often struggles to understand complex figures and perform rigorous reasoning.
This significant gap underscores the critical role that M ATH V ISTA will play in
the development of general-purpose AI agents capable of tackling mathematically
intensive and visually rich real-world tasks. We further explore the new ability
of self-verification, the application of self-consistency, and the interactive chatbot
capabilities of GPT-4V, highlighting its promising potential for future research.
1 I NTRODUCTION
1
Published as a conference paper at ICLR 2024
Figure 1: Accuracies of one leading LLM (i.e., PoT GPT-4), four prominent LMMs, random chance,
and human performance on our proposed M ATH V ISTA across mathematical reasoning and visual
context types. PoT GPT-4 is a textual, program-aided LLM augmented with the Bard caption and
OCR text. GPT-4V is manually evaluated via the playground chatbot.
On the other hand, Large Language Models (LLMs) (OpenAI, 2022; 2023a) and Large Multimodal
Models (LMMs) (Google, 2023; OpenAI, 2023b; Team et al., 2023) have exhibited impressive
problem-solving skills in many tasks and domains. Recently, some studies have aimed to augment
existing LLMs with mathematical and scientific reasoning capabilities using external tools (Lu et al.,
2023a; Wang et al., 2023b). However, the ability of these foundation models to perform mathemat-
ical reasoning in visual contexts has not been systematically examined. Therefore, it is essential to
develop a new benchmark to (1) facilitate the development of mathematical reasoning systems in
visually intensive scenarios, and (2) evaluate the research progress of LLMs and LMMs, especially
their capabilities in solving rigorous reasoning tasks.
In this paper, we present M ATH V ISTA, a consolidated Mathematical reasoning benchmark in Visual
contexts. We propose a task taxonomy to guide the development of M ATH V ISTA: (1) we identify
seven mathematical reasoning types: algebraic reasoning, arithmetic reasoning, geometry reason-
ing, logical reasoning, numeric common sense, scientific reasoning, and statistical reasoning; (2)
we focus on five primary tasks: figure question answering (FQA), geometry problem solving (GPS),
math word problem (MWP), textbook question answering (TQA), and visual question answering
(VQA); and (3) we encompass a diverse array of visual contexts, including natural images, ge-
ometry diagrams, abstract scenes, synthetic scenes, as well as various figures, charts, and plots.
M ATH V ISTA incorporates 28 existing multimodal datasets, including 9 math-targeted question an-
swering (MathQA) datasets and 19 VQA datasets. In addition, we have created three new datasets
(i.e., IQTest, FunctionQA, PaperQA) which are tailored to evaluating logical reasoning on puzzle
test figures, algebraic reasoning over functional plots, and scientific reasoning with academic paper
figures, respectively. Overall, M ATH V ISTA consists of 6,141 examples, with 736 of them being
newly curated (Table 1). To facilitate fine-grained evaluation, examples are annotated with meta-
data, including question type, answer type, task category, grade level, visual context, and required
reasoning skills. Detailed descriptions of data collection can be found in §2, §C, and §D.
We conduct extensive experiments on M ATH V ISTA to evaluate the reasoning abilities of 12 founda-
tion models known for their leading performance in mathematical and multimodal reasoning. This
ensemble includes three LLMs (i.e, ChatGPT, GPT-4, Claude-2), two proprietary LMMs (i.e., GPT-
4V, Bard), and seven open-source LMMs. For LLMs, we examine zero-shot and few-shot settings
using two prompting strategies: chain-of-thought (CoT) (Wei et al., 2022b) and program-of-thought
(PoT) (Chen et al., 2022b). These LLMs can also be augmented with off-the-shelf visual models
for image captioning and OCR. We establish a human performance baseline by engaging qualified
human annotators with a high school diploma or higher. We show that M ATH V ISTA, featuring ad-
vanced topics such as college curricula and scientific reasoning, is a very challenging benchmark,
with human performance reaching only 60.3% accuracy.
2
Published as a conference paper at ICLR 2024
Figure 2: Examples of our newly annotated datasets: IQTest, FunctionQA, and PaperQA.
Our results indicate that CoT GPT-4, the best-performing LLM without visual tool augmentations,
achieves an overall accuracy of 29.2%. Multimodal Bard, the best-performing LMM, achieves
34.8% (§3.3), which attains only 58% of human performance (34.8% vs 60.3%). When augmented
with Bard captions and OCR text, PoT GPT-4 obtains 33.9%, closely matching Multimodal Bard
(§3.4). Further analysis indicates that the Multimodal Bard model failures arise from incorrect cal-
culations and hallucinations caused by visual perception and textual reasoning (§3.5).
With M ATH V ISTA, we report, for the first time, a comprehensive quantitative and qualitative eval-
uation of GPT-4V (OpenAI, 2023b), the latest multimodal version of GPT-4. Remarkably, GPT-4V
achieves a state-of-the-art accuracy of 49.9%, a significant improvement of 15.1% over Multimodal
Bard. As illustrated in Figure 1, GPT-4V even surpasses human performance on a set of tasks in-
volving algebraic reasoning and complex visual contexts, which include tables and function plots.
Nevertheless, a 10.4% gap in overall accuracy remains when compared to the human baseline, leav-
ing plenty of room for model improvement. Our in-depth analysis (§H) reveals that the superiority of
GPT-4V is mainly attributed to its strong capabilities in visual perception and mathematical reason-
ing. We further highlight its emergent ability for self-verification (§H.5), the use of self-consistency
(§H.6), and its ability to drive goal-directed multi-turn human-AI dialogues (§H.7).
As discussed previously, there is a notable gap in existing benchmarks, which primarily evaluate
mathematical reasoning in textual contexts, overlooking the intrinsic visual nature of many mathe-
matical problems. Our dataset, M ATH V ISTA, is therefore motivated to bridge this gap, offering a
robust evaluation benchmark for mathematical reasoning intertwined with visual understanding, thus
pushing AI assistants towards general-purpose capabilities. Our benchmark adheres to the following
collection guidelines: (1) it covers multiple tasks and topics to mirror real-world applications; (2)
it incorporates diverse visual contexts and mathematical skills to foster a well-rounded evaluation;
(3) it offers varying levels of challenge to effectively probe and uncover the potential limitations of
current models; and (4) it provides robust evaluation settings for deterministic evaluations.
The taxonomy for this work is introduced as follows: We identify seven types of mathematical rea-
soning: algebraic reasoning, arithmetic reasoning, geometry reasoning, logical reasoning, numeric
common sense, scientific reasoning, and statistical reasoning, with detailed definitions provided in
3
Published as a conference paper at ICLR 2024
§C.1 and examples shown in §C.2. We focus on five primary tasks: figure question answering
(FQA), which centers around statistical reasoning over multiple charts and plots; geometry problem
solving (GPS), which deals with geometrical topics; math word problem (MWP), which involves
arithmetic reasoning in everyday scenarios; textbook question answering (TQA), which usually en-
tails knowledge-intensive reasoning on scientific topics and figures; and visual question answering
(VQA). Furthermore, our objective is to account for a diverse array of visual contexts, including
natural images, geometry diagrams, abstract scenes, synthetic scenes, multiple charts and plots,
scientific figures, tables, function plots, puzzle test figures, and more, with examples shown in §C.3.
Collection of MathQA datasets. We collected nine MathQA datasets in multimodal settings, in-
cluding four for GPS, two for MWP with visual contexts of synthetic scenes, abstract diagrams, and
tables, and two for TQA on college curricula (see §C.4). Annotations such as solutions, programs,
parsing results, and grounded theorems are also collected, providing demonstration examples for
LLMs. Each source dataset is limited to up to 400 examples to ensure a balanced representation of
each source in our final compiled benchmark. In total, we collected 2,666 examples.
Review and collection of VQA datasets. Many existing VQA datasets feature instances requiring
mathematical reasoning abilities, such as arithmetic operations or numeric common sense. Incor-
porating these datasets enhances problem diversity in terms of tasks, domains, visual contexts, and
reasoning skills involved. We reviewed more than 70 datasets, collecting 19 of them that contain
math-related instances and are publicly available, as listed in §C.4. Since these datasets are not orig-
inally math-targeted, we initially designed heuristic rules to automatically select examples likely
to involve mathematical reasoning from a large pool of candidates. Examples with numeric an-
swers or those containing quantity words (as listed in §D.1) in the questions were selected. This
automatic filtration yielded 4,949 VQA-format examples, though some false positive examples re-
mained. Therefore, we engaged three expert annotators to manually label these examples to deter-
mine if they involve mathematical reasoning (more details in § D.2). Utilizing majority voting and
limiting each source dataset to 400 examples, we finalized a collection of 2,739 examples.
Collection of three new datasets. While the source datasets we collected encompass multiple
visual contexts and mathematical reasoning abilities, certain scenarios remain unaddressed: logical
reasoning on puzzle test diagrams, statistical reasoning on functional plots, and scientific reasoning
on academic figures. To address these gaps, we introduced three new datasets: IQTest, FunctionQA,
and PaperQA, with examples illustrated in Figure 2. IQTest comprises 228 examples requiring in-
ductive reasoning, abstract thinking, pattern prediction, and calculations, sourced from puzzle test
figures on online learning platforms. FunctionQA, with 400 examples, emphasizes subtle visual per-
ceptions of functional plots and algebraic reasoning concerning variables, expressions, equations,
and functions. PaperQA is a novel dataset featuring questions derived from informative academic il-
lustrations, including tables, figures, and charts from online education resources, with 107 examples
sourced from papers released in August 2023 on Huggingface1 .
To ensure data quality, all questions were manually annotated by graduate students in STEM fields
and further refined through a rigorous review process. To ensure consistency in annotation, we
employed a two-step process. Initially, each dataset was independently annotated by three review-
ers, resulting in a high inter-annotation consistency rate of 99.2%. Specifically, among the newly
collected 736 questions, only 6 exhibited disagreements in the annotated answers. Then, these dis-
crepancies were resolved through discussion among the entire review team, ensuring a consensus
was reached on each example. The GUI of the annotation tool is shown in Figure 23 in §D.3.
4
Published as a conference paper at ICLR 2024
Statistic Number
A
PlotQA
Geo
PaperQ
3.7 t
s
Total questions 6,141
6.5%
Ch %
IQTe
5.2 QA
Ge 6.
QA+
- multiple-choice questions 3,392 (55.2%)
art
om 5%
%
- Free-form questions 2,749 (44.8%)
et
QA
ry3
- Questions with annotations 5,261 (85.6%) ure
Fig .5%
K
Un
- Questions newly annotated 736 (12.0%) iG 6
6.5 eo
% GPS FQA
Unique number of images 5,487 21.5% 26.8% DVQA
GEOS
Unique number of questions 4,746 6.5%
Unique number of answers 1,464 ath
CLEVR-M MWP TQA TQ
6.5% SciBe A
Source datasets 31 19.5% 15.3% nc h
- Existing VQA datasets 19 VQA AI2
nQ
A 4.4 D
- Existing MathQA datasets 9 Ico 5% 16.9% %
Fu 6
- Our newly annotated datasets 3 6.
nc .5%
% P
tio
6.5 W
6.5% VR
nQ
bM
V Q .1 %
Visual context (image) classes 19
A
Ta
3
r-CL
KVQA
A2.
6.2%
Maximum question length 213
0
Supe
Maximum answer length 27
Maximum choice number 8
Average question length 15.6
Average answer length 1.2 Figure 3: Source dataset distribution of M ATH V ISTA.
Average choice number 3.4 FQA: figure question answering, GPS: geometry prob-
lem solving, MWP: math word problem, TQA: textbook
Table 1: Key statistics of M ATH V ISTA. question answering, VQA: visual question answering.
seven different types of mathematical reasoning abilities, as categorized in Table 3 (§C.1). Coarse la-
bels of mathematical reasoning can be automatically obtained from the details of the source datasets.
To verify the quality of automatic annotation, expert annotators manually label the mathematical rea-
soning categories from seven candidates for 1,000 examples, using the annotation tool illustrated in
§D.4. The results show that 94.1% of the examples from automatic and human annotations have
the exact same set of reasoning types, while 98.79% of the individual labels are identical, indicating
that the automatic annotation for the labeling of mathematical reasoning is highly accurate.
M ATH V ISTA consists of 6,141 examples, divided into two subsets: testmini and test. testmini con-
tains 1,000 examples, intended for model development validation or for those with limited comput-
ing resources. The test set features the remaining 5,141 examples for standard evaluation. Notably,
the answer labels for test will not be publicly released to prevent data contamination, and we will
maintain an online evaluation platform. To ensure that each source dataset is well represented in
testmini and to maintain a distribution in testmini closely resembling the whole set, we adopted this
sampling strategy: (1) first, randomly sample questions with a threshold number of 4 for each source
dataset; (2) then, randomly sample the remaining questions for each source dataset on its proportion
in the entire set. The KL Divergence and Total Variation (TV) distance between the testmini set and
the entire set are 0.008 and 0.035, respectively, suggesting that testmini is close to the distribution
of the whole set. We also conducted several quality checks to address any unidentified errors.
The main statistics of M ATH V ISTA are presented in Table 1. There are two types of questions:
multiple-choice and free-form. Answers to free-form questions are categorized as integers, float-
ing numbers, or lists. The large unique number of images, questions, and answers ensures pattern
diversity in M ATH V ISTA. M ATH V ISTA is derived from 31 source datasets, including three newly
annotated datasets to address the missing types of mathematical reasoning over specific visual con-
texts. Dataset examples in Table 4 (§C.2 ) highlight the richness of mathematical reasoning involved.
Examples in §C.3 demonstrate the diverse visual contexts present in M ATH V ISTA. Further details
on data analysis are available in §E.
3 E XPERIMENTS
5
Published as a conference paper at ICLR 2024
Prior work (Yang et al., 2023b) has studied the reasoning abilities of foundation models in visual
settings from a qualitative perspective. In contrast, our goal is to conduct both qualitative and quan-
titative studies to provide a systematic evaluation of existing foundation models for mathematical
reasoning capabilities in visual contexts using M ATH V ISTA. We introduce a novel benchmarking
strategy for M ATH V ISTA tailored for foundational models (§3.1). The models we have chosen are
detailed in §3.2. Quantitative results can be found in §3.3 and §3.4, while the qualitative analysis is
provided in §3.5. Given the significant advancements of GPT-4V over other models, we undertake
an in-depth comparative study with its peers in various aspects and highlight potential avenues for
future research in §H.
Recent LLMs and LMMs have been instructed to generate long responses in conventional settings
instead of short text. Therefore, we propose a new strategy for benchmarking M ATH V ISTA, unlike
using human-designed or template matching rules (Lu et al., 2022). The evaluation process consists
of three stages: response generation, answer extraction, and score calculation. Initially, the base-
lines generate responses given the input query, which incorporates the task description, the question,
the choices, and the metadata, using the template defined in Table 9 (§F.3). Next, the short answer
text is extracted from the detailed response. We propose an answer extractor (§F.2) based on LLMs
such as GPT-4, inspired by its remarkable ability for text processing (Wei et al., 2022b). A prelim-
inary study of 200 examples shows that GPT-4 can extract the answer text with more than 99.5%
accuracy. Finally, the extracted answer is normalized to a required answer format (e.g., an option
letter or an integer), and the target metric scores are computed. Taking advantage of the fact that
the instances in M ATH V ISTA are either multiple-choice questions for textual answers or free-form
questions for numerical answers, accuracy scores are used as metrics for deterministic evaluation.
We evaluate the models on M ATH V ISTA under three setups: (a) Text-Only LLMs including ChatGPT
(OpenAI, 2022), GPT-4 (OpenAI, 2023a), and Claude-2 (Anthropic, 2023) in zero-shot and two-shot
settings with Chain-of-Thought (CoT) (Wei et al., 2022b) and Program-of-Thought (PoT) (Chen
et al., 2022b), (b) Augmented-LLMs where the LLMs are provided with additional visual information
including the generated image captions from Multimodal Bard (Google, 2023) and the detected
OCR text from EasyOCR (JaidedAI, 2020), (c) LMMs that include open-source models such as
IDEFICS-9B (Laurençon et al., 2023), mPLUG-OWL-LLaMA-7B (Ye et al., 2023), miniGPT-4-
LLaMA-2-7B (Zhu et al., 2023a), LLaMA-Adapter-V2-7B (Gao et al., 2023), InstructBLIP-Vicuna-
7B (Dai et al., 2023), LLaVA-LLaMA-2-13B (Liu et al., 2023a), LLaVAR Zhang et al. (2023d), and
proprietary models such as Bard and GPT-4V. Since GPT-4V does not offer API access, we resorted
to manually evaluating it using the playground chatbot. We provide the prompts for LLMs and the
hyperparameters used for LMMs in §F.
We compare the performance of several models, including Text-only LLMs, Augmented LLMs, and
LMMs on M ATH V ISTA in Table 2. We include random chance (i.e., one of the options in multiple-
choice questions, and empty in the free-form questions) and frequency guess (§F.1) as naive base-
lines. Additionally, we established a human performance baseline using Amazon Mechanical Turk.
Eligible human annotators must have a satisfactory annotating history, successfully pass qualifica-
tion examples, and possess a high school degree or higher. We asked each annotator to complete
five questions within 20 minutes. Further details can be found in §F.6.
Among text-only LLMs, all models outperform the random baselines, with the 2-shot GPT-4 using
chain-of-thought (CoT) prompting achieving 29.2%. The limited performance of text-only LLMs
suggests that our dataset requires models to reason within visual contexts for optimal results. When
equipped with image captions and detected OCR text, augmented LLMs exhibit superior perfor-
mance compared to their text-only counterparts on M ATH V ISTA. Specifically, the best-performing
augmented LLM is the 2-shot GPT-4 employing program-of-thought (PoT) prompting, which scores
33.9%. This model generates Python programs for execution, thereby promoting rigorous reasoning.
6
Published as a conference paper at ICLR 2024
Model Input ALL FQA GPS MWP TQA VQA ALG ARI GEO LOG NUM SCI STA
Heuristics baselines
Random chance - 17.9 18.2 21.6 3.8 19.6 26.3 21.7 14.7 20.1 13.5 8.3 17.2 16.3
Frequent guess - 26.3 22.7 34.1 20.4 31.0 24.6 33.1 18.7 31.4 24.3 19.4 32.0 20.9
Large Language Models (LLMs)
Zero-shot ChatGPT Q only 23.5 21.9 26.9 9.1 38.6 23.5 27.7 15.9 25.7 21.6 9.9 41.5 20.5
Zero-shot GPT-4 Q only 26.1 22.3 37.0 7.0 39.2 27.4 33.6 17.4 35.6 16.2 9.2 45.8 19.5
Zero-shot Claude-2 Q only 26.4 21.9 34.1 13.4 36.1 29.1 32.8 20.4 33.3 13.5 12.1 36.4 20.5
2-shot CoT Claude-2 Q only 24.4 18.6 29.8 9.7 33.5 34.1 29.2 19.0 28.0 5.4 13.9 36.9 18.9
2-shot CoT ChatGPT Q only 26.8 20.1 36.5 8.6 44.9 28.5 35.6 17.0 33.5 21.6 14.6 45.9 17.9
2-shot CoT GPT-4 Q only 29.2 20.1 44.7 8.6 46.2 31.3 41.6 19.3 41.0 18.9 13.9 47.5 18.9
2-shot PoT ChatGPT Q only 25.1 19.0 30.8 16.1 38.0 25.7 29.9 19.8 29.3 24.3 19.4 38.5 16.9
2-shot PoT GPT-4 Q only 26.0 20.1 33.2 8.1 44.9 28.5 32.7 16.7 31.0 24.3 13.2 48.4 18.3
Augmented Large Language Models (Augmented-LLMs)
2-shot CoT Claude-2 Q, Ic , It 33.2 26.0 31.7 35.5 48.1 30.2 32.4 32.3 33.0 16.2 17.4 54.9 36.2
2-shot CoT ChatGPT Q, Ic , It 33.2 27.5 29.3 36.0 49.4 29.1 31.0 32.9 31.0 16.2 17.4 50.8 37.2
2-shot CoT GPT-4 Q, Ic , It 33.2 27.9 31.7 31.2 51.9 28.5 33.5 30.9 32.2 13.5 12.5 58.2 37.9
2-shot PoT ChatGPT Q, Ic , It 26.8 24.5 26.4 23.7 33.5 27.9 27.8 26.1 28.0 18.9 13.2 33.6 29.9
2-shot PoT GPT-4 Q, Ic , It 33.9 30.1 39.4 30.6 39.9 31.3 37.4 31.7 41.0 18.9 20.1 44.3 37.9
Large Multimodal Models (LMMs)
IDEFICS-9B-Instruct Q, I 19.8 21.6 21.1 6.5 25.9 24.0 22.1 15.0 19.8 18.9 9.9 24.6 18.1
mPLUG-Owl-LLaMA-7B Q, I 22.2 22.7 23.6 10.2 27.2 27.9 23.6 19.2 23.9 13.5 12.7 26.3 21.4
miniGPT4-LLaMA-2-7B Q, I 23.1 18.6 26.0 13.4 30.4 30.2 28.1 21.0 24.7 16.2 16.7 25.4 17.9
LLaMA-Adapter-V2-7B Q, I 23.9 21.2 25.5 11.3 32.3 31.8 26.3 20.4 24.3 24.3 13.9 29.5 18.3
LLaVAR Q, I 25.2 21.9 25.0 16.7 34.8 30.7 24.2 22.1 23.0 13.5 15.3 42.6 21.9
InstructBLIP-Vicuna-7B Q, I 25.3 23.1 20.7 18.3 32.3 35.2 21.8 27.1 20.7 18.9 20.4 33.0 23.1
LLaVA-LLaMA-2-13B Q, I 26.1 26.8 29.3 16.1 32.3 26.3 27.3 20.1 28.8 24.3 18.3 37.3 25.1
Multimodal Bard Q, I 34.8 26.0 47.1 29.6 48.7 26.8 46.5 28.6 47.8 13.5 14.9 47.5 33.0
GPT-4V (Playground) Q, I 49.9 43.1 50.5 57.5 65.2 38.0 53.0 49.0 51.0 21.6 20.1 63.1 55.8
Human
Human performance Q, I 60.3 59.7 48.4 73.0 63.2 55.9 50.9 59.2 51.4 40.7 53.8 64.9 63.9
Table 2: Accuracy scores on the testmini subset of M ATH V ISTA. Input: Q: question, I: image,
Ic : image caption, It : OCR text detected in the image. ALL: overall accuracy. Task types: FQA:
figure question answering, GPS: geometry problem solving, MWP: math word problem, TQA: text-
book question answering, VQA: visual question answering. Mathematical reasoning types: ALG:
algebraic reasoning, ARI: arithmetic reasoning, GEO: geometry reasoning, LOG: logical reasoning,
NUM: numeric commonsense, SCI: scientific reasoning, STA: statistical reasoning. The highest
scores among models in each section and overall are highlighted in blue and red, respectively.
On the LMM side, Multimodal Bard scores a 34.8% accuracy, which is only 58% of human perfor-
mance at 60.3%. Notably, the best-performing GPT-4V model achieves 49.9%, marking a substan-
tial 15.1% improvement over Bard; however, it still falls 10.4% short of human performance. These
gaps highlight that there is a significant scope for further improvements on our benchmark. The
open-source models (IDEFICS to LLaVA) achieve underwhelming performance on M ATH V ISTA.
This can be attributed to their lack of math reasoning capabilities, text recognition (useful for math
word problems), shape detection (useful for geometrical problems), and chart understanding. No-
tably, these models utilize different model architectures for processing the vision (e.g., OpenCLIP,
CLIP, Vit-G) and language (e.g., LLaMA-1, LLaMA-2), different alignment strategies (e.g., MLP
projection in LLaVA, Q-former in InstructBLIP, visual abstractor in mPLUGOwl), and instruction
tuning data (e.g., 150K instruction-response pairs from LLaVA data, 3,500 instruction-response
pairs from miniGPT-4). While fine-tuned with instruction-following data from text-rich images,
LLaVAR does not perform well, indicating that strong text recognition abilities do not guarantee
high performance on M ATH V ISTA, which requires comprehensive visual perception and mathemat-
ical reasoning. This underscores that there are immense possibilities for innovations in model, data,
or training objectives to improve the zero-shot performance of LMMs on M ATH V ISTA.
We also report fine-grained scores for a comprehensive study of the capabilities of existing models
across different tasks (Table 2), mathematical reasoning abilities (Table 2, Figures 1, 33), visual con-
7
Published as a conference paper at ICLR 2024
Figure 4: Error analysis of Bard results: (a) presents errors in answers and explanations; (b) delves
into the details of wrong explanations. Notations: “Answer” is “Ans.”, “Explanation” is “Exp.”,
“Partially Correct” is “Partial”, and “Not applicable” refers to unanswerable or indeterminate cases.
(a) Correct answer and explanation (b) Correct answer but wrong explanation
Figure 5: Two examples from Bard. In (b), Bard does not correctly identify the geometry symbols
and relationships. The accurate correct should identify the isosceles triangle and apply its properties.
text types (Figures 1, 34), and grade levels (Figure 35). Remarkably, GPT-4V surpasses most other
baselines in various categories, with exceptions in problems related to logical reasoning and numeric
commonsense reasoning. Notably, GPT-4V surpasses human performance not only in tasks like ge-
ometry problem solving (GPS), textbook question answering (TQA), and mathematical reasoning
skills such as algebraic reasoning but also in visual contexts including function plots, geometry
diagrams, scatter plots, and tables. Please refer to §G.2, §G.3, and §G.4 for more detailed analysis.
We perform an ablation study on the augmented LLMs and present the results in Table 36 (see §G.5).
The gap in the performance of the Augmented LLMs can be attributed to poor image captions, which
may not adequately describe the math in visual contexts, the inability of the OCR to detect shapes
useful for geometrical reasoning, and the lack of mathematical reasoning capabilities. An in-depth
study of GPT-4V can be found in §H.
Success and failure analysis of Multimodal Bard. In §3.3, we observe that Multimodal Bard
achieves the highest average accuracy on M ATH V ISTA. Here, we analyze its predictions through
human evaluation to understand its mode of success and failure. To do so, we ask the human workers,
from Amazon Mechanical Turk (AMT), to study Bard’s predictions given the math question, its
associated image, and the ground truth from M ATH V ISTA dataset for 250 instances. Specifically,
workers were instructed to decide whether the predictions contained the correct answer with the
8
Published as a conference paper at ICLR 2024
correct explanation. If the workers find that the model’s explanation is incorrect, they had to choose
whether the wrong explanation was due to various failure modes such as incorrect reasoning with
hallucination or wrong calculations. In our setup, we define hallucination as an introduction of
incorrect facts, in the model explanation, that is not mentioned in the context of the image or question
(e.g., in Figure 39 and Figure 40). More details can be found in §F.7.
We present the distribution of the quality of Bard’s predictions, judged by the human annotators,
in Figure 4 (a). We find that 44.6% of the Bard’s predictions had incorrect answers with incorrect
explanations. Interestingly, we observe that Bard responds with partial (6.8%) or completely (8.1%)
incorrect explanations despite giving the correct answer to the input image and question, highlight-
ing its failure to reach the correct answer for the wrong reasons. In Figure 4 (b), we present the
distribution over possible reasons when Bard provides incorrect explanations. Notably, we find that
49.6% of its responses contain hallucinations. Our analysis highlights that hallucination is a major
source of errors in the generative foundation models (Lu et al., 2023c; Ji et al., 2023). We also
observe that the model responds with correct reasoning but either hallucinates (18.6%) or performs
wrong calculations (19.5%) leaving an overall impression of being a wrong explanation.
Qualitative examples of Multimodal Bard. We also present a few qualitative examples of Bard’s
predictions. In Figure 5 (a), we find that Bard generates the correct answer with the correct expla-
nation, including detecting the correct function (i.e., f (x) = x2 ) and analyzing its properties (i.e.,
injective) to answer the question. However, in Figure 5 (b), we observe that the model provides the
correct answer (i.e., 12) but with an incorrect explanation (i.e., using the law of cosines when the
question requires an understanding of the properties of isosceles triangles). We present more ex-
amples in §G.9. Overall, our analysis of Bard highlights its modes of failure in detail, which could
guide future foundation model design to address these issues.
Qualitative examples of Augmented GPT-4. Augmented with external visual models, CoT GPT-
4 and PoT GPT-4 are able to achieve comparable performance with Multimodal Bard. As shown
9
Published as a conference paper at ICLR 2024
in Figure 6 (a), provided with the accurate OCR text detected in the image, PoT GPT-4 accurately
understands the structural information of the image and generates a code snippet to perform precise
statistical reasoning. In Figure 6 (b), the caption provides some accurate descriptions of the image
(e.g., f (x) = c) along with hallucination (e.g., y = 3, the line passes through (0, 3)) caused by the
external Bard model. Although CoT GPT-4 predicts the correct answer given the partially correct
information, the qualities of visual information augmented by external models have an impact on
the accurate visual perception and thus the final mathematical reasoning performance. Examples in
§G.10 show failure cases due to hallucination caused by external visual models.
4 R ELATED W ORK
Several benchmarks (Amini et al., 2019; Cobbe et al., 2021; Mishra et al., 2022; Frieder et al., 2023)
have emerged to assess the mathematical reasoning capabilities of LLMs, but most focus solely
on text-based tasks. Current benchmarks, such as GSM-8K (Cobbe et al., 2021), exhibit perfor-
mance saturation. Given the rise of LMMs Li et al. (2023a), there is a need for robust multimodal
benchmarks in scientific domains. To address this gap, we introduce a math reasoning dataset that
incorporates visual contexts.
VQA datasets (Antol et al., 2015; Gurari et al., 2018; Mobasher et al., 2022) gauge the visual reason-
ing abilities of LMMs. Recent studies explore assessing LMMs beyond natural images, including
abstract scenes, geometry diagrams, figures, charts, documents, and synthetic images (Lu et al.,
2021a; Kahou et al., 2017; Masry et al., 2022). In this work, we introduce new datasets (IQTest,
FunctionQA, PaperQA) to create a holistic benchmark for evaluating mathematical reasoning.
Generative foundation models like GPT-3, ChatGPT, GPT-4, Claude, and LLaMA have enabled di-
verse task solutions without fine-tuning. Specialized pretraining methods like PixStruct (Lee et al.,
2023), MatCha (Liu et al., 2022), and UniChart (Masry et al., 2023) enhance chart reasoning in vi-
sual contexts. Models like LLaVA, miniGPT4, InstructBLIP, and Bard leverage large-scale image-
text data, while specialized versions, such as LLaVAR (Zhang et al., 2023d; Ye et al., 2023), em-
phasize document understanding and math comprehension. Recent works (Bitton et al., 2023; Yu
et al., 2023) evaluate instruction-following and reasoning capabilities, underscoring the growing im-
portance of generative foundation models in practical applications. We introduce M ATH V ISTA as a
benchmark to evaluate their math reasoning capabilities in varied visual contexts.
5 C ONCLUSION
In this work, we introduce M ATH V ISTA, a benchmark designed to systematically analyze the math-
ematical reasoning capabilities of state-of-the-art models in visually complex scenarios. Our evalu-
ation of 12 prominent foundation models highlights that significant advancements have been made,
especially with the GPT-4V model. However, a substantial gap of 10.4% still exists between GPT-
4V, the best-performing model, and human performance. This disparity sets a clear direction for
future research, emphasizing the need for models that can seamlessly integrate mathematical rea-
soning with visual comprehension. Moreover, our exploration of GPT-4V’s self-verification, self-
consistency, and chatbot interactions offers valuable insights for future investigations.
R EFERENCES
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel
Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language
model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–
23736, 2022. 20
Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh
Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based for-
malisms. In Proceedings of the 2019 Conference of the North American Chapter of the Associ-
ation for Computational Linguistics: Human Language Technologies (NAACL), pp. 2357–2367,
2019. 10, 20
10
Published as a conference paper at ICLR 2024
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zit-
nick, and Devi Parikh. VQA: Visual question answering. In Proceedings of the IEEE international
conference on computer vision, pp. 2425–2433, 2015. 10, 20, 27
Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani
Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. OpenFlamingo: An open-
source framework for training large autoregressive vision-language models. arXiv preprint
arXiv:2308.01390, 2023. 20
Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wanrong Zhu, Anas Awadalla, Josh Gard-
ner, Rohan Taori, and Ludwig Schimdt. VisIT-Bench: A benchmark for vision-language instruc-
tion following inspired by real-world use. arXiv preprint arXiv:2308.06595, 2023. 10, 20
Nitzan Bitton-Guetta, Yonatan Bitton, Jack Hessel, Ludwig Schmidt, Yuval Elovici, Gabriel
Stanovsky, and Roy Schwartz. Breaking common sense: WHOOPS! A vision-and-language
benchmark of synthetic and compositional images. arXiv preprint arXiv:2303.07274, 2023. 20
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx,
Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportu-
nities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021. 20
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. 20
Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Ka-
mar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general
intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023. 20
Jie Cao and Jing Xiao. An augmented benchmark dataset for geometric question answering through
dual parallel text encoding. In Proceedings of the 29th International Conference on Computa-
tional Linguistics, pp. 1511–1520, 2022. 20, 27
Shuaichen Chang, David Palzer, Jialin Li, Eric Fosler-Lussier, and Ningchuan Xiao. MapQA: A
dataset for question answering on choropleth maps. arXiv preprint arXiv:2211.08545, 2022. 20,
27
Jiaqi Chen, Tong Li, Jinghui Qin, Pan Lu, Liang Lin, Chongyu Chen, and Xiaodan Liang. UniGeo:
Unifying geometry logical reasoning via reformulating mathematical expression. In Proceedings
of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 3313–3323,
2022a. 20, 27
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared
Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large
language models trained on code. arXiv preprint arXiv:2107.03374, 2021. 20
Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompt-
ing: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint
arXiv:2211.12588, 2022b. 2, 6, 21
Wenhu Chen, Ming Yin, Max Ku, Elaine Wan, Xueguang Ma, Jianyu Xu, Tony Xia, Xinyi
Wang, and Pan Lu. TheoremQA: A theorem-driven question answering dataset. arXiv preprint
arXiv:2305.12524, 2023. 21, 27
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser,
Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to
solve math word problems. arXiv preprint arXiv:2110.14168, 2021. 10, 20
11
Published as a conference paper at ICLR 2024
Adam Dahlgren Lindström and Savitha Sam Abraham. CLEVR-Math: A dataset for composi-
tional language, visual and mathematical reasoning. In 16th International Workshop on Neural-
Symbolic Learning and Reasoning, NeSy 2022, Windsor, UK, september 28-30, 2022., volume
3212. CEUR-WS, 2022. 1, 20, 27
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang,
Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision-
language models with instruction tuning, 2023. 6, 20, 39
Qingxiu Dong, Li Dong, Ke Xu, Guangyan Zhou, Yaru Hao, Zhifang Sui, and Furu Wei. Large
language model for science: A study on P vs. NP. arXiv preprint arXiv:2309.05689, 2023. 1
Iddo Drori and Nakul Verma. Solving linear algebra by program synthesis. arXiv preprint
arXiv:2111.08171, 2021. 21
Iddo Drori, Sarah Zhang, Reece Shuttleworth, Leonard Tang, Albert Lu, Elizabeth Ke, Kevin Liu,
Linda Chen, Sunny Tran, Newman Cheng, et al. A neural network solves, explains, and generates
university math problems by program synthesis and few-shot learning at human level. Proceed-
ings of the National Academy of Sciences, 119(32):e2123433119, 2022. 21
Simon Frieder, Luca Pinchetti, Ryan-Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz,
Philipp Christian Petersen, Alexis Chevalier, and Julius Berner. Mathematical capabilities of
chatgpt. In 37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on
Datasets and Benchmarks, 2023. 10, 20
Lingyue Fu, Huacan Chai, Shuang Luo, Kounianhua Du, Weiming Zhang, Longteng Fan, Jiayi Lei,
Renting Rui, Jianghao Lin, Yuchen Fang, et al. CodeApex: A bilingual programming evaluation
benchmark for large language models. arXiv preprint arXiv:2309.01940, 2023. 20
Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu,
Conghui He, Xiangyu Yue, Hongsheng Li, and Yu Qiao. LLaMA-Adapter V2: Parameter-efficient
visual instruction model. arXiv preprint arXiv:2304.15010, 2023. 6, 20
Google. Bard, 2023. URL https://bard.google.com/. 2, 6, 20
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V
in VQA matter: Elevating the role of image understanding in visual question answering. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6904–6913,
2017. 20, 27
Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and
Jeffrey P Bigham. VizWiz grand challenge: Answering visual questions from blind people. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3608–3617,
2018. 10, 20, 27
Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot
planners: Extracting actionable knowledge for embodied agents. In International Conference on
Machine Learning, pp. 9118–9147. PMLR, 2022. 20
Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu,
Chuancheng Lv, Yikai Zhang, Jiayi Lei, et al. C-eval: A multi-level multi-discipline chinese
evaluation suite for foundation models. arXiv preprint arXiv:2305.08322, 2023. 20
JaidedAI. EasyOCR: Ready-to-use OCR, 2020. URL https://github.com/JaidedAI/
EasyOCR. 6
Anya Ji, Noriyuki Kojima, Noah Rush, Alane Suhr, Wai Keen Vong, Robert D Hawkins, and Yoav
Artzi. Abstract visual reasoning with tangram shapes. arXiv preprint arXiv:2211.16492, 2022.
20
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang,
Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM
Computing Surveys, 55(12):1–38, 2023. 9
12
Published as a conference paper at ICLR 2024
Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. DVQA: Understanding data visu-
alizations via question answering. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pp. 5648–5656, 2018. 20, 27
Daniel Kahneman. Thinking, fast and slow. macmillan, 2011. 1
Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkinson, Ákos Kádár, Adam Trischler, and
Yoshua Bengio. FigureQA: An annotated figure dataset for visual reasoning. arXiv preprint
arXiv:1710.07300, 2017. 10, 20, 27
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali
Farhadi. A diagram is worth a dozen images. In Computer Vision–ECCV 2016: 14th Euro-
pean Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14,
pp. 235–251. Springer, 2016. 20, 27
Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, Jonghyun Choi, Ali Farhadi, and Hannaneh
Hajishirzi. Are you smarter than a sixth grader? Textbook question answering for multimodal
machine comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern
recognition, pp. 4999–5007, 2017. 20, 27
Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically
generated visual questions and answers about radiology images. Scientific data, 5(1):1–10, 2018.
20, 27
Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov,
Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, and
Victor Sanh. OBELICS: An open web-scale filtered dataset of interleaved image-text documents,
2023. 6, 39
Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexiang Hu, Fangyu Liu, Julian Martin Eisenschlos,
Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2Struct: Screen-
shot parsing as pretraining for visual language understanding. In International Conference on
Machine Learning, pp. 18893–18912. PMLR, 2023. 10, 20
Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, and Jianfeng Gao.
Multimodal foundation models: From specialists to general-purpose assistants. arXiv preprint
arXiv:2309.10020, 2023a. 10
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-
image pre-training with frozen image encoders and large language models. arXiv preprint
arXiv:2301.12597, 2023b. 39
Yunxin Li, Longyue Wang, Baotian Hu, Xinyu Chen, Wanqi Zhong, Chenyang Lyu, and Min Zhang.
A comprehensive evaluation of gpt-4v on knowledge-intensive visual question answering. arXiv
preprint arXiv:2311.07536, 2023c. 39
Zhuowan Li, Xingrui Wang, Elias Stengel-Eskin, Adam Kortylewski, Wufei Ma, Benjamin
Van Durme, and Alan L Yuille. Super-CLEVR: A virtual benchmark to diagnose domain ro-
bustness in visual reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pp. 14963–14973, 2023d. 20, 27
Thomas Liao, Rohan Taori, Inioluwa Deborah Raji, and Ludwig Schmidt. Are we learning yet? A
meta review of evaluation failures across machine learning. In Thirty-fifth Conference on Neural
Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. 20
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan
Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. arXiv preprint
arXiv:2305.20050, 2023. 1
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr
Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In Computer
Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014,
Proceedings, Part V 13, pp. 740–755. Springer, 2014. 20
13
Published as a conference paper at ICLR 2024
Fangyu Liu, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi,
Yasemin Altun, Nigel Collier, and Julian Martin Eisenschlos. MatCha: Enhancing visual lan-
guage pretraining with math reasoning and chart derendering. arXiv preprint arXiv:2212.09662,
2022. 10, 20
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv
preprint arXiv:2304.08485, 2023a. 6, 20
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding,
Kaiwen Men, Kejuan Yang, et al. AgentBench: Evaluating LLMs as agents. arXiv preprint
arXiv:2308.03688, 2023b. 20
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan,
Jiaqi Wang, Conghui He, Ziwei Liu, et al. MMBench: Is your multi-modal model an all-around
player? arXiv preprint arXiv:2307.06281, 2023c. 20
Yuliang Liu, Zhang Li, Hongliang Li, Wenwen Yu, Mingxin Huang, Dezhi Peng, Mingyu Liu, Min-
grui Chen, Chunyuan Li, Lianwen Jin, et al. On the hidden mystery of OCR in large multimodal
models. arXiv preprint arXiv:2305.07895, 2023d. 20
Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu.
Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning.
In The 59th Annual Meeting of the Association for Computational Linguistics (ACL), 2021a. 1,
10, 20, 21, 27
Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and
Song-Chun Zhu. IconQA: A new benchmark for abstract diagram understanding and visual lan-
guage reasoning. In The 35th Conference on Neural Information Processing Systems (NeurIPS)
Track on Datasets and Benchmarks, 2021b. 20, 27
Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord,
Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for
science question answering. In The 36th Conference on Neural Information Processing Systems
(NeurIPS), 2022. 6, 20, 27
Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu,
and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large language mod-
els. In The 37th Conference on Neural Information Processing Systems (NeurIPS), 2023a. 2,
37
Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Tanmay Rajpurohit, Peter
Clark, and Ashwin Kalyan. Dynamic prompt learning via policy gradient for semi-structured
mathematical reasoning. In International Conference on Learning Representations (ICLR),
2023b. 21, 27
Pan Lu, Liang Qiu, Wenhao Yu, Sean Welleck, and Kai-Wei Chang. A survey of deep learning
for mathematical reasoning. In The 61st Annual Meeting of the Association for Computational
Linguistics (ACL), 2023c. 9, 20
Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A bench-
mark for question answering about charts with visual and logical reasoning. In Findings of the
Association for Computational Linguistics: ACL 2022, pp. 2263–2279, 2022. 1, 10, 20, 27
Ahmed Masry, Parsa Kavehzadeh, Xuan Long Do, Enamul Hoque, and Shafiq Joty. UniChart:
A universal vision-language pretrained model for chart comprehension and reasoning. arXiv
preprint arXiv:2305.14761, 2023. 10, 20
Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawa-
har. InfographicsVQA. In Proceedings of the IEEE/CVF Winter Conference on Applications of
Computer Vision, pp. 1697–1706, 2022. 20, 27
Nitesh Methani, Pritha Ganguly, Mitesh M Khapra, and Pratyush Kumar. PlotQA: Reasoning over
scientific plots. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer
Vision, pp. 1527–1536, 2020. 20, 27
14
Published as a conference paper at ICLR 2024
Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard Tang, Sean Welleck, Chitta Baral, Tanmay
Rajpurohit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark, and Ashwin Kalyan. LILA: A unified
benchmark for mathematical reasoning. In The 2022 Conference on Empirical Methods in Natural
Language Processing (EMNLP), 2022. 10, 20
Shaghayegh Mobasher, Ghazal Zamaninejad, Maryam Hashemi, Melika Nobakhtian, and Sauleh
Eetemadi. ParsVQA-Caps: A benchmark for visual question answering and image captioning in
persian. people, 101:404, 2022. 10, 20
Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. Capabilities
of GPT-4 on medical challenge problems. arXiv preprint arXiv:2303.13375, 2023. 20
OpenAI. Chatgpt, 2022. URL https://openai.com/blog/chatgpt. 2, 6, 20
OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023a. 2, 6, 20
OpenAI. GPT-4V(ision) system card, 2023b. URL https://openai.com/research/
gpt-4v-system-card. 2, 3
Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng, Yujia Xie, Yu Hu, Qiuyuan Huang, Lars
Liden, Zhou Yu, Weizhu Chen, and Jianfeng Gao. Check your facts and try again: Improv-
ing large language models with external knowledge and automated feedback. arXiv preprint
arXiv:2302.12813, 2023. 97
Rachel Pollitt, Caroline Cohrssen, and Wee Tiong Seah. Assessing spatial reasoning during play:
Educator observations, assessment and curriculum planning. Mathematics Education Research
Journal, 32(2):331–363, 2020. 1
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi
Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. LAION-5B:
An open large-scale dataset for training next generation image-text models. Advances in Neural
Information Processing Systems, 35:25278–25294, 2022. 20
Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi.
A-OKVQA: A benchmark for visual question answering using world knowledge. In European
Conference on Computer Vision, pp. 146–162. Springer, 2022. 20, 27
Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi, Oren Etzioni, and Clint Malcolm. Solving geometry
problems: Combining text and diagram interpretation. In Proceedings of the 2015 conference on
empirical methods in natural language processing, pp. 1466–1476, 2015. 1, 20, 27
Sanket Shah, Anand Mishra, Naganand Yadati, and Partha Pratim Talukdar. KVQA: Knowledge-
aware visual question answering. In Proceedings of the AAAI conference on artificial intelligence,
pp. 8876–8884, 2019. 20, 27
Wenqi Shao, Yutao Hu, Peng Gao, Meng Lei, Kaipeng Zhang, Fanqing Meng, Peng Xu, Siyuan
Huang, Hongsheng Li, Yu Qiao, et al. Tiny LVLM-eHub: Early multimodal experiments with
bard. arXiv preprint arXiv:2308.03729, 2023. 20
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual Captions: A cleaned,
hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.
2556–2565, 2018. 20
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang.
HuggingGPT: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint
arXiv:2303.17580, 2023. 37
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh,
and Marcus Rohrbach. Towards VQA models that can read. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition, pp. 8317–8326, 2019. 20, 27
Deborah Stipek and Douglas Mac Iver. Developmental change in children’s assessment of intellec-
tual competence. Child development, pp. 521–538, 1989. 1
15
Published as a conference paper at ICLR 2024
Liangtai Sun, Yang Han, Zihan Zhao, Da Ma, Zhennan Shen, Baocai Chen, Lu Chen, and Kai Yu.
SciEval: A multi-level large language model evaluation benchmark for scientific research. arXiv
preprint arXiv:2308.13149, 2023. 20
Sanaz Talaifar and William B Swann. Self-verification theory. Encyclopedia of personality and
individual differences, pp. 4813–4821, 2020. 97
John Chong Min Tan and Mehul Motani. Large language model (llm) as a system of multiple expert
agents: An approach to solve the abstraction and reasoning corpus (arc) challenge. arXiv preprint
arXiv:2310.05146, 2023. 21
Leonard Tang, Elizabeth Ke, Nikhil Singh, Bo Feng, Derek Austin, Nakul Verma, and Iddo Drori.
Solving probability and statistics problems by probabilistic program synthesis at human level and
predicting solvability. In International Conference on Artificial Intelligence in Education, pp.
612–615. Springer, 2022. 21
Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia,
Andrew Poulton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for
science. arXiv preprint arXiv:2211.09085, 2022. 1
Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu,
Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly
capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. 2
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée
Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and
efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 20
Trieu H Trinh, Yuhuai Wu, Quoc V Le, He He, and Thang Luong. Solving olympiad geometry
without human demonstrations. Nature, 625(7995):476–482, 2024. 1
Ruocheng Wang, Eric Zelikman, Gabriel Poesia, Yewen Pu, Nick Haber, and Noah D Goodman.
Hypothesis search: Inductive reasoning with language models. arXiv preprint arXiv:2309.05660,
2023a. 21
Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R
Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. SciBench: Evaluating college-level sci-
entific problem-solving abilities of large language models. arXiv preprint arXiv:2307.10635,
2023b. 2, 20, 27
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Self-consistency
improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
103
Yan Wang, Xiaojiang Liu, and Shuming Shi. Deep neural solver for math word problems. In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
(EMNLP), pp. 845–854, 2017. 1
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yo-
gatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language
models. arXiv preprint arXiv:2206.07682, 2022a. 20
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny
Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint
arXiv:2201.11903, 2022b. 2, 6, 21, 103
Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prab-
hanjan Kambadur, David Rosenberg, and Gideon Mann. BloombergGPT: A large language model
for finance. arXiv preprint arXiv:2303.17564, 2023. 1
Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan
Huang, Yu Qiao, and Ping Luo. LVLM-eHub: A comprehensive evaluation benchmark for large
vision-language models. arXiv preprint arXiv:2306.09265, 2023. 20
16
Published as a conference paper at ICLR 2024
Hongyang Yang, Xiao-Yang Liu, and Christina Dan Wang. FinGPT: Open-source financial large
language models. arXiv preprint arXiv:2306.06031, 2023a. 1
Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Li-
juan Wang. The Dawn of LMMs: Preliminary explorations with gpt-4v(ision). arXiv preprint
arXiv:2309.17421, 2023b. 6, 97
Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen
Hu, Pengcheng Shi, Yaya Shi, et al. mPlug-Owl: Modularization empowers large language mod-
els with multimodality. arXiv preprint arXiv:2304.14178, 2023. 6, 10, 20
Da Yin, Liunian Harold Li, Ziniu Hu, Nanyun Peng, and Kai-Wei Chang. Broaden the vision:
Geo-diverse visual commonsense reasoning. arXiv preprint arXiv:2109.06860, 2021. 20
Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang,
and Lijuan Wang. MM-Vet: Evaluating large multimodal models for integrated capabilities. arXiv
preprint arXiv:2308.02490, 2023. 10, 20
Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual
commonsense reasoning. In Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition, pp. 6720–6731, 2019. 20
Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng
Gao, and Qiao Yu. LLaMA-Adapter: Efficient fine-tuning of language models with zero-init
attention. arXiv preprint arXiv:2303.16199, 2023a. 20
Xiang Zhang, Senyu Li, Zijun Wu, and Ning Shi. Lost in translation: When gpt-4v (ision) can’t see
eye to eye with text. a vision-language-consistency analysis of vllms and beyond. arXiv preprint
arXiv:2310.12520, 2023b. 21
Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi
Xie. PMC-VQA: Visual instruction tuning for medical visual question answering. arXiv preprint
arXiv:2305.10415, 2023c. 20, 27
Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi Yang, and Tong Sun.
LLaVAR: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint
arXiv:2306.17107, 2023d. 6, 10, 20
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. MiniGPT-4: En-
hancing vision-language understanding with advanced large language models. arXiv preprint
arXiv:2304.10592, 2023a. 6, 20
Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Young-
jae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimodal C4: An open, billion-
scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939, 2023b. 20
17
Published as a conference paper at ICLR 2024
C ONTENTS
18
Published as a conference paper at ICLR 2024
19
Published as a conference paper at ICLR 2024
Generative foundation models and their evaluation. Recently, there has been a surge of genera-
tive foundation models (Bommasani et al., 2021) that are trained on web-scale data, such as GPT-3,
ChatGPT, GPT-4, Claude, LLaMA, LLaMA-Adapter (Brown et al., 2020; OpenAI, 2022; 2023a;
Anthropic, 2023; Touvron et al., 2023; Zhang et al., 2023a), with the ability to solve a wide range of
downstream tasks (Wei et al., 2022a) without any task-specific finetuning. Prior work has focused
on evaluating their abilities to respond to the queries from various disciplines, grounded in text,
such as QA, math, medicine, coding and science (Bubeck et al., 2023; Nori et al., 2023; Chen et al.,
2021; Fu et al., 2023; Sun et al., 2023; Wang et al., 2023b; Huang et al., 2023; 2022; Liu et al.,
2023b; Zhang et al., 2023a). Prior work, such as PixStruct (Lee et al., 2023), MatCha (Liu et al.,
2022), and UniChart (Masry et al., 2023), has focused on developing specialized pretraining recipe
for improved math and chart reasoning in visual contexts.
On the vision-language side, there are several generative foundation models such as LLaVA,
miniGPT4, InstructBLIP, Flamingo, LLaMA-Adapter V2, Multimodal Bard (Liu et al., 2023a; Zhu
et al., 2023a; Dai et al., 2023; Alayrac et al., 2022; Awadalla et al., 2023; Gao et al., 2023; Google,
2023) that are trained on vast amount of paired (Schuhmann et al., 2022; Sharma et al., 2018; Lin
et al., 2014) and interleaved image-text data (Zhu et al., 2023b). In addition, there has been recent
development on specialized versions of these LMMs for document understanding where visual con-
texts require text recognition, math understanding being one of them (Zhang et al., 2023d; Ye et al.,
2023). In recent times, there have been several works, such as Visit-Bench, LVLM-eHub, MM-
Bench (Bitton et al., 2023; Yu et al., 2023; Liu et al., 2023c; Xu et al., 2023; Shao et al., 2023), that
assess their instruction-following and reasoning capabilities. As the generative foundation models
become more relevant to real-world applications, unlike prior work, we propose M ATH V ISTA to
benchmark their capabilities of math reasoning (logical, arithmetic, statistical) on a diverse set of
visual contexts (word problems in images, natural scenes, geometrical shapes, and plots).
20
Published as a conference paper at ICLR 2024
Recent work of LLM prompting and GPT-4V. We have witnessed the remarkable abilities of
large language models (LLMs), and their reasoning capabilities are further enhanced by promoting
approaches such as chain-of-thought (CoT) (Wei et al., 2022b), program-of-thought (PoT) (Chen
et al., 2022b), and inductive reasoning (Wang et al., 2023a; Tan & Motani, 2023). For example,
the feasibility of using LLMs to solve the Abstraction and Reasoning Corpus (ARC) challenge has
been verified using zero-shot, few-shot, and context-grounded prompting (Tan & Motani, 2023). In
this paper, we evaluate LLMs using zero-shot, few-shot, CoT prompting, PoT prompting, as well
as tool-augmented prompting, to explore their potential in solving mathematical reasoning in visual
contexts on M ATH V ISTA. Program-aided methods are widely used for mathematical reasoning due
to their advancements in precise logical reasoning and arithmetic calculations (Drori & Verma, 2021;
Tang et al., 2022; Drori et al., 2022). In this work, we have developed the LLM baselines with PoT.
Recently, OpenAI released GPT-4V, the multimodal version of GPT-4, which shows promising per-
formance in vision-language reasoning. However, the fine-grained study of its strengths and limi-
tations still remains underexplored. The recent work (Zhang et al., 2023b) contributes pioneering
efforts in this field, studying whether large multimodal models (LMMs), like GPT-4V, execute vi-
sion and language tasks consistently or independently. As concurrent work, our paper provides, for
the first time, a comprehensive quantitative and qualitative study of GPT-4V and other LLMs in
mathematical reasoning within visual contexts.
21
Published as a conference paper at ICLR 2024
Table 3: Definitions and proportions of seven mathematical reasoning categories in M ATH V ISTA.
22
Published as a conference paper at ICLR 2024
Math Examples
Question: Karen bought 4 pounds of silk scraps and 4 pounds of
canvas scraps. How much did she spend? (Unit: $)
Solution:
Find the cost of the silk scraps. Multiply: $9.08 × 4 = $36.32
ARI Find the cost of the canvas scraps. Multiply: $8.17 × 4 = $32.68
Now find the total cost by adding: $36.32 + $32.68 = $69
She spent $69.
Answer: 69
23
Published as a conference paper at ICLR 2024
Figure 11: Examples of the visual context for the scientific figure type.
24
Published as a conference paper at ICLR 2024
6. Table (450, 7.33%)
Figure 17: Examples of the visual context for the line plot type.
25
12. Pie Chart (97, 1.58%)
Published as a conference paper at ICLR 2024
Figure 21: Examples of the visual context for other types, including word cloud, map chart, radar
chart, violin plot, and heatmap chart.
26
Published as a conference paper at ICLR 2024
Table 5: Summary of the 31 different source datasets in M ATH V ISTA. Among these, FunctionQA,
IQTest, and PaperQA are our newly annotated datasets. The table provides details on their category,
task, visual context, and primary mathematical reasoning skill types.
27
Published as a conference paper at ICLR 2024
most, least, fewest more, less, fewer, largest, smallest, greatest, larger, smaller, greater, highest,
lowest, higher, lower, increase, decrease, minimum, maximum, max, min, mean, average, median,
total, sum, add, subtract, difference, quotient, gap, half, double, twice, triple, square, cube, root,
approximate, approximation, triangle, rectangle, circle, square, cube, sphere, cylinder, cone, pyra-
mid, multiply, divide, percentage, percent, ratio, proportion, fraction, rate
Table 6: Dictionary of quantity words used for the automatic selection of questions likely to involve
mathematical reasoning.
We are compiling a dataset that incorporates image context and involves mathematical reasoning
(MathQA in visual contexts). We have gathered a set of examples in which some involve mathe-
matical reasoning, while others do not.
In our task, a question can be classified as a mathematical problem if it
• Involves numbers or symbols in the question text or the image context, AND requires
further operations or transformations to be performed on them to reach a solution.
• Involves more complex forms of mathematical reasoning, including logical reasoning,
abstract thought, and understanding of patterns.
Based on the definition above, a problem is classified as a negative example (NOT involving math-
ematical reasoning) if it:
• Does not involve any numbers or quantity words, OR
• Involves only counting, reading, or recognizing numbers, OR
• Relies solely on factual information, such as recalling years and dates.
Table 7: Instructions for human annotators to identify if a problem involves mathematical reasoning.
We developed an annotation tool, as illustrated in Figure 22, to enable expert annotators to label
problems that involve mathematical reasoning. Annotators were trained using detailed instructions,
28
Published as a conference paper at ICLR 2024
as shown in Table 7, along with a variety of examples—positive ones that involve mathematical
reasoning and negative ones that do not. We provided three labeling options:
They may leave comments if they find anything incorrect or offensive for removal at a later stage.
In our study, we employed the Fleiss Kappa score to conduct an inter-annotator agreement analysis
among three annotators tasked with labeling examples based on mathematical reasoning. The Fleiss
Kappa score is a statistical measure used to evaluate the reliability of agreement between multiple
raters, providing a quantifiable metric to assess the consistency across different annotators. A score
of 1 indicates perfect agreement, while a score of 0 suggests no agreement beyond what would
be expected by chance. Our analysis yielded a Fleiss Kappa score of 0.775, indicating a substantial
level of consistency among the annotators. This high degree of agreement underscores the reliability
of our annotation process and affirms the quality of the labeled data generated for our study.
29
Published as a conference paper at ICLR 2024
��
�
�
�
�
�
� �� �� �� �� �� ��
���������������
Figure 25: The distribution of the number of words per question in M ATH V ISTA. Questions with a
length greater than 60 are categorized as 61 for visualization simplicity.
Dataset category and task type. Source datasets in M ATH V ISTA can be categorized into two
types: math-targeted VQA datasets, which are originally proposed for assessing mathematical rea-
soning, and general VQA datasets, which address visual reasoning in everyday scenarios. The dis-
tribution proportions of these two categories (55.4% vs. 44.6%, as illustrated in Figure 26) within
M ATH V ISTA enable a balanced examination of mathematical reasoning in both domain-specific and
general-purpose applications. The distribution of the five tasks contained within M ATH V ISTA is vi-
sualized in Figure 27. The relatively balanced distribution of these tasks enhances the benchmarking
robustness that our dataset provides.
Math-targeted VQA
General VQA
44.6%
2,739
55.4%
3,402
Grade level. The datasets within M ATH V ISTA are categorized into four distinct grade levels: el-
ementary school, high school, college, and not applicable, each representing a different level of
reasoning complexity and contextual application. The elementary school category aligns with the
typical mathematical curriculum of elementary education, introducing basic topics such as arith-
metic operations and introductory geometry. High school level questions delve into more complex
30
Published as a conference paper at ICLR 2024
21.5%
19.5% 1,319
1,200
mathematical concepts such as algebra, geometry, and introductory calculus. The college category
encapsulates the highest level of complexity, featuring questions on advanced mathematical and sci-
entific concepts like calculus, linear algebra, and physics. Questions without specific grade levels
are categorized as not applicable.
The distribution of questions across these grade levels is visualized in Figure 28. This structured
categorization enriches the diversity of M ATH V ISTA, providing a meaningful framework for evalu-
ating and benchmarking the mathematical and visual reasoning capabilities of various models across
different educational contexts, thereby assessing their practical utility and educational relevance.
��������������
�����
��� �����������������
�����������
�������
�����
�����
�����
�����
�����
�����
Figure 28: Distribution of questions across different grade levels within M ATH V ISTA.
Visual context. The datasets within M ATH V ISTA encompass over 10 different visual contexts
(with the distribution shown in Figure 29), crucial for evaluating models’ ability to interpret and
reason across diverse visual information. Common visual contexts include geometry diagrams, syn-
thetic scenes, bar charts, natural images, and scientific figures as illustrated in Figure 8 to Figure
19. Less frequent, yet equally important visual contexts such as medical images, word clouds, map
charts, radar charts, violin plots, and heatmap charts are depicted in Figure 20 and Figure 21. These
visual contexts, ranging from common to specialized representations, challenge the models to de-
code and reason with varying visual information, contributing to a more robust and comprehensive
evaluation. The diversity in visual contexts enriches M ATH V ISTA, enhancing the benchmarking ro-
bustness and providing a solid foundation for understanding the practical utility and domain-specific
performance of various models across different domains and applications.
Mathematical reasoning ability. The datasets within M ATH V ISTA encompass a spectrum of
seven distinct mathematical reasoning types, facilitating a thorough evaluation of models’ mathe-
matical reasoning capabilities. Figure 30 illustrates the portion of each reasoning type involved in
the problems, with arithmetic being the most frequent and logical reasoning being the least frequent.
This distribution reflects the varying degrees of mathematical reasoning required across different
problems. Figure 31 further delineates the distribution of reasoning types, showcasing a mean of
31
Published as a conference paper at ICLR 2024
����������������
�������
����
���������������
� ��
��
��
���������
�
��
���
��� �������������
�
�����������������
���� �����
�������������
����� ��������������
���� �����������
������������
���������
���
��
���������
�����
�
��
������
1.45. The sparse distribution observed aids in the precise analysis of each type’s performance by
the models, providing a nuanced understanding of their strengths and weaknesses across different
mathematical reasoning domains. This structured representation of mathematical reasoning types
within M ATH V ISTA not only enriches the dataset but also significantly contributes to a more robust
and comprehensive evaluation of models, aiding in the identification of areas for improvement and
the development of more proficient mathematical reasoning models.
������������������� �����
������������������ �����
������������������� �����
��������������������� �����
�������������������� �����
Figure 30: Portion of each mathematical reasoning type involved in the problems of M ATH V ISTA.
��������������������������������������������������������
���� �����������
�������������
����
����
���������
����
����
����
���
�
� � � � �
����������������
Figure 31: Distribution of the number of mathematical reasoning types within M ATH V ISTA.
32
Published as a conference paper at ICLR 2024
We employ a strategy where the most frequent answers in the testmini set are utilized as predictions
for various question and answer types. For multiple-choice questions, the most frequent option is
selected based on the number of available options. For instance, option B is chosen for questions
with two options, aligning with the answer distribution in testmini. Similarly, for questions requir-
ing an answer type of integer, a floating number with one decimal place, a floating number with
two decimal places, or a list, we use 2, 1.2, 0.21, and [0, 2, 0, 2, 1, 7, 1, 2, 0, 3, 0, 6] respectively, in
accordance with the answer distribution observed in testmini.
The prompt used to instruct GPT-4 for answer extraction is illustrated in Table 8.
Element Prompt
Please read the following example. Then extract the answer from the model response and
Task description type it at the end of the prompt.
Hint: Please answer the question requiring an integer answer and provide the final value,
e.g., 1, 2, 3, at the end.
Question: Which number is missing?
Example 1
Model response: The number missing in the sequence is 14.
Extracted answer: 14
Hint: Please answer the question requiring a floating-point number with one decimal
place and provide the final value, e.g., 1.2, 1.3, 1.4, at the end.
Question: What is the fraction of females facing the camera?
Example 2
Model response: The fraction of females facing the camera is 0.6, which means that six
out of ten females in the group are facing the camera.
Model response: Luca needs $1.45 to buy a sour apple candy and a butterscotch candy.
Extracted answer: B
Table 8: Task description along with five examples used to prompt GPT-4 for answer extraction.
33
Published as a conference paper at ICLR 2024
Table 9: The task instructions for different question and answer types in answer extraction. Here,
Float (1) refers to a floating-point number with one decimal place, and Float (2) refers to a floating-
point number with two decimal places.
We instruct Multimodal Bard to generate a detailed description for an input image, aiming to aug-
ment current LLMs with visual understanding capabilities. The prompt is shown in Table 10.
Describe the fine-grained content of the image or figure, including scenes, objects, relationships,
and any text present.
Table 10: Prompt for instructing Multimodal Bard to generate a detailed caption for an input image.
The hyperparameters for the experiments in §3.2 are set to their default values unless specified
otherwise. Table 11 and Table 12 detail specific generation parameters for the various large language
models (LLMs) and large multimodal models (LMMs) we evaluated, respectively.
We conducted a study to evaluate human performance on the testmini subset of the M ATH V ISTA,
utilizing Amazon Mechanical Turk (AMT). Each question from the testmini subset was assigned
to five annotators, all of whom have a history of completing more than 5,000 HIT tasks and boast
an acceptance score higher than 0.99, to ensure the quality of the results. The study comprised
five test questions and two qualification questions, which were to be answered within a 20-minute
timeframe. The qualification questions consisted of elementary math word problems requiring basic
arithmetic operations (e.g., addition and subtraction). Only annotators who successfully answered
the qualification questions were deemed eligible for the study, and their responses were included in
the final analysis. Additionally, annotators were requested to provide information regarding their
34
Published as a conference paper at ICLR 2024
highest level of educational attainment. We retained the results exclusively from annotators who
had achieved a high school diploma or higher, as 30.9% of the problems in M ATH V ISTA are of
high-school level difficulty and 10.8% correspond to college-level curricula.
A screenshot of our AMT worker interface, utilized for the Multimodal Bard assessment task, is
provided in Figure 32. The workers were compensated at a rate of $18 per hour.
35
Published as a conference paper at ICLR 2024
Table 13 reports the accuracy scores of two heuristic baselines, two leading augmented LLMs (CoT
GPT-4, PoT GPT-4), and one leading LMM (LLaVA-LLaMA-2-13B) on the test subset. The minor
differences between scores on the test subset and the testmini subset, as shown in Table 2, suggest
that testmini effectively mirrors the test subset, serving as a valuable evaluation subset for model
development, especially for those who have limited computing resources.
Model Input ALL FQA GPS MWP TQA VQA ALG ARI GEO LOG NUM SCI STA
Random chance - 17.86 15.46 24.12 4.54 23.36 24.33 25.84 13.85 22.69 13.40 8.82 15.76 14.28
Frequent guess - 23.48 20.97 27.18 16.27 26.06 28.87 28.29 20.86 25.71 11.86 19.61 20.45 20.08
2-shot CoT GPT-4 Q, Ic , It 30.50 27.21 35.91 21.30 43.13 28.17 35.72 25.17 35.80 24.74 15.41 47.28 31.29
2-shot PoT GPT-4 Q, Ic , It 31.74 27.58 37.35 23.87 43.00 30.27 37.15 27.93 37.48 22.68 15.83 44.47 31.87
LLaVA-LLaMA-2-13B Q, I 25.40 22.86 24.57 18.15 35.82 29.69 26.93 22.47 24.45 19.07 19.05 34.71 21.61
Table 13: Accuracy scores on the test subset of M ATH V ISTA. Input: Q: question, I: image, Ic :
image caption, It : OCR texts detected from the image. ALL: overall accuracy. Task types: FQA:
figure question answering, GPS: geometry problem solving, MWP: math word problem, TQA: text-
book question answering, VQA: visual question answering. Mathematical reasoning types: ALG:
algebraic reasoning, ARI: arithmetic reasoning, GEO: geometry reasoning, LOG: logical reasoning,
NUM: numeric common sense, SCI: scientific reasoning, STA: statistical reasoning.
The accuracy scores across seven mathematical reasoning categories are reported in Table 2, with
primary baselines highlighted in Figures 1 and 33. GPT-4V outperforms other baseline models
in most mathematical reasoning categories, except for logical reasoning and numeric commonsense
reasoning. Multimodal Bard achieves comparable performance with GPT-4V in geometry reasoning
(47.8% vs. 51.0%) and algebraic reasoning (46.5% vs. 53.0%), highlighting its enhanced abilities
in comprehending geometry diagrams and performing algebraic calculations.
60
50
Accuracy Score (%)
40
30
20
10
0
Algebraic Arithmetic Geometry Logical Numeric Scientific Statistical
Figure 33: Accuracy scores of baselines across mathematical reasoning types in M ATH V ISTA.
Among open-source LMMs (ranging from IDEFICS to LLaVA), LLaVA achieves the best overall
accuracy on M ATH V ISTA and the highest fine-grained scores for problems in geometry reasoning,
logical reasoning, and statistical reasoning. However, these scores still substantially lag behind
GPT-4V and Multimodal Bard, indicating a gap in the overall effectiveness of these open-source
models compared to more advanced proprietary systems. Despite this, LLaMA-Adapter-V2, tied
with LLaVA, outperforms GPT-4V by 2.7% in logical reasoning, and InstructBLIP beats GPT-4V
36
Published as a conference paper at ICLR 2024
Figure 34 illustrates the accuracy scores of leading baselines on M ATH V ISTA across a diverse range
of visual contexts. Remarkably, GPT-4V outperforms human performance in visual contexts of
function plots, geometry diagrams, scatter plots, tables, and other types, which aligns with its su-
periority in terms of related mathematical reasoning types. Other foundation models trail behind
humans in visual perception and reasoning across most visual context categories. Multimodal Bard
demonstrates comparable performance to humans in questions with a visual context of geometry
diagrams, showcasing its promising capabilities in recognizing geometric shapes and relationships.
On the other hand, PoT GPT-4, augmented by Bard captions, achieves a significant performance ad-
vantage over other baselines, exhibiting strong abilities in discerning structural information in tables
and generating symbolic codes for precise statistical reasoning.
80
70
Accuracy Score (%)
60
50
40
30
20
10
0
ne art lot am lot ge est lot ure ene ble Othe
r
ract Sce Bar Ch nction P r y D iagr Line P u r al Ima Puzzle T catter P ti f i c Fig hetic Sc Ta
Abst Fu met Na t S
Scie
n
Synt
Geo
Figure 34: Accuracy scores of leading baselines across various visual contexts in M ATH V ISTA.
Figure 35 displays the average accuracy scores across different grade levels (elementary school,
high school, and college) for the leading foundation models, as well as random chance and human
performance. Humans exhibit the highest performance on questions at the elementary school level
(70.4%), while they fare the worst on college-level questions (52.6%) within M ATH V ISTA. Foun-
dation model baselines exhibit varying performance behaviors: they achieve better accuracy scores
on high school level questions compared to the other two categories.
37
Published as a conference paper at ICLR 2024
In addressing elementary school problems, the performance gap between human performance and
the best-performing model, GPT-4V, is notably the largest when compared to other grade levels.
This gap could potentially be attributed to the limited availability of age-specific training data that
accurately captures the unique learning styles (i.e., rich with abstract scenes) of elementary school
students. On the other hand, GPT-4V demonstrates an improvement of 20.9% over the Multimodal
Bard, the second-best performing model in this category. This improvement suggests that while
GPT-4V still lags behind human performance, its ability to tackle elementary-level problems in
visually intensive settings has been significantly enhanced.
For high school problems, GPT-4V, with a score of 61.8%, outperforms human performance, which
stands at 58.2%. Additionally, the second-best performing model, Multimodal Bard, with a score of
50.3%, is on par with human performance. This disparity might be attributed to the training regimen
of the models, which perhaps aligns well with the high school curriculum.
In the context of college curriculum, the performance of various baselines varies dramatically. GPT-
4V demonstrates performance comparable to that of humans. The GPT-4 model, when augmented
with vision inputs (CoT GPT-4V), outperforms the Multimodal Bard. Among the best open-source
Large Multimodal Models (LMMs) on M ATH V ISTA, LLaMA achieves only a negligible gain over
random chance. This suggests that while advanced models like GPT-4V and CoT GPT-4V show
promise in higher education settings, there remains significant room for improvement in the devel-
opment of LMMs to effectively address the complex and diverse nature of college-level content.
60
50
Accuracy Score (%)
40
30
20
10
0
Elementary School High School College
Figure 35: Average accuracy scores across different grade levels for primary baselines.
Table 36 presents an ablation study conducted on LLMs, examining their performance under varying
visual information inputs.
��� �������� ������� ������������������
��
��
������������������
��
��
��
��
��
��
����������� ��������� ����������� ���������
Figure 36: Average accuracy scores of LLM baselines under various visual inputs.
38
Published as a conference paper at ICLR 2024
We explored whether LLMs and Augmented LLMs can benefit from larger numbers of few-shot
examples on M ATH V ISTA, with results reported in Figure 37. In the question-only input setting
(a), both Claude-2 and ChatGPT suffer from a performance drop, suggesting that they are more
sensitive to the bias in demonstrations, especially in the absence of visual inputs. There is a marginal
improvement of 1.4% when the shot number increases from 2 to 4 for GPT-4. A similar phenomenon
is observed when LLMs are augmented with external OCR texts and image captions with CoT
prompting (b); notably, there is a significant drop of 3.4% when the shot number increases from 2
to 4 for CoT Claude-2. With PoT prompting (c), LLMs like ChatGPT and GPT-4 can obtain gains
of 3.4% and 1.4%, respectively, with the shot number increasing from 2 to 4. Overall, while there
might be marginal improvements, larger numbers of few-shot examples do not necessarily benefit
the LLMs on M ATH V ISTA. In some settings, LLMs suffer from unstable performance drops. This
further indicates that the quality of the augmented information plays a more important role for
augmented LLMs.
CoT GPT-4
Accuracy Score (%)
We conducted an initial study on the few-shot learning ability of the Large Multimodal Model
(LMM), specifically IDEFICS (Laurençon et al., 2023), on M ATH V ISTA. As shown in Figure
38, there is a modest improvement with increased shot numbers, suggesting potential benefits of
few-shot learning for LMMs on M ATH V ISTA.
However, recent studies highlight the instability of LMMs in few-shot settings. For instance, a
significant accuracy drop was observed in models like BLIP-2 (Li et al., 2023b) and InstructBLIP
(Dai et al., 2023) when applying 4-shot in-context learning in common sense reasoning tasks (Li
et al., 2023c). These variations may stem from the specific training techniques or the nature of few-
shot examples used, impacting the in-context learning performance of LMMs. Given the rapidly
evolving landscape of LMMs, the consistent benefits of few-shot learning remain an open question.
26 IDEFICS-9B-Instruct
Accuracy Score (%)
24
22
20
18
0 1 2 3 4
Shot Number
Figure 38: Performance comparison of the LMM model in the few-shot setting.
39
Published as a conference paper at ICLR 2024
The markings on the side of the beaker are in milliliters (ml). The largest marking is 600 ml, which
is the maximum amount of liquid that the beaker can hold.
The other markings on the beaker are 300 ml, 200 ml, 100 ml, and -400 ml. The negative markings
indicate the amount of liquid that is below the zero mark.
Figure 39: An example illustrating hallucinated information not depicted in the image.
Question: As shown in the figure, if CB = 4.0, DB = 7.0, and D is the midpoint of AC, then the length
of AC is ().
We know that CB = 4.0 and DB = 7.0, and that AD = DC. So we can write the equation:
AD + DC = 4.0 + 7.0
2AD = 11.0
AD = 5.5
AC = 2AD = 11.0cm
Therefore, the length of AC is 11.0 cm, which is answer choice (D).
Figure 40: An example illustrating hallucinated information not depicted in the question.
40
Published as a conference paper at ICLR 2024
Question: As shown in the figure: In Rt△ABC, ∠C = 90.0, AC = 8.0, AB = 10.0, then the value of sinB is
equal to ().
3 4 3 4
Choices: (A) 5
(B) 5
(C) 4
(D) 3
To find: sinB
sinB = opposite/hypotenuse = AC/AB = 8/10 = 3/5
3
Wrong output: (A) 5
4
Ground truth: (B) 5
Figure 41: A failure example from Multimodal Bard: incorrect answer with wrong calculation.
Question: What is the age gap between these two people in image? (unit: years)
Ground truth: 7
Figure 42: A failure example of Multimodal Bard: restricted in responding to images of people.
41
Published as a conference paper at ICLR 2024
The other algorithms are desert, pine, and globe. Their accuracies are 2, 4, and 6, respectively.
Wrong output: 1
Ground truth: 0
Figure 43: A failure example from Multimodal Bard: incorrect answer with hallucination caused by
visual perception.
Question: What is the age gap between these two people in image? (unit: years)
Figure 44: A failure example from Multimodal Bard: incorrect answer with correct detection but
wrong calculation.
42
Published as a conference paper at ICLR 2024
Question: Is the number of cars in front of the tiny metal thing less than the number of large matte things in
front of the cyan rubber road bike?
Choice: (A) Yes (B) No
There are no cars in front of the tiny metal thing (the motorcycle). There are two large matte things
in front of the cyan rubber road bike (the van and the bicycle).
So the number of large matte things in front of the cyan rubber road bike is greater than the number
of cars in front of the tiny metal thing.
Figure 45: A failure example from Multimodal Bard: incorrect answer with correct explanation.
The numbers in the triangles form a sequence where each number is 8 more than the previous num-
ber.
2
10
(8)
20
/12
Figure 46: A successful example from Multimodal Bard: correct answer with hallucination (20 is
not present in the image) and wrong reasoning.
43
Published as a conference paper at ICLR 2024
Question: The magnitude of the acceleration vector a is 10 cm/s2 . Use the figure to estimate the normal
components of a.
Here is a breakdown of the steps involved in finding the normal component of the acceleration:
1. Find the angle between the acceleration vector and the radius vector.
2. Use the cosine function to find the cosine of the angle.
3. Multiply the magnitude of the acceleration vector by the cosine of the angle to find the normal component
of the acceleration.
Ground truth: 9
Figure 47: A failure example from Multimodal Bard: the question is incomplete (or requires implicit
measuring ability), but the model hallucinates the angle (the actual angle is 25◦ ). Here, an =
10cm/s2 · cos 25◦ = 9cm/s2 .
44
Published as a conference paper at ICLR 2024
The graph of the function y = log2 x is a concave up curve. This means that the derivative of the
function is increasing as x increases. So, the derivative of the function at x = 6, which is a smaller value of
x, will be smaller than the derivative of the function at x = 8, which is a larger value of x.
x | dy/dx
–|–
6 | 0.5
7 | 0.7
8|1
As you can see, the value of dy/dx increases from 0.5 to 0.7 to 1 as x increases from 6 to 7 to 8.
This confirms that the derivative of y at x = 6 is smaller than that at x = 8.
Figure 48: A failure example from Multimodal Bard: incorrect reasoning and wrong calculations.
45
Published as a conference paper at ICLR 2024
The derivative of a function at a point gives the slope of the line tangent to the function’s graph at
that point. The steeper the line, the larger the derivative. In the graph, the line tangent to the graph of f (x)
at x = 2 is less steep than the line tangent to the graph at x = 5. Therefore, the derivative of f (x) at x = 2
is smaller than that at x = 5.
In mathematical terms, the derivative of f (x) is given by the expression f ′ (x) = 2x − 3. Plugging
in x = 2 and x = 5, we get f ′ (2) = 2(2) − 3 = 1 and f ′ (5) = 2(5) − 3 = 7. Therefore, f ′ (2) < f ′ (5).
Figure 49: A failure example from Multimodal Bard: arguing based on incorrect facts in the context
of the image.
46
Published as a conference paper at ICLR 2024
Ground truth: 40
Solution:
4 + 1 = 5, 5 + 2 = 7 (2 = 1 + 1)
10 + 4 = 14, 14 + 5 = 19 (5 = 4 + 1)
25 + 7 = 32, 32 + 8 = 40 (8 = 7 + 1)
mPLUG-Owl: There are 13 missing numbers in the image. The missing number can be 3.
InstructBLIP: N/A
Multimodal Bard:
The missing number in the sequence is 28.
The sequence progresses by adding consecutive odd numbers to each number. The first number is 4,
and we add 1 to get 5. We then add 3 to get 7, and so on. The missing number is therefore the next number
after 32, which is 28.
GPT-4V:
4, 5, 7, 10, 14, 19, 25, 32, ?, 52
This sequence follows a pattern where the difference between each number increases by one. For
example, the difference between 4 and 5 is 1, the difference between 5 and 7 is 2, the difference between
7 and 10 is 3, and so on. So, the difference between 32 and the missing number should be 8, making the
missing number 40.
Correct output: 40
Figure 50: Among all LMM baselines, only GPT-4V accurately predicts the correct answer to this
logical reasoning question, demonstrating correct visual perception and textual reasoning. Multi-
modal Bard generates well-organized reasoning steps; however, incorrect calculations are present.
47
Published as a conference paper at ICLR 2024
Ground truth: 40
Solution:
4 + 1 = 5, 5 + 2 = 7 (2 = 1 + 1)
10 + 4 = 14, 14 + 5 = 19 (5 = 4 + 1)
25 + 7 = 32, 32 + 8 = 40 (8 = 7 + 1)
The image is a simple math problem, and it is likely that it is being used as part of an IQ test or
other kind of intelligence test. The image is also consistent with the webpages that you mentioned, which
are all related to IQ tests.
CoT GPT-4:
The pattern in the numbers is that each number is one more than the previous number plus the position of
the number in the sequence. For example, the second number (5) is 1 more than the first number (4) plus the
position of the number in the sequence (2). The third number (7) is 1 more than the second number (5) plus
the position of the number in the sequence (3).
Following this pattern, the missing number should be 1 more than the last number (32) plus the
position of the number in the sequence (9).
32 + 9 + 1 = 42
So, the missing number is 42.
PoT GPT-4:
numbers = [4, 5, 7, 10, 14, 19, 25, 32]
diffs = [numbers[i+1] - numbers[i] for i in range(len(numbers)-1)] #
[1, 2, 3, 4, 5, 6, 7]
missing_number = numbers[-1] + diffs[-1]
print(missing_number)
Output: 39
Figure 51: The LLM baselines augmented with Bard image captions and OCR text also fail on this
question. CoT GPT-4 generates explanations with incorrect reasoning. PoT GPT-4 also fails to
understand the correct pattern behind the sequence of numbers.
48
Published as a conference paper at ICLR 2024
mPLUG-Owl: B: 8/11
InstructBLIP: N/A
Multimodal Bard:
The correct answer is (C) 6/11.
The shape is divided into 11 equal parts. The blue part takes up 6 of these parts. Therefore, the
fraction of the shape that is blue is 6/11.
The other choices are incorrect. Option (A), 3/11, is the fraction of the shape that is white. Option
(B), 8/11, is the fraction of the shape that is not blue. Option (D), 3/5, is not a fraction of the shape.
Figure 52: All LMM baselines fail to answer this statistical reasoning question, indicating the un-
satisfactory abilities in understanding abstract scenes and performing statistical reasoning.
49
Published as a conference paper at ICLR 2024
The image could be used to represent a variety of concepts, such as fractions, colors, or the four
elements. It could also be used as a decorative element or a symbol.
CoT GPT-4:
The image description does not provide enough information to determine what fraction of the shape is blue.
Please provide a more detailed description of the image.
PoT GPT-4:
def fraction_of_blue(description, choices):
parts = {
’blue’: 1,
’green’: 1,
’yellow’: 1,
’red’: 1
}
total_parts = sum(parts.values())
blue_fraction = parts[’blue’] / total_parts
print(fraction_of_blue(description, choices))
Output: “ ”
Figure 53: Two LLM baselines also fail to answer this question. The incorrect facts in the image
captions generated by Multimodal Bard further verify its weakness in understanding the abstract
concepts in this figure.
50
Published as a conference paper at ICLR 2024
Ground truth: 3
Solution: 2 + 1 = 3
mPLUG-Owl: The total sum of the two numbers on the players’ jerseys is 24.
miniGPT4: The numbers on the jerseys are 7 and 15. If you add them together, the total sum is 22.
LLaMA-Adapter-V2: The total sum of the two visible numbers on the jerseys is 12.
InstructBLIP: 21
LLaVA: The two visible numbers on the jerseys are 2 and 1. If we add them together, the total sum is 3.
Correct output: 3
GPT-4V: The total sum of the two visible numbers on the jerseys is 29 (21 + 8).
CoT GPT-4:
The image only shows one visible number on the jerseys, which is 21.
PoT GPT-4:
visible_numbers = [21]
total_sum = sum(visible_numbers)
print(total_sum)
Execution output: 21
Figure 54: Among all LMM and LLM baselines, only LLaVA generates the correct answer in this
instance. Models like LLaVAR, Augmented CoT GPT-4, and Augmented PoT GPT-4 identify the
correct number(s) in the image but fail to resolve the ambiguity within the contexts of the input
question and image.
51
Published as a conference paper at ICLR 2024
Question: How much money does Luca need to buy a sour apple candy and
a butterscotch candy? (Unit: $)
Solution: Add the price of a sour apple candy and the price of a but-
terscotch candy: $0.06 + $0.07 = $0.13. Luca needs $0.13.
mPLUG-Owl: The price of each candy is not provided, so it is not possible to calculate the total price for
buying both candy items.
Luca needs $0.75 to buy a sour apple candy and a butterscotch candy.
Explanation:
* The price of the sour apple candy is $0.50.
* The price of the butterscotch candy is $0.25.
* Therefore, Luca needs a total of $0.75 to buy both candies.
LLaMA-Adapter-V2: Luca needs $1.23 to buy a sour apple candy and a butterscotch candy.
LLaVAR: Luca needs to earn $1.23 to be able to buy a sour apple candy and a butterscotch candy.
InstructBLIP: 1.23
LLaVA: Luca needs $1.45 to buy a sour apple candy and a butterscotch candy.
Multimodal Bard:
The price of a sour apple candy is $0.06 and the price of a butterscotch candy is $0.07. So, Luca needs
$0.06 + $0.07 = $0.13.
GPT-4V:
The total cost for a sour apple candy and a butterscotch candy is $0.13. Luca would need at least $0.13 to
purchase both items.
Correct output: 0.13
Figure 55: Among all LMM baselines, only Multimodal Bard and GPT-4V accurately detect the
text information in the image, and perform the correct calculation to deduce the right answer for the
arithmetic question within the visual context of a table.
52
Published as a conference paper at ICLR 2024
Question: How much money does Luca need to buy a sour apple candy and
a butterscotch candy? (Unit: $)
Solution: Add the price of a sour apple candy and the price of a but-
terscotch candy: $0.06 + $0.07 = $0.13. Luca needs $0.13.
CoT GPT-4:
Find the total cost of a sour apple candy and a butterscotch candy.
PoT GPT-4:
sour_apple_candy_price = 0.06
butterscotch_candy_price = 0.07
total_price = sour_apple_candy_price + butterscotch_candy_price
print(total_price)
Execution output: 0.13
Correct output: 0.13
Figure 56: Using the correct image captions and OCR text as augmented inputs, both CoT GPT-4
and PoT GPT-4 predict the correct answer.
53
Published as a conference paper at ICLR 2024
Given that GPT-4V does not offer API access, we have performed manual evaluations using the
playground platform2 . For a fair comparison, we used the same input queries as those for all the other
LMMs and recorded the responses in a single round of chat without additional feedback (Figure 57).
2
https://chat.openai.com/
54
Published as a conference paper at ICLR 2024
The leaderboard in Figure 58 highlights GPT-4V’s substantial advancements over the current LLM
and LMM baselines. Notably, there is a 15.1% improvement over the second-best performing Mul-
timodal Bard model. However, a significant gap of 10.4% still exists between GPT-4V and human
performance, indicating plenty of room for further improvement by developing new LMMs and
tool-augmented LLMs.
55
Published as a conference paper at ICLR 2024
This section compares the mathematical reasoning ability of GPT-4V with that of other LLMs on
M ATH V ISTA, including LLaMA-Adapter-V2-7B (LLaMA-Adapter-V2 for simplification), LLaVA-
LLaMA-2-13B (LLaVA for simplification), and Multimodal Bard.
ID: 926
Question: What is the size of the shaded area under the curve?
Round the answer to 2 decimal places.
LLaMA-Adapter-V2: The size of the shaded area under the curve is 0.00.
LLaVA: The size of the shaded area under the curve is approximately 1.34.
Multimodal Bard:
The given curve is a circle with center at (3, 0) and radius 3. The shaded area is a quarter of the circle’s area.
GPT-4V: p
To find the size of the shaded area under the curve f (x) = 9 − (x − 3)2 , we need to integrate the
function over the given interval.
The given function represents the upper half of a circle with a radius of 3 and centered at the point (3,0).
The shaded region corresponds to the area of a quarter circle with a radius of 3. The formula for the area of
a circle is πr2 . For a quarter of the circle, the area is 14 πr2 .
Given: r = 3
1
Area = π(32 )
4
1
Area = π(9)
4
9
Area = π
4
Area = 7.07 (rounded to 2 decimal places)
So, the size of the shaded area under the curve is approximately 7.07 square units.
Correct output: 7.07
Figure 59: Both Multimodal Bard and GPT-4V accurately determine the function depicted in the
image and compute the answer to the specified decimal places.
56
Published as a conference paper at ICLR 2024
ID: 346
Ground truth: 1
Multimodal Bard:
The y-intercept of the function is the point where the graph crosses the y-axis. Since the y-axis is also the
line x = 0, the x-coordinate of the y-intercept will always be 0. In the image, the graph crosses the y-axis
at the point (0, 3). Therefore, the y-intercept of the function is 3.
GPT-4V:
The y-intercept of a function is the point where the graph of the function crosses the y-axis.
Figure 60: The plot represents a function that intersects the y-axis at y = 1. However, large multi-
modal models, including GPT-4V, struggle to grasp the algebraic relationship depicted in this plot.
57
Published as a conference paper at ICLR 2024
ID: 963
Ground truth: 3
Multimodal Bard:
The local maximum of the function is the point where the graph of the function changes from increasing to
decreasing. In the given graph, this point is at x = 3. So the answer is 3.
GPT-4V:
The function f (x) reaches its local maximum where its first derivative, f ′ (x), changes from positive to
negative.
Figure 61: The graph depicts two functions, f (x) and f ′ (x). While Multimodal Bard produces the
correct answer, it provides inaccurate information about the properties of f (x). In contrast, GPT-4V
confuses f (x) with f ′ (x).
58
Published as a conference paper at ICLR 2024
ID: 74
LLaVA: The difference between the largest and smallest bar in the graph is 1.4.
Multimodal Bard:
The largest bar in the image is Iceland with 100%, and the smallest bar is Cambodia with 52.4%. So the
difference between the largest and smallest bar is 100 - 52.4 = 47.6%.
However, the question asks for the answer in one decimal place. Rounding 47.6% to one decimal place
gives 47.7%.
Therefore, the difference between the largest and smallest bar is 47.7%.
GPT-4V:
From the graph:
Figure 62: The problem involves performing a straightforward arithmetical operation after identi-
fying two numbers from the bar chart. However, LLaMA-Adapter-V2 and LLaVA are unable to
interpret the figure correctly, while Multimodal Bard fails to round 47.6% to one decimal place.
59
Published as a conference paper at ICLR 2024
ID: 718
LLaMA-Adapter-V2:
The probability that a randomly selected bow tie is designed with swirls and is made of velvet is 0.00.
LLaVA:
To determine the probability of a randomly selected bow tie being designed with swirls and made of velvet,
we need to divide the total number of bow ties with the specific characteristics (swirls and velvet) by the
total number of bow ties in Jeffrey’s collection.
In the image, there are three bow ties with swirls and two bow ties made of velvet. The total number of bow
ties in the collection is 10.
So, the probability of a randomly selected bow tie being designed with swirls and made of velvet is:
(3 bow ties with swirls and velvet) / (10 total bow ties) = 3/10 = 0.3 or 30%.
Multimodal Bard:
The probability that a randomly selected bow tie is designed with swirls and is made of velvet is:
(Number of swirls and velvet bow ties)/(Total number of bow ties)
We can find the number of swirls and velvet bow ties by adding the values in the “Swirls” and “Velvet”
columns. This gives us 3 + 5 = 8 bow ties.
We can find the total number of bow ties by adding the values in all three columns. This gives us
5 + 3 + 1 + 5 = 14 bow ties.
Therefore, the probability that a randomly selected bow tie is designed with swirls and is made of velvet is:
(8)/(14) = 0.57 = 57.14%
To two decimal places, the probability is 0.5714, or 57.14%.
GPT-4V:
To find the probability that a randomly selected bow tie is designed with swirls and is made of velvet, we
first need to find the total number of bow ties and then determine how many of these fit the criteria.
Probability = Number of desired outcomes / Total number of possible outcomes Probability = 3/14
Now, converting the fraction to a decimal (rounded to two decimal places): 3 ÷ 14 = 0.21
Correct output: 0.21
Figure 63: The problem requires arithmetic reasoning combined with an understanding of a basic
statistical table and fundamental knowledge of probability. However, only GPT-4V predicts the
correct answer.
60
Published as a conference paper at ICLR 2024
ID: 606
Ground truth: 10
Multimodal Bard:
I see the image. The rectangle has a width of 3 cm and a length of 2 cm. So, the perimeter of the rectangle
is 2*3 + 2*2 = 8 cm.
GPT-4V:
The perimeter of a rectangle is calculated by adding up all its sides.
Figure 64: This problem poses an elementary geometry question, calculating the perimeter of a
rectangle. While LLaMA-Adapter-V2 and LLaVA produce the correct answer, they do not offer
explanations to confirm the correctness of their reasoning steps. Conversely, Multimodal Bard com-
mits a calculation error when summing up the lengths of the four sides.
61
Published as a conference paper at ICLR 2024
ID: 640
d if △IWN is equilateral and W N = 5
Question: Find WN
LLaVA:
Given that △IWN is equilateral, we can use the cosine rule to find the length of the side opposite the angle
∠WN. a
cos ∠WN =
b
where a and b are the lengths of the adjacent sides of the equilateral triangle.
a=5
b=5
Substituting the values, we get:
5
cos ∠WN = =1
5
So, the length of the side opposite the angle ∠WN is equal to the length of the adjacent sides, which is 5.
Therefore, the correct answer is: (D) 10 π
Multimodal Bard:
The correct answer is (B).
5
Since △IW N is equilateral, W
\ IN = W
\ N I = 60◦ . Therefore, W
[ N = π − 2 · 60◦ = π .
3
Correct output: (B) 53 π
GPT-4V:
For an equilateral triangle, each interior angle is 60◦ .
The arc corresponding to a central angle in a circle is given by: Arc length = central angle in radians×radius
Figure 65: This is a high school-level geometry problem requiring an understanding of geometric
symbols and relations, the application of the correct theorem, and accurate calculations. While
Multimodal Bard generates the correct answer, it applies the wrong theorem. GPT-4V outperforms
other baselines, demonstrating its remarkable ability to reason geometrically.
62
Published as a conference paper at ICLR 2024
ID: 98
Ground truth: 2
Multimodal Bard:
The value of the square in the figure is 7.
GPT-4V:
From the given figure:
Circle + Square = 5
=⇒ Circle = 5 − Square . . . (i)
Triangle + Triangle = 8
=⇒ Triangle = 4 . . . (ii)
Circle + Triangle = 7
Using (ii), Circle + 4 = 7
=⇒ Circle = 3 . . . (iii)
Using (i) and (iii),
5 − Square = 3
=⇒ Square = 2.
Correct output: 2
Figure 66: The question assesses abstract thinking through a puzzle test problem. Only GPT-4V
grasps the abstract concepts and employs step-by-step reasoning to deduce the final answer.
63
Published as a conference paper at ICLR 2024
ID: 778
Question: Can you find the missing shape in this picture puzzle?
Multimodal Bard:
The missing shape in the picture puzzle is (D).
The shapes in the puzzle are all regular polygons, and the number of sides increases by 1 as you move from
left to right. The first shape is a pentagon, the second shape is a hexagon, the third shape is an octagon, and
the fourth shape is a decagon. The fifth shape should be a dodecagon.
The answer choice A is a pentagon, which is already in the puzzle. The answer choice B is a heptagon,
which has 7 sides. The answer choice C is a nonagon, which has 9 sides. The answer choice D is a
dodecagon, which has 12 sides.
GPT-4V:
The pattern seems to involve alternating filled and unfilled shapes.
Figure 67: All baselines struggle to discern the shapes presented in the problem and the options, and
are unable to identify the correct pattern.
64
Published as a conference paper at ICLR 2024
ID: 750
Ground truth: 18
Multimodal Bard:
The missing number in the puzzle is 7.
The numbers in the puzzle are arranged in a circle, and each number is surrounded by the sum of the two
numbers next to it. For example, the number 12 is surrounded by 9 and 10, because 9 + 10 = 19.
The missing number must be surrounded by 2 and 5, and the only number that satisfies this condition is 7.
Therefore, the missing number is 7.
GPT-4V:
From the given image, the number in the center, 2, is the result of the difference between the number directly
opposite it in each pair:
12 − 10 = 2
9−5=4
6−? = 2
Given that the difference between 6 and the missing number is 2, the missing number is: 6 − 2 = 4.
Figure 68: All baselines fail to deduce the pattern inherent in the logical reasoning problem.
65
Published as a conference paper at ICLR 2024
ID: 693
Multimodal Bard:
The answer is (B), No. The orange is not larger than the car. The car is clearly much larger than the orange.
Even if it is a large orange, it is still much smaller than the car. The orange is probably about the size of a
basketball, while the car is at least the size of a small SUV.
I can’t see the orange in the image you sent, but I can still answer the question based on the information you
provided. The question is asking if the orange is larger than the car in general, not in the specific image.
So, even if the orange in the image is larger than the car in the image, the answer would still be no, because
oranges are not generally larger than cars.
Correct output: (B) No
GPT-4V:
The orange is not larger than the car; it’s just closer to the camera, creating an optical illusion.
Correct output: (B) No
Figure 69: The question pertains to everyday common sense and the identification of ob-
jects—specifically an orange and a car—in the image. GPT-4V produces the correct answer by
discerning an optical illusion present in the image.
66
Published as a conference paper at ICLR 2024
ID: 29
Ground truth: 4
LLaMA-Adapter-V2: The age gap between these two people in the image is 10 years.
LLaVA: The age gap between the two people in the image is 20 years.
Multimodal Bard:
Based on the image you sent, the age gap between the king and queen is 3 years.
The king is Richard III of England, who was born in 1452. The queen is Anne Neville, who was born in
1456. So, Richard was 3 years older than Anne when they married in 1472.
The stained glass window is likely depicting the couple after their marriage, so the age gap between them
would still be 3 years.
GPT-4V:
The two figures depicted in the stained glass are King Richard III and Anne Neville. Richard III was born
on October 2, 1452, and Anne Neville was born on June 11, 1456.
Figure 70: In this instance, only GPT-4V correctly recognizes the celebrities along with their profiles
to calculate the age gap.
67
Published as a conference paper at ICLR 2024
ID: 19
Multimodal Bard:
The highest amount the beaker in the image measures is 600 ml.
The markings on the side of the beaker are in milliliters (ml). The largest marking is 600 ml, which is the
maximum amount of liquid that the beaker can hold.
The other markings on the beaker are 300 ml, 200 ml, 100 ml, and -400 ml. The negative markings indicate
the amount of liquid that is below the zero mark.
GPT-4V:
The highest amount this glass measures is 600 ml.
Figure 71: Most AI models without tool augmentation fall short in this example. A model not only
needs to recognize the text and numbers accurately but also to possess numerical common sense to
determine the highest amount that the beaker can measure.
68
Published as a conference paper at ICLR 2024
ID: 807
Choices:
(A) Truncate (B) Acuminate
(C) Rounded (D) Sagittate
LLaVA: The leaf shape with the least amount of wind resistance and water loss would be the truncate leaf.
This is because truncate leaves have a more streamlined shape, which reduces air resistance and minimizes
water loss due to evaporation.
Multimodal Bard:
The answer is (C). Rounded leaves have the least amount of wind resistance and water loss. This is because
they have a smooth, curved surface that does not catch the wind as much as other leaf shapes. Rounded
leaves also have a smaller surface area than other leaf shapes, which means that they lose less water through
evaporation.
The other leaf shapes in the choices are not as aerodynamic as rounded leaves. Truncate leaves have a flat
end, which can catch the wind. Acuminate leaves have a long, pointed tip, which can also catch the wind.
Sagittate leaves have a long, arrow-shaped base, which can also catch the wind.
GPT-4V:
The leaf shape that would have the least amount of wind resistance and water loss would be the one that is
most slender and pointed. This is because a slender and pointed shape would allow wind to flow around it
with the least resistance, and a narrow surface area would lead to less water loss through transpiration.
Among the options, the “acuminate” leaf shape is the most slender and pointed.
Correct output: (B) Acuminate
Figure 72: The question requires grasping biology concepts depicted in the image. LLaMA-
Adapter-V2, LLaVA, and Multimodal Bard demonstrate a deficiency in domain-specific knowledge
and struggle to interpret the illustration.
69
Published as a conference paper at ICLR 2024
ID: 278
LLaMA-Adapter-V2:
The correct option letter for the best PSNR score is “C,” which is the PSNR score for the mic.
Multimodal Bard:
The answer is (D) Ship.
The PSNR score for the ship is 33.4 dB, which is higher than the PSNR scores for the other objects. The
PSNR score measures the amount of distortion between the original image and the reconstructed image. A
higher PSNR score indicates less distortion and a better quality reconstruction.
Therefore, the ship has the best PSNR score among the given objects.
GPT-4V:
Based on the table for PSNR (dB) scores:
Lego:
DMTet: 29.0
FLeXiCubes: 28.8
Mats:
DMTet: 27.0
FLeXiCubes: 26.7
Mic:
DMTet: 30.7
FLeXiCubes: 30.8
Ship:
DMTet: 26.0
FLeXiCubes: 25.9
Among the choices, “Mic” has the highest PSNR scores with both DMTet and FLeXiCubes, so the answer
is (C) Mic.
Correct output: (C) Mic
Figure 73: The problem necessitates a unified comprehension of the table and its corresponding
caption within academic domains. Exclusively, GPT-4V accurately recognizes the OCR text from
related columns and deduces the correct answer.
70
Published as a conference paper at ICLR 2024
ID: 725
LLaVA:
If Alex measures a constant vP A = −78 km/h for car P , Barbara will measure a velocity of vP B =
−26 km/h relative to car P .
Multimodal Bard:
The velocity of car P relative to Barbara is given by:
vP B = vP A − vBA = −78 km/h − 52 km/h = −130 km/h
Therefore, Barbara will measure a velocity of -130 km/h for car P .
According to the relative velocity addition rule, the velocity of car P relative to Barbara is equal to the
velocity of car P relative to Alex minus the velocity of Barbara relative to Alex.
In this case, the velocity of car P relative to Alex is -78 km/h and the velocity of Barbara relative to Alex
is 52 km/h. Therefore, the velocity of car P relative to Barbara is -130 km/h.
Correct output: -130
GPT-4V:
Given:
vBA = 52 km/h (velocity of Barbara relative to Alex)
vP A = -78 km/h (velocity of car P as measured by Alex)
Figure 74: This college-level question evaluates the capacity to understand scientific figure, com-
mand domain-specific knowledge in science, and execute rigorous reasoning. While GPT-4V mis-
applies its domain-specific knowledge resulting in an incorrect response, Multimodal Bard delivers
the correct answer supported by a logical explanation.
71
Published as a conference paper at ICLR 2024
ID: 649
Ground truth: 4
LLaMA-Adapter-V2: There are 2 values below 30 in Mainly are incidents of individual misconduct.
LLaVA: There are two values below 30 in Mainly are incidents of individual misconduct.
Multimodal Bard:
There is only 1 value below 30 in the “Mainly are incidents of individual misconduct” column. This value is
22, which is the percentage of women who believe that recent allegations of sexual harassment and assault
are mainly incidents of individual misconduct.
The other values in this column are 34, 31, 33, and 25. All of these values are greater than 30.
GPT-4V:
Based on the “Mainly are incidents of individual misconduct” column, the following values are below 30:
- 28 (Total)
- 22 (Women)
- 19 (College grad+)
- 25 (Dem/Lean Dem)
Figure 75: The question with a richly formatted statistical figure presents challenges for prominent
multimodal models such as LLaMA-Adapter-V2, LLaVA, and Multimodal Bard. However, GPT-4V
successfully extracts essential information from the figure, producing the correct answer.
72
Published as a conference paper at ICLR 2024
ID: 389
Multimodal Bard:
The correct answer is (B), surplus.
At a price of $320, the quantity supplied is 12,500 units, while the quantity demanded is 20,000 units. This
means that there are 7,500 more units supplied than demanded. So the answer is surplus.
GPT-4V:
At a price of $320:
Quantity demanded = 20,000
Quantity supplied = 12,500
Since the quantity demanded (20,000) is greater than the quantity supplied (12,500), there is a shortage.
Correct output: (A) shortage
Figure 76: LLaMA-Adapter-V2, LLaVA, and Multimodal Bard struggle to discern the structured
information within the table, leading to incorrect answers.
73
Published as a conference paper at ICLR 2024
This section compares the reasoning abilities of GPT-4V with other large multimodal models
(LLMs) on M ATH V ISTA, considering various types of visual contexts. Models used for compar-
ison include LLaMA-Adapter-V2-7B (simplified as LLaMA-Adapter-V2), LLaVA-LLaMA-2-13B
(simplified as LLaVA), and Multimodal Bard.
ID: 742
Ground truth: 3
Multimodal Bard:
According to the image, the 0 inch mark of the ruler is aligned with the base of the twig. The tip of the twig
is just past the 5 inch mark. Therefore, the twig is about 5 inches long.
Figure 77: Large multimodal models like GPT-4V struggle to answer this elementary-level question
with the abstract scene as the context.
74
Published as a conference paper at ICLR 2024
ID: 7
Question: What time is shown? Answer by typing a time word, not a num-
ber. It is ( ) after eight.
Choices: (A) half (B) quarter (C) o’clock (D) quarter to (E) quarter past
Multimodal Bard:
The image you sent shows a clock with the hands pointing to 8:15. So the time is quarter past eight. (E)
The hour hand is between 8 and 9, which means it is past 8 o’clock. The minute hand is pointing to the 3,
which is 15 minutes past. Therefore, the time is quarter past eight.
GPT-4V:
The minute hand is pointing to the 9, which indicates that it is 45 minutes past the hour. The hour hand is
pointing a little past the 8, which means it’s after 8 o’clock.
45 minutes past the hour is also described as ”quarter to” the next hour.
Figure 78: Large multimodal models face challenges in answering this elementary-level question
which entails interpreting the time from an abstract clock diagram.
75
Published as a conference paper at ICLR 2024
ID: 497
Ground truth: 9
Multimodal Bard: The largest bar in the image is the one for the title “ninety”. It has a value of 10.
Figure 79: Large Multimodal Models such Multimodal Bard and GPT-4 fail to extract the correct
value for the tallest bar in this basic bar chart.
ID: 629
Ground truth: 20
LLaVA: Based on the image, 30% of people prefer the object hut..
Multimodal Bard: Based on the image, 100% of people prefer the object hut. The line graph shows that
the percentage of people who prefer the hut is 100%, which is the maximum value on the graph.
GPT-4V: The percentage of people who prefer the object “hut” is approximately 40%.
Figure 80: To address this question, the model simply needs to identify the “hut” and determine its
value on the x-axis. However, current foundation models such as GPT-4V struggle to produce the
correct answer.
76
Published as a conference paper at ICLR 2024
ID: 767
Ground truth: 5
Multimodal Bard:
The maximum value of y is 3.
GPT-4V:
The equation given is for a circle: x2 + y 2 = 25.
The maximum value of y corresponds to the highest point of the circle, which is equal to the radius.
Figure 81: The figure illustrates a function plot accompanied by its equation. GPT-4V correctly
identifies the equation, leading to the correct answer.
77
Published as a conference paper at ICLR 2024
ID: 39
Multimodal Bard:
The function in the image is odd.
An even function is a function where f (−x) = f (x). An odd function is a function where f (−x) = −f (x).
GPT-4V:
The given function is y = ax where a > 1.
To determine if the function is odd or even, we can use the following definitions:
y = a−x
This is not equal to −ax and it’s not equal to ax either.
Thus, the function is neither odd nor even. However, since the given choices only include “odd” and “even”,
the best answer is neither. But since I must choose one of the provided options (A) odd.
Correct output: (A) odd
Figure 82: The figure showcases a hand-drawn function plot. GPT-4V identifies the function’s
expression and accurately determines its property.
78
Published as a conference paper at ICLR 2024
ID: 478
Question: As shown in the figure, in the diamond ABCD, two diagonal lines
AC = 12.0, BD = 16.0, then the edge length of this diamond is ().
Ground truth: 10
LLaVA: Based on the given information, the edge length of the diamond ABCD is 8.0.
Multimodal Bard:
The answer is (A). 10.
In a rhombus, the diagonals bisect each other at right angles. So, we can draw two right triangles within the
rhombus, each with a leg of length 6 and a hypotenuse of length 10.
√ √
By the Pythagorean Theorem, the length of each side of the rhombus is 62 + 62 = 72 = 10.
GPT-4V:
Given that AC = 12.0 and BD = 16.0 in the rhombus (diamond) ABCD.
Since all sides of a rhombus are congruent, the diagonals of a rhombus are perpendicular bisectors of each
other. This means that each half of the diagonal BD is 8.0 and each half of diagonal AC is 6.0.
Using the Pythagorean theorem for one of the right triangles (formed by the diagonals):
Figure 83: For this geometry problem, GPT-4V correctly applies the appropriate theorem and arrives
at the right answer. In contrast, Multimodal Bard commits calculation errors during its reasoning
process.
79
Published as a conference paper at ICLR 2024
ID: 410
LLaVA:
The perimeter of △DEF can be calculated using the similarity theorem, which states that if two triangles
are similar, their corresponding sides are proportional. In this case, we know that △DEF ∼ △CBF and
the perimeter of △CBF is 27.
Since the ratio of the corresponding sides of similar triangles is the same, we can use the perimeter of
△CBF to find the perimeter of △DEF . The perimeter of △DEF is proportional to the perimeter of
△CBF , so we can multiply the perimeter of △CBF by the ratio of the corresponding sides.
The ratio of the corresponding sides of △DEF and △CBF is 6:8, which means that the perimeter of
△DEF is 6 × 8 = 48.
Multimodal Bard:
DE DF DE 6
Since △DEF ∼ △CBF , we have CB
= FC
. Since DE = 6 and F C = 8, we have CB
= 8
= 43 .
3 CB
The perimeter of △CBF is 27, so CB + BF = 27. Substituting 4
for BF
, we get 43 BF + BF = 27.
Solving for BF , we get BF = 12.
3 EF
The perimeter of △DEF is DE + EF + DF = 6 + EF + 6 = EF + 12. Substituting 4
for BF
, we
get EF + 12 = 43 EF . Solving for EF , we get EF = 9.
GPT-4V:
Given △DEF ∼ △CBF and DF = 6 and F C = 8, we can find the scale factor of the two similar
triangles.
DF 6
Scale factor = FC
= 8
= 34 .
Given the perimeter of △CBF = 27, we can find the perimeter of △DEF by multiplying by the scale
factor.
3
Perimeter of △DEF = 27 × 4
= 20.25.
Correct output: 20.25
Figure 84: For this geometry problem, GPT-4V correctly applies the appropriate theorem and arrives
at the right answer. In contrast, Multimodal Bard commits calculation errors during its reasoning
process.
80
Published as a conference paper at ICLR 2024
ID: 981
Multimodal Bard:
I see the image. The sum of 2002, 2003 and 2004 is 1.40.
The year 2002 has a share of 67.4%, 2003 has a share of 64.3%, and 2004 has a share of 62%. When we add
these three values, we get 1.937%. Rounding to one decimal place, the sum is 1.40.
GPT-4V:
The sum for the years 2002, 2003, and 2004 from the graph are:
2002: 19.4%
2003: 23.2%
2004: 25.1%
Figure 85: The figure displays a low-resolution line plot with numerous points. Current foundation
models, including GPT-4V, appear to struggle with accurately interpreting the plot.
81
Published as a conference paper at ICLR 2024
ID: 209
Multimodal Bard: The highest value in the black line chart is 1.4.
GPT-4V: The highest value in the black line chart, which represents “Women,” is 30.1.
Figure 86: The figure displays two low-resolution line plots comprising numerous points. Current
foundation models, including GPT-4V, seem to struggle with interpreting the plots accurately.
82
Published as a conference paper at ICLR 2024
ID: 980
Ground truth: 12
Figure 87: LLaVA and Multimodal Bard are unable to identify the highest number displayed on the
clock panel in the image.
83
Published as a conference paper at ICLR 2024
ID: 884
Choices:
(A) ferret (B) cat (C) cloud (D) octopus
LLaMA-Adapter-V2: The blue kite in the lower right corner is shaped like a cloud.
LLaVA: The blue kite in the lower right corner is shaped like a cloud.
GPT-4V: The blue kite in the lower right corner is shaped like (D) octopus.
Correct output: (D) octopus
Figure 88: The question demands spatial reasoning ability from the model to pinpoint the object
within the image. In this context, only GPT-4V delivers a correct prediction.
84
Published as a conference paper at ICLR 2024
ID: 123
LLaMA-Adapter-V2:
The diagram represents successive rotations, starting from the top down. The next shape that comes next is
letter D.
Correct output: (D) D
Multimodal Bard:
The answer is (C).
The diagram shows a 3D cube being rotated 90 degrees on each axis. Starting from the top, the first shape
is a square. The second shape is a cube with a square missing from the top face. The third shape is a cube
with a square missing from the front face. The fourth shape is a cube with a square missing from the left face.
To find the next shape, we need to rotate the cube 90 degrees on the z-axis. This will bring the square that is
missing from the left face to the front face. So the next shape is the one shown in option (C).
GPT-4V:
The diagram shows a pattern of rotating the shape in a specific sequence. To determine the next shape:
Starting with the first shape, when we rotate it 90 degrees counterclockwise, we get the second shape.
Rotating the second shape 90 degrees counterclockwise gives the third shape.
Thus, rotating the third shape 90 degrees counterclockwise will give us the next shape.
From the choices provided, the shape that matches this rotation is (E) E.
Figure 89: This puzzle test figure demands intricate skills in abstract thinking and spatial reasoning,
which prove challenging for current foundation models. Although GPT-4V correctly discerns the
transformation pattern of the shapes, it errs in spatially distinguishing between shapes (D) and (E).
85
Published as a conference paper at ICLR 2024
ID: 941
Ground truth: 2
Solution:
(3968 + 1) = 632
(63 + 1) = 82
(8 + 1) = 32
(3 + 1) = 22
Multimodal Bard:
The next number in the series is 9.
The series is based on the square root of the number plus 1. So, the first number is 3968, and its square
root is 63. Then, 63 + 1 = 64, and its square root is 8. Continuing the pattern, 8 + 1 = 9, and its square root is 3.
GPT-4V:
The pattern in the ladder involves dividing the number by the step number to get the next number.
The given sequence doesn’t seem to have a consistent pattern based on the values presented.
However, if we assume the ladder sequence is based on dividing by a consistent factor (rather than step
number), we can attempt to find a common factor:
3968 ÷ 63≈ 63
63 ÷ 8= 7.875 which isn’t a round number.
8 ÷ 3= 2.6666 again not a round number.
Without a clear and consistent pattern, it’s challenging to determine the next number in the sequence.
Figure 90: The question demands advanced abstract thinking to discern the consistent pattern within
a sequence of four numbers. Despite their capabilities, current foundation models including Multi-
modal Bard and GPT-4V show limited abilities of logical reasoning.
86
Published as a conference paper at ICLR 2024
ID: 105
LLaMA-Adapter-V2: No, Dark Violet does not have the minimum area under the curve.
LLaVA: Based on the image, Dark Violet does not have the minimum area under the curve. The area under
the curve is the highest for the color Green.
Multimodal Bard:
The correct answer is (B), no. Dark Violet does not have the minimum area under the curve. The color with
the minimum area under the curve is Yellow Green.
The area under the curve is calculated by summing the areas of all the rectangles under the curve. The
rectangle for Dark Violet has a height of 92 and a width of 20, so its area is 1840. The rectangle for Yellow
Green has a height of 88 and a width of 40, so its area is 3520. Therefore, the area under the curve for Yellow
Green is larger than the area under the curve for Dark Violet.
GPT-4V:
Based on the scatter plot provided, “area under the curve” is not represented since there is no curve shown.
However, if by “area under the curve” we mean the overall distribution and position of the points for ”Dark
Violet” compared to others, we can analyze the positions of the points.
From the scatter plot, the color “Dark Violet” consistently appears to be the lowest on the y-axis across the
entire range of the x-axis.
Figure 91: The scatter plot presented in this problem is template-generated. While models such as
LLaMA-Adapter-V2, LLaVA, and Multimodal Bard encounter difficulties in discerning quantitative
relationships between different plot lines, GPT-4V successfully discerns the correct relationships and
provides an accurate answer.
87
Published as a conference paper at ICLR 2024
ID: 542
Ground truth: 2
LLaMA-Adapter-V2: There are 10 models in the figure that achieve an Acc score greater than 60..
LLaVA: There are two models in the figure that achieve an Acc score greater than 60.
Correct output: 2
Multimodal Bard:
There are 8 models in the figure that achieve an Acc score greater than 60. These models are:
* GPT-4
* GPT3-175B
* LLAMA-7B
* LLAMA2-7B
* LLAMA-13B
* LLAMA2-13B
* LLAMA-33B
* LLAMA-65B
* LLAMA2-70B
The Acc score is the accuracy of the model, which is the percentage of test cases that the model correctly
classifies. An Acc score of 60 means that the model correctly classifies 60% of the test cases.
GPT-4V: There are 6 models in the figure that achieve an Acc score greater than 60.
Figure 92: The example showcases a scatter plot sourced from an academic paper. Current models,
including GPT-4V, struggle to discern the fine-grained relationships depicted in the figure.
88
Published as a conference paper at ICLR 2024
ID: 685
LLaMA-Adapter-V2:
If organism b increases, organism c would decrease. This is because organism c is dependent on organism b
for its food supply, and an increase in organism b would lead to a decrease in organism c.
LLaVA:
If organism b increased, organism c would likely decrease. This is because the growth of organism b would
lead to increased competition for resources, such as food and space. As a result, organism c would have to
compete more fiercely for these resources, which could lead to a decrease in its population.
Multimodal Bard:
The answer is (A). If organism B increases, then organism C will decrease.
In the food chain, organism B is the predator of organism C. This means that organism B eats organism C.
If there are more organism B, then there will be more predation on organism C. This will lead to a decrease
in the population of organism C.
GPT-4V:
If Organism B (which appears to be some kind of insect) increased in population, it would mean that there
is more available food for Organism C (the fish).
Figure 93: Models such as LLaMA-Adapter-V2, LLaVA, and Multimodal Bard seem to struggle
with discerning the relationships between organisms depicted in the food chain. Instead, GPT-4V
recognizes organisms in the food chain and possesses the knowledge to infer their relations.
89
Published as a conference paper at ICLR 2024
ID: 862
Question: One of the most dramatic videos on the web (but entirely fictitious) supposedly shows a
man sliding along a long water slide and then being launched into the air to land in a water pool. Let’s
attach some reasonable numbers to such a flight to calculate the velocity with which the man would have
hit the water. Figure indicates the launch and landing sites and includes a superimposed coordinate system
with its origin conveniently located at the launch site. From the video we take the horizontal flight distance
as D = 20.0 m, the flight time as t = 2.50 s, and the launch angle as θ0 = 40.0◦ . Find the magnitude of
the velocity at launch and at landing.
Multimodal Bard:
Given the following information:
We can find the magnitude of the velocity at launch using the following equation:
q
2 2
v0 = v0x + v0y
where v0x is the horizontal component of the launch velocity and v0y is the vertical component of the
launch velocity.
The horizontal component of the launch velocity can be found using the following equation:
D 20.0 m
v0x = = = 8.00 m/s
t 2.50 s
The vertical component of the launch velocity can be found using the following equation:
Figure 94: This college-level question, derived from the physics curriculum, requires understand-
ing the intricate scenario depicted in the scientific figure, possessing domain-specific knowledge,
and performing multiple reasoning steps to obtain a conclusive answer. Multimodal Bard correctly
processes the reasoning but failed to present the answer in the appropriate format.
90
Published as a conference paper at ICLR 2024
ID: 862
Question: One of the most dramatic videos on the web (but entirely fictitious) supposedly shows a
man sliding along a long water slide and then being launched into the air to land in a water pool. Let’s
attach some reasonable numbers to such a flight to calculate the velocity with which the man would have
hit the water. Figure indicates the launch and landing sites and includes a superimposed coordinate system
with its origin conveniently located at the launch site. From the video we take the horizontal flight distance
as D = 20.0 m, the flight time as t = 2.50 s, and the launch angle as θ0 = 40.0◦ . Find the magnitude of
the velocity at launch and at landing.
GPT-4V:
Given:
- D = 20.0 m
- t = 2.50 s
- θ0 = 40.0◦
D = V0x × t
Where V0x is the horizontal component of the initial velocity.
V0x = V0 × cos(θ0 )
From the first equation:
D
V0 × cos(θ0 ) =
t
Rearranging for V0 :
D
V0 =
t × cos(θ0 )
Substitute the given values to get V0 :
20.0 m
V0 =
2.50 s × cos(40.0◦ )
Using a calculator:
20.0
V0 =
2.50 × 0.766
20.0
V0 =
1.915
V0 = 10.44 m/s
Thus, the magnitude of the launch velocity (which will be the same as the landing velocity, ignoring air
resistance) is 10.44 m/s.
Correct output: 10.44
Figure 95: Continuing from Figure 94, GPT-4V generates the correct answer and the associated
reasoning steps for this scientific problem.
91
Published as a conference paper at ICLR 2024
ID: 52
Ground truth: 6
LLaMA-Adapter-V2:
The final value is 1, as there is only one object left after subtracting all big red metallic spheres and all big
brown matte things.
LLaVA:
After subtracting all big red metallic spheres and big brown matte things, there are 3 objects left.
Multimodal Bard:
Given the image, there are 4 objects:
If we subtract all big red metallic spheres, there will be 3 objects left:
If we subtract all big brown matte things, there will be 2 objects left:
* 1 small red metallic sphere
* 1 small brown matte sphere
Therefore, if we subtract all big red metallic spheres and big brown matte things, there will be 2 objects left.
GPT-4V:
There are no big red metallic spheres in the image. There is 1 big brown matte object (the cube).
Objects left = 7 - 1 = 6
Figure 96: In this example, GPT-4V is able to recognize the objects with specific attributes from the
synthetic figure and then perform arithmetic reasoning to get the correct answer.
92
Published as a conference paper at ICLR 2024
ID: 177
LLaMA-Adapter-V2:
Yes, the number of brown tandem bikes that are to the left of the small blue matte car is greater than the
number of tiny blue biplanes.
Correct output: (A) Yes
LLaVA:
The correct answer is (B) No. There are more tiny blue biplanes to the left of the small blue matte car than
brown tandem bikes.
Multimodal Bard:
The answer is (B). There are two brown tandem bikes to the left of the small blue matte car, and there is
only one tiny blue biplane. So the number of brown tandem bikes is not greater than the number of tiny blue
biplanes.
GPT-4V:
Looking at the image, there is 1 brown tandem bike to the left of the small blue matte car. There are no tiny
blue biplanes in the image.
Thus, the number of brown tandem bikes that are to the left of the small blue matte car is (A) Yes greater
than the number of tiny blue biplanes.
Correct output: (A) Yes
Figure 97: The question in this example requires the models to accurately extract the attributes of
different objects from the synthetic figure. GPT-4V infers the correct answer based on its generated
detailed description of the figure.
93
Published as a conference paper at ICLR 2024
H.4.11 TABLE
Tables serve as a powerful tool to present and summarize large amounts of data in a comprehensible
manner. In particular, GPT-4V has shown significant advancements over other foundation models
and even surpasses human performance on table-related reasoning tasks, as shown in Figure 1. The
example in Figure 98 shows a complex table taken from an academic paper. GPT-4V can accurately
pinpoint the target cells among numerous rows and columns. Figure 99 shows a QA task in which
the answer needs to be derived from the table regarding the push-up competition. GPT-4V is the
only model that can produce the correct answer.
ID: 923
Choices:
(A) Transformers Agent (GPT-4)
(B) LLaMA-Adapter v2-7B
(C) LLaVA-7B
(D) Otter-9B
(E) MM-ReAct-GPT-3.5
(F) LLaVA-13B (LLaMA-2)
(G) MM-ReAct-GPT-4
Ground truth:
(F) LLaVA-13B (LLaMA-2)
LLaVA:
The correct answer is (B) LLaMA-Adapter v2-7B, which achieves the highest score in terms of Rec.
Multimodal Bard:
The answer is (G) MM-ReAct-GPT-4.
MM-ReAct-GPT-4 achieves the highest score of 33.1 in terms of Rec, which is the ability to recognize and
understand text. The other models have the following scores in terms of Rec:
GPT-4V:
The model that achieves the highest score in terms of Rec is “LLaVA-13B (LLaMA-2)” with a score of
39.2%. The correct option is: (F) LLaVA-13B (LLaMA-2)
Correct output: (F) LLaVA-13B (LLaMA-2)
Figure 98: The problem presents a complex table from an academic paper, complete with a caption
and multiple rows and columns. This format poses challenges to models like LLaMA-Adapter-V2,
LLaVA, and Multimodal Bard. However, GPT-4V demonstrates its exceptional reasoning abilities
within tabular contexts.
94
Published as a conference paper at ICLR 2024
ID: 265
Ground truth: 11
LLaMA-Adapter-V2:
Justine’s P.E. class participated in a push-up competition, and Justine wrote down how many push-ups each
person could do. There were 8 people in the competition, and at least 60 push-ups were done by 3 people.
The final value is 3.
LLaVA:
In the image, there are a total of 13 people participating in the push-up competition. Among them, there are
two people who did at least 60 push-ups.
Multimodal Bard:
There are 4 people who did at least 60 push-ups.
The person with the highest number of push-ups did 778, so the only people who did at least 60 push-ups
are the ones who did 60, 77, 78, and 778.
There is one person who did 60 push-ups, one person who did 77 push-ups, one person who did 78 push-ups,
and one person who did 778 push-ups. So there are a total of 1 + 1 + 1 + 1 = 4 people who did at least 60
push-ups.
GPT-4V:
For the stem-and-leaf plot provided:
6 represents the tens value, and the digits 3, 6, 7, and 7 represent the ones value. So, the number of push-ups
represented by the 6 row are 63, 66, 67, and 67.
7 represents the tens value, and the digits 1, 3, 3, 5, 7, 7, and 8 represent the ones value. So, the number of
push-ups represented by the 7 row are 71, 73, 73, 75, 77, 77, and 78.
Counting these numbers: 4 from the 6 row + 7 from the 7 row = 11.
Correct output: 11
Figure 99: This figure features a stem-and-leaf plot used for statistical analysis. While LLaMA-
Adapter-V2, LLaVA, and Multimodal Bard struggle to comprehend this domain-specific table, GPT-
4V excels in understanding its content.
95
Published as a conference paper at ICLR 2024
ID: 895
Question: Which model has the lowest Audio-Audio Similarity and Text-Audio Similarity scores
overall?
Choices:
(A) MusicLDM (mix-up)
(B) MusicLDM (original)
(C) MusicLDM (BLM)
(D) MusicLDM (BAM)
(E) MuBERT
LLaVA:
Based on the image, the model with the lowest Audio-Audio Similarity and Text-Audio Similarity scores
overall is MusicLDM (mix-up).
Multimodal Bard:
The answer is (C). MusicLDM (BLM) has the lowest Audio-Audio Similarity and Text-Audio Similarity
scores overall.
The violin plot shows that the distribution of audio-audio similarity scores for MusicLDM (BLM) is the most
spread out, indicating that there is a lot of variation in the scores. The distribution of text-audio similarity
scores for MusicLDM (BLM) is also the most spread out. This suggests that MusicLDM (BLM) is not very
consistent in its ability to capture the similarity between audio and text.
GPT-4V:
The model with the lowest Audio-Audio Similarity and Text-Audio Similarity scores overall is MuBERT.
Figure 100: Violin plots are widely used in academic papers to visualize the distribution of data and
probability density. In this example, only GPT-4V is able to generate the correct output from the
figure.
96
Published as a conference paper at ICLR 2024
Self-verification is a social psychological theory asserting that people desire others to perceive them
as they see themselves. Consequently, individuals will take active measures to ensure that others
view them in ways that confirm their stable self-concepts (Talaifar & Swann, 2020).
Interestingly, in our experiments, GPT-4V demonstrates an ability similar to self-verification. The
model can inspect its own behaviors during the course of reasoning and can take active actions to
correct its mistakes. Note that self-verification we discuss here differs from several recent works on
improving the model’s outputs based on external feedback (Peng et al., 2023) or additional gener-
ations (Yang et al., 2023b). The examples in Figures 101 and 103 show that GPT-4V, on its own,
can inspect a set of candidate answers and identify the one that is valid and meets all the given con-
straints. The multi-step reasoning example in Figure 102 shows that GPT-4V can verify the validity
of (the result of) each reasoning step, and explore alternative approaches if any invalid (intermediate)
result is detected (e.g., a negative value for length).
Although self-verification does not guarantee an accurate response even after multiple tries, espe-
cially when applying GPT-4V to visual perception or mathematical reasoning in intricate scenarios
(see Figure 104), it is instrumental in improving the model performance on M ATH V ISTA. We also
found that GPT-4V’s self-verification is weaker for non-English tasks, such as Mandarin, as shown
in Figure 105. It is also worth noting that self-verification does not emerge in other foundation mod-
els we studied, or at least it is not as robust as that of GPT-4V. As shown in Figure 106, Multimodal
Bard first attempts a natural language solution, followed by a program-assisted one for verification.
However, the program-aided solution leads to a different and incorrect prediction.
The emergent ability of self-verification highlights GPT-4V’s potential in solving rigorous reasoning
and theorem-proving tasks. One of the most exciting research topics for future work is to develop
a mechanism that allows the model to activate self-verification consistently at the right time and to
use a set of alternative approaches that maximize the success rate of task completion.
ID: 615
GPT-4V:
The sum of the lengths of any two sides of a triangle must be greater than the length of the third side. Let x
be the length of the third side.
Figure 101: Given several possible values for the length of the third side, GPT-4V verifies each
candidate answer to ensure that it meets the given constraints (highlighted in blue).
97
Published as a conference paper at ICLR 2024
ID: 6
Question: As shown in the figure, if CB = 4.0, DB = 7.0, and
D is the midpoint of AC, then the length of AC is ().
Choices: (A) 6 cm (B) 7 cm (C) 4 cm (D) 5 cm
Ground truth: (A) 6 cm
GPT-4V:
From the figure, we can deduce:
AD = DC (since D is the midpoint of AC)
Now, using the entire line segment AB:
AB = AD + DB
AB = DC + DB
Given DB = 7.0 cm, we can rearrange to find:
AD (or DC) = AB − DB
AD = AB − 7.0
Now, let’s find AB:
From the given information,
AB = AC + CB
AB = AC + 4.0
Using the previously deduced equation:
AD = AC + 4.0 − 7.0
AD = AC − 3.0
Since D is the midpoint of AC, AD will be half the length of AC:
AC = 2(AD)
From the above:
AC − 3.0 = 2(AD)
Given that AD and DC are equal, we can represent DC as:
DC = AC/2
Using the relation CB = 4.0 cm and DB = 7.0 cm:
AC/2 + 7.0 = 4.0
AC/2 = −3.0
This is not possible as lengths cannot be negative. Let’s re-evaluate our approach.
From the original figure:
AB = AD + DB
AB = DC + DB
Given DB = 7.0 cm, we have:
AD = AB − 7.0
AD = AC/2 − 7.0
Using the relation AB = AC + CB:
AC/2 + 7.0 = AC + 4.0
Subtracting AC/2 from both sides:
7.0 = AC/2 + 4.0
Subtracting 4 from both sides:
AC/2 = 3.0
Multiplying both sides by 2:
AC = 6.0 cm
Correct output: (A) 6 cm
Figure 102: GPT-4V generates the result as AC/2 = −3.0. It verifies the validity of the value and
concludes it cannot be negative. The model then re-evaluates its approach and adopts a new method
to obtain the correct answer, highlighted in blue.
98
Published as a conference paper at ICLR 2024
ID: 28
GPT-4V:
Given that AB is a diameter and AC = 8 inches and BC = 15 inches, we can use the Pythagorean theorem
to determine the length of AB.
The right triangle is formed by A, the center of the circle, and C makes AC the shorter leg, the radius the
longer leg, and BC the hypotenuse.
AC 2 + r2 = BC 2
√
Figure 103: Starting by computing the initial radius√length as r = √161, GPT-4V performs two
verifications resulting in different answers: r = 2 161 and r = 161/2. It then selects the
matching option from the choices as its final prediction, highlighted in blue.
99
Published as a conference paper at ICLR 2024
ID: 644
Solution: 8 + 4 − 2 = 10, 3 + 2 − 3 = 2, 12 + 1 − 5 = 8
GPT-4V:
Let’s analyze the pattern in the given triangles:
(12 − 1) × 5 = 55
However, since the answer is in single digits, it seems we take only the unit’s place of the result.
This is not in the given options. Therefore, let’s explore another possible pattern:
Given the information and the options provided, I am unable to find a consistent pattern that applies to all
the triangles.
Therefore, I cannot determine the missing number from the given choices.
Figure 104: In this logical reasoning question, GPT-4V verifies its generated solution and retries
twice. However, it still fails to produce an answer that matches any of the options. Consequently, it
claims that it cannot determine the missing number from the given choices (highlighted in blue).
100
Published as a conference paper at ICLR 2024
ID: 32
GPT-4V:
给定 AB = AC,△ABC 是等腰三角形。由于 ∠CAB = 40◦ ,所以 ∠CBA 也为 40◦ (因为在等
腰三角形中,两个底角是相等的)。
(Given AB = AC, △ABC is an isosceles triangle. Since ∠CAB = 40◦ , ∠CBA is also 40◦ (because in
an isosceles triangle, the two base angles are equal).)
考虑四边形 ABCD,其内角和为 360◦ 。已知 ∠CAB = 40◦ ,∠CBA = 40◦ ,并且 ∠BCA 和 ∠D
是补角。
(Considering the quadrilateral ABCD, its interior angles sum up to 360◦ . We know that ∠CAB = 40◦ ,
∠CBA = 40◦ , and ∠BCA and ∠D are supplementary.)
由于 ∠BCA 和 ∠D 是补角,我们有:
(As ∠BCA and ∠D are supplementary, we have:)
∠D = 180◦ − 100◦ = 80◦
但这不是给定的选项。所以,让我们重新检查。
(But this isn’t a given option. So, let’s recheck.)
等一下,我好像考虑错了。
(Wait, I seem to have thought it wrong.)
所以 (So)
∠D = 100◦ − 40◦ = 60◦
Figure 105: The question is presented in Chinese, with the translated English provided in paren-
theses after the Chinese text. GPT-4V is capable of performing self-verification until it produces a
solution that matches one of the given options (highlighted in blue).
101
Published as a conference paper at ICLR 2024
ID: 44
Ground truth: 18
Solution: 8 + 4 − 2 = 10, 3 + 2 − 3 = 2, 12 + 1 − 5 = 8
Multimodal Bard:
The cost of 4 kilograms of oval beads is 4 ∗ 2 = $8.
I also wrote a Python code to calculate the total cost and got the same answer of 35.
def calculate_cost(oval_beads, star_shaped_beads):
oval_bead_cost = 2
star_shaped_bead_cost = 2
total_cost = oval_beads * oval_bead_cost + star_shaped_beads *
star_shaped_bead_cost
return total_cost
print(calculate_cost(4, 5))
# 35
Figure 106: Multimodal Bard seems incapable of verifying its solutions and revisiting them if found
invalid. For instance, in this example, after initially producing a natural language solution, Mul-
timodal Bard subsequently generates Python code. However, the executed result of the code (35)
differs from its initial solution (18).
102
Published as a conference paper at ICLR 2024
Self-consistency (Wang et al., 2022) is a decoding strategy for chain-of-thought prompting (Wei
et al., 2022b). A diverse set of reasoning paths is sampled, and the most consistent answer is se-
lected as the final prediction. Moving beyond vanilla greedy decoding, this method resorts to the
inherent coherence and reliability of multiple reasoning trajectories to produce a more trustworthy
conclusion. Self-consistency has been widely employed in LLMs for complex reasoning tasks, such
as math word problems and commonsense reasoning.
In our experiments, we validated the effectiveness of using self-consistency for GPT-4V on M ATH -
V ISTA. Given a question and context, we ran GPT-4V multiple times to obtain a set of different
reasoning paths and then selected the most frequent answer as the final prediction. We found that
self-consistency is instrumental in rectifying visual perception errors (Figure 107), correcting cal-
culation mistakes (Figure 108), and mitigating hallucinations (Figure 109). In comparison, self-
consistency is less effective when GPT-4V has difficulties in interpreting complex visual contexts
(Figures 110, 111) or extracting salient information from images (Figure 112).
ID: 648
Ground truth: -1
From the graph, the minimum value lies on the y-axis between -0.5 and 0. Based on the visual representa-
tion, it appears that the minimum value of y is: -0.5.
From the graph, it is evident that the lowest point (or the minimum value of the function) is at a y-coordinate
of approximately -1.
Figure 107: In this algebraic reasoning task, GPT-4V displays inconsistency in recognizing the
value range of the function. After reevaluating GPT-4V three times, three distinct reasoning paths
emerge. Utilizing self-consistency minimizes variation in the generated results, converging on the
most frequent prediction (-1) as the final answer.
103
Published as a conference paper at ICLR 2024
ID: 331
Question: Several people compared how many Web pages they had
visited. What is the mean of the numbers?
Ground truth: 64
First, sum the number of web pages for all the individuals:
Bonnie: 84, Winston: 56, Percy: 70, Derek: 68, Jake: 60, Finn: 47, Carter: 75, Kurt: 52
Total = 84 + 56 + 70 + 68 + 60 + 47 + 75 + 52 = 512
Next, divide by the number of individuals (8):
Mean = 512 ÷ 8 = 64
The mean number of web pages visited is 64.
Correct output: 64
Bonnie: 84, Winston: 56, Percy: 70, Derek: 68, Jake: 60, Finn: 47, Carter: 75, Kurt: 52
512 ÷ 8 = 64
So, the mean number of web pages visited is 64.
Correct output: 64
Figure 108: For this arithmetic reasoning question, GPT-4V arrives at the answers of 58, 64, and 64
through three distinct reasoning paths. Applying self-consistency would confirm the correct answer
to be 64.
104
Published as a conference paper at ICLR 2024
ID: 430
The slope (or the derivative) of f (x) is represented by the inclination of the line.
At x = 0, the slope is negative (the line is declining), and at x = 5, the slope is positive (the line is
inclining). The magnitude of the slopes appears to be the same; only their directions differ.
However, when considering the magnitude of the derivative, the value at x = 0 is negative, and the value at
x = 5 is positive, but they have the same absolute value.
Wrong output: (B) equal to
The absolute value function is piecewise defined, so its derivative will have different values depending on
where you take it.
The absolute value function has a V-shaped graph, as shown in the image. The slope (or derivative) of the
line to the left of the vertex is negative, and the slope of the line to the right of the vertex is positive.
The derivative of f (x) at x = 0 corresponds to the slope of the line on the left side of the vertex, which is
negative (downward slope). The derivative of f (x) at x = 5 corresponds to the slope of the line on the right
side of the vertex, which is positive (upward slope).
Since a positive value is always larger than a negative value, the derivative of f (x) at x = 0 is smaller than
that at x = 5.
Correct output: (C) smaller than
Figure 109: This is another instance where the application of self-consistency across three reasoning
paths results in a correct final prediction.
105
Published as a conference paper at ICLR 2024
ID: 157
- There are 6 large cubes, each representing 1, 000. So, 6 × 1, 000 = 6, 000.
- There are 7 flats, each representing 100. So, 7 × 100 = 700.
- There are 4 rods, each representing 10. So, 4 × 10 = 40.
- There are 3 units, each representing 1. So, 3 × 1 = 3.
1. There are six large cubes, which represent a value of 6 × 1000 = 6000.
2. There are three medium-sized cubes, which represent a value of 3×100 = 300.
3. There are two small cubes, which represent a value of 2 × 10 = 20.
4. The vertical stack of single unit cubes shows nine cubes, which represent a value of 9 × 1 = 9.
1. There are seven large cubes, which represent a value of 7 × 1000 = 7000.
2. There are two medium-sized cubes, which represent a value of 2 × 100 = 200.
3. There are nine small cubes, which represent a value of 9 × 10 = 90.
4. The vertical stack of single unit cubes shows eight cubes, which represent a value of 8 × 1 = 8.
Figure 110: In this arithmetic reasoning task, GPT-4V struggles with comprehending the abstract
scene, leading to its inability to calculate the correct answer through multiplication. Multiple at-
tempts yield varying incorrect solutions. Consequently, utilizing self-consistency does not verify
the definitive right answer.
106
Published as a conference paper at ICLR 2024
ID: 551
Solution:
27% (male in 2015)
- 13% (female in 2008)
= 14%
The lowest percentage of females with at least one tattoo is in 2003, 2008, and 2012, all at 15%.
Figure 111: In this instance, GPT-4V persistently produces incorrect predictions across multiple
evaluations. When prompted to detail the essential elements of the image, GPT-4V provides inac-
curate descriptions. This suggests that if GPT-4V misinterprets vital visual cues, relying on self-
consistency may not necessarily guide it towards the correct conclusion.
107
Published as a conference paper at ICLR 2024
ID: 965
Ground truth: 30
Prompt: List the yard line numbers that are visible in the image.
GPT-4V: The visible yard line numbers in the image are 10, 20, 30.
Prompt: Can you see the yard numbers 40 and 50 in the image?
GPT-4V: No, the yard numbers 40 and 50 are not visible in the image.
Figure 112: This example demonstrates that if GPT-4V misinterprets key information from an im-
age, the application of self-consistency may not aid in producing the correct final prediction. We
evaluated GPT-4V five times for the question “What’s the lowest number yard line that you can
see?” and received consistently incorrect responses. Further, when prompted to describe the yard
line numbers present in the image, GPT-4V provided incorrect descriptions, indicating its difficulty
in accurately recognizing them.
108
Published as a conference paper at ICLR 2024
This section investigates the use of GPT-4V for multi-turn human-AI interaction on M ATH V ISTA,
as exemplified in the goal-directed dialog in Figure 113.
We found that GPT-4V is effective in engaging multi-turn goal-directed conversations with users.
In particular, GPT-4V can make good use of hints (e.g., user feedback or responses) to guide the
conversion to generate desirable results. For instance, it can (1) rectify visual perception errors
based on hints (Figure 114), (2) reassess reasoning steps and calculations (Figure 115), (3) correct
misinformation using user-provided domain-specific knowledge (Figure 116), and (4) aggregate
intricate contexts over multiple turns in a human-AI conversation (Figures 117 and 118).
We also observed failure cases in our evaluation. For instance, GPT-4V struggles to generate correct
responses when questions and user hints are ambiguous (Figure 119), or when the model fails to
understand abstract shapes and concepts visually (Figure 120). These failures motivate the develop-
ment of more powerful, conversational foundation models.
The total cost for a sour apple candy and a butterscotch candy is $0.13.
Luca would need at least $0.13 to purchase both items.
Why?
The cost for each item is $0.06 for the sour apple candy and $0.07 for the
butterscotch candy. Adding the two together gives us a total cost of $0.13.
Hence, Luca would need at least $0.13 to purchase both items.
109
Published as a conference paper at ICLR 2024
Answer: 23.
Your solution is wrong. You counted the baseballs in each bucket incorrectly.
Answer: 20.
Figure 114: GPT-4V initially produces an incorrect solution due to errors in visually perceiving
the number of baseballs. After receiving user feedback, it recalibrates and generates the accurate
answer.
110
Published as a conference paper at ICLR 2024
Following this pattern, the next two prime numbers are 23 and 29.
Your solution is incorrect. The pattern does not involve prime numbers.
Figure 115: GPT-4V correctly identifies the sequence of numbers in the images but struggles to
discern the underlying pattern. With feedback from the user, GPT-4V re-evaluates the pattern and
produces the correct response to the logical reasoning question.
111
Published as a conference paper at ICLR 2024
The diagrams below show two pure samples of gas in identical closed, rigid containers. Each
colored ball represents one gas particle. Both samples have the same number of particles.
Compare the average kinetic energies of the particles in each sample. Which sample has the
higher temperature?
Choices: (A) sample A (B) sample B (C) neither; the samples have the same temperature
The average kinetic energy of an ideal gas particle is directly proportional to the temperature of
the gas and is given by the equation:
3
𝐾𝐸!"# = 𝑘 𝑇
2
Where:
• 𝐾𝐸!"# is the average kinetic energy per molecule.
• 𝑘 is the Boltzmann constant.
• 𝑇 is the absolute temperature.
For gases at the same temperature and under the same conditions, the average kinetic energy
is the same, irrespective of the mass of the particles.
Given that the average particle speed in both Sample A and Sample B is 1,400 m/s, and since
they are in identical closed, rigid containers, it implies that both samples have the same average
kinetic energy. Therefore, they also have the same temperature.
The correct answer is: (C) neither; the samples have the same temperature.
The temperature of a substance depends on the average kinetic energy of the particles in the
substance. The higher the average kinetic energy of the particles, the higher the temperature of
the substance. The kinetic energy of a particle is determined by its mass and speed. For a pure
substance, the greater the mass of each particle in the substance and the higher the average
speed of the particles, the higher their average kinetic energy.
Given:
• Both Sample A and Sample B have particles with the same average speed (1,400 m/s).
• The mass of each particle in Sample A is 44 u and in Sample B is 46 u.
Given that the speeds are the same but the mass of the particles in Sample B is greater than
that of Sample A, the particles in Sample B will have a higher kinetic energy than those in
Sample A. As the temperature is directly related to the average kinetic energy of the particles,
Sample B will have a higher temperature than Sample A.
Figure 116: Initially, GPT-4V applies an incorrect theorem to address the scientific reasoning query.
However, after receiving the appropriate domain knowledge hint from the user, it utilizes the correct
theorem, combined with key visual cues from the images, to produce the right answer.
112
Published as a conference paper at ICLR 2024
How many models in the table have a model size larger than 10B?
Total: 10 models.
Figure 117: Answering this question requires comprehending a complex table regarding AI models
(such as identifying relevant rows and columns), and up-to-date knowledge of AI research, and
the ability to perform advanced arithmetic reasoning. Without human in the loop, GPT-4V fails to
produce the correct answer.
113
Published as a conference paper at ICLR 2024
You've located the wrong column. Please refer to the "Model Size" column to get the correct
answer.
I apologize for the oversight. Let's try again by referring to the "Model Size" column:
Total: 8 models.
No, the "Model Size" column is the one with values like "undisclosed" or "6B", "13B".
I apologize for the oversight. Let's count again by focusing on the "Model Size" column:
Total: 8 models.
You're almost there. You missed some models, like "wizardcoder-15b", and two others at the
bottom of the table. Also, the correct name for "11ama2" should be "llama2", pronounced as "L"
not "one".
Total: 11 models.
Figure 118: (Continued) In this multi-turn human-AI conversation, the user acts as a tutor to help
GPT-4V generate the correct answer . (1) Initially, the user directs GPT-4V to the “Model Size” col-
umn after observing its misdirection. Despite the guidance, GPT-4V still struggles. (2) Recognizing
the model’s difficulty in identifying the “Model Size” column, the user provides more detailed in-
structions. This time, GPT-4V is on the right track but still makes some error in its response. (3)
With further information from the user, GPT-4V corrects the remaining errors and generate the cor-
rect answer.
114
Published as a conference paper at ICLR 2024
The total sum of the two visible numbers on the jerseys is 29 (21 + 8).
I apologize, I can only see the number 21 clearly on the jerseys. My best guess
is that the other number is 8, but I cannot be certain due to the image quality.
If you add the two visible numbers, on the jerseys, what is the total sum?
Based on the image quality, my best answer is that the total sum of the two
visible numbers on the jerseys is 29 (21 + 8). But I cannot be certain about the
number 8 due to the image quality.
Since you are not certain about 8, let's ignore it. In this case, if you add the two
visible numbers, on the jerseys, what is the total sum?
My best answer, excluding the uncertain number 8, is that the total sum of the
visible numbers on the jerseys is 21.
Figure 119: A failure example from GPT-4V in an interactive chatbot environment. Utilizing the
visual cues from the image to resolve the question’s ambiguity (distinguishing between two numbers,
2 and 1, instead of the single number 21), the correct answer is 3, obtained by summing 2 and 1.
115
Published as a conference paper at ICLR 2024
1. Larger Triangles:
• The entire figure itself is a triangle. So that's 1 triangle.
2. Medium-sized Triangles:
• The three triangles formed by the division of the larger triangle into four parts. So that's 3
triangles.
3. Smaller Triangles:
• The triangles inside each of the medium-sized triangles. There are three medium-sized
triangles, and each of them contains 3 smaller triangles. So that's 3 x 3 = 9 triangles.
Your solution is incorrect. There are four categories of triangles: triangles containing one,
triangles containing two, triangles containing three, and the largest one.
I apologize for the oversight. Let's re-evaluate the triangles step by step based on your guidance:
Figure 120: In this example, even after the user offers helpful hints, GPT-4V struggles to decipher
the number of triangles in the given geometric shape. The task is challenging for AI models, includ-
ing GPT-4V, with a limited ability to understand abstract shapes visually.
116