0% found this document useful (0 votes)
40 views

Chain of Thought Reasoning

Book on chain of thought reasoning in Large language models

Uploaded by

codenumber4x
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

Chain of Thought Reasoning

Book on chain of thought reasoning in Large language models

Uploaded by

codenumber4x
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Chain-of-Thought Reasoning in Tabular Language Models

Mingyu Zheng1,2† , Yang Hao3 , Wenbin Jiang3 , Zheng Lin1,2‡ ,


Yajuan Lyu3 , Qiaoqiao She3 , Weiping Wang1
1
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
2
School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
3
Baidu Inc, Beijing, China
{zhengmingyu,linzheng,wangweiping}@iie.ac.cn
{haoyang03,jiangwenbin,lvyajuan,sheqiaoqiao}@baidu.com

Abstract
Tabular mathematical reasoning task requires
models to perform multi-step operations includ-
ing information look-up and numerical calcula-
tions, based on heterogeneous data from tables
and questions. Existing solutions tend to extend
chain-of-thought (CoT) reasoning into power-
ful large language models (LLMs) to promote
multi-hop mathematical reasoning. However, it
can be extremely difficult to apply such LLM-
based approaches under scenarios of privatiza-
tion deployment or limited resources. To ad-
dress this problem, we revisit small-scale tabu-
lar language models (TaLMs) and extend chain-
of-thought reasoning into TaLMs for the first
time. Specifically, we propose a novel frame-
work, TaCo, which coordinates two TaLMs
responsible for CoT generation and answer
inference, respectively. Besides, our frame-
work can be combined with an external cal- Figure 1: An example from the TABMWP dataset. To
culator to enhance accurate numerical calcula- solve the problem, the model needs to perform multi-
tions. On the TABMWP dataset, TaCo outper- step mathematical reasoning based on the table and the
forms the state-of-the-art ChatGPT by 9.55% question.
(82.60%→92.15% in accuracy) with much less
parameters (0.8B).1
Considering the inherent demand for multi-step
1 Introduction operations, existing studies tend to extend chain-
of-thought (CoT) reasoning (Wei et al., 2022;
Tabular mathematical reasoning task aims at an-
Wang et al., 2023a; Kojima et al., 2022; Zhang
swering math questions based on heterogeneous
et al., 2022) into powerful Large Language Mod-
tabular and textual data, which can provide users
els (LLMs) (Brown et al., 2020; Chowdhery et al.,
with insights from tables containing valuable fig-
2022; Thoppilan et al., 2022; Chen et al., 2021a)
ures (Lu et al., 2023b; Zhu et al., 2021; Chen et al.,
to promote multi-hop mathematical reasoning. As
2021b). This task highlights the demand for multi-
depicted in Figure 2 (b), this paradigm prompts
step mathematical reasoning including information
LLMs with several in-context examples containing
look-up and numerical calculations. For example,
CoT demonstrations to elicit intermediate reason-
given the table and the question in Figure 1, we
ing steps before inferring the final answer.
firstly need to count how many numbers are in the
Though the combo of LLM and CoT has
table, then add all the numbers together to get the
achieved great performance, such LLM-based
sum of baskets, and finally compute the mean of
methods may not be a feasible approach in some
the sum.
1
real-world scenarios. For instance, it is financially
The code will be released at https://github.com/
SpursGoZmy/TaCo
expensive to satisfy the high computational require-

This work was done during an internship at Baidu Inc. ments, the storage capacity and the desired band-

Corresponding author: Zheng Lin. width of LLMs, which makes it a challenge for
11006
Findings of the Association for Computational Linguistics: EMNLP 2023, pages 11006–11019
December 6-10, 2023 ©2023 Association for Computational Linguistics
individual users or small organizations to utilize follows:
LLMs in their applications (Strubell et al., 2019;
Bender et al., 2021). In consideration of the data • To the best of our knowledge, we explore the
security, enterprises may also seek privatization chain-of-thought reasoning in TaLMs for the
deployments where private data is not allowed to first time, and advocate a new and promising
be processed by third-party LLM APIs. What’s paradigm for tabular mathematical reasoning,
more, despite the fact that many pre-trained tab- especially under scenarios where LLM-based
ular language models have been developed (Liu methods are not feasible.
et al., 2022; Herzig et al., 2020; Wang et al., 2021;
• We propose a novel framework, TaCo, which
Dong et al., 2022), their CoT reasoning ability has
coordinates two TaLMs responsible for CoT
not been thoroughly investigated and it could be
generation and answer inference, respectively.
inadequate for solving the tabular mathematical
It is also integrated with a calculator to en-
reasoning task. As a result, an alternative approach,
hance accurate numerical calculations.
with lower costs and competitive CoT reasoning
ability, is needed. • Our method can boost the performance of
To accomplish this goal, we revisit small-scale small-scale TaLMs and surpasses the state-
tabular language models (TaLMs) and initiatively of-the-art ChatGPT by 9.55% on TABMWP
explore the chain-of-thought reasoning in TaLMs. benchmark with much less parameters (0.8B).
Specifically, we propose a novel framework named
TaCo, which coordinates two TaLMs that are re- 2 Pilot Experiment
sponsible for CoT generation and answer inference,
Before diving into the specific method, we present
respectively. Given the input table and question,
a pilot experiment on the TABMWP dataset to an-
the first TaLM is fine-tuned to generate interme-
swer two important questions: (i) Do existing pre-
diate reasoning steps. Based on the original input
trained generative TaLMs possess chain-of-thought
and generated reasoning steps, the second TaLM
reasoning ability? (ii) Whether generative TaLMs
is fine-tuned to infer the final answer. To alleviate
can benefit from chain-of-thoughts when predict-
the weakness of TaLMs in solving mathematical
ing the final answer. We select the state-of-the-
expressions, TaCo is also combined with an ex-
art TAPEX model (Liu et al., 2022) for experi-
ternal calculator which is used to perform math
ments, which is based on the encoder-decoder lan-
calculations and fix incorrect results in the output
guage model BART (Lewis et al., 2020) and is
reasoning steps.
additionally pre-trained on the tabular data. We
To verify the effectiveness of the proposed consider two model sizes: TAPEX-base (140M)
method, we conduct comprehensive experiments and TAPEX-large (400M).
on the TABMWP (Lu et al., 2023b) dataset, Experiments are conducted in three different set-
which is the latest math word problem benchmark tings, i.e., vanilla, zero-shot CoT and gold CoT. For
over tabular data and provides detailed chain-of- the “vanilla” setting, the pre-trained TAPEX model
thoughts to solve the problem step by step. Ex- f (·) autoregressively generates the answer a based
perimental results reveal that TaCo explores a new on the table t and the question q, i.e., a = f (t, q).
and promising paradigm for tabular mathematical For the “zero-shot CoT” setting, we follow Ko-
reasoning, which is illustrated in Figure 2 (c). Com- jima et al. (2022) to evaluate the CoT reasoning
pared with traditional fine-tuned TaLMs, TaCo im- of the TAPEX. Specifically, a trigger sentence p1
proves the accuracy of recent TAPEX model by is appended to the question in order to ask the
29.76%. Compared with LLM-based approaches, TAPEX to output intermediate reasoning steps s,
TaCo outperforms the state-of-the-art ChatGPT by i.e., s = f (t, q, p1 ). Then, given the original input
9.55% (82.60%→92.15%) with much less param- and the generated CoT, another trigger sentence p2
eters (0.8B). Moreover, we conduct ablation stud- is appended to make the TAPEX output the final
ies to analyse contributions of different parts in answer a, i.e., a = f (t, q, p1 , s, p2 ). For p1 , we
the framework. The detailed error analysis is also try various templates such as “Let’s think step by
performed to provide insights for future improve- step” and report best results. For p2 , we intuitively
ments. select “As a result, the answer is” as the trigger
To summarize, we conclude our contributions as sentence. For the “gold CoT” setting, we replace
11007
Figure 2: Different paradigms for tabular mathematical reasoning.

generated reasoning steps with annotated ones and generation and (ii) answer inference, where two
other procedures are same as “zero-shot CoT”. generative TaLMs with the same architecture are
fine-tuned independently with different inputs and
Pre-trained TaLMs Acc-Dev Acc-Test
outputs. In this section, we introduce the frame-
TAPEX-base (vanilla) 15.66 15.69
TAPEX-large (vanilla) 18.41 18.59 work with the TAPEX model as selected backbones,
TAPEX-base (zero-shot CoT) 15.30 15.25 but it should be noted that TaCo is compatible with
TAPEX-large (zero-shot CoT) 18.25 17.94
arbitrary generative TaLMs to boost their perfor-
TAPEX-base (gold CoT) 40.54 39.99
TAPEX-large (gold CoT) 47.48 48.01 mance. The overview of TaCo framework is illus-
trated in Figure 3.
Table 1: Pilot experimental results of pre-trained
TAPEX under different settings. “Acc-Dev” and “Acc-
Test” represents accuracy on the development set and 3.1 CoT Generation
the test set respectively.
In the CoT generation stage, a TAPEX model is
fine-tuned to generate a solution which consists
From the results in Table 1, we can see that
of multiple reasoning steps to solve the problem.
the TAPEX with “zero-shot CoT” setting performs
Given an input table T with M rows {Ri }M i=1
even worse than the vanilla one, which shows that
and N column headers {cj }N j=1 , the TAPEX will
the small-scale TAPEX is not a decent zero-shot
linearize the table into a flattened text sequence
reasoner like LLMs and does not possess CoT rea-
T ∗ = [HEAD] : c1 | · · · | cN [ROW] 1 :
soning ability. This is also consistent with find-
R1 | [ROW] 2 : R2 | · · · |RM , where [HEAD]
ings from previous CoT studies (Wei et al., 2022;
and [ROW] are special tokens used to indicate the
Ho et al., 2023). After inspecting the model out-
region of column headers and rows, respectively.
puts, we find that the pre-trained TAPEX model
The number after [ROW] represents different row
cannot follow the instruction to generate reason-
index and the vertical bar “|” separates headers or
ing steps. In most cases, it directly generates the
cells in different columns. For instance, the table
answer or illogical texts. However, given the anno-
in Figure 1 will be linearized into the following
tated “gold CoT”, the model achieves a remarkable
sequence:
performance gain. For instance, the accuracy of
TAPEX-large on test set increases from 18.59%
to 48.01%. This demonstrates that CoT reasoning col : Day | Number of baskets row 1 :
steps are beneficial to TAPEX when inferring the Thursday | 49 row 2 : Friday | 48 ... row
correct answer and it encourages us to further elicit 6 : Tuesday | 49
CoT reasoning ability of TaLMs by finetuning.

3 Method The resulting sequence T ∗ will be concatenated


with the textual context, which includes a question
Based on observations in Section 2, we propose Q and a trigger sentence P . Based on the concate-
the TaCo framework for tabular mathematical rea- nated input, the probability of generating the target
soning. It includes two training stages: (i) CoT solution S is computed as follows:
11008
Figure 3: Overview of the TaCo framework, with the table and the question in Figure 1 as a running example.

3.2 Answer Inference


In answer inference stage, another TAPEX model is
L
Y fine-tuned to generate the final ansewr based on the
p(S|T ∗ , Q, P ) = pθ (Si |T ∗ , Q, P, S<i ) (1) original input and the annotated solution S. Similar
i=1 with the CoT generation stage, the probability of
generating target answer A is computed by:
where L is the length of target solution. We select
“Let’s think step by step” as the trigger sentence P N
Y

since it gives the best performance in pilot experi- p(A|T , Q, P, S) = pθ (Ai |T ∗ , Q, P, S, A<i )
ments. i=1
(2)
After generating a potential solution S̄, we find where N is the length of target answer. During the
that S̄ often contains some numerical calculation inference phase, the annotated solution is replaced
errors. This is often the case with language models with the corrected solution Ŝ to output the pre-
because TaLMs and even LLMs are not suitable for dicted answer Ā. Both CoT generation model and
actually solving mathematical expressions (Chen answer inference model are trained with a standard
et al., 2022). Take the generated solution in Figure language modeling objective.
3 as an example. Though the model generates plau-
sible reasoning steps, calculation results among 4 Experiments
these steps are all wrong (in red color), e.g., “49 +
48 + 51 + 54 + 37 + 49 = 312”. Such calculation 4.1 Dataset and Evaluation Metric
errors will accumulate to the last reasoning step Experiments are conducted on the TABMWP (Lu
and seriously mislead the answer inference model et al., 2023b) dataset, a recent large-scale bench-
into predicting the false answer. mark which is constructed from grade-level math
To mitigate the influence of calculation mistakes, curricula and contains 38,481 math word problems
we introduce an arithmetic calculator g(·) to solve with the tabular context. Beside the gold answers,
mathematical expressions of “+,-,×,÷” in the gen- TABMWP also provides detailed step-by-step solu-
erated solution S̄ and output the corrected solu- tions to solve the problems, which can be utilized as
tion Ŝ = g(S̄). Concretely, we extract equation chain-of-thoughts to finetuning TaLMs. There are
strings in S̄ using regular expressions and calculate two question-types in the TABMWP: 28,719 free-
their results using the Python eval function. Since text questions with integer answers (INT) and deci-
multiple equations may exist in one solution and mal answers (DEC), and 9,712 multi-choice ques-
one equation could also refer to results of previous tions with extractive text answers (EXTR), boolean
equations, the calculation result of each equation text answers (BOOL) and other text answers (OTH).
is propagated to the following equations by string Statistics of each split are shown in the Table 2. The
replacing. As we can see from Figure 3, original test set contains 7,686 questions in total. Among
wrong results in S̄ are successfully fixed and are them, 74.08% are INT (4,529) and DEC (1165)
replaced with correct results (in green color), e.g., questions, and 25.92% are DEC (1,165), EXTR
“49 + 48 + 51 + 54 + 37 + 49 = 288”. (987) and OTH (105) questions. Thus, INT and
11009
Train Dev Test Total
# of questions 23,059 7,686 7,686 38,431
lection strategies for few-shot prompting are listed
# of free-text 17,135 5,710 5,694 28,719 in Table 8. (3) Large language models with CoT
# of multi-choice 5,744 1,976 1,992 9,712
# of tables 22,620 7,546 7,549 37,644
prompting: Beside standard prompting, we also
# of solutions 21,623 7,365 7,378 35,442 consider above LLMs with the chain-of-thought
prompting. PromptPG (Lu et al., 2023b) utilizes
Table 2: Dataset statistics of TABMWP.
the policy gradient method to select in-context
examples for test samples when constructing the
DEC questions are more essential for the overall ac- prompt for LLMs. PoT (Chen et al., 2022) pro-
curacy. Given the predicted answer and the ground poses the “program-of-thoughts”, which exploits
truth, we employ the exact match accuracy as the Codex to generate the text and Python program for
metric and use the official evaluation script to eval- math computations. The generated program is exe-
uate the model performance. cuted by a program interpreter to output the final
answer. The “Heuristic guess” is a baseline from
4.2 Implementation Details the TABMWP paper. For multi-choice questions, it
Implementations. Our framework is imple- randomly selects one from the given options with
mented with Pytorch (Paszke et al., 2019). We even probabilities. For free-text questions, it ran-
mainly employ the TAPEX (Liu et al., 2022) as the domly chooses one number from the question or
backbone TaLM in the proposed framework. We the table as the prediction.
also replace TAPEX with UnifiedQA (Khashabi
et al., 2020) for the ablation study. Various model 4.3 Main Results
sizes are included to present more valid evaluation
Table 3 demonstrates main experimental results on
across different model capacities. Both CoT gener-
the TABMWP dataset. For TAPEX, UnifiedQA
ation model and answer inference model are opti-
and ChatGPT baselines, we report results based on
mized by AdamW (Loshchilov and Hutter, 2019).
our implementation. For other baselines, we report
We use validation set for the model selection and
published results from original papers (Lu et al.,
manually tune hyper-parameters, and evaluate the
2023b; Chen et al., 2022).
best model on the test set. For CoT generation, we
adopt the beam search decoding with the beam size From the results in Table 3, we can find that: (1)
of 3. For answer inference, we adopt the greedy With two TAPEX-large models as backbones, the
decoding. Hyper-parameter configurations for best- TaCo framework establishes a new state-of-the-art
performing models and more implementation de- accuracy of 92.15% on the TABMWP test set, out-
tails are shown in the Table 6 and Table 7. performing the previous best model ChatGPT with
CoT prompting by 9.55%, which demonstrates the
Baselines. (1) Pre-trained and Fine-tuned lan- effectiveness of the proposed method. Notably,
guage models: We develop TAPEX (Liu et al., compared with LLMs such as GPT-3 and Codex,
2022) and UnifiedQA (Khashabi et al., 2020) in the parameters in TaCo framework are much less
both pre-trained and fine-tuned settings to predict (0.8B), which brings lower costs for application
the final answer. TAPEX is the state-of-the-art deployments. (2) Compared with LLM-based ap-
BART-based (Lewis et al., 2020) TaLM which is proaches with the standard few-shot prompting,
pre-trained on the tabular data to mimic a SQL fine-tuned TAPEX and UnifiedQA can achieve
executor. UnifiedQA is a T5-based (Raffel et al., competitive results. For instance, the fine-tuned
2020) QA model which is pre-trained on 8 QA TAPEX-large even performs better than GPT-3 and
datasets of multiple formats. We consider three Codex. However, when combined with the CoT
model sizes for UnifiedQA: small (60M), base prompting, LLM-based methods are significantly
(220M) and large (770M). Given the flattened table better than fine-tuned small-scale language models,
and question, both TAPEX and UnifiedQA can gen- which shows that the CoT prompting plays an im-
erate the answer text autoregressively. (2) Large portant role in the tabular mathematical reasoning
language models: We consider GPT-3 (Brown task. By contrast, the TaCo framework extends
et al., 2020), Codex (Chen et al., 2021a) and Chat- the CoT reasoning into TaLMs for the first time,
GPT with the standard few-shot and zero-shot and improves the performance of TAPEX-base and
prompting. ChatGPT is based on the gpt-3.5-turbo TAPEX-large model by 29.19% and 29.76%, re-
engine. Numbers of in-context examples and se- spectively.
11010
Question Types Answer Types Grades
Model Acc-Dev Acc-Test
FREE MC INT DEC EXTR BOOL OTH 1-6 7-8
Heuristic baselines
Heuristic guess - 15.29 6.71 39.81 8.37 0.26 30.80 51.22 26.67 17.55 12.27
Human performance - 90.22 84.61 93.32 84.95 83.29 97.18 88.69 96.20 94.27 81.28
Pre-trained LM
TAPEX-base 15.66 15.69 7.29 39.71 8.63 2.06 34.95 47.11 20.95 18.6 11.81
TAPEX-large 18.41 18.59 8.80 46.59 10.62 1.72 46.91 48.11 30.48 22.65 13.18
UnifiedQA-small 10.71 12.18 1.18 43.62 1.37 0.43 38.7 49.78 37.14 15.57 7.65
UnifiedQA-base 12.10 14.56 4.60 43.02 5.28 1.97 37.08 50.11 38.1 17.14 11.11
UnifiedQA-large 14.00 14.06 3.37 44.63 4.02 0.86 40.53 50.22 35.24 17.21 9.87
Fine-tuned LM
TAPEX-base 57.10 56.39 48.33 79.42 56.33 17.25 90.37 67.78 76.19 65.17 44.67
TAPEX-large 62.28 62.39 55.50 82.08 64.21 21.63 96.47 65.78 77.14 71.32 50.47
UnifiedQA-small 35.79 34.82 27.99 54.32 33.94 4.89 52.99 53.89 70.48 42.23 24.93
UnifiedQA-base 51.89 51.08 42.10 76.76 49.83 12.02 89.16 63.33 75.24 59.03 40.48
UnifiedQA-large 59.35 59.26 51.62 81.12 60.68 16.39 92.20 69.44 77.14 67.11 48.80
LLM
GPT-3 (zero-shot) - 56.96 53.57 66.67 55.55 45.84 78.22 55.44 54.29 63.37 48.41
GPT-3 - 57.13 54.69 64.11 58.36 40.40 75.95 52.41 53.02 63.10 49.16
Codex - 59.40 - - - - - - - - -
ChatGPT 64.12 65.52 65.84 64.61 66.55 63.09 74.67 54.67 55.24 69.75 59.88
LLM+CoT
GPT-3 (zero-shot) - 57.61 54.36 66.92 55.82 48.67 78.82 55.67 51.43 63.62 49.59
GPT-3 - 62.92 60.76 69.09 60.04 63.58 76.49 61.19 67.30 68.62 55.31
Codex - 65.20 - - - - - - - - -
PromptPG - 68.23 66.17 74.11 64.12 74.16 76.19 72.81 65.71 71.20 64.27
Codex-SC - 75.40 - - - - - - - - -
PoT - 73.20 - - - - - - - - -
PoT-SC - 81.80 - - - - - - - - -
ChatGPT 82.49 82.60 80.89 87.50 79.36 86.87 81.86 94.00 84.76 82.68 82.51
Ours
TaCo (TAPEX-base) 86.12±0.13 85.58±0.14 85.53 85.74 85.29 86.44 93.31 77.89 81.90 87.43 83.12
TaCo (TAPEX-large) 92.91±0.17 92.15±0.13 91.69 93.47 92.54 88.41 96.05 91.44 86.67 92.37 91.86

Table 3: Accuracy (%) on the development set and test set of TABMWP. We also report detailed accuracy on
different types of questions in test set. FREE: free-text questions; MC: multi-choice questions. INT: integer answers;
DEC: decimal answers; EXTR: extractive text answers; BOOL: Boolean text answers; OTH: other text answers.
The best results are marked in bold. ± stands for standard deviation over 3 repeated experiments. If not otherwise
specified, LLM baselines are in few-shot setting. “-SC” represents using self-consistency decoding strategy (Wang
et al., 2023a).

(3) Among different baselines, the model per- other baselines on questions with integer (INT) and
formance on free-text questions is obviously worse decimal (DEC) answers, which may resulted from
than that on multi-choice questions, with an aver- the utilization of the external calculator. ChatGPT
age difference of 21%. The reason is that, com- with the CoT prompting outperforms other meth-
pared with multi-choice questions, free-text ques- ods including the human baseline on questions with
tions usually require more complicated numerical Boolean text answer, which may contribute to its
calculations and also do not directly provide an- great general semantic understanding ability. For
swer options in the input. The detailed evidence is example, judging yes/no questions based on previ-
presented in the Appendix B. Nevertheless, from ously generated reasoning steps. (5) Not surpris-
pre-trained LM to LLM+CoT and to the proposed ingly, all the models perform worse on questions
TaCo framework, the performance gap between from the grade 7-8 than that from the grade 1-6 due
two question types gradually decreases. For in- to the increasing difficulty. Among them, the pro-
stance, the accuracy gap of TaCo (TAPEX-large) posed framework achieves the best accuracy than
framework (1.78%) is much lower than that of other baselines on harder questions from grade 7-8.
fine-tuned TAPEX-large (26.58%). This shows
our method can obtain better generalization on two 4.4 Ablation Study
types of questions. (4) Considering questions of We conduct ablation experiments to systematically
various answer types, the TaCo framework beats investigate the effect of the external calculator,
the progressive two-stage paradigm and the TaLM
11011
Settings Dev Test
Average Question Types Question Types
Drop↓ FREE MC Model Dev Test
FREE MC
ours w/ TAPEX
TaCo (base) 86.12 85.58 - 85.53 85.74
TaCo (large) 92.91 92.15 - 91.69 93.47
TAPEX-base 86.12 85.58 85.53 85.74
w/o calculator TAPEX-large 92.91 92.15 91.69 93.47
QT → S → A (base) 65.21 64.35 21.07 56.23 84.55 w/ UnifiedQA
QT → S → A (large) 75.60 74.58 17.44 67.77 93.03 UnifiedQA-small 48.32 48.17 46.45 53.06
w/o two-stage paradigm UnifiedQA-base 66.32 65.46 60.70 79.07
QT → SA (base) 78.22 77.66 7.91 77.15 79.12 UnifiedQA-large 77.44 76.96 73.50 86.85
QT → SA (large) 84.73 84.25 8.04 83.95 85.14 fine-tuned
QT → AS (base) 75.18 74.34 11.09 71.88 81.38
UnifiedQA-small 35.79 34.82 27.99 54.32
QT → AS (large) 81.45 81.41 11.10 80.21 84.84
w/o two-stage paradigm and calculator UnifiedQA-base 51.89 51.08 42.10 76.76
QT → SA (base) 59.69 59.41 26.30 50.86 83.84 UnifiedQA-large 59.35 59.26 51.62 81.12
QT → SA (large) 69.57 68.85 23.32 63.79 83.33
QT → AS (base) 56.43 54.85 30.21 45.64 81.17 Table 5: Experiment results of TaCo framework with
QT → AS (large) 63.80 63.41 28.93 56.06 84.44
w/o two-stage paradigm, calculator and solution TAPEX and UnifiedQA as backbone, respectively.
QT → A (base) 57.10 56.39 29.11 48.33 79.42
QT → A (large) 62.28 62.39 30.20 55.50 82.08

Table 4: Ablation study of the external calculator and


two TaLMs, respectively. More importantly, one-
proposed two-stage paradigm. “base” and “large” stands stage paradigms cannot fully utilize the corrected
for model sizes of TAPEX backbone. CoT to change the original (wrong) answer. By
contrast, the two-stage paradigm brings a second
chance to re-contemplate the improved reasoning
backbone in the TaCo framework. QT → S → A steps before making the final judgement. The simi-
represents the proposed two-stage paradigm, which lar two-stage paradigm has also been explored in
firstly generates the solution S and then arrives at recent works (Press et al., 2023; Zhao et al., 2023),
the final answer A based on the input question Q, where they utilize one LLM to generate the CoT to
table T and generated solution S. QT → SA and be improved, and then ask the same LLM to infer
QT → AS represents one-stage paradigms, which the final answer based on the improved CoT.
generate the solution and the answer in different or- Comparing two one-stage paradigms, we notice
ders, respectively. QT → A stands for the vanilla that QT → SA performs better than QT → AS.
fine-tuning paradigm that directly predicts the an- This shows that it may be more suitable for TaLMs
swer. to infer the final answer according to produced
Effect of External Calculator. As shown in Ta- reasoning steps, rather than give explanations based
ble 4, there is a drastic performance drop for the on the predicted final answer. If we remove both
TaCo framework (e.g., 92.15% → 74.58%) when the two-stage paradigm and the external calculator,
removing the external calculator. With further ob- the model performance would suffer a more steep
servations, we find that the performance decline decline. But it is still better than that of traditional
mainly comes from free-text questions which de- fine-tuned models in QT → A paradigm, which
mand more numerical calculations. For instance, validates the value of intermediate reasoning steps
the accuracy of TaCo (TAPEX-large) plummets for TaLMs.
from 91.69% to 67.77%. It demonstrates the great
significance of using the external calculator to re- Effect of TaLM Backbone. To investigate the
duce calculation errors in the generated solutions. performance of TaCo with different backbones, we
Otherwise, the answer inference model is likely to replace TAPEX with UnifiedQA as the backbone
be misled by the incorrect solution and arrives at model. Related experimental results are presented
the wrong answer. in Table 5. When the backbone changes from
TAPEX to UnifiedQA, the TaCo framework suf-
Effect of Two-stage Paradigm. When we fers a sharp performance drop on both free-text and
change the two-stage paradigm to one-stage ones, multi-choice questions. For instance, even with
the model performance drops about 9.5%, which more parameters (1.54B), the accuracy of TaCo
reveals the contribution of two-stage paradigm. We with UnifiedQA-large on the test set (76.96%) is
think it is challenging for single small-scale TaLM much lower than that with TAPEX-large (92.15%),
to generate correct reasoning steps and the final which indicates the advantages of pre-trained tabu-
answer simultaneously. As a result, we delegate lar language models. Unlike UnifiedQA which is
the CoT generation and the answer inference to solely pre-trained on the unstructured textual data,
11012
tion, e.g., compute the slope of the function based
on the table data.
For multi-choice questions, error cases can be
divided into the following five types. (1) Number
comparison (44%): the model cannot determine
which number is larger or smaller. (2) Time cal-
culation (21%): the model needs to perform time
calculation such as compute the elapsed time be-
tween 9:15 A.M. and 11:20 A.M.. (3) Max/Min
operation (19%): the question demands finding the
biggest or smallest number in a group. (4) False
CoT (9%): the CoT generation model gives wrong
or hallucinated reasoning steps, e.g., using numbers
that do not exist in the table or the question when
generating formulas. (5) Commonsense (7%): the
commonsense knowledge is needed to answering
the question, which is a weakness of small-scale
Figure 4: Error distributions of different question types. language models.

5 Related Work
TAPEX is additionally pre-trained on the tabular
data and thus has a better understanding of table CoT prompting for LLMs. By providing a few
structures. As more powerful generative TaLMs in-context examples (or demonstrations) which
emerge, they can be integrated into the TaCo frame- contain chain-of-thoughts, CoT prompting can en-
work to improve their performance on the tabular courage LLMs to output intermediate reasoning
mathematical reasoning task. steps before predicting the final answer (Wei et al.,
2022). Existing CoT studies mainly focus on
4.5 Error Analysis and Case Study two directions. (1) Improving the quality of CoT
demonstrations. For instance, selecting better in-
As illustrated in Figure 6, for this problem that
context examples for CoT prompting according to
involves two multiplication and one addition oper-
the question diversity (Zhang et al., 2022), the so-
ations, the TaCo framework successfully generates
lution complexity (Fu et al., 2023), or the example
correct intermediate reasoning chains and finally
similarity (Rubin et al., 2022). (2) Exploring new
predicts the right answer.
representations of CoT reasoning steps. Beside the
There are 473 free-text questions (78%) and 130
typical natural language format, researchers also
multi-choice questions (22%) for which the TaCo
proposed chain-of-thoughts in other formats. For
(TAPEX-large) gives wrong predictions. We ran-
instance, program-of-thoughts (Chen et al., 2022),
domly selected 100 questions of each type for error
tree-of-thoughts (Yao et al., 2023a), and graph-of-
analyses. Figure 4 depicts error distributions by
thoughts (Yao et al., 2023b). Among them, the CoT
question types. More error instances are presented
in program languages has emerged as a powerful
and discussed in Appendix C.
approach for LLMs to invoking external tools (Qin
For free-text questions, error cases fall into the
et al., 2023). Recently, Lu et al. (2023a) proposed
following four categories. (1) Counting operation
the Chameleon framework that augments LLMs
(49%): the question requires the model to count
with various tools like search engines and Python
numbers as the final answer, which is challenging
executors. We treat it as a contemporary work of
for generative language models. (2) Fraction calcu-
our paper and list its results in the Appendix D.
lation (36%): the model fails to conduct fraction-
related calculations such as reducing a fraction, Pre-trained TaLMs. Inspired by the success of
which may be alleviated with an advanced calcu- pre-training on the natural language text, various
lator. (3) Wrong formula (11%): the CoT genera- TaLMs are proposed for pre-training on the semi-
tion model outputs wrong formulas in the reason- structured tabular data (Dong et al., 2022). Ex-
ing steps. (4) Function-related problem (4%): the isting TaLMs mainly inherit the architectures of
model fails to solve problems related to the func- traditional language models and can be classified
11013
into three types. (1) Encoder-based TaLMs like scenarios where LLMs are not available.
TAPAS (Herzig et al., 2020), MATE (Eisensch-
los et al., 2021) and TUTA (Wang et al., 2021). Ethics Statement
(2) Encoder-Decoder TaLMs such as TAPEX (Liu This paper proposes a two-stage framework for the
et al., 2022) and STTP (Xing and Wan, 2021). (3) tabular mathematical reasoning task, and models
Decoder-based TaLMs like TableGPT (Gong et al., are trained and evaluated on the public TABMWP
2020). In previous studies, TaLMs are usually fine- dataset. Thus, the authors foresee no ethical con-
tuned to directly generate final answers or simple cerns with the research in this paper.
formulas. By contrast, we are the first to explore the
combination of the CoT reasoning and pre-trained Acknowledgements
TaLMs.
This work was supported by the National Natural
6 Conclusion Science Foundation of China (No. 61976207) and
the National Social Science Foundation of China
We extend the CoT reasoning into small-scale (No. 21AZD145).
TaLMs for the first time, and provide an effective
approach for tabular mathematical reasoning task,
especially under scenarios where LLMs are not References
accessible. Specifically, we propose a novel frame- Emily M. Bender, Timnit Gebru, Angelina McMillan-
work named TaCo, which coordinates two TaLMs Major, and Shmargaret Shmitchell. 2021. On the
dangers of stochastic parrots: Can language mod-
responsible for CoT generation and answer infer- els be too big? . In Proceedings of the 2021 ACM
ence, respectively. By introducing an external cal- Conference on Fairness, Accountability, and Trans-
culator, we further augment TaCo with the accurate parency, FAccT ’21, page 610–623, New York, NY,
math computing ability. With two TAPEX-large USA. Association for Computing Machinery.
models as backbones, the TaCo outperforms the Tom Brown, Benjamin Mann, Nick Ryder, Melanie
state-of-the-art ChatGPT on the TABMWP dataset Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
by 9.55% (82.60%→92.15%) with much less pa- Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, Sandhini Agarwal, Ariel Herbert-Voss,
rameters (0.8B).
Gretchen Krueger, Tom Henighan, Rewon Child,
Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens
Limitations Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma-
teusz Litwin, Scott Gray, Benjamin Chess, Jack
Though the proposed method achieves great perfor- Clark, Christopher Berner, Sam McCandlish, Alec
mance with less parameters, the fine-tuning of the Radford, Ilya Sutskever, and Dario Amodei. 2020.
CoT generation model and the answer inference Language models are few-shot learners. In Ad-
model depends on annotated chain-of-thoughts and vances in Neural Information Processing Systems,
volume 33, pages 1877–1901. Curran Associates,
gold answers. As a result, the chain-of-thought Inc.
reasoning ability of TaCo could be limited to the
tabular mathematical reasoning task. In the future Mark Chen, Jerry Tworek, Heewoo Jun, Qiming
Yuan, Henrique Ponde de Oliveira Pinto, Jared Ka-
research, one can utilize open-source LLMs to gen- plan, Harri Edwards, Yuri Burda, Nicholas Joseph,
erate chain-of-thoughts of more diversities and of Greg Brockman, Alex Ray, Raul Puri, Gretchen
more table-related tasks (Wang et al., 2023b; Ho Krueger, Michael Petrov, Heidy Khlaaf, Girish Sas-
et al., 2023), which may further extend the gener- try, Pamela Mishkin, Brooke Chan, Scott Gray,
Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz
alization ability of TaLMs and reduce the cost of Kaiser, Mohammad Bavarian, Clemens Winter,
manual annotation. Philippe Tillet, Felipe Petroski Such, Dave Cum-
In the aspect of external tools, compared with mings, Matthias Plappert, Fotios Chantzis, Eliza-
frameworks which enable LLMs to access various beth Barnes, Ariel Herbert-Voss, William Hebgen
Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie
tools (Shen et al., 2023; Lu et al., 2023a), TaCo Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain,
only utilizes a calculator to complete common arith- William Saunders, Christopher Hesse, Andrew N.
metic calculations, i.e., “+,-,×,÷”. More advanced Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan
external tools may be integrated to enhance the Morikawa, Alec Radford, Matthew Knight, Miles
Brundage, Mira Murati, Katie Mayer, Peter Welinder,
capability of the framework. We believe that the Bob McGrew, Dario Amodei, Sam McCandlish, Ilya
tool learning with small-scale language models is a Sutskever, and Wojciech Zaremba. 2021a. Evaluat-
valuable future direction, especially for particular ing large language models trained on code.

11014
Wenhu Chen, Xueguang Ma, Xinyi Wang, and Jonathan Herzig, Pawel Krzysztof Nowak, Thomas
William W. Cohen. 2022. Program of thoughts Müller, Francesco Piccinno, and Julian Eisenschlos.
prompting: Disentangling computation from reason- 2020. TaPas: Weakly supervised table parsing via
ing for numerical reasoning tasks. pre-training. In Proceedings of the 58th Annual Meet-
ing of the Association for Computational Linguistics,
Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena pages 4320–4333, Online. Association for Computa-
Shah, Iana Borova, Dylan Langdon, Reema Moussa, tional Linguistics.
Matt Beane, Ting-Hao Huang, Bryan Routledge, and
William Yang Wang. 2021b. FinQA: A dataset of nu- Namgyu Ho, Laura Schmid, and Se-Young Yun. 2023.
merical reasoning over financial data. In Proceedings Large language models are reasoning teachers.
of the 2021 Conference on Empirical Methods in Nat-
ural Language Processing, pages 3697–3711, Online Daniel Khashabi, Sewon Min, Tushar Khot, Ashish
and Punta Cana, Dominican Republic. Association Sabharwal, Oyvind Tafjord, Peter Clark, and Han-
for Computational Linguistics. naneh Hajishirzi. 2020. UNIFIEDQA: Crossing for-
mat boundaries with a single QA system. In Find-
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, ings of the Association for Computational Linguistics:
Maarten Bosma, Gaurav Mishra, Adam Roberts, EMNLP 2020, pages 1896–1907, Online. Association
Paul Barham, Hyung Won Chung, Charles Sutton, for Computational Linguistics.
Sebastian Gehrmann, Parker Schuh, Kensen Shi,
Sasha Tsvyashchenko, Joshua Maynez, Abhishek Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu-
Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vin- taka Matsuo, and Yusuke Iwasawa. 2022. Large lan-
odkumar Prabhakaran, Emily Reif, Nan Du, Ben guage models are zero-shot reasoners. In Advances
Hutchinson, Reiner Pope, James Bradbury, Jacob in Neural Information Processing Systems.
Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin,
Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
Sunipa Dev, Henryk Michalewski, Xavier Garcia, Ghazvininejad, Abdelrahman Mohamed, Omer Levy,
Vedant Misra, Kevin Robinson, Liam Fedus, Denny Veselin Stoyanov, and Luke Zettlemoyer. 2020.
Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, BART: Denoising sequence-to-sequence pre-training
Barret Zoph, Alexander Spiridonov, Ryan Sepassi, for natural language generation, translation, and com-
David Dohan, Shivani Agrawal, Mark Omernick, An- prehension. In Proceedings of the 58th Annual Meet-
drew M. Dai, Thanumalayan Sankaranarayana Pil- ing of the Association for Computational Linguistics,
lai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, pages 7871–7880, Online. Association for Computa-
Rewon Child, Oleksandr Polozov, Katherine Lee, tional Linguistics.
Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark
Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi
Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, Lin, Weizhu Chen, and Jian-Guang Lou. 2022.
and Noah Fiedel. 2022. Palm: Scaling language mod- TAPEX: Table pre-training via learning a neural SQL
eling with pathways. executor. In International Conference on Learning
Representations.
Haoyu Dong, Zhoujun Cheng, Xinyi He, Mengyu Zhou,
Anda Zhou, Fan Zhou, Ao Liu, Shi Han, and Dong-
Ilya Loshchilov and Frank Hutter. 2019. Decoupled
mei Zhang. 2022. Table pre-training: A survey
weight decay regularization.
on model architectures, pre-training objectives, and
downstream tasks. In Proceedings of the Thirty-First
Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-
International Joint Conference on Artificial Intel-
Wei Chang, Ying Nian Wu, Song-Chun Zhu, and
ligence, IJCAI-22, pages 5426–5435. International
Jianfeng Gao. 2023a. Chameleon: Plug-and-play
Joint Conferences on Artificial Intelligence Organi-
compositional reasoning with large language models.
zation. Survey Track.
Julian Martin Eisenschlos, Maharshi Gor, Thomas Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu,
Müller, and William W. Cohen. 2021. Mate: Multi- Song-Chun Zhu, Tanmay Rajpurohit, Peter Clark,
view attention for table transformer efficiency. and Ashwin Kalyan. 2023b. Dynamic prompt learn-
ing via policy gradient for semi-structured mathe-
Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and matical reasoning. In International Conference on
Tushar Khot. 2023. Complexity-based prompting for Learning Representations (ICLR).
multi-step reasoning.
Adam Paszke, Sam Gross, Francisco Massa, Adam
Heng Gong, Yawei Sun, Xiaocheng Feng, Bing Lerer, James Bradbury, Gregory Chanan, Trevor
Qin, Wei Bi, Xiaojiang Liu, and Ting Liu. 2020. Killeen, Zeming Lin, Natalia Gimelshein, Luca
TableGPT: Few-shot table-to-text generation with Antiga, Alban Desmaison, Andreas Köpf, Edward
table structure reconstruction and content matching. Yang, Zach DeVito, Martin Raison, Alykhan Tejani,
In Proceedings of the 28th International Conference Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Jun-
on Computational Linguistics, pages 1978–1988, jie Bai, and Soumith Chintala. 2019. Pytorch: An
Barcelona, Spain (Online). International Committee imperative style, high-performance deep learning li-
on Computational Linguistics. brary.

11015
Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Hajishirzi. 2023b. Self-instruct: Aligning language
Noah A. Smith, and Mike Lewis. 2023. Measuring models with self-generated instructions.
and narrowing the compositionality gap in language
models. Zhiruo Wang, Haoyu Dong, Ran Jia, Jia Li, Zhiyi Fu,
Shi Han, and Dongmei Zhang. 2021. Tuta: Tree-
Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, based transformers for generally structured table pre-
Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, training. In Proceedings of the 27th ACM SIGKDD
Chaojun Xiao, Chi Han, Yi Ren Fung, Yusheng Su, Conference on Knowledge Discovery amp; Data Min-
Huadong Wang, Cheng Qian, Runchu Tian, Kunlun ing, KDD ’21, page 1780–1790, New York, NY, USA.
Zhu, Shihao Liang, Xingyu Shen, Bokai Xu, Zhen Association for Computing Machinery.
Zhang, Yining Ye, Bowen Li, Ziwei Tang, Jing Yi,
Yuzhang Zhu, Zhenning Dai, Lan Yan, Xin Cong, Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Yaxi Lu, Weilin Zhao, Yuxiang Huang, Junxi Yan, Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le,
Xu Han, Xian Sun, Dahai Li, Jason Phang, Cheng and Denny Zhou. 2022. Chain of thought prompt-
Yang, Tongshuang Wu, Heng Ji, Zhiyuan Liu, and ing elicits reasoning in large language models. In
Maosong Sun. 2023. Tool learning with foundation Advances in Neural Information Processing Systems.
models.
Xinyu Xing and Xiaojun Wan. 2021. Structure-aware
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine pre-training for table-to-text generation. In Find-
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, ings of the Association for Computational Linguis-
Wei Li, and Peter J. Liu. 2020. Exploring the limits tics: ACL-IJCNLP 2021, pages 2273–2278, Online.
of transfer learning with a unified text-to-text trans- Association for Computational Linguistics.
former.
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran,
Ohad Rubin, Jonathan Herzig, and Jonathan Berant. Thomas L. Griffiths, Yuan Cao, and Karthik
2022. Learning to retrieve prompts for in-context Narasimhan. 2023a. Tree of thoughts: Deliberate
learning. problem solving with large language models.
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Yao Yao, Zuchao Li, and Hai Zhao. 2023b. Beyond
Weiming Lu, and Yueting Zhuang. 2023. Hugging- chain-of-thought, effective graph-of-thought reason-
gpt: Solving ai tasks with chatgpt and its friends in ing in large language models.
hugging face.
Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex
Emma Strubell, Ananya Ganesh, and Andrew McCal- Smola. 2022. Automatic chain of thought prompting
lum. 2019. Energy and policy considerations for in large language models.
deep learning in nlp.
Ruochen Zhao, Xingxuan Li, Shafiq Joty, Chengwei
Romal Thoppilan, Daniel De Freitas, Jamie Hall, Qin, and Lidong Bing. 2023. Verify-and-edit: A
Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze knowledge-enhanced chain-of-thought framework.
Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du,
YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao
Amin Ghafouri, Marcelo Menegali, Yanping Huang, Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-
Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Seng Chua. 2021. TAT-QA: A question answering
Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, benchmark on a hybrid of tabular and textual content
Maarten Bosma, Vincent Zhao, Yanqi Zhou, Chung- in finance. CoRR, abs/2105.07624.
Ching Chang, Igor Krivokon, Will Rusch, Marc
Pickett, Pranesh Srinivasan, Laichee Man, Kathleen
Meier-Hellstern, Meredith Ringel Morris, Tulsee
Doshi, Renelito Delos Santos, Toju Duke, Johnny So-
raker, Ben Zevenbergen, Vinodkumar Prabhakaran,
Mark Diaz, Ben Hutchinson, Kristen Olson, Ale-
jandra Molina, Erin Hoffman-John, Josh Lee, Lora
Aroyo, Ravi Rajakumar, Alena Butryna, Matthew
Lamm, Viktoriya Kuzmina, Joe Fenton, Aaron Co-
hen, Rachel Bernstein, Ray Kurzweil, Blaise Aguera-
Arcas, Claire Cui, Marian Croak, Ed Chi, and Quoc
Le. 2022. Lamda: Language models for dialog appli-
cations.
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc
Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery,
and Denny Zhou. 2023a. Self-consistency improves
chain of thought reasoning in language models.
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa
Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh

11016
A More Implementation Details B The complexity of CoT generation
In our experiments, we employ TAPEX and Uni-
fiedQA as backbones of TaCo framework. When Table 3 reveals a significant performance difference
linearizing the table into flattened sequence, if between free-text questions and multi-choice ques-
there exist no column headers in the original ta- tions. To shed more light on the TABMWP dataset,
ble, pseudo column headers will be inserted, e.g., we quantitatively analyze the complexity of the
’Column header 1’. The hyper-parameter config- CoT generation for two question types. Specifi-
urations of TAPEX and UnifiedQA backbone and cally, we compute the number of required numer-
their model sizes are shown in Table 6 and Table 7, ical calculations in the gold CoT (including +, -,
respectively. Our experiments are all performed on ×, ÷, counting, min, max), the number of reason-
a 32G NVIDIA V100 GPU. ing steps (we treat each line in the gold CoT as
For LLM-based baselines, we list numbers of one reasoning step for simplicity) and the length of
few-shot examples and selection strategies in Table the gold CoT. The statistical results in the Table 9
8. For ChatGPT baseline, we randomly select 4 demonstrate that, in the TABMWP dataset, the CoT
examples from train set for each question type. For generation from free-text questions is more com-
fair comparison, we use the same prompt format as plex than that from multi-choice questions. Based
PromptPG (Lu et al., 2023b) to construct in-context on our observations, at least 18% multi-choice ques-
examples, which is demonstrated in Figure 5. tions (mainly of EXTR and OTH answer types) do
not need numerical calculations, but almost all free-
TAPEX text questions need numerical calculations.
Parameters
base (140M) large (400M)
Learning Rate 3e-5 3e-5
Batch Size 16 32
Weight Decay 0.01 0.01 C Error Instances and More Analysis
Max Grad Norm 1.0 1.0
Warmup Linear Linear
Warmup Fraction 0.1 0.1 In this section, we present detailed error instances
Epochs for Stage 1 20 25
Epochs for Stage 2 15 20 to analyze the weakness of TaCo framework, which
Training Time for Stage 1 3 hours 8 hours is shown in Figure 7 to Figure 10. We find that
Training Time for Stage 2 2 hours 6 hours
most of errors are caused by the inability of used
Table 6: Hyper-parameter configurations for TAPEX external tool and the representation of chain-of-
backbone. thoughts. Take the error instance in Figure 7 as
an example. To correctly answer the question in
Parameters
UnifiedQA Figure 7, the model should find numbers from the
small (60M) base (220M) large (770M)
Learning Rate 5e-5 5e-5 5e-5 table which are greater than 53, and then count
Batch Size
Weight Decay
16
0.01
16
0.01
48
0.01
how many numbers are found. However, as the
Max Grad Norm
Warmup
1.0
Linear
1.0
Linear
1.0
Linear
CoT generation model is fine-tuned to generate
Warmup Fraction 0.1 0.1 0.1 chain-of-thoughts in simple natural language, it
Epochs for Stage 1 15 20 25
Epochs for Stage 2 15 15 20 is difficult for the model to describe the above
Training Time for Stage 1 2 hours 8 hours 15 hours
Training Time for Stage 2 2 hours 5 hours 12 hours process in a short and straightforward expression,
which makes it hard to invoke external tools. If
Table 7: Hyper-parameter configurations for UnifiedQA we could represent chain-of-thoughts in program
backbone.
languages like Python, the solution of this ques-
tion would be much more clear. For instance,
Method
# few-shot Selection
Acc-Test one can write a line of Python code: “Ans =
examples strategy
GPT-3 2 Random selection 57.13 Count(61,61,65,65,66,70,66,78)”, and imple-
Codex 4 Manual construction 59.40 ment a Python function “Count()” as an external
GPT-3+CoT 2 Random selection 62.92
Codex+CoT 4 Manual construction 65.20 tool to get the accurate result. The same method-
PromptPG 2 Policy Gradient 68.23
PoT 4 Manual construction 73.20 ology could be applied to error instances which
ChatGPT 4 Random selection 65.52 demand other abilities such as fraction calculation,
ChatGPT+CoT 4 Random selection 82.60
min/max operation and time calculation. Besides,
Table 8: Number of in-context examples and selection lacking commonsense knowledge also increases
strategies of LLM baselines. the difficulty for models to comprehend tables and
questions, e.g., reading bus schedule in Figure 10.
11017
Figure 5: The format of in-context examples for ChatGPT baseline (ID:19324).

# of numerical calculations # of reasoning steps the length of CoT


Question Types
(median/mean) (median/mean) (median/mean)
free-text 2.00/2.15 4.00/5.18 196.00/239.15
multi-choice 1.00/1.78 2.00/3.84 180.00/253.21

Table 9: The quantitative analysis of the complexity of the CoT generation for two question types.

D Results of Chameleon framework


Recently, Lu et al. (2023a) proposed a compo-
sitional reasoning framework named Chameleon,
which treats LLMs as a natural language planner
to utilize a variety of tools including vision models,
web search engines, Python functions and so on.
As shown in Table 10, based on the powerful GPT-
4 and multiple external tools, Chameleon achieves
the best accuracy of 98.78% on TABMWP test
set. However, the proposed TaCo framework still Method Acc-Test
Question Types
achieves a competitive result of 92.15% with less FREE MC
ChatGPT CoT 82.03 78.43 92.32
parameters. ChatGPT PoT 89.49 90.24 87.35
We also apply the same calculator to the output GPT-4 CoT 90.81 88.48 97.49
of ChatGPT and use regular expressions to extract GPT-4 PoT 96.93 97.40 95.58
Chameleon (ChatGPT) 93.28 93.13 93.72
the final answer from the output. There is a slight Chameleon (GPT-4) 98.78 98.95 98.29
performance increase from 82.60% to 83.07%. Af- TaCo (Ours) 92.15±0.13 91.69 93.47
ter inspecting error cases of ChatGPT, we found
Table 10: Accuracy of Chameleon on TABMWP test
that most errors resulted from wrong reasoning
set.
steps rather than calculation mistakes. Compared
with small-scale TaLMs, the numerical calculating
ability of ChatGPT is much more better, which
may attribute to the potential use of more advanced
external tools behind the ChatGPT system.

11018
Figure 6: A correct instance where TaCo generates right solution and answer. (ID:752).

Figure 7: An error instance of counting operation (ID:449), where TaCo cannot correctly count how many numbers
satisfying requirements.

Figure 8: An error instance of fraction calculation (ID:1711), where TaCo makes mistakes when reducing a fraction.

Figure 9: An error instance of number comparison (ID:1434), where TaCo cannot correctly judge which is the larger
number between 72.00 and 74.00.

Figure 10: An error instance of time calculation (ID:2766), where TaCo fails to compute the elapsed time between
11:00 A.M. and 12:00 P.M.

11019

You might also like