Chain of Thought Reasoning
Chain of Thought Reasoning
Abstract
Tabular mathematical reasoning task requires
models to perform multi-step operations includ-
ing information look-up and numerical calcula-
tions, based on heterogeneous data from tables
and questions. Existing solutions tend to extend
chain-of-thought (CoT) reasoning into power-
ful large language models (LLMs) to promote
multi-hop mathematical reasoning. However, it
can be extremely difficult to apply such LLM-
based approaches under scenarios of privatiza-
tion deployment or limited resources. To ad-
dress this problem, we revisit small-scale tabu-
lar language models (TaLMs) and extend chain-
of-thought reasoning into TaLMs for the first
time. Specifically, we propose a novel frame-
work, TaCo, which coordinates two TaLMs
responsible for CoT generation and answer
inference, respectively. Besides, our frame-
work can be combined with an external cal- Figure 1: An example from the TABMWP dataset. To
culator to enhance accurate numerical calcula- solve the problem, the model needs to perform multi-
tions. On the TABMWP dataset, TaCo outper- step mathematical reasoning based on the table and the
forms the state-of-the-art ChatGPT by 9.55% question.
(82.60%→92.15% in accuracy) with much less
parameters (0.8B).1
Considering the inherent demand for multi-step
1 Introduction operations, existing studies tend to extend chain-
of-thought (CoT) reasoning (Wei et al., 2022;
Tabular mathematical reasoning task aims at an-
Wang et al., 2023a; Kojima et al., 2022; Zhang
swering math questions based on heterogeneous
et al., 2022) into powerful Large Language Mod-
tabular and textual data, which can provide users
els (LLMs) (Brown et al., 2020; Chowdhery et al.,
with insights from tables containing valuable fig-
2022; Thoppilan et al., 2022; Chen et al., 2021a)
ures (Lu et al., 2023b; Zhu et al., 2021; Chen et al.,
to promote multi-hop mathematical reasoning. As
2021b). This task highlights the demand for multi-
depicted in Figure 2 (b), this paradigm prompts
step mathematical reasoning including information
LLMs with several in-context examples containing
look-up and numerical calculations. For example,
CoT demonstrations to elicit intermediate reason-
given the table and the question in Figure 1, we
ing steps before inferring the final answer.
firstly need to count how many numbers are in the
Though the combo of LLM and CoT has
table, then add all the numbers together to get the
achieved great performance, such LLM-based
sum of baskets, and finally compute the mean of
methods may not be a feasible approach in some
the sum.
1
real-world scenarios. For instance, it is financially
The code will be released at https://github.com/
SpursGoZmy/TaCo
expensive to satisfy the high computational require-
†
This work was done during an internship at Baidu Inc. ments, the storage capacity and the desired band-
‡
Corresponding author: Zheng Lin. width of LLMs, which makes it a challenge for
11006
Findings of the Association for Computational Linguistics: EMNLP 2023, pages 11006–11019
December 6-10, 2023 ©2023 Association for Computational Linguistics
individual users or small organizations to utilize follows:
LLMs in their applications (Strubell et al., 2019;
Bender et al., 2021). In consideration of the data • To the best of our knowledge, we explore the
security, enterprises may also seek privatization chain-of-thought reasoning in TaLMs for the
deployments where private data is not allowed to first time, and advocate a new and promising
be processed by third-party LLM APIs. What’s paradigm for tabular mathematical reasoning,
more, despite the fact that many pre-trained tab- especially under scenarios where LLM-based
ular language models have been developed (Liu methods are not feasible.
et al., 2022; Herzig et al., 2020; Wang et al., 2021;
• We propose a novel framework, TaCo, which
Dong et al., 2022), their CoT reasoning ability has
coordinates two TaLMs responsible for CoT
not been thoroughly investigated and it could be
generation and answer inference, respectively.
inadequate for solving the tabular mathematical
It is also integrated with a calculator to en-
reasoning task. As a result, an alternative approach,
hance accurate numerical calculations.
with lower costs and competitive CoT reasoning
ability, is needed. • Our method can boost the performance of
To accomplish this goal, we revisit small-scale small-scale TaLMs and surpasses the state-
tabular language models (TaLMs) and initiatively of-the-art ChatGPT by 9.55% on TABMWP
explore the chain-of-thought reasoning in TaLMs. benchmark with much less parameters (0.8B).
Specifically, we propose a novel framework named
TaCo, which coordinates two TaLMs that are re- 2 Pilot Experiment
sponsible for CoT generation and answer inference,
Before diving into the specific method, we present
respectively. Given the input table and question,
a pilot experiment on the TABMWP dataset to an-
the first TaLM is fine-tuned to generate interme-
swer two important questions: (i) Do existing pre-
diate reasoning steps. Based on the original input
trained generative TaLMs possess chain-of-thought
and generated reasoning steps, the second TaLM
reasoning ability? (ii) Whether generative TaLMs
is fine-tuned to infer the final answer. To alleviate
can benefit from chain-of-thoughts when predict-
the weakness of TaLMs in solving mathematical
ing the final answer. We select the state-of-the-
expressions, TaCo is also combined with an ex-
art TAPEX model (Liu et al., 2022) for experi-
ternal calculator which is used to perform math
ments, which is based on the encoder-decoder lan-
calculations and fix incorrect results in the output
guage model BART (Lewis et al., 2020) and is
reasoning steps.
additionally pre-trained on the tabular data. We
To verify the effectiveness of the proposed consider two model sizes: TAPEX-base (140M)
method, we conduct comprehensive experiments and TAPEX-large (400M).
on the TABMWP (Lu et al., 2023b) dataset, Experiments are conducted in three different set-
which is the latest math word problem benchmark tings, i.e., vanilla, zero-shot CoT and gold CoT. For
over tabular data and provides detailed chain-of- the “vanilla” setting, the pre-trained TAPEX model
thoughts to solve the problem step by step. Ex- f (·) autoregressively generates the answer a based
perimental results reveal that TaCo explores a new on the table t and the question q, i.e., a = f (t, q).
and promising paradigm for tabular mathematical For the “zero-shot CoT” setting, we follow Ko-
reasoning, which is illustrated in Figure 2 (c). Com- jima et al. (2022) to evaluate the CoT reasoning
pared with traditional fine-tuned TaLMs, TaCo im- of the TAPEX. Specifically, a trigger sentence p1
proves the accuracy of recent TAPEX model by is appended to the question in order to ask the
29.76%. Compared with LLM-based approaches, TAPEX to output intermediate reasoning steps s,
TaCo outperforms the state-of-the-art ChatGPT by i.e., s = f (t, q, p1 ). Then, given the original input
9.55% (82.60%→92.15%) with much less param- and the generated CoT, another trigger sentence p2
eters (0.8B). Moreover, we conduct ablation stud- is appended to make the TAPEX output the final
ies to analyse contributions of different parts in answer a, i.e., a = f (t, q, p1 , s, p2 ). For p1 , we
the framework. The detailed error analysis is also try various templates such as “Let’s think step by
performed to provide insights for future improve- step” and report best results. For p2 , we intuitively
ments. select “As a result, the answer is” as the trigger
To summarize, we conclude our contributions as sentence. For the “gold CoT” setting, we replace
11007
Figure 2: Different paradigms for tabular mathematical reasoning.
generated reasoning steps with annotated ones and generation and (ii) answer inference, where two
other procedures are same as “zero-shot CoT”. generative TaLMs with the same architecture are
fine-tuned independently with different inputs and
Pre-trained TaLMs Acc-Dev Acc-Test
outputs. In this section, we introduce the frame-
TAPEX-base (vanilla) 15.66 15.69
TAPEX-large (vanilla) 18.41 18.59 work with the TAPEX model as selected backbones,
TAPEX-base (zero-shot CoT) 15.30 15.25 but it should be noted that TaCo is compatible with
TAPEX-large (zero-shot CoT) 18.25 17.94
arbitrary generative TaLMs to boost their perfor-
TAPEX-base (gold CoT) 40.54 39.99
TAPEX-large (gold CoT) 47.48 48.01 mance. The overview of TaCo framework is illus-
trated in Figure 3.
Table 1: Pilot experimental results of pre-trained
TAPEX under different settings. “Acc-Dev” and “Acc-
Test” represents accuracy on the development set and 3.1 CoT Generation
the test set respectively.
In the CoT generation stage, a TAPEX model is
fine-tuned to generate a solution which consists
From the results in Table 1, we can see that
of multiple reasoning steps to solve the problem.
the TAPEX with “zero-shot CoT” setting performs
Given an input table T with M rows {Ri }M i=1
even worse than the vanilla one, which shows that
and N column headers {cj }N j=1 , the TAPEX will
the small-scale TAPEX is not a decent zero-shot
linearize the table into a flattened text sequence
reasoner like LLMs and does not possess CoT rea-
T ∗ = [HEAD] : c1 | · · · | cN [ROW] 1 :
soning ability. This is also consistent with find-
R1 | [ROW] 2 : R2 | · · · |RM , where [HEAD]
ings from previous CoT studies (Wei et al., 2022;
and [ROW] are special tokens used to indicate the
Ho et al., 2023). After inspecting the model out-
region of column headers and rows, respectively.
puts, we find that the pre-trained TAPEX model
The number after [ROW] represents different row
cannot follow the instruction to generate reason-
index and the vertical bar “|” separates headers or
ing steps. In most cases, it directly generates the
cells in different columns. For instance, the table
answer or illogical texts. However, given the anno-
in Figure 1 will be linearized into the following
tated “gold CoT”, the model achieves a remarkable
sequence:
performance gain. For instance, the accuracy of
TAPEX-large on test set increases from 18.59%
to 48.01%. This demonstrates that CoT reasoning col : Day | Number of baskets row 1 :
steps are beneficial to TAPEX when inferring the Thursday | 49 row 2 : Friday | 48 ... row
correct answer and it encourages us to further elicit 6 : Tuesday | 49
CoT reasoning ability of TaLMs by finetuning.
Table 3: Accuracy (%) on the development set and test set of TABMWP. We also report detailed accuracy on
different types of questions in test set. FREE: free-text questions; MC: multi-choice questions. INT: integer answers;
DEC: decimal answers; EXTR: extractive text answers; BOOL: Boolean text answers; OTH: other text answers.
The best results are marked in bold. ± stands for standard deviation over 3 repeated experiments. If not otherwise
specified, LLM baselines are in few-shot setting. “-SC” represents using self-consistency decoding strategy (Wang
et al., 2023a).
(3) Among different baselines, the model per- other baselines on questions with integer (INT) and
formance on free-text questions is obviously worse decimal (DEC) answers, which may resulted from
than that on multi-choice questions, with an aver- the utilization of the external calculator. ChatGPT
age difference of 21%. The reason is that, com- with the CoT prompting outperforms other meth-
pared with multi-choice questions, free-text ques- ods including the human baseline on questions with
tions usually require more complicated numerical Boolean text answer, which may contribute to its
calculations and also do not directly provide an- great general semantic understanding ability. For
swer options in the input. The detailed evidence is example, judging yes/no questions based on previ-
presented in the Appendix B. Nevertheless, from ously generated reasoning steps. (5) Not surpris-
pre-trained LM to LLM+CoT and to the proposed ingly, all the models perform worse on questions
TaCo framework, the performance gap between from the grade 7-8 than that from the grade 1-6 due
two question types gradually decreases. For in- to the increasing difficulty. Among them, the pro-
stance, the accuracy gap of TaCo (TAPEX-large) posed framework achieves the best accuracy than
framework (1.78%) is much lower than that of other baselines on harder questions from grade 7-8.
fine-tuned TAPEX-large (26.58%). This shows
our method can obtain better generalization on two 4.4 Ablation Study
types of questions. (4) Considering questions of We conduct ablation experiments to systematically
various answer types, the TaCo framework beats investigate the effect of the external calculator,
the progressive two-stage paradigm and the TaLM
11011
Settings Dev Test
Average Question Types Question Types
Drop↓ FREE MC Model Dev Test
FREE MC
ours w/ TAPEX
TaCo (base) 86.12 85.58 - 85.53 85.74
TaCo (large) 92.91 92.15 - 91.69 93.47
TAPEX-base 86.12 85.58 85.53 85.74
w/o calculator TAPEX-large 92.91 92.15 91.69 93.47
QT → S → A (base) 65.21 64.35 21.07 56.23 84.55 w/ UnifiedQA
QT → S → A (large) 75.60 74.58 17.44 67.77 93.03 UnifiedQA-small 48.32 48.17 46.45 53.06
w/o two-stage paradigm UnifiedQA-base 66.32 65.46 60.70 79.07
QT → SA (base) 78.22 77.66 7.91 77.15 79.12 UnifiedQA-large 77.44 76.96 73.50 86.85
QT → SA (large) 84.73 84.25 8.04 83.95 85.14 fine-tuned
QT → AS (base) 75.18 74.34 11.09 71.88 81.38
UnifiedQA-small 35.79 34.82 27.99 54.32
QT → AS (large) 81.45 81.41 11.10 80.21 84.84
w/o two-stage paradigm and calculator UnifiedQA-base 51.89 51.08 42.10 76.76
QT → SA (base) 59.69 59.41 26.30 50.86 83.84 UnifiedQA-large 59.35 59.26 51.62 81.12
QT → SA (large) 69.57 68.85 23.32 63.79 83.33
QT → AS (base) 56.43 54.85 30.21 45.64 81.17 Table 5: Experiment results of TaCo framework with
QT → AS (large) 63.80 63.41 28.93 56.06 84.44
w/o two-stage paradigm, calculator and solution TAPEX and UnifiedQA as backbone, respectively.
QT → A (base) 57.10 56.39 29.11 48.33 79.42
QT → A (large) 62.28 62.39 30.20 55.50 82.08
5 Related Work
TAPEX is additionally pre-trained on the tabular
data and thus has a better understanding of table CoT prompting for LLMs. By providing a few
structures. As more powerful generative TaLMs in-context examples (or demonstrations) which
emerge, they can be integrated into the TaCo frame- contain chain-of-thoughts, CoT prompting can en-
work to improve their performance on the tabular courage LLMs to output intermediate reasoning
mathematical reasoning task. steps before predicting the final answer (Wei et al.,
2022). Existing CoT studies mainly focus on
4.5 Error Analysis and Case Study two directions. (1) Improving the quality of CoT
demonstrations. For instance, selecting better in-
As illustrated in Figure 6, for this problem that
context examples for CoT prompting according to
involves two multiplication and one addition oper-
the question diversity (Zhang et al., 2022), the so-
ations, the TaCo framework successfully generates
lution complexity (Fu et al., 2023), or the example
correct intermediate reasoning chains and finally
similarity (Rubin et al., 2022). (2) Exploring new
predicts the right answer.
representations of CoT reasoning steps. Beside the
There are 473 free-text questions (78%) and 130
typical natural language format, researchers also
multi-choice questions (22%) for which the TaCo
proposed chain-of-thoughts in other formats. For
(TAPEX-large) gives wrong predictions. We ran-
instance, program-of-thoughts (Chen et al., 2022),
domly selected 100 questions of each type for error
tree-of-thoughts (Yao et al., 2023a), and graph-of-
analyses. Figure 4 depicts error distributions by
thoughts (Yao et al., 2023b). Among them, the CoT
question types. More error instances are presented
in program languages has emerged as a powerful
and discussed in Appendix C.
approach for LLMs to invoking external tools (Qin
For free-text questions, error cases fall into the
et al., 2023). Recently, Lu et al. (2023a) proposed
following four categories. (1) Counting operation
the Chameleon framework that augments LLMs
(49%): the question requires the model to count
with various tools like search engines and Python
numbers as the final answer, which is challenging
executors. We treat it as a contemporary work of
for generative language models. (2) Fraction calcu-
our paper and list its results in the Appendix D.
lation (36%): the model fails to conduct fraction-
related calculations such as reducing a fraction, Pre-trained TaLMs. Inspired by the success of
which may be alleviated with an advanced calcu- pre-training on the natural language text, various
lator. (3) Wrong formula (11%): the CoT genera- TaLMs are proposed for pre-training on the semi-
tion model outputs wrong formulas in the reason- structured tabular data (Dong et al., 2022). Ex-
ing steps. (4) Function-related problem (4%): the isting TaLMs mainly inherit the architectures of
model fails to solve problems related to the func- traditional language models and can be classified
11013
into three types. (1) Encoder-based TaLMs like scenarios where LLMs are not available.
TAPAS (Herzig et al., 2020), MATE (Eisensch-
los et al., 2021) and TUTA (Wang et al., 2021). Ethics Statement
(2) Encoder-Decoder TaLMs such as TAPEX (Liu This paper proposes a two-stage framework for the
et al., 2022) and STTP (Xing and Wan, 2021). (3) tabular mathematical reasoning task, and models
Decoder-based TaLMs like TableGPT (Gong et al., are trained and evaluated on the public TABMWP
2020). In previous studies, TaLMs are usually fine- dataset. Thus, the authors foresee no ethical con-
tuned to directly generate final answers or simple cerns with the research in this paper.
formulas. By contrast, we are the first to explore the
combination of the CoT reasoning and pre-trained Acknowledgements
TaLMs.
This work was supported by the National Natural
6 Conclusion Science Foundation of China (No. 61976207) and
the National Social Science Foundation of China
We extend the CoT reasoning into small-scale (No. 21AZD145).
TaLMs for the first time, and provide an effective
approach for tabular mathematical reasoning task,
especially under scenarios where LLMs are not References
accessible. Specifically, we propose a novel frame- Emily M. Bender, Timnit Gebru, Angelina McMillan-
work named TaCo, which coordinates two TaLMs Major, and Shmargaret Shmitchell. 2021. On the
dangers of stochastic parrots: Can language mod-
responsible for CoT generation and answer infer- els be too big? . In Proceedings of the 2021 ACM
ence, respectively. By introducing an external cal- Conference on Fairness, Accountability, and Trans-
culator, we further augment TaCo with the accurate parency, FAccT ’21, page 610–623, New York, NY,
math computing ability. With two TAPEX-large USA. Association for Computing Machinery.
models as backbones, the TaCo outperforms the Tom Brown, Benjamin Mann, Nick Ryder, Melanie
state-of-the-art ChatGPT on the TABMWP dataset Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
by 9.55% (82.60%→92.15%) with much less pa- Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, Sandhini Agarwal, Ariel Herbert-Voss,
rameters (0.8B).
Gretchen Krueger, Tom Henighan, Rewon Child,
Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens
Limitations Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma-
teusz Litwin, Scott Gray, Benjamin Chess, Jack
Though the proposed method achieves great perfor- Clark, Christopher Berner, Sam McCandlish, Alec
mance with less parameters, the fine-tuning of the Radford, Ilya Sutskever, and Dario Amodei. 2020.
CoT generation model and the answer inference Language models are few-shot learners. In Ad-
model depends on annotated chain-of-thoughts and vances in Neural Information Processing Systems,
volume 33, pages 1877–1901. Curran Associates,
gold answers. As a result, the chain-of-thought Inc.
reasoning ability of TaCo could be limited to the
tabular mathematical reasoning task. In the future Mark Chen, Jerry Tworek, Heewoo Jun, Qiming
Yuan, Henrique Ponde de Oliveira Pinto, Jared Ka-
research, one can utilize open-source LLMs to gen- plan, Harri Edwards, Yuri Burda, Nicholas Joseph,
erate chain-of-thoughts of more diversities and of Greg Brockman, Alex Ray, Raul Puri, Gretchen
more table-related tasks (Wang et al., 2023b; Ho Krueger, Michael Petrov, Heidy Khlaaf, Girish Sas-
et al., 2023), which may further extend the gener- try, Pamela Mishkin, Brooke Chan, Scott Gray,
Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz
alization ability of TaLMs and reduce the cost of Kaiser, Mohammad Bavarian, Clemens Winter,
manual annotation. Philippe Tillet, Felipe Petroski Such, Dave Cum-
In the aspect of external tools, compared with mings, Matthias Plappert, Fotios Chantzis, Eliza-
frameworks which enable LLMs to access various beth Barnes, Ariel Herbert-Voss, William Hebgen
Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie
tools (Shen et al., 2023; Lu et al., 2023a), TaCo Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain,
only utilizes a calculator to complete common arith- William Saunders, Christopher Hesse, Andrew N.
metic calculations, i.e., “+,-,×,÷”. More advanced Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan
external tools may be integrated to enhance the Morikawa, Alec Radford, Matthew Knight, Miles
Brundage, Mira Murati, Katie Mayer, Peter Welinder,
capability of the framework. We believe that the Bob McGrew, Dario Amodei, Sam McCandlish, Ilya
tool learning with small-scale language models is a Sutskever, and Wojciech Zaremba. 2021a. Evaluat-
valuable future direction, especially for particular ing large language models trained on code.
11014
Wenhu Chen, Xueguang Ma, Xinyi Wang, and Jonathan Herzig, Pawel Krzysztof Nowak, Thomas
William W. Cohen. 2022. Program of thoughts Müller, Francesco Piccinno, and Julian Eisenschlos.
prompting: Disentangling computation from reason- 2020. TaPas: Weakly supervised table parsing via
ing for numerical reasoning tasks. pre-training. In Proceedings of the 58th Annual Meet-
ing of the Association for Computational Linguistics,
Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena pages 4320–4333, Online. Association for Computa-
Shah, Iana Borova, Dylan Langdon, Reema Moussa, tional Linguistics.
Matt Beane, Ting-Hao Huang, Bryan Routledge, and
William Yang Wang. 2021b. FinQA: A dataset of nu- Namgyu Ho, Laura Schmid, and Se-Young Yun. 2023.
merical reasoning over financial data. In Proceedings Large language models are reasoning teachers.
of the 2021 Conference on Empirical Methods in Nat-
ural Language Processing, pages 3697–3711, Online Daniel Khashabi, Sewon Min, Tushar Khot, Ashish
and Punta Cana, Dominican Republic. Association Sabharwal, Oyvind Tafjord, Peter Clark, and Han-
for Computational Linguistics. naneh Hajishirzi. 2020. UNIFIEDQA: Crossing for-
mat boundaries with a single QA system. In Find-
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, ings of the Association for Computational Linguistics:
Maarten Bosma, Gaurav Mishra, Adam Roberts, EMNLP 2020, pages 1896–1907, Online. Association
Paul Barham, Hyung Won Chung, Charles Sutton, for Computational Linguistics.
Sebastian Gehrmann, Parker Schuh, Kensen Shi,
Sasha Tsvyashchenko, Joshua Maynez, Abhishek Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu-
Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vin- taka Matsuo, and Yusuke Iwasawa. 2022. Large lan-
odkumar Prabhakaran, Emily Reif, Nan Du, Ben guage models are zero-shot reasoners. In Advances
Hutchinson, Reiner Pope, James Bradbury, Jacob in Neural Information Processing Systems.
Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin,
Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
Sunipa Dev, Henryk Michalewski, Xavier Garcia, Ghazvininejad, Abdelrahman Mohamed, Omer Levy,
Vedant Misra, Kevin Robinson, Liam Fedus, Denny Veselin Stoyanov, and Luke Zettlemoyer. 2020.
Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, BART: Denoising sequence-to-sequence pre-training
Barret Zoph, Alexander Spiridonov, Ryan Sepassi, for natural language generation, translation, and com-
David Dohan, Shivani Agrawal, Mark Omernick, An- prehension. In Proceedings of the 58th Annual Meet-
drew M. Dai, Thanumalayan Sankaranarayana Pil- ing of the Association for Computational Linguistics,
lai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, pages 7871–7880, Online. Association for Computa-
Rewon Child, Oleksandr Polozov, Katherine Lee, tional Linguistics.
Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark
Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi
Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, Lin, Weizhu Chen, and Jian-Guang Lou. 2022.
and Noah Fiedel. 2022. Palm: Scaling language mod- TAPEX: Table pre-training via learning a neural SQL
eling with pathways. executor. In International Conference on Learning
Representations.
Haoyu Dong, Zhoujun Cheng, Xinyi He, Mengyu Zhou,
Anda Zhou, Fan Zhou, Ao Liu, Shi Han, and Dong-
Ilya Loshchilov and Frank Hutter. 2019. Decoupled
mei Zhang. 2022. Table pre-training: A survey
weight decay regularization.
on model architectures, pre-training objectives, and
downstream tasks. In Proceedings of the Thirty-First
Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-
International Joint Conference on Artificial Intel-
Wei Chang, Ying Nian Wu, Song-Chun Zhu, and
ligence, IJCAI-22, pages 5426–5435. International
Jianfeng Gao. 2023a. Chameleon: Plug-and-play
Joint Conferences on Artificial Intelligence Organi-
compositional reasoning with large language models.
zation. Survey Track.
Julian Martin Eisenschlos, Maharshi Gor, Thomas Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu,
Müller, and William W. Cohen. 2021. Mate: Multi- Song-Chun Zhu, Tanmay Rajpurohit, Peter Clark,
view attention for table transformer efficiency. and Ashwin Kalyan. 2023b. Dynamic prompt learn-
ing via policy gradient for semi-structured mathe-
Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and matical reasoning. In International Conference on
Tushar Khot. 2023. Complexity-based prompting for Learning Representations (ICLR).
multi-step reasoning.
Adam Paszke, Sam Gross, Francisco Massa, Adam
Heng Gong, Yawei Sun, Xiaocheng Feng, Bing Lerer, James Bradbury, Gregory Chanan, Trevor
Qin, Wei Bi, Xiaojiang Liu, and Ting Liu. 2020. Killeen, Zeming Lin, Natalia Gimelshein, Luca
TableGPT: Few-shot table-to-text generation with Antiga, Alban Desmaison, Andreas Köpf, Edward
table structure reconstruction and content matching. Yang, Zach DeVito, Martin Raison, Alykhan Tejani,
In Proceedings of the 28th International Conference Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Jun-
on Computational Linguistics, pages 1978–1988, jie Bai, and Soumith Chintala. 2019. Pytorch: An
Barcelona, Spain (Online). International Committee imperative style, high-performance deep learning li-
on Computational Linguistics. brary.
11015
Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Hajishirzi. 2023b. Self-instruct: Aligning language
Noah A. Smith, and Mike Lewis. 2023. Measuring models with self-generated instructions.
and narrowing the compositionality gap in language
models. Zhiruo Wang, Haoyu Dong, Ran Jia, Jia Li, Zhiyi Fu,
Shi Han, and Dongmei Zhang. 2021. Tuta: Tree-
Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, based transformers for generally structured table pre-
Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, training. In Proceedings of the 27th ACM SIGKDD
Chaojun Xiao, Chi Han, Yi Ren Fung, Yusheng Su, Conference on Knowledge Discovery amp; Data Min-
Huadong Wang, Cheng Qian, Runchu Tian, Kunlun ing, KDD ’21, page 1780–1790, New York, NY, USA.
Zhu, Shihao Liang, Xingyu Shen, Bokai Xu, Zhen Association for Computing Machinery.
Zhang, Yining Ye, Bowen Li, Ziwei Tang, Jing Yi,
Yuzhang Zhu, Zhenning Dai, Lan Yan, Xin Cong, Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Yaxi Lu, Weilin Zhao, Yuxiang Huang, Junxi Yan, Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le,
Xu Han, Xian Sun, Dahai Li, Jason Phang, Cheng and Denny Zhou. 2022. Chain of thought prompt-
Yang, Tongshuang Wu, Heng Ji, Zhiyuan Liu, and ing elicits reasoning in large language models. In
Maosong Sun. 2023. Tool learning with foundation Advances in Neural Information Processing Systems.
models.
Xinyu Xing and Xiaojun Wan. 2021. Structure-aware
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine pre-training for table-to-text generation. In Find-
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, ings of the Association for Computational Linguis-
Wei Li, and Peter J. Liu. 2020. Exploring the limits tics: ACL-IJCNLP 2021, pages 2273–2278, Online.
of transfer learning with a unified text-to-text trans- Association for Computational Linguistics.
former.
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran,
Ohad Rubin, Jonathan Herzig, and Jonathan Berant. Thomas L. Griffiths, Yuan Cao, and Karthik
2022. Learning to retrieve prompts for in-context Narasimhan. 2023a. Tree of thoughts: Deliberate
learning. problem solving with large language models.
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Yao Yao, Zuchao Li, and Hai Zhao. 2023b. Beyond
Weiming Lu, and Yueting Zhuang. 2023. Hugging- chain-of-thought, effective graph-of-thought reason-
gpt: Solving ai tasks with chatgpt and its friends in ing in large language models.
hugging face.
Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex
Emma Strubell, Ananya Ganesh, and Andrew McCal- Smola. 2022. Automatic chain of thought prompting
lum. 2019. Energy and policy considerations for in large language models.
deep learning in nlp.
Ruochen Zhao, Xingxuan Li, Shafiq Joty, Chengwei
Romal Thoppilan, Daniel De Freitas, Jamie Hall, Qin, and Lidong Bing. 2023. Verify-and-edit: A
Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze knowledge-enhanced chain-of-thought framework.
Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du,
YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao
Amin Ghafouri, Marcelo Menegali, Yanping Huang, Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-
Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Seng Chua. 2021. TAT-QA: A question answering
Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, benchmark on a hybrid of tabular and textual content
Maarten Bosma, Vincent Zhao, Yanqi Zhou, Chung- in finance. CoRR, abs/2105.07624.
Ching Chang, Igor Krivokon, Will Rusch, Marc
Pickett, Pranesh Srinivasan, Laichee Man, Kathleen
Meier-Hellstern, Meredith Ringel Morris, Tulsee
Doshi, Renelito Delos Santos, Toju Duke, Johnny So-
raker, Ben Zevenbergen, Vinodkumar Prabhakaran,
Mark Diaz, Ben Hutchinson, Kristen Olson, Ale-
jandra Molina, Erin Hoffman-John, Josh Lee, Lora
Aroyo, Ravi Rajakumar, Alena Butryna, Matthew
Lamm, Viktoriya Kuzmina, Joe Fenton, Aaron Co-
hen, Rachel Bernstein, Ray Kurzweil, Blaise Aguera-
Arcas, Claire Cui, Marian Croak, Ed Chi, and Quoc
Le. 2022. Lamda: Language models for dialog appli-
cations.
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc
Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery,
and Denny Zhou. 2023a. Self-consistency improves
chain of thought reasoning in language models.
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa
Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh
11016
A More Implementation Details B The complexity of CoT generation
In our experiments, we employ TAPEX and Uni-
fiedQA as backbones of TaCo framework. When Table 3 reveals a significant performance difference
linearizing the table into flattened sequence, if between free-text questions and multi-choice ques-
there exist no column headers in the original ta- tions. To shed more light on the TABMWP dataset,
ble, pseudo column headers will be inserted, e.g., we quantitatively analyze the complexity of the
’Column header 1’. The hyper-parameter config- CoT generation for two question types. Specifi-
urations of TAPEX and UnifiedQA backbone and cally, we compute the number of required numer-
their model sizes are shown in Table 6 and Table 7, ical calculations in the gold CoT (including +, -,
respectively. Our experiments are all performed on ×, ÷, counting, min, max), the number of reason-
a 32G NVIDIA V100 GPU. ing steps (we treat each line in the gold CoT as
For LLM-based baselines, we list numbers of one reasoning step for simplicity) and the length of
few-shot examples and selection strategies in Table the gold CoT. The statistical results in the Table 9
8. For ChatGPT baseline, we randomly select 4 demonstrate that, in the TABMWP dataset, the CoT
examples from train set for each question type. For generation from free-text questions is more com-
fair comparison, we use the same prompt format as plex than that from multi-choice questions. Based
PromptPG (Lu et al., 2023b) to construct in-context on our observations, at least 18% multi-choice ques-
examples, which is demonstrated in Figure 5. tions (mainly of EXTR and OTH answer types) do
not need numerical calculations, but almost all free-
TAPEX text questions need numerical calculations.
Parameters
base (140M) large (400M)
Learning Rate 3e-5 3e-5
Batch Size 16 32
Weight Decay 0.01 0.01 C Error Instances and More Analysis
Max Grad Norm 1.0 1.0
Warmup Linear Linear
Warmup Fraction 0.1 0.1 In this section, we present detailed error instances
Epochs for Stage 1 20 25
Epochs for Stage 2 15 20 to analyze the weakness of TaCo framework, which
Training Time for Stage 1 3 hours 8 hours is shown in Figure 7 to Figure 10. We find that
Training Time for Stage 2 2 hours 6 hours
most of errors are caused by the inability of used
Table 6: Hyper-parameter configurations for TAPEX external tool and the representation of chain-of-
backbone. thoughts. Take the error instance in Figure 7 as
an example. To correctly answer the question in
Parameters
UnifiedQA Figure 7, the model should find numbers from the
small (60M) base (220M) large (770M)
Learning Rate 5e-5 5e-5 5e-5 table which are greater than 53, and then count
Batch Size
Weight Decay
16
0.01
16
0.01
48
0.01
how many numbers are found. However, as the
Max Grad Norm
Warmup
1.0
Linear
1.0
Linear
1.0
Linear
CoT generation model is fine-tuned to generate
Warmup Fraction 0.1 0.1 0.1 chain-of-thoughts in simple natural language, it
Epochs for Stage 1 15 20 25
Epochs for Stage 2 15 15 20 is difficult for the model to describe the above
Training Time for Stage 1 2 hours 8 hours 15 hours
Training Time for Stage 2 2 hours 5 hours 12 hours process in a short and straightforward expression,
which makes it hard to invoke external tools. If
Table 7: Hyper-parameter configurations for UnifiedQA we could represent chain-of-thoughts in program
backbone.
languages like Python, the solution of this ques-
tion would be much more clear. For instance,
Method
# few-shot Selection
Acc-Test one can write a line of Python code: “Ans =
examples strategy
GPT-3 2 Random selection 57.13 Count(61,61,65,65,66,70,66,78)”, and imple-
Codex 4 Manual construction 59.40 ment a Python function “Count()” as an external
GPT-3+CoT 2 Random selection 62.92
Codex+CoT 4 Manual construction 65.20 tool to get the accurate result. The same method-
PromptPG 2 Policy Gradient 68.23
PoT 4 Manual construction 73.20 ology could be applied to error instances which
ChatGPT 4 Random selection 65.52 demand other abilities such as fraction calculation,
ChatGPT+CoT 4 Random selection 82.60
min/max operation and time calculation. Besides,
Table 8: Number of in-context examples and selection lacking commonsense knowledge also increases
strategies of LLM baselines. the difficulty for models to comprehend tables and
questions, e.g., reading bus schedule in Figure 10.
11017
Figure 5: The format of in-context examples for ChatGPT baseline (ID:19324).
Table 9: The quantitative analysis of the complexity of the CoT generation for two question types.
11018
Figure 6: A correct instance where TaCo generates right solution and answer. (ID:752).
Figure 7: An error instance of counting operation (ID:449), where TaCo cannot correctly count how many numbers
satisfying requirements.
Figure 8: An error instance of fraction calculation (ID:1711), where TaCo makes mistakes when reducing a fraction.
Figure 9: An error instance of number comparison (ID:1434), where TaCo cannot correctly judge which is the larger
number between 72.00 and 74.00.
Figure 10: An error instance of time calculation (ID:2766), where TaCo fails to compute the elapsed time between
11:00 A.M. and 12:00 P.M.
11019