100% found this document useful (1 vote)
133 views13 pages

Distilling Step-by-Step! Outperforming Larger Language Models With Less Training Data and Smaller Model Sizes

1) The document introduces a new method called "Distilling step-by-step" that trains smaller language models to outperform larger ones using less training data and smaller model sizes. 2) It does this by extracting rationales from large language models as additional supervision for training smaller models in a multi-task framework. 3) Experiments on 4 NLP benchmarks show the method achieves better performance than finetuning or distillation using fewer labeled/unlabeled examples, and trains smaller models that outperform larger ones.

Uploaded by

Victor Löfgren
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
133 views13 pages

Distilling Step-by-Step! Outperforming Larger Language Models With Less Training Data and Smaller Model Sizes

1) The document introduces a new method called "Distilling step-by-step" that trains smaller language models to outperform larger ones using less training data and smaller model sizes. 2) It does this by extracting rationales from large language models as additional supervision for training smaller models in a multi-task framework. 3) Experiments on 4 NLP benchmarks show the method achieves better performance than finetuning or distillation using fewer labeled/unlabeled examples, and trains smaller models that outperform larger ones.

Uploaded by

Victor Löfgren
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Distilling Step-by-Step!

Outperforming Larger Language Models


with Less Training Data and Smaller Model Sizes

Cheng-Yu Hsieh1∗, Chun-Liang Li2 , Chih-Kuan Yeh3 , Hootan Nakhost2 ,


Yasuhisa Fujii3 , Alexander Ratner1 , Ranjay Krishna1 , Chen-Yu Lee2 , Tomas Pfister2
1
University of Washington, 2 Google Cloud AI Research, 3 Google Research
[email protected]

Abstract
Deploying large language models (LLMs) is
challenging because they are memory inef-
ficient and compute-intensive for practical
applications. In reaction, researchers train
arXiv:2305.02301v2 [cs.CL] 5 Jul 2023

smaller task-specific models by either finetun-


ing with human labels or distilling using LLM-
generated labels. However, finetuning and dis-
tillation require large amounts of training data Figure 1: While large language models (LLMs) offer
to achieve comparable performance to LLMs. strong zero/few-shot performance, they are challenging
We introduce Distilling step-by-step, a new to serve in practice. Traditional ways of training small
mechanism that (a) trains smaller models that task-specific models, on the other hand, requires large
outperform LLMs, and (b) achieves so by lever- amount of training data. We propose Distilling step-
aging less training data needed by finetuning by-step, a new paradigm that extracts rationales from
or distillation. Our method extracts LLM ra- LLMs as informative task knowledge into training small
tionales as additional supervision for training models, which reduces both the deployed model size as
small models within a multi-task framework. well as the data required for training.
We present three findings across 4 NLP bench-
marks: First, compared to both finetuning and
distillation, our mechanism achieves better per- size. Serving a single 175 billion LLM requires
formance with much fewer labeled/unlabeled at least 350GB GPU memory using specialized in-
training examples. Second, compared to few- frastructure (Zheng et al., 2022). To make matters
shot prompted LLMs, we achieve better perfor- worse, today’s state-of-the-art LLMs are composed
mance using substantially smaller model sizes.
of over 500B parameters (Chowdhery et al., 2022),
Third, we reduce both the model size and the
amount of data required to outperform LLMs; requiring significantly more memory and compute.
our finetuned 770M T5 model outperforms the Such computational requirements are far beyond
few-shot prompted 540B PaLM model using affordable for most product teams, especially for
only 80% of available data on a benchmark, applications that require low latency performance.
whereas standard finetuning the same T5 model To circumvent these deployment challenges of
struggles to match even by using 100% of the large models, practitioners often choose to de-
dataset.1
ploy smaller specialized models instead. These
smaller models are trained using one of two
1 Introduction common paradigms: finetuning or distillation.
Finetuning updates a pretrained smaller model
Despite the impressive few-shot ability offered by
(e.g. BERT (Devlin et al., 2018) or T5 (Raffel
large language models (LLMs) (Brown et al., 2020;
et al., 2020)) using downstream human annotated
Chowdhery et al., 2022; Thoppilan et al., 2022;
data (Howard and Ruder, 2018). Distillation trains
Hoffmann et al., 2022; Smith et al., 2022b; Zhang
the same smaller models with labels generated by
et al., 2022), these models are challenging to de-
a larger LLM (Tang et al., 2019; Wang et al., 2021;
ploy in real world applications due to their sheer
Smith et al., 2022a; Arora et al., 2022). Unfortu-

Work done while the author was a student researcher at nately, these paradigms reduce model size at a cost:
Google Cloud AI Research.
1
Source code is available at: https://github.com/ to achieve comparable performance to LLMs, fine-
google-research/distilling-step-by-step. tuning requires expensive human labels, and dis-
tillation requires large amounts of unlabeled data ing finetuning method. When only unlabeled data
which can be hard to obtain (Tang et al., 2019; is present, our small models still perform on par or
Liang et al., 2020). better than LLMs. We outperform 540B PaLM’s
performance with only a 11B T5 model. We further
In this work, we introduce Distilling step-by-
show that when a smaller model performs worse
step, a new simple mechanism for training smaller
than an LLM, Distilling step-by-step can more effi-
models with less training data. Our mechanism re-
ciently leverage additional unlabeled data to match
duces the amount of training data required for both
the LLM performance compared to the standard
finetuning and distillation of LLMs into smaller
distillation approach.
model sizes. Core to our mechanism is changing
our perspective from viewing LLMs as a source
2 Related work
of noisy labels to viewing them as agents that can
reason: LLMs can produce natural language ratio- Our work distills task-specific knowledge of LLMs
nales justifying their predicted labels (Wei et al., into smaller specialist models by leveraging the
2022; Kojima et al., 2022). For example, when emergent reasoning capabilities of today’s LLMs.
asked “Jesse’s room is 11 feet long and 15 feet We draw on knowledge distillation research and
wide. If she already has 16 square feet of carpet. methods that learn from both human-generated ra-
How much more carpet does she need to cover tionales and LLM-generated rationales.
the whole floor?”, an LLM can be prompted by
chain-of-thought (CoT) technique (Wei et al., 2022) Knowledge distillation from large models.
to provide intermediate rationales “Area = length Knowledge distillation has been successfully used
× width. Jesse’s room has 11 × 15 square feet.” to transfer knowledge from larger, more compe-
that better connects the input to the final answer tent teacher models into smaller student models
“(11 × 15) − 16”. These rationales can contain affordable for practical applications (Buciluǎ et al.,
relevant task knowledge, such as “Area = length × 2006; Hinton et al., 2015; Beyer et al., 2022; West
width”, that may originally require many data for et al., 2021; Fu et al., 2023). It supports learning
small task-specific models to learn. We thus utilize from limited labeled data, since the larger teacher
these extracted rationales as additional, richer infor- model is often used to generate a training dataset
mation to train small models through a multi-task with noisy pseudo labels (Chen et al., 2020; Il-
training setup, with both label prediction and ratio- iopoulos et al., 2022; Wang et al., 2021; Smith
nale prediction tasks (Raffel et al., 2020; Narang et al., 2022a; Arora et al., 2022; Agrawal et al.,
et al., 2020). 2022). The one limitation that knowledge distil-
lation often faces is its reliance on large amounts
Distilling step-by-step allows us to learn task- of unlabelled data required to create a useful noisy
specific smaller models that outperform LLMs us- training dataset. Although prior work has explored
ing over 500× less model parameters, and it does using data augmentation techniques to reduce this
so with far fewer training examples compared to hunger for data (Tang et al., 2019; Liang et al.,
traditional finetuning or distillation (Figure 1). Our 2020; Srinivas and Fleuret, 2018; Milli et al., 2019),
results show three promising empirical conclusions we propose an alternative approach: we reduce the
across 4 NLP benchmarks. First, compared to both need for large unlabeled data by distilling not just
finetuning and distillation, our resulting models labels but also the teacher’s rationales.
achieve better performance with over 50% less
training examples on average across datasets (and Learning with human rationales. While utiliz-
up to over 85% reduction). Second, our models ing LLM-generated rationales is a new exciting
outperform LLMs with much smaller model sizes area of investigation, using human-generated ratio-
(up to 2000× smaller), drastically reducing the nales has a rich history (Hase and Bansal, 2021).
computation cost required for model deployment. For instance, human rationales can be used to reg-
Third, we simultaneously reduce the model size ularize model behavior (Ross et al., 2017); it can
as well as the amount of data required to outper- be used as additional inputs to guide a model’s
form LLMs. We surpass the performance of 540B predictions (Rajani et al., 2019); it can be used to
parameter LLMs using a 770M T5 model; this improve overall model performance (Zaidan et al.,
smaller model only uses 80% of a labeled dataset 2007; Zhang et al., 2016; Camburu et al., 2018;
that would otherwise be required if using an exist- Hancock et al., 2019; Pruthi et al., 2022); and hu-
Figure 2: Overview on Distilling step-by-step. We first utilize CoT prompting to extract rationales from an LLM
(Section 3.1). We then use the generated rationales to train small task-specific models within a multi-task learning
framework where we prepend task prefixes to the input examples and train the model to output differently based on
the given task prefix (Section 3.2).

man rationales can be used as gold standard labels strating the effectiveness of our method on fully
to make models more interpretable by generating unlabeled datasets.
similar rationales (Wiegreffe et al., 2021; Narang
et al., 2020; Eisenstein et al., 2022). Unfortunately, 3 Distilling step-by-step
human rationales are expensive.
We propose a new paradigm, Distilling step-by-
Learning with LLM generated rationales. To- step, that leverages the ability of LLMs to reason
day’s LLMs are capable of explaining their pre- about their predictions to train smaller models in
dictions by generating high-quality reasoning a data-efficient way. Our overall framework is il-
steps (Wei et al., 2022; Kojima et al., 2022). These lustrated in Figure 2. Our paradigm has two sim-
reasoning steps have been used to augment input ple steps: First, given an LLM and an unlabeled
prompts to LLMs, improving their few-shot or zero- dataset, we prompt the LLM to generate output
shot performance (Wei et al., 2022; Kojima et al., labels along with rationales to justify the labels.
2022; Wang et al., 2022b); reasoning steps have Rationales are natural language explanations that
also been used as additional finetuning data “self- provide support for the model’s predicted label
improve” LLMs (Zelikman et al., 2022; Huang (see Figure 2). Second, we leverage these ratio-
et al., 2022). Unfortunately, regardless of how nales in addition to the task labels to train smaller
LLMs are improved, their large size limits their downstream models. Intuitively, rationales provide
utility in most test-time applications. richer, more detailed information about why an in-
By contrast, we leverage generated rationales put is mapped to a specific output label, and often
as informative supervision to train smaller task- contain relevant task knowledge that may be hard
specific models, i.e. models that can be deployed to infer solely from the original inputs.
without incurring large computation or memory
costs. Several concurrent works have also proposed 3.1 Extracting rationales from LLMs
a similar idea to ours – that of using extracted ra- Recent studies observe one intriguing emerging
tionales as supervision (Wang et al., 2022a; Ho property of LLMs: their ability to generate ra-
et al., 2022; Magister et al., 2022; Li et al., 2023). tionales that support their predictions (Wei et al.,
Amongst them, PINTO (Wang et al., 2022a) relies 2022; Kojima et al., 2022). While the studies have
on an LLM to generate rationales at test-time, and largely focused on how to elicit such reasoning ca-
thus does not fully solve deployment challenges. pability from LLMs (Nye et al., 2021; Wei et al.,
Compared with Ho et al. (2022) and Magister et al. 2022; Kojima et al., 2022), we use them in training
(2022), we go beyond their experiments to provide smaller downstream models.
a granular study by varying training dataset size, Specifically, we utilize Chain-of-Thought (CoT)
exploring downstream model sizes, and demon- prompting (Wei et al., 2022) to elicit and extract
trained to minimize the label prediction loss:
N
1 X
Llabel = ℓ(f (xi ), ŷi ), (1)
N
i=1

where ℓ is the cross-entropy loss between the pre-


dicted and target tokens. Note that for ease of
exposition, we overload ŷi in Eq. 1 to be either
Figure 3: We use few-shot CoT prompting that contains human-annotated labels yi for the standard finetun-
both an example rationale (highlighted in green) and a ing case, or LLM-predicted labels ŷi for the model
label (highlighted in blue) to elicit rationales from an distillation case.
LLM on new input examples.
Multi-task learning with rationales. To create
a more explicit connection between xi ’s to ŷi ’s, we
rationales from LLMs. As illustrated in Figure 3,
use extracted rationales r̂i as additional supervi-
given an unlabeled dataset xi ∈ D, we first cu-
sion. There are several ways to incorporate ratio-
rate a prompt template p that articulates how the
nales into the downstream model’s training process.
task should be solved. Each prompt is a triplet
One straightforward approach is feed r̂i as an ad-
(xp , rp , y p ), where xp is an example input, y p is
ditional input—as proposed by other concurrent
its corresponding label and rp is a user-provided
research (Rajani et al., 2019; Wang et al., 2022a).
rationale that explains why xp can be categorized
In other words, the f (xi , r̂i ) → ŷi is trained with
as y p . We append each input xi to p and use it as
both text and rationale [x, r] as inputs:
an input to prompt the LLM to generate rationales
and labels for each xi ∈ D. With the demonstra- N
1 X
tions seen in p, the LLM is able to mimic the triplet L= ℓ(f (xi , r̂i ), ŷi ). (2)
N
demonstration to generate the rationale r̂i and out- i=1
put ŷi for xi . Unfortunately, this design requires an LLM to first
generate a rationale before the f can make a pre-
3.2 Training smaller models with rationales diction. The LLM is still necessary during deploy-
We first describe the current framework for learn- ment, limited its deployability.
ing task-specific models. With this framework in In this work, instead of using rationales as ad-
place, we extend it to incorporate rationales into ditional model inputs, we frame learning with ra-
the training process. Formally, we denote a dataset tionales as a multi-task problem. Specifically, we
as D = {(xi , yi )}N train the model f (xi ) → (ŷi , r̂i ) to not only predict
i=1 where each xi represents an
input and yi is the corresponding desired output the task labels but also generate the corresponding
label. While our framework supports inputs and rationales given the text inputs:
outputs of any modality, our experiments limits
L = Llabel + λLrationale , (3)
x and y to be natural language. This text-to-text
framework (Raffel et al., 2020) encompasses a va- where Llabel is the label prediction loss in Eq. 1
riety of NLP tasks: classification, natural language and Lrationale is the rationale generation loss:
inference, question answering and more.
N
1 X
Standard finetuning and task distillation. The Lrationale = ℓ(f (xi ), r̂i ). (4)
N
i=1
most common practice to train a task-specific
model is to finetune a pretrained model with su- The rationale generation loss enables the model to
pervised data (Howard and Ruder, 2018). In the learn to generate the intermediate reasoning steps
absence of human-annotated labels, task-specific for the prediction, and could therefore guide the
distillation (Hinton et al., 2015; Tang et al., 2019) model in better predicting the resultant label. This
uses LLM teachers to generates pseudo noisy train- is our proposed Distilling step-by-step. Compared
ing labels, ŷi in place of yi (Wang et al., 2021; with Eq. 2, the rationale r̂i is not required in the
Smith et al., 2022a; Arora et al., 2022). test time, which removes the need for an LLM at
For both scenarios, the smaller model f is test-time.
We prepend “task prefixes” ([label], (1) S TANDARD FINETUNING when human-labeled
[rationale]) to the input examples and examples are available, and (2) S TANDARD TASK
train the smaller model to output ŷi when DISTILLATION when only unlabeled examples are
[label] is provided and to produce r̂i with available. Specifically, standard finetuning refers to
[rationale] (Raffel et al., 2020). the prevailing pretrain-then-finetune paradigm that
finetunes a model with ground-truth labels via stan-
4 Experiments dard label supervision (Howard and Ruder, 2018).
We empirically validate the effectiveness of Dis- On the other hand, when only unlabeled examples
tilling step-by-step. First, we show that when are available, standard task distillation learns the
compared to standard finetuning and task distil- task-specific model by treating a teacher LLM’s
lation approaches, Distilling step-by-step achieves predicted labels as ground-truths (Hinton et al.,
better performance with much fewer number of 2015; Chen et al., 2020; Wang et al., 2021; Smith
training examples, substantially improving the et al., 2022a; Arora et al., 2022).
data efficiency to learn small task-specific mod- In the following set of experiments, we fix the
els (Sec. 4.1). Second, we show that Distilling task-specific models to be 220M T5-Base models,
step-by-step surpasses the performance of LLMs and compare the task performances achieved by dif-
with much smaller model size, drastically lowering ferent methods under varying number of available
the deployment cost compared to LLMs (Sec. 4.2). training examples.
Third, we investigate the minimum resources re-
quired, w.r.t. both number of training examples and Distilling step-by-step outperforms standard
model size, for Distilling step-by-step to outper- finetuning with much less labeled examples.
form LLMs. We show that Distilling step-by-step When finetuned with human-labeled examples, Fig-
outperforms LLMs by using less data and smaller ure 4 shows that Distilling step-by-step consistently
model, simultaneously improving both data- and achieves better performance than standard finetun-
deployability-efficiency (Sec. 4.3). Finally, we per- ing across varying numbers of labeled examples
form ablation studies to understand the influence used. Furthermore, we see that Distilling step-by-
of different components and design choices in the step can achieve the same performance as stan-
Distilling step-by-step framework (Sec. 4.4). dard finetuning with much less labeled examples.
Setup. In the experiments, we consider the 540B In particular, by using only 12.5% of the full e-
PaLM model (Chowdhery et al., 2022) as the LLM. SNLI dataset, Distilling step-by-step can outper-
For task-specific downstream models, we use T5 form standard finetuning trained with 100% of the
models (Raffel et al., 2020) where we initialize the full dataset. Similarly, we achieve 75%, 25%, and
models with pretrained weights obtained from pub- 20% reduction in training examples required to out-
licly available sources2 . For CoT prompting, we perform standard finetuning on ANLI, CQA, and
follow Wei et al. (2022) when available, and curate SVAMP respectively.
our own examples for new datasets. We include
more implementation details in Appendix A.1. Distilling step-by-step outperforms standard dis-
tillation with much less unlabeled examples.
Datasets. We conduct the experiments on 4
When only unlabeled data is available, we compare
popular benchmark datasets across 3 different
Distilling step-by-step to standard task distillation.
NLP tasks: e-SNLI (Camburu et al., 2018) and
In Figure 5, we observe an overall similar trend to
ANLI (Nie et al., 2020) for natural language infer-
the finetuning setup. Specifically, we see that Dis-
ence; CQA (Talmor et al., 2019; Rajani et al., 2019)
tilling step-by-step outperforms standard task distil-
for commonsense question answering; SVAMP (Pa-
lation on all 4 datasets under different numbers of
tel et al., 2021) for arithmetic math word problems.
unlabeled data used. We as well see that Distilling
We include more dataset details in Appendix A.2.
step-by-step requires much less unlabeled data to
4.1 Reducing training data outperform standard task distillation. For instance,
we need only 12.5% of the full unlabeled dataset
We compare Distilling step-by-step to two most
to outperform the performance achieved by stan-
common methods in learning task-specific models:
dard task distillation using 100% of the training
2
https://huggingface.co/ examples on e-SNLI dataset.
Figure 4: We compare Distilling step-by-step and Standard finetuning using 220M T5 models on varying sizes of
human-labeled datasets. On all datasets, Distilling step-by-step is able to outperform Standard finetuning, trained on
the full dataset, by using much less training examples (e.g., 12.5% of the full e-SNLI dataset).

Figure 5: Similar to the plots above, we compare Distilling step-by-step and Standard task distillation using 220M
T5 models on varying sizes of unlabeled datasets. Distilling step-by-step is able to outperform Standard task
distillation by using only a small subset of the full unlabeled dataset (e.g., 12.5% on ANLI dataset).

4.2 Reducing model size two broad scenarios of having access to labeled
datasets or unlabeled datasets in Figure 6 and Fig-
In the following set of experiments, we hold the ure 7, respectively. We plot each method by their
training set size fixed (using 100% of the datasets), deployed model sizes for prediction (x-axis), and
and compare varying sizes of small T5 models their corresponding task performances (y-axis).
trained with Distilling step-by-step and standard
approaches to LLMs. Specifically, we consider 3 Distilling step-by-step improves over standard
different sizes of T5 models, i.e., 220M T5-Base, baselines across varying model sizes used. In
770M T5-Large, and 11B T5-XXL. For LLMs, Figure 6 and Figure 7 respectively, we see that
we include two baseline methods: (1) F EW- SHOT Distilling step-by-step consistently improves over
C OT (Wei et al., 2022), and (2) PINTO TUN - standard finetuning and standard distillation across
ING (Wang et al., 2022a). Few-shot CoT directly all sizes of T5 models. The improvements are most
utilizes CoT demonstrations to prompt the 540B pronounced on ANLI, where Distilling step-by-
PaLM to generate intermediate steps before pre- step outperforms standard finetuning and distilla-
dicting the final labels without any further fine- tion by an average of 8% and 13% on task accuracy
tuning of the LLM. PINTO tuning refers to our respectively.
extension of Wang et al. (2022a) to handle tasks Distilling step-by-step outperforms LLMs by
beyond question-answering, which are not stud- using much smaller task-specific models. In
ied by Wang et al. (2022a). Here, we finetune a Figure 6 when human-labeled datasets are avail-
220M T5-Base model on top of the outputs gener- able, Distilling step-by-step can always outper-
ated from the PaLM model, which can be viewed form Few-shot CoT and PINTO tuning on all 4
as a finetuning method for LLMs with additional datasets considered, by using much smaller T5
parameters (Zhang et al., 2020; Lester et al., 2021). models. For instance, we can achieve better perfor-
We present the experimental results under the mances than 540B PaLM model’s Few-shot CoT
Figure 6: We perform Distilling step-by-step and Standard finetuning, using the full human-labeled datasets, on
varying sizes of T5 models and compare their performance to LLM baselines, i.e., Few-shot CoT and PINTO
Tuning. Distilling step-by-step is able to outperform LLM baselines by using much smaller models, e.g., over 700×
smaller model on ANLI. Standard finetuning fails to match LLM’s performance using the same model size.

Figure 7: Using unlabeled datasets, we perform Distilling step-by-step and Standard task distillation on varying
sizes of T5 models and compare them to Few-shot CoT. Distilling step-by-step outperforms Few-shot CoT by using
2000× smaller models on e-SNLI and 45× smaller models on ANLI and CQA. On SVAMP, by adding unlabeled
examples from ASDiv, we close the gap to Few-shot CoT whereas Standard distillation still struggles to catch up.

with 220M (over 2000× smaller) T5 model on e- Unlabeled data augmentation further improves
SNLI, 770M (over 700× smaller) T5 models on Distilling step-by-step. We augment the SVAMP
ANLI and SVAMP, and 11B (over 45× smaller) training set with unlabeled examples from the AS-
T5 model on CQA. These results hold true even Div dataset (Miao et al., 2020). ASDiv dataset
by further finetuning the 540B PaLM model on contains a total of 2, 305 examples, where each ex-
available labeled data with PINTO tuning3 . ample is a math word problem similar to the ones in
In Figure 7, by only utilizing unlabeled exam- SVAMP. In Figure 7 on SVAMP, we show the per-
ples, Distilling step-by-step also outperforms the formances of Distilling step-by-step and standard
teacher LLM on 3 out of 4 datasets. Specifically, task distillation using 11B T5 model after augment-
Distilling step-by-step surpasses the 540B PaLM ing the training set with ASDiv. We see the data
model’s Few-shot CoT performance by using 11B augmentation much improves the performance for
T5 with less than 3% of PaLM’s size. On SVAMP both Distilling step-by-step and standard task dis-
where the distilled model underperforms, we hy- tillation. However, even with the added unlabeled
pothesize that the performance gap is due to the examples, standard task distillation still underper-
relatively small number of data points in the dataset forms Few-shot CoT. On the other hand, Distilling
(i.e., 800). In reaction, we propose to augment the step-by-step is able to much more efficiently ex-
dataset with additional unlabeled examples to close ploit the value of the added examples to achieve the
the performance gap as shown in next. same performance level of Few-shot CoT, again,
3 using a T5 model of size less than 3% of the 540B
We note that PETuning methods may outperform PINTO
tuning. However, they require massive resource in both train- PaLM.
ing and deployment, which is not the focus of this work.
Figure 8: We show the minimum size of T5 models and the least amount of human-labeled examples required
for Distilling step-by-step to outperform LLM’s Few-shot CoT by a coarse-grained search. Distilling step-by-step
is able to outperform Few-shot CoT using not only much smaller models, but it also achieves so with much less
training examples compared to Standard finetuning. On ANLI, we outperform the LLM CoT with a 770M model
using only 80% of the dataset, whereas Standard finetuning struggles to match even using 100% of the dataset.

Figure 9: Similar to Figure 8 but using only unlabeled examples, Distilling step-by-step is able to outperform
Few-shot CoT using much smaller models and with much less examples compared to Standard task distillation. On
SVAMP, the x-axis corresponds to the size of ASDiv dataset used for augmenting the original SVAMP dataset, i.e.,
x = 0 is without augmentation and x = 100 corresponds to adding the full ASDiv dataset.

4.3 Outperforming LLMs using minimum ter performance than Few-shot CoT with a model
model size and least training data over 2000× smaller (220M T5) and only 0.1% of
the full dataset. In Figure 9 where only unlabeled
Here, using the LLM’s performance as an anchor
datasets are available, we observe the same trend
point, we explore the most efficient resource re-
that Distilling step-by-step can, at most time, out-
quirements in terms of both number of training
perform Few-shot CoT with smaller model as well
examples and deployed model size, that Distill-
as less data. For instance, on ANLI, Distilling step-
ing step-by-step and standard finetuning/distillation
by-step outperforms the LLM with a 45× smaller
need to outperform the LLM. We present the re-
model and 50% of the full unlabeled set.
sults, again under human-labeled setting and unla-
beled setting, in Figure 8 and Figure 9 respectively.
We visualize the results by plotting different resul- Standard finetuning and distillation require
tant models by (1) the number of training exam- more data and larger model. Finally, in Fig-
ples used (x-axis), (2) the final task performance ure 8 and Figure 9, we see that standard finetuning
achieved (y-axis), and (3) the size of the model and distillation often need either more data or larger
(visualized by the size of the shaded area). models to match LLM’s performance. For instance,
on e-SNLI in Figure 8, we observe that Distilling
Distilling step-by-step outperforms LLMs with step-by-step outperform the LLM using only 0.1%
much smaller models by using less data. On of the dataset while standard finetuning requires
all datasets in Figure 8, we see that Distilling step- more data to match the performance. Furthermore,
by-step outperforms PaLM’s Few-shot CoT with on ANLI in Figure 8, we observe that Distilling
much smaller T5 models using only a subset of step-by-step can outperform PaLM using 770M
the available training examples. Specifically, on model with only 80% of the training set while stan-
e-SNLI, Distilling step-by-step can achieve bet- dard finetuning struggles to match the LLM even
Table 1: Distilling step-by-step works with different Table 2: Our proposed multi-task training framework
sizes of LLMs. When rationales are extracted from a consistently leads to better performances than treating
20B GPT-NeoX model, Distilling step-by-step is still rationale and label predictions as a single task. Single-
able to provide performance lift compared to standard task training can at times lead to worse performance
finetuning on 220M T5 models. than standard finetuning.

Dataset Dataset
Method LLM e-SNLI ANLI CQA SVAMP Method e-SNLI ANLI CQA SVAMP
S TANDARD FINETUNING N/A 88.38 43.58 62.19 62.63
D ISTILLING STEP - BY- STEP 20B 89.12 48.15 63.25 63.00 S TANDARD FINETUNING 88.38 43.58 62.19 62.63
D ISTILLING STEP - BY- STEP 540B 89.51 49.58 63.29 65.50 S INGLE - TASK TRAINING 88.88 43.50 61.37 63.00
M ULTI - TASK TRAINING 89.51 49.58 63.29 65.50

using the full dataset and thus requires larger model


to close the performance gap. sequence [r̂i , ŷi ] and treat the entire sequence as
the target output in training small models, as con-
4.4 Further ablation studies sidered in (Magister et al., 2022; Ho et al., 2022):
So far, we have focused on showing the effective-
ness of Distilling step-by-step on reducing the train- N
1 X
ing data required for finetuning or distilling smaller Lsingle = ℓ(f (xi ), [r̂i , ŷi ]). (5)
N
task-specific models. In this section, we perform i=1
further studies to understand the influence of dif-
ferent components in the Distilling step-by-step In Table 2, we compare this single-task training
framework. Specifically, we study (1) how differ- approach to our proposed multi-task training ap-
ent LLMs, from which the rationales are extracted, proach for utilizing LLM-rationales. We see that
affect the effectiveness of Distilling step-by-step, not only multi-task training consistently leads to
and (2) how the multi-task training approach com- better performance, single-task training with LLM-
pares to other potential design choices in training rationales can at times leads to worse performance
small task-specific models with LLM rationales. than standard finetuning, e.g., on ANLI and CQA.
Here, we fix the small task-specific models to be In fact, similar results have also been observed
220M T5 models, and utilize 100% of the data on in (Wiegreffe et al., 2021; Magister et al., 2022;
all datasets. Ho et al., 2022) that simply treating rationale and
label predictions as a single joint task may harm the
Distilling step-by-step works with different sizes model’s performance on label prediction. This val-
of decently trained LLMs. In addition to using idates our use of the multi-task training approach,
540B PaLM as the LLM, here we consider a rela- and highlights the need to treat the rationales care-
tively smaller LLM, 20B GPT-NeoX model (Black fully so as to unleash their actual benefits.
et al., 2022), from which we extract rationales for
Distilling step-by-step. In Table 1, we see that
when coupled with LLMs of different sizes, Distill- 5 Discussion
ing step-by-step can still provide performance im-
provements compared to standard finetuning. How- We propose Distilling step-by-step to extract ra-
ever, the performance lift is smaller when rationales tionales from LLMs as informative supervision in
are extracted from the 20B GPT-NeoX model in- training small task-specific models. We show that
stead of from the 540B PaLM. This can be due Distilling step-by-step reduces the training dataset
to the fact that the larger PaLM model provides required to curate task-specific smaller models; it
higher-quality rationales that are more beneficial also reduces the model size required to achieve,
for learning the task. and even surpass, the original LLM’s performance.
Distilling step-by-step proposes a resource-efficient
Multi-task training is much more effective than training-to-deployment paradigm compared to ex-
single-task rationale and label joint prediction. isting methods. Further studies demonstrate the
There are different possible ways to train task- generalizability and the design choices made in
specific models with LLM-rationales as output su- Distilling step-by-step. Finally, we discuss the lim-
pervisions. One straightforward approach is to con- itations, future directions and ethics statement of
catenate the rationale r̂i and label ŷi into a single our work below.
Limitations Samuel Weinbach. 2022. GPT-NeoX-20B: An open-
source autoregressive language model. In Proceed-
There are a number of limitations with our ap- ings of the ACL Workshop on Challenges & Perspec-
proach. First, we require users to produce a few tives in Creating Large Language Models.
example demonstrations (∼ 10-shot for all tasks) Tom Brown, Benjamin Mann, Nick Ryder, Melanie
in order to use the few-shot CoT (Wei et al., 2022) Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
prompting mechanism. This limitation can be Neelakantan, Pranav Shyam, Girish Sastry, Amanda
improved by using recent advances that suggest Askell, et al. 2020. Language models are few-shot
learners. Advances in neural information processing
that rationales can be elicited without any user- systems, 33:1877–1901.
annotated demonstrations (Kojima et al., 2022).
Second, training task-specific models with ratio- Cristian Buciluǎ, Rich Caruana, and Alexandru
Niculescu-Mizil. 2006. Model compression. In Pro-
nales incur slight training-time computation over- ceedings of the 12th ACM SIGKDD international
head. However, at test time, our multi-task design conference on Knowledge discovery and data mining,
naturally avoids the computation overhead since it pages 535–541.
allows one to only predict labels without generat-
Oana-Maria Camburu, Tim Rocktäschel, Thomas
ing the rationales. Finally, while we observe suc- Lukasiewicz, and Phil Blunsom. 2018. e-snli: Natu-
cess using LLM rationales, there is evidence that ral language inference with natural language expla-
LLMs exhibit limited reasoning capability on more nations. Advances in Neural Information Processing
complex reasoning and planning tasks (Valmeekam Systems, 31.
et al., 2022). Future work should characterize how Ting Chen, Simon Kornblith, Kevin Swersky, Moham-
rationale quality affects Distilling step-by-step. mad Norouzi, and Geoffrey E Hinton. 2020. Big
self-supervised models are strong semi-supervised
Ethics statement learners. Advances in neural information processing
systems, 33:22243–22255.
It is worth noting that the behavior of the our down- Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,
stream smaller models is subject to biases inherited Maarten Bosma, Gaurav Mishra, Adam Roberts,
from the larger teacher LLM. We envision that the Paul Barham, Hyung Won Chung, Charles Sutton,
same research progress in reducing anti-social be- Sebastian Gehrmann, et al. 2022. Palm: Scaling
language modeling with pathways. arXiv preprint
haviors in LLMs can also be applied to improve
arXiv:2204.02311.
smaller language models.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. Bert: Pre-training of deep
References bidirectional transformers for language understand-
ing. arXiv preprint arXiv:1810.04805.
Priyanka Agrawal, Chris Alberti, Fantine Huot, Joshua
Maynez, Ji Ma, Sebastian Ruder, Kuzman Ganchev, Jacob Eisenstein, Daniel Andor, Bernd Bohnet, Michael
Dipanjan Das, and Mirella Lapata. 2022. Qameleon: Collins, and David Mimno. 2022. Honest students
Multilingual qa with only 5 examples. arXiv preprint from untrusted teachers: Learning an interpretable
arXiv:2211.08264. question-answering pipeline from a pretrained lan-
guage model. arXiv preprint arXiv:2210.02498.
Simran Arora, Avanika Narayan, Mayee F Chen, Lau-
Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and
rel J Orr, Neel Guha, Kush Bhatia, Ines Chami, Fred-
Tushar Khot. 2023. Specializing smaller language
eric Sala, and Christopher Ré. 2022. Ask me any-
models towards multi-step reasoning. arXiv preprint
thing: A simple strategy for prompting language mod-
arXiv:2301.12726.
els. arXiv preprint arXiv:2210.02441.
Braden Hancock, Antoine Bordes, Pierre-Emmanuel
Lucas Beyer, Xiaohua Zhai, Amélie Royer, Larisa Mar- Mazare, and Jason Weston. 2019. Learning from
keeva, Rohan Anil, and Alexander Kolesnikov. 2022. dialogue after deployment: Feed yourself, chatbot!
Knowledge distillation: A good teacher is patient and arXiv preprint arXiv:1901.05415.
consistent. In Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition, Peter Hase and Mohit Bansal. 2021. When can models
pages 10925–10934. learn from explanations? a formal framework for
understanding the roles of explanation data. arXiv
Sid Black, Stella Biderman, Eric Hallahan, Quentin An- preprint arXiv:2102.02201.
thony, Leo Gao, Laurence Golding, Horace He, Con-
nor Leahy, Kyle McDonell, Jason Phang, Michael Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. 2015.
Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Distilling the knowledge in a neural network. arXiv
Laria Reynolds, Jonathan Tow, Ben Wang, and preprint arXiv:1503.02531, 2(7).
Namgyu Ho, Laura Schmid, and Se-Young Yun. 2022. Sharan Narang, Colin Raffel, Katherine Lee, Adam
Large language models are reasoning teachers. arXiv Roberts, Noah Fiedel, and Karishma Malkan. 2020.
preprint arXiv:2212.10071. Wt5?! training text-to-text models to explain their
predictions. arXiv preprint arXiv:2004.14546.
Jordan Hoffmann, Sebastian Borgeaud, Arthur Men-
sch, Elena Buchatskaya, Trevor Cai, Eliza Ruther- Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal,
ford, Diego de Las Casas, Lisa Anne Hendricks, Jason Weston, and Douwe Kiela. 2020. Adversarial
Johannes Welbl, Aidan Clark, et al. 2022. Train- NLI: A new benchmark for natural language under-
ing compute-optimal large language models. arXiv standing. In Proceedings of the 58th Annual Meeting
preprint arXiv:2203.15556. of the Association for Computational Linguistics. As-
sociation for Computational Linguistics.
Jeremy Howard and Sebastian Ruder. 2018. Universal
language model fine-tuning for text classification. Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari,
In Proceedings of the 56th Annual Meeting of the Henryk Michalewski, Jacob Austin, David Bieber,
Association for Computational Linguistics (Volume 1: David Dohan, Aitor Lewkowycz, Maarten Bosma,
Long Papers), pages 328–339, Melbourne, Australia. David Luan, et al. 2021. Show your work: Scratch-
Association for Computational Linguistics. pads for intermediate computation with language
models. arXiv preprint arXiv:2112.00114.
Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu,
Xuezhi Wang, Hongkun Yu, and Jiawei Han. 2022. Arkil Patel, Satwik Bhattamishra, and Navin Goyal.
Large language models can self-improve. arXiv 2021. Are NLP models really able to solve simple
preprint arXiv:2210.11610. math word problems? In Proceedings of the 2021
Conference of the North American Chapter of the
Fotis Iliopoulos, Vasilis Kontonis, Cenk Baykal, Gau- Association for Computational Linguistics: Human
rav Menghani, Khoa Trinh, and Erik Vee. 2022. Language Technologies, pages 2080–2094, Online.
Weighted distillation with unlabeled examples. In Association for Computational Linguistics.
Advances in Neural Information Processing Systems.
Danish Pruthi, Rachit Bansal, Bhuwan Dhingra,
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu- Livio Baldini Soares, Michael Collins, Zachary C
taka Matsuo, and Yusuke Iwasawa. 2022. Large lan- Lipton, Graham Neubig, and William W Cohen.
guage models are zero-shot reasoners. arXiv preprint 2022. Evaluating explanations: How much do ex-
arXiv:2205.11916. planations from the teacher aid students? Transac-
tions of the Association for Computational Linguis-
Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. tics, 10:359–375.
The power of scale for parameter-efficient prompt
tuning. arXiv preprint arXiv:2104.08691. Colin Raffel, Noam Shazeer, Adam Roberts, Kather-
ine Lee, Sharan Narang, Michael Matena, Yanqi
Liunian Harold Li, Jack Hessel, Youngjae Yu, Xi- Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the
ang Ren, Kai-Wei Chang, and Yejin Choi. 2023. limits of transfer learning with a unified text-to-text
Symbolic chain-of-thought distillation: Small mod- transformer. Journal of Machine Learning Research,
els can also" think" step-by-step. arXiv preprint 21(140):1–67.
arXiv:2306.14050.
Nazneen Fatema Rajani, Bryan McCann, Caiming
Kevin J Liang, Weituo Hao, Dinghan Shen, Yufan Xiong, and Richard Socher. 2019. Explain your-
Zhou, Weizhu Chen, Changyou Chen, and Lawrence self! leveraging language models for commonsense
Carin. 2020. Mixkd: Towards efficient distilla- reasoning. In Proceedings of the 57th Annual Meet-
tion of large-scale language models. arXiv preprint ing of the Association for Computational Linguistics,
arXiv:2011.00593. pages 4932–4942, Florence, Italy. Association for
Computational Linguistics.
Lucie Charlotte Magister, Jonathan Mallinson, Jakub
Adamek, Eric Malmi, and Aliaksei Severyn. 2022. Andrew Slavin Ross, Michael C Hughes, and Finale
Teaching small language models to reason. arXiv Doshi-Velez. 2017. Right for the right reasons: Train-
preprint arXiv:2212.08410. ing differentiable models by constraining their expla-
nations. arXiv preprint arXiv:1703.03717.
Shen-yun Miao, Chao-Chun Liang, and Keh-Yih Su.
2020. A diverse corpus for evaluating and developing Ryan Smith, Jason A Fries, Braden Hancock, and
english math word problem solvers. In Proceedings Stephen H Bach. 2022a. Language models in the
of the 58th Annual Meeting of the Association for loop: Incorporating prompting into weak supervision.
Computational Linguistics, pages 975–984. arXiv preprint arXiv:2205.02318.

Smitha Milli, Ludwig Schmidt, Anca D Dragan, and Shaden Smith, Mostofa Patwary, Brandon Norick,
Moritz Hardt. 2019. Model reconstruction from Patrick LeGresley, Samyam Rajbhandari, Jared
model explanations. In Proceedings of the Confer- Casper, Zhun Liu, Shrimai Prabhumoye, George
ence on Fairness, Accountability, and Transparency, Zerveas, Vijay Korthikanti, et al. 2022b. Using
pages 1–9. deepspeed and megatron to train megatron-turing nlg
530b, a large-scale generative language model. arXiv Processing, pages 10266–10284, Online and Punta
preprint arXiv:2201.11990. Cana, Dominican Republic. Association for Compu-
tational Linguistics.
Suraj Srinivas and François Fleuret. 2018. Knowledge
transfer with jacobian matching. In International Omar Zaidan, Jason Eisner, and Christine Piatko. 2007.
Conference on Machine Learning, pages 4723–4731. Using “annotator rationales” to improve machine
PMLR. learning for text categorization. In Human Language
Technologies 2007: The Conference of the North
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and American Chapter of the Association for Computa-
Jonathan Berant. 2019. CommonsenseQA: A ques- tional Linguistics; Proceedings of the Main Confer-
tion answering challenge targeting commonsense ence, pages 260–267, Rochester, New York. Associa-
knowledge. In Proceedings of the 2019 Conference tion for Computational Linguistics.
of the North American Chapter of the Association for
Computational Linguistics: Human Language Tech- Eric Zelikman, Yuhuai Wu, and Noah D Goodman.
nologies, Volume 1 (Long and Short Papers), pages 2022. Star: Bootstrapping reasoning with reason-
4149–4158, Minneapolis, Minnesota. Association for ing. arXiv preprint arXiv:2203.14465.
Computational Linguistics.
Jeffrey O Zhang, Alexander Sax, Amir Zamir, Leonidas
Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Guibas, and Jitendra Malik. 2020. Side-tuning: a
Vechtomova, and Jimmy Lin. 2019. Distilling task- baseline for network adaptation via additive side net-
specific knowledge from bert into simple neural net- works. In European Conference on Computer Vision,
works. arXiv preprint arXiv:1903.12136. pages 698–714. Springer.
Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Susan Zhang, Stephen Roller, Naman Goyal, Mikel
Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Artetxe, Moya Chen, Shuohui Chen, Christopher De-
Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. wan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022.
2022. Lamda: Language models for dialog applica- Opt: Open pre-trained transformer language models.
tions. arXiv preprint arXiv:2201.08239. arXiv preprint arXiv:2205.01068.
Karthik Valmeekam, Alberto Olmo, Sarath Sreedharan, Ye Zhang, Iain Marshall, and Byron C. Wallace. 2016.
and Subbarao Kambhampati. 2022. Large language Rationale-augmented convolutional neural networks
models still can’t plan (a benchmark for llms on plan- for text classification. In Proceedings of the 2016
ning and reasoning about change). arXiv preprint Conference on Empirical Methods in Natural Lan-
arXiv:2206.10498. guage Processing, pages 795–804, Austin, Texas.
Association for Computational Linguistics.
Peifeng Wang, Aaron Chan, Filip Ilievski, Muhao Chen,
and Xiang Ren. 2022a. Pinto: Faithful language Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao
reasoning using prompt-generated rationales. arXiv Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang,
preprint arXiv:2211.01562. Yuanzhong Xu, Danyang Zhuo, Joseph E Gonza-
lez, et al. 2022. Alpa: Automating inter-and intra-
Shuohang Wang, Yang Liu, Yichong Xu, Chenguang operator parallelism for distributed deep learning.
Zhu, and Michael Zeng. 2021. Want to reduce arXiv preprint arXiv:2201.12023.
labeling cost? gpt-3 can help. arXiv preprint
arXiv:2108.13487.
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le,
Ed Chi, and Denny Zhou. 2022b. Self-consistency
improves chain of thought reasoning in language
models. arXiv preprint arXiv:2203.11171.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022.
Chain of thought prompting elicits reasoning in large
language models. arXiv preprint arXiv:2201.11903.
Peter West, Chandra Bhagavatula, Jack Hessel, Jena D
Hwang, Liwei Jiang, Ronan Le Bras, Ximing
Lu, Sean Welleck, and Yejin Choi. 2021. Sym-
bolic knowledge distillation: from general language
models to commonsense models. arXiv preprint
arXiv:2110.07178.
Sarah Wiegreffe, Ana Marasović, and Noah A. Smith.
2021. Measuring association between labels and
free-text rationales. In Proceedings of the 2021 Con-
ference on Empirical Methods in Natural Language
A Experiment detail Table 3: Dataset statistics used in our experiments.

A.1 Implementation Dataset Train Validation Test

We perform our experiments on cloud A100×16 e-SNLI 549,367 9,842 9,824


ANLI 16,946 1,000 1,000
GPU instances. We train the T5 models with CQA 8,766 975 1,221
the following hyperparameters, using publicly SVAMP 720 80 200
available packages from https://github.com/
huggingface/transformers:
by (Rajani et al., 2019), which is avail-
• T5-Base (220M) and T5-Large (770M): We able at https://github.com/salesforce/
train the models with learning rate = 5 × cos-e. We obtain the dataset used in our ex-
10−5 , batch size = 64, max input length = periments from https://huggingface.co/
1024, for a maximum of 10000 steps. datasets/cos_e.

• T5-XXL (11B): We train the models with • SVAMP: The dataset was originally re-
learning rate = 5 × 10−5 , batch size = 32, leased in (Patel et al., 2021). We ob-
max input length = 1024, for a maximum of tain the dataset from https://github.com/
4000 steps. arkilpatel/SVAMP.

We report all the results over 4 random runs, and • ASDiv: The dataset was originally re-
include the standard error in the presented plots. leased in (Miao et al., 2020). We ob-
tain the dataset from https://github.com/
A.2 Datasets chaochun/nlu-asdiv-dataset.
We provide more detailed descriptions on the For each dataset, we randomly subsample 10%
datasets used in our experiments. We include the of the original training set to serve as validation set
sources from which we obtain the datasets as well when validation set is not originally provided. For
as their original sources released from the authors. CQA, we use the original validation set to serve
We refer readers to these sources for their license or as our test set since the ground-truth labels are not
terms for use and/or distribution. To the best of our available for the original test set. We provide the
knowledge, the datasets used do not contain infor- dataset statistics in Table 3.
mation that names or uniquely identifies individual
people or offensive content.

• e-SNLI: The dataset was originally re-


leased in (Camburu et al., 2018), and made
publicly available at https://github.com/
OanaMariaCamburu/e-SNLI. We obtain
the dataset from https://huggingface.co/
datasets/esnli.

• ANLI: The dataset was originally released


in (Nie et al., 2020), and made pub-
licly available at https://github.com/
facebookresearch/anli. We obtain the
dataset from https://huggingface.co/
datasets/anli. We use the R1 split in our
experiments.

• CQA: The dataset was originally released


in (Talmor et al., 2019), and made publicly
available at https://www.tau-nlp.sites.
tau.ac.il/commonsenseqa. It was then
augmented with human-labeled explanations

You might also like