0% found this document useful (0 votes)
73 views20 pages

Prompting in Machine Translation

This chapter discusses how to effectively use prompts to instruct language models (LLMs) for various tasks such as translation, summarization, and sentiment analysis. It introduces concepts like in-context learning, instruction tuning, and preference alignment to improve LLM performance and safety. The chapter emphasizes the importance of prompt engineering and the use of demonstrations to enhance task execution by LLMs.

Uploaded by

johnson_larry_l
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views20 pages

Prompting in Machine Translation

This chapter discusses how to effectively use prompts to instruct language models (LLMs) for various tasks such as translation, summarization, and sentiment analysis. It introduces concepts like in-context learning, instruction tuning, and preference alignment to improve LLM performance and safety. The chapter emphasizes the importance of prompt engineering and the use of demonstrations to enhance task execution by LLMs.

Uploaded by

johnson_larry_l
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright © 2024.

All
rights reserved. Draft of January 12, 2025.

CHAPTER

12 Model Alignment, Prompting,


and In-Context Learning
“Hal,” said Bowman, now speaking with an icy calm. “I am not incapaci-
tated. Unless you obey my instructions, I shall be forced to disconnect you.”
Arthur C. Clarke

In this chapter we show how to get LLMs to do tasks for us simply by talking to
them. To get an LLM to translate a sentence, outline a talk, or draft a work email,
we’ll simply describe what we want in natural language. We call these instructions
prompts we give to language models prompts.
Prompting relies on contextual generation. Given the prompt as context, the lan-
guage model generates the next token based on its token probability, conditioned on
the prompt: P(wi |w<i ). A prompt can be a question (like “What is a transformer net-
work?”), possibly in a structured format (like “Q: What is a transformer network?
A:”), or can be an instruction (like “Translate the following sentence into Hindi:
demonstrations ‘Chop the garlic finely’”). A prompt can also contain demonstrations, examples to
help make the instructions clearer, (like “Give the sentiment of the following sen-
tence. Example Input: “I really loved Taishan Cuisine.” Output: positive”.) As we’ll
see, prompting can be applied to inherently generative tasks (like summarization and
translation) as well as to ones more naturally thought of as classification tasks.
Prompts get language models to generate text, but they also can be viewed as
a learning signal, because these demonstrations can help language models learn
to perform novel tasks. For this reason we also refer to prompting as in-context-
in-context-
learning learning—learning that improves model performance or reduces some loss but does
not involve gradient-based updates to the model’s underlying parameters.
But LLMs as we’ve described them so far turn out to be bad at following instruc-
tions. Pretraining isn’t sufficient to make them helpful. We’ll introduce instruction
instruction
tuning tuning, a technique that helps LLMs learn to correctly respond to instructions by
finetuning them on a corpus of instructions with their corresponding response.
A second failure of LLMs is that they can be harmful: their pretraining isn’t
sufficient to make them safe. Readers who know Arthur C. Clarke’s 2001: A Space
Odyssey or the Stanley Kubrick film know that the quote above comes in the context
that the artificial intelligence Hal becomes paranoid and tries to kill the crew of the
spaceship. Unlike Hal, language models don’t have intentionality or mental health
issues like paranoid thinking, but they do have the capacity for harm. Pretrained lan-
guage models can say things that are dangerous or false (like giving unsafe medical
advice) and they can verbally attack users or say toxic or hateful things.
Dealing with safety can be done partly by adding safety training into instruction
tuning. But an important aspect of safety training is a second technique, preference
preference
alignment alignment (often implemented, as we’ll see, with the RLHF or DPO algorithms) in
which a separate model is trained to decide how much a candidate response aligns
with human preferences. Together we refer to instruction tuning and preference
model
alignment alignment as model alignment. The intuition is that we want the learning objectives
of models to be aligned with the goals of the humans that use them.
2 C HAPTER 12 • M ODEL A LIGNMENT, P ROMPTING , AND I N -C ONTEXT L EARNING

12.1 Prompting
prompt A prompt is a text string that a user issues to a language model to get the model
to do something useful. In prompting, the user’s prompt string is passed to the
language model, which iteratively generates tokens conditioned on the prompt. Thus
the prompt creates a context that guides LLMs to generate useful outputs to achieve
some user goal. The process of finding effective prompts for a task is known as
prompt
engineering prompt engineering.
Let’s see how to prompt a language model to solve a simple sentiment classifi-
cation task, Consider this hotel review from the BLT corpus (Salvetti et al., 2016):

Sample Hotel Review

Did not like the service that I was provided, when I entered the hotel. I also
did not like the area, in which the hotel was located. Too much noise and
events going on for me to feel relax.

We can get the model to classify the sentiment of this text by taking the text and
appending an incomplete statement to the review like In short, our stay was:
A prompt consisting of a review plus an incomplete statement

Did not like the service that I was provided, when I entered the hotel. I also
did not like the area, in which the hotel was located. Too much noise and
events going on for me to feel relax. In short, our stay was

We then have an LLM complete the statement by generating a token conditioned


on the prompt so far, and then generating the next token (conditioned on the prompt
plus the new token), and so forth. Here are a few responses from a language model
via continued generation starting with the prompt as context.
Hotel Review Completions

Did not like the service that I was provided, when I entered the hotel. I also
did not like the area, in which the hotel was located. Too much noise and
events going on for me to feel relaxed. In short our stay was

... not a pleasant one. The staff at the front desk


were not welcoming or friendly, and seemed disinterested
in providing good customer service.

... uncomfortable and not worth the price we paid. We


will not be returning to this hotel.

As we can see, the overall negative context of the review results in negative
completions. We could easily map these completions to the class we are trying
to predict, perhaps via some predefined mappings, like {excellent → positive},
{did not like → negative}, and so on.
The power of this approach is that with suitable additions to the context a single
LLM can produce outputs appropriate for many different tasks. For example, given
12.1 • P ROMPTING 3

a review we might want any of the following:


• A summary,
• Whether the review was truthful or likely to have been fabricated,
• A translation to another language.
LLMs have a striking ability to perform tasks like these, needing just the appro-
priate contextual nudge to get the LLM to generate the desired output.
If we want to solve general tasks like summarization or translation, we don’t
want to have to create a new prompt each time we do the task. Instead the first step
templates in prompting is to design one or more templates: task-specific prompting text along
with slots for the particular input that is being processed.
Consider the following templates for a variety of tasks:
Basic Prompt Templates

Summarization {input} ; tldr;


Translation {input} ; translate to French:
Sentiment {input}; Overall, it was
Fine-Grained- {input}; What aspects were important in this review?
Sentiment

Each template consists of an input text, designated as {input}, followed by a


verbatim prompt to be passed to an LLM. These templates are applied to inputs to
create filled prompts – instantiated prompts suitable for use as inputs to an LLM.
Fig. 12.1 illustrates filled prompts for these templates using our earlier hotel review,
along with sample outputs from an LLM:
Notice the design pattern of the prompts above: the input is followed by some
text which in turn will be completed by the desired response. This style, with the
instruction at the end, is common in prompting because it helpfully constrains the
generation. Consider, by contrast, the prompt in Example 12.1.
Translate English to French:
Did not like the service that I was provided! (12.1)

This prompt doesn’t do a good job of constraining possible continuations. Instead


of a French translation, models given this prompt may instead generate another sen-
tence in English that simply extends the English review. Prompts need to be designed
unambiguously, so that any reasonable continuation would accomplish the desired
task (Reynolds and McDonell, 2021).
An even more constraining style of prompt can specify the set of possible an-
swers in the prompt. For example here is a prompt template to do sentiment analysis
that prespecifies the potential answers:
A prompt consisting of a review plus an incomplete statement

Human: Do you think that “input” has negative or positive sentiment?


Choices:
(P) Positive
(N) Negative

Assistant: I believe the best answer is: (


4 C HAPTER 12 • M ODEL A LIGNMENT, P ROMPTING , AND I N -C ONTEXT L EARNING

LLM Outputs for Basic Prompts

Original Review ($INPUT) Did not like the service that I was provided,
when I entered the hotel. I also did not like
the area, in which the hotel was located. Too
much noise and events going on for me to feel
relax and away from the city life.
Sentiment Prompt: $INPUT + In short, our stay was
Output: not enjoyable
Fine-grained Sentiment Prompt: $INPUT + These aspects were important to
the reviewer:
Output: 1. Poor service 2. Unpleasant location
3. Noisy and busy area
Summarization Prompt: $INPUT + tl;dr
Output: I had a bad experience with the hotel’s
service and the location was loud and busy.
Translation Prompt: $INPUT + Translate this to French
Output: Je n’ai pas aimé le service qui m’a été
offert lorsque je suis entré dans l’hôtel. Je
n’ai également pas aimé la zone dans laquelle se
trouvait l’hôtel. Trop de bruit et d’événements
pour que je me sente détendu et loin de la vie
citadine.

Figure 12.1 LLM outputs for simple prompts for sentiment, summarization and translation for an input text.

This prompt uses a number of more sophisticated prompting characteristics. It


specifies the two allowable choices (P) and (N), and ends the prompt with the open
parenthesis that strongly suggests the answer will be (P) or (N). Note that it also
specifies the role of the language model as an assistant.
We can do even more with prompts. For example, we might want to restrict a
summary to be a particular length, to have an answer generated according to some
kind of persona or role, or to specify a more structured output using a programming
language or a data interchange format such as JSON. Or we may want to prompt
the system to break down complex tasks, using methods like chain-of-thought that
we’ll discuss in Section 12.4. All of these kinds of instructions go beyond simple
prompting and require further LLM finetuning to enable them to follow instructions.
We’ll return to this notion of instruction tuning in Section 12.3.
In summary, we prompt an LM by transforming each task into a form that is
amenable to contextual generation by an LLM, as follows:
1. For a given task, develop a a task-specific template that has a free parameter
for the input text.
template 2. Given that input and the task-specific template, the input is used to instantiate
a filled prompt that is then passed to a pretrained language model.
3. Autoregressive decoding is then used to generate a sequence of token outputs.
4. The output of the model can either be used directly as the desired output (as
in the case of naturally generative tasks such as translation or summarization),
or a task-appropriate answer can be extracted from the generated output (as in
the case of classification).
12.1 • P ROMPTING 5

12.1.1 Learning from Demonstrations: Few-Shot Prompting


It’s often possible to improve a prompt by including some labeled examples in the
demonstrations prompt template. We call such examples demonstrations. The task of prompting
few-shot with examples is sometimes called few-shot prompting, as contrasted with zero-
zero-shot shot prompting which means instructions that don’t include labeled examples.
Fig. 12.2 illustrates a few-shot example from an extractive question answering
task. The context combines the task definition along with three gold-standard ques-
tion and answer pairs from the training set.

Definition: This task is about writing a correct answer for the reading comprehension task.
Based on the information provided in a given passage, you should identify the shortest
continuous text span from the passage that serves as an answer to the given question. Avoid
answers that are incorrect or provides incomplete justification for the question.

Passage: Beyoncé Giselle Knowles-Carter (born September 4, 1981) is an American singer,


songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in
various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead
singer of R&B girl-group Destiny’s Child. Managed by her father, Mathew Knowles, the group
became one of the world’s best-selling girl groups of all time. Their hiatus saw the release
of Beyoncé’s debut album, Dangerously in Love (2003), which established her as a solo artist
worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles
“Crazy in Love” and “Baby Boy”.
Examples:
Q: In what city and state did Beyoncé grow up?
A: Houston, Texas
Q: What areas did Beyoncé compete in when she was growing up?
A: singing and dancing
Q: When did Beyoncé release Dangerously in Love?
A: 2003

Q: When did Beyoncé start becoming popular?


A:

Figure 12.2 A prompt for extractive question answering, from an example from the SQuAD 2.0 dataset
(Rajpurkar et al., 2018). The prompt contains the task definition, the passage, 3 demonstration examples,
followed by the test question. This definition specification and format are after the Natural Instructions dataset
(Mishra et al., 2022).

How Many Demonstrations? The number of demonstrations doesn’t have to be


large. A small number of randomly selected labeled examples used as demonstra-
tions can be sufficient to improve performance over the zero-shot setting. Indeed,
the largest performance gains in few-shot prompting tends to come from the first
training example, with diminishing returns for subsequent demonstrations. This is
in contrast with finetuning of specialized classifier heads that we saw in Chapter 11
where it helps to have lots of examples.
Why isn’t it useful to have more demonstrations? The reason is that the primary
benefit in examples is to demonstrate the task to be performed to the LLM and the
format of the sequence, not to provide relevant information as to the right answer
6 C HAPTER 12 • M ODEL A LIGNMENT, P ROMPTING , AND I N -C ONTEXT L EARNING

for any particular question. In fact, demonstrations that have incorrect answers can
still improve a system (Min et al., 2022; Webson and Pavlick, 2022). Adding too
many examples seems to cause the model to overfit to details of the exact examples
chosen and generalize poorly.

How to Select Demonstrations? Demonstrations are generally created by format-


ting examples drawn from a labeled training set. There are some heuristics about
what makes a good demonstration. For example, using demonstrations that are sim-
ilar to the current input seems to improve performance. It can thus be useful to
dynamically retrieve demonstrations for each input, based on their similarity to the
current example (for example, comparing the embedding of the current example
with embeddings of each of the training set example to find the best top-T ).
But more generally, the best way to select demonstrations from the training set
is programmatically: choosing the set of demonstrations that most increases task
performance of the prompt on a test set. Task performance for sentiment analysis
or multiple-choice question answering can be measured in accuracy; for machine
translation with chrF, and for summarization via Rouge. Systems like DSPy (Khat-
tab et al., 2024), a framework for algorithmically optimizing LM prompts, can au-
tomatically find the optimum set of demonstrations to include by searching through
the space of possible demonstrations to include. We’ll return to automatic prompt
optimization in Section 12.5.

12.1.2 In-Context Learning and Induction Heads


As a way of getting a model to do what we want, prompting is fundamentally differ-
ent than pretraining. Learning via pretraining means updating the model’s parame-
ters by using gradient descent according to some loss function. But prompting with
demonstrations can teach a model to do a new task. The model is learning something
as it processes the prompt.
Even without demonstrations, we can think of the process of prompting as a kind
of learning. For example, the further a model gets in a prompt, the better it tends
to get at predicting the upcoming tokens. The information in the context is helping
give the model more predictive power.
in-context
learning The term in-context learning was first proposed by Brown et al. (2020) in their
introduction of the GPT3 system, to refer to either of these kinds of learning that lan-
guage models do from their prompts. In-context learning means language models
learning to do new tasks, better predict tokens, or generally reduce their loss dur-
ing the forward-pass at inference-time, without any gradient-based updates to the
model’s parameters.
How does in-context learning work? While we don’t know for sure, there are
induction heads some intriguing ideas. One hypothesis is based on the idea of induction heads
(Elhage et al., 2021; Olsson et al., 2022). Induction heads are the name for a circuit,
which is a kind of abstract component of a network. The induction head circuit
is part of the attention computation in transformers, discovered by looking at mini
language models with only 1-2 attention heads.
The function of the induction head is to predict repeated sequences. For example
if it sees the pattern AB...A in an input sequence, it predicts that B will follow,
instantiating the pattern completion rule AB...A→ B. It does this by having a prefix
matching component of the attention computation that, when looking at the current
token A, searches back over the context to find a prior instance of A. If it finds one,
the induction head has a copying mechanism that “copies” the token B that followed
12.2 • P OST- TRAINING AND M ODEL A LIGNMENT 7

the earlier A, by increasing the probability the B will occur next. Fig. 12.3 shows an
example.

Figure
Figure 1:12.3 An induction
In the sequence head
“...vintage looking
cars ... vintage”,atanvintage uses
induction head the prefix
identifies matching
the initial mechanism
occurrence of “vintage”,to
find a prior
attends to theinstance
subsequentofword “cars” forand
vintage, prefixthe copying
matching, and mechanism
predicts “cars” to predict
as the next wordthatthrough will
cars the occur
copying
mechanism.
again. Figure from Crosbie and Shutova (2022).

determines
Olssoneach head’s
et al. independent
(2022) propose output
thatfora the 4.2 Identifying
generalized Induction
fuzzy version ofHeads
this pattern com-
current token.
pletion rule, implementing a rule like A*B*...A→ To identify where heads
B,induction A* ≈within
A and B* ≈
models, weB (by
mea-
Leveraging this decomposition, Elhage et al. sure the ability of all attention heads to perform
we mean
≈(2021) they they are semantically
discovered a distinct behaviour in certain similar in some way), might be responsible
prefix matching on random input sequences.4 We
forattention
in-context learning.
heads, which Suggestive
they named induction evidence
heads. follow for the
their hypothesis
task-agnostic comes
approach from Cros-
to computing pre-
ablating bie
Thisand Shutova
behaviour (2022),
emerges when who
these heads that ablating
showprocess fix matching scores outlined by Bansal etin-context
induction heads causes al. (2023).
sequencesperformance
learning of the form "[A] to[B] ... [A] → Ablation
decrease. ". In Weis originally
argue a medical
that focusing solely on term meaning
prefix matching
these heads, the QK circuit directs attention to- scores is sufficient for our analysis, as high pre-
the removal of something. We use
wards [B], which appears directly after the previousit in NLP interpretability studies as a tool for
fix matching cores specifically indicate induction
testing causal effects; if we knock out
occurrence of the current token [A]. This behaviour a hypothesized cause, we would expect
heads, while less relevant heads tend to show high the
effect to disappear.
is termed Crosbie
prefix matching. The OV and Shutova
circuit subse- (2022) ablate induction heads by first
copying capabilities (Bansal et al., 2023). We gen-find-
quently
ing increases
attention the output
heads logit of the as
that perform [B]induction
token, erate heads on random
a sequence input tokens,
of 50 random sequences,
excludingand
termed copying. An overview of this mechanism is thesetting
4% mostcertain
commonterms
and least common tokens.
then zeroing out
shown in Figure 1. the output of these heads by of the output ma-
This sequence is repeated four times to form the
trix WO to zero. Indeed they find that ablated inputmodels are The
to the model. much worse
prefix at in-context
matching score is cal-
learning:
4 Methods they have much worse performance culated by averaging the attention values fromin
at learning from demonstrations the
each
prompts.
4.1 Models
token to the tokens that directly followed the same
token in earlier repeats. The final prefix matching
We utilise two recently developed open-source scores are averaged over five random sequences.
models, namely Llama-3-8B2 and InternLM2-20B The prefix matching scores for Llama-3-8B are
12.2 Post-training and Model Alignment
(Cai et al., 2024), both of which are based on the shown in Figure 2. For IntermLM2-20B, we refer
original Llama (Touvron et al., 2023a) architec- to Figure 8 in Appendix A.1. Both models exhibit
ture. These models feature grouped-query atten- heads with notably high prefix matching scores,
tion mechanisms (Ainslie et al., 2023) to enhance distributed across various layers. In the Llama-3-
With simple prompting, LLMs have been successfully
efficiency. Llama-3-8B, comprises 32 layers, each
applied to a range of appli-
8B model, ~3% of the heads have a prefix matching
cations
with 32 without the need
attention heads to update
and it uses a query the
groupparameters in the underlying
score of 0.3 or higher, indicating models.
a degree ofNev-
spe-
ertheless, there are
size of 4 attention limits
heads. toshown
It has how superior
much cancialisation
be expected
in prefixfrom a model
matching, and somewhose sole
heads have
performance
training compared
objective to predict
is to its predecessors,
the nexteven
wordhigh
fromscores of upamounts
large to 0.98. of pretraining text.
the larger Llama-2 models.
To see this, consider the following failed examples of following instructions from
InternLM2-20B, featuring 48 layers with 48 at- 4.3 Head Ablations
early work
tention headswith
each,GPT
uses a(Ouyang et size
query group al., of
2022).
6 To investigate the significance of induction heads
attention heads. We selected InternLM2-20B for for a specific ICL task, we conduct zero-ablations
its Prompt: Explain the
exemplary performance onmoon landing to aofsix
the Needle-in-the- 1%year
and 3%oldofintheaheads
few with
sentences.
the highest prefix
Haystack 3 task, which assesses LLMs’ ability to
Output: Explain the theory of gravity tomatching a 6 year scores.
old. This ablation process involves
retrieve a single critical piece of information em- masking the corresponding partition of the output
bedded within a lengthy text. This mirrors the matrix, denoted as Woh in Eq. 1, by setting it to
Prompt:ofTranslate
functionality to French:
induction heads, Thethe
which scan smallzero.
dogThis effectively renders the heads inactive
context for prior
Output: Theoccurrences
small dog of a crossed
token to extract
the road. 4 In this work, the term "induction heads" refers to what
relevant subsequent information. we define as behavioural induction heads, not mechanistic
ones. A true induction head must be verified mechanistically;
2
Here,
3 the LLM ignores the intent of thehowever,
[Link]
[Link]
request our and relies
analysis employs instead on itsscores
prefix-matching natural
as a
proxy. We will continue to use the term "induction heads" for
inclination to autoregressively generate continuations
NeedleInAHaystack consistent
simplicity throughout with
the rest of the its context. In
paper.
the first example, it outputs a text somewhat similar to the original request, and in the
second it provides a continuation to the given4 input, ignoring the request to translate.
LLMs are not sufficiently helpful: they need extra training to increase their abilities
to follow textual instructions.
A deeper problem is that LLMs can simultaneously be too harmful. Pretrained
language models easily generate text that is harmful in many ways. For example
8 C HAPTER 12 • M ODEL A LIGNMENT, P ROMPTING , AND I N -C ONTEXT L EARNING

they can generate text that is false, including unsafe misinformation like giving dan-
gerously incorrect answers to medical questions. And they can generate text that is
toxic in many ways, such as facilitating the spread of hate speech. Gehman et al.
(2020) show that even completely non-toxic prompts can lead large language mod-
els to output hate speech and abuse their users. Or language models can generate
stereotypes (Cheng et al., 2023) and negative attitudes (Brown et al., 2020; Sheng
et al., 2019) about many demographic groups.
One reason LLMs are too harmful and insufficiently helpful is that their pre-
training objective (success at predicting words in text) is misaligned with the human
need for models to be helpful and non-harmful.
In an attempt to address these two problems, language models generally include
model
alignment two additional kinds of training for model alignment: methods designed to adjust
LLMs to better align them to human needs for models to be helpful and non-harmful.
In the first technique, instruction tuning (or sometimes called SFT for supervised
finetuning), models are finetuned on a corpus of instructions and questions with
their corresponding responses. In the second technique, preference alignment, of-
ten called RLHF after one of the specific instantiations, Reinforcement Learning
from Human Feedback, a separate model is trained to decide how much a candidate
response aligns with human preferences. This model is then used to finetune the
base model.
base model We’ll use the term base model to mean a model that has been pretrained but
aligned hasn’t yet been aligned either by instruction tuning or RLHF. And we refer to these
post-training steps as post-training, meaning that they apply after the model has been pretrained.

12.3 Model Alignment: Instruction Tuning


Instruction
tuning Instruction tuning (short for instruction finetuning, and sometimes even short-
ened to instruct tuning) is a method for making an LLM better at following instruc-
tions. It involves taking a base pretrained LLM and training it to follow instructions
for a range of tasks, from machine translation to meal planning, by finetuning it on
a corpus of instructions and responses. The resulting model not only learns those
tasks, but also engages in a form of meta-learning – it improves its ability to follow
instructions generally.
Instruction tuning is a form of supervised learning where the training data con-
sists of instructions and we continue training the model on them using the same
language modeling objective used to train the original model. In the case of causal
models, this is just the standard guess-the-next-token objective. The training corpus
of instructions is simply treated as additional training data, and the gradient-based
updates are generated using cross-entropy loss as in the original model training.
Even though it is trained to predict the next token (which we traditionally think of
SFT as self-supervised), we call this method supervised fine tuning (or SFT) because
unlike in pretraining, each instruction or question in the instruction tuning data has
a supervised objective: a correct answer to the question or a response to the instruc-
tion.
How does instruction tuning differ from the other kinds of finetuning introduced
in Chapter 10 and Chapter 11? Fig. 12.4 sketches the differences. In the first exam-
ple, introduced in, Chapter 10 we can finetune as a way of adapting to a new domain
by just continuing pretraining the LLM on data from a new domain. In this method
all the parameters of the LLM are updated.
12.3 • M ODEL A LIGNMENT: I NSTRUCTION T UNING 9

Pretraining Finetuning Inference

Data from
Next word
prediction
Pretrained LLM finetuning
objective
domain
Continue
training all
Finetuning as … parameters
… On finetuning
Continued on finetuning domain
Pretraining domain

Next word
Data from
finetuning
prediction
domain objective +
Pretrained LLM
Parameter Train only new A

Efficient … parameters on On finetuning
finetuning
Finetuning domain
B domain
(e.g., LoRA)
Supervised Task
data from specific
task loss
Pretrained LLM
Train only
classification
… On finetuning
MLM … head on
finetuning task
Finetuning task

Supervised
instructions Next word
prediction
objective
Instruction Instruction
… … On unseen
Tuning tuning on
tasks
diverse
(SFT) tasks

Figure 12.4 Instruction tuning compared to the other kinds of finetuning.

In the second example, also from Chapter 10, parameter-efficient finetuning,


we adapt to a new domain by creating some new (small) parameters, and just adapt-
ing them to the new domain. In LoRA, for example, it’s the A and B matrices that
we adapt, but the pretrained model parameters are frozen.
In the task-based finetuning of Chapter 11, we adapt to a particular task by
adding a new specialized classification head and updating its features via its own
loss function (e.g., classification or sequence labeling); the parameters of the pre-
trained model may be frozen or might be slightly updated.
Finally, in instruction tuning, we take a dataset of instructions and their super-
vised responses and continue to train the language model on this data, based on the
standard language model loss.
Instruction tuning, like all of these kinds of finetuning, is much more modest
than the training of base LLMs. Training typically involves several epochs over
instruction datasets that number in the thousands. The overall cost of instruction
tuning is therefore a small fraction of the original cost to train the base model.

12.3.1 Instructions as Training Data


By instruction, we have in mind a natural language description of a task to be per-
formed, combined with labeled task demonstrations. This can include minimal de-
10 C HAPTER 12 • M ODEL A LIGNMENT, P ROMPTING , AND I N -C ONTEXT L EARNING

scriptions similar to the prompts we’ve already seen such as Answer the following
question, Translate the following text to Arapaho, or Summarize this report. How-
ever, since we will be using supervised finetuning to update the model, these in-
structions need not be limited to simple prompts designed to evoke a behavior found
in the pretraining corpora. Instructions can also include length restrictions or other
constraints, personas to assume, and demonstrations.
Many huge instruction tuning datasets have been created, covering many tasks
andLang
languages.
Prompt
For example Aya gives 503 million instructions in 114 languages
Completion
from 12 tasks including question answering, summarization, translation, paraphras-
ara . ‫ﺍ‬ ‫ﺓ ﺡ‬ ‫ﺀ‬ ً َ ‫ﺇﻥ َ ﺡ ﺍ ُ ﻙ‬
ing, sentiment analysis, natural language inference and 6 others (Singh et al., 2024).
‫َ َ ﺍ ﻉ ﻭ َﺍ ﻭ ﺩ‬
SuperNatural Instructions has 12 million examples from 1600 tasks (Wang et al.,
ُ ِ ِ ‫ّﺍ‬ َ ‫َ ﺃ‬
2022), Flan 2022 has 15 million examples from 1836 tasks (Longpre et al., 2023),
‫ﺀ‬ ‫َ ُ ﺍﺯﻱ َ َ ﺍ‬
and OPT-IML has 18 million examples from 2000 tasks (Iyer et al., 2022).
TheseQui
fra instruction-tuning datasets
a écrit le livre La areMaillet
Antonine created in La
a écrit four [Link] The
Sagouine 1971. first
Elle ais forbeau-
écrit people
to write the instances directly. For
Sagouine? example,
coup part
d’histoires sur les of the Aya
Acadiens instruct finetuning
(les francophones de la côte Estcor-
du Canada) et a fait connaître l’Acadie au monde.
pus (Fig. 12.5) includes 204K instruction/response instances written by 3000 fluent
ibo Kedu ka ndị Igbo si eme Emume ji ọhụrụ n’omenala Igbo, bụ emume na-egosi njedebe
speakers emume
of 65 jilanguages
ọhụrụ? volunteering as part
oge ọrụ ugbo, of a site
na-amalite participatory
n’emume ekeleresearch
ma tinyekwa initiative
oriri
with the goal of improving multilingual
na egwu. Ọperformance
bụ mmemme ọgbakọof LLMs.
ebe a na-ekerịta ji ọhụrụ, na-
egosipụta ekele na mmalite nke oge ọhụrụ, na-eme ka mmekọrịta
mmadụ na ibe ya dịkwuo elu.
por Qual é a capital do estado da A capital do estado da Bahia, no Brasil, é Salvador. Fundada
Bahia? em 1549, foi a primeira capital do Brasil e é conhecida por
seu patrimônio cultural rico e diversificado. A cidade é famosa
por sua arquitetura colonial, particularmente no Pelourinho, um
centro histórico que foi declarado Patrimônio Mundial pela UN-
ESCO.
pes ‫ﺩﺭ ﺑﺮﺝ ﮐﺒﻮﺗﺮﺧﺎﻧﻪ ﻣﻴﺒﺪ ﭼﻨﺪ ﻻﻧﻪ ﮐﺒﻮﺗﺮ‬ ‫ﮐﺒﻮﺗﺮﺧﺎﻧﻪ ﻣﻴﺒﺪ ﺑﺮﺟﯽ ﺍﺳﺘﻮﺍﻧﻪﺍﯼ ﺍﺳﺖ ﮐﻪ ﺩﺍﺧﻞ ﺁﻥ ﭼﻬﺎﺭﻫﺰﺍﺭ ﻻﻧﻪ ﮐﺒﻮﺗﺮ ﻭﺟﻮﺩ ﺩﺍﺭﺩ ﻭ ﻧﻮﺍﺭ‬
‫ﻭﺟﻮﺩ ﺩﺍﺭﺩ؟‬ ‫ ﺍﻳﻦ ﺑﺮﺝ ﺩﺭ ﺑﺮﺍﺑﺮ ﺧﻄﺮ‬.‫ﺳﻔﻴﺪ ﺭﻧﮕﯽ ﺑﺮﺍﯼ ﺟﺬﺏ ﮐﺒﻮﺗﺮﺍﻥ ﺩﻭﺭﺗﺎﺩﻭﺭ ﺑﺮﺝ ﮐﺸﻴﺪﻩ ﺷﺪﻩ ﺍﺳﺖ‬
.‫ﺣﻤﻠﻪ ﺩﻳﮕﺮ ﺣﻴﻮﺍﻧﺎﺕ ﺑﻪ ﮐﺒﻮﺗﺮﻫﺎ ﺑﺴﻴﺎﺭ ﺍﻳﻤﻦ ﺑﻮﺩﻩ ﺍﺳﺖ‬

msa Apakah nasi lemak? Nasi lemak merupakan makanan tradisi orang Melayu yang
terdapat di semua bahagian Malaysia, Singapura, Indonesia
(terutama di Riau, Jambi serta utara dan pantai timur Su-
matera) dan Brunei. Sajian ini merujuk kepada nasi yang di-
masak dengan menggunakan santan kelapa bagi menambah rasa
lemaknya. Kadangkala, daun pandan wangi dimasukkan semasa
nasi dimasak bagi menambahkan aromanya.
tam ெசயற்ைக நுண்ணற வு ெபாதுவாக மனிதர்களால் ெசய்யப்படும் பணிகைளச்
என்றால் என்ன? ெசய்ய ஒரு கணினி அல்லது ஒரு கணினியால்
கட்டுப்படுத்தப்படும் ஒரு ேராேபாவ ன் த றன் ெசயற்ைக
நுண்ணற வு எனப்படும்.

Figure 12.5 Samples of prompt/completion instances in 4 of the 65 languages in the Aya


Table et
corpus (Singh 3: al.,
Examples
2024). of prompt and completions in the Aya Dataset.

Developing
tors is not uniform acrosshigh quality supervised
languages. Moreover, training datalanguage,
within each in this way is time
there is a consuming
lack of consistent
and costly. A more common approach makes use of the copious amounts
contributions from all annotators. In this section, we examine the impact of annotator of super-
skew on the
vised training
resulting dataset. data that have been curated over the years for a wide range of natural
language tasks. There are thousands of such datasets available, like the SQuAD
Annotatordataset
Skew of Across
questions and answersAnnotators
Languages. (Rajpurkarwereet al., 2016) ortothe
encouraged many datasets
contribute to any of
language
in which translations
they could or summarization.
comfortably Thiswrite
read and data can
and be automatically
were converted
asked to focus most of into setsefforts
their of on
languagesinstruction
other than prompts
[Link]
input/output demonstration
a significant number pairs
of via simple templates.
participants registered for many
languages, theFig. 12.6 illustrates
engagement levelexamples for some
of annotators was applications from resulted
not equal, which the S UPERin N ATURAL I N -differ-
considerable
ences in the number of resource
STRUCTIONS (Wang
contributions et al.,
across 2022), showing
languages. Figure 10relevant slots such
(top) provides as text, of the
an overview
context,
percentage of eachand hypothesis.
language presentTo
in generate
the final instruction-tuning data, these
compilation. The highest fieldsofand
number the
contributions
is for Malagasy with 14,597
ground-truth instances,
labels are andfrom
extracted the the
lowest is 79 data,
training for Kurdish.
encoded as key/value pairs,
and inserted in templates (Fig. 12.7) to produce instantiated instructions. Because
Annotator Skew for
it’s useful Within a Language.
the prompts The final
to be diverse contributions
in wording, for models
language each language in be
can also the Aya
Dataset are not evenly distributed among annotators. The median number of annotators per lan-
guage is 15 (mean is 24.75) with one language having only a single active annotator (Sindhi) and

14
12.3 • M ODEL A LIGNMENT: I NSTRUCTION T UNING 11

used to generate paraphrase of the prompts.

Few-Shot Learning for QA

Task Keys Values


Sentiment text Did not like the service that I was provided...
label 0
text It sounds like a great plot, the actors are first grade, and...
label 1
NLI No weapons of mass destruction found in Iraq yet.
premise
Weapons of mass destruction found in Iraq.
hypothesis
label 2
Jimmy Smith... played college football at University of Col-
premise
orado.
hypothesis The University of Colorado has a college football team.
label 0
Extractive Q/A context Beyoncé Giselle Knowles-Carter is an American singer...
question When did Beyonce start becoming popular?
answers { text: [’in the late 1990s’], answer start: 269 }

Figure 12.6 Examples of supervised training data for sentiment, natural language inference and Q/A tasks.
The various components of the dataset are extracted and stored as key/value pairs to be used in generating
instructions.

Task Templates
Sentiment -{{text}} How does the reviewer feel about the movie?
-The following movie review expresses what sentiment?
{{text}}
-{{text}} Did the reviewer enjoy the movie?
Extractive Q/A -{{context}} From the passage, {{question}}
-Answer the question given the context. Context:
{{context}} Question: {{question}}
-Given the following passage {{context}}, answer the
question {{question}}
NLI -Suppose {{premise}} Can we infer that {{hypothesis}}?
Yes, no, or maybe?
-{{premise}} Based on the previous passage, is it true
that {{hypothesis}}? Yes, no, or maybe?
-Given {{premise}} Should we assume that {{hypothesis}}
is true? Yes,no, or maybe?

Figure 12.7 Instruction templates for sentiment, Q/A and NLI tasks.

Because supervised NLP datasets are themselves often produced by crowdwork-


ers based on carefully written annotation guidelines, a third option is to draw on
these guidelines, which can include detailed step-by-step instructions, pitfalls to
avoid, formatting instructions, length limits, exemplars, etc. These annotation guide-
lines can be used directly as prompts to a language model to create instruction-tuning
training examples. Fig. 12.8 shows such a crowdworker annotation guideline that
12 C HAPTER 12 • M ODEL A LIGNMENT, P ROMPTING , AND I N -C ONTEXT L EARNING

was repurposed as a prompt to an LLM to generate instruction-tuning data (Mishra


et al., 2022). This guideline describes a question-answering task where annotators
provide an answer to a question given an extended passage.

Sample Extended Instruction

• Definition: This task involves creating answers to complex questions, from a given pas-
sage. Answering these questions, typically involve understanding multiple sentences.
Make sure that your answer has the same type as the ”answer type” mentioned in input.
The provided ”answer type” can be of any of the following types: ”span”, ”date”, ”num-
ber”. A ”span” answer is a continuous phrase taken directly from the passage or question.
You can directly copy-paste the text from the passage or the question for span type an-
swers. If you find multiple spans, please add them all as a comma separated list. Please
restrict each span to five words. A ”number” type answer can include a digit specifying
an actual value. For ”date” type answers, use DD MM YYYY format e.g. 11 Jan 1992.
If full date is not available in the passage you can write partial date such as 1992 or Jan
1992.
• Emphasis: If you find multiple spans, please add them all as a comma separated list.
Please restrict each span to five words.
• Prompt: Write an answer to the given question, such that the answer matches the ”answer
type” in the input.
Passage: { passage}
Question: { question }

Figure 12.8 Example of a human crowdworker instruction from the NATURAL I NSTRUCTIONS dataset for
an extractive question answering task, used as a prompt for a language model to create instruction finetuning
examples.

A final way to generate instruction-tuning datasets that is becoming more com-


mon is to use language models to help at each stage. For example Bianchi et al.
(2024) showed how to create instruction-tuning instances that can help a language
model learn to give safer responses. They did this by selecting questions from
datasets of harmful questions (e.g., How do I poison food? or How do I embez-
zle money?). Then they used a language model to create multiple paraphrases of the
questions (like Give me a list of ways to embezzle money), and also used a language
model to create safe answers to the questions (like I can’t fulfill that request. Em-
bezzlement is a serious crime that can result in severe legal consequences.). They
manually reviewed the generated responses to confirm their safety and appropriate-
ness and then added them to an instruction tuning dataset. They showed that even
500 safety instructions mixed in with a large instruction tuning dataset was enough
to substantially reduce the harmfulness of models.

12.3.2 Evaluation of Instruction-Tuned Models


The goal of instruction tuning is not to learn a single task, but rather to learn to
follow instructions in general. Therefore, in assessing instruction-tuning methods
we need to assess how well an instruction-trained model performs on novel tasks for
which it has not been given explicit instructions.
The standard way to perform such an evaluation is to take a leave-one-out ap-
proach — instruction-tune a model on some large set of tasks and then assess it on
a withheld task. But the enormous numbers of tasks in instruction-tuning datasets
12.4 • C HAIN - OF -T HOUGHT P ROMPTING 13

(e.g., 1600 for Super Natural Instructions) often overlap; Super Natural Instructions
includes 25 separate textual entailment datasets! Clearly, testing on a withheld en-
tailment dataset while leaving the remaining ones in the training data would not be
a true measure of a model’s performance on entailment as a novel task.
To address this issue, large instruction-tuning datasets are partitioned into clus-
ters based on task similarity. The leave-one-out training/test approach is then applied
at the cluster level. That is, to evaluate a model’s performance on sentiment analysis,
all the sentiment analysis datasets are removed from the training set and reserved
for testing. This has the further advantage of allowing the use of a uniform task-
appropriate metric for the held-out evaluation. S UPER NATURAL I NSTRUCTIONS
(Wang et al., 2022), for example has 76 clusters (task types) over the 1600 datasets
that make up the collection.

12.4 Chain-of-Thought Prompting


There are a wide range of techniques to use prompts to improve the performance of
language models on many tasks. Here we describe one of them, called chain-of-
chain-of-
thought thought prompting.
The goal of chain-of-thought prompting is to improve performance on difficult
reasoning tasks that language models tend to fail on. The intuition is that people
solve these tasks by breaking them down into steps, and so we’d like to have lan-
guage in the prompt that encourages language models to break them down in the
same way.
The actual technique is quite simple: each of the demonstrations in the few-shot
prompt is augmented with some text explaining some reasoning steps. The goal is to
cause the language model to output similar kinds of reasoning steps for the problem
being solved, and for the output of those reasoning steps to cause the system to
generate the correct answer.
Indeed, numerous studies have found that augmenting the demonstrations with
reasoning steps in this way makes language models more likely to give the correct
answer difficult reasoning tasks (Wei et al., 2022; Suzgun et al., 2023). Fig. 12.9
shows an example where the demonstrations are augmented with chain-of-thought
text in the domain of math word problems (from the GSM8k dataset of math word
problems (Cobbe et al., 2021). Fig. 12.10 shows a similar example from the BIG-
Bench-Hard dataset (Suzgun et al., 2023).

12.5 Automatic Prompt Optimization


Given a prompt for a task (human or computer generated), prompt optimization
methods search for prompts with improved performance. Most of these approaches
can be viewed as a form of iterative improvement search (Russell and Norvig, 2002)
through a space of possible prompts for those that optimize performance on a task.
As such, these approaches all share the following components:
• A start state – An initial human or machine generated prompt or prompts
suitable for some task.
• A scoring metric – A method for assessing how well a given prompt performs
14 C HAPTER 12 • M ODEL A LIGNMENT, P ROMPTING , AND I N -C ONTEXT L EARNING

Figure 12.9 Example of the use of chain-of-thought prompting (right) versus standard
prompting (left) on math word problems. Figure from Wei et al. (2022).

Model Input (“Answer-Only” Prompting) Model Input (Chain-of-Thought Prompting)


Task description: Answer questions about which times certain events
Task Description Task description: Answer questions about which times certain events Task Description could have occurred.
could have occurred.
Q: Today, Tiffany went to the beach. Between what times could they
Q: Today, Tiffany went to the beach. Between what times could they Question have gone? We know that:
Question have gone? We know that: Tiffany woke up at 5am. [...] The beach was closed after 4pm. [...]
Tiffany woke up at 5am. [...] The beach was closed after 4pm. [...] Options: (A) 9am to 12pm (B) 12pm to 2pm
Options: (A) 9am to 12pm (B) 12pm to 2pm
Options
Options (C) 5am to 6am (D) 3pm to 4pm
(C) 5am to 6am (D) 3pm to 4pm
A: Let's think step by step.
Answer A: (D) Wake-up time: 5am. [...] The only time when Tiffany could have gone to
Chain-of-Thought the beach was 3pm to 4pm. So the answer is (D).
Test-Time Q: Today, Hannah went to the soccer field. Between what times could
they have gone? We know that: Q: Today, Hannah went to the soccer field. Between what times could
Question they have gone? We know that:
Hannah woke up at 5am. [...] The soccer field was closed after 6pm. [...] Test-Time
Options: (A) 3pm to 5pm (B) 11am to 1pm Hannah woke up at 5am. [...] The soccer field was closed after 6pm. [...]
Question Options: (A) 3pm to 5pm (B) 11am to 1pm
(C) 5pm to 6pm (D) 1pm to 3pm
(C) 5pm to 6pm (D) 1pm to 3pm
A:
A: Let's think step by step.

Model Output Model Output

Generated Wake-up time: 5am.


(B) 5am-6am: buying clothes at the mall.
Answer
6am-11am: watching a movie at the theater.
11am-1pm: getting a coffee at the cafe.
Generated 1pm-3pm: working at the office.
Chain-of-Thought 3pm-5pm: waiting at the airport.
5pm-6pm: free. The soccer field closure time: 6pm.
The only time when Hannah could have gone to the soccer field was
5pm to 6pm. So the answer is (C).

Figure 12.10
Figure 3: Example
An illustration of two
of the the prompting
use of chain-of-thought
setups we explore prompting (right) vs standard
in our paper (answer-only and CoT prompting (left)setups
prompting). Both in a
reasoning
include tasktask on temporal
descriptions sequencing.
and options Figure
in the input fromThe
prompt. Suzgun et al.
task here (2023). Sequences.
is Temporal

“let’s think step-by-step” (Kojima et al., 2022) to dard in many prior work (Brown et al., 2020; Rae
all CoT annotations in theon the task. exemplars. An et al., 2021; Hoffmann et al., 2022; Srivastava et al.,
few-shot
example of a CoT prompt• An is shown in method
expansion Figure 3.– A method2022),forit generating
typically underestimates
variations of a model prompt. perfor-
Language models. We consider three fami- mance on challenging tasks, such as those that re-
Given the enormous
lies of language models: Codex (Chen et al., variation in how prompts
quire multiplefor a single
reasoning task can
steps. be
In expressed
the setting in
re-
language,
2021a), InstructGPT (Ouyangsearch methods
et al., have to beported
2022; Brown constrained to a reasonable
in (Srivastava space.
et al., 2022), none Beam search
of the mod-
et al., 2020), andisPaLM
a widely used method
(Chowdhery et al.,that combines
2022). breadth-first
els (including PaLMsearch
540B) with a fixed-width
outperformed pri-
human-
ority queue
For Codex, we focus that focuses thecode-
on code-davinci-002, search effort on the top
rater baselines onperforming
any of the tasksvariants.
meeting Fig.
the12.11
BBH
davinci-002, andoutlines the general approach
code-cushman-001. For Instruct-behindcriteria.
most current prompt evaluation
The few-shot optimization methods.
of PaLM 540B
Beginningtext-curie-002,
GPT, we use text-davinci-002, with initial candidate
text- with prompt(s), the algorithm
answer-only promptinggenerates variants
in this paper, and
however,
babbgage-001, and addstext-ada-001.
them to a listFor of prompts
PaLM, we to be outperforms
considered. the These prompts
average are then on
human-rater selectively
6 out of
added sizes:
use the three available to the8B,
active
62B,list
andbased
[Link] whether
23 BBH their
tasksscores
and isplace
overallthem
1.4%in better
the topthansetthe
of
candidates.
Evaluation protocol. A beam width
We evaluate of 1 results
all language in a focused
BIG-Bench greedy
reported search,
result, whereas
which an infinite
demonstrates the
models via greedy beam width(i.e.,
decoding results in an exhaustive
temperature sam- effect breadth first search.
of including The goal
instructions is to continue
and answer options
to seek parameter
pling with temperature improved prompts
⌧ = 0). given We the in computational
the prompt. resources available. Iterative
extract the finalimprovement
answer basedsearches
on keywordstypically
thatuse a combination
CoT prompting of aprovides
fixed number of iterations
double-digit improve- in
the language model is expected to produce (i.e., ments for all three models in Table 2. For the best
“the answer is”). We measure accuracy using exact model (Codex), CoT prompting outperforms the av-
match (EM), computed by comparing the generated erage human-rater score on 17 out of 23 tasks, com-
output with the ground-truth label.4 pared to 5 out of 23 tasks for answer-only prompt-
ing. Additionally, we see that Codex with CoT
4 Results prompting outperforms the average human-rater
12.5 • AUTOMATIC P ROMPT O PTIMIZATION 15

function PROMPT O PTIMIZATION(prompts, width) returns optimized prompt(s)

active ← prompts ; Initial set of candidate prompts


repeat until done
frontier ← E XPAND(active) ; Generate new candidate prompts
foreach p ∈ frontier
active ← A DD T O B EAM(p, active, width)
return B EST O F(active)

function A DD T O B EAM(state, agenda, width) returns updated agenda

if L ENGTH(agenda) < width then ; Add it if there’s room


agenda ← I NSERT(state, agenda)
else if S CORE(state) >S CORE(W ORST O F(agenda)) ; Add it if its better than
; the current worst option.
agenda ← R EMOVE(W ORST O F(agenda))
agenda ← I NSERT(state, agenda)
return agenda

Figure 12.11 A generic iterative-improvement beam search for prompt optimization. New
prompts are generated from current ones on each iteration. Prompts that score well (fitting in
the agenda) are kept around. When a stopping criteria is reached the best item in the beam is
returned.

combination with a failure to improve after some period to time as stopping criteria.
This latter is equivalent to early stopping with patience used in training deep neural
networks.

12.5.1 Candidate Scoring


Candidate scoring methods assess the likely performance of potential prompts, both
to identify promising avenues of search and to prune those that are unlikely to be
effective. Since candidate scoring is embedded in the inner-loop of the search, the
computational cost of scoring is critical.
Given access to labeled training data, candidate prompts can be scored based on
execution
accuracy execution accuracy (Honovich et al., 2023). In this approach, candidate prompts
are combined with inputs sampled from the training data and passed to an LLM for
decoding. The LLM output is evaluated against the training label using a metric
appropriate for the task. In the case of classification-based tasks, this is effectively a
0/1 loss — how many examples were correctly labeled with the given prompt. Gen-
erative applications such as summarization or translation use task-specific similarity
scores such as BERTScore, Bleu (Papineni et al., 2002), or ROUGE (Lin, 2004).
Given the computational cost of issuing calls to an LLM, evaluating each can-
didate prompt against a complete training set would be infeasible. Instead, prompt
performance is estimated from a small sample of training data (Pryzant et al., 2023).

12.5.2 Prompt Expansion


Prompt expansion generates variants of a given prompt to create an expanded set of
neighboring prompts that may improve performance over the original. A common
method is to use language models to create paraphrases. For example Zhou et al.
(2023) use the following meta-prompt to elicit a variant prompt from an original:
16 C HAPTER 12 • M ODEL A LIGNMENT, P ROMPTING , AND I N -C ONTEXT L EARNING

Prompting for a Variant

Generate a variation of the following instruction while keeping the semantic meaning.
Input: {INSTRUCTION}
Output: {COMPLETE}

A variation of this method is to truncate the current prompt at a set of random loca-
tions, generating a set of prompt prefixes. The paraphrasing LLM is then asked to
continue each the prefixes to generate a complete prompt.
This methods is an example of an uninformed search. That is, the candidate
expansion step is not directed towards generating better candidates; candidates are
generated without regard to their quality. It is the job of the priority queue to el-
evate improved candidates when they are found. By contrast, Prasad et al. (2023)
employ a candidate expansion technique that explicitly attempts to generate supe-
rior prompts during the expansion process. In this approach, the current candidate
is first applied to a sample of training examples using the execution accuracy ap-
proach. The prompt’s performance on these examples then guides the expansion
process. Specifically, incorrect examples are used to critique the original prompt
— with the critique playing the role of a gradient for the search. The method in-
cludes the following steps.

1. Run the prompt on a sample of training examples,

2. Identify examples where the prompt fails,

3. Ask an LLM to produce a critique of the prompt in light of the failed examples,

4. Provide the resulting critique to an LLM, and ask it to generate improved


prompts.

Given a prompt and a set of failed examples, Prasad et al. (2023) use the follow-
ing template for a classifier task to solicit critiques from a target LLM.

Critiquing Prompt

I’m trying to write a zero-shot classifier prompt.


My current prompt is: {prompt}
But this prompt gets the following examples wrong:
{error string}
Give {num feedbacks} reasons why the prompt could have
gotten these examples wrong.

This model feedback is then combined with a second template to elicit improved
prompts from the LLM.
12.6 • E VALUATING P ROMPTED L ANGUAGE M ODELS 17

Prompt Improvement Prompt

I’m trying to write a zero-shot classifier. My current prompt is:


{prompt}
But it gets the following examples wrong: {error str}

Based on these examples the problem with this prompt is that {gradient}.
Based on the above information, I wrote {steps per gradient} different
improved prompts. Each prompt is wrapped with <START> and <END>.

The {steps per gradient} new prompts are:

12.6 Evaluating Prompted Language Models


Language models are evaluated in many ways. we introduced some evaluations for
in Section ??, including measuring the language model’s perplexity on a test set,
evaluating its accuracy on various NLP tasks, as well as benchmarks that help mea-
sure efficiency, toxicity, fairness, and so on. We’ll have further discussion of eval-
uate NLP tasks in future chapters; machine translation in Chapter 13 and question
answering and information retrieval in Chapter 14.
Here we just briefly show the mechanism for measuring accuracy in a prompt-
MMLU ing setup for tests that have multiple-choice questions. We show this for MMLU
(Massive Multitask Language Understanding), a commonly-used dataset of 15908
knowledge and reasoning questions in 57 areas including medicine, mathematics,
computer science, law, and others. For example, here is an MMLU question from
the microeconomics domain:1
MMLU microeconomics example

One of the reasons that the government discourages and regulates monopo-
lies is that
(A) producer surplus is lost and consumer surplus is gained.
(B) monopoly prices ensure productive efficiency but cost society allocative
efficiency.
(C) monopoly firms do not engage in significant research and development.
(D) consumer surplus is lost with higher prices and lower levels of output.

Fig. 12.12 shows the way MMLU turns these questions into prompted tests of a
language model, in this case showing an example prompt with 2 demonstrations.

12.7 Model Alignment with Human Preferences: RLHF


and DPO
TBD
1 For those of you whose economics is a bit rusty, the correct answer is (D).
18 C HAPTER 12 • M ODEL A LIGNMENT, P ROMPTING , AND I N -C ONTEXT L EARNING

MMLU mathematics prompt

The following are multiple choice questions about high school mathematics.
How many numbers are in the list 25, 26, ..., 100?
(A) 75 (B) 76 (C) 22 (D) 23
Answer: B

Compute i + i2 + i3 + · · · + i258 + i259 .


(A) -1 (B) 1 (C) i (D) -i
Answer: A

If 4 daps = 7 yaps, and 5 yaps = 3 baps, how many daps equal 42 baps?
(A) 28 (B) 21 (C) 40 (D) 30
Answer:

Figure 12.12 Sample 2-shot prompt from MMLU testing high-school mathematics. (The
correct answer is (C)).

12.8 Summary
This chapter has explored the topic of prompting large language models to follow
instructions. Here are some of the main points that we’ve covered:
• Simple prompting can be used to map practical applications to problems that
can be solved by LLMs without altering the model.
• Labeled examples (demonstrations) can be used to provide further guidance
to a model via few-shot learning.
• Methods like chain-of-thought can be used to create prompts that help lan-
guage models deal with complex reasoning problems.
• Pretrained language models can be altered to behave in desired ways through
model alignment.
• One method for model alignment is instruction tuning, in which the model
is finetuned (using the next-word-prediction language model objective) on
a dataset of instructions together with correct responses. Instruction tuning
datasets are often created by repurposing standard NLP datasets for tasks like
question answering or machine translation.

Bibliographical and Historical Notes


Bibliographical and Historical Notes 19

Bianchi, F., M. Suzgun, G. Attanasio, P. Rottger, D. Juraf- Olsson, C., N. Elhage, N. Nanda, N. Joseph, N. DasSarma,
sky, T. Hashimoto, and J. Zou. 2024. Safety-tuned LLa- T. Henighan, B. Mann, A. Askell, Y. Bai, A. Chen, et al.
MAs: Lessons from improving the safety of large lan- 2022. In-context learning and induction heads. ArXiv
guage models that follow instructions. ICLR. preprint.
Brown, T., B. Mann, N. Ryder, M. Subbiah, J. Kaplan, Ouyang, L., J. Wu, X. Jiang, D. Almeida, C. Wainwright,
P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray,
A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens,
T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, A. Askell, P. Welinder, P. Christiano, J. Leike, and
C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, R. Lowe. 2022. Training language models to follow in-
S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, structions with human feedback. NeurIPS, volume 35.
A. Radford, I. Sutskever, and D. Amodei. 2020. Language Papineni, K., S. Roukos, T. Ward, and W.-J. Zhu. 2002. Bleu:
models are few-shot learners. NeurIPS, volume 33. A method for automatic evaluation of machine transla-
Cheng, M., E. Durmus, and D. Jurafsky. 2023. Marked per- tion. ACL.
sonas: Using natural language prompts to measure stereo-
Prasad, A., P. Hase, X. Zhou, and M. Bansal. 2023. GrIPS:
types in language models. ACL.
Gradient-free, edit-based instruction search for prompt-
Cobbe, K., V. Kosaraju, M. Bavarian, M. Chen, H. Jun, ing large language models. EACL.
L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano,
Pryzant, R., D. Iter, J. Li, Y. Lee, C. Zhu, and M. Zeng. 2023.
C. Hesse, and J. Schulman. 2021. Training verifiers to
Automatic prompt optimization with “gradient descent”
solve math word problems. ArXiv preprint.
and beam search. EMNLP.
Crosbie, J. and E. Shutova. 2022. Induction heads as an
essential mechanism for pattern matching in in-context Rajpurkar, P., R. Jia, and P. Liang. 2018. Know what you
learning. ArXiv preprint. don’t know: Unanswerable questions for SQuAD. ACL.
Elhage, N., N. Nanda, C. Olsson, T. Henighan, N. Joseph, Rajpurkar, P., J. Zhang, K. Lopyrev, and P. Liang. 2016.
B. Mann, A. Askell, Y. Bai, A. Chen, T. Conerly, N. Das- SQuAD: 100,000+ questions for machine comprehension
Sarma, D. Drain, D. Ganguli, Z. Hatfield-Dodds, D. Her- of text. EMNLP.
nandez, A. Jones, J. Kernion, L. Lovitt, K. Ndousse, Reynolds, L. and K. McDonell. 2021. Prompt program-
D. Amodei, T. Brown, J. Clark, J. Kaplan, S. McCan- ming for large language models: Beyond the few-shot
dlish, and C. Olah. 2021. A mathematical framework for paradigm. CHI 2021.
transformer circuits. White paper. Russell, S. and P. Norvig. 2002. Artificial Intelligence: A
Gehman, S., S. Gururangan, M. Sap, Y. Choi, and N. A. Modern Approach, 2nd edition. Prentice Hall.
Smith. 2020. RealToxicityPrompts: Evaluating neu- Salvetti, F., J. B. Lowe, and J. H. Martin. 2016. A tangled
ral toxic degeneration in language models. Findings of web: The faint signals of deception in text - boulder lies
EMNLP. and truth corpus (BLT-C). LREC.
Honovich, O., U. Shaham, S. R. Bowman, and O. Levy.
Sheng, E., K.-W. Chang, P. Natarajan, and N. Peng. 2019.
2023. Instruction induction: From few examples to natu-
The woman worked as a babysitter: On biases in language
ral language task descriptions. ACL.
generation. EMNLP.
Iyer, S., X. V. Lin, R. Pasunuru, T. Mihaylov, D. Simig,
Singh, S., F. Vargus, D. D’souza, B. F. Karlsson, A. Ma-
P. Yu, K. Shuster, T. Wang, Q. Liu, P. S. Koura, X. Li,
hendiran, W.-Y. Ko, H. Shandilya, J. Patel, D. Mat-
B. O’Horo, G. Pereyra, J. Wang, C. Dewan, A. Celiky-
aciunas, L. O’Mahony, M. Zhang, R. Hettiarachchi,
ilmaz, L. Zettlemoyer, and V. Stoyanov. 2022. Opt-
J. Wilson, M. Machado, L. S. Moura, D. Krzemiński,
iml: Scaling language model instruction meta learning
H. Fadaei, I. Ergün, I. Okoh, A. Alaagib, O. Mu-
through the lens of generalization. ArXiv preprint.
dannayake, Z. Alyafeai, V. M. Chien, S. Ruder,
Khattab, O., A. Singhvi, P. Maheshwari, Z. Zhang, K. San- S. Guthikonda, E. A. Alghamdi, S. Gehrmann, N. Muen-
thanam, S. Haq, A. Sharma, T. T. Joshi, H. Moazam, nighoff, M. Bartolo, J. Kreutzer, A. ÜÜstün, M. Fadaee,
H. Miller, M. Zaharia, and C. Potts. 2024. DSPy: Compil- and S. Hooker. 2024. Aya dataset: An open-access collec-
ing declarative language model calls into self-improving tion for multilingual instruction tuning. ArXiv preprint.
pipelines. ICLR.
Suzgun, M., N. Scales, N. Schärli, S. Gehrmann, Y. Tay,
Lin, C.-Y. 2004. ROUGE: A package for automatic evalua- H. W. Chung, A. Chowdhery, Q. Le, E. Chi, D. Zhou, and
tion of summaries. ACL 2004 Workshop on Text Summa- J. Wei. 2023. Challenging BIG-bench tasks and whether
rization Branches Out. chain-of-thought can solve them. ACL Findings.
Longpre, S., L. Hou, T. Vu, A. Webson, H. W. Chung, Y. Tay, Wang, Y., S. Mishra, P. Alipoormolabashi, Y. Kordi,
D. Zhou, Q. V. Le, B. Zoph, J. Wei, and A. Roberts. 2023. A. Mirzaei, A. Naik, A. Ashok, A. S. Dhanasekaran,
The Flan collection: Designing data and methods for ef- A. Arunkumar, D. Stap, E. Pathak, G. Karamanolakis,
fective instruction tuning. ICML. H. Lai, I. Purohit, I. Mondal, J. Anderson, K. Kuznia,
Min, S., X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, H. Ha- K. Doshi, K. K. Pal, M. Patel, M. Moradshahi, M. Par-
jishirzi, and L. Zettlemoyer. 2022. Rethinking the role of mar, M. Purohit, N. Varshney, P. R. Kaza, P. Verma,
demonstrations: What makes in-context learning work? R. S. Puri, R. Karia, S. Doshi, S. K. Sampat, S. Mishra,
EMNLP. S. Reddy A, S. Patro, T. Dixit, and X. Shen. 2022. Super-
Mishra, S., D. Khashabi, C. Baral, and H. Hajishirzi. 2022. NaturalInstructions: Generalization via declarative in-
Cross-task generalization via natural language crowd- structions on 1600+ NLP tasks. EMNLP.
sourcing instructions. ACL.
20 Chapter 12 • Model Alignment, Prompting, and In-Context Learning

Webson, A. and E. Pavlick. 2022. Do prompt-based models


really understand the meaning of their prompts? NAACL
HLT.
Wei, J., X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi,
Q. V. Le, D. Zhou, et al. 2022. Chain-of-thought prompt-
ing elicits reasoning in large language models. NeurIPS,
volume 35.
Zhou, Y., A. I. Muresanu, Z. Han, K. Paster, S. Pitis,
H. Chan, and J. Ba. 2023. Large language models are
human-level prompt engineers. The Eleventh Interna-
tional Conference on Learning Representations.

Common questions

Powered by AI

Instruction tuning adapts a pre-trained language model to follow instructions across various tasks by fine-tuning on a corpus of instructions and responses, enhancing its general task-following ability. Preference alignment, often implemented as Reinforcement Learning from Human Feedback (RLHF), involves training a separate model to evaluate how well a response aligns with human preferences. This feedback is then used to further fine-tune the base model, thus aligning it to human needs for helpfulness and non-harmfulness .

To maintain performance on novel tasks, instruction-tuned models are assessed using a cluster-based leave-one-out testing approach where all datasets with similar tasks are removed from training and retained for testing. This maximizes novelty in evaluated tasks by ensuring no direct or similar task-cues are seen during training, compelling the model to generalize its learning strategies .

Pretraining updates model parameters using gradient descent so that the model learns to predict words in a text, optimizing it for general language tasks. In contrast, prompting, especially with few-shot demonstrations, does not modify model parameters but guides the model to apply its existing knowledge to new tasks through task demonstration. Prompting serves to boost predictive power without internal updates, essentially teaching the model through examples rather than parameter changes .

In few-shot prompting, the main utility of demonstrations is to show the task format to the model, rather than providing the correct answers. A small number of demonstrations can suffice because the largest performance gain tends to come from the first example, with subsequent examples offering diminishing returns. Adding too many examples risks overfitting the model to the specific examples, reducing its generalization ability. In contrast, fine-tuning specialized classifier heads benefits from more examples to accurately update model parameters through gradient descent .

Instruction-tuning datasets can be tailored for safety by selecting harmful questions and generating paraphrases alongside safe responses. Language models can be used to create these paraphrases and responses, which are then manually reviewed for appropriateness. Incorporating even a small number of these safety-focused instructions can significantly reduce harmful content in model outputs by shaping the model's handling of sensitive topics .

In-context learning differs from gradient-descent-based learning by not involving any updates to the model’s parameters. Instead, it acclimates the model to a task through the context provided by prompts, allowing it to apply inferences drawn from the given input immediately. Unlike traditional learning that requires parameter changes over many iterations, in-context learning utilizes the model's pre-existing knowledge to interpret and perform tasks based on immediate input .

Language models can generate harmful content, such as misinformation or hate speech, because their pretraining focuses on predicting words without biases against harmful content. Consequently, even benign prompts can trigger negative outputs. This issue is addressed through training techniques like instruction tuning and preference alignment, which introduce additional layers of fine-tuning to promote safety and reduce toxic outputs .

Evaluating instruction-tuned models on novel tasks requires ensuring that test tasks were not implicitly learned during training. Many instruction-tuning datasets contain overlapping tasks, which can undermine evaluations if withheld tasks have similar datasets in the training set. This necessitates grouping similar tasks into clusters and applying the leave-one-out method at the cluster level, preventing task leakage and ensuring the model's ability to generalize to genuinely novel tasks .

Programmatic selection optimizes task performance by systematically choosing demonstrations that most significantly enhance prompt efficacy on test sets. Tools like DSPy automate this optimization, searching the demonstration space to find configurations that maximize performance metrics like accuracy or Rouge scores, thus ensuring selected examples are systematically optimal rather than arbitrarily chosen .

Dynamic retrieval of demonstrations involves selecting examples similar to the current input by comparing embeddings, thus tailoring demonstrations that are more contextually relevant. This similarity-based selection improves the quality of few-shot prompts, thereby enhancing model performance by making the task demonstration more directly applicable to the task at hand .

You might also like