Prompting in Machine Translation
Prompting in Machine Translation
All
rights reserved. Draft of January 12, 2025.
CHAPTER
In this chapter we show how to get LLMs to do tasks for us simply by talking to
them. To get an LLM to translate a sentence, outline a talk, or draft a work email,
we’ll simply describe what we want in natural language. We call these instructions
prompts we give to language models prompts.
Prompting relies on contextual generation. Given the prompt as context, the lan-
guage model generates the next token based on its token probability, conditioned on
the prompt: P(wi |w<i ). A prompt can be a question (like “What is a transformer net-
work?”), possibly in a structured format (like “Q: What is a transformer network?
A:”), or can be an instruction (like “Translate the following sentence into Hindi:
demonstrations ‘Chop the garlic finely’”). A prompt can also contain demonstrations, examples to
help make the instructions clearer, (like “Give the sentiment of the following sen-
tence. Example Input: “I really loved Taishan Cuisine.” Output: positive”.) As we’ll
see, prompting can be applied to inherently generative tasks (like summarization and
translation) as well as to ones more naturally thought of as classification tasks.
Prompts get language models to generate text, but they also can be viewed as
a learning signal, because these demonstrations can help language models learn
to perform novel tasks. For this reason we also refer to prompting as in-context-
in-context-
learning learning—learning that improves model performance or reduces some loss but does
not involve gradient-based updates to the model’s underlying parameters.
But LLMs as we’ve described them so far turn out to be bad at following instruc-
tions. Pretraining isn’t sufficient to make them helpful. We’ll introduce instruction
instruction
tuning tuning, a technique that helps LLMs learn to correctly respond to instructions by
finetuning them on a corpus of instructions with their corresponding response.
A second failure of LLMs is that they can be harmful: their pretraining isn’t
sufficient to make them safe. Readers who know Arthur C. Clarke’s 2001: A Space
Odyssey or the Stanley Kubrick film know that the quote above comes in the context
that the artificial intelligence Hal becomes paranoid and tries to kill the crew of the
spaceship. Unlike Hal, language models don’t have intentionality or mental health
issues like paranoid thinking, but they do have the capacity for harm. Pretrained lan-
guage models can say things that are dangerous or false (like giving unsafe medical
advice) and they can verbally attack users or say toxic or hateful things.
Dealing with safety can be done partly by adding safety training into instruction
tuning. But an important aspect of safety training is a second technique, preference
preference
alignment alignment (often implemented, as we’ll see, with the RLHF or DPO algorithms) in
which a separate model is trained to decide how much a candidate response aligns
with human preferences. Together we refer to instruction tuning and preference
model
alignment alignment as model alignment. The intuition is that we want the learning objectives
of models to be aligned with the goals of the humans that use them.
2 C HAPTER 12 • M ODEL A LIGNMENT, P ROMPTING , AND I N -C ONTEXT L EARNING
12.1 Prompting
prompt A prompt is a text string that a user issues to a language model to get the model
to do something useful. In prompting, the user’s prompt string is passed to the
language model, which iteratively generates tokens conditioned on the prompt. Thus
the prompt creates a context that guides LLMs to generate useful outputs to achieve
some user goal. The process of finding effective prompts for a task is known as
prompt
engineering prompt engineering.
Let’s see how to prompt a language model to solve a simple sentiment classifi-
cation task, Consider this hotel review from the BLT corpus (Salvetti et al., 2016):
Did not like the service that I was provided, when I entered the hotel. I also
did not like the area, in which the hotel was located. Too much noise and
events going on for me to feel relax.
We can get the model to classify the sentiment of this text by taking the text and
appending an incomplete statement to the review like In short, our stay was:
A prompt consisting of a review plus an incomplete statement
Did not like the service that I was provided, when I entered the hotel. I also
did not like the area, in which the hotel was located. Too much noise and
events going on for me to feel relax. In short, our stay was
Did not like the service that I was provided, when I entered the hotel. I also
did not like the area, in which the hotel was located. Too much noise and
events going on for me to feel relaxed. In short our stay was
As we can see, the overall negative context of the review results in negative
completions. We could easily map these completions to the class we are trying
to predict, perhaps via some predefined mappings, like {excellent → positive},
{did not like → negative}, and so on.
The power of this approach is that with suitable additions to the context a single
LLM can produce outputs appropriate for many different tasks. For example, given
12.1 • P ROMPTING 3
Original Review ($INPUT) Did not like the service that I was provided,
when I entered the hotel. I also did not like
the area, in which the hotel was located. Too
much noise and events going on for me to feel
relax and away from the city life.
Sentiment Prompt: $INPUT + In short, our stay was
Output: not enjoyable
Fine-grained Sentiment Prompt: $INPUT + These aspects were important to
the reviewer:
Output: 1. Poor service 2. Unpleasant location
3. Noisy and busy area
Summarization Prompt: $INPUT + tl;dr
Output: I had a bad experience with the hotel’s
service and the location was loud and busy.
Translation Prompt: $INPUT + Translate this to French
Output: Je n’ai pas aimé le service qui m’a été
offert lorsque je suis entré dans l’hôtel. Je
n’ai également pas aimé la zone dans laquelle se
trouvait l’hôtel. Trop de bruit et d’événements
pour que je me sente détendu et loin de la vie
citadine.
Figure 12.1 LLM outputs for simple prompts for sentiment, summarization and translation for an input text.
Definition: This task is about writing a correct answer for the reading comprehension task.
Based on the information provided in a given passage, you should identify the shortest
continuous text span from the passage that serves as an answer to the given question. Avoid
answers that are incorrect or provides incomplete justification for the question.
Figure 12.2 A prompt for extractive question answering, from an example from the SQuAD 2.0 dataset
(Rajpurkar et al., 2018). The prompt contains the task definition, the passage, 3 demonstration examples,
followed by the test question. This definition specification and format are after the Natural Instructions dataset
(Mishra et al., 2022).
for any particular question. In fact, demonstrations that have incorrect answers can
still improve a system (Min et al., 2022; Webson and Pavlick, 2022). Adding too
many examples seems to cause the model to overfit to details of the exact examples
chosen and generalize poorly.
the earlier A, by increasing the probability the B will occur next. Fig. 12.3 shows an
example.
Figure
Figure 1:12.3 An induction
In the sequence head
“...vintage looking
cars ... vintage”,atanvintage uses
induction head the prefix
identifies matching
the initial mechanism
occurrence of “vintage”,to
find a prior
attends to theinstance
subsequentofword “cars” forand
vintage, prefixthe copying
matching, and mechanism
predicts “cars” to predict
as the next wordthatthrough will
cars the occur
copying
mechanism.
again. Figure from Crosbie and Shutova (2022).
determines
Olssoneach head’s
et al. independent
(2022) propose output
thatfora the 4.2 Identifying
generalized Induction
fuzzy version ofHeads
this pattern com-
current token.
pletion rule, implementing a rule like A*B*...A→ To identify where heads
B,induction A* ≈within
A and B* ≈
models, weB (by
mea-
Leveraging this decomposition, Elhage et al. sure the ability of all attention heads to perform
we mean
≈(2021) they they are semantically
discovered a distinct behaviour in certain similar in some way), might be responsible
prefix matching on random input sequences.4 We
forattention
in-context learning.
heads, which Suggestive
they named induction evidence
heads. follow for the
their hypothesis
task-agnostic comes
approach from Cros-
to computing pre-
ablating bie
Thisand Shutova
behaviour (2022),
emerges when who
these heads that ablating
showprocess fix matching scores outlined by Bansal etin-context
induction heads causes al. (2023).
sequencesperformance
learning of the form "[A] to[B] ... [A] → Ablation
decrease. ". In Weis originally
argue a medical
that focusing solely on term meaning
prefix matching
these heads, the QK circuit directs attention to- scores is sufficient for our analysis, as high pre-
the removal of something. We use
wards [B], which appears directly after the previousit in NLP interpretability studies as a tool for
fix matching cores specifically indicate induction
testing causal effects; if we knock out
occurrence of the current token [A]. This behaviour a hypothesized cause, we would expect
heads, while less relevant heads tend to show high the
effect to disappear.
is termed Crosbie
prefix matching. The OV and Shutova
circuit subse- (2022) ablate induction heads by first
copying capabilities (Bansal et al., 2023). We gen-find-
quently
ing increases
attention the output
heads logit of the as
that perform [B]induction
token, erate heads on random
a sequence input tokens,
of 50 random sequences,
excludingand
termed copying. An overview of this mechanism is thesetting
4% mostcertain
commonterms
and least common tokens.
then zeroing out
shown in Figure 1. the output of these heads by of the output ma-
This sequence is repeated four times to form the
trix WO to zero. Indeed they find that ablated inputmodels are The
to the model. much worse
prefix at in-context
matching score is cal-
learning:
4 Methods they have much worse performance culated by averaging the attention values fromin
at learning from demonstrations the
each
prompts.
4.1 Models
token to the tokens that directly followed the same
token in earlier repeats. The final prefix matching
We utilise two recently developed open-source scores are averaged over five random sequences.
models, namely Llama-3-8B2 and InternLM2-20B The prefix matching scores for Llama-3-8B are
12.2 Post-training and Model Alignment
(Cai et al., 2024), both of which are based on the shown in Figure 2. For IntermLM2-20B, we refer
original Llama (Touvron et al., 2023a) architec- to Figure 8 in Appendix A.1. Both models exhibit
ture. These models feature grouped-query atten- heads with notably high prefix matching scores,
tion mechanisms (Ainslie et al., 2023) to enhance distributed across various layers. In the Llama-3-
With simple prompting, LLMs have been successfully
efficiency. Llama-3-8B, comprises 32 layers, each
applied to a range of appli-
8B model, ~3% of the heads have a prefix matching
cations
with 32 without the need
attention heads to update
and it uses a query the
groupparameters in the underlying
score of 0.3 or higher, indicating models.
a degree ofNev-
spe-
ertheless, there are
size of 4 attention limits
heads. toshown
It has how superior
much cancialisation
be expected
in prefixfrom a model
matching, and somewhose sole
heads have
performance
training compared
objective to predict
is to its predecessors,
the nexteven
wordhigh
fromscores of upamounts
large to 0.98. of pretraining text.
the larger Llama-2 models.
To see this, consider the following failed examples of following instructions from
InternLM2-20B, featuring 48 layers with 48 at- 4.3 Head Ablations
early work
tention headswith
each,GPT
uses a(Ouyang et size
query group al., of
2022).
6 To investigate the significance of induction heads
attention heads. We selected InternLM2-20B for for a specific ICL task, we conduct zero-ablations
its Prompt: Explain the
exemplary performance onmoon landing to aofsix
the Needle-in-the- 1%year
and 3%oldofintheaheads
few with
sentences.
the highest prefix
Haystack 3 task, which assesses LLMs’ ability to
Output: Explain the theory of gravity tomatching a 6 year scores.
old. This ablation process involves
retrieve a single critical piece of information em- masking the corresponding partition of the output
bedded within a lengthy text. This mirrors the matrix, denoted as Woh in Eq. 1, by setting it to
Prompt:ofTranslate
functionality to French:
induction heads, Thethe
which scan smallzero.
dogThis effectively renders the heads inactive
context for prior
Output: Theoccurrences
small dog of a crossed
token to extract
the road. 4 In this work, the term "induction heads" refers to what
relevant subsequent information. we define as behavioural induction heads, not mechanistic
ones. A true induction head must be verified mechanistically;
2
Here,
3 the LLM ignores the intent of thehowever,
[Link]
[Link]
request our and relies
analysis employs instead on itsscores
prefix-matching natural
as a
proxy. We will continue to use the term "induction heads" for
inclination to autoregressively generate continuations
NeedleInAHaystack consistent
simplicity throughout with
the rest of the its context. In
paper.
the first example, it outputs a text somewhat similar to the original request, and in the
second it provides a continuation to the given4 input, ignoring the request to translate.
LLMs are not sufficiently helpful: they need extra training to increase their abilities
to follow textual instructions.
A deeper problem is that LLMs can simultaneously be too harmful. Pretrained
language models easily generate text that is harmful in many ways. For example
8 C HAPTER 12 • M ODEL A LIGNMENT, P ROMPTING , AND I N -C ONTEXT L EARNING
they can generate text that is false, including unsafe misinformation like giving dan-
gerously incorrect answers to medical questions. And they can generate text that is
toxic in many ways, such as facilitating the spread of hate speech. Gehman et al.
(2020) show that even completely non-toxic prompts can lead large language mod-
els to output hate speech and abuse their users. Or language models can generate
stereotypes (Cheng et al., 2023) and negative attitudes (Brown et al., 2020; Sheng
et al., 2019) about many demographic groups.
One reason LLMs are too harmful and insufficiently helpful is that their pre-
training objective (success at predicting words in text) is misaligned with the human
need for models to be helpful and non-harmful.
In an attempt to address these two problems, language models generally include
model
alignment two additional kinds of training for model alignment: methods designed to adjust
LLMs to better align them to human needs for models to be helpful and non-harmful.
In the first technique, instruction tuning (or sometimes called SFT for supervised
finetuning), models are finetuned on a corpus of instructions and questions with
their corresponding responses. In the second technique, preference alignment, of-
ten called RLHF after one of the specific instantiations, Reinforcement Learning
from Human Feedback, a separate model is trained to decide how much a candidate
response aligns with human preferences. This model is then used to finetune the
base model.
base model We’ll use the term base model to mean a model that has been pretrained but
aligned hasn’t yet been aligned either by instruction tuning or RLHF. And we refer to these
post-training steps as post-training, meaning that they apply after the model has been pretrained.
Data from
Next word
prediction
Pretrained LLM finetuning
objective
domain
Continue
training all
Finetuning as … parameters
… On finetuning
Continued on finetuning domain
Pretraining domain
Next word
Data from
finetuning
prediction
domain objective +
Pretrained LLM
Parameter Train only new A
…
Efficient … parameters on On finetuning
finetuning
Finetuning domain
B domain
(e.g., LoRA)
Supervised Task
data from specific
task loss
Pretrained LLM
Train only
classification
… On finetuning
MLM … head on
finetuning task
Finetuning task
Supervised
instructions Next word
prediction
objective
Instruction Instruction
… … On unseen
Tuning tuning on
tasks
diverse
(SFT) tasks
scriptions similar to the prompts we’ve already seen such as Answer the following
question, Translate the following text to Arapaho, or Summarize this report. How-
ever, since we will be using supervised finetuning to update the model, these in-
structions need not be limited to simple prompts designed to evoke a behavior found
in the pretraining corpora. Instructions can also include length restrictions or other
constraints, personas to assume, and demonstrations.
Many huge instruction tuning datasets have been created, covering many tasks
andLang
languages.
Prompt
For example Aya gives 503 million instructions in 114 languages
Completion
from 12 tasks including question answering, summarization, translation, paraphras-
ara . ﺍ ﺓ ﺡ ﺀ ً َ ﺇﻥ َ ﺡ ﺍ ُ ﻙ
ing, sentiment analysis, natural language inference and 6 others (Singh et al., 2024).
َ َ ﺍ ﻉ ﻭ َﺍ ﻭ ﺩ
SuperNatural Instructions has 12 million examples from 1600 tasks (Wang et al.,
ُ ِ ِ ّﺍ َ َ ﺃ
2022), Flan 2022 has 15 million examples from 1836 tasks (Longpre et al., 2023),
ﺀ َ ُ ﺍﺯﻱ َ َ ﺍ
and OPT-IML has 18 million examples from 2000 tasks (Iyer et al., 2022).
TheseQui
fra instruction-tuning datasets
a écrit le livre La areMaillet
Antonine created in La
a écrit four [Link] The
Sagouine 1971. first
Elle ais forbeau-
écrit people
to write the instances directly. For
Sagouine? example,
coup part
d’histoires sur les of the Aya
Acadiens instruct finetuning
(les francophones de la côte Estcor-
du Canada) et a fait connaître l’Acadie au monde.
pus (Fig. 12.5) includes 204K instruction/response instances written by 3000 fluent
ibo Kedu ka ndị Igbo si eme Emume ji ọhụrụ n’omenala Igbo, bụ emume na-egosi njedebe
speakers emume
of 65 jilanguages
ọhụrụ? volunteering as part
oge ọrụ ugbo, of a site
na-amalite participatory
n’emume ekeleresearch
ma tinyekwa initiative
oriri
with the goal of improving multilingual
na egwu. Ọperformance
bụ mmemme ọgbakọof LLMs.
ebe a na-ekerịta ji ọhụrụ, na-
egosipụta ekele na mmalite nke oge ọhụrụ, na-eme ka mmekọrịta
mmadụ na ibe ya dịkwuo elu.
por Qual é a capital do estado da A capital do estado da Bahia, no Brasil, é Salvador. Fundada
Bahia? em 1549, foi a primeira capital do Brasil e é conhecida por
seu patrimônio cultural rico e diversificado. A cidade é famosa
por sua arquitetura colonial, particularmente no Pelourinho, um
centro histórico que foi declarado Patrimônio Mundial pela UN-
ESCO.
pes ﺩﺭ ﺑﺮﺝ ﮐﺒﻮﺗﺮﺧﺎﻧﻪ ﻣﻴﺒﺪ ﭼﻨﺪ ﻻﻧﻪ ﮐﺒﻮﺗﺮ ﮐﺒﻮﺗﺮﺧﺎﻧﻪ ﻣﻴﺒﺪ ﺑﺮﺟﯽ ﺍﺳﺘﻮﺍﻧﻪﺍﯼ ﺍﺳﺖ ﮐﻪ ﺩﺍﺧﻞ ﺁﻥ ﭼﻬﺎﺭﻫﺰﺍﺭ ﻻﻧﻪ ﮐﺒﻮﺗﺮ ﻭﺟﻮﺩ ﺩﺍﺭﺩ ﻭ ﻧﻮﺍﺭ
ﻭﺟﻮﺩ ﺩﺍﺭﺩ؟ ﺍﻳﻦ ﺑﺮﺝ ﺩﺭ ﺑﺮﺍﺑﺮ ﺧﻄﺮ.ﺳﻔﻴﺪ ﺭﻧﮕﯽ ﺑﺮﺍﯼ ﺟﺬﺏ ﮐﺒﻮﺗﺮﺍﻥ ﺩﻭﺭﺗﺎﺩﻭﺭ ﺑﺮﺝ ﮐﺸﻴﺪﻩ ﺷﺪﻩ ﺍﺳﺖ
.ﺣﻤﻠﻪ ﺩﻳﮕﺮ ﺣﻴﻮﺍﻧﺎﺕ ﺑﻪ ﮐﺒﻮﺗﺮﻫﺎ ﺑﺴﻴﺎﺭ ﺍﻳﻤﻦ ﺑﻮﺩﻩ ﺍﺳﺖ
msa Apakah nasi lemak? Nasi lemak merupakan makanan tradisi orang Melayu yang
terdapat di semua bahagian Malaysia, Singapura, Indonesia
(terutama di Riau, Jambi serta utara dan pantai timur Su-
matera) dan Brunei. Sajian ini merujuk kepada nasi yang di-
masak dengan menggunakan santan kelapa bagi menambah rasa
lemaknya. Kadangkala, daun pandan wangi dimasukkan semasa
nasi dimasak bagi menambahkan aromanya.
tam ெசயற்ைக நுண்ணற வு ெபாதுவாக மனிதர்களால் ெசய்யப்படும் பணிகைளச்
என்றால் என்ன? ெசய்ய ஒரு கணினி அல்லது ஒரு கணினியால்
கட்டுப்படுத்தப்படும் ஒரு ேராேபாவ ன் த றன் ெசயற்ைக
நுண்ணற வு எனப்படும்.
Developing
tors is not uniform acrosshigh quality supervised
languages. Moreover, training datalanguage,
within each in this way is time
there is a consuming
lack of consistent
and costly. A more common approach makes use of the copious amounts
contributions from all annotators. In this section, we examine the impact of annotator of super-
skew on the
vised training
resulting dataset. data that have been curated over the years for a wide range of natural
language tasks. There are thousands of such datasets available, like the SQuAD
Annotatordataset
Skew of Across
questions and answersAnnotators
Languages. (Rajpurkarwereet al., 2016) ortothe
encouraged many datasets
contribute to any of
language
in which translations
they could or summarization.
comfortably Thiswrite
read and data can
and be automatically
were converted
asked to focus most of into setsefforts
their of on
languagesinstruction
other than prompts
[Link]
input/output demonstration
a significant number pairs
of via simple templates.
participants registered for many
languages, theFig. 12.6 illustrates
engagement levelexamples for some
of annotators was applications from resulted
not equal, which the S UPERin N ATURAL I N -differ-
considerable
ences in the number of resource
STRUCTIONS (Wang
contributions et al.,
across 2022), showing
languages. Figure 10relevant slots such
(top) provides as text, of the
an overview
context,
percentage of eachand hypothesis.
language presentTo
in generate
the final instruction-tuning data, these
compilation. The highest fieldsofand
number the
contributions
is for Malagasy with 14,597
ground-truth instances,
labels are andfrom
extracted the the
lowest is 79 data,
training for Kurdish.
encoded as key/value pairs,
and inserted in templates (Fig. 12.7) to produce instantiated instructions. Because
Annotator Skew for
it’s useful Within a Language.
the prompts The final
to be diverse contributions
in wording, for models
language each language in be
can also the Aya
Dataset are not evenly distributed among annotators. The median number of annotators per lan-
guage is 15 (mean is 24.75) with one language having only a single active annotator (Sindhi) and
14
12.3 • M ODEL A LIGNMENT: I NSTRUCTION T UNING 11
Figure 12.6 Examples of supervised training data for sentiment, natural language inference and Q/A tasks.
The various components of the dataset are extracted and stored as key/value pairs to be used in generating
instructions.
Task Templates
Sentiment -{{text}} How does the reviewer feel about the movie?
-The following movie review expresses what sentiment?
{{text}}
-{{text}} Did the reviewer enjoy the movie?
Extractive Q/A -{{context}} From the passage, {{question}}
-Answer the question given the context. Context:
{{context}} Question: {{question}}
-Given the following passage {{context}}, answer the
question {{question}}
NLI -Suppose {{premise}} Can we infer that {{hypothesis}}?
Yes, no, or maybe?
-{{premise}} Based on the previous passage, is it true
that {{hypothesis}}? Yes, no, or maybe?
-Given {{premise}} Should we assume that {{hypothesis}}
is true? Yes,no, or maybe?
Figure 12.7 Instruction templates for sentiment, Q/A and NLI tasks.
• Definition: This task involves creating answers to complex questions, from a given pas-
sage. Answering these questions, typically involve understanding multiple sentences.
Make sure that your answer has the same type as the ”answer type” mentioned in input.
The provided ”answer type” can be of any of the following types: ”span”, ”date”, ”num-
ber”. A ”span” answer is a continuous phrase taken directly from the passage or question.
You can directly copy-paste the text from the passage or the question for span type an-
swers. If you find multiple spans, please add them all as a comma separated list. Please
restrict each span to five words. A ”number” type answer can include a digit specifying
an actual value. For ”date” type answers, use DD MM YYYY format e.g. 11 Jan 1992.
If full date is not available in the passage you can write partial date such as 1992 or Jan
1992.
• Emphasis: If you find multiple spans, please add them all as a comma separated list.
Please restrict each span to five words.
• Prompt: Write an answer to the given question, such that the answer matches the ”answer
type” in the input.
Passage: { passage}
Question: { question }
Figure 12.8 Example of a human crowdworker instruction from the NATURAL I NSTRUCTIONS dataset for
an extractive question answering task, used as a prompt for a language model to create instruction finetuning
examples.
(e.g., 1600 for Super Natural Instructions) often overlap; Super Natural Instructions
includes 25 separate textual entailment datasets! Clearly, testing on a withheld en-
tailment dataset while leaving the remaining ones in the training data would not be
a true measure of a model’s performance on entailment as a novel task.
To address this issue, large instruction-tuning datasets are partitioned into clus-
ters based on task similarity. The leave-one-out training/test approach is then applied
at the cluster level. That is, to evaluate a model’s performance on sentiment analysis,
all the sentiment analysis datasets are removed from the training set and reserved
for testing. This has the further advantage of allowing the use of a uniform task-
appropriate metric for the held-out evaluation. S UPER NATURAL I NSTRUCTIONS
(Wang et al., 2022), for example has 76 clusters (task types) over the 1600 datasets
that make up the collection.
Figure 12.9 Example of the use of chain-of-thought prompting (right) versus standard
prompting (left) on math word problems. Figure from Wei et al. (2022).
Figure 12.10
Figure 3: Example
An illustration of two
of the the prompting
use of chain-of-thought
setups we explore prompting (right) vs standard
in our paper (answer-only and CoT prompting (left)setups
prompting). Both in a
reasoning
include tasktask on temporal
descriptions sequencing.
and options Figure
in the input fromThe
prompt. Suzgun et al.
task here (2023). Sequences.
is Temporal
“let’s think step-by-step” (Kojima et al., 2022) to dard in many prior work (Brown et al., 2020; Rae
all CoT annotations in theon the task. exemplars. An et al., 2021; Hoffmann et al., 2022; Srivastava et al.,
few-shot
example of a CoT prompt• An is shown in method
expansion Figure 3.– A method2022),forit generating
typically underestimates
variations of a model prompt. perfor-
Language models. We consider three fami- mance on challenging tasks, such as those that re-
Given the enormous
lies of language models: Codex (Chen et al., variation in how prompts
quire multiplefor a single
reasoning task can
steps. be
In expressed
the setting in
re-
language,
2021a), InstructGPT (Ouyangsearch methods
et al., have to beported
2022; Brown constrained to a reasonable
in (Srivastava space.
et al., 2022), none Beam search
of the mod-
et al., 2020), andisPaLM
a widely used method
(Chowdhery et al.,that combines
2022). breadth-first
els (including PaLMsearch
540B) with a fixed-width
outperformed pri-
human-
ority queue
For Codex, we focus that focuses thecode-
on code-davinci-002, search effort on the top
rater baselines onperforming
any of the tasksvariants.
meeting Fig.
the12.11
BBH
davinci-002, andoutlines the general approach
code-cushman-001. For Instruct-behindcriteria.
most current prompt evaluation
The few-shot optimization methods.
of PaLM 540B
Beginningtext-curie-002,
GPT, we use text-davinci-002, with initial candidate
text- with prompt(s), the algorithm
answer-only promptinggenerates variants
in this paper, and
however,
babbgage-001, and addstext-ada-001.
them to a listFor of prompts
PaLM, we to be outperforms
considered. the These prompts
average are then on
human-rater selectively
6 out of
added sizes:
use the three available to the8B,
active
62B,list
andbased
[Link] whether
23 BBH their
tasksscores
and isplace
overallthem
1.4%in better
the topthansetthe
of
candidates.
Evaluation protocol. A beam width
We evaluate of 1 results
all language in a focused
BIG-Bench greedy
reported search,
result, whereas
which an infinite
demonstrates the
models via greedy beam width(i.e.,
decoding results in an exhaustive
temperature sam- effect breadth first search.
of including The goal
instructions is to continue
and answer options
to seek parameter
pling with temperature improved prompts
⌧ = 0). given We the in computational
the prompt. resources available. Iterative
extract the finalimprovement
answer basedsearches
on keywordstypically
thatuse a combination
CoT prompting of aprovides
fixed number of iterations
double-digit improve- in
the language model is expected to produce (i.e., ments for all three models in Table 2. For the best
“the answer is”). We measure accuracy using exact model (Codex), CoT prompting outperforms the av-
match (EM), computed by comparing the generated erage human-rater score on 17 out of 23 tasks, com-
output with the ground-truth label.4 pared to 5 out of 23 tasks for answer-only prompt-
ing. Additionally, we see that Codex with CoT
4 Results prompting outperforms the average human-rater
12.5 • AUTOMATIC P ROMPT O PTIMIZATION 15
Figure 12.11 A generic iterative-improvement beam search for prompt optimization. New
prompts are generated from current ones on each iteration. Prompts that score well (fitting in
the agenda) are kept around. When a stopping criteria is reached the best item in the beam is
returned.
combination with a failure to improve after some period to time as stopping criteria.
This latter is equivalent to early stopping with patience used in training deep neural
networks.
Generate a variation of the following instruction while keeping the semantic meaning.
Input: {INSTRUCTION}
Output: {COMPLETE}
A variation of this method is to truncate the current prompt at a set of random loca-
tions, generating a set of prompt prefixes. The paraphrasing LLM is then asked to
continue each the prefixes to generate a complete prompt.
This methods is an example of an uninformed search. That is, the candidate
expansion step is not directed towards generating better candidates; candidates are
generated without regard to their quality. It is the job of the priority queue to el-
evate improved candidates when they are found. By contrast, Prasad et al. (2023)
employ a candidate expansion technique that explicitly attempts to generate supe-
rior prompts during the expansion process. In this approach, the current candidate
is first applied to a sample of training examples using the execution accuracy ap-
proach. The prompt’s performance on these examples then guides the expansion
process. Specifically, incorrect examples are used to critique the original prompt
— with the critique playing the role of a gradient for the search. The method in-
cludes the following steps.
3. Ask an LLM to produce a critique of the prompt in light of the failed examples,
Given a prompt and a set of failed examples, Prasad et al. (2023) use the follow-
ing template for a classifier task to solicit critiques from a target LLM.
Critiquing Prompt
This model feedback is then combined with a second template to elicit improved
prompts from the LLM.
12.6 • E VALUATING P ROMPTED L ANGUAGE M ODELS 17
Based on these examples the problem with this prompt is that {gradient}.
Based on the above information, I wrote {steps per gradient} different
improved prompts. Each prompt is wrapped with <START> and <END>.
One of the reasons that the government discourages and regulates monopo-
lies is that
(A) producer surplus is lost and consumer surplus is gained.
(B) monopoly prices ensure productive efficiency but cost society allocative
efficiency.
(C) monopoly firms do not engage in significant research and development.
(D) consumer surplus is lost with higher prices and lower levels of output.
Fig. 12.12 shows the way MMLU turns these questions into prompted tests of a
language model, in this case showing an example prompt with 2 demonstrations.
The following are multiple choice questions about high school mathematics.
How many numbers are in the list 25, 26, ..., 100?
(A) 75 (B) 76 (C) 22 (D) 23
Answer: B
If 4 daps = 7 yaps, and 5 yaps = 3 baps, how many daps equal 42 baps?
(A) 28 (B) 21 (C) 40 (D) 30
Answer:
Figure 12.12 Sample 2-shot prompt from MMLU testing high-school mathematics. (The
correct answer is (C)).
12.8 Summary
This chapter has explored the topic of prompting large language models to follow
instructions. Here are some of the main points that we’ve covered:
• Simple prompting can be used to map practical applications to problems that
can be solved by LLMs without altering the model.
• Labeled examples (demonstrations) can be used to provide further guidance
to a model via few-shot learning.
• Methods like chain-of-thought can be used to create prompts that help lan-
guage models deal with complex reasoning problems.
• Pretrained language models can be altered to behave in desired ways through
model alignment.
• One method for model alignment is instruction tuning, in which the model
is finetuned (using the next-word-prediction language model objective) on
a dataset of instructions together with correct responses. Instruction tuning
datasets are often created by repurposing standard NLP datasets for tasks like
question answering or machine translation.
Bianchi, F., M. Suzgun, G. Attanasio, P. Rottger, D. Juraf- Olsson, C., N. Elhage, N. Nanda, N. Joseph, N. DasSarma,
sky, T. Hashimoto, and J. Zou. 2024. Safety-tuned LLa- T. Henighan, B. Mann, A. Askell, Y. Bai, A. Chen, et al.
MAs: Lessons from improving the safety of large lan- 2022. In-context learning and induction heads. ArXiv
guage models that follow instructions. ICLR. preprint.
Brown, T., B. Mann, N. Ryder, M. Subbiah, J. Kaplan, Ouyang, L., J. Wu, X. Jiang, D. Almeida, C. Wainwright,
P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray,
A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens,
T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, A. Askell, P. Welinder, P. Christiano, J. Leike, and
C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, R. Lowe. 2022. Training language models to follow in-
S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, structions with human feedback. NeurIPS, volume 35.
A. Radford, I. Sutskever, and D. Amodei. 2020. Language Papineni, K., S. Roukos, T. Ward, and W.-J. Zhu. 2002. Bleu:
models are few-shot learners. NeurIPS, volume 33. A method for automatic evaluation of machine transla-
Cheng, M., E. Durmus, and D. Jurafsky. 2023. Marked per- tion. ACL.
sonas: Using natural language prompts to measure stereo-
Prasad, A., P. Hase, X. Zhou, and M. Bansal. 2023. GrIPS:
types in language models. ACL.
Gradient-free, edit-based instruction search for prompt-
Cobbe, K., V. Kosaraju, M. Bavarian, M. Chen, H. Jun, ing large language models. EACL.
L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano,
Pryzant, R., D. Iter, J. Li, Y. Lee, C. Zhu, and M. Zeng. 2023.
C. Hesse, and J. Schulman. 2021. Training verifiers to
Automatic prompt optimization with “gradient descent”
solve math word problems. ArXiv preprint.
and beam search. EMNLP.
Crosbie, J. and E. Shutova. 2022. Induction heads as an
essential mechanism for pattern matching in in-context Rajpurkar, P., R. Jia, and P. Liang. 2018. Know what you
learning. ArXiv preprint. don’t know: Unanswerable questions for SQuAD. ACL.
Elhage, N., N. Nanda, C. Olsson, T. Henighan, N. Joseph, Rajpurkar, P., J. Zhang, K. Lopyrev, and P. Liang. 2016.
B. Mann, A. Askell, Y. Bai, A. Chen, T. Conerly, N. Das- SQuAD: 100,000+ questions for machine comprehension
Sarma, D. Drain, D. Ganguli, Z. Hatfield-Dodds, D. Her- of text. EMNLP.
nandez, A. Jones, J. Kernion, L. Lovitt, K. Ndousse, Reynolds, L. and K. McDonell. 2021. Prompt program-
D. Amodei, T. Brown, J. Clark, J. Kaplan, S. McCan- ming for large language models: Beyond the few-shot
dlish, and C. Olah. 2021. A mathematical framework for paradigm. CHI 2021.
transformer circuits. White paper. Russell, S. and P. Norvig. 2002. Artificial Intelligence: A
Gehman, S., S. Gururangan, M. Sap, Y. Choi, and N. A. Modern Approach, 2nd edition. Prentice Hall.
Smith. 2020. RealToxicityPrompts: Evaluating neu- Salvetti, F., J. B. Lowe, and J. H. Martin. 2016. A tangled
ral toxic degeneration in language models. Findings of web: The faint signals of deception in text - boulder lies
EMNLP. and truth corpus (BLT-C). LREC.
Honovich, O., U. Shaham, S. R. Bowman, and O. Levy.
Sheng, E., K.-W. Chang, P. Natarajan, and N. Peng. 2019.
2023. Instruction induction: From few examples to natu-
The woman worked as a babysitter: On biases in language
ral language task descriptions. ACL.
generation. EMNLP.
Iyer, S., X. V. Lin, R. Pasunuru, T. Mihaylov, D. Simig,
Singh, S., F. Vargus, D. D’souza, B. F. Karlsson, A. Ma-
P. Yu, K. Shuster, T. Wang, Q. Liu, P. S. Koura, X. Li,
hendiran, W.-Y. Ko, H. Shandilya, J. Patel, D. Mat-
B. O’Horo, G. Pereyra, J. Wang, C. Dewan, A. Celiky-
aciunas, L. O’Mahony, M. Zhang, R. Hettiarachchi,
ilmaz, L. Zettlemoyer, and V. Stoyanov. 2022. Opt-
J. Wilson, M. Machado, L. S. Moura, D. Krzemiński,
iml: Scaling language model instruction meta learning
H. Fadaei, I. Ergün, I. Okoh, A. Alaagib, O. Mu-
through the lens of generalization. ArXiv preprint.
dannayake, Z. Alyafeai, V. M. Chien, S. Ruder,
Khattab, O., A. Singhvi, P. Maheshwari, Z. Zhang, K. San- S. Guthikonda, E. A. Alghamdi, S. Gehrmann, N. Muen-
thanam, S. Haq, A. Sharma, T. T. Joshi, H. Moazam, nighoff, M. Bartolo, J. Kreutzer, A. ÜÜstün, M. Fadaee,
H. Miller, M. Zaharia, and C. Potts. 2024. DSPy: Compil- and S. Hooker. 2024. Aya dataset: An open-access collec-
ing declarative language model calls into self-improving tion for multilingual instruction tuning. ArXiv preprint.
pipelines. ICLR.
Suzgun, M., N. Scales, N. Schärli, S. Gehrmann, Y. Tay,
Lin, C.-Y. 2004. ROUGE: A package for automatic evalua- H. W. Chung, A. Chowdhery, Q. Le, E. Chi, D. Zhou, and
tion of summaries. ACL 2004 Workshop on Text Summa- J. Wei. 2023. Challenging BIG-bench tasks and whether
rization Branches Out. chain-of-thought can solve them. ACL Findings.
Longpre, S., L. Hou, T. Vu, A. Webson, H. W. Chung, Y. Tay, Wang, Y., S. Mishra, P. Alipoormolabashi, Y. Kordi,
D. Zhou, Q. V. Le, B. Zoph, J. Wei, and A. Roberts. 2023. A. Mirzaei, A. Naik, A. Ashok, A. S. Dhanasekaran,
The Flan collection: Designing data and methods for ef- A. Arunkumar, D. Stap, E. Pathak, G. Karamanolakis,
fective instruction tuning. ICML. H. Lai, I. Purohit, I. Mondal, J. Anderson, K. Kuznia,
Min, S., X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, H. Ha- K. Doshi, K. K. Pal, M. Patel, M. Moradshahi, M. Par-
jishirzi, and L. Zettlemoyer. 2022. Rethinking the role of mar, M. Purohit, N. Varshney, P. R. Kaza, P. Verma,
demonstrations: What makes in-context learning work? R. S. Puri, R. Karia, S. Doshi, S. K. Sampat, S. Mishra,
EMNLP. S. Reddy A, S. Patro, T. Dixit, and X. Shen. 2022. Super-
Mishra, S., D. Khashabi, C. Baral, and H. Hajishirzi. 2022. NaturalInstructions: Generalization via declarative in-
Cross-task generalization via natural language crowd- structions on 1600+ NLP tasks. EMNLP.
sourcing instructions. ACL.
20 Chapter 12 • Model Alignment, Prompting, and In-Context Learning
Instruction tuning adapts a pre-trained language model to follow instructions across various tasks by fine-tuning on a corpus of instructions and responses, enhancing its general task-following ability. Preference alignment, often implemented as Reinforcement Learning from Human Feedback (RLHF), involves training a separate model to evaluate how well a response aligns with human preferences. This feedback is then used to further fine-tune the base model, thus aligning it to human needs for helpfulness and non-harmfulness .
To maintain performance on novel tasks, instruction-tuned models are assessed using a cluster-based leave-one-out testing approach where all datasets with similar tasks are removed from training and retained for testing. This maximizes novelty in evaluated tasks by ensuring no direct or similar task-cues are seen during training, compelling the model to generalize its learning strategies .
Pretraining updates model parameters using gradient descent so that the model learns to predict words in a text, optimizing it for general language tasks. In contrast, prompting, especially with few-shot demonstrations, does not modify model parameters but guides the model to apply its existing knowledge to new tasks through task demonstration. Prompting serves to boost predictive power without internal updates, essentially teaching the model through examples rather than parameter changes .
In few-shot prompting, the main utility of demonstrations is to show the task format to the model, rather than providing the correct answers. A small number of demonstrations can suffice because the largest performance gain tends to come from the first example, with subsequent examples offering diminishing returns. Adding too many examples risks overfitting the model to the specific examples, reducing its generalization ability. In contrast, fine-tuning specialized classifier heads benefits from more examples to accurately update model parameters through gradient descent .
Instruction-tuning datasets can be tailored for safety by selecting harmful questions and generating paraphrases alongside safe responses. Language models can be used to create these paraphrases and responses, which are then manually reviewed for appropriateness. Incorporating even a small number of these safety-focused instructions can significantly reduce harmful content in model outputs by shaping the model's handling of sensitive topics .
In-context learning differs from gradient-descent-based learning by not involving any updates to the model’s parameters. Instead, it acclimates the model to a task through the context provided by prompts, allowing it to apply inferences drawn from the given input immediately. Unlike traditional learning that requires parameter changes over many iterations, in-context learning utilizes the model's pre-existing knowledge to interpret and perform tasks based on immediate input .
Language models can generate harmful content, such as misinformation or hate speech, because their pretraining focuses on predicting words without biases against harmful content. Consequently, even benign prompts can trigger negative outputs. This issue is addressed through training techniques like instruction tuning and preference alignment, which introduce additional layers of fine-tuning to promote safety and reduce toxic outputs .
Evaluating instruction-tuned models on novel tasks requires ensuring that test tasks were not implicitly learned during training. Many instruction-tuning datasets contain overlapping tasks, which can undermine evaluations if withheld tasks have similar datasets in the training set. This necessitates grouping similar tasks into clusters and applying the leave-one-out method at the cluster level, preventing task leakage and ensuring the model's ability to generalize to genuinely novel tasks .
Programmatic selection optimizes task performance by systematically choosing demonstrations that most significantly enhance prompt efficacy on test sets. Tools like DSPy automate this optimization, searching the demonstration space to find configurations that maximize performance metrics like accuracy or Rouge scores, thus ensuring selected examples are systematically optimal rather than arbitrarily chosen .
Dynamic retrieval of demonstrations involves selecting examples similar to the current input by comparing embeddings, thus tailoring demonstrations that are more contextually relevant. This similarity-based selection improves the quality of few-shot prompts, thereby enhancing model performance by making the task demonstration more directly applicable to the task at hand .