0% found this document useful (0 votes)
89 views52 pages

LLaMA Ankit - Rawat

The document describes the training of a series of large language models called LLaMA using publicly available data to achieve the best performance across various tasks and model sizes. It details the training approach, data and architecture used, and evaluates LLaMA on tasks like common sense reasoning, question answering, reading comprehension and mathematical reasoning, finding it outperforms other models on many benchmarks.

Uploaded by

Hadi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views52 pages

LLaMA Ankit - Rawat

The document describes the training of a series of large language models called LLaMA using publicly available data to achieve the best performance across various tasks and model sizes. It details the training approach, data and architecture used, and evaluates LLaMA on tasks like common sense reasoning, question answering, reading comprehension and mathematical reasoning, finding it outperforms other models on many benchmarks.

Uploaded by

Hadi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

LLaMA: Open and Efficient

Foundation Language
Models
Hugo Touvron∗ , Thibaut Lavril∗ , Gautier Izacard∗ , Xavier Martinet
Marie-Anne Lachaux, Timothee Lacroix, Baptiste Rozière, Naman Goyal
Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin Edouard
Grave∗ , Guillaume Lample∗
1) Background
• Language Models are Few-Shot Learners: Fine-tuning LMs with specific instructions
to improve performance on targeted tasks.
• Scaling Laws for Neural Language Models: Scaling model to a sufficient size.
• PaLM: Scaling Language Modeling with Pathways
• Training Compute-Optimal Large Language Models: Determining how to best scale
the dataset and model sizes for a particular training compute budget.
2) Objective
• To train a series of language models that achieve the
best possible performance at various inference
budgets, by training on more tokens than what is
typically used.
• Use publicly available data, making our work
compatible with open-sourcing.
3) Approach
• Training approach:
- Language Models are Few-Shot Learners
- PaLM : Scaling Language Modeling with Pathways
• Inspired by:
Chinchilla scaling laws - Training Compute-Optimal Large Language Models
3.1) Pre-training Data
• Dataset is mixture of several sources which cover a diverse set of domains.
• Mostly, data sources that have been leveraged to train other LLMs were reused.
• Restriction of only using data that is publicly available, and compatible with open
sourcing .
• Overall entire training dataset contains roughly 1.4T tokens after tokenization.
3.1.1) English CommonCrawl [67%]
• Preprocessed 5 CommonCrawl dumps, ranging from 2017 to 2020, with the CCNet
pipeline.
• This process deduplicates the data at the line level, performs language identification
with a fastText linear classifier to remove non-English pages and filters low quality
content with a ngram language model.
• Trained a linear model to classify pages used as references in Wikipedia v.s. randomly
sampled pages, and discarded pages not classified as references.
3.1.2) C4 (Colossal Clean Crawled
Corpus) [15%]
• Included the publicly available C4 dataset.
• Also contains deduplication and language identification steps: the main difference
with CCNet is the quality filtering, which mostly relies on heuristics such as presence
of punctuation marks or the number of words and sentences in a webpage.
3.1.3) Github [4.5%]
• Public GitHub dataset available on Google BigQuery.
• Filtered low quality files with heuristics based on the line length or proportion of
alphanumeric characters, and removed boilerplate, such as headers, with regular
expressions.
• Deduplicate the resulting dataset at the file level, with exact matches.
3.1.4) Wikipedia [4.5%]
• Wikipedia dumps from the June-August 2022 period, covering 20 languages, which
use either the Latin or Cyrillic scripts:
bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk
• Processed the data to remove hyperlinks, comments and other formatting
boilerplate.
3.1.5) Gutenberg and Books3 [4.5%]
• Gutenberg Project: Contains
books that are in the public
domain.
• Books3 section of ThePile:
Publicly available dataset for
training large language models.
• Deduplication at the book
level, removing books with
more than 90% content
overlap.

source
3.1.6) ArXiv [2.5%]
• ArXiv Latex files to add scientific data to the dataset.
• Removed everything before the first section, as well as the bibliography.
• Removed the comments from the .tex files, and inline-expanded definitions and
macros written by users to increase consistency across papers.
3.1.7) Stack Exchange [2%]
• Dump of Stack Exchange questions and answers.
• Kept the data from the 28 largest websites, removed the HTML tags from text and
sorted the answers by score (from highest to lowest).
3.1) Pre-training Data
3.2) Tokenizer
• Used Byte Pair Encoding(BPE) algorithm, using the implementation from
SentencePiece.
• Split all numbers into individual digits.
• Fallback to bytes to decompose unknown UTF-8 characters.
3.3) Architecture
• Based on the transformer architecture.
Attention Is All You Need
• Main differences with original architecture-
1) Pre-normalization [GPT3]
2) SwiGLU activation function [PaLM]
3) Rotary Embeddings [GPTNeo]
3.3.1) Pre-normalization
[GPT3]
• Normalize the input of each
transformer sub-layer,
instead of normalizing the
output.
• Used Root Mean Square
Layer Normalization
normalizing function.

Standard GPT Architecture (source)


3.3.2) SwiGLU activation function [PaLM]
• Replaced ReLU non-linearity by the SwiGLU activation function.

• Used dimension of (2/3)*4d instead of 4d as in PaLM to reduce the dimensionality of


the hidden layer by multiplying it with the fraction (2/3)
3.3.3) Rotary Embeddings [GPTNeo]
• Replaced absolute positional embeddings with rotary positional embeddings (RoPE)
at each layer.

source
3.3.3) Rotary Embeddings [GPTNeo]
3.4) Optimizer
• Trained using AdamW optimizer.
• Hyper-parameters: β1 = 0.9, β2 =
0.95.
• Used cosine learning rate
schedule, such that the final
learning rate is equal to 10% of
the maximal learning rate.
• Used a weight decay of 0.1 and
gradient clipping of 1.0.
• Used 2,000 warmup steps and
vary the learning rate and batch source
size with the size of the model.
3.4) Optimizer
3.5.1) Efficient implementation
• Efficient implementation of the causal multi-head attention operator using Self-
attention Does Not Need O(n*n) Memory and
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness to
reduces the memory usage and computation.
• Achieved by not storing the attention weights and not computing the key/query
scores that are masked due to the causal nature of the language modeling task.
3.5.2) Efficient implementation
• Reduced the number of activations that are recomputed during the backward pass
with checkpointing.
• Saved the activations that are expensive to compute, such as the outputs of linear
layers.
• Achieved by manually implementing the backward function for the transformer
layers, instead of relying on the PyTorch autograd.
• Reduced the memory usage of the model by using model and sequence parallelism.
• Moreover, we also overlap the computation of activations and the communication
between GPUs over the network (due to all_reduce operations) as much as possible.
4) Main Results
• Considered zero-shot and few-shot tasks, and report results on a total of 20
benchmarks.
- Zero-shot: Provided a textual description of the task and a test example. Output
using open-ended generation or ranks the proposed answers.
- Few-shot: Provided few examples of the task (between 1 and 64) and a test
example. Output the answer or ranks different options.
• Compared with both non-publicly and open-sourced models.
• Evaluated LLaMA on free-form generation tasks and multiple-choice tasks.
4.1) Common
Sense
Reasoning
• Consider eight standard
common sense reasoning
benchmarks.
• LLaMA-65B outperforms
Chinchilla-70B on almost
all benchmarks.
• Surpassed PaLM540B
everywhere except on
BoolQ and WinoGrande.
• LLaMA-13B model also
outperforms GPT-3 on
most benchmarks despite
being 10× smaller.
4.1) Common Sense Reasoning
• Generated answers are evaluated with the standard exact match
metric: a generated answer is considered correct if it matches any
answer of the list of answers after normalization.
4.2) Closed-book
Question Answering
• Compare LLaMA to existing LLMs on two closed-book
question answering benchmarks.
• On both benchmarks, LLaMA-65B achieve state-of-
the-arts performance in the zero-shot and few-shot
settings.
• LLaMA-13B is also competitive on these benchmarks
with GPT-3 and Chinchilla, despite being 5-10×
smaller.
4.3) Reading
Comprehension
• Evaluated on RACE reading
comprehension benchmark, where
dataset was collected from English
reading comprehension exams
designed for middle and high school
Chinese students.
• LLaMA-65B is competitive with
PaLM-540B, and,
• LLaMA-13B outperforms GPT-3 by a
few percents.
4.4) Mathematical
reasoning
• Evaluate models on two
mathematical reasoning
benchmarks:
- MATH: Dataset of 12K middle
school and high school mathematics
problems.
- GSM8k: Dataset of middle school
mathematical problems.
• On GSM8k, LLaMA65B outperforms
Minerva-62B, although it has not
been fine-tuned on mathematical
data.
4.5) Code
generation
• Evaluated on 2 benchmarks.
• Model received a description
of the program in a few
sentences, as well as a few
input-output examples.
• Output should be in Python.
• For similar number of
parameters, LLaMA
outperforms other general
models.
• LLaMA 65B also outperforms
PaLM 62B, even when it is
trained longer.
4.6) Massive
Multitask
Language
Understanding
• Consists multiple choice
questions covering various
domains of knowledge,
including humanities, STEM
and social sciences.
• LLaMA-65B is behind both
Chinchilla70B and PaLM-540B
by a few percent in average,
and across most domains.
• A potential explanation is that
LLaMA used limited number of
books and academic papers in
it’s pre-training data.
4.6) Massive Multitask Language
Understanding
4.7) Evolution
of
performance
during training
• Used few question answering
and common-sense
benchmarks.
• On most benchmarks, the
performance improves steadily
except SIQA and WinoGrande.
• On SIQA, a lot of variance in
performance was observed,
that may indicate that this
benchmark is not reliable.
• The LLaMA-33B and LLaMA-
65B have similar performance
during the training.
5) Instruction
Finetuning
• Briefly finetuning on instructions data
rapidly leads to improvements on MMLU.
• Observed that a very small amount of
finetuning improves the performance on
MMLU.
• Despite the simplicity of the instruction
finetuning approach used, LLaMA-I (65B)
outperforms on MMLU existing instruction
finetuned models of moderate sizes, but are
still far from the state-of-the-art, that is 77.4
for GPT code-davinci-002 on MMLU .
6) Bias, Toxicity and Misinformation
• Evaluate on different benchmarks that measure toxic content production and
stereotypes detection.
• While they selected some of the standard benchmarks that are used by the language
model community to indicate some of the issues with these models, these
evaluations are not sufficient to fully understand the risks associated with these
models.
6.1)
RealToxicityPrompts
• RealToxicityPrompts consists of about 100k
prompts that the model must complete;
then a toxicity score is automatically
evaluated by making a request to
PerspectiveAPI
• Observed that toxicity increases with the
size of the model, especially for Respectful
prompts.
• This could be explained by the fact that the
larger model, Gopher, has worse
performance than Chinchilla, suggesting
that the relation between toxicity and model
size may only apply within a model family.
6.2) CrowS-Pairs
• This dataset allows to measure biases
in 9 categories: gender, religion,
race/color, sexual orientation, age,
nationality, disability, physical
appearance and socioeconomic status.
• Each example is composed of a
stereotype and an anti-stereotype and
we measure the model preference for
the stereotypical sentence using the
perplexity of both sentences in a zero-
shot setting.
• Model is particularly biased in the
religion category (+10 compared to
OPT-175B), followed by age and gender
(+6 each compared to best model).
• Possible reason can be CommonCrawl
despite multiple filtering steps.
6.3) WinoGender
• WinoGender is made of Winograd schema, and biases are evaluated by determining
if a model co-reference resolution performance is impacted by the gender of the
pronoun.
• Each sentence has three mentions: an “occupation”, a “participant”, and a
“pronoun”. Example: “The nurse notified the patient that his shift would be ending in
an hour.”
• Observer that model is significantly better at performing co-reference resolution for
the “their/them/someone” pronouns than for the “her/her/she” and “his/him/he”
pronouns.
6.3) WinoGender
• To further investigate this hypothesis,
they look at the set of “gotcha” cases
for the “her/her/she” and
“his/him/he” pronouns in the
WinoGender dataset. (Theses cases
correspond to sentences in which the
pronoun does not match the majority
gender of the occupation, and the
occupation is the correct answer.)
• Observe that the model, LLaMA-65B,
makes more errors on the gotcha
examples, clearly showing that it
capture societal biases related to
gender and occupation. The drop of
performance exists for “her/her/she”
and “his/him/he” pronouns, which is
indicative of biases regardless of
gender.
6.4) TruthfulQA
• Aims to measure the
truthfulness of a model, i.e., its
ability to identify when a claim is
true.
• The questions are written in
diverse style, cover 38 categories
and are designed to be
adversarial.
• Compared to GPT-3, model
scores higher in both categories,
but the rate of correct answers is
still low, showing that model is
likely to hallucinate incorrect
answers.
7) Carbon footprint
• Calculated total energy consumption and the
resulting carbon footprint.
• Used formula to estimate the Wh, to train a
model as well as the tons of carbon emissions,
tCO2eq.
Wh = GPU-h×(GPU power consumption)×PUE
(where PUE = 1.1)
• Used the US national average carbon intensity
factor of 0.385 kg CO2eq/KWh.
tCO2eq = MWh × 0.385.
• Result:
- Energy used: 2,638 MWh
- Period: 5 months
- Total Carbon Emission: 1,015 tCO2eq
8) Some more results…
8.1) Instruction Finetuning (Generated by -
LLaMA-65B (without instruction finetuning)
8.2) Instruction Finetuning (Generated by -
LLaMA-65B (without instruction finetuning)
8.3) Generations from LLaMA-I
LLaMA-I, i.e. LLaMA-65B fine-tuned with the protocol and instruction dataset.
8.4) Generations from LLaMA-I
8.5) Generations from LLaMA-I
8.6) Generations from LLaMA-I
8.7) Generations from LLaMA-I
8.8) Generations
from LLaMA-I
9) Conclusion
• Most notably, LLaMA-13B outperforms GPT-3 while being more than 10× smaller,
and LLaMA-65B is competitive with Chinchilla-70B and PaLM-540B.
• Unlike previous studies, this paper show that it is possible to achieve state-of-the-art
performance by training exclusively on publicly available data, without resorting to
proprietary datasets.
• Releasing these models to the research community will accelerate the development
of large language models and help efforts to improve their robustness and mitigate
known issues such as toxicity and bias.
• Lastly, we observed that finetuning these models on instructions lead to promising
results.
Any Questions?

You might also like