LLaMA Ankit - Rawat

The document describes the training of a series of large language models called LLaMA using publicly available data to achieve the best performance across various tasks and model sizes. It details the training approach, data and architecture used, and evaluates LLaMA on tasks like common sense reasoning, question answering, reading comprehension and mathematical reasoning, finding it outperforms other models on many benchmarks.

Uploaded by

Hadi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

89 views52 pages

LLaMA Ankit - Rawat

Uploaded by

Hadi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

LLaMA: Open and Efficient

Foundation Language
Models
Hugo Touvron∗ , Thibaut Lavril∗ , Gautier Izacard∗ , Xavier Martinet
Marie-Anne Lachaux, Timothee Lacroix, Baptiste Rozière, Naman Goyal
Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin Edouard
Grave∗ , Guillaume Lample∗
1) Background
• Language Models are Few-Shot Learners: Fine-tuning LMs with specific instructions
to improve performance on targeted tasks.
• Scaling Laws for Neural Language Models: Scaling model to a sufficient size.
• PaLM: Scaling Language Modeling with Pathways
• Training Compute-Optimal Large Language Models: Determining how to best scale
the dataset and model sizes for a particular training compute budget.
2) Objective
• To train a series of language models that achieve the
best possible performance at various inference
budgets, by training on more tokens than what is
typically used.
• Use publicly available data, making our work
compatible with open-sourcing.
3) Approach
• Training approach:
- Language Models are Few-Shot Learners
- PaLM : Scaling Language Modeling with Pathways
• Inspired by:
Chinchilla scaling laws - Training Compute-Optimal Large Language Models
3.1) Pre-training Data
• Dataset is mixture of several sources which cover a diverse set of domains.
• Mostly, data sources that have been leveraged to train other LLMs were reused.
• Restriction of only using data that is publicly available, and compatible with open
sourcing .
• Overall entire training dataset contains roughly 1.4T tokens after tokenization.
3.1.1) English CommonCrawl [67%]
• Preprocessed 5 CommonCrawl dumps, ranging from 2017 to 2020, with the CCNet
pipeline.
• This process deduplicates the data at the line level, performs language identification
with a fastText linear classifier to remove non-English pages and filters low quality
content with a ngram language model.
• Trained a linear model to classify pages used as references in Wikipedia v.s. randomly
sampled pages, and discarded pages not classified as references.
3.1.2) C4 (Colossal Clean Crawled
Corpus) [15%]
• Included the publicly available C4 dataset.
• Also contains deduplication and language identification steps: the main difference
with CCNet is the quality filtering, which mostly relies on heuristics such as presence
of punctuation marks or the number of words and sentences in a webpage.
3.1.3) Github [4.5%]
• Public GitHub dataset available on Google BigQuery.
• Filtered low quality files with heuristics based on the line length or proportion of
alphanumeric characters, and removed boilerplate, such as headers, with regular
expressions.
• Deduplicate the resulting dataset at the file level, with exact matches.
3.1.4) Wikipedia [4.5%]
• Wikipedia dumps from the June-August 2022 period, covering 20 languages, which
use either the Latin or Cyrillic scripts:
bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk
• Processed the data to remove hyperlinks, comments and other formatting
boilerplate.
3.1.5) Gutenberg and Books3 [4.5%]
• Gutenberg Project: Contains
books that are in the public
domain.
• Books3 section of ThePile:
Publicly available dataset for
training large language models.
• Deduplication at the book
level, removing books with
more than 90% content
overlap.

source
3.1.6) ArXiv [2.5%]
• ArXiv Latex files to add scientific data to the dataset.
• Removed everything before the first section, as well as the bibliography.
• Removed the comments from the .tex files, and inline-expanded definitions and
macros written by users to increase consistency across papers.
3.1.7) Stack Exchange [2%]
• Dump of Stack Exchange questions and answers.
• Kept the data from the 28 largest websites, removed the HTML tags from text and
sorted the answers by score (from highest to lowest).
3.1) Pre-training Data
3.2) Tokenizer
• Used Byte Pair Encoding(BPE) algorithm, using the implementation from
SentencePiece.
• Split all numbers into individual digits.
• Fallback to bytes to decompose unknown UTF-8 characters.
3.3) Architecture
• Based on the transformer architecture.
Attention Is All You Need
• Main differences with original architecture-
1) Pre-normalization [GPT3]
2) SwiGLU activation function [PaLM]
3) Rotary Embeddings [GPTNeo]
3.3.1) Pre-normalization
[GPT3]
• Normalize the input of each
transformer sub-layer,
instead of normalizing the
output.
• Used Root Mean Square
Layer Normalization
normalizing function.

Standard GPT Architecture (source)

3.3.2) SwiGLU activation function [PaLM]
• Replaced ReLU non-linearity by the SwiGLU activation function.

• Used dimension of (2/3)*4d instead of 4d as in PaLM to reduce the dimensionality of

the hidden layer by multiplying it with the fraction (2/3)
3.3.3) Rotary Embeddings [GPTNeo]
• Replaced absolute positional embeddings with rotary positional embeddings (RoPE)
at each layer.

source
3.3.3) Rotary Embeddings [GPTNeo]
3.4) Optimizer
• Trained using AdamW optimizer.
• Hyper-parameters: β1 = 0.9, β2 =
0.95.
• Used cosine learning rate
schedule, such that the final
learning rate is equal to 10% of
the maximal learning rate.
• Used a weight decay of 0.1 and
gradient clipping of 1.0.
• Used 2,000 warmup steps and
vary the learning rate and batch source
size with the size of the model.
3.4) Optimizer
3.5.1) Efficient implementation
• Efficient implementation of the causal multi-head attention operator using Self-
attention Does Not Need O(n*n) Memory and
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness to
reduces the memory usage and computation.
• Achieved by not storing the attention weights and not computing the key/query
scores that are masked due to the causal nature of the language modeling task.
3.5.2) Efficient implementation
• Reduced the number of activations that are recomputed during the backward pass
with checkpointing.
• Saved the activations that are expensive to compute, such as the outputs of linear
layers.
• Achieved by manually implementing the backward function for the transformer
layers, instead of relying on the PyTorch autograd.
• Reduced the memory usage of the model by using model and sequence parallelism.
• Moreover, we also overlap the computation of activations and the communication
between GPUs over the network (due to all_reduce operations) as much as possible.
4) Main Results
• Considered zero-shot and few-shot tasks, and report results on a total of 20
benchmarks.
- Zero-shot: Provided a textual description of the task and a test example. Output
using open-ended generation or ranks the proposed answers.
- Few-shot: Provided few examples of the task (between 1 and 64) and a test
example. Output the answer or ranks different options.
• Compared with both non-publicly and open-sourced models.
• Evaluated LLaMA on free-form generation tasks and multiple-choice tasks.
4.1) Common
Sense
Reasoning
• Consider eight standard
common sense reasoning
benchmarks.
• LLaMA-65B outperforms
Chinchilla-70B on almost
all benchmarks.
• Surpassed PaLM540B
everywhere except on
BoolQ and WinoGrande.
• LLaMA-13B model also
outperforms GPT-3 on
most benchmarks despite
being 10× smaller.
4.1) Common Sense Reasoning
• Generated answers are evaluated with the standard exact match
metric: a generated answer is considered correct if it matches any
answer of the list of answers after normalization.
4.2) Closed-book
Question Answering
• Compare LLaMA to existing LLMs on two closed-book
question answering benchmarks.
• On both benchmarks, LLaMA-65B achieve state-of-
the-arts performance in the zero-shot and few-shot
settings.
• LLaMA-13B is also competitive on these benchmarks
with GPT-3 and Chinchilla, despite being 5-10×
smaller.
4.3) Reading
Comprehension
• Evaluated on RACE reading
comprehension benchmark, where
dataset was collected from English
reading comprehension exams
designed for middle and high school
Chinese students.
• LLaMA-65B is competitive with
PaLM-540B, and,
• LLaMA-13B outperforms GPT-3 by a
few percents.
4.4) Mathematical
reasoning
• Evaluate models on two
mathematical reasoning
benchmarks:
- MATH: Dataset of 12K middle
school and high school mathematics
problems.
- GSM8k: Dataset of middle school
mathematical problems.
• On GSM8k, LLaMA65B outperforms
Minerva-62B, although it has not
been fine-tuned on mathematical
data.
4.5) Code
generation
• Evaluated on 2 benchmarks.
• Model received a description
of the program in a few
sentences, as well as a few
input-output examples.
• Output should be in Python.
• For similar number of
parameters, LLaMA
outperforms other general
models.
• LLaMA 65B also outperforms
PaLM 62B, even when it is
trained longer.
4.6) Massive
Multitask
Language
Understanding
• Consists multiple choice
questions covering various
domains of knowledge,
including humanities, STEM
and social sciences.
• LLaMA-65B is behind both
Chinchilla70B and PaLM-540B
by a few percent in average,
and across most domains.
• A potential explanation is that
LLaMA used limited number of
books and academic papers in
it’s pre-training data.
4.6) Massive Multitask Language
Understanding
4.7) Evolution
of
performance
during training
• Used few question answering
and common-sense
benchmarks.
• On most benchmarks, the
performance improves steadily
except SIQA and WinoGrande.
• On SIQA, a lot of variance in
performance was observed,
that may indicate that this
benchmark is not reliable.
• The LLaMA-33B and LLaMA-
65B have similar performance
during the training.
5) Instruction
Finetuning
• Briefly finetuning on instructions data
rapidly leads to improvements on MMLU.
• Observed that a very small amount of
finetuning improves the performance on
MMLU.
• Despite the simplicity of the instruction
finetuning approach used, LLaMA-I (65B)
outperforms on MMLU existing instruction
finetuned models of moderate sizes, but are
still far from the state-of-the-art, that is 77.4
for GPT code-davinci-002 on MMLU .
6) Bias, Toxicity and Misinformation
• Evaluate on different benchmarks that measure toxic content production and
stereotypes detection.
• While they selected some of the standard benchmarks that are used by the language
model community to indicate some of the issues with these models, these
evaluations are not sufficient to fully understand the risks associated with these
models.
6.1)
RealToxicityPrompts
• RealToxicityPrompts consists of about 100k
prompts that the model must complete;
then a toxicity score is automatically
evaluated by making a request to
PerspectiveAPI
• Observed that toxicity increases with the
size of the model, especially for Respectful
prompts.
• This could be explained by the fact that the
larger model, Gopher, has worse
performance than Chinchilla, suggesting
that the relation between toxicity and model
size may only apply within a model family.
6.2) CrowS-Pairs
• This dataset allows to measure biases
in 9 categories: gender, religion,
race/color, sexual orientation, age,
nationality, disability, physical
appearance and socioeconomic status.
• Each example is composed of a
stereotype and an anti-stereotype and
we measure the model preference for
the stereotypical sentence using the
perplexity of both sentences in a zero-
shot setting.
• Model is particularly biased in the
religion category (+10 compared to
OPT-175B), followed by age and gender
(+6 each compared to best model).
• Possible reason can be CommonCrawl
despite multiple filtering steps.
6.3) WinoGender
• WinoGender is made of Winograd schema, and biases are evaluated by determining
if a model co-reference resolution performance is impacted by the gender of the
pronoun.
• Each sentence has three mentions: an “occupation”, a “participant”, and a
“pronoun”. Example: “The nurse notified the patient that his shift would be ending in
an hour.”
• Observer that model is significantly better at performing co-reference resolution for
the “their/them/someone” pronouns than for the “her/her/she” and “his/him/he”
pronouns.
6.3) WinoGender
• To further investigate this hypothesis,
they look at the set of “gotcha” cases
for the “her/her/she” and
“his/him/he” pronouns in the
WinoGender dataset. (Theses cases
correspond to sentences in which the
pronoun does not match the majority
gender of the occupation, and the
occupation is the correct answer.)
• Observe that the model, LLaMA-65B,
makes more errors on the gotcha
examples, clearly showing that it
capture societal biases related to
gender and occupation. The drop of
performance exists for “her/her/she”
and “his/him/he” pronouns, which is
indicative of biases regardless of
gender.
6.4) TruthfulQA
• Aims to measure the
truthfulness of a model, i.e., its
ability to identify when a claim is
true.
• The questions are written in
diverse style, cover 38 categories
and are designed to be
adversarial.
• Compared to GPT-3, model
scores higher in both categories,
but the rate of correct answers is
still low, showing that model is
likely to hallucinate incorrect
answers.
7) Carbon footprint
• Calculated total energy consumption and the
resulting carbon footprint.
• Used formula to estimate the Wh, to train a
model as well as the tons of carbon emissions,
tCO2eq.
Wh = GPU-h×(GPU power consumption)×PUE
(where PUE = 1.1)
• Used the US national average carbon intensity
factor of 0.385 kg CO2eq/KWh.
tCO2eq = MWh × 0.385.
• Result:
- Energy used: 2,638 MWh
- Period: 5 months
- Total Carbon Emission: 1,015 tCO2eq
8) Some more results…
8.1) Instruction Finetuning (Generated by -
LLaMA-65B (without instruction finetuning)
8.2) Instruction Finetuning (Generated by -
LLaMA-65B (without instruction finetuning)
8.3) Generations from LLaMA-I
LLaMA-I, i.e. LLaMA-65B fine-tuned with the protocol and instruction dataset.
8.4) Generations from LLaMA-I
8.5) Generations from LLaMA-I
8.6) Generations from LLaMA-I
8.7) Generations from LLaMA-I
8.8) Generations
from LLaMA-I
9) Conclusion
• Most notably, LLaMA-13B outperforms GPT-3 while being more than 10× smaller,
and LLaMA-65B is competitive with Chinchilla-70B and PaLM-540B.
• Unlike previous studies, this paper show that it is possible to achieve state-of-the-art
performance by training exclusively on publicly available data, without resorting to
proprietary datasets.
• Releasing these models to the research community will accelerate the development
of large language models and help efforts to improve their robustness and mitigate
known issues such as toxicity and bias.
• Lastly, we observed that finetuning these models on instructions lead to promising
results.
Any Questions?

LLaMA Open and Efficient Foundation Language Models
No ratings yet
LLaMA Open and Efficient Foundation Language Models
27 pages
Research Paper Llama
No ratings yet
Research Paper Llama
27 pages
Perspectives in Business Ethics
No ratings yet
Perspectives in Business Ethics
113 pages
10.48550 Arxiv.2204.02311
No ratings yet
10.48550 Arxiv.2204.02311
87 pages
Building LLMs - Stanford
No ratings yet
Building LLMs - Stanford
78 pages
Deepseek LLM
No ratings yet
Deepseek LLM
48 pages
Make Your LLM Core-1
No ratings yet
Make Your LLM Core-1
104 pages
Recent Advances in Language Modeling (2022-2025)
No ratings yet
Recent Advances in Language Modeling (2022-2025)
5 pages
Chowdhery Et Al. - 2022 - PaLM Scaling Language Modeling With Pathways
No ratings yet
Chowdhery Et Al. - 2022 - PaLM Scaling Language Modeling With Pathways
83 pages
LLM Review
No ratings yet
LLM Review
31 pages
Pretraining and Evaluation CodeLLMs
No ratings yet
Pretraining and Evaluation CodeLLMs
71 pages
GenAI LLM Foundations and Building Blocks
No ratings yet
GenAI LLM Foundations and Building Blocks
6 pages
Presentation 11
No ratings yet
Presentation 11
20 pages
Persianmind:: A Cross-Lingual Persian-English Large Language Model
No ratings yet
Persianmind:: A Cross-Lingual Persian-English Large Language Model
13 pages
Little Guide To Building Large Language Models in 2024
No ratings yet
Little Guide To Building Large Language Models in 2024
65 pages
Evolution of Large Language Models
No ratings yet
Evolution of Large Language Models
32 pages
OCI Generative AI
No ratings yet
OCI Generative AI
19 pages
LLMs Overview and OpenAI API Ver 1-8 - Final NLP Day-UM6P-Nov 2023
No ratings yet
LLMs Overview and OpenAI API Ver 1-8 - Final NLP Day-UM6P-Nov 2023
45 pages
Hands-On Large Language Models
No ratings yet
Hands-On Large Language Models
59 pages
Little Guide To Building Large Language Models in 2024
100% (1)
Little Guide To Building Large Language Models in 2024
65 pages
ChatBot With GANs
No ratings yet
ChatBot With GANs
61 pages
ML Interview Ke Pehle Padhna Hai
No ratings yet
ML Interview Ke Pehle Padhna Hai
59 pages
To Create A LLM
No ratings yet
To Create A LLM
53 pages
Building Finetuning Aimodels
No ratings yet
Building Finetuning Aimodels
41 pages
Summary - Foundations On LLMs
No ratings yet
Summary - Foundations On LLMs
6 pages
LLM Tutorial for CSC413 Students
100% (1)
LLM Tutorial for CSC413 Students
40 pages
2025 04 22 Intro To LLMsv1
No ratings yet
2025 04 22 Intro To LLMsv1
41 pages
(10 December 2024, NeurIPS) Tutorial On Language Modeling
No ratings yet
(10 December 2024, NeurIPS) Tutorial On Language Modeling
255 pages
Performance Analysis of LoRA Finetuning Llama-2
No ratings yet
Performance Analysis of LoRA Finetuning Llama-2
4 pages
2005 14165v3 PDF
No ratings yet
2005 14165v3 PDF
74 pages
Efficient Multimodal Large Language Models - A Survey
No ratings yet
Efficient Multimodal Large Language Models - A Survey
36 pages
Machine Learning Systems With Reduced Memory Requirements
No ratings yet
Machine Learning Systems With Reduced Memory Requirements
41 pages
2023 LLMBC Whats Next
No ratings yet
2023 LLMBC Whats Next
95 pages
Pgi20s02j - Lab Record
No ratings yet
Pgi20s02j - Lab Record
24 pages
OpenCoder 1731317971
No ratings yet
OpenCoder 1731317971
35 pages
O C: T O C T - T C L L M: PEN Oder HE PEN Ookbook For OP IER ODE Arge Anguage Odels
No ratings yet
O C: T O C T - T C L L M: PEN Oder HE PEN Ookbook For OP IER ODE Arge Anguage Odels
35 pages
Trend
No ratings yet
Trend
47 pages
Day 5
No ratings yet
Day 5
48 pages
Lec7 - Large Models
No ratings yet
Lec7 - Large Models
33 pages
1Z0-1127-25 Oracle Cloud Infrastructure 2025 Generative AI Professional
No ratings yet
1Z0-1127-25 Oracle Cloud Infrastructure 2025 Generative AI Professional
31 pages
Zy 174360787988339
No ratings yet
Zy 174360787988339
8 pages
DL Assignment 2 Final
No ratings yet
DL Assignment 2 Final
15 pages
Own Your AI - Tech Deck
No ratings yet
Own Your AI - Tech Deck
75 pages
Papers
No ratings yet
Papers
16 pages
Thesis RAG Retrieval Augmented Generation For The IR-Anthology
No ratings yet
Thesis RAG Retrieval Augmented Generation For The IR-Anthology
83 pages
NLP Model Analysis for Students
No ratings yet
NLP Model Analysis for Students
22 pages
Yi: Open Foundation Models by 01.AI
No ratings yet
Yi: Open Foundation Models by 01.AI
26 pages
LLM Cheat Sheetpdf
No ratings yet
LLM Cheat Sheetpdf
7 pages
Upload 1
No ratings yet
Upload 1
26 pages
10 21105 Joss 07489-3
No ratings yet
10 21105 Joss 07489-3
1 page
Towards Interpreting Language Models
No ratings yet
Towards Interpreting Language Models
79 pages
Open Pre-trained Transformers for Researchers
No ratings yet
Open Pre-trained Transformers for Researchers
30 pages
Text Generation
No ratings yet
Text Generation
4 pages
Local LLMs: Key Terms and Concepts
No ratings yet
Local LLMs: Key Terms and Concepts
13 pages
Building A Large Language Model LLM From Scratch
No ratings yet
Building A Large Language Model LLM From Scratch
13 pages
PythonAI LLMs ForSharing
100% (2)
PythonAI LLMs ForSharing
47 pages
6 - Naive Bayes
No ratings yet
6 - Naive Bayes
26 pages
Chat GPT
No ratings yet
Chat GPT
21 pages
Generating Wikipedia by Summarizing Long Sequence
No ratings yet
Generating Wikipedia by Summarizing Long Sequence
33 pages
Diplomacy
No ratings yet
Diplomacy
31 pages
Classification of Terms-1
No ratings yet
Classification of Terms-1
2 pages
4 6
No ratings yet
4 6
19 pages
System Simulation: General Principles
No ratings yet
System Simulation: General Principles
57 pages
Irvine Welsh Trainspotting (1993)
No ratings yet
Irvine Welsh Trainspotting (1993)
3 pages
Circle
No ratings yet
Circle
18 pages
Codigo Abrir Ofx
No ratings yet
Codigo Abrir Ofx
8 pages
607 Midterm Review 22
No ratings yet
607 Midterm Review 22
2 pages
Unit 14 Review
No ratings yet
Unit 14 Review
14 pages
Indian Institute of Technology, Delhi Department of Physics:: Electromagnetics & Quantum Mechanics
No ratings yet
Indian Institute of Technology, Delhi Department of Physics:: Electromagnetics & Quantum Mechanics
2 pages
Basavaraj Donur
No ratings yet
Basavaraj Donur
3 pages
Sundry Free Moors Act 2012
93% (60)
Sundry Free Moors Act 2012
81 pages
AS3 Practice Exercises
No ratings yet
AS3 Practice Exercises
12 pages
DCCN File
No ratings yet
DCCN File
33 pages
Analyzing Tone in Dylan Thomas Poem
No ratings yet
Analyzing Tone in Dylan Thomas Poem
2 pages
English Language Test Prep
No ratings yet
English Language Test Prep
4 pages
Prerna Subtitles
No ratings yet
Prerna Subtitles
4 pages
Interview with Dalit Writer Limbale
No ratings yet
Interview with Dalit Writer Limbale
5 pages
Adobe Photoshop: Features & History
No ratings yet
Adobe Photoshop: Features & History
2 pages
F1 Second Exam 5 (14-15 劉金龍 Final Exam) (modified)
No ratings yet
F1 Second Exam 5 (14-15 劉金龍 Final Exam) (modified)
8 pages
Praying in Tongues Guide
No ratings yet
Praying in Tongues Guide
4 pages
Unit4 B
No ratings yet
Unit4 B
3 pages
ĐÁP ÁN de Tu Soan Năng Khieu L P 8 TEST 1
No ratings yet
ĐÁP ÁN de Tu Soan Năng Khieu L P 8 TEST 1
4 pages
Java Exception Handling Guide
No ratings yet
Java Exception Handling Guide
13 pages
Guitar Tab for Gorillaz Fans
No ratings yet
Guitar Tab for Gorillaz Fans
2 pages
I Sem-Question Bank-2024 Module - 5 Matrix Theory For CS and EE
No ratings yet
I Sem-Question Bank-2024 Module - 5 Matrix Theory For CS and EE
6 pages
Vedam BK List
No ratings yet
Vedam BK List
5 pages
21st Century Literature Lesson Plan
No ratings yet
21st Century Literature Lesson Plan
8 pages
Unit 6
No ratings yet
Unit 6
3 pages
2017 Psle Examination Timetable: Actual Duration May Differ Slightly
No ratings yet
2017 Psle Examination Timetable: Actual Duration May Differ Slightly
2 pages
Win95IO - DLL: Documentation For
No ratings yet
Win95IO - DLL: Documentation For
8 pages

LLaMA Ankit - Rawat

Uploaded by

LLaMA Ankit - Rawat

Uploaded by

LLaMA: Open and Efficient

Standard GPT Architecture (source)

• Used dimension of (2/3)*4d instead of 4d as in PaLM to reduce the dimensionality of

You might also like