0% found this document useful (0 votes)

20 views

Llms Course Andrew

Uploaded by

belalzahran19

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views

Llms Course Andrew

Uploaded by

belalzahran19

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Generative A I mainly fouse on the chats bot

Mnext word prediction is the Very important Concept

that based on important Cabablities like
basic chati bots e
summarize a docfile d
translation fish iÉÉ
write a Code
augmented 11m Connect to API
get information that not
trained on

The11ms Paramter increase based on

Complexity of the language

Prompt completion
To contian the input
4m
model
7Pa to
output

Context window
Both completion is 2 sides
the size of input of
text as You see
before was depend on
The Generative model

Ras until 2017 He transformers Comeout

efficient in long Parallization

sentence
rars former is highly work Powerful than The RUN

the transformerlearn to get the relevantmeaning not lik

the image above each word to it's mishour But

But 41k this image by calculating attention between

all the words each other make the model
learns the relevance between all words
that called selfattention which is getting the
attention weight between all word saith it
self

trans farmer

A transformer is combination of distinct parts

They Confuctioneach other

before Passing the words we could to tokenia the

words
which is converting the words to number which is
th indexes of this word in certain
dictionary
You can use several Techniques like index
represent word or index represent sub words
like here

then it Pasted to
em bedding layer

rule the tolaniar that used duringtraining

is th same we use to test it out
or during the inference

The embedding layur is Vectorembedding space

which is ahigh dimension space which is each
sub word index represented by vector
like word 2 vac

th dim of the c
actors is 512 so we cant
represent it in our life
dish
But for simplicity if the dim is Just 3 dim

student

got
O
t howtheir words I's locat near
of eachothers inthe dim
Lf space
It to get thedistance between
them you use the angel

which is give You the ability

to understand the language
besides Parartized to the embedding you need
to Process the positional encoding
once You summed
the Positional encoder themodel procees
tee token embedding had taken in
Parallel
LSMSAD

that makethe model to attend to diffrents par

in the input seauences to better capture the
Context of sentence that doint happend
once

the heals between

12 9 100

That mean multiple heads process it out

independent Parallel of each other
each head will learns differentaspect of thelanguage
as You see hire
diffrent attention
taken

the output of the MSA Proceed through MIP

as showan here

the output of it is a vector of logit propotin

to the probability score for each token
last _The Softmax layer outputa probabilities
for each word in the Vocabulary the height
is should be the next word
a But
there is a Vary fun of methods
that used to change the scenario of
output inastedthe softmat
I think the Fm that use work to output numbers
that represents learn theattention scores

The output of encoder is deeply represent the input

tokens I't's sort

by the tokens is passed

hera in first it becomestart 7
with encoderoutput starts
Fend the Contexunderstandi
thatmean the output of
the encoders have of the language Thatmake
that the
context meaning decoder predict the word
token 7 1 Is the output of
doco bet based on the
often the 11m models dossit outputs the suitable
output from the first time so one of the tricks
is Pass some information through the prompt
that You want the model carry out
this Power full strategy to improve the
model outputs
be providingexamples inside the context
window is called in context learning

nstr n

the instruction is to produce the

sentimente that happend with the big Kms
Zero shot inference
But that doesnt work well
how that solved with small models like
GPTZ
You need to Pass that called one shot
a full example first inference
and then Pass the thing
You need to classify

But remmebar You

have limit atteContext
og
SoYou Can engineer Yourprompt to encourage your
model based on it's size
You will need to finetune it again
are alot of techniques that influence on the model
inference
each model represent by some Params that
discrive it's performance

token
THEY can't nerate
A

A as You see here You can chang the tokens

that because limiti

is lower than150 sothere is stoptoken

may willnotice here that when

You
Yourequestfor 200 tshown lowerthan 150
Geeds is select that
highest probability distributions
But it may fall with the
repeated words

e to avoid this repeattons more

natural replides _You need another
Controls
like Random sampling is theeasiestway to introduce
some Variability

it's select the word Randomly based on Probability

distributions

So in some models You could be to convert from

greedy to Random sampling
top K
But based on some parame
top
p Thathelp to get the best
sample
k determine the words th have
heighst prob if K 3
mean select the height score
3

P'make you select within a Certain

range of probabilities not based
on the of words

FRI
we cansay the temperature is used to determine the
51 shape of probability distributions
as You see here the distributions
in Contrast in certain word

I set as default which is

all equal

when You decresethe temp the std of the probability dis

narrow more above the tokens which is decrese
the secletions more just will select s

But
tempel s.asiti.is
hsaeEitie

t.aii.fiI
Thedevelopers use some Hubs to make the AI den to
browse some models
The advantage of it is the inclusion of model
cards
is that describe important details like best
Use cases for each model
how it was trained
Knownlimitations
Yourselection for model depends on how this model trained
because theVariance of the model use depends on the

model train

C Youneed to Collecttons of Docs

enconder
pass it

which is generate representille embeddings

to the21m

for each word based on it's order themeaning

aerton need a filter that decribe theharmful bias bad

tokens Thatmakemodel use
13 of thevocabulary that pretrained
on to increase the
reality
as You know there are a several structures for transformers
encoders decoders encoder decoders
2
encoders based
e thethe language that
autoencodersbased

trained on sort of input dataCalled masked inputseauen

which is called denoising
The representations of autoencoder is called bi directional

that make model

know the affects of which is mean Themodels

prediction afteradding have context understanding of the
remolle predicted model not Just meaning of the
word
word
Good use Cases
sentimentanalysis BERT
Named entity recogention
word classification ROBERTA
that's
Thlanguage

here is the model purpose is to Predictnext token

i i
sometimes called full languge
modeling
the next word
the model iterate gymbahffnfff.it
unidirection context

Decoders models are often used to text generation

Decoders shows strong performance with the zeroshots
ability
GPT

Bloom
2 7 is masking sentences that make
the model predict sentence of it

one of the issues is that challenges the computations of time

is
Cada out of memory
Fompute unified Device architecture
flow PYtorch uses cuda for matrix multiplication
Tensor some
Paratel Process

hugememory need Just for IB

Parameters
246

F so They go for Quantization to store it

5 1 3

FP 16
asset asset
18
to reduce it by Quantizations
kit
guy
The AI reascarch are depends on B float 16

The Goal is to reduce therequired memory train models

I some times it distribute the training of modelsthrough
GPU

DDP is one of the solution that used to separate

the batches on GPU then sync the gradients to
Update the modelcopies as one update But thissol
is used if themodel Params Can be contained on one GPU
But what if no
The another technique called model sharding
one of the mostimportanttechnique is
Fully sharded Data Parallel
FS DP

by microsoft 2019
Proposed stand for Zero
2eR_0 optmied
redundnas
The Goalof Zero is distributing or sharding
The model states across GPUs with Zero data
overlaping
That allow You to scale model training
even Your model doesn't fit on one
GPa
there are levels of Zero

a cost memory 4 Quarters

R herethe FS DP
beforeforwardPass we syncall modelweights
to make thecoresPonding Part of the model on
Certain GPA update

You Can Configurethe sharding level from this Pavanter

Butshould to notice that it's trade of between
the Comunication between Gpu
memory
why how th developers scales the models

may you increase thesize of data

11 11 1 1 11 number of modelParams
But this tow steps islinearly
effected
with theCost RG Pus
trainingtime

to Understand this discussion ahead You need to know the

Unita of Compute that Quantify the required resourses
which is Petaflops Is which is the Quadratic floating point
the system Can Perform Per second

Day theComulative of 1 Petaflop

Petaflops s sustainable

during the day of system performance

I Petaflop s day
can betranslated to 8 G Pu Koo

2 Aloof flopilsday
1pct Tuning full
day

linearlyeffected

as we can notice that the relationship between

the loss Petaflor s dat is Proportional

the impact of providing on the dataset Parameters

there is a Papercalled chinchiladater that figure out
that

the Chincina paper decided the optimal sizes for 11mmodel

th optimal size of the dataset

The Goal of chinchilla paper is to mak all the model

Parameters trained with the same land of deepness
Just one sitination will need to train from scratch is
it youneed Yourmodel understand language use differen
word meanings that we use on our daily situations
like

that Because themodel learn the language

from the first
training
fine tune
with istruction Prompts

as we talkingbefore about the in context learning

how that affects on ht response of it with certain
number of shots
But this method have some of drawbacks

small models The extras hot take up

doesn't work good space in the Context window

So there a solution for it which is the finetune

The finitaning is supervised learning which is the learningdata

labeled Thelabel is

promptftp.letic
instruction finetuneing _is one of solutions that used
improve th models in
specific task

G make the prompts Completion consist of istruction

The desired Completion which is start with sentiment
so the data that you use Containes the needed instruction that
You want yourmodel adapt with

Thfinetune thatused to updates all yourmodels weights called

full fine tune
The finettuning or the full version of it is also need the
big memory for the derivitives etc

to go through The lime fine tunes You need

Pre Pare Your data actally any data
in internet can be used But should first to
be if form prompt instruction templates

The loop of th modelfinetune on the data

That draw from your tamplet
The finetuned model called instructed model

fine tune model on a

singletask

Goo d result can be archived relativly for examples

often 500 1000 example
notlike the pretraining Thatneed billion billion example
There is an issue about train amodel on a single site
which I can lead to Catastrophic forgetting
IN 41

Catastrophic forgeting
happens becausethe full finetuning
modifies weights of theoriginal 11m while these
leak to great Performance on singlefine tuning task
it degrad the Performance in the other tasks
to avoid it there several tasks can help on it
May there is no issue to generalize it
finetune on multiple tasks thatneed so 100 thoman
example

ParamterEffcient fine tune PEET

PEFT is a of techniques that used to prsÉn theweights
set
of the original model train only on the specific Parameter
Thatresponsible for the new Changen

multitask instructionfinetuning

is extension of single task fine tuning where

the traing data is comprised of multiple of input output for
multipletasks

one of examples thatused to finituremodels based on various

tasks is
Flan family ofmodels
finetune language net
Plan is specific set of instructions used to fine tune different
models
the aether called it
Themetaphorical dessert to the main care
pretraining

Than 55 is instructed version of TS the

same for PALM

Flan 75 is Perfect general purpose of 11ms models

as Youcan see here the Pharies may be canal But doesn'thave
the
same word

what is metrics that we can use for

BIEU
ROUGE SCORE

Recall ovintedunderstudy Billingu evaluation

for testing evaluation understates

Used for text summarization used for text translatics

Compares asummary to Compress to human
one or more reffrana senerated translation
The dog lay on the rug as I sipped a cup of tea

T.ci

0 m

0
Grscore i is mean all thegenerated worde matches with
referenced with the same

precision
Unigramematches
Unisrame in output 5
8

But there is a lands of Rouce N to Can reflect

1st
Th accuracy of the meaning it self
Doas You know the finetunig need alot of storage for needed
Paramters so there is a solution which is
PEFT
Parameter efficient finetuning
EFT it's update a Part of theweights
freez the others
mainly update the weights that affects on the
task that u need to adapt Your model on
Corlike certain weights of a layer

Rotter technique doesnt touch the original datasets

Just add new Paramters or new layer

so in PEFT Can Performed in a single GPU

so it less catastrophic forgetting

added weights
so the instructed modelsize is the same size of original
one

G there are a several method that You can use in PEF

But it's trade of between some aspects

A
B
C

A Just select a set of the model Parameters Just traini

But the researchers foundout it's output's in mixed
in the efficiency in general

C adapter is likeadding new layers on the encoder component

soft Prompts like adding learnable weight on the input

embedding or rtrain the embedding weights
LORA
low Rank adaptation for 11ms
falls in reparameterization category

or_as you know the transformers consist of FNN MSA

so the researchers find out it's enough to apply Lora
Just on the MSA Pretrained weights

also Lord is train 2 small low Rank matrix it's multip

output matrix with same dim of the original ones By
adding it togather the impact or the add You need
is done without any inference latency
for example
in the transformerpaper
IT length of vector ii 512 64 232 7 11

so by Lorra in Rank 8 You can have

2 low Rank arrays
8 64 572
8 512 4098
so
Is pain
Because of LORA you don't need to yaucantrani
ideofDDPRFSDP.ee

So You Can train multiple different rank decomposition

for certain task switch between them untill

You need a certain task
here as You see it's fully depends on full fine tunes
LORA
based on yes fullfine tune is more than louta
But that not too much

may you ask whatis the choose for the Rank

Based on results or outputs on the previoustables the

Best Rank is between 8 32
a Prompt tuning with soft Prompts

prompt tuning is differentof The Prompt Engineering

PE is about howto handelThe grames of the Prompt or

adding more examples orshots like the concept of Ict

It's needalot of manualefforts to write tring
different Prompts

KEY
re
Youadd additional trainabletokens with the original
input

I leave the Process to determinethe Right Parameters

as You know the

embeddings ofeachtoken o

exist at uniquepointin
multidimensional space
But the soft prompts is not fixed vectors youcanthink of it as
virtual token can take on any Value withcontinuousvalue
the fintuning hadel it's Posed

in the learning the dataset consist of Prompts it'scompletion

The model Paramterupdated
in Contrast the softprompts the model Parameteris frozen Just
the added embeddings isthat updated

for sure the use of tin technique is effcient Caus is just train
some small Parameters
i

So You can prepare different embedenss for different tasks

swich between them
softprompts is very small on disk

Titinney im

ae Yan can see the results based on differenttuning

technique
note
The softPrompts should not represent worde on the Vocabula
But by KNN theyfound they cluster the words near of
each other have similar soft Prompt which is mean it as
learn the representation

The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time - .Booklet
No ratings yet
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time - .Booklet
14 pages
Plan Transparency Declaration Form (PTDF) - FINAL
No ratings yet
Plan Transparency Declaration Form (PTDF) - FINAL
3 pages
Cheatsheet Recurrent Neural Networks
No ratings yet
Cheatsheet Recurrent Neural Networks
5 pages
AXXIS 23/23 (Ascent) : Service Manual 90-180171-05
100% (2)
AXXIS 23/23 (Ascent) : Service Manual 90-180171-05
145 pages
AN2DL_05_2324_Seq2SeqAndWordEmbedding
No ratings yet
AN2DL_05_2324_Seq2SeqAndWordEmbedding
42 pages
Neural Text Generation: A Practical Guide: Ziang Xie Zxie@cs - Stanford.edu
No ratings yet
Neural Text Generation: A Practical Guide: Ziang Xie Zxie@cs - Stanford.edu
21 pages
Lecture 13 - Transformer Encoder Decoderv2
No ratings yet
Lecture 13 - Transformer Encoder Decoderv2
65 pages
LLM_book_43-102
No ratings yet
LLM_book_43-102
60 pages
Nn4nlp 02 LM
No ratings yet
Nn4nlp 02 LM
47 pages
Pretraining Part1 16 Mar 23 PDF
No ratings yet
Pretraining Part1 16 Mar 23 PDF
32 pages
2 ai doc
No ratings yet
2 ai doc
21 pages
NLP Concepts
No ratings yet
NLP Concepts
37 pages
CS585 Lecture October15th
No ratings yet
CS585 Lecture October15th
162 pages
5th Unit
No ratings yet
5th Unit
36 pages
Transformer
No ratings yet
Transformer
5 pages
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
No ratings yet
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
20 pages
Natural Language Processing GPT-2
No ratings yet
Natural Language Processing GPT-2
5 pages
3 - Deep Learning
No ratings yet
3 - Deep Learning
33 pages
NLP Short
No ratings yet
NLP Short
5 pages
Cs224n 2023 Lecture05 RNNLM
No ratings yet
Cs224n 2023 Lecture05 RNNLM
68 pages
REPORT-MTechPESJul23BGrp2-3 (22-02-25)
No ratings yet
REPORT-MTechPESJul23BGrp2-3 (22-02-25)
15 pages
Language Model Evaluation in Open-Ended Text Gener
No ratings yet
Language Model Evaluation in Open-Ended Text Gener
70 pages
Towards Interpreting Language Models
No ratings yet
Towards Interpreting Language Models
79 pages
NLP Summary
No ratings yet
NLP Summary
6 pages
Stable Diffusion
No ratings yet
Stable Diffusion
58 pages
Binder
No ratings yet
Binder
97 pages
Binder Merged
No ratings yet
Binder Merged
142 pages
Transformer Part3 16 Mar 23 PDF
No ratings yet
Transformer Part3 16 Mar 23 PDF
59 pages
Lecture 15 - Foundation Models - CLIP and GPT
No ratings yet
Lecture 15 - Foundation Models - CLIP and GPT
45 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
33 pages
Foundations of Large Language Models 1738142777
No ratings yet
Foundations of Large Language Models 1738142777
101 pages
Foundations of LLM
No ratings yet
Foundations of LLM
231 pages
lec20.LLM
No ratings yet
lec20.LLM
58 pages
ML Algorithms
No ratings yet
ML Algorithms
5 pages
08 Word Embeddings (2021)
No ratings yet
08 Word Embeddings (2021)
58 pages
ML for NLP-LO4
No ratings yet
ML for NLP-LO4
42 pages
DAB311 DL Week 11 RNN
No ratings yet
DAB311 DL Week 11 RNN
25 pages
تمثيل النص كموترات - تدريب _ مايكروسوفت ليرن
No ratings yet
تمثيل النص كموترات - تدريب _ مايكروسوفت ليرن
14 pages
Summer Course Material
No ratings yet
Summer Course Material
52 pages
Unit 5.
No ratings yet
Unit 5.
17 pages
L4_CSE256_FA24_WE
No ratings yet
L4_CSE256_FA24_WE
68 pages
Large Language Models From Scratch
No ratings yet
Large Language Models From Scratch
29 pages
Module 5
No ratings yet
Module 5
76 pages
LLM Cheatsheet
No ratings yet
LLM Cheatsheet
1 page
Word Embedding Learning Process
No ratings yet
Word Embedding Learning Process
6 pages
Fundamentals of Generative AI
No ratings yet
Fundamentals of Generative AI
17 pages
lecture2-transformer
No ratings yet
lecture2-transformer
64 pages
Chapter 2. Transformers: A Note For Early Release Readers
No ratings yet
Chapter 2. Transformers: A Note For Early Release Readers
85 pages
Bay Learn 2015 Deep Mind
No ratings yet
Bay Learn 2015 Deep Mind
69 pages
GPT in 60 Lines of NumPy _ Jay Mody
No ratings yet
GPT in 60 Lines of NumPy _ Jay Mody
41 pages
10.48550 Arxiv.2204.02311
No ratings yet
10.48550 Arxiv.2204.02311
87 pages
Jason Wei Stanford cs330 Talk
No ratings yet
Jason Wei Stanford cs330 Talk
44 pages
CS4740/5740 Introduction To NLP Fall 2017 Neural Language Models and Classifiers
No ratings yet
CS4740/5740 Introduction To NLP Fall 2017 Neural Language Models and Classifiers
7 pages
ChatBot with GANs
No ratings yet
ChatBot with GANs
61 pages
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
No ratings yet
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
53 pages
Neubig 16 Afnlp
No ratings yet
Neubig 16 Afnlp
58 pages
Problem 1 Proposal
No ratings yet
Problem 1 Proposal
24 pages
Apresentação Deep
No ratings yet
Apresentação Deep
28 pages
Speech and Language Processing - J&M
No ratings yet
Speech and Language Processing - J&M
599 pages
19 20-gpt-3 Prompts
No ratings yet
19 20-gpt-3 Prompts
68 pages
Brief Introduction to LLM
No ratings yet
Brief Introduction to LLM
69 pages
Dragon Dictate: Fast Track to Prolific Writing on Your Mac
From Everand
Dragon Dictate: Fast Track to Prolific Writing on Your Mac
Jose John
5/5 (2)
Carrier VoIP Nortel CS 2000 Core Manager Fault Management
100% (1)
Carrier VoIP Nortel CS 2000 Core Manager Fault Management
394 pages
Bisleri
No ratings yet
Bisleri
35 pages
One Globe Trade Account
No ratings yet
One Globe Trade Account
5 pages
Ups Sizing - 2
No ratings yet
Ups Sizing - 2
30 pages
Las#3 - (Ia3) STATEMENT OF CASH FLOWS PDF
No ratings yet
Las#3 - (Ia3) STATEMENT OF CASH FLOWS PDF
6 pages
Advantages and Disadvantages of Distributed System Over Centralized System
33% (3)
Advantages and Disadvantages of Distributed System Over Centralized System
4 pages
Research Progress II
No ratings yet
Research Progress II
4 pages
Curriculum Vitae: Harpreet Singh
No ratings yet
Curriculum Vitae: Harpreet Singh
2 pages
CV AJ Menzies Update 02142012
No ratings yet
CV AJ Menzies Update 02142012
8 pages
Gripen C Factsheet
No ratings yet
Gripen C Factsheet
2 pages
Reflection On 1st Teaching Practicum
No ratings yet
Reflection On 1st Teaching Practicum
4 pages
SITHPAT006 Student Logbook
No ratings yet
SITHPAT006 Student Logbook
15 pages
PSYA01 Syllabus
No ratings yet
PSYA01 Syllabus
9 pages
Orunmila (Orula) Yoruba Orisha of Wisdom & Divination – Culture Bay
No ratings yet
Orunmila (Orula) Yoruba Orisha of Wisdom & Divination – Culture Bay
1 page
Field Craft Essentials
No ratings yet
Field Craft Essentials
24 pages
Cover Pageee Mee
No ratings yet
Cover Pageee Mee
5 pages
SEMINAR REPORT ON REDTACTON (Jntu)
No ratings yet
SEMINAR REPORT ON REDTACTON (Jntu)
31 pages
How To Test MD380 Power Module - (Rectifier + Inverter) PDF
No ratings yet
How To Test MD380 Power Module - (Rectifier + Inverter) PDF
21 pages
Compressed Gas: Symbol Means
No ratings yet
Compressed Gas: Symbol Means
8 pages
South NTS 350 Series Total Station
No ratings yet
South NTS 350 Series Total Station
2 pages
Vehicle Quantity
No ratings yet
Vehicle Quantity
1 page
Predictive Control: One Step Ahead in Automated Pouring With A Stopper
No ratings yet
Predictive Control: One Step Ahead in Automated Pouring With A Stopper
4 pages
What Is Fusion Welding - TWI PDF
No ratings yet
What Is Fusion Welding - TWI PDF
6 pages
Summary Monk Mode Lyst3958
No ratings yet
Summary Monk Mode Lyst3958
6 pages
Multisyllabic Words: 3 Syllables 4 Syllables
No ratings yet
Multisyllabic Words: 3 Syllables 4 Syllables
10 pages
Eq 3-AN INTRODUCTION TO THE GENERAL THEORY OF CONDENSATION POLYMERS-Carothers PDF
No ratings yet
Eq 3-AN INTRODUCTION TO THE GENERAL THEORY OF CONDENSATION POLYMERS-Carothers PDF
12 pages
Week 8 (Buying Things 3, Map)
No ratings yet
Week 8 (Buying Things 3, Map)
13 pages
Web Technologies Nodrm
0% (1)
Web Technologies Nodrm
651 pages