0% found this document useful (0 votes)
20 views

Llms Course Andrew

Uploaded by

belalzahran19
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Llms Course Andrew

Uploaded by

belalzahran19
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Generative A I mainly fouse on the chats bot

Mnext word prediction is the Very important Concept


that based on important Cabablities like
basic chati bots e
summarize a docfile d
translation fish iÉÉ
write a Code
augmented 11m Connect to API
get information that not
trained on

The11ms Paramter increase based on

Complexity of the language


Prompt completion
To contian the input
4m
model
7Pa to
output

Context window
Both completion is 2 sides
the size of input of
text as You see
before was depend on
The Generative model

Ras until 2017 He transformers Comeout

efficient in long Parallization


sentence
rars former is highly work Powerful than The RUN

the transformerlearn to get the relevantmeaning not lik


the image above each word to it's mishour But

But 41k this image by calculating attention between


all the words each other make the model
learns the relevance between all words
that called selfattention which is getting the
attention weight between all word saith it
self

trans farmer

A transformer is combination of distinct parts


They Confuctioneach other

before Passing the words we could to tokenia the


words
which is converting the words to number which is
th indexes of this word in certain
dictionary
You can use several Techniques like index
represent word or index represent sub words
like here

then it Pasted to
em bedding layer

rule the tolaniar that used duringtraining


is th same we use to test it out
or during the inference

The embedding layur is Vectorembedding space


which is ahigh dimension space which is each
sub word index represented by vector
like word 2 vac

th dim of the c
actors is 512 so we cant
represent it in our life
dish
But for simplicity if the dim is Just 3 dim

student

got
O
t howtheir words I's locat near
of eachothers inthe dim
Lf space
It to get thedistance between
them you use the angel

which is give You the ability


to understand the language
besides Parartized to the embedding you need
to Process the positional encoding
once You summed
the Positional encoder themodel procees
tee token embedding had taken in
Parallel
LSMSAD

that makethe model to attend to diffrents par


in the input seauences to better capture the
Context of sentence that doint happend
once

the heals between


12 9 100

That mean multiple heads process it out


independent Parallel of each other
each head will learns differentaspect of thelanguage
as You see hire
diffrent attention
taken

the output of the MSA Proceed through MIP


as showan here

the output of it is a vector of logit propotin


to the probability score for each token
last _The Softmax layer outputa probabilities
for each word in the Vocabulary the height
is should be the next word
a But
there is a Vary fun of methods
that used to change the scenario of
output inastedthe softmat
I think the Fm that use work to output numbers
that represents learn theattention scores

The output of encoder is deeply represent the input


tokens I't's sort

by the tokens is passed


hera in first it becomestart 7
with encoderoutput starts
Fend the Contexunderstandi
thatmean the output of
the encoders have of the language Thatmake
that the
context meaning decoder predict the word
token 7 1 Is the output of
doco bet based on the
often the 11m models dossit outputs the suitable
output from the first time so one of the tricks
is Pass some information through the prompt
that You want the model carry out
this Power full strategy to improve the
model outputs
be providingexamples inside the context
window is called in context learning

nstr n

the instruction is to produce the


sentimente that happend with the big Kms
Zero shot inference
But that doesnt work well
how that solved with small models like
GPTZ
You need to Pass that called one shot
a full example first inference
and then Pass the thing
You need to classify

But remmebar You


have limit atteContext
og
SoYou Can engineer Yourprompt to encourage your
model based on it's size
You will need to finetune it again
are alot of techniques that influence on the model
inference
each model represent by some Params that
discrive it's performance

token
THEY can't nerate
A

A as You see here You can chang the tokens

that because limiti


is lower than150 sothere is stoptoken

may willnotice here that when


You
Yourequestfor 200 tshown lowerthan 150
Geeds is select that
highest probability distributions
But it may fall with the
repeated words

e to avoid this repeattons more


natural replides _You need another
Controls
like Random sampling is theeasiestway to introduce
some Variability

it's select the word Randomly based on Probability


distributions

So in some models You could be to convert from


greedy to Random sampling
top K
But based on some parame
top
p Thathelp to get the best
sample
k determine the words th have
heighst prob if K 3
mean select the height score
3

P'make you select within a Certain


range of probabilities not based
on the of words

FRI
we cansay the temperature is used to determine the
51 shape of probability distributions
as You see here the distributions
in Contrast in certain word

I set as default which is


all equal

when You decresethe temp the std of the probability dis


narrow more above the tokens which is decrese
the secletions more just will select s

But
tempel s.asiti.is
hsaeEitie

t.aii.fiI
Thedevelopers use some Hubs to make the AI den to
browse some models
The advantage of it is the inclusion of model
cards
is that describe important details like best
Use cases for each model
how it was trained
Knownlimitations
Yourselection for model depends on how this model trained
because theVariance of the model use depends on the

model train

C Youneed to Collecttons of Docs


enconder
pass it

which is generate representille embeddings


to the21m

for each word based on it's order themeaning

aerton need a filter that decribe theharmful bias bad


tokens Thatmakemodel use
13 of thevocabulary that pretrained
on to increase the
reality
as You know there are a several structures for transformers
encoders decoders encoder decoders
2
encoders based
e thethe language that
autoencodersbased

trained on sort of input dataCalled masked inputseauen


which is called denoising
The representations of autoencoder is called bi directional

that make model

know the affects of which is mean Themodels


prediction afteradding have context understanding of the
remolle predicted model not Just meaning of the
word
word
Good use Cases
sentimentanalysis BERT
Named entity recogention
word classification ROBERTA
that's
Thlanguage

here is the model purpose is to Predictnext token

i i
sometimes called full languge
modeling
the next word
the model iterate gymbahffnfff.it
unidirection context

Decoders models are often used to text generation


Decoders shows strong performance with the zeroshots
ability
GPT

Bloom
2 7 is masking sentences that make
the model predict sentence of it

one of the issues is that challenges the computations of time


is
Cada out of memory
Fompute unified Device architecture
flow PYtorch uses cuda for matrix multiplication
Tensor some
Paratel Process

hugememory need Just for IB


Parameters
246

F so They go for Quantization to store it

5 1 3

FP 16
asset asset
18
to reduce it by Quantizations
kit
guy
The AI reascarch are depends on B float 16

The Goal is to reduce therequired memory train models


I some times it distribute the training of modelsthrough
GPU

DDP is one of the solution that used to separate


the batches on GPU then sync the gradients to
Update the modelcopies as one update But thissol
is used if themodel Params Can be contained on one GPU
But what if no
The another technique called model sharding
one of the mostimportanttechnique is
Fully sharded Data Parallel
FS DP

by microsoft 2019
Proposed stand for Zero
2eR_0 optmied
redundnas
The Goalof Zero is distributing or sharding
The model states across GPUs with Zero data
overlaping
That allow You to scale model training
even Your model doesn't fit on one
GPa
there are levels of Zero

a cost memory 4 Quarters


i

R herethe FS DP
beforeforwardPass we syncall modelweights
to make thecoresPonding Part of the model on
Certain GPA update

You Can Configurethe sharding level from this Pavanter


Butshould to notice that it's trade of between
the Comunication between Gpu
memory
why how th developers scales the models

may you increase thesize of data


11 11 1 1 11 number of modelParams
But this tow steps islinearly
effected
with theCost RG Pus
trainingtime

to Understand this discussion ahead You need to know the


Unita of Compute that Quantify the required resourses
which is Petaflops Is which is the Quadratic floating point
the system Can Perform Per second

Day theComulative of 1 Petaflop


Petaflops s sustainable

during the day of system performance


I Petaflop s day
can betranslated to 8 G Pu Koo

2 Aloof flopilsday
1pct Tuning full
day

linearlyeffected

as we can notice that the relationship between


the loss Petaflor s dat is Proportional

the impact of providing on the dataset Parameters


there is a Papercalled chinchiladater that figure out
that

the Chincina paper decided the optimal sizes for 11mmodel

th optimal size of the dataset

The Goal of chinchilla paper is to mak all the model


Parameters trained with the same land of deepness
Just one sitination will need to train from scratch is
it youneed Yourmodel understand language use differen
word meanings that we use on our daily situations
like

that Because themodel learn the language


from the first
training
fine tune
with istruction Prompts

as we talkingbefore about the in context learning


how that affects on ht response of it with certain
number of shots
But this method have some of drawbacks

small models The extras hot take up


doesn't work good space in the Context window

So there a solution for it which is the finetune

The finitaning is supervised learning which is the learningdata


labeled Thelabel is

promptftp.letic
instruction finetuneing _is one of solutions that used
improve th models in
specific task

G make the prompts Completion consist of istruction


The desired Completion which is start with sentiment
so the data that you use Containes the needed instruction that
You want yourmodel adapt with

Thfinetune thatused to updates all yourmodels weights called


full fine tune
The finettuning or the full version of it is also need the
big memory for the derivitives etc

to go through The lime fine tunes You need


Pre Pare Your data actally any data
in internet can be used But should first to
be if form prompt instruction templates

The loop of th modelfinetune on the data


That draw from your tamplet
The finetuned model called instructed model

fine tune model on a

singletask

Goo d result can be archived relativly for examples


often 500 1000 example
notlike the pretraining Thatneed billion billion example
There is an issue about train amodel on a single site
which I can lead to Catastrophic forgetting
IN 41

Catastrophic forgeting
happens becausethe full finetuning
modifies weights of theoriginal 11m while these
leak to great Performance on singlefine tuning task
it degrad the Performance in the other tasks
to avoid it there several tasks can help on it
May there is no issue to generalize it
finetune on multiple tasks thatneed so 100 thoman
example

ParamterEffcient fine tune PEET


PEFT is a of techniques that used to prsÉn theweights
set
of the original model train only on the specific Parameter
Thatresponsible for the new Changen

multitask instructionfinetuning

is extension of single task fine tuning where


the traing data is comprised of multiple of input output for
multipletasks

one of examples thatused to finituremodels based on various


tasks is
Flan family ofmodels
finetune language net
Plan is specific set of instructions used to fine tune different
models
the aether called it
Themetaphorical dessert to the main care
pretraining

Than 55 is instructed version of TS the


same for PALM

Flan 75 is Perfect general purpose of 11ms models


as Youcan see here the Pharies may be canal But doesn'thave
the
same word

what is metrics that we can use for

BIEU
ROUGE SCORE

Recall ovintedunderstudy Billingu evaluation


for testing evaluation understates

Used for text summarization used for text translatics


Compares asummary to Compress to human
one or more reffrana senerated translation
The dog lay on the rug as I sipped a cup of tea

T.ci

0 m

0
Grscore i is mean all thegenerated worde matches with
referenced with the same

precision
Unigramematches
Unisrame in output 5
8

But there is a lands of Rouce N to Can reflect

1st
Th accuracy of the meaning it self
Doas You know the finetunig need alot of storage for needed
Paramters so there is a solution which is
PEFT
Parameter efficient finetuning
EFT it's update a Part of theweights
freez the others
mainly update the weights that affects on the
task that u need to adapt Your model on
Corlike certain weights of a layer

Rotter technique doesnt touch the original datasets


Just add new Paramters or new layer

so in PEFT Can Performed in a single GPU


so it less catastrophic forgetting

added weights
so the instructed modelsize is the same size of original
one

G there are a several method that You can use in PEF

But it's trade of between some aspects


A
B
C

A Just select a set of the model Parameters Just traini


But the researchers foundout it's output's in mixed
in the efficiency in general

C adapter is likeadding new layers on the encoder component

soft Prompts like adding learnable weight on the input


embedding or rtrain the embedding weights
LORA
low Rank adaptation for 11ms
falls in reparameterization category

or_as you know the transformers consist of FNN MSA


so the researchers find out it's enough to apply Lora
Just on the MSA Pretrained weights

also Lord is train 2 small low Rank matrix it's multip


output matrix with same dim of the original ones By
adding it togather the impact or the add You need
is done without any inference latency
for example
in the transformerpaper
IT length of vector ii 512 64 232 7 11

so by Lorra in Rank 8 You can have


2 low Rank arrays
8 64 572
8 512 4098
so
Is pain
Because of LORA you don't need to yaucantrani
ideofDDPRFSDP.ee

So You Can train multiple different rank decomposition

for certain task switch between them untill


You need a certain task
here as You see it's fully depends on full fine tunes
LORA
based on yes fullfine tune is more than louta
But that not too much

may you ask whatis the choose for the Rank

Based on results or outputs on the previoustables the


Best Rank is between 8 32
a Prompt tuning with soft Prompts

prompt tuning is differentof The Prompt Engineering


PE is about howto handelThe grames of the Prompt or

adding more examples orshots like the concept of Ict


It's needalot of manualefforts to write tring
different Prompts

KEY
re
Youadd additional trainabletokens with the original
input

I leave the Process to determinethe Right Parameters

as You know the


embeddings ofeachtoken o

exist at uniquepointin
multidimensional space
But the soft prompts is not fixed vectors youcanthink of it as
virtual token can take on any Value withcontinuousvalue
the fintuning hadel it's Posed

in the learning the dataset consist of Prompts it'scompletion


The model Paramterupdated
in Contrast the softprompts the model Parameteris frozen Just
the added embeddings isthat updated

for sure the use of tin technique is effcient Caus is just train
some small Parameters
i

So You can prepare different embedenss for different tasks


swich between them
softprompts is very small on disk

Titinney im

ae Yan can see the results based on differenttuning

technique
note
The softPrompts should not represent worde on the Vocabula
But by KNN theyfound they cluster the words near of
each other have similar soft Prompt which is mean it as
learn the representation

You might also like