Llms Course Andrew
Llms Course Andrew
Context window
Both completion is 2 sides
the size of input of
text as You see
before was depend on
The Generative model
trans farmer
then it Pasted to
em bedding layer
th dim of the c
actors is 512 so we cant
represent it in our life
dish
But for simplicity if the dim is Just 3 dim
student
got
O
t howtheir words I's locat near
of eachothers inthe dim
Lf space
It to get thedistance between
them you use the angel
nstr n
token
THEY can't nerate
A
FRI
we cansay the temperature is used to determine the
51 shape of probability distributions
as You see here the distributions
in Contrast in certain word
But
tempel s.asiti.is
hsaeEitie
t.aii.fiI
Thedevelopers use some Hubs to make the AI den to
browse some models
The advantage of it is the inclusion of model
cards
is that describe important details like best
Use cases for each model
how it was trained
Knownlimitations
Yourselection for model depends on how this model trained
because theVariance of the model use depends on the
model train
i i
sometimes called full languge
modeling
the next word
the model iterate gymbahffnfff.it
unidirection context
Bloom
2 7 is masking sentences that make
the model predict sentence of it
5 1 3
FP 16
asset asset
18
to reduce it by Quantizations
kit
guy
The AI reascarch are depends on B float 16
by microsoft 2019
Proposed stand for Zero
2eR_0 optmied
redundnas
The Goalof Zero is distributing or sharding
The model states across GPUs with Zero data
overlaping
That allow You to scale model training
even Your model doesn't fit on one
GPa
there are levels of Zero
R herethe FS DP
beforeforwardPass we syncall modelweights
to make thecoresPonding Part of the model on
Certain GPA update
2 Aloof flopilsday
1pct Tuning full
day
linearlyeffected
promptftp.letic
instruction finetuneing _is one of solutions that used
improve th models in
specific task
singletask
Catastrophic forgeting
happens becausethe full finetuning
modifies weights of theoriginal 11m while these
leak to great Performance on singlefine tuning task
it degrad the Performance in the other tasks
to avoid it there several tasks can help on it
May there is no issue to generalize it
finetune on multiple tasks thatneed so 100 thoman
example
multitask instructionfinetuning
BIEU
ROUGE SCORE
T.ci
0 m
0
Grscore i is mean all thegenerated worde matches with
referenced with the same
precision
Unigramematches
Unisrame in output 5
8
1st
Th accuracy of the meaning it self
Doas You know the finetunig need alot of storage for needed
Paramters so there is a solution which is
PEFT
Parameter efficient finetuning
EFT it's update a Part of theweights
freez the others
mainly update the weights that affects on the
task that u need to adapt Your model on
Corlike certain weights of a layer
added weights
so the instructed modelsize is the same size of original
one
KEY
re
Youadd additional trainabletokens with the original
input
exist at uniquepointin
multidimensional space
But the soft prompts is not fixed vectors youcanthink of it as
virtual token can take on any Value withcontinuousvalue
the fintuning hadel it's Posed
for sure the use of tin technique is effcient Caus is just train
some small Parameters
i
Titinney im
technique
note
The softPrompts should not represent worde on the Vocabula
But by KNN theyfound they cluster the words near of
each other have similar soft Prompt which is mean it as
learn the representation