RL DQN PG
RL DQN PG
dated label
good learn some
Goal Learn a fnthal underlying hidden
can map relationship believers
y the data that can
he used to enflose
enamples classification Cart vsDog its structure ofdata
Semantic segmentation examples clustering
Object detection
dimensionality reduction
Regression fee YJ image feature learny destiny
estimation
Calption
Anything in which
target is known tasgeeh Fhdatfeqt
Injun.at
qst data probability
m
tonights.mg
It is a learning
Reinforcement learning
paradigm in which
learning happens by enfloration So agent
interact with the environment repeatedly w o
and labels Wrt the environnt
any prior knowledge
Completely relying
on hit and trial
behave optimally
strategy to learn howto
within that environment
Agent
A
Stsaehzo
fRewsqdIYAuoctiaonNentstateCSe.D
t 5
Qld a
QfnGiImay t
affirm
as or
B Basic RL setupframework
Agent
environt
finally
stsfso.iq
IgEEFfRwsEFm
State se choosen
actioncats agent
Etf
Environment cord Jn return depending
will
the agent eeomaqf.tw upon its state agent
gives takes some action
somestatti L
Environment
E
Episodic training o
These steps keeps on
six
Markov property
the
The future is independentof
Past gmen the present
Yt
Pfs Is e PIs Is.si set
S A P R r
Set of
b
set ti
Fattest Porthos EEmatsify.ge
TTnsg7qda.Y
Si Sz St 9,92 Ah
SN an Sst At
t
Stat
Lse Ats Discount
foot
I 4 How much current
Sttl out as comprwomdsftusem.to't
State transition Mahin P
Markov processfchain is a
memory less
random process of Sampling iteratively
as per the
ginen
ST M P Starbug
from some gmen seed random
state and Markov property
starting following
Full game dynamics can be encoded
within P Robot can be interacting
with with the environment
gaming
following The game rules
reward agent is
Amount of an
to the environment
going get from
when at state agent
takes an action a
rat r St at
St
Greening
But in M DP we are not
interested in immediate rewards
An optimal behaviour must need to
the Average Discounted
maximize
Commutative Future Reward DC FR
man IE DC FR mining
E o
a
E
se o
How MDP operates and can
be used to obtain Episodes
MDP Execution
At time o environment
may sample randomly an
initial State
Soup Cso
Initial state
IS PD
Probability distribution
Using some
policy that we need
to out
figure So as to Max ECKERD
upon the agent's
Defending
State and the action
taken environment will
return the immediate
pass
Pfs is f SES AE 9
St P Rsa
g Sth
action Policy
A policy is mapping
just
a
of States to actions
to take
enabling any agent
action
prostate Raction
It can be deterministic or
Co2
Goalof MDP
Can the DC FR
maximizes
The actual returns are always
in the form of DC FR
Ge Discounted cummulature
future Reward from time
Stelp fge
ownfefffuyardswareaguted
close to Myopia
evaluation
More importance
close to to immediate
returns
far
sighted
Care about all future rewards
Requirementofdiscount
about
future is uncertain and we cure o
50
re lo i
y5 glo y
Since we are inaccurate
predicting
using
an
overall policy
the policies where Sou p f So
statestaligaFactionmittefnentm
distribution
see
in
Pf IL Stitt
vacs r
H
or
EfE
Value of state
him Effeded
9defy't Tetffrom state
to
followingthe policy
choose
future actions
Value function sat
D 2 7
Q
Let us assume that we have
5 A for some
9 2,53 9,92
actionpaissandusatah.IQ
wecgg.hqeinemzyostak
52,9 53,9 205621
Si Az Q sails
52,9 Sz Az xca.lsdh
acaslsd4
axCailszjfxxf.azszJ
action at state
Basically then following
the enpected DCFR action
fat state
has been choosen MEER
QTs.at
Used for
fEa.oTnttxlsaEEa'I
at State
policy optimization
affection
ketonfouow
Equivalently
sa
oils.at Efz.int
IEaI
O
t final
t.TT
crowd
E ttfinal t
Oj f get
Gotthtrka x
1
Psst
Do3 Optimal Value function
value
and Q f n
4 67 Vals
Maan
m an
EEE or're
HEMI
QIs.at rtrx.name
Qualm sick
best Cssa
oitcs.ae
rtV Q sfa
Since at state taken action
s
after an
oils a
s
E
n E
Ert r
reach
it may
to a Set of
Max Oils a sa
statesce A
defined recursively
Expected values are Considered in order
to address the randomness in the forest
Just see it once more as we are
going
to use
reach
it may Max Q fs a 1 99
to a Set of
States E A
s
G Ann Bestcai
a SE si sa
E 195253
S a
S2 Eur Best as
a
a
asshown
E enfectedoalus
3X is the average of
Best as
basically whatever all the
we observe
aftertaking a decision
at state
4 Tht
D
Optimal Policy
Policy Defeo
MDP policies depends over current state
only C Not on History
Meme policies are
stationary Time independent
T1 als PEA al 49
ai
s
E SE
these ties we
Yaz
are ay
egocheepidoli
deciding under policy
suitable arsing to be given Later we need to Compute
ensures
actionspade Items DP Not scalable
state optimally
heme appronimalt litusing
erygogation usingpolicy network
Policy
Iterative update
rays
At alE3sY
IT3sYJ.s
State
Transition
n Dio 5 Aza
as por 12
14
a io
i i
15 16 17 18
vacs Vacsz Viscss Viacsu
Crites is
HaldKIII sig.YSg
Ij
ij
ssuming
atAfrastVZsesPass.x Yak'D
Inclosed form
0
VK.ii R pp.TV
Value vector for
l 47
stale falue
Reward Transition fn
all States at
function probability
at
CKtDthiteration asperthe Kth
MDP iteratin
Let us evaluate a random policy in small
grid world
E 2 Policy Evolution Example
F 2 Deep Q learning for Qfn approximation
Q Cs a O x oils a
A
using inn parameters
to estimate the Qoalus Doffearningly
Ser CS a
pair
We need a Q fn appronimator that satisfies Bellman
Bellman optimality at
equation by enforcing
Intuition
each iterative step
i
t7Eo i saoei
s.me
is.g
l
irmaainQlsiasi
go.esaeer
Qt.fsiaiq
Solving for Optimal Policy
Directly
G I Introduction to policy gradient
Instead of DNN to approximate
using
Q Value function why can t
directly learn the
0
Suitable policy Ti parameterized by
Class of parameterized A 10 C
To Rmg
policies
ftp.neam.f.fmwue
JHKEIE.ortn.to
L 11
x Hi
value of the policy Ewewp.ectedDcmpaeecyIa
Parameterized by
much DC FR on
ttOGivenanypolicy.CTho how
an one can entract value of
average the IJ lo
GOAL DNN need to optimize for the parameter 0
Is
1
DQ N Learning Qfn
using samentw
samentw may
Target NIM introduce
instability
Find A m
t S
Environment e
O Main who t 6
thistenniffrience
R.qnm.dq.maleey.my
z metre replay
buffer I
3
i
gg53,93
n
0 Predicted
aeargmjexQOGMJIRetft.asum.mg
Choose action usingEpsilon it have enough
greedy policy samples
Q Values
7 Sample a random
Su Ag Io BATCH batch of transition
currentstate 51,9 G S
12 and actiongoes
5219492,52
Y ssigaiygoes.to
main network for 53193,923153 Thement stale Ss
Yi its Q value
Yz Goesto target network
Prediction
43
1 O for the
Si Ai Ri Si
target Lfo target value computaters
values using bootstrapping
4 0,0 15 Su anguish
8
or Bellman
Yu Unrolling
g QOD
O'CTargetnao q.tt naaY
It is the time
y mainnetwork
updation
just 0 0 4 40
delayed copy of
DEEP 12
O Copy the parameters of
the main network to target
after Sometime Iterations
Learning
Policy Gradient
up
i
R
ped
Jn RL we need to learn
Q In parameterization G
parameterization O
Policy
we directly learn
the policy
Intention o
In PG we use N N to
approximate the optimal policy a't
Initialize 0 randomly
feed state as an
we
probability distribution
Also Stone S A R s
until the
of end the episode
data
training
that
If agent win
t as stated in
step
G 2 REINFORCE algorithm Loss function
E so au no c si ai ri
e
Trajectory sash
Sample trajectory g
sampled randomly
from the example episode of game play
plz lol
ronwrczz
Gamefaunihiaemf.TT
Toledo r Zz
Assuming the
meee
agent to be need
game state and startingfrom some initial
following To policy parameter
Assuming all trajectories to be over O
equally probable J O
If rti
considering an
KG P Glo de
trajectories that one
can encounter while
0 argmfenff
O
Forward Pass J 8
GG plzlol de
Backward
pass J 8
Tf red
To PGH de
This is intractable
Gradient I
glomerulonephritides
Cowhenmbulationgofdependsentedation
Trier
Putting eq
y
fptelos pklolx F.pe
TI.o
into eq
p Glo To log EPHOD
J O
To
L r z
To log PGIof PG lol de
i To JL E
Enfectation
free FologEPGLOB
zu plz 10
of Gradient
G 4 Issue with Gradient of its Computation
sampling of trajectories
All trajectories 7,22 Zo for Some
gum o
are not equiprobable instead
following a distribution
Hence
say p Glo
for a some trajectory
sayfi sso.fi aYfg9jY5f
Let us compute the
probabilityof
Dgiven1pcz1JpCziI0J_fTCs.lso.ao
CaoKos
To
given the policy Cao S
T S2 9 Tofu IsD
parameterized by d
learned
by some
T Sz Sz Aa
Deep policy To as 152
Nfo manimigng
the reward
fn HOD
Trpgnsf
tiof.nu meFIFsed
p no
plz 10 IT T Ste Ist at
Estimating the probability F9 A
tf
j f To att se
Probability of on
The transition www.ueae taking
ffhy
spatiyslah
Lat at state
St
Tho by maximizing d O
estimating
Now the question is Can we compute J O
without transition probabilities T
But for backpoopagation only
J O To
gradient J O is
of required and it is
not T
depending upon
AS already shown
i To
Enfectation
E free FologEPGLOB
of Gradient empty 0
Eq B
logfkklof fzologTGt.lst.at 1
Just Independent of
differentiating
e wot lol dog Ig at Ise CEge
Dependent over O
Tologkolattsef
J O IE feces
To
z n
p Glo
seeds hwEaTw
EtToD man
no
a
G to log totals D t
Ge To log totals t
T
_gQ
M oo
1
To Jlo I G Fologtaocaetsed
or H
I
FologGoCatstD
Ezo Es.IE we
I
Ii
rt Game
E o that
for
Playing game
moored
F KEELEY
I
w t THE EdisonB
O9
Tnn is
lankain
E
think
Exploration
it may
and minima
local
MafHEfz.m
EH E
Enbokr6on
Eerie
fogeloisode Average
AarT
gradiutvariane Yw
High Santas
confloration
Correlated Finish properly all the
limitations issues
of
REINFORCE algorithm
Algorithm for Policy Gradient REINFORCE
Foto onto
log tofaith
RG
update the network parameter
0 Ot L Tg J O
Actually we are
policy to generate
using
the trajectory and then
computing lo
to update the
policy itself
This will in turn improve the policy
after each iteration
Hence returns
vary
greatly introducing
variance in
high the
gradient updates
Policy gradient with reward to Go
for Vanila PG
popp
0k
EI otoqaopfq.IS
D
where
ftp.gY RG rt
to
hit us
say Rewasd to Go defined as
Rt
Sum of the rewards
of the trajectory starting
from the State Se
T l
Rt 2 r s
ti
a
e
t t An action is only
Instead of RG 7tManasgen
use Rt 7hd
vostok at ohqfof.LY
I
Baseline
is the value that can give
It
us the
enfected return from the state the agent
is in simplestbaseline
Can be
b RG
IE woman
Inge
age
J lot
To L To log Iolalat
REV Stl
Novo since the value is floating Point
of a state
number 0 Can be trained
by
minimizing MSE
Re Actual Return
11 se Predicted Return
se'T
T O Iz EI CRE voi
we need to
minimize J O hence
0 0 B Tg J O
fn Ai of itn
instep
Advantage episode
when Value
Q S a
3 s
Action is better
than enfected
average
Ait Q stilt V Se taiga
To Hot Q St att VI D
Ego
Scaling fn L To log To att St
for I
loglikelihood
Minimization of the advantage function will
also enforces Bellman optimality
Yani
man not
have the
full trajectory or trajectories
so 9499
This we cangetfrom
value can becomputed Value Hw
using In Now Qfn Can
1meuqd.am
mdkfumtfhdtfsmg of
any step
ofvalue
Hence fn
A if ACS a r t y VCsg V s
Actor Critic Algorithm
Policy Net als
head
Jobservation 91mF
value met
head us
Observation
Sample trajectories under the current policy
I I
j m GondAom rom m m
St at he
m
t I t 2t3 Ft
M
M2
M3
i
i
Mm
stage Adigfors Reggard Advagontage
tie.in niassriem
ri
EY
t't
et
EEE.fi
sroriettinitIEniinIii Iu
Ert tr
t t
Initializeftp.cyn
ugEOJLActor
lsQlearningNkE0 critic
For each
training iteration 4,2 Ido
After everyiteration policy will beupdated
Neuodates
need
to be generated
for t 1,2 do
for each step
Advantages c
forum
Ctthstepof
2311
At
A doesn't
titre Vogts't
w
episode
any defoendsufoswmlhtAetornyuoards Critic
jallfutuserw.v network
Rt
Minimization of Atienforces Bellman Eg
Getting the gz
updationfrom 00 00
Gradient
AitTologlails
the ithepisode's Accumulating
all the
ascent Gradientestimator
thtinestep policy
updates for Awww Step
scaled by
AI
00
Minimizing the advantage 7 mu
HA p
nm
Fn Discounted accumaled
mm
critic who Atminimization
foulest rewards getting for all episodes updation enforcesBellman
clegetottetsedicted
andfor all Constraint
timesteps just
value function accumalete them
all
g O O t d B
oddproblemof
2.60 Of 0 13 DO deep a
tearing
Learn Q values
for all state action
End for rains
Critic only learns the
incorporate values for
Also generated
MW.iuebfm.dg d7mn rha.n
Ren r tr LvCs's
None we don't need to wait till the end of
episode to compute reward
Get a action a
cat so
Get a reward ro and ment state
Get the value of LSD as
110 Si
Get Q so no rot TH Csi
Get Advantage f n Also ao as
A Soho ro tr Vg Si Vg So
Sample
efficient