0% found this document useful (0 votes)
19 views65 pages

RL DQN PG

This document provides an overview of various machine learning paradigms, focusing on supervised, unsupervised, and reinforcement learning. It explains the concepts of Markov Decision Processes (MDP) and the importance of policies and value functions in reinforcement learning. Additionally, it discusses the mathematical formulation of reinforcement learning and examples such as the Cart-Pole problem and Grid World MDP.

Uploaded by

raphaelvon28
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views65 pages

RL DQN PG

This document provides an overview of various machine learning paradigms, focusing on supervised, unsupervised, and reinforcement learning. It explains the concepts of Markov Decision Processes (MDP) and the importance of policies and value functions in reinforcement learning. Additionally, it discusses the mathematical formulation of reinforcement learning and examples such as the Cart-Pole problem and Grid World MDP.

Uploaded by

raphaelvon28
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

This lecture is completely adapted from Stanford

lecture series Serena Yeung


CS 23in
by
A Introduction to learning
paradigms
These are
major paradigms of Mk leaning
Supervised Learning Unsupervisedhearing
Data n
Data n y Hawes

dated label
good learn some
Goal Learn a fnthal underlying hidden
can map relationship believers
y the data that can
he used to enflose
enamples classification Cart vsDog its structure ofdata
Semantic segmentation examples clustering
Object detection
dimensionality reduction
Regression fee YJ image feature learny destiny
estimation
Calption
Anything in which
target is known tasgeeh Fhdatfeqt
Injun.at
qst data probability

m
tonights.mg
It is a learning
Reinforcement learning
paradigm in which
learning happens by enfloration So agent
interact with the environment repeatedly w o
and labels Wrt the environnt
any prior knowledge
Completely relying
on hit and trial
behave optimally
strategy to learn howto
within that environment

Agent
A

Stsaehzo
fRewsqdIYAuoctiaonNentstateCSe.D

Let us see how


Environment
to read this state E
transition diagram
The is at some state at time
agent
It is interacting with Aton at any
state
Environment via fewdefined is choosen only by utilizing
Set of actions the state It assumes
that Completely charcuteriesThe
Environmental state RegNo
History Markov
At state taking a particular action
any
Say is to fetch a reward
going
The final Goal of the agent is
To learn how to take an appropriate optimal
action so as to maximize the reward
Future discounted reward
maximization
Re inforcemert barmy can be formulated as

Markov decision process MD P

There classes of RL algorithms


are
major
No
Q Learning Policy gradients

t 5
Qld a
QfnGiImay t
affirm

as or
B Basic RL setupframework
Agent
environt
finally

stsfso.iq
IgEEFfRwsEFm
State se choosen
actioncats agent
Etf
Environment cord Jn return depending
will
the agent eeomaqf.tw upon its state agent
gives takes some action
somestatti L
Environment
E

Episodic training o
These steps keeps on
six

repeating in a loop and


Considered as an episode The episode continues
until the environment provides a terminal state final
to the RL Agent after which the current episode got
terminated
Hence one need to generate data Episodes by
to enflore the environment
allowing agent
in episodes Later use these episodic dates for
the to learn how to behave
trang agent
wrt the environment
optimally
ot
Example Cart Pole problem
mm the
m
Objective To balance the pole on

top of a movable cart


O can be
State The physical state
ip
O speed w
Conhtmotion encoded as angle angular
x
Position ca y horizontalvelocity
F Cmay be
more
mass M
Action Horizontal force vector
applied on the Cart
Reward
each time if pole is upright
the
F xamfsle a.AT game playing agent
02 ari

with the highest score


Objective Finish the game
of the
State Game state can be raw pixel inputscreen
game screens Images of game
Action Gone controlling actions Rl ul D
all
Reward At the end of the episode if win them
state action pair will get a reward Else
will get reward
they
Mathematical formulation of RL
RL can be mathematically formulated using
Decision processes MD P
Markov

Markov property
the
The future is independentof
Past gmen the present

Det A state is Markov

Yt
Pfs Is e PIs Is.si set

State captures all relevant informations


from the history State is sufficient
a

statistic the future


of
Markov Decision Process

MDP is decision process


action
a choosing
at each state It is an environment

in which all States are Markov The


MDP can be characterized by tuple

S A P R r
Set of
b
set ti
Fattest Porthos EEmatsify.ge
TTnsg7qda.Y
Si Sz St 9,92 Ah
SN an Sst At
t
Stat
Lse Ats Discount

foot
I 4 How much current
Sttl out as comprwomdsftusem.to't
State transition Mahin P

Markov processfchain is a
memory less
random process of Sampling iteratively
as per the
ginen
ST M P Starbug
from some gmen seed random
state and Markov property
starting following
Full game dynamics can be encoded
within P Robot can be interacting
with with the environment
gaming
following The game rules

Psas Pfs Es't SE s.ae a


6220 Immediatereward

reward agent is
Amount of an

to the environment
going get from
when at state agent
takes an action a

rat r St at
St
Greening
But in M DP we are not
interested in immediate rewards
An optimal behaviour must need to
the Average Discounted
maximize
Commutative Future Reward DC FR
man IE DC FR mining

E o
a
E
se o
How MDP operates and can
be used to obtain Episodes
MDP Execution

At time o environment
may sample randomly an

initial State
Soup Cso
Initial state
IS PD
Probability distribution

from t o until terminates

An agent can select


an action at How

Using some
policy that we need
to out
figure So as to Max ECKERD
upon the agent's
Defending
State and the action
taken environment will
return the immediate

reward as well as ment


state s RPO
using
O
for a MDP of governs

pass
Pfs is f SES AE 9

Rag IE ret SES AE 9

given a discount factor VE foil


taken action

St P Rsa
g Sth

G2 But how to take an

action Policy
A policy is mapping
just
a

of States to actions
to take
enabling any agent
action
prostate Raction
It can be deterministic or

stochastic Random Deferred


as
That IS e Pfaetse

Co2
Goalof MDP

The objective of MDP to find


an
optimal policy Att that

Can the DC FR
maximizes
The actual returns are always
in the form of DC FR

Ge Discounted cummulature
future Reward from time
Stelp fge
ownfefffuyardswareaguted

Get t Y r t.at Tht I


for all upcoming
busting time
Z y KR ft EG D
tt k Gamma
K O

close to Myopia
evaluation
More importance
close to to immediate
returns
far
sighted
Care about all future rewards

Requirementofdiscount
about
future is uncertain and we cure o

how do I need to Care


If I invest 1K INR Today then
about my return after LLOYD
49
o
r Yr

50
re lo i
y5 glo y
Since we are inaccurate
predicting
using
an

model of environment hence the trust over model


paedietios Can decrease entponentierly Reason
Mathematically convenient to discount rewards
Avoids infinite returns in cyclic Markovprocesses

Grid World MDP


C 2.5 Example
Stateslab locations 130 Actions
right left up down
If I
g
Reward r I

negative reward for each


transition It ensures to
learn how to reach
target fast
Objective Starting from any randomly
choosers block Reach to a terminal state G
in the least number of actions I
Given
the above 2 D Grid world MDP we need to

figureout some optimal policy that can


Manimize the Out of all the policies a
Dappy
9
IF 4 Randomly choosing
F E BO
Ipage Ifs direction
I
Ig aka Ip Visual fl Az
9 9 9 T
RandomdzgfiuyPolieyrealigation.o
Policyphilosophye For immediate neighbours
have to in a direction that can
One should move

take you to the terminal state


move in the direction
that
Olterstate
the immediate
can take your closest to one of
neighbouring state X att se Moastkingffrong

D Solution of MD P optimal Policy


Policy Te Planning
It do not matter that in whatever state

say agent is at time Ca't


is to suggest that what action Cat
going
agent needs to take in order to get
the Def R Maximized Hae Flailse
Such MD P's are realized as stochastic framework
with Butial probabilitydistribution transition probability
or
randomness
distribution Hence in order to handle the

we always talk about Enfected value of the


reward manimigation
ri.ir
TitttigIitinase
Inanimizing ELE xq
E'T I
Enforing Given a

overall policy
the policies where Sou p f So
statestaligaFactionmittefnentm

their respective probability


af is af Is e

distribution
see
in
Pf IL Stitt

attributes associated to a policy


There are
few
Value f n Not a
given policy
KlattSe state
EM How good is
a
given
Q value fro
i e if agent just follow
How much DC FR is expected
Wrt a
given policy is the value of that state
how good is state
for given policy
a a
action pair i e
Mstead
of directly following the policy at State
agent first takes an action and then follow
the policy Qvalue of SS a is the enfeeted DER
D I Value function 11 2

Defined the expected DCFR


as
if
agent follow at state until the

episode terminates reechng terminal state

Given and a intial State


a policy
Coo Ao ro Csi
is
Soup Cso Episode 9,1
EpisodeCanyon
state Sz 93,83 42,92Pa
terminal
till reached
is

vacs r
H
or
EfE

Value of state
him Effeded
9defy't Tetffrom state
to
followingthe policy
choose
future actions
Value function sat
D 2 7
Q
Let us assume that we have
5 A for some
9 2,53 9,92
actionpaissandusatah.IQ
wecgg.hqeinemzyostak
52,9 53,9 205621

Si Az Q sails
52,9 Sz Az xca.lsdh
acaslsd4
axCailszjfxxf.azszJ

QTS a How good it is to take an

action at state
Basically then following
the enpected DCFR action
fat state
has been choosen MEER

QTs.at
Used for
fEa.oTnttxlsaEEa'I
at State
policy optimization
affection
ketonfouow
Equivalently

sa
oils.at Efz.int
IEaI
O

t final
t.TT

crowd

E ttfinal t

Oj f get
Gotthtrka x

1
Psst
Do3 Optimal Value function
value
and Q f n

Optimal state value f n


y s

maximum value function


Over all the policies

4 67 Vals
Maan

Optimal action value Q value fro


of
QTsea
Q s a
Maan Q Cs a

m an
EEE or're

Mani mum Dc F R one can achieve


from any given Cstate action pair
Agent
attiaon sia
State Agent Episode
Terminalia
state q
s o
reward S
D
Q fs az
q
µ Ftisodematif
O
O

Itis assumed there Q Sfa


when agentisatstate sga
ftp.sodemmaefm

HEMI
QIs.at rtrx.name
Qualm sick
best Cssa
oitcs.ae
rtV Q sfa
Since at state taken action
s
after an

agent will bend into stale with


some probability
S A S O 3 s a
52 0.4 495503

oils a

s
E
n E
Ert r

reach
it may
to a Set of
Max Oils a sa

statesce A

Q't well as it both functions


as
and
Satisfies Bellman equation
are

defined recursively
Expected values are Considered in order
to address the randomness in the forest
Just see it once more as we are
going
to use

it for Deep Gleaning Q


fnzqsyaafitywoftajoitcs.at
s n
E
E
Ert r't

reach
it may Max Q fs a 1 99
to a Set of
States E A

s
G Ann Bestcai
a SE si sa
E 195253
S a
S2 Eur Best as
a
a
asshown

E enfectedoalus
3X is the average of
Best as
basically whatever all the
we observe
aftertaking a decision
at state
4 Tht
D
Optimal Policy
Policy Defeo
MDP policies depends over current state
only C Not on History
Meme policies are
stationary Time independent
T1 als PEA al 49
ai
s
E SE
these ties we
Yaz
are ay
egocheepidoli
deciding under policy
suitable arsing to be given Later we need to Compute
ensures
actionspade Items DP Not scalable
state optimally
heme appronimalt litusing
erygogation usingpolicy network

Policy governs the dynamicsof agent


Best optimal policy need to ensure maximumfuture reward

In Markov Random Processes we don't case


about the historical reward accumulated so far we
Considiest DC Discounted Commbelim
only FR
future reward
µM y
F How to solve optimal policy
for DP
using
firstly we need to
figure out how to
evaluate a
policy Policy Evaluation
This will tell us how good is that policy

Later using Bellman equation policy


iteration or value iterations can be
done inorden to update the current

Policy
Iterative update

Basicallyassuming policy or value function


at and will be
step are optimal
used to compute the policyvalue
for i Stef using Dynamic
Programming
Eat
Policy Evaluation
Policyunderevaluati
z
State to u computing 453

rays
At alE3sY
IT3sYJ.s

State
Transition
n Dio 5 Aza
as por 12
14

a io

i i

15 16 17 18
vacs Vacsz Viscss Viacsu

Crites is
HaldKIII sig.YSg
Ij

ij
ssuming

IngL RsYIPs9z Hd Hails Their


known discounted
initialized values are added
Atb Ctd Enfected
VK.IS DCFR of
fg7YtE
acamusgini omyg
Z enpeetatios.yi

atAfrastVZsesPass.x Yak'D
Inclosed form
0

VK.ii R pp.TV
Value vector for
l 47
stale falue
Reward Transition fn
all States at
function probability
at
CKtDthiteration asperthe Kth
MDP iteratin
Let us evaluate a random policy in small
grid world
E 2 Policy Evolution Example
F 2 Deep Q learning for Qfn approximation

Instead of Computing Q S a we need


to use some neuralnetwork
function appronimator say
to appronimatifestimate Q S a t s a

Q Cs a O x oils a
A
using inn parameters
to estimate the Qoalus Doffearningly
Ser CS a
pair
We need a Q fn appronimator that satisfies Bellman

Bellman optimality at
equation by enforcing
Intuition
each iterative step
i
t7Eo i saoei
s.me
is.g
l
irmaainQlsiasi
go.esaeer
Qt.fsiaiq
Solving for Optimal Policy
Directly
G I Introduction to policy gradient
Instead of DNN to approximate
using
Q Value function why can t
directly learn the
0
Suitable policy Ti parameterized by
Class of parameterized A 10 C
To Rmg
policies

ftp.neam.f.fmwue
JHKEIE.ortn.to
L 11
x Hi
value of the policy Ewewp.ectedDcmpaeecyIa
Parameterized by

much DC FR on
ttOGivenanypolicy.CTho how
an one can entract value of
average the IJ lo
GOAL DNN need to optimize for the parameter 0

that can realize the optimal policy scent


Gradi'ent
O
A
asg J over 0 end REINFORCE
megese
Deep Q Learning

Is
1
DQ N Learning Qfn
using samentw

samentw may
Target NIM introduce
instability
Find A m
t S
Environment e
O Main who t 6
thistenniffrience
R.qnm.dq.maleey.my
z metre replay
buffer I
3
i

gg53,93
n
0 Predicted
aeargmjexQOGMJIRetft.asum.mg
Choose action usingEpsilon it have enough
greedy policy samples
Q Values
7 Sample a random
Su Ag Io BATCH batch of transition
currentstate 51,9 G S
12 and actiongoes
5219492,52
Y ssigaiygoes.to
main network for 53193,923153 Thement stale Ss
Yi its Q value
Yz Goesto target network
Prediction
43
1 O for the
Si Ai Ri Si
target Lfo target value computaters
values using bootstrapping
4 0,0 15 Su anguish
8
or Bellman
Yu Unrolling
g QOD
O'CTargetnao q.tt naaY

It is the time
y mainnetwork
updation
just 0 0 4 40
delayed copy of
DEEP 12
O Copy the parameters of
the main network to target
after Sometime Iterations
Learning
Policy Gradient

up

i
R

ped

Jn RL we need to learn

the optimal policy


In Continous action space
Clike Car driving Robot monant
D G N faces issues

Q In parameterization G
parameterization O
Policy

we directly learn
the policy
Intention o

In PG we use N N to
approximate the optimal policy a't

Initialize 0 randomly
feed state as an
we

input and it will return


the action probabilities
Nao is not tramed
Initially as

Random op action probabilities weger


Still we select the action
based on the olP action

probability distribution
Also Stone S A R s
until the
of end the episode

This will become our

data
training
that
If agent win

episode Assign high


probability to

all the attions of the episode


for that Else reduce
REINFORCE

How to proof that

t as stated in
step
G 2 REINFORCE algorithm Loss function

E so au no c si ai ri
e
Trajectory sash
Sample trajectory g
sampled randomly
from the example episode of game play
plz lol

ronwrczz
Gamefaunihiaemf.TT
Toledo r Zz

Assuming the
meee
agent to be need
game state and startingfrom some initial
following To policy parameter
Assuming all trajectories to be over O

equally probable J O
If rti

Assuring trajectories follow


plz 10 distribution
value asso ahaha P te i l 0 see is
a
policy o
Krameterized over 1 Ei
Expected E rod
Reward e n
p Glo

considering an
KG P Glo de
trajectories that one
can encounter while

agent start from some initial state of Ho policy


following

0 argmfenff
O

REINFORCE algorithm uses J O


for its
forward pass as a Gain fnfflffdm.b
m.bg
G 3 REINFORCE algo Backward pass

Forward Pass J 8
GG plzlol de

Backward
pass J 8
Tf red
To PGH de

This is intractable
Gradient I
glomerulonephritides
Cowhenmbulationgofdependsentedation

Trier

Putting eq
y
fptelos pklolx F.pe
TI.o
into eq
p Glo To log EPHOD
J O
To
L r z
To log PGIof PG lol de

i To JL E
Enfectation
free FologEPGLOB
zu plz 10
of Gradient
G 4 Issue with Gradient of its Computation

There can be infinity many trajectories possible for


any parameterized policy Io estimated by DNN
The reward function
IT lol has been defined as an
J O
Entpectation over are such trajectories Hence
Can essentially by Monte Carlo
be computed

sampling of trajectories
All trajectories 7,22 Zo for Some
gum o
are not equiprobable instead
following a distribution
Hence
say p Glo
for a some trajectory
sayfi sso.fi aYfg9jY5f
Let us compute the
probabilityof

Dgiven1pcz1JpCziI0J_fTCs.lso.ao
CaoKos
To
given the policy Cao S
T S2 9 Tofu IsD
parameterized by d
learned
by some
T Sz Sz Aa
Deep policy To as 152
Nfo manimigng
the reward
fn HOD
Trpgnsf
tiof.nu meFIFsed
p no
plz 10 IT T Ste Ist at
Estimating the probability F9 A
tf

j f To att se

Probability of on
The transition www.ueae taking
ffhy

spatiyslah
Lat at state
St

The major problem is the J O


requires
P 2 0 which in turn
requires Transition probability
T which is unknown Our Policy network is

Tho by maximizing d O
estimating
Now the question is Can we compute J O
without transition probabilities T
But for backpoopagation only
J O To
gradient J O is
of required and it is
not T
depending upon
AS already shown

i To
Enfectation
E free FologEPGLOB
of Gradient empty 0
Eq B

The likelihood of Such a sampled trajectory is


log using leg A

logfkklof fzologTGt.lst.at 1
Just Independent of
differentiating
e wot lol dog Ig at Ise CEge

Dependent over O

To log It GHB oTologkolatHI


Hence the derivate
of Independentof
doglikelihood ofthe trajectory z
transitionprobability
which is
required forthe computation Eq D
of the Gradient of the Reward
function i e This also implies
is
fo J o
that
Io Jl 9 is
independent also independent
of Transition of IT
probability ETI
Hence
putting Eq D into Eq B

Tologkolattsef
J O IE feces
To
z n
p Glo

seeds hwEaTw
EtToD man
no
a

To TCO To log to attED


o
Gt
Go
vologtolaolso t

G to log totals D t

Ge To log totals t

G To log Tol asks t

G To log tho lanky


9 2 Good Em Intention
Credit Assignment unbiased

T
_gQ

M oo
1

To Jlo I G Fologtaocaetsed
or H
I
FologGoCatstD
Ezo Es.IE we
I
Ii

rt Game

E o that
for
Playing game
moored
F KEELEY

I
w t THE EdisonB

O9

Tnn is
lankain
E

think
Exploration
it may
and minima
local
MafHEfz.m
EH E
Enbokr6on

Eerie
fogeloisode Average
AarT
gradiutvariane Yw
High Santas
confloration
Correlated Finish properly all the
limitations issues
of
REINFORCE algorithm
Algorithm for Policy Gradient REINFORCE

the network parameters 0


Initialize
Generate N trajectories I E's
i
following the policy to
goofing

Compute the return of the trajectory Rtd


Compute the gradient

Foto onto
log tofaith
RG
update the network parameter
0 Ot L Tg J O

Repeat steps to for several iterations

getting updated after each iteration Hence


folicy got changed become better and then

generate new data episode trajectories for


It is an On policy method
traing
How to manage high variance in the gradient

Actually we are
policy to generate
using
the trajectory and then
computing lo
to update the
policy itself
This will in turn improve the policy
after each iteration
Hence returns
vary
greatly introducing
variance in
high the

gradient updates
Policy gradient with reward to Go

for Vanila PG

popp
0k
EI otoqaopfq.IS
D
where
ftp.gY RG rt
to

hit us
say Rewasd to Go defined as
Rt
Sum of the rewards
of the trajectory starting
from the State Se
T l

Rt 2 r s
ti
a
e
t t An action is only
Instead of RG 7tManasgen
use Rt 7hd

Liii Policy gradient with thebaseline

Trajectory lengths are


different
Favoring early rewards

Can we Normalize the Reward to Go


Jn order to reduce the variance
as ndH
hy onfomomuasgftgaetw.me
Component Average

vostok at ohqfof.LY
I

Baseline
is the value that can give
It
us the
enfected return from the state the agent
is in simplestbaseline
Can be
b RG
IE woman

Inge
age

baseline can be function but should not


any
dependent upon G Does not affect Hwfora
o

Other obvious choices can be


Value function V Sts
Value state is
of a the enfected
return an agent would obtain
starting from the state following
other options are
Qtnfttduaftna.ge
How can we learn the baseline functions
Just like approximating polecy
using O network
one can use value network

J lot
To L To log Iolalat
REV Stl
Novo since the value is floating Point
of a state
number 0 Can be trained
by
minimizing MSE
Re Actual Return
11 se Predicted Return

se'T
T O Iz EI CRE voi
we need to
minimize J O hence

0 0 B Tg J O
fn Ai of itn
instep
Advantage episode

How good it is to take an action a at state s

when Value
Q S a
3 s
Action is better
than enfected
average
Ait Q stilt V Se taiga

To Hot Q St att VI D
Ego
Scaling fn L To log To att St
for I
loglikelihood
Minimization of the advantage function will
also enforces Bellman optimality

Computation of Advantage f n needs neural


for value fn This is not optimal
networks ca for Q fn Lb

This is computationally inefficient expensive


Q fn can be
Approximated wa Monte Carlobased
Random
sampling or Reward as Go
QC S a
Allfuture reward

Yani

man not
have the
full trajectory or trajectories

It can be approximated by using the definitions

y of Q fn and using Bellman optimality

so 9499
This we cangetfrom
value can becomputed Value Hw
using In Now Qfn Can
1meuqd.am

mdkfumtfhdtfsmg of
any step
ofvalue
Hence fn
A if ACS a r t y VCsg V s
Actor Critic Algorithm
Policy Net als
head
Jobservation 91mF
value met
head us
Observation
Sample trajectories under the current policy

m ssio.do.to siai re ssi.at re


Mz ssfiq.ro
Ms Go add
i l

I I

j m GondAom rom m m
St at he
m

t I t 2t3 Ft
M
M2
M3

i
i
Mm
stage Adigfors Reggard Advagontage

State of the agent at tth step


Sfi in the i episode
Si Ai right 111
Couritesatorise

tie.in niassriem
ri
EY
t't
et

EEE.fi
sroriettinitIEniinIii Iu
Ert tr
t t

ALGORITHM Actor critic

Initializeftp.cyn
ugEOJLActor
lsQlearningNkE0 critic

For each
training iteration 4,2 Ido
After everyiteration policy will beupdated
Neuodates
need
to be generated

Sample trajectories under the


current 5 a t
obey
gqzOGradientaccumalatorferActooNfo.for
g agivenfolicylabbefesetrainergreset

2.30 for 1 1,2 do


for eachepisode
trajectory

for t 1,2 do
for each step
Advantages c
forum
Ctthstepof
2311
At
A doesn't
titre Vogts't
w
episode
any defoendsufoswmlhtAetornyuoards Critic
jallfutuserw.v network
Rt
Minimization of Atienforces Bellman Eg

Getting the gz
updationfrom 00 00
Gradient
AitTologlails
the ithepisode's Accumulating
all the
ascent Gradientestimator
thtinestep policy
updates for Awww Step
scaled by
AI
00
Minimizing the advantage 7 mu
HA p
nm

Fn Discounted accumaled
mm
critic who Atminimization
foulest rewards getting for all episodes updation enforcesBellman
clegetottetsedicted
andfor all Constraint
timesteps just
value function accumalete them
all
g O O t d B
oddproblemof
2.60 Of 0 13 DO deep a
tearing
Learn Q values
for all state action
End for rains
Critic only learns the
incorporate values for
Also generated
MW.iuebfm.dg d7mn rha.n

Are these algorithms sample efficient lasont.hq.my

Instead of generating the full trajectory


and then
only compute the Return Reward
Can we do something efficient

we can approximate CRH or Qts a as

Ren r tr LvCs's
None we don't need to wait till the end of
episode to compute reward

One can compute gradient at each step


update the policy network parameters.CAT hJ
Let us see the flow In
for
Initialize O O
Get a random Starling State so

Get a action a
cat so
Get a reward ro and ment state
Get the value of LSD as
110 Si
Get Q so no rot TH Csi
Get Advantage f n Also ao as

A Soho ro tr Vg Si Vg So

Get Policy Gradient


Koko
To ILO To logTho a Iso
froth Vols Hkd

update Actor NIN O 0 0 4


see No 07accumulation 10
GradientAscent

Get critic Ngo loss Also to tr 14


J O a
V so
Get
TOICO Value Gradient and update
critic who ng gradientdescent 0 0 LIT JIO
Bothupdates
at
aredone
each stepof
every
the episode

Sample
efficient

You might also like