0% found this document useful (0 votes)

19 views65 pages

RL DQN PG

This document provides an overview of various machine learning paradigms, focusing on supervised, unsupervised, and reinforcement learning. It explains the concepts of Markov Decision Processes (MDP) and the importance of policies and value functions in reinforcement learning. Additionally, it discusses the mathematical formulation of reinforcement learning and examples such as the Cart-Pole problem and Grid World MDP.

Uploaded by

raphaelvon28

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views65 pages

RL DQN PG

Uploaded by

raphaelvon28

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 65

This lecture is completely adapted from Stanford

lecture series Serena Yeung

CS 23in
by
A Introduction to learning
paradigms
These are
major paradigms of Mk leaning
Supervised Learning Unsupervisedhearing
Data n
Data n y Hawes

dated label
good learn some
Goal Learn a fnthal underlying hidden
can map relationship believers
y the data that can
he used to enflose
enamples classification Cart vsDog its structure ofdata
Semantic segmentation examples clustering
Object detection
dimensionality reduction
Regression fee YJ image feature learny destiny
estimation
Calption
Anything in which
target is known tasgeeh Fhdatfeqt
Injun.at
qst data probability

m
tonights.mg
It is a learning
Reinforcement learning
paradigm in which
learning happens by enfloration So agent
interact with the environment repeatedly w o
and labels Wrt the environnt
any prior knowledge
Completely relying
on hit and trial
behave optimally
strategy to learn howto
within that environment

Agent
A

Stsaehzo
fRewsqdIYAuoctiaonNentstateCSe.D

Let us see how

Environment
to read this state E
transition diagram
The is at some state at time
agent
It is interacting with Aton at any
state
Environment via fewdefined is choosen only by utilizing
Set of actions the state It assumes
that Completely charcuteriesThe
Environmental state RegNo
History Markov
At state taking a particular action
any
Say is to fetch a reward
going
The final Goal of the agent is
To learn how to take an appropriate optimal
action so as to maximize the reward
Future discounted reward
maximization
Re inforcemert barmy can be formulated as

Markov decision process MD P

There classes of RL algorithms

are
major
No
Q Learning Policy gradients

t 5
Qld a
QfnGiImay t
affirm

as or
B Basic RL setupframework
Agent
environt
finally

stsfso.iq
IgEEFfRwsEFm
State se choosen
actioncats agent
Etf
Environment cord Jn return depending
will
the agent eeomaqf.tw upon its state agent
gives takes some action
somestatti L
Environment
E

Episodic training o
These steps keeps on
six

repeating in a loop and

Considered as an episode The episode continues
until the environment provides a terminal state final
to the RL Agent after which the current episode got
terminated
Hence one need to generate data Episodes by
to enflore the environment
allowing agent
in episodes Later use these episodic dates for
the to learn how to behave
trang agent
wrt the environment
optimally
ot
Example Cart Pole problem
mm the
m
Objective To balance the pole on

top of a movable cart

O can be
State The physical state
ip
O speed w
Conhtmotion encoded as angle angular
x
Position ca y horizontalvelocity
F Cmay be
more
mass M
Action Horizontal force vector
applied on the Cart
Reward
each time if pole is upright
the
F xamfsle a.AT game playing agent
02 ari

with the highest score

Objective Finish the game
of the
State Game state can be raw pixel inputscreen
game screens Images of game
Action Gone controlling actions Rl ul D
all
Reward At the end of the episode if win them
state action pair will get a reward Else
will get reward
they
Mathematical formulation of RL
RL can be mathematically formulated using
Decision processes MD P
Markov

Markov property
the
The future is independentof
Past gmen the present

Det A state is Markov

Yt
Pfs Is e PIs Is.si set

State captures all relevant informations

from the history State is sufficient
a

statistic the future

of
Markov Decision Process

MDP is decision process

action
a choosing
at each state It is an environment

in which all States are Markov The

MDP can be characterized by tuple

S A P R r
Set of
b
set ti
Fattest Porthos EEmatsify.ge
TTnsg7qda.Y
Si Sz St 9,92 Ah
SN an Sst At
t
Stat
Lse Ats Discount

foot
I 4 How much current
Sttl out as comprwomdsftusem.to't
State transition Mahin P

Markov processfchain is a
memory less
random process of Sampling iteratively
as per the
ginen
ST M P Starbug
from some gmen seed random
state and Markov property
starting following
Full game dynamics can be encoded
within P Robot can be interacting
with with the environment
gaming
following The game rules

Psas Pfs Es't SE s.ae a

6220 Immediatereward

reward agent is
Amount of an

to the environment
going get from
when at state agent
takes an action a

rat r St at
St
Greening
But in M DP we are not
interested in immediate rewards
An optimal behaviour must need to
the Average Discounted
maximize
Commutative Future Reward DC FR
man IE DC FR mining

E o
a
E
se o
How MDP operates and can
be used to obtain Episodes
MDP Execution

At time o environment
may sample randomly an

initial State
Soup Cso
Initial state
IS PD
Probability distribution

from t o until terminates

An agent can select

an action at How

Using some
policy that we need
to out
figure So as to Max ECKERD
upon the agent's
Defending
State and the action
taken environment will
return the immediate

reward as well as ment

state s RPO
using
O
for a MDP of governs

pass
Pfs is f SES AE 9

Rag IE ret SES AE 9

given a discount factor VE foil

taken action

St P Rsa
g Sth

G2 But how to take an

action Policy
A policy is mapping
just
a

of States to actions
to take
enabling any agent
action
prostate Raction
It can be deterministic or

stochastic Random Deferred

as
That IS e Pfaetse

Co2
Goalof MDP

The objective of MDP to find

an
optimal policy Att that

Can the DC FR
maximizes
The actual returns are always
in the form of DC FR

Ge Discounted cummulature
future Reward from time
Stelp fge
ownfefffuyardswareaguted

Get t Y r t.at Tht I

for all upcoming
busting time
Z y KR ft EG D
tt k Gamma
K O

close to Myopia
evaluation
More importance
close to to immediate
returns
far
sighted
Care about all future rewards

Requirementofdiscount
about
future is uncertain and we cure o

how do I need to Care

If I invest 1K INR Today then
about my return after LLOYD
49
o
r Yr

50
re lo i
y5 glo y
Since we are inaccurate
predicting
using
an

model of environment hence the trust over model

paedietios Can decrease entponentierly Reason
Mathematically convenient to discount rewards
Avoids infinite returns in cyclic Markovprocesses

Grid World MDP

C 2.5 Example
Stateslab locations 130 Actions
right left up down
If I
g
Reward r I

negative reward for each

transition It ensures to
learn how to reach
target fast
Objective Starting from any randomly
choosers block Reach to a terminal state G
in the least number of actions I
Given
the above 2 D Grid world MDP we need to

figureout some optimal policy that can

Manimize the Out of all the policies a
Dappy
9
IF 4 Randomly choosing
F E BO
Ipage Ifs direction
I
Ig aka Ip Visual fl Az
9 9 9 T
RandomdzgfiuyPolieyrealigation.o
Policyphilosophye For immediate neighbours
have to in a direction that can
One should move

take you to the terminal state

move in the direction
that
Olterstate
the immediate
can take your closest to one of
neighbouring state X att se Moastkingffrong

D Solution of MD P optimal Policy

Policy Te Planning
It do not matter that in whatever state

say agent is at time Ca't

is to suggest that what action Cat
going
agent needs to take in order to get
the Def R Maximized Hae Flailse
Such MD P's are realized as stochastic framework
with Butial probabilitydistribution transition probability
or
randomness
distribution Hence in order to handle the

we always talk about Enfected value of the

reward manimigation
ri.ir
TitttigIitinase
Inanimizing ELE xq
E'T I
Enforing Given a

overall policy
the policies where Sou p f So
statestaligaFactionmittefnentm

their respective probability

af is af Is e

distribution
see
in
Pf IL Stitt

attributes associated to a policy

There are
few
Value f n Not a
given policy
KlattSe state
EM How good is
a
given
Q value fro
i e if agent just follow
How much DC FR is expected
Wrt a
given policy is the value of that state
how good is state
for given policy
a a
action pair i e
Mstead
of directly following the policy at State
agent first takes an action and then follow
the policy Qvalue of SS a is the enfeeted DER
D I Value function 11 2

Defined the expected DCFR

as
if
agent follow at state until the

episode terminates reechng terminal state

Given and a intial State

a policy
Coo Ao ro Csi
is
Soup Cso Episode 9,1
EpisodeCanyon
state Sz 93,83 42,92Pa
terminal
till reached
is

vacs r
H
or
EfE

Value of state
him Effeded
9defy't Tetffrom state
to
followingthe policy
choose
future actions
Value function sat
D 2 7
Q
Let us assume that we have
5 A for some
9 2,53 9,92
actionpaissandusatah.IQ
wecgg.hqeinemzyostak
52,9 53,9 205621

Si Az Q sails
52,9 Sz Az xca.lsdh
acaslsd4
axCailszjfxxf.azszJ

QTS a How good it is to take an

action at state
Basically then following
the enpected DCFR action
fat state
has been choosen MEER

QTs.at
Used for
fEa.oTnttxlsaEEa'I
at State
policy optimization
affection
ketonfouow
Equivalently

sa
oils.at Efz.int
IEaI
O

t final
t.TT

crowd

E ttfinal t

Oj f get
Gotthtrka x

1
Psst
Do3 Optimal Value function
value
and Q f n

Optimal state value f n

y s

maximum value function

Over all the policies

4 67 Vals
Maan

Optimal action value Q value fro

of
QTsea
Q s a
Maan Q Cs a

m an
EEE or're

Mani mum Dc F R one can achieve

from any given Cstate action pair
Agent
attiaon sia
State Agent Episode
Terminalia
state q
s o
reward S
D
Q fs az
q
µ Ftisodematif
O
O

Itis assumed there Q Sfa

when agentisatstate sga
ftp.sodemmaefm

HEMI
QIs.at rtrx.name
Qualm sick
best Cssa
oitcs.ae
rtV Q sfa
Since at state taken action
s
after an

agent will bend into stale with

some probability
S A S O 3 s a
52 0.4 495503

oils a

s
E
n E
Ert r

reach
it may
to a Set of
Max Oils a sa

statesce A

Q't well as it both functions

as
and
Satisfies Bellman equation
are

defined recursively
Expected values are Considered in order
to address the randomness in the forest
Just see it once more as we are
going
to use

it for Deep Gleaning Q

fnzqsyaafitywoftajoitcs.at
s n
E
E
Ert r't

reach
it may Max Q fs a 1 99
to a Set of
States E A

s
G Ann Bestcai
a SE si sa
E 195253
S a
S2 Eur Best as
a
a
asshown

E enfectedoalus
3X is the average of
Best as
basically whatever all the
we observe
aftertaking a decision
at state
4 Tht
D
Optimal Policy
Policy Defeo
MDP policies depends over current state
only C Not on History
Meme policies are
stationary Time independent
T1 als PEA al 49
ai
s
E SE
these ties we
Yaz
are ay
egocheepidoli
deciding under policy
suitable arsing to be given Later we need to Compute
ensures
actionspade Items DP Not scalable
state optimally
heme appronimalt litusing
erygogation usingpolicy network

Policy governs the dynamicsof agent

Best optimal policy need to ensure maximumfuture reward

In Markov Random Processes we don't case

about the historical reward accumulated so far we
Considiest DC Discounted Commbelim
only FR
future reward
µM y
F How to solve optimal policy
for DP
using
firstly we need to
figure out how to
evaluate a
policy Policy Evaluation
This will tell us how good is that policy

Later using Bellman equation policy

iteration or value iterations can be
done inorden to update the current

Policy
Iterative update

Basicallyassuming policy or value function

at and will be
step are optimal
used to compute the policyvalue
for i Stef using Dynamic
Programming
Eat
Policy Evaluation
Policyunderevaluati
z
State to u computing 453

rays
At alE3sY
IT3sYJ.s

State
Transition
n Dio 5 Aza
as por 12
14

a io

i i

15 16 17 18
vacs Vacsz Viscss Viacsu

Crites is
HaldKIII sig.YSg
Ij

ij
ssuming

IngL RsYIPs9z Hd Hails Their

known discounted
initialized values are added
Atb Ctd Enfected
VK.IS DCFR of
fg7YtE
acamusgini omyg
Z enpeetatios.yi

atAfrastVZsesPass.x Yak'D
Inclosed form
0

VK.ii R pp.TV
Value vector for
l 47
stale falue
Reward Transition fn
all States at
function probability
at
CKtDthiteration asperthe Kth
MDP iteratin
Let us evaluate a random policy in small
grid world
E 2 Policy Evolution Example
F 2 Deep Q learning for Qfn approximation

Instead of Computing Q S a we need

to use some neuralnetwork
function appronimator say
to appronimatifestimate Q S a t s a

Q Cs a O x oils a
A
using inn parameters
to estimate the Qoalus Doffearningly
Ser CS a
pair
We need a Q fn appronimator that satisfies Bellman

Bellman optimality at
equation by enforcing
Intuition
each iterative step
i
t7Eo i saoei
s.me
is.g
l
irmaainQlsiasi
go.esaeer
Qt.fsiaiq
Solving for Optimal Policy
Directly
G I Introduction to policy gradient
Instead of DNN to approximate
using
Q Value function why can t
directly learn the
0
Suitable policy Ti parameterized by
Class of parameterized A 10 C
To Rmg
policies

ftp.neam.f.fmwue
JHKEIE.ortn.to
L 11
x Hi
value of the policy Ewewp.ectedDcmpaeecyIa
Parameterized by

much DC FR on
ttOGivenanypolicy.CTho how
an one can entract value of
average the IJ lo
GOAL DNN need to optimize for the parameter 0

that can realize the optimal policy scent

Gradi'ent
O
A
asg J over 0 end REINFORCE
megese
Deep Q Learning

Is
1
DQ N Learning Qfn
using samentw

samentw may
Target NIM introduce
instability
Find A m
t S
Environment e
O Main who t 6
thistenniffrience
R.qnm.dq.maleey.my
z metre replay
buffer I
3
i

gg53,93
n
0 Predicted
aeargmjexQOGMJIRetft.asum.mg
Choose action usingEpsilon it have enough
greedy policy samples
Q Values
7 Sample a random
Su Ag Io BATCH batch of transition
currentstate 51,9 G S
12 and actiongoes
5219492,52
Y ssigaiygoes.to
main network for 53193,923153 Thement stale Ss
Yi its Q value
Yz Goesto target network
Prediction
43
1 O for the
Si Ai Ri Si
target Lfo target value computaters
values using bootstrapping
4 0,0 15 Su anguish
8
or Bellman
Yu Unrolling
g QOD
O'CTargetnao q.tt naaY

It is the time
y mainnetwork
updation
just 0 0 4 40
delayed copy of
DEEP 12
O Copy the parameters of
the main network to target
after Sometime Iterations
Learning
Policy Gradient

i
R

ped

Jn RL we need to learn

the optimal policy

In Continous action space
Clike Car driving Robot monant
D G N faces issues

Q In parameterization G
parameterization O
Policy

we directly learn
the policy
Intention o

In PG we use N N to
approximate the optimal policy a't

Initialize 0 randomly
feed state as an
we

input and it will return

the action probabilities
Nao is not tramed
Initially as

Random op action probabilities weger

Still we select the action
based on the olP action

probability distribution
Also Stone S A R s
until the
of end the episode

This will become our

data
training
that
If agent win

episode Assign high

probability to

all the attions of the episode

for that Else reduce
REINFORCE

How to proof that

t as stated in
step
G 2 REINFORCE algorithm Loss function

E so au no c si ai ri
e
Trajectory sash
Sample trajectory g
sampled randomly
from the example episode of game play
plz lol

ronwrczz
Gamefaunihiaemf.TT
Toledo r Zz

Assuming the
meee
agent to be need
game state and startingfrom some initial
following To policy parameter
Assuming all trajectories to be over O

equally probable J O
If rti

Assuring trajectories follow

plz 10 distribution
value asso ahaha P te i l 0 see is
a
policy o
Krameterized over 1 Ei
Expected E rod
Reward e n
p Glo

considering an
KG P Glo de
trajectories that one
can encounter while

agent start from some initial state of Ho policy

following

0 argmfenff
O

REINFORCE algorithm uses J O

for its
forward pass as a Gain fnfflffdm.b
m.bg
G 3 REINFORCE algo Backward pass

Forward Pass J 8
GG plzlol de

Backward
pass J 8
Tf red
To PGH de

This is intractable
Gradient I
glomerulonephritides
Cowhenmbulationgofdependsentedation

Trier

Putting eq
y
fptelos pklolx F.pe
TI.o
into eq
p Glo To log EPHOD
J O
To
L r z
To log PGIof PG lol de

i To JL E
Enfectation
free FologEPGLOB
zu plz 10
of Gradient
G 4 Issue with Gradient of its Computation

There can be infinity many trajectories possible for

any parameterized policy Io estimated by DNN
The reward function
IT lol has been defined as an
J O
Entpectation over are such trajectories Hence
Can essentially by Monte Carlo
be computed

sampling of trajectories
All trajectories 7,22 Zo for Some
gum o
are not equiprobable instead
following a distribution
Hence
say p Glo
for a some trajectory
sayfi sso.fi aYfg9jY5f
Let us compute the
probabilityof

Dgiven1pcz1JpCziI0J_fTCs.lso.ao
CaoKos
To
given the policy Cao S
T S2 9 Tofu IsD
parameterized by d
learned
by some
T Sz Sz Aa
Deep policy To as 152
Nfo manimigng
the reward
fn HOD
Trpgnsf
tiof.nu meFIFsed
p no
plz 10 IT T Ste Ist at
Estimating the probability F9 A
tf

j f To att se

Probability of on
The transition www.ueae taking
ffhy

spatiyslah
Lat at state
St

The major problem is the J O

requires
P 2 0 which in turn
requires Transition probability
T which is unknown Our Policy network is

Tho by maximizing d O
estimating
Now the question is Can we compute J O
without transition probabilities T
But for backpoopagation only
J O To
gradient J O is
of required and it is
not T
depending upon
AS already shown

i To
Enfectation
E free FologEPGLOB
of Gradient empty 0
Eq B

The likelihood of Such a sampled trajectory is

log using leg A

logfkklof fzologTGt.lst.at 1
Just Independent of
differentiating
e wot lol dog Ig at Ise CEge

Dependent over O

To log It GHB oTologkolatHI

Hence the derivate
of Independentof
doglikelihood ofthe trajectory z
transitionprobability
which is
required forthe computation Eq D
of the Gradient of the Reward
function i e This also implies
is
fo J o
that
Io Jl 9 is
independent also independent
of Transition of IT
probability ETI
Hence
putting Eq D into Eq B

Tologkolattsef
J O IE feces
To
z n
p Glo

seeds hwEaTw
EtToD man
no
a

To TCO To log to attED

o
Gt
Go
vologtolaolso t

G to log totals D t

Ge To log totals t

G To log Tol asks t

G To log tho lanky

9 2 Good Em Intention
Credit Assignment unbiased

T
_gQ

M oo
1

To Jlo I G Fologtaocaetsed
or H
I
FologGoCatstD
Ezo Es.IE we
I
Ii

rt Game

E o that
for
Playing game
moored
F KEELEY

I
w t THE EdisonB

Tnn is
lankain
E

think
Exploration
it may
and minima
local
MafHEfz.m
EH E
Enbokr6on

Eerie
fogeloisode Average
AarT
gradiutvariane Yw
High Santas
confloration
Correlated Finish properly all the
limitations issues
of
REINFORCE algorithm
Algorithm for Policy Gradient REINFORCE

the network parameters 0

Initialize
Generate N trajectories I E's
i
following the policy to
goofing

Compute the return of the trajectory Rtd

Compute the gradient

Foto onto
log tofaith
RG
update the network parameter
0 Ot L Tg J O

Repeat steps to for several iterations

getting updated after each iteration Hence

folicy got changed become better and then

generate new data episode trajectories for

It is an On policy method
traing
How to manage high variance in the gradient

Actually we are
policy to generate
using
the trajectory and then
computing lo
to update the
policy itself
This will in turn improve the policy
after each iteration
Hence returns
vary
greatly introducing
variance in
high the

gradient updates
Policy gradient with reward to Go

for Vanila PG

popp
0k
EI otoqaopfq.IS
D
where
ftp.gY RG rt
to

hit us
say Rewasd to Go defined as
Rt
Sum of the rewards
of the trajectory starting
from the State Se
T l

Rt 2 r s
ti
a
e
t t An action is only
Instead of RG 7tManasgen
use Rt 7hd

Liii Policy gradient with thebaseline

Trajectory lengths are

different
Favoring early rewards

Can we Normalize the Reward to Go

Jn order to reduce the variance
as ndH
hy onfomomuasgftgaetw.me
Component Average

vostok at ohqfof.LY
I

Baseline
is the value that can give
It
us the
enfected return from the state the agent
is in simplestbaseline
Can be
b RG
IE woman

Inge
age

baseline can be function but should not

any
dependent upon G Does not affect Hwfora
o

Other obvious choices can be

Value function V Sts
Value state is
of a the enfected
return an agent would obtain
starting from the state following
other options are
Qtnfttduaftna.ge
How can we learn the baseline functions
Just like approximating polecy
using O network
one can use value network

J lot
To L To log Iolalat
REV Stl
Novo since the value is floating Point
of a state
number 0 Can be trained
by
minimizing MSE
Re Actual Return
11 se Predicted Return

se'T
T O Iz EI CRE voi
we need to
minimize J O hence

0 0 B Tg J O
fn Ai of itn
instep
Advantage episode

How good it is to take an action a at state s

when Value
Q S a
3 s
Action is better
than enfected
average
Ait Q stilt V Se taiga

To Hot Q St att VI D
Ego
Scaling fn L To log To att St
for I
loglikelihood
Minimization of the advantage function will
also enforces Bellman optimality

Computation of Advantage f n needs neural

for value fn This is not optimal
networks ca for Q fn Lb

This is computationally inefficient expensive

Q fn can be
Approximated wa Monte Carlobased
Random
sampling or Reward as Go
QC S a
Allfuture reward

Yani

man not
have the
full trajectory or trajectories

It can be approximated by using the definitions

y of Q fn and using Bellman optimality

so 9499
This we cangetfrom
value can becomputed Value Hw
using In Now Qfn Can
1meuqd.am

mdkfumtfhdtfsmg of
any step
ofvalue
Hence fn
A if ACS a r t y VCsg V s
Actor Critic Algorithm
Policy Net als
head
Jobservation 91mF
value met
head us
Observation
Sample trajectories under the current policy

m ssio.do.to siai re ssi.at re

Mz ssfiq.ro
Ms Go add
i l

I I

j m GondAom rom m m
St at he
m

t I t 2t3 Ft
M
M2
M3

i
i
Mm
stage Adigfors Reggard Advagontage

State of the agent at tth step

Sfi in the i episode
Si Ai right 111
Couritesatorise

tie.in niassriem
ri
EY
t't
et

EEE.fi
sroriettinitIEniinIii Iu
Ert tr
t t

ALGORITHM Actor critic

Initializeftp.cyn
ugEOJLActor
lsQlearningNkE0 critic

For each
training iteration 4,2 Ido
After everyiteration policy will beupdated
Neuodates
need
to be generated

Sample trajectories under the

current 5 a t
obey
gqzOGradientaccumalatorferActooNfo.for
g agivenfolicylabbefesetrainergreset

2.30 for 1 1,2 do

for eachepisode
trajectory

for t 1,2 do
for each step
Advantages c
forum
Ctthstepof
2311
At
A doesn't
titre Vogts't
w
episode
any defoendsufoswmlhtAetornyuoards Critic
jallfutuserw.v network
Rt
Minimization of Atienforces Bellman Eg

Getting the gz
updationfrom 00 00
Gradient
AitTologlails
the ithepisode's Accumulating
all the
ascent Gradientestimator
thtinestep policy
updates for Awww Step
scaled by
AI
00
Minimizing the advantage 7 mu
HA p
nm

Fn Discounted accumaled
mm
critic who Atminimization
foulest rewards getting for all episodes updation enforcesBellman
clegetottetsedicted
andfor all Constraint
timesteps just
value function accumalete them
all
g O O t d B
oddproblemof
2.60 Of 0 13 DO deep a
tearing
Learn Q values
for all state action
End for rains
Critic only learns the
incorporate values for
Also generated
MW.iuebfm.dg d7mn rha.n

Are these algorithms sample efficient lasont.hq.my

Instead of generating the full trajectory

and then
only compute the Return Reward
Can we do something efficient

we can approximate CRH or Qts a as

Ren r tr LvCs's
None we don't need to wait till the end of
episode to compute reward

One can compute gradient at each step

update the policy network parameters.CAT hJ
Let us see the flow In
for
Initialize O O
Get a random Starling State so

Get a action a
cat so
Get a reward ro and ment state
Get the value of LSD as
110 Si
Get Q so no rot TH Csi
Get Advantage f n Also ao as

A Soho ro tr Vg Si Vg So

Get Policy Gradient

Koko
To ILO To logTho a Iso
froth Vols Hkd

update Actor NIN O 0 0 4

see No 07accumulation 10
GradientAscent

Get critic Ngo loss Also to tr 14

J O a
V so
Get
TOICO Value Gradient and update
critic who ng gradientdescent 0 0 LIT JIO
Bothupdates
at
aredone
each stepof
every
the episode

Sample
efficient

ReinforcementLearning Algos
No ratings yet
ReinforcementLearning Algos
77 pages
L12 Reinforcement Learning 2
No ratings yet
L12 Reinforcement Learning 2
26 pages
Unit-4 of Ai
No ratings yet
Unit-4 of Ai
9 pages
Ai (It) Unit-4
100% (1)
Ai (It) Unit-4
37 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
RL Unit-Ii
No ratings yet
RL Unit-Ii
14 pages
Unit 03 RL Problem
No ratings yet
Unit 03 RL Problem
9 pages
Markov Decision & RL Overview
No ratings yet
Markov Decision & RL Overview
39 pages
AS02
No ratings yet
AS02
16 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
Microsoft PowerPoint - Lecture20Final-Part1
No ratings yet
Microsoft PowerPoint - Lecture20Final-Part1
65 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
Understanding The Markov Decision Process (MDP) - Built in
No ratings yet
Understanding The Markov Decision Process (MDP) - Built in
18 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
AI Decision Making & RL Guide
No ratings yet
AI Decision Making & RL Guide
18 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
43 pages
Markov Decision Process
No ratings yet
Markov Decision Process
15 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
4 pages
Unit 4
No ratings yet
Unit 4
6 pages
ml4r 2025 04
No ratings yet
ml4r 2025 04
14 pages
Chapter 18 - Reinforcement Learning
No ratings yet
Chapter 18 - Reinforcement Learning
29 pages
PDF Unit-5 (Full Unit)
No ratings yet
PDF Unit-5 (Full Unit)
37 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
62 pages
Machine Learning For NLP
No ratings yet
Machine Learning For NLP
58 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
13 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
17 pages
06 MDP
No ratings yet
06 MDP
89 pages
CSE2530 Reinforcement Learning 2025 P1+2
No ratings yet
CSE2530 Reinforcement Learning 2025 P1+2
115 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
14 pages
10 ML Introduction To Reinforcement Learning
No ratings yet
10 ML Introduction To Reinforcement Learning
8 pages
Unit 4
No ratings yet
Unit 4
49 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
Markov Decision Processes Overview
No ratings yet
Markov Decision Processes Overview
14 pages
Markov Decision Process
No ratings yet
Markov Decision Process
21 pages
RL Ese
No ratings yet
RL Ese
7 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
Markov Decicion
No ratings yet
Markov Decicion
40 pages
Introduction To Reinforcement Learning
No ratings yet
Introduction To Reinforcement Learning
62 pages
Lecture 3 - MDPS, Returns, V, Q
No ratings yet
Lecture 3 - MDPS, Returns, V, Q
31 pages
AIML Unit - 3 MDP New
No ratings yet
AIML Unit - 3 MDP New
30 pages
10 ReinforcementLearning
No ratings yet
10 ReinforcementLearning
59 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
86 pages
Markovian Decision Process
No ratings yet
Markovian Decision Process
27 pages
Conjugate Markov Decision Processes
No ratings yet
Conjugate Markov Decision Processes
8 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
POMDP and MDP Tutorial Guide
No ratings yet
POMDP and MDP Tutorial Guide
55 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
Markov Decision
No ratings yet
Markov Decision
4 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
Lect28 4up
No ratings yet
Lect28 4up
11 pages
Lecture Notes
No ratings yet
Lecture Notes
29 pages
Kaist cs492d Fall 2024 Assignment 2
No ratings yet
Kaist cs492d Fall 2024 Assignment 2
24 pages
Kaist cs492d Fall 2024 Lecture 6
No ratings yet
Kaist cs492d Fall 2024 Lecture 6
24 pages
Transaction Concurrency
No ratings yet
Transaction Concurrency
24 pages
Distributed Transaction
No ratings yet
Distributed Transaction
46 pages
Mathematics for AI: Essential Concepts
No ratings yet
Mathematics for AI: Essential Concepts
176 pages
Mathematical Logic Through Python
No ratings yet
Mathematical Logic Through Python
284 pages
Transdeeplab: Convolution-Free Transformer-Based Deeplab V3+ For Medical Image Segmentation
No ratings yet
Transdeeplab: Convolution-Free Transformer-Based Deeplab V3+ For Medical Image Segmentation
13 pages
UN Sustainable Development Goals Overview
No ratings yet
UN Sustainable Development Goals Overview
12 pages
S.S. Carnatic: 19th Century Shipwreck Analysis
100% (2)
S.S. Carnatic: 19th Century Shipwreck Analysis
264 pages
Wipro 2
No ratings yet
Wipro 2
8 pages
Modelling, Simulation and Control Design For Robotic Manipulators PDF
100% (2)
Modelling, Simulation and Control Design For Robotic Manipulators PDF
16 pages
Case Study Repor Take Time
No ratings yet
Case Study Repor Take Time
18 pages
Node B Installation Guide
67% (3)
Node B Installation Guide
29 pages
Hydraulic Bench Flow Rate Analysis
No ratings yet
Hydraulic Bench Flow Rate Analysis
10 pages
Sample Transportation Problems
No ratings yet
Sample Transportation Problems
4 pages
Circular Letter No.4623 - Information On Hybrid Meetings (Secretariat)
No ratings yet
Circular Letter No.4623 - Information On Hybrid Meetings (Secretariat)
6 pages
Feasibility Study of Coffee
No ratings yet
Feasibility Study of Coffee
14 pages
Food Chains
No ratings yet
Food Chains
5 pages
Intrinsic FUNCTIONS in COBOL
No ratings yet
Intrinsic FUNCTIONS in COBOL
33 pages
Unit 456
No ratings yet
Unit 456
6 pages
Stanford Psychology Programs
No ratings yet
Stanford Psychology Programs
10 pages
Lecture 01 Properties of Sea Water PDF
No ratings yet
Lecture 01 Properties of Sea Water PDF
6 pages
What Is PDP?: Personal Development Planning The Engineering Subject Centre 1
0% (1)
What Is PDP?: Personal Development Planning The Engineering Subject Centre 1
5 pages
ISE 330 Introduction To Operations Research: Deterministic Models What Is Linear Programming?
No ratings yet
ISE 330 Introduction To Operations Research: Deterministic Models What Is Linear Programming?
5 pages
Freud vs. Frankl: Student Coping Strategies
No ratings yet
Freud vs. Frankl: Student Coping Strategies
1 page
Blood Relation - Vivek
No ratings yet
Blood Relation - Vivek
46 pages
Legal Research Search Techniques
100% (1)
Legal Research Search Techniques
12 pages
Curriculum Map: SY 2019-2020 Yr Level: Grade 8 Subject: Mathematics 8 (Second Quarter)
No ratings yet
Curriculum Map: SY 2019-2020 Yr Level: Grade 8 Subject: Mathematics 8 (Second Quarter)
3 pages
Standard SRMU
0% (5)
Standard SRMU
24 pages
Accident Investigation Guide
100% (1)
Accident Investigation Guide
113 pages
The Study of Materia Medica and Taking The Case
100% (1)
The Study of Materia Medica and Taking The Case
34 pages
Class 9 PT-2
No ratings yet
Class 9 PT-2
3 pages
Vaişeshika's Prāgabhāva in Politics
0% (1)
Vaişeshika's Prāgabhāva in Politics
5 pages
Final Report Askari Bank
No ratings yet
Final Report Askari Bank
117 pages
Paper RASD2010 005 Halfpenny Kihm
No ratings yet
Paper RASD2010 005 Halfpenny Kihm
12 pages
Construction Contract - The Cost of Mistrust
No ratings yet
Construction Contract - The Cost of Mistrust
6 pages
Ray Tsai: Key Skills
No ratings yet
Ray Tsai: Key Skills
3 pages