0% found this document useful (0 votes)
12 views

4a - Approximate Reinforcement Learning

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

4a - Approximate Reinforcement Learning

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Reinforcement learning

Episode 4

Approximate reinforcement learning

1
Recap: Q-learning
One approach:
action Q-values

Q(s, a)=E [r(s , a)+γ⋅V (s' )]


s'

Action value Q(s,a) is the expected total reward G agent gets from
state s by taking action a and following policy π from next state.

π (s): argmax a Q(s , a) 2


Recap: Q-learning
One approach:
action Q-values

Q(s, a)=E [r(s , a)+γ⋅V (s' )]


s'
We can replace
P(s' | s,a)
with sampling

Q(s t , at )← α⋅(r t +γ⋅max a ' Q (s t +1 , a '))+(1−α)Q(s t , a t )

π (s): argmax a Q(s , a) 3


Q-learning as MSE minimization

Given <s,a,r,s'> minimize


true 2
L=[Q(s t , at )−Q (s t , a t )]

2
L≈[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a ' ))]

How to optimize?
4
Q-learning as MSE minimization

Given <s,a,r,s'> minimize


true 2
L=[Q(s t , at )−Q (s t , a t )]

2
L≈[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a ' ))]

For tabular Q(s,a)

∇ L=2⋅[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a '))]


5
Q-learning as MSE minimization

Given <s,a,r,s'> minimize


true 2
L=[Q(s t , at )−Q (s t , a t )]

2
L≈[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a ' ))]

For tabular Q(s,a)

∇ L≈2⋅[Q (s t , a t )−(r t + γ⋅max a ' Q (s t +1 , a '))]

Something's sooo wrong! 6


Q-learning as MSE minimization

Given <s,a,r,s'> minimize


true 2
L=[Q(s t , at )−Q (s t , a t )]
const
2
L≈[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a ' ))]
const
For tabular Q(s,a)

∇ L≈2⋅[Q (s t , a t )−(r t +γ⋅max a ' Q (s t +1 , a '))]


7
Q-learning as MSE minimization

For tabular Q(s,a)


2
L≈[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a ' ))]

∇ L≈2⋅[Q (s t , a t )−(r t +γ⋅max a ' Q (s t +1 , a '))]


Gradient descent step:

Q(s , a):=Q(s , a)−α⋅2[Q(st , at )−(r t +γ⋅maxa ' Q(st +1 , a '))]

8
Q-learning as MSE minimization

For tabular Q(s,a)


2
L≈[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a ' ))]

∇ L≈2⋅[Q (s t , a t )−(r t +γ⋅max a ' Q (s t +1 , a '))]


Gradient descent step:

Q(s , a):=Q(s , a)(1−2 α)+2 α (r t +γ⋅max a ' Q(s t +1 , a ' ))

9
Q-learning as MSE minimization

For tabular Q(s,a)


2
L≈[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a ' ))]

∇ L≈2⋅[Q (s t , a t )−(r t +γ⋅max a ' Q (s t +1 , a '))]


Gradient descent step:

Q(s , a):=Q(s , a)(1−2 α)+2 α (r t +γ⋅max a ' Q(s t +1 , a ' ))

= moving average formula 10


(define alpha' = 2*alpha)
Real world

How many states are there?


approximately

11
Real world

8⋅210⋅160
|S|≈2 =729179546432630. ..

80917 digits :)

12
Problem:
State space is usually large,
sometimes continuous.
And so is action space;

However, states do have a structure, similar


states have similar action outcomes.

13
Problem:
State space is usually large,
sometimes continuous.
And so is action space;

Two solutions:
– Binarize state space (last week)
– Approximate agent with a function (crossentropy method)
14
Which one would you prefer for atari?
Problem:
State space is usually large,
sometimes continuous.
And so is action space;

Two solutions:
– Binarize state space Too many bins or handcrafted features

– Approximate agent with a function Let's pick this one


15
From tables to approximations

● Before:
– For all states, for all actions, remember Q(s,a)

● Now:
– Approximate Q(s,a) with some function
– e.g. linear model over state features

2
argminw , b (Q (s t , a t )−[r t + γ⋅max a ' Q(s t +1 , a ' )])

Trivia: should we use classification or regression model?


(e.g. logistic regression Vs linear regression) 16
From tables to approximations

● Before:
– For all states, for all actions, remember Q(s,a)

● Now:
– Approximate Q(s,a) with some function
– e.g. linear model over state features

2
argminw , b (Q (s t , a t )−[r t + γ⋅max a ' Q(s t +1 , a ' )])

● Solve it as a regression problem!


17
MDP again
Agent

apply
observe
action

Obser
action
vation

Environment
Approximate Q-learning

Q(s,a0), Q(s,a1), Q(s,a2) Q-values:

^ t ,at )=r+γ⋅maxa' Q(s


Q(s ^ t +1 ,a')

model Objective:
W = params
2
L=(Q(s t ,a t )−[r+γ⋅max a' Q (s t +1 ,a')])

Gradient step:
δL
image w t +1=wt −α⋅
δw
Approximate Q-learning

Q(s,a0), Q(s,a1), Q(s,a2) Q-values:

^ t ,at )=r+γ⋅maxa' Q(s


Q(s ^ t +1 ,a')

model Objective:
W = params
2
L=(Q(s t ,a t )−[r+γ⋅max a' Q (s t +1 ,a')])
consider const
Gradient step:
∂L
image w t +1=wt −α⋅
∂ wt
Approximate SARSA
Objective:

Q(s,a0), Q(s,a1), Q(s,a2) ^


L=(Q(s t ,a t )−Q (s t ,a t ))2

consider const

Q-learning:

model ^ t ,at )=r+γ⋅maxa' Q(st +1 ,a')


Q(s
W = params
SARSA:
^ t ,at )=r+γ⋅Q (s t +1 ,at+1 )
Q(s
Expected Value SARSA:
image ^ t ,at )=r+γ⋅
Q(s ??? E Q(s t +1 ,a' )
a'∼π(a∣s)
Approximate SARSA
Objective:

Q(s,a0), Q(s,a1), Q(s,a2) ^


L=(Q(s t ,a t )−Q (s t ,a t ))2

consider const

Q-learning:

model ^ t ,at )=r+γ⋅maxa' Q(st +1 ,a')


Q(s
W = params
SARSA:
^ t ,at )=r+γ⋅Q (s t +1 ,at+1 )
Q(s
Expected Value SARSA:
image ^ t ,at )=r+γ⋅ E Q(s t +1 ,a' )
Q(s
a'∼π(a∣s)
Deep RL 101
apply
Qvalues action action

Qvalues is a
dense layer with Dense
no nonlinearity
ϵ-greedy rule
(tune ϵ or use
probabilistic rule)
Dense

Dense
Whatever
you found in
Obser- your favorite
vation deep learning
toolkit
Architectures

Given (s,a) Given s predict all q-values


Predict Q(s,a) Q(s,a0), Q(s,a1), Q(s,a2)

24
Architectures

Given (s,a) Given s predict all q-values


Predict Q(s,a) Q(s,a0), Q(s,a1), Q(s,a2)

Trivia: in which situation does left model work better? 25


And right?
Architectures

Needs one forward pass


for each action
Needs one forward pass
Works if action space is large for all actions
efficient when not all actions (faster)
26
are available from each state
What kind of network digests images well?

27
Deep learning approach: DQN

28
DQN ϵ-greedy rule
(tune ϵ or use apply
probabilistic rule) action action

Qvalues is a Qvalues
dense layer with
no nonlinearity

Dense
Change axis Any neural
order to fit in network you
with lasagne can think of.
convolutions ● conv
Conv1
● pool

● dense

● dropout

● batchnorm
image Dimshuffle
(i,w,h,3) Conv0 ...
(i,3,w,h)
DQN ϵ-greedy rule
(tune ϵ or use apply
probabilistic rule) action action

Qvalues is a Qvalues
dense layer with
no nonlinearity

Dense
Change axis Any neural
order to fit in network you
with lasagne can think of.
convolutions ● conv
Conv1
● pool

● dense

● dropout

● batchnorm
image Dimshuffle
(i,w,h,3) Conv0 ...
(i,3,w,h)

Those two are


a bit tricky (later)
TSNE makes every slide 40% better

● Embedding of pre-last layer activations


31
● Color = V(s) = max_a Q(s,a)
32
How bad it is if agent spends
next 1000 ticks under the left rock?
(while training)
33
Problem

● Training samples are not


“i.i.d”,

● Model forgets parts of


environment it hasn't
visited for some time

● Drops on learning curve

● Any ideas?
Multiple agent trick

parameter
server
Idea: Throw in several
agents with shared W. W

agent0 agent1 agent2

env0 env1 env2


Multiple agent trick

parameter
server
Idea: Throw in several
agents with shared W. W

● Chances are, they will be agent0 agent1 agent2


exploring different parts of
the environment, env0 env1 env2

● More stable training,

● Requires a lot of interaction

Trivia: your agent is a real


robot car. Any problems?
Experience replay Interaction

Idea: store several past interactions


<s,a,r,s'>
Train on random subsamples Agent

Any +/- ?

training

<s,a,r,s'>
<s,a,r,s'>
training
batches ~ <s,a,r,s'>
<s,a,r,s'>
<s,a,r,s'>
<s,a,r,s'>
<s,a,r,s'>

Replay
37
buffer
Experience replay Interaction

Idea: store several past interactions


<s,a,r,s'>
Train on random subsamples Agent

● Atari DQN: >10^5 interactions

● Closer to i.i.d training


pool contains several sessions
<s,a,r,s'>
<s,a,r,s'>

~
● Older interactions were obtained <s,a,r,s'>
under weaker policy training <s,a,r,s'>
batches <s,a,r,s'>
<s,a,r,s'>
<s,a,r,s'>
Better versions coming next week
Replay
38
buffer
Summary so far
to make data closer to i.i.d.

Use one or several of


– experience replay
– multiple agents
– Infinitely small learning rate :)

advanced stuff coming next lecture

39
An important question

● You approximate Q(s,a) with a neural network


● You use experience replay when training

Trivia: which of those algorithms will fail?

– Q-learning – CEM
– SARSA
– Expected Value SARSA 40
An important question

● You approximate Q(s,a) with a neural network


● You use experience replay when training
Agent trains off-policy on an older version of him

Trivia: which of those algorithms will fail?


Off-policy methods work, On-policy is super-slow (fail)
– Q-learning – CEM
– SARSA
– Expected Value SARSA 41
When training with on-policy methods,
– use no (or small) experience replay
– compensate with parallel game sessions

42
Deep learning meets MDP

– Dropout, noize
● Used in experience replay only: like the usual dropout
● Used when interacting: a special kind of exploration
● You may want to decrease p over time.

– Batchnorm
● Faster training but may break moving average
● Experience replay: may break down if buffer is too small
● Parallel agents: may break down under too few agents
<same problem of being non i.i.d.>
Final problem

Left or right?
44
Problem:
Most practical cases are partially observable:
Agent observation does not hold all information about process state
(e.g. human field of view).

Any ideas?

45
Problem:
Most practical cases are partially observable:
Agent observation does not hold all information about process state
(e.g. human field of view).

● However, we can try to infer hidden states


from sequences of observations.

s t ≃mt : P (mt∣ot , mt−1 )

● Intuitively that's agent memory state.

46
Partially observable MDP

apply
observe Agent action

Obser
vation action

Hidden state
(Markov assumption holds)
[but no one cares]

47
N-gram heuristic
Idea:
s t ≠o( s t )
s t ≈(o (st −n ) , at −n , ..., o(s t−1 ), at −1 , o(s t ))
e.g. ball movement in breakout

• One frame • Several frames 48


4-frame DQN ϵ-greedy rule
(tune ϵ or use apply
probabilistic rule) action action

Qvalues is a Qvalues
dense layer with
no nonlinearity

Dense
Any neural
push image network you
can think of.
● conv
Conv1
● pool

● dense

● dropout
stack
● batchnorm
image image t 4 images
(i,w,h,3) image t-1 Conv0 ...
image t-2
image t-3

delete last frame


N-gram heuristic
Idea:
s t ≠o( s t )
s t ≈(o (st −n ) , at −n , ..., o(s t−1 ), at −1 , o(s t ))
e.g. ball movement in breakout

• One frame • Several frames 50


Alternatives

Ngrams:
• Nth-order markov assumption

• Works for velocity/timers

• Fails for anything longer that N frames

• Impractical for large N

Alternative approach:
• Infer hidden variables given observation

sequence
• Kalman Filters, Recurrent Neural Networks

• More on that in a few lectures


51
Seminar

52
Autocorrelation

● Reference is based on predictions

r + γ⋅maxa ' Q (st +1 , a ')


● Any error in Q approximation is propagated to neighbors

● If some Q(s,a) is mistakenly over-exaggerated,


neighboring qvalues will also be increased in a cascade

● Worst case: divergence

● Any ideas?
Target networks
Idea: use older network snapshot
to compute reference

old 2
L=(Q(s t ,a t )−[r+γ⋅max a ' Q (s t +1 ,a' )])

● Update Q old periodically


● Slows down training

54
Target networks
Idea: use older network snapshot
to compute reference

old 2
L=(Q(s t ,a t )−[r+γ⋅max a ' Q (s t +1 ,a' )])

● Update Q old periodically


● Slows down training

● Smooth version:
● use moving average

old old new


θ :=(1−α)⋅θ +α⋅θ
55
● Θ = weights

You might also like