4a - Approximate Reinforcement Learning
4a - Approximate Reinforcement Learning
Episode 4
1
Recap: Q-learning
One approach:
action Q-values
Action value Q(s,a) is the expected total reward G agent gets from
state s by taking action a and following policy π from next state.
2
L≈[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a ' ))]
How to optimize?
4
Q-learning as MSE minimization
2
L≈[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a ' ))]
2
L≈[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a ' ))]
8
Q-learning as MSE minimization
9
Q-learning as MSE minimization
11
Real world
8⋅210⋅160
|S|≈2 =729179546432630. ..
80917 digits :)
12
Problem:
State space is usually large,
sometimes continuous.
And so is action space;
13
Problem:
State space is usually large,
sometimes continuous.
And so is action space;
Two solutions:
– Binarize state space (last week)
– Approximate agent with a function (crossentropy method)
14
Which one would you prefer for atari?
Problem:
State space is usually large,
sometimes continuous.
And so is action space;
Two solutions:
– Binarize state space Too many bins or handcrafted features
● Before:
– For all states, for all actions, remember Q(s,a)
● Now:
– Approximate Q(s,a) with some function
– e.g. linear model over state features
2
argminw , b (Q (s t , a t )−[r t + γ⋅max a ' Q(s t +1 , a ' )])
● Before:
– For all states, for all actions, remember Q(s,a)
● Now:
– Approximate Q(s,a) with some function
– e.g. linear model over state features
2
argminw , b (Q (s t , a t )−[r t + γ⋅max a ' Q(s t +1 , a ' )])
apply
observe
action
Obser
action
vation
Environment
Approximate Q-learning
model Objective:
W = params
2
L=(Q(s t ,a t )−[r+γ⋅max a' Q (s t +1 ,a')])
Gradient step:
δL
image w t +1=wt −α⋅
δw
Approximate Q-learning
model Objective:
W = params
2
L=(Q(s t ,a t )−[r+γ⋅max a' Q (s t +1 ,a')])
consider const
Gradient step:
∂L
image w t +1=wt −α⋅
∂ wt
Approximate SARSA
Objective:
consider const
Q-learning:
consider const
Q-learning:
Qvalues is a
dense layer with Dense
no nonlinearity
ϵ-greedy rule
(tune ϵ or use
probabilistic rule)
Dense
Dense
Whatever
you found in
Obser- your favorite
vation deep learning
toolkit
Architectures
24
Architectures
27
Deep learning approach: DQN
28
DQN ϵ-greedy rule
(tune ϵ or use apply
probabilistic rule) action action
Qvalues is a Qvalues
dense layer with
no nonlinearity
Dense
Change axis Any neural
order to fit in network you
with lasagne can think of.
convolutions ● conv
Conv1
● pool
● dense
● dropout
● batchnorm
image Dimshuffle
(i,w,h,3) Conv0 ...
(i,3,w,h)
DQN ϵ-greedy rule
(tune ϵ or use apply
probabilistic rule) action action
Qvalues is a Qvalues
dense layer with
no nonlinearity
Dense
Change axis Any neural
order to fit in network you
with lasagne can think of.
convolutions ● conv
Conv1
● pool
● dense
● dropout
● batchnorm
image Dimshuffle
(i,w,h,3) Conv0 ...
(i,3,w,h)
● Any ideas?
Multiple agent trick
parameter
server
Idea: Throw in several
agents with shared W. W
parameter
server
Idea: Throw in several
agents with shared W. W
Any +/- ?
training
<s,a,r,s'>
<s,a,r,s'>
training
batches ~ <s,a,r,s'>
<s,a,r,s'>
<s,a,r,s'>
<s,a,r,s'>
<s,a,r,s'>
Replay
37
buffer
Experience replay Interaction
~
● Older interactions were obtained <s,a,r,s'>
under weaker policy training <s,a,r,s'>
batches <s,a,r,s'>
<s,a,r,s'>
<s,a,r,s'>
Better versions coming next week
Replay
38
buffer
Summary so far
to make data closer to i.i.d.
39
An important question
– Q-learning – CEM
– SARSA
– Expected Value SARSA 40
An important question
42
Deep learning meets MDP
– Dropout, noize
● Used in experience replay only: like the usual dropout
● Used when interacting: a special kind of exploration
● You may want to decrease p over time.
– Batchnorm
● Faster training but may break moving average
● Experience replay: may break down if buffer is too small
● Parallel agents: may break down under too few agents
<same problem of being non i.i.d.>
Final problem
Left or right?
44
Problem:
Most practical cases are partially observable:
Agent observation does not hold all information about process state
(e.g. human field of view).
Any ideas?
45
Problem:
Most practical cases are partially observable:
Agent observation does not hold all information about process state
(e.g. human field of view).
46
Partially observable MDP
apply
observe Agent action
Obser
vation action
Hidden state
(Markov assumption holds)
[but no one cares]
47
N-gram heuristic
Idea:
s t ≠o( s t )
s t ≈(o (st −n ) , at −n , ..., o(s t−1 ), at −1 , o(s t ))
e.g. ball movement in breakout
Qvalues is a Qvalues
dense layer with
no nonlinearity
Dense
Any neural
push image network you
can think of.
● conv
Conv1
● pool
● dense
● dropout
stack
● batchnorm
image image t 4 images
(i,w,h,3) image t-1 Conv0 ...
image t-2
image t-3
Ngrams:
• Nth-order markov assumption
Alternative approach:
• Infer hidden variables given observation
sequence
• Kalman Filters, Recurrent Neural Networks
52
Autocorrelation
● Any ideas?
Target networks
Idea: use older network snapshot
to compute reference
old 2
L=(Q(s t ,a t )−[r+γ⋅max a ' Q (s t +1 ,a' )])
54
Target networks
Idea: use older network snapshot
to compute reference
old 2
L=(Q(s t ,a t )−[r+γ⋅max a ' Q (s t +1 ,a' )])
● Smooth version:
● use moving average