lecture-06
lecture-06
Data: (x, y)
x is data, y is label
Examples: Classification,
regression, object detection,
semantic segmentation, image Classification
captioning, etc.
Data: x
Just data, no labels!
1-d density estimation
Goal: Learn some underlying
hidden structure of the data
Examples: Clustering,
dimensionality reduction, feature
learning, density estimation, etc. 2-d density estimation
2-d density images left and right
are CC0 public domain
Today: Reinforcement Learning
Agent
Environment
Reinforcement Learning
Agent
State st
Environment
Reinforcement Learning
Agent
State st
Action at
Environment
Reinforcement Learning
Agent
Environment
Reinforcement Learning
Agent
Environment
Cart-Pole Problem
Agent
Environment
Markov Decision Process
- Mathematical formulation of the RL problem
- Markov property: Current state completely characterises the state of the
world
Defined by:
★ ★
★ ★
Formally: with
Definitions: Value function and Q-value function
Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, …
Definitions: Value function and Q-value function
Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, …
Intuition: if the optimal state-action values for the next time-step Q*(s’,a’) are known,
then the optimal strategy is to take the action that maximizes the expected value of
Bellman equation
The optimal Q-value function Q* is the maximum expected cumulative reward achievable
from a given (state, action) pair:
Intuition: if the optimal state-action values for the next time-step Q*(s’,a’) are known,
then the optimal strategy is to take the action that maximizes the expected value of
The optimal policy * corresponds to taking the best action in any state as specified by Q*
Solving for the optimal policy
Value iteration algorithm: Use Bellman equation as an iterative update
Forward Pass
Loss function:
where
Solving for the optimal policy: Q-learning
Remember: want to find a Q-function that satisfies the Bellman Equation:
Forward Pass
Loss function:
where
Backward Pass
Gradient update (with respect to Q-function parameters θ):
Solving for the optimal policy: Q-learning
Remember: want to find a Q-function that satisfies the Bellman Equation:
Forward Pass
Loss function:
Iteratively try to make the Q-value
where close to the target value (yi) it
should have, if Q-function
corresponds to optimal Q* (and
Backward Pass optimal policy *)
Gradient update (with respect to Q-function parameters θ):
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Q-network Architecture
:
FC-4 (Q-values)
neural network
with weights FC-256
Q-network Architecture
:
FC-4 (Q-values)
neural network
with weights FC-256
Input: state st
Q-network Architecture
:
FC-4 (Q-values)
neural network
with weights FC-256
Q-network Architecture
: Last FC layer has 4-d
FC-4 (Q-values)
neural network output (if 4 actions),
with weights FC-256 corresponding to Q(st,
a1), Q(st, a2), Q(st, a3),
32 4x4 conv, stride 2 Q(st,a4)
16 8x8 conv, stride 4
Q-network Architecture
: Last FC layer has 4-d
FC-4 (Q-values)
neural network output (if 4 actions),
with weights FC-256 corresponding to Q(st,
a1), Q(st, a2), Q(st, a3),
32 4x4 conv, stride 2 Q(st,a4)
16 8x8 conv, stride 4
Number of actions between 4-18
depending on Atari game
Q-network Architecture
: Last FC layer has 4-d
FC-4 (Q-values)
neural network output (if 4 actions),
with weights FC-256 corresponding to Q(st,
a1), Q(st, a2), Q(st, a3),
32 4x4 conv, stride 2 Q(st,a4)
A single feedforward pass
to compute Q-values for all 16 8x8 conv, stride 4
actions from the current Number of actions between 4-18
state => efficient! depending on Atari game
Forward Pass
Loss function:
Iteratively try to make the Q-value
where close to the target value (yi) it
should have, if Q-function
corresponds to optimal Q* (and
Backward Pass optimal policy *)
Gradient update (with respect to Q-function parameters θ):
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Initialize state
(starting game
screen pixels) at the
beginning of each
episode
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Store transition in
replay memory
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Experience Replay:
Sample a random
minibatch of transitions
from replay memory
and perform a gradient
descent step
https://www.youtube.com/watch?v=V1eYniJ0Rnk
Video by Károly Zsolnai-Fehér. Reproduced with permission.
Policy Gradients
What is a problem with Q-learning?
The Q-function can be very complicated!
Example: a robot grasping an object has a very high-dimensional state => hard
to learn exact value of every (state, action) pair
Policy Gradients
What is a problem with Q-learning?
The Q-function can be very complicated!
Example: a robot grasping an object has a very high-dimensional state => hard
to learn exact value of every (state, action) pair
But the policy can be much simpler: just close your hand
Can we learn a policy directly, e.g. finding the best policy from a collection of
policies?
Policy Gradients
Formally, let’s define a class of parametrized policies:
We have:
REINFORCE algorithm
Can we compute those quantities without knowing the transition probabilities?
We have:
Thus:
REINFORCE algorithm
Can we compute those quantities without knowing the transition probabilities?
We have:
Thus:
Doesn’t depend on
And when differentiating: transition probabilities!
REINFORCE algorithm
Can we compute those quantities without knowing the transition probabilities?
We have:
Thus:
Doesn’t depend on
And when differentiating: transition probabilities!
Interpretation:
- If r( ) is high, push up the probabilities of the actions seen
- If r( ) is low, push down the probabilities of the actions seen
Intuition
Gradient estimator:
Interpretation:
- If r( ) is high, push up the probabilities of the actions seen
- If r( ) is low, push down the probabilities of the actions seen
Might seem simplistic to say that if a trajectory is good then all its actions were
good. But in expectation, it averages out!
Intuition
Gradient estimator:
Interpretation:
- If r( ) is high, push up the probabilities of the actions seen
- If r( ) is low, push down the probabilities of the actions seen
Might seem simplistic to say that if a trajectory is good then all its actions were
good. But in expectation, it averages out!
However, this also suffers from high variance because credit assignment is
really hard. Can we help the estimator?
Variance reduction
Gradient estimator:
Variance reduction
Gradient estimator:
What is important then? Whether a reward is better or worse than what you
expect to get
- The actor decides which action to take, and the critic tells the actor
how good its action was and how it should adjust
- Also alleviates the task of the critic as it only has to learn the values
of (state, action) pairs generated by the policy
- Can also incorporate Q-learning tricks e.g. experience replay
- Remark: we can define by the advantage function how much an
action was better than expected
Actor-Critic Algorithm
Initialize policy parameters , critic parameters
For iteration=1, 2 … do
Sample m trajectories under the current policy
For i=1, …, m do
For t=1, ... , T do
End for
REINFORCE in action: Recurrent Attention Model (RAM)
Objective: Image Classification
Glimpsing is a non-differentiable operation => learn policy for how to take glimpse actions using REINFORCE
Given state of glimpses seen so far, use RNN to model the state and output next action
[Mnih et al. 2014]
REINFORCE in action: Recurrent Attention Model (RAM)
(x1, y1)
Input NN
image
Input NN NN
image
Input NN NN NN
image
Input NN NN NN NN
image
(x1, y1) (x2, y2) (x3, y3) (x4, y4) (x5, y5)
Softmax
Input NN NN NN NN NN y=2
image
Has also been used in many other tasks including fine-grained image recognition,
image captioning, and visual question-answering!
- Guarantees:
- Policy Gradients: Converges to a local minima of J( ), often good
enough!
- Q-learning: Zero guarantees since you are approximating Bellman
equation with a complicated function approximator