Artificial Intelligence: Lecture 11 - Reinforcement Learning II Dr. Shivanjali Khare
Artificial Intelligence: Lecture 11 - Reinforcement Learning II Dr. Shivanjali Khare
Intelligence
Lecture 11 – Reinforcement Learning II
Dr. Shivanjali Khare
[email protected]
Reinforcement Learning
• We still assume an MDP:
• A set of states s S
• A set of actions (per state) A
• A model T(s,a,s’)
• A reward function R(s,a,s’)
• Still looking for a policy (s)
Evaluate a fixed policy PE on approx. MDP Evaluate a fixed policy Value Learning
Analogy: Expected Age
• Idea: Take samples of outcomes s’ (by doing the action!) and average
s
(s)
s,
(s)
s, (s),s’
's2 's 1 's3
'
Almost! But we
can’t rewind time to
get sample after
sample from state s.
Model-Free Learning
s
• Model-free (temporal difference) a
learning s, a
• Experience world through episodes r
’s
a’
• Update estimates each transition s’, a’
’’s
• Over time, updates will mimic Bellman
updates
Temporal Difference
Learning
s
• Temporal difference learning of values (s)
• Policy still fixed, still doing evaluation!
• Move values toward value of whatever successor occurs: running
s,
average (s)
’s
Sample of V(s):
Update to V(s):
Example: Temporal Difference
Learning
A 0 0 0
B C D 0 0 8 -1 0 8 -1 3 8
E 0 0 0
Assume: = 1, α =
1/2
Q-Learning
• Caveats:
• You have to explore enough
• You have to eventually make the learning rate
small enough
• … but not decrease it too quickly
• Basically, in the limit, it doesn’t matter how you select actions (!)
Input Policy
A
act according to current optimal
B C D also explore!
E
Exploration vs.
Exploitation
Video of Demo Q-learning – Manual
Exploration – Bridge Grid
How to Explore?
• When to explore?
• Random actions: explore a fixed amount
• Better idea: explore areas whose badness is not
(yet) established, eventually stop exploring
• Exploration function
• Takes a value estimate u and a visit count n, and
returns an optimistic utility, e.g.
Regular Q-Update:
Modified Q-Update:
• Note: this propagates the “bonus” back to states that lead to unknown states as well!
[demo – RL pacman]
Example: Pacman
• Using a feature representation, we can write a q function (or value function) for any
state using a few weights:
• Disadvantage: states may share features but actually be very different in value!
Approximate Q-Learning
Exact Q’s
Approximate Q’s
• Intuitive interpretation:
• Adjust weights of active features
• E.g., if something unexpectedly bad happens, blame the features that were on: disprefer all
states with that state’s features
[Demo: approximate Q-
learning pacman
(L11D10)]
Video of Demo Approximate Q-
Learning -- Pacman
DeepMind Atari () approximate Q-
learning with neural nets
36
Q-Learning and Least
Squares
Linear Approximation:
Regression
40
26
24
20
22
20
30
40
0 20
30
0 20
10 20
10
0 0
Prediction: Prediction:
Optimization: Least
Squares
Error or “residual”
Observation
Prediction
0
0 20
Minimizing Error
Imagine we had only one point x, with features f(x), target value y, and weights w:
“target” “prediction”
Overfitting: Why Limiting Capacity
Can Help
30
25
20
Degree 15 polynomial
15
10
-5
-10
-15
0 2 4 6 8 10 12 14 16 18 20
Policy Search
Policy Search
• Problem: often the feature-based policies that work well (win games, maximize
utilities) aren’t the ones that approximate V / Q best
• E.g. your value functions from project 2 were probably horrible estimates of future rewards, but
they still produced good decisions
• Q-learning’s priority: get Q-values close (modeling)
• Action selection priority: get ordering of Q-values right (prediction)
• We’ll see this distinction between modeling and prediction again later in the course
• Solution: learn policies that maximize rewards, not the values that predict them
• Policy search: start with an ok solution (e.g. Q-learning) then fine-tune by hill climbing
on feature weights
Policy Search
• Simplest policy search:
• Start with an initial linear value function or Q-function
• Nudge each feature weight up and down and see if your policy is better than
before
• Problems:
• How do we tell the policy got better?
• Need to run many sample episodes!
• If there are a lot of features, this can be impractical
Evaluate a fixed policy PE on approx. MDP Evaluate a fixed policy Value Learning
Discussion: Model-Based vs Model-
Free RL
47
RL: Helicopter Flight
[OpenAI]
Conclusion