DSA5102_lecture12
DSA5102_lecture12
Li Qianxiao
Department of Mathematics
Last time
We introduced
• The basic formulation of reinforcement learning as
Markov decision processes (MDP)
• Bellman’s equations characterizing the solution of finite
MDPs
Today, we will look at various solution methods of reinforcement
learning one can derive from these analyses
Model-Based vs Model-Free
What is a model?
By a model, we mean a precise description of the dynamics in the
system
is completely known to us
Model Based RL Algorithms
Model-based algorithms relies on using the knowledge of to
solve the Bellman’s (optimality) equations
https://bair.berkeley.edu/blog/2019/12/12/mbpo/
Model-Free Algorithms
Model-Free algorithms on the other hand relies on approximate
solutions via sampling/Monte-Carlo methods
Key idea: trade variance and noise for generality and scalability
Model-Based Algorithms
Review of Bellman’s Equations
• Bellman’s Optimality Equation for Value Function
• Relationship of and
where
Does converge?
Contraction Mapping Theorem
Let be a complete normed space and be a contraction, i.e.
Hence:
• Given any , we can derive
• with equality if and only if is optimal
Policy Iteration Algorithm
Convergence of Policy Iteration
• By policy improvement result
• if and only if
• There are only a finite number of deterministic policies
Demo: Policy Iteration
Model-Free Algorithms
Why model-free?
Often, in applications one cannot solve the problems using the
model-based approach.
Possible reasons:
• Computing/estimating is prohibitively expensive
• Reward and state evolutions are from some black-box
system
Examples
Which of the following can be solved in a model-based manner?
• Maze
• Recycling robot
• Tic tac toe
• Super Mario
• Alpha Go
The Basic Set-up in the Model-Free
Case
In the model-free setting, we are given an environment simulator
Its function is to output, given the current state and action , the
next state and the associated reward
numerically.
• If we don’t know , but we can sample from it, we can compute
this by Monte-Carlo
Monte-Carlo Policy Evaluation
Recall that given , in the model-based case we evaluate it via
by Monte-Carlo!
• Using the black-box simulator, draw episodes of states and rewards
where .
• Estimate via Monte-Carlo
Monte-Carlo Policy Improvement
Estimate action value function via Monte-Carlo
Impr
Optimal
ovem
Evaluation
Policy/Value
ent
Value
Generalized Policy Iteration
Policy
Optimal
Policy/Value
Value
Temporal Differencing Methods
Temporal differencing (TD) methods exploits such partial updates
in the model free setting
• Game AI
• Recommender Systems
• Process Optimization
• Autonomous Driving
• Many more!
Limitations of RL
Despite its success, there are some important limitations of the
(model-free) RL methodology!
• Need for efficient simulator
• Need for large exploration
• Need for proper reward definition
Example: The Recycling Robot
Actions
• Search for cans
• Pick up or drop cans
• Stop and wait
• Go back and charge
Rewards:
• +10 for each can picked up
• -1 for each meter moved
• -1000 for running out of battery
Summary
We introduced two classes of algorithms for solution of
MDPs/RLs
• Model-based
• Value iteration
• Policy iteration
• Model-free
• Monte-Carlo estimates to replace expectations
• Monte-Carlo Policy Iteration, TD(0), Q-Learning