0% found this document useful (0 votes)
7 views

DSA5102_lecture12

Uploaded by

gjpnwmdpz7
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

DSA5102_lecture12

Uploaded by

gjpnwmdpz7
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Foundations of Machine Learning

DSA 5102 • Lecture 12

Li Qianxiao
Department of Mathematics
Last time
We introduced
• The basic formulation of reinforcement learning as
Markov decision processes (MDP)
• Bellman’s equations characterizing the solution of finite
MDPs
Today, we will look at various solution methods of reinforcement
learning one can derive from these analyses
Model-Based vs Model-Free
What is a model?
By a model, we mean a precise description of the dynamics in the
system

In other words, we know how the system should behave if I apply


an action at state

In the language of MDPs, this means that the transition


probability

is completely known to us
Model Based RL Algorithms
Model-based algorithms relies on using the knowledge of to
solve the Bellman’s (optimality) equations

https://bair.berkeley.edu/blog/2019/12/12/mbpo/
Model-Free Algorithms
Model-Free algorithms on the other hand relies on approximate
solutions via sampling/Monte-Carlo methods

Key idea: trade variance and noise for generality and scalability
Model-Based Algorithms
Review of Bellman’s Equations
• Bellman’s Optimality Equation for Value Function

• Optimal Action Value

• Relationship of and

• Optimal (deterministic) policy


Value Iteration
The first algorithm, known as value iteration proceeds as follows
• Solve the Bellman’s optimality equation
• Read off an optimal policy

This requires iterating an estimate of the value function, and thus


its name.
Value Iteration
Observe that we can write the Bellman’s optimality equation as

where

is a non-linear function given by

Recall that in matrix notations, we can write


Fixed Point Iteration
Since the optimal value function satisfies

i.e. it is a fixed point of , we can solve it via fixed point iteration:

If this converges to as , then


• is a fixed point of , i.e.
• By uniqueness of optimal value functions,

Does converge?
Contraction Mapping Theorem
Let be a complete normed space and be a contraction, i.e.

Then, there exists a unique such that . Moreover,

for any sequence such that and is arbitrary.


Convergence of Value Iteration
Theorem 4.10 in the notes
The Value Iteration Algorithm
Rate of Convergence of Value Iteration
How fast does value iteration converge?
That is, given , what’s the minimal such that ?

Note that we have

Continuing this process, we have

Hence, to achieve error of we need


Policy Iteration
Value iteration
• Converges exponentially
• But still “theoretically” needs an infinite number of steps
to converge
However, our MDP is finite!

Do we have a “finite” algorithm?


Recall: Policy Improvement Theorem
For any two (deterministic) policies , if

Then we must have

In addition, if the first inequality is strict for some , then the


second equality is strict for at least one .

Hence:
• Given any , we can derive
• with equality if and only if is optimal
Policy Iteration Algorithm
Convergence of Policy Iteration
• By policy improvement result

• if and only if
• There are only a finite number of deterministic policies
Demo: Policy Iteration
Model-Free Algorithms
Why model-free?
Often, in applications one cannot solve the problems using the
model-based approach.

Possible reasons:
• Computing/estimating is prohibitively expensive
• Reward and state evolutions are from some black-box
system
Examples
Which of the following can be solved in a model-based manner?
• Maze
• Recycling robot
• Tic tac toe
• Super Mario
• Alpha Go
The Basic Set-up in the Model-Free
Case
In the model-free setting, we are given an environment simulator

Its function is to output, given the current state and action , the
next state and the associated reward

In other words, the environment simulator performs the sampling


Monte-Carlo Methods
Consider the problem of computing

• If we know , we can compute the integral

numerically.
• If we don’t know , but we can sample from it, we can compute
this by Monte-Carlo
Monte-Carlo Policy Evaluation
Recall that given , in the model-based case we evaluate it via

where the computation of depends on .

In the model free case, we turn to the original definition of value


function
Monte-Carlo Policy Evaluation
We approximate the expectation

by Monte-Carlo!
• Using the black-box simulator, draw episodes of states and rewards

where .
• Estimate via Monte-Carlo
Monte-Carlo Policy Improvement
Estimate action value function via Monte-Carlo

Improve policy via


Model-Free Policy Iteration
Policy Iteration
Policy

Impr
Optimal

ovem
Evaluation
Policy/Value

ent
Value
Generalized Policy Iteration
Policy

Optimal
Policy/Value

Value
Temporal Differencing Methods
Temporal differencing (TD) methods exploits such partial updates
in the model free setting

Given policy , suppose we are at state


• Sample action
• Sample using environment simulator
• Update

Suppose such iteration converges, then take expectation over


environment,
Why is it called temporal differencing?
We can rewrite

• is the updated value function prediction given the new sample


drawn from the environment
• Compares the temporal difference

• If “over-values” , then increase it


• If “under-values” , then decrease it
TD(0) Algorithm for Policy Evaluation
Q-Learning Algorithm
The well-known Q-Learning algorithm takes this a step further
and tries to also find the optimal action value function, which
satisfies

Similarly, we replace the expectation by samples for updates


The Q-Learning Algorithm
Demo: Model-Free Algorithms
More Applications of RL

• Game AI
• Recommender Systems
• Process Optimization
• Autonomous Driving
• Many more!
Limitations of RL
Despite its success, there are some important limitations of the
(model-free) RL methodology!
• Need for efficient simulator
• Need for large exploration
• Need for proper reward definition
Example: The Recycling Robot
Actions
• Search for cans
• Pick up or drop cans
• Stop and wait
• Go back and charge

Rewards:
• +10 for each can picked up
• -1 for each meter moved
• -1000 for running out of battery
Summary
We introduced two classes of algorithms for solution of
MDPs/RLs
• Model-based
• Value iteration
• Policy iteration
• Model-free
• Monte-Carlo estimates to replace expectations
• Monte-Carlo Policy Iteration, TD(0), Q-Learning

You might also like