Lecture: starts Tuesday 9:35am
Course website: introml.odl.mit.edu
Who's talking? Prof. Iddo Drori
Questions? Piazza
Materials: Will be available on course website
Today’s Plan: State Machines and
Markov Decision Processes (MDPs)
• State machine
• Observation vs. State
• Markov decision process (MDP)
• Policy: value, optimal
• Finite horizon value iteration, optimal policy
• Return
• Bellman equations: expectation, optimality
• Infinite horizon value iteration
State Machine
• S = set of possible states = {standing, moving}
standing moving
State Machine
• S = set of possible states = {standing, moving}
• s0 ∈ S = initial state = standing
standing moving
State Machine
• S = set of possible states = {standing, moving}
slow
• " = set of possible inputs = {slow, fast}
• s0 ∈ S = initial state = standing
standing moving
fast
slow fast
State Machine
• S = set of possible states = {standing, moving}
slow
• " = set of possible inputs = {slow, fast}
• #: S x " ➞ S transition function
• s0 ∈ S = initial state = standing standing moving
fast
slow fast
State Machine
• S = set of possible states = {standing, moving}
slow
• " = set of possible inputs = {slow, fast}
• #: S x " ➞ S transition function
• s0 ∈ S = initial state = standing standing moving
fast
• s1 = #(s0, fast) = ?
slow fast
State Machine
• S = set of possible states = {standing, moving}
slow
• " = set of possible inputs = {slow, fast}
• #: S x " ➞ S transition function
• s0 ∈ S = initial state = standing standing moving
fast
• s1 = #(s0, fast) = moving
slow fast
State Machine
• S = set of possible states = {standing, moving}
slow
• " = set of possible inputs = {slow, fast}
• #: S x " ➞ S transition function
• s0 ∈ S = initial state = standing standing moving
fast
• s1 = #(s0, fast) = moving
slow fast
• May not observe states {standing, moving} directly
for example a sensor measurement
Example: Observation vs. State
observation O state S
State Machine
• S = set of possible states = {fallen, standing, moving}
slow
• " = set of possible inputs = {slow, fast}
• #: S x " ➞ S transition function
• s0 ∈ S = initial state = standing standing moving
fast
• s1 = #(s0, fast) = moving
slow fast
• Not observe states
• $: set of possible outputs
• %: S ➞ $
• $1 = g(s1) = moving
State Machine
• S = set of possible states = {fallen, standing, moving}
slow
• " = set of possible inputs = {slow, fast}
• #: S x " ➞ S transition function
• $: set of possible outputs standing moving
• %: S ➞ $ fast
• s0 ∈ S = initial state = standing slow fast
• Iteratively computer for t ⥸ 1:
State Machine Example
• Reads binary string
• String has even number of zeros iff ends at state S1
State and Reward
?
Markov Model
Markov Process
Markov Process
Markov Process
Markov Process
Markov Process
• set of possible states S = {fallen, standing, moving}
• set of possible actions A = {slow, fast} p=1
• transition model T: S x A x S ➞ ℝ p=1
p=3/5
• transition function is stochastic standing moving
defines probability distribution p=3/4
over next state given
p=2/5 p=2/5 p=1/5
previous state and action
• Output is state (g is identity)
fallen
p=3/5 p=1
Transition P(s,a,s’)
fallen standing moving s = {standing, moving, fallen} a = {slow, fast} probability
p=1
fallen
p=1
standing p=3/5
standing moving
moving
p=3/4
p=2/5 p=2/5 p=1/5
fallen
p=3/5 p=1
21
Transition P(s,a,s’)
fallen standing moving s = {standing, moving, fallen} a = {slow, fast} probability
p=1
fallen
p=1
standing p=3/5
standing moving
moving
p=3/4
fallen standing moving
p=2/5 p=2/5 p=1/5
fallen
standing
fallen
moving p=3/5 p=1
22
Markov Decision Process (MDP)
• set of possible states S = {fallen, standing, moving}
• set of possible actions A = {slow, fast} p=1
• transition model T: S x A x S ➞ ℝ p=1
• reward function R: S x A ➞ ℝ standing p=3/5
moving
reward based on state & action p=3/4
R(fallen, slow) = 1
R(fallen, fast) = 0 p=2/5 p=2/5 p=1/5
R(standing, slow) = 1
R(standing, fast) = 2
R(moving, slow) = 1 fallen
R(moving, fast) = -1 p=3/5 p=1
Markov Decision Process (MDP)
• S = set of possible states {fallen, standing, moving}
• A = set of possible actions {slow, fast} p=1
• T: S x A x S ➞ ℝ transition model p=1
• R: S x A ➞ ℝ reward function standing p=3/5
moving
• ( discount factor p=3/4
p=2/5 p=2/5 p=1/5
fallen
p=3/5 p=1
Markov Decision Process (MDP)
• set of possible states S = {fallen, standing, moving}
• set of possible actions A = {slow, fast} p=1 r=1
• transition model T: S x A x S ➞ ℝ p=1 r=1
• rewards ℝ may be probabilistic standing p=3/5 r=2 moving
with transition function p=3/4 r=2
• ( discount factor
p=2/5 r=1 p=2/5 r=-1 p=1/5 r=-1
fallen
p=3/5 r=-1 p=1 r=0
Reward R(s,a)
slow fast s = {standing, moving, fallen} a = {slow, fast} probability
p=1 r=1 reward
fallen
p=1 r=1
p=3/5 r=2
standing standing moving
p=3/4 r=2
moving
p=2/5 r=1 p=2/5 r=-1 p=1/5 r=-1
horizon 1 fallen
optimal myopic policy
p=3/5 r=-1 p=1 r=0
26
Markov Decision Process (MDP)
• Defined by tuple (S, A, T, R, ))
state space S, action space A, transition function T, reward
function R, discount factor )
• At every time step t agent finds itself in state s ∈ S and
selects action a ∈ A
Agent transitions to next state s’ and selects new action.
• In contrast, in reinforcement learning the agent does not
know T and R, learns by sampling environment
27
Policy
• *: S ➞ A
p=1
• Rule book:
p=1
what action to take in each state? standing
p=3/5
moving
• in state s ∈ {fallen, standing, moving} p=3/4
take action a ∈ {slow, fast}
p=2/5 p=2/5 p=1/5
fallen
p=3/5 p=1
Policy
whichpolicy is best
• *: S ➞ A
p=1
• Example robot policies:
p=1
– *A: always slow standing
p=3/5
moving
p=3/4
– *B: always fast
p=2/5 p=2/5 p=1/5
– *C: if fallen slow, else fast
fallen
p=3/5 p=1
– *D: if moving fast, else slow
Stochastic Policy
• *: S ➞ A
p=1
• Policy may be stochastic
p=1
randomness in agent actions standing
p=3/5
moving
for example p=3/4
– *E: from all states
p=2/5 p=2/5 p=1/5
with probability 0.3 do slow
with probability 0.7 do fast
fallen
p=3/5
p=1
In addition to transitions being stochastic.
State-Action Diagram
• In state S
• Take action a by policy *: S ➞ A
• Transition to state s’ s
Repeat take action
• Tree of (s,a,r,s’)
s, a
where the transition function
will take us
s’
State-Action Diagram
• In state S
• Take action a by policy *: S ➞ A s
• Transition to state s’ take action
Repeat
take action
What is the Value of a Policy
• *: S ➞ A
• What is the value of a policy? p=1 r=1
p=1 r=1
• Depends on number of steps p=3/5 r=2
standing moving
• Renting robot for h time steps p=3/4 r=2
then will be destroyed
p=2/5 r=1 p=2/5 r=-1 p=1/5 r=-1
fallen
p=3/5 r=-1 p=1 r=0
What is the Value of a Policy?
• *: S ➞ A
• +: horizon, number of time steps left p=1 r=1
p=1 r=1
• : value (expected reward) p=3/5 r=2
standing moving
with policy * starting at s p=3/4 r=2
• Example robot policies
– *A: always slow p=2/5 r=1 p=2/5 r=-1 p=1/5 r=-1
– *B: always fast
– *C: if fallen slow, else fast fallen
p=1 r=0
– *D: if moving fast, else slow p=3/5 r=-1
What is the Value of a Policy?
• *: S ➞ A
• +: horizon, number of time steps left
• : value (expected reward) with policy * starting at s
• By induction on the number of steps left to go h.
• Base case: no steps remaining, no matter what state
we’re in the value is:
What is the Value of a Policy?
• *: S ➞ A
• +: horizon, number of time steps left
• : value (expected reward) with policy * starting at s
• Value of policy at state s at horizon h is
reward in s + next state’s expected horizon h-1 value.
• For h = 1
What is the Value of a Policy?
• *: S ➞ A
• +: horizon, number of time steps left
• : value (expected reward) with policy * starting at s
• Value of policy at state s at horizon h is
reward in s + next state’s expected horizon h-1 value.
• For h = 2
What is the Value of a Policy?
• *: S ➞ A
• +: horizon, number of time steps left
• : value (expected reward) with policy * starting at s
• Value of policy at state s at horizon h is
reward in s + next state’s expected horizon h-1 value.
• For any h
What is the Value of a Policy?
Updating State Value Function
40
Finite Horizon Value Iteration Algorithm
• Compute : start from horizon 0, store values
• Use to compute
• For n = |S|, m = |A|, horizon h, computation time O(nmh)
Updating Action Value Function
42
Finite Horizon Optimal Policy
• Given the optimal finite horizon policy is:
• May be multiple optimal policies
State
44
Action
45
Policy
46
Policy
47
Return and Discount Factor
(0,1)
if ! = 0 then agent is myopic maximizing only immediate rewards
as ! → 1 agent becomes foresighted
48
State Value Function
49
Action-Value Function
50
Value Functions
51
Value Functions
• For all states s :
52
Returns at Successive Time Steps
• Recursive relationship
53
4 Bellman Equations
• Expectation for state value
linear
• Expectation for action value
• Optimality for state value function
non linear
• Optimality for action value function
54
Infinite Horizon
• Don’t know when game will be over.
• Problem: may be infinite, cannot select one over other
• Solution: find policy that maximizes infinite horizon
discounted value
• t denotes number of steps from starting state
Policy Evaluation
• Expected infinite horizon value of state s under policy ,
• t denotes number of steps from starting state
• n = |S| linear equations
system of linear equations
State-Value Function for Policy
• Expected return starting in s and following policy ! satisfies recursive relationship:
Bellman equation for V!
• Relationship between value of state and values of successors
57
Bellman Equation for State-Value Function
• V! is unique solution to its Bellman equation s
Linear equation take action average
Vector notation
s, a
• Value of start state s
is discounted value of expected r
next state + expected reward where the transition function
will take us
s’
• Bellman equation averages over all possibilities
weighing each by its probability to occur
58
Bellman Equation for Action-Value Function
linear equation
where the transition
function will take us
what action we will take
59
Finding Optimal Policy
• In infinite horizon there exists a stationary optimal policy ,*
(at least one) such that for all s ∈ S and all other policies ,
• Stationary: does not change over time
Infinite Horizon Value Iteration
• expected infinite horizon discounted value of being in
state s, taking action a, then executing optimal policy ,*
• n = |S|, m = |A|, nm non-linear equations with unique solution
value
Bellman Optimality Equation for Q*
non linear equation
expectation over where the transition
function will take us
maximize over the actions we can take
max max
once we have Q* we can act optimally
62
Finding Optimal Policy
• If we know optimal action-value function then
derive optimal policy:
• Optimal policy is not unique
Infinite Horizon Value Iteration Algorithm
Beltran
I