0% found this document useful (0 votes)

54 views64 pages

MIT 6.036 Lecture

The lecture, led by Prof. Iddo Drori, covers State Machines and Markov Decision Processes (MDPs), focusing on concepts such as states, actions, policies, and value functions. Key topics include the transition functions, reward structures, and the Bellman equations for evaluating policies in both finite and infinite horizons. Course materials will be available on the course website, and questions can be directed to Piazza.

Uploaded by

luke bastian

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views64 pages

MIT 6.036 Lecture

Uploaded by

luke bastian

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

Lecture: starts Tuesday 9:35am

Course website: introml.odl.mit.edu

Who's talking? Prof. Iddo Drori
Questions? Piazza
Materials: Will be available on course website
Today’s Plan: State Machines and
Markov Decision Processes (MDPs)

• State machine
• Observation vs. State
• Markov decision process (MDP)
• Policy: value, optimal
• Finite horizon value iteration, optimal policy
• Return
• Bellman equations: expectation, optimality
• Infinite horizon value iteration
State Machine

• S = set of possible states = {standing, moving}

standing moving
State Machine

• S = set of possible states = {standing, moving}

• s0 ∈ S = initial state = standing

standing moving
State Machine

• S = set of possible states = {standing, moving}

slow
• " = set of possible inputs = {slow, fast}
• s0 ∈ S = initial state = standing
standing moving
fast

slow fast
State Machine

• S = set of possible states = {standing, moving}

slow
• " = set of possible inputs = {slow, fast}
• #: S x " ➞ S transition function
• s0 ∈ S = initial state = standing standing moving
fast

slow fast
State Machine

• S = set of possible states = {standing, moving}

slow
• " = set of possible inputs = {slow, fast}
• #: S x " ➞ S transition function
• s0 ∈ S = initial state = standing standing moving
fast
• s1 = #(s0, fast) = ?
slow fast
State Machine

• S = set of possible states = {standing, moving}

slow
• " = set of possible inputs = {slow, fast}
• #: S x " ➞ S transition function
• s0 ∈ S = initial state = standing standing moving
fast
• s1 = #(s0, fast) = moving
slow fast
State Machine

• S = set of possible states = {standing, moving}

slow
• " = set of possible inputs = {slow, fast}
• #: S x " ➞ S transition function
• s0 ∈ S = initial state = standing standing moving
fast
• s1 = #(s0, fast) = moving
slow fast

• May not observe states {standing, moving} directly

for example a sensor measurement
Example: Observation vs. State

observation O state S
State Machine

• S = set of possible states = {fallen, standing, moving}

slow
• " = set of possible inputs = {slow, fast}
• #: S x " ➞ S transition function
• s0 ∈ S = initial state = standing standing moving
fast
• s1 = #(s0, fast) = moving
slow fast
• Not observe states
• $: set of possible outputs
• %: S ➞ $
• $1 = g(s1) = moving
State Machine

• S = set of possible states = {fallen, standing, moving}

slow
• " = set of possible inputs = {slow, fast}
• #: S x " ➞ S transition function
• $: set of possible outputs standing moving

• %: S ➞ $ fast

• s0 ∈ S = initial state = standing slow fast

• Iteratively computer for t ⥸ 1:

State Machine Example

• Reads binary string

• String has even number of zeros iff ends at state S1

State and Reward

?
Markov Model
Markov Process
Markov Process
Markov Process
Markov Process
Markov Process

• set of possible states S = {fallen, standing, moving}

• set of possible actions A = {slow, fast} p=1

• transition model T: S x A x S ➞ ℝ p=1

p=3/5
• transition function is stochastic standing moving

defines probability distribution p=3/4

over next state given

p=2/5 p=2/5 p=1/5
previous state and action
• Output is state (g is identity)
fallen
p=3/5 p=1
Transition P(s,a,s’)
fallen standing moving s = {standing, moving, fallen} a = {slow, fast} probability

p=1
fallen
p=1
standing p=3/5
standing moving
moving
p=3/4

p=2/5 p=2/5 p=1/5

fallen
p=3/5 p=1

21
Transition P(s,a,s’)
fallen standing moving s = {standing, moving, fallen} a = {slow, fast} probability

p=1
fallen
p=1
standing p=3/5
standing moving
moving
p=3/4

fallen standing moving

p=2/5 p=2/5 p=1/5
fallen

standing

fallen
moving p=3/5 p=1

22
Markov Decision Process (MDP)

• set of possible states S = {fallen, standing, moving}

• set of possible actions A = {slow, fast} p=1

• transition model T: S x A x S ➞ ℝ p=1

• reward function R: S x A ➞ ℝ standing p=3/5
moving

reward based on state & action p=3/4

R(fallen, slow) = 1
R(fallen, fast) = 0 p=2/5 p=2/5 p=1/5
R(standing, slow) = 1
R(standing, fast) = 2
R(moving, slow) = 1 fallen
R(moving, fast) = -1 p=3/5 p=1
Markov Decision Process (MDP)

• S = set of possible states {fallen, standing, moving}

• A = set of possible actions {slow, fast} p=1

• T: S x A x S ➞ ℝ transition model p=1

• R: S x A ➞ ℝ reward function standing p=3/5
moving

• ( discount factor p=3/4

p=2/5 p=2/5 p=1/5

fallen
p=3/5 p=1
Markov Decision Process (MDP)

• set of possible states S = {fallen, standing, moving}

• set of possible actions A = {slow, fast} p=1 r=1
• transition model T: S x A x S ➞ ℝ p=1 r=1
• rewards ℝ may be probabilistic standing p=3/5 r=2 moving

with transition function p=3/4 r=2

• ( discount factor
p=2/5 r=1 p=2/5 r=-1 p=1/5 r=-1

fallen
p=3/5 r=-1 p=1 r=0
Reward R(s,a)
slow fast s = {standing, moving, fallen} a = {slow, fast} probability
p=1 r=1 reward
fallen
p=1 r=1
p=3/5 r=2
standing standing moving
p=3/4 r=2
moving

p=2/5 r=1 p=2/5 r=-1 p=1/5 r=-1

horizon 1 fallen
optimal myopic policy
p=3/5 r=-1 p=1 r=0

26
Markov Decision Process (MDP)
• Defined by tuple (S, A, T, R, ))
state space S, action space A, transition function T, reward
function R, discount factor )

• At every time step t agent finds itself in state s ∈ S and

selects action a ∈ A
Agent transitions to next state s’ and selects new action.

• In contrast, in reinforcement learning the agent does not

know T and R, learns by sampling environment
27
Policy

• *: S ➞ A
p=1
• Rule book:
p=1
what action to take in each state? standing
p=3/5
moving
• in state s ∈ {fallen, standing, moving} p=3/4

take action a ∈ {slow, fast}

p=2/5 p=2/5 p=1/5

fallen
p=3/5 p=1
Policy
whichpolicy is best
• *: S ➞ A
p=1
• Example robot policies:
p=1
– *A: always slow standing
p=3/5
moving
p=3/4

– *B: always fast

p=2/5 p=2/5 p=1/5

– *C: if fallen slow, else fast

fallen
p=3/5 p=1
– *D: if moving fast, else slow
Stochastic Policy

• *: S ➞ A
p=1
• Policy may be stochastic
p=1
randomness in agent actions standing
p=3/5
moving

for example p=3/4

– *E: from all states

p=2/5 p=2/5 p=1/5
with probability 0.3 do slow
with probability 0.7 do fast
fallen
p=3/5
p=1
In addition to transitions being stochastic.
State-Action Diagram

• In state S
• Take action a by policy *: S ➞ A
• Transition to state s’ s

Repeat take action

• Tree of (s,a,r,s’)
s, a

where the transition function

will take us

s’
State-Action Diagram

• In state S
• Take action a by policy *: S ➞ A s

• Transition to state s’ take action

Repeat

take action
What is the Value of a Policy

• *: S ➞ A
• What is the value of a policy? p=1 r=1

p=1 r=1
• Depends on number of steps p=3/5 r=2
standing moving
• Renting robot for h time steps p=3/4 r=2

then will be destroyed

p=2/5 r=1 p=2/5 r=-1 p=1/5 r=-1

fallen
p=3/5 r=-1 p=1 r=0
What is the Value of a Policy?

• *: S ➞ A
• +: horizon, number of time steps left p=1 r=1

p=1 r=1
• : value (expected reward) p=3/5 r=2
standing moving
with policy * starting at s p=3/4 r=2

• Example robot policies

– *A: always slow p=2/5 r=1 p=2/5 r=-1 p=1/5 r=-1

– *B: always fast

– *C: if fallen slow, else fast fallen
p=1 r=0
– *D: if moving fast, else slow p=3/5 r=-1
What is the Value of a Policy?

• *: S ➞ A
• +: horizon, number of time steps left
• : value (expected reward) with policy * starting at s
• By induction on the number of steps left to go h.

• Base case: no steps remaining, no matter what state

we’re in the value is:
What is the Value of a Policy?

• *: S ➞ A
• +: horizon, number of time steps left
• : value (expected reward) with policy * starting at s
• Value of policy at state s at horizon h is
reward in s + next state’s expected horizon h-1 value.
• For h = 1
What is the Value of a Policy?

40
Finite Horizon Value Iteration Algorithm

• Compute : start from horizon 0, store values

• Use to compute
• For n = |S|, m = |A|, horizon h, computation time O(nmh)
Updating Action Value Function

42
Finite Horizon Optimal Policy

• Given the optimal finite horizon policy is:

• May be multiple optimal policies

State

44
Action

45
Policy

46
Policy

47
Return and Discount Factor

(0,1)

if ! = 0 then agent is myopic maximizing only immediate rewards

as ! → 1 agent becomes foresighted
48
State Value Function

49
Action-Value Function

50
Value Functions

51
Value Functions

• For all states s :

52
Returns at Successive Time Steps

• Recursive relationship

53
4 Bellman Equations

• Expectation for state value

linear
• Expectation for action value

• Optimality for state value function

non linear
• Optimality for action value function

54
Infinite Horizon

• Don’t know when game will be over.

• Problem: may be infinite, cannot select one over other
• Solution: find policy that maximizes infinite horizon
discounted value

• t denotes number of steps from starting state

Policy Evaluation

• Expected infinite horizon value of state s under policy ,

• t denotes number of steps from starting state

• n = |S| linear equations

system of linear equations
State-Value Function for Policy
• Expected return starting in s and following policy ! satisfies recursive relationship:

Bellman equation for V!

• Relationship between value of state and values of successors
57
Bellman Equation for State-Value Function
• V! is unique solution to its Bellman equation s
Linear equation take action average
Vector notation
s, a
• Value of start state s
is discounted value of expected r

next state + expected reward where the transition function

will take us

s’
• Bellman equation averages over all possibilities
weighing each by its probability to occur
58
Bellman Equation for Action-Value Function

linear equation

where the transition

function will take us

what action we will take

59
Finding Optimal Policy

• In infinite horizon there exists a stationary optimal policy ,*

(at least one) such that for all s ∈ S and all other policies ,

• Stationary: does not change over time

Infinite Horizon Value Iteration

• expected infinite horizon discounted value of being in

state s, taking action a, then executing optimal policy ,*

• n = |S|, m = |A|, nm non-linear equations with unique solution

value
Bellman Optimality Equation for Q*

non linear equation

expectation over where the transition

function will take us

maximize over the actions we can take

max max

once we have Q* we can act optimally

62
Finding Optimal Policy

• If we know optimal action-value function then

derive optimal policy:

• Optimal policy is not unique

Infinite Horizon Value Iteration Algorithm

Beltran
I

A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Lec 09
No ratings yet
Lec 09
51 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
Sp14 Cs188 Lecture 8 - Mdps I
No ratings yet
Sp14 Cs188 Lecture 8 - Mdps I
50 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
06 MDP
No ratings yet
06 MDP
89 pages
Lec 08
No ratings yet
Lec 08
59 pages
Discounted Markov Decision Processes
No ratings yet
Discounted Markov Decision Processes
26 pages
Markov Decision Processes Ii: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
No ratings yet
Markov Decision Processes Ii: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
50 pages
08 MDPs
No ratings yet
08 MDPs
110 pages
Class Notes 2
No ratings yet
Class Notes 2
6 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
2025 - MDPs 2
No ratings yet
2025 - MDPs 2
42 pages
CSE 445 - Lecture 9 - Reinforcement Learning
No ratings yet
CSE 445 - Lecture 9 - Reinforcement Learning
45 pages
MDP Solution Methods: Iteration & LP
No ratings yet
MDP Solution Methods: Iteration & LP
34 pages
17 - Markov Decision Processes
No ratings yet
17 - Markov Decision Processes
59 pages
AI Decision Making & RL Guide
No ratings yet
AI Decision Making & RL Guide
18 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
2025 - MDPs - Part 2
No ratings yet
2025 - MDPs - Part 2
41 pages
M 2
No ratings yet
M 2
12 pages
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
44 pages
15 MDP
No ratings yet
15 MDP
35 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
62 pages
CS229
No ratings yet
CS229
17 pages
08 MDPs
No ratings yet
08 MDPs
111 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
19.5 Markov Decision Processes: Resolving Unbounded Expected Rewards
No ratings yet
19.5 Markov Decision Processes: Resolving Unbounded Expected Rewards
13 pages
CS415 - Lecture 21 - MDPs I
No ratings yet
CS415 - Lecture 21 - MDPs I
49 pages
Lecture3 InsideAnAgent
No ratings yet
Lecture3 InsideAnAgent
35 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
7 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Markovian Decision Process
No ratings yet
Markovian Decision Process
27 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
Markov Decision Processes: Stochastic, Sequential Environments
No ratings yet
Markov Decision Processes: Stochastic, Sequential Environments
20 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
Markov Decision Processes: Lecture Notes For STP 425: Jay Taylor
100% (1)
Markov Decision Processes: Lecture Notes For STP 425: Jay Taylor
86 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
51 pages
Lecture7 MDPs I
No ratings yet
Lecture7 MDPs I
9 pages
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
No ratings yet
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
14 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
16 RL PDF
No ratings yet
16 RL PDF
87 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
CSE2530 Reinforcement Learning 2025 P1+2
No ratings yet
CSE2530 Reinforcement Learning 2025 P1+2
115 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
Markov Decision & RL Overview
No ratings yet
Markov Decision & RL Overview
39 pages
Reinforcement Learning Lec12
No ratings yet
Reinforcement Learning Lec12
60 pages
Cs5811 Ch17 Complex Dec
No ratings yet
Cs5811 Ch17 Complex Dec
29 pages
Deep RL - Content Beyond Syllabus
No ratings yet
Deep RL - Content Beyond Syllabus
16 pages
402 Lec20
No ratings yet
402 Lec20
21 pages
Markov Decision Processes: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
No ratings yet
Markov Decision Processes: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
123 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
MIT 14.13 - Lecture 3 and 4
No ratings yet
MIT 14.13 - Lecture 3 and 4
63 pages
MIT 14.13 Lecture 2
No ratings yet
MIT 14.13 Lecture 2
38 pages
FE Studying Transportation
No ratings yet
FE Studying Transportation
5 pages
MIT Fluid Dynamics
No ratings yet
MIT Fluid Dynamics
28 pages
MIT HVAC Lecture
No ratings yet
MIT HVAC Lecture
4 pages
Borden Knight Trading Columbia Seminar
No ratings yet
Borden Knight Trading Columbia Seminar
16 pages
Algorithm Trading Using Q-Learning and Recurrent Reinforcement Learning PDF
No ratings yet
Algorithm Trading Using Q-Learning and Recurrent Reinforcement Learning PDF
7 pages
Recursive Dynamic Programming Model
No ratings yet
Recursive Dynamic Programming Model
3 pages
Dyanamic Programing
No ratings yet
Dyanamic Programing
6 pages
Robust Inventory-Production Control Problem With Stochastic Demand
No ratings yet
Robust Inventory-Production Control Problem With Stochastic Demand
20 pages
Princeton Financial Math PDF
No ratings yet
Princeton Financial Math PDF
46 pages
Report On "Trend Following Trading With Mean Reverting Drift Assets" by Vu, Ho and Duong
No ratings yet
Report On "Trend Following Trading With Mean Reverting Drift Assets" by Vu, Ho and Duong
1 page
CDC00-INV4502: A Numerical Method For Solving Singular Brownian Control Problems
No ratings yet
CDC00-INV4502: A Numerical Method For Solving Singular Brownian Control Problems
6 pages
cs229 Notes13
No ratings yet
cs229 Notes13
15 pages
Dynamic Programming in Economics
100% (2)
Dynamic Programming in Economics
23 pages
Optimal Design of Control Systems Stochastic and Deterministic Problems
100% (2)
Optimal Design of Control Systems Stochastic and Deterministic Problems
423 pages
6 Hansen Model
No ratings yet
6 Hansen Model
26 pages
Namwook Kim - 2011 PDF
No ratings yet
Namwook Kim - 2011 PDF
9 pages
Dynamic Programming - Multiple Choice Questions
No ratings yet
Dynamic Programming - Multiple Choice Questions
4 pages
16.323 Principles of Optimal Control: Mit Opencourseware
No ratings yet
16.323 Principles of Optimal Control: Mit Opencourseware
27 pages
Uh Econ 607 Notes
No ratings yet
Uh Econ 607 Notes
255 pages
(F23) ECON - 536 Topics in Mathematical Methods For Economist
No ratings yet
(F23) ECON - 536 Topics in Mathematical Methods For Economist
3 pages
High-Frequency Market-Making For Multi-Dimensional Markov Processes
No ratings yet
High-Frequency Market-Making For Multi-Dimensional Markov Processes
28 pages
Project Portfolio Selection DSS
No ratings yet
Project Portfolio Selection DSS
16 pages
Unit 4 QP
No ratings yet
Unit 4 QP
19 pages
Econ 614 Macroeconomic Theory Ii: The Basic Neoclassical Model
No ratings yet
Econ 614 Macroeconomic Theory Ii: The Basic Neoclassical Model
31 pages
Value Functions & Bellman Equations
No ratings yet
Value Functions & Bellman Equations
13 pages
The New Palgrave Dictionary of Economics Online
No ratings yet
The New Palgrave Dictionary of Economics Online
5 pages
17 Making Complex Decisions: 4 × 3 U, R, D, L
No ratings yet
17 Making Complex Decisions: 4 × 3 U, R, D, L
8 pages
Dynamic Programming in Finance
No ratings yet
Dynamic Programming in Finance
5 pages
Learning Potential Functions and Their R PDF
No ratings yet
Learning Potential Functions and Their R PDF
45 pages
Dynamic Optimization
No ratings yet
Dynamic Optimization
16 pages
Machine Learning in Trading
No ratings yet
Machine Learning in Trading
6 pages
Stanford Machine Learning Course Notes by Andrew NG
No ratings yet
Stanford Machine Learning Course Notes by Andrew NG
16 pages

MIT 6.036 Lecture

Uploaded by

MIT 6.036 Lecture

Uploaded by

Lecture: starts Tuesday 9:35am

Course website: introml.odl.mit.edu

• S = set of possible states = {standing, moving}

• S = set of possible states = {standing, moving}

• S = set of possible states = {standing, moving}

• S = set of possible states = {standing, moving}

• S = set of possible states = {standing, moving}

• S = set of possible states = {standing, moving}

• S = set of possible states = {standing, moving}

• May not observe states {standing, moving} directly

• S = set of possible states = {fallen, standing, moving}

• S = set of possible states = {fallen, standing, moving}

• s0 ∈ S = initial state = standing slow fast

• Iteratively computer for t ⥸ 1:

• Reads binary string

• String has even number of zeros iff ends at state S1

• set of possible states S = {fallen, standing, moving}

• transition model T: S x A x S ➞ ℝ p=1

defines probability distribution p=3/4

over next state given

p=2/5 p=2/5 p=1/5

fallen standing moving

• set of possible states S = {fallen, standing, moving}

• transition model T: S x A x S ➞ ℝ p=1

reward based on state & action p=3/4

• S = set of possible states {fallen, standing, moving}

• T: S x A x S ➞ ℝ transition model p=1

• ( discount factor p=3/4

p=2/5 p=2/5 p=1/5

• set of possible states S = {fallen, standing, moving}

with transition function p=3/4 r=2

p=2/5 r=1 p=2/5 r=-1 p=1/5 r=-1

• At every time step t agent finds itself in state s ∈ S and

• In contrast, in reinforcement learning the agent does not

take action a ∈ {slow, fast}

– *B: always fast

– *C: if fallen slow, else fast

for example p=3/4

– *E: from all states

Repeat take action

where the transition function

• Transition to state s’ take action

then will be destroyed

• Example robot policies

– *B: always fast

• Base case: no steps remaining, no matter what state

• Compute : start from horizon 0, store values

• Given the optimal finite horizon policy is:

• May be multiple optimal policies

if ! = 0 then agent is myopic maximizing only immediate rewards

• For all states s :

• Expectation for state value

• Optimality for state value function

• Don’t know when game will be over.

• t denotes number of steps from starting state

• Expected infinite horizon value of state s under policy ,

• t denotes number of steps from starting state

• n = |S| linear equations

Bellman equation for V!

next state + expected reward where the transition function

where the transition

what action we will take

• In infinite horizon there exists a stationary optimal policy ,*

• Stationary: does not change over time

• expected infinite horizon discounted value of being in

• n = |S|, m = |A|, nm non-linear equations with unique solution

non linear equation

expectation over where the transition

maximize over the actions we can take

once we have Q* we can act optimally

• If we know optimal action-value function then

• Optimal policy is not unique

You might also like