0% found this document useful (0 votes)

19 views51 pages

Lec 09

This document discusses Markov decision processes and methods for solving them. It introduces value iteration, which iteratively updates value estimates until convergence. It also covers policy iteration, which alternates between policy evaluation to calculate values for the current policy and policy improvement to derive a new policy based on the values.

Uploaded by

daliYop

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views51 pages

Lec 09

Uploaded by

daliYop

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

CS 188: Artificial Intelligence

Markov Decision Processes II

Instructor: Pieter Abbeel

University of California, Berkeley
[These slides adapted from Dan Klein and Pieter Abbeel]
Recap: Defining MDPs
o Markov decision processes:
s
o Set of states S
o Start state s0 a
o Set of actions A
o Transitions P(s’|s,a) (or T(s,a,s’)) s, a
o Rewards R(s,a,s’) (and discount g)
s,a,s’
s’
o MDP quantities so far:
o Policy = Choice of action for each state
o Utility = sum of (discounted) rewards
o Values = expected future utility from a state (max node)
o Q-Values = expected future utility from a q-state (chance node)
Example: Grid World
§ A maze-like problem
§ The agent lives in a grid
§ Walls block the agent’s path

§ Noisy movement: actions do not always go as

planned
§ 80% of the time, the action North takes the agent
North (if there is no wall there)
§ 10% of the time, North takes the agent West; 10% East
§ If there is a wall in the direction the agent would have
been taken, the agent stays put

§ The agent receives rewards

§ Small “living” reward each step (can be negative)
§ Big rewards come at the end (good or bad)

§ Goal: maximize sum of rewards

Solving MDPs
Optimal Quantities

§ The value (utility) of a state s:

V*(s) = expected utility starting in s s s is a
and acting optimally state
a
(s, a) is a
§ The value (utility) of a q-state (s,a): s, a q-state
Q*(s,a) = expected utility starting out
having taken action a from state s s,a,s’ (s,a,s’) is a
and (thereafter) acting optimally s’ transition

§ The optimal policy:

p*(s) = optimal action from state s
Value Iteration
o Start with V0(s) = 0: no time steps left means an expected reward sum of zero
o Given vector of Vk(s) values, do one ply of expectimax from each state:
Vk+1(s)
a
s, a

o Repeat until convergence, which yields V* s,a,s’

Vk(s’)

o Complexity of each iteration: O(S2A)

o Theorem: will converge to unique optimal values

o Basic idea: approximations get refined towards optimal values
o Policy may converge long before values do
Value Iteration
o Bellman equations characterize the optimal values: V(s)

a
s, a

s,a,s’
o Value iteration computes them: V(s’)

o Value iteration is just a fixed point solution method

o … though the Vk vectors are also interpretable as time-limited values
Value Iteration (again J ) s
a
o Init:
s, a
∀𝑠: 𝑉 𝑠 = 0
s,a,s’
o Iterate: s’

∀𝑠: 𝑉!"# 𝑠 = max * 𝑇 𝑠, 𝑎, 𝑠 & [𝑅 𝑠, 𝑎, 𝑠 & + 𝛾𝑉 𝑠 & ]

$
%&

𝑉 = 𝑉!"#
Note: can even directly assign to V(s), which will not compute the sequence of Vk but will still converge to V*
The Bellman Equations

How to be optimal:
Step 1: Take correct first action

Step 2: Keep being optimal

k=0

Noise = 0.2
Discount = 0.9
Living reward = 0
k=1

Noise = 0.2
Discount = 0.9
Living reward = 0
k=2

Noise = 0.2
Discount = 0.9
Living reward = 0
k=3

Noise = 0.2
Discount = 0.9
Living reward = 0
k=4

Noise = 0.2
Discount = 0.9
Living reward = 0
k=5

Noise = 0.2
Discount = 0.9
Living reward = 0
k=6

Noise = 0.2
Discount = 0.9
Living reward = 0
k=7

Noise = 0.2
Discount = 0.9
Living reward = 0
k=8

Noise = 0.2
Discount = 0.9
Living reward = 0
k=9

Noise = 0.2
Discount = 0.9
Living reward = 0
k=10

Noise = 0.2
Discount = 0.9
Living reward = 0
k=11

Noise = 0.2
Discount = 0.9
Living reward = 0
k=12

Noise = 0.2
Discount = 0.9
Living reward = 0
k=100

Noise = 0.2
Discount = 0.9
Living reward = 0
Policy Extraction
Computing Actions from Values
o Let’s imagine we have the optimal values V*(s)

o How should we act?

o It’s not obvious!

o We need to do a mini-expectimax (one step)

o This is called policy extraction, since it gets the policy implied by the
values
Computing Actions from Q-Values
o Let’s imagine we have the optimal
q-values:

o How should we act?

o Completely trivial to decide!

o Important lesson: actions are easier to select from q-values than

values!
Let’s think.
o Take a minute, think about value iteration.
o Write down the biggest question you have about it.
Policy Methods
Problems with Value Iteration
o Value iteration repeats the Bellman updates: s
a
s, a

o Problem 1: It’s slow – O(S2A) per iteration s,a,s’

s’

o Problem 2: The “max” at each state rarely changes

o Problem 3: The policy often converges long before the values

k=12

Noise = 0.2
Discount = 0.9
Living reward = 0
k=100

Noise = 0.2
Discount = 0.9
Living reward = 0
Policy Iteration
o Alternative approach for optimal values:
o Step 1: Policy Evaluation: calculate utilities for some fixed policy (not optimal
utilities!) until convergence
o Step 2: Policy Improvement: update policy using one-step look-ahead with
resulting converged (but not optimal!) utilities as future values
o Repeat steps until policy converges

o This is Policy Iteration

o It’s still optimal!
o Can converge (much) faster under some conditions
Policy Evaluation
Fixed Policies
Do the optimal action Do what p says to do
s s
a p(s)
s, a s, p(s)

s,a,s’ s, p(s),s’
s’ s’

o Expectimax trees max over all actions to compute the optimal values
o If we fixed some policy p(s), then the tree would be simpler – only one action
per state
o … though the tree’s value would depend on which policy we fixed
Utilities for a Fixed Policy
o Another basic operation: compute the utility of a state s s
under a fixed (generally non-optimal) policy
p(s)
o Define the utility of a state s, under a fixed policy p: s, p(s)
Vp(s) = expected total discounted rewards starting in s and
following p
s, p(s),s’
s’
o Recursive relation (one-step look-ahead / Bellman
equation):
Policy Evaluation
o How do we calculate the V’s for a fixed policy p? s

o Idea 1: Turn recursive Bellman equations into updates p(s)

(like value iteration) s, p(s)

s, p(s),s’
s’

o Efficiency: O(S2) per iteration

o Idea 2: Without the maxes, the Bellman equations are just a linear system
o Solve with Matlab (or your favorite linear system solver)
Example: Policy Evaluation
Always Go Right Always Go Forward
Example: Policy Evaluation
Always Go Right Always Go Forward
Policy Iteration
Policy Iteration

o Evaluation: For fixed current policy p, find values with policy evaluation:
o Iterate until values converge:

o Improvement: For fixed values, get a better policy using policy extraction
o One-step look-ahead:
Comparison
o Both value iteration and policy iteration compute the same thing (all optimal values)

o In value iteration:
o Every iteration updates both the values and (implicitly) the policy
o We don’t track the policy, but taking the max over actions implicitly recomputes it

o In policy iteration:
o We do several passes that update utilities with fixed policy (each pass is fast because we
consider only one action, not all of them)
o After the policy is evaluated, a new policy is chosen (slow like a value iteration pass)
o The new policy will be better (or we’re done)

o Both are dynamic programs for solving MDPs

Summary: MDP Algorithms
o So you want to….
o Compute optimal values: use value iteration or policy iteration
o Compute values for a particular policy: use policy evaluation
o Turn your values into a policy: use policy extraction (one-step lookahead)

o These all look the same!

o They basically are – they are all variations of Bellman updates
o They all use one-step lookahead expectimax fragments
o They differ only in whether we plug in a fixed policy or max over actions
The Bellman Equations

How to be optimal:
Step 1: Take correct first action

Step 2: Keep being optimal

Double Bandits
Double-Bandit MDP
o Actions: Blue, Red No discount
o States: Win, Lose 100 time steps
0.25 $0
Both states have
the same value
0.75
$2
W 0.25 L
$0
$1 $1
0.75 $2
1.0 1.0
Offline Planning
o Solving MDPs is offline planning No discount
o You determine all quantities through computation 100 time steps
o You need to know the details of the MDP Both states have
the same value
o You do not actually play the game!

0.25 $0
Value
0.75
W $2 0.25 L
Play Red 150 $0
$1 $1
0.75 $2
Play Blue 100 1.0 1.0
Let’s Play!

$2 $2 $0 $2 $2
$2 $2 $0 $0 $0
Online Planning
o Rules changed! Red’s win chance is different.

?? $0

??
$2
W ?? L
$0
$1 $1
?? $2
1.0 1.0
Let’s Play!

$0 $0 $0 $2 $0
$2 $0 $0 $0 $0
What Just Happened?
o That wasn’t planning, it was learning!
o Specifically, reinforcement learning
o There was an MDP, but you couldn’t solve it with just computation
o You needed to actually act to figure it out

o Important ideas in reinforcement learning that came up

o Exploration: you have to try unknown actions to get information
o Exploitation: eventually, you have to use what you know
o Regret: even if you learn intelligently, you make mistakes
o Sampling: because of chance, you have to try things repeatedly
o Difficulty: learning can be much harder than solving a known MDP
Next Time: Reinforcement Learning!

Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
44 pages
Markov Decision Processes Ii: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
No ratings yet
Markov Decision Processes Ii: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
50 pages
2025 - MDPs 2
No ratings yet
2025 - MDPs 2
42 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
2025 - MDPs - Part 2
No ratings yet
2025 - MDPs - Part 2
41 pages
Fa19 Lecture 15 MDPs II
No ratings yet
Fa19 Lecture 15 MDPs II
76 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
Markov Decision Process II
No ratings yet
Markov Decision Process II
88 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
08 MDPs
No ratings yet
08 MDPs
111 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
MDP Solution Methods: Iteration & LP
No ratings yet
MDP Solution Methods: Iteration & LP
34 pages
2 Dynamic
No ratings yet
2 Dynamic
50 pages
18 - Dynamic Programming For Markov Decision Processes
No ratings yet
18 - Dynamic Programming For Markov Decision Processes
50 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
7 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
No ratings yet
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
14 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Sp14 Cs188 Lecture 8 - Mdps I
No ratings yet
Sp14 Cs188 Lecture 8 - Mdps I
50 pages
08 MDPs
No ratings yet
08 MDPs
110 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
51 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
Markov Decision Processes: Stochastic, Sequential Environments
No ratings yet
Markov Decision Processes: Stochastic, Sequential Environments
20 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
MDP Cheatsheet
No ratings yet
MDP Cheatsheet
3 pages
RL Lecture4
No ratings yet
RL Lecture4
7 pages
M 2
No ratings yet
M 2
12 pages
CSE 445 - Lecture 9 - Reinforcement Learning
No ratings yet
CSE 445 - Lecture 9 - Reinforcement Learning
45 pages
CS229
No ratings yet
CS229
17 pages
Cs5811 Ch17 Complex Dec
No ratings yet
Cs5811 Ch17 Complex Dec
29 pages
AI Exam Prep for CS Students
No ratings yet
AI Exam Prep for CS Students
4 pages
Lec 08
No ratings yet
Lec 08
59 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
06 MDP
No ratings yet
06 MDP
89 pages
Class Notes 2
No ratings yet
Class Notes 2
6 pages
Experiment 3
No ratings yet
Experiment 3
6 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
62 pages
Reinforcement Learning 3 Recap
No ratings yet
Reinforcement Learning 3 Recap
3 pages
Reinforcement Learning for Experts
No ratings yet
Reinforcement Learning for Experts
36 pages
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec12 - 6up-Markov-Decision-Processes-Iii-+-Rl - (Cuuduongthancong - Com)
No ratings yet
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec12 - 6up-Markov-Decision-Processes-Iii-+-Rl - (Cuuduongthancong - Com)
7 pages
2025 - MDPs 1
No ratings yet
2025 - MDPs 1
62 pages
Experiment 4
No ratings yet
Experiment 4
7 pages
EE675 Lecture 10
No ratings yet
EE675 Lecture 10
4 pages
(Partially Observable) Markov Decision Processes: Frederike Petzschner & Lionel Rigoux
No ratings yet
(Partially Observable) Markov Decision Processes: Frederike Petzschner & Lionel Rigoux
19 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
15 MDP
No ratings yet
15 MDP
35 pages
Pomdps
No ratings yet
Pomdps
76 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
Lecture7 MDPs I
No ratings yet
Lecture7 MDPs I
9 pages
Lec 12
No ratings yet
Lec 12
60 pages
Lec 13-Power Series
No ratings yet
Lec 13-Power Series
63 pages
Janae Benson: Exceptional Nursing Student Recommendation
No ratings yet
Janae Benson: Exceptional Nursing Student Recommendation
1 page
English 3: Unit 1 - My Friends Lesson
No ratings yet
English 3: Unit 1 - My Friends Lesson
12 pages
Bootcamp 2020 Complete Course Outline
No ratings yet
Bootcamp 2020 Complete Course Outline
25 pages
Cross-Functional Team
No ratings yet
Cross-Functional Team
2 pages
Marine Transportation Thesis Topics
100% (3)
Marine Transportation Thesis Topics
7 pages
1.1. Cách Viết Câu Supporting Sentences
No ratings yet
1.1. Cách Viết Câu Supporting Sentences
4 pages
Critical Thinking Essay Guide
No ratings yet
Critical Thinking Essay Guide
2 pages
Adverbs
No ratings yet
Adverbs
2 pages
Nepal Pokhara Affiliated College List.
No ratings yet
Nepal Pokhara Affiliated College List.
3 pages
Pampanga 3
No ratings yet
Pampanga 3
5 pages
Profed Summative Examination
No ratings yet
Profed Summative Examination
13 pages
Script For Project Control
No ratings yet
Script For Project Control
8 pages
Dubuque Historical Education Plan
No ratings yet
Dubuque Historical Education Plan
48 pages
TIP Manila Political Science Curriculum
No ratings yet
TIP Manila Political Science Curriculum
3 pages
Grade 10 Reading & Health Activities
No ratings yet
Grade 10 Reading & Health Activities
5 pages
Strategy: The Totality of Decisions - 47
No ratings yet
Strategy: The Totality of Decisions - 47
1 page
Prufrock Notes 3
No ratings yet
Prufrock Notes 3
2 pages
What Is Teaching Approach
No ratings yet
What Is Teaching Approach
3 pages
Daily Lesson Plan 1
No ratings yet
Daily Lesson Plan 1
5 pages
Sample Q
No ratings yet
Sample Q
5 pages
Reservation in Sanskriti School
No ratings yet
Reservation in Sanskriti School
31 pages
Siena News Fall 2010
No ratings yet
Siena News Fall 2010
36 pages
Tle 9
No ratings yet
Tle 9
31 pages
Group Assignment IT Audit
No ratings yet
Group Assignment IT Audit
24 pages
ITIL Practitioner 160317
No ratings yet
ITIL Practitioner 160317
26 pages
Jayson Bejec: Industrial Engineering Resume
No ratings yet
Jayson Bejec: Industrial Engineering Resume
3 pages
Cetprospectus 2025
No ratings yet
Cetprospectus 2025
56 pages
Multicultural Identity and Ecocentrism
No ratings yet
Multicultural Identity and Ecocentrism
13 pages
SCOPE Student's Handbook - Obstetrics - Gynecology
No ratings yet
SCOPE Student's Handbook - Obstetrics - Gynecology
16 pages

Lec 09

Uploaded by

Lec 09

Uploaded by

CS 188: Artificial Intelligence

Markov Decision Processes II

Instructor: Pieter Abbeel

§ Noisy movement: actions do not always go as

§ The agent receives rewards

§ Goal: maximize sum of rewards

§ The value (utility) of a state s:

§ The optimal policy:

o Repeat until convergence, which yields V* s,a,s’

o Complexity of each iteration: O(S2A)

o Theorem: will converge to unique optimal values

o Value iteration is just a fixed point solution method

∀𝑠: 𝑉!"# 𝑠 = max * 𝑇 𝑠, 𝑎, 𝑠 & [𝑅 𝑠, 𝑎, 𝑠 & + 𝛾𝑉 𝑠 & ]

Step 2: Keep being optimal

o How should we act?

o We need to do a mini-expectimax (one step)

o How should we act?

o Important lesson: actions are easier to select from q-values than

o Problem 1: It’s slow – O(S2A) per iteration s,a,s’

o Problem 2: The “max” at each state rarely changes

o Problem 3: The policy often converges long before the values

o This is Policy Iteration

o Idea 1: Turn recursive Bellman equations into updates p(s)

o Efficiency: O(S2) per iteration

o Both are dynamic programs for solving MDPs

o These all look the same!

Step 2: Keep being optimal

o Important ideas in reinforcement learning that came up

You might also like