Artificial Intelligence and
Intelligent Agents (F29AI)
MDP II: Policies, Search & Utility
Arash Eshghi
Based on slides from Ioannis Konstas @HWU, Verena Rieser @HWU, Dan Klein @UC Berkeley
Markov Decision Processes
• An MDP is defined by:
• A set of states s Î S
• A set of actions a Î A
• A transition function T(s, a, s’)
• Prob that action a from s leads to s’
i.e., P(s’ | s,a)
• Also called “the model”
• A reward function R(s, a, s’)
• Sometimes just R(s) or R(s’)
• A start state (or distribution)
• Maybe a terminal state
• MDPs are a family of non-deterministic search problems
• One way to solve them is with expectimax search – but we will
have a new tool soon
2
Policies
• In deterministic single-agent search problems, we wanted an optimal
plan, or sequence of actions, from start to a goal
• In an MDP, we want an optimal policy p*: S → A
• A policy p gives an action for each state
• An optimal policy maximizes expected utility if followed
• An explicit policy defines a reflex agent
• Expectimax didn’t compute entire policies
• Expectimax computed actions
for a single state only!
Optimal policy when R(s, a, s’) = -0.03
for all non-terminals s
Example Optimal Policies
R(s) = -0.01 R(s) = -0.3
R(s) = -0.4 R(s) = -2.0 4
Example: Racing
Example: racing
• A robot car wants to travel far, quickly
• Three states: Cool, Warm, Overheated
• Two actions: Slow, Fast
• Going faster gets double reward
• Break-down: Game over!
Racing Search Tree
Racing Search Tree
MDP Search Trees
Each MDP state projects an expectimax-like search tree
a s is a state
(s,a) is a
q-state
s,a
(s,a,s’) is a transition
s,a,s’ T(s,a,s’) = P(s’|s,a)
R(s,a,s’)
s’
9
Utilities of Sequences
Utilities of Sequences
• What preferences should an agent have over reward
sequences?
• More or less? [1,2,2] or [2,3,4]
• Now or later? [0,0,1] or [1,0,0]
Discounting (gamma)
• It’s reasonable to maximise the sum of rewards
• It’s also reasonable to prefer rewards now to rewards later
• One solution: values of rewards decay exponentially!
1
Worth Now Worth next step Worth in two steps
Discounting
• How to discount?
• Each time we descend a level,
we multiply in the discount once
• Why discount?
• Sooner rewards probably do
have higher utility than later
rewards
• Also helps our algorithms
converge
• Example: discount of 0.5
• U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3
• U([3,2,1]) = 3*1 + 0.5*2 + 0.25*1
• U([1,2,3]) < U([3,2,1])