0% found this document useful (0 votes)
80 views13 pages

MDPs: Policies, Search & Utility

This document discusses Markov decision processes (MDPs) and methods for solving them. An MDP is defined by states, actions, transition probabilities, and reward functions. The goal is to find an optimal policy that maps states to actions to maximize expected utility. MDP search trees can be used to represent the problem. Discounting future rewards helps algorithms converge by giving more weight to nearer term rewards.

Uploaded by

Aimee Lemma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views13 pages

MDPs: Policies, Search & Utility

This document discusses Markov decision processes (MDPs) and methods for solving them. An MDP is defined by states, actions, transition probabilities, and reward functions. The goal is to find an optimal policy that maps states to actions to maximize expected utility. MDP search trees can be used to represent the problem. Discounting future rewards helps algorithms converge by giving more weight to nearer term rewards.

Uploaded by

Aimee Lemma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Artificial Intelligence and

Intelligent Agents (F29AI)

MDP II: Policies, Search & Utility


Arash Eshghi

Based on slides from Ioannis Konstas @HWU, Verena Rieser @HWU, Dan Klein @UC Berkeley
Markov Decision Processes

• An MDP is defined by:


• A set of states s Î S
• A set of actions a Î A
• A transition function T(s, a, s’)
• Prob that action a from s leads to s’
i.e., P(s’ | s,a)
• Also called “the model”
• A reward function R(s, a, s’)
• Sometimes just R(s) or R(s’)
• A start state (or distribution)
• Maybe a terminal state

• MDPs are a family of non-deterministic search problems


• One way to solve them is with expectimax search – but we will
have a new tool soon
2
Policies
• In deterministic single-agent search problems, we wanted an optimal
plan, or sequence of actions, from start to a goal

• In an MDP, we want an optimal policy p*: S → A


• A policy p gives an action for each state
• An optimal policy maximizes expected utility if followed
• An explicit policy defines a reflex agent

• Expectimax didn’t compute entire policies


• Expectimax computed actions
for a single state only!

Optimal policy when R(s, a, s’) = -0.03


for all non-terminals s
Example Optimal Policies

R(s) = -0.01 R(s) = -0.3

R(s) = -0.4 R(s) = -2.0 4


Example: Racing
Example: racing
• A robot car wants to travel far, quickly
• Three states: Cool, Warm, Overheated
• Two actions: Slow, Fast
• Going faster gets double reward
• Break-down: Game over!
Racing Search Tree
Racing Search Tree
MDP Search Trees
Each MDP state projects an expectimax-like search tree

a s is a state

(s,a) is a
q-state
s,a
(s,a,s’) is a transition

s,a,s’ T(s,a,s’) = P(s’|s,a)


R(s,a,s’)
s’

9
Utilities of Sequences
Utilities of Sequences
• What preferences should an agent have over reward
sequences?

• More or less? [1,2,2] or [2,3,4]

• Now or later? [0,0,1] or [1,0,0]


Discounting (gamma)
• It’s reasonable to maximise the sum of rewards
• It’s also reasonable to prefer rewards now to rewards later
• One solution: values of rewards decay exponentially!

1
Worth Now Worth next step Worth in two steps
Discounting
• How to discount?
• Each time we descend a level,
we multiply in the discount once
• Why discount?
• Sooner rewards probably do
have higher utility than later
rewards
• Also helps our algorithms
converge
• Example: discount of 0.5
• U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3
• U([3,2,1]) = 3*1 + 0.5*2 + 0.25*1
• U([1,2,3]) < U([3,2,1])

You might also like