Slides
Slides
1 2
Rattapoom Waranusast
These slides are adapted from CS188 Introduction to Artificial Intelligence Fall 2019 class at UC Berkeley.
The original slides were created by Dan Klein and Pieter Abbeel from UC Berkeley. (ai.berkeley.edu)
1 2
A maze-like problem
The agent lives in a grid
Walls block the agent’s path
Noisy movement: actions do not always go as planned
80% of the time, the action North takes the
agent North
(if there is no wall there)
10% of the time, North takes the agent West;
10% East
If there is a wall in the direction the agent
would have been taken, the agent stays put
The agent receives rewards each time step
Small “living” reward each step (can be
negative)
Big rewards come at the end (good or bad)
Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel
ai.berkeley.edu Goal: maximize sum of rewards ai.berkeley.edu
Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University
3 4
Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University
5 6
1
8/16/2021
• “Markov” generally means that given the present state, the future
and the past are independent
Andrey Markov
(1856-1922)
• This is just like search, where the successor function could only
depend on the current state (not the history)
Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University
7 8
Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University
9 10
Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University
11 12
2
8/16/2021
s s is a state
(s, a) is a q-state a
s, a
(s,a,s’) called a transition
s,a,s’ T(s,a,s’) = P(s’|s,a)
s’ R(s,a,s’)
Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University
13 14
-ไดด้มา
Utilities of Sequences 15 Utilities of Sequences 16
ได้ก่
• What preferences should an agent have over reward sequences?
• More or less?
[0, 0, 1] or [1, 0, 0]
Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University
15 16
Discounting 17 Discounting 18
• How to discount?
• It’s reasonable to maximize the sum of rewards
– Each time we descend a level, we
• It’s also reasonable to prefer rewards now to rewards later Step- multiply in the discount once
• One solution: values of rewards decay exponentially
• Why discount?
– Sooner rewards probably do have
higher utility than later rewards
Step: – Also helps our algorithms converge
Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University
17 18
3
8/16/2021
– Discounted utility: • Quiz 3: For which are West and East equally good when in state d?
Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University
19 20
"Start 31.01.2025
Problem: What if the game lasts forever? Do we get infinite • Markov decision processes: s
rewards? – Set of states S
– Start state s0 a
Solutions:
Finite horizon: (similar to depth-limited search) – Set of actions A s, a
Terminate episodes after a fixed T steps (e.g. life) – Transitions P(s’|s,a) (or T(s,a,s’))
s,a,s’
Gives nonstationary policies ( depends on time left) – Rewards R(s,a,s’) (and discount ) s’
Discounting: use 0 < < 1
• MDP quantities so far:
– Policy = Choice of action for each state
Smaller means smaller “horizon” – shorter term focus
– Utility = sum of (discounted) rewards
Absorbing state: guarantee that for every policy, a terminal state will
eventually be reached (like “overheated” for racing) Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel
ai.berkeley.edu ai.berkeley.edu
Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University
21 22
1 \
บ Policie ↑expecte
a
คําตอบแบ
10.0.5 + 20. 0.5:
The value (utility) of a state s:
V*(s) = expected utility starting in s and s
s is a state expecte
acting optimally
-> node น ั้ a
จําร ูปน
ี้
(s, a) is a q-state
The value (utility) of a q-state (s,a): s, a จ
Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University
23 24
4
8/16/2021
Noise = 0 Noise = 0
Discount = 1 Discount = 1
Living reward = 0 Living reward = 0
Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel
ai.berkeley.edu Department of Electrical and Computer Engineering ai.berkeley.edu Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University
25 26
27 28
· } โดน
-
·
- -
-
บังคัน เพร าะ
มี acton เ
--> action
Exit = -1
29 30
5
8/16/2021
31 32
s, ad 0s, a" a
↑
byyos
-> RCS, 9, si
s,a,s’
-> (sa.s s’
Transition function
arewards
& 10 20 30
vis'
10. 3. 10 + 0.3.20 + 0. p. 30 %
Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University
33 34
– Idea: Do a depth-limited
!เกร่
expectimax computation, but with increasing
ไมม่ ไ depths until change is small
เรือของอ↓ ได
ลู ไ – Note: deep parts of the tree
!
eventually don’t matter if γ < 1
เวอร
Source: D. Klein, P. Abbeel Source: D. Klein, P. Abbeel
ai.berkeley.edu ai.berkeley.edu
Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University
35 36
6
8/16/2021
Time-Limited Values 37
&
k=0 มี เว ลาไม่ เเหลื
38
exit ไม ทันเพราะKFO-
( optimal ส หล
เมื
่อบ ีชี ้อ K st s คํานี ้ O pt
sexit
a te
·
#
&
Noise = 0.2
Source: D. Klein, P. Abbeel Discount = 0.9
ai.berkeley.edu Living reward = 0
Source: D. Klein, P. Abbeel
Department of Electrical and Computer Engineering ai.berkeley.edu Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University
37 38
k=1 K=↓
39 k=2 40
%
sexit ↑
&
->
เอา ค่ า จ
max จึ งเป
็
-
0.
ค. ผิ ดพลา
Noise = 0.2 Noise = 0.2 ~
Discount = 0.9 Discount = 0.9
-
39 40
k=3 41 k=4 42
41 42
7
8/16/2021
k=5 43 k=6 44
43 44
k=7 45 k=8 46
45 46
k=9 47 k=10 48
47 48
8
8/16/2021
k=11 49 k=12 50
49 50
*
k=100
-> คอนเว้อเจ้น ๆ มันไม่ เปลี่ ย
51 Computing Time-Limited Values 52
Convergence
&
ค่ ามัน สิ่ จน Convergence
K =/
K= 3
K= -
I Cache K2 ใน Re
K= &
↑
51 52
• Start with V0(s) = 0: no time steps left means an expected reward sum of zero
• Given vector of Vk(s) values, do one ply of expectimax from each state:
*
Vk+1(s)
=> ( tenance
#the w Preward
-
ด
* s,->
a อุ
ราค
• Repeat until convergence s,a,s’~ Ta
>
& Vk(s’)
• Complexity of each iteration: O(S2A)
Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University
53 54
9
8/16/2021
Vecool =
References
Warm
Example: Value Iteration 55 56
slot Notices-+1
Fas
1+ 1 =
Ve max
=
Warm =
War
# War M
>
1 st 0.6 +
2
#
200
+h
-
K==
2 1 0 ↑
+ กําหน
V = ↓
Assume no discount!
K= · 0 0 0
Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering
305453 Artificial Intelligence Faculty of Engineering, Naresuan University 305453 Artificial Intelligence Faculty of Engineering, Naresuan University
#cutos. Baccarat = 3
55 Vecool mat
Fast
Great+(0.5)( +(((( =
10.5)
War
56
3.5
2.0) · ( 10 + 1 210) = -
Fast (
10
NewerMax Cool
10