NPTEL
Video Course on Machine Learning
Professor Carl Gustaf Jansson, KTH
Week 5 Machine Learning enabled
by prior Theories
Video 5.4 Reinforcement Learning – Part 3 Q-learning
Q Learning
Q-learning is a model-free off-policy TD reinforcement learning algorithm.
The goal of Q-learning is to learn a policy, which tells an agent what action to take under what circumstances.
For any finite Markov decision process (FMDP), Q-learning finds a policy that is optimal in the sense that it
maximizes the expected value of the total reward over any and all successive steps, starting from the current
state.
Q-learning can identify an optimal action-selection policy for any given FMDP, given infinite exploration time
and a partly-random policy.
"Q" names the function Q(s,a) that can be said to stand for the "quality" of an action a taken in a given state s.
Suppose we have the optimal Q-function (s, a) then the optimal policy in state s is argmax a Q(s, a).
Q-learning Algorithm
Initialize Q(s, a) arbitrarily
Repeat (for each episode)
Initialize s
Repeat (for each step of the episode)
Take action a, observe r, s’
Q(s, a) 🡨 Q(s, a) + α[r + γ max Q(s’, a’) – Q(s, a)]
a’
s 🡨 s’
With α =1 or α =1 and γ = 1 the updating formula is simplified
Q(s, a) 🡨 r + γ max Q(s’, a’)
Q(s, a) 🡨 r + max Q(s’, a’)
Example
r=8
r=0
r=-8
States and Actions
States: s Actions: a
1 2 3 4 5
N
6 7 8 9 10
S
11 12 13 14 15
E
16 17 18 19 20
W
Assume that α=1 and γ = 0.5
Initializing the Q(s, a) function
States
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
A N 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c
t S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
i
o W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
n
s E 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
An Episode
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
Calculating new Q(s, a) values
1st step:
2nd step:
3rd step:
4th step:
The Q(s, a) function after the first episode
States
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
A N 0 0 0 0 0 0 -8 0 0 0 0 0 0 0 0 0 0 0 0 0
c
t S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
i
o
W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
n
s
E 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A second episode
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
Calculating new Q(s, a) values
1st step:
2nd step:
3rd step:
4th step:
The Q(s, a) function after the second episode
States
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
A N 0 0 0 0 0 0 -8 0 0 0 0 0 0 0 0 0 0 0 0 0
c
t S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
i
o W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
n
s E 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0
The Q(s, a) function after a few episodes
States
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
N 0 0 0 0 0 0 -8 -8 -8 0 0 1 2 4 0 0 0 0 0 0
A
c
t S 0 0 0 0 0 0 0.5 1 2 0 0 -8 -8 -8 0 0 0 0 0 0
i
o W 0 0 0 0 0 0 -8 1 2 0 0 -8 0.5 1 0 0 0 0 0 0
n
s E 0 0 0 0 0 0 2 4 8 0 0 1 2 -8 0 0 0 0 0 0
One of the optimal policies
States
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
N 0 0 0 0 0 0 -8 -8 -8 0 0 1 2 4 0 0 0 0 0 0
A
c S 0 0 0 0 0 0 0.5 1 2 0 0 -8 -8 -8 0 0 0 0 0 0
t
i W 0 0 0 0 0 0 -8 1 2 0 0 -8 0.5 1 0 0 0 0 0 0
o
n
E 0 0 0 0 0 0 2 4 8 0 0 1 2 -8 0 0 0 0 0 0
s
An optimal policy graphically
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
Another of the optimal policies
States
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
A N 0 0 0 0 0 0 -8 -8 -8 0 0 1 2 4 0 0 0 0 0 0
c
t S 0 0 0 0 0 0 0.5 1 2 0 0 -8 -8 -8 0 0 0 0 0 0
i
o W 0 0 0 0 0 0 -8 1 2 0 0 -8 0.5 1 0 0 0 0 0 0
n
s E 0 0 0 0 0 0 2 4 8 0 0 1 2 -8 0 0 0 0 0 0
Another optimal policy graphically
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
NPTEL
Video Course on Machine Learning
Professor Carl Gustaf Jansson, KTH
Thanks for your attention!
The next lecture 5.5 will be on the topic:
Case Based Reasoning