NPTEL: AI Assignment-11 Solutions
Reinforcement Learning
Solution Q1) B, Follows from slides
Solution Q2) BC, Follows from slides
Solution Q3) AD
Solution Q4) A, Follows from the equation of feature-based Q-learning
Solution Q5) BC
a. Incorrect. Temporal difference (TD) learning is a model-free reinforcement learning
technique. It doesn't require knowledge of the underlying model of the environment,
unlike model-based approaches
b. Correct. Follows from slides
c. Correct. In temporal difference learning, the value of a state is updated incrementally
based on the TD error. The value of the state is adjusted towards a target value,
which is a combination of the current estimate and the observed reward and the next
state values.
d. Incorrect. The TD error is defined as the difference between the estimated value of a
state (based on the current estimate) and the target value (usually based on the
observed reward and next state values). It's not the difference between old and new
values; rather, it represents the discrepancy between what was expected and what
was actually observed.
Solution Q6) ACD
- Factual, discussed in videos
Solution Q7) 3
The state action pair (B2, R) is seen 3 times, all 3 times we end up in state B3, hence
x = T(B2, R, B3) = 3/3 = 1
The state action pair (B3, U) is seen 3 times, only 1 time we end up in state C3,
hence y = T(B3, U, C3) = 1/3
Solution Q8) 46
- We visit A1 twice, the first time in the first simulation from where the reward collected
before reaching a terminal state is -9 + 100 = 91. The second time we visit A1 is in
the second simulation from where the reward collected before reaching a terminal
state is -5-100 = -105. Hence w = (91-105)/2 = -7
- Similarly we visit B1 twice, the first time in the first simulation from where the reward
collected before reaching a terminal state is -8 + 100 = 92. The second time we visit
B1 is in the second simulation from where the reward collected before reaching a
terminal state is -4-100 = -104. Hence x = (92-104)/2 = -6.
- We visit B2 thrice, 2 times in the first simulation and 1 time in the second simulation.
The rewards collected before reaching a terminal state are -7+100 = 93, -3 + 100 =
97, -3-100 = -103. Hence y = (93+97-103)/3 = 29.
- We visit B3 thrice, 2 times in the first simulation and 1 time in the second simulation.
The rewards collected before reaching a terminal state are -6+100, -2+100, -2-100.
Hence z = (94 + 98 - 102)/3 = 30
- w+x+y+z = 46
Solution Q9) -0.16
The state action pair (c, RIGHT) is experienced twice and hence Q(c, RIGHT) will be
updated twice.
At first update, since a collision happens a reward of -1 will be received.
Q(c, RIGHT) = (1-𝝰).Q(c, RIGHT) + 𝝰(R+Q(c, RIGHT))
= 0.2x0 + 0.8(-1)
= -0.8
At second update
Q(c, RIGHT) = (1-𝝰).Q(c, RIGHT) + 𝝰(R+Q(d, UP))
= 0.2x(-0.8) + 0.8(0)
= -0.16
Solution Q10) 0.025
Q(c,RIGHT)= -0.16
Q(c,UP)=Q(c,DOWN)=Q(c,LEFT)=0
Epsilon greedy policy will pick any one of UP, DOWN, LEFT as the action.
Q(c,RIGHT) action will only be taken with epsilon/4 probability.