0% found this document useful (0 votes)
68 views3 pages

AI Assignment 11: RL Solutions

The document provides solutions to 10 questions related to reinforcement learning. The solutions discuss concepts like temporal difference learning, state-action pairs, Q-learning updates, and epsilon-greedy policies.

Uploaded by

shanthidl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views3 pages

AI Assignment 11: RL Solutions

The document provides solutions to 10 questions related to reinforcement learning. The solutions discuss concepts like temporal difference learning, state-action pairs, Q-learning updates, and epsilon-greedy policies.

Uploaded by

shanthidl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

NPTEL: AI Assignment-11 Solutions

Reinforcement Learning

Solution Q1) B, Follows from slides

Solution Q2) BC, Follows from slides

Solution Q3) AD

Solution Q4) A, Follows from the equation of feature-based Q-learning

Solution Q5) BC

a. Incorrect. Temporal difference (TD) learning is a model-free reinforcement learning


technique. It doesn't require knowledge of the underlying model of the environment,
unlike model-based approaches
b. Correct. Follows from slides
c. Correct. In temporal difference learning, the value of a state is updated incrementally
based on the TD error. The value of the state is adjusted towards a target value,
which is a combination of the current estimate and the observed reward and the next
state values.
d. Incorrect. The TD error is defined as the difference between the estimated value of a
state (based on the current estimate) and the target value (usually based on the
observed reward and next state values). It's not the difference between old and new
values; rather, it represents the discrepancy between what was expected and what
was actually observed.

Solution Q6) ACD


- Factual, discussed in videos

Solution Q7) 3
The state action pair (B2, R) is seen 3 times, all 3 times we end up in state B3, hence
x = T(B2, R, B3) = 3/3 = 1
The state action pair (B3, U) is seen 3 times, only 1 time we end up in state C3,
hence y = T(B3, U, C3) = 1/3

Solution Q8) 46
- We visit A1 twice, the first time in the first simulation from where the reward collected
before reaching a terminal state is -9 + 100 = 91. The second time we visit A1 is in
the second simulation from where the reward collected before reaching a terminal
state is -5-100 = -105. Hence w = (91-105)/2 = -7
- Similarly we visit B1 twice, the first time in the first simulation from where the reward
collected before reaching a terminal state is -8 + 100 = 92. The second time we visit
B1 is in the second simulation from where the reward collected before reaching a
terminal state is -4-100 = -104. Hence x = (92-104)/2 = -6.
- We visit B2 thrice, 2 times in the first simulation and 1 time in the second simulation.
The rewards collected before reaching a terminal state are -7+100 = 93, -3 + 100 =
97, -3-100 = -103. Hence y = (93+97-103)/3 = 29.
- We visit B3 thrice, 2 times in the first simulation and 1 time in the second simulation.
The rewards collected before reaching a terminal state are -6+100, -2+100, -2-100.
Hence z = (94 + 98 - 102)/3 = 30
- w+x+y+z = 46

Solution Q9) -0.16


The state action pair (c, RIGHT) is experienced twice and hence Q(c, RIGHT) will be
updated twice.

At first update, since a collision happens a reward of -1 will be received.


Q(c, RIGHT) = (1-𝝰).Q(c, RIGHT) + 𝝰(R+Q(c, RIGHT))
= 0.2x0 + 0.8(-1)
= -0.8

At second update
Q(c, RIGHT) = (1-𝝰).Q(c, RIGHT) + 𝝰(R+Q(d, UP))
= 0.2x(-0.8) + 0.8(0)
= -0.16

Solution Q10) 0.025

Q(c,RIGHT)= -0.16
Q(c,UP)=Q(c,DOWN)=Q(c,LEFT)=0
Epsilon greedy policy will pick any one of UP, DOWN, LEFT as the action.
Q(c,RIGHT) action will only be taken with epsilon/4 probability.

You might also like