0% found this document useful (0 votes)
26 views19 pages

MDPs Solving

The document discusses various algorithms for solving Markov Decision Processes (MDPs), including Value Iteration, Policy Iteration, and Q-Learning. It details the processes involved in each algorithm, their complexities, and key concepts such as exploration vs. exploitation and the parameters influencing Q-Learning. Additionally, it explains the epsilon-greedy action selection method and the significance of learning parameters like alpha, gamma, and epsilon.

Uploaded by

lahlou khalid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views19 pages

MDPs Solving

The document discusses various algorithms for solving Markov Decision Processes (MDPs), including Value Iteration, Policy Iteration, and Q-Learning. It details the processes involved in each algorithm, their complexities, and key concepts such as exploration vs. exploitation and the parameters influencing Q-Learning. Additionally, it explains the epsilon-greedy action selection method and the significance of learning parameters like alpha, gamma, and epsilon.

Uploaded by

lahlou khalid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 19

MDP Solving

GUETTICHE Mourad

1
1. Value iteration
Bellman Equations gives us a recursive definition of the optimal
value :
V*(s) =maxaϵAΣs’ϵSP(s,a,s’)(R(s,a,s’)+ᵞv*(s’)).
The algorithm consists on :
Initializing v0(s)=0for all state s.
Computing iteratively V*(s) via dynammic programming until
convergence.

2
1. Value iteration
State value algorithm
for each sϵS :
Initialize V𝟬(s) by 0.
End.
Repeat until converged:
for each sϵS :
V(s) =maxaϵA{Σs’ϵSP(s,a,s’)(R(s,a,s’)+ᵞv(s’))}.
End.
3
1. Value iteration
for each sϵS :
π(s) =argmaxaϵA{Σs’ϵSP(s,a,s’)(R(s,a,s’)+γv(s’))}.
End.
Algorithm complexity per iteration O(|S| 2 |A|)

4
1. Value iteration

Vi+1(s) =maxaϵAΣs’ϵSP(s,a,s’)(R(s,a,s’)+ᵞvi(s’)).
ᵞ=𝟬.𝟡, P(s,a,s’)=𝟬.8 R(s,a,s’)=𝟬
0.52 0,72 +1

0.43 -1

5
2. Policy iteration
Policy iteration algorithm
Initialize π randomly
Repeat until no change in π :
Repeat until converged:
for each sϵS :
Vπk+1(s) =Σs’ϵSP(s,πk(s),s’)(R(s,πk(s),s’)+γvπk(s’))}.
End.
End.
6
2. Policy iteration
for each sϵS :
πk(s) =argmaxaϵA{Σs’ϵSP(s,a,s’)(R(s,a,s’)+γv(s’))}.
End.
Algorithm complexity per iteration O(|S| 3+ |A||S|2).

7
3. Q-Learning
We will learn about epsilon–greedy Q–learning, a well-
known reinforcement learning algorithm. We will also
mention some basic reinforcement learning concepts like
temporal difference and off-policy learning on the way.
Then we will inspect exploration vs. exploitation tradeoff
and epsilon-greedy action selection.

8
3.1 Q-Learning Algorithm
We create and fill a table storing state-action pairs. The table is
called Q or Q-table interchangeably.
Q(S, A) in our Q-table corresponds to the state-action pair for state S and
action A. R stands for the reward. t denotes the current time step, hence t+1
denotes the next one. Alpha (α) and (γ) gamma are learning parameters.

9
3.1 Q-Learning Algorithm
In this case, possible values of state-action pairs are calculated
iteratively by the formula:

Q(St, at) = Q(St, at) + α[ Rt+1 + ᵞ maxa Q(St+1, a) - Q(St, at)]


This is called the action-value function or Q-function.
The function approximates the value of selecting a certain action
in a certain state.

10
3.1 Q-Learning Algorithm
The output of the algorithm is calculated Q(S, A) values. A
Q-table for N states and M actions looks like this:
A1 A2 ... Am

S1 Q(S1,A1) Q(S1,A2) Q(S1, Am)

S2 Q(S2,A1) Q(S2,A2) Q(S2,Am)

...

Sn Q(Sn,A1) Q(Sn,A2) Q(Sm,Am)

11
3.2 Q-Learning properties
- Q-learning is a model-free algorithm. We can think of model-free
algorithms as trial-and-error methods. The agent explores the
environment and learns from outcomes of the actions directly, without
constructing an internal model or a Markov Decision Process. In the
beginning, the agent knows the possible states and actions in an
environment. Then the agent discovers the state transitions and
rewards by exploration.
- Temporal Difference. In Q-learning, Q-values stored in the Q-table
are partially updated using an estimate. Hence, there is no need to
wait for the final reward and update prior state-action pair values in Q-
learning.
12
3.2 Q-Learning properties

- Q-learning is an off-policy algorithm. An off-policy


algorithm approximates the optimal action-value function,
independent of the policy. the algorithm (usually) selects
the next action with the best reward. In this case, the
action selection is not performed on a possibly longer
and better path, making it a short-sighted learning
algorithm.

13
4 Epsilon-Greedy Q-Learning
Algorithm
Initialization :
Initilize Q(S, A) arbitrarily.
For each episode :
Inisialize State S
For each step in episode :
A=SELECT-ACTION(Q,s,epsilon)
s’, r, done, info=env.step(A)

Q(s, A) = Q(s, A) + α[ r + ᵞ maxa Q(s’, a) - Q(s, A)]


s=s’
If done:
break

14
5 Action selection
Exploration vs. Exploitation Tradeoff. The agent
initially has none or limited knowledge about the
environment. The agent can choose to explore by
selecting an action with an unknown outcome, to get
more information about the environment. Or, it can
choose to exploit and choose an action based on its prior
knowledge of the environment to get a good reward.

15
5 Action selection
Epsilon-Greedy Action Selection. In epsilon-greedy action
selection, the agent uses both exploitations to take advantage of
prior knowledge and exploration to look for new options:
start
n=random.uniform(0,1)
If n<epsilon ε 1-ε
A=randomAction.
Else exploration exploitation
A=max(Q[S,-])
Best Known
Random
action
Action

16
6 Q-learning Parameters

Alpha (α) : Alpha is a real number between zero and one (0 < α≤ 1). If
we set alpha to zero, the agent learns nothing from new actions.
Conversely, if we set alpha to 1, the agent completely ignores prior
knowledge and only values the most recent information. Higher alpha
values make Q-values change faster.
Gamma (γ) : is the discount factor. If we set gamma to zero, the agent
completely ignores the future rewards. On the other hand, if we set
gamma to 1, the algorithm would look for high rewards in the long term.

17
6 Q-learning Parameters

Epsilon (ε) : In the beginning, the agent has no idea about the environment. He is
more likely to explore new things than to exploit his knowledge. Through time
steps, the agent will get more and more information about how the environment
works and then, he is more likely to exploit his knowledge than exploring new
things. To handle this, we will have a threshold which will decay every episode
using exponential decay formula.
E=E0*exp(-λt) with λ is called the decay constant. At every time step t, we will
sample a variable uniformly over [0,1].
If the variable is smaller than the threshold, the agent will explore the environment.
Otherwise, he will exploit his knowledge.

18
Merci

You might also like