0% found this document useful (0 votes)

26 views19 pages

MDPs Solving

The document discusses various algorithms for solving Markov Decision Processes (MDPs), including Value Iteration, Policy Iteration, and Q-Learning. It details the processes involved in each algorithm, their complexities, and key concepts such as exploration vs. exploitation and the parameters influencing Q-Learning. Additionally, it explains the epsilon-greedy action selection method and the significance of learning parameters like alpha, gamma, and epsilon.

Uploaded by

lahlou khalid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views19 pages

MDPs Solving

Uploaded by

lahlou khalid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 19

MDP Solving

GUETTICHE Mourad

1
1. Value iteration
Bellman Equations gives us a recursive definition of the optimal
value :
V*(s) =maxaϵAΣs’ϵSP(s,a,s’)(R(s,a,s’)+ᵞv*(s’)).
The algorithm consists on :
Initializing v0(s)=0for all state s.
Computing iteratively V*(s) via dynammic programming until
convergence.

2
1. Value iteration
State value algorithm
for each sϵS :
Initialize V𝟬(s) by 0.
End.
Repeat until converged:
for each sϵS :
V(s) =maxaϵA{Σs’ϵSP(s,a,s’)(R(s,a,s’)+ᵞv(s’))}.
End.
3
1. Value iteration
for each sϵS :
π(s) =argmaxaϵA{Σs’ϵSP(s,a,s’)(R(s,a,s’)+γv(s’))}.
End.
Algorithm complexity per iteration O(|S| 2 |A|)

4
1. Value iteration

Vi+1(s) =maxaϵAΣs’ϵSP(s,a,s’)(R(s,a,s’)+ᵞvi(s’)).
ᵞ=𝟬.𝟡, P(s,a,s’)=𝟬.8 R(s,a,s’)=𝟬
0.52 0,72 +1

0.43 -1

5
2. Policy iteration
Policy iteration algorithm
Initialize π randomly
Repeat until no change in π :
Repeat until converged:
for each sϵS :
Vπk+1(s) =Σs’ϵSP(s,πk(s),s’)(R(s,πk(s),s’)+γvπk(s’))}.
End.
End.
6
2. Policy iteration
for each sϵS :
πk(s) =argmaxaϵA{Σs’ϵSP(s,a,s’)(R(s,a,s’)+γv(s’))}.
End.
Algorithm complexity per iteration O(|S| 3+ |A||S|2).

7
3. Q-Learning
We will learn about epsilon–greedy Q–learning, a well-
known reinforcement learning algorithm. We will also
mention some basic reinforcement learning concepts like
temporal difference and off-policy learning on the way.
Then we will inspect exploration vs. exploitation tradeoff
and epsilon-greedy action selection.

8
3.1 Q-Learning Algorithm
We create and fill a table storing state-action pairs. The table is
called Q or Q-table interchangeably.
Q(S, A) in our Q-table corresponds to the state-action pair for state S and
action A. R stands for the reward. t denotes the current time step, hence t+1
denotes the next one. Alpha (α) and (γ) gamma are learning parameters.

9
3.1 Q-Learning Algorithm
In this case, possible values of state-action pairs are calculated
iteratively by the formula:

Q(St, at) = Q(St, at) + α[ Rt+1 + ᵞ maxa Q(St+1, a) - Q(St, at)]

This is called the action-value function or Q-function.
The function approximates the value of selecting a certain action
in a certain state.

10
3.1 Q-Learning Algorithm
The output of the algorithm is calculated Q(S, A) values. A
Q-table for N states and M actions looks like this:
A1 A2 ... Am

S1 Q(S1,A1) Q(S1,A2) Q(S1, Am)

S2 Q(S2,A1) Q(S2,A2) Q(S2,Am)

...

Sn Q(Sn,A1) Q(Sn,A2) Q(Sm,Am)

11
3.2 Q-Learning properties
- Q-learning is a model-free algorithm. We can think of model-free
algorithms as trial-and-error methods. The agent explores the
environment and learns from outcomes of the actions directly, without
constructing an internal model or a Markov Decision Process. In the
beginning, the agent knows the possible states and actions in an
environment. Then the agent discovers the state transitions and
rewards by exploration.
- Temporal Difference. In Q-learning, Q-values stored in the Q-table
are partially updated using an estimate. Hence, there is no need to
wait for the final reward and update prior state-action pair values in Q-
learning.
12
3.2 Q-Learning properties

- Q-learning is an off-policy algorithm. An off-policy

algorithm approximates the optimal action-value function,
independent of the policy. the algorithm (usually) selects
the next action with the best reward. In this case, the
action selection is not performed on a possibly longer
and better path, making it a short-sighted learning
algorithm.

13
4 Epsilon-Greedy Q-Learning
Algorithm
Initialization :
Initilize Q(S, A) arbitrarily.
For each episode :
Inisialize State S
For each step in episode :
A=SELECT-ACTION(Q,s,epsilon)
s’, r, done, info=env.step(A)

Q(s, A) = Q(s, A) + α[ r + ᵞ maxa Q(s’, a) - Q(s, A)]

s=s’
If done:
break

14
5 Action selection
Exploration vs. Exploitation Tradeoff. The agent
initially has none or limited knowledge about the
environment. The agent can choose to explore by
selecting an action with an unknown outcome, to get
more information about the environment. Or, it can
choose to exploit and choose an action based on its prior
knowledge of the environment to get a good reward.

15
5 Action selection
Epsilon-Greedy Action Selection. In epsilon-greedy action
selection, the agent uses both exploitations to take advantage of
prior knowledge and exploration to look for new options:
start
n=random.uniform(0,1)
If n<epsilon ε 1-ε
A=randomAction.
Else exploration exploitation
A=max(Q[S,-])
Best Known
Random
action
Action

16
6 Q-learning Parameters

Alpha (α) : Alpha is a real number between zero and one (0 < α≤ 1). If
we set alpha to zero, the agent learns nothing from new actions.
Conversely, if we set alpha to 1, the agent completely ignores prior
knowledge and only values the most recent information. Higher alpha
values make Q-values change faster.
Gamma (γ) : is the discount factor. If we set gamma to zero, the agent
completely ignores the future rewards. On the other hand, if we set
gamma to 1, the algorithm would look for high rewards in the long term.

17
6 Q-learning Parameters

Epsilon (ε) : In the beginning, the agent has no idea about the environment. He is
more likely to explore new things than to exploit his knowledge. Through time
steps, the agent will get more and more information about how the environment
works and then, he is more likely to exploit his knowledge than exploring new
things. To handle this, we will have a threshold which will decay every episode
using exponential decay formula.
E=E0*exp(-λt) with λ is called the decay constant. At every time step t, we will
sample a variable uniformly over [0,1].
If the variable is smaller than the threshold, the agent will explore the environment.
Otherwise, he will exploit his knowledge.

18
Merci

Nidhish RLAI-Lab1
No ratings yet
Nidhish RLAI-Lab1
18 pages
Q Learing
No ratings yet
Q Learing
30 pages
Unit 5
No ratings yet
Unit 5
65 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
12 pages
Deep Learning Binoy-19-3-RL Q Learning
No ratings yet
Deep Learning Binoy-19-3-RL Q Learning
26 pages
Unit 1
No ratings yet
Unit 1
18 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
Filippov Theory in ϵ-Greedy Q-Learning
No ratings yet
Filippov Theory in ϵ-Greedy Q-Learning
66 pages
Q-Learning for Optimal Pathfinding
No ratings yet
Q-Learning for Optimal Pathfinding
2 pages
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
No ratings yet
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
9 pages
Q Learning SARSA Deep Q Learning
No ratings yet
Q Learning SARSA Deep Q Learning
4 pages
Intro To Reinforcement Learning - DQ Q AC A3C
No ratings yet
Intro To Reinforcement Learning - DQ Q AC A3C
36 pages
Unit-5 MLT
No ratings yet
Unit-5 MLT
13 pages
Q-Learning: Reinforcement Learning Basic Q-Learning Algorithm Common Modifications
No ratings yet
Q-Learning: Reinforcement Learning Basic Q-Learning Algorithm Common Modifications
22 pages
Unit 5
No ratings yet
Unit 5
54 pages
Understanding Reinforcement Learning Basics
No ratings yet
Understanding Reinforcement Learning Basics
11 pages
7 - Reinforcement Learning
No ratings yet
7 - Reinforcement Learning
23 pages
Q-Learning in C++
No ratings yet
Q-Learning in C++
4 pages
Unit 5
No ratings yet
Unit 5
70 pages
Reinforcement Learning 2
No ratings yet
Reinforcement Learning 2
41 pages
Q Learning Ejemplo
100% (1)
Q Learning Ejemplo
11 pages
Machine - Learning - Chapter 4
No ratings yet
Machine - Learning - Chapter 4
13 pages
CSE 445 - Lecture 9 - Reinforcement Learning
No ratings yet
CSE 445 - Lecture 9 - Reinforcement Learning
45 pages
Sections
No ratings yet
Sections
76 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
Q-Learning for Room Navigation Simulation
100% (1)
Q-Learning for Room Navigation Simulation
15 pages
AI Seminar RL
No ratings yet
AI Seminar RL
27 pages
Mid Term Report SoS
No ratings yet
Mid Term Report SoS
18 pages
10 Deep Reinforcement
No ratings yet
10 Deep Reinforcement
40 pages
SARSA and Q-Learning Overview
No ratings yet
SARSA and Q-Learning Overview
4 pages
RL Cheatsheet for Researchers
No ratings yet
RL Cheatsheet for Researchers
16 pages
Multi-Agent Reinforcement Learning-Implementation of Hide and Seek
No ratings yet
Multi-Agent Reinforcement Learning-Implementation of Hide and Seek
7 pages
Qustion Bank With Solution
No ratings yet
Qustion Bank With Solution
147 pages
Lecture 9 Reiforcement Learning
No ratings yet
Lecture 9 Reiforcement Learning
29 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
30 pages
AI Unit3 Part 1
No ratings yet
AI Unit3 Part 1
5 pages
ML - Unit 3 - Part II
No ratings yet
ML - Unit 3 - Part II
51 pages
Reinforcement Learning - Personal Study Notes
No ratings yet
Reinforcement Learning - Personal Study Notes
12 pages
Reinforcement Learning Overview
No ratings yet
Reinforcement Learning Overview
41 pages
Homework #3: MDPS, Q-Learning, &: Pomdps
No ratings yet
Homework #3: MDPS, Q-Learning, &: Pomdps
18 pages
Exploration vs Exploitation AI
No ratings yet
Exploration vs Exploitation AI
2 pages
MarkovDecisionProcesses Analysis
No ratings yet
MarkovDecisionProcesses Analysis
10 pages
5.4-Reinforcement learning-part3-Q-Learning
No ratings yet
5.4-Reinforcement learning-part3-Q-Learning
18 pages
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
No ratings yet
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
11 pages
MAS Lab7 QFA
No ratings yet
MAS Lab7 QFA
10 pages
ml4r 2025 05
No ratings yet
ml4r 2025 05
22 pages
MDP Algorithms: Value & Policy Iteration
No ratings yet
MDP Algorithms: Value & Policy Iteration
24 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
10 pages
Muhammad Muaaz Aamer BSCS 2021 FAST NU LHR - Take Home Quiz No 3
No ratings yet
Muhammad Muaaz Aamer BSCS 2021 FAST NU LHR - Take Home Quiz No 3
4 pages
RL Class Mtech
No ratings yet
RL Class Mtech
67 pages
Learning Task
No ratings yet
Learning Task
14 pages
Lec 22
No ratings yet
Lec 22
22 pages
Passive vs Active Reinforcement Learning
No ratings yet
Passive vs Active Reinforcement Learning
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
Lec 09
No ratings yet
Lec 09
26 pages
Some Thoughts On Reinforcement Learning: 1 Motivation
No ratings yet
Some Thoughts On Reinforcement Learning: 1 Motivation
9 pages
RL-Endterm Report - Mridul Agarwal
No ratings yet
RL-Endterm Report - Mridul Agarwal
27 pages
Mila University Centre M1: IA Uncertain Decision: Work 3 (Bayesian Network) Exercise 1
No ratings yet
Mila University Centre M1: IA Uncertain Decision: Work 3 (Bayesian Network) Exercise 1
1 page
Mila University Centre M1: I2A Uncertain Decision: Work 2 (Reinforcement Learning) 1. Definition
No ratings yet
Mila University Centre M1: I2A Uncertain Decision: Work 2 (Reinforcement Learning) 1. Definition
2 pages
Cours 3
No ratings yet
Cours 3
18 pages
MDPs
No ratings yet
MDPs
19 pages
Cours 7 A
No ratings yet
Cours 7 A
25 pages
Cours9a RNN
No ratings yet
Cours9a RNN
29 pages
Cours 7 B
No ratings yet
Cours 7 B
31 pages
Cours 8 A
No ratings yet
Cours 8 A
34 pages
Cours9c-Attention Mechanism
No ratings yet
Cours9c-Attention Mechanism
36 pages
Cours9b NLP
No ratings yet
Cours9b NLP
35 pages
Assignment of Projects 2023-2024
No ratings yet
Assignment of Projects 2023-2024
1 page
Projects 2023-2024
No ratings yet
Projects 2023-2024
7 pages
Report On Parcel Tracking System
No ratings yet
Report On Parcel Tracking System
15 pages
Future Directions for Sustainable Business
No ratings yet
Future Directions for Sustainable Business
22 pages
Artificial Intelligence Project
No ratings yet
Artificial Intelligence Project
51 pages
Principled Penalty-Based Methods For Bilevel Reinforcement Learning and RLHF
No ratings yet
Principled Penalty-Based Methods For Bilevel Reinforcement Learning and RLHF
49 pages
Firewall Project
No ratings yet
Firewall Project
59 pages
ML Handout
No ratings yet
ML Handout
9 pages
Curse of Dimensionality in ML Explained
No ratings yet
Curse of Dimensionality in ML Explained
13 pages
1 s2.0 S0925231225007945 Main
No ratings yet
1 s2.0 S0925231225007945 Main
12 pages
ML - Unit 5
No ratings yet
ML - Unit 5
22 pages
Machine Learning: Chapter 4. Artificial Neural Networks
0% (1)
Machine Learning: Chapter 4. Artificial Neural Networks
34 pages
Data Analyst Training & Internship Program
No ratings yet
Data Analyst Training & Internship Program
25 pages
Log Mining Writing Assistance Genai Presentation
No ratings yet
Log Mining Writing Assistance Genai Presentation
8 pages
Selenium Framework For Web Automation Testing
No ratings yet
Selenium Framework For Web Automation Testing
12 pages
Black Box Attacks On Explainable Artificial IntelligenceXAI Methods in Cyber Security
No ratings yet
Black Box Attacks On Explainable Artificial IntelligenceXAI Methods in Cyber Security
8 pages
Ahmed Efficient Event-Based Object Detection A Hybrid Neural Network With Spatial CVPR 2025 Paper
No ratings yet
Ahmed Efficient Event-Based Object Detection A Hybrid Neural Network With Spatial CVPR 2025 Paper
10 pages
Different Roles in Data Science
No ratings yet
Different Roles in Data Science
11 pages
Radhin Krishna: Data Science Profile
No ratings yet
Radhin Krishna: Data Science Profile
2 pages
5g Advanced Evolution Towards 6g
No ratings yet
5g Advanced Evolution Towards 6g
18 pages
A Study On Impact of Ai On Financial Analysis at Avineon India PVT LTD Kakinada
No ratings yet
A Study On Impact of Ai On Financial Analysis at Avineon India PVT LTD Kakinada
51 pages
M.E.cse Syllabus
100% (1)
M.E.cse Syllabus
41 pages
A Cluster Based Under-Sampling Solution For Handling Imbalanced Data
No ratings yet
A Cluster Based Under-Sampling Solution For Handling Imbalanced Data
12 pages
CHC 351 Module 2
No ratings yet
CHC 351 Module 2
73 pages
Rahul Tech Sem
No ratings yet
Rahul Tech Sem
14 pages
Plenary I - 2025 Lectures 1 and 2 - Tagged
No ratings yet
Plenary I - 2025 Lectures 1 and 2 - Tagged
25 pages
AI Applications in Endodontics Explained
No ratings yet
AI Applications in Endodontics Explained
6 pages
The Cambridge Law Corpus
No ratings yet
The Cambridge Law Corpus
2 pages
Dhruvil Resume
No ratings yet
Dhruvil Resume
3 pages
Research Paper2
No ratings yet
Research Paper2
21 pages
Week 1
No ratings yet
Week 1
75 pages
2025 Fellows Class Announcement
No ratings yet
2025 Fellows Class Announcement
57 pages

MDPs Solving

Uploaded by

MDPs Solving

Uploaded by

MDP Solving

Q(St, at) = Q(St, at) + α[ Rt+1 + ᵞ maxa Q(St+1, a) - Q(St, at)]

S1 Q(S1,A1) Q(S1,A2) Q(S1, Am)

S2 Q(S2,A1) Q(S2,A2) Q(S2,Am)

Sn Q(Sn,A1) Q(Sn,A2) Q(Sm,Am)

- Q-learning is an off-policy algorithm. An off-policy

Q(s, A) = Q(s, A) + α[ r + ᵞ maxa Q(s’, a) - Q(s, A)]

You might also like