Séquence n°15
Deep Reinforcement Learning
The Godzilla
Attack !!
2
Questions and answers :
[Link]
Accompanied by :
IA Support (dream) Team of IDRIS
Directed by :
Agathe, Baptiste et Yanis - UGA/DAPI
Thibaut, Kamel - IDRIS
3
[Link]
Fidle information list
[Link]
AI exchange list
4
[Link]
List of ESR* « Software developers » group
[Link]
List of ESR* « Calcul » group
(*) ESR is Enseignement Supérieur et Recherche, french universities and public academic research organizations
5
6
7
Reinforcement Learning
what are we talking about?
Tabular Reinforcement Learning
Bellman Equation (1960’s)
SARSA and Q-Learning (1990’s)
Deep Reinforcement Learning
Deep Q-Network (2013)
On-Policy Gradient (2015)
Off-Policy Gradient (2015)
State of the Art and Perspective
when, where and for what purpose to use its?
8
Going Forward & Ressources
● Reinforcement Learning: An Introduction - R. S. Sutton and A. G. Barto
● Grokking Deep Reinforcement Learning - M. Morales
● Welcome to the HuggingFace🤗 Deep Reinforcement Learning Course
● OpenAI Spinning Up
● Berkeley’s Deep Reinforcement Learning course
● More resources
● Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning🤖
9
Reinforcement Learning
what are we talking about?
Tabular Reinforcement Learning
Bellman Equation (1960’s)
SARSA and Q-Learning (1990’s)
Deep Reinforcement Learning
Deep Q-Network (2013)
On-Policy Gradient (2015)
Off-Policy Gradient (2015)
State of the Art and Perspective
when, where and for what purpose to use its?
10
Deep Reinforcement Learning
Tabular Reinforcement Learning
Artificial Intelligence
Machine Learning
Supervised
Learning
Deep
Learning
Reinforcement
Learning
Unsupervised
Learning
11
[ In-Live Policy Learning ]
Large Environment with
many states
12
[ Reward ]
Trial and Error Learning
« You Win or You Learn ! »
Hard to design !
The Biggest Issue of RL !
13
[ Applications ]
14
[ Gymnasium ]
OpenAi Gym
Unity Gym
Isaac Gym
...
15
16
[ Python Implementation ]
Dopamine on Tensorflow & Jax
Stable Baseline3 on Pytorch
17
Reinforcement Learning
what are we talking about?
Tabular Reinforcement Learning
Bellman Equation (1960’s)
SARSA and Q-Learning (1990’s)
Deep Reinforcement Learning
Deep Q-Network (2013)
On-Policy Gradient (2015)
Off-Policy Gradient (2015)
State of the Art and Perspective
when, where and for what purpose to use its?
18
[ Optimal Control ]
Perfect known environment :
Fully Observable
19
Optimal Control - MDP & Grid World
MDP: Markov Decision Process
Markov chain
20
Optimal Control - Bellman Equations
Q table V table
Q(s,a) : Action-State Value function V(s) : State Value function
21
Optimal Control - Bellman Equations
Q table V table
Q(s,a) : Action-State Value function V(s) : State Value function
22
Optimal Control - MDP & Grid World
Q(s,a) R(s,a) Q(s’,a’)
Grid world V(s) R(s,a) V(s’)
a a’
S S’ 1.00
s : state s’ : next state
a : action a’ : next action
-1.00
R(s,a) : Reward
a’ a
S’ S Start
V(s) : State Value function
Expected return for a given state
Q(s,a) : Action-State Value function
Expected return for a given action from a given state
23
Optimal Control - MDP & Grid World
[ Stochastic action ]
for one action, several next
0.84 states are likely
0.08 0.08
24
Optimal Control – Optimal Policy
𝛾
𝛾
Discount rate
𝛾=0.96
Q table V table
Q(s,a) : Action-State Value function V(s) : State Value function
25
Optimal Control - Discount rate
Reward trajectory
𝛾 = 0.90
Which discount rate
you apply for
yourself ?
𝛾
day by day:
● 0.1
● 0.5
● 0.9
26
Calculating Q Table or V Table is very complex!
How to do that...
?
27
Optimal Control - Dynamic Programming
2. Policy Evaluation 0.85 0.89 0.93 1.
0. 0. 0. 1.
0.81 0.68 -1.
0. 0. -1.
0.77 0.73 0.70 0.47
0. 0. 0. 0.
1. Initialisation
1.
-1. 1.
3. Policy Improvement
-1.
If policy stable, then stop and return
V ≈ v* and π ≈ π*; else go to 2
28
[ Reinforcement Learning ]
In-Live Learning
Embodied Agent
Partially Observable Environment
29
Imagine Endless Grid World
an
or a too big one for Optimal Control Technics
30
[ Exploitation or Exploration ]
31
Exploration vs Exploitation Trade-off
Decrease ε in time !!
[ ε-greedy policy ]
ε -Exploration
ε 1-ε
Exploration Exploitation
Choose Action Select Current
Randomly Best Action start steps
(Be Greedy)
32
Monte Carlo Learning
Gt : reward trajectory until the game over
𝛾 𝛾 𝛾
33
Temporal Difference Learning
TD Target
Self-Estimation tail
34
N-step Temporal Difference Learning
𝛾 𝛾 𝛾
TD Target
Self-Estimation tail
35
N-step Temporal Difference Learning
[ Bootstrapping ] High Variance
High Bias
tail
no tail !!
36
[ On-Policy Learning ]
From Live risky Actions
37
SARSA - On-policy
On-policy
𝜀-greedy policy
𝜀-greedy policy
38
[ Off-Policy Learning ]
From previous learned
Actions Value
39
Q-learning - Off-policy
Off-policy
𝜀-greedy policy
maxa Q(S’, a)
40
To have in Mind
● V(state) & Q(action, state) value functions
and discount rate 𝛾
● Exploration strategy : Ԑ-greedy
● Temporal Difference, TD(n) with tail,
and Monte Carlo without tail
● On-policy, Off-policy concepts
On-policy Off-policy
41
Reinforcement Learning
what are we talking about?
Tabular Reinforcement Learning
Bellman Equation (1960’s)
SARSA and Q-Learning (1990’s)
Deep Reinforcement Learning
Deep Q-Network (2013)
On-Policy Gradient (2015)
Off-Policy Gradient (2015)
State of the Art and Perspective
when, where and for what purpose to use its?
42
RL \ Deep RL
Q value action 1
State Q value action 2
Q value action 3
Deep Q learning
State
Q value
Action
Q learning
43
High Dimensional benefit
Tabular representation is limited. But Deep learning enable to use
high dimensional data : Image, text, ...
44
Approximation function benefit
Imagine this state-value function V=[-2.5, -1.1, 0.7, 3.2, 7.6]
With Q-table each value is With function approximation,
Value
independent. Value
the underlying relationship of the
states can be learned and
exploited
With function
With Q-table, the
approximation, the
update only changes
update changes multiple
one step.
Value steps. Value
45
DQN : Deep Q-learning Network
Playing Atari with Deep Reinforc
ement Learning (2013)
Freeway Game
46
DQN Algorithm
Using ε-greedy (Q) policy
Exploitation : Action with the max Q-value
Store transition (s, a, r, s’, done)
in the experience replay memory D
47
DQN Algorithm
Using ε-greedy (Q) policy
Exploitation : Action with the max Q-value
Training Policy
Store transition (s, a, r, s’, done)
in the experience replay memory D
48
[ Off-Policy Learning ]
with Experience Replay Buffer
Training Policy
49
DQN – Experience Replay Buffer
Off-policy
[ Experience Replay Buffer ]
Random sampling
50
DQN Algorithm
[ Loss function ]
Initialize Network
TD-label = y
TD-error
Store transition (s, a, r, s’, done) in
the experience replay memory D
Q Network
51
Prioritized DQN
Off-policy
[ Prioritizing Experience Replay Buffer ]
Prioritize the examples with larger TD error, and so with more information
Probability sampling
rank according to how big is the TD-error
52
[ Bootstrapping Issue ]
Loss =
max function and
because of
Approximation diffusion between states
53
DQN - Target network
Target network / Online network solution
Loss =
Stabilized tail
Every C
steps
54
Double DQN
To fix Q value
Loss = overestimation issue and
to smooth the learning
≤
Loss =
Stabilized tail
<< 1
55
Dueling-DDQN
A(s,a) Q(s,a) : Action-State Value function
V(s) : State Value function
A(s,a) : Advantage Value function
-2
56
[ Dueling-DDQN ]
57
[ Stochastic Reward ]
Distributional Q-learning
evaluate:
● the expected reward
● and the risk to reach it
58
58
Distributional RL
Distributional Q-learning
evaluate:
● the expected reward
● and the risk to reach it
59
Distributional DQNs
Distributional RL aims to model the distribution over returns…
Categorical Quantile Method
60
Rainbow DQN
● DQN
● DDQN
● Prioritized DDQN
● Dueling DDQN
● Distributional DQN
● Noisy DQN (Add noise for
exploration)
Rainbow DQN Median human-normalized
performance across 57 Atari games
Rainbow: Combining Improvements in Deep Reinforcement Learning (2017) 61
[ Limit of DQNs ]
Environment’s actions space must be
discrete !!!
62
Quelques questions ?
63
[ On-Policy Learning ]
Direct
Train from the Live Policy
without Experience Replay Buffer
64
Policy Gradient / DQN
Discrete control Continuous control
Action Probability Action Intensity
State Policy Function
65
[ Continuous Control ]
Environment’s actions space can be
continuous !!!
66
Policy Gradient
Maximize Loss Objective function :
learning rate
Gradient Ascent
= negative gradient descent
(-obj).backward()
[Link]()
67
REINFORCE - Pre-DQN Policy Gradient
Good stuff is made more likely or more intensively
gr
LPG
ad
ie
Bad stuff is made less likely or less intensively
t n
1
π
nt
ie
rad
g
68
REINFORCE - Pre-DQN Policy Gradient
[ Monte-Carlo policy gradient ]
69
REINFORCE - Pre-DQN Policy Gradient
[ High Variance ]
Big issues :
Huge variance & Local maxima !!
CartPole
70
[ Solution Bootstrapping ]
is
High Bias
[ High Variance ]
tail
no tail !! 71
REINFORCE with Baseline
Monte Carlo
baseline
TD(n-step)
72
REINFORCE with Baseline
baseline baseline
Advantage function : 0-centered
73
REINFORCE with Baseline
74
[ Actor Critic ]
75
REINFORCE with Baseline or VPG
... TD(n-step)
Gradient Descent
Gradient Ascent
76
No ε-greedy
No Exploration
77
A3C - Asynchronous Advantage Actor-Critic
Asynchronous multi-workers for
exploration
78
A2C - Synchronous Advantage Actor-Critic
Asynchronous Synchronous multi-workers for exploration
Thread-specific agents Training more cohesive
would be playing with policies of and potentially to make
different versions. convergence faster.
79
How to Minibatch Actor-Critic ?
Batches on N epochs
Collect set of trajectories by running policy in At each timestep in each trajectory, compute the
the different environments. return of each TD(n-step) trajectory candidates :
t=1
...
t=n-3
t=n-2
t=n-1
80
[ Unstable Training ]
Maximize
“destructively large policy updates.”
81
[ The Surrogate Objective Function ]
Maximize
if policy iteration stay conservative with :
• constraints,
Small step by small step is equivalent to
• penalties legacy objective function
Maximize and so ...
“destructively large policy updates.”
82
Trust Region Policy Optimization - TRPO
Computationally expensive with big
model !!
TheTrust Region Constraint on the
« Surrogate » Objective Function
trust region
kullback-leibler divergence
83
Proximal Policy Optimization - PPO
The Clipped Surrogate Objective Function
Maximize
Line search Clipped search
if A > 0 if A < 0 to ensure enough
exploration for many
clipped r increase agents
clipped r decrease
84
Quelques questions ?
85
[ Off-Policy Learning ]
Training Policy
For Continuous Control
86
Deep Deterministic Policy Gradient
DDPG is an off-policy TD(1-step) learning algorithm
DDPG is Deep Q-learning Network for continuous control (Action Intensity)
87
[ Actor Critic ]
88
[ Actor Critic ]
89
[ Q-Critic ]
Critic
Sends To
Gradient of Q
Actor Sampled Policy Gradient
90
Deep Deterministic Policy Gradient
State
Action
Intensity
Action
Intensity
State
GRADIENT ASCENT
91
Deep Deterministic Policy Gradient
Noisy exploration
92
TD3 : Twin Delayed Deep Deterministic
To fix Q value overestimation issue and variance issue : use Twin critic
Delayed update of target and
policy networks
93
Soft Actor Critic
incorporates the entropy measure of the policy into the reward to encourage exploration
94
D4PG
Distributed Distributional Deep Deterministic Policy Gradients
DDPG
+Multi-Workers +Distributional RL
95
Multi Agent DDPG
Collaborative and/or adversarial
Critics take the actions of every actor
96
Reinforcement Learning
what are we talking about?
Tabular Reinforcement Learning
Bellman Equation (1960’s)
SARSA and Q-Learning (1990’s)
Deep Reinforcement Learning
Deep Q-Network (2013)
On-Policy Gradient (2015)
Off-Policy Gradient (2015)
State of the Art and Perspective
when, where and for what purpose to use its?
97
Which algorithm should I use ?
Discrete control Continuous control
Single Process DQN and Distributional DQN. SAC, TD3 and TQC
DQN is usually slower to train Please use the
(regarding wall clock time) but is the hyperparameters in the RL zoo
most sample efficient (because of its for best results
replay buffer)
Multiprocessed PPO and A2C PPO, TRPO and A2C
Please use the
hyperparameters in the RL zoo
for best results
Source: Stable Baseline 3
98
Kinds of DRL Algorithms
Source : OpenAI Spinning Up 99
Model-Based DRL
100
Model-Based DRL - Dreamer
DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION (2020) 101
Model-Based DRL - Dreamer
102
Which algorithm should I use ?
Q-learning Policy Gradient
(10 M time steps) Evolutionary
Model-based (1 M time steps) Actor-Critic (100 M time steps)
(100K time steps)
DDPG & co PPO & co
Off-policy On-policy
Better Sample Efficient Less Sample Efficient
More Computationally Less Computationally
Expensive Expensive
Experience Replay Multiple Live Experience
Stochastic
(Distributional RL)
Dynamic
(no Replay)
Multi-Agents
Adversarial
Collaborative
103
2020’s one Deep
Are the
Reinforcement Learning
Winter ?
TRANSFORMERS
104
Augmented Random Search (2018)
Sample Efficient
Computationally Cheap
Exploration in the policy space:
Apply several +/- noises 𝛿
Outperforms largely PPO, DDPG on
MuJoCo locomotion environments !!
Collect r[+] , r[-]
Update the weights Θ += α(r[+] – r[-]). 𝛿
Augmented with:
● Dividing by the Standard Deviation 𝞼ᵣ,
● Normalizing the States,
● Using top performing directions
105
Curiculum Learning
106
RL vs Supervised Learning
(Self-)Supervised learning Reinforcement learning
Lifelong learning
Dataset of experience
vs
● Hard to collect a representative labeled dataset ● Slow Live Training
● Hard to deal with dynamic environment ● Reward function hard to design
107
RT2 – Massive Self-Supervised Learning
108
RT2 – Massive Self-Supervised Learning
109
Imitation Learning
For impossible tasks with Reinforcement
Learning from scratch !!
Video labeled pre-train then fine-tune with RL !
jun. 2022
110
RLHF - Preference Alignment Fine-Tuning
111
Physical-Deep Reinforcement Learning
112
Merci beaucoup !!
Quelques questions ?
113
Jeudi 11 avril 2024, 14h
Next, on Fidle :
L’IA
comme
un outil
114
Jeudi 2 mai Jeudi 16 mai Jeudi 30 mai
115