0% found this document useful (0 votes)
134 views115 pages

Deep Reinforcement Learning Overview

The document provides an overview of Deep Reinforcement Learning, including its history, key concepts, and algorithms such as DQN, SARSA, and Q-Learning. It discusses the challenges of reinforcement learning, such as the exploration-exploitation trade-off, and introduces various applications and resources for further learning. Additionally, it covers advanced topics like Policy Gradient methods, Actor-Critic architectures, and Proximal Policy Optimization.

Uploaded by

ariss bandoss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
134 views115 pages

Deep Reinforcement Learning Overview

The document provides an overview of Deep Reinforcement Learning, including its history, key concepts, and algorithms such as DQN, SARSA, and Q-Learning. It discusses the challenges of reinforcement learning, such as the exploration-exploitation trade-off, and introduces various applications and resources for further learning. Additionally, it covers advanced topics like Policy Gradient methods, Actor-Critic architectures, and Proximal Policy Optimization.

Uploaded by

ariss bandoss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Séquence n°15

Deep Reinforcement Learning


The Godzilla
Attack !!

2
Questions and answers :
[Link]
Accompanied by :
IA Support (dream) Team of IDRIS
Directed by :
Agathe, Baptiste et Yanis - UGA/DAPI
Thibaut, Kamel - IDRIS
3
[Link]
Fidle information list

[Link]
AI exchange list

4
[Link]
List of ESR* « Software developers » group

[Link]
List of ESR* « Calcul » group

(*) ESR is Enseignement Supérieur et Recherche, french universities and public academic research organizations
5
6
7
Reinforcement Learning
what are we talking about?

Tabular Reinforcement Learning


Bellman Equation (1960’s)
SARSA and Q-Learning (1990’s)

Deep Reinforcement Learning


Deep Q-Network (2013)
On-Policy Gradient (2015)
Off-Policy Gradient (2015)

State of the Art and Perspective


when, where and for what purpose to use its?

8
Going Forward & Ressources

● Reinforcement Learning: An Introduction - R. S. Sutton and A. G. Barto


● Grokking Deep Reinforcement Learning - M. Morales
● Welcome to the HuggingFace🤗 Deep Reinforcement Learning Course
● OpenAI Spinning Up
● Berkeley’s Deep Reinforcement Learning course
● More resources

● Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning🤖

9
Reinforcement Learning
what are we talking about?

Tabular Reinforcement Learning


Bellman Equation (1960’s)
SARSA and Q-Learning (1990’s)

Deep Reinforcement Learning


Deep Q-Network (2013)
On-Policy Gradient (2015)
Off-Policy Gradient (2015)

State of the Art and Perspective


when, where and for what purpose to use its?

10
Deep Reinforcement Learning
Tabular Reinforcement Learning
Artificial Intelligence

Machine Learning

Supervised
Learning

Deep
Learning

Reinforcement
Learning
Unsupervised
Learning

11
[ In-Live Policy Learning ]

Large Environment with


many states

12
[ Reward ]
Trial and Error Learning
« You Win or You Learn ! »

Hard to design !

The Biggest Issue of RL !

13
[ Applications ]

14
[ Gymnasium ]
OpenAi Gym
Unity Gym
Isaac Gym
...

15
16
[ Python Implementation ]

Dopamine on Tensorflow & Jax

Stable Baseline3 on Pytorch

17
Reinforcement Learning
what are we talking about?

Tabular Reinforcement Learning


Bellman Equation (1960’s)
SARSA and Q-Learning (1990’s)

Deep Reinforcement Learning


Deep Q-Network (2013)
On-Policy Gradient (2015)
Off-Policy Gradient (2015)

State of the Art and Perspective


when, where and for what purpose to use its?

18
[ Optimal Control ]

Perfect known environment :


Fully Observable

19
Optimal Control - MDP & Grid World
MDP: Markov Decision Process

Markov chain

20
Optimal Control - Bellman Equations

Q table V table
Q(s,a) : Action-State Value function V(s) : State Value function

21
Optimal Control - Bellman Equations

Q table V table
Q(s,a) : Action-State Value function V(s) : State Value function

22
Optimal Control - MDP & Grid World

Q(s,a) R(s,a) Q(s’,a’)

Grid world V(s) R(s,a) V(s’)


a a’
S S’ 1.00
s : state s’ : next state
a : action a’ : next action
-1.00
R(s,a) : Reward
a’ a
S’ S Start

V(s) : State Value function


Expected return for a given state

Q(s,a) : Action-State Value function


Expected return for a given action from a given state

23
Optimal Control - MDP & Grid World

[ Stochastic action ]
for one action, several next
0.84 states are likely
0.08 0.08

24
Optimal Control – Optimal Policy

𝛾
𝛾

Discount rate
𝛾=0.96

Q table V table
Q(s,a) : Action-State Value function V(s) : State Value function

25
Optimal Control - Discount rate

Reward trajectory

𝛾 = 0.90
Which discount rate
you apply for
yourself ?
𝛾
day by day:
● 0.1
● 0.5
● 0.9

26
Calculating Q Table or V Table is very complex!

How to do that...

?
27
Optimal Control - Dynamic Programming

2. Policy Evaluation 0.85 0.89 0.93 1.


0. 0. 0. 1.

0.81 0.68 -1.


0. 0. -1.

0.77 0.73 0.70 0.47


0. 0. 0. 0.

1. Initialisation

1.

-1. 1.
3. Policy Improvement

-1.
If policy stable, then stop and return
V ≈ v* and π ≈ π*; else go to 2

28
[ Reinforcement Learning ]

In-Live Learning
Embodied Agent
Partially Observable Environment

29
Imagine Endless Grid World
an

or a too big one for Optimal Control Technics

30
[ Exploitation or Exploration ]

31
Exploration vs Exploitation Trade-off

Decrease ε in time !!
[ ε-greedy policy ]

ε -Exploration
ε 1-ε

Exploration Exploitation

Choose Action Select Current


Randomly Best Action start steps
(Be Greedy)

32
Monte Carlo Learning

Gt : reward trajectory until the game over

𝛾 𝛾 𝛾

33
Temporal Difference Learning

TD Target

Self-Estimation tail

34
N-step Temporal Difference Learning

𝛾 𝛾 𝛾

TD Target

Self-Estimation tail

35
N-step Temporal Difference Learning

[ Bootstrapping ] High Variance

High Bias
tail

no tail !!

36
[ On-Policy Learning ]

From Live risky Actions

37
SARSA - On-policy

On-policy

𝜀-greedy policy

𝜀-greedy policy

38
[ Off-Policy Learning ]

From previous learned


Actions Value

39
Q-learning - Off-policy
Off-policy

𝜀-greedy policy

maxa Q(S’, a)

40
To have in Mind

● V(state) & Q(action, state) value functions


and discount rate 𝛾

● Exploration strategy : Ԑ-greedy

● Temporal Difference, TD(n) with tail,


and Monte Carlo without tail

● On-policy, Off-policy concepts

On-policy Off-policy

41
Reinforcement Learning
what are we talking about?

Tabular Reinforcement Learning


Bellman Equation (1960’s)
SARSA and Q-Learning (1990’s)

Deep Reinforcement Learning


Deep Q-Network (2013)
On-Policy Gradient (2015)
Off-Policy Gradient (2015)

State of the Art and Perspective


when, where and for what purpose to use its?

42
RL \ Deep RL

Q value action 1

State Q value action 2

Q value action 3

Deep Q learning

State
Q value
Action

Q learning

43
High Dimensional benefit

Tabular representation is limited. But Deep learning enable to use


high dimensional data : Image, text, ...

44
Approximation function benefit

Imagine this state-value function V=[-2.5, -1.1, 0.7, 3.2, 7.6]

With Q-table each value is With function approximation,


Value
independent. Value
the underlying relationship of the
states can be learned and
exploited

With function
With Q-table, the
approximation, the
update only changes
update changes multiple
one step.
Value steps. Value

45
DQN : Deep Q-learning Network

Playing Atari with Deep Reinforc


ement Learning (2013)

Freeway Game

46
DQN Algorithm

Using ε-greedy (Q) policy


Exploitation : Action with the max Q-value

Store transition (s, a, r, s’, done)


in the experience replay memory D

47
DQN Algorithm

Using ε-greedy (Q) policy


Exploitation : Action with the max Q-value

Training Policy

Store transition (s, a, r, s’, done)


in the experience replay memory D

48
[ Off-Policy Learning ]
with Experience Replay Buffer

Training Policy

49
DQN – Experience Replay Buffer
Off-policy

[ Experience Replay Buffer ]

Random sampling

50
DQN Algorithm

[ Loss function ]
Initialize Network

TD-label = y
TD-error
Store transition (s, a, r, s’, done) in
the experience replay memory D

Q Network
51
Prioritized DQN
Off-policy

[ Prioritizing Experience Replay Buffer ]

Prioritize the examples with larger TD error, and so with more information

Probability sampling

rank according to how big is the TD-error

52
[ Bootstrapping Issue ]
Loss =

max function and


because of

Approximation diffusion between states

53
DQN - Target network

Target network / Online network solution

Loss =
Stabilized tail

Every C
steps

54
Double DQN
To fix Q value
Loss = overestimation issue and
to smooth the learning

Loss =
Stabilized tail

<< 1

55
Dueling-DDQN

A(s,a) Q(s,a) : Action-State Value function


V(s) : State Value function
A(s,a) : Advantage Value function

-2

56
[ Dueling-DDQN ]

57
[ Stochastic Reward ]

Distributional Q-learning
evaluate:
● the expected reward
● and the risk to reach it

58
58
Distributional RL

Distributional Q-learning
evaluate:
● the expected reward
● and the risk to reach it

59
Distributional DQNs

Distributional RL aims to model the distribution over returns…

Categorical Quantile Method

60
Rainbow DQN

● DQN
● DDQN
● Prioritized DDQN
● Dueling DDQN
● Distributional DQN
● Noisy DQN (Add noise for
exploration)

Rainbow DQN Median human-normalized


performance across 57 Atari games

Rainbow: Combining Improvements in Deep Reinforcement Learning (2017) 61


[ Limit of DQNs ]
Environment’s actions space must be
discrete !!!

62
Quelques questions ?

63
[ On-Policy Learning ]

Direct
Train from the Live Policy
without Experience Replay Buffer

64
Policy Gradient / DQN

Discrete control Continuous control


Action Probability Action Intensity

State Policy Function

65
[ Continuous Control ]
Environment’s actions space can be
continuous !!!

66
Policy Gradient

Maximize Loss Objective function :

learning rate

Gradient Ascent
= negative gradient descent

(-obj).backward()
[Link]()

67
REINFORCE - Pre-DQN Policy Gradient

Good stuff is made more likely or more intensively


gr
LPG

ad
ie

Bad stuff is made less likely or less intensively


t n

1
π

nt
ie
rad
g

68
REINFORCE - Pre-DQN Policy Gradient

[ Monte-Carlo policy gradient ]

69
REINFORCE - Pre-DQN Policy Gradient

[ High Variance ]
Big issues :
Huge variance & Local maxima !!

CartPole

70
[ Solution Bootstrapping ]
is

High Bias
[ High Variance ]

tail

no tail !! 71
REINFORCE with Baseline

Monte Carlo

baseline

TD(n-step)

72
REINFORCE with Baseline

baseline baseline

Advantage function : 0-centered

73
REINFORCE with Baseline

74
[ Actor Critic ]

75
REINFORCE with Baseline or VPG

... TD(n-step)
Gradient Descent

Gradient Ascent

76
No ε-greedy
No Exploration

77
A3C - Asynchronous Advantage Actor-Critic

Asynchronous multi-workers for


exploration

78
A2C - Synchronous Advantage Actor-Critic

Asynchronous Synchronous multi-workers for exploration

Thread-specific agents Training more cohesive


would be playing with policies of and potentially to make
different versions. convergence faster.
79
How to Minibatch Actor-Critic ?

Batches on N epochs
Collect set of trajectories by running policy in At each timestep in each trajectory, compute the
the different environments. return of each TD(n-step) trajectory candidates :

t=1

...

t=n-3

t=n-2

t=n-1

80
[ Unstable Training ]
Maximize

“destructively large policy updates.”

81
[ The Surrogate Objective Function ]
Maximize

if policy iteration stay conservative with :


• constraints,
Small step by small step is equivalent to
• penalties legacy objective function

Maximize and so ...

“destructively large policy updates.”

82
Trust Region Policy Optimization - TRPO

Computationally expensive with big


model !!
TheTrust Region Constraint on the
« Surrogate » Objective Function

trust region

kullback-leibler divergence

83
Proximal Policy Optimization - PPO

The Clipped Surrogate Objective Function

Maximize
Line search Clipped search

if A > 0 if A < 0 to ensure enough


exploration for many
clipped r increase agents

clipped r decrease

84
Quelques questions ?

85
[ Off-Policy Learning ]

Training Policy

For Continuous Control

86
Deep Deterministic Policy Gradient

DDPG is an off-policy TD(1-step) learning algorithm


DDPG is Deep Q-learning Network for continuous control (Action Intensity)

87
[ Actor Critic ]

88
[ Actor Critic ]

89
[ Q-Critic ]

Critic
Sends To
Gradient of Q
Actor Sampled Policy Gradient

90
Deep Deterministic Policy Gradient

State

Action
Intensity

Action
Intensity

State

GRADIENT ASCENT

91
Deep Deterministic Policy Gradient

Noisy exploration

92
TD3 : Twin Delayed Deep Deterministic

To fix Q value overestimation issue and variance issue : use Twin critic

Delayed update of target and


policy networks

93
Soft Actor Critic

incorporates the entropy measure of the policy into the reward to encourage exploration

94
D4PG
Distributed Distributional Deep Deterministic Policy Gradients

DDPG

+Multi-Workers +Distributional RL

95
Multi Agent DDPG

Collaborative and/or adversarial

Critics take the actions of every actor

96
Reinforcement Learning
what are we talking about?

Tabular Reinforcement Learning


Bellman Equation (1960’s)
SARSA and Q-Learning (1990’s)

Deep Reinforcement Learning


Deep Q-Network (2013)
On-Policy Gradient (2015)
Off-Policy Gradient (2015)

State of the Art and Perspective


when, where and for what purpose to use its?

97
Which algorithm should I use ?

Discrete control Continuous control

Single Process DQN and Distributional DQN. SAC, TD3 and TQC
DQN is usually slower to train Please use the
(regarding wall clock time) but is the hyperparameters in the RL zoo
most sample efficient (because of its for best results
replay buffer)

Multiprocessed PPO and A2C PPO, TRPO and A2C


Please use the
hyperparameters in the RL zoo
for best results

Source: Stable Baseline 3


98
Kinds of DRL Algorithms

Source : OpenAI Spinning Up 99


Model-Based DRL

100
Model-Based DRL - Dreamer

DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION (2020) 101


Model-Based DRL - Dreamer

102
Which algorithm should I use ?

Q-learning Policy Gradient


(10 M time steps) Evolutionary
Model-based (1 M time steps) Actor-Critic (100 M time steps)
(100K time steps)

DDPG & co PPO & co


Off-policy On-policy

Better Sample Efficient Less Sample Efficient


More Computationally Less Computationally
Expensive Expensive
Experience Replay Multiple Live Experience
Stochastic
(Distributional RL)
Dynamic
(no Replay)

Multi-Agents
Adversarial
Collaborative

103
2020’s one Deep
Are the

Reinforcement Learning
Winter ?

TRANSFORMERS

104
Augmented Random Search (2018)

Sample Efficient

Computationally Cheap
Exploration in the policy space:
Apply several +/- noises 𝛿
Outperforms largely PPO, DDPG on
MuJoCo locomotion environments !!

Collect r[+] , r[-]


Update the weights Θ += α(r[+] – r[-]). 𝛿

Augmented with:
● Dividing by the Standard Deviation 𝞼ᵣ,
● Normalizing the States,
● Using top performing directions

105
Curiculum Learning

106
RL vs Supervised Learning

(Self-)Supervised learning Reinforcement learning


Lifelong learning
Dataset of experience

vs

● Hard to collect a representative labeled dataset ● Slow Live Training


● Hard to deal with dynamic environment ● Reward function hard to design

107
RT2 – Massive Self-Supervised Learning

108
RT2 – Massive Self-Supervised Learning

109
Imitation Learning

For impossible tasks with Reinforcement


Learning from scratch !!

Video labeled pre-train then fine-tune with RL !


jun. 2022

110
RLHF - Preference Alignment Fine-Tuning

111
Physical-Deep Reinforcement Learning

112
Merci beaucoup !!
Quelques questions ?

113
Jeudi 11 avril 2024, 14h

Next, on Fidle :

L’IA
comme
un outil

114
Jeudi 2 mai Jeudi 16 mai Jeudi 30 mai

115

You might also like