0% found this document useful (0 votes)

101 views

Deep RL Tutorial Small

The document is a tutorial on deep reinforcement learning by David Silver from Google DeepMind. It provides an introduction to deep learning and reinforcement learning concepts. The key ideas are that deep reinforcement learning combines deep learning and reinforcement learning, using neural networks to approximate value functions and policies to solve complex tasks like playing games and exploring 3D worlds. Reinforcement learning defines the objective of maximizing reward, while deep learning provides the mechanism to learn representations directly from raw inputs.

Uploaded by

Lê Kim Hùng

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

101 views

Deep RL Tutorial Small

Uploaded by

Lê Kim Hùng

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 66

Tutorial: Deep Reinforcement Learning

David Silver, Google DeepMind

Outline

Introduction to Deep Learning

Introduction to Reinforcement Learning

Value-Based Deep RL

Policy-Based Deep RL

Model-Based Deep RL
Reinforcement Learning in a nutshell

RL is a general-purpose framework for decision-making

I RL is for an agent with the capacity to act
I Each action influences the agents future state
I Success is measured by a scalar reward signal
I Goal: select actions to maximise future reward
Deep Learning in a nutshell

DL is a general-purpose framework for representation learning

I Given an objective
I Learn representation that is required to achieve objective
I Directly from raw inputs
I Using minimal domain knowledge
Deep Reinforcement Learning: AI = RL + DL

We seek a single agent which can solve any human-level task

I RL defines the objective
I DL gives the mechanism
I RL + DL = general intelligence
Examples of Deep RL @DeepMind

I Play games: Atari, poker, Go, ...

I Explore worlds: 3D worlds, Labyrinth, ...
I Control physical systems: manipulate, walk, swim, ...
I Interact with users: recommend, optimise, personalise, ...
Outline

Introduction to Deep Learning

Introduction to Reinforcement Learning

Value-Based Deep RL

Policy-Based Deep RL

Model-Based Deep RL
Deep Representations

I A deep representation is a composition of many functions

x / h1
O
/ ... / hn
O
/y /l

w1 ... wn

I Its gradient can be backpropagated by the chain rule

@h1 @h2 @hn @y

@h1 @hn 1 @hn
@l @x @l @l @l
@x
o
@h1
o ... o @hn
o
@y
@h1 @hn
@w1 @wn

@l @l
@w1 ... @wn
Deep Neural Network

A deep neural network is typically composed of:

I Linear transformations

hk+1 = Whk

I Non-linear activation functions

hk+2 = f (hk+1 )
I A loss function on the output, e.g.
I Mean-squared error l = ||y y ||2
I Log likelihood l = log P [y ]
Training Neural Networks by Stochastic Gradient Descent
!"#$%"&'(%")*+,#'-'!"#$%%" (%")*+,#'.+/0+,#

!"#$%&'('%$&#()&*+$,*$#&&-&$$$$."'%"$
I '*$%-/0'*,('-*$.'("$("#$1)*%('-*$
Sample gradient of expected loss L(w) = E [l]
,22&-3'/,(-&
%,*$0#$)+#4$(-$
@l @l @L(w)
%&#,(#$,*$#&&-&$1)*%('-*$$$$$$$$$$$$$
E =
@w @w @w
I Adjust w down the sampled gradient
!"#$2,&(',5$4'11#&#*(',5$-1$("'+$#&&-&$
1)*%('-*$$$$$$$$$$$$$$$$6$("#$7&,4'#*($
@l
w/
%,*$*-.$0#$)+#4$(-$)24,(#$("#$
@w
'*(#&*,5$8,&',05#+$'*$("#$1)*%('-*$
,22&-3'/,(-& 9,*4$%&'('%:;$$$$$$
<&,4'#*($4#+%#*($=>?
Weight Sharing
Recurrent neural network shares weights between time-steps
yO t yt+1
O

... / ht / ht+1 / ...

? O = O

w xt w xt+1

Convolutional neural network shares weights between local regions

w1 w2

w1 w2
h2

h1
x
Outline

Introduction to Deep Learning

Introduction to Reinforcement Learning

Value-Based Deep RL

Policy-Based Deep RL

Model-Based Deep RL
Many Faces of Reinforcement Learning

Computer Science

Engineering Neuroscience
Machine
Learning
Optimal Reward
Control System
Reinforcement
Learning
Operations Classical/Operant
Research Conditioning
Rationality/
Mathematics Psychology
Game Theory

Economics
Agent and Environment

observation action I At each step t the agent:

ot at
I Executes action at
I Receives observation ot
I Receives scalar reward rt
reward rt
I The environment:
I Receives action at
I Emits observation ot+1
I Emits scalar reward rt+1
State

I Experience is a sequence of observations, actions, rewards

o1 , r1 , a1 , ..., at 1 , ot , rt

I The state is a summary of experience

st = f (o1 , r1 , a1 , ..., at 1 , ot , r t )

I In a fully observed environment

st = f (ot )
Major Components of an RL Agent

I An RL agent may include one or more of these components:

I Policy: agents behaviour function
I Value function: how good is each state and/or action
I Model: agents representation of the environment
Policy

I A policy is the agents behaviour

I It is a map from state to action:
I Deterministic policy: a = (s)
I Stochastic policy: (a|s) = P [a|s]
Value Function

I A value function is a prediction of future reward

I How much reward will I get from action a in state s?
I Q-value function gives expected total reward
I from state s and action a
I under policy
I with discount factor
2

Q (s, a) = E rt+1 + rt+2 + rt+3 + ... | s, a
Value Function

I A value function is a prediction of future reward

I Value functions decompose into a Bellman equation

Q (s, a) = Es 0 ,a0 r + Q (s 0 , a0 ) | s, a
Optimal Value Functions
I An optimal value function is the maximum achievable value

Q (s, a) = max Q (s, a) = Q (s, a)

Optimal Value Functions
I An optimal value function is the maximum achievable value

Q (s, a) = max Q (s, a) = Q (s, a)

I Once we have Q we can act optimally,

(s) = argmax Q (s, a)

a
Optimal Value Functions
I An optimal value function is the maximum achievable value

Q (s, a) = max Q (s, a) = Q (s, a)

I Once we have Q we can act optimally,

(s) = argmax Q (s, a)

I Optimal value maximises over all decisions. Informally:

Q (s, a) = rt+1 + max rt+2 + 2

max rt+3 + ...
at+1 at+2

= rt+1 + max Q (st+1 , at+1 )
at+1
Optimal Value Functions
I An optimal value function is the maximum achievable value

Q (s, a) = max Q (s, a) = Q (s, a)

I Once we have Q we can act optimally,

(s) = argmax Q (s, a)

I Optimal value maximises over all decisions. Informally:

Q (s, a) = rt+1 + max rt+2 + 2

max rt+3 + ...
at+1 at+2

= rt+1 + max Q (st+1 , at+1 )
at+1

I Formally, optimal values decompose into a Bellman equation

Q (s, a) = Es 0 r + max
0
Q (s 0 , a0 ) | s, a
a
Value Function Demo
Model

observation action

ot at

reward rt
Model

observation action

ot at I Model is learnt from experience

I Acts as proxy for environment
I Planner interacts with model
reward rt

I e.g. using lookahead search

Approaches To Reinforcement Learning

Value-based RL
I Estimate the optimal value function Q (s, a)
I This is the maximum value achievable under any policy
Policy-based RL
I Search directly for the optimal policy
I This is the policy achieving maximum future reward
Model-based RL
I Build a model of the environment
I Plan (e.g. by lookahead) using model
Deep Reinforcement Learning

I Use deep neural networks to represent

I Value function
I Policy
I Model
I Optimise loss function by stochastic gradient descent
Outline

Introduction to Deep Learning

Introduction to Reinforcement Learning

Value-Based Deep RL

Policy-Based Deep RL

Model-Based Deep RL
Q-Networks

Represent value function by Q-network with weights w

Q(s, a, w) Q (s, a)

Q(s,a,w) Q(s,a1,w) Q(s,am,w)

w w

s a s
Q-Learning

I Optimal Q-values should obey Bellman equation

Q (s, a) = Es 0 r + max 0
Q (s 0 , a0 ) | s, a
a

I Treat right-hand side r + max

0
Q(s 0 , a0 , w) as a target
a
I Minimise MSE loss by stochastic gradient descent
2
0 0
l= r+ max
0
Q(s , a , w) Q(s, a, w)
a
Q-Learning

I Optimal Q-values should obey Bellman equation

Q (s, a) = Es 0 r + max 0
Q (s 0 , a0 ) | s, a
a

I Treat right-hand side r + max

0
Q(s 0 , a0 , w) as a target
a
I Minimise MSE loss by stochastic gradient descent
2
0 0
l= r+ max
0
Q(s , a , w) Q(s, a, w)
a

I Converges to Q using table lookup representation

Q-Learning

I Optimal Q-values should obey Bellman equation

Q (s, a) = Es 0 r + max 0
Q (s 0 , a0 ) | s, a
a

I Treat right-hand side r + max

0
Q(s 0 , a0 , w) as a target
a
I Minimise MSE loss by stochastic gradient descent
2
0 0
l= r+ max
0
Q(s , a , w) Q(s, a, w)
a

I Converges to Q using table lookup representation

I But diverges using neural networks due to:
I Correlations between samples
I Non-stationary targets
Deep Q-Networks (DQN): Experience Replay

To remove correlations, build data-set from agents own experience

s1 , a 1 , r 2 , s 2
s2 , a 2 , r 3 , s 3 ! s, a, r , s 0
s3 , a 3 , r 4 , s 4
...
st , at , rt+1 , st+1 ! st , at , rt+1 , st+1

Sample experiences from data-set and apply update

2
0 0
l= r+ max
0
Q(s , a , w ) Q(s, a, w)
a

To deal with non-stationarity, target parameters w are held fixed

Deep Reinforcement Learning in Atari

state action

st at

reward rt
DQN in Atari

I End-to-end learning of values Q(s, a) from pixels s

I Input state s is stack of raw pixels from last 4 frames
I Output is Q(s, a) for 18 joystick/button positions
I Reward is change in score for that step

Network architecture and hyperparameters fixed across all games

DQN Results in Atari
DQN Atari Demo

INNOVATIONS IN
The microbiome

T H E I N T E R N AT I O N A L W E E K LY J O U R N A L O F S C I E N C E

DQN paper
www.nature.com/articles/nature14236

DQN source code:

sites.google.com/a/deepmind.com/dqn/ Self-taught AI software
attains human-level
performance in video games
PAGES 486 & 529

EPIDEMIOLOGY COSMOLOGY QUANTUM PHYSICS NATURE.COM/NATURE

26 February 2015 10

SHARE DATA IN A GIANT IN THE TELEPORTATION Vol. 518, No. 7540

OUTBREAKS EARLY UNIVERSE FOR TWO

Forge open access A supermassive black hole Transferring two properties
to sequences and more at a redshift of 6.3 of a single photon
PAGE 477 PAGES 490 & 512 PAGES 491 & 516
Improvements since Nature DQN
I Double DQN: Remove upward bias caused by max Q(s, a, w)
a
I Current Q-network w is used to select actions
I Older Q-network w is used to evaluate actions
2
0 0 0
l = r + Q(s , argmax Q(s , a , w), w ) Q(s, a, w)
a0
Improvements since Nature DQN
I Double DQN: Remove upward bias caused by max Q(s, a, w)
a
I Current Q-network w is used to select actions
I Older Q-network w is used to evaluate actions
2
0 0 0
l = r + Q(s , argmax Q(s , a , w), w ) Q(s, a, w)
a0

I Prioritised replay: Weight experience according to surprise

I Store experience in priority queue according to DQN error

r+ max
0
Q(s 0 , a0 , w ) Q(s, a, w )
a
Improvements since Nature DQN
I Double DQN: Remove upward bias caused by max Q(s, a, w)
a
I Current Q-network w is used to select actions
I Older Q-network w is used to evaluate actions
2
0 0 0
l = r + Q(s , argmax Q(s , a , w), w ) Q(s, a, w)
a0

I Prioritised replay: Weight experience according to surprise

I Store experience in priority queue according to DQN error

r+ max
0
Q(s 0 , a0 , w ) Q(s, a, w )
a

I Duelling network: Split Q-network into two channels

I Action-independent value function V (s, v )
I Action-dependent advantage function A(s, a, w)

Q(s, a) = V (s, v ) + A(s, a, w)

Improvements since Nature DQN
I Double DQN: Remove upward bias caused by max Q(s, a, w)
a
I Current Q-network w is used to select actions
I Older Q-network w is used to evaluate actions
2
0 0 0
l = r + Q(s , argmax Q(s , a , w), w ) Q(s, a, w)
a0

I Prioritised replay: Weight experience according to surprise

I Store experience in priority queue according to DQN error

r+ max
0
Q(s 0 , a0 , w ) Q(s, a, w )
a

I Duelling network: Split Q-network into two channels

I Action-independent value function V (s, v )
I Action-dependent advantage function A(s, a, w)

Q(s, a) = V (s, v ) + A(s, a, w)

Combined algorithm: 3x mean Atari score vs Nature DQN

Gorila (General Reinforcement Learning Architecture)

I 10x faster than Nature DQN on 38 out of 49 Atari games

I Applied to recommender systems within Google
Asynchronous Reinforcement Learning

I Exploits multithreading of standard CPU

I Execute many instances of agent in parallel
I Network parameters shared between threads
I Parallelism decorrelates data
I Viable alternative to experience replay
I Similar speedup to Gorila - on a single machine!
Outline

Introduction to Deep Learning

Introduction to Reinforcement Learning

Value-Based Deep RL

Policy-Based Deep RL

Model-Based Deep RL
Deep Policy Networks

I Represent policy by deep network with weights u

a = (a|s, u) or a = (s, u)

I Define objective function as total discounted reward

L(u) = E r1 + r2 + 2 r3 + ... | (, u)

I Optimise objective end-to-end by SGD

I i.e. Adjust policy parameters u to achieve more reward
Policy Gradients

How to make high-value actions more likely:

I The gradient of a stochastic policy (a|s, u) is given by

@L(u) @log (a|s, u)
=E Q (s, a)
@u @u
Policy Gradients

How to make high-value actions more likely:

I The gradient of a stochastic policy (a|s, u) is given by

@L(u) @log (a|s, u)
=E Q (s, a)
@u @u

I The gradient of a deterministic policy a = (s) is given by

@L(u) @Q (s, a) @a
=E
@u @a @u
I if a is continuous and Q is dierentiable
Actor-Critic Algorithm

I Estimate value function Q(s, a, w) Q (s, a)

I Update policy parameters u by stochastic gradient ascent

@l @log (a|s, u)
= Q(s, a, w)
@u @u
or
@l @Q(s, a, w) @a
=
@u @a @u
Asynchronous Advantage Actor-Critic (A3C)
I Estimate state-value function
V (s, v) E [rt+1 + rt+2 + ...|s]
I Q-value estimated by an n-step sample
n 1 n
qt = rt+1 + rt+2 ... + rt+n + V (st+n , v)
Asynchronous Advantage Actor-Critic (A3C)
I Estimate state-value function
V (s, v) E [rt+1 + rt+2 + ...|s]
I Q-value estimated by an n-step sample
n 1 n
qt = rt+1 + rt+2 ... + rt+n + V (st+n , v)

I Actor is updated towards target

@lu @log (at |st , u)
= (qt V (st , v))
@u @u

I Critic is updated to minimise MSE w.r.t. target

lv = (qt V (st , v))2
Asynchronous Advantage Actor-Critic (A3C)
I Estimate state-value function
V (s, v) E [rt+1 + rt+2 + ...|s]
I Q-value estimated by an n-step sample
n 1 n
qt = rt+1 + rt+2 ... + rt+n + V (st+n , v)

I Actor is updated towards target

@lu @log (at |st , u)
= (qt V (st , v))
@u @u

I Critic is updated to minimise MSE w.r.t. target

lv = (qt V (st , v))2

I 4x mean Atari score vs Nature DQN

Deep Reinforcement Learning in Labyrinth
A3C in Labyrinth
(a|st-1) V(st-1) (a|st) V(st) (a|st+1) V(st-1)

Deep Reinforcement Learning in Labyrinth

st-1 st st+1

Deep Reinforcement
Deep Reinforcement Learning
Learning in Labyrinth
in Labyrinth

ot-1 ot ot+1

I End-to-end learning of softmax policy (a|st ) from pixels

I Observations ot are raw pixels from current frame
I State st = f (o1 , ..., ot ) is a recurrent neural network (LSTM)
I Outputs both value V (s) and softmax over actions (a|s)
I Task is to collect apples (+1 reward) and escape (+10 reward)
A3C Labyrinth Demo

Demo:
www.youtube.com/watch?v=nMR5mjCFZCw&feature=youtu.be

Labyrinth source code (coming soon):

sites.google.com/a/deepmind.com/labyrinth/
Deep Reinforcement Learning with Continuous Actions

How can we deal with high-dimensional continuous action spaces?

I Cant easily compute max Q(s, a)
a
I Actor-critic algorithms learn without max
I Q-values are dierentiable w.r.t a
I @Q
Deterministic policy gradients exploit knowledge of @a
Deep DPG

DPG is the continuous analogue of DQN

I Experience replay: build data-set from agents experience
I Critic estimates value of current policy by DQN
2
0 0
lw = r + Q(s , (s , u ), w ) Q(s, a, w)

To deal with non-stationarity, targets u , w are held fixed

I Actor updates policy in direction that improves Q

@lu @Q(s, a, w) @a
=
@u @a @u
I In other words critic provides loss function for actor
DPG in Simulated Physics
I Physics domains are simulated in MuJoCo
I End-to-end learning of control policy from raw pixels s
I Input state s is stack of raw pixels from last 4 frames
I Two separate convnets are used for Q and
I Policy is adjusted in direction that most improves Q
a

Q(s,a)

(s)
DPG in Simulated Physics Demo

I Demo: DPG from pixels

A3C in Simulated Physics Demo

I Asynchronous RL is viable alternative to experience replay

I Train a hierarchical, recurrent locomotion controller
I Retrain controller on more challenging tasks
Fictitious Self-Play (FSP)

Can deep RL find Nash equilibria in multi-agent games?

I Q-network learns best response to opponent policies
I By applying DQN with experience replay
I c.f. fictitious play
I Policy network (a|s, u) learns an average of best responses

@l @log (a|s, u)
=
@u @u
I Actions a sample mix of policy network and best response
Neural FSP in Texas Holdem Poker
I Heads-up limit Texas Holdem
I NFSP with raw inputs only (no prior knowledge of Poker)
I vs SmooCT (3x medal winner 2015, handcrafted knowlege)
100

-100

-200

-300
mbb/h

-400

-500

-600 SmooCT
NFSP, best response strategy
-700 NFSP, greedy-average strategy
NFSP, average strategy
-800
0 5e+06 1e+07 1.5e+07 2e+07 2.5e+07 3e+07 3.5e+07
Iterations
Outline

Introduction to Deep Learning

Introduction to Reinforcement Learning

Value-Based Deep RL

Policy-Based Deep RL

Model-Based Deep RL
Learning Models of the Environment

I Demo: generative model of Atari

I Challenging to plan due to compounding errors
I Errors in the transition model compound over the trajectory
I Planning trajectories dier from executed trajectories
I At end of long, unusual trajectory, rewards are totally wrong
Deep Reinforcement Learning in Go

What if we have a perfect model? e.g. game rules are known

T H E I N T E R N AT I O N A L W E E K LY J O U R N A L O F S C I E N C E

AlphaGo paper:
www.nature.com/articles/nature16961

AlphaGo resources: At last a computer program that

can beat a champion Go player PAGE 484

deepmind.com/alphago/ ALL SYSTEMS GO

CONSERVATION RESEARCH ETHICS POPULAR SCIENCE NATURE.COM/NATURE
28 January 2016 10

SONGBIRDS SAFEGUARD WHEN GENES Vol. 529, No. 7587

LA CARTE TRANSPARENCY GOT SELFISH

Illegal harvest of millions Dont let openness backfire Dawkinss calling
of Mediterranean birds on individuals card forty years on
PAGE 452 PAGE 459 PAGE 462

Cover 28 January 2016.indd 1 20/01/2016 15:40

Conclusion

I General, stable and scalable RL is now possible

I Using deep networks to represent value, policy, model
I Successful in Atari, Labyrinth, Physics, Poker, Go
I Using a variety of deep RL paradigms

Ai - Q - 1
0% (1)
Ai - Q - 1
2 pages
The Contemporary World Syllabus
100% (1)
The Contemporary World Syllabus
11 pages
Q-Learning and Deep Q Networks (DQN)
No ratings yet
Q-Learning and Deep Q Networks (DQN)
52 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
6S191 MIT DeepLearning L5
No ratings yet
6S191 MIT DeepLearning L5
62 pages
lecture-06
No ratings yet
lecture-06
98 pages
Lecture Notes Deep Reinforcement Learning: Generalizability in Deep RL
No ratings yet
Lecture Notes Deep Reinforcement Learning: Generalizability in Deep RL
7 pages
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
46 pages
RL Theory Tutorial
No ratings yet
RL Theory Tutorial
80 pages
lec22
No ratings yet
lec22
22 pages
Q Learning
No ratings yet
Q Learning
38 pages
MAS-Lab7-QFA
No ratings yet
MAS-Lab7-QFA
10 pages
DRL
No ratings yet
DRL
9 pages
dqn-atari
No ratings yet
dqn-atari
26 pages
Deep Reinforcement Learning
100% (4)
Deep Reinforcement Learning
48 pages
Report On Reinforcement Learning
No ratings yet
Report On Reinforcement Learning
26 pages
Unit Iv Deep Q Learning
No ratings yet
Unit Iv Deep Q Learning
27 pages
Lecture 10 - Overview of RL With A VIP Perspective
No ratings yet
Lecture 10 - Overview of RL With A VIP Perspective
35 pages
2.3+Value+Function+Approximation
No ratings yet
2.3+Value+Function+Approximation
55 pages
18 AI BasicRL
No ratings yet
18 AI BasicRL
96 pages
New CZ3005 Module 5 - Reinforcement Learning
No ratings yet
New CZ3005 Module 5 - Reinforcement Learning
31 pages
Deep Reinforcement Learning: Lecture Notes
No ratings yet
Deep Reinforcement Learning: Lecture Notes
60 pages
13-RL DRL
No ratings yet
13-RL DRL
102 pages
Towards Adapting Reinforcement Learning Agents To New Tasks: Insights From Q-Values
No ratings yet
Towards Adapting Reinforcement Learning Agents To New Tasks: Insights From Q-Values
10 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
CZ3005 Module 5_Reinforcement Learning(1)
No ratings yet
CZ3005 Module 5_Reinforcement Learning(1)
31 pages
Deep Quality-Value (DQV) Learning: Preprint. Work in Progress
No ratings yet
Deep Quality-Value (DQV) Learning: Preprint. Work in Progress
10 pages
S18 Reinforcement Learning 2
No ratings yet
S18 Reinforcement Learning 2
46 pages
37 RL
No ratings yet
37 RL
18 pages
42-Deep Q Learning
No ratings yet
42-Deep Q Learning
8 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
RL Intro-2
No ratings yet
RL Intro-2
24 pages
What is TD Learning
No ratings yet
What is TD Learning
15 pages
Playing Geometry Dash With Convolutional Neural Networks
No ratings yet
Playing Geometry Dash With Convolutional Neural Networks
7 pages
Reinforcement Learning Optimization
No ratings yet
Reinforcement Learning Optimization
6 pages
Chapter 1
No ratings yet
Chapter 1
33 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
18-deeprl
No ratings yet
18-deeprl
19 pages
Bridging The Gap Between Value and Policy Based Reinforcement Learning
No ratings yet
Bridging The Gap Between Value and Policy Based Reinforcement Learning
21 pages
Introduction To Reinforcement Learning: Instructor: Sergey Levine UC Berkeley
No ratings yet
Introduction To Reinforcement Learning: Instructor: Sergey Levine UC Berkeley
46 pages
Hindsight Experience Replay
No ratings yet
Hindsight Experience Replay
15 pages
lecture doubts
No ratings yet
lecture doubts
2 pages
3.5 Intro2DeepQLearning
No ratings yet
3.5 Intro2DeepQLearning
12 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
2023 Week4 Funcapproximate Update
No ratings yet
2023 Week4 Funcapproximate Update
69 pages
Pplication of Deep Reinforcement Learning For Ndian Stock Trading Automation
No ratings yet
Pplication of Deep Reinforcement Learning For Ndian Stock Trading Automation
9 pages
drl_v5
No ratings yet
drl_v5
64 pages
8200 Non Delusional Q Learning and Value Iteration
No ratings yet
8200 Non Delusional Q Learning and Value Iteration
11 pages
RADL LACuong
No ratings yet
RADL LACuong
81 pages
Unit 5 Deep Learning
No ratings yet
Unit 5 Deep Learning
24 pages
Rl Dp and Value and Policy
No ratings yet
Rl Dp and Value and Policy
4 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
Part 3 - Building A Deep Q-Network To Play Gridworld - Learning Instability and Target Networks - by NandaKishore Joshi - Towards Data Science
No ratings yet
Part 3 - Building A Deep Q-Network To Play Gridworld - Learning Instability and Target Networks - by NandaKishore Joshi - Towards Data Science
7 pages
Extreme Q-Learning - MaxEnt RL Without Entropy
No ratings yet
Extreme Q-Learning - MaxEnt RL Without Entropy
25 pages
Walking Through Original DQN Paper - by Stas Olekhnovich - Medium
No ratings yet
Walking Through Original DQN Paper - by Stas Olekhnovich - Medium
13 pages
Sdfesdf
No ratings yet
Sdfesdf
23 pages
Demonstration Final Presentation (1)
No ratings yet
Demonstration Final Presentation (1)
59 pages
10. Learning Task
No ratings yet
10. Learning Task
14 pages
Deep Learning Fundamentals in Python
From Everand
Deep Learning Fundamentals in Python
LazyProgrammer
4/5 (9)
KFRNN An Effective False Data Injection Attack Detection in Smart Grid Based On Kalman Filter and Recurrent Neural Network
No ratings yet
KFRNN An Effective False Data Injection Attack Detection in Smart Grid Based On Kalman Filter and Recurrent Neural Network
12 pages
A Machine-Learning-Based Technique For False Data Injection Attacks Detection in Industrial IoT
No ratings yet
A Machine-Learning-Based Technique For False Data Injection Attacks Detection in Industrial IoT
10 pages
Cyber Vulnerability Intelligence For Internet of Things Binary
No ratings yet
Cyber Vulnerability Intelligence For Internet of Things Binary
10 pages
Plant Disease Detection and Classification by Deep Learning-A Review
No ratings yet
Plant Disease Detection and Classification by Deep Learning-A Review
16 pages
Survey SensorNetworkAnalytics 2012
No ratings yet
Survey SensorNetworkAnalytics 2012
29 pages
Sensing and Actuation As A Service, A Key Enabler For A New Cloud Paradigm
No ratings yet
Sensing and Actuation As A Service, A Key Enabler For A New Cloud Paradigm
8 pages
Clean and Prepare Room For Guest
No ratings yet
Clean and Prepare Room For Guest
120 pages
An Scalable Iot Framework To Design Logical Data Flow Using Virtual Sensor
No ratings yet
An Scalable Iot Framework To Design Logical Data Flow Using Virtual Sensor
7 pages
IOT Tutorial
100% (1)
IOT Tutorial
170 pages
Chapter 7 Gender and Society
No ratings yet
Chapter 7 Gender and Society
11 pages
B1+ UNIT 7 CEFR Checklist
No ratings yet
B1+ UNIT 7 CEFR Checklist
2 pages
Drexel MEM 220 Syllabus
No ratings yet
Drexel MEM 220 Syllabus
6 pages
Unsw Sydney School of Chemical Engineering Sample Paper Ceic 2005 Chemical Reaction Engineering
No ratings yet
Unsw Sydney School of Chemical Engineering Sample Paper Ceic 2005 Chemical Reaction Engineering
6 pages
S23 TOKENLESS BLOCK INSTRUMENTS FOR SINGLE LINE
No ratings yet
S23 TOKENLESS BLOCK INSTRUMENTS FOR SINGLE LINE
127 pages
IB Biology - HL Semester 1 Project Topic 9 - Plant Science
No ratings yet
IB Biology - HL Semester 1 Project Topic 9 - Plant Science
2 pages
Use of Structural Steel in Overhead Transmission Line Towers - Code of Practice
No ratings yet
Use of Structural Steel in Overhead Transmission Line Towers - Code of Practice
14 pages
Cover Letter For Spain Visa
100% (2)
Cover Letter For Spain Visa
6 pages
Model Paper - I: Market Research
No ratings yet
Model Paper - I: Market Research
4 pages
Espelage y McMahon Revisión Violencia Contra Educadores Medidas e Investigación
No ratings yet
Espelage y McMahon Revisión Violencia Contra Educadores Medidas e Investigación
20 pages
MODALS Edited
No ratings yet
MODALS Edited
35 pages
Latex
No ratings yet
Latex
11 pages
LED Corporate Presentation
100% (1)
LED Corporate Presentation
19 pages
Jameeluddin CV (Update)
No ratings yet
Jameeluddin CV (Update)
3 pages
PGGA Portafolio
No ratings yet
PGGA Portafolio
58 pages
Microscan MS-3 Laser Scanner: Device Driver User Guide
No ratings yet
Microscan MS-3 Laser Scanner: Device Driver User Guide
43 pages
William Bruce Strohmaier-187d6fa25100eaf
No ratings yet
William Bruce Strohmaier-187d6fa25100eaf
12 pages
Excel
No ratings yet
Excel
59 pages
Draft User Requirements Definition Report
No ratings yet
Draft User Requirements Definition Report
76 pages
(Dr. Mirvat Bulbul) ENCE335 Fundamentals of Reinfo PDF
No ratings yet
(Dr. Mirvat Bulbul) ENCE335 Fundamentals of Reinfo PDF
41 pages
B.Sc. Economics Honours - Detailed Syllabus
No ratings yet
B.Sc. Economics Honours - Detailed Syllabus
49 pages
Electronic Configuration: at The End of This Topic, Students Should Be Able To
No ratings yet
Electronic Configuration: at The End of This Topic, Students Should Be Able To
21 pages
Time Table 2024-2025
No ratings yet
Time Table 2024-2025
3 pages
Aladin S
No ratings yet
Aladin S
4 pages
Data Sheet C-Mag Hs 7 Ikamag
No ratings yet
Data Sheet C-Mag Hs 7 Ikamag
2 pages
Army Alpha Test Sheet
No ratings yet
Army Alpha Test Sheet
1 page
M.sc. Mathematics - Syllabus
No ratings yet
M.sc. Mathematics - Syllabus
24 pages
Introduction To ICT
No ratings yet
Introduction To ICT
23 pages