Deep RL Tutorial Small
Deep RL Tutorial Small
Value-Based Deep RL
Policy-Based Deep RL
Model-Based Deep RL
Reinforcement Learning in a nutshell
Value-Based Deep RL
Policy-Based Deep RL
Model-Based Deep RL
Deep Representations
x / h1
O
/ ... / hn
O
/y /l
w1 ... wn
hk+1 = Whk
hk+2 = f (hk+1 )
I A loss function on the output, e.g.
I Mean-squared error l = ||y y ||2
I Log likelihood l = log P [y ]
Training Neural Networks by Stochastic Gradient Descent
!"#$%"&'(%")*+,#'-'!"#$%%" (%")*+,#'.+/0+,#
!"#$%&'('%$&#()&*+$,*$#&&-&$$$$."'%"$
I '*$%-/0'*,('-*$.'("$("#$1)*%('-*$
Sample gradient of expected loss L(w) = E [l]
,22&-3'/,(-&
%,*$0#$)+#4$(-$
@l @l @L(w)
%&#,(#$,*$#&&-&$1)*%('-*$$$$$$$$$$$$$
E =
@w @w @w
I Adjust w down the sampled gradient
!"#$2,&(',5$4'11#&#*(',5$-1$("'+$#&&-&$
1)*%('-*$$$$$$$$$$$$$$$$6$("#$7&,4'#*($
@l
w/
%,*$*-.$0#$)+#4$(-$)24,(#$("#$
@w
'*(#&*,5$8,&',05#+$'*$("#$1)*%('-*$
,22&-3'/,(-& 9,*4$%&'('%:;$$$$$$
<&,4'#*($4#+%#*($=>?
Weight Sharing
Recurrent neural network shares weights between time-steps
yO t yt+1
O
w xt w xt+1
w1 w2
w1 w2
h2
h1
x
Outline
Value-Based Deep RL
Policy-Based Deep RL
Model-Based Deep RL
Many Faces of Reinforcement Learning
Computer Science
Engineering Neuroscience
Machine
Learning
Optimal Reward
Control System
Reinforcement
Learning
Operations Classical/Operant
Research Conditioning
Rationality/
Mathematics Psychology
Game Theory
Economics
Agent and Environment
o1 , r1 , a1 , ..., at 1 , ot , rt
st = f (o1 , r1 , a1 , ..., at 1 , ot , r t )
st = f (ot )
Major Components of an RL Agent
observation action
ot at
reward rt
Model
observation action
Value-based RL
I Estimate the optimal value function Q (s, a)
I This is the maximum value achievable under any policy
Policy-based RL
I Search directly for the optimal policy
I This is the policy achieving maximum future reward
Model-based RL
I Build a model of the environment
I Plan (e.g. by lookahead) using model
Deep Reinforcement Learning
Value-Based Deep RL
Policy-Based Deep RL
Model-Based Deep RL
Q-Networks
Q(s, a, w) Q (s, a)
w w
s a s
Q-Learning
s1 , a 1 , r 2 , s 2
s2 , a 2 , r 3 , s 3 ! s, a, r , s 0
s3 , a 3 , r 4 , s 4
...
st , at , rt+1 , st+1 ! st , at , rt+1 , st+1
2
0 0
l= r+ max
0
Q(s , a , w ) Q(s, a, w)
a
state action
st at
reward rt
DQN in Atari
INNOVATIONS IN
The microbiome
T H E I N T E R N AT I O N A L W E E K LY J O U R N A L O F S C I E N C E
DQN paper
www.nature.com/articles/nature14236
r+ max
0
Q(s 0 , a0 , w ) Q(s, a, w )
a
Improvements since Nature DQN
I Double DQN: Remove upward bias caused by max Q(s, a, w)
a
I Current Q-network w is used to select actions
I Older Q-network w is used to evaluate actions
2
0 0 0
l = r + Q(s , argmax Q(s , a , w), w ) Q(s, a, w)
a0
r+ max
0
Q(s 0 , a0 , w ) Q(s, a, w )
a
r+ max
0
Q(s 0 , a0 , w ) Q(s, a, w )
a
Value-Based Deep RL
Policy-Based Deep RL
Model-Based Deep RL
Deep Policy Networks
a = (a|s, u) or a = (s, u)
@l @log (a|s, u)
= Q(s, a, w)
@u @u
or
@l @Q(s, a, w) @a
=
@u @a @u
Asynchronous Advantage Actor-Critic (A3C)
I Estimate state-value function
V (s, v) E [rt+1 + rt+2 + ...|s]
I Q-value estimated by an n-step sample
n 1 n
qt = rt+1 + rt+2 ... + rt+n + V (st+n , v)
Asynchronous Advantage Actor-Critic (A3C)
I Estimate state-value function
V (s, v) E [rt+1 + rt+2 + ...|s]
I Q-value estimated by an n-step sample
n 1 n
qt = rt+1 + rt+2 ... + rt+n + V (st+n , v)
Deep Reinforcement
Deep Reinforcement Learning
Learning in Labyrinth
in Labyrinth
ot-1 ot ot+1
Demo:
www.youtube.com/watch?v=nMR5mjCFZCw&feature=youtu.be
@lu @Q(s, a, w) @a
=
@u @a @u
I In other words critic provides loss function for actor
DPG in Simulated Physics
I Physics domains are simulated in MuJoCo
I End-to-end learning of control policy from raw pixels s
I Input state s is stack of raw pixels from last 4 frames
I Two separate convnets are used for Q and
I Policy is adjusted in direction that most improves Q
a
Q(s,a)
(s)
DPG in Simulated Physics Demo
@l @log (a|s, u)
=
@u @u
I Actions a sample mix of policy network and best response
Neural FSP in Texas Holdem Poker
I Heads-up limit Texas Holdem
I NFSP with raw inputs only (no prior knowledge of Poker)
I vs SmooCT (3x medal winner 2015, handcrafted knowlege)
100
-100
-200
-300
mbb/h
-400
-500
-600 SmooCT
NFSP, best response strategy
-700 NFSP, greedy-average strategy
NFSP, average strategy
-800
0 5e+06 1e+07 1.5e+07 2e+07 2.5e+07 3e+07 3.5e+07
Iterations
Outline
Value-Based Deep RL
Policy-Based Deep RL
Model-Based Deep RL
Learning Models of the Environment
T H E I N T E R N AT I O N A L W E E K LY J O U R N A L O F S C I E N C E
AlphaGo paper:
www.nature.com/articles/nature16961