dqn-atari
dqn-atari
reinforcement learning
Jiang Guo
2016.04.19
Towards General Artificial Intelligence
• Playing Atari with Deep Reinforcement Learning. ArXiv (2013)
• 7 Atari games
• The first step towards “General Artificial Intelligence”
• Suppose you want to teach an agent (e.g. NN) to play this game
• Supervised training (expert players play a million times) That’s not how we learn!
• Reinforcement learning
Reinforcement Learning
Supervised Learning Target label for each training example
RL is like Life!
Markov Decision Process
𝑠0 , 𝑎0 , 𝑟1 , 𝑠1 , 𝑎1 , 𝑟2 , … , 𝑠𝑛−1 , 𝑎𝑛−1 , 𝑟𝑛 , 𝑠𝑛
state Terminal state
action
reward
State Representation
Think about the Breakout game
• How to define a state?
• Location of the paddle
• Location/direction of the ball
• Presence/absence of each individual brick
Screen pixels
MDP
• Future reward
𝑅 = 𝑟1 + 𝑟2 + 𝑟3 + ⋯ + 𝑟𝑛
𝑅𝑡 = 𝑟𝑡 + 𝑟𝑡+1 + 𝑟𝑡+2 + ⋯ + 𝑟𝑛
• Discounted future reward (environment is stochastic)
𝑅𝑡 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾 2 𝑟𝑡+2 + ⋯ + 𝛾 𝑛−𝑡 𝑟𝑛
= 𝑟𝑡 + 𝛾(𝑟𝑡+1 + 𝛾(𝑟𝑡+2 + ⋯ ))
= 𝑟𝑡 + 𝛾𝑅𝑡+1
• A good strategy for an agent would be to always choose an action that maximizes
the (discounted) future reward
Value-Action Function
• We define a 𝑄(𝑠, 𝑎) representing the maximum discounted future
reward when we perform action a in state s:
𝑄 𝑠𝑡 , 𝑎𝑡 = max 𝑅𝑡+1
• 𝜋 is the policy
Q-Learning
• How do we get the Q-function?
• Bellman Equation (贝尔曼公式)
𝑄 𝑠, 𝑎 = 𝑟 + 𝛾𝑚𝑎𝑥𝑎′ 𝑄(𝑠 ′ , 𝑎′ )
Value Iteration
Q-Learning
• In practice, Value Iteration is impractical
• Very limited states/actions
• Cannot generalize to unobserved states
𝑎 Action
Q-value 3
Deep Q-Network
𝜕𝐿 𝑤 ′ ′
𝜕𝑄(𝑠, 𝑎, 𝑤)
= 𝔼[(𝒓 + 𝜸𝒎𝒂𝒙𝒂′ 𝑸 𝒔 , 𝒂 − 𝑄 𝑠, 𝑎 ) ]
𝜕𝑤 𝜕𝑤
• Optimize objective end-to-end by SGD
Learning Stability
• Non-linear function approximator (Q-Network) is not very stable
• 𝜖-greedy policy
• With probability 𝜖 select a random action (Exploration)
• Otherwise select 𝑎 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑎′ 𝑄 𝑠, 𝑎′ (Exploitation)
Experience Replay
• To remove correlations, build data-set from agent’s own experience
1
𝐿 = 𝔼𝑠,𝑎,𝑟,𝑠′ ~𝐷 [𝒓 + 𝜸𝒎𝒂𝒙𝒂′ 𝑸 𝒔′ , 𝒂′ − 𝑄 𝑠, 𝑎 ]2
2
𝝐-greedy policy
Experience memory
Target network
Effect of Experience Replay and Target Q-Network
A short review
• Reinforcement Learning
• Function approximators for end-to-end Q-learning
• Deep Learning
• Extract high-level feature representations from high-dimensional raw sensory
data