0% found this document useful (0 votes)
3 views

dqn-atari

The document discusses the integration of deep reinforcement learning in achieving human-level control, particularly through the application of Q-learning and Deep Q-Networks in playing Atari games. It highlights key concepts such as Markov Decision Processes, exploration-exploitation dilemmas, and the importance of experience replay in stabilizing learning. The work signifies a significant step towards general artificial intelligence, as demonstrated by DeepMind's advancements in the field.

Uploaded by

erwinfnaruto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

dqn-atari

The document discusses the integration of deep reinforcement learning in achieving human-level control, particularly through the application of Q-learning and Deep Q-Networks in playing Atari games. It highlights key concepts such as Markov Decision Processes, exploration-exploitation dilemmas, and the importance of experience replay in stabilizing learning. The work signifies a significant step towards general artificial intelligence, as demonstrated by DeepMind's advancements in the field.

Uploaded by

erwinfnaruto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Human-level control through deep

reinforcement learning
Jiang Guo
2016.04.19
Towards General Artificial Intelligence
• Playing Atari with Deep Reinforcement Learning. ArXiv (2013)
• 7 Atari games
• The first step towards “General Artificial Intelligence”

• DeepMind got acquired by @Google (2014)

• Human-level control through deep reinforcement learning. Nature


(2015)
• 49 Atari games
• Google patented “Deep Reinforcement Learning”
Key Concepts
• Reinforcement Learning
• Markov Decision Process
• Discounted Future Reward
• Q-Learning
• Deep Q Network
• Exploration-Exploitation
• Experience Replay
• Deep Q-learning Algorithm
Reinforcement Learning
• Example: breakout (one of the Atari games)

• Suppose you want to teach an agent (e.g. NN) to play this game
• Supervised training (expert players play a million times) That’s not how we learn!
• Reinforcement learning
Reinforcement Learning
Supervised Learning Target label for each training example

ML Reinforcement Learning Sparse and time-delayed labels

Unsupervised Learning No label at all

Pong Breakout Space Invaders Seaquest Beam Rider


RL is Learning from Interaction

RL is like Life!
Markov Decision Process

𝑠0 , 𝑎0 , 𝑟1 , 𝑠1 , 𝑎1 , 𝑟2 , … , 𝑠𝑛−1 , 𝑎𝑛−1 , 𝑟𝑛 , 𝑠𝑛
state Terminal state
action
reward
State Representation
Think about the Breakout game
• How to define a state?
• Location of the paddle
• Location/direction of the ball
• Presence/absence of each individual brick

Let’s make it more universal!

Screen pixels
MDP

Value Function 𝑠0 , 𝑎0 , 𝑟1 , 𝑠1 , 𝑎1 , 𝑟2 , … , 𝑠𝑛−1 , 𝑎𝑛−1 , 𝑟𝑛 , 𝑠𝑛

• Future reward
𝑅 = 𝑟1 + 𝑟2 + 𝑟3 + ⋯ + 𝑟𝑛
𝑅𝑡 = 𝑟𝑡 + 𝑟𝑡+1 + 𝑟𝑡+2 + ⋯ + 𝑟𝑛
• Discounted future reward (environment is stochastic)
𝑅𝑡 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾 2 𝑟𝑡+2 + ⋯ + 𝛾 𝑛−𝑡 𝑟𝑛
= 𝑟𝑡 + 𝛾(𝑟𝑡+1 + 𝛾(𝑟𝑡+2 + ⋯ ))
= 𝑟𝑡 + 𝛾𝑅𝑡+1
• A good strategy for an agent would be to always choose an action that maximizes
the (discounted) future reward
Value-Action Function
• We define a 𝑄(𝑠, 𝑎) representing the maximum discounted future
reward when we perform action a in state s:
𝑄 𝑠𝑡 , 𝑎𝑡 = max 𝑅𝑡+1

• Q-function: represents the “Quality” of a certain action in a given state


• Imagine you have the magical Q-function
𝜋 𝑠 = 𝑎𝑟𝑔max 𝑄(𝑠, 𝑎)
𝑎

• 𝜋 is the policy
Q-Learning
• How do we get the Q-function?
• Bellman Equation (贝尔曼公式)
𝑄 𝑠, 𝑎 = 𝑟 + 𝛾𝑚𝑎𝑥𝑎′ 𝑄(𝑠 ′ , 𝑎′ )

Value Iteration
Q-Learning
• In practice, Value Iteration is impractical
• Very limited states/actions
• Cannot generalize to unobserved states

• Think about the Breakout game


• State: screen pixels
• Image size: 𝟖𝟒 × 𝟖𝟒 (resized)
• Consecutive 4 images 𝟐𝟓𝟔𝟖𝟒×𝟖𝟒×𝟒 rows in the Q-table!
• Grayscale with 256 gray levels
Function Approximator
• Use a function (with parameters) to approximate the Q-function
𝑄 𝑠, 𝑎; 𝜽 ≈ 𝑄 ∗ (𝑠, 𝑎)
• Linear
• Non-linear: Q-network
Q-value 1
𝑠 State

Network Q-value 𝑠 State Network Q-value 2

𝑎 Action

Q-value 3
Deep Q-Network

Deep Q-Network used in the DeepMind paper:

Note: No Pooling Layer!


Estimating the Q-Network
• Objective Function
• Recall the Bellman Equation: 𝑄 𝑠, 𝑎 = 𝑟 + 𝛾𝑚𝑎𝑥𝑎′ 𝑄(𝑠 ′ , 𝑎′ )

• Here, we use simple squared error:


𝐿 = 𝔼[(𝒓 + 𝜸𝒎𝒂𝒙𝒂′ 𝑸 𝒔′ , 𝒂′ − 𝑄 𝑠, 𝑎 )2 ]
target
• Leading to the following Q-learning gradient

𝜕𝐿 𝑤 ′ ′
𝜕𝑄(𝑠, 𝑎, 𝑤)
= 𝔼[(𝒓 + 𝜸𝒎𝒂𝒙𝒂′ 𝑸 𝒔 , 𝒂 − 𝑄 𝑠, 𝑎 ) ]
𝜕𝑤 𝜕𝑤
• Optimize objective end-to-end by SGD
Learning Stability
• Non-linear function approximator (Q-Network) is not very stable

Deep Learning Reinforcement Learning

Data samples are I.I.D. States are highly correlated


1. Exploration-Exploitation
vs. 2. Experience Replay
Underlying data
Data distribution changes
distribution is fixed
Exploration-Exploitation Dilemma
(探索-利用 困境)
• During training, how do we choose an action at time 𝑡?

• (探索)Exploration: random guessing


• (利用)Exploitation: choose the best one according to the Q-value

• 𝜖-greedy policy
• With probability 𝜖 select a random action (Exploration)
• Otherwise select 𝑎 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑎′ 𝑄 𝑠, 𝑎′ (Exploitation)
Experience Replay
• To remove correlations, build data-set from agent’s own experience

1. Take action 𝑎𝑡 according to 𝝐-greedy policy


2. During gameplay, store transition < 𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡+1 , 𝑠𝑡+1 > in replay memory 𝐷
3. Sample random mini-batch of transitions < 𝑠, 𝑎, 𝑟, 𝑠 ′ > from 𝐷
4. Optimize MSE between Q-network and Q-learning targets

1
𝐿 = 𝔼𝑠,𝑎,𝑟,𝑠′ ~𝐷 [𝒓 + 𝜸𝒎𝒂𝒙𝒂′ 𝑸 𝒔′ , 𝒂′ − 𝑄 𝑠, 𝑎 ]2
2
𝝐-greedy policy

Experience memory

Target network
Effect of Experience Replay and Target Q-Network
A short review
• Reinforcement Learning
• Function approximators for end-to-end Q-learning
• Deep Learning
• Extract high-level feature representations from high-dimensional raw sensory
data

Reinforcement Learning + Deep Learning = AI


by David Silver

You might also like