0% found this document useful (0 votes)

30 views98 pages

Reinforcement Learning Overview Guide

The document provides an overview of supervised, unsupervised, and reinforcement learning, focusing on reinforcement learning's goal of maximizing rewards through agent-environment interactions. It details the Markov Decision Process, value functions, Q-learning, and the architecture of Q-networks used in deep Q-learning. Additionally, it discusses the importance of experience replay for efficient learning in training Q-networks.

Uploaded by

dsmlab986

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views98 pages

Reinforcement Learning Overview Guide

Uploaded by

dsmlab986

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

So far… Supervised Learning

Data: (x, y)
x is data, y is label

Goal: Learn a function to map x -> y Cat

Examples: Classification,
regression, object detection,
semantic segmentation, image Classification
captioning, etc.

This image is CC0 public domain

So far… Unsupervised Learning

Data: x
Just data, no labels!
1-d density estimation
Goal: Learn some underlying
hidden structure of the data

Examples: Clustering,
dimensionality reduction, feature
learning, density estimation, etc. 2-d density estimation
2-d density images left and right
are CC0 public domain
Today: Reinforcement Learning

Problems involving an agent

interacting with an environment,
which provides numeric reward
signals

Goal: Learn how to take actions

in order to maximize reward
Overview

- What is Reinforcement Learning?

- Markov Decision Processes
- Q-Learning
- Policy Gradients
Reinforcement Learning

Agent

Environment
Reinforcement Learning

Agent

State st

Environment
Reinforcement Learning

Agent

State st
Action at

Environment
Reinforcement Learning

Agent

State st Reward rt Action at

Environment
Reinforcement Learning

Agent

State st Reward rt Action at

Next state st+1

Environment
Cart-Pole Problem

Objective: Balance a pole on top of a movable cart

State: angle, angular speed, position, horizontal velocity

Action: horizontal force applied on the cart
Reward: 1 at each time step if the pole is upright

This image is CC0 public domain

Robot Locomotion

Objective: Make the robot move forward

State: Angle and position of the joints

Action: Torques applied on joints
Reward: 1 at each time step upright +
forward movement
Atari Games

Objective: Complete the game with the highest score

State: Raw pixel inputs of the game state

Action: Game controls e.g. Left, Right, Up, Down
Reward: Score increase/decrease at each time step
Go

Objective: Win the game!

State: Position of all pieces

Action: Where to put the next piece down
Reward: 1 if win at the end of the game, 0 otherwise

This image is CC0 public domain

How can we mathematically formalize the RL problem?

Agent

State st Reward rt Action at

Next state st+1

Environment
Markov Decision Process
- Mathematical formulation of the RL problem
- Markov property: Current state completely characterises the state of the
world

Defined by:

: set of possible states

: set of possible actions
: distribution of reward given (state, action) pair
: transition probability i.e. distribution over next state given (state, action) pair
: discount factor
Markov Decision Process
- At time step t=0, environment samples initial state s0 ~ p(s0)
- Then, for t=0 until done:
- Agent selects action at
- Environment samples reward rt ~ R( . | st, at)
- Environment samples next state st+1 ~ P( . | st, at)
- Agent receives reward rt and next state st+1

- A policy is a function from S to A that specifies what action to take in

each state
- Objective: find policy * that maximizes cumulative discounted reward:
A simple MDP: Grid World
states
actions = {
1. right ★
2. left Set a negative “reward”
★ for each transition
3. up
(e.g. r = -1)
4. down
}

Objective: reach one of terminal states (greyed out) in

least number of actions
A simple MDP: Grid World

★ ★

Random Policy Optimal Policy

The optimal policy *
We want to find optimal policy * that maximizes the sum of rewards.

How do we handle the randomness (initial state, transition probability…)?

The optimal policy *
We want to find optimal policy * that maximizes the sum of rewards.

How do we handle the randomness (initial state, transition probability…)?

Maximize the expected sum of rewards!

Formally: with
Definitions: Value function and Q-value function
Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, …
Definitions: Value function and Q-value function
Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, …

How good is a state?

The value function at state s, is the expected cumulative reward from following the policy
from state s:
Definitions: Value function and Q-value function
Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, …

How good is a state?

The value function at state s, is the expected cumulative reward from following the policy
from state s:

How good is a state-action pair?

The Q-value function at state s and action a, is the expected cumulative reward from
taking action a in state s and then following the policy:
Bellman equation
The optimal Q-value function Q* is the maximum expected cumulative reward achievable
from a given (state, action) pair:
Bellman equation
The optimal Q-value function Q* is the maximum expected cumulative reward achievable
from a given (state, action) pair:

Q* satisfies the following Bellman equation:

Intuition: if the optimal state-action values for the next time-step Q*(s’,a’) are known,
then the optimal strategy is to take the action that maximizes the expected value of
Bellman equation
The optimal Q-value function Q* is the maximum expected cumulative reward achievable
from a given (state, action) pair:

Q* satisfies the following Bellman equation:

Intuition: if the optimal state-action values for the next time-step Q*(s’,a’) are known,
then the optimal strategy is to take the action that maximizes the expected value of

The optimal policy * corresponds to taking the best action in any state as specified by Q*
Solving for the optimal policy
Value iteration algorithm: Use Bellman equation as an iterative update

Qi will converge to Q* as i -> infinity

Solving for the optimal policy
Value iteration algorithm: Use Bellman equation as an iterative update

Qi will converge to Q* as i -> infinity

What’s the problem with this?

Solving for the optimal policy
Value iteration algorithm: Use Bellman equation as an iterative update

Qi will converge to Q* as i -> infinity

What’s the problem with this?

Not scalable. Must compute Q(s,a) for every state-action pair. If state is e.g. current game state
pixels, computationally infeasible to compute for entire state space!
Solving for the optimal policy
Value iteration algorithm: Use Bellman equation as an iterative update

Qi will converge to Q* as i -> infinity

What’s the problem with this?

Not scalable. Must compute Q(s,a) for every state-action pair. If state is e.g. current game state
pixels, computationally infeasible to compute for entire state space!

Solution: use a function approximator to estimate Q(s,a). E.g. a neural network!

Solving for the optimal policy: Q-learning
Q-learning: Use a function approximator to estimate the action-value function
Solving for the optimal policy: Q-learning
Q-learning: Use a function approximator to estimate the action-value function

If the function approximator is a deep neural network => deep q-learning!

Solving for the optimal policy: Q-learning
Q-learning: Use a function approximator to estimate the action-value function

function parameters (weights)

If the function approximator is a deep neural network => deep q-learning!
Solving for the optimal policy: Q-learning
Remember: want to find a Q-function that satisfies the Bellman Equation:
Solving for the optimal policy: Q-learning
Remember: want to find a Q-function that satisfies the Bellman Equation:

Forward Pass
Loss function:

where
Solving for the optimal policy: Q-learning
Remember: want to find a Q-function that satisfies the Bellman Equation:

Forward Pass
Loss function:

where

Backward Pass
Gradient update (with respect to Q-function parameters θ):
Solving for the optimal policy: Q-learning
Remember: want to find a Q-function that satisfies the Bellman Equation:

Forward Pass
Loss function:
Iteratively try to make the Q-value
where close to the target value (yi) it
should have, if Q-function
corresponds to optimal Q* (and
Backward Pass optimal policy *)
Gradient update (with respect to Q-function parameters θ):
[Mnih et al. NIPS Workshop 2013; Nature 2015]

Case Study: Playing Atari Games

Objective: Complete the game with the highest score

State: Raw pixel inputs of the game state

Action: Game controls e.g. Left, Right, Up, Down
Reward: Score increase/decrease at each time step
[Mnih et al. NIPS Workshop 2013; Nature 2015]

Q-network Architecture
:
FC-4 (Q-values)
neural network
with weights FC-256