0% found this document useful (0 votes)

15 views52 pages

Artificial Intelligence: Lecture 11 - Reinforcement Learning II Dr. Shivanjali Khare

The lecture covers advanced concepts in Reinforcement Learning (RL), focusing on model-free and model-based approaches, including Q-learning and temporal difference learning. It discusses the importance of exploration versus exploitation, the use of feature-based representations for generalization, and policy search techniques to optimize rewards. The session concludes with a transition to the next part of the course, which will address uncertainty and learning.

Uploaded by

suryatej2601

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views52 pages

Artificial Intelligence: Lecture 11 - Reinforcement Learning II Dr. Shivanjali Khare

Uploaded by

suryatej2601

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 52

Artificial

Intelligence
Lecture 11 – Reinforcement Learning II
Dr. Shivanjali Khare
[email protected]
Reinforcement Learning
• We still assume an MDP:
• A set of states s  S
• A set of actions (per state) A
• A model T(s,a,s’)
• A reward function R(s,a,s’)
• Still looking for a policy (s)

• New twist: don’t know T or R, so must try out actions

• Big idea: Compute all averages over T using sample outcomes

The Story So Far: MDPs and RL

Known MDP: Offline Solution

Goal Technique
Compute V*, Q*, * Value / policy iteration

Evaluate a fixed policy  Policy evaluation

Unknown MDP: Model-Based Unknown MDP: Model-Free

Goal Technique Goal Technique
Compute V*, Q*, * VI/PI on approx. MDP Compute V*, Q*, * Q-learning

Evaluate a fixed policy  PE on approx. MDP Evaluate a fixed policy  Value Learning
Analogy: Expected Age

Goal: Compute expected age of cs188 students

Known P(A)

Without P(A), instead collect samples [a1, a2, … aN]

Unknown P(A): “Model Based” Unknown P(A): “Model Free”

Why does this Why does this

work? Because work? Because
eventually you samples appear
learn the right with the right
model. frequencies.
Sample-Based Policy
Evaluation?
• We want to improve our estimate of V by computing these averages:

• Idea: Take samples of outcomes s’ (by doing the action!) and average
s
(s)
s,
(s)
s, (s),s’
's2 's 1 's3
'
Almost! But we
can’t rewind time to
get sample after
sample from state s.
Model-Free Learning

s
• Model-free (temporal difference) a
learning s, a
• Experience world through episodes r
’s
a’
• Update estimates each transition s’, a’

’’s
• Over time, updates will mimic Bellman
updates
Temporal Difference
Learning

s
• Temporal difference learning of values (s)
• Policy still fixed, still doing evaluation!
• Move values toward value of whatever successor occurs: running
s,
average (s)
’s

Sample of V(s):

Update to V(s):
Example: Temporal Difference
Learning

States Observed Transitions

B, east, C, - C, east, D, -
2 2

A 0 0 0

B C D 0 0 8 -1 0 8 -1 3 8

E 0 0 0

Assume:  = 1, α =
1/2
Q-Learning

• Q-Learning: sample-based Q-value iteration

• Learn Q(s,a) values as you go

• Receive a sample (s,a,s’,r)
• Consider your old estimate:
• Consider your new sample estimate:
no longer
policy
• Incorporate the new estimate into a running average:
evaluation!

[Demo: Q-learning – gridworld (L10D2)]

[Demo: Q-learning – crawler (L10D3)]
Video of Demo Q-Learning -- Gridworld
Video of Demo Q-Learning -- Crawler
Q-Learning Properties

• Amazing result: Q-learning converges to optimal policy -- even if

you’re acting suboptimally!

• This is called off-policy learning

• Caveats:
• You have to explore enough
• You have to eventually make the learning rate
small enough
• … but not decrease it too quickly
• Basically, in the limit, it doesn’t matter how you select actions (!)

[Demo: Q-learning – auto – cliff grid (L11D1)]

Active Reinforcement
Learning
Model-Free Learning

• act according to current optimal (based on Q-Values)

• but also explore…
Model-Based Learning

Input Policy 

A
act according to current optimal
B C D also explore!

E
Exploration vs.
Exploitation
Video of Demo Q-learning – Manual
Exploration – Bridge Grid
How to Explore?

• Several schemes for forcing exploration

• Simplest: random actions (-greedy)
• Every time step, flip a coin
• With (small) probability , act randomly
• With (large) probability 1-, act on current policy

• Problems with random actions?

• You do eventually explore the space, but keep
thrashing around once learning is done
• One solution: lower  over time
• Another solution: exploration functions
[Demo: Q-learning – manual exploration – bridge grid (L11D2)]
[Demo: Q-learning – epsilon-greedy -- crawler (L11D3)]
Video of Demo Q-learning – Epsilon-
Greedy – Crawler
Exploration Functions

• When to explore?
• Random actions: explore a fixed amount
• Better idea: explore areas whose badness is not
(yet) established, eventually stop exploring

• Exploration function
• Takes a value estimate u and a visit count n, and
returns an optimistic utility, e.g.
Regular Q-Update:
Modified Q-Update:

• Note: this propagates the “bonus” back to states that lead to unknown states as well!

[Demo: exploration – Q-learning – crawler – exploration function (L11D4)]

Video of Demo Q-learning – Exploration
Function – Crawler
Regret

• Even if you learn the optimal policy,

you still make mistakes along the way!
• Regret is a measure of your total
mistake cost: the difference between
your (expected) rewards, including
youthful suboptimality, and optimal
(expected) rewards
• Minimizing regret goes beyond
learning to be optimal – it requires
optimally learning to be optimal
• Example: random exploration and
exploration functions both end up
optimal, but random exploration has
higher regret
Approximate Q-Learning
Generalizing Across
States

• Basic Q-Learning keeps a table of all q-values

• In realistic situations, we cannot possibly learn

about every single state!
• Too many states to visit them all in training
• Too many states to hold the q-tables in memory

• Instead, we want to generalize:

• Learn about some small number of training states from
experience
• Generalize that experience to new, similar situations
• This is a fundamental idea in machine learning, and
we’ll see it over and over again

[demo – RL pacman]
Example: Pacman

Let’s say we discover In naïve q-learning, Or even this one!

through experience we know nothing
that this state is bad: about this state:

[Demo: Q-learning – pacman – tiny – watch all (L11D5)]

[Demo: Q-learning – pacman – tiny – silent train (L11D6)]
[Demo: Q-learning – pacman – tricky – watch all (L11D7)]
Video of Demo Q-Learning Pacman –
Tiny – Watch All
Video of Demo Q-Learning Pacman –
Tiny – Silent Train
Video of Demo Q-Learning Pacman –
Tricky – Watch All
Feature-Based
Representations

• Solution: describe a state using a vector of

features (properties)
• Features are functions from states to real numbers
(often 0/1) that capture important properties of the
state
• Example features:
• Distance to closest ghost
• Distance to closest dot
• Number of ghosts
• 1 / (dist to dot)2
• Is Pacman in a tunnel? (0/1)
• …… etc.
• Is it the exact state on this slide?
• Can also describe a q-state (s, a) with features (e.g.
action moves closer to food)
Linear Value Functions

• Using a feature representation, we can write a q function (or value function) for any
state using a few weights:

• Advantage: our experience is summed up in a few powerful numbers

• Disadvantage: states may share features but actually be very different in value!
Approximate Q-Learning

• Q-learning with linear Q-functions:

Exact Q’s

Approximate Q’s

• Intuitive interpretation:
• Adjust weights of active features
• E.g., if something unexpectedly bad happens, blame the features that were on: disprefer all
states with that state’s features

• Formal justification: online least squares

Example: Q-Pacman

[Demo: approximate Q-
learning pacman
(L11D10)]
Video of Demo Approximate Q-
Learning -- Pacman
DeepMind Atari () approximate Q-
learning with neural nets

36
Q-Learning and Least
Squares
Linear Approximation:
Regression

24
20
22

30
40
0 20
30
0 20
10 20
10
0 0

Prediction: Prediction:
Optimization: Least
Squares

Error or “residual”
Observation

Prediction

0
0 20
Minimizing Error

Imagine we had only one point x, with features f(x), target value y, and weights w:

Approximate q update explained:

“target” “prediction”
Overfitting: Why Limiting Capacity
Can Help

20
Degree 15 polynomial
15

-5

-10

-15
0 2 4 6 8 10 12 14 16 18 20
Policy Search
Policy Search
• Problem: often the feature-based policies that work well (win games, maximize
utilities) aren’t the ones that approximate V / Q best
• E.g. your value functions from project 2 were probably horrible estimates of future rewards, but
they still produced good decisions
• Q-learning’s priority: get Q-values close (modeling)
• Action selection priority: get ordering of Q-values right (prediction)
• We’ll see this distinction between modeling and prediction again later in the course

• Solution: learn policies that maximize rewards, not the values that predict them

• Policy search: start with an ok solution (e.g. Q-learning) then fine-tune by hill climbing
on feature weights
Policy Search
• Simplest policy search:
• Start with an initial linear value function or Q-function
• Nudge each feature weight up and down and see if your policy is better than
before

• Problems:
• How do we tell the policy got better?
• Need to run many sample episodes!
• If there are a lot of features, this can be impractical

• Better methods exploit lookahead structure, sample wisely, change

multiple parameters…
The Story So Far: MDPs and RL

Known MDP: Offline Solution

Goal Technique
Compute V*, Q*, * Value / policy iteration

Evaluate a fixed policy  Policy evaluation

Unknown MDP: Model-Based Unknown MDP: Model-Free

*use features *use features
Goal to generalize Technique Goal to generalize Technique
Compute V*, Q*, * VI/PI on approx. MDP Compute V*, Q*, * Q-learning

Evaluate a fixed policy  PE on approx. MDP Evaluate a fixed policy  Value Learning
Discussion: Model-Based vs Model-
Free RL

47
RL: Helicopter Flight

[Andrew Ng] [Video: HELICOPTER]

RL: Learning Locomotion

[Schulman, Moritz, Levine, Jordan, Abbeel, ICLR 2016] [Video: GAE]

RL: Learning Soccer

[Bansal et al, 2017]

RL: Learning
Manipulation

[Levine, Finn, Darrell, Abbeel, JMLR 2016]

RL: NASA SUPERball

Pieter Abbeel -- UC Berkeley | Gradescope |

[Geng*, Zhang*, Bruce*, Caluwaerts, Vespignani, Sunspiral, Abbeel, Levine, ICRA 2017] Covariant.AI
RL: In-Hand
Manipulation

Pieter Abbeel -- UC Berkeley | Gradescope |

Covariant.AI
OpenAI: Dactyl

Trained with domain randomization

[OpenAI]
Conclusion

• We’re done with Part I: Search and Planning!

• We’ve seen how AI methods can solve

problems in:
• Search
• Constraint Satisfaction Problems
• Games
• Markov Decision Problems
• Reinforcement Learning

• Next up: Part II: Uncertainty and Learning!

Ai (It) Unit-5
No ratings yet
Ai (It) Unit-5
43 pages
Lec 11
No ratings yet
Lec 11
45 pages
AI 11 Reinforcement Learning II
No ratings yet
AI 11 Reinforcement Learning II
35 pages
AI T8 ReinfoLearning
No ratings yet
AI T8 ReinfoLearning
38 pages
Lecture 3.1 AML
No ratings yet
Lecture 3.1 AML
65 pages
Intro To Reinforcement Learning - DQ Q AC A3C
No ratings yet
Intro To Reinforcement Learning - DQ Q AC A3C
36 pages
13-RL DRL
No ratings yet
13-RL DRL
102 pages
Week #6 - Verilog Behavioural Modeling (Part 4) FSM
No ratings yet
Week #6 - Verilog Behavioural Modeling (Part 4) FSM
18 pages
CS 188 Introduction To Artificial Intelligence Summer 2019 Note 4
No ratings yet
CS 188 Introduction To Artificial Intelligence Summer 2019 Note 4
9 pages
Intro To Reinforcement Learning
No ratings yet
Intro To Reinforcement Learning
56 pages
Unit 5
No ratings yet
Unit 5
70 pages
Hota ML ReinforcementLearning
No ratings yet
Hota ML ReinforcementLearning
12 pages
Lec 10
No ratings yet
Lec 10
50 pages
Unit 5
No ratings yet
Unit 5
54 pages
Lec 09
No ratings yet
Lec 09
26 pages
Unit 5 Deep Learning
No ratings yet
Unit 5 Deep Learning
24 pages
21 - Reinforcement Learning
No ratings yet
21 - Reinforcement Learning
25 pages
ML - Unit 3 - Part II
No ratings yet
ML - Unit 3 - Part II
51 pages
Population Forecasting Methods - Formulas - Example Problems - Practice Problem
No ratings yet
Population Forecasting Methods - Formulas - Example Problems - Practice Problem
10 pages
Learning Task
No ratings yet
Learning Task
14 pages
Index: Exp No. Experiment Name Date of Performance Date of Checking Signature Marks
No ratings yet
Index: Exp No. Experiment Name Date of Performance Date of Checking Signature Marks
41 pages
Silabus AACSB-S1 - Analitik Bisnis - Gasal 18-19
No ratings yet
Silabus AACSB-S1 - Analitik Bisnis - Gasal 18-19
5 pages
Multicollinearity Slides PDF
No ratings yet
Multicollinearity Slides PDF
8 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
Unit 1
No ratings yet
Unit 1
18 pages
5SC28 L7 Machine Learning
No ratings yet
5SC28 L7 Machine Learning
61 pages
S18 Reinforcement Learning 2
No ratings yet
S18 Reinforcement Learning 2
46 pages
10 Deep Reinforcement
No ratings yet
10 Deep Reinforcement
40 pages
T318 Applied Network Security: Dr. Mahmoud Attalah
No ratings yet
T318 Applied Network Security: Dr. Mahmoud Attalah
54 pages
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
45 pages
Reinforcement Learning: Yijue Hou
No ratings yet
Reinforcement Learning: Yijue Hou
34 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I - Print
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I - Print
25 pages
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
No ratings yet
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
8 pages
Unit-5 ML
No ratings yet
Unit-5 ML
18 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
46 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
11 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
52 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
46 pages
cs188 sp23 Note14
No ratings yet
cs188 sp23 Note14
2 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
Lecture 29 RL
No ratings yet
Lecture 29 RL
38 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
Capital Asset Pricing Model
No ratings yet
Capital Asset Pricing Model
2 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
CS 4705 Hidden Markov Models: Slides Adapted From Dan Jurafsky, and James Martin
No ratings yet
CS 4705 Hidden Markov Models: Slides Adapted From Dan Jurafsky, and James Martin
35 pages
37 RL
No ratings yet
37 RL
18 pages
MAS Lab7 QFA
No ratings yet
MAS Lab7 QFA
10 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
Unit 2 Artificial Intelligence - Problem-Solving Through Searching
No ratings yet
Unit 2 Artificial Intelligence - Problem-Solving Through Searching
120 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
7 - Reinforcement Learning
No ratings yet
7 - Reinforcement Learning
23 pages
Q-Learning and Deep Q Networks (DQN)
No ratings yet
Q-Learning and Deep Q Networks (DQN)
52 pages
The Laws of Thermodynamics - Boundless Chemistry - pdf1
No ratings yet
The Laws of Thermodynamics - Boundless Chemistry - pdf1
4 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
Unit-5 MLT
No ratings yet
Unit-5 MLT
13 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
Data Structures and Algorithms
No ratings yet
Data Structures and Algorithms
32 pages
Iterative Computations of The Transportation Algorithm
No ratings yet
Iterative Computations of The Transportation Algorithm
35 pages
Unit 5 ML
No ratings yet
Unit 5 ML
15 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
Understanding Absence Quota
No ratings yet
Understanding Absence Quota
14 pages
Nature-Inspired Optimizers: Theories, Literature Reviews and Applications Seyedali Mirjalili Download
No ratings yet
Nature-Inspired Optimizers: Theories, Literature Reviews and Applications Seyedali Mirjalili Download
60 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
Robust Iterative PID Controller Based On Linear Matrix Inequality For A Sample Power System
No ratings yet
Robust Iterative PID Controller Based On Linear Matrix Inequality For A Sample Power System
9 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
New CZ3005 Module 5 - Reinforcement Learning
No ratings yet
New CZ3005 Module 5 - Reinforcement Learning
31 pages
Mid Exam Analisis Algoritma
No ratings yet
Mid Exam Analisis Algoritma
70 pages
B1 3L
No ratings yet
B1 3L
36 pages
Reinforcement Learning - Ipynb - Colaboratory
No ratings yet
Reinforcement Learning - Ipynb - Colaboratory
7 pages
Unit 3
No ratings yet
Unit 3
37 pages
Deepdesrt Deep Learning For Table Detection
No ratings yet
Deepdesrt Deep Learning For Table Detection
6 pages
Berkley Data Science
No ratings yet
Berkley Data Science
4 pages
Functions of Several Variables, Partial Derivatives
No ratings yet
Functions of Several Variables, Partial Derivatives
26 pages
Maulina Putri Lestari - M0220052 - Tugas 4
No ratings yet
Maulina Putri Lestari - M0220052 - Tugas 4
12 pages
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
No ratings yet
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
11 pages
MCL261 Assignment 2
No ratings yet
MCL261 Assignment 2
2 pages
ESL2023 randomInterleaverNS
No ratings yet
ESL2023 randomInterleaverNS
4 pages
Unleashing The Potential of Pre-Trained Diffusion Models For Generalizable Person Re-Identification
No ratings yet
Unleashing The Potential of Pre-Trained Diffusion Models For Generalizable Person Re-Identification
15 pages
NeurIPS 2018 Information Constraints On Auto Encoding Variational Bayes Paper
No ratings yet
NeurIPS 2018 Information Constraints On Auto Encoding Variational Bayes Paper
12 pages
MLP Syllabus
No ratings yet
MLP Syllabus
4 pages
Tentative Mphil, PHD Time Table Spring 2025 V2 (Ramzan) - 1
No ratings yet
Tentative Mphil, PHD Time Table Spring 2025 V2 (Ramzan) - 1
1 page
Scan 12 Jun 25 16 17 27
No ratings yet
Scan 12 Jun 25 16 17 27
10 pages
DC - Unit IV
No ratings yet
DC - Unit IV
36 pages