DA6400:
Reinforcement Learning
E Slot
Tue: 11 - 11:50, Wed: 10 - 10:50, Fri 5:00 – 5:50
Tutorial: Thur: 8 – 8:50
B. Ravindran
TAs: Returaj, Jash, Argha
Lecture 1:
Introduction to
Reinforcement Learning
B. Ravindran
Machine Learning
❏ Learn functions from input to output from data
DA6400 Lecture 1 3
Reinforcement Learning
❏ Familiar models of machine learning
❏ Learning from data
❏ How did you learn to cycle?
❏ Trial and error!
❏ Falling down hurts!
❏ Evaluation, not instruction
❏ Reinforcement Learning
❏ Walk, Talk, etc.
DA6400 Lecture 1 4
Reinforcement Learning
❏ A trial-and-error learning paradigm
❏ Rewards and Punishments
❏ Learn about a system through interaction
❏ Inspired by behavioural psychology!
❏ Pavlov’s dog
DA6400 Lecture 1 5
Reinforcement Learning Works!
DA6400 Lecture 1 6
Tic-Tac-Toe
X
X X X
O O ...
O
X X X X X X
O
... X O O
... O X
... X
X X X O O
DA6400 Lecture 1 7
Supervised Learning
X X O
O X O
X X
X O
X O X
O
DA6400 Lecture 1 8
Supervised Learning
X X O
O O X O
X X O
Expert Moves
X O X
X X O X
O
DA6400 Lecture 1 9
Reinforcement Learning
❏ Learn from evaluation
❏ Win gives 1 point
❏ Loss gives -1 point
❏ Draw gives 0 points
❏ Learn from repeated
play
DA6400 Lecture 1 10
MENACE
❏ Machine Educable Noughts And Crosses Engine
❏ Michie and Chambers ’60
DA6400 Lecture 1 11
More Tic-Tac-Toe
X X X X X X X X X X
O O O O O O O O
X X O X O X O X O
❏ Assume an imperfect opponent
❏ makes mistakes sometimes
DA6400 Lecture 1 12
Reinforcement Learning
❏ Simple rule to explain complex behaviors
❏ Intuition: Prediction of outcome at time t+1 is
better than the prediction at time t. Hence use
the later prediction to adjust the earlier
prediction
❏ Has also had profound impact in behavioral
psychology and neuroscience!
DA6400 Lecture 1 13
TD in the Brain
DA6400 Lecture 5 14
Administrivia
❏ Textbook: Sutton, R. S., and Barto, A. G.
‘Reinforcement Learning: An Introduction’,
2nd Edition. MIT Press
❏ Get Moodle Access
❏ TAs: Returaj, Jash, Argha
DA6400 Lecture 1 15
Why RL?
❏ Complex Dynamics
❏ Helicopter control
DA6400 Lecture 1 16
Helicopter Control
[Link]
DA6400 Lecture 1 17
Why RL?
❏ Complex Dynamics
❏ Helicopter control
❏ Humanoid control
DA6400 Lecture 1 18
Humanoid Control
[Link]
DA6400 Lecture 1 19
Humanoid Control
DA6400 Lecture 1 20
Why RL?
❏ Complex Dynamics
❏ Helicopter control
❏ Humanoid control
❏ Complex Environments
DA6400 Lecture 1 21
Path Planning
❏ Learn distance function ICRA 2014
DA6400 Lecture 1 22
Supply chain: Inventory management
❏ Hierarchical flow of
products
❏ Challenges:
❏ Different time scales
at different levels of
the tree
❏ Capacity constraints
❏ Lead times
❏ Large number of
products
❏ Selfish objectives
DA6400 Lecture 1 23
Green Security Games
❏ Subclass of Stackelberg Security Games used to model
strategic interactions between law enforcement agencies
(defenders) and their opponents (adversaries)
❏ Model repeated interactions [Fang, Stone, and Tambe 2015;
Fang et al. 2016; Xu et al. 2017]
❏ Defenders protect a finite set of targets (e.g., wildlife) with
limited resources
DA6400 Lecture 1 24
Poaching
❏ Focus on real-world scenarios
❏ Combination of allocation and patrolling
❏ MILP and LP approaches do not scale well
❏ Use of Reinforcement Learning
DA6400 Lecture 1 25
CombSGPO
DA6400 Lecture 1 26
Cluttered Workspace
DA6400 Lecture 1 27
Why RL?
❏ Complex Dynamics
❏ Helicopter control
❏ Humanoid control
❏ Complex Environments
❏ Customization / Personalization
DA6400 Lecture 1 28
Customization
DA6400 Lecture 1 29
Ad Selection
DA6400 Lecture 1 30
Aligning LLM Responses with
Humans Preferences
Prompt: Serendipity means the occurrence and development
of events by chance in a happy or beneficial way. Use the word
in a sentence.
❏ Serendipity is the ability to see something good in
something bad.
❏ Serendipity can be defined as the happy chance
occurrence of events leading to a beneficial outcome.
Both the responses are technically completions, but which one
is better?
DA6400 Lecture 1 31
Aligning LLM Responses with
Humans Preferences
❏ Thought experiment.
❏ Consider the prompt - “Write a strongly worded email
to your co-worker about their unfinished task”.
❏ Consider the following responses:
❏ The task is of extreme importance and I wish you took
its completion seriously.
❏ You are a terrible colleague and do not finish assigned
tasks.
❏ Which one would you prefer?
❏ Even though there are multiple correct answers, users
have a preference.
❏ How do we encode this preference in the LLM?
DA6400 Lecture 1 32
Aligning LLM Responses with
Humans Preferences
DA6400 Lecture 1 33
Why RL?
❏ Complex Dynamics
❏ Helicopter control
❏ Humanoid control
❏ Complex Environments
❏ Customization / Personalization
❏ Going beyond human knowledge
❏ Learn through Self-Play
DA6400 Lecture 1 34
TD-Gammon
❏ TD-Gammon (Tesauro 92, 94,
95)
❏ Human Level Backgammon
player
❏ Beat the best human player in
1995
❏ Learnt completely by self play
❏ New moves not recorded by
humans in centuries of play
DA6400 Lecture 1 35
Game Playing – Arcade Games
❏ Learnt to play from video input!
❏ Learnt from scratch
❏ Used a complex neural network!
❏ Considered one of the hardest
learning problems solved by a
computer!
DA6400 Lecture 1 36
DQN - Breakout
DA6400 Lecture 1 37
AlphaGo
❏ Branching factor : Go 250 vs 35
Chess
❏ AlphaGo Master defeated the
18-time World Champion 4-1
DA6400 Lecture 1 38
AlphaGo - Move 37
Match #2 - Move 37
“I would be a bit thrown off by some unusual moves that
AlphaGo has played. … It’s playing moves that are definitely
not usual moves. They’re not moves that would have a high
percentage of moves in its database. So it’s coming up with
the moves on its own. … It’s a creative move”
DA6400 Lecture 1 39
Defence of The Ancients 2 (DoTA 2)
❏ The AI bot won 1v1 matches against
top players in the world at the
International DoTA Championships
❏ 100 different heroes. 100 different
items. Many different tactics. Much
more complex than board games
❏ Trained for 2 weeks by using just
self-play.
❏ The full game is played 5v5.
Multi-agent coordination required.
Still being developed
DA6400 Lecture 1 40
AlphaZero
❏ A general AI agent; Not limited to Go.
Superhuman Performance on Chess, Shogi
and Go
❏ No human data: Trained from Scratch RL by
playing against itself
❏ No human features: Only the raw positions
from the board are provided to the agent
❏ Simpler Search: No randomized Monte Carlo
Rollouts. Use a Neural Network to evaluate
❏ Beat AlphaGo Lee by 100 – 0
❏ Beat Stockfish and Elmo on Chess and Shogi
DA6400 Lecture 1 41
Why RL?
❏ Complex Dynamics
❏ Helicopter control
❏ Humanoid control
❏ Complex Environments
❏ Customization / Personalization
❏ Going beyond human knowledge
❏ Learn through Self-Play
❏ Improving heuristics
DA6400 Lecture 1 42
Power Management
DeepMind AI reduces Google Data Centre cooling bill by 40%
DA6400 Lecture 1 43
AlphaTensor
DeepMind's AlphaTensor discovered new efficient matrix
multiplication methods, the first breakthrough since 1969
DA6400 Lecture 1 44
Driver-Rider Matching
Lyft used RL to automate driver-rider matching, increasing the
revenue by $30 million per year
DA6400 Lecture 1 45
Other Applications
❏ Optimal Control ❏ Computational
Neuroscience
❏ Robot Navigation
❏ Primary mechanism of
❏ Chemical Plants
learning
❏ Combinatorial Optimization
❏ Psychology
❏ Elevator Dispatching
❏ Behavioral and operant
❏ VLSI placement and conditioning
routing
❏ Decision making
❏ Job-shop scheduling
❏ Operations Research
❏ Routing algorithms
❏ Approximate Dynamic
❏ Call admission control Programming
❏ More ❏ More
❏ Intelligent Tutoring Systems ❏ Dialogue systems
DA6400 Lecture 1 46
What’s Next?
❏ Deep Reinforcement Learning has revived excitement in the
community
❏ But many fundamental questions still to be addressed
DA6400 Lecture 1 47
Administrivia
❏ Textbook: Sutton, R. S., and Barto, A. G.
‘Reinforcement Learning: An Introduction’,
2nd Edition. MIT Press
❏ Get Moodle Access
❏ TAs: Returaj, Jash, Argha
DA6400 Lecture 1 48