0% found this document useful (0 votes)
11 views48 pages

Lecture 1 - Introduction

The document outlines a course on Reinforcement Learning (DA6400) taught by B. Ravindran, including lecture schedules and topics covered. It introduces key concepts of reinforcement learning, such as trial-and-error learning, rewards, and its applications in complex environments like helicopter and humanoid control. The document also highlights significant advancements in AI, including notable achievements in game playing and real-world applications like driver-rider matching.

Uploaded by

pratssunkad08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views48 pages

Lecture 1 - Introduction

The document outlines a course on Reinforcement Learning (DA6400) taught by B. Ravindran, including lecture schedules and topics covered. It introduces key concepts of reinforcement learning, such as trial-and-error learning, rewards, and its applications in complex environments like helicopter and humanoid control. The document also highlights significant advancements in AI, including notable achievements in game playing and real-world applications like driver-rider matching.

Uploaded by

pratssunkad08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

DA6400:

Reinforcement Learning
E Slot
Tue: 11 - 11:50, Wed: 10 - 10:50, Fri 5:00 – 5:50
Tutorial: Thur: 8 – 8:50

B. Ravindran
TAs: Returaj, Jash, Argha
Lecture 1:
Introduction to
Reinforcement Learning

B. Ravindran
Machine Learning
❏ Learn functions from input to output from data

DA6400 Lecture 1 3
Reinforcement Learning
❏ Familiar models of machine learning
❏ Learning from data

❏ How did you learn to cycle?


❏ Trial and error!
❏ Falling down hurts!
❏ Evaluation, not instruction
❏ Reinforcement Learning

❏ Walk, Talk, etc.

DA6400 Lecture 1 4
Reinforcement Learning
❏ A trial-and-error learning paradigm
❏ Rewards and Punishments
❏ Learn about a system through interaction
❏ Inspired by behavioural psychology!
❏ Pavlov’s dog

DA6400 Lecture 1 5
Reinforcement Learning Works!

DA6400 Lecture 1 6
Tic-Tac-Toe
X

X X X

O O ...
O

X X X X X X

O
... X O O
... O X
... X

X X X O O

DA6400 Lecture 1 7
Supervised Learning
X X O
O X O
X X

X O
X O X
O

DA6400 Lecture 1 8
Supervised Learning
X X O
O O X O
X X O
Expert Moves
X O X
X X O X
O

DA6400 Lecture 1 9
Reinforcement Learning
❏ Learn from evaluation
❏ Win gives 1 point
❏ Loss gives -1 point
❏ Draw gives 0 points

❏ Learn from repeated


play

DA6400 Lecture 1 10
MENACE
❏ Machine Educable Noughts And Crosses Engine
❏ Michie and Chambers ’60

DA6400 Lecture 1 11
More Tic-Tac-Toe
X X X X X X X X X X

O O O O O O O O

X X O X O X O X O

❏ Assume an imperfect opponent


❏ makes mistakes sometimes
DA6400 Lecture 1 12
Reinforcement Learning
❏ Simple rule to explain complex behaviors
❏ Intuition: Prediction of outcome at time t+1 is
better than the prediction at time t. Hence use
the later prediction to adjust the earlier
prediction
❏ Has also had profound impact in behavioral
psychology and neuroscience!

DA6400 Lecture 1 13
TD in the Brain

DA6400 Lecture 5 14
Administrivia
❏ Textbook: Sutton, R. S., and Barto, A. G.
‘Reinforcement Learning: An Introduction’,
2nd Edition. MIT Press
❏ Get Moodle Access
❏ TAs: Returaj, Jash, Argha

DA6400 Lecture 1 15
Why RL?
❏ Complex Dynamics
❏ Helicopter control

DA6400 Lecture 1 16
Helicopter Control

[Link]

DA6400 Lecture 1 17
Why RL?
❏ Complex Dynamics
❏ Helicopter control
❏ Humanoid control

DA6400 Lecture 1 18
Humanoid Control

[Link]

DA6400 Lecture 1 19
Humanoid Control

DA6400 Lecture 1 20
Why RL?
❏ Complex Dynamics
❏ Helicopter control
❏ Humanoid control
❏ Complex Environments

DA6400 Lecture 1 21
Path Planning

❏ Learn distance function ICRA 2014

DA6400 Lecture 1 22
Supply chain: Inventory management

❏ Hierarchical flow of
products
❏ Challenges:
❏ Different time scales
at different levels of
the tree
❏ Capacity constraints
❏ Lead times
❏ Large number of
products
❏ Selfish objectives

DA6400 Lecture 1 23
Green Security Games
❏ Subclass of Stackelberg Security Games used to model
strategic interactions between law enforcement agencies
(defenders) and their opponents (adversaries)

❏ Model repeated interactions [Fang, Stone, and Tambe 2015;


Fang et al. 2016; Xu et al. 2017]

❏ Defenders protect a finite set of targets (e.g., wildlife) with


limited resources

DA6400 Lecture 1 24
Poaching
❏ Focus on real-world scenarios
❏ Combination of allocation and patrolling
❏ MILP and LP approaches do not scale well
❏ Use of Reinforcement Learning

DA6400 Lecture 1 25
CombSGPO

DA6400 Lecture 1 26
Cluttered Workspace

DA6400 Lecture 1 27
Why RL?
❏ Complex Dynamics
❏ Helicopter control
❏ Humanoid control
❏ Complex Environments
❏ Customization / Personalization

DA6400 Lecture 1 28
Customization

DA6400 Lecture 1 29
Ad Selection

DA6400 Lecture 1 30
Aligning LLM Responses with
Humans Preferences
Prompt: Serendipity means the occurrence and development
of events by chance in a happy or beneficial way. Use the word
in a sentence.

❏ Serendipity is the ability to see something good in


something bad.

❏ Serendipity can be defined as the happy chance


occurrence of events leading to a beneficial outcome.

Both the responses are technically completions, but which one


is better?

DA6400 Lecture 1 31
Aligning LLM Responses with
Humans Preferences
❏ Thought experiment.
❏ Consider the prompt - “Write a strongly worded email
to your co-worker about their unfinished task”.
❏ Consider the following responses:
❏ The task is of extreme importance and I wish you took
its completion seriously.
❏ You are a terrible colleague and do not finish assigned
tasks.
❏ Which one would you prefer?
❏ Even though there are multiple correct answers, users
have a preference.
❏ How do we encode this preference in the LLM?
DA6400 Lecture 1 32
Aligning LLM Responses with
Humans Preferences

DA6400 Lecture 1 33
Why RL?
❏ Complex Dynamics
❏ Helicopter control
❏ Humanoid control
❏ Complex Environments
❏ Customization / Personalization
❏ Going beyond human knowledge
❏ Learn through Self-Play

DA6400 Lecture 1 34
TD-Gammon
❏ TD-Gammon (Tesauro 92, 94,
95)
❏ Human Level Backgammon
player
❏ Beat the best human player in
1995

❏ Learnt completely by self play


❏ New moves not recorded by
humans in centuries of play

DA6400 Lecture 1 35
Game Playing – Arcade Games
❏ Learnt to play from video input!
❏ Learnt from scratch

❏ Used a complex neural network!


❏ Considered one of the hardest
learning problems solved by a
computer!

DA6400 Lecture 1 36
DQN - Breakout

DA6400 Lecture 1 37
AlphaGo
❏ Branching factor : Go 250 vs 35
Chess
❏ AlphaGo Master defeated the
18-time World Champion 4-1

DA6400 Lecture 1 38
AlphaGo - Move 37
Match #2 - Move 37
“I would be a bit thrown off by some unusual moves that
AlphaGo has played. … It’s playing moves that are definitely
not usual moves. They’re not moves that would have a high
percentage of moves in its database. So it’s coming up with
the moves on its own. … It’s a creative move”

DA6400 Lecture 1 39
Defence of The Ancients 2 (DoTA 2)
❏ The AI bot won 1v1 matches against
top players in the world at the
International DoTA Championships
❏ 100 different heroes. 100 different
items. Many different tactics. Much
more complex than board games

❏ Trained for 2 weeks by using just


self-play.
❏ The full game is played 5v5.
Multi-agent coordination required.
Still being developed

DA6400 Lecture 1 40
AlphaZero
❏ A general AI agent; Not limited to Go.
Superhuman Performance on Chess, Shogi
and Go
❏ No human data: Trained from Scratch RL by
playing against itself
❏ No human features: Only the raw positions
from the board are provided to the agent

❏ Simpler Search: No randomized Monte Carlo


Rollouts. Use a Neural Network to evaluate

❏ Beat AlphaGo Lee by 100 – 0


❏ Beat Stockfish and Elmo on Chess and Shogi

DA6400 Lecture 1 41
Why RL?
❏ Complex Dynamics
❏ Helicopter control
❏ Humanoid control
❏ Complex Environments
❏ Customization / Personalization
❏ Going beyond human knowledge
❏ Learn through Self-Play
❏ Improving heuristics

DA6400 Lecture 1 42
Power Management

DeepMind AI reduces Google Data Centre cooling bill by 40%

DA6400 Lecture 1 43
AlphaTensor

DeepMind's AlphaTensor discovered new efficient matrix


multiplication methods, the first breakthrough since 1969

DA6400 Lecture 1 44
Driver-Rider Matching

Lyft used RL to automate driver-rider matching, increasing the


revenue by $30 million per year

DA6400 Lecture 1 45
Other Applications
❏ Optimal Control ❏ Computational
Neuroscience
❏ Robot Navigation
❏ Primary mechanism of
❏ Chemical Plants
learning
❏ Combinatorial Optimization
❏ Psychology
❏ Elevator Dispatching
❏ Behavioral and operant
❏ VLSI placement and conditioning
routing
❏ Decision making
❏ Job-shop scheduling
❏ Operations Research
❏ Routing algorithms
❏ Approximate Dynamic
❏ Call admission control Programming
❏ More ❏ More
❏ Intelligent Tutoring Systems ❏ Dialogue systems

DA6400 Lecture 1 46
What’s Next?

❏ Deep Reinforcement Learning has revived excitement in the


community
❏ But many fundamental questions still to be addressed

DA6400 Lecture 1 47
Administrivia
❏ Textbook: Sutton, R. S., and Barto, A. G.
‘Reinforcement Learning: An Introduction’,
2nd Edition. MIT Press
❏ Get Moodle Access
❏ TAs: Returaj, Jash, Argha

DA6400 Lecture 1 48

You might also like