0% found this document useful (0 votes)
17 views

L13 Reinforcement Learning

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

L13 Reinforcement Learning

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Machine Learning

(Học máy – IT3190E)

Khoat Than
School of Information and Communication Technology
Hanoi University of Science and Technology

2023
Contents 2

¡ Introduction
¡ Supervised learning
¡ Unsupervised learning
¡ Reinforcement learning
¡ Practical advice
Reinforcement Learning problem 3

¡ Goal: Learn to choose actions that maximize


𝑟0 + g𝑟1 + g2𝑟2 + ⋯ , 𝑤ℎ𝑒𝑟𝑒 0 ≤ g < 1

(g is the discount factor for future rewards)


(Mitchell, 1997)
Characteristics of Reinforcement learning 4

¡ What makes Reinforcement Learning (RL) different


from other machine learning paradigms?
v There is no explicit supervisor, only a reward signal
v Training examples are of form ((S, A), R)
v Feedback is often delayed
v Time really matters (sequential, not independent data)
v Agent's actions affect the subsequent data it receives
¡ Examples of RL
v Play games better than humans
v Manage an investment portfolio
v Make a humanoid robot walk
v …
Reward 5

¡ A reward Rt is a scalar feedback signal


¡ Indicates how well agent is doing at step t
¡ The agent's job is to maximize cumulative reward
¡ Reinforcement learning is based on the reward
hypothesis:
v All goals can be described by the maximization of expected
cumulative reward
Examples of reward 6

¡ Play games better than humans


v + reward for increasing score
v - reward for decreasing score

¡ Manage an investment portfolio


v + reward for each $ in bank

¡ Make a humanoid robot walk


v + reward for forward motion
v - reward for falling over
Sequential decision making 7

¡ Goal: Select actions to maximize total future reward


¡ Actions may have long term consequences
¡ Reward may be delayed
¡ It may be better to sacrifice an immediate reward to
gain more long-term reward
¡ Examples:
v A financial investment (may take months to mature)
v Blocking opponent moves (might help winning chances, after
many moves from now)
Agent and Environment (1) 8

n At each step t, the


Observation Action agent:
Ot
Agent At q Executes action At
q Receives observation Ot
Reward q Receives scalar reward Rt
Rt
Agent and Environment (2) 9

n At each step t, the


agent:
Observation Action
Ot
Agent At
q Executes action At
q Receives observation Ot
Reward q Receives scalar reward Rt
Rt
n At each step t, the
environment:
Environment q Receives action At
q Emits observation Ot+1
q Emits scalar reward Rt+1

n t increments at
environment step
History and State 10

¡ The history is the sequence of observations, actions,


rewards:
𝐻𝑡 = 𝑂1, 𝑅1, 𝐴1, … , 𝐴!"#, 𝑂𝑡, 𝑅!
v All observable variables up to time t
v The sensorimotor stream of the agent
¡ What happens next depends on the history:
v The agent selects actions
v The environment selects observations/rewards
¡ State is the information used to determine what
happens next
¡ Formally, state is a function of the history:
𝑆𝑡 = 𝑓(𝐻𝑡)
Environment state 11

Observation Action n The environment state


Ot
Agent At
𝑆#$ is the environment's
private representation
q The information the
Reward
environment uses to pick
Rt the next observation or
reward
n The environment state
Environment is not usually visible to
the agent
Environment state 𝑆!"
Agent state 12

Agent state Sat n The agent state 𝑆#& is


the agent's internal
Observation Action
Agent representation
Ot At
q The information the agent
uses to pick the next
Reward action
Rt q It is the information used
by reinforcement learning
algorithms
Environment n It can be a function of
history:
𝑆$% = 𝑓 𝐻$
Information state 13

¡ An information state (a.k.a. Markov state) contains all


useful information from the history
¡ A state St is Markov if and only if:
𝑃(𝑆$&' |𝑆$ ) = 𝑃(𝑆$&' | 𝑆1, … , 𝑆$ )
v The future is independent of the past given the present
𝐻':$ → 𝑆$ → 𝐻$&': )
v Once the state is known, the history may be thrown away
v The state is a sufficient statistic of the future
v The environment state 𝑆$* is Markov
v The history Ht is Markov
Fully observable environments 14

State Action n Full observability:


St
Agent At
Agent directly observes
environment state
Reward 𝑂' = 𝑆#& = 𝑆#$
Rt n Agent state =
Environment state =
Information state
Environment n Formally, this is a
Markov decision
process (MDP)
Partially observable environments 15

¡ Partial observability: Agent indirectly observes


environment:
v E.g., a robot with camera vision isn't told its absolute location
v E.g., a trading agent only observes current prices
v E.g., a poker playing agent only observes public cards

¡ Now, Agent state ≠ Environment state


¡ Formally this is a partially observable Markov decision
process (POMDP)
¡ Agent must construct its own state representation 𝑆!$ :
v E.g., by using complete history: 𝑆$% = 𝐻$
v E.g., by using a recurrent neural network: 𝑆$% = 𝜎(𝑆$+'
%
𝑊𝑠 + 𝑂𝑡𝑊𝑜 )
Major components of a RL agent 16

A RL agent may include one or more of these components:

¡ Policy: Agent's behavior function

¡ Value function: How good is each state and/or action

¡ Model: Agent's representation of the environment


Policy 17

¡ A policy is the agent's behavior


¡ It is a map from state to action
¡ Deterministic policy: 𝑎 = 𝜋(𝑠)
¡ Stochastic policy: p(𝑎|𝑠) = 𝑃(𝐴$ = 𝑎 |𝑆$ = 𝑠)
Value function 18

¡ Value function is a prediction of future reward


¡ Used to evaluate the goodness/badness of states
¡ And therefore, to select between actions
𝑣p(𝑠) = 𝔼p 𝑅!%# + g𝑅!%& + g2𝑅!%' + … 𝑆! = 𝑠)
where 𝑅$&' , 𝑅$&, , … are generated by following policy p starting at
state s
¡ For each policy p, we have a value 𝑣p(𝑠)
¡ We want to find the optimal policy p∗ such that
𝑣 ∗ 𝑠 = max 𝑣p(𝑠) , ∀𝑠
)
Model 19

¡ A model predicts what the environment will do next

¡ P predicts the next state


& ∗ 𝑆 = 𝑠, 𝐴 = 𝑎)
𝑃)) ∗ = 𝑃 𝑆#*+ = 𝑠 # #

¡ R predicts the next (immediate) reward


𝑅)& = 𝔼 𝑅#*+ 𝑆# = 𝑠; 𝐴# = 𝑎)
Maze example 20

¡ Rewards: -1 per
time-step
¡ Actions: N, E, S, W
¡ States: Agent's
location

(https://www.davidsilver.uk/wp-content/uploads/2020/03/intro_RL.pdf)
Maze example: Policy 21

¡ Arrows represent
policy p(𝑠) for each
state s

(https://www.davidsilver.uk/wp-content/uploads/2020/03/intro_RL.pdf)
Maze example: Value function 22

¡ Numbers represent
value 𝑣p(𝑠) of each
state s

(https://www.davidsilver.uk/wp-content/uploads/2020/03/intro_RL.pdf)
Maze example: Model 23

¡ Agent may have an internal


model of the environment
¡ Dynamics: How actions
change the state
¡ Rewards: How much reward
from each state
¡ Grid layout represents
%
transition model 𝑃--.
¡ Numbers represent
(https://www.davidsilver.uk/wp-
content/uploads/2020/03/intro_RL.pdf) immediate reward 𝑅-% from
each state s (same for all
actions a)
Categorizing RL agents (1) 24

¡ Value-based
v No policy
v Value function

¡ Policy-based
v Policy
v No value function

¡ Actor critic
v Policy
v Value function
Categorizing RL agents (2) 25

¡ Model-free
v Policy and/or Value function
v No model

¡ Model-based
v Policy and/or Value function
v Model
Exploration and Exploitation (1) 26

¡ Reinforcement learning is like trial-and-error learning


¡ The agent should discover a good policy
¡ from its experiences of the environment
¡ without losing too much reward along the way
Exploration and Exploitation (2) 27

¡ Exploration finds more information about the environment


¡ Exploitation exploits known information to maximize reward
¡ It is usually important to both explore and exploit
Exploration and Exploitation: Examples 28

¡ Restaurant selection
v Exploitation: Go to your favorite restaurant
v Exploration: Try a new restaurant

¡ Online banner advertisements


v Exploitation: Show the most successful advertisement
v Exploration: Show a different advertisement

¡ Game playing
v Exploitation: Play the move you believe is best
v Exploration: Play an experimental move
Q-Learning: What to learn 29

¡ We might try to have agent learn the value function 𝑣-


¡ It could then do a lookahead search to choose best action
from any state s because
𝜋 𝑠 = arg max 𝑟 𝑠, 𝑎 + 𝛾𝑣- 𝛿 𝑠, 𝑎
&
v 𝛿: 𝑆×𝐴 → 𝑆 will map a given action 𝑎 and state 𝑠 to the next state
v 𝑟: 𝑆×𝐴 → 𝑅 provides the reward of action 𝑎, from state 𝑠
¡ A problem:
v This works well if agent knows functions 𝛿 and 𝑟
v But when it doesn’t, it can’t choose actions by this way
Q-Function 30

¡ Define new function very similar to v:


𝑄 𝑠, 𝑎 = 𝑟 𝑠, 𝑎 + 𝛾𝑣- (𝛿(𝑠, 𝑎))
v 𝑄(𝑠, 𝑎) shows how good it is to perform action 𝑎 when in state 𝑠
v whereas 𝑣! (𝑠) shows how good it is for the agent to be in state 𝑠

¡ If agent learns Q, it can choose optimal action


𝜋 𝑠 = arg max 𝑟 𝑠, 𝑎 + 𝛾𝑣- 𝛿 𝑠, 𝑎 = arg max 𝑄(𝑠, 𝑎)
& &

¡ Q is the value function the agent will learn


Training rule to learn Q 31

¡ Note that Q and 𝑣. are closely related


𝑣. s = max 𝑄(𝑠, 𝑎′)
&/

¡ Which allows us to write Q recursively as


𝑄 𝑠#, 𝑎# = 𝑟 𝑠# , 𝑎# + 𝛾𝑣. (𝛿(𝑠# , 𝑎# ))
= 𝑟 𝑠# , 𝑎# + 𝛾𝑣. (𝑠#*+ )
= 𝑟 𝑠# , 𝑎# + 𝛾 max 𝑄(𝑠#*+ , 𝑎′)
&/

¡ Let Q* denote learner (agent)’s current approximation to


Q, consider the training rule
𝑄∗ 𝑠, 𝑎 ← 𝑟 𝑠, 𝑎 + 𝛾 max 𝑄∗ (𝑠 / , 𝑎′)
&/
v where s’ is the state resulting from applying action a in state s
Q-Learning for deterministic worlds 32

For each s, initialize table entry 𝑄∗ 𝑠, 𝑎 ← 0


Observe current state 𝑠 Note:
- Finite action
Do forever: space
- Finite state
v Select an action 𝑎 and execute it
space
v Receive immediate reward 𝑟
v Observe the new state 𝑠′
v Update the table entry for 𝑄∗ 𝑠, 𝑎 as follows:
𝑄∗ 𝑠, 𝑎 ← 𝑟 + 𝛾 max 𝑄∗(𝑠 . , 𝑎′)
%.

v 𝑠 ← 𝑠.
Updating Q* 33

¡ 𝑄∗ 𝑠+ , 𝑎1234# ← 𝑟 + 𝛾. max
&
𝑄 ∗
𝑠5 , 𝑎 /
&

← 0 + 0.9 . 𝑚𝑎𝑥 63, 81, 100


← 90
¡ Note that if rewards are non-negative, then

∀ 𝑠, 𝑎, 𝑛: 𝑄6*+ 𝑠, 𝑎 ≥ 𝑄6∗ (𝑠, 𝑎)
∀ 𝑠, 𝑎, 𝑛: 0 ≤ 𝑄6∗ 𝑠, 𝑎 ≤ 𝑄(𝑠, 𝑎)
v Where 𝑄0∗ is the value at iteration 𝑛
(Mitchell, 1997)
Q-Learning for non-deterministic worlds 34

¡ What if reward and next state are non-deterministic?


¡ We redefine 𝑣. and Q by taking expected values
𝑣- 𝑠 = 𝔼[𝑟# + 𝛾𝑟#*+ + 𝛾 5 𝑟#*5 + ⋯ ]
𝑄 𝑠, 𝑎 = 𝔼 𝑟 𝑠, 𝑎 + 𝛾𝑣- 𝛿 𝑠, 𝑎

= S 𝑃 𝑠 / , 𝑟| 𝑠, 𝑎 𝑟 + 𝛾𝑣- (𝑠′)
)/,1

¡ Q-learning generalizes to non-deterministic worlds


v Alter the training rule at iteration 𝑛 to:
∗ ∗
𝑄0∗ 𝑠, 𝑎 ← 1 − 𝛼0 . 𝑄0+' 𝑠, 𝑎 + 𝛼0 𝑟 + max 𝑄0+' (𝑠 . , 𝑎′)
%.

v where 𝛼0 is sometimes known as learning rate


References 35

•D. Silver. Lecture 1: Introduction to Reinforcement


Learning (https://www.davidsilver.uk/wp-
content/uploads/2020/03/intro_RL.pdf).
•T. M. Mitchell. Machine Learning. McGraw-Hill, 1997.

You might also like