0% found this document useful (0 votes)
6 views70 pages

Unit-5

The document discusses Reinforcement Learning (RL), a machine learning approach that learns from feedback in an environment without labeled data, aiming to maximize rewards through actions. It covers basic components of RL, including agents, environments, actions, states, rewards, and penalties, as well as algorithms like Q-Learning and Deep Q-Learning, which utilize neural networks for complex environments. Additionally, it introduces Genetic Algorithms as a search-based optimization technique inspired by natural selection, detailing its components and processes.

Uploaded by

darkguardian363
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views70 pages

Unit-5

The document discusses Reinforcement Learning (RL), a machine learning approach that learns from feedback in an environment without labeled data, aiming to maximize rewards through actions. It covers basic components of RL, including agents, environments, actions, states, rewards, and penalties, as well as algorithms like Q-Learning and Deep Q-Learning, which utilize neural networks for complex environments. Additionally, it introduces Genetic Algorithms as a search-based optimization technique inspired by natural selection, detailing its components and processes.

Uploaded by

darkguardian363
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Machine Learning

Techniques

KCS 055
Reinforcement Learning
Reinforcement Learning
• It is a feedback-based machine
learning approach.
• Learns depending on changes
occurring in environment without
any labelled data.
• Goal - To perform actions by looking
at the environment and get the
maximum positive rewards.
• Example: Chessboard
– Goal - To win the game
– Feedback- based on the right
choice
Reinforcement Learning

• The agent learn by its experience as there is no


labelled data.
• It is used to solve specific type of problem where
decision making is sequential, and the goal is long-
term, such as game-playing, robotics, etc.
Basic Components Of Reinforcement
Learning

• Agent → A hardware/software/computer program. For


ex: AI Robot, Robotic Car.
• Environment → The situation or surroundings of the
agent, ex: Road Highway.
• Action→ The movement of agent inside the
environment, ex: Move right/left/up/down.
• State → The situation returned by the environment
after each action.
Basic Components Of Reinforcement
Learning

• Reward → Positive feedback


• Penalty → Negative feedback
• Policy → Strategy of agent for next action.
• Policy Map → Agent’s action selection is called policy
map.
Steps in Reinforcement Learning
Take an Action

Get a Feedback
(Reward/Penalty)

Remain in same
state/change state
Two types of Reinforcement Learning:

Positive Reinforcement Learning Negative Reinforcement Learning


• Recurrence of behavior due to • Negative rewards are used as
positive rewards. a deterrent to weaken the
• Such rewards increases behavior and to avoid it.
strength and the frequency of a • These rewards decreases the
specific behavior and strength and the frequency of
encourages to execute similar a specific behavior.
action in future.
Markov Decision Problem
Q-Learning Algorithm

• Model-free reinforcement learning algorithm.

• Learns the value of an action in a particular state.

• The ‘Q’ stands for quality of actions.

• The quality represents the usefulness of a given action.


Q-Learning Algorithm

• States(s): the current position of the agent in the environment.

• Action(a): a step taken by the agent in a particular state.

• Rewards: for every action, the agent receives a reward and penalty.

• Episodes: the end of the stage, where agents can’t take new action. It
happens when the agent has achieved the goal or failed.
Q-Learning Algorithm

• Q(St+1, a): expected optimal Q-value of doing the action in a particular


state.

• Q(St, At): It is the current estimation of Q(St+1, a).

• Q-Table: the agent maintains the Q-table of sets of states and actions.

• Temporal Differences(TD): used to estimate the expected value of Q(St+1,


a) by using the current state and action and previous state and action.
Temporal Difference

• Model free learning.


• Combination of the Monte Carlo (MC) method and the
Dynamic Programming (DP) method.
• Monte Carlo Ideas : (MC)
– Learns directly from raw experience i.e. without
model
– No predefined model.
Temporal Difference

• Dynamic Programming : (DP)


– Estimates based on part of learning rather than
waiting for final outcome.
• 2 properties of Temporal Difference Learning
– Does not require the model to be known in
advance.
– Also, can be applied for non-episodic tasks.
Temporal Difference
Steps followed:

• Exploration: Explore all possible paths


• Exploitation: Best Possible path is identified.

Choose Perform
Initialize Measure Update
an an
Q-table Reward Q-table
Action Action

A number of iteration result a good Q-table.


Q function
• Based on Bellman Equation.
• Takes 2 inputs as state (s) and action (a)
Updating Q-table
Q Table
• Example: In a
Game
• Actions: up,
down, right, left
• State – Start, End,
Idle, Hole, etc.

Reference: https://www.datacamp.com/tutorial/introduction-q-
learning-beginner-tutorial
Q-Learning Algorithm
Q Table
• Step 1: Initialize Q- table
Q Table
• Step 2: Choose an action.
At the start, the agent will
choose to take the random
action(down or right), and
on the second run, it will
use an updated Q-Table to
select the action.
Q Table
• Step 3: Perform an action.
Initially, the agent is in exploration
mode and chooses a random
action to explore the environment.
The Epsilon Greedy Strategy is a
simple method to balance
exploration and exploitation. The
epsilon stands for the probability of
choosing to explore and exploits
when there are smaller chances of
exploring.
Q Table
• Step 3: Perform an action.
At the start, the epsilon rate is
higher, meaning the agent is in
exploration mode. While exploring
the environment, the epsilon
decreases, and agents start to
exploit the environment. During
exploration, with every iteration,
the agent becomes more confident
in estimating Q-values.
Q Table
• Step 3: Perform an action.
Q Table
• Step 4: Update Q- table.
• We will update the function Q(St, At) using the
equation. It uses the previous episode’s estimated
Q-values, learning rate, and Temporal Differences
error. Temporal Differences error is calculated using
Immediate reward, the discounted maximum
expected future reward, and the former estimation
Q-value.
• The process is repeated multiple times until the Q-
Table is updated and the Q-value function is
maximized.
Q Table
• Step 4: Update Q- table.
Q Table
• Step 4: Update Q- table.
Deep Q learning

• Q-Learning creates an exact matrix for the working


agent which it can “refer to” to maximize its
reward in the long run.
• This is only practical for very small environments
and quickly loses it’s feasibility when the number
of states and actions in the environment
increases.
Deep Q learning

• The solution for the above problem comes from the realization
that the values in the matrix only have relative importance i.e. the
values only have importance with respect to the other values.
• Thus, this thinking leads us to Deep Q-Learning which uses a deep
neural network to approximate the values.
• The basic working step for Deep Q-Learning is that the initial state
is fed into the neural network, and it returns the Q-value of all
possible actions as an output.
Deep Q learning
Deep Q learning

• It is a variant of Q-Learning that uses a deep neural network to


represent the Q-function, rather than a simple table of values.
• Handle environments with a large number of states and
actions, as well as to learn from high-dimensional inputs such
as images or sensor data.
• Most important feature
– Experience replay
– Target Networks
Deep Q learning
(Experience replay)

• Experience replay is a technique where the agent


stores a subset of its experiences (state, action,
reward, next state) in a memory buffer and samples
from this buffer to update the Q-function.
• This helps to decorrelate the data and make the
learning process more stable.
Deep Q learning
(Target Networks)

• Target networks are used to stabilize the Q-function


updates.
• In this technique, a separate network is used to
compute the target Q-values, which are then used to
update the Q-function network.
Q learning vs Deep Q learning
Deep Q learning

• Initialize main
2 • Use Bellman
and target • Use Epsilon- equation to
neural Greedy update network
networks. exploration weight.
strategy.
1 3
Genetic Algorithm
• Search-based optimization technique.
• Based on the principles of genetics and natural selection.
• It keeps on evolving better solutions over next
generations, till it reach stopping criteria.
Basic Terminologies of
Genetic Algorithm
• Genes : A single bit of a
bit string.
• Chromosome: The
possible solution (a bit
string, collection of
genes.
• Population: Set of
solutions.
Basic Terminologies of
Genetic Algorithm
• Allele: The possibility of
combination of genes to
make a property.
• Gene Pool: All possible
combination of genes
that are alleles
Basic Terminologies of
Genetic Algorithm
• Crossover: process of
taking 2 individual bit
stream (solution) and
producing new child bit
stream (offspring) from
them.
Basic Terminologies of
Genetic Algorithm
• 3 types of crossover:
– Single point crossover: data
bits are swapped between 2
parent strings after this
crossover point.
– 2-point crossover: Bits
between 2 points are swapped.
– Uniform crossover: Random
bits are swapped with equal
probability
Basic Terminologies of
Genetic Algorithm
• Mutation: A small random change
in chromosome. It is used to
introduce diversity in the genetic
population.
• Types of Mutation:
– Bit Flip mutation
– Swap mutation
– Random resetting
– Scramble mutation
– Inversion mutation
Basic Terminologies of
Genetic Algorithm

• Bit Flip Mutation: One or more random bits are selected and flipped.
Basic Terminologies of
Genetic Algorithm

• Random Resetting: Extension of bit flip method, for integer


representation.
Basic Terminologies of
Genetic Algorithm
• Swap Mutation: We
select 2 positions on the
chromosome at random
and interchange their
values.
Basic Terminologies of
Genetic Algorithm
• Scramble
Mutation: A subset
of genes is chosen,
and their values
are shuffled
randomly.
Basic Terminologies of
Genetic Algorithm
• Inversion Mutation: A
subset of genes is
chosen, and their genes
are inverted as a string.
Evolution

Selection Flow chart of GA


Best Individual
Solution
Crossover
Start

Optimal Solution
Mutation as output
Initial Population
of Solutions

Termi
Stop
No nate? Yes
Fitness Function
• Determines the fitness of
individual solution(bit
string).
• Fitness refers to the
ability of an individual to
compete with other
individuals.
• The induvial solution is
selected based on the
fitness score value.
Advantages and Disadvantages of GA

Advantages Disadvantages
• It has wide solution space. • The fitness function
• It is easier to discover global calculation is a limitation.
optimum. • Convergence of GA can be too
• Multiple GA can run fast or too slow.
together in same CPU. • Limit of selecting parameters.
Example-1

• Let the population of chromosome in genetic


algorithm is represented in terms of binary number.
The strength of fitness of a chromosome in decimal
𝑓(𝑥)
form x, is given by 𝑆𝑓 𝑥 = σ 𝑓(𝑥)
where f(x) = x2 .
• The population is given by P where:
P ={(01101),(11000),(01000),(10011)}
P ={(01101),(11000),(01000),(10011)}
𝑓(𝑥) 2
𝑆𝑓 𝑥 = σ 𝑓(𝑥)
where f(x) = x
Step 1: Selection
P Value in f(x)=x2
decimal
01101 13 169
11000 24 576
01000 8 64
10011 19 361
P ={(01101),(11000),(01000),(10011)}
𝑓(𝑥) 2
𝑆𝑓 𝑥 = σ 𝑓(𝑥)
where f(x) = x
P Value in f(x)=x2 𝑓(𝑥)
𝑆𝑓 𝑥 =
decimal σ 𝑓(𝑥)
01101 13 169 169/1170=0.14
11000 24 576 576/1170=0.49
01000 8 64 64/1170=0.06
10011 19 361 361/1170=0.31
Total 1170
P ={(01101),(11000),(01000),(10011)}
𝑓(𝑥) 2
𝑆𝑓 𝑥 = σ 𝑓(𝑥)
where f(x) = x
Step 1: Selection
P Value in f(x)=x2 𝑓(𝑥) Expected count
𝑆𝑓 𝑥 =
decimal σ 𝑓(𝑥) N*Sf(x)
01101 13 169 0.14 4*0.14=0.56
11000 24 576 0.49 4*0.49=1.96
01000 8 64 0.06 4*0.06=0.24
10011 19 361 0.31 4*0.31=1.24
Total 1170
P ={(01101),(11000),(01000),(10011)}
𝑓(𝑥) 2
𝑆𝑓 𝑥 = σ 𝑓(𝑥)
where f(x) = x
Step 2: Crossover
P Crossover After Value in f(x)=x2
Initial point crossover decimal
0110|1 4 01100 12 144
1100|0 4 11001 25 625
11|000 2 11011 27 729
10|011 2 10000 16 256
Total 1754
P ={(01101),(11000),(01000),(10011)}
𝑓(𝑥) 2
𝑆𝑓 𝑥 = σ 𝑓(𝑥)
where f(x) = x
Step 3: Mutation
After After Value in f(x)=x2
crossover mutation decimal
01100 11100 26 676
11001 11001 25 625
11011 11011 27 729
10000 10100 18 324
Total 2354
Example - 2

Suppose a genetic algorithm uses chromosomes of the form x = “a


b c d e f g h” with a fixed length of eight genes. Each gene can be
any digit between 0 and 9. Let the fitness of individual x be
calculated as: f(x) = (a+b)-(c+d)+(e+f)-(g+h). Let the initial
population consist of four individuals with the following
chromosomes:
x1 = 6 5 4 1 3 5 3 2
x2 = 8 7 1 2 6 6 0 1
x3 = 2 3 9 2 1 2 8 5
x4 = 4 1 8 5 2 0 9 4
Example - 2

a. Evaluate the fitness of each individual, showing all your workings, and arrange them
in order with the fittest first and the least fit last.
b. Perform the following crossover operations.
i. Cross the fittest two individual using one-point crossover at the middle point.
ii. Cross the second and third fittest individuals using a two-point crossover
(points b and f).
iii. Cross the first and third fittest individuals (ranked 1st and 3rd) using a uniform
crossover.
c. Suppose the new population consists of the six offspring individuals received by the
crossover operations in the above question. Evaluate the fitness of the new
population, showing all the workings. Has the overall fitness improved?
x = “a b c d e f g h”
f(x) = (a+b)-(c+d)+(e+f)-(g+h)
x1 = 6 5 4 1 3 5 3 2
x2 = 8 7 1 2 6 6 0 1
x3 = 2 3 9 2 1 2 8 5
x4 = 4 1 8 5 2 0 9 4
a. Evaluate the fitness of each individual, showing all your
workings, and arrange them in order with the fittest first
and the least fit last.
Sol: f(x1) = (6+5)-(4+1)+(3+5)-(3+2) = 9
f(x2) = (8+7)-(1+2)+(6+6)-(0+1) = 23 The order is
f(x3) = (2+3)-(9+2)+(1+2)-(8+5) = -16 x2, x1, x3, x4
f(x4) = (4+1)-(8+5)+(2+0)-(9+4) = -19
x = “a b c d e f g h”
f(x) = (a+b)-(c+d)+(e+f)-(g+h)
x1 = 6 5 4 1 3 5 3 2
x2 = 8 7 1 2 6 6 0 1
x3 = 2 3 9 2 1 2 8 5
x4 = 4 1 8 5 2 0 9 4
i. Cross the fittest two individual using one-point crossover at the middle
point.
Sol: x2 = 8 7 1 2 | 6 6 0 1 o1 = 8 7 1 2 3 5 3 2
x1 = 6 5 4 1 |3 5 3 2 o2 = 6 5 4 1 6 6 0 1
x = “a b c d e f g h”
f(x) = (a+b)-(c+d)+(e+f)-(g+h)
x1 = 6 5 4 1 3 5 3 2
x2 = 8 7 1 2 6 6 0 1
x3 = 2 3 9 2 1 2 8 5
x4 = 4 1 8 5 2 0 9 4
ii. Cross the second and third fittest individuals using a two-point crossover
(points b and f).
Sol: x1 = 6 5 | 4 1 3 5 | 3 2 o3 = 6 5 9 2 1 2 3 2
x3 = 2 3 | 9 2 1 2 | 8 5 o4 = 2 3 4 1 3 5 8 5
x = “a b c d e f g h”
f(x) = (a+b)-(c+d)+(e+f)-(g+h)
x1 = 6 5 4 1 3 5 3 2
x2 = 8 7 1 2 6 6 0 1
x3 = 2 3 9 2 1 2 8 5
x4 = 4 1 8 5 2 0 9 4
iii. Cross the first and third fittest individuals (ranked 1st and 3rd) using a
uniform crossover.
Sol: x2 = 8 7 1 2 6 6 0 1 o5 = 2 7 1 2 6 2 0 1
x3 = 2 3 9 2 1 2 8 5 o6 = 8 3 9 2 1 6 8 5
x = “a b c d e f g h”
f(x) = (a+b)-(c+d)+(e+f)-(g+h)
o1 = 8 7 1 2 3 5 3 2
o2 = 6 5 4 1 6 6 0 1
o3 = 6 5 9 2 1 2 3 2
o4 = 2 3 4 1 3 5 8 5
o5 = 2 7 1 2 6 2 0 1
o6 = 8 3 9 2 1 6 8 5

c. Suppose the new population consists of the six offspring individuals received
by the crossover operations in the above question. Evaluate the fitness of the
new population, showing all the workings. Has the overall fitness improved?.
Sol: f(o1) = (8+7)-(1+2)+(3+5)-(3+2) = 15
f(o2) = (6+5)-(4+1)+(6+6)-(0+1) = 17
f(o3) = (6+5)-(9+2)+(1+2)-(3+2) = -2
f(o4) = (2+3)-(4+1)+(3+5)-(8+5) = -5 Yes, the overall
f(o5) = (2+7)-(1+2)+(6+2)-(0+1) = 13 fitness has
f(o6) = (8+3)-(9+2)+(1+6)-(8+5) = -6 improved.
Reference Books

Tom M. Mitchell, Ethem Alpaydin, ―Introduction Stephen Marsland, Bishop, C., Pattern
―Machine Learning, to Machine Learning (Adaptive ―Machine Learning: An Recognition and Machine
McGraw-Hill Computation and Machine Algorithmic Perspective, Learning. Berlin:
Education (India) Learning), The MIT Press 2004. CRC Press, 2009. Springer-Verlag.
Private Limited, 2013.
Text Books

Saikat Dutt, Andreas C. Müller and John Paul Mueller and Dr. Himanshu
Subramanian Sarah Guido - Luca Massaron - Sharma, Machine
Chandramaouli, Amit Introduction to Machine Machine Learning for Learning, S.K.
Kumar Das – Machine Learning with Python Dummies Kataria & Sons -2022
Learning, Pearson

You might also like