0% found this document useful (0 votes)
3 views

ml notes

The document provides an overview of machine learning, detailing its definition, applications, challenges, and design principles. It covers various learning problems, including well-posed learning problems and the design of learning systems, emphasizing the importance of training experience and target functions. Additionally, it discusses supervised learning, association rule learning, and the processes involved in training models with labeled data.

Uploaded by

ggi2022.1201
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

ml notes

The document provides an overview of machine learning, detailing its definition, applications, challenges, and design principles. It covers various learning problems, including well-posed learning problems and the design of learning systems, emphasizing the importance of training experience and target functions. Additionally, it discusses supervised learning, association rule learning, and the processes involved in training models with labeled data.

Uploaded by

ggi2022.1201
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Introduction to Machine Learning

- Machine learning enables computers to learn from examples and experience


- It involves building algorithms that can learn from and make predictions on data
- It is a subset of artificial intelligence that focuses on enabling computers to learn from data
- Machine learning has a wide range of applications, from image recognition to speech
recognition to predicting stock prices

Applications of Machine Learning:


1. Image Recognition: identifying objects and people in photos and videos
2. Natural Language Processing: understanding and generating human language
3. Recommendation Systems: suggesting products or content based on a user's interests and
behavior
4. Fraud Detection: identifying and preventing fraudulent activity
5. Predictive Analytics: predicting future trends and events based on past data
6. Autonomous Vehicles: enabling cars and other vehicles to operate on their own, by
learning
from their environment
7. Medical Diagnostics: assisting doctors with diagnosing medical conditions, by analyzing
patient data
8. Financial Analysis: predicting stock prices and other financial trends based on market data
9. Robotics: enabling robots to learn and perform tasks on their own, by learning from their
environment
10. Gaming: creating intelligent opponents and game environments that can adapt to the
player's behavior

Challenges in Machine Learning:


1. Data quality: bad or poor quality data can negatively affect model performance
2. Data quantity: insufficient data can limit the effectiveness of a model
3. Bias and fairness: models may learn biases based on their training data, leading to unfair
or
discriminatory outcomes
4. Overfitting and underfitting: models may be too complex or not complex enough, leading
to
poor performance on unseen data
5. Model interpretation: complex models may be difficult to interpret, making it hard to
understand how predictions are made

WELL-POSED LEARNING PROBLEMS:


Definition: A computer program is said to learn from experience E with respect to some class of
tasks T and performance measure P, if its performance at tasks in T, as measured by P,
improves with experience E.

To have a well-defined learning problem, three features needs to be identified:


1.The class of tasks
2.The measure of performance to be improved
3.The source of experience
Examples
1.Checkers game: A computer program that learns to play checkers might improve its
performance as measured by its ability to win at the class of tasks involving playing
checkers games, through experience obtained by playing games against itself.
A checkers learning problem:
•Task T: playing checkers
•Performance measure P: percent of games won against opponents
•Training experience E: playing practice games against itself
2.A handwriting recognition learning problem:
•Task T: recognizing and classifying handwritten words within images
•Performance measure P: percent of words correctly classified
•Training experience E: a database of handwritten words with given
classifications

3.A robot driving learning problem:


•Task T: driving on public four-lane highways using vision sensors
•Performance measure P: average distance travelled before an error (as judged
by human overseer)
•Training experience E: a sequence of images and steering commands recorded
while observing a human driver

1.2 DESIGNING A LEARNING SYSTEM:

The basic design issues and approaches to machine learning are illustrated by designing a
program to learn to play checkers, with the goal of entering it in the world checkers
tournament
1.Choosing the Training Experience
2.Choosing the Target Function
3.Choosing a Representation for the Target Function
4.Choosing a Function Approximation Algorithm
1.Estimating training values
2.Adjusting the weights
5.The Final Design

1.Choosing the Training Experience

•The first design choice is to choose the type of training experience from which the
system will learn.
•The type of training experience available can have a significant impact on success or
failure of the learner.

There are three attributes which impact on success or failure of the learner

1.Whether the training experience provides direct or indirect feedback regarding the
choices made by the performance system.
For example, in checkers game:
In learning to play checkers, the system might learn from direct training examples
consisting of individual checkers board states and the correct move for each.
Indirect training examples consisting of the move sequences and final outcomes of
various games played. The information about the correctness of specific moves early in
the game must be inferred indirectly from the fact that the game was eventually won or
lost.
Here the learner faces an additional problem of credit assignment, or determining the
degree to which each move in the sequence deserves credit or blame for the final
outcome. Credit assignment can be a particularly difficult problem because the game can
be lost even when early moves are optimal, if these are followed later by poor moves.
Hence, learning from direct training feedback is typically easier than learning from
indirect feedback.

2.The degree to which the learner controls the sequence of training examples
For example, in checkers game:
The learner might depends on the teacher to select informative board states and to
provide the correct move for each.
Alternatively, the learner might itself propose board states that it finds particularly
confusing and ask the teacher for the correct move.
The learner may have complete control over both the board states and (indirect) training
classifications, as it does when it learns by playing against itself with
no teacher present.

3.How well it represents the distribution of examples over which the final system
performance P must be measured
For example, in checkers game:
In checkers learning scenario, the performance metric P is the percent of games the
system wins in the world tournament.
If its training experience E consists only of games played against itself, there is a danger
that this training experience might not be fully representative of the distribution of
situations over which it will later be tested.
It is necessary to learn from a distribution of examples that is different from those on
which the final system will be evaluated.

2.Choosing the Target Function


The next design choice is to determine exactly what type of knowledge will be learned and how
this will be used by the performance program.
Let’s consider a checkers-playing program that can generate the legal moves from any board
state.
The program needs only to learn how to choose the best move from among these legal moves.
We must learn to choose among the legal moves, the most obvious choice for the type of
information to be learned is a program, or function, that chooses the best move for any given
board state.

1.Let ChooseMove be the target function and the notation is

ChooseMove : B→ M
which indicate that this function accepts as input any board from the set of legal board
states B and produces as output some move from the set of legal moves M
ChooseMove is a choice for the target function in checkers example, but this function
will turn out to be very difficult to learn given the kind of indirect training experience
available to our system

2.An alternative target function is an evaluation function that assigns a numerical score
to any given board state
Let the target function V and the notation
V:B →R

which denote that V maps any legal board state from the set B to some real value. Intend
for this target function V to assign higher scores to better board states. If the system
can successfully learn such a target function V, then it can easily use it to select the best
move from any current board position.

Let us define the target value V(b) for an arbitrary board state b in B, as follows:
•If b is a final board state that is won, then V(b) = 100
•If b is a final board state that is lost, then V(b) = -100
•If b is a final board state that is drawn, then V(b) = 0
•If b is a not a final state in the game, then V(b) = V(b' ),
Where b' is the best final board state that can be achieved starting from b and playing optimally
until the end of the game.

3.Choosing a Representation for the Target Function

Let’s choose a simple representation - for any given board state, the function c will be
calculated as a linear combination of the following board features:
• xl: the number of black pieces on the board
• x2: the number of red pieces on the board
• x3: the number of black kings on the board
• x4: the number of red kings on the board
• x5: the number of black pieces threatened by red (i.e., which can be captured on red's
next turn)
• x6: the number of red pieces threatened by black
Thus, learning program will represent as a linear function of the form

Where,
•w0 through w6 are numerical coefficients, or weights, to be chosen by the learning
algorithm.
•Learned values for the weights w1 through w6 will determine the relative importance
of the various board features in determining the value of the board
•The weight w0 will provide an additive constant to the board value

4.Choosing a Function Approximation Algorithm


In order to learn the target function f we require a set of training examples, each describing a
specific board state b and the training value Vtrain(b) for b.
Each training example is an ordered pair of the form (b, Vtrain(b)).
For instance, the following training example describes a board state b in which black has won
the game (note x2 = 0 indicates that red has no remaining pieces) and for which the target
function value Vtrain(b) is therefore +100.
((x1=3, x2=0, x3=1, x4=0, x5=0, x6=0), +100)

Function Approximation Procedure

1.Derive training examples from the indirect training experience available to the learner
2.Adjusts the weights wi to best fit these training examples
1.Estimating training values
A simple approach for estimating training values for intermediate board states is to
assign the training value of Vtrain(b) for any intermediate board state b to be
V (Successor(b))

Where ,
•V is the learner's current approximation to V
•Successor(b) denotes the next board state following b for which it is again the
program's turn to move
Rule for estimating training values

Vtrain(b) ← V (Successor(b))

2.Adjusting the weights


Specify the learning algorithm for choosing the weights wi to best fit the set of training
examples {(b, Vtrain(b))}
A first step is to define what we mean by the bestfit to the training data.
One common approach is to define the best hypothesis, or set of weights, as that which
minimizes the squared error E between the training values and the values predicted by
the hypothesis.
Several algorithms are known for finding weights of a linear function that minimize E.
One such algorithm is called the least mean squares, or LMS training rule. For each
observed training example it adjusts the weights a small amount in the direction that
reduces the error on this training example

LMS weight update rule :- For each training example (b, Vtrain(b))
Use the current weights to calculate V (b)
For each weight wi, update it as

wi ← wi + ƞ (Vtrain (b) - V (b)) xi

Here ƞ is a small constant (e.g., 0.1) that moderates the size of the weight update.

Working of weight update rule


•When the error (Vtrain(b)- V (b)) is zero, no weights are changed.
•When (Vtrain(b) - V (b)) is positive (i.e., when V (b) is too low), then each weight
is increased in proportion to the value of its corresponding feature. This will raise the
value of V (b), reducing the error.
•If the value of some feature xi is zero, then its weight is not altered regardless of
the error, so that the only weights updated are those whose features actually occur
on the training example board
5.The Final Design
The final design of checkers learning system can be described by four distinct program modules
that represent the central components in many learning systems

1.The Performance System is the module that must solve the given performance task by

using the learned target function(s). It takes an instance of a new problem (new game)

as input and produces a trace of its solution (game history) as output.

2.The Critic takes as input the history or trace of the game and produces as output a set of

training examples of the target function

3.The Generalizer takes as input the training examples and produces an output

hypothesis that is its estimate of the target function. It generalizes from the specific

training examples, hypothesizing a general function that covers these examples and
other cases beyond the training examples.

4.The Experiment Generator takes as input the current hypothesis and outputs a new

problem (i.e., initial board state) for the Performance System to explore. Its role is to

pick new practice problems that will maximize the learning rate of the overall system.

Issues in Machine Learning

The field of machine learning, and much of this book, is concerned with answering questions

such as the following

•What algorithms exist for learning general target functions from specific training

examples? In what settings will particular algorithms converge to the desired function,

given sufficient training data? Which algorithms perform best for which types of

problems and representations?

•How much training data is sufficient? What general bounds can be found to relate the

confidence in learned hypotheses to the amount of training experience and the character

of the learner's hypothesis space?

•When and how can prior knowledge held by the learner guide the process of generalizing

from examples? Can prior knowledge be helpful even when it is only approximately correct?

•What is the best strategy for choosing a useful next training experience, and how does the choice of
this strategy alter the complexity of the learning problem?

•What is the best way to reduce the learning task to one or more function approximation

problems? Put another way, what specific functions should the system attempt to learn?

Can this process itself be automated?

•How can the learner automatically alter its representation to improve its ability to

represent and learn the target function?

Association Rules Learning


•Association rule learning is a type of unsupervised learning technique that

checks for the dependency of one data item on another data item and

maps accordingly so that it can be more profitable.

•It tries to find some interesting relations or associations among the

variables of dataset. It is based on different rules to discover the

interesting relations between variables in the database.

•The association rule learning is one of the very important concepts


of machine learning, and it is employed in Market Basket analysis, Web

usage mining, continuous production, etc.

•Here market basket analysis is a technique used by the various big retailer

to discover the associations between items. We can understand it by

taking an example of a supermarket, as in a supermarket, all products that

are purchased together are put together.

Association Rules Learning

How does Association Rule Learning work?

•Association rule learning works on the concept of If and Else Statement, such as

If A then B.

•Here the If element is called antecedent, and then statement is called as Consequent.

•These types of relationships where we can find out some association or relation

between two items is known as single cardinality.Itisallaboutcreatingrules,

and if the number of items increases, then cardinality also increases

accordingly. So, to measure the associations between thousands of data items,

there are several metrics. These metrics are given below:

Support Confidence Lif

SUPERVISED LEARNING

Supervised learning is the types of machine learning in which machines

are trained using well "labelled" training data, and on basis of that data,

machines predict the output. The labelled data means some input data is already

tagged with the correct output. In supervised learning, the training data provided

to the machines work as the supervisor that teaches the machines to predict the

output correctly. It applies the same concept as a student learns in the

supervision of the teacher.

Supervised learning is a process of providing input data as well as correct

output data to the machine learning model. The aim of a supervised learning

algorithm is to find a mapping function to map the input variable(x) with

the output variable(y). In the real-world, supervised learning can be used

for Risk Assessment, Image classification, Fraud Detection, spam filtering,

etc.

How Supervised Learning Works?


In supervised learning, models are trained using labelled dataset, where

the model learns about each type of data. Once the training process is

completed, the model is tested on the basis of test data (a subset of the training

set), and then it predicts the output.The working of Supervised learning can be

easily understood by the below example and diagram:

Suppose we have a dataset of different types of shapes which includes square,

rectangle, triangle, and Polygon. Now the first step is that we need to train the

model for each shape.

oIf the given shape has four sides, and all the sides are equal, then it will

be labelled as a Square.

oIf the given shape has three sides, then it will be labelled as a triangle.

oIf the given shape has six equal sides then it will be labelled as hexagon.

Now, after training, we test our model using the test set, and the task of the

model is to identify the shape.

The machine is already trained on all types of shapes, and when it finds a new

shape, it classifies the shape on the bases of a number of sides, and predicts the

output.

Steps Involved in Supervised Learning:

1. First Determine the type of training dataset


2. Collect/Gather the labelled training data.
3. Split the training dataset into training dataset, test dataset, and validation dataset.
4. Determine the input features of the training dataset, which should have enough knowledge
so that the model can accurately predict the output.
5. Determine the suitable algorithm for the model, such as support vector machine, decision
tree, etc.
6. Execute the algorithm on the training dataset. Sometimes we need validation sets
as the control parameters, which are the subset of training datasets.
7. Evaluate the accuracy of the model by providing the test set.If the model predicts the
correct output, which means our model is accurate.

Types of supervised Machine learning Algorithms:

Supervised learning can be further divided into two types of problems:

1) Regression

2) Classification

1. Regression

Regression algorithms are used if there is a relationship between the input

variable and the output variable. It is used for the prediction of continuous

variables, such as Weather forecasting, Market Trends, etc. Below are some

popular Regression algorithms which come under supervised learning:

1. Linear Regression
2. Regression Trees
3. Non-Linear Regression
4. Bayesian Linear Regression
5. Polynomial Regression

2.Classification

Classification algorithms are used when the output variable is categorical,

which means there are two classes such as Yes-No, Male-Female, True-false,

etc.Image classification, Fraud Detection, spam filteringare some examples of

classification. The popular algorithms for classification are:

• Random Forest
• Decision Trees
• Logistic Regression
• Support vector Machines

Advantages of Supervised learning:

1. With the help of supervised learning, the model can predict the output on the basis of prior
experiences.
2. In supervised learning, we can have an exact idea about the classes of objects.
3. Supervised learning model helps us to solve various real-world problems such as fraud
detection, spam filtering, etc.

Disadvantages of supervised learning:

4. Supervised learning models are not suitable for handling the complex tasks.
5. Supervised learning cannot predict the correct output if the test data is different from the
training dataset.
6. Training required lots of computation times.
7. In supervised learning, we need enough knowledge about the classes of object

Regression vs. Classification in supervised Learning:

Regression and Classification algorithms are Supervised Learning

algorithms. Both the algorithms are used for prediction in Machine learning and

work with the labelled datasets. But the difference between both is how they are

used for different machine learning problems.

The main difference between Regression and Classification algorithms that

Regression algorithms are used to predict the continuous values such as price,

salary, age, etc. and Classification algorithms are used to predict/Classify the

discrete values such as Male or Female, True or False, Spam or Not Spam, etc.

Consider the below diagram:

Classification:

Classification is a process of finding a function which helps in dividing

the dataset into classes based on different parameters. In Classification, a

computer program is trained on the training dataset and based on that training, it

categorizes the data into different classes.The task of the classification

algorithm is to find the mapping function to map the input(x) to the discrete

output(y).

Example: The best example to understand the Classification problem is Email

Spam Detection. The model is trained on the basis of millions of emails on

different parameters, and whenever it receives a new email, it identifies whether

the email is spam or not. If the email is spam, then it is moved to the Spam

folder.

Types of ML Classification Algorithms:

Classification Algorithms can be further divided into the following types:


I. Logistic Regression
II. K-Nearest Neighbours
III. Support Vector Machines
IV. Kernel SVM
V. Naïve Bayes
VI. Decision Tree Classification
VII. Random Forest Classification

Regression:

Regression is a process of finding the correlations between dependent and


independent variables. It helps in predicting the continuous variables such as
prediction of Market Trends, prediction of House prices, etc.
The task of the Regression algorithm is to find the mapping function to map the
input variable(x) to the continuous output variable(y).
Example: Suppose we want to do weather forecasting, so for this, we will use
the Regression algorithm. In weather prediction, the model is trained on the past
data, and once the training is completed, it can easily predict the weather for
future days.

Types of Regression Algorithm:


I. Simple Linear Regression
II. Multiple Linear Regression
III. Polynomial Regression
IV. Support Vector Regression
V. Decision Tree Regression
VI. Random Forest Regression

Unsupervised Learning:
Unsupervised learning is a branch of machine learning where the model is trained on

unlabeled data, meaning that it doesn't receive explicit input-output pairs. Instead, it must

discover the underlying structure or patterns in the data by itself. This is in contrast to

supervised learning, where the model is provided with labeled data to learn from.

Unsupervised learning tasks include clustering, dimensionality reduction, and generative

modeling.

Reinforcement Learning

Reinforcement Learning (RL) is a branch of machine learning that focuses on how agents can learn to
make decisions through trial and error to maximize cumulative rewards. RL allows machines to learn
by interacting with an environment and receiving feedback based on their actions. This feedback
comes in the form of rewards or penalties.

Reinforcement Learning revolves around the idea that an agent (the learner or decision-maker)
interacts with an environment to achieve a goal. The agent performs actions and receives feedback
to optimize its decision-making over time.

• Agent: The decision-maker that performs actions.

• Environment: The world or system in which the agent operates.


• State: The situation or condition the agent is currently in.

• Action: The possible moves or decisions the agent can make.

• Reward: The feedback or result from the environment based on the agent’s action.

UNIT 2: Data Pre-processing

What is Data Preprocessing in Machine Learning?

Data preprocessing is the process of evaluating, filtering, manipulating, and encoding data so that a
machine learning algorithm can understand it and use the resulting output. The major goal of data
preprocessing is to eliminate data issues such as missing values, improve data quality, and make the
data useful for machine learning purposes.

Why is Data Preprocessing important?

The majority of the real-world datasets for machine learning are highly susceptible to be

missing, inconsistent, and noisy due to their heterogeneous origin.

Applying data mining algorithms on this noisy data would not give quality results as they would

fail to identify patterns effectively. Data Processing is, therefore, important to improve the overall

data quality.

•Duplicate or missing values may give an incorrect view of the overall statistics of data.

•Outliers and inconsistent data points often tend to disturb the model’s overall learning,

leading to false predictions.

Quality decisions must be based on quality data. Data Preprocessing is important to get this

quality data, without which it would just be a Garbage In, Garbage Out scenario.

Here are the reasons why data preprocessing is so important for machine learning
projects:
It Improves Data Quality:-

Data preprocessing is the fast track to improving data quality since many of its steps mirror activities
you’ll find in any data quality management process, such as data cleansing, data profiling, data
integration, and more.

It Normalizes and Scales Data:-

Dependent and independent variables change on separate scales, or one changes linearly while
another changes exponentially. Salary, for example, might be a multiple-figure digit, whereas age is
expressed in double digits. Normalizing and scaling help to modify data in a way that allows
computers to extract a meaningful link between these variables.

It Eliminates Duplicate Records;-

When two records appear to repeat, an algorithm must identify whether the same metric was
captured twice or whether the data reflects separate occurrences. In rare circumstances, a record
may have minor discrepancies due to an erroneously reported field. Techniques for finding, deleting,
or connecting duplicates help to address such data quality issues automatically.

It Handles Outliers :-

Data practitioners sometimes need to merge many data sources to construct a new machine learning
model. Principal component analysis, for example, is an important technique for lowering the
number of dimensions in the training data set and producing a more efficient representation.

It Helps in Enhancing Model Performance :-

Preprocessing often entails developing new features or modifying existing ones to better capture the
underlying problem and enhance model performance. This might include encoding category
variables, developing interaction terms, and retrieving pertinent data from text or timestamps.

4 Steps in Data Preprocessing

Now, let's discuss more in-depth four main stages of data preprocessing

Data Cleaning:-

Data Cleaning is particularly done as part of data preprocessing to clean the data by filling

missing values, smoothing the noisy data, resolving the inconsistency, and removing outliers.

1. Missing values
Here are a few ways to solve this issue:

•Ignore those tuples

This method should be considered when the dataset is huge and numerous missing values are

present within a tuple.

•Fill in the missing values

There are many methods to achieve this, such as filling in the values manually, predicting the

missing values using regression method, or numerical methods like attribute mean.

2. Noisy Data

It involves removing a random error or variance in a measured variable. It can be done with the

help of the following techniques:

•Binning

It is the technique that works on sorted data values to smoothen any noise present in it. The data

is divided into equal-sized bins, and each bin/bucket is dealt with independently. All data in a

segment can be replaced by its mean, median or boundary values.

•Regression

This data mining technique is generally used for prediction. It helps to smoothen noise by fitting

all the data points in a regression function. The linear regression equation is used if there is only

one independent attribute; else Polynomial equations are used.

•Clustering

Creation of groups/clusters from data having similar values. The values that don't lie in the

cluster can be treated as noisy data and can be removed.

3. Removing outliers

Clustering techniques group together similar data points. The tuples that lie outside the cluster

are outliers/inconsistent data.


Data Integration

Data Integration is one of the data preprocessing steps that are used to merge the data present in

multiple sources into a single larger data store like a data warehouse.

Data Integration is needed especially when we are aiming to solve a real-world scenario like

detecting the presence of nodules from CT Scan images. The only option is to integrate the

images from multiple medical nodes to form a larger database

Data Preprocessing: Best practices

Here's a short recap of everything we've learnt about data preprocessing:

•The first step in Data Preprocessing is to understand your data. Just looking at your

dataset can give you an intuition of what things you need to focus on.

•Use statistical methods or pre-built libraries that help you visualize the dataset and give a

clear image of how your data looks in terms of class distribution.

•Summarize your data in terms of the number of duplicates, missing values, and outliers

present in the data.

•Drop the fields you think have no use for the modeling or are closely related to other

attributes. Dimensionality reduction is one of the very important aspects of Data

Preprocessing.

•Do some feature engineering and figure out which attributes contribute most towards

model training.

Data Transformation

One of the most important stages in the preparation phase is data transformation, which changes
data from one format to another. Some algorithms require that the input data be changed – if you
fail to finish this process, you may receive poor model performance or even introduce bias.

For example, the KNN model uses distance measurements to determine which neighbors are closest
to a particular record. If you have a feature with a particularly high scale relative to the other
features in your model, your model will likely employ this feature more than the others, resulting in a
bias.

Data Reduction

Sometimes, datasets are too large or contain too many features. Data reduction helps simplify the
dataset without losing important information. Techniques include:

• Dimensionality reduction: Reducing the number of features using methods like Principal
Component Analysis (PCA).

• Feature selection: Identifying and keeping only the most relevant features to the problem.
Feature Scaling
Scaling is a broader term that encompasses both normalization and standardization. While
normalization aims for a specific range (0-1), scaling adjusts the spread or variability of your
data.
Feature Scaling is a technique to standardize the independent features present in the data.
It is performed during the data pre-processing to handle highly varying values. If feature
scaling is not done then machine learning algorithm tends to use greater values as higher
and consider smaller values as lower regardless of the unit of the values. For example it will
take 10 m and 10 cm both as same regardless of their unit. In this article we will learn about
different techniques which are used to perform feature scaling.
Normalization

Normalization is a process that transforms your data's features to a standard scale, typically between
0 and 1. This is achieved by adjusting each feature's values based on its minimum and maximum
values. The goal is to ensure that no single feature dominates the others due to its magnitude

This method is more or less the same as the previous method but here instead of the minimum value
we subtract each entry by the mean value of the whole data and then divide the results by the
difference between the minimum and the maximum value.

Why Normalize?

• Improved Model Convergence: Algorithms like gradient descent often converge faster when
features are on a similar scale.

• Fairness Across Features: In distance-based algorithms (e.g., k-nearest neighbors),


normalization prevents features with larger ranges from disproportionately influencing
results.

• Enhanced Interpretability: Comparing and interpreting feature importances is easier when


they're on the same scale.

Standardization

This method of scaling is basically based on the central tendencies and variance of the data.

1. First we should calculate the mean and standard deviation of the data we would like to
normalize it.

2. Then we are supposed to subtract the mean value from each entry and then divide the result
by the standard deviation.

This helps us achieve a normal distribution of the data with a mean equal to zero and a standard
deviation equal to 1.

Standardization: Here, each feature is transformed to have a mean of 0 and a standard deviation of
1. This is achieved by subtracting the mean value and dividing by the standard deviation of the
feature.

The formula for standardization is:

Z = (x — μ) / σ

Where:

• Z is the standardized score (also called a z-score)


• x is the original value you want to standardize

• μ (mu) is the mean of the data set

• σ (sigma) is the standard deviation of the data set

Why do we need to split data into training and


testing sets?
While training a machine learning model we are trying to find a
pattern that best represents all the data points with minimum error.
While doing so, two common errors come up. These
are overfitting and underfitting.

Underfitting
Underfitting is when the model is not even able to represent the
data points in the training dataset. In the case of under-fitting,
you will get a low accuracy even when testing on the training
dataset.
Underfitting usually means that your model is too simple to
capture the complexities of the dataset.

Overfitting

Overfitting is the case when your model represents the training


dataset a little too accurately. This means that your model fits too
closely. In the case of overfitting, your model will not be able to
perform well on new unseen data. Overfitting is usually a sign of
model being too complex.
UNIT -3 :- Regression
Regression in machine learning refers to a supervised learning technique where the
goal is to predict a continuous numerical value based on one or more independent features.
It finds relationships between variables so that predictions can be made. we have two types
of variables present in regression:
• Dependent Variable (Target): The variable we are trying to predict e.g house price.
• Independent Variables (Features): The input variables that influence the prediction
e.g locality, number of rooms.

Regression analysis problem works with if output variable is a real or continuous value such
as “salary” or “weight”. Many different regression models can be used but the simplest
model in them is linear regression.

Types of Regression
1. Simple Linear Regression:-
Linear regression is one of the simplest and most widely used statistical models. This
assumes that there is a linear relationship between the independent and dependent
variables. This means that the change in the dependent variable is proportional to
the change in the independent variables. For example predicting the price of a house
based on its size.
2. Multiple Linear Regression:-
Multiple linear regression extends simple linear regression by using multiple
independent variables to predict target variable. For example predicting the
price of a house based on multiple features such as size, location, number of
rooms, etc.
3. Polynomial Regression
Polynomial regression is used to model with non-linear relationships between the
dependent variable and the independent variables. It adds polynomial terms to the
linear regression model to capture more complex relationships. For example when
we want to predict a non-linear trend like population growth over time we use
polynomial regression.

4. Ridge & Lasso Regression


Ridge & lasso regression are regularized versions of linear regression that
help avoid overfitting by penalizing large coefficients. When there’s a risk of
overfitting due to too many features we use these type of regression
algorithms.
5. Support Vector Regression (SVR)
SVR is a type of regression algorithm that is based on the Support Vector Machine
(SVM) algorithm. SVM is a type of algorithm that is used for classification tasks but it can
also be used for regression tasks. SVR works by finding a hyperplane that minimizes the
sum of the squared residuals between the predicted and actual values.
6. Decision Tree Regression
Decision tree Uses a tree-like structure to make decisions where each branch of tree
represents a decision and leaves represent outcomes. For example predicting customer
behavior based on features like age, income, etc there we use decison tree regression.

7. Random Forest Regression


Random Forest is a ensemble method that builds multiple decision trees and each tree is
trained on a different subset of the training data. The final prediction is made by averaging
the predictions of all of the trees. For example customer churn or sales data using this.

Regression Evaluation Metrics


Evaluation in machine learning measures the performance of a model. Here are some
popular evaluation metrics for regression:
• Mean Absolute Error (MAE): The average absolute difference between the predicted
and actual values of the target variable.
• Mean Squared Error (MSE): The average squared difference between the predicted
and actual values of the target variable.
• Root Mean Squared Error (RMSE): Square root of the mean squared error.
• Huber Loss: A hybrid loss function that transitions from MAE to MSE for larger errors,
providing balance between robustness and MSE’s sensitivity to outliers.
• R2 – Score: Higher values indicate better fit ranging from 0 to 1.

Applications of Regression
• Predicting prices: Used to predict the price of a house based on its size, location and
other features.
• Forecasting trends: Model to forecast the sales of a product based on historical sales
data.
• Identifying risk factors: Used to identify risk factors for heart patient based on
patient medical data.
• Making decisions: It could be used to recommend which stock to buy based on
market data.
Advantages of Regression
• Easy to understand and interpret.
• Robust to outliers.
• Can handle both linear relationships easily.
Disadvantages of Regression
• Assumes linearity.
• Sensitive to situation where two or more independent variables are highly correlated
with each other i.e multicollinearity.

• May not be suitable for highly complex relationships. Simple Linear


Regression in Python
Simple linear regression
Simple linear regression models the relationship between a dependent variable and a single
independent variable. In this article, we will explore simple linear regression and it's
implementation in Python using libraries such as NumPy, Pandas, and scikit-learn.
Understanding Simple Linear Regression
Simple Linear Regression aims to describe how one variable i.e the dependent variable
changes in relation with reference to the independent variable. For example consider a
scenario where a company wants to predict sales based on advertising expenditure. By using
simple linear regression the company can determine if an increase in advertising leads to
higher sales or not.
The below graph explains the relationship between advertising expenditure and sales using
simple linear regression:

The relationship between the dependent and independent variables is


represented by the simple linear equation:
y=mx+b
Here:
• y is the predicted value (dependent variable).
• m is the slope of the line
• x is the independent variable.
• b is the y-intercept (the value of y when x is 0).
In this equation m signifies the slope of the line indicating how much y
changes for a one-unit increase in x, a positive m suggests a direct
relationship while a negative m indicates an inverse relationship.
To better understand this relationship we can express it in a more statistical
context using the linear regression formula:
Y = β_0 + β_1x
In simple linear regression the parameters \beta_0 \space \text {and}
\space \beta_1 play crucial roles in defining the relationship between the
independent variable X and the dependent variable Y in the regression
equation.
Intercept \beta_0:
• The intercept \beta_0 represents the predicted value of the dependent
variable Y when the independent variable X is zero. In other words it is
the point where the regression line crosses the y-axis.
• It provides a baseline value for Y and helps us understand the expected
outcome when there is no influence from X. For example if you were
predicting sales based on advertising expenditure it would indicate the
estimated sales when no money is spent on advertising.
Slope \beta_1:
• A positive \beta_1 value suggests that as X increases, Y also increases,
indicating a direct relationship.
Conversely, a negative \beta_1 value indicates that an increase in X leads to
a decrease in Y, indicating an inverse relationship.
For instance, if \beta_1 = 2, this would mean that for every additional unit
spent on advertising, sales are expected to increase by 2 units.

Multiple Linear Regression :- Multiple Linear Regression is a extended


version of it only. It attempts to model relationship between two or more
features to fit a linear equation to predict one dependent variable.
Steps for Multiple Linear Regression
Steps to perform multiple linear Regression are almost similar to that of
simple linear Regression difference Lies in the evaluation. We can use it to find out which
factor has the highest impact on the predicted output and how different variables relate
to each other.
The equation for multiple linear regression is:
[Tex]y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_n X_n[/Tex]
• [Tex]y [/Tex]is the dependent variable
• [Tex]X_1, X_2, \cdots X_n [/Tex] are the independent variables
• [Tex]\beta_0[/Tex] is the intercept
• [Tex]\beta_1,\beta_2, \cdots \beta_n[/Tex]are the slopes
The goal of the algorithm is to find the best fit line equation that can predict the values
based on the independent variables. The regression model learns a function from the
dataset (with known X and Y values) and uses it to predict Y values for unknown X.
Polynomial Regression :-
Polynomial Regression is a form of regression analysis in which the relationship
between the independent variables and dependent variables are modeled in the nth
degree polynomial.
Polynomial Regression models are usually fit with the method of least squares.The
least square method minimizes the variance of the coefficients,under the Gauss
Markov Theorem.
Polynomial Regression is a special case of Linear Regression where we fit
the polynomial equation on the data with a curvilinear relationship between the
dependent and independent variables.

Why do we need Polynomial Regression?


Let’s consider a case of Simple Linear Regression.
• We make our model and find out that it performs very badly,
• We observe between the actual value and the best fit line,which we predicted and it
seems that the actual value has some kind of curve in the graph and our line is no
where near to cutting the mean of the points.
• This where polynomial Regression comes to the play,it predicts the best fit line that
follows the pattern(curve) of the data,as shown in the pic below:

• Polynomial Regression does not require the relationship between the independent
and dependent variables to be linear in the data set,This is also one of the main
difference between the Linear and Polynomial Regression.
• Polynomial Regression is generally used when the points in the data are not captured
by the Linear Regression Model and the Linear Regression fails in describing the best
result clearly.
As we increase the degree in the model,it tends to increase the performance of the
model.However,increasing the degrees of the model also increases the risk of over-fitting
and under-fitting the data.
How to find the right degree of the equation?
In order to find the right degree for the model to prevent over-fitting or under-fitting, we
can use:
1. Forward Selection:
This method increases the degree until it is significant enough to define the best
possible model.
2. Backward Selection:
This method decreases the degree until it is significant enough to define the best
possible model.
Cost Function is a function that measures the performance of a Machine Learning
model for given data.
Cost Function is basically the calculation of the error between predicted values and
expected values and presents it in the form of a single real number.
Many people gets confused between Cost Function and Loss Function,
Well to put this in simple terms Cost Function is the average of error of n-sample in the
data and Loss Function is the error for individual data points.In other words,Loss
Function is for one training example,Cost Function is the for the entire training set.

• Cost Function of Polynomial Regression can also be taken to be Mean


Squared Error,However there will a slight change in the equation.

• Polynomial regression can reduce your costs returned by the cost function. It gives your
regression line a curvilinear shape and makes it more fitting for your underlying data. By
applying a higher order polynomial, you can fit your regression line to your data more
precisely.

Now,We know that the ideal value of the Cost Function is 0 or somewhere closer to 0.
In order to get out ideal Cost Function,We can perform, Gradient descent that updates the
weight which in return minimizes the errors.
Gradient Descent for Polynomial Regression

Gradient descent is an optimization algorithm used to find the values of parameters


(coefficients) of a function that minimizes a cost function (cost).
To update m and b values in order to reduce Cost function (minimizing MSE value) and achieving the
best fit line you can use the Gradient Descent. The idea is to start with random m and b values and
then iteratively updating the values, reaching minimum cost.

Steps followed by the Gradient Descent to obtain lower cost function:

→ Initially,the values of m and b will be 0 and the learning rate(α) will be introduced to the function.
The value of learning rate(α) is taken very small,something between 0.01 or 0.0001.

The learning rate is a tuning parameter in an optimization algorithm that determines the step size
at each iteration while moving toward a minimum of a cost function.

→ Then the partial derivative is calculate for the cost function equation in terms of slope(m) and also
derivatives are calculated with respect to the intercept(b).

Guys familiar with Calculus will understand how the derivatives are taken.

If you don’t know calculus don’t worry just understand how this works and it will be more than
enough to think intuitively what’s happening behind the scenes.

→ After the derivatives are calculated,The slope(m) and intercept(b) are updated with the help of the
following equation.
m = m - α*derivative of m
b = b - α*derivative of b
Derivative of m and b are calculated above and α is the learning rate.

Evaluating Regression Models Performance :-


Evaluation metrics for regression are essential for assessing the performance of regression
models, which predict continuous outcomes. These metrics help in measuring how well a
regression model can predict continuous values such as prices, ratings, or fees. Here are
some common evaluation metrics for regression:
1. Mean Absolute Error (MAE)
This is the simplest metric used to analyze the loss over the whole dataset. As we know that
error is basically the difference between the predicted and actual values. Therefore MAE is
defined as the average of the errors calculated. Here we calculate the modulus of the error,
perform summation and then divide the result by the total number of data points. It is a
positive value. The formula of MAE is given by

Advantages:
• MAE is in the same unit as the output variable.
• Robust to outliers.
Disadvantages:
• Not differentiable, requiring alternative optimization methods

2. Mean Squared Error(MSE)


The most commonly used metric is Mean Square error or MSE. It is a function
used to calculate the loss. We find the difference between the predicted values
and actual variable, square the result and then find the average by all
datapoints present in dataset. MSE is always positive as we square the values.
Small the value of MSE better is the performance of our model. The formula of
MSE is given:

Advantages:
• Differentiable, suitable for optimization.
• Penalizes larger errors more.
Disadvantages:
The output is in squared units, which can be less interpretable.
3. Root Mean Squared Error(RMSE)
RMSE is a popular method and is the extended version of MSE. It indicates how
much the data points are spread around the best line. It is the standard
deviation of the MSE. A lower value means that the data point lies closer to the
best fit line.

Advantages:
• Output in the same unit as the output variable.
• Easier to interpret.
Disadvantages:
• Not as robust to outliers as MAE
• R-squared (R²):-

• R-squared (R²), also known as the coefficient of determination, measures the proportion
of the variance in the dependent variable that is predictable from the independent
variables. It provides a baseline to compare models and is independent of the context.
Advantages:
• Provides a baseline for model comparison.
• Independent of context.
Disadvantages:
• Can be misleading with irrelevant features

Correlation:- A statistical tool that helps in the study of the relationship


between two variables is known as Correlation. It also helps in understanding
the economic behaviour of the variables.

Correlation coefficient procedure is used to determine how strong a relationship is between


the data. The correlation coefficient procedure yields a value between 1 and -1. In which,
• -1 indicates a strong negative relationship
• 1 indicates strong positive relationships
• Zero implies no connection at all
Understanding Correlation Coefficient
• Correlation coefficient of -1 means there is a negative decrease of a fixed proportion,
for every positive increase in one variable. Like, the amount of gas in a tank
decreases in a perfect correlation with the speed.
• Correlation coefficient of 1 means there is a positive increase of a fixed proportion of
others, for every positive increase in one variable. Like, the size of the shoe goes up
in perfect correlation with foot length.
• Correlation coefficient of 0 means that for every increase, there is neither a positive
nor a negative increase. The two just aren’t related.

Scatter Plot:-
Scatter plot is one of the most important data visualization techniques and it is considered
one of the Seven Basic Tools of Quality. A scatter plot is used to plot the relationship
between two variables, on a two-dimensional graph that is known as Cartesian Plane on
mathematical grounds.
It is generally used to plot the relationship between one independent variable and one
dependent variable, where an independent variable is plotted on the x-axis and a dependent
variable is plotted on the y-axis so that you can visualize the effect of the independent
variable on the dependent variable. These plots are known as Scatter Plot Graph or Scatter
Diagram.

Applications of Scatter Plot


As already mentioned, a scatter plot is a very useful data visualization technique. A few
applications of Scatter Plots are listed below.
• Correlation Analysis: Scatter plot is useful in the investigation of the correlation
between two different variables. It can be used to find out whether two variables
have a positive correlation, negative correlation or no correlation.
• Outlier Detection: Outliers are data points, which are different from the rest of the
data set. A Scatter Plot is used to bring out these outliers on the surface.
• Cluster Identification: In some cases, scatter plots can help identify clusters or groups
within the data.
UNIT 4 Classification
Logistic Regression:- Logistic regression is a supervised machine learning algorithm used
for classification tasks where the goal is to predict the probability that an instance belongs to
a given class or not. Logistic regression is a statistical algorithm which analyze the
relationship between two data factors. The article explores the fundamentals of logistic
regression, it’s types and implementations.
Logistic regression is used for binary classification where we use sigmoid function, that takes
input as independent variables and produces a probability value between 0 and 1.
For example, we have two classes Class 0 and Class 1 if the value of the logistic function for
an input is greater than 0.5 (threshold value) then it belongs to Class 1 otherwise it belongs
to Class 0. It’s referred to as regression because it is the extension of linear regression but is
mainly used for classification problems.

Key Points:
• Logistic regression predicts the output of a categorical dependent variable.
Therefore, the outcome must be a categorical or discrete value.
• It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact
value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
• In Logistic regression, instead of fitting a regression line, we fit an “S” shaped logistic
function, which predicts two maximum values (0 or 1)

TYPES
1. Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
2. Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as “cat”, “dogs”, or “sheep”
3. Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types
of dependent variables, such as “low”, “Medium”, or “High”.

Decision Tree
Decision tree is a simple diagram that shows different choices and their possible results
helping you make decisions easily. This article is all about what decision trees are, how
they work, their advantages and disadvantages and their applications.

Understanding Decision Tree


A decision tree is a graphical representation of different options for solving a problem
and show how different factors are related. It has a hierarchical tree structure starts with
one main question at the top called a node which further branches out into different
possible outcomes where:
• Root Node is the starting point that represents the entire dataset.
• Branches: These are the lines that connect nodes. It shows the flow from one
decision to another.
• Internal Nodes are Points where decisions are made based on the input features.
• Leaf Nodes: These are the terminal nodes at the end of branches that represent final
outcomes or predictions
Now, let’s take an example to understand the decision tree. Imagine you want to decide
whether to drink coffee based on the time of day and how tired you feel. First the tree
checks the time of day—if it’s morning it asks whether you are tired. If you’re tired the
tree suggests drinking coffee if not it says there’s no need. Similarly in the afternoon the
tree again asks if you are tired. If you recommends drinking coffee if not it concludes no
coffee is needed.

Classification of Decision Tree


• Classification trees: They are designed to predict categorical outcomes means they
classify data into different classes. They can determine whether an email is “spam” or
“not spam” based on various features of the email.
• Regression trees : These are used when the target variable is continuous It predict
numerical values rather than categories. For example a regression tree can estimate
the price of a house based on its size, location, and other features.

Advantages of Decision Trees


• Simplicity and Interpretability: Decision trees are straightforward and easy to
understand. You can visualize them like a flowchart which makes it simple to see how
decisions are made.
• Versatility: It means they can be used for different types of tasks can work well for
both classification and regression
• No Need for Feature Scaling: They don’t require you to normalize or scale your data.
• Handles Non-linear Relationships: It is capable of capturing non-linear relationships
between features and target variables.
Disadvantages of Decision Trees
• Overfitting: Overfitting occurs when a decision tree captures noise and details in the
training data and it perform poorly on new data.
• Instability: instability means that the model can be unreliable slight variations in
input can lead to significant differences in predictions.
• Bias towards Features with More Levels: Decision trees can become biased towards
features with many categories focusing too much on them during decision-making.
This can cause the model to miss out other important features led to less accurate
predictions .

• Decision tree induction is a popular technique in data mining because it is easy


to understand and interpret, and it can handle both numerical and categorical data.
Additionally, decision trees can handle large amounts of data, and they can be
updated with new data as it becomes available. However, decision trees can be
prone to overfitting, where the model becomes too complex and does not generalize
well to new data. As a result, data scientists often use techniques such as pruning to
simplify the tree and improve its performance.

Advantages of Decision Tree Induction


1. Easy to understand and interpret: Decision trees are a visual and intuitive model that
can be easily understood by both experts and non-experts.
2. Handle both numerical and categorical data: Decision trees can handle a mix of
numerical and categorical data, which makes them suitable for many different types
of datasets.
3. Can handle large amounts of data: Decision trees can handle large amounts of data
and can be updated with new data as it becomes available.
4. Can be used for both classification and regression tasks: Decision trees can be used
for both classification, where the goal is to predict a discrete outcome, and
regression, where the goal is to predict a continuous outcome.
Disadvantages of Decision Tree Induction
1. Prone to overfitting: Decision trees can become too complex and may not generalize
well to new data. This can lead to poor performance on unseen data.
2. Sensitive to small changes in the data: Decision trees can be sensitive to small
changes in the data, and a small change in the data can result in a significantly
different tree.
3. Biased towards attributes with many levels: Decision trees can be biased towards
attributes with many levels, and may not perform well on attributes with a small
number of levels.
Random forest :-
Random Forest is a powerful and versatile machine learning algorithm that can
handle both classification and regression tasks. It works by creating multiple
decision trees during training and aggregating their results to make a final
prediction. This ensemble method reduces overfitting and improves accuracy.
Key Features of Random Forest
• Handles Missing Data: Automatically handles missing values during
training, eliminating the need for manual imputation.
• Algorithm ranks features based on their importance in making
predictions offering valuable insights for feature selection and
interpretability.
• Scales Well with Large and Complex Data without significant
performance degradation.
• Algorithm is versatile and can be applied to both classification tasks (e.g.,
predicting categories) and regression tasks (e.g., predicting continuous
values).
Advantages of Random Forest
• Random Forest provides very accurate predictions even with large
datasets.
• Random Forest can handle missing data well without compromising with
accuracy.
• It doesn’t require normalization or standardization on dataset.
• When we combine multiple decision trees it reduces the risk of
overfitting of the model.
Limitations of Random Forest
• It can be computationally expensive especially with a large number of
trees.
• It’s harder to interpret the model compared to simpler models like
decision trees.

Naïve Bayes Algorithm


The Naïve Bayes algorithm is a probabilistic classifier based on Bayes'
Theorem. It is widely used for classification tasks due to its simplicity and
efficiency. Despite its "naive" assumption of feature independence, it performs
well in various applications such as spam filtering, sentiment analysis, and
document classification.It is named as “Naive” because it assumes the
presence of one feature does not affect other features.
Advantages of Naive Bayes Classifier
• Easy to implement and computationally efficient.
• Effective in cases with a large number of features.
• Performs well even with limited training data.
• It performs well in the presence of categorical features.
• For numerical features data is assumed to come from normal
distributions
Disadvantages of Naive Bayes Classifier
• Assumes that features are independent, which may not always hold in
real-world data.
• Can be influenced by irrelevant attributes.
• May assign zero probability to unseen events, leading to poor
generalization.
Applications of Naive Bayes Classifier
• Spam Email Filtering: Classifies emails as spam or non-spam based on
features.
• Text Classification: Used in sentiment analysis, document categorization,
and topic classification.
• Medical Diagnosis: Helps in predicting the likelihood of a disease based
on symptoms.
• Credit Scoring: Evaluates creditworthiness of individuals for loan
approval.
• Weather Prediction: Classifies weather conditions based on various
factors.
K-Nearest Neighbours (K-NN) Algorithm
K-Nearest Neighbors (KNN) is a simple way to classify things by looking at what’s
nearby. Imagine a streaming service wants to predict if a new user is likely to
cancel their subscription (churn) based on their age. They checks the ages of its
existing users and whether they churned or stayed. If most of the “K” closest
users in age of new user canceled their subscription KNN will predict the new
user might churn too. The key idea is that users with similar ages tend to have
similar behaviors and KNN uses this closeness to make decisions.
In the k-Nearest Neighbours (k-NN) algorithm k is just a number that
tells the algorithm how many nearby points (neighbours) to look at
when it makes a decision.

An anomaly is something that differs from what is typical, normal, or


expected. It can be an irregularity or an outlier that stands out from the
usual pattern. Anomalies are important because they often signal
unusual or unexpected events, such as errors, fraud, or rare incidents.

Support Vector Machine (SVM):-


Support Vector Machines (SVMs) are a type of supervised machine
learning algorithm used for classification and regression tasks. They are widely
used in various fields, including pattern recognition, image analysis, and natural
language processing.
SVMs work by finding the optimal hyperplane that separates data points into
different classes.
Key Terms
1.Hyperplane
A hyperplane is a decision boundary that separates data points into different
classes in a high-dimensional space. In two-dimensional space, a hyperplane is
simply a line that separates the data points into two classes. In three-
dimensional space, a hyperplane is a plane that separates the data points into
two classes. Similarly, in N-dimensional space, a hyperplane has (N-1)-
dimensions.
It can be used to make predictions on new data points by evaluating which side
of the hyperplane they fall on. Data points on one side of the hyperplane are
classified as belonging to one class, while data points on the other side of the
hyperplane are classified as belonging to another class.
2.Margin
A margin is the distance between the decision boundary (hyperplane) and the
closest data points from each class. The goal of SVMs is to maximize this
margin while minimizing classification errors. A larger margin indicates a
greater degree of confidence in the classification, as it means that there is a
larger gap between the decision boundary and the closest data points from
each class. The margin is a measure of how well-separated the classes are in
feature space. SVMs are designed to find the hyperplane that maximizes this
margin, which is why they are sometimes referred to as maximum-margin
classifiers.

3.Support Vectors
They are the data points that lie closest to the decision boundary (hyperplane)
in a Support Vector Machine (SVM). These data points are important because
they determine the position and orientation of the hyperplane, and thus have a
significant impact on the classification accuracy of the SVM. In fact, SVMs are
named after these support vectors because they “support” or define the
decision boundary. The support vectors are used to calculate the margin, which
is the distance between the hyperplane and the closest data points from each
class. The goal of SVMs is to maximize this margin while minimizing
classification errors.
We have a famous dataset called ‘Iris’. There are four features (columns or
independent variables) in this dataset but for simplicity purposes, we shall on
look at two which are: ‘Petal length’ and ‘Petal Width’. These points are then
plotted on a 2D plane.
Why do we use Support Vector Machines for Anomaly Detection?
We use Support Vector Machine for anomaly detection because of the
following reasons:
1. Effective for High-Dimensional Data: SVMs perform well in high-
dimensional spaces, making them suitable for datasets with many
features, such as those commonly encountered in anomaly detection
tasks.
2. Robust to Overfitting: SVMs are less prone to overfitting, which is crucial
in anomaly detection where the goal is to generalize well to unseen
anomalies.
3. Optimal Separation: SVMs aim to find the hyperplane that maximally
separates the normal data points from the anomalies, making them
effective in identifying outliers.
4. One-Class SVM: The One-Class SVM variant is specifically designed for
anomaly detection, learning to distinguish normal data points from
outliers without the need for labeled anomalies.
5. Kernel Trick: SVMs can use kernel functions to map data into a higher-
dimensional space, allowing for non-linear separation of anomalies from
normal data.
6. Handling Imbalanced Data: Anomaly detection datasets are often highly
imbalanced, with normal data points outnumbering anomalies. SVMs
can handle this imbalance well.
7. Interpretability: SVMs provide clear decision boundaries, which can help
in interpreting why a particular data point is classified as an anomaly
In this article we will seeks answers to the questions:
• How to train a one-class support vector machine (SVM) model.
• How to predict anomalies from a one-class SVM model.
• How to change the default threshold for anomaly prediction.
• How to visualize the prediction results

Problems with this method

The main problem here is the heavy load of the calculations to be


performed.

Here we took points centred on origin in a concentric manner.


Suppose the points were not concentric but could be separated by
the RBF. So we would need to take each point in the dataset as a
reference each time and find the distance of all other points with
respect to that point.

So we would need to calculate n*(n-1)/2 distances. (n-1 other points


with respect to each n points, but once the distance of 1–2 is
calculated the distance of 2–1 does need to be calculated).

The time complexity of a square root is O(log(n)) and for power,


addition is O(1). Thus to do n*(n-1)/2 total calculations we would
need O(n²log(n)) time complexity.

Numerical example for Linear SVM:

Numerical example for Linear SVM:


Q. Positively labelled data points (3,1)(3,-1)(6,1)(6,-1) and Negatively labelled
data points (1,0)
(0,1)(0,-1)(-1,0)
Solution: for all negative labelled output is -1 and for all positive labelled
output is 1.
Now adding 1 to all points
s = (3,1) => s `= (3,1,1)| s = (3,-1) => s ` = (3,1,-1)₁ ₁ ₂ ₂
s = (6,1) => s `= (6,1,1) |s = (6,-1) => s ` = (6,1,-1)₃ ₃ ₄ ₄
s = (1,0) => s `= (1,0,1) |s = (0,1) => s ` = (0,1,1)₅ ₅ ₆ ₆
s = (0,-1)=> s = (0,-1,1) |s = (-1,0) => s ` = (-1,0,1)₇ ₇ ₈ ₈
from the graph we can see there is one negative point α = (1,0,1) and two
positive points α =₁ ₂
(3,1,1) & α = (3,-1,1) form support vectors₃
Generalized equation
α *s `*s `+ α *s `*s `+α *s `*s ` = -1 → 1
α *s `*s `+ α *s `*s `+α *s `*s ` = 1 → 2
α *s `*s `+ α *s `*s `+α *s `*s ` = 1 →3

On solving these equations


eq 1 => α * (1,0,1) *(1,0,1) + α * (1,0,1) * (3,1,1) + α * (1,0,1) * (3,-1,1) = -1
eq 2=> α * (3,1,1) *(1,0,1) + α * (3,1,1) * (3,1,1) + α * (3,1,1) * (3,-1,1) = 1
eq 3=> α * (3,-1,1) *(1,0,1) + α * (3,-1,1) * (3,1,1) + α * (3,-1,1)* (3,-1,1) = 1
2α + 4α + 4α = -1 → 4₁ ₂ ₃
4α + 11α + 9α = 1 → 5₁ ₂ ₃
4α + 9α + 11α = 1 → 6₁ ₂ ₃
on solving these equations 4,5 and 6 we get,
α = -3.5, α = 0.75 and α = 0.75
To find hyper-plane
W` = Σ α * sᵢ ᵢ
W` = -3.5 * (1,0,1) + 0.75 * (3,1,1) + 0.75 * (3,-1,1)
W` = (1,0,-2)
y = W`x + b
W` = (1,0) and b = 2
so the best fit line or hyper plane is at (0,2) which splits the data points into
two classes

What is kNN
kNN Characteristics
• A non-parametric classification method, meaning that no parameters of
the population distribution are estimated
• It is a supervised ML algorithm, meaning we need data with known
classes.
• It is a type of lazy learning, because it doesnʼt create a model.
• It predicts directly based on training data.

Bivariate vs Univariate Analysis


• In univariate analysis we only look at one independent and one
dependent variable, this lacks accuracy.
• In bivariate we look at two dependent variables at the same time.
• kNN takes data in known classes and makes a prediction for a new data
point.
• For classification we prefer representing our data in a higher-dimensional
Space.

In a Nutshel
1. Determine the distance between the new observation and all the data
points in the training set.
2. Sort the distances.
3. Identify K closest neighbours.
4. Determine the class of the new observation based on the group majority of
the k nearest neighbours.

Finding the Distance


Euclidian Distance

The Euclidean distance in two dimensions can be seen as applying the


Pythagorasʼ theorem to the angle formed by those two points and the axes.

Evaluation - LOOCV
• Once way to validate kNN predictions is to use leave-one-out cross
validation (LOOCV)on the existing data with a known class.
• We can then use accuracy for evaluation.

Limitations
Big Datasets
• If we have very big data sets, the LOOCV might take a long time to run.
• Instead of using LOOCV, we may use the hold-out method, where we split
the dataset into training and test sets.
Dataset Imbalance
• When groups are of equal size, kNN is unbiased.

Cluster Analysis
Introduction
Cluster Analysis is a technique used in Data Mining to group a set of data objects
into clusters. A cluster is a collection of data objects that are similar to each other
within the samecluster but different from objects in other clusters. This method
is useful in exploring data, identifying patterns, and classifying information
without prior knowledge of predefined categories.

Clustering is considered an unsupervised learning method because it does not


require labeled data. Instead, it identifies inherent structures and relationships
within the data based on similarity or dissimilarity measures.
Applications of Clustering
Clustering is widely used in various fields, including:
1. Pattern Recognition
• Identifies underlying patterns in datasets.
• Used in AI, machine learning, and fraud detection.
2. Spatial Data Analysis
• Used in Geographic Information Systems (GIS) to create thematic maps by
clustering feature spaces.
• Helps in detecting spatial clusters for applications like urban planning and
environmental monitoring.
3. Image Processing
• Helps in image segmentation by grouping similar pixels together.
• Used in facial recognition, medical imaging, and satellite image analysis.
4. Economic Science and Market Research
• Helps in customer segmentation to develop targeted marketing strategies.
• Used to analyze customer buying behavior and improve business decision-
making.
5. Web-based Applications
• Document Classification: Groups similar documents together for easier
retrieval.
• Web Log Data Analysis: Helps discover groups of users with similar
browsing behaviors.
6. Other Real-World Applications
• Marketing: Identifies distinct customer groups to develop targeted
advertising campaigns.
• Land Use Identification: Helps categorize land based on usage and
geographical data.
• Insurance: Groups policyholders based on risk levels to adjust insurance
premiums.
• City Planning: Classifies residential areas based on house type, location,
and value.
• Earthquake Studies: Clusters earthquake epicenters to analyze seismic
Patterns.

Hierarchical Clustering
Hierarchical clustering creates a hierarchy of clusters from the finest level (individual
points) to a single large cluster. The hierarchy is visualized using a dendrogram.

Types of Hierarchical Clustering


There are two types of hierarchical clustering:
A. Agglomerative Hierarchical Clustering (AHC) – Bottom-Up Approach
•Each data point starts as an individual cluster.
•The closest clusters are iteratively merged based on a distance metric.
•The process continues until all points belong to a single cluster.
•Common linkage criteria:
1. Single-Linkage: Merges clusters based on the closest point.
2. Complete-Linkage: Merges clusters based on the farthest point.
3. Average-Linkage: Uses the average distance between clusters.
4. Ward’s Method: Minimizes variance within clusters.
Advantages & Disadvantages of AHC
Advantages
•✅ No need to specify k beforehand.
✅ Produces a hierarchy of clusters (dendrogram).

✅ Can handle non-spherical clusters (unlike K-Means)

Disadvantages
❌ Computationally expensive O(n2)O(n^2)O(n2) – slow for large datasets.

❌ Once merged, clusters cannot be split (no backtracking).


❌ Sensitive to noisy data and outliers
Applications of Agglomerative Hierarchical Clustering
📌 Bioinformatics – Gene sequencing, protein classification
📌 Market Segmentation – Customer group analysis

📌 Image Processing – Pattern recognition

📌 Social Network Analysis – Community detection

B. Divisive Hierarchical Clustering (DHC) – Top-Down Approach


•Starts with all data points in a single cluster.
•The cluster is recursively split into smaller groups.
•The process continues until each data point is its own cluster.
•Less commonly used than Agglomerative clustering.
Advantages of Hierarchical Clustering
✅ No need to specify the number of clusters beforehand.

✅ Provides a dendrogram to help determine the optimal number of clusters.


✅ Works well for small datasets.

✅ Can capture nested cluster structures.

Disadvantages of Hierarchical Clustering


❌ Computationally expensive – O(n²) time complexity.
❌ Cannot handle large datasets efficiently.

❌ Once a cluster is merged/split, it cannot be undone

2.Non-Hierarchical Clustering (Partitioning Methods)


Non-hierarchical clustering divides data into a fixed number of clusters (k) based
on similarity.
Types of Non-Hierarchical Clustering
1. Single-Pass Methods
• Each data point is assigned to a cluster in a single pass.
• Fast but less accurate.
2. Reallocation Methods (Iterative Methods)
• Data points are moved between clusters to optimize clustering.
• Examples:
K-Means Clustering
K-Medoids Clustering (PAM)
Fuzzy C-Means (Soft Clustering

Partitioning Clustering Algorithms


Introduction
Partitioning algorithms are a class of clustering techniques that divide a dataset D of n
objects into k distinct and non-overlapping clusters. These algorithms work by
optimizing an objective function, usually based on minimizing intra-cluster distances
while maximizing inter-cluster differences.
Partitioning methods attempt to construct an optimal clustering by dividing the dataset
into k clusters, where:
•Each object belongs to exactly one cluster
•The clustering structure optimizes a chosen criterion
Since finding the global optimum requires examining all possible partitions, which is
computationally infeasible, heuristic approaches like k-means and k-medoids are
commonly used.

Types of Partitioning Algorithms


The two primary partitioning clustering algorithms are:
1. K-Means Clustering
• Each cluster is represented by its centroid (mean value of all points in the
cluster).
2. K-Medoids Clustering (PAM - Partitioning Around Medoids)
• Each cluster is represented by an actual data point called a medoid, which is the
most centrally located object in the cluster.
Concept
K-means clustering is a simple and efficient method that partitions a dataset into k
clusters based on their mean (centroid). It is one of the most widely used clustering
techniques in machine learning and data mining.
Steps of the K-Means Algorithm
1. Initialize k centroids (randomly select k points from the dataset).
2. Assign each object to the nearest centroid based on Euclidean distance.
3. Recalculate the centroids by computing the mean of all points in each cluster.
4. Repeat steps 2 and 3 until no objects change clusters (i.e., the centroids no longer
move).
Example of K-Mean
Consider the following dataset of points
Object X Y
A 2 3
B 5 4
C 9 6
D 4 7
E 8 1
1. Suppose we choose k = 2 (two clusters) and randomly initialize centroids.
2. Assign points to the nearest centroid.
3. Compute new centroids.
4. Repeat until assignments remain unchanged.
Strengths of K-Means
•Efficient for large datasets: Computational complexity is O(tkn), where:
1. t= number of iterations
2. k = number of clusters
3. n = number of data points
•Works well with spherical clusters (clusters with similar density and size).
•Easy to implement and interpret.

Weaknesses of K-Means
•Requires predefining the number of clusters (k).
•Sensitive to initial centroid selection (different initializations can yield different
results).
•Not suitable for categorical data (since mean values are undefined for categorical
attributes).
•Sensitive to noise and outliers, which can distort the centroids.
•Fails with non-convex clusters (e.g., clusters with irregular shapes).

Notes:-A dendrogram is a tree-like diagram that illustrates how clusters are merged
(AGNES) or split (DIANA)

Density-based methods:
To discover clusters with arbitrary shape, density-based clustering methods
have been developed. These typically regard clusters as dense regions of
objects in the data space which are separated by regions of low density
(representing noise).
DBSCAN: A density-based clustering method based on connected
regions with sufficiently high density
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is
a density- based clustering algorithm. The algorithm grows regions with
sufficiently high density into clusters, and discovers clusters of arbitrary
shape in spatial databases with noise. It defines a cluster as a maximal set of
density-connected points.

➢ The basic ideas of density-based clustering involve a number of new

definitions. We intuitively present these definitions, and then follow up with


an example.

✓ The neighborhood within a radius ε of a given object is called the ε-neighborhood of the object.

✓ If the ε -neighborhood of an object contains at least a minimum number, MinP ts, of objects, then the
object is called a core object.

✓ Given a set of objects, D, we say that an object p is directly density-reachable from object q if p is within
the ε -neighborhood of q, and q is a core object.

✓ An object p is density-reachable from object q with respect to ε and MinPts in a set of objects, D, if there
is a chain of objects p1, . .. , pn, p1 = q and pn = p such that pi+1 is directly density reachable from pi with
respect to ε and MinP ts, for 1 ≤ i ≤ n, pi € D.

✓ An object p is density-connected to object q with respect to ε and MinP ts in a set of objects, D, if there is
an object o € D such that both p and q are density- reachable from o with respect to ε and MinP ts.

Density reachability is the transitive closure of direct density reachability, and this relationship is
asymmetric. Only core objects are mutually density reachable. Density connectivity, however, is a
symmetric relation.
Example 8.5 Consider Figure 8.9 for a given ε represented by the radius of the circles, and, say, let
MinPts = 3.

➢ Based on the above definitions:


Density-reachability and density-connectivity in density-based clustering.
✓ Of the labeled points, M, P, O, and R are core objects since each is in an ε-
neighborhood containing at least three points.
✓ M is directly density-reachable from P, and Q is directly density-reachable from M.

✓ Based on the previous observation, Q is (indirectly) density-reachable from P.


However, P is not density-reachable from Q. Similarly, R and S are density- reachable
from O.
✓ O, R and S are all density-connected.

➢ A density-based cluster is a set of density-connected objects that is maximal with


respect to density-reachability. Every object not contained in any cluster is
considered to be noise.
➢ “How does DBSCAN find clusters?" DBSCAN checks the ε -neighborhood of each
point in the database. If the ε -neighborhood of a point p contains more than MinP ts, a
new cluster with p as a core object is created. DBSCAN then iteratively collects directly
density-reachable objects from these core objects, which may involve the merge of a
few density-reachable clusters.
➢ The process terminates when no new point can be added to any cluster
In this process, for an object p0 in N that carries the label
“unvisited,” DBSCAN marks it as “visited” and checks its ε -
neighborhood. If the ε -neighborhood of p0 has at leastMinPts objects,
those objects in the ε -neighborhood of p0 are added to N. DBSCAN
continues adding objects to C until C can no longer be expanded, that is,
N is empty. At this time, cluster C is completed, and thus is output.
To find the next cluster, DBSCAN randomly selects an
unvisited object from the remaining ones. The clustering
process continues until all objects are visited. The pseudocode
of the DBSCAN algorithm is given in Figure 10.15.
The computational complexity of DBSCAN is O(n log n), where
n is the number of database objects. The algorithm is sensitive
to the user-defined parameters

You might also like