ml notes
ml notes
The basic design issues and approaches to machine learning are illustrated by designing a
program to learn to play checkers, with the goal of entering it in the world checkers
tournament
1.Choosing the Training Experience
2.Choosing the Target Function
3.Choosing a Representation for the Target Function
4.Choosing a Function Approximation Algorithm
1.Estimating training values
2.Adjusting the weights
5.The Final Design
•The first design choice is to choose the type of training experience from which the
system will learn.
•The type of training experience available can have a significant impact on success or
failure of the learner.
There are three attributes which impact on success or failure of the learner
1.Whether the training experience provides direct or indirect feedback regarding the
choices made by the performance system.
For example, in checkers game:
In learning to play checkers, the system might learn from direct training examples
consisting of individual checkers board states and the correct move for each.
Indirect training examples consisting of the move sequences and final outcomes of
various games played. The information about the correctness of specific moves early in
the game must be inferred indirectly from the fact that the game was eventually won or
lost.
Here the learner faces an additional problem of credit assignment, or determining the
degree to which each move in the sequence deserves credit or blame for the final
outcome. Credit assignment can be a particularly difficult problem because the game can
be lost even when early moves are optimal, if these are followed later by poor moves.
Hence, learning from direct training feedback is typically easier than learning from
indirect feedback.
2.The degree to which the learner controls the sequence of training examples
For example, in checkers game:
The learner might depends on the teacher to select informative board states and to
provide the correct move for each.
Alternatively, the learner might itself propose board states that it finds particularly
confusing and ask the teacher for the correct move.
The learner may have complete control over both the board states and (indirect) training
classifications, as it does when it learns by playing against itself with
no teacher present.
3.How well it represents the distribution of examples over which the final system
performance P must be measured
For example, in checkers game:
In checkers learning scenario, the performance metric P is the percent of games the
system wins in the world tournament.
If its training experience E consists only of games played against itself, there is a danger
that this training experience might not be fully representative of the distribution of
situations over which it will later be tested.
It is necessary to learn from a distribution of examples that is different from those on
which the final system will be evaluated.
ChooseMove : B→ M
which indicate that this function accepts as input any board from the set of legal board
states B and produces as output some move from the set of legal moves M
ChooseMove is a choice for the target function in checkers example, but this function
will turn out to be very difficult to learn given the kind of indirect training experience
available to our system
2.An alternative target function is an evaluation function that assigns a numerical score
to any given board state
Let the target function V and the notation
V:B →R
which denote that V maps any legal board state from the set B to some real value. Intend
for this target function V to assign higher scores to better board states. If the system
can successfully learn such a target function V, then it can easily use it to select the best
move from any current board position.
Let us define the target value V(b) for an arbitrary board state b in B, as follows:
•If b is a final board state that is won, then V(b) = 100
•If b is a final board state that is lost, then V(b) = -100
•If b is a final board state that is drawn, then V(b) = 0
•If b is a not a final state in the game, then V(b) = V(b' ),
Where b' is the best final board state that can be achieved starting from b and playing optimally
until the end of the game.
Let’s choose a simple representation - for any given board state, the function c will be
calculated as a linear combination of the following board features:
• xl: the number of black pieces on the board
• x2: the number of red pieces on the board
• x3: the number of black kings on the board
• x4: the number of red kings on the board
• x5: the number of black pieces threatened by red (i.e., which can be captured on red's
next turn)
• x6: the number of red pieces threatened by black
Thus, learning program will represent as a linear function of the form
Where,
•w0 through w6 are numerical coefficients, or weights, to be chosen by the learning
algorithm.
•Learned values for the weights w1 through w6 will determine the relative importance
of the various board features in determining the value of the board
•The weight w0 will provide an additive constant to the board value
1.Derive training examples from the indirect training experience available to the learner
2.Adjusts the weights wi to best fit these training examples
1.Estimating training values
A simple approach for estimating training values for intermediate board states is to
assign the training value of Vtrain(b) for any intermediate board state b to be
V (Successor(b))
Where ,
•V is the learner's current approximation to V
•Successor(b) denotes the next board state following b for which it is again the
program's turn to move
Rule for estimating training values
Vtrain(b) ← V (Successor(b))
LMS weight update rule :- For each training example (b, Vtrain(b))
Use the current weights to calculate V (b)
For each weight wi, update it as
Here ƞ is a small constant (e.g., 0.1) that moderates the size of the weight update.
1.The Performance System is the module that must solve the given performance task by
using the learned target function(s). It takes an instance of a new problem (new game)
2.The Critic takes as input the history or trace of the game and produces as output a set of
3.The Generalizer takes as input the training examples and produces an output
hypothesis that is its estimate of the target function. It generalizes from the specific
training examples, hypothesizing a general function that covers these examples and
other cases beyond the training examples.
4.The Experiment Generator takes as input the current hypothesis and outputs a new
problem (i.e., initial board state) for the Performance System to explore. Its role is to
pick new practice problems that will maximize the learning rate of the overall system.
The field of machine learning, and much of this book, is concerned with answering questions
•What algorithms exist for learning general target functions from specific training
examples? In what settings will particular algorithms converge to the desired function,
given sufficient training data? Which algorithms perform best for which types of
•How much training data is sufficient? What general bounds can be found to relate the
confidence in learned hypotheses to the amount of training experience and the character
•When and how can prior knowledge held by the learner guide the process of generalizing
from examples? Can prior knowledge be helpful even when it is only approximately correct?
•What is the best strategy for choosing a useful next training experience, and how does the choice of
this strategy alter the complexity of the learning problem?
•What is the best way to reduce the learning task to one or more function approximation
problems? Put another way, what specific functions should the system attempt to learn?
•How can the learner automatically alter its representation to improve its ability to
checks for the dependency of one data item on another data item and
•Here market basket analysis is a technique used by the various big retailer
•Association rule learning works on the concept of If and Else Statement, such as
If A then B.
•Here the If element is called antecedent, and then statement is called as Consequent.
•These types of relationships where we can find out some association or relation
SUPERVISED LEARNING
are trained using well "labelled" training data, and on basis of that data,
machines predict the output. The labelled data means some input data is already
tagged with the correct output. In supervised learning, the training data provided
to the machines work as the supervisor that teaches the machines to predict the
output data to the machine learning model. The aim of a supervised learning
etc.
the model learns about each type of data. Once the training process is
completed, the model is tested on the basis of test data (a subset of the training
set), and then it predicts the output.The working of Supervised learning can be
rectangle, triangle, and Polygon. Now the first step is that we need to train the
oIf the given shape has four sides, and all the sides are equal, then it will
be labelled as a Square.
oIf the given shape has three sides, then it will be labelled as a triangle.
oIf the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the
The machine is already trained on all types of shapes, and when it finds a new
shape, it classifies the shape on the bases of a number of sides, and predicts the
output.
1) Regression
2) Classification
1. Regression
variable and the output variable. It is used for the prediction of continuous
variables, such as Weather forecasting, Market Trends, etc. Below are some
1. Linear Regression
2. Regression Trees
3. Non-Linear Regression
4. Bayesian Linear Regression
5. Polynomial Regression
2.Classification
which means there are two classes such as Yes-No, Male-Female, True-false,
• Random Forest
• Decision Trees
• Logistic Regression
• Support vector Machines
1. With the help of supervised learning, the model can predict the output on the basis of prior
experiences.
2. In supervised learning, we can have an exact idea about the classes of objects.
3. Supervised learning model helps us to solve various real-world problems such as fraud
detection, spam filtering, etc.
4. Supervised learning models are not suitable for handling the complex tasks.
5. Supervised learning cannot predict the correct output if the test data is different from the
training dataset.
6. Training required lots of computation times.
7. In supervised learning, we need enough knowledge about the classes of object
algorithms. Both the algorithms are used for prediction in Machine learning and
work with the labelled datasets. But the difference between both is how they are
Regression algorithms are used to predict the continuous values such as price,
salary, age, etc. and Classification algorithms are used to predict/Classify the
discrete values such as Male or Female, True or False, Spam or Not Spam, etc.
Classification:
computer program is trained on the training dataset and based on that training, it
algorithm is to find the mapping function to map the input(x) to the discrete
output(y).
the email is spam or not. If the email is spam, then it is moved to the Spam
folder.
Regression:
Unsupervised Learning:
Unsupervised learning is a branch of machine learning where the model is trained on
unlabeled data, meaning that it doesn't receive explicit input-output pairs. Instead, it must
discover the underlying structure or patterns in the data by itself. This is in contrast to
supervised learning, where the model is provided with labeled data to learn from.
modeling.
Reinforcement Learning
Reinforcement Learning (RL) is a branch of machine learning that focuses on how agents can learn to
make decisions through trial and error to maximize cumulative rewards. RL allows machines to learn
by interacting with an environment and receiving feedback based on their actions. This feedback
comes in the form of rewards or penalties.
Reinforcement Learning revolves around the idea that an agent (the learner or decision-maker)
interacts with an environment to achieve a goal. The agent performs actions and receives feedback
to optimize its decision-making over time.
• Reward: The feedback or result from the environment based on the agent’s action.
Data preprocessing is the process of evaluating, filtering, manipulating, and encoding data so that a
machine learning algorithm can understand it and use the resulting output. The major goal of data
preprocessing is to eliminate data issues such as missing values, improve data quality, and make the
data useful for machine learning purposes.
The majority of the real-world datasets for machine learning are highly susceptible to be
Applying data mining algorithms on this noisy data would not give quality results as they would
fail to identify patterns effectively. Data Processing is, therefore, important to improve the overall
data quality.
•Duplicate or missing values may give an incorrect view of the overall statistics of data.
•Outliers and inconsistent data points often tend to disturb the model’s overall learning,
Quality decisions must be based on quality data. Data Preprocessing is important to get this
quality data, without which it would just be a Garbage In, Garbage Out scenario.
Here are the reasons why data preprocessing is so important for machine learning
projects:
It Improves Data Quality:-
Data preprocessing is the fast track to improving data quality since many of its steps mirror activities
you’ll find in any data quality management process, such as data cleansing, data profiling, data
integration, and more.
Dependent and independent variables change on separate scales, or one changes linearly while
another changes exponentially. Salary, for example, might be a multiple-figure digit, whereas age is
expressed in double digits. Normalizing and scaling help to modify data in a way that allows
computers to extract a meaningful link between these variables.
When two records appear to repeat, an algorithm must identify whether the same metric was
captured twice or whether the data reflects separate occurrences. In rare circumstances, a record
may have minor discrepancies due to an erroneously reported field. Techniques for finding, deleting,
or connecting duplicates help to address such data quality issues automatically.
It Handles Outliers :-
Data practitioners sometimes need to merge many data sources to construct a new machine learning
model. Principal component analysis, for example, is an important technique for lowering the
number of dimensions in the training data set and producing a more efficient representation.
Preprocessing often entails developing new features or modifying existing ones to better capture the
underlying problem and enhance model performance. This might include encoding category
variables, developing interaction terms, and retrieving pertinent data from text or timestamps.
Now, let's discuss more in-depth four main stages of data preprocessing
Data Cleaning:-
Data Cleaning is particularly done as part of data preprocessing to clean the data by filling
missing values, smoothing the noisy data, resolving the inconsistency, and removing outliers.
1. Missing values
Here are a few ways to solve this issue:
This method should be considered when the dataset is huge and numerous missing values are
There are many methods to achieve this, such as filling in the values manually, predicting the
missing values using regression method, or numerical methods like attribute mean.
2. Noisy Data
It involves removing a random error or variance in a measured variable. It can be done with the
•Binning
It is the technique that works on sorted data values to smoothen any noise present in it. The data
is divided into equal-sized bins, and each bin/bucket is dealt with independently. All data in a
•Regression
This data mining technique is generally used for prediction. It helps to smoothen noise by fitting
all the data points in a regression function. The linear regression equation is used if there is only
•Clustering
Creation of groups/clusters from data having similar values. The values that don't lie in the
3. Removing outliers
Clustering techniques group together similar data points. The tuples that lie outside the cluster
Data Integration is one of the data preprocessing steps that are used to merge the data present in
multiple sources into a single larger data store like a data warehouse.
Data Integration is needed especially when we are aiming to solve a real-world scenario like
detecting the presence of nodules from CT Scan images. The only option is to integrate the
•The first step in Data Preprocessing is to understand your data. Just looking at your
dataset can give you an intuition of what things you need to focus on.
•Use statistical methods or pre-built libraries that help you visualize the dataset and give a
•Summarize your data in terms of the number of duplicates, missing values, and outliers
•Drop the fields you think have no use for the modeling or are closely related to other
Preprocessing.
•Do some feature engineering and figure out which attributes contribute most towards
model training.
Data Transformation
One of the most important stages in the preparation phase is data transformation, which changes
data from one format to another. Some algorithms require that the input data be changed – if you
fail to finish this process, you may receive poor model performance or even introduce bias.
For example, the KNN model uses distance measurements to determine which neighbors are closest
to a particular record. If you have a feature with a particularly high scale relative to the other
features in your model, your model will likely employ this feature more than the others, resulting in a
bias.
Data Reduction
Sometimes, datasets are too large or contain too many features. Data reduction helps simplify the
dataset without losing important information. Techniques include:
• Dimensionality reduction: Reducing the number of features using methods like Principal
Component Analysis (PCA).
• Feature selection: Identifying and keeping only the most relevant features to the problem.
Feature Scaling
Scaling is a broader term that encompasses both normalization and standardization. While
normalization aims for a specific range (0-1), scaling adjusts the spread or variability of your
data.
Feature Scaling is a technique to standardize the independent features present in the data.
It is performed during the data pre-processing to handle highly varying values. If feature
scaling is not done then machine learning algorithm tends to use greater values as higher
and consider smaller values as lower regardless of the unit of the values. For example it will
take 10 m and 10 cm both as same regardless of their unit. In this article we will learn about
different techniques which are used to perform feature scaling.
Normalization
Normalization is a process that transforms your data's features to a standard scale, typically between
0 and 1. This is achieved by adjusting each feature's values based on its minimum and maximum
values. The goal is to ensure that no single feature dominates the others due to its magnitude
This method is more or less the same as the previous method but here instead of the minimum value
we subtract each entry by the mean value of the whole data and then divide the results by the
difference between the minimum and the maximum value.
Why Normalize?
• Improved Model Convergence: Algorithms like gradient descent often converge faster when
features are on a similar scale.
Standardization
This method of scaling is basically based on the central tendencies and variance of the data.
1. First we should calculate the mean and standard deviation of the data we would like to
normalize it.
2. Then we are supposed to subtract the mean value from each entry and then divide the result
by the standard deviation.
This helps us achieve a normal distribution of the data with a mean equal to zero and a standard
deviation equal to 1.
Standardization: Here, each feature is transformed to have a mean of 0 and a standard deviation of
1. This is achieved by subtracting the mean value and dividing by the standard deviation of the
feature.
Z = (x — μ) / σ
Where:
Underfitting
Underfitting is when the model is not even able to represent the
data points in the training dataset. In the case of under-fitting,
you will get a low accuracy even when testing on the training
dataset.
Underfitting usually means that your model is too simple to
capture the complexities of the dataset.
Overfitting
Regression analysis problem works with if output variable is a real or continuous value such
as “salary” or “weight”. Many different regression models can be used but the simplest
model in them is linear regression.
Types of Regression
1. Simple Linear Regression:-
Linear regression is one of the simplest and most widely used statistical models. This
assumes that there is a linear relationship between the independent and dependent
variables. This means that the change in the dependent variable is proportional to
the change in the independent variables. For example predicting the price of a house
based on its size.
2. Multiple Linear Regression:-
Multiple linear regression extends simple linear regression by using multiple
independent variables to predict target variable. For example predicting the
price of a house based on multiple features such as size, location, number of
rooms, etc.
3. Polynomial Regression
Polynomial regression is used to model with non-linear relationships between the
dependent variable and the independent variables. It adds polynomial terms to the
linear regression model to capture more complex relationships. For example when
we want to predict a non-linear trend like population growth over time we use
polynomial regression.
Applications of Regression
• Predicting prices: Used to predict the price of a house based on its size, location and
other features.
• Forecasting trends: Model to forecast the sales of a product based on historical sales
data.
• Identifying risk factors: Used to identify risk factors for heart patient based on
patient medical data.
• Making decisions: It could be used to recommend which stock to buy based on
market data.
Advantages of Regression
• Easy to understand and interpret.
• Robust to outliers.
• Can handle both linear relationships easily.
Disadvantages of Regression
• Assumes linearity.
• Sensitive to situation where two or more independent variables are highly correlated
with each other i.e multicollinearity.
• Polynomial Regression does not require the relationship between the independent
and dependent variables to be linear in the data set,This is also one of the main
difference between the Linear and Polynomial Regression.
• Polynomial Regression is generally used when the points in the data are not captured
by the Linear Regression Model and the Linear Regression fails in describing the best
result clearly.
As we increase the degree in the model,it tends to increase the performance of the
model.However,increasing the degrees of the model also increases the risk of over-fitting
and under-fitting the data.
How to find the right degree of the equation?
In order to find the right degree for the model to prevent over-fitting or under-fitting, we
can use:
1. Forward Selection:
This method increases the degree until it is significant enough to define the best
possible model.
2. Backward Selection:
This method decreases the degree until it is significant enough to define the best
possible model.
Cost Function is a function that measures the performance of a Machine Learning
model for given data.
Cost Function is basically the calculation of the error between predicted values and
expected values and presents it in the form of a single real number.
Many people gets confused between Cost Function and Loss Function,
Well to put this in simple terms Cost Function is the average of error of n-sample in the
data and Loss Function is the error for individual data points.In other words,Loss
Function is for one training example,Cost Function is the for the entire training set.
• Polynomial regression can reduce your costs returned by the cost function. It gives your
regression line a curvilinear shape and makes it more fitting for your underlying data. By
applying a higher order polynomial, you can fit your regression line to your data more
precisely.
Now,We know that the ideal value of the Cost Function is 0 or somewhere closer to 0.
In order to get out ideal Cost Function,We can perform, Gradient descent that updates the
weight which in return minimizes the errors.
Gradient Descent for Polynomial Regression
→ Initially,the values of m and b will be 0 and the learning rate(α) will be introduced to the function.
The value of learning rate(α) is taken very small,something between 0.01 or 0.0001.
The learning rate is a tuning parameter in an optimization algorithm that determines the step size
at each iteration while moving toward a minimum of a cost function.
→ Then the partial derivative is calculate for the cost function equation in terms of slope(m) and also
derivatives are calculated with respect to the intercept(b).
Guys familiar with Calculus will understand how the derivatives are taken.
If you don’t know calculus don’t worry just understand how this works and it will be more than
enough to think intuitively what’s happening behind the scenes.
→ After the derivatives are calculated,The slope(m) and intercept(b) are updated with the help of the
following equation.
m = m - α*derivative of m
b = b - α*derivative of b
Derivative of m and b are calculated above and α is the learning rate.
Advantages:
• MAE is in the same unit as the output variable.
• Robust to outliers.
Disadvantages:
• Not differentiable, requiring alternative optimization methods
Advantages:
• Differentiable, suitable for optimization.
• Penalizes larger errors more.
Disadvantages:
The output is in squared units, which can be less interpretable.
3. Root Mean Squared Error(RMSE)
RMSE is a popular method and is the extended version of MSE. It indicates how
much the data points are spread around the best line. It is the standard
deviation of the MSE. A lower value means that the data point lies closer to the
best fit line.
Advantages:
• Output in the same unit as the output variable.
• Easier to interpret.
Disadvantages:
• Not as robust to outliers as MAE
• R-squared (R²):-
• R-squared (R²), also known as the coefficient of determination, measures the proportion
of the variance in the dependent variable that is predictable from the independent
variables. It provides a baseline to compare models and is independent of the context.
Advantages:
• Provides a baseline for model comparison.
• Independent of context.
Disadvantages:
• Can be misleading with irrelevant features
Scatter Plot:-
Scatter plot is one of the most important data visualization techniques and it is considered
one of the Seven Basic Tools of Quality. A scatter plot is used to plot the relationship
between two variables, on a two-dimensional graph that is known as Cartesian Plane on
mathematical grounds.
It is generally used to plot the relationship between one independent variable and one
dependent variable, where an independent variable is plotted on the x-axis and a dependent
variable is plotted on the y-axis so that you can visualize the effect of the independent
variable on the dependent variable. These plots are known as Scatter Plot Graph or Scatter
Diagram.
Key Points:
• Logistic regression predicts the output of a categorical dependent variable.
Therefore, the outcome must be a categorical or discrete value.
• It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact
value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
• In Logistic regression, instead of fitting a regression line, we fit an “S” shaped logistic
function, which predicts two maximum values (0 or 1)
TYPES
1. Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
2. Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as “cat”, “dogs”, or “sheep”
3. Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types
of dependent variables, such as “low”, “Medium”, or “High”.
Decision Tree
Decision tree is a simple diagram that shows different choices and their possible results
helping you make decisions easily. This article is all about what decision trees are, how
they work, their advantages and disadvantages and their applications.
3.Support Vectors
They are the data points that lie closest to the decision boundary (hyperplane)
in a Support Vector Machine (SVM). These data points are important because
they determine the position and orientation of the hyperplane, and thus have a
significant impact on the classification accuracy of the SVM. In fact, SVMs are
named after these support vectors because they “support” or define the
decision boundary. The support vectors are used to calculate the margin, which
is the distance between the hyperplane and the closest data points from each
class. The goal of SVMs is to maximize this margin while minimizing
classification errors.
We have a famous dataset called ‘Iris’. There are four features (columns or
independent variables) in this dataset but for simplicity purposes, we shall on
look at two which are: ‘Petal length’ and ‘Petal Width’. These points are then
plotted on a 2D plane.
Why do we use Support Vector Machines for Anomaly Detection?
We use Support Vector Machine for anomaly detection because of the
following reasons:
1. Effective for High-Dimensional Data: SVMs perform well in high-
dimensional spaces, making them suitable for datasets with many
features, such as those commonly encountered in anomaly detection
tasks.
2. Robust to Overfitting: SVMs are less prone to overfitting, which is crucial
in anomaly detection where the goal is to generalize well to unseen
anomalies.
3. Optimal Separation: SVMs aim to find the hyperplane that maximally
separates the normal data points from the anomalies, making them
effective in identifying outliers.
4. One-Class SVM: The One-Class SVM variant is specifically designed for
anomaly detection, learning to distinguish normal data points from
outliers without the need for labeled anomalies.
5. Kernel Trick: SVMs can use kernel functions to map data into a higher-
dimensional space, allowing for non-linear separation of anomalies from
normal data.
6. Handling Imbalanced Data: Anomaly detection datasets are often highly
imbalanced, with normal data points outnumbering anomalies. SVMs
can handle this imbalance well.
7. Interpretability: SVMs provide clear decision boundaries, which can help
in interpreting why a particular data point is classified as an anomaly
In this article we will seeks answers to the questions:
• How to train a one-class support vector machine (SVM) model.
• How to predict anomalies from a one-class SVM model.
• How to change the default threshold for anomaly prediction.
• How to visualize the prediction results
What is kNN
kNN Characteristics
• A non-parametric classification method, meaning that no parameters of
the population distribution are estimated
• It is a supervised ML algorithm, meaning we need data with known
classes.
• It is a type of lazy learning, because it doesnʼt create a model.
• It predicts directly based on training data.
In a Nutshel
1. Determine the distance between the new observation and all the data
points in the training set.
2. Sort the distances.
3. Identify K closest neighbours.
4. Determine the class of the new observation based on the group majority of
the k nearest neighbours.
Evaluation - LOOCV
• Once way to validate kNN predictions is to use leave-one-out cross
validation (LOOCV)on the existing data with a known class.
• We can then use accuracy for evaluation.
Limitations
Big Datasets
• If we have very big data sets, the LOOCV might take a long time to run.
• Instead of using LOOCV, we may use the hold-out method, where we split
the dataset into training and test sets.
Dataset Imbalance
• When groups are of equal size, kNN is unbiased.
Cluster Analysis
Introduction
Cluster Analysis is a technique used in Data Mining to group a set of data objects
into clusters. A cluster is a collection of data objects that are similar to each other
within the samecluster but different from objects in other clusters. This method
is useful in exploring data, identifying patterns, and classifying information
without prior knowledge of predefined categories.
Hierarchical Clustering
Hierarchical clustering creates a hierarchy of clusters from the finest level (individual
points) to a single large cluster. The hierarchy is visualized using a dendrogram.
Disadvantages
❌ Computationally expensive O(n2)O(n^2)O(n2) – slow for large datasets.
Weaknesses of K-Means
•Requires predefining the number of clusters (k).
•Sensitive to initial centroid selection (different initializations can yield different
results).
•Not suitable for categorical data (since mean values are undefined for categorical
attributes).
•Sensitive to noise and outliers, which can distort the centroids.
•Fails with non-convex clusters (e.g., clusters with irregular shapes).
Notes:-A dendrogram is a tree-like diagram that illustrates how clusters are merged
(AGNES) or split (DIANA)
Density-based methods:
To discover clusters with arbitrary shape, density-based clustering methods
have been developed. These typically regard clusters as dense regions of
objects in the data space which are separated by regions of low density
(representing noise).
DBSCAN: A density-based clustering method based on connected
regions with sufficiently high density
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is
a density- based clustering algorithm. The algorithm grows regions with
sufficiently high density into clusters, and discovers clusters of arbitrary
shape in spatial databases with noise. It defines a cluster as a maximal set of
density-connected points.
✓ The neighborhood within a radius ε of a given object is called the ε-neighborhood of the object.
✓ If the ε -neighborhood of an object contains at least a minimum number, MinP ts, of objects, then the
object is called a core object.
✓ Given a set of objects, D, we say that an object p is directly density-reachable from object q if p is within
the ε -neighborhood of q, and q is a core object.
✓ An object p is density-reachable from object q with respect to ε and MinPts in a set of objects, D, if there
is a chain of objects p1, . .. , pn, p1 = q and pn = p such that pi+1 is directly density reachable from pi with
respect to ε and MinP ts, for 1 ≤ i ≤ n, pi € D.
✓ An object p is density-connected to object q with respect to ε and MinP ts in a set of objects, D, if there is
an object o € D such that both p and q are density- reachable from o with respect to ε and MinP ts.
Density reachability is the transitive closure of direct density reachability, and this relationship is
asymmetric. Only core objects are mutually density reachable. Density connectivity, however, is a
symmetric relation.
Example 8.5 Consider Figure 8.9 for a given ε represented by the radius of the circles, and, say, let
MinPts = 3.