ADITYA COLLEGE OF ENGINEERING & TECHNOLOGY
Machine Learning
UNIT-4
Unsupervised Learning Techniques
By
B Manikyala Rao M.Tech(Ph.d)
Senior Assistant Professor
Dept of Computer Science & Engineering
Aditya College of Engineering & Technology
Surampalem
Aditya College of Engineering & Technology
Clustering in Machine Learning
➢A way of grouping the data points into different clusters, consisting of similar
data points. The objects with the possible similarities remain in a group that
has less or no similarities with another group.“
➢It is an unsupervised learning method, hence no supervision is provided to the
algorithm, and it deals with the unlabeled dataset.
➢After applying this clustering technique, each cluster or group is provided with
a cluster-ID. ML system can use this id to simplify the processing of large and
complex datasets.
Example: Let's understand the clustering technique with the real-world example
of Mall: When we visit any shopping mall, we can observe that the things with
similar usage are grouped together. Such as the t-shirts are grouped in one
section, and trousers are at other sections, similarly, at vegetable sections,
apples, bananas, Mangoes, etc., are grouped in separate sections, so that we
can easily find out the things. The clustering technique also works in the same
way.
Machine Learning B Manikyala Rao
Clustering Aditya College of Engineering & Technology
The clustering technique can be widely used in various tasks. Some
most common uses of this technique are:
• Market Segmentation
• Statistical data analysis
• Social network analysis
• Image segmentation
• Anomaly detection, etc.
• Apart from these general usages, it is used by the Amazon in its
recommendation system to provide the recommendations as per the
past search of products. Netflix also uses this technique to
recommend the movies and web-series to its users as per the watch
history.
Machine Learning B Manikyala Rao
Aditya College of Engineering & Technology
Types of Clustering Methods
• Partitioning Clustering :It is a type of clustering that divides the data
into non-hierarchical groups. It is also known as the centroid-based
method. The most common example of partitioning clustering is
the K-Means Clustering algorithm.
• Density-Based Clustering
• Distribution Model-Based Clustering
• Hierarchical Clustering
• Fuzzy Clustering
Machine Learning B Manikyala Rao
Aditya College of Engineering & Technology
Applications of Clustering
• In Identification of Cancer Cells: The clustering algorithms are widely used
for the identification of cancerous cells. It divides the cancerous and non-
cancerous data sets into different groups.
• In Search Engines: Search engines also work on the clustering technique.
The search result appears based on the closest object to the search query.
It does it by grouping similar data objects in one group that is far from the
other dissimilar objects. The accurate result of a query depends on the
quality of the clustering algorithm used.
• Customer Segmentation: It is used in market research to segment the
customers based on their choice and preferences.
• In Biology: It is used in the biology stream to classify different species of
plants and animals using the image recognition technique.
• In Land Use: The clustering technique is used in identifying the area of
similar lands use in the GIS database. This can be very useful to find that
for what purpose the particular land should be used, that means for which
purpose it is more suitable.
Machine Learning B Manikyala Rao
Aditya College of Engineering & Technology
K-Means Clustering Algorithm
• K-Means Clustering is an Unsupervised Learning algorithm, which groups the
unlabeled dataset into different clusters. Here K defines the number of pre-
defined clusters that need to be created in the process, as if K=2, there will be
two clusters, and for K=3, there will be three clusters, and so on.
• It is an iterative algorithm that divides the unlabeled dataset into k different
clusters in such a way that each dataset belongs only one group that has similar
properties.
• The algorithm takes the unlabeled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best
clusters. The value of k should be predetermined in this algorithm.
The k-means clustering algorithm mainly performs two tasks:
• Determines the best value for K center points or centroids by an iterative process.
• Assigns each data point to its closest k-center. Those data points which are near
to the particular k-center, create a cluster.
Machine Learning B Manikyala Rao
Aditya College of Engineering & Technology
How does the K-Means Algorithm Work?
The working of the K-Means algorithm is explained in the below steps:
• Step-1: Select the number K to decide the number of clusters.
• Step-2: Select random K points or centroids. (It can be other from the input dataset).
• Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
• Step-4: Calculate the variance and place a new centroid of each cluster.
• Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid
of each cluster.
• Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
• Step-7: The model is ready.
Machine Learning B Manikyala Rao
Aditya College of Engineering & Technology
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given
below:
• Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different
clusters. It means here we will try to group these datasets into two different clusters.
• We need to choose some random k points or centroid to form the cluster. These points can be
either the points from the dataset or any other point. So, here we are selecting the below two
points as k points, which are not the part of our dataset. Consider the below image:
Machine Learning B Manikyala Rao
Aditya College of Engineering & Technology
Algorithm
Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will compute
it by applying some mathematics that we have studied to calculate the distance between two points.
So, we will draw a median between both the centroids. Consider the below image:
From the above image, it is clear that points left side of the line is near to the K1 or blue centroid, and
points to the right of the line are close to the yellow centroid. Let's color them as blue and yellow for
clear visualization.
Machine Learning B Manikyala Rao
Aditya College of Engineering & Technology
As we need to find the closest cluster, so we will repeat the process by
choosing a new centroid. To choose the new centroids, we will
compute the center of gravity of these centroids, and will find new
centroids as below:
Next, we will reassign each datapoint to the new centroid. For this, we
will repeat the same process of finding a median line. The median will
be like below image:
Machine Learning B Manikyala Rao
Aditya College of Engineering & Technology
• From the above image, we can see, one yellow point is on the left side of the line, and two blue
points are right to the line. So, these three points will be assigned to new centroids.
• As reassignment has taken place, so we will again go to the step-4, which is finding new centroids
or K-points.
• We will repeat the process by finding the center of gravity of centroids, so the new centroids will
be as shown in the below image:
Machine Learning B Manikyala Rao
Aditya College of Engineering & Technology
As we got the new centroids so again will draw the median line and reassign the data points. So, the
image will be:
We can see in the above image; there are no dissimilar data points on either side of the line, which
means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final clusters will
be as shown in the below image:
Machine Learning B Manikyala Rao
Aditya College of Engineering & Technology
Limits of K Means
• The most important limitations of Simple k-means are: The user has to
specify k (the number of clusters) in the beginning.
• k-means can only handle numerical data.
• k-means assumes that we deal with spherical clusters and that each cluster
has roughly equal numbers of observations.
Implementation:
from sklearn.cluster import Kmeans
k=5
kmeans = KMeans(n_clusters=k)
y_pred = kmeans.fit_predict(X)
Machine Learning B Manikyala Rao
Aditya College of Engineering & Technology
Semi-Supervised Cluster Analysis
• Semi-supervised clustering is a method that partitions unlabeled data by creating the use
of domain knowledge. It is generally expressed as pairwise constraints between
instances or just as an additional set of labeled instances.
• The quality of unsupervised clustering can be essentially improved using some weak
structure of supervision, for instance, in the form of pairwise constraints (i.e., pairs of
objects labeled as belonging to similar or different clusters). Such a clustering procedure
that depends on user feedback or guidance constraints is known as semisupervised
clustering.
There are several methods for semi-supervised clustering that can be divided into two
classes which are as follows
Constraint-based semi-supervised clustering − It can be used based on user-provided
labels or constraints to support the algorithm toward a more appropriate data
partitioning. This contains modifying the objective function depending on constraints or
initializing and constraining the clustering process depending on the labeled objects.
Distance-based semi-supervised clustering − It can be used to employ an adaptive
distance measure that is trained to satisfy the labels or constraints in the supervised
data. Multiple adaptive distance measures have been utilized, including string-edit
distance trained using Expectation-Maximization (EM), and Euclidean distance changed
by the shortest distance algorithm.
Machine Learning B Manikyala Rao
Aditya College of Engineering & Technology
• An interesting clustering method, known as CLTree (CLustering based on decision
TREEs). It integrates unsupervised clustering with the concept of supervised
classification. It is an instance of constraint-based semi-supervised clustering. It
changes a clustering task into a classification task by considering the set of points
to be clustered as belonging to one class, labeled as “Y,” and inserts a set of
relatively uniformly distributed, “nonexistence points” with a multiple class label,
“N.”
• The problem of partitioning the data area into data (dense) regions and empty
(sparse) regions can then be changed into a classification problem. These points
can be considered as a set of “Y” points. It shows the addition of a collection of
uniformly distributed “N” points, defined by the “o” points.
• The original clustering problem is thus changed into a classification problem,
which works out a design that distinguishes “Y” and “N” points. A decision tree
induction method can be used to partition the two-dimensional space. Two
clusters are recognized, which are from the “Y” points only.
• It can be used to insert a large number of “N” points to the original data can
introduce unnecessary overhead in the calculation. Moreover, it is unlikely that
some points added would truly be uniformly distributed in a very high-
dimensional space as this can need an exponential number of points.
Machine Learning B Manikyala Rao
Aditya College of Engineering & Technology
Using clustering for image segmentation
• Image segmentation is the task of partitioning an image into multiple segments. In semantic segmentation, all pixels
that are part of the same object type get assigned to the same segment.
• For example, in a self-driving car’s vision system, all pixels that are part of a pedestrian’s image might be assigned to
the “pedestrian” segment.
• Here, we are going to do something much simpler: color segmentation. We will simply assign pixels to the same
segment if they have a similar color. In some applications, this may be sufficient, for example if you want to analyze
satellite images to measure how much total forest area there is in a region, color segmentation may be just fine.
• First, let’s load the image using Matplotlib’s imread() function:
from matplotlib.image import imread
image = imread(“path")
image.shape (533, 800, 3)
• The image is represented as a 3D array: the first dimension’s size is the height, the second is the width, and the third is
the number of color channels, in this case red, green and blue (RGB).
• The following code reshapes the array to get a long list of RGB colors, then it clusters these colors using K-Means. For
example, it may identify a color cluster for all shades of green. Next, for each color (e.g., dark green), it looks for the
mean color of the pixel’s color cluster.
X = image.reshape(-1, 3)
kmeans = KMeans(n_clusters=8).fit(X)
segmented_img = kmeans.cluster_centers_[kmeans.labels]
segmented_img = segmented_img.reshape(image.shape)
Machine Learning B Manikyala Rao
Aditya College of Engineering & Technology
Using Clustering for Preprocessing
• Clustering can be an efficient approach to dimensionality reduction, in particular as a
preprocessing step before a supervised learning algorithm.
• let’s tackle the digits dataset which is a simple MNIST-like dataset containing 1,797
grayscale 8×8 images representing digits 0 to 9. First, let’s load the dataset:
from sklearn.datasets import load_digits
X_digits, y_digits = load_digits(return_X_y=True)
Now, let’s split it into a training set and a test set:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_digits, y_digits)
Next, let’s fit a Logistic Regression model:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train, y_train)
Let’s evaluate its accuracy on the test set:
log_reg.score(X_test, y_test)
0.9666666666666667
Machine Learning B Manikyala Rao
Aditya College of Engineering & Technology
DBSCAN
• For each instance, the algorithm counts how many instances are located within a small
distance ε (epsilon) from it. This region is called the instance’s εneighborhood.
• If an instance has at least min_samples instances in its ε-neighborhood (includ‐ ing
itself), then it is considered a core instance. In other words, core instances are those that
are located in dense regions.
• All instances in the neighborhood of a core instance belong to the same cluster. This may
include other core instances, therefore a long sequence of neighboring core instances
forms a single cluster.
• Any instance that is not a core instance and does not have one in its neighbor‐ hood is
considered an anomaly.
• Let’s test it on the moons dataset,
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=1000, noise=0.05)
dbscan = DBSCAN(eps=0.05, min_samples=5)
dbscan.fit(X)
Machine Learning B Manikyala Rao
Aditya College of Engineering & Technology
Dimensionality Reduction
• The number of input features, variables, or columns present in a
given dataset is known as dimensionality, and the process to reduce
these features is called dimensionality reduction.
• A dataset contains a huge number of input features in various cases,
which makes the predictive modeling task more complicated. Because
it is very difficult to visualize or make predictions for the training
dataset with a high number of features, for such cases, dimensionality
reduction techniques are required to use.
• It is a way of converting the higher
dimensions dataset into lesser
dimensions dataset ensuring that it
provides similar information.
Machine Learning B Manikyala Rao
Aditya College of Engineering & Technology
The Curse of Dimensionality
• Handling the high-dimensional data is very difficult in practice, commonly known as
the curse of dimensionality. If the dimensionality of the input dataset increases, any
machine learning algorithm and model becomes more complex. As the number of
features increases, the number of samples also gets increased proportionally, and the
chance of overfitting also increases. If the machine learning model is trained on high-
dimensional data, it becomes overfitted and results in poor performance.
Benefits of applying Dimensionality Reduction:
• By reducing the dimensions of the features, the space required to store the dataset also
gets reduced.
• Less Computation training time is required for reduced dimensions of features.
• Reduced dimensions of features of the dataset help in visualizing the data quickly.
• It removes the redundant features (if present) by taking care of multicollinearity.
Disadvantages of dimensionality Reduction:
• Some data may be lost due to dimensionality reduction.
• In the PCA dimensionality reduction technique, sometimes the principal components
required to consider are unknown.
Machine Learning B Manikyala Rao
Aditya College of Engineering & Technology
1.Feature Selection: Feature selection is the process of selecting the subset of the relevant features and leaving out the irrelevant features
present in a dataset to build a model of high accuracy. In other words, it is a way of selecting the optimal features from the input dataset.
Three methods are used for the feature selection:
A. Filters Methods:
In this method, the dataset is filtered, and a subset that contains only the relevant features is taken. Some common techniques of filters
method are:
• Correlation
• Chi-Square Test
• ANOVA
• Information Gain, etc.
B. Wrappers Methods: The wrapper method has the same goal as the filter method, but it takes a machine learning model for its evaluation. In
this method, some features are fed to the ML model, and evaluate the performance. The performance decides whether to add those
features or remove to increase the accuracy of the model. This method is more accurate than the filtering method but complex to work.
Some common techniques of wrapper methods are:
• Forward Selection
• Backward Selection
• Bi-directional Elimination
C. Embedded Methods: Embedded methods check the different training iterations of the machine learning model and evaluate the
importance of each feature. Some common techniques of Embedded methods are:
• LASSO
• Elastic Net
• Ridge Regression, etc.
Machine Learning B Manikyala Rao
Aditya College of Engineering & Technology
2.Feature Extraction:Feature extraction is the process of transforming
the space containing many dimensions into space with fewer
dimensions. This approach is useful when we want to keep the whole
information but use fewer resources while processing the
information.
Some common feature extraction techniques are:
• Principal Component Analysis
• Linear Discriminant Analysis
• Kernel PCA
• Quadratic Discriminant Analysis
Machine Learning B Manikyala Rao
Aditya College of Engineering & Technology
Machine Learning B Manikyala Rao
Aditya College of Engineering & Technology
Approaches of Dimension Reduction
Two main approaches to reducing dimensionality:
1.projection and
2.Manifold Learning.
1.Projection:
• In most real-world problems, training instances are not spread out
uniformly across all dimensions.
• Many features are almost constant, while others are highly
correlated. As a result, all training instances actually lie within (or
close to) a much lower-dimensional subspace of the high-dimensional
space.
Machine Learning B Manikyala Rao
Aditya College of Engineering & Technology
Machine Learning B Manikyala Rao
Aditya College of Engineering & Technology
• Notice that all training instances lie close to a plane: this is a lower-
dimensional (2D) subspace of the high-dimensional (3D) space. Now
if we project every training instance perpendicularly onto this
subspace (as represented by the short lines connecting the instances
to the plane), we get the new 2D dataset.
• We have just reduced the dataset’s dimensionality from 3D to 2D
Machine Learning B Manikyala Rao
Aditya College of Engineering & Technology
• projection is not always the best approach to dimensionality
reduction. In many cases the subspace may twist and turn, such as in
the famous Swiss roll toy data‐ set
Machine Learning B Manikyala Rao
Aditya College of Engineering & Technology
2. Manifold Learning:
• The Swiss roll is an example of a 2D manifold. Put simply, a 2D
manifold is a 2D shape that can be bent and twisted in a higher-
dimensional space
Machine Learning B Manikyala Rao
Principal Component Analysis Aditya College of Engineering & Technology
• Principal Component Analysis is an unsupervised learning algorithm
that is used for the dimensionality reduction in machine learning.
• It is a statistical process that converts the observations of correlated
features into a set of linearly uncorrelated features with the help of
orthogonal transformation. These new transformed features are
called the Principal Components.
• PCA generally tries to find the lower-dimensional surface to project
the high-dimensional data.
Machine Learning B Manikyala Rao
Aditya College of Engineering & Technology
Steps for PCA algorithm
• Getting the dataset
Firstly, we need to take the input dataset and divide it into two subparts X and Y,
where X is the training set, and Y is the validation set.
• Representing data into a structure
Now we will represent our dataset into a structure. Such as we will represent the
two-dimensional matrix of independent variable X. Here each row corresponds to
the data items, and the column corresponds to the Features. The number of
columns is the dimensions of the dataset.
• Standardizing the data
In this step, we will standardize our dataset. Such as in a particular column, the
features with high variance are more important compared to the features with
lower variance.
If the importance of features is independent of the variance of the feature, then
we will divide each data item in a column with the standard deviation of the
column. Here we will name the matrix as Z.
• Calculating the Covariance of Z
To calculate the covariance of Z, we will take the matrix Z, and will transpose it.
After transpose, we will multiply it by Z. The output matrix will be the Covariance
matrix of Z.
Machine Learning B Manikyala Rao
Aditya College of Engineering & Technology
• Calculating the Eigen Values and Eigen Vectors
Now we need to calculate the eigenvalues and eigenvectors for the resultant covariance
matrix Z. Eigenvectors or the covariance matrix are the directions of the axes with high
information. And the coefficients of these eigenvectors are defined as the eigenvalues.
• Sorting the Eigen Vectors
In this step, we will take all the eigenvalues and will sort them in decreasing order, which
means from largest to smallest. And simultaneously sort the eigenvectors accordingly in
matrix P of eigenvalues. The resultant matrix will be named as P*.
• Calculating the new features Or Principal Components
Here we will calculate the new features. To do this, we will multiply the P* matrix to the
Z. In the resultant matrix Z*, each observation is the linear combination of original
features. Each column of the Z* matrix is independent of each other.
• Remove less or unimportant features from the new dataset.
The new feature set has occurred, so we will decide here what to keep and what to
remove. It means, we will only keep the relevant or important features in the new
dataset, and unimportant features will be removed out.
Machine Learning B Manikyala Rao
Aditya College of Engineering & Technology
Machine Learning B Manikyala Rao