ML Unit 4 Notes - NJ
ML Unit 4 Notes - NJ
UNIT-4
Cluster analysis, also known as clustering, is a method of data mining that groups similar data points
together. The goal of cluster analysis is to divide a dataset into groups (or clusters) such that the
data points within each group are more similar to each other than to data points in other groups. This
process is often used for exploratory data analysis and can help identify patterns or relationships
within the data that may not be immediately obvious. There are many different algorithms used for
cluster analysis, such as k-means, hierarchical clustering, and density-based clustering. The choice
of algorithm will depend on the specific requirements of the analysis and the nature of the data being
analyzed.
Cluster Analysis is the process to find similar groups of objects in order to form clusters.It is an
unsupervised machine learning-based algorithm that acts on unlabelled data. A group of data points
would comprise together to form a cluster in which all the objects would belong to the same group.
The given data is divided into different groups by combining similar objects into a group. This group
is nothing but a cluster. A cluster is nothing but a collection of similar data which is grouped
together.
For example, consider a dataset of vehicles given in which it contains information about different
vehicles like cars, buses, bicycles, etc. As it is unsupervised learning there are no class labels like
Cars, Bikes, etc for all the vehicles, all the data is combined and is not in a structured manner.
Now our task is to convert the unlabelled data to labelled data and it can be done using clusters.
The main idea of cluster analysis is that it would arrange all the data points by forming clusters like
cars cluster which contains all the cars, bikes clusters which contains all the bikes, etc.
Simply it is the partitioning of similar objects which are applied to unlabelled data.
Properties of Clustering :
1. Clustering Scalability: Nowadays there is a vast amount of data and should be dealing with huge
databases. In order to handle extensive databases, the clustering algorithm should be scalable. Data
should be scalable, if it is not scalable, then we can’t get the appropriate result which would lead to
wrong results.
2. High Dimensionality: The algorithm should be able to handle high dimensional space along with
the data of small size.
3. Algorithm Usability with multiple data kinds: Different kinds of data can be used with
algorithms of clustering. It should be capable of dealing with different types of data like discrete,
categorical and interval-based data, binary data etc.
4. Dealing with unstructured data: There would be some databases that contain missing values,
and noisy or erroneous data. If the algorithms are sensitive to such data then it may lead to poor
quality clusters. So it should be able to handle unstructured data and give some structure to the data
by organising it into groups of similar data objects. This makes the job of the data expert easier in
order to process the data and discover new patterns.
5. Interpretability: The clustering outcomes should be interpretable, comprehensible, and usable.
The interpretability reflects how easily the data is understood.
Clustering Methods:
The clustering methods can be classified into the following categories:
• Partitioning Method
• Hierarchical Method
• Density-based Method
• Grid-Based Method
• Model-Based Method
• Constraint-based Method
Partitioning Method: It is used to make partitions on the data in order to form clusters. If “n”
partitions are done on “p” objects of the database then each partition is represented by a cluster and
n < p. The two conditions which need to be satisfied with this Partitioning Clustering Method are:
• Divisive Approach: The divisive approach is also known as the top-down approach. In this
approach, we would start with the data objects that are in the same cluster. The group of
individual clusters is divided into small clusters by continuous iteration. The iteration
continues until the condition of termination is met or until each cluster contains one object.
• Once the group is split or merged then it can never be undone as it is a rigid method and is
not so flexible. The two approaches which can be used to improve the Hierarchical
Clustering Quality in Data Mining are: –
• One should carefully analyze the linkages of the object at every partitioning of hierarchical
clustering.
• One can use a hierarchical agglomerative algorithm for the integration of hierarchical
agglomeration. In this approach, first, the objects are grouped into micro-clusters. After
grouping data objects into microclusters, macro clustering is performed on the microcluster.
Density-Based Method: The density-based method mainly focuses on density. In this method,
the given cluster will keep on growing continuously as long as the density in the neighbourhood
exceeds some threshold, i.e, for each data point within a given cluster. The radius of a given
cluster has to contain at least a minimum number of points.
Grid-Based Method: In the Grid-Based method a grid is formed using the object together,i.e,
the object space is quantized into a finite number of cells that form a grid structure. One of the
major advantages of the grid-based method is fast processing time and it is dependent only on the
number of cells in each dimension in the quantized space. The processing time for this method is
much faster so it can save time.
Model-Based Method: In the model-based method, all the clusters are hypothesized in order
to find the data which is best suited for the model. The clustering of the density function is used
to locate the clusters for a given model. It reflects the spatial distribution of data points and also
provides a way to automatically determine the number of clusters based on standard statistics,
taking outlier or noise into account. Therefore, it yields robust clustering methods.
• K-Means
• K-Medoids
• K-Modes
1. K-Means Clustering
K-Means Clustering is a classical approach to Clustering. K-Means iteratively relocates the cluster
centers by computing the mean of a cluster.
• Distance is calculated between each data point and cluster centers (Euclidean distance is
commonly used). A data point is assigned to a Cluster to which it is very close.
• After all the data points are assigned to a cluster the algorithm computes the mean of the
cluster data points and relocates the cluster centers to its corresponding mean of the cluster.
The advantage of using K-Means is Scalability. K-Means perform well on huge data. The main
disadvantage of using K-Means is its sensitivity to outliers. The outliers have a severe impact while
computing means of the cluster. Cluster results vary according to k value and initial choice of cluster
centers. K-Means algorithm works well only for spherical data and fails to perform well on arbitrary
shapes of data.
2. K-Modes Clustering
K-Means works well for continuous data. What about categorical data? K-Modes Clustering comes
to the rescue. The algorithm is very similar to K-Means but instead of calculating the mean of the
cluster, K-Modes calculate the mode (a value that occurs frequently) of the cluster.
• Initially, K-Mode chooses k cluster centers randomly.
• The similarity is calculated between each data point and cluster centers. A data point is
assigned to a Cluster to which it has high similarity.
• After all the data points are assigned to a cluster the algorithm computes the mode of the
cluster data points and relocates the cluster centers to its corresponding mode of the cluster.
• This process is continued until the cluster centers do not change.
Density-based methods
1. DBSCAN
Terminologies of DBSCAN
• If a data point q is within the radius ϵ of another data point p then the data point q is
considered to be ϵ-neighborhood (epsilon neighborhood) of the data point p.
• A data point p is said to be a core object if the ϵ-neighborhood of point p consists of the
value mentioned in MinPts (Minimum Points). For example, consider MinPts = 5, p is said
to be a core object if the ϵ-neighborhood of p consists of 5 data points.
• A data point p is said to be directly density-reachable from point q if the point p is within
the ϵ-neighborhood of q.
How does it work? DBSCAN checks the ϵ-neighborhood of each data point. If p is a core object and
its data points in ϵ-neighborhood are higher than the value of MinPts it forms a new cluster around
the core object p. DBSCAN finds directly density-reachable data points iteratively and may
amalgamate other few clusters. The process is continued until there are no more points left to
analyze.
The disadvantage of using DBSCAN is the presence of hyperparameters ϵ and MinPts. The results
produced vary according to the values opted for hyperparameters.
Gaussian Mixture Models (GMMs) play a pivotal role in achieving this task.
Recognized as a robust statistical tool in machine learning and data science, GMMs excel in
estimating density and clustering data.
Imagine blending multiple Gaussian distributions to form a single model. This is precisely what a
Gaussian Mixture Model does.
At its heart, GMM operates on the principle that a complex, multi-modal distribution can be
approximated by a combination of simpler Gaussian distributions, each representing a different
cluster within the data.
The essence of GMM lies in its ability to determine cluster characteristics such as mean, variance,
and weight.
The mean of each Gaussian component gives us a central point, around which the data points are
most densely clustered.
The variance, on the other hand, provides insight into the spread or dispersion of the data points
around this mean. A smaller variance indicates that the data points are closely clustered around the
mean, while a larger variance suggests a more spread-out cluster.
The weights in a GMM are particularly significant. They represent the proportion of the dataset that
belongs to each Gaussian component.
In a sense, these weights embody the strength or dominance of each cluster within the overall
mixture. Higher weights imply that a greater portion of the data aligns with that particular Gaussian
distribution, signifying its greater prominence in the model.
This triad of parameters — mean, variance, and weight — enables GMMs to model the data with
remarkable flexibility. By adjusting these parameters, a GMM can shape itself to fit a wide variety
of data distributions, whether they are tightly clustered, widely dispersed, or overlapping with one
another.
One of the most powerful aspects of GMMs is their capacity to compute the probability of each data
point belonging to a particular cluster.
This is achieved through a process known as ‘soft clustering’, as opposed to ‘hard clustering’
methods like K-Means.
In soft clustering, instead of forcefully assigning a data point to a single cluster, GMM assigns
probabilities that indicate the likelihood of that data point belonging to each of the Gaussian
components.
Algorithms
Model Representation
These components are defined by their mean vectors, covariance matrices, and weights, providing
a comprehensive representation of data distributions.
The probability density function of a GMM is a sum of its components, each weighted accordingly.
Notation:
• K: Number of Gaussian components
• N: Number of data points
• D: Dimensionality of the data
GMM Parameters:
• Means (μ): Center locations of Gaussian components.
• Covariance Matrices (Σ): Define the shape and spread of each component.
• Weights (π): Probability of selecting each component.
Model Training
Training a GMM involves setting the parameters using available data. The Expectation-
Maximization (EM) technique is often employed, alternating between the Expectation (E) and
Maximization (M) steps until convergence.
Expectation-Maximization
During the E step, the model calculates the probability of each data point belonging to each Gaussian
component. The M step then adjusts the model’s parameters based on these probabilities.
Post-training, GMMs cluster data points based on the highest posterior probability. They are also
used for density estimation, assessing the probability density at any point in the feature space.
This code generates some sample data from two different normal distributions and uses a Gaussian
Mixture Model from Scikit-learn to fit this data.
It then predicts which cluster each data point belongs to and visualizes the data points with their
respective clusters.
The centers of the Gaussian components are marked with red ‘X’ symbols.
The resulting plot provides a visual representation of how the GMM has clustered the data.
After fitting the Gaussian Mixture Model to the data, a new data point at coordinates [2,2] is defined.
The predict_proba method of the GMM object is then used to calculate the probability of this new
data point belonging to each of the two clusters.
The resulting probabilities are printed, and the data points, Gaussian centers, and the new data point
are plotted for visualization.
Use Cases of Gaussian Mixture Models
GMMs find application in a diverse range of fields:
• Anomaly Detection: Identifying unusual data patterns.
• Image Segmentation: Grouping pixels in images based on color or texture.
• Speech Recognition: Assisting in the recognition of phonemes in audio data.
• Handwriting Recognition: Simulating different handwriting styles.
• Customer Segmentation: Grouping customers with similar behaviors or preferences.
• Data Clustering: Finding natural groups in data.
• Computer Vision: Object detection and background removal.
• Bioinformatics: Analyzing gene expression data.
• Recommendation Systems: Personalizing user experiences.
• Medical Imaging: Tissue classification and abnormality detection.
• Finance: Asset price modeling and risk management.
Advantages and Disadvantages of Gaussian Mixture Models
Advantages
• Flexibility in Data Representation: GMMs adeptly represent complex data structures.
• Probabilistic Approach: They provide probabilities for cluster assignments, aiding in
uncertainty estimation.
• Soft Clustering: GMMs offer probabilistic cluster assignments, allowing for more nuanced
data analysis.
• Effective in Overlapping Clusters: They accurately model data with overlapping clusters.
• Density Estimation Capabilities: Useful in understanding the underlying distribution of
data.
• Handling Missing Data: GMMs can estimate parameters even with incomplete data sets.
• Outlier Detection: Identifying data points that do not conform to the general pattern.
• Scalability and Simplicity: Effective in handling large datasets and relatively easy to
implement.
• Interpretable Parameters: Provides meaningful insights into cluster characteristics.
Disadvantages
• Challenges in Determining Component Number: Misjudgment in component number can
lead to overfitting or underfitting.
• Initialization Sensitivity: The outcome is influenced by the initial parameter settings.
• Assumption of Gaussian Distribution: Not always applicable if data do not adhere to
Gaussian distributions.
• Curse of Dimensionality: High-dimensional data can complicate the model.
• Convergence Issues: Problems arise when dealing with singular covariance matrices.
• Resource Intensive for Large Datasets: Computing and memory requirements can be
substantial.
• threshold : threshold is the maximum number of data points a sub-cluster in the leaf node
of the CF tree can hold.
• n_clusters : The number of clusters to be returned after the entire BIRCH algorithm is
complete i.e., number of clusters after the final clustering step. If set to None, the final
clustering step is not performed and intermediate clusters are returned.
Affinity Propagation
Affinity Propagation is based on the concept of “message-passing” between data points to identify
cluster centres and assign data points to these centres automatically. It utilizes “exemplars,” which
are typical data points representing other data points within the same cluster. The objective of the
algorithm is to identify the most representative exemplars of the overall data and employ them to
cluster the data into compatible groups. This algorithm is especially suitable for data with numerous
clusters or data exhibiting intricate, non-linear distribution patterns.
The core of the algorithm is represented by three matrices which are Similarity Matrix(S),
Responsibility Matrix (R) and Availability Matrix (A).
2. Responsibility Matrix(R): The matrix used to represent the suitability of one data point to
be the exemplar (cluster centre) for another data point. R(i, k) is a value that indicates how
well data point ‘i’ is suited to be the exemplar for data point ‘k’.
3. Availability Matrix (A): The availability matrix is used to represent the “availability” of
each data point to serve as an exemplar for other data points. The calculation reflects the
competition between different data points to serve as exemplars and helps in determining
the most suitable exemplars for the clusters.
Mean-Shift Clustering
Meanshift is falling under the category of a clustering algorithm in contrast of Unsupervised
learning that assigns the data points to the clusters iteratively by shifting points towards the mode
(mode is the highest density of data points in the region, in the context of the Meanshift). As such,
it is also known as the Mode-seeking algorithm. Mean-shift algorithm has applications in the field
of image processing and computer vision.
Given a set of data points, the algorithm iteratively assigns each data point towards the closest
cluster centroid and direction to the closest cluster centroid is determined by where most of the
points nearby are at. So each iteration each data point will move closer to where the most points
are at, which is or will lead to the cluster center. When the algorithm stops, each point is assigned
to a cluster.
Unlike the popular K-Means cluster algorithm, mean-shift does not require specifying the number
of clusters in advance. The number of clusters is determined by the algorithm with respect to the
data.
Note: The downside to Mean Shift is that it is computationally expensive O(n²).
Mean-shift clustering is a non-parametric, density-based clustering algorithm that can be used to
identify clusters in a dataset. It is particularly useful for datasets where the clusters have arbitrary
shapes and are not well-separated by linear boundaries.
The basic idea behind mean-shift clustering is to shift each data point towards the mode (i.e., the
highest density) of the distribution of points within a certain radius. The algorithm iteratively
performs these shifts until the points converge to a local maximum of the density function. These
local maxima represent the clusters in the data.
The process of mean-shift clustering algorithm can be summarized as follows:
Initialize the data points as cluster centroids.
Repeat the following steps until convergence or a maximum number of iterations is reached:
For each data point, calculate the mean of all points within a certain radius (i.e., the “kernel”)
centered at the data point.
Shift the data point to the mean.
Identify the cluster centroids as the points that have not moved after convergence.
Return the final cluster centroids and the assignments of data points to clusters.
One of the main advantages of mean-shift clustering is that it does not require the number of
clusters to be specified beforehand. It also does not make any assumptions about the distribution
of the data, and can handle arbitrary shapes and sizes of clusters. However, it can be sensitive to
the choice of kernel and the radius of the kernel.
Mean-Shift clustering can be applied to various types of data, including image and video
processing, object tracking and bioinformatics.
Agglomerative Clustering
Agglomerative clustering is a bottom-up approach. It starts clustering by treating the individual
data points as a single cluster, then it is merged continuously based on similarity until it forms
one big cluster containing all objects. It is good at identifying small clusters.
The steps for agglomerative clustering are as follows:
1. Compute the proximity matrix using a distance metric.
2. Use a linkage function to group objects into a hierarchical cluster tree based on the
computed distance matrix from the above step.
3. Data points with close proximity are merged together to form a cluster.
4. Repeat steps 2 and 3 until a single cluster remains.
The pictorial representation of the above steps would be:
The above proximity matrix consists of n points named x, and the d(xi,xj) represents the
distance between the points.
In order to group the data points in a cluster, a linkage function is used where the values in the
proximity matrix are taken and the data points are grouped based on similarity. The newly
formed clusters are linked to each other until they form a single cluster containing all the data
points.
The most common linkage methods are as follows:
• Complete linkage: The maximum of all pairwise distance between elements in each pair of
clusters is used to measure the distance between two clusters.
• Single linkage: The minimum of all pairwise distance between elements in each pair of
clusters is used to measure the distance between two clusters.
• Average linkage: The average of all pairwise distances between elements in each pair of
clusters is used to measure the distance between two clusters.
• Centroid linkage: Before merging, the distance between the two clusters’ centroids are
considered.
• Ward’s Method: It uses squared error to compute the similarity of the two clusters for
merging.
Divisive Clustering
Divisive clustering works just the opposite of agglomerative clustering. It starts by considering
all the data points into a big single cluster and later on splitting them into smaller heterogeneous
clusters continuously until all data points are in their own cluster. Thus, they are good at
identifying large clusters. It follows a top-down approach and is more efficient than
agglomerative clustering. But, due to its complexity in implementation, it doesn’t have any
predefined implementation in any of the major machine learning frameworks.
Steps in Divisive Clustering
Consider all the data points as a single cluster.
1. Split into clusters using any flat-clustering method, say K-Means.
2. Choose the best cluster among the clusters to split further, choose the one that has the
largest Sum of Squared Error (SSE).
3. Repeat steps 2 and 3 until a single cluster is formed.
Hierarchical clustering can be used for several applications, ranging from customer segmentation to object
recognition.