0% found this document useful (0 votes)
45 views14 pages

Clustering in Data Mining

Clustering in data mining is an unsupervised learning technique that groups similar data points based on their features, aiming to identify patterns within datasets. Various clustering methods exist, including partitioning, hierarchical, density-based, grid-based, model-based, and constraint-based methods, each with its own advantages and applications. While clustering can reveal hidden patterns and assist in exploratory data analysis, it also has limitations such as sensitivity to initial conditions and computational expense.

Uploaded by

preethier
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views14 pages

Clustering in Data Mining

Clustering in data mining is an unsupervised learning technique that groups similar data points based on their features, aiming to identify patterns within datasets. Various clustering methods exist, including partitioning, hierarchical, density-based, grid-based, model-based, and constraint-based methods, each with its own advantages and applications. While clustering can reveal hidden patterns and assist in exploratory data analysis, it also has limitations such as sensitivity to initial conditions and computational expense.

Uploaded by

preethier
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Clustering in Data Mining

Clustering in data mining is a technique that groups similar data points together based on
their features and characteristics.

It can also be referred to as a process of grouping a set of objects so that objects in the same
group (called a cluster) are more similar to each other than those in other groups (clusters).

It is an unsupervised learning technique that aims to identify similarities and patterns in a


dataset.

Clustering algorithms typically require defining the number of clusters, similarity measures,
and clustering methods.

These algorithms aim to group data points together in a way that maximizes similarity within
the groups and minimizes similarity between different groups, as shown in the picture below.

A cluster can have the following properties -

 The data points within a cluster are similar to each other based on some pre-defined
criteria or similarity measures.
 The clusters are distinct from each other, and the data points in one cluster are
different from those in another cluster.
 The data points within a cluster are closely packed together.
 A cluster is often represented by a centroid or a center point that summarizes the
properties of the data points within the cluster.
 A cluster can have any number of data points, but a good cluster should not be too
small or too large.

Requirements of clustering in data mining:

Clustering is a critical technique in the data mining process, and it has various advantages, as
mentioned below -
 Scalability
Clustering algorithms in data mining can handle large datasets efficiently, making it possible
to extract useful insights and knowledge from massive amounts of data.
 High Dimensionality
Clustering algorithms in data mining can efficiently handle high-dimensional datasets,
making it possible to find patterns and relationships that may not be apparent in lower
dimensions.
 Discovery of Clusters with Arbitrary Shape
Clustering algorithms in data mining can discover clusters that have different shapes and
sizes, making it possible to identify groups of data points that share common properties or
features.
 Interpretability
Clustering results can be easily interpreted by humans, making it possible to extract useful
insights and knowledge from the data.
 Ability to Deal with Different Kinds of Data
Clustering algorithms in data mining can handle different types of data, such as categorical,
numerical, and binary, making it possible to cluster a wide range of data types.

Clustering Methods
Clustering methods can be classified into the following categories −

 Partitioning Method
 Hierarchical Method
 Density-based Method
 Grid-Based Method
 Model-Based Method
 Constraint-based Method

Partitioning Method: It is used to make partitions on the data in order to


form clusters. If “n” partitions are done on “p” objects of the database
then each partition is represented by a cluster and n < p. The two
conditions which need to be satisfied with this Partitioning Clustering
Method are:
 One objective should only belong to only one group.
 There should be no group without even a single purpose.
In the partitioning method, there is one technique called iterative
relocation, which means the object will be moved from one group to
another to improve the partitioning
Hierarchical Method: In this method, a hierarchical decomposition of the
given set of data objects is created. We can classify hierarchical methods
and will be able to know the purpose of classification on the basis of how
the hierarchical decomposition is formed. There are two types of
approaches for the creation of hierarchical decomposition, they are:
 Agglomerative Approach: The agglomerative approach is also known
as the bottom-up approach. Initially, the given data is divided into
which objects form separate groups. Thereafter it keeps on merging
the objects or the groups that are close to one another which means
that they exhibit similar properties. This merging process continues
until the termination condition holds.
 Divisive Approach: The divisive approach is also known as the top-
down approach. In this approach, we would start with the data objects
that are in the same cluster. The group of individual clusters is divided
into small clusters by continuous iteration. The iteration continues until
the condition of termination is met or until each cluster contains one
object.

Density-Based Method: The density-based method mainly focuses on


density. In this method, the given cluster will keep on growing
continuously as long as the density in the neighbourhood exceeds some
threshold, i.e, for each data point within a given cluster. The radius of a
given cluster has to contain at least a minimum number of points.
Grid-Based Method: In the Grid-Based method a grid is formed using the
object together,i.e, the object space is quantized into a finite number of
cells that form a grid structure. One of the major advantages of the grid-
based method is fast processing time and it is dependent only on the
number of cells in each dimension in the quantized space. The
processing time for this method is much faster so it can save time.

Model-Based Method: In the model-based method, all the clusters are


hypothesized in order to find the data which is best suited for the model.
The clustering of the density function is used to locate the clusters for a
given model. It reflects the spatial distribution of data points and also
provides a way to automatically determine the number of clusters based
on standard statistics, taking outlier or noise into account. Therefore it
yields robust clustering methods.
Constraint-Based Method: The constraint-based clustering method is
performed by the incorporation of application or user-oriented constraints.
A constraint refers to the user expectation or the properties of the desired
clustering results. Constraints provide us with an interactive way of
communication with the clustering process. The user or the application
requirement can specify constraints.

Applications Of Cluster Analysis:


 It is widely used in image processing, data analysis, and pattern
recognition.
 It helps marketers to find the distinct groups in their customer base and
they can characterize their customer groups by using purchasing
patterns.
 It can be used in the field of biology, by deriving animal and plant
taxonomies and identifying genes with the same capabilities.
 It also helps in information discovery by classifying documents on the
web.

Advantages of Cluster Analysis:


1. It can help identify patterns and relationships within a dataset that may
not be immediately obvious.

2. It can be used for exploratory data analysis and can help with feature
selection.
3. It can be used to reduce the dimensionality of the data.

4. It can be used for anomaly detection and outlier identification.

5. It can be used for market segmentation and customer profiling.


Disadvantages of Cluster Analysis:
1. It can be sensitive to the choice of initial conditions and the number of
clusters.

2. It can be sensitive to the presence of noise or outliers in the data.

3. It can be difficult to interpret the results of the analysis if the clusters


are not well-defined.

4. It can be computationally expensive for large datasets.

5. The results of the analysis can be affected by the choice of clustering


algorithm used.

6. It is important to note that the success of cluster analysis depends on


the data, the goals of the analysis, and the ability of the analyst to
interpret the results.

Partitioning Method
 K-means Clustering
K-means clustering is a partitioning method that divides the data points into k
clusters, where k is a pre-defined number. It works by iteratively moving the centroid
of each cluster to the mean of the data points assigned to it until convergence. K-
means aims to minimize the sum of squared distances between each data point and its
assigned cluster centroid.

K-Means Clustering Algorithm-

K-Means Clustering Algorithm involves the following steps:

Step 1: Calculate the number of K (Clusters).

Step 2: Randomly select K data points as cluster center.

Step 3: Using the Euclidean distance formula measure the distance between each data

point and each cluster center.


Step 4: Assign each data point to that cluster whose center is nearest to that data point.

Step 5: Re-compute the center of newly formed clusters. The center of a cluster is

computed by taking the mean of all the data points contained in that cluster.

Step 6: Keep repeating the procedure from Step 3 to Step 5 until any of the following

stopping criteria is met-

 If data points fall in the same cluster

 Reached maximum of iteration

 The newly formed cluster does not change in center points

Figure – K-mean Clustering

Flowchart:
Example
Lets consider we have cluster points P1(1,3) , P2(2,2) , P3(5,8) , P4(8,5) , P5(3,9)

, P6(10,7) , P7(3,3) , P8(9,4) , P9(3,7).

First, we take our K value as 3 and we assume that our Initial cluster centers are

P7(3,3), P9(3,7), P8(9,4) as C1, C2, C3. We will find out the new centroids after 2

iterations for the above data points.


Step 1
Find the distance between data points and Centroids. which data points have a

minimum distance that points moved to the nearest cluster centroid.

Iteration 1
Calcualte the distance between data points and K (C1,C2,C3)

C1P1 =>(3,3)(1,3) => sqrt[(1–3)²+(3–3)²] => sqrt[4] =>2

C2P1 =>(3,7)(1,3)=> sqrt[(1–3)²+(3–7)²] => sqrt[20] =>4.5

C3P1 =>(9,4)(1,3) => sqrt[(1–9)²+(3–4)²]=> sqrt[65] =>8.1

For P2,

C1P2 =>(3,3)(2,2) => sqrt[(2–3)²+(2–3)²] => sqrt[2] =>1.4

C2P2 =>(3,7)(2,2)=> sqrt[(2–3)²+(2–7)²] => sqrt[26] =>5.1

C3P2 =>(9,4)(2,2) => sqrt[(2–9)²+(2–4)²]=> sqrt[53] =>7.3

For P3,

C1P2 =>(3,3)(5,8) => sqrt[(5–3)²+(8–3)²] => sqrt[29] =>5.3

C2P2 =>(3,7)(5,8)=> sqrt[(5–3)²+(8–7)²] => sqrt[5] =>2.2

C3P2 =>(9,4)(5,8) => sqrt[(5–9)²+(8–4)²]=> sqrt[32] =>5.7

Similarly for other distances..


Cluster 1 => P1(1,3) , P2(2,2) , P7(3,3)

Cluster 2 => P3(5,8) , P5(3,9) , P9(3,7)

Cluster 3 => P4(8,5) , P6(10,7) , P8(9,4)

Now, We re-compute the new clusters and the new cluster center is computed

by taking the mean of all the points contained in that particular cluster.

New center of Cluster 1 => (1+2+3)/3 , (3+2+3)/3 => 2,2.7

New center of Cluster 2 => (5+3+3)/3 , (8+9+7)/3 => 3.7,8

New center of Cluster 3 => (8+10+9)/3 , (5+7+4)/3 => 9,5.3

Iteration 1 is over. Now, let us take our new center points and repeat the same

steps which are to calculate the distance between data points and new center

points with the Euclidean formula and find cluster groups.


Iteration 2

Calcualte the distance between data points and K (C1,C2,C3)

C1(2,2.7) , C2(3.7,8) , C3(9,5.3)

C1P1 =>(2,2.7)(1,3) => sqrt[(1–2)²+(3–2.7)²] => sqrt[1.1] =>1.0

C2P1 =>(3.7,8)(1,3)=> sqrt[(1–3.7)²+(3–8)²] => sqrt[32.29] =>4.5

C3P1 =>(9,5.3)(1,3) => sqrt[(1–9)²+(3–5.3)²]=> sqrt[69.29] =>8.3

Similarly for other distances..

Cluster 1 => P1(1,3) , P2(2,2) , P7(3,3)

Cluster 2 => P3(5,8) , P5(3,9) , P9(3,7)

Cluster 3 => P4(8,5) , P6(10,7) , P8(9,4)


Center of Cluster 1 => (1+2+3)/3 , (3+2+3)/3 => 2,2.7

Center of Cluster 2 => (5+3+3)/3 , (8+9+7)/3 => 3.7,8

Center of Cluster 3 => (8+10+9)/3 , (5+7+4)/3 => 9,5.3

We got the same centroid and cluster groups which indicates that this dataset has only 2 groups.

K-Means clustering stops iteration because of the same cluster repeating so no need to continue

iteration and display the last iteration as the best cluster groups for this dataset.

The Below graph explained the difference between iterations 1 and 2. We can see centroids (green

dot) changed in the 2nd Iteration.

 Hierarchical Clustering
Hierarchical clustering in data mining is a method that builds a tree-like hierarchy of
clusters, either by merging smaller clusters into larger ones (agglomerative or bottom-
up) or by splitting larger clusters into smaller ones (divisive or top-down). It does not
require a pre-defined number of clusters.

Types of Hierarchical Clustering


Basically, there are two types of hierarchical Clustering:
1. Agglomerative Clustering
2. Divisive clustering
1. Agglomerative Clustering
Initially consider every data point as an individual Cluster and at every
step, merge the nearest pairs of the cluster. (It is a bottom-up method). At
first, every dataset is considered an individual entity or cluster. At every
iteration, the clusters merge with different clusters until one cluster is
formed.
The algorithm for Agglomerative Hierarchical Clustering is:
 Calculate the similarity of one cluster with all the other clusters
(calculate proximity matrix)
 Consider every data point as an individual cluster
 Merge the clusters which are highly similar or close to each other.
 Recalculate the proximity matrix for each cluster
 Repeat Steps 3 and 4 until only a single cluster remains.
Let’s see the graphical representation of this algorithm using a
dendrogram.
Note: This is just a demonstration of how the actual algorithm works no
calculation has been performed below all the proximity among the clusters
is assumed.
Let’s say we have six data points A, B, C, D, E, and F.

Agglomerative Hierarchical clustering

 Step-1: Consider each alphabet as a single cluster and calculate the


distance of one cluster from all the other clusters.
 Step-2: In the second step comparable clusters are merged together to
form a single cluster. Let’s say cluster (B) and cluster (C) are very
similar to each other therefore we merge them in the second step
similarly to cluster (D) and (E) and at last, we get the clusters [(A),
(BC), (DE), (F)]
 Step-3: We recalculate the proximity according to the algorithm and
merge the two nearest clusters([(DE), (F)]) together to form new
clusters as [(A), (BC), (DEF)]
 Step-4: Repeating the same process; The clusters DEF and BC are
comparable and merged together to form a new cluster. We’re now left
with clusters [(A), (BCDEF)].
 Step-5: At last, the two remaining clusters are merged together to form
a single cluster [(ABCDEF)].
2. Divisive Hierarchical clustering
We can say that Divisive Hierarchical clustering is precisely
the opposite of Agglomerative Hierarchical clustering. In Divisive
Hierarchical clustering, we take into account all of the data points as a
single cluster and in every iteration, we separate the data points from the
clusters which aren’t comparable. In the end, we are left with N clusters.

Hierarchical clustering has several advantages over other


clustering methods
 The ability to handle non-convex clusters and clusters of different sizes
and densities.
 The ability to handle missing data and noisy data.
 The ability to reveal the hierarchical structure of the data, which can be
useful for understanding the relationships among the clusters.
Drawbacks of Hierarchical Clustering
 The need for a criterion to stop the clustering process and determine
the final number of clusters.
 The computational cost and memory requirements of the method can
be high, especially for large datasets.
 The results can be sensitive to the initial conditions, linkage criterion,
and distance metric used.
In summary, Hierarchical clustering is a method of data mining that
groups similar data points into clusters by creating a hierarchical
structure of the clusters.
 This method can handle different types of data and reveal the
relationships among the clusters. However, it can have high
computational cost and results can be sensitive to some conditions.
Density-based clustering
Density-based clustering refers to a method that is based on local cluster criterion, such
as density connected points
There are two different parameters to calculate the density-based clustering

You might also like