Unit5 Clustering
Unit5 Clustering
2
What is Cluster Analysis?
n Cluster: A collection of data objects
n similar (or related) to one another within the same group
3
n Cluster analysis is a multivariate method which aims
to classify a sample of subjects (or objects) on the
basis of a set of measured variables into a number of
different groups such that similar subjects are placed
in the same group.
n An example where this might be used is in the field of
psychiatry, where the characterization of patients on
the basis of clusters of symptoms can be useful in the
identification of an appropriate form of therapy.
n In marketing, it may be useful to identify distinct
groups of potential customers so that, for example,
advertising can be appropriately targeted. 4
WARNING ABOUT CLUSTER ANALYSIS
5
What is Clustering in Data Mining?
Clustering is a process of partitioning a set of data (or objects) in a set of
meaningful sub-classes, called clusters
n Cluster:
n a collection of data
7
Clustering for Data Understanding and
Applications
n Biology: taxonomy of living things: kingdom, phylum, class, order,
family, genus and species
n Information retrieval: document clustering
n Land use: Identification of areas of similar land use in an earth
observation database
n Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
n City-planning: Identifying groups of houses according to their house
type, value, and geographical location
n Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
n Climate: understanding earth climate, find patterns of atmospheric
and ocean
n Economic Science: market resarch
8
Clustering as a Preprocessing Tool (Utility)
n Summarization:
n Preprocessing for regression, PCA, classification, and
association analysis
n Compression:
n Image processing: vector quantization
n Finding K-nearest Neighbors
n Localizing search to one or a small number of clusters
n Outlier detection
n Outliers are often viewed as those “far away” from any
cluster
9
Basic Steps to Develop a Clustering Task
n Feature selection / Preprocessing
n Select info concerning the task of interest
n Minimal information redundancy
n May need to do normalization/standardization
n Distance/Similarity measure
n Similarity of two feature vectors
n Clustering criterion
n Expressed via a cost function or some rules
n Clustering algorithms
n Choice of algorithms
n Validation of the results
10
n Interpretation of the results with applications
Distance or Similarity Measures
n Common Distance Measures:
n Manhattan distance:
n Euclidean distance:
n Cosine similarity: å ( xi ´ yi )
dist ( X , Y ) = 1 - sim( X , Y ) sim( X , Y ) = i
å xi ´ å yi
2 2
i i
11
Quality: What Is Good Clustering?
12
Approaches to cluster analysis
n There are a number of different methods that can be
used to carry out a cluster analysis; these methods
can be classified as follows:
n Non-hierarchical methods
n Density-based approach:
n Based on connectivity and density functions
n Hierarchical approach:
n Create a hierarchical decomposition of the set of data (or objects) using
some criterion
n Typical methods: Diana, Agnes, BIRCH, CAMELEON
n Model-based:
n A model is hypothesized for each of the clusters and tries to find the best
n Grid-based approach:
n based on a multiple-level granularity structure
these
n Constraint-based clustering
n User may give inputs on constraints
n Use domain knowledge to determine input parameters
n Interpretability and usability
n Others
n Discovery of clusters with arbitrary shape
n High dimensionality
17
5.2 K-means Clustering
18
Partitioning Algorithms: Basic Concept
n Partitioning method: Partitioning a database D of n objects into a set of k
clusters, such that the sum of squared distances is minimized (where ci is
the centroid or medoid of cluster Ci)
E = S ik=1S pÎCi ( p - ci ) 2
n Given k, find a partition of k clusters that optimizes the chosen partitioning
criterion
n Global optimal: exhaustively enumerate all partitions
n Heuristic methods: k-means and k-medoids algorithms
n k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by
the center of the cluster
n k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects in
the cluster
19
The K-Means Clustering Method
20
21
An Example of K-Means Clustering
K=2
Suppose that the initial seeds (centers of each cluster) are A1, A4 and A7.
Run the k-means algorithm for
1 epoch only. At the end of this epoch show:
a) The new clusters (i.e. the examples belonging to each cluster)
b) The centers of the new clusters 24
Solution:
a)
d(a,b) denotes the Eucledian distance between a and b.
It is obtained directly from the distance matrix or
calculated as follows:
d(a,b)=sqrt((xb-xa)2+(yb-ya)2))
seed1=A1=(2,10), seed2=A4=(5,8), seed3=A7=(1,2)
epoch1 – start:
26
new clusters: 1: {A1}, 2: {A3, A4, A5, A6, A8}, 3: {A2, A7}
29
d) How many more iterations are needed to converge?
Draw the result for each epoch.
Example
Subject A B
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5
31
32
K-Means Example: Document Clustering
T1 T2 T3 T4 T5
Initial arbitrary assignment D1 0 3 3 0 2
D2 4 1 0 1 2
(k=3):
D3 0 4 0 0 2
C1 = {D1,D2}, D4 0 3 0 3 3
C2 = {D3,D4}, D5 0 1 3 0 1
C3 = {D5,D6} D6 2 2 0 0 4
D7 1 0 3 2 0
D8 3 1 0 0 2
C1 4/2 4/2 3/2 1/2 4/2
C2 0/2 7/2
Cluster0/2 3/2
Centroids 5/2
C3 2/2 3/2 3/2 0/2 5/2
Now compute the similarity (or distance) of each item to each cluster,
resulting a cluster-document similarity matrix (here we use dot product as
the similarity measure).
D1 D2 D3 D4 D5 D6 D7 D8
C1 29/2 29/2 24/2 27/2 17/2 32/2 15/2 24/2
C2 31/2 20/2 38/2 45/2 12/2 34/2 6/2 17/2
C3 28/2 21/2 22/2 24/2 17/2 30/2 11/2 19/2
33
Example (Continued)
D1 D2 D3 D4 D5 D6 D7 D8
C1 29/2 29/2 24/2 27/2 17/2 32/2 15/2 24/2
C2 31/2 20/2 38/2 45/2 12/2 34/2 6/2 17/2
C3 28/2 21/2 22/2 24/2 17/2 30/2 11/2 19/2
For each document, reallocate the document to the cluster to which it has
the highest similarity (shown in red in the above table). After the
reallocation we have the following new clusters. Note that the previously
unassigned D7 and D8 have been assigned, and that D1 and D6 have been
reallocated from their original assignment.
34
Example (Continued)
Now compute new C1 = {D2,D7,D8}, C2 = {D1,D3,D4,D6}, C3 = {D5}
cluster centroids
using the original T1 T2 T3 T4 T5
document-term D1 0 3 3 0 2
D2 4 1 0 1 2
matrix
D3 0 4 0 0 2
D4 0 3 0 3 3
D5 0 1 3 0 1
This will lead to a new D6 2 2 0 0 4
cluster-doc similarity D7 1 0 3 2 0
matrix similar to D8 3 1 0 0 2
previous slide. Again, C1 8/3 2/3 3/3 3/3 4/3
the items are C2 2/4 12/4 3/4 3/4 11/4
reallocated to clusters C3 0/1 1/1 3/1 0/1 1/1
with highest similarity.
D1 D2 D3 D4 D5 D6 D7 D8
C1 7.67 15.01 5.34 9.00 5.00 12.00 7.67 11.34
C2 16.75 11.25 17.50 19.50 8.00 6.68 4.25 10.00
C3 14.00 3.00 6.00 6.00 11.00 9.34 9.00 3.00
Note: This process is now repeated with new clusters. However, the next iteration in this example
Will show no change to the clusters, thus terminating the algorithm.
35
K-Means Algorithm
n Strength of the k-means:
n Relatively efficient: O(tkn), where n is # of objects, k is # of clusters, and
t is # of iterations. Normally, k, t << n
n Often terminates at a local optimum
36
Comments on the K-Means Method
n Dissimilarity calculations
38
What Is the Problem of the K-Means Method?
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
39
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10 10 10
9 9 9
8 8 8
Arbitrary Assign
7 7 7
6 6 6
5
choose k 5 each 5
4 object 4 remainin 4
3 as initial 3
g object 3
2
medoids 2
to 2
nearest
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10
Do loop 9
8
Compute
9
8
Swapping O total cost of
Until no
7 7
and Oramdom 6
swapping 6
change
5 5
If quality is 4 4
improved. 3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
40
The K-Medoid Clustering Method
n PAM works effectively for small data sets, but does not scale
well for large data sets (due to the computational complexity)
41
A Disk Version of k-means
42
BIRCH
43
5.3 Hierarchical Clustering
44
Hierarchical Clustering
n Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input, but
needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
45
Hierarchical Clustering Algorithms
• Two main types of hierarchical clustering
– Agglomerative:
• Start with the points as individual clusters
• At each step, merge the closest pair of clusters until only one cluster
– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a point (or there
are k clusters)
47
AGNES (Agglomerative Nesting)
n Introduced in Kaufmann and Rousseeuw (1990)
n Implemented in statistical packages, e.g., Splus
n Use the single-link method and the dissimilarity matrix
n Merge nodes that have the least dissimilarity
n Go on in a non-descending fashion
n Eventually all nodes belong to the same cluster
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
48
Hierarchical Agglomerative Clustering
:: Example
4 1
2 5 0.4
0.35
5
2 0.3
0.25
3 6
0.2
3 0.15
1 0.1
4 0.05
0
3 6 4 1 2 5
50
DIANA (Divisive Analysis)
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
51
Distance between Clusters X X
53
Single Link Method
n The distance between
two clusters is the
distance between two
closest data points in
the two clusters, one
data point from each
cluster Two natural clusters are
split into two
n It can find arbitrarily
shaped clusters, but
n It may cause the
undesirable “chain effect”
due to noisy points
54
Distance between two clusters
similar objects
{
Dsl (Ci , C j ) = min x, y d ( x, y) x Î Ci , y Î C j }
Complete Link Method
n The distance between two clusters is the distance
of two furthest data points in the two clusters
n It is sensitive to outliers because they are far
away
56
Distance between two clusters
similar objects
{
Dcl (Ci , C j ) = maxx, y d ( x, y) x Î Ci , y Î C j }
Average link and centroid methods
n Average link: A compromise between
n the sensitivity of complete-link clustering to outliers and
n the tendency of single-link clustering to form long chains that do
not correspond to the intuitive notion of clusters as compact,
spherical objects
n In this method, the distance between two clusters is the average
distance of all pair-wise distances between the data points in two
clusters.
58
Extensions to Hierarchical Clustering
n Major weakness of agglomerative clustering methods
60
Density-Based Methods
61
Density-Based Clustering Methods
n Handle noise
n One scan
62
n “How can we find dense regions in density-based
clustering?”
n The density of an object o can be measured by the
65
Density-Based Clustering: Basic Concepts
n Two parameters:
n Eps: Maximum radius of the neighbourhood
n MinPts: Minimum number of points in an Eps-
neighbourhood of that point
n NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
n Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if
n p belongs to NEps(q)
n core point condition: p MinPts = 5
66
Density-Reachable and Density-Connected
n Density-reachable:
n A point p is density-reachable from p
a point q w.r.t. Eps, MinPts if there p1
is a chain of points p1, …, pn, p1 = q
q, pn = p such that pi+1 is directly
density-reachable from pi
n Density-connected
n A point p is density-connected to a p q
point q w.r.t. Eps, MinPts if there is
a point o such that both, p and q o
are density-reachable from o w.r.t.
Eps and MinPts
67
DBSCAN: Density-Based Spatial Clustering of
Applications with Noise
n Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
n Discovers clusters of arbitrary shape in spatial databases
with noise
Outlier
Border
Eps = 1cm
Core MinPts = 5
68
DBSCAN: The Algorithm
n Arbitrary select a point p
69
5.5 Issues : Evaluation, Scalability, Comparison
70
Determine the Number of Clusters
n Empirical method
n # of clusters ≈√n/2 for a dataset of n points
n Elbow method
n Use the turning point in the curve of sum of within cluster variance
n E.g., For each point in the test set, find the closest centroid, and
use the sum of squared distance between all points in the test set
and the closest centroids to measure how well the model fits the
test set
n For any k > 0, repeat it m times, compare the overall quality measure
w.r.t. different k’s, and find # of clusters that fits the data the best
71
Measuring Clustering Quality
72
Measuring Clustering Quality: Extrinsic Methods