Chap15 Cluster Analysis
Chap15 Cluster Analysis
Cluster Analysis
Clustering: The Main Idea
• Goal: Form groups (clusters) of similar records
• MIN
• MAX
• Group Average
• Distance Between Centroids
• Other methods driven by an objective function
How to Define Inter-Cluster Similarity
• MIN
• MAX
• Group Average
• Distance Between Centroids
• Other methods driven by an objective function
How to Define Inter-Cluster Similarity
• MIN
• MAX
• Group Average
• Distance Between Centroids
• Other methods driven by an objective function
How to Define Inter-Cluster Similarity
• MIN
• MAX
• Group Average
• Distance Between Centroids
• Other methods driven by an objective function
How to Define Inter-Cluster Similarity
• MIN
• MAX
• Group Average
• Distance Between Centroids
• Other methods driven by an objective function
Cluster Similarity: MIN or Single Link
• Similarity of two clusters is based on the two
most similar (closest) points in the different
clusters
- Determined by one pair of points, i.e., by one link in
the distance graph.
Hierarchical Clustering: MIN
Cluster Similarity: MAX or Complete
Linkage
• Similarity of two clusters is based on
the two least similar (most distant)
points in the different clusters
Hierarchical Clustering: MAX
Cluster Similarity: Group Average
• distance of two clusters is the average of
pairwise distance between points in the two
clusters. proximity(pi , p j )
pi Clusteri
p j Cluster j
proximity(Clusteri , Clusterj )
| Clusteri | | Clusterj |
• Need to use average connectivity for scalability
since total distance favors large clusters
Hierarchical Clustering: Group Average
Hierarchical Clustering: Limitations
• Once a decision is made to combine two clusters, it
cannot be undone
• No objective function is directly minimized
• Time complexities
• Different schemes have problems with one or more of
the following:
- Sensitivity to noise and outliers
- Biased towards globular clusters
- Difficulty handling different sized clusters and convex shapes
- Breaking large clusters
The Hierarchical Clustering Steps (Using
Agglomerative Method)
Records 12 & 21
are closest &
form first cluster
Reading the Dendrogram
• See process of clustering: Lines connected lower
down are merged earlier
- 10 and 13 will be merged next, after 12 & 21
• Determining number of clusters: For a given “distance
between clusters”, a horizontal line intersects the
clusters that are that far apart, to create clusters
- E.g., at distance of 4.6 (red line in next slide), data can be
reduced to 2 clusters -- The smaller of the two is circled
- At distance of 3.6 (green line) data can be reduced to 6
clusters, including the circled cluster
Validating Clusters
Interpretation
• Goal: obtain meaningful and useful clusters
• Caveats:
- Random chance can often produce apparent clusters
- Different cluster methods produce different results
• Solutions:
- Obtain summary statistics
- Also review clusters in terms of variables not used in
clustering
- Label the cluster (e.g. clustering of financial firms in
2008 might yield label like “midsize, sub-prime loser”)
Desirable Cluster Features
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Importance of Choosing Initial Centroids …
Iteration 1 Iteration 2
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE)
For each point, the error is the distance to the nearest cluster
To get SSE, we square these
K errors and sum them.
SSE dist 2 ( mi , x )
i 1 xCi
We chose k = 3
4 of the 8 variables are shown
Distance Between Clusters
Distance
Cluster-1 Cluster-2 Cluster-3
between
cluster
Cluster-1 0 5.03216253 3.16901457
Cluster-2 5.03216253 0 3.76581196
Cluster-3 3.16901457 3.76581196 0
Average
Cluster #Obs distance in
cluster
Cluster-1 12 1748.348058
Cluster-2 3 907.6919822
Cluster-3 7 3625.242085
Overall 22 2230.906692