0% found this document useful (0 votes)
114 views55 pages

Chap15 Cluster Analysis

This document discusses cluster analysis and hierarchical clustering. Cluster analysis aims to group similar records together. Hierarchical clustering produces nested clusters organized as a tree diagram. It begins with each record as its own cluster, then iteratively merges the closest pairs of clusters until all records are in one cluster. The distance between clusters must be defined, such as group average, minimum/maximum, or centroid distance. Hierarchical clustering has strengths but also limitations like inability to undo merges and sensitivity to outliers.

Uploaded by

Jakhongir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
114 views55 pages

Chap15 Cluster Analysis

This document discusses cluster analysis and hierarchical clustering. Cluster analysis aims to group similar records together. Hierarchical clustering produces nested clusters organized as a tree diagram. It begins with each record as its own cluster, then iteratively merges the closest pairs of clusters until all records are in one cluster. The distance between clusters must be defined, such as group average, minimum/maximum, or centroid distance. Hierarchical clustering has strengths but also limitations like inability to undo merges and sensitivity to outliers.

Uploaded by

Jakhongir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 55

Chapter 14

Cluster Analysis
Clustering: The Main Idea
• Goal: Form groups (clusters) of similar records

• Used for segmenting markets into groups of


similar customers

• Example: Claritas segmented US neighborhoods


based on demographics & income: “Furs &
station wagons,” “Money & Brains”, …
Other Applications
• Periodic table of the elements
• Classification of species
• Grouping securities in portfolios
• Grouping firms for structural analysis of
economy
• Army uniform sizes
Example: Public Utilities
• Goal: find clusters of similar utilities
• Data: 22 firms, 8 variables
- Fixed-charge covering ratio
- Rate of return on capital
- Cost per kilowatt capacity
- Annual load factor
- Growth in peak demand
- Sales
- % nuclear
- Fuel costs per kwh
Company Fixed_charge RoR Cost Load D Demand Sales Nuclear Fuel_Cost
Arizona 1.06 9.2 151 54.4 1.6 9077 0 0.628
Boston 0.89 10.3 202 57.9 2.2 5088 25.3 1.555
Central 1.43 15.4 113 53 3.4 9212 0 1.058
Commonwealth 1.02 11.2 168 56 0.3 6423 34.3 0.7
Con Ed NY 1.49 8.8 192 51.2 1 3300 15.6 2.044
Florida 1.32 13.5 111 60 -2.2 11127 22.5 1.241
Hawaiian 1.22 12.2 175 67.6 2.2 7642 0 1.652
Idaho 1.1 9.2 245 57 3.3 13082 0 0.309
Kentucky 1.34 13 168 60.4 7.2 8406 0 0.862
Madison 1.12 12.4 197 53 2.7 6455 39.2 0.623
Nevada 0.75 7.5 173 51.5 6.5 17441 0 0.768
New England 1.13 10.9 178 62 3.7 6154 0 1.897
Northern 1.15 12.7 199 53.7 6.4 7179 50.2 0.527
Oklahoma 1.09 12 96 49.8 1.4 9673 0 0.588
Pacific 0.96 7.6 164 62.2 -0.1 6468 0.9 1.4
Puget 1.16 9.9 252 56 9.2 15991 0 0.62
San Diego 0.76 6.4 136 61.9 9 5714 8.3 1.92
Southern 1.05 12.6 150 56.7 2.7 10140 0 1.108
Texas 1.16 11.7 104 54 -2.1 13507 0 0.636
Wisconsin 1.2 11.8 148 59.9 3.5 7287 41.1 0.702
United 1.04 8.6 204 61 3.5 6650 0 2.116
Virginia 1.07 9.3 174 54.3 5.9 10093 26.6 1.306
Sales & Fuel Cost
Sales & Fuel Cost:
3 rough clusters can be seen

High fuel cost, low sales

Low fuel cost, high sales

Low fuel cost, low sales


Extension to
More Than 2 Dimensions

• In prior example, clustering was done by eye


• Multiple dimensions require formal algorithm
with
- A distance measure
- A way to use the distance measure in forming
clusters
• We will consider two algorithms:
hierarchical and non-hierarchical
Hierarchical
Clustering
Hierarchical Clustering
• A way to use the distance measure in
forming clusters
• Produces a set of nested clusters
organized as a hierarchical tree
• Can be visualized as a dendrogram
- A tree like diagram that records the
sequences of merges or splits
Strengths of Hierarchical Clustering
• Do not have to assume any particular
number of clusters
- Any desired number of clusters can be
obtained by ‘cutting’ the dendogram at the
proper level
• They may correspond to meaningful
taxonomies
- Example in biological sciences (e.g., animal
kingdom, phylogeny reconstruction, …)
Strengths of Hierarchical Clustering
Types of Hierarchical
Clustering
• Agglomerative:
- Start with the points as individual clusters
- At each step, merge the closest pair of clusters until only
one cluster (or k clusters) left
• Divisive:
- Start with one, all-inclusive cluster
- At each step, split a cluster until each cluster contains a
point (or there are k clusters)
• Traditional hierarchical algorithms use a similarity or
distance matrix
- Merge or split one cluster at a time
Agglomerative Clustering Algorithm
• More popular hierarchical clustering technique
• Basic algorithm is straightforward
1. Let each data point be a cluster
2. Compute the distance matrix
3. Repeat
Merge the two closest clusters
Update the distance matrix
Until only a single cluster remains
• Key operation is the computation of the distance of two
clusters
- Different approaches to defining the distance between clusters
distinguish the different algorithms
Agglomerative Clustering Algorithm
Starting Situation
• Start with clusters of individual points and a
distance matrix
Intermediate Situation
• After some merging steps, we have
some clusters
Intermediate Situation
• We want to merge the two closest clusters
(C2 and C5) and update the distance matrix.
After Merging
• The question is “How do we update the
distance matrix?”
Measuring Distance
Between Records
Distance Between Two Records
• Euclidean Distance is most popular:
Normalizing
• Problem: Raw distance measures are
highly influenced by scale of
measurements

• Solution: normalize (standardize) the data


first
- Subtract mean, divide by std. deviation
- Also called z-scores
Example: Normalization
• For 22 utilities:

• Avg. sales = 8,914


• Std. dev. = 3,550

• Normalized score for Arizona sales:


(9,077-8,914)/3,550 = 0.046
For Categorical Data: Similarity
• To measure the distance between records in terms
of two 0/1 variables, create table with counts:
0 1
0 a b
1 c d

• Similarity metrics based on this table:


- Matching similarity = (a+d)/(a+b+c+d)
- Jaccard similarity = d/(b+c+d)
Use in cases where a matching “1” is much greater
evidence of similarity than matching “0”
Other Distance Measures
• Correlation-based similarity
• Statistical distance (Mahalanobis)
• Manhattan distance (absolute
differences)
• Maximum coordinate distance
• Gower’s similarity (for mixed variable
types: continuous & categorical)
Measuring Distance Between
Clusters
How to Define Inter-Cluster Similarity

• MIN
• MAX
• Group Average
• Distance Between Centroids
• Other methods driven by an objective function
How to Define Inter-Cluster Similarity

• MIN
• MAX
• Group Average
• Distance Between Centroids
• Other methods driven by an objective function
How to Define Inter-Cluster Similarity

• MIN
• MAX
• Group Average
• Distance Between Centroids
• Other methods driven by an objective function
How to Define Inter-Cluster Similarity

• MIN
• MAX
• Group Average
• Distance Between Centroids
• Other methods driven by an objective function
How to Define Inter-Cluster Similarity

• MIN
• MAX
• Group Average
• Distance Between Centroids
• Other methods driven by an objective function
Cluster Similarity: MIN or Single Link
• Similarity of two clusters is based on the two
most similar (closest) points in the different
clusters
- Determined by one pair of points, i.e., by one link in
the distance graph.
Hierarchical Clustering: MIN
Cluster Similarity: MAX or Complete
Linkage
• Similarity of two clusters is based on
the two least similar (most distant)
points in the different clusters
Hierarchical Clustering: MAX
Cluster Similarity: Group Average
• distance of two clusters is the average of
pairwise distance between points in the two
clusters.  proximity(pi , p j )
pi Clusteri
p j Cluster j
proximity(Clusteri , Clusterj ) 
| Clusteri |  | Clusterj |
• Need to use average connectivity for scalability
since total distance favors large clusters
Hierarchical Clustering: Group Average
Hierarchical Clustering: Limitations
• Once a decision is made to combine two clusters, it
cannot be undone
• No objective function is directly minimized
• Time complexities
• Different schemes have problems with one or more of
the following:
- Sensitivity to noise and outliers
- Biased towards globular clusters
- Difficulty handling different sized clusters and convex shapes
- Breaking large clusters
The Hierarchical Clustering Steps (Using
Agglomerative Method)

• Dendrogram, from bottom up, illustrates the process

Records 12 & 21
are closest &
form first cluster
Reading the Dendrogram
• See process of clustering: Lines connected lower
down are merged earlier
- 10 and 13 will be merged next, after 12 & 21
• Determining number of clusters: For a given “distance
between clusters”, a horizontal line intersects the
clusters that are that far apart, to create clusters
- E.g., at distance of 4.6 (red line in next slide), data can be
reduced to 2 clusters -- The smaller of the two is circled
- At distance of 3.6 (green line) data can be reduced to 6
clusters, including the circled cluster
Validating Clusters
Interpretation
• Goal: obtain meaningful and useful clusters
• Caveats:
- Random chance can often produce apparent clusters
- Different cluster methods produce different results
• Solutions:
- Obtain summary statistics
- Also review clusters in terms of variables not used in
clustering
- Label the cluster (e.g. clustering of financial firms in
2008 might yield label like “midsize, sub-prime loser”)
Desirable Cluster Features

• Stability – are clusters and cluster assignments


sensitive to slight changes in inputs? Are cluster
assignments in partition B similar to partition A?

• Separation – check ratio of between-cluster


variation to within-cluster variation (higher is
better)
Nonhierarchical Clustering:
K-Means Clustering
K-Means Clustering Algorithm
1. Choose # of clusters desired, k
2. Start with a partition into k clusters
Often based on random selection of k centroids

3. At each step, move each record to cluster


with closest centroid
4. Recompute centroids, repeat step 3
5. Stop when moving records increases
within-cluster dispersion
K-means Algorithm:
Choosing k and Initial Partitioning

• Choose k based on the how results will be used


- e.g., “How many market segments do we want?”

• Also experiment with slightly different k’s

• Initial partition into clusters can be random, or based


on domain knowledge
- If random partition, repeat the process with different random
partitions
Importance of Choosing Initial Centroids
Iteration 1 Iteration 2 Iteration 3
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Iteration 4 Iteration 5 Iteration 6


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Importance of Choosing Initial Centroids …
Iteration 1 Iteration 2
3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x x

Iteration 3 Iteration 4 Iteration 5


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE)
For each point, the error is the distance to the nearest cluster
To get SSE, we square these
K errors and sum them.
SSE    dist 2 ( mi , x )
i 1 xCi

x is a data point in cluster Ci and mi is the representative point for


cluster Ci
 can show that mi corresponds to the center (mean) of the cluster
Given two clusters, we can choose the one with the smallest error
A good clustering with smaller K can have a lower SSE than a poor
clustering with higher K
XLMiner Output: Cluster Centroids

Cluster Fixed_charge RoR Cost Load_factor

Cluster-1 0.89 10.3 202 57.9


Cluster-2 1.43 15.4 113 53
Cluster-3 1.06 9.2 151 54.4

We chose k = 3
4 of the 8 variables are shown
Distance Between Clusters

Distance
Cluster-1 Cluster-2 Cluster-3
between
cluster
Cluster-1 0 5.03216253 3.16901457
Cluster-2 5.03216253 0 3.76581196
Cluster-3 3.16901457 3.76581196 0

Clusters 1 and 2 are relatively well-


separated from each other, while cluster 3
not as much
Within-Cluster Dispersion
Data summary (In Original coordinates)

Average
Cluster #Obs distance in
cluster
Cluster-1 12 1748.348058
Cluster-2 3 907.6919822
Cluster-3 7 3625.242085
Overall 22 2230.906692

Clusters 1 and 2 are relatively tight, cluster 3 very loose


Conclusion: Clusters 1 & 2 well defined, not so for cluster 3

Next step: try again with k=2 or k=4


Applications

• Data Exploration and Understanding


• Market Segmentation
• Multiple Regression / Classification
models
• Characterization of Normality in Novelty
Detection
Summary
• Cluster analysis is an exploratory tool. Useful only when it
produces meaningful clusters
• Hierarchical clustering gives visual representation of different
levels of clustering
- On other hand, due to non-iterative nature, it can be unstable, can
vary highly depending on settings, and is computationally expensive
• Non-hierarchical is computationally cheap and more stable;
requires user to set k
• Can use both methods
• Be wary of chance results; data may not have definitive “real”
clusters

You might also like