0% found this document useful (0 votes)
139 views

10.cluster Analysis

This document provides an overview of cluster analysis techniques. It describes hierarchical clustering, which forms clusters iteratively by linking records or clusters together based on distance measures. A dendrogram can show the cluster hierarchy. K-means clustering partitions records into k mutually exclusive clusters by iteratively assigning records to the cluster with the closest centroid. Both methods require choosing distance measures and validating the resulting clusters.

Uploaded by

ironchefff
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
139 views

10.cluster Analysis

This document provides an overview of cluster analysis techniques. It describes hierarchical clustering, which forms clusters iteratively by linking records or clusters together based on distance measures. A dendrogram can show the cluster hierarchy. K-means clustering partitions records into k mutually exclusive clusters by iteratively assigning records to the cluster with the closest centroid. Both methods require choosing distance measures and validating the resulting clusters.

Uploaded by

ironchefff
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 68

Cluster Analysis

Clustering: The Main Idea


Goal: Form groups (clusters) of similar records Used for segmenting markets into groups of similar customers Example: Claritas segmented US

neighborhoods that capture the dominant lifestyles such as: Furs & station wagons, Money & Brains, for products and services

Applications

Mendeleyevs Periodic table of the elements Classification of species Grouping securities in portfolios Grouping firms for structural analysis of economy Design army uniform sizes

Example: Public Utilities


Goal: find clusters of similar utilities Data: 22 firms, 8 variables

Fixed-charge covering ratio Rate of return on capital Cost per kilowatt capacity Annual load factor Growth in peak demand Sales % nuclear Fuel costs per kwh

Company Arizona Boston Central Commonwealth Con Ed NY Florida Hawaiian Idaho Kentucky Madison Nevada New England Northern Oklahoma Pacific Puget San Diego Southern Texas Wisconsin United Virginia

Fixed_charge RoR Cost Load D Demand Sales Nuclear Fuel_Cost 1.06 9.2 151 54.4 1.6 9077 0 0.628 0.89 10.3 202 57.9 2.2 5088 25.3 1.555 1.43 15.4 113 53 3.4 9212 0 1.058 1.02 11.2 168 56 0.3 6423 34.3 0.7 1.49 8.8 192 51.2 1 3300 15.6 2.044 1.32 13.5 111 60 -2.2 11127 22.5 1.241 1.22 12.2 175 67.6 2.2 7642 0 1.652 1.1 9.2 245 57 3.3 13082 0 0.309 1.34 13 168 60.4 7.2 8406 0 0.862 1.12 12.4 197 53 2.7 6455 39.2 0.623 0.75 7.5 173 51.5 6.5 17441 0 0.768 1.13 10.9 178 62 3.7 6154 0 1.897 1.15 12.7 199 53.7 6.4 7179 50.2 0.527 1.09 12 96 49.8 1.4 9673 0 0.588 0.96 7.6 164 62.2 -0.1 6468 0.9 1.4 1.16 9.9 252 56 9.2 15991 0 0.62 0.76 6.4 136 61.9 9 5714 8.3 1.92 1.05 12.6 150 56.7 2.7 10140 0 1.108 1.16 11.7 104 54 -2.1 13507 0 0.636 1.2 11.8 148 59.9 3.5 7287 41.1 0.702 1.04 8.6 204 61 3.5 6650 0 2.116 1.07 9.3 174 54.3 5.9 10093 26.6 1.306

Sales & Fuel Cost: 3 rough clusters can be seen


High fuel cost, low sales

Low fuel cost, high sale

Low fuel cost, low sales

Extension to More Than 2 Dimensions


In prior example, clustering was done by eye Multiple dimensions require formal algorithm with

A distance measure A way to use the distance measure in forming clusters

We will consider two algorithms: hierarchical and non-hierarchical

Hierarchical Clustering

Hierarchical Methods
Agglomerative Methods

Begin with n-clusters (each record its own cluster) Keep joining records into clusters until one cluster is left (the entire data set) Most popular

Divisive Methods

Start with one all-inclusive cluster Repeatedly divide into smaller clusters

A Dendrogram shows the cluster hierarchy

Measuring Distance
Between records Between clusters

Measuring Distance Between Records

Properties of Distance

Denote dij the distance between Record Xi and Xj Non-negative: dij 0 Self-Proximity: dii = 0 Symmetry: dij = dji

Distance Between Two Records


Euclidean Distance is most popular:

i.e.

Other Distance Measures

Manhattan distance (absolute differences)

Maximum coordinate distance

Euclidean Distance

Xi (xi1,xi2)

a b

Xj (xj1,xj2)

Distance in three dimensions


b a Euclidean: d Manhattan: a+b+c d Maximum coordinate : max(a,b,c) = c

Normalizing
Problem: Raw distance measures are highly influenced by scale of measurements
Solution: normalize (standardize) the data first

Subtract mean, divide by std. deviation Also called z-scores

Example: Normalization
For 22 utilities: Avg. sales = 8,914 Std. dev. = 3,550 Arizona sales: 9,077 Normalized score for Arizona sales:
(9,077-8,914)/3,550 = 0.046

For Categorical Data: Similarity


x1 x2 1 0 0 1 1 1 1 1 0 0 1 1

To measure the distance between records with 0/1 variables, create table with counts: 0 1 0 1 0 a b 0 1 1 1 c d 1 1 3

Similarity metrics based on this table: Matching coef. = (a+d)/(a+b+c+d)= 4/6 =0.67 Jaquards coef. = d/(b+c+d) = 3/5 = 0.6

Use in cases where a matching 1 is much greater evidence of similarity than matching 0 (e.g. owns Corvette)

Distance Measure for Mixed Data

Gowers similarity measure is a weighted average of the distance computed for each variable after scaling each variable to a [0, 1] scale.

Wijm = 0 if the value of the measurement is not known for one of the pair of records. Otherwise Wijm = 1

Distance Measure for Mixed Data

Sijm is the difference measure of xim and xjm for each individual variable xm If xm is a continuous variable, Sijm=|xim - xjm|/(max(xm) - min(xm)) If xm is a categorical variable, Sijm=0 if xim = xjm and Sijm=1 if xim xjm

Example
Clothing A B size small large color red red price 35.00 25.00 discount 15% N/A

Assuming the price range is 10 - 50,

S(A,B) = (1 + 0 + (35-25)/(50-10))/3

Measuring Distance Between Clusters

Minimum Distance (Cluster A to Cluster B)

Also called single linkage Distance between two clusters is the distance between the pair of records Ai and Bj that are closest

Maximum Distance (Cluster A to Cluster B)

Also called complete linkage Distance between two clusters is the distance between the pair of records Ai and Bj that are farthest from each other

Average Distance

Also called average linkage Distance between two clusters is the average of all possible pair-wise distances

Centroid Distance

Distance between two clusters is the distance between the two cluster centroids. Centroid is the vector of variable averages for all records in a cluster

The Hierarchical Clustering Steps (Using Agglomerative Method)


1.

2.

3.

Start with n clusters (each record is its own cluster) Merge two closest records into one cluster At each successive step, the two clusters closest to each other are merged

Dendrogram, from bottom up, illustrates the process

Hierarchical Clustering

Hierarchical Clustering

Hierarchical Clustering

Within cluster max distance < d Between cluster min distance >d

Hierarchical Clustering

Hierarchical Clustering

Records 12 & 21 are closest & form first cluster

Records 10 & 13 are merged next to form second cluster

Reading the Dendrogram


See process of clustering: Lines connected lower down are merged earlier

10 and 13 are merged next

Determining number of clusters: For a given distance between clusters, a horizontal line intersects the clusters that are that far apart, to create clusters

E.g., at distance of 4.6 (red line in next slide), data can be reduced to 2 clusters -- The smaller of the two is circled At distance of 3.6 (green line) data can be reduced to 6 clusters, including the circled cluster

Validating Clusters

Interpretation
Goal: obtain meaningful and useful clusters Caveats:
(1) Random chance can often produce apparent clusters (2) Different cluster methods produce different results

Solutions: Obtain summary statistics Also review clusters in terms of variables not used in clustering Label the cluster (e.g. clustering of financial firms in 2008 might yield label like midsize, sub-prime loser)

Desirable Cluster Features


Stability are clusters and cluster assignments sensitive to slight changes in inputs? Are cluster assignments in partition B similar to partition A?

Separation check ratio of betweencluster variation to within-cluster variation (higher is better)

Nonhierarchical Clustering: K-Means Clustering

K-Means Clustering Algorithm


1.
2.

Choose # of clusters desired, k Start with a partition into k clusters


Often based on random selection of k centroids

3.

4. 5.

At each step, move each record to cluster with closest centroid Recompute centroids, repeat step 3 Stop when moving records increases within-cluster dispersion

K-Means Clustering (k=3)

K-Means Clustering (k=3)


3 1 2 1 3 1

2
3 2

K-Means Clustering (k=3)


1 1 1 1 3 2

2
2 2

K-Means Clustering (k=3)


1 1 1 1 3 2

2
2 2

K-Means Clustering (k=3)


1 1 1 3 3 2

2
2 3

K-Means Clustering (k=3)


1 1 1 3 3 2

2
2 3

K-means Algorithm: Choosing k and Initial Partitioning


Choose k based on the how results will be used

e.g. How many market segments do we want?

Also experiment with slightly different ks Initial partition into clusters can be random, or based on domain knowledge

If random partition, repeat the process with different random partitions

XLMiner Output: Cluster Centroids


Cluster Cluster-1 Cluster-2 Cluster-3 Fixed_charge 0.89 1.43 1.06 RoR 10.3 15.4 9.2 Cost 202 113 151 Load_factor 57.9 53 54.4

We chose k = 3 4 of the 8 variables are shown

Distance Between Clusters


Distance between cluster Cluster-1 Cluster-2 Cluster-3 Cluster-1 0 5.03216253 3.16901457 Cluster-2 5.03216253 0 3.76581196 Cluster-3 3.16901457 3.76581196 0

Clusters 1 and 2 are relatively wellseparated from each other, while cluster 3 not as much

Within-Cluster Dispersion
Data summary (In Original coordinates)
Cluster Cluster-1 Cluster-2 Cluster-3 Overall #Obs 12 3 7 22 Average distance in cluster 1748.348058 907.6919822 3625.242085 2230.906692

Clusters 1 and 2 are relatively tight, cluster 3 very loose Conclusion: Clusters 1 & 2 well defined, not so for cluster 3

Summary

Cluster analysis is an exploratory tool. Useful only when it produces meaningful clusters Hierarchical clustering gives visual representation of different levels of clustering

On other hand, due to non-iterative nature, it can be unstable, can vary highly depending on settings, and is computationally expensive

Non-hierarchical is computationally cheap and more stable; requires user to set k Can use both methods Be wary of chance results; data may not have definitive real clusters

http://dataminingbook.com/datasets

You might also like