10.cluster Analysis
10.cluster Analysis
neighborhoods that capture the dominant lifestyles such as: Furs & station wagons, Money & Brains, for products and services
Applications
Mendeleyevs Periodic table of the elements Classification of species Grouping securities in portfolios Grouping firms for structural analysis of economy Design army uniform sizes
Fixed-charge covering ratio Rate of return on capital Cost per kilowatt capacity Annual load factor Growth in peak demand Sales % nuclear Fuel costs per kwh
Company Arizona Boston Central Commonwealth Con Ed NY Florida Hawaiian Idaho Kentucky Madison Nevada New England Northern Oklahoma Pacific Puget San Diego Southern Texas Wisconsin United Virginia
Fixed_charge RoR Cost Load D Demand Sales Nuclear Fuel_Cost 1.06 9.2 151 54.4 1.6 9077 0 0.628 0.89 10.3 202 57.9 2.2 5088 25.3 1.555 1.43 15.4 113 53 3.4 9212 0 1.058 1.02 11.2 168 56 0.3 6423 34.3 0.7 1.49 8.8 192 51.2 1 3300 15.6 2.044 1.32 13.5 111 60 -2.2 11127 22.5 1.241 1.22 12.2 175 67.6 2.2 7642 0 1.652 1.1 9.2 245 57 3.3 13082 0 0.309 1.34 13 168 60.4 7.2 8406 0 0.862 1.12 12.4 197 53 2.7 6455 39.2 0.623 0.75 7.5 173 51.5 6.5 17441 0 0.768 1.13 10.9 178 62 3.7 6154 0 1.897 1.15 12.7 199 53.7 6.4 7179 50.2 0.527 1.09 12 96 49.8 1.4 9673 0 0.588 0.96 7.6 164 62.2 -0.1 6468 0.9 1.4 1.16 9.9 252 56 9.2 15991 0 0.62 0.76 6.4 136 61.9 9 5714 8.3 1.92 1.05 12.6 150 56.7 2.7 10140 0 1.108 1.16 11.7 104 54 -2.1 13507 0 0.636 1.2 11.8 148 59.9 3.5 7287 41.1 0.702 1.04 8.6 204 61 3.5 6650 0 2.116 1.07 9.3 174 54.3 5.9 10093 26.6 1.306
Hierarchical Clustering
Hierarchical Methods
Agglomerative Methods
Begin with n-clusters (each record its own cluster) Keep joining records into clusters until one cluster is left (the entire data set) Most popular
Divisive Methods
Start with one all-inclusive cluster Repeatedly divide into smaller clusters
Measuring Distance
Between records Between clusters
Properties of Distance
Denote dij the distance between Record Xi and Xj Non-negative: dij 0 Self-Proximity: dii = 0 Symmetry: dij = dji
i.e.
Euclidean Distance
Xi (xi1,xi2)
a b
Xj (xj1,xj2)
Normalizing
Problem: Raw distance measures are highly influenced by scale of measurements
Solution: normalize (standardize) the data first
Example: Normalization
For 22 utilities: Avg. sales = 8,914 Std. dev. = 3,550 Arizona sales: 9,077 Normalized score for Arizona sales:
(9,077-8,914)/3,550 = 0.046
To measure the distance between records with 0/1 variables, create table with counts: 0 1 0 1 0 a b 0 1 1 1 c d 1 1 3
Similarity metrics based on this table: Matching coef. = (a+d)/(a+b+c+d)= 4/6 =0.67 Jaquards coef. = d/(b+c+d) = 3/5 = 0.6
Use in cases where a matching 1 is much greater evidence of similarity than matching 0 (e.g. owns Corvette)
Gowers similarity measure is a weighted average of the distance computed for each variable after scaling each variable to a [0, 1] scale.
Wijm = 0 if the value of the measurement is not known for one of the pair of records. Otherwise Wijm = 1
Sijm is the difference measure of xim and xjm for each individual variable xm If xm is a continuous variable, Sijm=|xim - xjm|/(max(xm) - min(xm)) If xm is a categorical variable, Sijm=0 if xim = xjm and Sijm=1 if xim xjm
Example
Clothing A B size small large color red red price 35.00 25.00 discount 15% N/A
S(A,B) = (1 + 0 + (35-25)/(50-10))/3
Also called single linkage Distance between two clusters is the distance between the pair of records Ai and Bj that are closest
Also called complete linkage Distance between two clusters is the distance between the pair of records Ai and Bj that are farthest from each other
Average Distance
Also called average linkage Distance between two clusters is the average of all possible pair-wise distances
Centroid Distance
Distance between two clusters is the distance between the two cluster centroids. Centroid is the vector of variable averages for all records in a cluster
2.
3.
Start with n clusters (each record is its own cluster) Merge two closest records into one cluster At each successive step, the two clusters closest to each other are merged
Hierarchical Clustering
Hierarchical Clustering
Hierarchical Clustering
Within cluster max distance < d Between cluster min distance >d
Hierarchical Clustering
Hierarchical Clustering
Determining number of clusters: For a given distance between clusters, a horizontal line intersects the clusters that are that far apart, to create clusters
E.g., at distance of 4.6 (red line in next slide), data can be reduced to 2 clusters -- The smaller of the two is circled At distance of 3.6 (green line) data can be reduced to 6 clusters, including the circled cluster
Validating Clusters
Interpretation
Goal: obtain meaningful and useful clusters Caveats:
(1) Random chance can often produce apparent clusters (2) Different cluster methods produce different results
Solutions: Obtain summary statistics Also review clusters in terms of variables not used in clustering Label the cluster (e.g. clustering of financial firms in 2008 might yield label like midsize, sub-prime loser)
3.
4. 5.
At each step, move each record to cluster with closest centroid Recompute centroids, repeat step 3 Stop when moving records increases within-cluster dispersion
2
3 2
2
2 2
2
2 2
2
2 3
2
2 3
Also experiment with slightly different ks Initial partition into clusters can be random, or based on domain knowledge
Clusters 1 and 2 are relatively wellseparated from each other, while cluster 3 not as much
Within-Cluster Dispersion
Data summary (In Original coordinates)
Cluster Cluster-1 Cluster-2 Cluster-3 Overall #Obs 12 3 7 22 Average distance in cluster 1748.348058 907.6919822 3625.242085 2230.906692
Clusters 1 and 2 are relatively tight, cluster 3 very loose Conclusion: Clusters 1 & 2 well defined, not so for cluster 3
Summary
Cluster analysis is an exploratory tool. Useful only when it produces meaningful clusters Hierarchical clustering gives visual representation of different levels of clustering
On other hand, due to non-iterative nature, it can be unstable, can vary highly depending on settings, and is computationally expensive
Non-hierarchical is computationally cheap and more stable; requires user to set k Can use both methods Be wary of chance results; data may not have definitive real clusters
http://dataminingbook.com/datasets