0% found this document useful (0 votes)

139 views

10.cluster Analysis

This document provides an overview of cluster analysis techniques. It describes hierarchical clustering, which forms clusters iteratively by linking records or clusters together based on distance measures. A dendrogram can show the cluster hierarchy. K-means clustering partitions records into k mutually exclusive clusters by iteratively assigning records to the cluster with the closest centroid. Both methods require choosing distance measures and validating the resulting clusters.

Uploaded by

ironchefff

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

139 views

10.cluster Analysis

Uploaded by

ironchefff

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 68

Cluster Analysis

Clustering: The Main Idea

Goal: Form groups (clusters) of similar records Used for segmenting markets into groups of similar customers Example: Claritas segmented US

neighborhoods that capture the dominant lifestyles such as: Furs & station wagons, Money & Brains, for products and services

Applications

Mendeleyevs Periodic table of the elements Classification of species Grouping securities in portfolios Grouping firms for structural analysis of economy Design army uniform sizes

Example: Public Utilities

Goal: find clusters of similar utilities Data: 22 firms, 8 variables

Fixed-charge covering ratio Rate of return on capital Cost per kilowatt capacity Annual load factor Growth in peak demand Sales % nuclear Fuel costs per kwh

Company Arizona Boston Central Commonwealth Con Ed NY Florida Hawaiian Idaho Kentucky Madison Nevada New England Northern Oklahoma Pacific Puget San Diego Southern Texas Wisconsin United Virginia

Fixed_charge RoR Cost Load D Demand Sales Nuclear Fuel_Cost 1.06 9.2 151 54.4 1.6 9077 0 0.628 0.89 10.3 202 57.9 2.2 5088 25.3 1.555 1.43 15.4 113 53 3.4 9212 0 1.058 1.02 11.2 168 56 0.3 6423 34.3 0.7 1.49 8.8 192 51.2 1 3300 15.6 2.044 1.32 13.5 111 60 -2.2 11127 22.5 1.241 1.22 12.2 175 67.6 2.2 7642 0 1.652 1.1 9.2 245 57 3.3 13082 0 0.309 1.34 13 168 60.4 7.2 8406 0 0.862 1.12 12.4 197 53 2.7 6455 39.2 0.623 0.75 7.5 173 51.5 6.5 17441 0 0.768 1.13 10.9 178 62 3.7 6154 0 1.897 1.15 12.7 199 53.7 6.4 7179 50.2 0.527 1.09 12 96 49.8 1.4 9673 0 0.588 0.96 7.6 164 62.2 -0.1 6468 0.9 1.4 1.16 9.9 252 56 9.2 15991 0 0.62 0.76 6.4 136 61.9 9 5714 8.3 1.92 1.05 12.6 150 56.7 2.7 10140 0 1.108 1.16 11.7 104 54 -2.1 13507 0 0.636 1.2 11.8 148 59.9 3.5 7287 41.1 0.702 1.04 8.6 204 61 3.5 6650 0 2.116 1.07 9.3 174 54.3 5.9 10093 26.6 1.306

Sales & Fuel Cost: 3 rough clusters can be seen

High fuel cost, low sales

Low fuel cost, high sale

Low fuel cost, low sales

Extension to More Than 2 Dimensions

In prior example, clustering was done by eye Multiple dimensions require formal algorithm with

A distance measure A way to use the distance measure in forming clusters

We will consider two algorithms: hierarchical and non-hierarchical

Hierarchical Clustering

Hierarchical Methods
Agglomerative Methods

Begin with n-clusters (each record its own cluster) Keep joining records into clusters until one cluster is left (the entire data set) Most popular

Divisive Methods

Start with one all-inclusive cluster Repeatedly divide into smaller clusters

A Dendrogram shows the cluster hierarchy

Measuring Distance
Between records Between clusters

Measuring Distance Between Records

Properties of Distance

Denote dij the distance between Record Xi and Xj Non-negative: dij 0 Self-Proximity: dii = 0 Symmetry: dij = dji

Distance Between Two Records

Euclidean Distance is most popular:

i.e.

Other Distance Measures

Manhattan distance (absolute differences)

Maximum coordinate distance

Euclidean Distance

Xi (xi1,xi2)

a b

Xj (xj1,xj2)

Distance in three dimensions

b a Euclidean: d Manhattan: a+b+c d Maximum coordinate : max(a,b,c) = c

Normalizing
Problem: Raw distance measures are highly influenced by scale of measurements
Solution: normalize (standardize) the data first

Subtract mean, divide by std. deviation Also called z-scores

Example: Normalization
For 22 utilities: Avg. sales = 8,914 Std. dev. = 3,550 Arizona sales: 9,077 Normalized score for Arizona sales:
(9,077-8,914)/3,550 = 0.046

For Categorical Data: Similarity

x1 x2 1 0 0 1 1 1 1 1 0 0 1 1

To measure the distance between records with 0/1 variables, create table with counts: 0 1 0 1 0 a b 0 1 1 1 c d 1 1 3

Similarity metrics based on this table: Matching coef. = (a+d)/(a+b+c+d)= 4/6 =0.67 Jaquards coef. = d/(b+c+d) = 3/5 = 0.6

Use in cases where a matching 1 is much greater evidence of similarity than matching 0 (e.g. owns Corvette)

Distance Measure for Mixed Data

Gowers similarity measure is a weighted average of the distance computed for each variable after scaling each variable to a [0, 1] scale.

Wijm = 0 if the value of the measurement is not known for one of the pair of records. Otherwise Wijm = 1

Distance Measure for Mixed Data

Sijm is the difference measure of xim and xjm for each individual variable xm If xm is a continuous variable, Sijm=|xim - xjm|/(max(xm) - min(xm)) If xm is a categorical variable, Sijm=0 if xim = xjm and Sijm=1 if xim xjm

Example
Clothing A B size small large color red red price 35.00 25.00 discount 15% N/A

Assuming the price range is 10 - 50,

S(A,B) = (1 + 0 + (35-25)/(50-10))/3

Measuring Distance Between Clusters

Minimum Distance (Cluster A to Cluster B)

Also called single linkage Distance between two clusters is the distance between the pair of records Ai and Bj that are closest

Maximum Distance (Cluster A to Cluster B)

Also called complete linkage Distance between two clusters is the distance between the pair of records Ai and Bj that are farthest from each other

Average Distance

Also called average linkage Distance between two clusters is the average of all possible pair-wise distances

Centroid Distance

Distance between two clusters is the distance between the two cluster centroids. Centroid is the vector of variable averages for all records in a cluster

The Hierarchical Clustering Steps (Using Agglomerative Method)

Start with n clusters (each record is its own cluster) Merge two closest records into one cluster At each successive step, the two clusters closest to each other are merged

Dendrogram, from bottom up, illustrates the process

Hierarchical Clustering

Within cluster max distance < d Between cluster min distance >d

Hierarchical Clustering

Records 12 & 21 are closest & form first cluster

Records 10 & 13 are merged next to form second cluster

Reading the Dendrogram

See process of clustering: Lines connected lower down are merged earlier

10 and 13 are merged next

Determining number of clusters: For a given distance between clusters, a horizontal line intersects the clusters that are that far apart, to create clusters

E.g., at distance of 4.6 (red line in next slide), data can be reduced to 2 clusters -- The smaller of the two is circled At distance of 3.6 (green line) data can be reduced to 6 clusters, including the circled cluster

Validating Clusters

Interpretation
Goal: obtain meaningful and useful clusters Caveats:
(1) Random chance can often produce apparent clusters (2) Different cluster methods produce different results

Solutions: Obtain summary statistics Also review clusters in terms of variables not used in clustering Label the cluster (e.g. clustering of financial firms in 2008 might yield label like midsize, sub-prime loser)

Desirable Cluster Features

Stability are clusters and cluster assignments sensitive to slight changes in inputs? Are cluster assignments in partition B similar to partition A?

Separation check ratio of betweencluster variation to within-cluster variation (higher is better)

Nonhierarchical Clustering: K-Means Clustering

K-Means Clustering Algorithm

1.
2.

Choose # of clusters desired, k Start with a partition into k clusters

Often based on random selection of k centroids

4. 5.

At each step, move each record to cluster with closest centroid Recompute centroids, repeat step 3 Stop when moving records increases within-cluster dispersion

K-Means Clustering (k=3)

3 1 2 1 3 1

2
3 2

K-Means Clustering (k=3)

1 1 1 1 3 2

2
2 2

K-Means Clustering (k=3)

1 1 1 1 3 2

2
2 2

K-Means Clustering (k=3)

1 1 1 3 3 2

2
2 3

K-Means Clustering (k=3)

1 1 1 3 3 2

2
2 3

K-means Algorithm: Choosing k and Initial Partitioning

Choose k based on the how results will be used

e.g. How many market segments do we want?

Also experiment with slightly different ks Initial partition into clusters can be random, or based on domain knowledge

If random partition, repeat the process with different random partitions

XLMiner Output: Cluster Centroids

Cluster Cluster-1 Cluster-2 Cluster-3 Fixed_charge 0.89 1.43 1.06 RoR 10.3 15.4 9.2 Cost 202 113 151 Load_factor 57.9 53 54.4

We chose k = 3 4 of the 8 variables are shown

Distance Between Clusters

Distance between cluster Cluster-1 Cluster-2 Cluster-3 Cluster-1 0 5.03216253 3.16901457 Cluster-2 5.03216253 0 3.76581196 Cluster-3 3.16901457 3.76581196 0

Clusters 1 and 2 are relatively wellseparated from each other, while cluster 3 not as much

Within-Cluster Dispersion
Data summary (In Original coordinates)
Cluster Cluster-1 Cluster-2 Cluster-3 Overall #Obs 12 3 7 22 Average distance in cluster 1748.348058 907.6919822 3625.242085 2230.906692

Clusters 1 and 2 are relatively tight, cluster 3 very loose Conclusion: Clusters 1 & 2 well defined, not so for cluster 3

Summary

Cluster analysis is an exploratory tool. Useful only when it produces meaningful clusters Hierarchical clustering gives visual representation of different levels of clustering

On other hand, due to non-iterative nature, it can be unstable, can vary highly depending on settings, and is computationally expensive

Non-hierarchical is computationally cheap and more stable; requires user to set k Can use both methods Be wary of chance results; data may not have definitive real clusters

http://dataminingbook.com/datasets

ECSE 205 Fall2020 Outline
No ratings yet
ECSE 205 Fall2020 Outline
4 pages
Civil Engineering Subjects (1st - 5th Year) - 1
50% (2)
Civil Engineering Subjects (1st - 5th Year) - 1
5 pages
To Log or Not To Log: The Distribution of Asset Returns: SAMBA/03/04
No ratings yet
To Log or Not To Log: The Distribution of Asset Returns: SAMBA/03/04
14 pages
Cluster Analysis Notes
No ratings yet
Cluster Analysis Notes
37 pages
Chapter 14 - Cluster Analysis: Data Mining For Business Intelligence
No ratings yet
Chapter 14 - Cluster Analysis: Data Mining For Business Intelligence
31 pages
Cluster Analysis
No ratings yet
Cluster Analysis
24 pages
Chap15 Cluster Analysis
No ratings yet
Chap15 Cluster Analysis
55 pages
L18_19_Clustering
No ratings yet
L18_19_Clustering
48 pages
8.Cluster Analysis HCA
No ratings yet
8.Cluster Analysis HCA
31 pages
Grouping
No ratings yet
Grouping
98 pages
Lecture+Notes+ +clustering
No ratings yet
Lecture+Notes+ +clustering
13 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Cluster Analysis BRM Session 14
No ratings yet
Cluster Analysis BRM Session 14
25 pages
Lecture 3
No ratings yet
Lecture 3
46 pages
Cluster Analysis
No ratings yet
Cluster Analysis
15 pages
Cluster Analysis: Abu Bashar
No ratings yet
Cluster Analysis: Abu Bashar
18 pages
Module 3 - 1
No ratings yet
Module 3 - 1
149 pages
Introduction To Data Analytics MCA-3282 Open Elective - 6 Sem B.Tech Topic - Grouping
No ratings yet
Introduction To Data Analytics MCA-3282 Open Elective - 6 Sem B.Tech Topic - Grouping
44 pages
DM Clustering
No ratings yet
DM Clustering
51 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Cluster Analysis
No ratings yet
Cluster Analysis
15 pages
Clustering
No ratings yet
Clustering
125 pages
Data Clustering..
No ratings yet
Data Clustering..
10 pages
Unit 2
No ratings yet
Unit 2
89 pages
21AI71-module-5-textbook
No ratings yet
21AI71-module-5-textbook
25 pages
Market Segmentation - Cluster Analysis
No ratings yet
Market Segmentation - Cluster Analysis
18 pages
Lecture 6
No ratings yet
Lecture 6
14 pages
Cluster Analysis
No ratings yet
Cluster Analysis
20 pages
Cluster
100% (1)
Cluster
72 pages
Clustering Today
No ratings yet
Clustering Today
52 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
Unit-7 Finalized
No ratings yet
Unit-7 Finalized
20 pages
Lecture Notes - Clustering
No ratings yet
Lecture Notes - Clustering
13 pages
Cluster Analysis: Classification Analysis, or Numerical Taxonomy
No ratings yet
Cluster Analysis: Classification Analysis, or Numerical Taxonomy
13 pages
clustering
No ratings yet
clustering
6 pages
Module 4 ML
No ratings yet
Module 4 ML
11 pages
Data Mining - Chapter 4 Cluster Analysis
No ratings yet
Data Mining - Chapter 4 Cluster Analysis
37 pages
Cluster Analysis Concept & Methods
No ratings yet
Cluster Analysis Concept & Methods
14 pages
Knowledge Acquisition and Sharing - Data Mining: INF 791 Lecture 4: Cluster Analysis
No ratings yet
Knowledge Acquisition and Sharing - Data Mining: INF 791 Lecture 4: Cluster Analysis
43 pages
Chapter 8 - Clustering
No ratings yet
Chapter 8 - Clustering
42 pages
ML-UNIT-III
No ratings yet
ML-UNIT-III
12 pages
BA2 7 Cluster
No ratings yet
BA2 7 Cluster
33 pages
Chapter-5-Cluster Analysis PDF
No ratings yet
Chapter-5-Cluster Analysis PDF
5 pages
Cluster Analysis: G Sreenivas
No ratings yet
Cluster Analysis: G Sreenivas
29 pages
11 Chapter 3
No ratings yet
11 Chapter 3
17 pages
In Marketing, Cluster Analysis Is Used For: Statistical
No ratings yet
In Marketing, Cluster Analysis Is Used For: Statistical
3 pages
Cluster Analysis
No ratings yet
Cluster Analysis
34 pages
Clustering
No ratings yet
Clustering
20 pages
Unit 2 - Introduction to Cluster Analysis
No ratings yet
Unit 2 - Introduction to Cluster Analysis
53 pages
Clustering: Analisis Big Data - Pertemuan 6
No ratings yet
Clustering: Analisis Big Data - Pertemuan 6
51 pages
Clustering
No ratings yet
Clustering
29 pages
Clustering Analysis (Unsupervised)
No ratings yet
Clustering Analysis (Unsupervised)
6 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
Unit- 4 DMA
No ratings yet
Unit- 4 DMA
145 pages
Cluster Analysis: Consumer Segmentation
No ratings yet
Cluster Analysis: Consumer Segmentation
17 pages
DM_C6
No ratings yet
DM_C6
37 pages
Unit-4 new
No ratings yet
Unit-4 new
36 pages
Lecture 02 - Cluster Analysis 1
No ratings yet
Lecture 02 - Cluster Analysis 1
59 pages
Clustering
No ratings yet
Clustering
75 pages
TQM - TRG - F-07 - Cluster Analysis - Rev02 - 20180421
No ratings yet
TQM - TRG - F-07 - Cluster Analysis - Rev02 - 20180421
42 pages
Importance of Clustering in Data Mining
No ratings yet
Importance of Clustering in Data Mining
5 pages
Profit Driven Business Analytics: A Practitioner's Guide to Transforming Big Data into Added Value
From Everand
Profit Driven Business Analytics: A Practitioner's Guide to Transforming Big Data into Added Value
Wouter Verbeke
No ratings yet
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
From Everand
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
Fouad Sabry
No ratings yet
Chapter 13
No ratings yet
Chapter 13
34 pages
Chapter 14
No ratings yet
Chapter 14
22 pages
Chapter 5
No ratings yet
Chapter 5
58 pages
Chapter 6
No ratings yet
Chapter 6
44 pages
Chapter 3
No ratings yet
Chapter 3
33 pages
MBA K723 Winter 2013: Data Mining and Business Intelligence
No ratings yet
MBA K723 Winter 2013: Data Mining and Business Intelligence
48 pages
7.simple Classification
No ratings yet
7.simple Classification
45 pages
6 Evaluation
No ratings yet
6 Evaluation
57 pages
4 Datamining
No ratings yet
4 Datamining
90 pages
6 Evaluation
No ratings yet
6 Evaluation
57 pages
3 Olap
No ratings yet
3 Olap
73 pages
Science 2B03 "The Big Questions": Lectures: Mon/Wed/Thur 1:30 - 2:20 + Your Tutorial Section ("Inquiry Group")
No ratings yet
Science 2B03 "The Big Questions": Lectures: Mon/Wed/Thur 1:30 - 2:20 + Your Tutorial Section ("Inquiry Group")
3 pages
(Ebook) Foundations of Factor Analysis, Second Edition by Stanley A Mulaik ISBN 9781420099614, 1420099612 - The ebook in PDF/DOCX format is available for instant download
100% (2)
(Ebook) Foundations of Factor Analysis, Second Edition by Stanley A Mulaik ISBN 9781420099614, 1420099612 - The ebook in PDF/DOCX format is available for instant download
53 pages
Dimensions and Units
No ratings yet
Dimensions and Units
15 pages
Modul 1 - Linear Programming
No ratings yet
Modul 1 - Linear Programming
20 pages
FT-GFE-09-V3 FPS T3 Stage1 - Build Your Own 3
No ratings yet
FT-GFE-09-V3 FPS T3 Stage1 - Build Your Own 3
4 pages
Practice Questions Day Wise-PDF
No ratings yet
Practice Questions Day Wise-PDF
8 pages
Math P1 2016 June
No ratings yet
Math P1 2016 June
13 pages
2d Curves - Notations
No ratings yet
2d Curves - Notations
2 pages
2023 11MM Long Division Notes
No ratings yet
2023 11MM Long Division Notes
3 pages
Gr8 BOT Math (Term1 Revision)
No ratings yet
Gr8 BOT Math (Term1 Revision)
6 pages
Bead and Ring Problem
No ratings yet
Bead and Ring Problem
4 pages
Chapter 11 Math Instructions
No ratings yet
Chapter 11 Math Instructions
42 pages
Engineering Structures: Sciencedirect
No ratings yet
Engineering Structures: Sciencedirect
21 pages
Duong Thinh Math IA
No ratings yet
Duong Thinh Math IA
20 pages
Pauli Matrices
No ratings yet
Pauli Matrices
7 pages
Nptel 1
No ratings yet
Nptel 1
57 pages
Math 166 CO1 Lesson 1.5 Fundamental Identities
No ratings yet
Math 166 CO1 Lesson 1.5 Fundamental Identities
16 pages
Chapter 3: Partial Derivative: Lecturer: Nguyen Minh Quan, PHD Quannm@Hcmiu - Edu.Vn
No ratings yet
Chapter 3: Partial Derivative: Lecturer: Nguyen Minh Quan, PHD Quannm@Hcmiu - Edu.Vn
101 pages
Right Triangles-Cw
No ratings yet
Right Triangles-Cw
7 pages
MULTIVARIABLE CALCULUS AND GEOMETRY 2017
No ratings yet
MULTIVARIABLE CALCULUS AND GEOMETRY 2017
2 pages
Subspaces With A Common Complement in A Banach Space
No ratings yet
Subspaces With A Common Complement in A Banach Space
24 pages
Number Theory _ DPP 01 __ Only PDF
No ratings yet
Number Theory _ DPP 01 __ Only PDF
3 pages
Math 131: Final Exam Solutions
No ratings yet
Math 131: Final Exam Solutions
3 pages
Radius Diameter 1
No ratings yet
Radius Diameter 1
2 pages
Exercise Sheet 05 2
No ratings yet
Exercise Sheet 05 2
7 pages
Design and Analysis of Algorithm Course Code: 5009
No ratings yet
Design and Analysis of Algorithm Course Code: 5009
46 pages
GENERAL PHYSICS 1 Q1 Week 1
No ratings yet
GENERAL PHYSICS 1 Q1 Week 1
29 pages

10.cluster Analysis

Uploaded by

10.cluster Analysis

Uploaded by

Cluster Analysis

Clustering: The Main Idea

Example: Public Utilities

Sales & Fuel Cost: 3 rough clusters can be seen

Low fuel cost, high sale

Low fuel cost, low sales

Extension to More Than 2 Dimensions

A distance measure A way to use the distance measure in forming clusters

We will consider two algorithms: hierarchical and non-hierarchical

A Dendrogram shows the cluster hierarchy

Measuring Distance Between Records

Distance Between Two Records

Other Distance Measures

Manhattan distance (absolute differences)

Maximum coordinate distance

Distance in three dimensions

Subtract mean, divide by std. deviation Also called z-scores

For Categorical Data: Similarity

Distance Measure for Mixed Data

Distance Measure for Mixed Data

Assuming the price range is 10 - 50,

Measuring Distance Between Clusters

Minimum Distance (Cluster A to Cluster B)

Maximum Distance (Cluster A to Cluster B)

The Hierarchical Clustering Steps (Using Agglomerative Method)

Dendrogram, from bottom up, illustrates the process

Records 12 & 21 are closest & form first cluster

Records 10 & 13 are merged next to form second cluster

Reading the Dendrogram

10 and 13 are merged next

Desirable Cluster Features

Separation check ratio of betweencluster variation to within-cluster variation (higher is better)

Nonhierarchical Clustering: K-Means Clustering

K-Means Clustering Algorithm

Choose # of clusters desired, k Start with a partition into k clusters

K-Means Clustering (k=3)

K-Means Clustering (k=3)

K-Means Clustering (k=3)

K-Means Clustering (k=3)

K-Means Clustering (k=3)

K-Means Clustering (k=3)

K-means Algorithm: Choosing k and Initial Partitioning

e.g. How many market segments do we want?

If random partition, repeat the process with different random partitions

XLMiner Output: Cluster Centroids

We chose k = 3 4 of the 8 variables are shown

Distance Between Clusters

You might also like