0% found this document useful (0 votes)

114 views55 pages

Chap15 Cluster Analysis

This document discusses cluster analysis and hierarchical clustering. Cluster analysis aims to group similar records together. Hierarchical clustering produces nested clusters organized as a tree diagram. It begins with each record as its own cluster, then iteratively merges the closest pairs of clusters until all records are in one cluster. The distance between clusters must be defined, such as group average, minimum/maximum, or centroid distance. Hierarchical clustering has strengths but also limitations like inability to undo merges and sensitivity to outliers.

Uploaded by

Jakhongir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

114 views55 pages

Chap15 Cluster Analysis

Uploaded by

Jakhongir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 55

Chapter 14

Cluster Analysis
Clustering: The Main Idea
• Goal: Form groups (clusters) of similar records

• Used for segmenting markets into groups of

similar customers

• Example: Claritas segmented US neighborhoods

based on demographics & income: “Furs &
station wagons,” “Money & Brains”, …
Other Applications
• Periodic table of the elements
• Classification of species
• Grouping securities in portfolios
• Grouping firms for structural analysis of
economy
• Army uniform sizes
Example: Public Utilities
• Goal: find clusters of similar utilities
• Data: 22 firms, 8 variables
- Fixed-charge covering ratio
- Rate of return on capital
- Cost per kilowatt capacity
- Annual load factor
- Growth in peak demand
- Sales
- % nuclear
- Fuel costs per kwh
Company Fixed_charge RoR Cost Load D Demand Sales Nuclear Fuel_Cost
Arizona 1.06 9.2 151 54.4 1.6 9077 0 0.628
Boston 0.89 10.3 202 57.9 2.2 5088 25.3 1.555
Central 1.43 15.4 113 53 3.4 9212 0 1.058
Commonwealth 1.02 11.2 168 56 0.3 6423 34.3 0.7
Con Ed NY 1.49 8.8 192 51.2 1 3300 15.6 2.044
Florida 1.32 13.5 111 60 -2.2 11127 22.5 1.241
Hawaiian 1.22 12.2 175 67.6 2.2 7642 0 1.652
Idaho 1.1 9.2 245 57 3.3 13082 0 0.309
Kentucky 1.34 13 168 60.4 7.2 8406 0 0.862
Madison 1.12 12.4 197 53 2.7 6455 39.2 0.623
Nevada 0.75 7.5 173 51.5 6.5 17441 0 0.768
New England 1.13 10.9 178 62 3.7 6154 0 1.897
Northern 1.15 12.7 199 53.7 6.4 7179 50.2 0.527
Oklahoma 1.09 12 96 49.8 1.4 9673 0 0.588
Pacific 0.96 7.6 164 62.2 -0.1 6468 0.9 1.4
Puget 1.16 9.9 252 56 9.2 15991 0 0.62
San Diego 0.76 6.4 136 61.9 9 5714 8.3 1.92
Southern 1.05 12.6 150 56.7 2.7 10140 0 1.108
Texas 1.16 11.7 104 54 -2.1 13507 0 0.636
Wisconsin 1.2 11.8 148 59.9 3.5 7287 41.1 0.702
United 1.04 8.6 204 61 3.5 6650 0 2.116
Virginia 1.07 9.3 174 54.3 5.9 10093 26.6 1.306
Sales & Fuel Cost
Sales & Fuel Cost:
3 rough clusters can be seen

High fuel cost, low sales

Low fuel cost, high sales

Low fuel cost, low sales

Extension to
More Than 2 Dimensions

• In prior example, clustering was done by eye

• Multiple dimensions require formal algorithm
with
- A distance measure
- A way to use the distance measure in forming
clusters
• We will consider two algorithms:
hierarchical and non-hierarchical
Hierarchical
Clustering
Hierarchical Clustering
• A way to use the distance measure in
forming clusters
• Produces a set of nested clusters
organized as a hierarchical tree
• Can be visualized as a dendrogram
- A tree like diagram that records the
sequences of merges or splits
Strengths of Hierarchical Clustering
• Do not have to assume any particular
number of clusters
- Any desired number of clusters can be
obtained by ‘cutting’ the dendogram at the
proper level
• They may correspond to meaningful
taxonomies
- Example in biological sciences (e.g., animal
kingdom, phylogeny reconstruction, …)
Strengths of Hierarchical Clustering
Types of Hierarchical
Clustering
• Agglomerative:
- Start with the points as individual clusters
- At each step, merge the closest pair of clusters until only
one cluster (or k clusters) left
• Divisive:
- Start with one, all-inclusive cluster
- At each step, split a cluster until each cluster contains a
point (or there are k clusters)
• Traditional hierarchical algorithms use a similarity or
distance matrix
- Merge or split one cluster at a time
Agglomerative Clustering Algorithm
• More popular hierarchical clustering technique
• Basic algorithm is straightforward
1. Let each data point be a cluster
2. Compute the distance matrix
3. Repeat
Merge the two closest clusters
Update the distance matrix
Until only a single cluster remains
• Key operation is the computation of the distance of two
clusters
- Different approaches to defining the distance between clusters
distinguish the different algorithms
Agglomerative Clustering Algorithm
Starting Situation
• Start with clusters of individual points and a
distance matrix
Intermediate Situation
• After some merging steps, we have
some clusters
Intermediate Situation
• We want to merge the two closest clusters
(C2 and C5) and update the distance matrix.
After Merging
• The question is “How do we update the
distance matrix?”
Measuring Distance
Between Records
Distance Between Two Records
• Euclidean Distance is most popular:
Normalizing
• Problem: Raw distance measures are
highly influenced by scale of
measurements

• Solution: normalize (standardize) the data

first
- Subtract mean, divide by std. deviation
- Also called z-scores
Example: Normalization
• For 22 utilities:

• Avg. sales = 8,914

• Std. dev. = 3,550

• Normalized score for Arizona sales:

(9,077-8,914)/3,550 = 0.046
For Categorical Data: Similarity
• To measure the distance between records in terms
of two 0/1 variables, create table with counts:
0 1
0 a b
1 c d

• Similarity metrics based on this table:

- Matching similarity = (a+d)/(a+b+c+d)
- Jaccard similarity = d/(b+c+d)
Use in cases where a matching “1” is much greater
evidence of similarity than matching “0”
Other Distance Measures
• Correlation-based similarity
• Statistical distance (Mahalanobis)
• Manhattan distance (absolute
differences)
• Maximum coordinate distance
• Gower’s similarity (for mixed variable
types: continuous & categorical)
Measuring Distance Between
Clusters
How to Define Inter-Cluster Similarity

• MIN
• MAX
• Group Average
• Distance Between Centroids
• Other methods driven by an objective function
How to Define Inter-Cluster Similarity

• MIN
• MAX
• Group Average
• Distance Between Centroids
• Other methods driven by an objective function
Cluster Similarity: MIN or Single Link
• Similarity of two clusters is based on the two
most similar (closest) points in the different
clusters
- Determined by one pair of points, i.e., by one link in
the distance graph.
Hierarchical Clustering: MIN
Cluster Similarity: MAX or Complete
Linkage
• Similarity of two clusters is based on
the two least similar (most distant)
points in the different clusters
Hierarchical Clustering: MAX
Cluster Similarity: Group Average
• distance of two clusters is the average of
pairwise distance between points in the two
clusters.  proximity(pi , p j )
pi Clusteri
p j Cluster j
proximity(Clusteri , Clusterj ) 
| Clusteri |  | Clusterj |
• Need to use average connectivity for scalability
since total distance favors large clusters
Hierarchical Clustering: Group Average
Hierarchical Clustering: Limitations
• Once a decision is made to combine two clusters, it
cannot be undone
• No objective function is directly minimized
• Time complexities
• Different schemes have problems with one or more of
the following:
- Sensitivity to noise and outliers
- Biased towards globular clusters
- Difficulty handling different sized clusters and convex shapes
- Breaking large clusters
The Hierarchical Clustering Steps (Using
Agglomerative Method)

• Dendrogram, from bottom up, illustrates the process

Records 12 & 21
are closest &
form first cluster
Reading the Dendrogram
• See process of clustering: Lines connected lower
down are merged earlier
- 10 and 13 will be merged next, after 12 & 21
• Determining number of clusters: For a given “distance
between clusters”, a horizontal line intersects the
clusters that are that far apart, to create clusters
- E.g., at distance of 4.6 (red line in next slide), data can be
reduced to 2 clusters -- The smaller of the two is circled
- At distance of 3.6 (green line) data can be reduced to 6
clusters, including the circled cluster
Validating Clusters
Interpretation
• Goal: obtain meaningful and useful clusters
• Caveats:
- Random chance can often produce apparent clusters
- Different cluster methods produce different results
• Solutions:
- Obtain summary statistics
- Also review clusters in terms of variables not used in
clustering
- Label the cluster (e.g. clustering of financial firms in
2008 might yield label like “midsize, sub-prime loser”)
Desirable Cluster Features

• Stability – are clusters and cluster assignments

sensitive to slight changes in inputs? Are cluster
assignments in partition B similar to partition A?

• Separation – check ratio of between-cluster

variation to within-cluster variation (higher is
better)
Nonhierarchical Clustering:
K-Means Clustering
K-Means Clustering Algorithm
1. Choose # of clusters desired, k
2. Start with a partition into k clusters
Often based on random selection of k centroids

3. At each step, move each record to cluster

with closest centroid
4. Recompute centroids, repeat step 3
5. Stop when moving records increases
within-cluster dispersion
K-means Algorithm:
Choosing k and Initial Partitioning

• Choose k based on the how results will be used

- e.g., “How many market segments do we want?”

• Also experiment with slightly different k’s

• Initial partition into clusters can be random, or based

on domain knowledge
- If random partition, repeat the process with different random
partitions
Importance of Choosing Initial Centroids
Iteration 1 Iteration 2 Iteration 3
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Iteration 4 Iteration 5 Iteration 6

3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Importance of Choosing Initial Centroids …
Iteration 1 Iteration 2
3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x x

Iteration 3 Iteration 4 Iteration 5

3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE)
For each point, the error is the distance to the nearest cluster
To get SSE, we square these
K errors and sum them.
SSE    dist 2 ( mi , x )
i 1 xCi

x is a data point in cluster Ci and mi is the representative point for

cluster Ci
 can show that mi corresponds to the center (mean) of the cluster
Given two clusters, we can choose the one with the smallest error
A good clustering with smaller K can have a lower SSE than a poor
clustering with higher K
XLMiner Output: Cluster Centroids

Cluster Fixed_charge RoR Cost Load_factor

Cluster-1 0.89 10.3 202 57.9

Cluster-2 1.43 15.4 113 53
Cluster-3 1.06 9.2 151 54.4

We chose k = 3
4 of the 8 variables are shown
Distance Between Clusters

Distance
Cluster-1 Cluster-2 Cluster-3
between
cluster
Cluster-1 0 5.03216253 3.16901457
Cluster-2 5.03216253 0 3.76581196
Cluster-3 3.16901457 3.76581196 0

Clusters 1 and 2 are relatively well-

separated from each other, while cluster 3
not as much
Within-Cluster Dispersion
Data summary (In Original coordinates)

Average
Cluster #Obs distance in
cluster
Cluster-1 12 1748.348058
Cluster-2 3 907.6919822
Cluster-3 7 3625.242085
Overall 22 2230.906692

Clusters 1 and 2 are relatively tight, cluster 3 very loose

Conclusion: Clusters 1 & 2 well defined, not so for cluster 3

Next step: try again with k=2 or k=4

Applications

• Data Exploration and Understanding

• Market Segmentation
• Multiple Regression / Classification
models
• Characterization of Normality in Novelty
Detection
Summary
• Cluster analysis is an exploratory tool. Useful only when it
produces meaningful clusters
• Hierarchical clustering gives visual representation of different
levels of clustering
- On other hand, due to non-iterative nature, it can be unstable, can
vary highly depending on settings, and is computationally expensive
• Non-hierarchical is computationally cheap and more stable;
requires user to set k
• Can use both methods
• Be wary of chance results; data may not have definitive “real”
clusters

Fantasy Sports Prediction Clustering Analysis
No ratings yet
Fantasy Sports Prediction Clustering Analysis
21 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
35 pages
Hierarchical Clustering Unit 4 ML
No ratings yet
Hierarchical Clustering Unit 4 ML
14 pages
Does India Have Subnational Welfare Regimes The Role of State Governments in Shaping Social Policy
No ratings yet
Does India Have Subnational Welfare Regimes The Role of State Governments in Shaping Social Policy
18 pages
Lecture - 11 Hierarchical Clustering
No ratings yet
Lecture - 11 Hierarchical Clustering
28 pages
Student's Behavior Clustering Based On Ubiquitous Learning Log Data Using Unsupervised Machine Learning
No ratings yet
Student's Behavior Clustering Based On Ubiquitous Learning Log Data Using Unsupervised Machine Learning
7 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Hierarchical Clustering: Relationship Between Clusters
No ratings yet
Hierarchical Clustering: Relationship Between Clusters
23 pages
K-Means and Hierarchical Clustering
No ratings yet
K-Means and Hierarchical Clustering
30 pages
UNIT5
No ratings yet
UNIT5
60 pages
Module-5-Cluster Analysis-Part1
No ratings yet
Module-5-Cluster Analysis-Part1
24 pages
Chap5 Evaluating Performance
No ratings yet
Chap5 Evaluating Performance
54 pages
Overview of Data Mining Process
No ratings yet
Overview of Data Mining Process
43 pages
Machine Learning Notes Anna University
No ratings yet
Machine Learning Notes Anna University
14 pages
Presentation 28128 Content Document 20241126014005PM
No ratings yet
Presentation 28128 Content Document 20241126014005PM
80 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
Chap11 Neural Nets
No ratings yet
Chap11 Neural Nets
38 pages
Chap10 Logistic Regression
No ratings yet
Chap10 Logistic Regression
36 pages
Midterm Lab Exam - AI
No ratings yet
Midterm Lab Exam - AI
13 pages
Module 5
No ratings yet
Module 5
43 pages
Introduction To Data Analytics MCA-3282 Open Elective - 6 Sem B.Tech Topic - Grouping
No ratings yet
Introduction To Data Analytics MCA-3282 Open Elective - 6 Sem B.Tech Topic - Grouping
44 pages
L18 19 Clustering
No ratings yet
L18 19 Clustering
48 pages
Unit-4 New
No ratings yet
Unit-4 New
36 pages
Cluster Analysis Notes
No ratings yet
Cluster Analysis Notes
37 pages
Clustering
No ratings yet
Clustering
38 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Clustering Basics
No ratings yet
Clustering Basics
39 pages
Lec 2
No ratings yet
Lec 2
32 pages
Cluster Analysis: Prof. Vandith Pamuru
No ratings yet
Cluster Analysis: Prof. Vandith Pamuru
68 pages
Ward Linkage-3941-Eng
No ratings yet
Ward Linkage-3941-Eng
41 pages
Clustering and Applications and Trends in Datamining Lecture:-30 To 35
No ratings yet
Clustering and Applications and Trends in Datamining Lecture:-30 To 35
66 pages
Clustering
No ratings yet
Clustering
75 pages
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
93 pages
Module 3 - 1
No ratings yet
Module 3 - 1
149 pages
6 - Clustering and Applications and Trends in Datamining
No ratings yet
6 - Clustering and Applications and Trends in Datamining
66 pages
03 Clustering
No ratings yet
03 Clustering
63 pages
Unit - 5
No ratings yet
Unit - 5
111 pages
10Hierarchical&Probabilistic Clustering & GMM (ML)
No ratings yet
10Hierarchical&Probabilistic Clustering & GMM (ML)
24 pages
My Lecture On CLUSTER ANALYSIS PDF
No ratings yet
My Lecture On CLUSTER ANALYSIS PDF
55 pages
19 - Sessionppt - Clusteringalgos
No ratings yet
19 - Sessionppt - Clusteringalgos
36 pages
Artificial Intelligence, Machine Learning and Smart Technologies For Nondestructive Evaluation
No ratings yet
Artificial Intelligence, Machine Learning and Smart Technologies For Nondestructive Evaluation
17 pages
Grouping
No ratings yet
Grouping
98 pages
Hierar Scale4
No ratings yet
Hierar Scale4
51 pages
Clustering
No ratings yet
Clustering
75 pages
L08 Hierachical Agglomerative Clustering
No ratings yet
L08 Hierachical Agglomerative Clustering
41 pages
Hierarchical Clustering - 11.3.2024 - Full
No ratings yet
Hierarchical Clustering - 11.3.2024 - Full
14 pages
A Project Report ON Consumer Purchase Decision: Kirana Stores Vs Super Markets
No ratings yet
A Project Report ON Consumer Purchase Decision: Kirana Stores Vs Super Markets
18 pages
Clustering
No ratings yet
Clustering
39 pages
Clustering Hierarchical PDF
No ratings yet
Clustering Hierarchical PDF
31 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
10.cluster Analysis
No ratings yet
10.cluster Analysis
68 pages
Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
P 3.1.3 Hierarchical
No ratings yet
P 3.1.3 Hierarchical
30 pages
Cluster Analysis
No ratings yet
Cluster Analysis
24 pages
BRM Multivariate Notes
No ratings yet
BRM Multivariate Notes
22 pages
8.cluster Analysis HCA
No ratings yet
8.cluster Analysis HCA
31 pages
Week 07 Lecture Material
No ratings yet
Week 07 Lecture Material
49 pages
Lab 13
No ratings yet
Lab 13
5 pages
IT3080 Lecture04 2023
No ratings yet
IT3080 Lecture04 2023
56 pages
Subcities of Bengaluru
No ratings yet
Subcities of Bengaluru
10 pages
Clustering
No ratings yet
Clustering
12 pages
Cluster Analysis Unit 4.
No ratings yet
Cluster Analysis Unit 4.
16 pages
Hierarchical Clustering Algorithm
No ratings yet
Hierarchical Clustering Algorithm
9 pages
Chapter 14 - Cluster Analysis: Data Mining For Business Intelligence
No ratings yet
Chapter 14 - Cluster Analysis: Data Mining For Business Intelligence
31 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
41 pages
Machine Learning in Drug Discovery and Development Part 1: A Primer
No ratings yet
Machine Learning in Drug Discovery and Development Part 1: A Primer
14 pages
Assignment 5 1
No ratings yet
Assignment 5 1
13 pages
DA Seminar
No ratings yet
DA Seminar
29 pages
Cluster Analysis Concept & Methods
No ratings yet
Cluster Analysis Concept & Methods
14 pages
Data Mining MCQ
No ratings yet
Data Mining MCQ
4 pages
SPK Clustering
No ratings yet
SPK Clustering
35 pages
Chapter3: Background and Related Work
No ratings yet
Chapter3: Background and Related Work
8 pages
RoboDoc Journal Paper
No ratings yet
RoboDoc Journal Paper
8 pages
What Is Cluster Analysis?
No ratings yet
What Is Cluster Analysis?
20 pages
Weekly Quiz 1 Machine Learning Great Learning PDF
100% (2)
Weekly Quiz 1 Machine Learning Great Learning PDF
7 pages
ML - 8
No ratings yet
ML - 8
70 pages
An Introduction To Clustering Methods
No ratings yet
An Introduction To Clustering Methods
8 pages
Clustering: EE-671 Prof L. Behera, IITK
No ratings yet
Clustering: EE-671 Prof L. Behera, IITK
33 pages
Slide TIF311 DM 10 11
No ratings yet
Slide TIF311 DM 10 11
49 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
CLUSTERING
No ratings yet
CLUSTERING
16 pages
Clustering
No ratings yet
Clustering
29 pages
Chap7 Basic Cluster Analysis
No ratings yet
Chap7 Basic Cluster Analysis
82 pages
Discussion
No ratings yet
Discussion
2 pages
What Happens If We Graph Both F and F On The Same Set of Axes, Using The X-Axis For The Input Tobothfandf ?
0% (2)
What Happens If We Graph Both F and F On The Same Set of Axes, Using The X-Axis For The Input Tobothfandf ?
2 pages
Unit 5
No ratings yet
Unit 5
10 pages
Cluster
100% (1)
Cluster
72 pages
Un Supervised Learning
No ratings yet
Un Supervised Learning
22 pages
Unit4 Discussion Assignment Unit 4 Algebra
No ratings yet
Unit4 Discussion Assignment Unit 4 Algebra
3 pages
Question: What Happens If We Graph Both And: Composition of Function and Inverse Function
No ratings yet
Question: What Happens If We Graph Both And: Composition of Function and Inverse Function
3 pages
DF Unit4 Math1201
No ratings yet
DF Unit4 Math1201
3 pages
VIP Cheatsheet: Unsupervised Learning: Afshine Amidi and Shervine Amidi September 9, 2018
No ratings yet
VIP Cheatsheet: Unsupervised Learning: Afshine Amidi and Shervine Amidi September 9, 2018
3 pages
Discussion
No ratings yet
Discussion
2 pages
Unit 4 DA Math 1201
100% (2)
Unit 4 DA Math 1201
4 pages
MCQ Machine Learning
No ratings yet
MCQ Machine Learning
23 pages
What Happens If We Graph Both and On The Same Set of Axes, Using The X-Axis For The Input To Both and ?
No ratings yet
What Happens If We Graph Both and On The Same Set of Axes, Using The X-Axis For The Input To Both and ?
3 pages
A Diana Algoritma
No ratings yet
A Diana Algoritma
2 pages
Course 4 Pegels Classification
No ratings yet
Course 4 Pegels Classification
8 pages
Tugas Individual 345
No ratings yet
Tugas Individual 345
7 pages
Exponential Smoothing-Trend and Seasonal
No ratings yet
Exponential Smoothing-Trend and Seasonal
11 pages

Chap15 Cluster Analysis

Uploaded by

Chap15 Cluster Analysis

Uploaded by

Chapter 14

• Used for segmenting markets into groups of

• Example: Claritas segmented US neighborhoods

High fuel cost, low sales

Low fuel cost, high sales

Low fuel cost, low sales

• In prior example, clustering was done by eye

• Solution: normalize (standardize) the data

• Avg. sales = 8,914

• Normalized score for Arizona sales:

• Similarity metrics based on this table:

• Dendrogram, from bottom up, illustrates the process

• Stability – are clusters and cluster assignments

• Separation – check ratio of between-cluster

3. At each step, move each record to cluster

• Choose k based on the how results will be used

• Also experiment with slightly different k’s

• Initial partition into clusters can be random, or based

2.5 2.5 2.5

1.5 1.5 1.5

0.5 0.5 0.5

Iteration 4 Iteration 5 Iteration 6

2.5 2.5 2.5

1.5 1.5 1.5

0.5 0.5 0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 3 Iteration 4 Iteration 5

2.5 2.5 2.5

1.5 1.5 1.5

0.5 0.5 0.5

x is a data point in cluster Ci and mi is the representative point for

Cluster Fixed_charge RoR Cost Load_factor

Cluster-1 0.89 10.3 202 57.9

Clusters 1 and 2 are relatively well-

Clusters 1 and 2 are relatively tight, cluster 3 very loose

Next step: try again with k=2 or k=4

• Data Exploration and Understanding

You might also like