0% found this document useful (0 votes)

68 views27 pages

L18 K Means

This document summarizes the K-means clustering algorithm. It begins by listing other clustering algorithms and then defines K-means, explaining that it is a partitional clustering approach that assigns data points to clusters based on proximity to centroid points. It discusses issues with K-means such as sensitivity to initial centroid positions and limitations in handling clusters of differing sizes, densities, or non-globular shapes. The document provides examples and discusses strategies for addressing empty clusters and updating centroids incrementally.

Uploaded by

Veena Tella

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

68 views27 pages

L18 K Means

Uploaded by

Veena Tella

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 27

BITS Pilani

BITS Pilani Dr.Aruna Malapati

Asst Professor
Hyderabad Campus Department of CSIS
BITS Pilani
Hyderabad Campus

K-Means Clustering
Today’s Learning objective

• List the clustering algorithms

• Define K-Means clustering algorithm

• List and resolve issues with K-Means clustering

BITS Pilani, Hyderabad Campus

Clustering Algorithms

• K-means and its variants

• Hierarchical clustering

• Density-based clustering

BITS Pilani, Hyderabad Campus

K-means Clustering
• Partitional clustering approach
• Each cluster is associated with a centroid (center point)
• Each point is assigned to the cluster with the closest centroid
• Number of clusters, K, must be specified
• The basic algorithm is very simple

BITS Pilani, Hyderabad Campus

Importance of Choosing
Initial Centroids
Iteration 1 Iteration 2 Iteration 3
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Iteration 4 Iteration 5 Iteration 6

3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

BITS Pilani, Hyderabad Campus

K Means clustering (section
9.1 Bishop page 454)

• Given the data set {x1, . . . , xN} where each Xi is a D-

dimensional Euclidean variable.
• Our goal is to partition the data set into some number K of
clusters.
• μk, where k = 1, . . . , K, in which μk is a prototype associated
with the kth cluster (representing the centres of the clusters).
• Our goal is then to find an assignment of data points to clusters,
as well as a set of vectors {μk}, such that the sum of the
squares of the distances of each data point to its closest vector
μk, is a minimum.

BITS Pilani, Hyderabad Campus

K Means clustering

• For each data point xn, we introduce a corresponding set

of binary indicator variables rnk ∈ {0, 1}, where k =
1, . . . , K describing which of the K clusters the data
point xn is assigned to, so that if data point xn is
assigned to cluster k then rnk = 1, and rnj = 0 for j = k.

BITS Pilani, Hyderabad Campus

K-means Clustering

• We can then define an objective function, which

represents the sum of the squares of the distances of
each data point to its assigned vector μk

• Our goal is to find values for the {rnk} and the {μk} so as to
minimize J.

BITS Pilani, Hyderabad Campus

Importance of Choosing
Initial Centroids
Iteration 5
1
2
3
4
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

BITS Pilani, Hyderabad Campus

Solution to random
initialization

• Choose initial centroids and perform multiple runs and

select the set of clusters with minimum SSE.
• This success of this will depend on data set and number
of clusters chosen.

BITS Pilani, Hyderabad Campus

Handling Empty Clusters

• Basic K-means algorithm can yield empty clusters

• Several strategies

• Choose a point and assign it to the cluster

• Choose the point that contributes most to SSE

• Choose a point from the cluster with the highest SSE

– If there are several empty clusters, the above can be

repeated several times.

BITS Pilani, Hyderabad Campus

Updating Centers
Incrementally
• In the basic K-means algorithm, centroids are updated after
all points are assigned to a centroid

• An alternative is to update the centroids after each

assignment (incremental approach)
– Each assignment updates zero or two centroids
– More expensive
– Never get an empty cluster
– Can use “weights” to change the impact

BITS Pilani, Hyderabad Campus

Pre-processing and Post-
processing
• Pre-processing
– Normalize the data
– Eliminate outliers

• Post-processing
– Eliminate small clusters that may represent outliers
– Split ‘loose’ clusters, i.e., clusters with relatively high SSE
– Merge clusters that are ‘close’ and that have relatively low
SSE
– Can use these steps during the clustering process
• ISODATA

BITS Pilani, Hyderabad Campus

Bisecting K-means

• Bisecting K-means algorithm

– Variant of K-means that can produce a partitional or a hierarchical clustering

BITS Pilani, Hyderabad Campus

Bisecting K-means Example

BITS Pilani, Hyderabad Campus

Limitations of K-means

• K-means has problems when clusters are of differing

– Sizes

– Densities

– Non-globular shapes

• K-means has problems when the data contains outliers.

BITS Pilani, Hyderabad Campus

Limitations of K-means:
Differing Sizes

Original Points K-means (3 Clusters)

BITS Pilani, Hyderabad Campus

Limitations of K-means:
Differing Density

Original Points K-means (3 Clusters)

BITS Pilani, Hyderabad Campus

Limitations of K-means:
Non-globular Shapes

Original Points K-means (2 Clusters)

BITS Pilani, Hyderabad Campus

Problems with K-Means
Clustering
• K-Means Clustering works only for clusters which represent
gaussian distributions. Hence, we cannot use K-Means
Clustering for finding complex clusters or non-convex clusters.

• The K-Means Algorithm is very sensitive to initialization, and

hence one must be careful while initializing the cluster means.

• The Algorithm can get stuck at a local optima, finding clusters

different from those originally wanted. This is also a factor
affected by the initialization of the cluster means.

BITS Pilani, Hyderabad Campus

K-medoids Clustering
Algorithm

BITS Pilani, Hyderabad Campus

PAM (Partitioning Around
Medoids) (1987)
• PAM (Kaufman and Rousseeuw, 1987)
• Use real object to represent the cluster
• Select k representative objects arbitrarily
• For each pair of non-selected object h and selected
object I, calculate the total swapping cost TCih
• For each pair of i and h,
• If TCih < 0, i is replaced by h
• Then assign each non-selected object to the most
similar representative object
• repeat steps 2-3 until there is no change

BITS Pilani, Hyderabad Campus

A Typical K-Medoids
Algorithm (PAM)

BITS Pilani, Hyderabad Campus

Computation Complexity
for K-Means

• In each iteration,

• It costs O(Kn) to compute the distance between each

of n examples and K cluster means
• It costs O(n) to update the cluster means by adding
each example to one cluster
• Assume t iterations are done before terminating the
algorithm, the computational complexity is O(tKn)

BITS Pilani, Hyderabad Campus

K-Means/Median/Mode/Medoid
Clustering complexity

BITS Pilani, Hyderabad Campus

Take home message

• K-means algorithm is a simple yet popular method for

clustering analysis.
• Its performance is determined by initialization and
appropriate distance measure
• There are several variants of K-means to overcome its
weaknesses
• K-Medoids: resistance to noise and/or outliers
• K-Modes: extension to categorical data clustering analysis
• CLARA: extension to deal with large data sets
• Mixture models (EM algorithm): handling uncertainty of clusters
BITS Pilani, Hyderabad Campus

Data Clustering: K-Means and Hierarchical Clustering
100% (1)
Data Clustering: K-Means and Hierarchical Clustering
24 pages
20+ Coding Patterns To Crack Any Coding Interviews
No ratings yet
20+ Coding Patterns To Crack Any Coding Interviews
26 pages
Python Unit 1
No ratings yet
Python Unit 1
22 pages
Machine Learning Notes-1 (Clustering-1)
No ratings yet
Machine Learning Notes-1 (Clustering-1)
25 pages
Clustering
No ratings yet
Clustering
104 pages
NEURAL NETWORKS Basics Using Matlab
100% (2)
NEURAL NETWORKS Basics Using Matlab
51 pages
Week 6 Lecture Notes
No ratings yet
Week 6 Lecture Notes
47 pages
1) Write A C Program To Implement First Come First Serve (FCFS) Algorithm Program
No ratings yet
1) Write A C Program To Implement First Come First Serve (FCFS) Algorithm Program
9 pages
Jaipur National University: Project Design With Seminar
100% (1)
Jaipur National University: Project Design With Seminar
26 pages
Lesson Plan New
No ratings yet
Lesson Plan New
6 pages
Exact Closed Form Algorithm For The Four Peg Tower of Hanoi Puzzle
No ratings yet
Exact Closed Form Algorithm For The Four Peg Tower of Hanoi Puzzle
6 pages
Homework 3 Association Rule Mining
No ratings yet
Homework 3 Association Rule Mining
3 pages
BCS401 ADA m5 Notes
No ratings yet
BCS401 ADA m5 Notes
29 pages
Unit 3 Clustering Algorithm
No ratings yet
Unit 3 Clustering Algorithm
44 pages
Unit 5
No ratings yet
Unit 5
63 pages
ML 3
No ratings yet
ML 3
100 pages
L2-4 - Data
No ratings yet
L2-4 - Data
83 pages
Unit 4
No ratings yet
Unit 4
125 pages
DSA Paper Solutions-2
No ratings yet
DSA Paper Solutions-2
50 pages
Cluster
No ratings yet
Cluster
50 pages
UNIT 4 Part 5 7 May
No ratings yet
UNIT 4 Part 5 7 May
47 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
Clustering 2
No ratings yet
Clustering 2
80 pages
Clustering
No ratings yet
Clustering
55 pages
Unit 7 Clustering
No ratings yet
Unit 7 Clustering
56 pages
Clustering
No ratings yet
Clustering
80 pages
07 Clustering
No ratings yet
07 Clustering
54 pages
Design and Analysis of Algorithm: Unit 1
No ratings yet
Design and Analysis of Algorithm: Unit 1
80 pages
CS 12
No ratings yet
CS 12
65 pages
Jntuk Machine Learning 3-2 Unit-4
No ratings yet
Jntuk Machine Learning 3-2 Unit-4
32 pages
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
No ratings yet
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
56 pages
Lecture5 - Sorting Searching Algorithms
No ratings yet
Lecture5 - Sorting Searching Algorithms
27 pages
Clustering Part1
No ratings yet
Clustering Part1
84 pages
MLSlides5 - Selected - Shared
No ratings yet
MLSlides5 - Selected - Shared
30 pages
Clustering
No ratings yet
Clustering
125 pages
Image Clustering: Prof. Dr. Rafiqul Islam Department of CSE
No ratings yet
Image Clustering: Prof. Dr. Rafiqul Islam Department of CSE
26 pages
P-3 1 2-Kmeans
No ratings yet
P-3 1 2-Kmeans
43 pages
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
No ratings yet
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
65 pages
Data Mining - Clustering
No ratings yet
Data Mining - Clustering
90 pages
Unsupervised Learning Models Overview, K-Means Algorithm: Sir Syed University of Engineering & Technology, Karachi
No ratings yet
Unsupervised Learning Models Overview, K-Means Algorithm: Sir Syed University of Engineering & Technology, Karachi
36 pages
Clustering
No ratings yet
Clustering
84 pages
Linked Lists: Computer Science E-119 Harvard Extension School Fall 2012 David G. Sullivan, PH.D
No ratings yet
Linked Lists: Computer Science E-119 Harvard Extension School Fall 2012 David G. Sullivan, PH.D
19 pages
Lect 10 DM
No ratings yet
Lect 10 DM
36 pages
Unit 4
No ratings yet
Unit 4
46 pages
WINSEM2021-22 ECE6093 ETH VL2021220505450 Reference Material I 23-03-2022 Slides Kmeans
No ratings yet
WINSEM2021-22 ECE6093 ETH VL2021220505450 Reference Material I 23-03-2022 Slides Kmeans
28 pages
5-Hash Table Datastructure
No ratings yet
5-Hash Table Datastructure
19 pages
Clustering
No ratings yet
Clustering
34 pages
Clustering Data Mining
No ratings yet
Clustering Data Mining
27 pages
19.1. Partitioning-Based Clustering Algorithms
No ratings yet
19.1. Partitioning-Based Clustering Algorithms
27 pages
L17 Clustering
No ratings yet
L17 Clustering
35 pages
AI-AG-Day-2-28th Feb 2023
No ratings yet
AI-AG-Day-2-28th Feb 2023
44 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Intro Data Science: Cluster Analysis
No ratings yet
Intro Data Science: Cluster Analysis
60 pages
Spanning Tree
No ratings yet
Spanning Tree
15 pages
L6-7 - Apriori
No ratings yet
L6-7 - Apriori
22 pages
L24-25 BIRCH and CURE
No ratings yet
L24-25 BIRCH and CURE
13 pages
Fast Fourier Transform and Its Applications PDF Brigham
No ratings yet
Fast Fourier Transform and Its Applications PDF Brigham
2 pages
L1 - Introduction
No ratings yet
L1 - Introduction
21 pages
Lect. 2-1numerical Solution of Nonlinear Equations Part1
No ratings yet
Lect. 2-1numerical Solution of Nonlinear Equations Part1
12 pages
K Means
No ratings yet
K Means
23 pages
L13-16 Sequential Patterns
No ratings yet
L13-16 Sequential Patterns
36 pages
Medical Imabmnge Analysis
No ratings yet
Medical Imabmnge Analysis
41 pages
Clustering Techniques - Hierarchical, K-Means Clustering
No ratings yet
Clustering Techniques - Hierarchical, K-Means Clustering
22 pages
CSE3506 - Essentials of Data Analytics: Facilitator: DR Sathiya Narayanan S
No ratings yet
CSE3506 - Essentials of Data Analytics: Facilitator: DR Sathiya Narayanan S
17 pages
L11-12 Qualitative Association Rule Mining
No ratings yet
L11-12 Qualitative Association Rule Mining
22 pages
Lect 4
No ratings yet
Lect 4
34 pages
MODULE 4 Clustering
No ratings yet
MODULE 4 Clustering
23 pages
WWW Simplilearn Com Tutorials Machine Learning Tutorial K Means Clustering Algor
No ratings yet
WWW Simplilearn Com Tutorials Machine Learning Tutorial K Means Clustering Algor
19 pages
Classification - Naive Bayes Classifier: DR - Aruna Malapati Asst Professor Dept of CS & IT BITS Pilani, Hyderabad Campus
No ratings yet
Classification - Naive Bayes Classifier: DR - Aruna Malapati Asst Professor Dept of CS & IT BITS Pilani, Hyderabad Campus
9 pages
5 - Clustering
No ratings yet
5 - Clustering
13 pages
1.2 Five Representative Problems
No ratings yet
1.2 Five Representative Problems
30 pages
JAVA LAB Manual
No ratings yet
JAVA LAB Manual
9 pages
DisSys Lec5
No ratings yet
DisSys Lec5
14 pages
ML Lec-16
No ratings yet
ML Lec-16
16 pages
Example of Matrix Chain Multiplication
No ratings yet
Example of Matrix Chain Multiplication
6 pages
V5I5201647
No ratings yet
V5I5201647
13 pages
22BLC1012 Lab 06
No ratings yet
22BLC1012 Lab 06
6 pages
Graph Partitioning Advance Clustering Technique
No ratings yet
Graph Partitioning Advance Clustering Technique
14 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
12 pages
K Means Clustering Lecture
No ratings yet
K Means Clustering Lecture
32 pages
K Mean Clustering1
No ratings yet
K Mean Clustering1
23 pages
MST Kruskals Prims
No ratings yet
MST Kruskals Prims
18 pages
Kmeans and Adaptive K Means
No ratings yet
Kmeans and Adaptive K Means
6 pages
Unit 6 Assessment
No ratings yet
Unit 6 Assessment
8 pages
ML 11
No ratings yet
ML 11
3 pages
DAA - Notations
No ratings yet
DAA - Notations
12 pages
Da Exp 10 66
No ratings yet
Da Exp 10 66
6 pages
Introduction To Data Structures: CS 202 Minor
No ratings yet
Introduction To Data Structures: CS 202 Minor
16 pages
Dsa Swayam
No ratings yet
Dsa Swayam
3 pages
283251assignment 1 - Data Structures and Algorithms-1720430124132
No ratings yet
283251assignment 1 - Data Structures and Algorithms-1720430124132
2 pages
Soft Vs Hard Clustering
No ratings yet
Soft Vs Hard Clustering
5 pages
Yapı Analizi
No ratings yet
Yapı Analizi
5 pages
L8 - Support Count Using Hash Tree
No ratings yet
L8 - Support Count Using Hash Tree
14 pages
K Means Clustering
No ratings yet
K Means Clustering
6 pages
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
No ratings yet
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
11 pages
Analysis and Study of K Means Clustering Algorithm IJERTV2IS70648
No ratings yet
Analysis and Study of K Means Clustering Algorithm IJERTV2IS70648
6 pages
Applications of Finite Mathematics
From Everand
Applications of Finite Mathematics
Gautami Devar
No ratings yet
Simplified College Algebra
From Everand
Simplified College Algebra
Sachin Nambeesan
No ratings yet

L18 K Means

Uploaded by

L18 K Means

Uploaded by

BITS Pilani

BITS Pilani Dr.Aruna Malapati

• List the clustering algorithms

• Define K-Means clustering algorithm

• List and resolve issues with K-Means clustering

BITS Pilani, Hyderabad Campus

• K-means and its variants

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

2.5 2.5 2.5

1.5 1.5 1.5

0.5 0.5 0.5

Iteration 4 Iteration 5 Iteration 6

2.5 2.5 2.5

1.5 1.5 1.5

0.5 0.5 0.5

BITS Pilani, Hyderabad Campus

• Given the data set {x1, . . . , xN} where each Xi is a D-

BITS Pilani, Hyderabad Campus

• For each data point xn, we introduce a corresponding set

BITS Pilani, Hyderabad Campus

• We can then define an objective function, which

BITS Pilani, Hyderabad Campus

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

BITS Pilani, Hyderabad Campus

• Choose initial centroids and perform multiple runs and

BITS Pilani, Hyderabad Campus

• Basic K-means algorithm can yield empty clusters

• Choose a point and assign it to the cluster

• Choose the point that contributes most to SSE

• Choose a point from the cluster with the highest SSE

– If there are several empty clusters, the above can be

BITS Pilani, Hyderabad Campus

• An alternative is to update the centroids after each

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

• Bisecting K-means algorithm

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

• K-means has problems when clusters are of differing

• K-means has problems when the data contains outliers.

BITS Pilani, Hyderabad Campus

Original Points K-means (3 Clusters)

BITS Pilani, Hyderabad Campus

Original Points K-means (3 Clusters)

BITS Pilani, Hyderabad Campus

Original Points K-means (2 Clusters)

BITS Pilani, Hyderabad Campus

• The K-Means Algorithm is very sensitive to initialization, and

• The Algorithm can get stuck at a local optima, finding clusters

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

• It costs O(Kn) to compute the distance between each

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

• K-means algorithm is a simple yet popular method for

You might also like