0% found this document useful (0 votes)
68 views27 pages

L18 K Means

This document summarizes the K-means clustering algorithm. It begins by listing other clustering algorithms and then defines K-means, explaining that it is a partitional clustering approach that assigns data points to clusters based on proximity to centroid points. It discusses issues with K-means such as sensitivity to initial centroid positions and limitations in handling clusters of differing sizes, densities, or non-globular shapes. The document provides examples and discusses strategies for addressing empty clusters and updating centroids incrementally.

Uploaded by

Veena Tella
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views27 pages

L18 K Means

This document summarizes the K-means clustering algorithm. It begins by listing other clustering algorithms and then defines K-means, explaining that it is a partitional clustering approach that assigns data points to clusters based on proximity to centroid points. It discusses issues with K-means such as sensitivity to initial centroid positions and limitations in handling clusters of differing sizes, densities, or non-globular shapes. The document provides examples and discusses strategies for addressing empty clusters and updating centroids incrementally.

Uploaded by

Veena Tella
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 27

BITS Pilani

BITS Pilani Dr.Aruna Malapati


Asst Professor
Hyderabad Campus Department of CSIS
BITS Pilani
Hyderabad Campus

K-Means Clustering
Today’s Learning objective

• List the clustering algorithms

• Define K-Means clustering algorithm

• List and resolve issues with K-Means clustering

BITS Pilani, Hyderabad Campus


Clustering Algorithms

• K-means and its variants

• Hierarchical clustering

• Density-based clustering

BITS Pilani, Hyderabad Campus


K-means Clustering
• Partitional clustering approach
• Each cluster is associated with a centroid (center point)
• Each point is assigned to the cluster with the closest centroid
• Number of clusters, K, must be specified
• The basic algorithm is very simple

BITS Pilani, Hyderabad Campus


Importance of Choosing
Initial Centroids
Iteration 1 Iteration 2 Iteration 3
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Iteration 4 Iteration 5 Iteration 6


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

BITS Pilani, Hyderabad Campus


K Means clustering (section
9.1 Bishop page 454)

• Given the data set {x1, . . . , xN} where each Xi is a D-


dimensional Euclidean variable.
• Our goal is to partition the data set into some number K of
clusters.
• μk, where k = 1, . . . , K, in which μk is a prototype associated
with the kth cluster (representing the centres of the clusters).
• Our goal is then to find an assignment of data points to clusters,
as well as a set of vectors {μk}, such that the sum of the
squares of the distances of each data point to its closest vector
μk, is a minimum.

BITS Pilani, Hyderabad Campus


K Means clustering

• For each data point xn, we introduce a corresponding set


of binary indicator variables rnk ∈ {0, 1}, where k =
1, . . . , K describing which of the K clusters the data
point xn is assigned to, so that if data point xn is
assigned to cluster k then rnk = 1, and rnj = 0 for j = k.

BITS Pilani, Hyderabad Campus


K-means Clustering

• We can then define an objective function, which


represents the sum of the squares of the distances of
each data point to its assigned vector μk

• Our goal is to find values for the {rnk} and the {μk} so as to
minimize J.

BITS Pilani, Hyderabad Campus


Importance of Choosing
Initial Centroids
Iteration 5
1
2
3
4
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x

BITS Pilani, Hyderabad Campus


Solution to random
initialization

• Choose initial centroids and perform multiple runs and


select the set of clusters with minimum SSE.
• This success of this will depend on data set and number
of clusters chosen.

BITS Pilani, Hyderabad Campus


Handling Empty Clusters

• Basic K-means algorithm can yield empty clusters

• Several strategies

• Choose a point and assign it to the cluster

• Choose the point that contributes most to SSE

• Choose a point from the cluster with the highest SSE

– If there are several empty clusters, the above can be


repeated several times.

BITS Pilani, Hyderabad Campus


Updating Centers
Incrementally
• In the basic K-means algorithm, centroids are updated after
all points are assigned to a centroid

• An alternative is to update the centroids after each


assignment (incremental approach)
– Each assignment updates zero or two centroids
– More expensive
– Never get an empty cluster
– Can use “weights” to change the impact

BITS Pilani, Hyderabad Campus


Pre-processing and Post-
processing
• Pre-processing
– Normalize the data
– Eliminate outliers

• Post-processing
– Eliminate small clusters that may represent outliers
– Split ‘loose’ clusters, i.e., clusters with relatively high SSE
– Merge clusters that are ‘close’ and that have relatively low
SSE
– Can use these steps during the clustering process
• ISODATA

BITS Pilani, Hyderabad Campus


Bisecting K-means

• Bisecting K-means algorithm


– Variant of K-means that can produce a partitional or a hierarchical clustering

BITS Pilani, Hyderabad Campus


Bisecting K-means Example

BITS Pilani, Hyderabad Campus


Limitations of K-means

• K-means has problems when clusters are of differing

– Sizes

– Densities

– Non-globular shapes

• K-means has problems when the data contains outliers.

BITS Pilani, Hyderabad Campus


Limitations of K-means:
Differing Sizes

Original Points K-means (3 Clusters)

BITS Pilani, Hyderabad Campus


Limitations of K-means:
Differing Density

Original Points K-means (3 Clusters)

BITS Pilani, Hyderabad Campus


Limitations of K-means:
Non-globular Shapes

Original Points K-means (2 Clusters)

BITS Pilani, Hyderabad Campus


Problems with K-Means
Clustering
• K-Means Clustering works only for clusters which represent
gaussian distributions. Hence, we cannot use K-Means
Clustering for finding complex clusters or non-convex clusters.

• The K-Means Algorithm is very sensitive to initialization, and


hence one must be careful while initializing the cluster means.

• The Algorithm can get stuck at a local optima, finding clusters


different from those originally wanted. This is also a factor
affected by the initialization of the cluster means.

BITS Pilani, Hyderabad Campus


K-medoids Clustering
Algorithm

BITS Pilani, Hyderabad Campus


PAM (Partitioning Around
Medoids) (1987)
• PAM (Kaufman and Rousseeuw, 1987)
• Use real object to represent the cluster
• Select k representative objects arbitrarily
• For each pair of non-selected object h and selected
object I, calculate the total swapping cost TCih
• For each pair of i and h,
• If TCih < 0, i is replaced by h
• Then assign each non-selected object to the most
similar representative object
• repeat steps 2-3 until there is no change

BITS Pilani, Hyderabad Campus


A Typical K-Medoids
Algorithm (PAM)

BITS Pilani, Hyderabad Campus


Computation Complexity
for K-Means

• In each iteration,

• It costs O(Kn) to compute the distance between each


of n examples and K cluster means
• It costs O(n) to update the cluster means by adding
each example to one cluster
• Assume t iterations are done before terminating the
algorithm, the computational complexity is O(tKn)

BITS Pilani, Hyderabad Campus


K-Means/Median/Mode/Medoid
Clustering complexity

BITS Pilani, Hyderabad Campus


Take home message

• K-means algorithm is a simple yet popular method for


clustering analysis.
• Its performance is determined by initialization and
appropriate distance measure
• There are several variants of K-means to overcome its
weaknesses
• K-Medoids: resistance to noise and/or outliers
• K-Modes: extension to categorical data clustering analysis
• CLARA: extension to deal with large data sets
• Mixture models (EM algorithm): handling uncertainty of clusters
BITS Pilani, Hyderabad Campus

You might also like