0% found this document useful (0 votes)
67 views

Clustering Analysis: What Is Cluster Analysis?

Cluster analysis groups data objects into clusters based on their characteristics and relationships. The main clustering techniques are hierarchical and partitioning clustering. K-means is a common partitioning algorithm that assigns data points to clusters based on minimizing distances between points and cluster centroids. It works by iteratively updating cluster centroids until clusters are stable. The number of clusters K must be specified, and evaluation metrics like sum of squared errors help select the best K value.

Uploaded by

shyama
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views

Clustering Analysis: What Is Cluster Analysis?

Cluster analysis groups data objects into clusters based on their characteristics and relationships. The main clustering techniques are hierarchical and partitioning clustering. K-means is a common partitioning algorithm that assigns data points to clusters based on minimizing distances between points and cluster centroids. It works by iteratively updating cluster centroids until clusters are stable. The number of clusters K must be specified, and evaluation metrics like sum of squared errors help select the best K value.

Uploaded by

shyama
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Clustering Analysis

What is Cluster Analysis?


Cluster analysis groups data objects based only on information found in data that
describes the objects and their relationships.
Main purpose of clustering techniques is to partition a set of entities into different
groups, called clusters.

Goal of Cluster Analysis


The objects within a group be similar to one another and different from the objects
in other groups.

Types of Clustering
Partitioning and Hierarchical Clustering

Hierarchical Clustering
A set of nested clusters organized as a hierarchical tree

Partitioning Clustering
A division data objects into non-overlapping subsets (clusters) such that each data
object is in exactly one subset

What is K-means?

Partitional clustering approach


Each cluster is associated with a centroid (center point)
Each point is assigned to the cluster with the closest centroid
Number of clusters, K, must be specified

Goal of K-means:
To find the best division of n entities in k groups, so that the total distance between
the group's members and its corresponding centroid, representative of the group, is
minimized.

Basic Algorithm of K-means


Select K points as the initial centroids
repeat
Form K clusters by assigning all points to the closest centroid
Re compute the centroid of each cluster
Until the centroids don't change

Details of K-means
Initial centroids are often chosen randomly
The centroid is the mean of the points in the cluster
Closeness is measured by Euclidean distance, cosine similarity,
correlation etc.,
K-means will converge for common similarity measures mentioned above
Most of the convergence happens in the first few iterations.

Euclidean Distance

Update Centroid
We use the following equation to calculate the n dimensional centroid point using k
n-dimensional points.

Evaluating K-means Clusters:


Most Common measure is Sum of Squared Error

For each point, the error is the distance to the nearest cluster
To get SSE, we square the errors and sum them

x is a data point in cluster Ci and mi is the representative point for cluster C i


can show that mi corresponds to the center (mean) of the cluster
Given two clusters, we can choose the one with the smallest error
One easy way to reduce SSE is to increase K, the number of clusters
A good clustering with smaller K can have a lower SSE than a poor clustering
with higher K

How to choose K?
Screeplot
Elbow Method

(or)

Use another clustering method, like EM


Run algorithm on data with several different values of K
Use the prior knowledge about the characteristics of the problem

K-means with Simple Example:


http://mnemstudio.org/clustering-k-means-example-1.htm
Algorithms Used in K-means:
Lloyd's Algorithm:

Initially k random observations are chosen that will serve as the centroids of the k
clusters. Then the following steps occur in iteration till the centroids converge.
The Euclidean distance between each observation and the chosen centroids
is calculated
The observations that are closest to each centroids are tagged within k
buckets
The mean of all the observations in each bucket serves as new centroids
The new centroids replace the old centroids and the iteration goes back to
step 1 if the old and new centroids have not converged

The conditions to converge are the following: the old and the new centroids are
exactly identical, the difference between the centroids is small (of the order of 10^3) or the maximum number of iterations (10 or 100) are reached.

MacQueen's Algorithm:
This is an online version where the first k instances are chosen as centroids
Then each instance is placed in buckets depending on which centroid is
closest to that instance. The respective centroid is recalculated
Repeat this step till each instance is placed in the appropriate bucket
This algorithm only has one iteration and the loop goes on for x instances
Hartigan- Wong Algorithm:
Assign all the points/instances to random buckets and calculate the
respective centroid
Starting from the first instance find the nearest centroid and assigning that
bucket. If the bucket changed then recalculate the new centroids i.e. the
centroid of the newly assigned bucket and the centroid of the old bucket
assignment as those are two centroids that are affected by the change
Loop through all the points and get new centroids
Do a second iteration of points 2 and 3 which performs sort of a clean-up
operation and reassigns stray points to correct buckets.

You might also like