Clustering Analysis: What Is Cluster Analysis?
Clustering Analysis: What Is Cluster Analysis?
Types of Clustering
Partitioning and Hierarchical Clustering
Hierarchical Clustering
A set of nested clusters organized as a hierarchical tree
Partitioning Clustering
A division data objects into non-overlapping subsets (clusters) such that each data
object is in exactly one subset
What is K-means?
Goal of K-means:
To find the best division of n entities in k groups, so that the total distance between
the group's members and its corresponding centroid, representative of the group, is
minimized.
Details of K-means
Initial centroids are often chosen randomly
The centroid is the mean of the points in the cluster
Closeness is measured by Euclidean distance, cosine similarity,
correlation etc.,
K-means will converge for common similarity measures mentioned above
Most of the convergence happens in the first few iterations.
Euclidean Distance
Update Centroid
We use the following equation to calculate the n dimensional centroid point using k
n-dimensional points.
For each point, the error is the distance to the nearest cluster
To get SSE, we square the errors and sum them
How to choose K?
Screeplot
Elbow Method
(or)
Initially k random observations are chosen that will serve as the centroids of the k
clusters. Then the following steps occur in iteration till the centroids converge.
The Euclidean distance between each observation and the chosen centroids
is calculated
The observations that are closest to each centroids are tagged within k
buckets
The mean of all the observations in each bucket serves as new centroids
The new centroids replace the old centroids and the iteration goes back to
step 1 if the old and new centroids have not converged
The conditions to converge are the following: the old and the new centroids are
exactly identical, the difference between the centroids is small (of the order of 10^3) or the maximum number of iterations (10 or 100) are reached.
MacQueen's Algorithm:
This is an online version where the first k instances are chosen as centroids
Then each instance is placed in buckets depending on which centroid is
closest to that instance. The respective centroid is recalculated
Repeat this step till each instance is placed in the appropriate bucket
This algorithm only has one iteration and the loop goes on for x instances
Hartigan- Wong Algorithm:
Assign all the points/instances to random buckets and calculate the
respective centroid
Starting from the first instance find the nearest centroid and assigning that
bucket. If the bucket changed then recalculate the new centroids i.e. the
centroid of the newly assigned bucket and the centroid of the old bucket
assignment as those are two centroids that are affected by the change
Loop through all the points and get new centroids
Do a second iteration of points 2 and 3 which performs sort of a clean-up
operation and reassigns stray points to correct buckets.