AI20001
Essentials of Machine
Learning
Unsupervised Learning
Koustav Rudra
25/08/2025
Example: Face Clustering
Example: Search result clustering
Example: Google News
A data set with clear cluster structure
What are some of the
issues for clustering?
What clustering algorithms
can we use?
Issues for clustering
• Representation for clustering
• How do we represent an example
• features, etc.
• Similarity/distance between examples
• Flat clustering or hierarchical?
• How many clusters you want to create?
• Fixed a priori
• Data driven
Major Types of Clustering Algorithms
• Flatalgorithms
• Usually start with a random partitioning
• Refine it iteratively
• Example: K-means clustering
• Produces disjoint set of groups
• Hierarchical algorithms
• Bottom-up, agglomerative
• Top-down, divisive
Hard vs. soft clustering
• Hard
clustering:
Each example belongs to exactly one cluster
• Softclustering:
• An example can belong to more than one cluster (probabilistic)
• A pair of sneakers may belong to two groups
• sports apparel and shoes
Flat Clustering: K-mean Clustering algorithm
K-means
• K-means is simple, efficient and widely used
• Main steps of k-means:
When to stop?
Some possibilities
1. after fixed #
STEP 1: Start with k initial cluster
centers (that is why k-mean J) iteration
2. when centers do
not change
STEP 2: Assign/cluster each
go back here
member to the closest center
Iterative steps
STEP 3: Recalculate centers as the
mean of the points in a cluster
K-means: an example
K-means: Initialize centers randomly
K-means: assign points to nearest center
K-means: readjust centers
K-means: assign points to nearest center
K-means: readjust centers
K-means: assign points to nearest center
K-means: readjust centers
K-means: assign points to nearest center
No changes: Done
K-means
Iterate:
• Assign/cluster each example to closest center
• Recalculate centers as the mean of the points in a cluster
How do we do this?
K-means
Iterate:
• Assign/cluster each example to closest center
• Iterate over each point:
• - get distance to each cluster center
• - assign to closest center (hard cluster)
• Recalculate centers as the mean of the points in a cluster
K-means
Iterate:
n Assign/cluster each example to closest center
n Iterate over each point:
n - get distance to each cluster center
n - assign to closest center (hard cluster)
n Recalculate centers as the mean of the points in a cluster
What distance measure should we use?
Distance measure
Euclidean:
n
d(x, y) = ∑ (xi − yi )2
i=1
x and y are n-dimensional vectors:
x = (𝑥!, 𝑥", … . . , 𝑥# )
y = (𝑦!, 𝑦", … . . , 𝑦# )
K-means
Iterate:
• Assign/cluster each example to closest center
• Recalculate centers as the mean of the points in a cluster
Where are the cluster centers?
K-means
Iterate:
• Assign/cluster each example to closest center
• Recalculate centers as the mean of the points in a cluster
How do we calculate these?
K-means
Iterate:
• Assign/cluster each example to closest center
• Recalculate centers as the mean of the points in a cluster
Mean of the points in the cluster:
1
µ (C) = ∑
| C | x∈C
x
where:
n x n x
x + y = ∑ xi + yi =∑ i
i=1 C i=1 C
K-means loss function
K-means tries to minimize what is called the “k-means”
loss function:
n
loss = ∑ d(xi , µ k ) where µ k is cluster center for xi
2
i=1
That is, the sum of the squared distances from each point to the
associated cluster center.
K-means algorithm
Randomly initialize cluster centroids
Repeat {
for = 1 to
Cluster := index (from 1 to ) of cluster centroid
assignment
closest to
Move for = 1 to
centroid := average (mean) of points assigned to cluster
}
Running time of Kmeans
• In every iteration
• Assign data points to closest cluster center
• O(kn) time (k = # clusters, n = # data points)
• Change the cluster center to the average of its assigned points
• O(n)
K-means: Big Issues
• Value of k (# cluster)
• Convergence
• A fixed number of iterations
• partitions unchanged
• Cluster centers do not change
• Initial (seed) cluster centers
KMEANS: VALUE OF K
Elbow method
• Run kmeans with different k
• Plot k-means loss vs. k
• Choose k such that the curve show an elbow shape
Kmeans loss
Good value of k
Silhouette Measure
• Key Idea for good clustering
• Small within cluster variance
• Large between cluster variance
Silhouette Measure
𝐶! 𝑖𝑠 𝑡ℎ𝑒 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑓𝑜𝑟 𝑑𝑎𝑡𝑎 𝑝𝑜𝑖𝑛𝑡 𝑖
Within cluster measure d(i,j) = distance between i & j
Between cluster measure
Silhouette measure
s(i) = 0, if 𝑪𝒊 = 𝟏
Silhouette Plot
• Propertyof the Silhouette measure:
• High score is better
Average Silhouette score
Desirable value of k
𝒏
𝟏
4 𝑺(𝒊)
𝒏
𝒊#𝟏
Initial (seed) cluster centers
K-means: Initialize centers randomly
What would happen here?
K-means: Initialize centers randomly
Bad clustering
Choice of Initial Centroids
• Resultscan vary drastically based on random seed selection
• Slow convergence
• converges to sub-optimal clustering
• Common heuristics
• Random centers in the space
• Randomly pick from feature vectors
• Points least similar to any existing center (furthest centers heuristic)
• Try out multiple starting points
• Initialize with the results of another clustering method
Furthest centers heuristic
• μ1 = pick random point (the first center)
• 𝒇𝒐𝒓 𝒊 = 𝟐 𝒕𝒐 𝒌 (# cluster)
• μi = point that is furthest from any previous centers
K-means: Initialize furthest from centers
Say, k = 3
• Pick a random point for the first center
• Which point will be chosen next?
• Next point?
K-means: Initialize furthest from centers
Furthest point from center
Any issues/concerns with this approach?
Furthest points concerns
If k = 4, which points will get chosen?
Doesn’t deal well with outliers
A Better Approach
• K-means++
• Centers are initialized using a probabilistic approach
• Other steps are exactly the same as the standard k-means algorithm
Cluster centers initialization:
1. Choose one center 𝒄𝟏 randomly from the data (X)
2. For each 𝒙 ∈ 𝑿, compute D(x) (D(x) is the distance of x from the
closest center we have already chosen)
𝑫𝒙𝟐
3. Select a data 𝒙 from X as a new center with probability ∑𝒙∈𝑿 𝑫 𝒙 𝟐
4. Repeat step 2 and 3 until k centers are chosen
K-means++
• Cluster centers initialization:
1. Choose the first center randomly from the data, X
2. For each 𝒙 ∈ 𝑿, compute D(x) (D(x) is the distance of x from the closest center we have already
chosen)
𝑫 𝒙 𝟐
3. Select a data 𝒙 from X as a new center with probability ∑
𝒙∈𝑿 𝑫 𝒙 𝟐
4. Repeat step 2 and 3 until k centers are chosen
Illustration: say we want to create 3 clusters
most likely second center
Quiz:
first center Given the two centers, what will be the D(x)?
length of black line or read line?
x
most likely third center
Clustering Graph/Network Data
Graph/Network Data
• So far, we talked about data as n-dimensional point
• What about the following?
• Facebook friendship network
• Communities in question-answering website like Quora
• Protein interaction network
Facebook Friendship Graph/Network
Problem You Want to Solve
• Find coherent groups
• friend circles in Facebook
• Group similar objects
• So, essentially, it is again a grouping problem with different type of data
Example: Protein Interaction Network
What is Graph?
• A graph is composed of two things
1. Set of objects (called nodes of the graph)
2. Set of connections (called edges of the graph)
node
edge
Types of Graph
• Un-weighted
• Edges do not have weight
• Edges simply say whether two objects are connected or not
• Weighted
• Edges have weight
• Examples:
• Email communication network
• How often two persons exchanging emails?
• City and Road network
• Edge weight: the distance between two cities
Graph Data
• Graph data shows relationship between object pairs
• Therefore, centroid is often not valid in graph
• What does centroid mean in your Facebook friend network?
• How do you define the center in a graph?
Clustering Un-weighted Graph
Graph Clustering: Minimum Cut (or Mincut)
Min cut of a graph
• Partition on nodes in two groups 𝑆! 𝑎𝑛𝑑 𝑆" so that the # edges between
𝑆! 𝑎𝑛𝑑 𝑆" are minimized
Example
For the above graph, the min cut partition is {a,b,e,f} and {c,d,g,h}
Graph Clustering using Mincut
• Use a min cut algorithm to break a graph into two sets
• Use the min cut algorithm to further break the smaller graphs
• Continue until the stopping condition is satisfied
Karger’s Min Cut Algorithm
The algorithm is based on edge contraction
• Repeat until just two nodes remain
1. Pick an edge at random
2. Collapse its two endpoints into a single node
Karger’s Min Cut Algorithm: Example
14 edges to choose from
Pick b –f with prob 1/14 9 edges to choose from
Pick ae – bf with prob 4/9
13 edges to choose from
Pick g - h with prob 1/13
5 edges to choose from
Pick c - dgh with prob 3/5
12 edges to choose from
Pick d - gh with prob 2/12
DONE: just two nodes
10 edges to choose from
Min cut value:
Pick a - e with prob 1/10 # of parallel edges in the final two-node graph
For this example: min cut is 2
Karger’s Min Cut Algorithm
• An Important Note
• It is a randomized algorithm
• Therefore, for good result do the following
1. Run Karger’s algorithm multiple times
(it will produce multiple cuts)
2. Take the cut with minimum value
Clustering Weighted Graph
Hierarchical Clustering algorithms
• Agglomerative (bottom-up)
• Start with each object being a single cluster
• Gradually merge two most similar clusters
• Divisive (top-down)
• Start out with all objects in the same cluster.
• Then in each step of the algorithm do the following
• Partition the cluster into two smaller clusters maximizing the
distance
Hierarchical Clustering: Important Notes
1. Does not require the number of clusters in advance
2. Needs a termination/readout condition
• Could be distance threshold
Rest of the Lecture
• Hierarchical Agglomerative Clustering (HAC)
• And for simple reasons
• Simpler
• Widely used
Hierarchical Agglomerative Clustering (HAC)
• Define a similarity function for determining the similarity of two instances.
• Starts with all instances in a separate cluster
• Then repeatedly joins the two clusters that are most similar
• The history of merging forms a hierarchy (called dendogram).
65
Hierarchical Clustering
• The important question
How do you determine the “nearness” of clusters?
Closest pair of clusters
Many variants to defining closest pair of clusters
• Single-link
• Distance of the “closest” points (single-link)
• Complete-link
• Distance of the “furthest” points
• Average-link
• Average distance between pairs of elements
Single Link Agglomerative Clustering
• Use maximum similarity of pairs:
#$%#&$ "& ' ! = $%& #$%# "" !!
"!&$ " !!& '
6
4
10
Cluster 1 Cluster 2
Distance between Cluster 1 and 2: 3
Single Link Example
Problem with Single Link Clustering
Chain or elongated clusters
They are far apart, yet they are in the same cluster
Complete Link Agglomerative Clustering
• Use minimum similarity of pairs:
#$% #&$ "& ' ! = $%& #$% # "" ! !
"!&$ " !!& '
6
4
10
Cluster 1 Cluster 2
Distance between Cluster 1 and 2: 10
• Makes “tighter,” spherical clusters that are typically preferable.
Complete Link Example
Example in Detail: Single Link Clustering
Example in Detail: Single Link Clustering
Example in Detail: Single Link Clustering
Example in Detail: Single Link Clustering
Spectral Clustering
Spectral Clustering
Spectral Clustering: Examples
Spectral Clustering
• Group points based on links in the graph
Creating Graph from n-dimensional data
• Use Gaussian Kernel to compute similarity between objects i and j
• Possible Graphs:
• Fully
connected
• Connect only k-nearest neighbours
Image Segmentation
Why not min cut?
Graph Partitioning
Useful Terminologies
Graph Cut
Normalized cut
Solving Normalized Cut
NP Hard!
Solving Normalized Cut: Approximate Solution
The second smallest eigenvector is the real valued solution to this problem!!
Two-way Normalized Cut
Portioning using 2nd Eigenvector
• Second eigenvector takes continuous values
• Difficult to find a clear threshold to split
• How to choose splitting threshold?
• Pick the median value as splitting point
• Look for the splitting point that has the minimum Normalized cut value:
1. Choose n possible splitting points.
2. Compute Normalized cut value.
3. Pick minimum.
THANK YOU