0% found this document useful (0 votes)
6 views

Lecture - 10 Unsupervised Learning & K-Means Clustering

Uploaded by

Prince Abdullah
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Lecture - 10 Unsupervised Learning & K-Means Clustering

Uploaded by

Prince Abdullah
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Unsupervised Learning

K-Means Clustering

By: Abdul Hameed

1
Unsupervised Learning

• It is a machine learning technique in which the users do not need to


supervise the model.
• Instead, it allows the model to work on its own to discover patterns and
information that was previously undetected.
• It mainly deals with the unlabeled data.

2
Unsupervised Learning
Supervised Learning

Unsupervised Learning

3
Unsupervised Learning - Applications

• Clustering allows you to automatically split the dataset into groups according to
similarity. Often, however, cluster analysis overestimates the similarity between groups
and doesn’t treat data points as individuals. For this reason, cluster analysis is a poor
choice for applications like customer segmentation and targeting.
• Anomaly detection can automatically discover unusual data points in your dataset. This
is useful in pinpointing fraudulent transactions, discovering faulty pieces of hardware, or
identifying an outlier caused by a human error during data entry.
• Association rules discovery identifies sets of items that frequently occur together in your
dataset. Retailers often use it for basket analysis, because it allows analysts to discover
goods often purchased at the same time and develop more effective marketing and
merchandising strategies.
• Latent variable models are commonly used for data preprocessing, such as reducing the
number of features in a dataset (dimensionality reduction) or decomposing the dataset
into multiple components.
4
Clustering for Understanding

• Classes, or conceptually meaningful groups of objects that share common


characteristics, play an important role in how people analyze and describe
the world.
• Indeed, human beings are skilled at dividing objects into groups (clustering)
and assigning particular objects to these groups (classification).
• For example, even relatively young children can quickly label the objects in a
photograph as buildings, vehicles, people, animals, plants, etc.
• In the context of understanding data, clusters are potential classes and
cluster analysis is the study of techniques for automatically finding classes.
• Clustering is an Unsupervised learning concept.

5
What is Cluster Analysis?

• Finding groups of objects such that the objects in a


group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groups Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

6
Clustering Example – news.google.com

7
Cluster formation methods

Centroid Based Hierarchical Based Density Based


• The clusters are • Clusters are • It isolates various
formed by the constructed as a tree- density regions based
closeness of data type structure based on different densities
points to the centroid on the hierarchy. They present in the data
of clusters. Here, the have two categories, space.
cluster center, i.e. namely,
centroid is Agglomerative
constructed such that (Bottom-up approach)
the distance of data and Divisive (Top-
points is minimum down approach).
with the center.
8
K-Means Clustering

9
Centroid based clusters

• An iterative clustering algorithm in


which the clusters are formed by the
closeness of data points to the
centroid of clusters. Here , the
cluster center i.e. centroid is formed
such that the distance of data points
is minimum with the center.

10
k-Means Clustering Algorithm
• An iterative algorithm that partitions the dataset
according to their features into K number of predefined
non- overlapping distinct clusters or subgroups.
• It makes the data points of inter clusters as similar as
possible and also tries to keep the clusters as far as
possible.
• It allocates the data points to a cluster if the sum of the
squared distance between the cluster’s centroid and the
data points is at a minimum where the cluster’s centroid
is the arithmetic mean of the data points that are in the
cluster.
11
Step 1: Initialization

• Initialize any random points


called as the centroids of the
cluster. While initializing you
must take care that the
centroids of the cluster must be
less than the number of
training data points. This
algorithm is an iterative
algorithm hence the next two
steps are performed iteratively.

12
Step 2: Cluster Assignment

• After initialization, all data


points are traversed and the
distance between all the
centroids and the data points
are calculated. Now the clusters
would be formed depending
upon the minimum distance
from the centroids. In this
example, the data is divided
into two clusters.

13
Step 3: Moving Centroid

• As the clusters formed in the above


step are not optimized so we need
to form optimized clusters. For this,
we need to move the centroids
iteratively to a new location. Take
data points of one cluster, compute
their average and then move the
centroid of that cluster to this new
location. Repeat the same step for
all other clusters.

14
Step 4: Optimization

• The steps 2 and 3 are done iteratively until the


centroids stop moving i.e. they do not change their
positions anymore and have become static. Once this
is done the k- means algorithm is termed to be
converged.

15
Step 5: Convergence

• Now this algorithm has


converged and distinct
clusters are formed and
clearly visible. This
algorithm can give different
results depending on how
the clusters were initialized
in the first step.

16
Euclidean Distance

• A distance measure between a For example, picture it as a


pair of samples p and q in an n- “straight, connecting” line in a
dimensional feature space:
2D feature space:

17
k-Means Demo – Step 1: Initialization

S# X1 X2
A 1 1
B 1 0
C 0 2
D 2 4
E 3 5

18
k-Means Demo – Step 2: Cluster Assignment
Consider:
ID X1 X2 q(1, 1)
A 1 1 p(1,0)
B 1 0
C 0 2
D 2 4
E 3 5

CID X1 X2
1 1
0 2

19
k-Means Demo – Step 2: Cluster Assignment

ID X1 X2 C1 C2
A 1 1 0 1.4
B 1 0 1 2.2
C 0 2 1.4 0
D 2 4 3.2 2.8
E 3 5 4.5 4.2

CID X1 X2
1 1
0 2

20
k-Means Demo – Step 2: Cluster Assignment

ID X1 X2 C1 C2 CID
A 1 1 0 1.4 1
B 1 0 1 2.2 1
C 0 2 1.4 0 2
D 2 4 3.2 2.8 2
E 3 5 4.5 4.2 2

CID X1 X2
1 1
0 2

21
k-Means Demo – Step 3: Moving Centroid

ID X1 X2 C1 C2 CID
A 1 1 0 1.4 1
B 1 0 1 2.2 1
C 0 2 1.4 0 2
D 2 4 3.2 2.8 2
E 3 5 4.5 4.2 2

CID X1 X2 CID X1 X2
1 1
0 2

22
k-Means Demo – Step 3: Moving Centroid

ID X1 X2 C1 C2 CID
A 1 1 0 1.4 1
B 1 0 1 2.2 1
C 0 2 1.4 0 2
D 2 4 3.2 2.8 2
E 3 5 4.5 4.2 2

CID X1 X2 CID X1 X2 CID X1 X2


1 1 1 0.5
0 2 1.67 3.67

23
k-Means Demo – Step 4: Optimization

ID X1 X2 C1 C2 CID
A 1 1 0.5 2.7 1
B 1 0 0.5 3.7 1
C 0 2 1.8 2.4 1
D 2 4 3.6 0.5 2
E 3 5 4.9 1.9 2

CID X1 X2 CID X1 X2
1 1 1 0.5
0 2 1.67 3.67

24
k-Means Demo – Step 4: Optimization

ID X1 X2 C1 C2 CID
A 1 1 0.5 2.7 1
B 1 0 0.5 3.7 1
C 0 2 1.8 2.4 1
D 2 4 3.6 0.5 2
E 3 5 4.9 1.9 2

CID X1 X2 CID X1 X2 CID X1 X2


1 1 1 0.5
0 2 1.67 3.67

25
k-Means Demo – Step 4: Optimization

ID X1 X2 C1 C2 CID
A 1 1 0.5 2.7 1
B 1 0 0.5 3.7 1
C 0 2 1.8 2.4 1
D 2 4 3.6 0.5 2
E 3 5 4.9 1.9 2

CID X1 X2 CID X1 X2 CID X1 X2


1 0.5 0.67 1
1.67 3.67 2.5 4.5

26
k-Means Demo – Step 5: Convergence

ID X1 X2 C1 C2 CID
A 1 1 0.33 3.81 1
B 1 0 1.05 4.74 1
C 0 2 1.20 3.54 1
D 2 4 3.28 0.71 2
E 3 5 4.63 0.71 2

CID X1 X2 CID X1 X2
No change, Done! 1 0.5
1.67 3.67
0.67
2.5
1
4.5

27
Another Example
Iteration 1 Iteration 2 Iteration 3
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Iteration 4 Iteration 5 Iteration 6


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
28
k-Means Clustering: Discussion

• finds a local optimum


• often converges quickly
• but not always
• the choice of initial points can have large influence in the result
• tends to find spherical clusters
• outliers can cause a problem
• different densities may cause a problem

29
k-Means Clustering: Advantages

• K-means is relatively scalable and efficient in


processing large data sets
• The computational complexity of the algorithm is
O(nkt)
• n: the total number of objects
• k: the number of clusters
• t: the number of iterations
• Normally: k<<n and t<<n
30
k-Means Clustering: Disadvantages

• Can be applied only when the mean of a cluster is defined


• Users need to specify k
• K-means is not suitable for clusters of very different size
• It is sensitive to noise and outlier data points (can
influence the mean value)

31

You might also like