Unit-8 (1)
Unit-8 (1)
Unsupervised Learning
Subject: Machine Learning (3170724)
Faculty: Dr. Ami Tusharkant Choksi
Associate professor, Computer Engineering Department,
Navyug Vidyabhavan Trust
C.K.Pithawala College of Engineering and Technology,
Surat, Gujarat State, India.
Website: www.ckpcet.ac.in
group
■ dissimilar (or unrelated) to the objects in other groups
■ Summarization:
■ Preprocessing for regression, PCA, classification, and
association analysis
■ Compression:
■ Image processing: vector quantization
■ Finding K-nearest Neighbors
■ Localizing search to one or a small number of clusters
■ Outlier detection
■ Outliers are often viewed as those “far away” from any
cluster
■ Constraint-based clustering
■ User may give inputs on constraints
■ Use domain knowledge to determine input parameters
■ Interpretability and usability
■ Others
■ Discovery of clusters with arbitrary shape
■ High dimensionality
■ Hierarchical approach:
■ Create a hierarchical decomposition of the set of data (or objects)
■ Density-based approach:
■ Based on connectivity and density functions
■ Grid-based approach:
■ based on a multiple-level granularity structure
■ Frequent pattern-based:
■ Based on the analysis of frequent patterns
■ User-guided or constraint-based:
■ Clustering by considering user-specified or application-specific
constraints
■ Typical methods: COD (obstacles), constrained clustering
■ Link-based clustering:
■ Objects are often linked together in various ways
● Partitioning methods,
● Hierarchical methods, and
● Density-based methods.
Step 1: Select K points in the data space and mark them as initial
centroids loop
Step 2: Assign each point in the data space to the nearest centroid to
form K clusters
Step 3: Measure the distance of each point in the cluster from the
centroid
Step 4: Calculate the Sum of Squared Error
(SSE) to measure the quality of the clusters. SSE is to be minimized.
Step 5: Identify the new centroid of each cluster on the basis of distance
between points
Step 6: Repeat Steps 2 to 5 to refine until centroids do not change end
loop
Typically K=√n/2 but doesn’t work for all types of dataset.
Dr. Ami T. Choksi @CKPCET Machine Learning (3170724) 16
K-Means clustering numerical example
We will apply k-means on the following 1 dimensional data set for K=2.
Data set {2, 4, 10, 12, 3, 20, 30, 11, 25}
Iteration 1
M1, M2 are the two randomly selected centroids/means where
M1= 4, M2=11
and the initial clusters are
C1= {4}, C2= {11}
Calculate the Euclidean distance as
D=[x,a]=√(x-a)²
D1 is the distance from M1
D2 is the distance from M2
Dr. Ami T. Choksi @CKPCET Machine Learning (3170724) 17
K-Means clustering numerical example
As we can see in the above table, 2 datapoints are added to cluster C1 and other
datapoints added to cluster C2
Therefore
C1= {2, 4, 3}
Iteration 2
Therefore
M1= (2+3+4)/3= 3
M2= (10+12+20+30+11+25)/6= 18
New Clusters
Iteration 3
Therefore
New Clusters
Iteration 4
Therefore
M1= (2+3+4+10+12+11)/6=7
M2= (20+30+25)/3= 25
New Clusters
As we can see that the data points in the cluster C1 and C2 in iteration 3 are same as
the data points of the cluster C1 and C2 of iteration 2.
It means that none of the data points has moved to other cluster. Also the
means/centeroid of these clusters is constant. So this becomes the stopping condition
for our algorithm.
Strengths Weakness
The algorithm is very flexible and The starting point of guessing the
thus can be adjusted for most number of clusters within the data
scenarios and complexities requires some experience of the
user, so that the final outcome is
efficient.
K=2
■ Dissimilarity calculations
29
What Is the Problem of the K-Means Method?
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 1
0 0
30
Dr. Ami Choksi @CKPCET Machine Learning (3170724) 30
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
1
0
9
Arbitrary Assign
7
5
choose k each
4 object as remainin
3
initial g object
2
medoids to
nearest
1
0
0 1 2 3 4 5 6 7 8 9 1
0
medoids
Do loop
0 0
Compute
9 9
Swapping O
8 8
total cost of
Until no
7 7
and Oramdom 6
swapping 6
change
5 5
If quality is 4 4
improved. 3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 1
0
0 1 2 3 4 5 6 7 8 9 31
1
0
■ PAM works effectively for small data sets, but does not scale
well for large data sets (due to the computational complexity)
34
Dr. Ami Choksi @CKPCET Machine Learning (3170724) 34
Dendrogram: Shows How Clusters are Merged
35
Next lecture
● Because the SSE of the second clustering is lower, k-means tend to put point 9 in
the same cluster with 1,2,3,6 though the point is logically nearer to point 10 and
12.
● This skewness is introduced due to the outlier point 25, which shifts the mean
awry from the center of the cluster.
44
Dr. Ami Choksi @CKPCET Machine Learning (3170724) 44
Distance between Clusters X X
46
Dr. Ami Choksi @CKPCET Machine Learning (3170724) 46
Strengths and weakness Extensions to
Hierarchical Clustering
■ Major weakness of agglomerative clustering methods
■ Model normal objects & report those not matching the model as
outliers, or
■ Model outliers and treat those not matching the model as normal
■ Challenges
■ Imbalanced classes, i.e., outliers are rare: Boost the outlier class
■ Problem 2: Costly since first clustering: but far less outliers than
normal objects
■ Newer methods: tackle outliers directly
49
Dr. Ami Choksi @CKPCET Machine Learning (3170724) 49
Outlier Detection III: Semi-Supervised
Methods
■ Situation: In many applications, the number of labeled data is often
small: Labels could be on outliers only, normal objects only, or both
■ Semi-supervised outlier detection: Regarded as applications of semi-
supervised learning
■ If some labeled normal objects are available
■ Use the labeled examples and the proximate unlabeled objects to
train a model for normal objects
■ Those not fitting the model of normal objects are detected as outliers
■ If only some labeled outliers are available, a small number of labeled
outliers many not cover the possible outliers well
■ To improve the quality of outlier detection, one can get help from
models for normal objects learned from unsupervised methods
50
Dr. Ami Choksi @CKPCET Machine Learning (3170724) 50
Outlier Detection (1): Statistical Methods
■ Statistical methods (also known as model-based methods) assume
that the normal data follow some statistical model (a stochastic model)
■ The data not following the model are outliers.
■ Example (right figure): First use Gaussian distribution
to model the normal data
■ For each object y in region R, estimate gD(y), the
51
Dr. Ami Choksi @CKPCET Machine Learning (3170724) 51
Outlier Detection (2): Proximity-Based Methods
■ An object is an outlier if the nearest neighbors of the object are far
away, i.e., the proximity of the object is significantly deviates from
the proximity of most of the other objects in the same data set
■ Example (right figure): Model the proximity of an
object using its 3 nearest neighbors
■ Objects in region R are substantially different
from other objects in the data set.
■ Thus the objects in R are outliers
■ The effectiveness of proximity-based methods highly relies on the
proximity measure.
■ In some applications, proximity or distance measures cannot be
obtained easily.
■ Often have a difficulty in finding a group of outliers which stay close to
each other
■ Two major types of proximity-based outlier detection
■ Distance-based vs. density-based 52
Dr. Ami Choksi @CKPCET Machine Learning (3170724) 52
Outlier Detection (3): Clustering-Based Methods
■ Normal data belong to large and dense clusters, whereas
outliers belong to small or sparse clusters, or do not belong
to any clusters
■ Example (right figure): two clusters
■ All points not in R form a large cluster
■ The two points in R form a tiny cluster,
thus are outliers
■ Since there are many clustering methods, there are many
clustering-based outlier detection methods as well
■ Clustering is expensive: straightforward adaption of a
clustering method for outlier detection can be costly and
does not scale up well for large data sets
53
Dr. Ami Choksi @CKPCET Machine Learning (3170724) 53
Clustering High Dimensional Data
54
Dr. Ami Choksi @CKPCET Machine Learning (3170724)
Graph-based clustering
55
Dr. Ami Choksi @CKPCET Machine Learning (3170724)
Graph Based Clustering
57
Dr. Ami Choksi @CKPCET Machine Learning (3170724)
Graph based clustering
58
Dr. Ami Choksi @CKPCET Machine Learning (3170724)
Graph based clustering
59
Dr. Ami Choksi @CKPCET Machine Learning (3170724)
60
Dr. Ami Choksi @CKPCET Machine Learning (3170724)
Applications
61
Dr. Ami Choksi @CKPCET Machine Learning (3170724)
References (1)
■ R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace
clustering of high dimensional data for data mining applications. SIGMOD'98
■ M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
■ M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points
to identify the clustering structure, SIGMOD’99.
■ Beil F., Ester M., Xu X.: "Frequent Term-Based Text Clustering", KDD'02
■ M. M. Breunig, H.-P. Kriegel, R. Ng, J. Sander. LOF: Identifying Density-
Based Local Outliers. SIGMOD 2000.
■ M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for
discovering clusters in large spatial databases. KDD'96.
■ M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial
databases: Focusing techniques for efficient class identification. SSD'95.
■ D. Fisher. Knowledge acquisition via incremental conceptual clustering.
Machine Learning, 2:139-172, 1987.
■ D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An
approach based on dynamic systems. VLDB’98.
■ V. Ganti, J. Gehrke, R. Ramakrishan. CACTUS Clustering Categorical Data
Using Summaries. KDD'99.
62
Dr. Ami Choksi @CKPCET Machine Learning (3170724) 62
References (2)
■ D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An
approach based on dynamic systems. In Proc. VLDB’98.
■ S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for
large databases. SIGMOD'98.
■ S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for
categorical attributes. In ICDE'99, pp. 512-521, Sydney, Australia, March
1999.
■ A. Hinneburg, D.l A. Keim: An Efficient Approach to Clustering in Large
Multimedia Databases with Noise. KDD’98.
■ A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall,
1988.
■ G. Karypis, E.-H. Han, and V. Kumar. CHAMELEON: A Hierarchical
Clustering Algorithm Using Dynamic Modeling. COMPUTER, 32(8): 68-75,
1999.
■ L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to
Cluster Analysis. John Wiley & Sons, 1990.
■ E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large
datasets. VLDB’98.
63
Dr. Ami Choksi @CKPCET Machine Learning (3170724) 63
References (3)
■ G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to
Clustering. John Wiley and Sons, 1988.
■ R. Ng and J. Han. Efficient and effective clustering method for spatial data mining.
VLDB'94.
■ L. Parsons, E. Haque and H. Liu, Subspace Clustering for High Dimensional Data: A
Review, SIGKDD Explorations, 6(1), June 2004
■ E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large
data sets. Proc. 1996 Int. Conf. on Pattern Recognition
■ G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution
clustering approach for very large spatial databases. VLDB’98.
■ A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng. Constraint-Based
Clustering in Large Databases, ICDT'01.
■ A. K. H. Tung, J. Hou, and J. Han. Spatial Clustering in the Presence of Obstacles,
ICDE'01
■ H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large
data sets, SIGMOD’02
■ W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial
Data Mining, VLDB’97
■ T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : An efficient data clustering
method for very large databases. SIGMOD'96
■ X. Yin, J. Han, and P. S. Yu, “LinkClus: Efficient Clustering via Heterogeneous
Semantic Links”, VLDB'06
64
Dr. Ami Choksi @CKPCET Machine Learning (3170724) 64
References (3)
■ Graph based clustering, https://towardsdatascience.com/how-to-cluster-in-high-
dimensions-4ef693bacc6
■ k-means clustering algorithm numerical example,
https://medium.datadriveninvestor.com/k-means-clustering-b89d349e98e6
■ K-medoid clustering numerical example, https://www.geeksforgeeks.org/ml-k-
medoids-clustering-with-example/
65
Dr. Ami Choksi @CKPCET Machine Learning (3170724) 65