0% found this document useful (0 votes)
4 views

Unit-8 (1)

Uploaded by

Manali Patel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Unit-8 (1)

Uploaded by

Manali Patel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

8.

Unsupervised Learning
Subject: Machine Learning (3170724)
Faculty: Dr. Ami Tusharkant Choksi
Associate professor, Computer Engineering Department,
Navyug Vidyabhavan Trust
C.K.Pithawala College of Engineering and Technology,
Surat, Gujarat State, India.
Website: www.ckpcet.ac.in

Dr. Ami T. Choksi @CKPCET Machine Learning (3170724)


Supervised vs. Unsupervised Learning
■ Supervised learning (classification)
■ Supervision: The training data (observations, measurements,
etc.) are accompanied by labels indicating the class of the
observations
■ New data is classified based on the training set
■ Unsupervised learning (clustering)
■ The class labels of training data is unknown
■ Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in the
data
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 2
What is Cluster Analysis?
■ Cluster: A collection of data objects
■ similar (or related) to one another within the same

group
■ dissimilar (or unrelated) to the objects in other groups

■ Cluster analysis (or clustering, data segmentation, …)


■ Finding similarities between data according to the

characteristics found in the data and grouping similar


data objects into clusters
■ Unsupervised learning: no predefined classes (i.e.,
learning by observations vs. learning by examples:
supervised)
■ Typical applications
■ As a stand-alone tool to get insight into data distribution

■ As a preprocessing step for other algorithms


Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 3
Clustering for Data Understanding and
Applications
■ Biology: taxonomy of living things: kingdom, phylum, class, order,
family, genus and species
■ Information retrieval: document clustering
■ Land use: Identification of areas of similar land use in an earth
observation database
■ Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
■ City-planning: Identifying groups of houses according to their house
type, value, and geographical location
■ Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
■ Climate: understanding earth climate, find patterns of atmospheric and
ocean
■ Economic Science: market resarch

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 4


Clustering as a Preprocessing Tool (Utility)

■ Summarization:
■ Preprocessing for regression, PCA, classification, and
association analysis
■ Compression:
■ Image processing: vector quantization
■ Finding K-nearest Neighbors
■ Localizing search to one or a small number of clusters
■ Outlier detection
■ Outliers are often viewed as those “far away” from any
cluster

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 5


Quality: What Is Good Clustering?
■ A good clustering method will produce high quality clusters
■ high intra-class similarity: cohesive within clusters
■ low inter-class similarity: distinctive between clusters
■ The quality of a clustering method depends on
■ the similarity measure used by the method
■ its implementation, and
■ Its ability to discover some or all of the hidden patterns

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 6


Measure the Quality of Clustering
■ Dissimilarity/Similarity metric
■ Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
■ The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical,
ordinal ratio, and vector variables
■ Weights should be associated with different variables
based on applications and data semantics
■ Quality of clustering:
■ There is usually a separate “quality” function that
measures the “goodness” of a cluster.
■ It is hard to define “similar enough” or “good enough”
■ The answer is typically highly subjective
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 7
Considerations for Cluster Analysis
■ Partitioning criteria
■ Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable)
■ Separation of clusters
■ Exclusive (e.g., one customer belongs to only one region) vs. non-
exclusive (e.g., one document may belong to more than one class)
■ Similarity measure
■ Distance-based (e.g., Euclidian, road network, vector) vs.
connectivity-based (e.g., density or contiguity)
■ Clustering space
■ Full space (often when low dimensional) vs. subspaces (often in
high-dimensional clustering)

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 8


Requirements and Challenges
■ Scalability
■ Clustering all the data instead of only on samples

■ Ability to deal with different types of attributes


■ Numerical, binary, categorical, ordinal, linked, and mixture of these

■ Constraint-based clustering
■ User may give inputs on constraints
■ Use domain knowledge to determine input parameters
■ Interpretability and usability
■ Others
■ Discovery of clusters with arbitrary shape

■ Ability to deal with noisy data

■ Incremental clustering and insensitivity to input order

■ High dimensionality

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 9


Major Clustering Approaches (I)
■ Partitioning approach:
■ Construct various partitions and then evaluate them by some

criterion, e.g., minimizing the sum of square errors


■ Typical methods: k-means, k-medoids, CLARANS

■ Hierarchical approach:
■ Create a hierarchical decomposition of the set of data (or objects)

using some criterion


■ Typical methods: Diana, Agnes, BIRCH, CAMELEON

■ Density-based approach:
■ Based on connectivity and density functions

■ Typical methods: DBSACN, OPTICS, DenClue

■ Grid-based approach:
■ based on a multiple-level granularity structure

■ Typical methods: STING, WaveCluster, CLIQUE

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 10


Major Clustering Approaches (II)
■ Model-based:
■ A model is hypothesized for each of the clusters and tries to find

the best fit of that model to each other


■ Typical methods: EM, SOM, COBWEB

■ Frequent pattern-based:
■ Based on the analysis of frequent patterns

■ Typical methods: p-Cluster

■ User-guided or constraint-based:
■ Clustering by considering user-specified or application-specific

constraints
■ Typical methods: COD (obstacles), constrained clustering

■ Link-based clustering:
■ Objects are often linked together in various ways

■ Massive links can be used to cluster objects: SimRank, LinkClus

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning (3170724) 11


Partitioning Algorithms: Basic Concept

■ Partitioning method: Partitioning a database D of n objects into a set of


k clusters, such that the sum of squared distances is minimized (where
ci is the centroid or medoid of cluster Ci)

■ Given k, find a partition of k clusters that optimizes the chosen


partitioning criterion
■ Global optimal: exhaustively enumerate all partitions
■ Heuristic methods: k-means and k-medoids algorithms
■ k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented
by the center of the cluster
■ k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
12
Different types of clustering techniques

● Partitioning methods,
● Hierarchical methods, and
● Density-based methods.

Dr. Ami T. Choksi @CKPCET Machine Learning (3170724) 13


Different types of clustering techniques
● Partitioning methods:
○ Uses mean or medoid(etc.) to represent cluster center
○ Adopts distance-based approach to refine clusters
○ Finds mutually exclusive clusters of spherical or nearly spherical shape
○ Effective for data sets of small to medium size
● Hierarchical methods:
○ Creates hierarchical or tree-like structure through decomposition or
merger
○ Uses distance between the nearest or furthest points in neighboring
clusters as a guideline for refinement
○ Erroneous merges or splits cannot be corrected at subsequent levels
● Density based methods:
○ Useful for identifying arbitrarily shaped clusters
○ Guiding principle of cluster creation is the identification of dense
resions of objects in space which are separated by low-density regions
○ May filter out outliers
Dr. Ami T. Choksi @CKPCET Machine Learning (3170724) 14
Partitioning methods

● Two of the most important algorithms for partitioning-


based clustering are k-means and k-medoid.
○ In the k-means algorithm, the centroid of the prototype is identified
for clustering, which is normally the mean of a group of points.
○ Similarly, the k-medoid algorithm identifies the medoid which is the
most representative point for a group of points.
● We can also infer that in most cases, the centroid does not
correspond to an actual data point, whereas medoid is
always an actual data point.

Dr. Ami T. Choksi @CKPCET Machine Learning (3170724) 15


K-means - A centroid-based technique

Step 1: Select K points in the data space and mark them as initial
centroids loop
Step 2: Assign each point in the data space to the nearest centroid to
form K clusters
Step 3: Measure the distance of each point in the cluster from the
centroid
Step 4: Calculate the Sum of Squared Error
(SSE) to measure the quality of the clusters. SSE is to be minimized.
Step 5: Identify the new centroid of each cluster on the basis of distance
between points
Step 6: Repeat Steps 2 to 5 to refine until centroids do not change end
loop
Typically K=√n/2 but doesn’t work for all types of dataset.
Dr. Ami T. Choksi @CKPCET Machine Learning (3170724) 16
K-Means clustering numerical example

We will apply k-means on the following 1 dimensional data set for K=2.
Data set {2, 4, 10, 12, 3, 20, 30, 11, 25}
Iteration 1
M1, M2 are the two randomly selected centroids/means where
M1= 4, M2=11
and the initial clusters are
C1= {4}, C2= {11}
Calculate the Euclidean distance as
D=[x,a]=√(x-a)²
D1 is the distance from M1
D2 is the distance from M2
Dr. Ami T. Choksi @CKPCET Machine Learning (3170724) 17
K-Means clustering numerical example

Dr. Ami T. Choksi @CKPCET Machine Learning (3170724) 18


K-Means clustering numerical example

As we can see in the above table, 2 datapoints are added to cluster C1 and other
datapoints added to cluster C2

Therefore

C1= {2, 4, 3}

C2= {10, 12, 20, 30, 11, 25}

Iteration 2

Calculate new mean of datapoints in C1 and C2.

Therefore

M1= (2+3+4)/3= 3

M2= (10+12+20+30+11+25)/6= 18

Calculating distance and updating clusters based on table


Dr. Ami T. Choksi @CKPCET Machine Learning (3170724) 19
K-Means clustering numerical example

Dr. Ami T. Choksi @CKPCET Machine Learning (3170724) 20


K-means clustering example

New Clusters

C1= {2, 3, 4, 10}

C2= {12, 20, 30, 11, 25}

Iteration 3

Calculate new mean of datapoints in C1 and C2.

Therefore

M1= (2+3+4+10)/4= 4.75

M2= (12+20+30+11+25)/5= 19.6

Calculating distance and updating clusters based on table below

Dr. Ami T. Choksi @CKPCET Machine Learning (3170724) 21


K-means clustering example

Dr. Ami T. Choksi @CKPCET Machine Learning (3170724) 22


K-means clustering example

New Clusters

C1= {2, 3, 4, 10, 12, 11}

C2= {20, 30, 25}

Iteration 4

Calculate new mean of data points in C1 and C2.

Therefore

M1= (2+3+4+10+12+11)/6=7

M2= (20+30+25)/3= 25

Calculating distance and updating clusters based on table below

Dr. Ami T. Choksi @CKPCET Machine Learning (3170724) 23


K-means clustering example

Dr. Ami T. Choksi @CKPCET Machine Learning (3170724) 24


K-means clustering example

New Clusters

C1= {2, 3, 4, 10, 12, 11}

C2= {20, 30, 25}

As we can see that the data points in the cluster C1 and C2 in iteration 3 are same as
the data points of the cluster C1 and C2 of iteration 2.

It means that none of the data points has moved to other cluster. Also the
means/centeroid of these clusters is constant. So this becomes the stopping condition
for our algorithm.

Dr. Ami T. Choksi @CKPCET Machine Learning (3170724) 25


K-Means: strength and weakness

Strengths Weakness

The principle used for identifying The algorithm involves an element


the clusters is very simple and of random chance and thus may
involves very less complexity of not find the optimal set of cluster
statistical terms in some cases

The algorithm is very flexible and The starting point of guessing the
thus can be adjusted for most number of clusters within the data
scenarios and complexities requires some experience of the
user, so that the final outcome is
efficient.

The performance and efficiency


are very high and comparable to
those of any sophisticated
algorithm in term of dividing the
data into useful clusters

Dr. Ami T. Choksi @CKPCET Machine Learning (3170724) 26


An Example of K-Means Clustering

K=2

Arbitrarily Update the


partition cluster
objects into centroids
k groups

The initial data set Loop if Reassign objects


needed
■ Partition objects into k nonempty
subsets
■ Repeat
■ Compute centroid (i.e., mean Update the
point) for each partition cluster
centroids
■ Assign each object to the
cluster of its nearest centroid
■ Until no change
Dr. Ami Choksi @CKPCET Machine Learning (3170724) 27
Comments on the K-Means Method
■ Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t
is # iterations. Normally, k, t << n.
■ Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
■ Comment: Often terminates at a local optimal.
■ Weakness
■ Applicable only to objects in a continuous n-dimensional space
■ Using the k-modes method for categorical data
■ In comparison, k-medoids can be applied to a wide range of
data
■ Need to specify k, the number of clusters, in advance (there are
ways to automatically determine the best k (see Hastie et al.,
2009)
■ Sensitive to noisy data and outliers
28
■ Not@CKPCET
Dr. Ami Choksi suitable to discover clusters
Machine with non-convex shapes
Learning (3170724) 28
Variations of the K-Means Method

■ Most of the variants of the k-means which differ in


■ Selection of the initial k means

■ Dissimilarity calculations

■ Strategies to calculate cluster means

■ Handling categorical data: k-modes

■ Replacing means of clusters with modes

■ Using new dissimilarity measures to deal with categorical objects


■ Using a frequency-based method to update modes of clusters

■ A mixture of categorical and numerical data: k-prototype method

29
What Is the Problem of the K-Means Method?

■ The k-means algorithm is sensitive to outliers !


■ Since an object with an extremely large value may substantially
distort the distribution of the data

■ K-Medoids: Instead of taking the mean value of the object in a


cluster as a reference point, medoids can be used, which is the most
centrally located object in a cluster

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 1
0 0
30
Dr. Ami Choksi @CKPCET Machine Learning (3170724) 30
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
1
0
9

Arbitrary Assign
7

5
choose k each
4 object as remainin
3
initial g object
2
medoids to
nearest
1

0
0 1 2 3 4 5 6 7 8 9 1
0
medoids

K=2 Randomly select a


Total Cost = 26 nonmedoid object,Oramdom
1 1

Do loop
0 0

Compute
9 9

Swapping O
8 8

total cost of
Until no
7 7

and Oramdom 6
swapping 6

change
5 5

If quality is 4 4

improved. 3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 1
0
0 1 2 3 4 5 6 7 8 9 31
1
0

Dr. Ami Choksi @CKPCET Machine Learning (3170724) 31


The K-Medoid Clustering Method
■ K-Medoids Clustering: Find representative objects (medoids) in
clusters
■ PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw
1987)

■ Starts from an initial set of medoids and iteratively replaces


one of the medoids by one of the non-medoids if it improves
the total distance of the resulting clustering

■ PAM works effectively for small data sets, but does not scale
well for large data sets (due to the computational complexity)

■ Efficiency improvement on PAM

■ CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples


32
■ CLARANS
Dr. Ami Choksi @CKPCET (Ng & Han, 1994):
Machine Randomized
Learning (3170724) re-sampling 32
Hierarchical Clustering
■ Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input, but
needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
33
Dr. Ami Choksi @CKPCET Machine Learning (3170724) 33
AGNES (Agglomerative Nesting)
■ Introduced in Kaufmann and Rousseeuw (1990)
■ Implemented in statistical packages, e.g., Splus
■ Use the single-link method and the dissimilarity matrix
■ Merge nodes that have the least dissimilarity
■ Go on in a non-descending fashion
■ Eventually all nodes belong to the same cluster

34
Dr. Ami Choksi @CKPCET Machine Learning (3170724) 34
Dendrogram: Shows How Clusters are Merged

Decompose data objects into a several levels of nested


partitioning (tree of clusters), called a dendrogram

A clustering of the data objects is obtained by cutting


the dendrogram at the desired level, then each
connected component forms a cluster

35
Next lecture

Dr. Ami T. Choksi @CKPCET Machine Learning (3170724) 36


K-means clustering numerical example
using SSE and Outlier

● Data is 1,2, 3, 5, 9, 10, 12, and 25.


● Point 25 is the outlier, and it affects the cluster formation negatively when the
mean of the points is considered as centroids.
● With K = 2, the initial clusters we arrived at are {1, 2, 3, 6} and {9, 10, 12, 25}.
● The mean of the cluster {1, 2, 3, 6} is 1+2+3+6/4=3
● and the mean of the cluster {9, 10, 12, 25} =9+10+12+25/4=14
● So, the SSE within the clusters is (1-3)^2+(2-3)^2+(3-3)^2+(6-3)^2+(9-
14)^2+(10-14)^2+(12-14)^2+(25-14)^2=179
● If we compare this with the cluster {1, 2, 3, 6, 9} and {10, 12, 25},
● the mean of the cluster {1,2,3,6,9} = 21/5=4.2
● and the mean of the cluster {10,12,25} = 47/3 = 15.67
● So, SSE within the clusters is (1-4.2)^2+(2-4.2)^2+(3-4.2)^2+(6-4.2)^2+(9-
4.2)^2+(10-15.67)^2+(12-15.67)^2+(25-15.67)^2=113.84

Dr. Ami T. Choksi @CKPCET Machine Learning (3170724) 37


K-means clustering numerical example
using SSE and Outlier

● Because the SSE of the second clustering is lower, k-means tend to put point 9 in
the same cluster with 1,2,3,6 though the point is logically nearer to point 10 and
12.
● This skewness is introduced due to the outlier point 25, which shifts the mean
awry from the center of the cluster.

Dr. Ami T. Choksi @CKPCET Machine Learning (3170724) 38


K-medoid clustering algorithm

● k-medoid provides a solution to this problem.


● Instead of considering the mean of the data points in cluster, k-medoids
considers k representative data points from the existing points in the data set as
the center of clusters.
● It assigns the data points according to their distance from these centres to form
k clusters.
● The medoids in this case are actual data points or objects from the mean of the
data sets within cluster is used as the centroid in the k-means technique.
● The SSE is calculated as

● Oi is the representative point or object of cluster Ci.


● Thus, the k-medoids method groups n objects in k clusters by minimizing the
SSE. Because of the use of medoids from the actual representative data points, k-
medoids is less influenced by the outliers in the data.
● One of the practical implementation of the k-medoids principle is the
Partitioning Around Medoids (PAM) algorithm.
Dr. Ami T. Choksi @CKPCET Machine Learning (3170724) 39
K-medoid clustering algorithm

● a representative object-based technique


Step 1: Randomly choose k points in the data set as the initial representative
points
loop
Step 2: Assign each of the remaining points to the cluster which has the
nearest representative point
Step 3: Randomly select a non-representative point o r in each cluster
Step 4: Swap the representative point o j with o r and compute the new SSE
after swapping
Step 5: If SSEnew < SSEold, then swap o j with o r to form the new set of k
representative objects;
Step 6: Refine the k clusters on the basis of the nearest representative point.
Logic continues until there is no change
end loop
Dr. Ami T. Choksi @CKPCET Machine Learning (3170724) 40
DIANA (Divisive Analysis)

■ Introduced in Kaufmann and Rousseeuw (1990)


■ Implemented in statistical analysis packages, e.g., Splus
■ Inverse order of AGNES
■ Eventually each node forms a cluster on its own

44
Dr. Ami Choksi @CKPCET Machine Learning (3170724) 44
Distance between Clusters X X

■ Single link: smallest distance between an element in one cluster


and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)

■ Complete link: largest distance between an element in one cluster


and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)

■ Average: avg distance between an element in one cluster and an


element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq)

■ Centroid: distance between the centroids of two clusters, i.e.,


dist(Ki, Kj) = dist(Ci, Cj)

■ Medoid: distance between the medoids of two clusters, i.e., dist(Ki,


Kj) = dist(Mi, Mj)
■ Medoid: a chosen, centrally located object in the cluster
45
Dr. Ami Choksi @CKPCET Machine Learning (3170724) 45
Centroid, Radius and Diameter of a Cluster
(for numerical data sets)
■ Centroid: the “middle” of a cluster

■ Radius: square root of average distance from any point of


the cluster to its centroid

■ Diameter: square root of average mean squared distance


between all pairs of points in the cluster

46
Dr. Ami Choksi @CKPCET Machine Learning (3170724) 46
Strengths and weakness Extensions to
Hierarchical Clustering
■ Major weakness of agglomerative clustering methods

■ Can never undo what was done previously

■ Do not scale well: time complexity of at least O(n2),


where n is the number of total objects

■ Integration of hierarchical & distance-based clustering


■ BIRCH (1996): uses CF-tree and incrementally adjusts
the quality of sub-clusters

■ CHAMELEON (1999): hierarchical clustering using


dynamic modeling 47
Dr. Ami Choksi @CKPCET Machine Learning (3170724) 47
Outlier Detection I: Supervised Methods
■ Two ways to categorize outlier detection methods:
■ Based on whether user-labeled examples of outliers can be obtained:

■ Supervised, semi-supervised vs. unsupervised methods

■ Based on assumptions about normal data and outliers:

■ Statistical, proximity-based, and clustering-based methods

■ Outlier Detection I: Supervised Methods


■ Modeling outlier detection as a classification problem

■ Samples examined by domain experts used for training & testing

■ Methods for Learning a classifier for outlier detection effectively:

■ Model normal objects & report those not matching the model as

outliers, or
■ Model outliers and treat those not matching the model as normal

■ Challenges

■ Imbalanced classes, i.e., outliers are rare: Boost the outlier class

and make up some artificial outliers


■ Catch as many outliers as possible, i.e., recall is more important

than accuracy (i.e., not mislabeling normal objects as outliers) 48


Dr. Ami Choksi @CKPCET Machine Learning (3170724) 48
Outlier Detection II: Unsupervised Methods
■ Assume the normal objects are somewhat ``clustered'‘ into multiple
groups, each having some distinct features
■ An outlier is expected to be far away from any groups of normal objects
■ Weakness: Cannot detect collective outlier effectively
■ Normal objects may not share any strong patterns, but the collective
outliers may share high similarity in a small area
■ Ex. In some intrusion or virus detection, normal activities are diverse
■ Unsupervised methods may have a high false positive rate but still
miss many real outliers.
■ Supervised methods can be more effective, e.g., identify attacking
some key resources
■ Many clustering methods can be adapted for unsupervised methods
■ Find clusters, then outliers: not belonging to any cluster

■ Problem 1: Hard to distinguish noise from outliers

■ Problem 2: Costly since first clustering: but far less outliers than
normal objects
■ Newer methods: tackle outliers directly

49
Dr. Ami Choksi @CKPCET Machine Learning (3170724) 49
Outlier Detection III: Semi-Supervised
Methods
■ Situation: In many applications, the number of labeled data is often
small: Labels could be on outliers only, normal objects only, or both
■ Semi-supervised outlier detection: Regarded as applications of semi-
supervised learning
■ If some labeled normal objects are available
■ Use the labeled examples and the proximate unlabeled objects to
train a model for normal objects
■ Those not fitting the model of normal objects are detected as outliers
■ If only some labeled outliers are available, a small number of labeled
outliers many not cover the possible outliers well
■ To improve the quality of outlier detection, one can get help from
models for normal objects learned from unsupervised methods
50
Dr. Ami Choksi @CKPCET Machine Learning (3170724) 50
Outlier Detection (1): Statistical Methods
■ Statistical methods (also known as model-based methods) assume
that the normal data follow some statistical model (a stochastic model)
■ The data not following the model are outliers.
■ Example (right figure): First use Gaussian distribution
to model the normal data
■ For each object y in region R, estimate gD(y), the

probability of y fits the Gaussian distribution


■ If gD(y) is very low, y is unlikely generated by the

Gaussian model, thus an outlier

■ Effectiveness of statistical methods: highly depends on whether the


assumption of statistical model holds in the real data
■ There are rich alternatives to use various statistical models
■ E.g., parametric vs. non-parametric

51
Dr. Ami Choksi @CKPCET Machine Learning (3170724) 51
Outlier Detection (2): Proximity-Based Methods
■ An object is an outlier if the nearest neighbors of the object are far
away, i.e., the proximity of the object is significantly deviates from
the proximity of most of the other objects in the same data set
■ Example (right figure): Model the proximity of an
object using its 3 nearest neighbors
■ Objects in region R are substantially different
from other objects in the data set.
■ Thus the objects in R are outliers
■ The effectiveness of proximity-based methods highly relies on the
proximity measure.
■ In some applications, proximity or distance measures cannot be
obtained easily.
■ Often have a difficulty in finding a group of outliers which stay close to
each other
■ Two major types of proximity-based outlier detection
■ Distance-based vs. density-based 52
Dr. Ami Choksi @CKPCET Machine Learning (3170724) 52
Outlier Detection (3): Clustering-Based Methods
■ Normal data belong to large and dense clusters, whereas
outliers belong to small or sparse clusters, or do not belong
to any clusters
■ Example (right figure): two clusters
■ All points not in R form a large cluster
■ The two points in R form a tiny cluster,
thus are outliers
■ Since there are many clustering methods, there are many
clustering-based outlier detection methods as well
■ Clustering is expensive: straightforward adaption of a
clustering method for outlier detection can be costly and
does not scale up well for large data sets
53
Dr. Ami Choksi @CKPCET Machine Learning (3170724) 53
Clustering High Dimensional Data

54
Dr. Ami Choksi @CKPCET Machine Learning (3170724)
Graph-based clustering

■ Graph-based clustering (Spectral, SNN-cliq,


Seurat) is perhaps most robust for high-
dimensional data as it uses the distance on a graph,
e.g. the number of shared neighbors, which is more
meaningful in high dimensions compared to the
Euclidean distance.

55
Dr. Ami Choksi @CKPCET Machine Learning (3170724)
Graph Based Clustering

Graph-based clustering uses distance on a graph: A and F


have 3 shared neighbors 56
Dr. Ami Choksi @CKPCET Machine Learning (3170724)
Graph Based Clustering

However, to build the graph this method still uses the


Euclidean distance. In addition, the number of
clusters has to be implicitly specified a-priori via the
“resolution” hyperparameters. Changing the
hyperparameters can easily result in fewer or more
clusters which is somehow arbitrary and hence quite
unsatisfactory as there is no obvious way to define an
objective function for automated tuning of the
hyperparameters.

57
Dr. Ami Choksi @CKPCET Machine Learning (3170724)
Graph based clustering

1. Detecting graph elements with “similar” properties


is of great importance, especially in large
networks, where it is crucial to identify specific
patterns or structures quickly. The process of
grouping together elements/entities that appear
based on some similarity measure to be closer to
each other (than from the remaining elements) is
called cluster analysis.

58
Dr. Ami Choksi @CKPCET Machine Learning (3170724)
Graph based clustering

1. The similarity measure is usually calculated based


on topological criteria, e.g., the graph structure, or
on the location of the nodes, or other
characteristics, e.g., specific properties of the
graph elements. Nodes that based on this
similarity value are considered to be similar are
grouped into the so-called clusters. In other words,
each cluster contains elements that share common
properties and characteristics. The collection of all
the clusters composes a clustering.

59
Dr. Ami Choksi @CKPCET Machine Learning (3170724)
60
Dr. Ami Choksi @CKPCET Machine Learning (3170724)
Applications

■ Cluster analysis has many applications fields, such


as data analysis, bioinformatics, biology, big data,
business, medicine. However, each of these
applications may consider differently the
"clustering notion," which is reflected by the fact
that there exist many clustering algorithms or
clustering models

61
Dr. Ami Choksi @CKPCET Machine Learning (3170724)
References (1)
■ R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace
clustering of high dimensional data for data mining applications. SIGMOD'98
■ M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
■ M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points
to identify the clustering structure, SIGMOD’99.
■ Beil F., Ester M., Xu X.: "Frequent Term-Based Text Clustering", KDD'02
■ M. M. Breunig, H.-P. Kriegel, R. Ng, J. Sander. LOF: Identifying Density-
Based Local Outliers. SIGMOD 2000.
■ M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for
discovering clusters in large spatial databases. KDD'96.
■ M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial
databases: Focusing techniques for efficient class identification. SSD'95.
■ D. Fisher. Knowledge acquisition via incremental conceptual clustering.
Machine Learning, 2:139-172, 1987.
■ D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An
approach based on dynamic systems. VLDB’98.
■ V. Ganti, J. Gehrke, R. Ramakrishan. CACTUS Clustering Categorical Data
Using Summaries. KDD'99.
62
Dr. Ami Choksi @CKPCET Machine Learning (3170724) 62
References (2)
■ D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An
approach based on dynamic systems. In Proc. VLDB’98.
■ S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for
large databases. SIGMOD'98.
■ S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for
categorical attributes. In ICDE'99, pp. 512-521, Sydney, Australia, March
1999.
■ A. Hinneburg, D.l A. Keim: An Efficient Approach to Clustering in Large
Multimedia Databases with Noise. KDD’98.
■ A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall,
1988.
■ G. Karypis, E.-H. Han, and V. Kumar. CHAMELEON: A Hierarchical
Clustering Algorithm Using Dynamic Modeling. COMPUTER, 32(8): 68-75,
1999.
■ L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to
Cluster Analysis. John Wiley & Sons, 1990.
■ E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large
datasets. VLDB’98.
63
Dr. Ami Choksi @CKPCET Machine Learning (3170724) 63
References (3)
■ G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to
Clustering. John Wiley and Sons, 1988.
■ R. Ng and J. Han. Efficient and effective clustering method for spatial data mining.
VLDB'94.
■ L. Parsons, E. Haque and H. Liu, Subspace Clustering for High Dimensional Data: A
Review, SIGKDD Explorations, 6(1), June 2004
■ E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large
data sets. Proc. 1996 Int. Conf. on Pattern Recognition
■ G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution
clustering approach for very large spatial databases. VLDB’98.
■ A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng. Constraint-Based
Clustering in Large Databases, ICDT'01.
■ A. K. H. Tung, J. Hou, and J. Han. Spatial Clustering in the Presence of Obstacles,
ICDE'01
■ H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large
data sets, SIGMOD’02
■ W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial
Data Mining, VLDB’97
■ T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : An efficient data clustering
method for very large databases. SIGMOD'96
■ X. Yin, J. Han, and P. S. Yu, “LinkClus: Efficient Clustering via Heterogeneous
Semantic Links”, VLDB'06

64
Dr. Ami Choksi @CKPCET Machine Learning (3170724) 64
References (3)
■ Graph based clustering, https://towardsdatascience.com/how-to-cluster-in-high-
dimensions-4ef693bacc6
■ k-means clustering algorithm numerical example,
https://medium.datadriveninvestor.com/k-means-clustering-b89d349e98e6
■ K-medoid clustering numerical example, https://www.geeksforgeeks.org/ml-k-
medoids-clustering-with-example/

65
Dr. Ami Choksi @CKPCET Machine Learning (3170724) 65

You might also like