Lecture 8 - Clustering
Lecture 8 - Clustering
Data Mining:
Cluster Analysis: Basic
Concepts and Methods
1
1
■ Partitioning Methods
■ Hierarchical Methods
■ Density-Based Methods
■ Grid-Based Methods
■ Evaluation of Clustering
■ Summary
2
1
25/09/2024
2
25/09/2024
3
25/09/2024
■ Dissimilarity/Similarity metric
■ Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
■ The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables
■ Weights should be associated with different variables based
on applications and data semantics
■ Quality of clustering:
■ There is usually a separate “quality” function that measures
the “goodness” of a cluster.
■ It is hard to define “similar enough” or “good enough”
■ The answer is typically highly subjective
4
25/09/2024
■ High dimensionality
10
10
5
25/09/2024
■ Partitioning approach:
■ Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
■ Typical methods: k-means, k-medoids, CLARANS
■ Hierarchical approach:
■ Create a hierarchical decomposition of the set of data (or
objects) using some criterion
■ Typical methods: Diana, Agnes, BIRCH, CAMELEON
■ Density-based approach:
■ Based on connectivity and density functions
■ Grid-based approach:
■ based on a multiple-level granularity structure
11
11
■ Model-based:
■ A model is hypothesized for each of the clusters and tries to
find the best fit of that model to each other
■ Typical methods: EM, SOM, COBWEB
■ Frequent pattern-based:
■ Based on the analysis of frequent patterns
■ User-guided or constraint-based:
■ Clustering by considering user-specified or application-specific
constraints
■ Typical methods: COD (obstacles), constrained clustering
■ Link-based clustering:
■ Objects are often linked together in various ways
12
6
25/09/2024
■ Partitioning Methods
■ Hierarchical Methods
■ Density-Based Methods
■ Grid-Based Methods
■ Evaluation of Clustering
■ Summary
13
13
14
7
25/09/2024
15
15
An Example of K-Means
Clustering
K=2
16
8
25/09/2024
17
17
■ Dissimilarity calculations
18
18
9
25/09/2024
19
19
6
Arbitrary Assign
choose k each
5
remaining
4 object as
object to
3 initial nearest
medoids
2
1
medoids
0
0 1 2 3 4 5 6 7 8 9 1
0
Do loop
0 0
9 9
8 Compute 8
Swapping O total cost of
Until no
7 7
and Oramdom 6
swapping 6
change
5 5
If quality is 4 4
improved. 3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 1
0 0
20
20
10
25/09/2024
21
21
Hierarchical Clustering
22
22
11
25/09/2024
23
23
24
24
12
25/09/2024
25
25
26
26
13
25/09/2024
27
27
28
28
14
25/09/2024
Density-Based Clustering
Methods
■ Clustering based on density (local cluster criterion),
such as density-connected points
■ Major features:
■ Discover clusters of arbitrary shape
■ Handle noise
■ One scan
■ Need density parameters as termination condition
■ Several interesting studies:
■ DBSCAN: Ester, et al. (KDD’96)
■ OPTICS: Ankerst, et al (SIGMOD’99).
■ DENCLUE: Hinneburg & D. Keim (KDD’98)
■ CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)
29
29
30
15
25/09/2024
31
31
Border
Eps = 1cm
Core MinPts = 5
32
32
16
25/09/2024
33
33
34
34
17
25/09/2024
35
35
wavelet method
36
36
18
25/09/2024
37
37
38
38
19
25/09/2024
39
39
40
20
25/09/2024
41
41
■ Partitioning Methods
■ Hierarchical Methods
■ Density-Based Methods
■ Grid-Based Methods
■ Evaluation of Clustering
■ Summary
42
42
21
25/09/2024
43
44
44
22
25/09/2024
Cluster C Cluster C
1 2
measure
■ Correlation measures
■ Discretized Huber static, normalized discretized
Huber static
45
45
23