Clustering introduction

Clustering for New Discovery in Data
Houston Machine Learning Meetup

2
SCR©
Roadmap: Method
• Tour of machine learning algorithms (1 session)
• Feature engineering (1 session)
– Feature selection - Yan
• Supervised learning (4 sessions)
– Regression models -Yan
– SVM and kernel SVM - Yan
– Tree-based models - Dario
– Bayesian method - Xiaoyang
– Ensemble models - Yan
• Unsupervised learning (3 sessions)
– K-means clustering
– DBSCAN - Cheng
– Mean shift
– Agglomerative clustering - Kunal
– Dimension reduction for data visualization - Yan
• Deep learning (4 sessions)
_ Neural network
– From neural network to deep learning
– Convolutional neural network
– Train deep nets with open-source tools

3
SCR©
Roadmap: Application
• Business analytics
• Recommendation system
• Natural language processing
• Computer vision
• Energy industry

4
SCR©
Agenda
• Introduction
• Application of clustering
• K-means
• DBSCAN
• Cluster validation

5
SCR©
What is clustering
Clustering: to discover the natural groupings of a set of objects/patterns in the
unlabeled data

6
SCR©
Application: Recommendation

7
SCR©
Application: Document Clustering
https://www.noggle.online/knowledgebase/document-clustering/

8
SCR©
Application: Pizza Hut Center
Delivery locations

9
SCR©
Application: Discovering Gene functions
Important to discover diseases
and treatment

10
SCR©
Clustering Algorithm
• K-Means (King of clustering, many variants)
• DBSCAN (group neighboring points)
• Mean shift (locating the maxima of density)
• Spectral clustering (cares about connectivity instead of proximity)
• Hierarchical clustering (a hierarchical structure, multiple levels)
• Expectation Maximization (k-means is a variant of EM)
• Latent Dirichlet Allocation (natural language processing)
……

11
SCR©
• K-Means
• DBSCAN

13
SCR©
Cluster Validity
• For cluster analysis, the question is how to evaluate the
“goodness” of the resulting clusters?
• Then why do we want to evaluate them?
– To avoid finding patterns in noise
– To compare clustering algorithms
– To determine the optimal number of clusters

14
SCR©
Cluster Validity
• Numerical measures:
– External: Used to measure the extent to which cluster labels match
externally supplied class labels.
• Entropy
– Internal: Used to measure the goodness of a clustering structure without
respect to external information.
• Sum of Squared Error (SSE)
– Relative: Used to compare two different clusterings.
• Often an external or internal measurement is used for this function, e.g., SSE or entropy
• Visualization

15
SCR©
Internal Measures: WSE and BSE
• Cluster Cohesion: Measures how closely related are objects in a
cluster
– Example: SSE
• Cluster Separation: Measure how distinct or well-separated a
cluster is from other clusters
• Example: Squared Error
– Cohesion is measured by the within cluster sum of squares (SSE)
– Separation is measured by the between cluster sum of squares
– Where |Ci| is the size of cluster i
 


i Cx
i
i
mxWSS 2
)(
 
i
ii mmCBSS 2
)(

16
SCR©
• Example: SSE
– BSS + WSS = constant
1091
9)35.4(2)5.13(2
1)5.45()5.44()5.12()5.11(
22
2222



Total
BSS
WSS
1 2 3 4 5
 
m1 m2
m
K=2 clusters:
10010
0)33(4
10)35()34()32()31(
2
2222



Total
BSS
WSSK=1 cluster:

17
SCR©
• Can be used to estimate the number of clusters
2 5 10 15 20 25 30
0
1
2
3
4
5
6
7
8
9
10
KSSE5 10 15
-6
-4
-2
0
2
4
6
WSS

18
SCR©
Internal Measures: Proximity graph measures
• Cluster cohesion is the sum of the weight of all links within a
cluster.
• Cluster separation is the sum of the weights between nodes in the
cluster and nodes outside the cluster.
cohesion separation

19
SCR©
Correlation between affinity matrix and
incidence matrix
• Given affinity distance matrix D = {d11,d12, …, dnn }
Incidence matrix C= { c11, c12,…, cnn } from clustering
• Correlation r between D and C is given by








n
ji
ij
n
ji
ij
n
ji
ijij
ccdd
ccdd
r
1,1
2
_
1,1
2
_
1,1
__
)()(
))((

20
SCR©
Correlation with Incidence matrix








n
ji
ij
n
ji
ij
n
ji
ijij
ccdd
ccdd
r
1,1
2
_
1,1
2
_
1,1
__
)()(
))((
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
r = -0.9235 r = -0.5810

21
SCR©
Visualization of similarity matrix
• Order the similarity matrix with respect to cluster labels and
inspect visually.
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Points
Points
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100
Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1

22
SCR©
• Clusters in random data are not so crisp
Points
Points
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100
Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Visualization of similarity matrix

23
SCR©
Final Comment on Cluster Validity
“The validation of clustering structures is the most difficult and frustrating part
of cluster analysis.
Without a strong effort in this direction, cluster analysis will remain a black art
accessible only to those true believers who have experience and great
courage.”
Algorithms for Clustering Data, Jain and Dubes

24
SCR©
Roadmap: Method
• Tour of machine learning algorithms (1 session)
• Feature engineering (1 session)
– Feature selection - Yan
• Supervised learning (4 sessions)
– Regression models -Yan
– SVM and kernel SVM - Yan
– Tree-based models - Dario
– Bayesian method - Xiaoyang
– Ensemble models - Yan
• Unsupervised learning (3 sessions)
– K-means clustering
– DBSCAN - Cheng
– Mean shift
– Hierarchical clustering - Kunal
– Dimension reduction for data visualization - Yan
• Deep learning (4 sessions)
_ Neural network
– From neural network to deep learning - Yan
– Convolutional neural network
– Train deep nets with open-source tools

25
SCR©
Thank you
Slides will be posted on slide share:
http://www.slideshare.net/xuyangela

Clustering introduction

More Related Content

What's hot

Viewers also liked

Similar to Clustering introduction

More from Yan Xu

Recently uploaded

Clustering introduction