Clustering for New Discovery in Data
Houston Machine Learning Meetup
2
SCR©
Roadmap: Method
• Tour of machine learning algorithms (1 session)
• Feature engineering (1 session)
– Feature selection - Yan
• Supervised learning (4 sessions)
– Regression models -Yan
– SVM and kernel SVM - Yan
– Tree-based models - Dario
– Bayesian method - Xiaoyang
– Ensemble models - Yan
• Unsupervised learning (3 sessions)
– K-means clustering
– DBSCAN - Cheng
– Mean shift
– Agglomerative clustering - Kunal
– Dimension reduction for data visualization - Yan
• Deep learning (4 sessions)
_ Neural network
– From neural network to deep learning
– Convolutional neural network
– Train deep nets with open-source tools
3
SCR©
Roadmap: Application
• Business analytics
• Recommendation system
• Natural language processing
• Computer vision
• Energy industry
4
SCR©
Agenda
• Introduction
• Application of clustering
• K-means
• DBSCAN
• Cluster validation
5
SCR©
What is clustering
Clustering: to discover the natural groupings of a set of objects/patterns in the
unlabeled data
6
SCR©
Application: Recommendation
7
SCR©
Application: Document Clustering
https://www.noggle.online/knowledgebase/document-clustering/
8
SCR©
Application: Pizza Hut Center
Delivery locations
9
SCR©
Application: Discovering Gene functions
Important to discover diseases
and treatment
10
SCR©
Clustering Algorithm
• K-Means (King of clustering, many variants)
• DBSCAN (group neighboring points)
• Mean shift (locating the maxima of density)
• Spectral clustering (cares about connectivity instead of proximity)
• Hierarchical clustering (a hierarchical structure, multiple levels)
• Expectation Maximization (k-means is a variant of EM)
• Latent Dirichlet Allocation (natural language processing)
……
11
SCR©
• K-Means
• DBSCAN
12
SCR©
Cluster Validation
13
SCR©
Cluster Validity
• For cluster analysis, the question is how to evaluate the
“goodness” of the resulting clusters?
• Then why do we want to evaluate them?
– To avoid finding patterns in noise
– To compare clustering algorithms
– To determine the optimal number of clusters
14
SCR©
Cluster Validity
• Numerical measures:
– External: Used to measure the extent to which cluster labels match
externally supplied class labels.
• Entropy
– Internal: Used to measure the goodness of a clustering structure without
respect to external information.
• Sum of Squared Error (SSE)
– Relative: Used to compare two different clusterings.
• Often an external or internal measurement is used for this function, e.g., SSE or entropy
• Visualization
15
SCR©
Internal Measures: WSE and BSE
• Cluster Cohesion: Measures how closely related are objects in a
cluster
– Example: SSE
• Cluster Separation: Measure how distinct or well-separated a
cluster is from other clusters
• Example: Squared Error
– Cohesion is measured by the within cluster sum of squares (SSE)
– Separation is measured by the between cluster sum of squares
– Where |Ci| is the size of cluster i
 


i Cx
i
i
mxWSS 2
)(
 
i
ii mmCBSS 2
)(
16
SCR©
Internal Measures: WSE and BSE
• Example: SSE
– BSS + WSS = constant
1091
9)35.4(2)5.13(2
1)5.45()5.44()5.12()5.11(
22
2222



Total
BSS
WSS
1 2 3 4 5
 
m1 m2
m
K=2 clusters:
10010
0)33(4
10)35()34()32()31(
2
2222



Total
BSS
WSSK=1 cluster:
17
SCR©
Internal Measures: WSE and BSE
• Can be used to estimate the number of clusters
2 5 10 15 20 25 30
0
1
2
3
4
5
6
7
8
9
10
KSSE5 10 15
-6
-4
-2
0
2
4
6
WSS
18
SCR©
Internal Measures: Proximity graph measures
• Cluster cohesion is the sum of the weight of all links within a
cluster.
• Cluster separation is the sum of the weights between nodes in the
cluster and nodes outside the cluster.
cohesion separation
19
SCR©
Correlation between affinity matrix and
incidence matrix
• Given affinity distance matrix D = {d11,d12, …, dnn }
Incidence matrix C= { c11, c12,…, cnn } from clustering
• Correlation r between D and C is given by








n
ji
ij
n
ji
ij
n
ji
ijij
ccdd
ccdd
r
1,1
2
_
1,1
2
_
1,1
__
)()(
))((
20
SCR©
Correlation with Incidence matrix








n
ji
ij
n
ji
ij
n
ji
ijij
ccdd
ccdd
r
1,1
2
_
1,1
2
_
1,1
__
)()(
))((
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
r = -0.9235 r = -0.5810
21
SCR©
Visualization of similarity matrix
• Order the similarity matrix with respect to cluster labels and
inspect visually.
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Points
Points
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100
Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
22
SCR©
• Clusters in random data are not so crisp
Points
Points
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100
Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Visualization of similarity matrix
23
SCR©
Final Comment on Cluster Validity
“The validation of clustering structures is the most difficult and frustrating part
of cluster analysis.
Without a strong effort in this direction, cluster analysis will remain a black art
accessible only to those true believers who have experience and great
courage.”
Algorithms for Clustering Data, Jain and Dubes
24
SCR©
Roadmap: Method
• Tour of machine learning algorithms (1 session)
• Feature engineering (1 session)
– Feature selection - Yan
• Supervised learning (4 sessions)
– Regression models -Yan
– SVM and kernel SVM - Yan
– Tree-based models - Dario
– Bayesian method - Xiaoyang
– Ensemble models - Yan
• Unsupervised learning (3 sessions)
– K-means clustering
– DBSCAN - Cheng
– Mean shift
– Hierarchical clustering - Kunal
– Dimension reduction for data visualization - Yan
• Deep learning (4 sessions)
_ Neural network
– From neural network to deep learning - Yan
– Convolutional neural network
– Train deep nets with open-source tools
25
SCR©
Thank you
Slides will be posted on slide share:
http://www.slideshare.net/xuyangela

Clustering introduction

  • 1.
    Clustering for NewDiscovery in Data Houston Machine Learning Meetup
  • 2.
    2 SCR© Roadmap: Method • Tourof machine learning algorithms (1 session) • Feature engineering (1 session) – Feature selection - Yan • Supervised learning (4 sessions) – Regression models -Yan – SVM and kernel SVM - Yan – Tree-based models - Dario – Bayesian method - Xiaoyang – Ensemble models - Yan • Unsupervised learning (3 sessions) – K-means clustering – DBSCAN - Cheng – Mean shift – Agglomerative clustering - Kunal – Dimension reduction for data visualization - Yan • Deep learning (4 sessions) _ Neural network – From neural network to deep learning – Convolutional neural network – Train deep nets with open-source tools
  • 3.
    3 SCR© Roadmap: Application • Businessanalytics • Recommendation system • Natural language processing • Computer vision • Energy industry
  • 4.
    4 SCR© Agenda • Introduction • Applicationof clustering • K-means • DBSCAN • Cluster validation
  • 5.
    5 SCR© What is clustering Clustering:to discover the natural groupings of a set of objects/patterns in the unlabeled data
  • 6.
  • 7.
  • 8.
    8 SCR© Application: Pizza HutCenter Delivery locations
  • 9.
    9 SCR© Application: Discovering Genefunctions Important to discover diseases and treatment
  • 10.
    10 SCR© Clustering Algorithm • K-Means(King of clustering, many variants) • DBSCAN (group neighboring points) • Mean shift (locating the maxima of density) • Spectral clustering (cares about connectivity instead of proximity) • Hierarchical clustering (a hierarchical structure, multiple levels) • Expectation Maximization (k-means is a variant of EM) • Latent Dirichlet Allocation (natural language processing) ……
  • 11.
  • 12.
  • 13.
    13 SCR© Cluster Validity • Forcluster analysis, the question is how to evaluate the “goodness” of the resulting clusters? • Then why do we want to evaluate them? – To avoid finding patterns in noise – To compare clustering algorithms – To determine the optimal number of clusters
  • 14.
    14 SCR© Cluster Validity • Numericalmeasures: – External: Used to measure the extent to which cluster labels match externally supplied class labels. • Entropy – Internal: Used to measure the goodness of a clustering structure without respect to external information. • Sum of Squared Error (SSE) – Relative: Used to compare two different clusterings. • Often an external or internal measurement is used for this function, e.g., SSE or entropy • Visualization
  • 15.
    15 SCR© Internal Measures: WSEand BSE • Cluster Cohesion: Measures how closely related are objects in a cluster – Example: SSE • Cluster Separation: Measure how distinct or well-separated a cluster is from other clusters • Example: Squared Error – Cohesion is measured by the within cluster sum of squares (SSE) – Separation is measured by the between cluster sum of squares – Where |Ci| is the size of cluster i     i Cx i i mxWSS 2 )(   i ii mmCBSS 2 )(
  • 16.
    16 SCR© Internal Measures: WSEand BSE • Example: SSE – BSS + WSS = constant 1091 9)35.4(2)5.13(2 1)5.45()5.44()5.12()5.11( 22 2222    Total BSS WSS 1 2 3 4 5   m1 m2 m K=2 clusters: 10010 0)33(4 10)35()34()32()31( 2 2222    Total BSS WSSK=1 cluster:
  • 17.
    17 SCR© Internal Measures: WSEand BSE • Can be used to estimate the number of clusters 2 5 10 15 20 25 30 0 1 2 3 4 5 6 7 8 9 10 KSSE5 10 15 -6 -4 -2 0 2 4 6 WSS
  • 18.
    18 SCR© Internal Measures: Proximitygraph measures • Cluster cohesion is the sum of the weight of all links within a cluster. • Cluster separation is the sum of the weights between nodes in the cluster and nodes outside the cluster. cohesion separation
  • 19.
    19 SCR© Correlation between affinitymatrix and incidence matrix • Given affinity distance matrix D = {d11,d12, …, dnn } Incidence matrix C= { c11, c12,…, cnn } from clustering • Correlation r between D and C is given by         n ji ij n ji ij n ji ijij ccdd ccdd r 1,1 2 _ 1,1 2 _ 1,1 __ )()( ))((
  • 20.
    20 SCR© Correlation with Incidencematrix         n ji ij n ji ij n ji ijij ccdd ccdd r 1,1 2 _ 1,1 2 _ 1,1 __ )()( ))(( 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x y 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x y r = -0.9235 r = -0.5810
  • 21.
    21 SCR© Visualization of similaritymatrix • Order the similarity matrix with respect to cluster labels and inspect visually. 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x y Points Points 20 40 60 80 100 10 20 30 40 50 60 70 80 90 100 Similarity 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
  • 22.
    22 SCR© • Clusters inrandom data are not so crisp Points Points 20 40 60 80 100 10 20 30 40 50 60 70 80 90 100 Similarity 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x y Visualization of similarity matrix
  • 23.
    23 SCR© Final Comment onCluster Validity “The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage.” Algorithms for Clustering Data, Jain and Dubes
  • 24.
    24 SCR© Roadmap: Method • Tourof machine learning algorithms (1 session) • Feature engineering (1 session) – Feature selection - Yan • Supervised learning (4 sessions) – Regression models -Yan – SVM and kernel SVM - Yan – Tree-based models - Dario – Bayesian method - Xiaoyang – Ensemble models - Yan • Unsupervised learning (3 sessions) – K-means clustering – DBSCAN - Cheng – Mean shift – Hierarchical clustering - Kunal – Dimension reduction for data visualization - Yan • Deep learning (4 sessions) _ Neural network – From neural network to deep learning - Yan – Convolutional neural network – Train deep nets with open-source tools
  • 25.
    25 SCR© Thank you Slides willbe posted on slide share: http://www.slideshare.net/xuyangela