0 ratings 0% found this document useful (0 votes) 19 views 21 pages 2024 Fo Cluster 4
Chapter 7 of 'Data Mining: Concepts and Techniques' discusses cluster analysis, focusing on density-based clustering methods such as DBSCAN, OPTICS, and DENCLUE. It explains the fundamental concepts of density-reachability and density-connectivity, and how these methods can discover clusters of arbitrary shapes while handling noise. The chapter also highlights the algorithms and parameters essential for effective clustering in spatial databases.
AI-enhanced title and description
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here .
Available Formats
Download as PDF or read online on Scribd
Go to previous items Go to next items
Save 2024-fo-cluster-4 For Later
Data Mining:
Concepts and Techniques
— Chapter 7 —
Jiawei Han
Department of Computer Science
University of Illinois at Urbana-Champaign
[Link],[Link]/~hanj
©2006 Jiawei Han and Micheline Kamber, All rights reservedSS
Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering
Methods
4. Partitioning Methods
5. Hierarchical Methods
6.Sd
Density-Based Clustering Methods
= Clustering based on density (local cluster criterion), such
as density-connected points
m Major features:
Discover clusters of arbitrary shape
Handle noise
One scan
Need density parameters as termination condition
m Several interesting studies:
DBSCAN: Ester, et al. (1996)
OPTICS: Ankerst, et al (1999).
DENCLUE: Hinneburg & D. Keim (1998)A
Density-Based Clustering: Basic Concepts
m Two parameters:
Eps: Maximum radius of the neighborhood
MinPts: Minimum number of points in an Eps-
neighborhood of that point
= Ne,<(p): — {q belongs to D | d(p,q) <= Eps}
” Directly density-reachable: A point p is directly di
~ reachable from a point g w.r.t. Eps, MinPts if
ohh fer
p belongs to Nzps(q)
5 4: MinPts = 5
core point condition: ins
INeps (4)| >= MinPts
Eps = 1 cm
4A
Density-Based Clustering: Basic Concepts
= Density-reachable:
A point p is density-reachable from
a point q w.r.t. Eps, MinPts if there
is a chain of points pj, ..., Py, P1 =
q, Pn = p such that p;,; is directly
density-reachable from p;
= Density-connected
A point p is density-connected toa
point q w.r.t. Eps, MinPts if there is
a point o such that both, p and q
are density-reachable from o w.r.t.
Eps and MinPts
5a
Explanation on whiteboard
. on pst & stl
Ep pe @ fomg wot Eek tang D Fromp ae
hae Coe f
> No, # % ph not Oo Co ‘
ey uk SAM
damp rer 4 DpRFAMP MN
wt Ee, L
vot (ye pofed
P Doe feng
. (ove pt)
crap. (No, fp woe GRP
cmp ton p DE Fam art & Oph
( ahabn o jeg 4)unit. Mpb, thon gana p De ort &, mph
fp PB audg awe De .
paq od FP.
4
Gow tue Same © feo
p ond gare bent Emp
pore (if O7B
g (ere = q bee yt ghey
FR (ypu
fe ppg dom, , Men
tee foeq_) t
2) P DRY =
L
b Ng
P DEL
= breSS
DBSCAN: Density Based Spatial
Clustering of Applications with Noise
m Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
= Discovers clusters of arbitrary shape in spatial databases
with noise Ao
Outlier
Eps = lem
MinPts = 5ES
DBSCAN: The Algorithm
= Arbitrary select a point p
= Retrieve all points density-reachable from p w.r.t. Eps
and MinPts.
= If pis acore point, a cluster is formed containing p and all
the density-reachable points from p. Mark these points as
processed. { Swuly asin fom ach Ste
m Mark p as processed.
= Continue this process until all of the points have been
processed.—
DBSCAN: Sensitive to Parameters
Nbr tue © toon roger Hee Cluster Oud Siem Neva cluster
Figure 8, 08Scan
suit for ith
‘Mines at Land Eps at
(@05and 004.
Figure 9. DBScan
results for 052 with
‘MinPts at 4 and Eps at
(@)5.0, 03.5, and
(30.
(a) ()
9—~~E
OPTICS: A Cluster-Ordering Method
= OPTICS: Ordering Points To Identify the Clustering
Structure
Ankerst, Breunig, Kriegel, and Sander (1999)
Produces a special order of the database w.r.t. its
density-based clustering structure
This cluster-ordering contains info equivalent to the
density-based clusterings corresponding to a broad
range of parameter settings
Good for both automatic and interactive cluster analysis,
including finding intrinsic clustering structure
Can be represented graphically or using visualization
techniques
10SS
OPTICS basic concepts
Syaadlai ©
= Core Distance of p wrt MinPts: smallest distance eps’
between p and an object in its eps-neighborhood such that
p would be a core object for eps’ and MinPts. Otherwise,
undefined.
= Reachability Distance of p wrt o:
Max (core-distance (0), d (0, p)) if o is core object.
Undefined otherwise
%
M -disti , d (0, i
ax (core-distance (0), d (0, p)) MinPts =5
r(pl, 0) =1.5cm. 1(p2,0) = 4cm ,, 3
e =3cmSh
OPTICS
= (1) Select non-processed object o
m (2) Find neighbors (eps-neighborhood)
= Compute core distance for o
= Write object 0 to ordered file and mark o as processed
= Ifois nota core object, restart at (1)
= (ois a core object ...)
= Putneighbors of o in Seedlist and order
If neighbor n is not yet in SeedList then add (n, reachability from 0)
else if reachability from o < current reachability, then update
reachability + order SeedList wrt reachability
= Take new object from Seedlist with smallest reachability and restart at
(2)a
Example on whiteboard
0, OO Cy or
NCOdF =(0%, oy
= £0.04, 04)
Ronpb ob
te )
ha itty di He J
* m
Cyr 04}Reachability
-distance
undefined
Cluster-order
4 of the objects————~_
DENCLUE: Using Statistical Density
Functions
m= DENSsity-based CLUstEring by Hinneburg & Keim (1998)
= Using statistical density functions
= Major features
Solid mathematical foundation
Good for data sets with large amounts of noise
Allows a compact mathematical description of arbitrarily shaped
clusters in high-dimensional data sets
Significant faster than DBSCAN
But needs a large number of parametersSS
Denclue: Technical Essence
= Uses grid cells but only keeps information about grid cells that do
actually contain data points and manages these cells in a tree-based
access structure
= Influence function: describes the impact of a data point within its
neighborhood atx,
Foasioa(¥2Y) = €
= Overall density of the data space can be calculated as the sum of
the influence function of all data points
OBS: minus
d(xx,) i
. nooo OBS: minus
Femssian OO) = Dye 2°
= Clusters can be determined mathematically by identifying density
attractors. Density attractors are local maxima of the overall density
function d(x)?
Vi eeusion 5%) Dy yO —X)@ 27 OBS: minusDensity Attractor
fa) Data SetSSS
Denclue: Technical Essence
= Significant density attractor for threshold k: density attractor with
density larger than or equal to k
= Center-defined cluster for a significant density attractor x for
threshold k: points that are density attracted by x
Points that are attracted to a density attractor with density less
than k are called outliers
= Set of significant density attractors X for threshold k: for each pair of
density attractors x1, x2 in X there is a path from x1 to x2 such that
each point on the path has density larger than or equal to k
= Arbitrary-shape cluster for a set of significant density attractors X for
threshold k: points that are density attracted to some density
attractor in X
18SS
Center-Defined and Arbitrary-shape
clusters
(abe =02 tb} idjo=15
Figure 3: Example cf Canepa Clusters for different
(@)é=2 ib) é=2 (hea (d}g=2
Figure 4: Example of Arbitray-Ghape Clusters for different ¢
19