Fuzzy DBScan
Fuzzy DBScan
net/publication/287527085
CITATIONS READS
15 1,245
2 authors:
Some of the authors of this publication are also working on these related projects:
SPACE4AGRI: Sviluppo di metodologie aerospaziali innovative di osservazione della terra a supporto del settore agricolo in Lombardia View project
All content following this page was uploaded by Dino Ienco on 08 June 2016.
1 Introduction
Density based clustering algorithms have a wide applicability in spatial data
mining. They apply a local criterion to group objects: clusters are regarded as
regions in the data space where the objects are dense, and which are separated
by regions of low object density (noise). Among the density based clustering
algorithms DBSCAN is very popular due both to its low complexity and its
ability to detect clusters of any shape, which is a desired characteristics when
one does not have any knowledge of the possible clusters’ shapes, or when the
objects are distributed heterogenously such as along paths of a graph or a road
network.
Nevertheless, to drive the process this algorithm needs two numeric input
parameters, minPts and which together define the desired density characteris-
tics of the generated clusters. Specifically, minPts is a positive integer specifying
the minimum number of objects that must exists within a maximum distance
from any given object in order for this latter to belong to a cluster.
Since DBSCAN is very sensible to the setting of these input parameters they
must be chosen with great accuracy, which generally needs an exploration phase
of trials and errors to fix the right values.
Unfortunately, these input parameters should be set properly considering
both the scale of the dataset and the closeness of the objects in order not to
affect too much both the speed of the algorithm which strongly depends on
these values ?? and the effectiveness of the results too.
In fact, a common drawback of all crip flat clustering algorithms used to group
objects whose distribution has a faint and smooth density profile is that they
draw crisp boundaries to separate clusters, which are often somewhat arbritrary.
There are also applications in which the positions of the objects is ill-known,
such as in the case of databases of moving objects, whose locations are recorded
at fixed timestamps, or in the case of objects appearing in remote sensing images
having a coarse spatial resolution so that a pixel is much greater than the object
dimension, and thus uncertainty is implied when one has to detect the exact
position of the object within the area of the pixel.
In this contribution we investigate several extensions of the DBSCAN algo-
rithm defined within the framework of fuzzy set theory whose aim is to detect
fuzzy clusters with desired density characteristics.
The objectives of the extensions are several: first in order to leverage the pa-
rameter setting we proposes distinct fuzzy extensions of the DBSCAN algorithm
which do not require to specify precise values for all the two input parameters,
but allow the specification of some approximative value of the parameters ex-
pressed by means of soft constraints defined by fuzzy sets on the basic domains of
the input parameters themselves. The algorithm uses this approximative input
to generate fuzzy clusters, i.e., clusters whose elements are associated with a nu-
meric membership degree in [0,1]. Having a membership degree associated with
each pair ¡object - cluster¿ it is then possible to perform a sensitivity analysis to
obtain distinct crisp partitions by specifying distinct minimum thresholds on the
membership degrees. This allows an exploration of the spatial distribution of the
objects without the need of several runs of the clustering. Second objective is to
be able to improve the effectiveness of the grouping of objects characterized by
a fuzzy distribution. We will present three possible fuzzy extensions and discuss
the properties of the generated fuzzy clusters.
For sake of clarity in the following we will consider a set of objects represented
by distinct points defined in a bidimensional spatial domain. These objects can
be either actual entities located on the real spatial domain such as cars, taxi
cabs, airplains, or virtual entities, such as web pages or tweets represented in
the virtual space of terms they contain. DBSCAN can be applied to group these
objects based on their local densities in the space that to identify traffic jams
of cars on the roads, or web pages and tweets that deal with close topics. DB-
SCAN assigns points of a spatial domain defined on RxR to particular clusters
or designates them as statistical noise if they are not sufficiently close to other
points. DBSCAN determines cluster assignments by assessing the local den-
sity at each point using two parameters: distance () and minimum number of
points (minP ts). A single point which meets the minimum density criterion,
namely that there are minPts located within distance , is designated a core
point. Formally, Given a set P of N points pi = (xi , yi ) wih xi ,yi defined on the
spatial domain . p ∈ P is a core point if at least a minimum number minP ts
of points p1 , , pminP ts ∈ P ∃s.t||pj − p|| < , Two core points pi and pj with
i, js.t||pi − pj || < define a cluster c, pi , pj ∈ c and are core point of c, i.e.,
pj , pj ∈ core(c) All not core points within the maximum distance from a core
point are considered non-core members of a cluster, and are boudary points:
p∈ / core(c) is a boundary point of c if ∃pi ∈ core(c) with ||p − pi || < . Finally,
points that are not part of a cluster are considered noise: p ∈/ core(c) are noise
if ∀c, @pi ∈ core(c) with ||p − pi || < . IN the following the classic DBSCAN
algorithm is described:
3 Related Work
In the literature there have been a few extensions of the DSCAN algorithm
in order to detect fuzzy clusters. In [1] the authors propose a fuzzy extension
of the DBSCAN, named FN-DBSCAN (fuzzy neighborhood DBSCAN), whose
main characteristic is to use a fuzzy neighborhood relation whereas DBSCAN
uses a crisp neighborhood relation. In this approach they address the difficulty
of the user in setting the values of the parameters when both the number of
the points to cluster is unknown and when their distances are in distinct scales.
Thus they first normalize the distances between pairs of points in [0,1], and
then they allow specifying distinct membership functions on the distance to
delimit the neighborhood of points, i.e., the decaying of the membership degree
as a function of the distance from the point. Then, they select as belonging to
the fuzzy neighbourhood of a point only those having a minimum membership
degree. This extension of DBSCAN uses a level-based neighborhood set instead
of a distance-based neighborhood set and it uses the concept of fuzzy cardinality
instead of classical cardinality for identifying core points. This last choice causes
the creation (with the same run of the algorithm) of both fuzzy clusters with
cores having many sparse points and fuzzy clusters with cores having only a
Algorithm 2 expandCluster(p, neighborsP ts, C, , M inP ts)
Require: p: the point just marked as visited
Require: neighborsP ts: the neighbourhood of p
Require: C: the actual cluster
Require: the distance around a point to compute its density
Require: M inP ts: density, in points, around a point to be considered a core point
1: add p to cluster C
0
2: for all p ∈ neighborsP ts do
0
3: if p is not visited then
0
4: mark p as visited
0 0
5: neighborsP ts = regionQuery(p ,)
0
6: if sizeof (neighborsP ts ) > M inP ts then
0
7: neighborsP ts = neighborsP ts ∪ neighborsP ts
8: end if
9: end if
0
10: if p is not yet member of any cluster then
0
11: add p to cluster C
12: end if
13: end for
14: return C
few close points. Thus the density characteristic of the generated clusters is
heterogeneous. A scalable implementation of the FN-DBSCAN, named SFN-
DBSCAN has been proposed in ?? an improvement of the efficency of the FN-
DBSCAN is described.
The utility of a fuzzy DBSCAN has pointed out in the paper ?? where the
authors use FN-DBSCAN in conjunction with the computation of the convex
hull of the generated fuzzy clusters to derive connected footprints of entities with
arbitrary shape. Having fuzzy clusters allows generating isolines footprints.
A second paper is somehow related to the motivations of our proposal since it
takles the problem of clustering huge number of objects strongly affected by noise
?? when the scale distributions of objects are heterogeneous. Their solution does
not generate fuzzy clusters but we report it since their work can be the basis for
our fuzzy extension. To remove noise they first map the distance from any point
of its k-neighbours and rank the distance values in decreasing order of distance;
then they determine the threshold θ on the distance which corresponds to the
first minimum on the ordered values. All points in the first ranked positions
having a distance above the thresholds θ are noise points and are removed,
while the remaining will belong to a cluster. These latter points are clustered
with the classic DBSCAN by providing as input parameters minP ts = K and
= θ. As stated by the authors, the main problem of this approach is the fact
that θ is somehow arbitrarily chosen within a range of possible values.
Another motivation of defining fuzzy DBSCAN is for clustering objects whose
position is ill-known as in the paper ?? where the authors propose a fuzzy
distance measure to define the probability that an object is directly density-
reachable from another objects.
4 Fuzzification
This membership function gives the value 1 when the number x of elements
in the neighbourhood of a point is greater than M ptsM ax , a value 0 when x
is below M ptsM in and intermediate values when x is in between M ptsM in and
M ptsM ax .
Let us redefine the fuzzy core . Given a set P of N points pi = (xi , yi )
with xi , yi defined on the spatial domain. Given a point p ∈ P , if x points pi
∃ in the neighbourhood of point p , i.e., with |pi − p| < , s.t. µminP (x) > 0
the p is a fuzzy core point with membership degree to the fuzzy core given by
F uzzycore(p) = µM inP (x) If two fuzzy core points pi , pj ( F uzzycore(pi ) > 0
and F uzzycore(pj ) > 0) ∃ with i 6= j s.t. |pi − pj | < then they define a cluster
c, pi , pj ∈ c , and are fuzzy core points of c, i.e., pi , pj ∈ f uzzycore(c) with their
membership degrees F uzzycorec (pi ) and F uzzycorec (pj ).
A point p of a cluster that is not a fuzzy core point, but is a boundary point,
is defined as follows: Given p if ∃pi ∈ f uzzycore(c), i.e., with membership degree
f uzzycorec (pi ) > 0 , s.t. |p−pi | < then p gets a membership degree to c defined
as: µc (p) = f uzzycorec (pi )
This definition allows generating fuzzy clusters with a fuzzy core, where the
membership degree represents the more or less number of core points, i.e., the
cluster density. Notice that the points belonging to a cluster c get all the same
membership value to the cluster. However distinct clusters may have distinct
membership degrees indicating their distinct density properties.
Moreover, a boundary point p can partially belong to more than one cluster
at the same time with distinct membership values µci (p) since boundary points
of given clusters can be considered as candidate boundary point of other clusters.
This allow generating fuzzy clusters with overlapping boundaries.
Finally, points p that are not part of a cluster are considered noise ∀c@pi ∈
f uzzycore(c) s.t. |pi − p| ≤ , then p is noise.
If two core points pi pj ∃ with i 6= j and µdist (pi , pj ) = 1 then pj ,pj belong to
c , i.e., they define a fuzzy cluster c, and are core points of c, i.e., pi , pj ∈ core(c)
and they get a membership degree to the cluster given by µc (p) = 1.
A point p of a fuzzy cluster that is not a core point, can be a boundary
point, if it satisfies the following: p ∈ / core(c) is a boundary point of a cluster
c if ∃pi ∈ core(c) s.t µdist (pi , p) > 0: p gets a membership degree to cluster c
defined as:
5 Experiments
6 Conclusion
The paper reviewed some models for evaluating soft aggregations of selection
conditions with unequal importance weights in flexible queries to databases.
We outlined the drawbacks of these approaches, specifically the fact that they
model only the relative importance of the conditions. Further, we proposed a
generalization of the p-norm model [3] to allow other semantics of importance
weights: besides the relative importance semantics it can model the ideal (desired
or undesired) and the minimum (crisp or broad) acceptance levels of satisfaction
degrees of the conditions.
References
1. Efendi N. Nasibov and Gözde Ulutagay. Robustness of density-based cluster-
ing methods with various neighborhood relations. Fuzzy Sets and Systems,
160(24):3601–3615, 2009.
2. Ester, M., Kriegel, H.P., Sander J., Xu, X. A density-based algorithm for discovering
clusters in large spatial databases with noise. Proc. 2nd Int. Conf. on Knowledge
Discovery and Data Mining, :226231, 1996.
3. Parker, J.K., Downs, J.A., Footprint generation using fuzzy-neighborhood cluster-
ing. Geoinformatica, 17 :285299, 2013.
600
550
500
450
Y
400
350 cluster0
cluster1
cluster2
cluster3
cluster4
300 cluster5
cluster6
cluster7
cluster8
250
50 100 150 200 250 300 350 400 450 500
X
(a)
600
550
500
450
Y
400
350
cluster0
cluster1
cluster2
300 cluster3
cluster4
cluster5
cluster6
250
50 100 150 200 250 300 350 400 450 500
X
(b)
Fig. 1. Results of a) DBSCAN b) Approx Fuzzy Core DBSCAN. For we set Mpts = 9
and = 12 while for Approx Fuzzy Core DBSCAN the soft constraint over the minimum
number of points ranges from 7 to 12 and is always equal to 12.
580 fuzzycore(c) = 1.0
1.0 > fuzzycore(c) >= 0.5
0.5 > fuzzycore(c) > 0
570
560
550
Y
540
530
520
510
120 140 160 180 200 220 240 260 280 300
X
Fig. 2. Inspection of a cluster generated with the Approx Fuzzy Core DBSCAN ap-
proach (M pts=(9,12), =12]). For the light blue cluster (Cluster2) shown in Figure 1b
we visualize the fuzzy core points grouped in three category: fuzzy cores with mem-
bership equal to 1 (red cross), fuzzy cores with membership lesser than 1 and greater
or equal to 0.5 (blue X) and fuzzy cores with membership lesser than 0.5 and bigger
than 0 (green star)