A Divide and Conquer Approach For MST Based Clustering
A Divide and Conquer Approach For MST Based Clustering
Abstract—Due to their ability to detect clusters with irregular boundaries, minimum spanning tree-based clustering algorithms have
been widely used in practice. However, in such clustering algorithms, the search for nearest neighbor in the construction of minimum
spanning trees is the main source of computation and the standard solutions take OðN 2 Þ time. In this paper, we present a fast minimum
spanning tree-inspired clustering algorithm, which, by using an efficient implementation of the cut and the cycle property of the
minimum spanning trees, can have much better performance than OðN 2 Þ.
Index Terms—Clustering, graph algorithms, minimum spanning tree, divisive hierarchical clustering algorithm.
1 INTRODUCTION
of the vertices [18], [19]. However, these algorithms are algorithms, “Reverse Delete” algorithm starts with the full
computationally expensive. graph and delete edges in order of nonincreasing weights
MST-based clustering algorithms have been studied for based on the cycle property as long as doing so does not
decades. With the coming of information explosion, compu- disconnect the graph [24]. The cost of constructing an MST
tational efficiency has become a major issue for modern large using these classical MST algorithms is Oðm log nÞ [16], [17],
databases which typically consist of millions of data items. [25]. More efficient algorithms promise close to linear time
Particularly, with random access memory getting cheaper, complexity under different assumptions [26], [27], [28].
larger and larger main memories become possible to store the In an MST-based clustering algorithm, the inputs are a
whole database for faster system response. As a result, very set of N data points and a distance measure defined upon
efficient MST-based in-memory clustering algorithms are in them. Since every pair of points in the point set is associated
need. In the past, k-d tree (for nearest neighbor search to with an edge, there are NðN 1Þ=2 such edges. The time
avoid some distance computation) [20] and the Delaunay complexity of the Kruskal’s algorithm, the Prim’s algorithm,
Triangulation [15] had been employed in the construction of and the “Reverse Delete” algorithm adapted for this case is
MST to reduce the time complexity to near O(N log N). OðN 2 Þ [29].
Unfortunately, they work well only for dimensions no more
2.2 MST-Based Clustering Algorithms
than 5 [21]. Although many new index structures for nearest
With an MST being constructed, the next step is to define an
neighbor search in high-dimensional databases have been
edge inconsistency measure so as to partition the tree into
proposed recently (as summarized in [22]), their applications
clusters. Like many other clustering algorithms, the number
to the MST problem have not been reported.
of clusters is either given as an input parameter or figured
In this paper, we propose a new MST-inspired clustering
out by the algorithms themselves. Under the ideal condi-
approach that is both computationally efficient and compe-
tion, that is, the clusters are well separated and there exist
tent with the state-of-the-art MST-based clustering techni-
no outliers, the inconsistent edges are just the longest edges.
ques. Basically, our MST-inspired clustering technique tries
However, in real-world tasks, outliers often exist, which
to identify the relatively small number of inconsistent edges
make the longest edges an unreliable indication of cluster
and remove them to form clusters before the complete
separations. In these cases, all the edges that satisfy the
MST is constructed. To be as general as possible, our
inconsistency measure are removed and the data points in
algorithm has no specific requirements on the dimension-
the smallest clusters are regarded as outliers. As a result,
ality of the data sets and the format of the distance measure, the definition of the inconsistent edges and the develop-
though euclidean distance is used as the edge weight in ment of the terminating condition are two major issues that
our experiments. have to be addressed in all MST-based clustering algo-
The rest of this paper is organized as follows: In Section 2, rithms, even when the number of clusters is given as an
we review some existing work on MST-based clustering input parameter. Due to the invisibility of the MST
algorithms. We next present our proposed approach in representation of a data set of dimensionalities beyond 3,
Section 3. In Section 4, an empirical study is conducted to many inconsistency measures have been suggested in the
evaluate the performances of our algorithm with respect to literature. In Zahn’s original work, the inconsistent edges
some state-of-the-art MST-based clustering algorithms. are defined to be those whose weights are significantly
Finally, conclusions are made and future work is discussed larger than the average weight of the nearby edges in the
in Section 5. tree [2]. The performance of this clustering algorithm is
affected by the size of the nearby neighborhood.
2 RELATED WORK Being a more density-oriented approach, Chowdbury
and Murthy’s MST-based clustering technique [4] assumes
2.1 Minimum Spanning Tree Algorithms
that the boundary between any two clusters must belong to
In traditional MST problems, a set of n vertices and a set a valley region (regions where the density of the data points
of m edges in a connected graph are given. A “generic” is the lowest compared to those of the neighboring regions)
minimum spanning tree algorithm grows the tree by adding and the inconsistency measure is based on finding such
one edge at a time [23]. Two popular ways to implement the valley regions. In their technique, the density of any point x
generic algorithm are the Kruskal’s algorithm and the is assumed to be the number of data points present in an
Prim’s algorithm. In the Kruskal’s algorithm, all the edges open disc around x of radius r, which, in their paper, is
are sorted into a nondecreasing order by their weights, and the sth root of the average edge weight of the MST with s
the construction of an MST starts with n trees, i.e., every being the dimensionality of the data set. Both theoretically
vertex being its own tree. Then for each edge to be added in and experimentally, they show that, under a “smoothing”
such a nondecreasing order, check whether its two end- condition, the performance of their proposed clustering
points belong to the same tree. If they do (i.e., a cycle will be technique tends to that of Bayes classifier as the number of
created), such an edge should be discarded. In the Prim’s data points in a given data set increases.
algorithm, the construction of an MST starts with some root In order to achieve the best clustering results, Xu et al.
node t and the tree T greedily grows from t outward. At believe that different MST-based clustering problems may
each step, among all the edges between the nodes in the need different objective functions [11]. In their paper, they
tree T and those not in the tree yet, the node and the edge describe three objective functions. With no predefined
associated with the smallest weight to the tree T are added. cluster number K, their first algorithm simply removes
In opposition to the “generic” minimum spanning tree longest edges consecutively. The objective function is
WANG ET AL.: A DIVIDE-AND-CONQUER APPROACH FOR MINIMUM SPANNING TREE-BASED CLUSTERING 947
defined to minimize the change of the total weight of the clustering algorithm (MSDR), respectively [6]. Requiring the
current clusters from the previous one. By locating number of clusters be given as an input, their HEMST first
the transition point, their program can automatically choose computes the average and the standard deviation of the edge
the number of the clusters for the user. With a predefined weights in the entire EMST and uses their sum as a threshold.
cluster number K, the objective function for their second and Next, edges with a weight larger than the threshold are
third clustering algorithms is defined to minimize the total removed, leading to a set of disjoint subtrees. For each
distance between a cluster center and each data point in that cluster, a representative is identified as its centroid, resulting
cluster. In their paper, the center of a cluster is a position in a reduced data set. An EMST is next constructed on these
(which may or may not be a data point in their second representatives and the same tree partitioning procedure is
objective function but are replaced by K “best” representa- followed until the number of clusters is equal to the preset
number of clusters. With no input information about the
tives chosen from the data set in their third objective
number of clusters, their MSDR is actually a recursive two-
function) chosen to satisfy the objective function. Their
partition optimization problem. In each step, it removes an
iterative algorithm starts by removing K 1 arbitrary edges
edge only when the overall clusters’ weight standard
from the tree, creating a K-partition. Next, it repeatedly deviation reduction is maximized. This process continues
merges a pair of adjacent partitions and finds its optimal until such reduction is within a threshold and the desired
two-clustering solution. They observe that the algorithm number of clusters is obtained by finding the local minimum
quickly converges to a local minimum but can run in an of the standard deviation reduction function. Since every
exponential time in the worst case. edge in the tree is checked before a cutting, the problem with
To be less sensitive to outliers, Laszlo and Mukherjee MSDR is its high computational cost, particularly for very
present an MST-based clustering algorithm (referred to as large data sets.
the LM algorithm in this paper) that puts a constraint on the
minimum cluster size rather than on the number of clusters
[12]. In their algorithm, when cutting edges in a sorted 3 AN MST-INSPIRED CLUSTERING ALGORITHM
nonincreasing order, only when those edges, removing Although MST-based clustering algorithms have been
which can result in two clusters whose sizes are both widely studied, in this section, we describe a new divide-
larger than the minimum cluster size, are reached can the and-conquer scheme to facilitate efficient MST-based
operation proceed. This algorithm was developed for the clustering in modern large databases. Basically, it follows
microaggregation problem, where the number of clusters in the idea of the “Reverse Delete” algorithm. Before proceed-
the data set can be figured out by the constraints of the ing, we give a formal proof of its correctness.
problem itself. Theorem 1. Given a connected, edge-weighted graph, the
More recently, Vathy-Fogarassy et al. suggest three new “Reverse Delete” algorithm produces an MST.
cutting criteria for the MST-based clustering [5]. Their goal
is to decrease the number of heuristically defined para- Proof. First, we show that the algorithm produces a
meters of existing algorithms so as to decrease the influence spanning tree. This is because the graph is given
of the user on the clustering results. First, they suggest connected at the beginning and, when deleting edges in
a global edge-removing threshold, called the attraction the nonincreasing order, only the most expensive edge in
threshold Tat , which is calculated according to a physical any cycle is deleted, which does eliminate the cycles but
model [5] and therefore, is not user defined. The problem not disconnect the graph, resulting in a connected graph
with this criterion is that it needs computing all pairwise containing no cycle at the end. To show that the obtained
distances among the data points. The second criterion is spanning tree is an MST, consider any edge removed by
based on the idea that a point x is nearby another point y if the algorithm. It can be observed that it must lie on some
point x is connected to point y by a path containing j or cycle (otherwise removing it would disconnect the graph)
fewer edges in an MST. However, the selection of the and it must be the most expensive one on it (otherwise
parameter j can significantly depend on the user. Their retaining it would violate the cycle property). Hence, the
third one is based on a validation function, called the fuzzy “Reverse Delete” algorithm produces an MST. u
t
hypervolume validity index, aiming to solve the classic
chaining problem. The idea comes from the fact that after For our MST-inspired clustering problem, it is straight-
removing the inconsistent edges using the first two classical forward that n ¼ N and m ¼ NðN 1Þ=2, and the standard
criteria, the clusters of the MST will be approximated by the solution has OðN 2 log NÞ time complexity. However, m ¼
multivariate Gaussians. Their proposed clustering algo- OðN 2 Þ is not always necessary. The design of a more efficient
rithm iteratively builds the possible clusters based on the scheme is motivated by the following observations. First, the
first two criteria. In each iteration step, the cluster having MST-based clustering algorithms can be more efficient if the
the largest hypervolume is selected for cutting. If the cutting longest edges of an MST can be identified quickly before
cannot be executed classically, it is performed based on the most of the shorter ones are found. This is because, for some
measure of the total fuzzy hypervolume until either a MST-based clustering problems, if we can find the longest
predefined number of clusters or the minimum number of edges in the MST very quickly, there is no need to compute
the objects in the largest partition is reached. the exact distance values associated with the shorter ones.
At almost the same time, Grygorash et al. propose Second, for other MST-based clustering algorithms, if the
two MST-based clustering algorithms, called the Hierarch- longest edges can be found quickly, the Prim’s algorithm can
ical euclidean-distance-based MST clustering algorithm be more efficiently applied to each individual size-reduced
(HEMST) and the Maximum Standard Deviation Reduction cluster. For cases where the number of the longest edges that
948 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009
Fig. 1. A two-dimensional five-cluster data set. Fig. 2. Its spanning tree after the sequential initialization.
separate the potential clusters can be much fewer than the It can be seen that the quality of our fast algorithm
number of the shorter edges, this divide-and-conquer depends on the quality of the initialization to quickly
approach will allow us to save the number of distance expose the longest edges. Though the sequential initializa-
computations tremendously. tion gives us a spanning tree, when the data are randomly
stored, such a tree could be far from being optimal. This
3.1 A Simple Idea
situation can be illustrated by a two-dimensional five-
Given a set of s-dimensional data, i.e., each data item is a
cluster data set shown in Fig. 1. Shown in Fig. 2 is its
point in the s-dimensional space, there exists a distance
spanning tree after the sequential initialization (SI). In order
between every pair of the data items. To compute all the
to quickly identify the longest edges, we propose to follow
pairwise distances, the time complexity is OðsN 2 Þ, where N the sequential initialization by multiple runs of a recursive
is the number of data items in the set. Suppose at the procedure known as the divisive hierarchical clustering
beginning, each data item is initialized to have a distance algorithm (DHCA) [30].
with another data item in the set. For example, since the data
items are always stored sequentially, each data item can be 3.2 Divisive Hierarchical Clustering Algorithm
assigned the distance between itself and its immediate Essentially, given a data set, the DHCA starts with
predecessor—called a forward initialized tree—or succes- k randomly selected centers, and then assigns each point
sor—called a backward initialized tree. These initial dis- to its closest center, creating k partitions. At each stage in
tances, whatever they are, provide an upper bound for the the iteration, for each of these k partitions, DHCA
distance of each data item to its neighbor in the MST. In the recursively selects k random centers and continues the
implementation, the data structure consists of two arrays, a clustering process within each partition to form at most
distance array and an index array. The distance array is used kn partitions for the nth stage. In our implementation, the
to record the distance of each data point to some other data procedure continues until the number of elements in a
point in the sequentially stored data set. The index array partition is below k þ 2, at which time, the distance of each
records the index of the data item at the other end of the data item to other data items in that partition can be
distance in the distance array. updated with a smaller value by a brute-force nearest
According to the working principle of the MST-based neighbor search. Such a strategy ensures that points that are
clustering algorithms, a database can be split into partitions close to each other in space are likely to be collocated in the
by identifying and removing the longest inconsistent edges same partition. However, because any data point in a
in the tree. Based on this finding, after the sequential partition is closer to its cluster center (not its nearest
initialization, we can do a search in the distance array (i.e., neighbor) than to the center of any other partition (in case,
the current spanning tree) for the edge that has the largest the data point is equidistant to two or more centers, the
distance value, which we call the potential longest edge partition to which the data point belongs is a random one),
candidate. Then the next step is to check whether or not the data points in the clusters’ boundaries can be
there exists another edge with a smaller weight crossing the misclassified into a wrong partition. Fortunately, such
two partitions connected now by this potential longest edge possibilities can be greatly reduced by multiple runs of
candidate. If the result shows that this potential longest DHCA. To summarize, we believe that the advantage of
edge candidate is the edge with the smallest weight DHCA is that, after multiple runs, each point will be very
crossing the two partitions, we find the longest edge in close to its true nearest neighbor in the data set.
the current spanning tree (ST) that agrees with the longest To demonstrate this fact, one can think of this problem as
edge in the corresponding MST. Otherwise, we record the a set of independent Bernoulli trials where one keeps
update and start another round of the potential longest running DHCA and classifying each data point to its closest
edge candidate identification in the current ST. randomly selected cluster center at each stage of the
WANG ET AL.: A DIVIDE-AND-CONQUER APPROACH FOR MINIMUM SPANNING TREE-BASED CLUSTERING 949
TABLE 1 TABLE 2
The Node Class The DHCA_ST Member Function
TABLE 3
The Sets of Data
Fig. 5. Probability distribution of DHCA trials. Data1 has 10 clusters, Data21 has two clusters, Data 22 has four
clusters, Data23 has six clusters, Data31 has two clusters, Data32 has
four clusters, and Data33 has six clusters.
different workloads. Particularly, we want to show the
impact of the size of the data set, the dimensionality, and the
input value k to the DHCA on the performance of our the figure, we can see that their performance agrees with
algorithm. Finally, we will evaluate our algorithm on two real our expectation.
higher dimensional large data sets without any assumptions 4.2 Three Classic Clustering Problems
on the data distribution. The results will also be compared
In this section, we investigate the relative performance of our
with those of the LM algorithm and the MSDR algorithm to
proposed algorithm on some classic clustering tasks. To do
check the technical soundness of this study. All the data sets
are briefly summarized in Table 3. so, we select one easy and two relatively difficult clustering
We implemented all the algorithms in C++. All the problems (as shown in Fig. 6) and exclusively use two-
experiments were performed on a computer with Intel Core dimensional data sets because of their visual convenience for
2 Duo Processor E6550 2.33 GHz CPU and 2 GB RAM. The the performance judgement. Further, to clearly demonstrate
operating system running on this computer is Ubuntu that our proposed algorithm is a serious contestant for the
Linux. We use the timer utilities defined in the C standard well-known classical clustering algorithms particularly on
library to report the CPU time. In the experimental large data sets, we make the following two assumptions on
evaluation of our algorithm on the first set of data, we use these test data sets: 1) the clusters are well separated and
the average total running time (RT) in seconds, and the 2) there are no outstanding outliers. This set of experiments
clustering accuracy as the performance metric. Each result
we show was obtained as the average value over 100 runs of
the program for each data set. The results show the
superiority of our MST-based algorithm over the k-means
algorithm and other state-of-the-art MST-based clustering
schemes. In the evaluation of our algorithm on the second
set of data, two real-world data sets are used in the testing
of our MST-inspired clustering algorithm. For these real-
world data sets, two minimum spanning trees are con-
structed. One uses our MST-inspired algorithm and the
other uses the Prim’s algorithm. For the second real-world
data set (which is our working data), our adapted algorithm
is applied and the clustering results are presented. The
results show the performance of our MST-inspired cluster-
ing algorithm with respect to the state-of-the-art methods
when large amounts of outliers exist in the data sets. In all
the experiments, the total execution time accounts for all the
phases of our MST-inspired clustering algorithm, including
those spent on the sequential initialization, the DHCA
updates, the MDHCA, and the rest.
4.1 The Effectiveness of DHCA
We begin by testing the efficiency of DHCA on nearest
neighbor search. Shown in Fig. 5 is the probability
distribution of random variable Y mentioned in Section 3.2
for the DHCA algorithm on seven of our data sets. From Fig. 6. Three classic clustering problems.
WANG ET AL.: A DIVIDE-AND-CONQUER APPROACH FOR MINIMUM SPANNING TREE-BASED CLUSTERING 953
TABLE 4
The Global Minimums Versus The Local Minimums
TABLE 5
Illustration of the Skewness of Data
Fig. 15. Results of the adapted LM algorithm (a) original image, (b) on our MST, (c) on Prim’s MST.
WANG ET AL.: A DIVIDE-AND-CONQUER APPROACH FOR MINIMUM SPANNING TREE-BASED CLUSTERING 957
Fig. 16. Results of the adapted MSDR algorithm (a) original image, (b) on our MST, (c) on Prim’s MST.
[14] M.F. Jiang, S.S. Tseng, and C.M. Su, “Two-Phase Clustering Xiaochun Wang received the BS degree from
Process for Outliers Detection,” Pattern Recognition Letters, vol. 22, Beijing University and the PhD degree from the
pp. 691-700, 2001. Department of Electrical Engineering and Com-
[15] J. Lin, D. Ye, C. Chen, and M. Gao, “Minimum Spanning Tree- puter Science, Vanderbilt University. Her re-
Based Spatial Outlier Mining and Its Applications,” Lecture Notes search interests are in computer vision, signal
in Computer Science, vol. 5009/2008, pp. 508-515, Springer-Verlag, processing, and pattern recognition.
2008.
[16] R. Prim, “Shortest Connection Networks and Some General-
ization,” Bell Systems Technical J., vol. 36, pp. 1389-1401, 1957.
[17] J. Kruskal, “On the Shortest Spanning Subtree and the Traveling
Salesman Problem,” Proc. Am. Math. Soc., pp. 48-50, 1956.
[18] L. Caccetta and S.P. Hill, “A Branch and Cut Method for the
Degree-Constrained Minimum Spanning Tree Problem,” Net- Xiali Wang received the PhD degree from the
works, vol. 37, no. 2, pp. 74-83, 2001. Department of Computer Science, Northwest
[19] N. Paivinen, “Clustering with a Minimum Spanning Tree of Scale- University, China, in 2005. He is a faculty
Free-Like Structure,” Pattern Recognition Letters, vol. 26, no. 7, member in the Department of Computer Science,
pp. 921-930, Elsevier, 2005. Changan University, China. His research inter-
[20] J.L. Bentley and J.H. Friedman, “Fast Algorithms for Constructing ests are in computer vision, signal processing,
Minimal Spanning Trees in Coordinate Spaces,” IEEE Trans. intelligent traffic system, and pattern recognition.
Computers, vol. 27, no. 2, pp. 97-105, Feb. 1978.
[21] S.D. Bay and M. Schwabacher, “Mining Distance-Based Outliers in
Near Linear Time with Randomization and a Simple Pruning
Rule,” Proc. Ninth ACM SIGKDD Int’l Conf. Knowledge Discovery
and Data Mining, pp. 29-38, 2003.
[22] H.V. Jagadish, B.C. Ooi, K.L. Tan, C. Yu, and R. Zhang, “iDistance:
An Adaptive B+-Tree Based Indexing Method for Nearest D. Mitchell Wilkes received the BSEE degree
Neighbor Search,” ACM Trans. Database System (TODS), vol. 30, from Florida Atlantic, and the MSEE and PhD
no. 2, pp. 364-397, 2005. degrees from Georgia Institute of Technology.
[23] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Prentice His research interests include digital signal
Hall, 1988. processing, image processing and computer
[24] J. Kleinberg and E. Tardos, Algorithm Design, pp. 142-149. Pearson- vision, structurally adaptive systems, sonar,
Addison Wesley, 2005. as well as signal modeling. He is a member
[25] J. Nesetril, E. Milkov’a, and H. Nesetrilov’a, “Otakar Boruvka on of the IEEE and a faculty member at the
Minimum Spanning Tree Problem: Translation of Both the 1926 Department of Electrical Engineering and
Papers, Comments, History,” DMATH: Discrete Math., vol. 233, Computer Science, Vanderbilt University. He
no. 1, pp. 3-36, 2001. is a member of the IEEE.
[26] M. Fredman and D. Willard, “Trans-Dichotomous Algorithms for
Minimum Spanning Trees and Shortest Paths,” Proc. 31st Ann.
IEEE Symp. Foundations of Computer Science, pp. 719-725, 1990. . For more information on this or any other computing topic,
[27] H. Gabow, T. Spencer, and R. Tarjan, “Efficient Algorithms for please visit our Digital Library at www.computer.org/publications/dlib.
Finding Minimum Spanning Trees in Undirected and Directed
Graphs,” Combinatorica, vol. 6, no. 2, pp. 109-122, 1986.
[28] D. Karger, P. Klein, and R. Tarjan, “A Randomized Lineartime
Algorithm to Find Minimum Spanning Trees,” J. ACM, vol. 42,
no. 2, pp. 321-328, 1995.
[29] P. Fränti, O. Virmajoki, and V. Hautamäki, “Fast PNN-Based
Clustering Using K-Nearest Neighbor Graph,” Proc. Third IEEE
Int’l Conf. Data Mining, 2003.
[30] A. Ghoting, S. Parthasarathy, and M.E. Otey, “Fast Mining of
Distance-Based Outliers in High Dimensional Data Sets,” Proc.
SIAM Int’l Conf. Data Mining (SDM), vol. 16, no. 3, pp. 349-364,
2006.
[31] S. Hettich and S.D. Bay, The UCI KDD Archive, Dept. of
Information and Computer Science, Univ. of California, Irvine,
http://kdd.ics.uci.edu/, 1999.