0% found this document useful (0 votes)
58 views

A Divide and Conquer Approach For MST Based Clustering

This document discusses minimum spanning tree (MST)-based clustering algorithms. It begins by introducing MSTs and their properties, such as the cut property and cycle property. It then discusses how MST-based clustering works, which involves constructing an MST, removing inconsistent edges to form clusters, and repeating until a termination condition is met. The document notes that constructing the MST is the main computational bottleneck, taking O(N2) time for N data points using classical algorithms like Prim's and Kruskal's. It proposes a new divide-and-conquer approach to improve this to better than O(N2) time.

Uploaded by

Gem Kartik
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views

A Divide and Conquer Approach For MST Based Clustering

This document discusses minimum spanning tree (MST)-based clustering algorithms. It begins by introducing MSTs and their properties, such as the cut property and cycle property. It then discusses how MST-based clustering works, which involves constructing an MST, removing inconsistent edges to form clusters, and repeating until a termination condition is met. The document notes that constructing the MST is the main computational bottleneck, taking O(N2) time for N data points using classical algorithms like Prim's and Kruskal's. It proposes a new divide-and-conquer approach to improve this to better than O(N2) time.

Uploaded by

Gem Kartik
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO.

7, JULY 2009 945

A Divide-and-Conquer Approach for Minimum


Spanning Tree-Based Clustering
Xiaochun Wang, Xiali Wang, and D. Mitchell Wilkes, Member, IEEE

Abstract—Due to their ability to detect clusters with irregular boundaries, minimum spanning tree-based clustering algorithms have
been widely used in practice. However, in such clustering algorithms, the search for nearest neighbor in the construction of minimum
spanning trees is the main source of computation and the standard solutions take OðN 2 Þ time. In this paper, we present a fast minimum
spanning tree-inspired clustering algorithm, which, by using an efficient implementation of the cut and the cycle property of the
minimum spanning trees, can have much better performance than OðN 2 Þ.

Index Terms—Clustering, graph algorithms, minimum spanning tree, divisive hierarchical clustering algorithm.

1 INTRODUCTION

G IVEN a set of data points and a distance measure,


clustering is the process of partitioning the data set into
subsets, called clusters, so that the data in each subset share
edges on it. The degree of link of a vertex is the number of
edges that link to this vertex. A loop in a graph is a closed
path. A connected graph has one or more paths between
some properties in common. Usually, the common proper- every pair of points. A tree is a connected graph with no
ties are quantitatively evaluated by some measures of the closed loops. A spanning tree is a tree that contains every
optimality such as minimum intracluster distance or max- point in the data set. If a value is assigned to each edge in
imum intercluster distance, etc. Clustering, as an important the tree, the tree is called a weighted tree. For example, the
tool to explore the hidden structures of modern large weight for each edge can be the distance between its two
databases, has been extensively studied and many algo- end points. The weight of a tree is the total sum of the edge
rithms have been proposed in the literature. Because of the
weights in the tree. The minimum spanning trees are the
huge variety of the problems and data distributions,
spanning trees that have the minimal total weight. Two
different techniques, such as hierarchical, partitional, and
density- and model-based approaches, have been developed properties used to identify edges provably in an MST are
and no techniques are completely satisfactory for all the the cut property and the cycle property [1]. The cut
cases. For example, some classical algorithms rely on either property states that the edge with the smallest weight
the idea of grouping the data points around some “centers” crossing any two partitions of the vertex set must belong to
or the idea of separating the data points using some regular the MST. The cycle property states that the edge with the
geometric curves such as hyperplanes. As a result, they largest weight in any cycle in a graph cannot be in the MST.
generally do not work well when the boundaries of the As a result, when the weight associated with each edge
clusters are irregular. Sufficient empirical evidences have denotes a distance between the two end points, any edge in
shown that a minimum spanning tree representation is quite the minimum spanning tree will be the shortest distance
invariant to the detailed geometric changes in clusters’ between the two subtrees that are connected by that edge.
boundaries. Therefore, the shape of a cluster has little impact Therefore, removing the longest edge will theoretically
on the performance of minimum spanning tree (MST)-based result in a two-cluster grouping. Removing the next longest
clustering algorithms, which allows us to overcome many of edge will result in a three-cluster grouping, and so on. This
the problems faced by the classical clustering algorithms. corresponds to choosing the breaks where the maximum
The MST method is a graphical analysis of an arbitrary weights occur in the sorted edges.
set of data points. In such a graph, two points or vertices Based on this finding, MST-based clustering algorithms
can be connected either by a direct edge, or by a sequence were initially proposed by Zahn [2] and have so far been
of edges called a path. The length of a path is the number of extensively studied in the fields of pattern recognition [3],
[4], [5], [6], image processing [7], [8], [9], biological data
. X. Wang and D.M. Wilkes are with the Department of Electrical analysis [10], [11], microaggregation [12], chemistry [13],
Engineering and Computer Science, Vanderbilt University, VU Station and outlier detection [14], [15]. Usually, MST-based cluster-
B 351649, Nashville, TN 37235. ing algorithms consist of three steps: 1) a minimum
E-mail: [email protected], [email protected]. spanning tree is constructed (typically in quadratic time)
. X. Wang is with the Department of Computer Science, Changan
University, Xi’an, Shaanxi 710061, China. E-mail: [email protected]. using either the Prim’s algorithm [16] or the Kruskal’s [17]
Manuscript received 22 May 2008; revised 26 Oct. 2008; accepted 8 Jan. 2009; algorithm; 2) the inconsistent edges are removed to get a
published online 21 Jan. 2009. set of connected components (clusters); and 3) step 2 is
Recommended for acceptance by E. Keogh. repeated until some terminating condition is satisfied. MST
For information on obtaining reprints of this article, please send e-mail to: is only one of several spanning tree problems that arise in
[email protected], and reference IEEECS Log Number
TKDE-2008-05-0280. practice. There exist other spanning tree-based clustering
Digital Object Identifier no. 10.1109/TKDE.2009.37. algorithms that maximize or minimize the degrees of link
1041-4347/09/$25.00 ß 2009 IEEE Published by the IEEE Computer Society
946 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009

of the vertices [18], [19]. However, these algorithms are algorithms, “Reverse Delete” algorithm starts with the full
computationally expensive. graph and delete edges in order of nonincreasing weights
MST-based clustering algorithms have been studied for based on the cycle property as long as doing so does not
decades. With the coming of information explosion, compu- disconnect the graph [24]. The cost of constructing an MST
tational efficiency has become a major issue for modern large using these classical MST algorithms is Oðm log nÞ [16], [17],
databases which typically consist of millions of data items. [25]. More efficient algorithms promise close to linear time
Particularly, with random access memory getting cheaper, complexity under different assumptions [26], [27], [28].
larger and larger main memories become possible to store the In an MST-based clustering algorithm, the inputs are a
whole database for faster system response. As a result, very set of N data points and a distance measure defined upon
efficient MST-based in-memory clustering algorithms are in them. Since every pair of points in the point set is associated
need. In the past, k-d tree (for nearest neighbor search to with an edge, there are NðN  1Þ=2 such edges. The time
avoid some distance computation) [20] and the Delaunay complexity of the Kruskal’s algorithm, the Prim’s algorithm,
Triangulation [15] had been employed in the construction of and the “Reverse Delete” algorithm adapted for this case is
MST to reduce the time complexity to near O(N log N). OðN 2 Þ [29].
Unfortunately, they work well only for dimensions no more
2.2 MST-Based Clustering Algorithms
than 5 [21]. Although many new index structures for nearest
With an MST being constructed, the next step is to define an
neighbor search in high-dimensional databases have been
edge inconsistency measure so as to partition the tree into
proposed recently (as summarized in [22]), their applications
clusters. Like many other clustering algorithms, the number
to the MST problem have not been reported.
of clusters is either given as an input parameter or figured
In this paper, we propose a new MST-inspired clustering
out by the algorithms themselves. Under the ideal condi-
approach that is both computationally efficient and compe-
tion, that is, the clusters are well separated and there exist
tent with the state-of-the-art MST-based clustering techni-
no outliers, the inconsistent edges are just the longest edges.
ques. Basically, our MST-inspired clustering technique tries
However, in real-world tasks, outliers often exist, which
to identify the relatively small number of inconsistent edges
make the longest edges an unreliable indication of cluster
and remove them to form clusters before the complete
separations. In these cases, all the edges that satisfy the
MST is constructed. To be as general as possible, our
inconsistency measure are removed and the data points in
algorithm has no specific requirements on the dimension-
the smallest clusters are regarded as outliers. As a result,
ality of the data sets and the format of the distance measure, the definition of the inconsistent edges and the develop-
though euclidean distance is used as the edge weight in ment of the terminating condition are two major issues that
our experiments. have to be addressed in all MST-based clustering algo-
The rest of this paper is organized as follows: In Section 2, rithms, even when the number of clusters is given as an
we review some existing work on MST-based clustering input parameter. Due to the invisibility of the MST
algorithms. We next present our proposed approach in representation of a data set of dimensionalities beyond 3,
Section 3. In Section 4, an empirical study is conducted to many inconsistency measures have been suggested in the
evaluate the performances of our algorithm with respect to literature. In Zahn’s original work, the inconsistent edges
some state-of-the-art MST-based clustering algorithms. are defined to be those whose weights are significantly
Finally, conclusions are made and future work is discussed larger than the average weight of the nearby edges in the
in Section 5. tree [2]. The performance of this clustering algorithm is
affected by the size of the nearby neighborhood.
2 RELATED WORK Being a more density-oriented approach, Chowdbury
and Murthy’s MST-based clustering technique [4] assumes
2.1 Minimum Spanning Tree Algorithms
that the boundary between any two clusters must belong to
In traditional MST problems, a set of n vertices and a set a valley region (regions where the density of the data points
of m edges in a connected graph are given. A “generic” is the lowest compared to those of the neighboring regions)
minimum spanning tree algorithm grows the tree by adding and the inconsistency measure is based on finding such
one edge at a time [23]. Two popular ways to implement the valley regions. In their technique, the density of any point x
generic algorithm are the Kruskal’s algorithm and the is assumed to be the number of data points present in an
Prim’s algorithm. In the Kruskal’s algorithm, all the edges open disc around x of radius r, which, in their paper, is
are sorted into a nondecreasing order by their weights, and the sth root of the average edge weight of the MST with s
the construction of an MST starts with n trees, i.e., every being the dimensionality of the data set. Both theoretically
vertex being its own tree. Then for each edge to be added in and experimentally, they show that, under a “smoothing”
such a nondecreasing order, check whether its two end- condition, the performance of their proposed clustering
points belong to the same tree. If they do (i.e., a cycle will be technique tends to that of Bayes classifier as the number of
created), such an edge should be discarded. In the Prim’s data points in a given data set increases.
algorithm, the construction of an MST starts with some root In order to achieve the best clustering results, Xu et al.
node t and the tree T greedily grows from t outward. At believe that different MST-based clustering problems may
each step, among all the edges between the nodes in the need different objective functions [11]. In their paper, they
tree T and those not in the tree yet, the node and the edge describe three objective functions. With no predefined
associated with the smallest weight to the tree T are added. cluster number K, their first algorithm simply removes
In opposition to the “generic” minimum spanning tree longest edges consecutively. The objective function is
WANG ET AL.: A DIVIDE-AND-CONQUER APPROACH FOR MINIMUM SPANNING TREE-BASED CLUSTERING 947

defined to minimize the change of the total weight of the clustering algorithm (MSDR), respectively [6]. Requiring the
current clusters from the previous one. By locating number of clusters be given as an input, their HEMST first
the transition point, their program can automatically choose computes the average and the standard deviation of the edge
the number of the clusters for the user. With a predefined weights in the entire EMST and uses their sum as a threshold.
cluster number K, the objective function for their second and Next, edges with a weight larger than the threshold are
third clustering algorithms is defined to minimize the total removed, leading to a set of disjoint subtrees. For each
distance between a cluster center and each data point in that cluster, a representative is identified as its centroid, resulting
cluster. In their paper, the center of a cluster is a position in a reduced data set. An EMST is next constructed on these
(which may or may not be a data point in their second representatives and the same tree partitioning procedure is
objective function but are replaced by K “best” representa- followed until the number of clusters is equal to the preset
number of clusters. With no input information about the
tives chosen from the data set in their third objective
number of clusters, their MSDR is actually a recursive two-
function) chosen to satisfy the objective function. Their
partition optimization problem. In each step, it removes an
iterative algorithm starts by removing K  1 arbitrary edges
edge only when the overall clusters’ weight standard
from the tree, creating a K-partition. Next, it repeatedly deviation reduction is maximized. This process continues
merges a pair of adjacent partitions and finds its optimal until such reduction is within a threshold and the desired
two-clustering solution. They observe that the algorithm number of clusters is obtained by finding the local minimum
quickly converges to a local minimum but can run in an of the standard deviation reduction function. Since every
exponential time in the worst case. edge in the tree is checked before a cutting, the problem with
To be less sensitive to outliers, Laszlo and Mukherjee MSDR is its high computational cost, particularly for very
present an MST-based clustering algorithm (referred to as large data sets.
the LM algorithm in this paper) that puts a constraint on the
minimum cluster size rather than on the number of clusters
[12]. In their algorithm, when cutting edges in a sorted 3 AN MST-INSPIRED CLUSTERING ALGORITHM
nonincreasing order, only when those edges, removing Although MST-based clustering algorithms have been
which can result in two clusters whose sizes are both widely studied, in this section, we describe a new divide-
larger than the minimum cluster size, are reached can the and-conquer scheme to facilitate efficient MST-based
operation proceed. This algorithm was developed for the clustering in modern large databases. Basically, it follows
microaggregation problem, where the number of clusters in the idea of the “Reverse Delete” algorithm. Before proceed-
the data set can be figured out by the constraints of the ing, we give a formal proof of its correctness.
problem itself. Theorem 1. Given a connected, edge-weighted graph, the
More recently, Vathy-Fogarassy et al. suggest three new “Reverse Delete” algorithm produces an MST.
cutting criteria for the MST-based clustering [5]. Their goal
is to decrease the number of heuristically defined para- Proof. First, we show that the algorithm produces a
meters of existing algorithms so as to decrease the influence spanning tree. This is because the graph is given
of the user on the clustering results. First, they suggest connected at the beginning and, when deleting edges in
a global edge-removing threshold, called the attraction the nonincreasing order, only the most expensive edge in
threshold Tat , which is calculated according to a physical any cycle is deleted, which does eliminate the cycles but
model [5] and therefore, is not user defined. The problem not disconnect the graph, resulting in a connected graph
with this criterion is that it needs computing all pairwise containing no cycle at the end. To show that the obtained
distances among the data points. The second criterion is spanning tree is an MST, consider any edge removed by
based on the idea that a point x is nearby another point y if the algorithm. It can be observed that it must lie on some
point x is connected to point y by a path containing j or cycle (otherwise removing it would disconnect the graph)
fewer edges in an MST. However, the selection of the and it must be the most expensive one on it (otherwise
parameter j can significantly depend on the user. Their retaining it would violate the cycle property). Hence, the
third one is based on a validation function, called the fuzzy “Reverse Delete” algorithm produces an MST. u
t
hypervolume validity index, aiming to solve the classic
chaining problem. The idea comes from the fact that after For our MST-inspired clustering problem, it is straight-
removing the inconsistent edges using the first two classical forward that n ¼ N and m ¼ NðN  1Þ=2, and the standard
criteria, the clusters of the MST will be approximated by the solution has OðN 2 log NÞ time complexity. However, m ¼
multivariate Gaussians. Their proposed clustering algo- OðN 2 Þ is not always necessary. The design of a more efficient
rithm iteratively builds the possible clusters based on the scheme is motivated by the following observations. First, the
first two criteria. In each iteration step, the cluster having MST-based clustering algorithms can be more efficient if the
the largest hypervolume is selected for cutting. If the cutting longest edges of an MST can be identified quickly before
cannot be executed classically, it is performed based on the most of the shorter ones are found. This is because, for some
measure of the total fuzzy hypervolume until either a MST-based clustering problems, if we can find the longest
predefined number of clusters or the minimum number of edges in the MST very quickly, there is no need to compute
the objects in the largest partition is reached. the exact distance values associated with the shorter ones.
At almost the same time, Grygorash et al. propose Second, for other MST-based clustering algorithms, if the
two MST-based clustering algorithms, called the Hierarch- longest edges can be found quickly, the Prim’s algorithm can
ical euclidean-distance-based MST clustering algorithm be more efficiently applied to each individual size-reduced
(HEMST) and the Maximum Standard Deviation Reduction cluster. For cases where the number of the longest edges that
948 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009

Fig. 1. A two-dimensional five-cluster data set. Fig. 2. Its spanning tree after the sequential initialization.

separate the potential clusters can be much fewer than the It can be seen that the quality of our fast algorithm
number of the shorter edges, this divide-and-conquer depends on the quality of the initialization to quickly
approach will allow us to save the number of distance expose the longest edges. Though the sequential initializa-
computations tremendously. tion gives us a spanning tree, when the data are randomly
stored, such a tree could be far from being optimal. This
3.1 A Simple Idea
situation can be illustrated by a two-dimensional five-
Given a set of s-dimensional data, i.e., each data item is a
cluster data set shown in Fig. 1. Shown in Fig. 2 is its
point in the s-dimensional space, there exists a distance
spanning tree after the sequential initialization (SI). In order
between every pair of the data items. To compute all the
to quickly identify the longest edges, we propose to follow
pairwise distances, the time complexity is OðsN 2 Þ, where N the sequential initialization by multiple runs of a recursive
is the number of data items in the set. Suppose at the procedure known as the divisive hierarchical clustering
beginning, each data item is initialized to have a distance algorithm (DHCA) [30].
with another data item in the set. For example, since the data
items are always stored sequentially, each data item can be 3.2 Divisive Hierarchical Clustering Algorithm
assigned the distance between itself and its immediate Essentially, given a data set, the DHCA starts with
predecessor—called a forward initialized tree—or succes- k randomly selected centers, and then assigns each point
sor—called a backward initialized tree. These initial dis- to its closest center, creating k partitions. At each stage in
tances, whatever they are, provide an upper bound for the the iteration, for each of these k partitions, DHCA
distance of each data item to its neighbor in the MST. In the recursively selects k random centers and continues the
implementation, the data structure consists of two arrays, a clustering process within each partition to form at most
distance array and an index array. The distance array is used kn partitions for the nth stage. In our implementation, the
to record the distance of each data point to some other data procedure continues until the number of elements in a
point in the sequentially stored data set. The index array partition is below k þ 2, at which time, the distance of each
records the index of the data item at the other end of the data item to other data items in that partition can be
distance in the distance array. updated with a smaller value by a brute-force nearest
According to the working principle of the MST-based neighbor search. Such a strategy ensures that points that are
clustering algorithms, a database can be split into partitions close to each other in space are likely to be collocated in the
by identifying and removing the longest inconsistent edges same partition. However, because any data point in a
in the tree. Based on this finding, after the sequential partition is closer to its cluster center (not its nearest
initialization, we can do a search in the distance array (i.e., neighbor) than to the center of any other partition (in case,
the current spanning tree) for the edge that has the largest the data point is equidistant to two or more centers, the
distance value, which we call the potential longest edge partition to which the data point belongs is a random one),
candidate. Then the next step is to check whether or not the data points in the clusters’ boundaries can be
there exists another edge with a smaller weight crossing the misclassified into a wrong partition. Fortunately, such
two partitions connected now by this potential longest edge possibilities can be greatly reduced by multiple runs of
candidate. If the result shows that this potential longest DHCA. To summarize, we believe that the advantage of
edge candidate is the edge with the smallest weight DHCA is that, after multiple runs, each point will be very
crossing the two partitions, we find the longest edge in close to its true nearest neighbor in the data set.
the current spanning tree (ST) that agrees with the longest To demonstrate this fact, one can think of this problem as
edge in the corresponding MST. Otherwise, we record the a set of independent Bernoulli trials where one keeps
update and start another round of the potential longest running DHCA and classifying each data point to its closest
edge candidate identification in the current ST. randomly selected cluster center at each stage of the
WANG ET AL.: A DIVIDE-AND-CONQUER APPROACH FOR MINIMUM SPANNING TREE-BASED CLUSTERING 949

After the DHCA updates, to check whether the longest


edge in the current spanning tree is associated with the true
shortest edge that connects the two partitions, we realize
that, if it does, any other edge crossing the two partitions
(i.e., one end of the edge is in a partition different from that
of the other) should have a larger distance value. Based on
this idea, we use a flag array to mark all the points on one
side of the longest edge to be 1 and all the points on the
other side to be 0. Then the DHCA can be applied multiple
times with the partition centers being chosen only from
the data points marked either 1 or 0, but not both. We
call this procedure the marked DHCA (MDHCA). Only
when the current largest distance value in the distance
array and the one found using MDHCA converge to the
same value at the same location can the procedure stop as
completing the verification of the current longest edge
Fig. 3. Probability distribution of Bernoulli trials. candidate. Such a procedure is an efficient method to
implement the cycle property and our experimental
process, until it succeeds (i.e., it hits its nearest neighbor, or results show that the convergence to the global minimum
at least, its approximate nearest neighbor). Let p be the is very fast.
probability that a random data point hits its nearest As a further improvement, in addition to the distance
neighbor. Let Y be the random variable representing the array and the index array for the spanning tree, we keep an
number of trials needed for a random data point to hit its auxiliary distance array and index array, which will be
nearest neighbor. The probability of obtaining a success on updated with the smallest distance of each data item to
trial y is given by some other one during the forward (or backward) tree
updates without regard to their index orders in the index
P ðY ¼ yÞ ¼ qy1 p; ð1Þ array. Then before each MDHCA, each point is checked
with its approximate nearest neighbor in the auxiliary index
where q ¼ 1  p denotes the probability that a failure
array. If some data point and its approximate nearest
occurs. The relationship between p and P ðY ¼ yÞ is plotted
neighbor have a different mark and the edge connecting
in Fig. 3. From it, we can see that for a randomized process
them is smaller than the current longest edge, we update
(i.e., p ¼ 0:5), at most 50 DHCAs are enough for most of the
the spanning tree with this smaller edge to save the
data points to meet their nearest neighbor. overheard involved in the MDHCA. Only when this smaller
For our purpose, after the sequential initialization, a
edge is larger than the sum of the mean and one standard
spanning tree is constructed and each data item in the tree
deviation of the edge weights in the auxiliary distance array
has already had a distance. During the divisive hierarchical
will the MDHCA be run to do the cycle property check.
clustering process, each data item will have multiple
distance computations. To maintain the spanning tree 3.3 Our MST-Inspired Clustering Algorithm
structure, however, its distance upper bound may not be
Based on the methodology presented in the previous two
updated with the smallest distance among them but with a
sections, given a loose estimate of minimum and maximum
smaller distance value whose neighbor in the index array
has an index value smaller (for forward tree) or larger (for numbers of data items in each cluster, an iterative approach
backward tree) than its own index value in the array index. for our MST-inspired clustering algorithm can be summar-
Further, we would be more interested in those data points ized in the following:
whose distance upper bounds are potential longest edge
1. Start with a spanning tree built by the SI.
candidates than those whose distance upper bounds are too
2. Calculate the mean and the standard deviation of the
small to be given any further consideration. Therefore, after
edge weights in the current distance array and use
the sequential initialization, we can compute the mean and
their sum as the threshold. Partially refine the
the standard deviation of the edge weights from the distance
spanning tree by running our DHCA multiple times
array and use their sum as a threshold value. Then in each
step of the spanning tree updates using DHCA, before we until the percentage threshold difference between two
assign a data item to a cluster center, if its current distance consecutively updated distance arrays is below 106 .
upper bound is smaller than the threshold, we can ignore it 3. Identify and verify the longest edge candidates by
so as to save some computations. Only when the distance running MDHCA until two consecutive longest edge
upper bound is larger than the threshold, do we carry out a distances converge to the same value at the same
distance computation to classify it to its closest cluster center places.
and make the corresponding update. In other words, we 4. Remove this longest edge.
give the potential longest edge candidates more attention. 5. If the number of clusters in the data set is preset or if
However, we do not perform such limiting operation the difference between two consecutively removed
when we randomly choose the centers, which will give the longest edges has a percentage decrement larger
potential longest edge candidates more opportunities to than 50 percent of the previous one, we stop.
have a smaller distance upper bound. Otherwise go to Step 3.
950 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009

1. Get a loose estimate of the maximum and minimum


number of data points for each cluster.
2. Always cut the largest subcluster and cut an edge
only when the sizes of both clusters resulted by
cutting that edge are larger than the minimum
number of data points.
3. Terminates when the size of the largest cluster
becomes smaller than the estimated maximum
number of data points.
Our adapted MSDR algorithm is the following:

1. Calculate the mean and the standard deviation of the


edge weights in the distance array and use their sum
as the threshold.
2. Remove the longest edge that is larger than the
threshold and that links either a single point or a
Fig. 4. Updated spanning tree using DHCAs. very small number of data points to the MST.
3. Continue Steps 1 and 2 until the edge is reached, by
We stop Step 2 when the percentage threshold difference removing which, two large groups will form from
between two consecutive pruning thresholds, i.e., its the single largest group before that edge is cut.
percentage decrement, is below a threshold, say 106 in 4. Apply the MSDR algorithm on the denoised MST.
our implementation, because further DHCA-based distance
5. Assign the removed data points the same cluster
upper bound updates will not bring us more gains which
label as their nearest neighbor’s.
are worth the overhead of the DHCA. The spanning tree
after the DHCA updates for the one shown in Fig. 2 is To summarize, the numerical parameters the algorithm
manifested in Fig. 4. needs from the user include the data set, the loosely
The terminating condition presented in the above MST- estimated minimum and maximum numbers of data points
inspired clustering algorithm is under the assumptions that in each cluster, the input k to the DHCA and MDHCA, and
the clusters are well separated and there are no outstanding number of nearest neighbors to keep for each data item in
outliers. However, in many real-world problems, the clusters the auxiliary arrays, while the outputs will be the final
are not always well separated and noise in the form of distance and index arrays, and a labeling array that
outliers often exists. For these cases, some of the longest remembers the cluster label each data item belongs to.
edges do not correspond to any cluster separations or breaks To improve the readability of our proposed MST-
but are associated with the outliers. For such cases, we inspired clustering algorithm, the time complexity analysis
propose terminating conditions that are the adaptation and the detailed descriptions of some of the algorithms in a
results from the LM algorithm and the MSDR algorithm. pseudocode form are presented in the following.
Before doing that, we realize, at any step of our algorithm, we 3.4 Time Complexity Analysis
have a spanning tree upon which MST-inspired clustering
From the description in the previous sections, it can be seen
operation can be performed. The advantage of the LM
that our algorithm mainly consists of two phases. The first
algorithm is the avoidance of unnecessary large number of
phase includes the sequential initialization and the DHCA
small clusters. The problem with it is that the number of
spanning tree updating, and the second phase uses the
clusters is not usually known a priori, though, for the more
MDHCA to locate the longest edges and partitions the
general cases, a loose estimate of the maximum and
obtained approximate minimum spanning tree to form
minimum numbers of data points in each cluster is possible.
sensible clusters. We expect the original DHCAs (i.e., no
The advantage of the MSDR algorithm is that it can find the
thresholding involved) to scale as OðfN log NÞ, where f
optimal cluster separations, particularly for cases where
denotes the number of DHCA constructed before the
there exist some unknown hidden structures in the data set.
terminating condition is satisfied. Since in our implementa-
However, this optimization is based on an exhaustive search
tion, at each step of the spanning tree updating using the
of the breaking edges in the MST whose cutting will give the
DHCA, before we assign a data item to a cluster center, if its
maximum standard deviation reduction and, therefore, has
current distance upper bound is smaller than the threshold
asymptotic complexity exponential with respect to the size of (i.e., the sum of the mean and one standard deviation of the
the data set and does not perform well when a lot of outliers tree edge weights), we ignore it, the time complexity is
exist. For MST-based clustering algorithm, our remedies to actually OðdðxNÞ logðxNÞÞ, where x is between 0 and 1.
these problems are two terminating conditions developed for Therefore, as long as x is small enough, the time complexity
the LM algorithm and the MSDR algorithm to be less could be near linear on average. Though its worst time
sensitive to outliers. The first one is for the LM algorithm complexity could be OðN 2 Þ, the average time complexity of
when a loose estimate of the maximum and minimum the second phase is OðeN log NÞ, where e denotes the
number of data points in each cluster is possible and the other number of MDHCA constructed before the terminating
is for the MSDR algorithm when there exist some unknown condition is satisfied. Since, on average, the number of
hidden structures in the data set. Our adapted LM algorithm longest edges is much smaller than the data set size N, as
is the following: long as the spanning tree constructed in the first phase is
WANG ET AL.: A DIVIDE-AND-CONQUER APPROACH FOR MINIMUM SPANNING TREE-BASED CLUSTERING 951

TABLE 1 TABLE 2
The Node Class The DHCA_ST Member Function

very close to the true minimum spanning tree, we expect


our MST-inspired algorithm to scale as OðN log NÞ.

3.5 Pseudocode for Our Clustering Algorithm


The implementation of the DHCA in our approach is
through the design of a C++ data structure called Node. The
Node data structure has several member variables that
remember the indexes of the subset of the data items that
are clustered into it from its parent level, the indexes of its
randomly chosen k cluster centers from its own set for its
descendants, and a main member function that generates
k new nodes by clustering its own set into k subclusters. The
outputs of the Node data structure are at most k new Nodes
as the descendents of the current one.
The divisive hierarchical clustering process starts with
creating a Node instance, called the topNode. This topNode dist_knn[i].max is the kNNth nearest neighbor of data item i.
has every data item in the data set as its samples. From
these samples, this topNode randomly chooses k data
4 A PERFORMANCE STUDY
points as its clustering centers and assigns each sample to
its nearest one, generating k data subsets in the form of In this section, we present the results of an experimental
study performed to evaluate our MST-inspired clustering
k Nodes. Only when the number of samples in a Node is
algorithm. First, we perform multiple runs of the DHCA on
larger than a predefined cluster size will that Node be
several data sets to check its effectiveness in nearest neighbor
pushed to the back of the topNode, forming an array of
search. Next, we select three two-dimensional clustering
Nodes. This process continues recursively. With the new
problems (which, raised originally in [1] and illustrated in
Nodes being generated on the fly and pushed to the back of
Fig. 5, have clusters of different sizes and shapes) and
the Node array, they will be processed in order until no new compare the performance of our proposed algorithm to that
Nodes are generated and the end of the existing Node array of the k-means algorithm, the LM algorithm, and the MSDR
is reached. algorithm. For this comparison, we assume that the clusters
Totally, we need two variants of the DHCA procedure, are well separated and there are no outstanding outliers. By
DHCA for our spanning tree updating, and MDHCA for the making these two strict assumptions, we would like to show
cycle property implementation. The Node class is summar- that our proposed MST-based clustering algorithm can
ized in Table 1 and the DHCA_ST procedure is given in outperform the classical clustering algorithms in both the
Table 2. The DHCA_ CYC procedure is the same as DHCA_ST execution time and the classification results. Then, we focus
except for the ways to choose cluster centers and will not be our study on the behavior of the proposed MST-inspired
repeated here. clustering algorithm with various parameters and under
952 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009

TABLE 3
The Sets of Data

Fig. 5. Probability distribution of DHCA trials. Data1 has 10 clusters, Data21 has two clusters, Data 22 has four
clusters, Data23 has six clusters, Data31 has two clusters, Data32 has
four clusters, and Data33 has six clusters.
different workloads. Particularly, we want to show the
impact of the size of the data set, the dimensionality, and the
input value k to the DHCA on the performance of our the figure, we can see that their performance agrees with
algorithm. Finally, we will evaluate our algorithm on two real our expectation.
higher dimensional large data sets without any assumptions 4.2 Three Classic Clustering Problems
on the data distribution. The results will also be compared
In this section, we investigate the relative performance of our
with those of the LM algorithm and the MSDR algorithm to
proposed algorithm on some classic clustering tasks. To do
check the technical soundness of this study. All the data sets
are briefly summarized in Table 3. so, we select one easy and two relatively difficult clustering
We implemented all the algorithms in C++. All the problems (as shown in Fig. 6) and exclusively use two-
experiments were performed on a computer with Intel Core dimensional data sets because of their visual convenience for
2 Duo Processor E6550 2.33 GHz CPU and 2 GB RAM. The the performance judgement. Further, to clearly demonstrate
operating system running on this computer is Ubuntu that our proposed algorithm is a serious contestant for the
Linux. We use the timer utilities defined in the C standard well-known classical clustering algorithms particularly on
library to report the CPU time. In the experimental large data sets, we make the following two assumptions on
evaluation of our algorithm on the first set of data, we use these test data sets: 1) the clusters are well separated and
the average total running time (RT) in seconds, and the 2) there are no outstanding outliers. This set of experiments
clustering accuracy as the performance metric. Each result
we show was obtained as the average value over 100 runs of
the program for each data set. The results show the
superiority of our MST-based algorithm over the k-means
algorithm and other state-of-the-art MST-based clustering
schemes. In the evaluation of our algorithm on the second
set of data, two real-world data sets are used in the testing
of our MST-inspired clustering algorithm. For these real-
world data sets, two minimum spanning trees are con-
structed. One uses our MST-inspired algorithm and the
other uses the Prim’s algorithm. For the second real-world
data set (which is our working data), our adapted algorithm
is applied and the clustering results are presented. The
results show the performance of our MST-inspired cluster-
ing algorithm with respect to the state-of-the-art methods
when large amounts of outliers exist in the data sets. In all
the experiments, the total execution time accounts for all the
phases of our MST-inspired clustering algorithm, including
those spent on the sequential initialization, the DHCA
updates, the MDHCA, and the rest.
4.1 The Effectiveness of DHCA
We begin by testing the efficiency of DHCA on nearest
neighbor search. Shown in Fig. 5 is the probability
distribution of random variable Y mentioned in Section 3.2
for the DHCA algorithm on seven of our data sets. From Fig. 6. Three classic clustering problems.
WANG ET AL.: A DIVIDE-AND-CONQUER APPROACH FOR MINIMUM SPANNING TREE-BASED CLUSTERING 953

Fig. 8. Typical misclassifications of k-means algorithm.

Each graph shows five lines. Two of these lines represent


the expected execution time to find the small number of
longest edges given an N log N time algorithm and a linear
time algorithm. These lines are extrapolated from the
running time consumed to construct the MST using the
Prim’s algorithm. One line represents the running time of
the Prim’s algorithm. Clearly, the execution time of the
Prim’s MST algorithm increases with the data set sizes in a
quadratic form. One line represents the running time of our
Fig. 7. Experiment 1 results. (a) Ours versus k-means and LM. (b) Ours
algorithm. The rest represent the running time of the
versus Prim and MSDR.
k-means algorithm and that of the MSDR algorithm. From
the figure, it can be seen that our algorithm outperforms the
is run with the input value of k for the DHCA and MDHCA
LM’s algorithm and the MSDR algorithm in running time by
being set to 3. an order of magnitude and exhibits linear scalability on the
The first clustering problem (Data1) has N ¼ 10,000 data
first seven data sets listed in Table 3 with 100 percent correct
points and consists of 10 dense and center-surrounded
clusters’ classification. The k-means algorithm has a similar
clusters. The second clustering problem (Data21) has
running time performance but low correct cluster classifica-
10,512 data points and contains two clusters, with one tion rate (2 percent for Data1 and 0 percent for the rest six).
inside the other. The third clustering problem (Data31) has A typical misclassification of the k-means algorithm for
two clusters as well (N ¼ 11,521) but is formed as a curving Data23 is illustrated in Fig. 8.
irregularly shaped band. To make things a little complex,
we double and then triple the second and the third 4.3 The Impact of k on the Effectiveness of DHCA
clustering tasks to form their four-cluster versions (Data22 In this section, we study the effect of k, the number of
and Data32) and six-cluster versions (Data23 and Data33), partition centers at each stage of the DHCA, on the
respectively. The aim of doing so is to test the running time performance of our algorithm with and without a given
scalability of our algorithm with the size of the data sets. number of clusters. We varied k from 3 to 30. The results
We first investigate the relative performance of our (i.e., the average running time and its standard deviation
algorithm with respect to the k-means algorithm and the over 100 runs) on Data23 in terms of the running time are
LM algorithm when the number of clusters for each data set shown in Fig. 9a. The results (i.e., the average running time
is given. In this case, the running time performance of the and its standard deviation over 100 runs) on Data33 in
LM algorithm is the same as that of the Prim’s algorithm. terms of the running time are shown in Fig. 9b.
The results are presented in Fig. 7a. Next, we compare our Overall, for large k, our algorithm incurs larger number of
algorithm and the MSDR algorithm when the number of distance computations in the DHCA and the MDHCA
clusters is not given a priori. If the number of clusters K processes, and the running time increases with k. As k
in each data set is given, our clustering algorithm stops increases dramatically, the changes in the running time
when K  1 longest edges are detected. If K is not given increase almost linearly with k. For small k’s, the changes in
beforehand, our algorithm stops when the performance the running time are small. This is because, when k is small,
levels off, i.e., when there appears a big jump (between the the overhead of constructing the DHCA and the MDHCA
ðK  1Þth and the Kth longest edges in this experiment) as exceeds the small increases in the distance computations.
measured by above 50 percent of the ðK  1Þth longest Further, with a small increase in k, the total number of Nodes
edge. This is a direct result of our assumptions made upon in the whole execution of our algorithm may decrease a little,
the data distribution. The running time results are pre- which makes our algorithm have even a little better
sented in Fig. 7b. performance than using a smaller k. This phenomenon can
954 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009

Fig. 10. An illustration of the effect of dimensionality.

the approximate longest edges are significantly larger than


the smaller edges and very close to the exact longest edges.
The percentage of the approximate longest edges out of the
exact ones over 100 test runs for each data set and the
percentage of the average deviation of their distance values
from the values of the exact longest edges are shown in
Fig. 9. An illustration of the effect of input k to DHCA. (a) Data23 running
time (s) variations with input k to DHCA. (b) Data33 running time (s) Table 4. It can be seen from Table 4 that both the percentage
variations with input k to DHCA. disagreement and the distance deviation are low.

4.6 Performance of Our Algorithm on Real Data


be seen in Fig. 9 when k increases from 3 to 5. When k gets
larger and larger, more distance computations to the To further test our algorithm, we choose two real data sets.
partition centers need to be computed and the increases in The first one, called Corel Histogram, is downloaded from
the distance computations eventually dominate. the UCI KDD Archive [26]. Each item in this data set is a
color histogram of an image with 32 bins which correspond
4.4 The Impact of Dimensionality of the Data Sets to eight levels of hue and four levels of saturation. The
In this section, we study the effect of the dimensionality on second one is our working data produced from an indoor
the performance of our algorithm. We increase the environment of the Department of Electrical Engineering
dimensionality of each two-dimensional data item by and Computer Science at Vanderbilt University. Each item
padding the same values of the data to the end to form 4 contains a 10,000-dimensional color histogram and 41 more
to 200 dimensions. For this experiment, we still keep k to be texture measures and is highly sparse small-valued data
3 with no preset number of clusters. The results on Data23 (the value in each dimension is in the range 0-1). The data
and Data33 in terms of the average running time over set consists of 66,740 feature vectors which are extracted
100 runs are shown in Fig. 10. In the firgure, the red line
denotes the linearly increasing running time. From the
results, we can see that the overhead of DHCA is relatively
low when the time spent on the distance computations
dominates, which also demonstrates the superiority of
our algorithm.

4.5 Some Discussion on the Quality of Our Trees


Although our MST-inspired clustering algorithm works
reasonable well on these three clustering problems, the
issue of whether our spanning tree is a minimum spanning
tree still remains. From the experiment results, we observe
that the spanning tree constructed using our algorithm is
not the exact MST. Some of the longest edges are not the
global but local minimums, examples of which are shown
in Fig. 11. In the examples, the clearly observed longest
edges on the Fig. 11a are the global minimums while the
ones on the right are the local minimums. However, this Fig. 11. (a) An illustration of global minimums vs. (b) local minimums for
phenomenon does not produce any clustering errors since Data33.
WANG ET AL.: A DIVIDE-AND-CONQUER APPROACH FOR MINIMUM SPANNING TREE-BASED CLUSTERING 955

TABLE 4
The Global Minimums Versus The Local Minimums

from 20 color images along the hallway. Additionally, we


took another three more images along the hallway as the Fig. 13. An illustration of performances on our Data.
test images to manifest the performance of the clustering
algorithms on the training images. the averages over 10 test runs for each case. We have used
We first evaluate the performance of our algorithm with 3 for k of the DHCA and MDHCA in this set of experiments.
respect to the Prim’s MST-based clustering algorithm in Before analyzing the results, we first would like to point
terms of the percentage of the agreed distances and edges of out that these two data sets are highly skewed. To see it, we
ours out of those of the MST constructed using the Prim’s calculate the mean and the standard deviation of the tree
algorithm, the running time, and the clustering results. For weights from the MSTs constructed using both the Prim’s
these real data sets, we replace the auxiliary distances and algorithm and our approximate MST algorithm. Next, the
index arrays by an auxiliary data structure which remem-
edges in the trees are sorted into a nonincreasing order. By
bers the distances to and the indexes of kNN nearest
cutting the edges in such a nonincreasing order from trees
neighbors for each data item. The ratios of the running time
until to a point at which the edge cut is no larger than the
used by the Prim’s algorithm to construct the MST for each
data set to that used by our MST-inspired algorithm to sum of the mean and the standard deviation, only three
construct our approximate MST (that is, there is no large groups show up as listed in the second column of
terminating condition and the program stops when the Table 5 (here by large group, we mean those groups whose
second smallest edge is identified and verified), the number of data points is larger than 100). The largest
percentage of the agreed distances and edges, and the total groups have more than 80 percent data points of the
tree weight ratios for each data set with the number of original data sets. By cutting the sorted edges in the same
nearest neighbors (that is, the input parameter kNN to the way from these MSTs until the edge is no larger than the
DHCA_ST procedure) set to be 5, 10, 20, and 30 are shown mean, also three large groups show up as listed in the third
in Figs. 12 and 13. The “Number of data points” in the column of Table 5. And the largest of them include more
figure indicates the amount of longest edges in nonincreas- than 40 percent data points of the original data sets. So the
ing order that are identified by our algorithm. The results two real data sets we used are highly skewed. To illustrate
obtained for each data set using our approximate MST are this, a two-dimensional skewed data set is shown in Fig. 14.
Since our MST-inspired clustering algorithm relies on the
identification and the removal of the edges in a nonincreas-
ing order, it is not expected to perform well with the
existence of large number of outliers in the data set. This
can be seen from Fig. 12 that our algorithm takes a little
longer time than the Prim’s algorithm on Corel Histogram

TABLE 5
Illustration of the Skewness of Data

Fig. 12. An illustration of performances on Corel Histogram.


956 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009

estimate about the largest and the smallest number of


elements in each cluster as N=5 and N=20, respectively.
This assumption comes from the fact that for the hallway
environment, we would expect around 10 meaningful
categories of objects. The clustering results presented in
Figs. 15 and 16 are obtained from our approximate MST
used five-nearest neighbors on the three test images. We
list the original three test images in the first column of
these two figures, the results obtained on our MST in the
second column and the results obtained on Prim’s MST in
the last column. Each clustered group is assigned a
different color for the displaying purpose. It seems that
our adapted LM algorithm does a better job than the
Fig. 14. An illustration of a two-dimensional skewed data set. adapted MSDR algorithm since the former can differenti-
ate the blue railing from the wood panel wall while the
data. However, it has a better running time performance on latter cannot. But on the whole, it can be observed that our
our working data. We believe this is because our working algorithms work reasonable well.
data has a much higher dimensionality, that is, 10,041 In fact, due to the nature of our divide-and-conquer
versus 32 and therefore, as mentioned in Section 4.4, the strategy, we do not have to finish constructing the MST
time spent on the distance computation weights over the before doing the clustering. According to the adapted
overhead spent for DHCA and MDHCA and dominates in LM algorithm, we can stop the process as early as the
the whole process. number of data points in the largest group drops below the
Although it can be seen from Figs. 12 and 13 that our loosely estimated maximum number of data points in a
approximate MST is very close to the true MST in terms of cluster, which results in a five times faster clustering with an
total tree weight, unlike in two-dimensional case, there is equal clustering quality on our working data.
usually no way to visually judge the clustering results
directly from their MSTs in higher dimension (>3). To
have a sensible comparison about the quality of our 5 CONCLUSION
approximate MST with that of Prim’s, we applied our As a graph partition technique, the MST-based clustering
adapted LM algorithm and MSDR algorithm to both algorithms are of growing importance in detecting the
MSTs. When the number of clusters in the data sets is irregular boundaries. A central problem in such clustering
not known a priori, our algorithm assumes a loose algorithms is the classic quadratic time complexity on the

Fig. 15. Results of the adapted LM algorithm (a) original image, (b) on our MST, (c) on Prim’s MST.
WANG ET AL.: A DIVIDE-AND-CONQUER APPROACH FOR MINIMUM SPANNING TREE-BASED CLUSTERING 957

Fig. 16. Results of the adapted MSDR algorithm (a) original image, (b) on our MST, (c) on Prim’s MST.

construction of an MST. In this paper, we present a more REFERENCES


efficient method that can quickly identify the longest edges [1] I. Katriel, P. Sanders, and J.L. Traff, “A Practical Minimum
in an MST so as to save some computations. Our contribu- Spanning Tree Algorithm Using the Cycle Property,” Proc.
11th European Symp. Algorithms (ESA ’03), vol. 2832,
tion is the design of a new MST-inspired clustering pp. 679-690, 2003.
algorithm for large data sets (however, without any specific [2] C.T. Zahn, “Graph-Theoretical Methods for Detecting and
requirements on the distance measure used) by utilizing a Describing Gestalt Clusters,” IEEE Trans. Computers, vol. 20,
no. 1, pp. 68-86, Jan. 1971.
DHCA in an efficient implementation of the cut and the
[3] R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis.
cycle property. Wiley-Interscience, 1973.
We conducted extensive experiments to evaluate our [4] N. Chowdhury and C.A. Murthy, “Minimum Spanning Tree-
algorithm against the k-means algorithm and two other Based Clustering Technique: Relationship with Bayes Classifier,”
Pattern Recognition, vol. 30, no. 11, pp. 1919-1929, 1997.
state-of-the-art MST-based clustering algorithms on three [5] A. Vathy-Fogarassy, A. Kiss, and J. Abonyi, “Hybrid Minimal
standard synthetic data sets and two real data sets. Spanning Tree and Mixture of Gaussians Based Clustering
The experimental results show that our proposed MST- Algorithm,” Foundations of Information and Knowledge Systems,
pp. 313-330, Springer, 2006.
inspired clustering algorithm is very effective and stable [6] O. Grygorash, Y. Zhou, and Z. Jorgensen, “Minimum Spanning
when applied to various clustering problems. Since there Tree-Based Clustering Algorithms,” Proc. IEEE Int’l Conf. Tools
with Artificial Intelligence, pp. 73-81, 2006.
often exist some structures in the data sets, our algorithm
[7] R.C. Gonzalez and P. Wintz, Digital Image Processing, second ed.
does not necessarily require but can automatically deter- Addison-Wesley, 1987.
mine the desired number of clusters by itself. [8] Y. Xu, V. Olman, and E.C. Uberbacher, “A Segmentation
In the future, we will further study the rich properties of Algorithm for Noisy Images: Design and Evaluation,” Pattern
Recognition Letters, vol. 19, pp. 1213-1224, 1998.
the existing MST algorithms and adapt our proposed MST- [9] Y. Xu and E.C. Uberbacher, “2D Image Segmentation Using
inspired clustering algorithm to more general and larger Minimum Spanning Trees,” Image and Vision Computing, vol. 15,
pp. 47-57, 1997.
data sets, particularly when the whole data set cannot fit
[10] D.J. States, N.L. Harris, and L. Hunter, “Computationally Efficient
into the main memory. Cluster Representation in Molecular Sequence Megaclassifica-
tion,” ISMB, vol. 1, pp. 387-394, 1993.
[11] Y. Xu, V. Olman, and D. Xu, “Clustering Gene Expression Data
ACKNOWLEDGMENTS Using a Graph-Theoretic Approach: An Application of Minimum
Spanning Trees,” Bioinformatics, vol. 18, no. 4, pp. 536-545, 2002.
The authors wish to thank the US National Science [12] M. Laszlo and S. Mukherjee, “Minimum Spanning Tree Partition-
Foundation for its valuable support of this work under ing Algorithm for Microaggregation,” IEEE Trans. Knowledge and
award 0325641 and all the anonymous reviewers for their Data Eng., vol. 17, no. 7, pp. 902-911, July 2005.
[13] M. Forina, M.C.C. Oliveros, C. Casolino, and M. Casale,
insightful comments which enabled their paper to be a “Minimum Spanning Tree: Ordering Edges to Identify Clustering
significant better product. Structure,” Analytical Chimica Acta, vol. 515, pp. 43-53, 2004.
958 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009

[14] M.F. Jiang, S.S. Tseng, and C.M. Su, “Two-Phase Clustering Xiaochun Wang received the BS degree from
Process for Outliers Detection,” Pattern Recognition Letters, vol. 22, Beijing University and the PhD degree from the
pp. 691-700, 2001. Department of Electrical Engineering and Com-
[15] J. Lin, D. Ye, C. Chen, and M. Gao, “Minimum Spanning Tree- puter Science, Vanderbilt University. Her re-
Based Spatial Outlier Mining and Its Applications,” Lecture Notes search interests are in computer vision, signal
in Computer Science, vol. 5009/2008, pp. 508-515, Springer-Verlag, processing, and pattern recognition.
2008.
[16] R. Prim, “Shortest Connection Networks and Some General-
ization,” Bell Systems Technical J., vol. 36, pp. 1389-1401, 1957.
[17] J. Kruskal, “On the Shortest Spanning Subtree and the Traveling
Salesman Problem,” Proc. Am. Math. Soc., pp. 48-50, 1956.
[18] L. Caccetta and S.P. Hill, “A Branch and Cut Method for the
Degree-Constrained Minimum Spanning Tree Problem,” Net- Xiali Wang received the PhD degree from the
works, vol. 37, no. 2, pp. 74-83, 2001. Department of Computer Science, Northwest
[19] N. Paivinen, “Clustering with a Minimum Spanning Tree of Scale- University, China, in 2005. He is a faculty
Free-Like Structure,” Pattern Recognition Letters, vol. 26, no. 7, member in the Department of Computer Science,
pp. 921-930, Elsevier, 2005. Changan University, China. His research inter-
[20] J.L. Bentley and J.H. Friedman, “Fast Algorithms for Constructing ests are in computer vision, signal processing,
Minimal Spanning Trees in Coordinate Spaces,” IEEE Trans. intelligent traffic system, and pattern recognition.
Computers, vol. 27, no. 2, pp. 97-105, Feb. 1978.
[21] S.D. Bay and M. Schwabacher, “Mining Distance-Based Outliers in
Near Linear Time with Randomization and a Simple Pruning
Rule,” Proc. Ninth ACM SIGKDD Int’l Conf. Knowledge Discovery
and Data Mining, pp. 29-38, 2003.
[22] H.V. Jagadish, B.C. Ooi, K.L. Tan, C. Yu, and R. Zhang, “iDistance:
An Adaptive B+-Tree Based Indexing Method for Nearest D. Mitchell Wilkes received the BSEE degree
Neighbor Search,” ACM Trans. Database System (TODS), vol. 30, from Florida Atlantic, and the MSEE and PhD
no. 2, pp. 364-397, 2005. degrees from Georgia Institute of Technology.
[23] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Prentice His research interests include digital signal
Hall, 1988. processing, image processing and computer
[24] J. Kleinberg and E. Tardos, Algorithm Design, pp. 142-149. Pearson- vision, structurally adaptive systems, sonar,
Addison Wesley, 2005. as well as signal modeling. He is a member
[25] J. Nesetril, E. Milkov’a, and H. Nesetrilov’a, “Otakar Boruvka on of the IEEE and a faculty member at the
Minimum Spanning Tree Problem: Translation of Both the 1926 Department of Electrical Engineering and
Papers, Comments, History,” DMATH: Discrete Math., vol. 233, Computer Science, Vanderbilt University. He
no. 1, pp. 3-36, 2001. is a member of the IEEE.
[26] M. Fredman and D. Willard, “Trans-Dichotomous Algorithms for
Minimum Spanning Trees and Shortest Paths,” Proc. 31st Ann.
IEEE Symp. Foundations of Computer Science, pp. 719-725, 1990. . For more information on this or any other computing topic,
[27] H. Gabow, T. Spencer, and R. Tarjan, “Efficient Algorithms for please visit our Digital Library at www.computer.org/publications/dlib.
Finding Minimum Spanning Trees in Undirected and Directed
Graphs,” Combinatorica, vol. 6, no. 2, pp. 109-122, 1986.
[28] D. Karger, P. Klein, and R. Tarjan, “A Randomized Lineartime
Algorithm to Find Minimum Spanning Trees,” J. ACM, vol. 42,
no. 2, pp. 321-328, 1995.
[29] P. Fränti, O. Virmajoki, and V. Hautamäki, “Fast PNN-Based
Clustering Using K-Nearest Neighbor Graph,” Proc. Third IEEE
Int’l Conf. Data Mining, 2003.
[30] A. Ghoting, S. Parthasarathy, and M.E. Otey, “Fast Mining of
Distance-Based Outliers in High Dimensional Data Sets,” Proc.
SIAM Int’l Conf. Data Mining (SDM), vol. 16, no. 3, pp. 349-364,
2006.
[31] S. Hettich and S.D. Bay, The UCI KDD Archive, Dept. of
Information and Computer Science, Univ. of California, Irvine,
http://kdd.ics.uci.edu/, 1999.

You might also like