A Divisive Hierarchical Structural Clustering Algorithm for Networks
A Divisive Hierarchical Structural Clustering Algorithm for Networks
442
Let v, w ∈ V and e = (v,w) ∈ E, the structure of e is [1] and Books about US politics [8]. The performance
defined by the structural similarity of v and w, denoted of DHSCAN is compared with modularity-based
by κ(e) = σ(v, w). algorithm [4].
The vertices in the same cluster have a higher We use adjusted Rand index (ARI) [9] as our
structural similarity than vertices from different measure of agreement between the clustering results
clusters. Therefore, intra-cluster edges, i.e. the edges found by a particular algorithm and the true clustering
connecting vertices of the same clusters, have a larger of the network. It is defined as:
edge structure than inter-cluster edges, i.e. the edges ⎛ nij ⎞ ⎡ ⎛ ni. ⎞ ⎛ n ⎞⎤ ⎛ n ⎞
connecting vertices of different clusters. Our clustering ∑i, j ⎜⎜ 2 ⎟ − ⎢∑ ⎜⎜ ⎟⎟∑ ⎜ . j ⎟⎥ / ⎜⎜ ⎟⎟
⎟ i 2 j⎜ ⎟
⎝ ⎠ ⎣⎢ ⎝ ⎠ ⎝ 2 ⎠⎦⎥ ⎝ 2 ⎠
algorithm aims to cluster vertices by identifying both
1 ⎡ ⎛ ni . ⎞ ⎛ n. j ⎞⎤ ⎡ ⎛ ni. ⎞ ⎛ n. j ⎞⎤ ⎛ n ⎞
intra and inter cluster edges. For this reason, we sort ⎢∑ ⎜ ⎟ + ∑ j ⎜⎜ ⎟⎟⎥ − ⎢∑i ⎜⎜ ⎟⎟∑ j ⎜⎜ ⎟⎟⎥ / ⎜⎜ ⎟⎟
2 ⎣ i ⎜⎝ 2 ⎟⎠ ⎝ 2 ⎠⎦ ⎣ ⎝ 2 ⎠ ⎝ 2 ⎠⎦ ⎝ 2 ⎠
the edges based on an ascending order of edge
structure and iteratively classify edges based on the where ni,j is the number of vertices in both cluster xi
edge structure. We use two sets for the classified and yj; and ni,⋅ and n⋅,j is the number of vertices in
edges. Set B is for inter-cluster edges; and set W is for cluster xi and yj respectively. ARI ranges between 0
intra-cluster edges. All edges are initialized as intra- and 1. ARI = 0 means a total disagreement of the
cluster edge and stored in set W in the beginning of the clustering; and ARI = 1 means a perfect agreement.
algorithm. In each iteration, the edge with minimal Zachary karate-club dataset is compiled by Zachary
edge structure will be moved from W to B. If there is [7] during a two-year period observation and largely
any change in terms of the clusters, modularity Q studied in the literatures [1] and [4]. It is a social
defined in (2.2) will be updated for changed clusters. friendship network between members of a karate club
Additionally, the edge structures will also be updated at a US university. The network is then split into two
for the edges that are directly connected to the moved groups due to a conflict in club management. By
edge. The procedure will terminate if all edges are analyzing the relationship between members of the
moved from W to B. The result of our clustering club, we try to assign club members to two distinct
algorithm is a hierarchy of clusters and can be groups which were only observable after the actual
represented as a dendrogram. The optimal clustering split occurred. The result of DHSCAN algorithm is
can be found by the maximal Q. The pseudo-code of shown on the graph in Figure 2. Shapes (round and
our algorithm is presented in Figure 1. rectangle) denote the true classes for club members and
colors denote the clusters found by DHSCAN. The
ALGORITHM DHSCAN(G=<V, E>) result is exactly same with the one obtained by
// all edges are initialized as intra-cluster edges; Newman [4], with only one misclassified node, namely
W := E; B := Φ; i := 0; Qi := 0; node 10.
while W ≠ Φ do {
// Move edge with minimal structure;
remove e := min_struct(W) from W;
insert e into B;
find all connected components in W;
if (number of components increased) {
i := i+1;
define each component in W as a cluster;
plot level i of dendrogram;
calculate Qi;
}
}
// Get the optimal clustering;
cut the dendrogram at maximal Q value;
end DHSCAN.
443
dataset is shown as an n example in Figure 3. Thhe
optimal clustering
c is obtained by cutting thhe
dendrogram m at the leveel of maximall modularity Qs
defined in (2.2). For this particular exammple Qs value of
0.43 dividees the entire dataset into twoo distinct groupps
with onlyy one (node 10) misclasssified and thhe
correspondding ARI valuee of 0.88. The presented
p resullts
for all threee datasets are obtained in thhe same manneer,
by cutting the tree where Qs is maximizzed.
444
The comparison above demonstrates an improved
accuracy of DHSCAN over modularity-based
algorithm on college football dataset. Both DHSCAN
and modularity-based methods achieve equivalent
result on Zachary-karate and political book datasets.
To demonstrate the efficiency of finding optimal
clustering of networks, we present the plots for Qs and
ARI values for each iteration of the algorithm on all
three datasets in Figure 7, Figure 8 and Figure 9
respectively. Plots demonstrate that Qs and ARI are
positively correlated, thus a high Qs value indicates a
high ARI value that also means a high-quality
clustering result. In our experiments, DHSCAN
algorithm is halted as Qs values start to decline because
maximal Qs value yields the highest ARI value as well.
Figure 5 College football data Note DHSCAN achieves an optimal clustering quickly,
after only removal of 9 out of 73, 61 out of 421 and
The clustering results of DHSCAN algorithm is 197 out of 601 edges, indicating a good efficiency and
presented in Figure 6, which demonstrates a good speed of clustering.
match with the original conference system we are
seeking for. The most errors are in lower-right part 5. Conclusion
where two separate conferences are merged together,
which we believe is caused by confused edge structure. In this paper we proposed a simple divisive
However, achieved ARI value of 0.79 is still hierarchical clustering algorithm, DHSCAN, that
significantly larger than that of modularity-based iteratively removes links based on an ascending order
algorithm by Clauset et al [4], which is 0.50. of a structural similarity measure. The network is
divided into clusters by removal of links. The iterative
divisive procedure produces a dendrogram showing the
hierarchical structure of the clusters. Additionally the
divisive procedure stops at the maximum of Newman’s
modularity. Therefore, our algorithm has two main
advantages: (1). It can find hierarchical structure of
clusters. (2). It does not require any parameters.
As future work we will compare DHSCAN with
more algorithms for clustering very large networks.
Additionally we will apply our algorithm to analyze
very large biological networks such as metabolic and
protein interaction networks.
Zachary Karate
Figure 6 Result of DHSCAN on college football data
1.00
We measure the accuracy of the clustering using 0.80 QS
adjusted rand index ARI. The ARI of DHSCAN and 0.60 ARI
modularity-based algorithm for three datasets are 0.40
presented in Table 1. 0.20
Table 1 Adjust rand index comparison 0.00
1 7 13 19 25 31 37 43 49 55 61 67 73
DHSCAN Modularity-based
Zachary-karate 0.88 0.88
Iteration of Removals
Political books 0.64 0.64 Figure 7 QS - ARI behavior for Zachary karate data
College football 0.79 0.50
445
[2] C. Ding, X. He, H. Zha, M. Gu, and H. Simon, “A min-
Political Books max cut algorithm for graph partitioning and data
0.70 clustering”, Proc. of 2001 IEEE International
0.60 QS Conference on Data Mining, San Jose, CA, November
29 – December 2, 2001.
0.50 ARI
[3] J. Shi and J. Malik, “Normalized cuts and image
0.40
segmentation”, IEEE Transactions on Pattern Analysis
0.30 and Machine Intelligence, Vol 22, No. 8, 2000.
0.20 [4] A. Clauset, M. E. J. Newman, and C. Moore, “Finding
0.10 community in very large networks”, Physical Review E
0.00 70, 066111 (2004).
1 43 85 127 169 211 253 295 337 379 421 [5] Z. Feng, X. Xu, N. Yuruk and T. A. J. Schweiger, “A
Iteration of Removals Novel Similarity-based Modularity Function for Graph
Partitioning”, To be published in Proc. of 9th
Figure 8 Qs - ARI behavior for political book data International Conference on Data Warehousing and
Knowledge Discovery, Regensburg, Germany,
September 3-7, 2007.
College Football
[6] X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger,
QS “SCAN: an Structural Clustering Algorithm for
1.00
ARI Netowrks”, To be published in Proc. of 13th ACM
0.80 SIGKDD International Conference on Knowledge
0.60 Discovery and Data Mining, San Jose, CA, August 12-
15, 2007.
0.40
[7] W. W. Zachary, “An information flow model for
0.20 conflict and fission in small groups”, Journal of
0.00 Anthropological Research 33, 452–473 (1977).
1 51 101 151 201 251 301 351 401 451 501 551 601 [8] http://www.orgnet.com/.
Iteration of Removals [9] L. Hubert and P. Arabie, “Comparing Partitions”.
Figure 9 Qs - ARI behavior for college football data Journal of Classification, 193–218, 1985.
[10] http://www-personal.umich.edu/~mejn/netdata/.
6. References
[1] M. E. J. Newman and M. Girvan, “Finding and
evaluating community structure in networks”, Phys.
Rev. E 69, 026113 (2004).
446