0% found this document useful (0 votes)
2 views

A Divisive Hierarchical Structural Clustering Algorithm for Networks

This paper presents DHSCAN, a Divisive Hierarchical Structural Clustering Algorithm for Networks that identifies hierarchical structures in networks without requiring input parameters. The algorithm iteratively removes links based on structural similarity, producing a dendrogram that represents the clustering results, which are evaluated using real datasets like the Zachary karate club network. Experimental results demonstrate the effectiveness of DHSCAN, achieving high agreement with the true clustering structures in the tested networks.

Uploaded by

2211110012
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

A Divisive Hierarchical Structural Clustering Algorithm for Networks

This paper presents DHSCAN, a Divisive Hierarchical Structural Clustering Algorithm for Networks that identifies hierarchical structures in networks without requiring input parameters. The algorithm iteratively removes links based on structural similarity, producing a dendrogram that represents the clustering results, which are evaluated using real datasets like the Zachary karate club network. Experimental results demonstrate the effectiveness of DHSCAN, achieving high agreement with the true clustering structures in the tested networks.

Uploaded by

2211110012
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Seventh IEEE International Conference on Data Mining - Workshops

A Divisive Hierarchical Structural Clustering Algorithm for Networks

Nurcan Yuruk, Mutlu Mete, Xiaowei Xu Thomas A. J. Schweiger


University of Arkansas at Little Rock Acxiom Cooperation
{nxyuruk, mxmete, xwxu}@ualr.edu [email protected]

algorithm is described in section 3. We evaluate the


Abstract proposed algorithm using real networks whose
expected cluster structures are known to us. The
Many systems in sciences, engineering and nature experiment results are presented in section 4. We
can be modeled as networks. Examples are internet, conclude the paper with some future work in section 5.
metabolic networks and social networks. Network
clustering algorithms aimed to find hidden structures
2. Related work
from networks are important to make sense of complex
networked data. In this paper we present a new
The goal of network clustering is to partition the
clustering method for networks. The proposed
network into clusters. Due to the immense needs the
algorithm can find hierarchical structure of clusters
network clustering problem has been studied in many
without requiring any input parameters. The
science and engineering disciplines for many years. In
experiments using real data demonstrate an
this section we focus on recent and commonly used
outstanding performance of the new method.
algorithms.
The min-max cut method [2] seeks to partition a
1. Introduction graph G = {V, E} into two clusters A and B. The
principle of min-max clustering is minimizing the
Many systems in science, engineering and nature number of connections between A and B and
can be modeled as networks consisting of nodes and maximizing the number of connections within each. A
links that represent real entities and relationships cut is defined as the number of edges that would have
between entities. Examples are social networks, to be removed to isolate the vertices in cluster A from
biological networks and internet. Network clustering is those in cluster B. The min-max cut algorithm searches
targeted to find clusters in the network, which is an for the clustering that creates two clusters whose cut is
important task for finding hidden structures in the minimized while maximizing the number of remaining
messy, otherwise hard to comprehend networks. The edges.
cluster can be a community such as a clique of A pitfall of this method is that, if one cuts out a
terrorists in a social network, or a group of molecules single vertex from the graph, one will probably achieve
sharing similar functions in a biological network. the optimum. Therefore, in practice, the optimization
In this paper, we present, DHSCAN, a Divisive must be accompanied with some constraint, such as A
Hierarchical Structural Clustering Algorithm for and B should be of equal or similar size, or |A| ≈ |B|.
Networks that iteratively removes links based on an Such constraints are not always appropriate; for
ascending order of a structural similarity measure. The example, in social networks some communities are
network is divided into disconnected components by much larger than the others.
removal of links. The iterative divisive procedure To amend the issue, a normalized cut [3] was
produces a dendrogram showing the hierarchical proposed, which normalizes the cut by the total number
structure of the clusters. Additionally the divisive connections between each cluster to the rest of the
procedure stops at the maximum of similarity-based graph. Therefore, cutting out one vertex or some small
modularity that is a slightly modified version of part of the graph will no longer always yield an
Newman’s modularity [1]. Therefore, our algorithm optimum.
has two main advantages: (1). It can find hierarchical Both min-max cut and normalized cut methods
structure of clusters. (2). It does not require any partition a graph into two clusters. To divide a graph
parameters. into k clusters, one has to adopt a top-down approach,
This paper is organized as follows. After a brief splitting the graph into two clusters, and then further
review of related work in section 2, the proposed new splitting these clusters, and so on, until k clusters have

0-7695-3019-2/07 $25.00 © 2007 IEEE 441


DOI 10.1109/ICDMW.2007.73
been detected. There is no guarantee of the optimality | Γ(v) I Γ( w) | (2.3)
of recursive clustering. There is no measure of the σ (v, w) =
| Γ(v) || Γ( w) |
number of clusters that should be produced when k is
unknown. There is no indicator to stop the bisection where Γ(v) is the direct neighbors of v. A genetic
procedure. algorithm is developed to find the optimal clustering of
Recently, modularity was proposed as a quality networks by maximizing similarity-based modularity
measure of network clustering [1]. For a clustering of in [5]. Although the proposed algorithm can find both
graph with k clusters, the modularity is defined as: clusters and hubs in networks, it does not scale well to
k ⎡
large networks.
ls ⎛ d s ⎞ ⎤
2

Qn = ∑ ⎢ − ⎜ ⎟ ⎥ (2.1) Most recently, we proposed SCAN, a Structural


s =1 ⎢
⎣ L ⎝ 2 L ⎠ ⎥
⎦ Clustering Algorithm for Networks in [6]. SCAN can
L is the number of edges in the graph, ls is the efficiently find clusters, hubs as well as outliers from
number of edges between vertices within cluster s, and very large networks by visiting each node exactly once.
ds is the sum of the degrees of the vertices in cluster s. However it requires two parameters that may be
The modularity of a clustering of a graph is the fraction difficult to determine for users.
of all edges that lie within each cluster minus the
fraction that would lie within each cluster if the graph’s 3. The algorithm
vertices were randomly connected. Optimal clustering
is achieved when the modularity is maximized. In this section we present a Divisive Hierarchical
Modularity is defined such that it is zero for two Structural Clustering Algorithm for Networks
extreme cases: when all vertices partitioned into a (DHSCAN) that can find hierarchical structure of
single cluster, and when the vertices are clustered at clusters in networks without requiring any parameters.
random. Note that the modularity measures the quality We focus on simple, undirected and un-weighted
of any network clustering. Normalized and min-max graph G = {V, E}, where V is a set of vertices; and E is
cut measures only the quality of a clustering of two set of pairs (unordered) of distinct vertices, called
clusters. edges.
Finding the maximum Qn is NP-complete. Instead Our method is based on common neighbors. Two
of performing an exhaustive search, various vertices are assigned to a cluster according to how they
optimization approaches are proposed. For example, a share neighbors. This makes sense when you consider
greedy method based on a hierarchical agglomeration social communities. People who share many friends
clustering algorithm is proposed in [4], which is faster create a community, and the more friends they have in
than many competing algorithms: its running time on a common, the more intimate the community.
graph with n vertices and m edges is O(md log n) The structure of a vertex can be described by its
where d is the depth of the dendrogram describing the neighborhood. A formal definition of vertex structure
hierarchical cluster structure. is given as follows.
Although the modularity-based algorithms can find DEFINITION 1 (VERTEX STRUCTURE)
good clusters, they fail to identify nodes playing Let v ∈ V, the structure of v is defined by its
special roles such as hubs and outliers in networks. neighborhood, denoted by Γ(v)
Hubs connecting many clusters are responsible for Γ(v) = {w ∈ V | (v,w) ∈ E} ∪ {v}
spreading ideas or diseases in social networks. Outliers The structure similarity between vertices can be
are nodes marginally connected to clusters. Recently measured by normalized common neighbors, which is
we proposed a similarity-based modularity [5] defined also called cosine similarity measure commonly used
as: in information retrieval.
k ⎡
⎛ DS i ⎞ ⎤
2
IS DEFINITION 2 (STRUCTURE SIMILARITY)
Qs = ∑ ⎢ i − ⎜ ⎟ ⎥ (2.2)
Let v, w ∈ V, the structure similarity of v and w is
i =1 ⎢
⎣ TS ⎝ TS ⎠ ⎥
⎦ defined by their common neighborhood normalized by
where k is the number of clusters, ISi is the total the geometric mean of the neighborhood, denoted by
similarity of vertices within cluster i; DSi is the total σ(v, w)
similarity between vertices in cluster i and any vertices | Γ(v) I Γ( w) |
in the graph; TS is the total similarity between any two σ (v, w) =
vertices in the graph. The similarity of two vertices is | Γ(v) || Γ( w) |
defined by a structural similarity measure, which is Every edge can be represented by two end vertices.
defined as: Therefore, we can define the structure of an edge by
the structure similarity of two end vertices.
DEFINITION 3 (EDGE STRUCTURE)

442
Let v, w ∈ V and e = (v,w) ∈ E, the structure of e is [1] and Books about US politics [8]. The performance
defined by the structural similarity of v and w, denoted of DHSCAN is compared with modularity-based
by κ(e) = σ(v, w). algorithm [4].
The vertices in the same cluster have a higher We use adjusted Rand index (ARI) [9] as our
structural similarity than vertices from different measure of agreement between the clustering results
clusters. Therefore, intra-cluster edges, i.e. the edges found by a particular algorithm and the true clustering
connecting vertices of the same clusters, have a larger of the network. It is defined as:
edge structure than inter-cluster edges, i.e. the edges ⎛ nij ⎞ ⎡ ⎛ ni. ⎞ ⎛ n ⎞⎤ ⎛ n ⎞
connecting vertices of different clusters. Our clustering ∑i, j ⎜⎜ 2 ⎟ − ⎢∑ ⎜⎜ ⎟⎟∑ ⎜ . j ⎟⎥ / ⎜⎜ ⎟⎟
⎟ i 2 j⎜ ⎟
⎝ ⎠ ⎣⎢ ⎝ ⎠ ⎝ 2 ⎠⎦⎥ ⎝ 2 ⎠
algorithm aims to cluster vertices by identifying both
1 ⎡ ⎛ ni . ⎞ ⎛ n. j ⎞⎤ ⎡ ⎛ ni. ⎞ ⎛ n. j ⎞⎤ ⎛ n ⎞
intra and inter cluster edges. For this reason, we sort ⎢∑ ⎜ ⎟ + ∑ j ⎜⎜ ⎟⎟⎥ − ⎢∑i ⎜⎜ ⎟⎟∑ j ⎜⎜ ⎟⎟⎥ / ⎜⎜ ⎟⎟
2 ⎣ i ⎜⎝ 2 ⎟⎠ ⎝ 2 ⎠⎦ ⎣ ⎝ 2 ⎠ ⎝ 2 ⎠⎦ ⎝ 2 ⎠
the edges based on an ascending order of edge
structure and iteratively classify edges based on the where ni,j is the number of vertices in both cluster xi
edge structure. We use two sets for the classified and yj; and ni,⋅ and n⋅,j is the number of vertices in
edges. Set B is for inter-cluster edges; and set W is for cluster xi and yj respectively. ARI ranges between 0
intra-cluster edges. All edges are initialized as intra- and 1. ARI = 0 means a total disagreement of the
cluster edge and stored in set W in the beginning of the clustering; and ARI = 1 means a perfect agreement.
algorithm. In each iteration, the edge with minimal Zachary karate-club dataset is compiled by Zachary
edge structure will be moved from W to B. If there is [7] during a two-year period observation and largely
any change in terms of the clusters, modularity Q studied in the literatures [1] and [4]. It is a social
defined in (2.2) will be updated for changed clusters. friendship network between members of a karate club
Additionally, the edge structures will also be updated at a US university. The network is then split into two
for the edges that are directly connected to the moved groups due to a conflict in club management. By
edge. The procedure will terminate if all edges are analyzing the relationship between members of the
moved from W to B. The result of our clustering club, we try to assign club members to two distinct
algorithm is a hierarchy of clusters and can be groups which were only observable after the actual
represented as a dendrogram. The optimal clustering split occurred. The result of DHSCAN algorithm is
can be found by the maximal Q. The pseudo-code of shown on the graph in Figure 2. Shapes (round and
our algorithm is presented in Figure 1. rectangle) denote the true classes for club members and
colors denote the clusters found by DHSCAN. The
ALGORITHM DHSCAN(G=<V, E>) result is exactly same with the one obtained by
// all edges are initialized as intra-cluster edges; Newman [4], with only one misclassified node, namely
W := E; B := Φ; i := 0; Qi := 0; node 10.
while W ≠ Φ do {
// Move edge with minimal structure;
remove e := min_struct(W) from W;
insert e into B;
find all connected components in W;
if (number of components increased) {
i := i+1;
define each component in W as a cluster;
plot level i of dendrogram;
calculate Qi;
}
}
// Get the optimal clustering;
cut the dendrogram at maximal Q value;
end DHSCAN.

Figure 1 DHSCAN algorithm


Figure 2 Zachary karate data
4. Experiments
DHSCAN is a divisive hierarchical clustering
In this section, we evaluate the algorithm DHSCAN algorithm and produces hierarchical clustering of
using three real datasets including well-known Zachary nodes represented by a tree structure called
karate dataset [7], American College Football network dendrogram. The dendrogram for Zachary karate

443
dataset is shown as an n example in Figure 3. Thhe
optimal clustering
c is obtained by cutting thhe
dendrogram m at the leveel of maximall modularity Qs
defined in (2.2). For this particular exammple Qs value of
0.43 dividees the entire dataset into twoo distinct groupps
with onlyy one (node 10) misclasssified and thhe
correspondding ARI valuee of 0.88. The presented
p resullts
for all threee datasets are obtained in thhe same manneer,
by cutting the tree where Qs is maximizzed.

Figure 4 Political book


ks data

DHHSCAN clusteering results are representeed by


differeent colors: libeerals are in bllue, neutrals inn gray
and coonservatives inn red. The larrge overlap beetween
shapess and colors inndicates a goood clustering yielded
y
by DH HSCAN. The achieved adjustt rand index vaalue of
0.64 iss the same forr both DHSCA AN and moduularity-
based clustering algoorithm by Clauuset et al [4].
Thee last dataset used
u in our exxperiment is a social
networrk – detecting communities (or conferencces) of
American college football team ms [1]. The graph
represeents the scheduule of Divisionn I-A games for f the
2000 season. Thee National Collegiate C A
Athletic
Figurre 3 Dendrogra
am of Zachary karate
k data Associiation (NCAA A) divides 1115 college foootball
teams into 12 confferences. In adddition, there are 5
The seccond example is the classifiication of bookks indepeendent teams (Utah State, Navy, N Notre Dame,
D
about US politics.
p We use
u the dataset of Books aboout Conneecticut and Cenntral Florida) thhat do not beloong to
US politicss compiled by y Valdis Krebs [8]. There iss a any coonference. Thee question is how h to find out
o the
link betweeen two book ks if they arre co-purchaseed conferrences from a graph
g that represents the schhedule
frequently enough by th he same buyerrs. The vertices of gam mes played by b all teams. We presumee that
have beenn given valuess "l", "n", or "c" to indicaate becausse teams in thee same confereence are more likely
whether thhey are "liberal", "neutral", orr "conservativee". to playy with each otther, the conferrence system can c be
These aliignments were assigned separately by b mappeed as a structurre despite the significant
s amoount of
Newman [10]. True classes correspponding liberaal, inter-cconference playy.
neutral annd conservatiive are denooted in rounnd, Figure 5 illustrattes the collegge football nettwork,
diamond annd rectangle reespectively in Figure
F 4. where each vertex represents a team and ann edge
conneccts two teams if they playeed in a game. Each
conferrence is represeented using a color and an integer
i
as the conference ID.

444
The comparison above demonstrates an improved
accuracy of DHSCAN over modularity-based
algorithm on college football dataset. Both DHSCAN
and modularity-based methods achieve equivalent
result on Zachary-karate and political book datasets.
To demonstrate the efficiency of finding optimal
clustering of networks, we present the plots for Qs and
ARI values for each iteration of the algorithm on all
three datasets in Figure 7, Figure 8 and Figure 9
respectively. Plots demonstrate that Qs and ARI are
positively correlated, thus a high Qs value indicates a
high ARI value that also means a high-quality
clustering result. In our experiments, DHSCAN
algorithm is halted as Qs values start to decline because
maximal Qs value yields the highest ARI value as well.
Figure 5 College football data Note DHSCAN achieves an optimal clustering quickly,
after only removal of 9 out of 73, 61 out of 421 and
The clustering results of DHSCAN algorithm is 197 out of 601 edges, indicating a good efficiency and
presented in Figure 6, which demonstrates a good speed of clustering.
match with the original conference system we are
seeking for. The most errors are in lower-right part 5. Conclusion
where two separate conferences are merged together,
which we believe is caused by confused edge structure. In this paper we proposed a simple divisive
However, achieved ARI value of 0.79 is still hierarchical clustering algorithm, DHSCAN, that
significantly larger than that of modularity-based iteratively removes links based on an ascending order
algorithm by Clauset et al [4], which is 0.50. of a structural similarity measure. The network is
divided into clusters by removal of links. The iterative
divisive procedure produces a dendrogram showing the
hierarchical structure of the clusters. Additionally the
divisive procedure stops at the maximum of Newman’s
modularity. Therefore, our algorithm has two main
advantages: (1). It can find hierarchical structure of
clusters. (2). It does not require any parameters.
As future work we will compare DHSCAN with
more algorithms for clustering very large networks.
Additionally we will apply our algorithm to analyze
very large biological networks such as metabolic and
protein interaction networks.

Zachary Karate
Figure 6 Result of DHSCAN on college football data
1.00
We measure the accuracy of the clustering using 0.80 QS
adjusted rand index ARI. The ARI of DHSCAN and 0.60 ARI
modularity-based algorithm for three datasets are 0.40
presented in Table 1. 0.20
Table 1 Adjust rand index comparison 0.00
1 7 13 19 25 31 37 43 49 55 61 67 73
DHSCAN Modularity-based
Zachary-karate 0.88 0.88
Iteration of Removals
Political books 0.64 0.64 Figure 7 QS - ARI behavior for Zachary karate data
College football 0.79 0.50

445
[2] C. Ding, X. He, H. Zha, M. Gu, and H. Simon, “A min-
Political Books max cut algorithm for graph partitioning and data
0.70 clustering”, Proc. of 2001 IEEE International
0.60 QS Conference on Data Mining, San Jose, CA, November
29 – December 2, 2001.
0.50 ARI
[3] J. Shi and J. Malik, “Normalized cuts and image
0.40
segmentation”, IEEE Transactions on Pattern Analysis
0.30 and Machine Intelligence, Vol 22, No. 8, 2000.
0.20 [4] A. Clauset, M. E. J. Newman, and C. Moore, “Finding
0.10 community in very large networks”, Physical Review E
0.00 70, 066111 (2004).
1 43 85 127 169 211 253 295 337 379 421 [5] Z. Feng, X. Xu, N. Yuruk and T. A. J. Schweiger, “A
Iteration of Removals Novel Similarity-based Modularity Function for Graph
Partitioning”, To be published in Proc. of 9th
Figure 8 Qs - ARI behavior for political book data International Conference on Data Warehousing and
Knowledge Discovery, Regensburg, Germany,
September 3-7, 2007.
College Football
[6] X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger,
QS “SCAN: an Structural Clustering Algorithm for
1.00
ARI Netowrks”, To be published in Proc. of 13th ACM
0.80 SIGKDD International Conference on Knowledge
0.60 Discovery and Data Mining, San Jose, CA, August 12-
15, 2007.
0.40
[7] W. W. Zachary, “An information flow model for
0.20 conflict and fission in small groups”, Journal of
0.00 Anthropological Research 33, 452–473 (1977).
1 51 101 151 201 251 301 351 401 451 501 551 601 [8] http://www.orgnet.com/.
Iteration of Removals [9] L. Hubert and P. Arabie, “Comparing Partitions”.
Figure 9 Qs - ARI behavior for college football data Journal of Classification, 193–218, 1985.
[10] http://www-personal.umich.edu/~mejn/netdata/.
6. References
[1] M. E. J. Newman and M. Girvan, “Finding and
evaluating community structure in networks”, Phys.
Rev. E 69, 026113 (2004).

446

You might also like