Community Detection
Community Detection
Alexandre Vilcek
Abstract
Identifying communities, or clusters, in graphs is a task of great
importance
when
analyzing
network
structures.
A
telecommunications provider, for instance, would like to identify
communities of customers that place a large amount of calls to
each other, in order to create more effective, directed marketing
campaigns. Another example would be a financial institution,
trying to identify and understand communities of customers
that have a high amount of financial transactions between each
other. There is also a wide applicability of community detection
in life sciences research as well: for example, when studying
protein-protein interaction networks.
In this project we will create a new processing pipeline for nonoverlapping community detection in network structures based
entirely on K-Means. We will show that this approach is similar
to a traditional Deep Learning auto-encoder in its ability to learn
useful representations of the original data in a lowerdimensional space, making the data clustering task easier to
accomplish. We will then test its applicability for the specific
challenges of community detection in networks and compare its
performance with the traditional Spectral Clustering approach.
1.4 K-Means
In the fields of Machine Learning and Data Mining, perhaps KMeans is the most known and studied method for clustering
analysis [12].
3. Proposed Approach
3.1 Motivation
||
||
2. Previous Work
There is a lot of previous research that investigated Spectral
Clustering or derived approaches both applied to general
clustering problems and specifically to community detection in
networks [8],[14],[18],[20].
, where
from
, using a
, whose elements
, we compute
, such
that
g. As the last step, we apply standard K-Means to
desired
to find the
communities of .
A vector
graph.
3.3 Experiments
We performed a series of experiments to evaluate the accuracy,
computational time complexity, and scalability of Deep K-Means,
comparing it with an implementation of Spectral Clustering as
defined in [14], on the task of finding communities in network
structures.
For those experiments, we analyzed network data with groundtruth communities, both synthetic and real-world data.
3.3.1 Experiment #1
In this experiment we ran both Deep K-Means and Spectral
Clustering on the synthetic networks described earlier. The goal
of this experiment is to investigate if the proposed algorithm
can provide better clustering performance than Spectral
Clustering in a consistent way.
the
Where:
is the Mutual Information between
and , that
represents the amount of shared information between
and
and is given by:
Number of layers: 3
Maximum number of iterations for K-Means on each
layer: 25
Number of random initializations for K-Means on each
layer: 10
Similarity function: Jaccard
3.3.2 Experiment #2
And
is the Entropy of , that represents the information
contained in and is given by:
And the second part of the test with the following parameters:
3.3.5 Experiment #5
In this experiment we analyze the performance of Deep K-Means
and Spectral Clustering using the Football dataset [22].
For the auto-encoder pipeline of Deep K-Means we used a singlelayer without random restarts.
3.3.3 Experiment #3
In this experiment we empirically analyze the time complexity of
Deep K-Means compared to Spectral Clustering.
In Spectral Clustering we expect the time complexity being
dominated by the following steps, which occur sequentially.
Here we define as the number of nodes in the graph and , the
number of communities:
We ran Deep K-Means using a 1-layer auto-encoder and one KMeans initialization. We ran both Deep K-Means and Spectral
Clustering 25 times and measure the NMI and execution time
for each run.
K-Means Clustering:
3.4 Results
Number of layers: 1
Maximum number of iterations for K-Means on each
layer: 25
Number of random initializations for K-Means on each
layer: 1, 2, 4, 8, 32
Similarity function: Dices
For each part of the test described above, and for each
corresponding parameter configuration, we ran Deep K-Means
10 times for the graphs g1 through g14 and measured the
correspondent average accuracy and average execution time.
For each similarity function described above we ran Deep KMeans 10 times for the graphs g1 through g14 and measured
the correspondent average accuracy as the normalized mutual
information compared to the ground-truth communities.
Fig. 9a: NMI accuracy for Deep K-Means for 1, 2, 3, and 4 autoencoder layers; dashed lines represent the accuracy average
across all graphs for the corresponding configuration
Fig. 9e: Execution time for Deep K-Means for graphs g1 trough
g10
Fig. 9b: Execution time for Deep K-Means for graphs g1 trough
g10
Fig. 9f: Execution time for Deep K-Means for graphs g1 trough
g14 shown in logarithmic scale
Fig. 9c: Execution time for Deep K-Means for graphs g1 trough
g14 shown in logarithmic scale
Fig. 9g : NMI accuracy for Spectral Clustering and Deep KMeans with 4 layers and 1 random restart
In Fig. 10a and Fig. 10b below we see the accuracy and
execution time, respectively, when running both Deep K-Means
and Spectral Clustering against the Football dataset [22]. As this
is a fairly simple and small network, running Deep K-Means in a
1-layer configuration yields good results. As with the synthetic
datasets in the previous experiments, we see Deep K-Means
outperforming Spectral Clustering.
References
[1] Adamic, Lada A., and Eytan Adar. "Friends and neighbors on the
web." Social networks 25.3 (2003): 211-230.
[2] Bengio, Yoshua. "Learning deep architectures for AI." Foundations
and trends in Machine Learning 2.1 (2009): 1-127.
4. Conclusion
In this work we proposed a new algorithm for non-overlapping
network community detection that leverages ideas from Deep
Learning pipelines for data embedding in lower-dimensional
spaces, which eases the task of clustering the data into
communities.
[14] Ng, Andrew Y., Michael I. Jordan, and Yair Weiss. "On spectral
clustering: Analysis and an algorithm." Advances in neural information
processing systems 2 (2002): 849-856.