Using Visualization To Improve Clustering Analysis On Heterogeneous Information Network - 2018
Using Visualization To Improve Clustering Analysis On Heterogeneous Information Network - 2018
Wenbo Wang† , Yuwei Li† , Feng Wang‡ , Xiaopei Liu†∗ , Youyi Zheng§†∗
† ShanghaiTech
University
‡ Jilin
University
§ Zhejiang University
Abstract—The exploration and analysis of data mining algorithm can provide a higher accuracy of classifying entities
methodologies is an important task for effective knowledge dis- in today’s complicated information networks. Based on their
covery, especially in today’s heterogeneous information networks. work, many alternatives had been proposed gradually, such as
Previously presented approaches for mining optimization aim
primarily at the improvements of time complexity, space com- NetClu [5], PathSim [6], etc..
plexity, accuracy, and robustness. We extend the state-of-the-art During the past years, Rankclus based classification al-
method by concentrating on user-availability and algorithm un- gorithms are contributed to solve the challenges in HINs,
derstandability. Specifically, we use Rankclus, a classic clustering and we summarized that most of these improvements were
algorithm as an example. After uncovering the unseen computing derived from the perspective of algorithm design principles and
processes to be displayed in a visual form, the whole clustering
processes are transparent to the users, which may help them more features. For example. they consider the improvement of time
clearly and quickly understand how the algorithms are computed, complexity, space complexity, accuracy, and robustness. There
how does each object influence one another. In addition, we use are also some researchers doing works from the perspective of
a density approach to intuitively simplify the discovery of data data volume and structures. Although they all make contribu-
patterns, and through the visualized results, users can adjust tions in the development of information discovery in HINs,
algorithm parameters with or without professional training.
Finally, we use another two visual techniques to improve the we find another two important elements are ignored: User
visualization quality: a heatmap matrix designed for checking practicality and algorithm understandability. User practicality
the similarities of objects which are in the same cluster, and represents whether the algorithm can be understood, and
a DOItree implemented to further analyze the accuracy of the wheter it is used properly by the group of people without pro-
algorithms. fessional background; algorithm understandability represents
Index Terms—data mining, heterogeneous information net-
works, visualization, Rankclus whehter the algorithm can be read in an easy and clear way,
rather than reading codes only. Based on this basic idea, we
I. I NTRODUCTION develop an idea that even a people new to the algorithms
can grasp the computing processes quickly. Then, they can
The use of heterogeneous information networks (HINs) [1]
make contributions on the existing works without wasting too
[2] [3], such as social networks and the Web, has drawn
much efforts on the analysis. To fill this development gap,
extremely wide attentions in recent years. Meanwhile, an
we propose using representative ways to make data mining
increasing number of exciting discoveries and successful ap-
processes to be transparent. The algorithms can be easily
plications are developed. They are able to find rich information
understood and adjusted, and hidden meaningful patterns can
hidden in heterogeneous links between entities in various
also be discovered. A well-recognized useful representative
fields: computer science, physics and biology, etc.. Among
approach is visualization.
these developments, clustering and ranking are two prominent
Generally, visualization techniques [7] [8] represent ab-
analytical techniques toward a better understanding of infor-
stract data to reinforce human cognition. They enable the
mation network. Unlike homogeneous network, the algorithm
generation, interpretation, and manipulation of information
for HINs has higher requirements, because HINs is ubiquitous
through spatial representations. In other words, visualization
and it has typed nodes and links, which would lead to a more
can be an aid for researchers to better and more quickly
informative discovery. In 2009, Rankclus [4], an algorithm
understand the complicated computing processes, and they can
designed for the data exploration of HINs, has been proposed
also explore information knowledge. Therefore, to optimize
by Sun and Han. They proposed the idea of exploring rank
algorithm performance, visualization is a desirable choice from
distribution for each cluster to improve clustering results. This
the user’s perspective. In our work, we dynamically monitor
the computing processes and utilize a riverstream metaphor
∗ Corresponding author. [9] [10] [11] [12], with variable-width trends, to explain the
computing steps of clustering and ranking. To better under- [16]; speedup K-means with SVD decomposition [17]; the
stand the meaning of various river streams and clearly discover initialization clustering algorithm K-means++ [18] etc.. How-
their transformations, we introduce a density approach to keep ever, researchers found that homogeneous information netwrok
the continuity and clarity of each stream. In addtion, our usually extracts data from systems by ignoring the hetero-
visualization can help algorithm designer find the suitable geneity of objects and links, and sometimes only consider one
parameters value more quickly and easily. For example, they type of relations among one type of objects. This would lead
can find when the algorithm is not having jittering problem, to the unaccuracy results. In order to improve the clustering
and what are the suitable vaules at this moment. Besides, we quality, data can be modeled as heterogeneous information
adopted another two visualization techniques, Heatmap Matrix networks, which contain data with different types of objects
[13] and DOItrees [14], to strengthen the analytical ability. and links, and the links and entities are all interconnected. In
These two techniques will cooperate with the main streams to the heterogeneous information networks, traditional clustering
discover hidden information patterns and relationships. algorithm could not get desirable classification results. But
The main contributions of this work are: they can work together with ranking algorithms to get more
• Using visualization to make the mining processes to accurate classification results.
be transparent, which will satisfactorily help users to The first proposed clustering algorithm in HINs is Rankclus,
better understand the process and adjust the algorithm which is designed for bi-typed information network. Through
parameters. using conditional ordering and mixed probability model, the
• Using density approach to connect neighbour objects that algorithm can provide a clustering result with higher accuracy;
directly and visually solves the problem of information followed by this, in order to deal with the data entity variety
discontinuity. problem, Netclus [5] was proposed in 2013; then RankClass
• Using Jaccard Distance to get the similarity of identities [19] algorithm was designed to let ranking and classification
that further understands the relationships between entities mutually enhance each other; and PathSim [6] was then
in Rankclus. discovered to find a better meta-path from many path choices.
• Combining RiverStream, Heatmap Matrix and DOItree To summarize the existed optimization approaches on HIN
visualization techniques to cooperate with each other. clustering functions: One group of people are concentrated
Applying them to analyze the river patterns, which are on data volume and structures; while the others are focused to
generated during and after the whole computing pro- optimize on the algorithm itself:(a) accuracy, which represents
cesses. This could effectively present results overtime to the algorithm can get the correct answers; (b) time complex-
users in an intuitive and manageable manner. ity, which refers to the computational workload required to
perform the algorithm; (c) robustness, which refers to the
II. R ELATED W ORKS ability of an algorithm to respond to irrational data input and
This section reviews related works on optimization ap- processing capabilities; (d) spatial complexity, which refers
proaches of clustering algorithms and the application of vi- to the memory space the algorithm needs to consume; (e)
sualization techniques in clustering in HINs. generalization ability, which refers to the ability of machine
Optimization Approaches of Clustering Algorithms. Clus- learning algorithms to adapt the new samples. However, they
tering, as one of the most important questions of unsupervised all did not consider the possibility of improving optimization
learning, forms the basis for further knowledge mining. It has through usability, which contains user readability and algo-
been used to group a set of objects, where the objects in a rithm understandability. In other words, if the algorithm can
group are more similar to each other while differentiating with be read more intuitively, and the computing steps are more
the objects in other groups. It has a wide acknowledgement understandable. These might save users’ time and efforts to
in data mining, and has also been populaly applied and further analyze and optimize algorithms, which can also extend
made contributions in various fields. Such as personalised the working oppotunities for more people, especially to the
environments, electronic commerce, and search engines, etc.. new in this area. Based on the above possibilities, in this paper,
As a result, finding methods on how to improve their com- we propose the idea of using visualization to explain Rankclus
puting efficiencies becomes an important issue. Firstly, we algorithm. Through our experiments, we certify that visualiza-
take a look at how clustering algorithm had been improved in tion is effective on helping user understand algorithms.
homogeneous information network, and then we consider its Visualization techniques in HINs. A major contribution of
development in heterogeneous information network. Because visualization is using simple graphs to help human brain
networks that are homogeneous always containing the same process complicated information. It is the explanation of data
type of objects and links. So the related clustering algorithms in a pictorial format rather than poring over in the number
are developed quickly, which includes hierarchical clustering, format. Through the prior researches on visualization, it has
centroid-based clustering, distribution-based clustering, and been certified that visualization effectively enables users to
density based clustering. Take centroid-based clustering as an understand analysis process. Users can also grasp difficult con-
example, a classic method is K-means [15], and many algo- cepts and identify information patterns through visualization
rithm alternatives had been proposed to cluster information results. During the past years, visualization techniques had
based on this method. They are: KD-tree accelerated K-means been widely applied in many research fields. Specifically, in
heterogeneous information networks, visualization has been Algorithm analysis
effectively used in building data mining models. Rather than Algorithm
Pre-processor
Adjusting Correctness Information
seeing the model as a black box, visualization transfers model Parameters Analysis Discovery