0% found this document useful (0 votes)
32 views

Using Visualization To Improve Clustering Analysis On Heterogeneous Information Network - 2018

This document discusses using visualization techniques to improve understanding and optimization of clustering analysis algorithms for heterogeneous information networks. Specifically, it focuses on applying visualization to the Rankclus clustering algorithm. Key points: 1) Visualization can make data mining processes more transparent and help users better understand algorithm computations and parameter effects without extensive expertise. 2) The document proposes visualizing Rankclus using a riverstream metaphor to explain clustering and ranking steps, and a density approach to connect related objects clearly. 3) Additional visualization techniques like heatmaps and decision trees are used to analyze cluster similarities and algorithm accuracy. The overall goal is to improve user practicality and algorithm understandability through visualization.

Uploaded by

Serhiy Yehress
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Using Visualization To Improve Clustering Analysis On Heterogeneous Information Network - 2018

This document discusses using visualization techniques to improve understanding and optimization of clustering analysis algorithms for heterogeneous information networks. Specifically, it focuses on applying visualization to the Rankclus clustering algorithm. Key points: 1) Visualization can make data mining processes more transparent and help users better understand algorithm computations and parameter effects without extensive expertise. 2) The document proposes visualizing Rankclus using a riverstream metaphor to explain clustering and ranking steps, and a density approach to connect related objects clearly. 3) Additional visualization techniques like heatmaps and decision trees are used to analyze cluster similarities and algorithm accuracy. The overall goal is to improve user practicality and algorithm understandability through visualization.

Uploaded by

Serhiy Yehress
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Using Visualization to Improve Clustering Analysis

on Heterogeneous Information Network

Wenbo Wang† , Yuwei Li† , Feng Wang‡ , Xiaopei Liu†∗ , Youyi Zheng§†∗
† ShanghaiTech
University
‡ Jilin
University
§ Zhejiang University

{wangwb, liyw, liuxp}@shanghaitech.edu.cn,{wangfeng12}@mails.jlu.edu.cn,{zyy}@cad.zju.edu.cn,

Abstract—The exploration and analysis of data mining algorithm can provide a higher accuracy of classifying entities
methodologies is an important task for effective knowledge dis- in today’s complicated information networks. Based on their
covery, especially in today’s heterogeneous information networks. work, many alternatives had been proposed gradually, such as
Previously presented approaches for mining optimization aim
primarily at the improvements of time complexity, space com- NetClu [5], PathSim [6], etc..
plexity, accuracy, and robustness. We extend the state-of-the-art During the past years, Rankclus based classification al-
method by concentrating on user-availability and algorithm un- gorithms are contributed to solve the challenges in HINs,
derstandability. Specifically, we use Rankclus, a classic clustering and we summarized that most of these improvements were
algorithm as an example. After uncovering the unseen computing derived from the perspective of algorithm design principles and
processes to be displayed in a visual form, the whole clustering
processes are transparent to the users, which may help them more features. For example. they consider the improvement of time
clearly and quickly understand how the algorithms are computed, complexity, space complexity, accuracy, and robustness. There
how does each object influence one another. In addition, we use are also some researchers doing works from the perspective of
a density approach to intuitively simplify the discovery of data data volume and structures. Although they all make contribu-
patterns, and through the visualized results, users can adjust tions in the development of information discovery in HINs,
algorithm parameters with or without professional training.
Finally, we use another two visual techniques to improve the we find another two important elements are ignored: User
visualization quality: a heatmap matrix designed for checking practicality and algorithm understandability. User practicality
the similarities of objects which are in the same cluster, and represents whether the algorithm can be understood, and
a DOItree implemented to further analyze the accuracy of the wheter it is used properly by the group of people without pro-
algorithms. fessional background; algorithm understandability represents
Index Terms—data mining, heterogeneous information net-
works, visualization, Rankclus whehter the algorithm can be read in an easy and clear way,
rather than reading codes only. Based on this basic idea, we
I. I NTRODUCTION develop an idea that even a people new to the algorithms
can grasp the computing processes quickly. Then, they can
The use of heterogeneous information networks (HINs) [1]
make contributions on the existing works without wasting too
[2] [3], such as social networks and the Web, has drawn
much efforts on the analysis. To fill this development gap,
extremely wide attentions in recent years. Meanwhile, an
we propose using representative ways to make data mining
increasing number of exciting discoveries and successful ap-
processes to be transparent. The algorithms can be easily
plications are developed. They are able to find rich information
understood and adjusted, and hidden meaningful patterns can
hidden in heterogeneous links between entities in various
also be discovered. A well-recognized useful representative
fields: computer science, physics and biology, etc.. Among
approach is visualization.
these developments, clustering and ranking are two prominent
Generally, visualization techniques [7] [8] represent ab-
analytical techniques toward a better understanding of infor-
stract data to reinforce human cognition. They enable the
mation network. Unlike homogeneous network, the algorithm
generation, interpretation, and manipulation of information
for HINs has higher requirements, because HINs is ubiquitous
through spatial representations. In other words, visualization
and it has typed nodes and links, which would lead to a more
can be an aid for researchers to better and more quickly
informative discovery. In 2009, Rankclus [4], an algorithm
understand the complicated computing processes, and they can
designed for the data exploration of HINs, has been proposed
also explore information knowledge. Therefore, to optimize
by Sun and Han. They proposed the idea of exploring rank
algorithm performance, visualization is a desirable choice from
distribution for each cluster to improve clustering results. This
the user’s perspective. In our work, we dynamically monitor
the computing processes and utilize a riverstream metaphor
∗ Corresponding author. [9] [10] [11] [12], with variable-width trends, to explain the
computing steps of clustering and ranking. To better under- [16]; speedup K-means with SVD decomposition [17]; the
stand the meaning of various river streams and clearly discover initialization clustering algorithm K-means++ [18] etc.. How-
their transformations, we introduce a density approach to keep ever, researchers found that homogeneous information netwrok
the continuity and clarity of each stream. In addtion, our usually extracts data from systems by ignoring the hetero-
visualization can help algorithm designer find the suitable geneity of objects and links, and sometimes only consider one
parameters value more quickly and easily. For example, they type of relations among one type of objects. This would lead
can find when the algorithm is not having jittering problem, to the unaccuracy results. In order to improve the clustering
and what are the suitable vaules at this moment. Besides, we quality, data can be modeled as heterogeneous information
adopted another two visualization techniques, Heatmap Matrix networks, which contain data with different types of objects
[13] and DOItrees [14], to strengthen the analytical ability. and links, and the links and entities are all interconnected. In
These two techniques will cooperate with the main streams to the heterogeneous information networks, traditional clustering
discover hidden information patterns and relationships. algorithm could not get desirable classification results. But
The main contributions of this work are: they can work together with ranking algorithms to get more
• Using visualization to make the mining processes to accurate classification results.
be transparent, which will satisfactorily help users to The first proposed clustering algorithm in HINs is Rankclus,
better understand the process and adjust the algorithm which is designed for bi-typed information network. Through
parameters. using conditional ordering and mixed probability model, the
• Using density approach to connect neighbour objects that algorithm can provide a clustering result with higher accuracy;
directly and visually solves the problem of information followed by this, in order to deal with the data entity variety
discontinuity. problem, Netclus [5] was proposed in 2013; then RankClass
• Using Jaccard Distance to get the similarity of identities [19] algorithm was designed to let ranking and classification
that further understands the relationships between entities mutually enhance each other; and PathSim [6] was then
in Rankclus. discovered to find a better meta-path from many path choices.
• Combining RiverStream, Heatmap Matrix and DOItree To summarize the existed optimization approaches on HIN
visualization techniques to cooperate with each other. clustering functions: One group of people are concentrated
Applying them to analyze the river patterns, which are on data volume and structures; while the others are focused to
generated during and after the whole computing pro- optimize on the algorithm itself:(a) accuracy, which represents
cesses. This could effectively present results overtime to the algorithm can get the correct answers; (b) time complex-
users in an intuitive and manageable manner. ity, which refers to the computational workload required to
perform the algorithm; (c) robustness, which refers to the
II. R ELATED W ORKS ability of an algorithm to respond to irrational data input and
This section reviews related works on optimization ap- processing capabilities; (d) spatial complexity, which refers
proaches of clustering algorithms and the application of vi- to the memory space the algorithm needs to consume; (e)
sualization techniques in clustering in HINs. generalization ability, which refers to the ability of machine
Optimization Approaches of Clustering Algorithms. Clus- learning algorithms to adapt the new samples. However, they
tering, as one of the most important questions of unsupervised all did not consider the possibility of improving optimization
learning, forms the basis for further knowledge mining. It has through usability, which contains user readability and algo-
been used to group a set of objects, where the objects in a rithm understandability. In other words, if the algorithm can
group are more similar to each other while differentiating with be read more intuitively, and the computing steps are more
the objects in other groups. It has a wide acknowledgement understandable. These might save users’ time and efforts to
in data mining, and has also been populaly applied and further analyze and optimize algorithms, which can also extend
made contributions in various fields. Such as personalised the working oppotunities for more people, especially to the
environments, electronic commerce, and search engines, etc.. new in this area. Based on the above possibilities, in this paper,
As a result, finding methods on how to improve their com- we propose the idea of using visualization to explain Rankclus
puting efficiencies becomes an important issue. Firstly, we algorithm. Through our experiments, we certify that visualiza-
take a look at how clustering algorithm had been improved in tion is effective on helping user understand algorithms.
homogeneous information network, and then we consider its Visualization techniques in HINs. A major contribution of
development in heterogeneous information network. Because visualization is using simple graphs to help human brain
networks that are homogeneous always containing the same process complicated information. It is the explanation of data
type of objects and links. So the related clustering algorithms in a pictorial format rather than poring over in the number
are developed quickly, which includes hierarchical clustering, format. Through the prior researches on visualization, it has
centroid-based clustering, distribution-based clustering, and been certified that visualization effectively enables users to
density based clustering. Take centroid-based clustering as an understand analysis process. Users can also grasp difficult con-
example, a classic method is K-means [15], and many algo- cepts and identify information patterns through visualization
rithm alternatives had been proposed to cluster information results. During the past years, visualization techniques had
based on this method. They are: KD-tree accelerated K-means been widely applied in many research fields. Specifically, in
heterogeneous information networks, visualization has been Algorithm analysis
effectively used in building data mining models. Rather than Algorithm
Pre-processor
Adjusting Correctness Information
seeing the model as a black box, visualization transfers model Parameters Analysis Discovery

outputs into meaningful graphical results and allows the user


to interact with the results. We use two publicly recognized
visualization techniques to explain this idea: One is drill-
through [20]; the other is linking and brushing [21]. Drill- Information
Discovery
through can reveal additional details in a sub-model; while
Processing Flow DOItree Heatmap
linking and brushing is able to highlight brushed data items
Visualization Guidance
in different representations. These two visualization techniques
greatly help users understand how the model relates to original
Fig. 1. New Model for Algorithm Analysis with Information Visualization
data, how the external contexts of the model are discovered,
and how the validation are enhanced. On the other hand,
in HINs, visualization is applied for model comparison. For might decrease the working efficiencies of developing new
example, if algorithms are required to be compared on the algorithms and discovering algorithm issues. Because there is
results of standard methods, such as computing time complex- a higher possibility for people who are not very experienced
ity, stability or computation size. Visualization techniques, bar in this area to have novel ideas on the problems and solve
charts or pie charts with other visual meaphors, such as colors the problems from a different aspect. Therefore, in this paper,
and shapes [22] [23] [24] , are efficient to accomplish these we propose using information visualization to help people,
tasks. who are interested in algorithms but without professional
To differentiate the motivation of our work in using visu- backgrounds, to quicker and clearer understand an algorithm.
alization in HINs, we proposed a novel approach to extend Here is the new model.
the applicability of visualization. It is using visualization to Fig.1 provides an overview of our model with its four main
explain algorithms and making the processing details being parts: Algorithm pre-processor, algorithm analysis, visualiza-
transparent to users. In addition, through all results that are tion guidance, and information discovery. The algorithm pre-
generated duing the processing processes, users can also processor is tasked to extract the basic idea of an algorithm,
discover information patterns and do further analysis on either and then we can cooperate algorithm analysis and visualization
algorithms or datasets. Specifically, to compare our idea with guidance to complete main visual analysis. Specifically, in
the existed approaches, we list three novities of our approach: this part, we use three visualization techniques to explain the
Firstly, we propose a visualization based model to understand algorithm. First is the riverStream visualization (processing
and analyze Rankclus algorithm; secondly, we concern that the flow), which can discover how algorithms are running over
uncertainty data may have a negative influence on the results, objects, how parameters are affected over the algorithm,
that they might cause discontinuity problems and then affect and how objects are affected with each other; Second is a
the pattern discovery in the analysis process.So we add an DOITree visualization, which can verify the correctness of
density approach to address this issue. This method is able the algorithm. This is especially suitable for pretesting on
to maintain visual effects without losing information pattern; a small part of dataset, while the classification results are
thirdly, rankclus itself could not provide ranking results in alreay known. Then through DOITree visualization results,
each cluster. So according to the fundamental requirements of users can easily know whether the algorithm is running right
users, we combine Jaccard Similarity approach and Heatmap or wrong; third is a Heatmap visualization, which is used
Matrix to further understand the inner relationships. for identifying inner relationships among objects per cluster.
These three visualization techniques are working together to
III. N EW M ODEL IN A LGORITHM O PTIMIZATION
better represent the computing processes of an algorithm. We
Algorithm optimization is a technique in the field of com- take Rankclus as an example: If we set the iteration number
puter science. It refers to improve the relevant performance to 5, the riverstream will be drawed 5 times. And at any
of the algorithm. Such as time complexity, space complexity, iteration step, we can check the result on DOITree. This
correctness, and robustness, etc.. With the arrival of the era will help users to understand how the results are changed
of big data, more problems are coming to challenge the ana- at different iteration steps, and which step affects the results
lysts’ working efficiency, so how to improve the optimization more. The last part of the model is information discovery.
algorithms also becomes an essential task. Users can interact with the results generated during computing
Most of researchers chose to deal with the shortcomings processes to discover further information, such as data patterns
of algorithm itself, either do numerous complicated program- or changing streams,etc..
ming experiments or apply various mathematical theories on
the original theory. However, both of the two aspects are IV. R ANKCLUS V ISUALIZATION
only suitable for researchers who are expert in information This section introduces how riverstream visualization helps
technology knowledge and programming. Sometimes strong to make the computing process of Rankclus to be transpar-
mathematics background is also necessary. These limitations ent to users, and how algorithm performance are optimized
Pm
through the dynamic observation of visualization results. The where j=1 WY X(i, j) represents the number of the pub-
reminder of Section4 is structured as follows. Section A lications for the ith author at the jth conference;
Pm ~rX (i)
introduces Rankclus Algorithm. Section B discusses how represents the score of the ith conference; j=1 WY Y (i, j)
does visualization guides user to understand and improve the represents the number of publications with the jth coauthor;
clustering methods. Section C explains the further analysis of ~rY (j) represents the score of the jth author.
this visualizer. Then normalize ~rY (j) by formula (2), denoted as Pk (Yj ):
A. Rankclus Algorithm
~rY (j)
Network datasets had already transformed from homoge- ~rY (j) ← Pm (2)
j =1 ~
0 rY (j 0 )
neous structure to heterogeneous structure since the data
volume and variety are generated greater and quicker. A To be the same as the calucation with ~rY (j), ~rX (i) can be
representative heterogeneous network can be seen as a bi-type calculated by formula (3) , and then it is normalized ~rX (i) by
directed graph [25] [26]. In this kind of graph, there are links fomula (4), which is denoted as Pk (Xi ):
among entities, and they are either having the same type or
m
different types. To discover useful information from this kind X
of graph and better understand network properties, various ~rX (i) = WX Y (i, j) · ~rY (j) (3)
j=1
analysis technologies are generated. Among these, ranking
and clustering are two of the most important approaches. In ~rX (i)
this paper, we work on a fundamental algorithm Rankclus, ~rX (i) ← Pm (4)
i =1 ~
0 rX (i0 )
which is also a classic clustering method in computer science
bibliographic network. Step2: Clustering by mixture possibility model
Here is the basic principle of Rankclus [27] [28]. Suppose Through step1, we rank each cluster,and get two conditional
there are two entities, X and Y . The task is to ranking Xi . The distributions for cluster K: one is Pk (Xi ) , which is measured
network is randomly divided into K clusterings, and rank each on the conference; the other is Pk (Yj ); which is measured on
of the cluster, then use the ranking result to be a metric in the the author. Then we use maximum likelihood estimation to
following computing steps. After this, we use a mixture model estimate P (z = k), refer to formula (5)-(7).
to transfer each node into a K-Dimension vector, and classify it
into Xk , which is the closest class with this node. Repeat the
above steps, until the clustering results are stable. Specifically, p(z = k|yj , xi , θ) ∝ p(xi , yj|z = k)p(z = k|θ◦ ) (5)
when the algorithm are doing iterations, clustering results will
be improved gradually. The similar objects(nodes) will get
closer to each other. In addition, the more accurate of the p(z = k|yj , xi , θ) = Pk (xi )Pk (yj )P ◦ (z = k) (6)
clustering results are, the more correct the ranking results
will be. These two functions are influenced each other. The Pm Pn
followings are the main steps to implement Rankclus: i=1 · ·WXY (i, j)p(z = k|xi , yj , θ◦ )
j=1
p(z = k) = P mPn
Step1: Ranking each cluster i=1 · j=1 ·WXY (i, j)
In this step, there are two rules to follow to get better (7)
ranking results. The first rule is that if the author has a Then the algorithm calculates which cluster does the objects
higher ranking, there will be a higher possibility for him/her belongs to in each X, see formula(8).
to publish papers in the conferences with higher rankings;
the second rule is that if a conference has a higher ranking, pk (xi )p(z = k)
p(z = k|xi ) = Pk (8)
it might attract authors who are at higher ranking place. r=1 pq (xi )p(z = q)
Specifically, these two rules illustrate that the value of each
author is related with both the number of his/her publications Step3: Adjusting the clusters
and the weight of the conferences; the score of a conference In this step, the distance between X and the center of cluster
(
depends both on the number of publications and the paper K will be calculated, denoted as d(x, Xk r)). Cosine similarity
qualities; and the paper qualitis are related with the author is used for measuring the distance. See formula(9)-(10).
ranking, that the higher of the author is ranking, the higher
∈ xk · T~ (x)
P
quality of the paper will be. T~X (k) = x
(9)
Mathematically, the principle can be explained by the fol- |xk |
lowing formulas: computing Yi , then~rY (j) can be computed
by formula (1): where T~X (k) = (p(z = 1|xi ), p(z = 2|xi ), ..., p(z = k|xi )
Then the distance d can be calculated by formula(11).
m
X m
X
~rY (j) = α WY X(i, j)·~rX (i)+(1−α)· WY Y (i, j)·~rY (j) d(x, Xk ) = 1 − cosine(x, Xk ) (10)
j=1 j=1
(1) Repeat the three steps, until the algorithm is convergent.
B. Visual Discovery of Rankclus

The visualizer is able to transform complicated computing


steps into comprehensive visualization results. In the visualiza-
tion of Rankclus [29] [30] [31], a riverstream is presented to
explain how the clustering results are generated, and how the
parameters are controlled over each cluster. In the meanwhile,
users can interact with the evolving trends to adjust the
parameters, and dynamically changed data patterns can also
be discovered during the visualizing process of each iteration.
These stream-changing patterns can help users to understand
more about dataset. In addition, we apply a DOItree to
clearly verify whether the Rankclus is able to provide accurate
results, and this technique always works together with the Fig. 2. An Example of RiverStream Visualization Panel
riverstream visualization. Finally, a Heatmap is designed to
visually explain the inner correlations among objects in each
cluster. The cooperation of these three visualization techniques
makes algorithm optimization and information discovery much
easier and quicker. We will describe them respectively.
a) Riverstream Visualization Panel: The RiverStream
panel [10] visualizes the generating processes of the algorithm
with a popular streamgraph metaphor, refer to Fig.2. It is an
aesthetically pleasing and readily comprehensible visualization
scheme, which is well established for visually integrating mul-
tiple time series. In addition, this metaphor makes it possible
to link the changing among different iterations and cluster
variations together, and visably discover their differences with-
out breaking the visual effects. As a result, a comprehensible Fig. 3. An Example of Applying Bubble Visual Properties
visualization is generated. Users can also interact with the
interface naturally and smoothly. To explain more details of
the algorithm: We imply a visual property, bubble shape, We calculate the density of all objects, and category them
to represent the density classification for all objects. If the by their values. If two objects have the same density, they
number of objects are more in a classification, the bubble will be gathered into one category. From the visualization
size will be larger. Fig.3 gives a clear explanation of the point of view, our results only display bubbles that are the
relationships between bubble size and the number of objects. classifications of objects rather than displaying all objects. By
In this Figure, we use bubbles to represent the conferences. using this approach, the number of river streams are decressed,
We can see that four conferences are chosen to be displayed, and visual clutter problem is sloved. The effectiveness is
including IEEE Visualization, JAMIA, VAST and CVPR. They especially obvious when the data volume is big. Besides, we
are all represented by streams with different shapes. It is set a special entity for each entity bubble cluster, the entity
obvious that IEEE Visualizaiton, JAMIA and VAST have the value is null, its color is highlighed as blue. This design is
similar trends which are going up; while CVPR is going down able to keep the continuity of each stream and remain a good
at the same time period. And this phenomenon can also be visual affect. This also reconfirms the accuracy of objects
found from the changing color of each stream. The color of classification results, refer to Fig.4.
IEEE Visualization, JAMIA and VAST are all changed to light Thirdly, we use interactive mechanism on riverstream to
organge; while CVPR is changed to dark orange. In the same learn how objects(each stream represents an object) are trans-
figure, we can also see that the bubble size of four confernces formed over time and what are the differences between them.
are different, because their densities are different, they are In order to better illustrate the variation degree of each stream,
46.24 we give each cluster a special colour. If an object is transfer-
In addition, because the objects might have no value in a ring to a different cluster, the stream colour will be gradually
specific time period in a cluster, and the number of objects changed to the colour which is represented the destination
might be too much in another cluster. These can cause the cluster. This idea of drawing gradient colour can improve
results to be unclear. Therefore, we are facing two challenges: the visualization result, because users can more intuitively
The visualization discontinuity and visual clutter. In our work, and transparently find their desired objects changing states.
we propose using density approach to deal with these two Fig.5 gives an illustratation of the interactive mechanism on
issues without losing data accuracy. Specifically, density rep- riverstream. If the user click a stream, the color of the stream
resents the percentage of each object in the whole objects. will be changed to green. In Fig.5, it is clear to observe that
Empty Category

Fig. 6. An example of DOITree Visualization - time based results

result, and when the algorithm finished all calculation, the


Fig. 4. An example of using null value to solve the visual discontinuity final result will also be displayed in a DOITrees form. This
problem
visualization technique mainly cooperates with riverstream to
verify the correctness of an algorithm, refer to Fig.6.
c) Heatmap Visualization Panel: It is important to un-
derstand the cluster processes, and to verify the correctness of
the results through the cooperation of riverstream and DOITree
visualizatioin. It is also essential to discover the relationships
among objects which are in the same cluster. In this paper,
we apply Jaccard Distance to measure the similarity distance.
The main reason is that this similarity measurement only
concentrates on whether the objects are similar on the same
features, and this matches our motivation to find differences
among the objects which are in the same cluster.
To better understand the similarity measurement result, we
Fig. 5. An example of using color streams to differentiate the changing apply the idea of heatmap visualization technique. Refer to
patterns of each cluster Fig.7. A heatmap is a graphical representation of data where
the individual values contained in a matrix, and the differences
of values are represented as various colours. In our paper,
this trend only has values in 2004. It is also interesting to in order to explain all relationships among objects, we use
find that although the values are all in a null cluster, they a matrix to express the similarities. Each square represents
are distributed to different clusters. For example, in 2002, the the similarity of two objects. As for the colour schemes,
null point belongs to cluster0; in 2003, it moves to cluster2; considering human perceptual advantages and disadvantages,
while in 2005, it goes back to cluster0. When this phenomenon we use a principle of colour contrast: while the similarity value
apprears, it represents that this object could not be ranking at is larger, the colour will be brighter. The colours are chosen
top k, which is the number of clusters. In Fig.5, we can also from light blue to dark red, see the legend in Fig.7.
find that author Beng Chin Ooi and Surajit Chaudhuri are both d) The Cooperation of Three Visualization Techniques:
transferred from cluster4 to cluster2, which means that cluster Riverstream, DOITree and heatmap are working together to
4 and cluster 2 are more similar with each other, because it finish the task of unveiling the processes of algorithms and dis-
is less likely that researchers at the same time period change covering information patterns. In our Rankclus visualization,
their research interests to a same research area. Riverstream is the main visualizer to represent the computing
b) DOITrees Visualization Panel: In reality, hierarchical process of Rankclus. User can choose attributes to be the
structures provide ways to present complex structures in a main visualized objects, either author or conference. Suppose
simplified form. In the past decades, various 2D hierarchical we choose conference as the target. In the beginning, user
structures visualization techniques have been proposed, such can set the iteration number and cluster number,then the
as Treemaps [32], Space-Optimized Tree [33], EncConTree visualizer will give the clustering results for each iteration
[34], and SpaceTree [35] etc.. Among these, Degree-of Interest in the stream format. During the whole computing process,
Trees (DOITrees) is one of the most popular techniques for the user can detect the changing frequencies of each cluster.
large tree visualizations. It can provide simple and clear Suppose since iteration 5, the riverstream could not changed
preview icons for the summarization of the complex structures. anymore. Then, users can adjust the iteration value to decrease
Therefore, in this paper, we choose DOITrees to display the possibility of wasting computing resources. As the changes
the different clustering results when the algorithm is doing in the riverstream might be small, and are not easy to find by
calculatation. This means that each iteration has a DOITree our eyes. So we can use DOITree panel to display cluster
TABLE II
ANALYSIS of ALGORITHM PARAMETERS
Alpha Itera�on �mes Clustering Accuracy
0.99 15 Theory
18,19 Database
20,21,22 -
0.98 15,22 -
18 Medical Informa�cs
19 Visualiza�on
20 Database
Visualiza�on
21 Medical Informa�cs
Visualiza�on
0.96 15,22 -
18,19,21 Theory
20 Database
Fig. 7. An example of Heatmap Visualization Panel Medical Informa�cs
0.95 15,19,20 Theory
TABLE I 18 -
ILLUSTRATION OF THE TEST DATASET 21 Database
CATEGORY CONFERENCE ATTRIBUTES
22 Medical Informa�cs
KDD
DataMining PKDD
ICDM Conference
SIGMOD Conference Name
DataBase VLDB
ICDE Paper Detail
Journal of Biomedical Informa�cs
Medical Informa�cs IEEE Transac�ons on Informa�on Technology in Biomedicine results with the correct cluster result in TABLE.I. If the result
JAMIA Author Name
Theory FOCs is not correct, the users can adjust the value of alpha and
SODA
Informa�on Visualiza�on Publishing Time the iteration number until the result is accurate. In this step,
Visualiza�on IEEE Visualiza�on (2002-2005)
VAST
the users can visually understand how algorithm parameters
CVPR (alpha and iteration number) are affected the algorithm; what
are the suitable values for parameters to make algorithm to
be stable, which is with less computing cost, and without
losing information accuracy. TABLE.II is the test results for
results of each iteration, the results will be displayed in text, a Rankclus algorithm on Aminer datasets. From the previous
and we can make sure whether the cluster result will be researches on Rankclus algorithm, it has been discovered that
affected if we change the iteration number. After we adjust the range of alpha shoule be from 0.95 to 1.00, and a suitable
the parameters for the algorithm, we are able to use heatmap iteration number would be around 20. And through our test, a
and DOITree together to check the relationships among all better visualization results can be received when alpha equals
entities, refer to Fig.6 and Fig.7. to 0.98, and the iteration value is 20, which is the same
conclusion with the Rankclus researchers.
C. Further Analysis of Rankclus on Aminer
We apply this visualizer on a network dataset Aminer V. C ONCLUSIONS
(https://www.aminer.cn/data) to understand and optimize
Rankclus algorithm. This dataset contains paper informa- In this paper, we discussed the developments and challenges
tion, author information, paper citation, author collaboration, about algorithm optimization in HINs. In association with
conference information, etc.. In the test, we chose a data these challenges, we discuss the methodologies and techniques
subset which contains paper information, author information proposed in recent years. Compared with previous solutions,
and research fields, the test dataset details are illustrated in we have presented a visual analysis technique to help users
TABLE.I. understand and optimize algorithms. Specifically, our approach
First, we disorder the dataset to make sure that the con- is unique in two aspects. First, it allows user, who are not
ferences are not in the same cluster. Then run the Rankclus professional in algorithms, are able to better and more quickly
algorithm to get clustering results. A stream can represent understand how algorithms are working on datasets; second,
either author or conference, which depends on the user require- it enables users to interact with algorithms. This method
ments. Through the changes of riverstreams, users can discover supports an interactive analysis cycle with the cooperation of
how conferences are transformed among clusters, and how three visualization techniques. And through the case study on
authors are transformed among research fields. In addition, Aminer, we have demonstrated the usability of our idea in
users can check the results in DOItree and then compare the visually analysing algorithm computing processes.
Our design also has some limitations on the analysis of [12] Yingcai Wu, Shixia Liu, Kai Yan, Mengchen Liu, and Fangzhao Wu.
Rankclus. First of all, it is not clear to display how the Opinionflow: Visual analysis of opinion diffusion on social media. IEEE
Transactions on Visualization and Computer Graphics, 20(12):1763–
attributes are affected each other. Take Aminer as an example, 1772, 2014.
user can see the processes of author ranking and confer- [13] Leland Wilkinson and Michael Friendly. The history of the cluster heat
ence ranking, but cannot clearly discover how author and map. The American Statistician, 63(2):179–184, 2009.
[14] Quang Vinh Nguyen, Simeon Simoff, and Mao Lin Huang. Using visual
conference are affected each other at the same time. As a cues on doitree for visualizing large hierarchical data. In Information
consequence, users have to explore visual results thoroughly Visualisation (IV), 2014 18th International Conference on, pages 1–6.
to find patterns. Second, it is still not easy to discover stream IEEE, 2014.
[15] John A Hartigan and Manchek A Wong. Algorithm as 136: A k-means
patterns when the dataset is extremely large in our current clustering algorithm. Journal of the Royal Statistical Society. Series C
system. (Applied Statistics), 28(1):100–108, 1979.
Based on our idea of using visualization to guide the [16] Dan Pelleg and Andrew Moore. Accelerating exact k-means algorithms
with geometric reasoning. In Proceedings of the fifth ACM SIGKDD
development of algorithm. In the future, on the one hand, we international conference on Knowledge discovery and data mining,
will improve the visualization of existing Rankclus algorithm; pages 277–281. ACM, 1999.
on the other hand, we are interested in discovering suitable [17] Chris Ding and Xiaofeng He. K-means clustering via principal compo-
nent analysis. In Proceedings of the twenty-first international conference
visualization techniques to explain Neural network algorithm on Machine learning, page 29. ACM, 2004.
[36], and make a transparent way to see how neural networks [18] Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and
learn. Sergei Vassilvitskii. Scalable k-means++. Proceedings of the VLDB
Endowment, 5(7):622–633, 2012.
[19] Ming Ji, Jiawei Han, and Marina Danilevsky. Ranking-based classifi-
ACKNOWLEDGMENT cation of heterogeneous information networks. In Proceedings of the
We thank all reviewers for their valuable comments. This 17th ACM SIGKDD international conference on Knowledge discovery
and data mining, pages 1298–1306. ACM, 2011.
work was supported in part by the National Natural Science [20] James Ahrens, Kristi Brislawn, Ken Martin, Berk Geveci, C Charles
Foundation of China NO. 61502306, the China Young 1000 Law, and Michael Papka. Large-scale data visualization using parallel
Talents Program. data streaming. IEEE Computer graphics and Applications, 21(4):34–
41, 2001.
[21] Daniel A Keim. Information visualization and visual data mining. IEEE
R EFERENCES transactions on Visualization and Computer Graphics, 8(1):1–8, 2002.
[1] Jiawei Han, Yizhou Sun, Xifeng Yan, and Philip S Yu. Mining [22] Nathan Kogan, Kathleen Connor, Augusta Gross, and Donald Fava. Un-
heterogeneous information networks. In Tutorial at the 2010 ACM derstanding visual metaphor: Developmental and individual differences.
SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD’10), Monographs of the Society for Research in Child Development, pages
Washington, DC, 2010. 1–78, 1980.
[2] Yizhou Sun and Jiawei Han. Mining heterogeneous information net- [23] Peter R Keller, Mary M Keller, Scott Markel, A John Mallinckrodt, and
works: principles and methodologies. Synthesis Lectures on Data Mining Susan McKay. Visual cues: practical data visualization. Computers in
and Knowledge Discovery, 3(2):1–159, 2012. Physics, 8(3):297–298, 1994.
[3] Yizhou Sun and Jiawei Han. Mining heterogeneous information [24] Hermine Feinstein. Meaning and visual metaphor. Studies in Art
networks: a structural analysis approach. Acm Sigkdd Explorations Education, 23(2):45–55, 1982.
Newsletter, 14(2):20–28, 2013. [25] Yizhou Sun and Jiawei Han. Integrating clustering with ranking in
[4] Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng, heterogeneous information networks analysis. In Link Mining: Models,
and Tianyi Wu. Rankclus: integrating clustering with ranking for Algorithms, and Applications, pages 439–473. Springer, 2010.
heterogeneous information network analysis. In Proceedings of the 12th [26] Dipak R Pardhi and Akhilesh A Waoo. An efficient ranking based clus-
International Conference on Extending Database Technology: Advances tering algorithm. International Journal of Engineering and Advanced
in Database Technology, pages 565–576. ACM, 2009. Technology (IJEAT), 1(1), 2011.
[5] Elena Baralis, Andrea Bianco, Tania Cerquitelli, Luca Chiaraviglio, and [27] Xing Le. Rankclus on directed graph and its application. China’s
Marco Mellia. Netcluster: A clustering-based framework to analyze Outstanding Master’s Degree thesis, 7, 2013.
internet passive measurements data. Computer Networks, 57(17):3300– [28] Huajie Shao, Jinda Han, and Sida Li. Highsim: Highly effective
3315, 2013. similarity measurement in large heterogeneous information networks.
[6] Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S Yu, and Tianyi Wu. [29] Yintao Yu. Ivis: Search and visualization on heterogeneous information
Pathsim: Meta path-based top-k similarity search in heterogeneous networks. 2011.
information networks. Proceedings of the VLDB Endowment, 4(11):992– [30] TAO Jianwen. Rchig: an effective clustering algorithm with ranking.
1003, 2011. Journal of Software, 4(4), 2009.
[7] Usama M Fayyad, Andreas Wierse, and Georges G Grinstein. Infor- [31] Zhiguo Zhu, Jingqin Su, and Liping Kong. Measuring influence in online
mation visualization in data mining and knowledge discovery. Morgan social network based on the user-content bipartite graph. Computers in
Kaufmann, 2002. Human Behavior, 52:184–189, 2015.
[8] Mike Cammarano, Xin Dong, Bryan Chan, Jeff Klingner, Justin Talbot, [32] Ben Shneiderman and Martin Wattenberg. Ordered treemap layouts. In
Alon Halevey, and Pat Hanrahan. Visualization of heterogeneous Information Visualization, 2001. INFOVIS 2001. IEEE Symposium on,
data. IEEE Transactions on Visualization and Computer Graphics, pages 73–78. IEEE, 2001.
13(6):1200–1207, 2007. [33] Quang Vinh Nguyen and Mao Lin Huang. A space-optimized tree
[9] Susan Havre, Elizabeth Hetzler, Paul Whitney, and Lucy Nowell. The- visualization. In Information Visualization, 2002. INFOVIS 2002. IEEE
meriver: Visualizing thematic changes in large document collections. Symposium on, pages 85–92. IEEE, 2002.
IEEE transactions on visualization and computer graphics, 8(1):9–20, [34] Mao Lin Huang, Quang Vinh Nguyen, Wei Lai, and Xiaodi Huang.
2002. Three-dimensional enccon tree. In Computer Graphics, Imaging and
[10] Florian Heimerl, Qi Han, Steffen Koch, and Thomas Ertl. Citerivers: Visualisation, 2007. CGIV’07, pages 429–433. IEEE, 2007.
Visual analytics of citation patterns. IEEE transactions on visualization [35] Catherine Plaisant, Jesse Grosjean, and Benjamin B Bederson. Space-
and computer graphics, 22(1):190–199, 2016. tree: Supporting exploration in large node link tree, design evolution and
[11] Dongning Luo, Jing Yang, Milos Krstajic, William Ribarsky, and Daniel empirical evaluation. In The Craft of Information Visualization, pages
Keim. Eventriver: Visually exploring text collections with temporal 287–294. Elsevier, 2003.
references. IEEE transactions on visualization and computer graphics, [36] Martin T Hagan, Howard B Demuth, Mark H Beale, et al. Neural
18(1):93–105, 2012. network design, volume 20. Pws Pub. Boston, 1996.

You might also like