Binxing Fang, Jia Yan - Online Social Network Analysis - Information and Communication. 3-De Gruyter (2019)
Binxing Fang, Jia Yan - Online Social Network Analysis - Information and Communication. 3-De Gruyter (2019)
)
Online Social Network Analysis
Unauthenticated
Download Date | 2/10/20 8:21 PM
Also of interest
Online Social Network Analysis, Volume
B. Fang, Y. Jia,
ISBN ----, e-ISBN (PDF) ----,
e-ISBN (EPUB) ----
Trusted Computing
D. Feng,
ISBN ----, e-ISBN (PDF) ----,
e-ISBN (EPUB) ----
Unauthenticated
Download Date | 2/10/20 8:21 PM
Online Social
Network Analysis
Edited by
Binxing Fang, Yan Jia
Unauthenticated
Download Date | 2/10/20 8:21 PM
Editors
Prof. Binxing Fang
Chinese Academy of Engineering
Building A, Tri-Tower
No. 66-1 Zhongguancun East Road
100190 Beijing, Haidian District
China
ISBN 978-3-11-059784-4
e-ISBN (PDF) 978-3-11-059943-5
e-ISBN (EPUB) 978-3-11-059793-6
www.degruyter.com
Unauthenticated
Download Date | 2/10/20 8:21 PM
Preface: Information and Diffusion
Volume 3 of the book focuses on the third core factor, namely, “information and
diffusion,” which consists of four chapters. Chapter 1 is about the information
retrieval for social networks, Chapter 2 is about the rules of information diffusion in
social networks, Chapter 3 is the topic discovery and evolution, and Chapter 4 is
about the influence maximization algorithms.
The following experts and scholars who participated in the data collection,
content arrangement, and achievement contribution of this volume are sincerely
appreciated: Zhaoyun Ding, Xiaomeng Wang, Bin Wang, Yezheng Liu, Xiaodong
Liu, Shenghong Li, Aiping Li, Lei Li, Shiyu Du, Peng Wu, Xiuzhen Chen, Wei Chen,
Yang Yang, Lumin Zhang, Peng Shi, and Yuanchun Jiang.
Thanks to Associate Professor Shudong Li for the careful coordination and
arrangement for writing this volume, and also to Weihong Han and Shuqiang Yang
for reviewing and proofreading.
https://doi.org/10.1515/9783110599435-201
Unauthenticated
Download Date | 2/10/20 8:23 PM
Unauthenticated
Download Date | 2/10/20 8:23 PM
Contents
List of Contributors IX
Li Guo
1 Information retrieval in social networks 1
Hu Changjun
2 The rules of information diffusion in social networks 61
Xindong Wu
3 Topic discovery and evolution 107
Xiangke Liao
4 Algorithms of influence maximization 143
Index 165
Unauthenticated
Download Date | 2/10/20 8:23 PM
Unauthenticated
Download Date | 2/10/20 8:23 PM
List of Contributors
Prof. Xueqi Cheng Prof. Jiayin Qi
Institute of Computing Technology Shanghai University of International Business
Chinese Academy of Sciences and Economics
No. 6 Zhongguancun Kexueyuan South Road Room 338, Bocui Building
100190 Beijing, China No. 1900 Wenxiang Road
201620 Shanghai, China
Prof. Binxing Fang
Chinese Academy of Engineering Prof. Xindong Wu
Building A, Tri-Tower Hefei University of Technology
No. 66-1 Zhongguancun East Road No. 193, Tunxi Road
100190 Beijing, China 230009 Hefei, China
https://doi.org/10.1515/9783110599435-202
Unauthenticated
Download Date | 2/10/20 8:23 PM
Unauthenticated
Download Date | 2/10/20 8:23 PM
Li Guo
1 Information retrieval in social networks
Information retrieval (IR) is a process of retrieving information, which satisfies users’
need, from massive unstructured datasets (such as natural language texts). IR is an
important tool that helps users to rapidly and effectively derive useful information
from massive data. With the drastic increase in the size of data and the increasingly
growing user needs in search services, IR has evolved from a tool that was only
designed for libraries into a network service indispensable in life, work, and study. In
addition to the search systems represented by the popular Google search engine,
some other common forms of IR systems include classification systems, recommen-
dation systems, and Q&A systems.
With the rapid popularization and continual development of social networking
services (SNS), IR not only has new resources and opportunities but is also con-
fronted by new problems and challenges. Acquiring information from emerging
resources such as social networks has gradually drawn attention from both industry
and academics. Compared with the traditional webpages, social network texts have
different characteristics, such as the limit of text length, special expression form
(such as Hashtag1 in microblogs), and existence of social relations between authors.
These differences make it inappropriate to directly apply traditional IR technologies
to an SNS environment. Social network-oriented IR technology still faces many
problems and difficulties, and it is of great academic significance and application
value to conduct research in this field.
This chapter mainly introduces IR for social networks, and aims to present the
challenges faced by IR technology in its applications to new resources of social
networks, and also introduces some possible solutions to these problems.
Concretely, three most representative IR applications – search, classification, and
recommendation are discussed in this chapter. This chapter is arranged as follows:
Section 1.1 is the Introduction, which introduces the relevant concepts commonly
used throughout the chapter, along with the challenges facing social network-
oriented IR technology; Sections 1.2, 1.3, and 1.4, respectively, introduce the basic
methods for content search, content classification, and recommendation in social
networks, and the status quo of researches in these areas; Section 1.5 provides the
summary of this chapter and future prospects. In Sections 1.2 and 1.3, relevant
researches are introduced based on microblog data – one of the most representative
SNSs, while Section 1.4 focusses on social networks developed from traditional
1 Hashtag here refers to the tag bracketed with “#” in microblog text, also called theme tag, which can
be regarded as a mark on a microblog made by the author. After development and evolution, it has
been used by some social networking sites to represent topics.
https://doi.org/10.1515/9783110599435-001
1.1 Introduction
IR is a process of retrieving information (generally documents), which satisfies users’
need for information, from massive unstructured datasets (generally texts that are
mostly stored in computers) [1].
Unstructured data refers to data without obvious structural markers which
differs from structured traditional databases. Natural language texts are the most
common unstructured data. User information need refers to an information theme
that the user wants to find, while a query is usually a piece of text that the user
submits to the retrieval system representing his information need or an object in any
other form (such as one or several keywords in the text search engine or sample
images in the image search engine).
The dataset being retrieved refers to a corpus or a collection. Each record in a
collection is called a document. A document is the basic object to be retrieved such as
a microblog message, a webpage, an image, or a sentence. It should be noted that the
document here is different from a file. A file may contain a number of documents, and
an IR document may be composed of several files.
During processing, a document is often converted into a format that can describe
the key characteristics of its most important content, and such a characteristic is
referred to as a “term” in IR, which is the basic retrieval unit in the retrieval system.
Terms in texts are generally expressed by keywords. For example, in the sentence
“social network oriented IR is very important,” the terms can be “IR,” “social,”
“networks,” “very,” and “important.” In practical applications, the final selected
terms depend on the applications.
The aim of IR is to return and search relevant documents from the document
set, and the degree of relatedness is termed relevance between a document and
the query. With respect to IR systems, it is usually necessary to rank documents
based on the relevance between them and the query. To overcome the problem
that an original query cannot precisely represent the user need, the original
query can be modified, which is referred to as query expansion or query
reformulation. After the retrieval results are returned, either the user or the
system can apply explicit or implicit markers to some documents returned to
determine whether they are relevant. The original query can be modified based
on the marking results. This process is called relevance feedback. If we directly
assume that the top k documents of the returned results are relevant and per-
form feedback based on this assumption, the feedback is referred to as pseudo-
relevance feedback, and the top k documents are referred to as pseudo-relevant
documents.
Evaluation is one of the important tasks in IR. On comparing the results returned by
the system with the actual results, we can derive some evaluation metrics to measure the
retrieval results. The most basic evaluation indicators for IR are “precision” and “recall,”
with the former referring to the proportion of actually relevant documents in the
returned results, and the latter referring to the proportion of actually relevant documents
that are returned. The two are used to measure the correctness of the results returned
and the degree of coverage of returned results, respectively, on all correct results.
Example 1.1 Calculation of precision and recall. Provided that there are 100 docu-
ments relevant to a query, and a system returns 120 documents, in which 80 are
actually relevant to the query, the precision of the system in terms of this query is 80/
120 = 2/3, and the recall is 80/100 = 4/5.
Precision and recall are applicable for search and classification tasks in IR; the
dedicated evaluation metrics for recommendation are introduced in Section 1.4.
Precision and recall are extensively applied, however, in terms of search, the order
of returned results, which is of vital importance to user experience, is not considered
while calculating precision and recall. Therefore, in the application of information
search, the following metrics that consider the order of returned results are usually
used: P@k and mean average precision (MAP).
Definition 1.1 P@k (precision at k) refers to the precision of the top k results in the
retrieval results; for example, P@5 and P@10, respectively, refer to the ratio of
relevant documents to the top 5 and top 10 results. For a given query q, the P@k is
calculated based on the following equation:
Definition 1.2 Average precision (AP) is, given a query q, the average value of
precisions at the positions of all relevant document in the returned result. AP is
calculated based on the following equation:
1 Xj
jretðqÞ
APðqÞ = isrelðdi Þ × P@i (1:2)
jrelðqÞj i = 1
where rel(q) refers to the document collection actually relevant to q; ret(q) represents
all returned document collection for q; di is the ith document in the documents
returned; and isrel(di) is a Boolean function. If di is relevant to q, 1 will be returned;
otherwise 0 will be returned.
Example 1.2 Calculation of AP. Provided that the size of the document collection (rel(q))
relevant to a query is 5, in which 4 documents appear at positions 1, 4, 5, and 10,
respectively of the search result, the AP of the query will be:
(1/1+2/4+3/5+4/10+0)/5=0.5
It is clear that the higher the number the higher the relevant documents in the
returned results and larger the AP.
Definition 1.3 MAP refers to the mean of the AP values of multiple queries. MAP is
used to evaluate the quality of an IR system.
Different from traditional IR in many aspects, social network-oriented IR has its own
characteristics, which has both challenges and opportunities for the traditional IR
technology:
(1) SNS documents are normally very short and the contents are sparsely distributed
to an extent, which makes it hard to calculate or accurately calculate the simi-
larity because of scarce co-occurrence terms.
(2) The expressions of SNS documents are usually nonstandard, streaming, and
dynamic. New terms mushroom from social networks which are extremely col-
loquial. The SNS contents of the same theme may shift over time, and hence, the
original expressions and calculation models have to account for such a shift. This
problem is particularly serious for classification.
(3) SNS documents have their own structures and interaction characteristics. They
also have their specific ways of expression and structures. For example, a
microblog document is likely to contain a Hashtag and an external URL link. In
general, SNS documents contain information about the authors, while social
relationships between authors can be formed by following or interacting. These
characteristics can be used during IR.
(4) The time attribute can be found in most SNS queries and documents, for
example, queries are always closely related to the current hot social events,
and documents present different distribution characteristics with the passage of
time. Using the time attribute to increase the effect of SNS IR is an important
research objective.
The remaining parts of this chapter will introduce IR in social networks based on the
three main IR applications or tasks. IR has an extremely wide coverage and is
correlated with other fields. It is impossible to introduce all fields in this chapter
due to the limitations of space. Readers interested in those topics can refer to social
network related papers published in academic conferences (such as SIGIR and CIKM)
in the field of IR.
of the most classical applications of IR. In SNS, content search is urgently needed. For
example, a user enters “missing of MH370,” with the aim of getting information about
this event. Sometimes it is also possible to realize “Expert Location” based on social
networks. For example, if you search “machine learning,” some information about
relevant experts can also be retrieved on SNS. This is a specific search application in
SNS. Owing to space constraints, this topic is not discussed in this chapter. Interested
readers can refer to Chapter 8 in the book “Social Network Data Analytics” [2]. The Text
Retrieval Conference (TREC) added the sub-task of microblog search in 2011 to promote
SNS search, in particular, microblog search, by providing standard queries and anno-
tation data collections (Twitter).
The basic process of a traditional content search is as follows: Massive document
data constitutes a corpus for the search; the user creates a query that can represent
his information need; the query and documents are respectively processed and
converted into certain representations; the relevance is calculated using the IR
model; and the documents are returned to the user in an descending order (based
on the calculation). It is clear from the above process, that in the IR process, the
processed objects (documents and queries) should be converted into certain forms of
representations first, following which the relevance between the objects can be
calculated. The conversion of the retrieved objects into expressions and the calcula-
tion of the relevance between them fall into the scope of IR models. There are
currently three classical IR models, including the vector space model, probabilistic
model, and statistical language models. Now we briefly introduce these models and
the corresponding feedback models.
DF (IDF) for calculation. The IDF of a term (t) is generally calculated based on the
following equation2:
N
IDFt = log (1:3)
DFt
where N is the number of documents in the document set and IDF is the ability of the
term to distinguish documents. The IDF of commonly used words, such as “of,” “a,”
and “the,” is small. That is to say, their ability to distinguish documents is very
limited. In VSM, the weight of a term is the product of TF and IDF.
Example 1.3 Calculation of TFIDF. Assume that the TF of a term in a document is 2,
and that it can be found in 10 out of 100 documents, the TFIDF of the term in the
document will be:
2 × log(100/10) = 2
Similarly, the weight of the query and other terms in the document can be calculated
to obtain the vector representations of the query and each document. The similarity is
calculated at last. For VSM, the cosine similarity between vectors is used to calculate
the relevance between a query and a document. The cosine similarity between a
query (q) and a document (d) is calculated using the following equation:
P
TFt, d × IDFt × TFt, q × IDFt
t
RSVðd, qÞ = rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P ffirffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P ffi (1:4)
ðTFt, q × IDFt Þ2 ðTFt, d × IDFt Þ2
t t
where TFt,d is the occurrence frequency (number of times) of the term (t) in the
document (d).
Example 1.4 Cosine similarity calculation. Assume that the vector representation of a
query is <2, 0, 1, 0> and that of a document is <1, 2, 2, 0>, the cosine similarity between
them will be:
2×1+0×2+1×2+0×0 4
RSVðd, qÞ = pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi = pffiffiffiffiffi
2 2 2
2 +0 +1 +0 × 1 +2 +2 +0 2 2 2 2 2 45
In practice, there are several transformed calculation methods for TF and IDF. In
addition, the document length is also considered in some VSM weighting representa-
tions. Details of the above methods are described in references [5] and [6].
VSM is very simple and intuitive and gives good practical results. The idea of
vector representation is widely used in various fields. Its shortcoming is the “term
independence assumption,” i.e., terms in different dimensions are independent of
each other; this assumption obviously does not hold in practice. For example, in an
2 For the purpose of unification, the log in this chapter refers to logarithm with 10 as the base.
article that contains the term “Yao Ming,” the probability that “basketball” appears
in that article obviously increases.
PðDjR = 1ÞPðR = 1Þ
PðR = 1jDÞ = (1:5)
PðDÞ
For the convenience of calculation, the BIR model uses the following log-odds to sort
the documents4:
where P(D|R=1) and P(D|R=0) denote the probability of generating the document D,
respectively, under the condition that R=1 (relevant) and R= 0 (irrelevant). ∝ denotes
order preserving, i.e., the order of the expression before ∝ is the same as that after ∝ .
In the BIR model, the document D is based on the term collection {ti|1≤i≤M} and is
3 http://www.soi.city.ac.uk/~andym/OKAPI-PACK/index.html.
4 Obviously, for the two documents D1 and D2, if P(R=1| D1)>P(R=1| D2), will hold. That is to say, the
log-odds function is order-preserving.
where pi and qi are the probabilities of occurrence of the term ti in relevant and
irrelevant documents, respectively. ei is a variable with a value of 0 or 1 (if ti∈D, ei=1;
otherwise, ei=0) and denotes whether the term ti exists in the document D. Parameters
pi and qi can be estimated, and then the rank of each document can be obtained.
Example 1.5 BIR model calculation. Provided that the query is “Information Retrieval
Textbook,”\a document D is “Retrieval Courseware,” and the number of terms (M)
is 5, the parameters pi and qi are shown in Table 1.1.
Consequently,
P(D|R=1)=(1−0.8)×0.9×(1−0.3)×(1−0.32)×0.15
P(D|R=0)= (1−0.3)×0.1×(1−0.35)×(1−0.33)×0.10
log(P(D|R=1)/P(D|R=0))=0.624
The basic BIR model does not consider important factors such as TF and document
length. Robertson et al. made improvements and proposed the well-known BM25
retrieval formula [8] as follows:
X N − DFt + 0.5 ðk1 + 1ÞTFt, d ðk3 + 1ÞTFt, q
RSVðd, qÞ = In × × (1:8)
t2q
DFt + 0.5 k1 ðð1 − bÞ + b avdlÞ + TFt, d
dl k3 + TFt, q
where dl is the length of the document; avdl is the average length of documents in the
document collection; TFt,d and TFt,q are the frequency of the term in the document
and query, respectively; and b, k1, and k3 are empirical parameters. The formula can
also be considered as a formula for calculating the inner product of vectors using
different TF and IDF calculation methods.
5 Multivariate Bernoulli distribution can be considered as the process of flipping M coins, where each
coin corresponds to a term. All terms that all upturned coins correspond to the document D.
The advantages of probabilistic models are that they are derived based on the
probability theory and they are more interpretable than VSMs. However, in a calculation
using probabilistic models, the assumption of term independence still exists. Moreover,
the parameters in the models need to be precisely estimated.
Definition 1.4 Statistical language model. A language is essentially the result of certain
probability distribution on its alphabet, which shows the possibility that any sequence
of letters becomes a sentence (or any other language unit) of the language. The prob-
ability distribution is the SLM of the language. For any term sequence S = w1 w2 … wn in a
language, probability can be determined based on the following equation:
Y
n
PðSÞ = Pðwi jw1 w2 ... wi − 1 Þ (1:9)
i=1
To estimate the probability P (wi|w1w2 … wi−1) based on a given dataset (corpus) has
become a key problem of SLM. It is impossible to obtain enough data to estimate P (wi|
w1w2 … wi−1), hence, the n-gram model becomes relevant, according to which the
occurrence of a term is only related to the n-1th term before it (the n-1th term is also
called the history of the nth term), that is,
When n=1, it is referred to as a unigram model, where the occurrence of any term is
considered to independent of other terms, i.e., it is assumed that all terms are inde-
pendent of each other. A model without regard to term sequence is also called a bag of
words model (BOW model). When n=2, it is referred to as a bigram model, where the
occurrence of the current term is considered to be only related to the previous term.
The basic idea of SLM-based IR models is considering relevance as the sampling
probability in statistical models. The earliest model of this kind is the query like-
lihood model (QLM) proposed by Jay Ponte and Bruce Croft [13]. The model’s basic
idea: there is a language model Md for each document d in the document set, while
the query is a sampling result of the model, and documents can be ranked based on
the sampling probability of the query based on different document models. The basic
formula of the QLM is as follows:
PðqjdÞPðdÞ
RSVðd; qÞ ¼ log PðdjqÞ ¼ log
PðqÞ
∝ logðPðqjdÞPðdÞÞ ∝ logPðqjdÞ (1:11)
X X
¼ log PðtjMd Þ ¼ TFt;q : log PðtjMd Þ
t2q t2q
In the above derivation, QLM assumes that the prior probability P(d) of the document
follows a uniform distribution, i.e., P(d)=1/N (N is the number of documents), hence
this part can be removed. It should be noted that in some studies the prior probability
can be reserved because other distributional hypotheses are adopted. P(t|Md) is the
probability of considering the term t by the model Md of the document d during
sampling. This probability can be calculated by adopting the maximum likelihood
estimation (MLE) as:
TFt, d
Pml ðtjMd Þ = P (1:12)
t′ TFt′, d
The above estimation is likely to result in zero probability, i.e., the probability that
the term t does not appear in the document is estimated to be 0. In this case, once a
term in a query does not appear in the document, the score of the document will be
zero. This is obviously inappropriate. Therefore, smoothing methods are commonly
used for correction. At present, the main smoothing methods include Jelinek–Mercer
(JM) smoothing method and Dirichlet prior smoothing method. The formula for the
JM smoothing method is as follows:
where, Pml(t|C) is the MLE value of t in the entire document collection C. λ is the
linear weighting coefficient between 0 and 1, which should be given beforehand.
Consequently, for every term t in the document collection, it is necessary to calculate
the Pml(t|C). For a document d, the retrieval status value RSV(d,q) can be obtained by
first calculating the Pml(t|Md) of every term t in the document and then deriving the
P(t|Md) of every term t in the query q by linear combination. The process is easy to
understand, but owing to space constraints, the concrete calculation example is not
presented here. Interested readers can try themselves.
A series of other statistical modeling-based IR models have been developed
based on the QLM, including the KL divergence model [14] and the translation
model [15]. With regard to the KL distance model, the KL divergence (relative
entropy) between the two kinds of distributions, respectively, in the query lan-
guage model and the document language model is calculated to rank the
where, |Cr| and |Cnr| represent the size of collection Cr and Cnr, respectively. The above
formula indicates that when all relevant and irrelevant documents are known, the
optimum query vector that distinguishes them is the vector difference between the
average vector of all relevant documents (centroid vector) and that of all irrelevant
documents [1].
However, in practice, for a given query, the collection of relevant documents and
that of irrelevant documents are unknown beforehand. Although relevance feedback
is performed, the relevance of only part of the documents can be obtained. RM is a
method of gradually modifying the original query vector when the relevance of a part
of the documents is known. Thus, assuming that ~ q is the original query vector, and Dr
and Dnr are, respectively, the collection of relevant documents and the collection of
β X ~ γ X
~
qopt = α~
~ q+ dj − dj (1:15)
jDr j ∀d 2D jDnr j ∀d 2D
j r j nr
where, |Dr| and |Dnr| denote the sizes of document collections Dr and Dnr, respectively;
~ P ~ P ~
dj is the vector of document dj; obviously, jD1r j dj and jD1nr j dj , respectively,
∀dj 2Dr ∀dj 2Dnr
denote the average vector of all document vectors in the collection of relevant
documents Dr and irrelevant documents Dnr; α, β, and γ are constants and non-
negative real numbers.
The above formula can be described as follows: The query vector after modification
is the linear weighted sum of the initial query vector, the average document vector of
the collection of relevant documents, and the average document vector of the collec-
tion of irrelevant documents, with the corresponding weighting coefficients being α, β,
and –γ. Essentially, the above formula makes the modified query continuously
approach the centroid vector of the collection of relevant documents, but gradually
deviates from the centroid vector of the collection of irrelevant documents. Due to the
subtraction operation in the above formula, the components of the final result vector
may be negative. In this case, the commonly used method is to set the value of the
components to 0, i.e., remove the terms that these components correspond to.
In practical applications, there are many value assignment methods for α, β, and γ,
and a commonly used method is α = 1, β = 0.75, and γ = 0.15. When γ > 0, the current
Rocchio query expansion method allows negative feedback; when γ = 0 and β > 0, the
current method only allows positive feedback. In addition to the above-mentioned
basic Rocchio formula, there are other transformed Rocchio formulas.
ri + 21 ni − ri + 21
pi = , qi = (1:17)
jV j + 1 N − jV j + 1
We can see that the original pi and qi in the above formulas do not appear in the query
updating formula, which is obviously different from the Rocchio method introduced
earlier. A query expansion method that uses the original pi is as follows:
ri + Kpi ðtÞ
pi ðt + 1Þ = (1:18)
jV j + K
where, pi ðtÞ and pi ðt + 1Þ , respectively, denote the former pi value and the pi value after
updating. Essentially, the former pi value is introduced as the Bayes prior and is used
together with the weight κ. Interested readers can refer to chapter 11 of “Introduction
to Information Retrieval” [1].
Pðt, q1 , q2 , ...Þ
PðtjRÞ P (1:19)
t′2V Pðt′, q1 , q2 , ...Þ
and
X Y
Pðt, q1 , q2 , Þ = PðdÞPðtjMd Þ Pðqi jMd Þ
d2Cprf i
where, q1 , q2 , ... are terms in the query and Cprf is the pseudo-relevant document
collection.
According to the above formula, the relevance between each term in a term
collection and the query q can be reckoned to select the term set, in terms of the
value scale, which is most likely to be applied to the expansion, and get a new query
by adopting the linear interpolation method in combination with both the original
and extended terms. The conditional probability of the new term Pnew(t|Mq) is
shown as follows:
Pnew ðtMq Þ = ð1 − λÞPorgin ðtMq Þ + λPðtjRÞ (1:20)
The relevance model-based sorting function may adopt the KL divergence model
[14] to calculate the KL divergence between P(t|Md) and P(t|Mq) and to obtain the
final sorting result.
As mentioned earlier, documents and user queries are the basis for the input of
the query model; therefore, in selecting the appropriate retrieval model, it is
imperative to consider the documents used for retrieval and the characteristics of
users’ query input. However, the SNS content search is different from traditional
text search in both the document and query. Consequently, corresponding mod-
ifications have to be made to query representation, document representation, and
relevance computing models in the process of searching. In the following section,
we will specifically introduce the SNS content search based on the above three
aspects.
^ fb = ð1 − λÞΘ
Θ ^r
^ ML + λΘ (1:22)
q
When a Hashtag co-occurs with more tags in the rk, the relevance will be bigger.
Introducing the relevance into the above-mentioned ranking formula, we get:
Conduct feedback (denoted as HFB1, HFB2, and HFB2a depending on different feed-
back methods) once more by using the new Hashtag collection calculated in this way.
Table 1.2 gives the result of the final query feedback method.
It is clear from Table 1.2, we can see that Hashtag can provide useful information for
query relevance feedback, and feedback, with the consideration of the relationship
between Hashtags further improving the retrieval performance.
Other applications where internal resources are used: According to reference
[20], microblogs containing hyperlinks have more abundant information; there-
fore, increasing the weight of such microblogs during PRF will improve the retrieval
result.
Number of Microblog
Number of Microblog
Figure 1.1: Distribution of relevant documents of TREC queries MB036, MB078, MB020, and MB099
over time [23]
From Figure 1.1, we can see that the distribution of relevant documents of microblog
queries at each moment is far from even and shows obvious time sensitivity. During
expansion of time-sensitive queries, the time factor should be considered; at present,
some time-sensitive query expansion works have sprung up. These works can be
divided into two categories based on different ways of time integration: one is to analyze
the characteristics of the distribution of documents at certain time points to select the
appropriate time point as the basis of document selection before query expansion; the
other is to construct a graph using the time relationship between terms and obtain the
relevance score of each candidate word corresponding to the given query through
iterative computations, and finally, select the terms with high scores for expansion.
Example 1.7 Time-aware query expansion: In reference [24], it has been pointed out
that there is a correlation between relevance and time to degree certain extent. The
author verified this hypothesis by comparing the two kinds of time sequences (one is the
time sequence formed by feedback documents consisting of the first retrieval results
and the other is the time sequence formed by the really relevant documents). Then, the
author counted the number of documents which appeared at each time frame t in the
returned document collection at the first time, then selected the top k documents near
the time frame when the largest number of documents appeared, and finally, used the
Rocchio method to calculate the score of each term to select the expanded word. The
experiment proved that this method can improve the retrieval effect.
where, D is the document collection for constructing the sequence, and ft(d) is
calculated as:
1; if d is posted in the time frame t
ft ðdÞ ¼
0; others
The time frame can be 1 hour, 1 day, or 1 month, however in the experiment, the time
unit was 1 day.
Using the above method, the author constructed two time sequences; sequence X
represented the time sequence of real relevant documents whereas sequence Y
represented the time sequence formed by the retrieval results. To determine whether
the two sequences are related, the author introduced the cross correlation function
(CCF) for calculation.
h i
E xt − μx yt + τ − μy
ρxy ðτÞ = (1:26)
σx σ y
where, μx and μy are the mean values of two time sequences, respectively, while
σx and σy are the corresponding standard deviations. τ represents the delay
time, 0 ≤ τ ≤ 15. Table 1.3 presents the CCF values of the two sequences.
From Table 1.3, it can be seen that the mean value of CCF is 0.7280 and its
variance is 0.0393. Therefore, the retrieved and real relevant documents are highly
related in terms of time distribution.
Table 1.3: Result of CCF values of two sequences from Giuseppe Amodeo [24]
The experiment shows that, after the queries are expanded using this method, the
retrieval performance can be effectively improved.
Apart from the above work, there are other studies regarding query expansion
using the time factor.
In reference [24], according to the time T of each document d and the
specified rules, the burst collection B (one or more) can be obtained from the
first search result collection R of queries, and the collection of bursts is denoted
as bursts (R); then, the score of each term can be calculated according to the
following formula:
0 1
X X
1 1 γ max Timeðd′Þ − TimeðdÞ
PðtjqÞ = @ PðtjdÞe d′2B A (1:28)
jburstsðRÞj B2burstsðRÞ NB d2B
where, Time(d) represents the document time and NB the size of bursts (R) B.
Finally, the top k terms with the largest probability are selected as the expanded
query terms. The experimental result shows that this method can improve the
retrieval effect.
In reference [25], in the context of blog retrieval, and based on the idea of the
relevance model, a time-based PðtjqÞ generation model is defined, and the expanded
terms ranking ahead are selected through calculation. According to the paper, the
generation process of PðtjqÞ is: first, select a time moment T for the query q and then
select a t under the conditions of T and q; thus, the formula of the query term
generation model is:
X
PðtjqÞ = PðtjT, qÞPðTjqÞ (1:29)
T
The experiment performed on the TREC data collection Blog08 indicates that this
method improves the retrieval effect. Similar methods can also be applied to micro-
blog search.
In reference [26], with the background of microblog retrieval, a PageRank-based
query expansion method to calculate term weights was proposed. The method
included the following steps: first, extract n-gram from the first returned results as
the terms. Then, construct a directed graph for these terms, in which the terms are
taken as the nodes and term TF is the prior value of the node. The edge represents the
time correlation between terms while the weight represents the time-based relevance.
Finally, use the PageRank random walk to calculate the final value of each term, and
select some terms to expand the original query. The TREC Twitter data collection-
based experiment indicates that good results can be ontained when n = 1 and n = 2 are
selected for the n-gram.
expanding the original document based on the external documents, which is not be
repeated here. Next, we will introduce the research on the expansion of documents
based on internal resources.
Example 1.8 Expansion of documents based on internal resources. According to
reference [27], a short text only covers one topic in general. Therefore, in this
paper, a short text is regarded as a pseudo-query, which is searched in some
data collections, and the search results are taken as the object of document
expansion.
1) Constructing pseudo-queries
First, assume document D as a pseudo-query which is denoted as QD. Then, execute
query QD in collection C to get a ranking result of the top k documents: RD=D1, …, Dk,
and their retrieval scores, and finally, calculate these scores P(D|D1),. . .P(D|Dk) using
the QLM.
P w, d1 , ..., djDj
PðwjD′Þ = P wjd1 , ..., djDj =
(1:30)
P d1 , ..., djDj
The denominator Pðd1 , ..., djDj Þ does not depend on w, so only the joint distribution
needs to be retained; namely:
X
P w, d1 , ..., djDj Þ = D 2C
PðDi ÞP w, d1 , ..., djDj jDi (1:31)
i
In which,
Y
jDj
Pðw, d1 , ..., djDj jDi Þ = PðwjDi Þ Pðdj jDi Þ
j=1
P w, d1 , ..., djDj = D 2C
P ð Di ÞP ð wjDi Þ j=1
P dj jDi (1:32)
i
The last factor in the aforesaid formula indicates that this joint distribution is, to a
large degree, determined by the documents that are most similar to D. Therefore, the
k most similar documents can be used to estimate the PðwjD′Þ.
After the PðwjD′Þ is gained, the document model can be updated by interpolation:
In this paper, λ=0.5. It should be noted that the time factor is also used here to expand
the documents at the same time.
Furthermore, other works related to expansion of the short-text documents have
been reported. In reference [28], the URLs contained in microblogs have links to use
for expanding the current microblogs. In some works, the translation model which is
expanded based on the integration of microblog Hashtags, URLs, and other features
is proposed, wherein various factors can be conveniently considered to expand the
microblog document. The experimental result verifies that microblog documents can
be effectively represented by these features.
In addition, it is also possible to realize microblog expansion using information
about authors. In reference [29], an author-based modeling microblog retrieval
method has been proposed. In the paper, information on authors is extracted from
the corpus of the TREC microblog first, and all microblog documents posted by each
author are sorted to form “new documents,” and then user’s language models are
constructed based on these documents. In addition, smoothing is carried out using
the information on authors to estimate the probability of new microblog document
terms. The experimental result shows that properly using the author-related informa-
tion can improve the microblog retrieval performance. In reference [30], the followers
of the author of microblog documents as well as the microblog documents issued by
the followers are used to expand the documents, which gave good effects.
The n(t,d) is the term frequency of term t in document d and N is the total amount of
microblog documents in data collection.
In the context of microblog search, reference [32] verified the impact of factors
such as the TF which plays a promoting role in traditional retrieval and document
length normalization (DLN) on microblog retrieval. The ranking model adopted in the
paper is BM25 model, and the result indicated that the effect was only 0.0048 lower
than the best-result P@30 when the document’s TF was ignored; the effect was the
best when the document’s DLN factor was ignored. In conclusion, the paper points
out that both TF and DLN are involved in the construction of the language model;
therefore, how to reduce the negative impacts of these factors shall be considered
during the construction of the language model.
Furthermore, the impact of the time factor can be considered in representing
microblog documents. Relevant studies were mainly conducted in the context of
statistical language retrieval models, where the key task is to calculate the term
generation probability PðtjMd Þ. There are currently two methods of adding the time
factor in the PðtjMd Þ: one is to define the time weight for the term t by introducing the
time factor [33, 34]; the other is to introduce time during smoothing of the language
model probability estimation [35].
Example 1.10 Temporal language model. In reference [34], the objective was to
determine the document time through the Temporal Language Model (TLM).
Although the study is not about microblog retrieval, the idea of introducing time for
the term t can be used in the language model. It shall be noted that the data collection
at this time will be divided into multiple blocks based on the time granularity, denoted
as p, whose set will constitute P. The time granularity can be selected freely such as 1
month and 3 months. The author defined a time weight in allusion to the term t in the
paper called the temporal entropy (TE), which can be calculated as follows:
1 X
TEðti Þ = 1 + Pðpjti Þ × log Pðpjti Þ (1:35)
log NP p2P
tfðti , pÞ
where, Pðpjti Þ = PNP ; NP is the total number of data collection blocks, and
tfðti , pk Þ
k=1
tfðti , pÞis the number of ti occurring in the block p. The time weights of terms can be
used to modify the term probability of the language model, thus realizing the
representation of microblog documents.
In reference [35], the improvement of the document language model lies in the
smoothing part; the paper holds that the value of smoothing weights varies with the
time of different documents; i.e., the newer the document, the smaller the λ value.
Thus, priority shall be given to the document itself. According to this hypothesis,
a new smoothing parameter λt is introduced in the smoothing formula and two
calculation formulas are given:
where, n(d, t < td) is the number of documents for which the time is smaller than that
for document d in the entire dataset; n(C) is the number of documents in the entire
dataset; and α and β are the conjugate prior parameters of beta.
In addition, there are also many studies in allusion to short-text modeling: some
studies [36–39] have focused on the classification and clustering and abstracting of
microblog documents, while others concentrated on the recording training topic
model of microblogs, where LDA [40] is adopted. In reference [40], the basic LDA
and author topic model as well as the post-expansion LDA are compared, showing that
the post-clustering retraining topic model has a better effect. All these studies can be
used to reevaluate P(t|d) to realize more sufficient representation of documents.
where Td stands for the time of the document; TC is the latest time for document
centralization; γ is the exponential distribution parameter, designated according to
the human experience. This paper takes SLM as the basis, and regarding the docu-
ment prior PðdÞ, the original constant is replaced by the exponential distribution
(namely, the above calculation formula), modifying the original language model
ranking function. This paper performed the experimental verification on the TREC’s
news corpus, and the result showed that the time-included ranking result is better
than the no-time ranking result. This method can be used in microblog retrieval.
In reference [35], the method proposed in reference [22] is improved; this study
pointed out that the significance of each document varies with different query
conditions. Based on this hypothesis, the method of estimating the exponential
distribution parameters using the pseudo-relevance feedback collection of queries
is proposed; the given query information is introduced by changing the γ in the above
formula into γq . The pseudo-relevance feedback collection of the query q is denoted
as Cprf = fd1 , d2 , ..., dk g, and Tq = ft1 , t2 , ..., tk g is used to represent the moment corre-
sponding to each document in the collection Cprf ; according to the maximum like-
lihood estimation, the calculation formula can be obtained as:
γML
q = 1=Tq (1:38)
where Tq stands for the average mean of the collection Tq , with the value being
P
Tq = ki= 1 ti =k.
In reference [35], in addition to the verification made on two TREC news corpora,
a microblog corpus collection is constructed; the experiment is carried out in the
microblog environment, which verifies that the retrieval result of this algorithm is
better than that of the existing algorithms.
Example 1.11 Time-based SLM: In reference [33], a time-based New Temporal
Language Model (NTLM) used for webpage searching is proposed, and the queries
that this model faces include texts and obvious time information; for example, “earth-
quakes from 2000 to 2010.” One of the major contributions of this paper is the proposed
concept of time-based TF (“TTF” for short); wherein the time factor is added to the TF
calculation formula to improve the retrieval effect through time information.
reference time using the retrospection method; that is, we will find the time of the
sentence most similar to this one; the calculation method of the similarity is as follows:
Pn ′
i = 1 ki × ki
SimðS1 , S2 Þ = qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (1:39)
Pn 2 Pn ′ 2
i=1 i k × i=1 ik
tf w, d
Pml ðwjMd Þ = (1:40)
d1d
tfw, d stands for the number of w occurring in document d while dld stands for the
length of document d. At the same time, the author proposed a term frequency based
on the time factor, and its specific definition is as follows:
½Tsq , Teq represents the querying time; Tsq the beginning time and Teq the ending time.
numðw, d, ½Tsq , Teq Þ represents that the word w occurs in document d and satisfies
½Tsw , Tew 2 ½Tsq , Teq . The author defines ½Tsw , Tew 2 ½Tsq , Teq as in the four cases as below
(½Tsw , Tew is denoted as T w and ½Tsq , Teq denoted as T q ).
(1) Contained: Tsq ≤ Tsw and Teq ≥ Tew .
(2) Contains: Tsq > Tsw and Teq < Tew .
(3) Left overlapping: Tsq < Tsw and Teq ≤ Tew .
(4) Right overlapping: Tsq ≥ Tsw and Teq > Tew .
The definition of the relationship between the two is shown in Figure 1.2.
Finally, in allusion to the above term weight sensitive to time, several smoothing
methods are proposed in the paper to improve the retrieval performance.
It is worth noting that TTF can be used in any TF-involved model, such as the
language model and BM25, and the experimental result also indicates that TTF can
help improve the retrieval effect.
During IR, PageRank (PR for short) indicates the significance of documents by
virtue of the links between webpages. References [41–44] are a series of studies where
time factor was added to the original PR algorithm.
In reference [42], it is mentioned that the time dimension information is
ignored in the traditional PageRank or HITS algorithm in analyzing the webpage
link, based on which the time factor is added in the original PR algorithm to obtain
a new algorithm, called the Timed PageRank (TPR for short), and its formula is as
follows:
The webpage time is used to define two kinds of freshness which shall be used
in the calculation formula of the transmission probability and skipping prob-
ability. The experimental result indicates that the effect is improved. In a similar
manner, in reference [41], the features of the last-modified tab of the webpage
are added in the PageRank after being analyzed, to be taken as the weight of the
edge between webpages.
In reference [44], the time weight of webpages is added in the score of webpages
obtained through the original PageRank algorithm to adjust the PR. Jing Wan and
Sixue Bai believe that a webpage has three properties: integrity, accuracy, and time-
activity. Time-activity represents the freshness of the webpage content, while outdated
information may bring negative impacts during IR, so the author proposed that time-
activity shall become one of the factors influencing the PR degree. Assuming
a webpage is T, four factors will be used to determine the activity of this page at a
certain moment, and these four factors are: the activity of the domain name that this
page belongs to, the activity of the creator of the domain that the page belongs to, the
degree to which users are interested in this page, and the contents of the text that the
page carries. The active scores of this page at different moments are calculated based
on these four factors to form its time activity vector, denoted as TRðTÞ. The author
classified the page into four categories: strong-temporal common quality, weak-tem-
poral common quality, weak-temporal high quality, and no-temporal quality based on
the active score at each moment, and gets the time vector TRðTÞ of this page; mean-
while, the author gets the PageRank vector PRðTÞ of this page according to different
moments, and the final prior calculation formula of this webpage is TRðTÞ × PRðTÞ.
1) Author score
If A is used to represent the collection of all authors, a mapping A→R+ can be
established, thus endowing each author a with a non-negative real number g(a),
which is called this author’s score. First, the score can be estimated through the
number of microblogs posted by the author thus far; the author whose underlying
assumption is active may release more meaningful microblogs. Therefore, the
author’s TweetRank can be calculated through the following formula:
The N(a) represents the number of microblogs posted by the author thus far. The
score can be estimated according to the numbers of followers and followees of author
a (FollowerRank), as follows:
iðaÞ
FRðaÞ = (1:45)
iðaÞ + oðaÞ
The i(a) is the in-degree of the author a and o(a) is the out-degree of the author a. Of
course, more complicated algorithms similar to PageRank can also be used to
calculate the author’s score.
2) Microblog score
Assuming D represents all microblogs and Q represents the set of all queries, the
retrieval is equivalent to making such a mapping as D×Q→R+; that is, to get a non-
negative real number f(d, q) for a given pair (d, q). In addition, use “auth” to stand for
the author of a document; i.e., auth(d). Therefore, from the perspective of the author’s
influence, the microblog score can be estimated using the following two formulas:
fTR ðd; qÞ ¼ TRðauthðdÞÞ
(1:46)
fFR ðd; qÞ ¼ FRðauthðdÞÞ
lðtÞ
fLR ðd, qÞ = (1:47)
maxs2Dk lðsÞ
q
The DKq is the result of the top k queries returned in the first time, while l(t) and l(s) are
the length of t and s, respectively. In addition, URLRank is calculated as below:
c; if microblog d contains URL
fUR ðd; qÞ ¼ (1:48)
0; others
3) Combination score
With the combination of the above factors, the follower length rank (FLR) score and
follower length URLRank (FLUR) score can be obtained:
The experimental result of the paper indicates that FLUR method is the best; that is,
to integrate the numbers of the author’s followers and followees, the length of the
author’s microblog, and the factor that whether the microblog contains URLs.
In reference [46], how to use the “learning to rank” method to rank microblogs is
introduced. It is also believed that the microblog search results shall be ranked
according to some features rather than the time of the microblogs. To ensure an
accurate model, the authors consider many features of microblogs, and then remove
some useless ones through feature extraction, principal component analysis (PCA),
among other methods, to obtain a conclusion similar to reference [45], which holds that
the factor of whether the microblog containing URL, the author’s authority and the
microblog’s length exert the most important impacts on the ranking.
In reference [47], it is believed that the difference between queries shall be con-
sidered for the “learning to rank” method in the microblog ranking, so they proposed a
learning to rank pattern for query modeling; a semi-supervised Transductive Learning
algorithm is used to train the model. The learning to rank scheme generally needs
labelling. As the transductive learning algorithm is used, the ranking of new queries
without any labeled examples is also attempted in this paper. The experiment indicates
that this method can improve the current “learning to rank” method as well as improve
the ranking effect.
Due to the scarcity of short texts, various resources shall be used to expand the
features in classification of texts to expand the short text. In addition, in terms of the
feature selection, most researchers will directly adopt the feature selection algorithm
in text classification; also, some researchers will combine the features of short texts
and modify the current feature selection algorithms or propose new feature selection
algorithms, thus better reducing the short text feature dimension and removing the
redundant or irrelevant features, with the hope of obtaining better classification
results.
1. Feature expansion
Most microblog text contents are simple and concise, being only a sentence or a few
phrases sometimes, and always depending on the context. If the microblog text is too
short, the problem of serious data sparsity arises during text classification. To solve this
problem, researchers may expand microblogs by introducing external resources; in
addition to the WordNet and Wikipedia, external resources such as the search results,
news, Mesh, and Open Directory Project (ODP) obtained through search engines are
used to expand the contents of short texts. It should be noted that the expansion here is
also applicable to the microblog document representation mentioned above, except
that the final objective of feature expansion is classification.
Example 1.13 Short text classification based on Wikipedia feature expansion method.
In reference [48], to overcome the sparsity problem in short text classification, the
Wikipedia-based feature expansion method is used to carry out multi-class classifica-
tion of texts.
(1) Each document d can be represented as the TFIDF form; namely, ΦTFIDF(d)=
(TFIDFd(t1), … , TFIDFd(tm)), in which, m is the size of the term space.
(2) Each document maps the short text d into a collection of Wikipedia concepts
defined in advance through explicit semantic analysis (ESA). Define the collec-
tion of Wikipedia articles as W= {a1,. . .,an}, to obtain the ESA feature expression
of document d: ΦESA(d)=(as(d,a1),. . .,as(d,an)), in which, the “as” function
represents the association strength [49] between the document d and the con-
cept, and is calculated as follows:
X
asðd, ai Þ = t2d
TFd ðtÞ TFIDFai ðtÞ (1:51)
(3) Perform classification through two steps: first, search in the whole classification
space by calculating the cosine similarity between the document and the class
center to get k candidate classes. Then, use trained SVM in the k classes to carry
out classification.
(4) The above processes can be carried out based on two feature representation
methods, and at last, the two result probabilities, PTFIDF and PESA calculated
through the SVM classifier can be integrated. The integrated classification prob-
ability can be obtained as:
2. Feature selection
How to screen the most representative features is one of the focus and difficulty of
research on short text classification at present.
In reference [56], the method of using eight features of Twitter to classify micro-
blogs into five classes has been proposed. These five classes are news, events,
opinions, deals, and personal messages; and the eight extracted features are: author-
ship, the presence of abbreviations and slangs, the presence of event related phrases,
the presence of emotion words, the presence of emphasizing words (such as veeery,
meaning the high degree of “very”), the presence of symbols of currency or percent,
the presence of @ username at the beginning of the microblog, and the presence of @
username in the middle of the microblog.
The author represented each microblog with the above features, and designed a
Naive Bayes classifier to classify the microblogs. The experimental result indicates
that the above feature representation has significant improvements relative to the
feature representation method for common bag of words, and that the classification
has achieved an accuracy of more than 90%.
In reference [57], a simple, scalable, and non-parametric method has been proposed
for the classification of short texts, and its main idea is as follows: first, select some
guiding words representative to the topic as the query words, to query the texts to be
classified; then, make full use of the information search technology to return the query
result; finally, select 5~15 features by voting, to represent and classify the microblogs.
In reference [58], a feature selection method is proposed; first, select some words
with rich part of speech as features, and then use Hownet to expand the semantic
features of those words, as well as to improve the classification effect. In reference
[59], short texts are classified according to features extracted based on the topic trend
in Twitter, which achieved certain results.
[62], a method of calculating the similarity of two documents that does not share any
common terms is proposed, which can be used in microblog classification.
The Jc ðwi Þ is the location collection of a word in dc, t and λ is the smoothing parameter
used in EWMA.
Using suffix arrays (SAs) to classify microblogs has been proposed in recent
years. In reference [65], a character string method with good performance in both
space and time is proposed based on SAs and the kernel technology in SVM. In
reference [66], a new logistic regression model is presented by virtue of all valid
string, and this model is also constructed based on the SAs technology.
classification, to deal with such features as fast changes in microblog texts and the
proneness to outdatedness of training data, a few scholars have introduced the
transfer learning technology, which is briefly summarized as follows.
In reference [73], assisted learning for partial observation (ALPOS) has been
proposed to solve the short-text problem in microblogs, which is a transfer learning
method based on feature representation. According to this method, a microblog text
is a part of an ordinary long text, and the reason why the text is short is that some
characters have not been observed. In this method, long texts are used as the source
data (assisted data) to improve the effect of microblog classification effect. This
method expands the framework of the self-learning method, and requires that the
source data and the target data have the same feature space and tag space; further,
the labeled source data are also necessary.
Advantages of the transfer learning method lie in its simple operation and
significant effect, whereas its disadvantages are that it is only applicable to situations
where the source field and the target field are quite similar, and that it requires the
source field and target field to have the same class space and feature space; the
efficiency of the iteration-based training process is also not high. When some
instances in the source field in the application scene conform to the distribution of
the target field, the instance-based transfer learning method is applicable.
In reference [74], a transfer learning method is proposed to solve the problem of
transfer classification of short and sparse text (TCSST); this method is an instance-
based transfer learning method in the inductive transfer learning and expands the
TrAdaBoost framework. To solve the problem of sparse labeled data in the source
data, the semi-supervised learning method is adopted to carry out sampling for the
original data. Based on the original labeled data, post-sampling original unlabeled
data and target labeled data, and the TrAdaBoost’s framework training classifier, the
method performed experimental verification on the 20-Newsgroups data collection
and a real seminar review datum.
similar items from the score data of a large number of users, and then use such
similarities to recommend items for current users; this practice is referred to as
collaborative filtering.
Recommendation is a common application form of IR. In this application, a user
can be taken as an “inquiry,” an item as a “document;” therefore, it is a process to
obtain a “document” best matching the “inquiry” from these “documents.” Thus, the
basic model and algorithms of IR can be applied in a recommendation system,
especially a content-based recommendation system. A recommendation system
based on collaborative filtering is somewhat different from an ordinary IR system.
Therefore, in the presentation below, we will mainly present a recommendation
system based on collaborative filtering.
In general, a recommendation system contains the following three elements.
User: the target of a recommendation system. The recommendation effect can
vary greatly with the amount of user information mastered by the system. User
information includes a user’s personal attributes, such as age, gender, and occupa-
tion, as well as historical exchange data of the user with the system, which can more
directly reflect user preference.
Item: the contents of a recommendation system. Here, item is a concept in a
broad sense; it can be both commodities such as books, music, and films, as well as
information content such as news, articles and microblog.
Rating of user preference for items: in websites such as Amazon or Netflix, scores
are normally used to indicate user’s preference for items. The commonly used scores
are divided into five levels (1~5), with ratings from low to high. Depending on
applications, the rating of preference degree can be in different forms, and sometimes
two levels (like or dislike) are used for the rating.
Recommendation algorithm: given user u and item i, estimate the rating r of a
user for the item. The recommendation algorithm can be abstracted as function
f ðu, iÞ = r. Essentially, the recommendation algorithm provides the estimation of
matching degree of user and item; to obtain a precision recommendation result, it
is necessary to tap in-depth characteristics of the user and the item, as well as the
interactive relationship between them. The recommendation algorithm studies how
to model various potential factors that may affect recommendation.
As mentioned above, recommendation algorithms can be classified into two
major categories according to different types of information used.
(1) Content-based recommendation: the basic idea is to analyze the personal attri-
butes of users and the attributes of item and to calculate the matching degree
between the item to be recommended and the user in conjunction with the
historical preference of the user for the item to recommend an item with high
matching degree to the user. This technique allows directly completing recom-
mendations based on information of users and items, and was used extensively in
the early recommendation systems; although it can only push items similar to
historical characteristics of items by establishing a direct relationship between
item characteristics and user preference, it cannot produce novel contents and its
expandability is poor [75–80].
(2) Recommendation based on collaborative filtering: the evaluation of users on
items constitutes the user-item matrix (see Table 1.4). Based on the basic
assumption that “similar users have similar preferences,” items preferred by
users similar to the target user can be pushed to the latter. Collaborative
filtering method does not take into account content and is entirely based on
the historical information on interactions between users and items; therefore, it
can solve the problem of lacking user and item description information in many
cases. With its effectiveness in recommendation, collaborative filtering is now
extensively applied in various commercial systems, such as, news recommen-
dation in GroupLens, movie recommendation in MovieLens, and music recom-
mendation in Ringo.
User
User
User
User
In an actual system, to obtain better recommendation results, the two methods can be
combined, such as the MatchBox recommendation system of Microsoft [81]. In the
research circle, collaborative filtering has attracted extensive attention for two rea-
sons: one is that the inputs (user-item matrix) of collaborative filtering is more
accessible to many systems, while it is difficult to obtain users’ personal attributes
(except for a small number of social network websites); the other is that it has higher
theoretical values to mine the user-item matrix by integrating multiple sources of
information. Due to limitations of space, this section mainly focuses on how social
network information is integrated in the collaborative filtering technology. In the
following section, the meaning of social recommendation will be presented first, and
then the memory-based social recommendation and model-based social recommen-
dation methods will be presented.
Social relation is defined as a function on the user set; for example, in a social relation
network G = < V, E > , V is a set of users, < u, v > 2 E indicates a connecting side of
user u and user v, the social relation based on a social relation network can be
expressed using user -user matrix T 2 Rm × m ; i.e., if there is a relationship between
user u and user v, then
matrix element Tu, v = 1; otherwise Tu, v = 0.
Obviously, social network relation is just one of the social relations, and social rela-
tions extracted based on different applications are also different. For example, in the
Epinions website [82], users can not only evaluate various items but also score the
evaluations made by other users. In addition, the Epinions website uses the concept of
“trust,” and users can choose to directly “trust” or “screen” some users; in this manner,
every user has his/her own trusted target user, and can be also possibly trusted by
other users, thus forming a complicated trust network (see Figure 1.3).
Despite their diversified forms, social relations have the potential of improv-
ing the effect of collaborative filtering as long as they can reflect the similarities
where ~ru, i and ru, i represent respectively the predicted values and true values, and T
the number of elements predicted.
(2) Similar items (based on historical record of users’ scores obtained) receive
similar scores from new users.
where N(u) indicates the set of users most similar to user u, and the number of set
jNðuÞj should be designated in advance; ru is the average value of user u’s scores for
all items; and simðu, vÞ gives the similarity between user u and user v. The similarity
calculation simðu, vÞ is a critical step in the memory-based collaborative filtering
algorithm, and commonly used similarity calculation methods are mainly Pearson’s
correlation coefficient [40] and cosine similarity [85].
Pearson’s correlation coefficient is used to reflect the degree of linear correlation
of two variables, and the calculation formula is as follows:
P
i2I ðru, i − ru Þðrv, i − rv Þ
q
simðu, vÞ = P ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffiqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P (1:56)
2 2
ð r
i2I u, i − r u Þ i2I ðrv, i − rv Þ
Item I I I I I
User
U
U
U
U
simðU1 , U3 Þ = 0.5183
simðU1 , U4 Þ = 0.3
Therefore, the score value of user U1 for item I2 can be calculated as follows:
rU1 ;I2
simðU1 ; U2 Þ ðrU2 ; I2 − rU2 Þ + simðU1 ; U3 Þ ðrU3 ; I2 − rU3 Þ + simðU1 ; U4 Þ ðrU4 ; I2 − rU4 Þ
= rU1 +
simðU1 ; U2 Þ + simðU1 ; U3 Þ + simðU1 ; U4 Þ
0:6923 ð2 − 3Þ + 0:5183 ð4 − 2:5Þ + 0:3 ð5 − 3Þ
= 4:3 + = 4:7535
0:6923 + 0:5183 + 0:3
simði, jÞ ru, j − rj
j2N ðiÞ
~ru, i = ri + P (1:57)
simði, jÞ
j2N ðiÞ
In the same manner, N(i) indicates the set of items most similar to i, and the number
of set jNðiÞj should be designated in advance. rj is the average value of scores
obtained by item j from all users. The calculation of item similarity simði, jÞ is similar
to that of user similarity simðu, vÞ, and similarity calculation methods, such as
Pearson’s correlation coefficient, can also be used.
User U U U U
U
U
U
U
The basic idea of memory-based social recommendation methods is that users and
their friends share similar preferences; therefore, the trust degree between users is
used directly to substitute the score similarity between users. Take the user trust
relationship in Table 1.6 as an example, the user-based collaborative filtering inte-
grated with user relations is calculated as follows:
This method relies upon the trust relationship between users and their friends;
however, it can be seen from the example above that, there is no direct trust value
between user U1 and user U3, and when such relationship is seriously missing, the
effect of this algorithm will be affected; therefore, researchers proposed establishing
indirect trust relationship for users through trust transfer.
proposerd the AppleSeed model, which states that the trust relationship of users has
the cumulative effect. Even if the trust path between two users is very weak, it is also
possible to obtain a very high trust weight as long as there are sufficient connection
paths. In the following part, we will take TidalTrust as an example to introduce the
trust transfer process.
Example 1.17 The method to adjust score weight with trust relations (TidalTrust).
Based on the trust relationships between users in Table 1.6, the trust diffusion path
from user U1 to user U3 can be obtained, as shown in Figure 1.5.
where uk 2 FðiÞ represents all neighboring friend of user i, and some neighboring
friends with trust degree above the threshold are selected by as the nodes of diffusion.
It can be seen from Figure 1.5 that, in this example, U2 and U4 are the neighboring
nodes of target user U3; the trust value from user U1 to U2 is 3; the trust value from U1
to U4 is 5; only U1 is connected to U2 and U4; therefore, U2 and U4 are respectively
marked as 3 and 5. As for U3, the threshold of trust degree for the node is 5; therefore,
only U4 satisfies the conditions. Thus, we can obtain the trust degree of users U1 and
U3 as follows:
5×6
trustðU1 , U3 Þ = =6
5
Apply this trust value in the above-mentioned user-based collaborative filtering
formula, to achieve the score value of U1 for item I2 as follows:
1. Neighborhood model
The neighborhood model was an improvement on the memory-based recommenda-
tion method; it was proposed in reference [99], with the basic idea to obtain similar
users on the basis of the score similarity between users, and then to predict users’
scores for items by using the scores given by similar users. The calculation formula is
as follows:
X
~ru, i = r w (1:59)
v2V v, i u, v
where wu,v is the interpolation weight, which is used to integrate the score values of
different users. This idea is basically the same as that of the memory-based colla-
borative filtering algorithm, except that after selecting similar users, it does not use
similarity as the weight to integrate the score values of users; instead, a regression
model is used to obtain the values of these weights through learning and training,
and the optimized target function is expressed as follows:
!2
X X
n
min v2V
rv, i rv, j wu, v (1:60)
j=1
The interpolation weight parameters obtained through learning can better fit the data
deviation and improve the recommendation effect.
Some scholars introduced the bias information on the basis of this job [100, 101]:
X
X
~ru, i = μ + bu + bi + ru, i − bu, j wi, j + Ci, j (1:61)
j2RðuÞ j2N ðuÞ
In this model, u is the global average score, bu the bias of user, bi the bias of item, R(u)
the set of items already scored by user u; the parameter w is the tightness between
items; N(u) is the set of items for which user u has made implicit feedback; ci, j
represents the bias based on implicit feedback item j on item i.
R = UTV
where U∈Rk×m and V∈Rk×n respectively express the k-dimension implicit character-
istic vectors of users and items; by minimizing the residual differential square, the
expression of users and items in implicit space is obtained:
1
where λ is the coefficient of a regular term, which can prevent overfitting of the
model. This formula can be solved using the gradient descent method [83, 93]. The
RSVD model is actually the basic frame of matrix decomposition, which has room for
improvement. To better fit a user-item matrix, reference [104] introduced global bias,
user bias, and item bias, hence, the scoring term can be expressed as
X
K
Ru, i = r + bu + bi + T
Uu, k Vi, k (1:63)
k=1
where r is the global average score, bu is the user bias, and bi the item bias. The
corresponding optimizing target function is expressed as:
1X m X n
where ||U||F and ||V||F are respectively F norms of matrixes U and V, i.e., the squares
and square roots of all elements of the matrixes. λ1 and λ2 are combinatorial
coefficients. Paterek has very good universality for expanding ideas of the model.
There may be many potential factors influencing the recommendation result, such
as the information about time and geographic locations, which can all be added
into formula (1.60) as bias terms to better fit the user-item score value. Reference
[105] considered the matrix decomposition from the perspective of probability, and
proposed the probability matrix decomposition model PMF, which assumes that
user’ score R for items conforms to the normal distribution with the mean value PTQ
and variance σR2, and that a set of users with similar scores have similar
preferences.
Recommendation methods integrated with the social relations of users can make
effective use of user relations, and the currently prevailing methods can be classified
into three types.
(1) Common factorization model: methods of this type factorize the user-to-item
score matrix and user relation matrix at the same time, and both matrixes share
the same user bias vector. The basic assumption is that users have the same bias
vector in the score matrix space and relation matrix space, and this method is
used to merge the score data with user relation data. Representative models
include SoRec and LOCABAL.
(2) Integration models: methods of this type make weighed merging of the friend-to-
term score matrix decomposition term with the user score matrix decomposition
term. The basic assumption is that users have similar preferences as their friends;
therefore, the bias term of users is smoothened with bias term of friends.
Representative models include STE and mTrust.
(3) Regular model: methods of this type introduce regular terms of user relations in
the matrix decomposition model, and in the optimizing process, the distance
between user-and friend’s bias terms is also one of the optimizing targets. Its
basic assumption is similar to that of integration models, and users have similar
bias as their friends. The representative models include SocialMF and Social
Regularization.
Example 1.18 The graphic model method SoRec integrating social relations and
score records. Reference [99] proposed SoRec, which integrates the social rela-
tions and score records of users by mapping the user social relations and user
score records on the same implicit characteristic space, thus mitigating data
sparsity and improving precision. The graphic model of SoRec is as shown in
Figure 1.6.
As shown in Figure 1.6, the SoRec model factorizes the user score matrix into two
terms Ui and Vj, and the user relation matrix into two terms Ui and Zk; Ui, as the
common term, links the two matrixes, thus realizing the integration of user relations.
Its optimizing target function is as follows:
Xn X
2
Based on assumption 1, if two users have a friend relationship, their scores for
items should be close to each other. Reference [106] proposed the STE model,
which performs linear combination of the basic matrix factorizing model with
social networks, and makes final score prediction by weighted summation of
basic score matrix and trust score matrix. Its graphic model expression is as
shown in Figure 1.7.
As shown in Figure 1.7, the STE model first factorizes the user-to-item score matrix,
and then makes weighted merging of user’s bias term on an item and the bias term of
friend on the same item, with the weight value being the weight of user’s relation with
friend.
!
X
Ru, i = g αUu Vi + ð1 − αÞ
T T
Tu, v Uu Vi (1:66)
v2Nu
The corresponding Ru, i is the normalized user scores, and g (g)is a logistic function.
Obviously, reference [106] is only a linear combination model of a superficial level,
and has not tapped the bonding role of friend relations. For reference [106] and
reference [99], the characteristic space of neighbors affects the user’s score, rather
than its characteristic space; therefore, it is not possible to solve the issue of trust
transfer.
Based on assumption 2, if the two users have relations in social network, the user
vector expressions after matrix decomposition should have some similarities.
Assumption 2 can be taken as a restriction on the matrix decomposition, equivalent
to adding an extra regular term into the previous optimizing target, as indicated
below:
1 2
OðU, V Þ = R − U T V F + λ kU k2F + kV k2F + σLðU Þ (1:67)
2
where L(U) is the regular term of social networks to user implicit vector. The relevant
work can better fit the recommendation result by designing the mathematical form of
L(U). Specifically, reference [107] has proposed two forms of regularization as
follows:
P 2
simði, f ÞUf
Xm
f 2F PðiÞ
+
LðU Þ = Ui (1:68)
i=1
sim ði, f Þ
f 2F + ðiÞ
F
X
m X 2
LðU Þ = simði, f ÞUi − Uf F (1:69)
i = 1 f 2F + ðiÞ
where F + ðiÞ is the collection of friends of user i, simði, f Þ is the similarity between user
i and user f. The experiment indicates that regularization based on the second
formula produces better effect.
In the PLSF model proposed by Shen [108], the regular term used is as follows:
If there is social relation between user i and user j, then ei, j = 1; otherwise ei, j = 0. The
implicit vector similarity between users i and j is calculated using cosine similarity
Ui T Uj , and Lossðx, yÞ reflects the variance of the discrete binomial variable x and
continuous variable y, and a loss function of the corresponding form can be used,
such as square loss. The above formula requires that the user implicit vector obtained
by matrix decomposition be consistent as much as possible with the actual relations
existing between users.
Example 1.19 Matrix decomposition model SocialMF. Reference [91] proposes the
SocialMF model, and the regular term is defined as the calculation result of the
following formula:
2
X Xm
LðU Þ = Ui − T U (1:71)
i, j j
i j2Ni
F
Where, if user u and user v have relations, then matrix element Tu, v = 1; otherwise
Tu, v = 0. The formula above requires that the vector expression obtained by neighbor
users be as close as possible to the vector expression of the user itself.
To make the matrix decomposition model significant in probability, references
[91] and [108] presented the corresponding probability generation model to model the
user -item score matrix.
Take referee [91] as an example; suppose the implicit vectors of both users and
items comply with Gaussian distribution, and the implicit vector expression of each
user is influenced by its neighbor, then the posterior probability of the characteristic
vectors of users and items can be expressed as the product of the likelihood prob-
ability function of score matrix and the priori probability function of the character-
istic vectors, as indicated by the formula below:
p U; VjR; T; σ2R ; σ2T ; σ2U σ2V ∝ p RjU; V;σ2R p UjT;σ2U ; σ2T p Vjσ2V
!
YN Y M
T
2
I R Y N X
¼ N Ru;i jg Uu Vi ; σr u;i × N Uu T U ;σ I2
solve the issue of social influence at the item level. Reference [109] has built a user
interest spreading model based on friend relation to explain the user friend relations
and interaction relations through random walking on the interest network of users,
and thereby combining the issues of recommendation and linking prediction into
one.
All the above studies use the known social network relations; however, under
many circumstances, the relations between users cannot be directly observed. To
solve this problem, many researchers have used the score similarity of users to
simulate user relations. Reference [110] directly used the user score similarity as
degree of trust, and used the trust transfer method to expand the closest neighbors.
Example 1.20 Metering of trust degree based on user score variance: references [111
and 112] proposed the metering method of trust degree based on user score variance,
as shown by the following formula:
1 X jra, i − rb, i j
uða, bÞ = (1:73)
jRa ∩ Rb j maxðrÞ
i2ðRa ∩ Rb Þ
This method calculates the degree of trust between users by summing up the absolute
values of errors of user a and user b on all score sets.
Reference [93] proposed the method to calculate according to score error
proportions, which first defines the correctness evaluation method for scores
between users, to classify the scores of users for the same item into a binary
issue of being correct and incorrect by setting the threshold of errors. Reference
[113] expanded this method and proposed the non-binary judging method to
adjust the effect of score variance on user relations by introducing error pane-
lizing factor. Reference [114] proposed the model of merging trust degree and
similarity, which has the advantage of trust transfer and most neighboring
methods.
In addition, references [89] and [115] proposed to improve the recommenda-
tion effect by using the user similarity and item similarity as the implicit social
relations, use the Pearson’s correlation coefficient to calculate user and item
similarity, and add the similarity into the matrix decomposition model as a
regularized parameter.
Reference [116] proposed the MCCRF model based on the condition random
field, to establish links for user’s scores; the conditional probability between
different scores is estimated by using the user score similarity and item
similarity, and the parameters of the model are trained using the MCMC
method.
1.5 Summary
This chapter presented the research development of IR in social networks based
on the three typical applications of search, classification, and recommendation.
In terms of search, we mainly summarize the current work from three aspects:
query representation, document representation, and similarity calculation. In
terms of classification, we mainly summarize the progress of short text classifi-
cation based on the two dimensions of features and algorithms. In terms of
recommendation, we mainly present how to add social information into tradi-
tional recommendation from the memory-based and model-based perspectives to
increase the effect of collaborative recommendation.
As shown in this chapter, IR in social networks mainly involves short text
representation and calculation. As the texts are short, it is an extensive practice
to expand texts when they are expressed. The fact that social networks are rich
in information makes it is an interesting topic to explore the ways of using such
information, especially social networking information, to improve the effect of
text representation. In terms of calculation models, usually a model that can
merge multiple features is used to introduce the features of microblogs.
We believe that the future development trend of IR can be summarized as
follows:
(1) Integrating more social network-specific information: for example, information
such as age, gender and geographic location of users in social networks has been
proved useful in auto-completion query tasks [117]. Such information can be used
References
[1] Christopher Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information
Retrieval. Cambridge University Press. 2008.
[2] Charu Aggarwal. Social Network Data Analytics. Springer, March, 2011.
[3] Peters Han Luhn. A statistical approach to mechanized encoding and searching of literary
information. IBM Journal, 1957: 309–317.
[4] Gerard Salton, Andrew Wong, ChungShu Yang. A vector space model for automatic indexing.
Communications of ACM, 1975: 613–620.
[5] Gerard Salton, Christopher Buckley. Term-weighting approaches in automatic text retrieval.
Information Processing & Management, 1988, 24(5): 513–523.
[6] Amit Singhal, Christopher Buckley, Mandar Mitra. Pivoted document length normalization. In
Proceedings of the 34th Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval (SIGIR), 1996: 21–29.
[7] Melvin Earl Maron, John L Kuhns. On relevance, probabilistic indexing and information retrie-
val. Journal of the ACM, 1960: 216–244.
[8] Stephen Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, Mike Gatford.
Okapi at TREC-3. In Proceedings of the Third Text REtrieval Conference (TREC). Gaithersburg,
USA, November 1994.
[9] James Callan, Bruce Croft, Stephen Harding. The INQUERY retrieval system. In Proceedings of
3th international conference on Database and Expert Systems Applications, 1992: 78–83.
[10] Fabio Crestani, Mounia Lalmas, Cornelis J. Van Rijsbergen, I Campbell. Is this document
relevant?. . . probably: A survey of probabilistic models in information retrieval. ACM
Computing Surveys (CSUR), 1998, 30 (4): 528–552.
[11] Christopher D. Manning, Hinrich Schütze. Foundations of Statistical Natural Language
Processing. MIT Press, 1999.
[12] Huang Changning. What is the main Chinese language information processing technology.
China Computer, 2002, 24.
[13] Jay M.Ponte, Bruce Croft. A language modeling approach to information retrieval. In
Proceedings of the 21st annual international ACM SIGIR conference on Research and
Development in Information Retrieval, New York, USA. 1998.
[14] Chengxiang Zhai, John Lafferty. Model-based feedback in the language modeling approach to
information retrieval. In Proceedings of the 10th ACM international conference on Information
and Knowledge Management (CIKM), pages 403–410, 2001. Atlanta, Georgia, USA.
[15] Adam Berger, John Lafferty. Information retrieval as statistical translation. In Proceeding of the
22rd international ACM SIGIR conference on Research and Development in Information
Retrieval (SIGIR), 1999: 222–229.
[16] Rocchio JJ, Relevance Feedback in Information Retrieval. Prentice-Hall, 1975: 313–323.
[17] Victor Lavrenko, Bruce Croft. Relevance based language models. In Proceedings of the 24th
international ACM SIGIR conference on Research and Development in Information Retrieval
(SIGIR), pages 120–127, 2001. New Orleans, Louisiana, USA.
[18] Wouter Weerkamp, Krisztian Balog, Maarten de Rijke. A Generative Blog Post Retrieval Model
that Uses Query Expansion based on External Collections. In Proceedings of the Joint
Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on
Natural Language Processing (ACL-IJCNLP), 2009.
[19] Miles Efron. Hashtag Retrieval in a Microblogging Environment. In Proceeding of the 33rd
international ACM SIGIR conference on Research and Development in Information Retrieval
(SIGIR), 2010.
[20] Cunhui Shi, Kejiang Ren, Hongfei Lin, Shaowu Zhang. DUTIR at TREC 2011 Microblog Track.
[21] Jaime Teevan, Daniel Ramage, Meredith Ringel Morris. #TwitterSearch: A Comparison of
Microblog Search and Web Search. In Proceedings of Web Search and Data Mining (WSDM),
2011: 9–12.
[22] Xiaoyan Li W. Bruce Croft. Time-based language models. In Proceedings of the 12th ACM
international conference on Information and Knowledge Management (CIKM), 2003: 469–475.
[23] Wei Bingjie, Wang Bin. Combing Cluster and Temporal Information for Microblog Search [J].
Journal of Chinese Information Processing, 2013.
[24] Giuseppe Amodeo, Giambattista Amati, Giorgio Gambosi. On relevance, time and query
expansion. In Proceedings of the 20th ACM international conference on Information and
Knowledge Management (CIKM), pages 1973–1976, 2011, Glasgow, Scotland, UK.
[25] Mostafa Keikha, Shima Gerani, Fabio Crestani. Time-based relevance models. In Proceedings
of the 34th international ACM SIGIR conference on Research and Development in Information
Retrieval (SIGIR), 2011: 1087–1088.
[26] Stewart Whiting, Iraklis Klampanos, Joemon Jose. Temporal pseudo-relevance feedback in
microblog retrieval. In Proceedings of the 34th European conference on Advances in
Information Retrieval (ECIR), 2012.
[27] Miles Efron, Peter Organisciak, Katrina Fenlon. Improving retrieval of short texts through
document expansion. In Proceedings of the 35th international ACM SIGIR conference on
Research and Development in Information Retrieval (SIGIR), 2012: 911–920.
[28] Yubin Kim, Reyyan Yeniterzi, Jamie Callan. Overcoming Vocabulary Limitations in Twitter
Microblogs. In Proceedings of the Twenty-First Text REtrieval Conference (TREC), 2012.
[29] Li Rui, Wang Bin. Microblog Retrieval via Author Based Microblog Expansion. Journal of Chinese
Information Processing, 2014.
[30] Alexander Kotov, Eugene Agichtein. The importance of being socially-savvy: quantifying the
influence of social networks on microblog retrieval. In Proceedings of the 22nd ACM interna-
tional conference on Information and Knowledge Management (CIKM), 2013: 1905–1908.
[31] Kamran Massoudi, Manos Tsagkias, Maarten Rijke, Wouter Weerkamp. Incorporating Query
Expansion and Quality Indicators in Searching Microblog Posts. in Advances in Information
Retrieval, P. Clough, et al., Editors, 2011: 362–367.
[32] Paul Ferguson, Neil O’Hare, James Lanagan, Owen Phelan, Kevin McCarthy. An Investigation of
Term Weighting Approaches for Microblog Retrieval. in Advances in Information Retrieval, R.
Baeza-Yates, et al., Editors, 2012: 552–555.
[33] Xiaowen Li, Peiquan Jin, Xujian Zhao, Hong Chen, Lihua Yue. NTLM: A Time-Enhanced Language
Model Based Ranking Approach for Web Search Web Information Systems Engineering. WISE
2010 Workshops, D. Chiu, et al., Editors, 2011: 156–170.
[34] Nattiya Kanhabua, Kjetil Nørvåg. Using Temporal Language Models for Document Dating
Machine Learning and Knowledge Discovery in Databases. W. Buntine, et al., Editors, 2009:
738–741.
[35] Miles Efron, Gene Golovchinsky. Estimation methods for ranking recent information. In
Proceedings of the 34th international ACM SIGIR conference on Research and Development in
Information Retrieval (SIGIR), 2011: 495–504.
[36] BP Sharifi. Automatic Microblog Classification and Summarization. University of Colorado,
2010.
[37] Carter Simon, Tsagkias Manos and Weerkamp Wouter (2011) Twitter Hashtags: Joint
Translation and Clustering. In Proceedings of the ACM WebSci’11, 2011:14–17.
[38] Beaux Sharifi, Hutton MA, Kalita JK. Experiments in Microblog Summarization. In Proceedings
of IEEE Second International conference on Social Computing (SocialCom), 2010.
[39] Gustavo Laboreiro, Luís Sarmento, Jorge Teixeira, Eugénio Oliveira. Tokenizing micro-blogging
messages using a text classification approach. In Proceedings of the fourth workshop on
Analytics for noisy unstructured text data, 2010.
[40] Daniel Ramage, Susan Dumais, Daniel Liebling. Characterizing Microblogs with Topic Models.
In Proceedings of ICWSM, 2010.
[41] Einat Amitay, David Carmel, Michael Herscovici, Ronny Lempel, Aya Soffer. Trend detection
through temporal link analysis[J]. Journal of the American Society for Information Science and
Technology (JASIST), 2004, 55(14): 1270–1281.
[42] Philip Yu, Xin Li, Bing Liu. On the temporal dimension of search. In Proceedings of the 13th
international World Wide Web conference on Alternate track papers posters, 2004: 448–449.
[43] Klaus Berberich, Michalis Vazirgiannis, Gerhard Weikum. Time-Aware Authority Ranking[J].
Internet Mathematics, 2005, 2(3): 301–332.
[44] Jing Wan, Sixue Bai. An improvement of PageRank algorithm based on the time-activity- curve.
In Proceedings of IEEE International conference on Granular Computing(GRC), 2009.
[45] Rinkesh Nagmoti, Ankur Teredesai, Martine De Cock. Ranking Approaches for Microblog
Search. In Proceedings of 2010 IEEE/WIC/ACM International conference on Web Intelligence
and Intelligent Agent Technology (WI-IAT), 2010.
[46] Yajuan Duan, Long Jiang, Tao Qin, Ming Zhou, Heung-Yeung Shum. An empirical study on
learning to rank of tweets. In the 23th International Conference on Computational Linguistics
(COLING), 2010: 295–303.
[47] Xin Zhang Ben He, Tiejian Luo, Baobin Li. Query-biased learning to rank for real-time Twitter
search. In Proceedings of the 21st ACM international conference on Information and Knowledge
Management (CIKM), 2012:1915–1919.
[48] Xinruo Sun, Haofen Wang, Yong Yu. Towards effective short text deep classification. In
Proceedings of the 34th international ACM SIGIR conference on Research and development in
Information Retrieval, 2011: 1143–1144.
[49] Evgeniy Gabrilovich, Shaul Markovitch. Computing semantic relatedness using Wikipedia-based
explicit semantic analysis. In Proceedings of 22nd AAAI Conference on Artificial Intelligence
(AAAI), 2007: 6–12.
[50] Xiang Wang, Ruhua Chen, Yan Jia, Bin Zhou. Short Text Classification using Wikipedia Concept
based Document Representation. In Proceedings of International conference on Information
Technology and Applications (ITA), 2013.
[51] Xuan-Hieu Phan, Le-Minh Nguyen, Susumu Horiguchi. Learning to classify short and sparse
text & web with hidden topics from large-scale data collections. In Proceedings of the 17th
International World Wide Web Conference (WWW), 2008: 91–100.
[52] Mengen Chen, Xiaoming Jin, Dou Shen. Short text classification improved by learning multi-granularity
topics. In Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI), 2011:
1776–1781.
[53] Sheila Kinsella, Alexandre Passant, John G. Breslin. Topic classification in social media using
metadata from hyperlinked objects. In Proceedings of the 33th European conference on
Advances in Information Retrieval (ECIR), 2011: 201–206.
[54] Evgeniy Gabrilovich, Shaul Markovitch. Overcoming the Brittleness Bottleneck using
Wikipedia:Enhancing Text Categorization with Encyclopedic Knowledge. In Proceedings of
the 21st National Conference on Artificial Intelligence (NCAI), 2006: 1301–1306.
[55] Duan Yajuan, Wei Furu, Zhou Ming. Graph-based collective classification for tweets.
Proceedings of the 21st ACM international conference on Information and Knowledge
Management (CIKM), 2012: 2323–2326.
[56] Bharath Sriram, Dave Fuhry, Engin Demir, Hakan Ferhatosmanoglu, Murat Demirbas. Short text
classification in Twitter to improve information filtering. In Proceedings of SIGIR, 2010: 841–842.
[57] Aixi Sun. Short Text Classification Using Very Few Words. In Proceedings of SIGIR2012, 2012:
1145–1146.
[58] Zitao Liu, Wenchao Yu, Wei Chen, Shuran Wang, Fengyi Wu. (2010). Short Text Feature
Selection for Micro-Blog Mining. In Proceedings of the International conference on
Computational Intelligence and Software Engineering, 2010: 1–4.
[59] Danesh Irani, Steve Webb. Study of Trend-Stuffing on Twitter through Text Classification. In
Proceedings of CEAS 2010, July 13–14: 114–123.
[60] Quan Yuan, Gao Cong, Nadia Magnenat Thalmann. Enhancing naive bayes with various
smoothing methods for short text classification. In Proceedings of the 21st International World
Wide Web Conference (WWW), 2012: 645–646.
[61] Mehran Sahami, Timothy D. Heilman. A web-based kernel function for measuring the similarity
of short text snippets. In Proceedings of the 15th International World Wide Web Conference
(WWW), May 23–26, pages 377–386, 2006.
[62] Sarah Zelikovitz, Haym Hirsh. Improving short-text classification using unlabeled background
knowledge to assess document similarity. In Proceedings of the 17th International conference
on Machine Learning (ICML), 2000: 1191–1198.
[63] Thiago Salles, Leonardo Rocha, Gisele Pappa, et al. Temporally-aware algorithms for document
classification. In Proceedings of the 33rd international ACM SIGIR conference on Research and
Development in Information Retrieval (SIGIR), 2010: 307–314.
[64] Kyosuke Nishida, Takahide Hoshide Ko Fujimura. Improving Tweet Stream Classification by
Detecting Changes in Word Probability. In Proceedings of SIGIR2012, 2012: 971–980.
[65] Choon Hui Teo S. V. N. Vishwanathan. Fast and space efficient string kernels using suffix
arrays. In Proceedings of the 23rd International conference on Machine Learning (ICML), 2006:
929–936.
[66] D. Okanohara, J. ichi Tsujii. Text categorization with all substring features. In Proceedings of
SDM, 2009: 838–846.
[67] Wenyuan Dai, Gui-Rong Xue, Qiang Yang, Yong Yu, Co-Clustering Based Classification for
Out- of-Domain Documents. In Proceedings 13th ACM SIGKDD international conference
Knowledge Discovery and Data Mining (SIGKDD), Aug. 2007.
[68] Rajat Raina, Andrew Y. Ng, Daphne Koller. Constructing Informative Priors Using Transfer
Learning. In Proceedings of the 23rd International conference on Machine Learning (ICML),
2006: 713–720.
[69] Pengcheng Wu, Thomas Dietterich. Improving SVM Accuracy by Training on Auxiliary Data
Sources. In Proceedings 21st International conference Machine Learning (ICML), July 2004.
[70] Andrew Arnold, Ramesh Nallapati, William W. Cohen, A Comparative Study of Methods for
Transductive Transfer Learning. In Proceedings of the Seventh IEEE International conference on
Data Mining Workshops, 2007: 77–82.
[71] Xiao Ling, Gui-Rong Xue, Wenyuan Dai, Yun Jiang, Qiang Yang, Yong Yu. Can Chinese Web
Pages be Classified with English Data Source. In Proceedings of the 17th International con-
ference on World Wide Web (WWW), 2008: 969–978.
[72] Sinno Jialin Pan, Vincent Wenchen Zheng, Qiang Yang, Derek Hao Hu. Transfer Learning for
WiFi-Based Indoor Localization. in Workshop Transfer Learning for Complex Task of the 23rd
Assoc. for the Advancement of Artificial Intelligence (AAAI) Conf. Artificial Intelligence, July
2008.
[73] Dan Zhang, Yan Liu, Richard D. Lawrence, Vijil Chenthamarakshan. Transfer Latent Semantic
Learning?: Microblog Mining with Less Supervision. In Proceedings of the 25th AAAI
Conference on Artificial Intelligence (AAAI), 2011: 561–566.
[74] Guodong Long, Ling Chen, Xingquan Zhu, Chengqi Zhang. TCSST: Transfer Classification of
Short & Sparse Text Using External Data Categories and Subject Descriptors. In Proceedings of
the 21st ACM international conference on Information and Knowledge Management (CIKM),
2012: 764–772.
[75] Paolo Avesani, Paolo Massa, Roberto Tiella. A trust-enhanced recommender system application:
Moleskiing. In Proceedings of the 2005 ACM symposium on applied computing (SAC), 2005.
[76] Paolo Massa, Paolo Avesani. Trust-aware recommender systems. In Proceedings of the 2007
ACM conference on recommender systems(RecSys2007), 2007.
[77] Matthew Richardson, Rakesh Agrawal, Pedro Domingos. Trust management for the semantic
web. The Semantic Web-ISWC 2003. Springer Berlin Heidelberg, 2003: 351–368.
[78] R. Guha, Ravi Kumar, Prabhakar Raghavan, Andrew Tomkins. Propagation of trust and distrust.
In Proceedings of the 13th international conference on World Wide Web (WWW), 2004: 403–412.
[79] Kailong Chen, Tianqi Chen, Guoqing Zheng, et al. Collaborative personalized tweet recom-
mendation. In Proceedings of the 35th international ACM SIGIR conference on Research and
Development in Information Retrieval (SIGIR), 2012: 661–670.
[80] Jilin Chen, Rowan Nairn, Les Nelson, et al. Short and tweet: experiments on recommending
content from information streams. In Proceedings of the SIGCHI Conference on Human Factors
in Computing Systems, 2010: 1185–1194.
[81] David Stern, Ralf Herbrich, Thore Graepel Matchbox. . Large Scale Online Bayesian
Recommendations. In Proceedings of the 18th International World Wide Web Conference
(WWW), 2009.
[82] http://www.epinions.com/.
[83] Paolo Massa, Bobby Bhattacharjee. Using trust in recommender systems: an experimental
analysis[J]. Trust Management. Springer Berlin Heidelberg, 2004: 221-235.
[84] Peng Cui, Fei Wang, Shaowei Liu, et al. Who should share what?: item-level social influence
prediction for users and posts ranking. In Proceedings of the 34th international ACM SIGIR
conference on Research and Development in Information Retrieval (SIGIR), 2011: 185–194.
[85] Ibrahim Uysal, Bruce Croft. User oriented tweet ranking: a filtering approach to microblogs. In
Proceedings of the 20th ACM international conference on Information and Knowledge
Management (CIKM), 2011: 2261–2264.
[86] Zi Yang, Jingyi Guo, Keke Cai, et al. Understanding retweeting behaviors in social networks. In
Proceedings of the 19th ACM international conference on Information and Knowledge
Management (CIKM), 2010: 1633–1636.
[87] Maksims Volkovs, Richard Zemel. Collaborative Ranking With 17 Parameters. In Proceedings of
the 26th Neural Information Processing Systems (NIPS), 2012: 2303–2311.
[88] Suhrid Balakrishnan, Sumit Chopra. Collaborative ranking. In Proceedings of the fifth ACM
international conference on Web Search and Data Mining(WSDM), 2012: 143–152.
[89] Hao Ma. An experimental study on implicit social recommendation. In Proceedings of the 36th
international ACM SIGIR conference on Research and Development in Information Retrieval
(SIGIR), 2013: 73–82.
[90] Punam Bedi, Harmeet Kaur, Sudeep Marwaha. Trust Based Recommender System for Semantic
Web. In Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI),
2007: 2677–2682.
[91] Mohsen Jamali, Martin Ester. A matrix factorization technique with trust propagation for
recommendation in social networks. In Proceedings of the fourth ACM conference on
Recommender systems (RecSys), 2010: 135–142.
[92] Sherrie Xiao, Izak Benbasat. The formation of trust and distrust in recommendation agents in
repeated interactions: a process-tracing analysis. In Proceedings of the 5th international
conference on Electronic commerce, 2003: 287–293.
[93] John O’Donovan, Barry Smyth. Trust in recommender systems. In Proceedings of the 10th
international conference on Intelligent user interfaces. ACM, 2005: 167–174.
[94] Ioannis Konstas, Vassilios Stathopoulos, Joemon M Jose. On social networks and collaborative
recommendation. In Proceedings of the 32nd international ACM SIGIR conference on Research
and Development in Information Retrieval (SIGIR), 2009: 195–202.
[95] Paolo Massa, Paolo Avesani. Trust-aware collaborative filtering for recommender systems.On
the Move to Meaningful Internet Systems 2004 : CoopIS, DOA, and ODBASE. Springer Berlin
Heidelberg, 2004: 492–508.
[96] Mohsen Jamali, Martin Ester. TrustWalker: A random walk model for combining trust-based and
item-based recommendation. In Proceedings of the 15th ACM SIGKDD international conference
on Knowledge Discovery and Data mining (SIGKDD), 2009: 397–406.
[97] Jennifer Ann Golbeck. Computing and Applying Trust in Web-based Social Networks[D].
University of Maryland College Park, 2005.
[98] Cai-Nicolas Ziegler. Towards decentralized recommender systems[D]. University of Freiburg,
2005.
[99] Hao Ma, Haixuan Yang, Michael R. Lyu, Irwin King. Sorec: Social recommendation using
probabilistic matrix factorization. In Proceedings of the 17th ACM conference on Information
and Knowledge Management (CIKM). ACM, 2008: 931–940.
[100] Yehuda Koren. Factorization meets the neighborhood: A multifaceted collaborative filtering
model. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge
Discovery and Data mining (SIGKDD), 2008: 426–434.
[101] Yehuda Koren. Factor in the neighbors: Scalable and accurate collaborative filtering[J]. ACM
Transactions on Knowledge Discovery from Data (TKDD), 2010, 4(1): 1.
[102] Quan Yuan, Li Chen, Shiwan Zhao. Factorization vs. regularization: Fusing heterogeneous
social relationships in top-n recommendation. In Proceedings of the fifth ACM conference on
Recommender systems (RecSYS), 2011, 245–252.
2.1 Introduction
Information diffusion is a process where people transmit, receive, and feedback
information using symbols and signals for achieving mutual understanding and
influence by exchanging opinions, ideas, and emotions. Information diffusion in
online social networks specifically refers to the diffusion of information realized via
online social networks.
Information diffusion in online social networks has the following character-
istics: first, the release and reception of information is extremely simple and fast;
users are able to release and receive information through mobile phones and other
mobile devices anytime and anywhere; second, information diffuses as “nuclear
fission;” as long as a message is released, it will be pushed by the system to all the
followers, and once forwarded, it will spread out instantly among another group
of followers. third, everyone has the opportunity to become an opinion leader;
each internet user is allowed to play an important role in the occurrence, fermen-
tation, dissemination, and sensationalization of a sudden event; finally, it takes
on the form of “We Media;” from super stars to grassroots, everyone can build
https://doi.org/10.1515/9783110599435-002
their own “media,” where they can express themselves freely, create messages for
the public, and communicate ideas, while receiving information from all direc-
tions on the platform. In general, social networks have sped up the process and
expanded the breadth of information diffusion. As a result, people are able to
access richer information through social networks.
Studying the law of information diffusion in social networks can help us
deepen our understanding of social networking systems and social phenomena,
and thereby furthering the understand the topologies, communication capabil-
ities, and dynamic behaviors of complex networks. In addition, the study of
information diffusion is also conducive to researches in other aspects such as
the discovery of models, identification of more influential nodes, and persona-
lized recommendations.
The vigorous development of social networks provides a rich database for
researchers to carry out relevant researches, allowing them to study the mechanism
of information diffusion and understand the law of information diffusion based on
massive real data, and have achieved phasic results. Information diffusion in social
networks mainly involves the base network structure, network users, diffusing
information, among other factors; relevant researches have been carried out on
these factors. Research results based on the network structure mainly include the
independent cascade model, linear threshold model, and their extended versions.
Research based on the state of the users mainly include the epidemic model and the
influence diffusion model. Research based on the characteristics of information
mainly include the multi-source information diffusion model and competitive diffu-
sion model. Consdering that explicit diffusion models cannot explain certain phe-
nomena of information diffusion, some researchers studied the method of
information diffusion prediction based on certain given data to predict the effect of
information diffusion such as the popularity of hot topics in social networks.
Considering numerous and jumbled sources of information in social networks,
some researchers studied the method of information source location to look for the
original sources of information and track their diffusion paths based on the distribu-
tion status of the existing information, thereby providing support for such applica-
tions as network security.
This chapter is organized as follows: Section 10.2 analyzes the main factors that
affect the diffusion of information in social networks. Section 10.3, 10.4, and 10.5
describe the diffusion models and application examples based on network struc-
ture, group status, and information characteristics, respectively. Section 10.6 intro-
duces the method as well as the application examples of predicting the state of
information diffusion based on certain given data. Section 10.7 describes the
information source location method and analyzes few cases. Finally, the challenges
and prospects faced by the research work of information diffusion in social net-
works are proposed.
reach a consensus [3, 4], whereas weak ties are usually formed between individuals
who are not connected closely or frequently, and these individuals are usually
dissimilar from one another; therefore, weak ties can provide new information,
serving as sources of diverse information; thus, weak ties play a more important
role in a wide range of information diffusion than strong ties [1, 3, 5, 6]. The density
of networking indicates the degree to which participants in the social network are
interconnected. The closeness between individuals in a network reduces uncer-
tainty and generates a sense of belonging, which may enhance trust between the
members and facilitate the diffusion of information [3, 7].
2.2.3 Information
Different from social networks in the actual world which are formed based on factors
such as geographical location, common activities, and kinship, users in online social
networks communicate with each other and establish connections mainly through
releasing, sharing, commenting, and forwarding information. Therefore, information
in online social networks carries all the records of users’ online activities. The
information itself has distinctive characteristics such as timeliness, multi-source
concurrency, and subject diversity, which plays an indispensable role in the analysis
of information diffusion.
Multi-source information implies that users in a network acquire information not
only through links in the online social network but also through factors outside the
network. For many traditional media or online media, online social networks can
allow more users to access high-quality information, which is an effective way for
them to test new models of news report and explore new channels of communication;
for online social networks, participation of traditional media and online media leads
to massive external information, which, coupled with the real social networks of their
own, safeguards the quality and scope of information diffusion.
Some messages, especially those in the same category, can have mutual influ-
ence when diffused simultaneously in social networks, which distinguishes its rule of
diffusion from those of independent information [11]. In fact, the simultaneous
diffusion of multiple inherently related messages is widely seen in social networks;
for example, “Evergrande winning championship” and “Wang Feng’s confession,”
“Red Cross” and “Guo Mei Mei,” “flu,” and “Banlangen,” and so on. Studying the
multi-message diffusion mechanism in social networks holds great significance.
who had also participated in the same activity. He proposed a threshold model
regarding collective behaviors [12]. Drawing upon the ideas of such a threshold,
researchers have conducted extensive researches. Among them, the linear threshold
model is universally recognized.
In the linear threshold model, each node v is assigned with a threshold
θðvÞ 2 ½0, 1, representing its susceptibility to infections. Node w, which is adjacent
to node v, influences node v with a non-negative weight of bv,w, and the sum of the
bv,w values of all the w nodes neighboring node v is less than or equal to 1.
An inactive node v is activated only if the sum of the influence of its active neighbor
nodes is greater than or equal to its threshold, as shown in Equation 2.1; i.e., the
decision of an individual in the network is subject to the decisions of all its neighbor
nodes, and the active neighbor nodes of node v can participate in the activation of v
multiple times. Algorithm 2.1 shows the implementation of the linear threshold model.
X
b ≥ θv
w :activeneighbor of v v, w
(2:1)
step 0: t=0:
initial initial
users users
Step 1
time
Step 2
Step 3
The diffusion process based on linear threshold model is shown in Figure 2.1:
Time Step 0: node a is activated.
Time Step 1: node a’s influence on node b is 0.5; node a’s influence on node c is 0.2.
At this point, the influence on node b is 0.5, greater than its threshold of 0.2; thus
node b is activated.
Time Step 2: node b’s influence on node c is 0.3; node b’s influence on node d is 0.5;
node a’s influence on node c is 0.2. At this point, node c is subject to the influence of
0.3 + 0.2 = 0.5, greater than its threshold of 0.4; thus, node c is activated.
Time Step 3: node c’s influence on node d is 0.1; node c’s influence on node e is 0.2;
node b’s influence on node d is 0.5. At this point, node d is subject to the influence
of 0.5 + 0.1 = 0.6, greater than its threshold of 0.5; thus, node d is activated.
Time Step 4: node d’s influence on node e is 0.2, node c’s influence on node e is 0.2.
At this point node e is subject to the influence of 0.2 + 0.2 = 0.4, less than its threshold
of 0.6. In this time step, no new node is activated; thus, the diffusion stops.
The independent cascades model (IC) is a probabilistic model [13, 14] initially
proposed by Goldenberg et al. in the research of a marketing model. The basic
assumption of the model is that whether node u trying to activate its neighbor node
v is successful is an event with a probability of pu,v. The probability that a node in
an inactive state is activated by a neighbor node that has just entered an active state
is independent of the activity of the neighbor who had previously tried to activate
the node. In addition, the model also assumes that any node u in the network has
only one chance to attempt to activate its neighbor node v, whether it succeeds and
that even though node u itself is still active at a later time, it does not have influence
any more. Such a node is called noninfluential active node. The implementation of
the independent cascade model is described in Algorithm 2.2.
The diffusion process of the independent cascade model based on Figure 2.2 is as
follows:
Time Step 0: node a is activated.
Time Step 1: node a attempts to activate b with a probability of 0.5, and attempts
to activate c with a probability of 0.2, assuming that node b is successfully activated
within this time step.
Time Step 2: node b attempts to activate c with a probability of 0.3 and attempts
to activate d with a probability of 0.5, assuming that node c and node d are success-
fully activated within this time step.
Time Step 3: node c attempts to activate e with a probability of 0.2, node d
attempts to activate e with a probability of 0.2, assuming that all the attempts
within this time step have failed, and no new nodes are activated; thus, the
diffusion ends.
with potentially delayed diffusion [15]. Saito et al. used a continuous time and added
a time delay parameter to each edge in the graph, extending the independent cascade
model and the linear threshold model to the AsIC (asynchronous independent
cascades) and the AsLT (asynchronous linear threshold) models [18].
The above methods focus on the reasoning of diffusion behaviors, without taking
into account the influence of content on diffusion. Information diffusion is a complex
social and psychological activity, and the content of diffusion inevitably affects the
action between neighbor nodes. Galuba et al. analyzed the diffusion characteristics of
URL in Twitter, starting from the attractiveness of URL, user’s influence and the rate of
diffusion, and constructed the linear threshold diffusion model of URL using these
parameters to predict which users would mention which URLs. This model can be used
for personalized URL referrals, spam detection, etc. [20]. Guille et al. proposed the
T-BaSIC model (time-based asynchronous independent cascades) using the Bayesian
logistic regression method on the basis of the AsIC model from three dimensions:
semantic meaning of topics, network structure and time, and to predict the probability
of diffusion over time between nodes on Twitter. The experimental results show that
the model has a good effect in predicting the dynamics of diffusion [21].
To analyze and model the action between neighbor nodes by extracting the
network structure based on the diffusion model of network topology is characteristic
of simple model and easy expansion, which has certain advantages in the case of
large social networks or limited information. In addition, such methods cross a
number of disciplines such as graph theory, probability statistics, sociology, and
physics, and have a solid theoretical basis. The model is suitable for the study of
diffusion cascade behavior, predict the diffusion path, and provide personalized
recommendations based on the degree of the user accepting the information.
However, the topology-based diffusion model also has some drawbacks: first,
from the perspective of timeliness, the social network topology that researchers
achieved is static, which is equivalent to a snapshot of the original network, on
which all explicit social relations before acquisition are recorded; that is to say, all
connections established 10 years ago and those established 1 second ago are col-
lected at the same time; the connection that received just one notice and the con-
nection of intimate communications between two friends are treated equally in the
calculation model; second, in such a network topology, the weights of the connec-
tions are generally equal or identically distributed, implying that the connected users
have the same influence on each other, or that the influence among the users in the
social network satisfy the simple probability function; third, the role of other factors
outside the network is ignored; users in the social network are affected not only by
neighbor nodes but also by the traditional media and the like in the external world,
thus participating in information diffusion. The next step should be focused on
improving existing methods based on the topology characteristics of dynamic net-
works, user influence analysis, external factors under consideration, and the intro-
duction of corresponding parameters.
1. SI model
The SI model is used to describe diseases that cannot be cured after the infection, or
infectious diseases that cannot be effectively controlled due to the emergent out-
break; for example, black death, SARS, etc. We may also say that, in the SI model,
once an individual is infected, it will remain in the infected state permanently. S(i)
and I(j) are used to express the susceptible population and the infected population,
respectively. Assuming that the individual becomes infected at the mean probability
β, the infection mechanism can be expressed by Equation 2.2:
β
SðiÞ + IðjÞ ! IðiÞ + IðjÞ (2:2)
At time t, the proportion of S-state individuals in the system is s(t), and that of the I-
state individuals is i(t). Based on this assumption, each infected individual can infect
βs(t) susceptible individuals. As the infected individuals have a proportion of i(t), a
total of βi(t)s(t) susceptible individuals are infected. The dynamical model of the SI
model can be described by the differential equations, as shown in Equation 2.3:
(
dsðtÞ
dt ¼ βiðtÞsðtÞ (2:3)
diðtÞ
dt ¼ βiðtÞsðtÞ
2. SIS model
The SIS model is suitable for describing diseases such as colds and ulcers, which
cannot be effectively immunized after cure. In the SIS diffusion model, the infected
individuals, as the source of infection, transmit the infectious disease to susceptible
individuals with a certain probability β, while the infected individuals return to the
susceptible state with a certain probability γ. On the other hand, the susceptible
individuals, once infected, become a new source of infection. The infection mechan-
ism can be described by Equation 2.4:
(
β
SðiÞ þ IðjÞ ! IðiÞ þ IðjÞ (2:4)
γ
IðiÞ ! SðiÞ
Assuming that the proportion of S-state individuals in the system at time t is s(t) and
that of I-state individuals is i(t), and that the growth rate of the infected individuals is
βiðtÞsðtÞ − γiðtÞ when susceptible individuals are fully mixed with infected indivi-
duals, the dynamic behavior of the SIS model can be expressed by the differential
equations shown in Equation 2.5:
(
dsðtÞ
dt ¼ βiðtÞsðtÞ þ γiðtÞ (2:5)
diðtÞ
dt ¼ βiðtÞsðtÞ γiðtÞ
3. SIR model
The SIR model is suitable for diseases that, once being caught, can give the patients
lifelong immunity, such as smallpox and measles. Assuming that in unit time the
infected individuals are in contact with some randomly selected individuals in all
states at the average probability β, and recover and obtain the immunity at an
average probability γ, the mechanism of the infection is described in Equation 2.6:
(
β
SðiÞ þ IðjÞ ! IðiÞ þ IðjÞ (2:6)
γ
IðiÞ ! RðiÞ
Assuming that the proportions of individuals in the susceptible, the infected and the
recovered states at time t in the system are s(t), i(t) and r(t), respectively, and in the
condition that the susceptible individuals are well mixed with the infected individuals,
the growth rate of the infected individuals is βiðtÞsðtÞ − γiðtÞ, the decline rate of the
susceptible individuals is βiðtÞsðtÞ, and the growth rate of the recovered individuals is
γiðtÞ, then the dynamic behavior of the SIR model can be described as Equation 2.7.
8 dsðtÞ
>
> ¼ βiðtÞsðtÞ
< dt
diðtÞ
dt ¼ βiðtÞsðtÞ γiðtÞ
(2:7)
>
>
: drðtÞ
dt ¼ γiðtÞ
The ideas of epidemic disease models are borrowed to divide nodes in a social
network into the “susceptible” (S) group, to whom the information is still unknown,
the “infected” (I) group, who have already received and keep transmitting the
information, and the “recovered” (R) group, who have received the information but
lost interest in transmitting it. The information diffusion is analyzed based on the
change of these different states [25–27].
Example 2.3 Example of epidemic information diffusion model in social networks
Saeed Abdullah and Xindong Wu studied the diffusion of information on Twitter
using the SIR model [25]. They believed that, similar to traditional epidemiology
which takes into account the birth rate, when the nodes in the infected state (Class I)
in a social network tweet about something, the fans will become a new susceptible
Table 2.1: Comparison of parameters in the epidemiology and Twitter dissemination model
Susceptible individual set Set of users who can receive tweets from infected
S(t)
at time t individuals at time t
Infected individual set at
I(t) Set of individuals who tweet about certain topics at time t
time t
Recovered individual set Set of infected individuals who stopped tweeting about
R(t)
at time t certain topics in a specific period of time
β Infection rate Diffusion rate
The number of new fans that each infected individual
μ Birth rate
gained in unit time
γ Recovery rate /average infection time
population, and their total number keeps growing (Table 2.1). Assuming that the new
susceptible population is introduced by the infected individual, they will establish a
dynamic equation, as shown in Equation 2.8.
8
>
>
dS
¼ β SðtÞ IðtÞ þ IðtÞ μ
< dt
dt ¼ β SðtÞ IðtÞ γ IðtÞ
dI
(2:8)
>
>
: dR ¼ γ IðtÞ
dt
X
vðt + 1Þ = u2AðtÞ Iu ðt − tu Þ (2:9)
where A(t) represents the set of influenced nodes, and node u is influenced at time tu
(tu ≤ t).
This model can be described as follows: nodes u, v, and w are influenced at
time tu ,tv and tw , respectively, after which each generates an influence function
Iu ðt − tu Þ,Iv ðt − tv Þ and Iw ðt − tw Þ. The quantity of a message being referred to in the
system at time t, represented by v(t), is the sum of these three influence
functions.
Yang et al. present the influence function of a node in a nonparametric
manner and estimate it by using the non-negative least squares problem of the
mapping Newton method [29]. This model can effectively evaluate the influence of
nodes and can be used to predict the dynamic changes of information diffusion
over time.
The population-based diffusion model describes the dynamic changes in infor-
mation diffusion by describing the state of acceptance of information by users in the
network and the redistribution of individuals between these states. Such models are
widely used in viral marketing, rumor spread analysis, information source location,
among others.
However, there are still some problems with the population-based diffusion
model. In the epidemic diffusion model, the individuals are only classified in three
states: infected, susceptible, and immune, and each state will continue for some time
until the infection of the virus. However, in social networks, an individual’s status
after receiving a message is highly susceptible to the influence of the surrounding
environment or other information, and such status also changes very fast. Based on
the model of individual influence, because of the enormous scale and the large
number of nodes in a social network, and the fact that different opinion leaders
can appear in different scenarios, it remains a major challenge to establish an
influence-based diffusion model by identifying the key nodes and estimating the
influences of these nodes.
Most existing social network information diffusion models assume that information
is not affected by factors outside the networks, and that it is only diffused among
nodes along the edges of social networks. However, in the real world, users in social
networks can access information through multiple channels.
When a node u in a social network releases information k and if none of its
neighbor nodes have released information related to k, it indicates that u is influ-
enced by some unobservable external factors, causing the unprecedented emergence
of information k; however, if its neighbor node has released the related information,
then the fact that u releases information k may be the result of being influenced by its
neighbors or some external factors.
Based on the above idea, Myers et al. believe that, in addition to acquiring
information through network links, nodes in social networks also acquire infor-
mation through external influences. Thereby, they created model [30], as shown
in Figure 2.3, in which function λext ðtÞ is used to describe the amount of
information that a user receives through external influences. If its neighboring
nodes have posted relevant information, the user will have a link-based internal
influence λint ðtÞ from them. Function ηðxÞ describes the probability that the user
will post microblog after being exposed to the information. Eventually, the user
will either post a relevant microblog under the influence or cease reacting to the
information.
The total influence on node i is shown in Equation 2.10:
ðiÞ n
ðiÞ t=dt Λ ðtÞΛext ðtÞ
Pexp ðn; tÞ int
t dt
n
ðiÞ t=dtn (2:10)
Λ ðtÞþΛext ðtÞ
× 1 int t dt
ðiÞ
where Λ int ðtÞ is the expected value of the node obtaining information through
internal influence, and Λext ðtÞ is the expected value of the node obtaining informa-
tion through external influence.
Finally, the probability that user i will post microblogs after being exposed to the
information is as shown in Equation 2.11:
P
∞
F ðiÞ ðtÞ ¼ P½i has n exp: × P½i inf :ji has n exp:
n¼1
P
∞
(2:11)
ðiÞ
¼ Pexp ðn; tÞ × 1 Π nk¼1 ½1 ηðkÞ
n¼1
where function ηðxÞ describes the possibility that the user will post a microblog after
having read a message.
Myers et al. estimated the parameters of the model by using artificial networks
and the infection time of some nodes. After applying this model to Twitter, they
found that 71% of the information on Twitter was diffused based on the internal
influence of the network, and the remaining 29% was triggered by factors outside
the network.
Assuming that I1, I2, and I12 indicate the number of nodes in each of the three
states, the number of nodes in each state changes with time as follows:
dI1
= β1 SðI1 + I12 Þ + δ2 I12 − δ1 I1 − 2 β2 I1 ðI2 + I12 Þ (2:12)
dt
dI2
= β2 SðI2 + I12 Þ + δ1 I12 − δ2 I2 − 2 β1 I2 ðI1 + I12 Þ (2:13)
dt
dI12
= 2 β1 S2 ðI1 + I12 Þ + 2 δ2 I1 ðI2 + I12 Þ − ðδ1 + δ2 ÞI12 (2:14)
dt
If N is the total number of nodes, then the number of nodes in the S state is:
S = N − I1 − I2 − I12 (2:15)
8 σ1 σ2
< σ2 ðσ1 1Þ ; σ 1 þ σ2 ≥ 2
2eritical ¼ pffiffiffiffiffiffiffiffiffiffiffi (2:16)
: 2ð1þ 1σ1 σ2 Þ
; σ1 þ σ 2 < 2
σ1 σ2
Where 2 reflects the interaction between the two viruses: when 2 > 2eritical , both viruses
can coexist; when 2 = 0, both viruses are immune to each other; when 0 < 2 < 1, both
viruses compete with each other; when 2 = 1, both viruses do not affect each other;
when 2 > 1, both viruses promote each other’s diffusion.
Beutel et al. selected two video service websites, Hulu and Blockbuster, and two
browsers, Firefox and Google Chrome, as the cases for study. The search volume of the
relevant information is obtained from Google Insights, and the data is fitted with SI1|2S
model. This model can fit the data well, which shows the applicability of the model.
Multi-source information diffusion analysis, through information modelling
sources, helps us understand the mechanism of action between the real world and
online social networks. In real scenarios, different information spreads through social
networks at the same time; the research on the competitive diffusion of information
helps to establish a model that can better reflect information diffusion in the real world,
and thus improve our understanding of the law of diffusion.
The existing researches mainly start from the scope, time, and other factors of
information diffusion. In addition to these factors, other properties of information
such as time, content, and source constitute the inherent attribute of its diffusion. In
what manner is the role of information combined with the role of users when it is
diffused in a social network? Investigating this problem will help people to better
understand the mechanism of diffusion.
Figure 2.5: Popularity curve of the post about the “missing of Malaysia Airlines flight”
Some researchers believe that there is a correlation between historical popularity and
future popularity. A regression model that shows the correlation between historical
and future popularities is established by considering the popularity at a particular
point in the early stage or at a series of time points. The classical model is the SH
model proposed by Gabor Szabo and Bernardo Huberman in 2008 [32]. The results
were verified by posts on Digg and videos on YouTube.
To better observe the relationship between early popularity and late popularity,
Gabor Szabo and Bernardo Huberman pretreated the data, and found that if the
popularity is logarithm transformed (ln), early popularity and late popularity show a
strong linear correlation, and the random fluctuations can be expressed in the form of
additive noise. They also built a scatter plot to describe the early popularity and late
popularity of the dataset by means of mathematical statistics. As far as the Digg posts
are concerned, each scattered point represents a post sample, and the abscissa value
that each point represents the number of “likes” for the post 1 hour after it was
released, while the ordinate value represents the number of “likes” 30 days later. In
the case of videos on Youtube, each scattered point represents a video sample, and
the abscissa value that each point corresponds to represents the number of views 7
days after the video was posted, whereas the ordinate value represents the number of
views 30 days later. According to the scatter plot, the specific relationship between
the early popularity and late popularity of Digg posts and that of Youtube videos can
be discovered using the least squares method: lny = lnx + 5.92, lny = lnx + 2.13.
The SH model established according to the above process is as follows:
where Ns(t2) represents the popularity of an online content s at time t2, whereas ln Ns
(t2) is the dependent variable, indicating late popularity. ln r(t1,t2) is the intercept. ln
Ns(t1) is an independent variable, indicating early popularity. εs is the additive noise,
i.e., the error term.
As far as linear regression fitting is concerned, if the error is subject to the normal
distribution, then the fitting is correct. Therefore, the normality of the residual is
detected by the quantile-quantile plot (QQ plot). It is found that the effect of the SH
model for fitting the early popularity and the late popularity is acceptable. The QQ
plot is used to visually verify whether a set of data is from a certain distribution, and
is often used to check whether it is subject to the normal distribution.
object, they found that the structural characteristics of early microblog forwarders
could reflect the final popularity of an online content.
The researchers first measured the correlation between the final popularity
and the link density, and that between the final popularity and the diffusion
depth of microblogs, and discovered a strong negative correlation between the
final popularity and the link density, and a strong positive correlation between
the final popularity and diffusion depth. This shows that groups of low link
density and high diffusion depth are more conducive to enhance the popularity
of microblogs. Based on the above findings, the researchers improved the SH
model.
The improved model is shown in Equation 2.18:
where p^k ðtr Þ is the popularity at time tr , i.e., the late popularity; pk ðti Þ is the popularity
at time ti , i.e., the early popularity; ρk ðti Þ is the link density at time ti . α1, α2, α3 are the
parameters trained from the dataset.
where p ^k ðtr Þ is the popularity at time tr , i.e., the late popularity; Inpk ðti Þ is the
popularity at time ti , i.e., the early popularity; dk ðti Þis the diffusion depth at time ti .
β1, β2, β3 are the parameters trained from the dataset (Table 2.2).
The comparison of network structure-based models and SH model is shown in Table 2.2.
It can be seen from the table that the root mean squared error (RMSE) and the mean
absolute error (MAE) of the improved model are significantly reduced compared with
the SH model.
Some researchers believe that the promotion of the popularity of online contents
is closely related to the behaviors of social network users. Kristina Lerman and
Tad Hogg argued that user’s social behavior, such as registering a website,
dNvote ðtÞ
= rðvf ðtÞ + vu ðtÞ + vfriends ðtÞÞ (2:20)
dt
where N vote(t) represents the number of “likes” a post receives at time t, i.e., the
popularity. r represents the fun factor of the post, i.e., the probability of users liking it
once after being viewed. vf, vu, vfriends, respectively, represent the rates of users
seeing the post via the front page, the “upcoming” section, and the “friends”
interface.
The researchers assume that all users will first browse the front page after
visiting the Digg website, and then enter the “upcoming” section at a certain
probability. The posts posted by all users on Digg are grouped, with every 15 in
one group. The latest 15 posts will be on the first page, the next 15 on the
second page, and so on. Researchers use the function fpage(p) to indicate the
visibility of a post. If p has a value of 1.5, it implies that the post is in the
middle of the second page. fpage(p) decreases as p increases, whereas p
increases as time increases. Researchers use the Gaussian inverse function to
represent the distribution of the number of pages viewed by users. Finally,
measure the vfriends, i.e., the rate of a post being viewed via the “friends”
interface. Users can see the posts both submitted and liked by friends.
Researchers use the function s(t) to represent the number of friends who have
not seen the post, out of the total number of the friends of the liker. Suppose a
friend of the liker finally sees the post at the rate of w, then vfriends=ws(t).
So here we have:
where t is the length of time after the post was submitted, and v is the rate at which
the user visits Digg.
Where the rate of users visiting Digg, the probability of browsing the “upcoming”
section, the rate of liker’s fans visiting Digg, the distribution of page views and some
other parameters are the empirical values trained from the train set; the fun factors,
and the number of fans of the poster vary with different posts (Table 2.3).
Parameter Value
Fan factor r
Number of the poster’s fans S
of time; “seasonality” refers to the repeated fluctuations of the time series in a period
of time, i.e., the aforementioned periodicity. The “level,” “trend,” and “seasonality”
are called the systematic part, whereas “noise” is the nonsystematic part. Time series
is aimed to make predications on the systematic part.
These different components together constitute the entire time series. The com-
pound modes of different components are divided into addition and multiplication.
Addition mode: Yt = level + trend + seasonality + noise
Multiplication mode: Yt = level × trend × seasonality × noise
The seasonality components of a time series are subdivided into additive season and
multiplicative season. The behavior of the additive season is that the seasonal fluctua-
tion does not change with the overall trend and level of the time series. The behavior of
the multiplicative season is that the seasonal fluctuation changes with the overall trend
and level of the time series. The type of season selected for analysis determines whether
to choose an additive model or a multiplication model. For time series of additive
seasons, the additive model is usually selected for fitting, whereas for time series of
multiplicative seasons, the multiplicative model is usually selected for fitting.
The basic steps of prediction using the time series method are as follows:
(1) access to data;
(2) visual analysis: analyze data characteristics, select the data granularity, and
develop the time series plot;
(3) data preprocessing: deal with missing values, extreme values, etc.;
(4) data division: division of train set and validation set;
(5) application of prediction methods: select the appropriate model;
(6) evaluation of the prediction performance: use mean absolute percentage error
(MAPE), relative absolute error (RAE), mean square error (MSE) and other meth-
ods for performance evaluation.
Yt = Pt + ða1 x1 + a2 x2 + . . . + am − 1 xm − 1 Þ + Et (2:24)
where t represents time; Y represents the actual value of the time series; Pt is a
polynomial, representing the trend and the level terms; m is the length of season, x1,
x2, . . .xm−1 are the dummy variables; in the case of m periods, there are m-1 dummy
variables; the period without a corresponding dummy variable is the reference value.
The dummy variable is either 0 or 1. If time falls within a specific period, the dummy
variable in this period is 1, and the others are 0. a1, a2, . . . am−1 are the coefficients
corresponding to the dummy variables respectively. Et represents noise.
where t presents time, Y is the fitted value of the time series, L is the level component,
T is the linear trend component, S is the seasonal factor, and E is the noise component
where yt represents the observed value at time t, m is the length of the season, and α,
β, and γ are called smoothing parameters, which can be calculated with the least MSE
of the train set.
The multiplicative HW model performs exponential smoothing for level, trend,
and seasonality components. The smoothing parameters determine the rate of the
latest information. The more approximate to 1 the value is, the more recent new
information is used.
Thus, the k-step ahead-of-time prediction can be achieved in Equation 2.29.
Example 2.4 Example of online content popularity prediction based on time series
The above sections depicted the basic idea of time series and two time series models. In
the following section, we will describe the specific application of the time series
approach in predicting the popularity of online contents: predicting the popularity of
topics on Tianya Forum. Two types of hot topics were selected for analysis and predic-
tion: one is the long-term, nonemergent type, such as Christmas and the US general
election; the other is the short-term, emergent type, such as H7N9 and Beijing haze.
1) Dataset collection
Source: Sina News Center (http://news.sina.com.cn) and Tianya Forum (http://www.
tianya.cn). Sina News Center is one of the major channels of Sina, with 24-hour
rolling coverage of domestic, international, and social news, and more than 10,000
pieces of news issued daily. Tianya Forum is the largest forum website in China,
founded in 1999, with registered users up to 85 million and daily visits of about 30
million. The Forum consists of politics, entertainment, economy, emotion, fashion,
and other sections. As of June 2013, the posts in the entertainment section had
exceeded 3.2 million, and the replies exceeded 170 million; the posts in the economy
section exceeded 1.4 million, and the replies exceeded 17 million.
The dataset was collected into two steps:
(1) select hot topics from Sina News Center;
(2) search on Tianya Forum for the hot topics screened out in the first step.
Here, the popularity of hot topics is defined as the quantity of all posts and
replies related to the topic within a certain period of time.
Hot topics in Sina News Center starting from July 5, 2004 were collected. Topics on
a specific date can be searched by date. A total of 10,000 topics were collected from the
international, social, entertainment, sports, science and technology, and other sec-
tions, which were then artificially divided into two types: emergent and nonemergent,
with 6,000 emergent-type topics and 4,000 nonemergent-type topics. Then, the topics
collected in the first step were used as keywords, which were searched in Tianya
Forum using its internal searching section. There are multiple searching methods
including searching by relevance, searching by posting time, searching from the full
text, searching titles, and so on. We chose the relevance and the title searching
methods. After further screening, we selected 7,000 hot topics, of which 3,000 were
nonemergent and 4,000 were emergent topics. The nonemergent topics collected were
data from January 2001 to December 2012, and the popularity of each topic was higher
than 15,000; the emergent topics collected were data from January 2011 to December
2013, and the popularity of each topic in the first 15 days was higher than 18,000.
Figure 2.6: Popularity of the “US general election” topic from January 2001 to December 2012
If the data is too fine-grained, many values would result in 0. For example,
nonemergent topics are not discussed much, and if the granularity is set at the
level of “hour,” the popularity is calculated once an hour. As a result, many of
the values might be 0, and it is not easy to identify seasonal characteristics.
Therefore, it is better to select “month” as the granularity. Take the topic
“Christmas” for example, the value reaches its peak in December every year,
while it is quite lower in other months. This shows a strong seasonality and the
season length is 12 months; however, if the data granularity is too rough, some
seasonal characteristics of a time series are likely to be overlooked. For example,
an emergent topic becomes a hot topic among many people. If “day” is selected
as the granularity, it is not easy to observe the period, but if “hour” is used as
the granularity, the law of the hourly quantity of replies in every 24 hours is
found to coincide with the habits of people, which is the characteristic of
seasonality, with a season length of 24 hours. Figure 2.7 shows the frequency
of replies on Tianya Forum in each time period of a day. The statistics is based
on 35,619,985 replies to 10,000 posts on Tianya Forum.
Figure 2.7: Frequency distribution of replies to hot posts on Tianya BBS during different periods of a day
Figure 2.8: The blue line indicates the time series of “Beijing haze” topic in the first 9 days. The red
line indicates the trend of the time series. The trend line is drawn with a moving average of a window
size of 24
(a) (b)
(c) (d)
Figure 2.9: Results of “Beijing haze” popularity prediction from the MLR model and the multiplicative HW
model
trend terms are lower than the actual ones. The accuracy rate of the MLR model in
predicting the popularity on the 8th day is 79.1%, and the accuracy rate of the HW
model is 88.3%.
Take the topic of “Christmas” as an example. As shown in Figure 2.10, the peak
arises in December each year; the trend line is drawn with the move smoothing
method using a window size of 12. The data from January 2008 to December 2009
were used as the train set, and the data from January to December of 2010 were used
as the validation set. The results are shown in Figure 2.11.
Figure 2.10: The grey line indicates the time series of the “Christmas” topic from January 2003 to
December 2012. The black line indicates the trend of the time series. The trend line is drawn with a
moving average of a window size of 24. The popularity reaches its peak in December each year
A total of 3,000 samples were collected for the nonemergent topics and 4,000
samples for the emergent topics. The historical data of both have a time span of 3
periods. The multiplicative HW model shows an average accuracy rate of 80.4% in
predicting the trend of nonemergent topics, and an average accuracy rate of 84.7% in
predicting the trend of emergent topics, whereas those of the MLR model are 63.7%
and 71.4%, respectively.
The HW model is much more accurate than the MLR. In addition, it also validates
that the seasonality of the hot topics is multiplicative season. The experimental results
also show that the smoothing parameters usually remain stable in a certain range in
both cases: α < 0.5, β < 0.5, γ < 0.07 (non-emergent), α > 0.55, β > 0.5, γ < 0.07 (emergent),
which indicates that short-term emergent topics have more unstable level and trend
terms, and more severe ups and downs, and therefore, they require frequent collection
of the latest information; in other words, the dependence of its future popularity on
historical data is weaker than long-term type topics, whereas the historical data of
long-term topics has stronger impact on the future popularity of these topics.
(a) (b)
(c) (d)
Figure 2.11: Results of “Christmas” popularity prediction from the MLR model and the multiplicative
HW model
used to form the validation set. The results are shown in Figure 2.12. According to the
results, the prediction for long-term nonemergent topics is the most accurate when
three periods of historical data are collected, with an average accuracy rate of 0.813;
the prediction for short-term emergent topics is the most accurate when two periods
of historical data are collected, with an average accuracy rate of 0.858. This also re-
verified the fact that the historical data of long-term topics has a stronger impact on
the future popularity, whereas the dependence of short-term nonemergent topics’
Figure 2.12: Average accuracy rates resulting from different season numbers used as train sets
examined on two types of topics
popularity on historical data is weaker. When the number of historical data periods
exceeds three, the accuracy cannot be significantly improved because data in one
period can reflect the seasonal information, data in two periods can reflect the trend
information. As the levels and trends are constantly changing, earlier historical data
cannot reflect the current popularity. Therefore, the accuracy reaches saturation after
a time length of 3 periods.
4. Comparative analysis
To validate the time series method, the SH Model based on the sample set is
used for comparative analysis. Gabor Szabo and Bernardo Huberman found that
the early popularity and late popularity exhibit linear relationship in logarithmic
form: lny = lnx + b, where x is the early popularity, a known data; y is the late
popularity, to be predicted; b is a parameter trained from the sample set.
The multiplicative HW and SH models are used for comparison. Four thousand
emergent-type hot topics serve as the sample set of the SH model; the trained
parameter b = 0.267, and “Beijing haze” is the topic of prediction. Data in the first 7
days are used for early popularity, and data on the 8th day are used for late
popularity. The popularity on the 8th day predicted by the SH model is 9154, that
predicted by the time series is 5488, and the actual popularity is 4913. The error rates
of the two models as to the two topics are shown in Table 2.4.
Emergent Non-Emergent
SH .% .%
HW .% .%
The reason why the SH model is not ideal to predict the popularity of hot topics is that it
relies on the sample set, and the development law of each hot topic has its own
characteristics. For example, each has different peak and base values, making it
difficult to find the right sample set. However, the time series method does not depend
on the sample set but on analyzing the historical data of the hot topic to be predicted.
It is easy to obtain historical data for a historical popularity-based prediction
model, which is suitable for predicting the long-term popularity of online contents. It
can also be applied to predict the popularity of various online contents because all
posts, news, video, and microblogs have the historical data of popularity. However,
predicting long-term popularity with premature historical data can lead to inaccurate
predictions, whereas the use of late historical data can lead to delayed predictions.
The advantage of a model based on the network structure is that the network
structure factor is taken into consideration, which is more accurate than the prediction
model based on the historical popularity, while its limitation is the same as that of the
In the study of diffusion process, how to determine the source of the diffusion based
on the observed results of diffusion is a fundamental problem. Research results on
this issue can usually be applied to spam management, computer virus prevention,
rumour prevention, and control on social networks and other areas.
Because of the conveniency and strong interactivity of social networks, informa-
tion on social networks can spread very fast and widely, which also lead to the
uncontrolled diffusion and spread of enormous false and illegal information. To
identify the source of malicious information by means of the information source
location technology and other effective methods is the key to control the diffusion of
false and illegal information on social networks. The basic goal of information source
location is to determine the initial source of information diffusion.
According to the existing researches on information source location, information
source location is defined as follows: to determine the initial source node of informa-
tion diffusion on a network in the condition of knowing the observed result, given the
attributes of the underlying network structure, the mode of information diffusion,
etc. Oftentimes, our observations of the results of diffusion are not complete as we
can only observe a part of the overall result, which adds difficulty to the source
location of the information. In addition, due to the diversity and uncertainty of the
identified the source node using an improved centrality measurement method. The
study mainly examined two different networks: the ER network and the scale-free
network, and experiments were conducted on both networks.
In reality, information diffusion exhibits different modes. For example, in a
computer virus diffusion network, when a new virus appears, the node that is
infected with the virus will diffuse it to all its neighbors; almost all of the neighbors
will be infected, and the process keeps on; however, in the information diffusion
process on social networks, there is a certain probability that an infected node infects
its neighbors. Normally not all of them are infected, and only a part of the nodes are
infected due to their interest in the information. Obviously, the performance of the
information source location method will be affected by the different characteristics of
different modes of diffusion. The study mainly considered the following three types
of diffusion modes and the conducted experiments:
2. Diffusion
The random walk algorithm can be referred to in this diffusion method. A node
chooses one of its neighbor nodes for diffusion. Each neighbor node has a probability
of being diffused.
3. Contact process
This diffusion method can be viewed as a classic virus diffusion. Each node has a
certain probability to infect its neighbors.
The source location method proposed in this study mainly deals with calculating
the centrality of each node. In previous studies, the four major centrality measuring
methods are “Degree,” “Closeness,” “Betweenness,” and “Eignvector.”
The first measurement method, i.e., degree, uses the classical definition of
degree given in the graph theory; that is, the number of edges associated with a
node. dij represents the length of the shortest path between nodes i and j, so the
average shortest distance passing node i li is:
1 X
li = d
j, j ≠ i ij
(2:30)
n−1
The second measurement method is closeness. In the following equation, the close-
ness of node i Ci is the reciprocal of the average shortest distance of node i:
1
Ci = (2:31)
li
We can see that in the “closeness” method, the distance between one node and the
other nodes is used to measure the centrality of the node. Obviously, if the average
distance between the node and the other nodes is small, it approximates the “center”
of the network; in other words, it has higher centrality.
The third measurement method is “betweenness.” As shown in the following
equation, the “betweenness” of node i is:
X nist
Bi = s, t, s ≠ t, s ≠ i, t ≠ i n
(2:32)
st
where nist is the number of the shortest paths between node s and node t via
node i, and nst is the number of shortest paths between node s and node t. It
can be seen that the betweenness method measures the centrality of a node by
testing whether the node is on the shortest path between other nodes. If a node
is on the shortest path between many other nodes, it is more like a “hub” which
has higher centrality.
The fourth measurement method is “eignvector.” Eignvector centrality follows
the principle that when a node is connected to other high-level nodes, its importance
becomes higher. Let si represent the score of the i-th node, and A denote the
adjacency matrix of the network, the score obtained by the i-th node is the sum of
the scores of all its neighbor nodes. Therefore,
1 XN
si = A S
j = 1 ij j
(2:33)
λ
where λ is a constant. The above equation can be rewritten as:
As = λs (2:34)
The eigenvector of the largest eigenvalue obtained by this equation represents the
eigenvector of the node.
In this study, the diffusion process on the network can be simulated in the
following manner: given that some seed nodes are assumed as the starting nodes,
the underlying network is sampled by the algorithm corresponding to the three
different types of diffusion mentioned earlier, to get a subgraph.
Comin et al. measured the centrality of the nodes in the subgraph achieved after
the sampling on the ER network and the scale-free network, and found that the
degree of the nodes was almost unchanged after sampling because of local variables.
Therefore, the deviation caused by sampling can be eliminated when the measured
centrality value is divided by the degree of the node. The unbiased betweenness is
defined as follows:
^ i = Bi r
B (2:35)
ðki Þ
The shortest distances from node a to node b, node c, node d, node e, and node f are
respectively 1, 1, 2, 2, and 3, so the average shortest distance to node a is
l = 1 + 1 + 25 + 2 + 3 = 1.8.
The closeness of node a is C = 1=1.8.
The betweenness of node a is B = 11 = 1. because there is only one shortest path
between node c and node b, and only the shortest path between node c and node b
passes through node a.
converted to the recovery state is q. Let Θ be the original infection node, assuming
that the time step experienced by the diffusion process is known and used as a
parameter for estimating the source node by reasoning.
Based on the above assumptions, the source location problem is defined as follows:
The random vector ~ R = ðRð1Þ, Rð2Þ, ..., RðNÞÞ represents the infection of the node
before a certain time threshold T. The random variable R(i) is a Bernoulli random
variable, and if the node is infected before the time point T, the corresponding value
is 1; otherwise, the value is 0.
Suppose we have observed the SIR model diffusion result of a known (p, q) and T,
and the set of all nodes is S = fθ1 , θ2 , , θN g, with a limited number of nodes. We get
the following problem of maximum likelihood:
Algorithm 2.3 represents a process that uses a maximum likelihood estimation algo-
rithm to perform calculations. The main idea of maximum likelihood estimation is,
with the experimental result being already known, to find the experimental condition
that is most favorable (i.e., of the largest likelihood) for getting the experimental
result through the algorithm. The experimental result here means the observed
diffusion result that is already known. At the same time, some conditions, including
the parameters (p, q) and T of the SIR model, are also known. The unknown experi-
mental condition is the source of diffusion.
The n in the parameters is the number of simulations run for a set of source node
candidates.
The similarity φ of two different diffusion results is judged in two ways (XNOR
and Jaccard), which are denoted as XNORð! r1 , !
r2 Þ and Jaccard (!
r1 , !
r2 ), respectively.
Later, Nino et al. defined three different likelihood estimation functions: AUCDF,
AvgTopK, and naive Bayesian. The first two uses the similarity calculation method
mentioned earlier, and the naive Bayesian method uses its own similarity calculation
method. Although the three algorithms are different, they share the same main idea;
that is, to calculate the likelihood, i.e., the probability of achieving the experimental
results under different experimental conditions. In the following part, we will intro-
duce these three algorithms.
Algorithm 2.4 represents the algorithm for the AUCDF estimation function.
^~
Output: Pð R=!
r* jΘ = θÞ = 1 − AUCDFθ
Algorithm 2.5 represents the algorithm for the AvgTopK likelihood estimation
function
n ! o
Sort the ratings ’ð~
r* , Rθ, J Þ in descending order;
Averaged of k maximum ratings:
k n
1X ! o
^~
PðR=!
r* jΘ = θÞ = ’ð!
r* , Rθ, J Þ sorted
k i=1
Output:
Likelihood ~ R=!
Pð~ r* jΘ = θÞ
Algorithm 2.6 represents the algorithm for the Naive Bayesian likelihood estimation
function.
^~ mk + 2
Pð r* ðkÞ = 1jΘ = θÞ = , ∀k 2 G
n+ 2
^~
logðPðR ¼~
r jΘ ¼ θÞÞ
X
¼ logðPð^~
r ðkÞ ¼ 1jΘ ¼ θÞÞ
fk:~
r ðkÞ¼1g
X
þ ^~
logð1 Pðr ðjÞjΘ ¼ θÞÞ
r ðjÞ¼0g
fj:~
Output:
^~
Likelihood logðPðR =~
r* jΘ = θÞÞ
Nino et al. tested the performance of the likelihood estimation algorithm on different
network architectures. After the list of potential source nodes output by the algorithm
in node set S is achieved, the ranking of the actual source node in this output list is
checked. Experiments showed that this method performs well in various network
environments.
Oftentimes, we can only observe some of the diffusion results. In a diffusion mode that
fits the SIR model, part of the nodes will shift from the infected state to the recovery
state, making it more difficult for information source location. In addition, the diffusion
results derived from multiple-source diffusion make it harder for us to determine the
true source of information. This section introduces a method of multi-point source
location based on sparsely infected nodes, which solves the problem of incomplete
observations and multiple-source diffusion. The method consists of the following three
steps: first, detect the recovered nodes in the network using a reverse diffusion method.
Second, partition all infected nodes using a partitioning algorithm, thus to turn the
issue of multi-point source location into the issue of multiple independent single-point
source location. Finally, determine the most likely source node in each partition.
A new network is created for location analysis by implementing Algorithm 2.7, and is
called the expanded infection network.
PðYj~
α; βÞ ¼
RP
~ ~ ~ ~ ~ ~ ~ ~ (2:38)
Zs ðΠ p;q PðYðp; qÞ zp!q ; zq!p ; BÞPðzp!q π pÞPðzp!q π pÞΠ p Pðπp αÞÞdπ
π
Figure 2.14: Cumulative probability distribution of the average distance between the real source node
and the calculated source node, the number of source nodes k=2 [41]
Figure 2.15: Cumulative probability distribution of the average distance between the real source node
and the calculated source node, the number of source nodes k=3 [41]
2.8 Summary
The study of information diffusion in social networks has become one of the most
challenging and promising fields of research. This chapter introduced a number of
mainstream models of information diffusion in social networks and explored the
methods for predicting the trend of information diffusion and tracing the source of
information.
Despite numerous studies and certain achievements in the field of information
diffusion in social networks, there are still many problems that require further
exploration.
(1) Model validation and evaluation methods: the existing model validation meth-
ods are mainly based on random data for verification computer simulated data
for analysis. However, for a more scientific approach, a unified standard test set
should be established by screening the typical examples of diffusion to assess the
advantages and disadvantages of different diffusion models, as well as to define
the scope of applications of the algorithms.
(2) The intrinsic rule of multi-factor coupling: most reported studies discuss informa-
tion diffusion from a single perspective, such as the topology of the network where
the information is diffused and the rules of individual interactions. However, in
reality, the diffusion of information is the typical evolution process of a complex
system, which requires a comprehensive consideration of multiple factors including
individual interaction, network structure, and information characteristics to
describe information diffusion in online social networks in a more accurate manner.
(3) Dynamic changes in social networks: most existing methods of information
diffusion analysis are based on the static network topology; however, in real
social networks, the network of relationship between users changes over time. It
is necessary to add dynamic change into the information diffusion model. In
addition, the existing algorithms are mostly based on serial or time step models.
Large-scale parallel distributed algorithms are needed to improve the efficiency
of processing.
References
[1] Mark S. Granovetter. The strength of weak ties. American Journal of Sociology,
1973:1360–1380.
[2] Stratis Ioannidis, Augustin Chaintreau. On the strength of weak ties in mobile social networks.
In Proceedings of the Second ACM EuroSys Workshop on Social Network Systems. ACM, 2009:
19–25.
[3] Stephan Ten Kate, Sophie Haverkamp, Fariha Mahmood, Frans Feldberg. Social network
influences on technology acceptance: A matter of tie strength, centrality and density. In BLED
2010 Proceedings, 2010, 40.
[4] Paul S. Adler, Seok-Woo Kwon. Social capital: Prospects for a new concept. Academy of
Management Review, 2002, 27(1):17–40.
[5] Jichang Zhao, Junjie Wu, Xu Feng, Hui Xiong, Ke Xu. Information propagation in online social
networks: A tie-strength perspective. Knowledge and Information Systems, 2012,
32(3):589–608.
[6] Eytan Bakshy, Itamar Rosenn, Cameron Marlow, Lada Adamic. The role of social networks in
information diffusion. In Proceedings of the 21st international conference on World Wide Web.
ACM, 2012: 519–528.
[7] John Scott. Social network analysis: Developments, advances, and prospects. Social Network
Analysis and Mining, 2011, 1(1):21–26.
[8] Mor Naaman, Jeffrey Boase, Chih-Hui Lai. Is it really about me?: Message content in social
awareness streams. In Proceedings of the 2010 ACM conference on Computer supported
cooperative work. ACM, 2010: 189–192.
[9] Akshay Java, Xiaodan Song, Tim Finin, Belle Tseng. Why we twitter: Understanding microblog-
ging usage and communities. Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop
on Web mining and social network analysis. ACM, 2007: 56–65.
[10] Mike Thelwall, Kevan Buckley, Georgios Paltoglou. Sentiment in Twitter events. Journal of the
American Society for Information Science and Technology, 2011, 62(2):406–418.
[11] Seth Myers, Jure Leskovec. Clash of the contagions: Cooperation and competition in informa-
tion diffusion. ICDM. 2012, 12:539–548.
[12] Mark Granovetter. Threshold models of collective behavior. American Journal of Sociology,
1978, 1420–1443.
[13] Jacob Goldenberg, Barak Libai, Eitan Muller. Talk of the network: A complex systems look at the
underlying process of word-of-mouth. Marketing Letters, 2001, 12(3):211–223.
[14] Jacob Goldenberg, Barak Libai, Eitan Muller. Using complex systems analysis to advance
marketing theory development: Modeling heterogeneity effects on new product
growth through stochastic cellular automata. Academy of Marketing Science Review, 2001,
9(3):1–18.
[15] Daniel Gruhl, Ramanathan Guha, David Liben-Nowell, Andrew Tomkins. Information diffusion
through blogspace. In Proceedings of the 13th international conference on World Wide Web.
ACM, 2004: 491–501.
[16] Xiaodan Song, Yun Chi, Koji Hino, Belle L. Tseng. Information flow modeling based on diffusion
rate for prediction and ranking. In Proceedings of the 16th international conference on World
Wide Web. ACM, 2007: 191–200.
[17] Kazumi Saito, Masahiro Kimura, Kouzou Ohara, Hiroshi Motoda. Behavioral analyses of infor-
mation diffusion models by observed data of social network. Advances in Social Computing.
Springer, 2010: 149–158.
[18] Kazumi Saito, Masahiro Kimura, Kouzou Ohara, Hiroshi Motoda. Selecting information diffu-
sion models over social networks for behavioral analysis. Machine Learning and Knowledge
Discovery in Databases. Springer, 2010: 180–195.
[19] Luke Dickens, Ian Molloy, Jorge Lobo, Paul-Chen Cheng, Alessandra Russo. Learning stochastic
models of information flow. In Data Engineering (ICDE), 2012 IEEE 28th International
Conference on. IEEE, 2012: 570–581.
[20] Wojciech Galuba, Karl Aberer, Dipanjan Chakraborty, Zoran Despotovic, Wolfgang Kellerer.
Outtweeting the twitterers-predicting information cascades in microblogs. In Proceedings of
the 3rd conference on online social networks. USENIX Association, 2010: 3.
[21] Adrien Guille, Hakim Hacid. A predictive model for the temporal dynamics of information
diffusion in online social networks. In Proceedings of the 21st international conference com-
panion on World Wide Web. ACM, 2012: 1145–1152.
[22] William O. Kermack, Anderson G. McKendrick, Contributions to the mathematical theory of
epidemics, In Proceedings of the Royal Society of London, 1927, 115(772):700–721.
[23] Wiiliam O. Kermack, Anderson G. McKendrick. Contributions to the mathematical theory of
epidemics. II. The problem of endemicity. Proceedings of the Royal society of London. Series A,
1932, 138(834): 55–83.
[24] Michelle Girvan, Mark Newman. Community structure in social and biological networks.
Proceedings of the National Academy of Sciences, 2002, 99(12):7821–7826.
[25] Saeed Abdullah, Xindong Wu. An epidemic model for news spreading on twitter. In Tools with
Artificial Intelligence (ICTAI), 2011 23rd IEEE International Conference on. IEEE, 2011: 163–169.
[26] Fei Xiong, Yun Liu, Zhen-jiang Zhang, Jiang Zhu, Ying Zhang. An information diffusion model
based on retweeting mechanism for online social media. Physics Letters A 2012,
376(30):2103–2108.
[27] Dechun Liu, Xi Chen. Rumor Propagation in Online Social Networks Like Twitter-A Simulation
Study. In Multimedia Information Networking and Security (MINES), 2011 Third International
Conference on. IEEE, 2011: 278–282.
[28] Jaewon Yang, Jure Leskovec. Modeling information diffusion in implicit networks. In Data
Mining (ICDM), 2010 IEEE 10th International Conference on. IEEE, 2010: 599–608.
[29] Thomas F. Coleman, Yuying Li. A reflective Newton method for minimizing a quadratic function
subject to bounds on some of the variables. SIAM Journal on Optimization, 1996,
6(4):1040–1058.
[30] Seth Myers, Chenguang Zhu, Jure Leskovec. Information diffusion and external influence in
networks. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge
discovery and data mining. ACM, 2012: 33–41.
[31] Alex Beutel, B. Aditya Prakash, Roni Rosenfeld, Christos Faloutsos. Interacting viruses in
networks: can both survive? In Proceedings of the 18th ACM SIGKDD international conference
on Knowledge discovery and data mining. ACM, 2012: 426–434.
[32] Gabor Szabo, Bernardo A. Huberman. Predicting the popularity of online content.
Communications of the ACM, 2010, 53(8):80–88.
[33] Peng Bao, Hua-Wei Shen, Junming Huang, et al. Popularity prediction in microblogging net-
work: a case study on Sina Weibo. In Proceedings of the 22nd international conference on
World Wide Web companion. International World Wide Web Conferences Steering Committee,
2013: 177–178.
[34] Kristina Lerman, Tad Hogg. Using a model of social dynamics to predict popularity of news.
Proceedings of the 19th international conference on World Wide Web. ACM, 2010: 621–630.
[35] Changjun Hu, Ying Hu. Predicting the popularity of hot topics based on time series models.
APWEB, 2014.
[36] Vincenzo Fioriti, Marta Chinnici. Predicting the sources of an outbreak with a spectral techni-
que. arXiv preprint arXiv:1211.2333, 2012.
[37] Cesar Henrique Comin, Luciano Da Fontoura Costa. Identifying the starting point of a spreading
process in complex networks. Physical Review E, 2011, 84(5):56105.
[38] Andrey Y. Lokhov, Marc M.E. Zard, Hiroki Ohta, et al. Inferring the origin of an epidemic with
dynamic message-passing algorithm. arXiv preprint arXiv:1303.5315, 2013.
[39] Nino Antulov-Fantulin, AlenLancic, Hrvoje Stefancic, et al. Statistical inference framework for
source detection of contagion processes on arbitrary network structures. arXiv preprint
arXiv:1304.0018, 2013.
[40] Aditya Prakash, Jilles Vreeken, Christos Faloutsos. Spotting culprits in epidemics: How many
and which ones? 2012. ICDM, 2012, 12: 11–20.
[41] Wenyu Zang, Peng Zhang, Chuan Zhou. Discovering multiple diffusion source nodes in social
Networks. ICCS, 2014.
1 Some research on topic discovery and evolution focuses on the discovery and evolution of a single
event. However, from the point of algorithm there is no distinguished different from the most works
on topics. Hence, in this book, we do not claim this difference particularly between event discovery
and topic discovery.
https://doi.org/10.1515/9783110599435-003
(5) The topic-related data from social networks are massive and relatively concen-
trated. In large social networks, the data generated every day is massive. For
example, Facebook usually handles about 2.5 billion messages on average every
day. However, all these vast amounts of data are relatively concentrated in a few
large social networks. With respect to Chinese networks, QQ, Sina Weibo, and
WeChat include almost all topics of social network data.
(6) The topic-related data from social networks are dynamically and constantly
updated. Because of the interaction of users in social networks, a user’s attitude
about a particular topic may change with that of the surrounding friends.
Based on the above features, the methods of topic discovery and evolution in social
networks, unlike traditional media monitoring, should be able to automatically dis-
cover and track the evolution of the topic with no need for human intervention. In
addition, because the topic data of social networks are multi-sources, dynamic, and
massive, data discovering and tracking manually in social networks is almost impos-
sible. Hence, it is necessary to propose algorithms for computer programs on topic
discovery and evolution in social networks to ensure automatic topic detecting and
tracking by computer programs.
As a relatively novel research subject, topic discovery and evolution in social
networks, which is completely different from that of traditional media, has not been
studied and explored in-depth before; thus, research on this issue still remains in a
relatively preliminary stage.
This chapter is organized as follows: in section 3.2, as one of the theoretical base
of topic-related research in social networks, the models and algorithms of topic
discovery are introduced in detail, including topic model-based topic discovery,
vector space model-based topic discovery, and term relationship graph-based topic
discovery. In section 3.3, the models and algorithms of topic evolution, as another
theoretical base of topic-related research in social networks, are introduced, includ-
ing the simple topic evolution, topic model-based topic evaluation, and adjacent time
slice relation-based topic evaluation. Section 3.4 is a brief summary of this chapter.
With regard to the features of Twitter data, we can propose targeted solutions to the
problems of topic discovery. For example, regarding large-scale data, we can use a
distributed algorithm; as for online demands, we can apply an online machine
learning algorithm. With respect to data briefness, we can adopt aggregation strat-
egy. Overall, we should pay special attention to the differences between data in social
networks and traditional data on topic discovery, and only in this way can it guide us
to propose reasonable and effective solutions to data in social networks.
gradually, accordingly, the topic model has been widely applied in various fields. Topic
model is now not only used in text processing but is also adopted in bioinformatics and
image processing, achieving good results in all of these areas. Among all the topic
models, LDA has been the most widely used because of its advantages such as a solid
statistical base and flexibility to adapt to different task requirements.
D Document z Topic
2 The difference between word and term: in a document represented by a term space, every term is a
dimension, and the number of terms in a document refers to the number of different words. However,
during the number counting for words, two same words can be counted repeatedly.
the formula pðwi jdÞ = Ni =N, where Ni represents the occurrence number of the word wi in
the document, and N represents the total number of words in the document.
Therefore, if pðwjdÞ of a document is identified, we can quickly generate a
document based on the probability distribution.
However, this process is not in line with our usual writing process. When
we write an article, in general, we first select a topic, and then select a number
of words related to this topic to enrich the article. In this manner, an article
is generated. Therefore, according to this approach, the completion of an article is
divided into two steps: first, selecting a topic of a document; and second, selecting
words related to the selected topic, repeating this process in turn, and generating one
word at a time until the number of words specified in the document is reached.
Topic model is a statistical modeling of the above-mentioned process where the
differences remain in that there is more than one topic in the assumption document
of a topic model. According to the generating process of a topic model document, a
brief description is as follows:
Assume that we already know the topic distribution pðzjdÞ of a document as well
as the term distribution pðwjzÞ of a topic.
When generating a word in a document, we first select a topic z according to the
topic distribution and then select a term w according to the term distribution under
the topic. Hence, the word generation in a document can be formulated from the
probability perspective and simply interpreted as:
X X
pðwjdÞ = z
pð zÞpðwjzÞpðzjdÞ = pðdÞ z
pðzjdÞpðwjzÞ (3:1)
Here, pðwjdÞ is known and pðwjzÞ and pðzjdÞ is unknown; assuming that there are M
documents, and the length of each document d is N, i.e., there are N words, and there
are K optional topics, then pðwjdÞ is a vector of M × N, pðzjdÞ is a vector of M × K, and
pðwjzÞ is a matrix of K × N.
From eq. (3.1), we can observe that this process inserts an intermediate layer
more directly by the term distribution generated by words of the document – topic
layer, which is invisible in an actual document. Therefore, in some studies, this layer
is called the latent structures of a document, or by intuitive understanding this is
what we want to express – topic.
A simple example is presented below (where although the data is artificially
constructed, it is consistent with our understanding on real data).
Suppose that we have a document set D = {Document 1, Document 2, Document
3}, a term list V = {movie, music, tax, government, student, teacher, amount, art,
principal}, and a topic set Z = {art, budget, education}, the optional words in each
document are from V (actual term data is undoubtedly much larger than the set V
here, and examples here are only for description convenience), and optional topics in
each document are from the topic set Z. Here, we artificially construct the proportion
of words of each document, however, in practice it can be obtained by taking the ratio
of the frequency of a specific word and the total number of words.
In this example, the matrix pðwjdÞ constructed by documents and terms is as
below where the horizontal represents different documents, whereas the vertical
represents different terms and the corresponding number is the probability of a
term in a document.
With eq. (3.1), the relationship of these matrixes can be expressed as:
X
pðwjdÞ = z
pðzjdÞ × pðwjzÞ
Back to the beginning, we assume that we already know the topic distribution of a
document and the term distribution of a topic, then we can generate words in the
document. However, it is just the opposite in reality: we can easily obtain the
document set, i.e., the document has already been written. In addition, we have
obtained the term distribution of the document, while the unknown part is the topic
information of the document, namely, the latent structure of the document.
Therefore, according to the document generation process, we know the results, and
now we need to inversely infer the intermediate parameters based on the results, i.e.,
the topic distribution pðzjdÞ of the document and the term distribution pðwjzÞ of
the topic. Before describing the calculation method about parameter estimation
in detail, we will first introduce a formalized presentation and specific algo-
rithm steps of LDA model.
Algorithm 3.1: The generation process of LDA model into a document set
1: Extract the topic distribution over the term: ϕ~DirðβÞ
2: From m = 1 to M do
3: Extract N words: N~PoissonðϕÞ
4: Extract the document distribution over the topic θm ~DirðαÞ
5: For n = 1 to N do
6: Extract a topic Zm, n ~Multiðθm Þ
7: Extract a word wm, n ~Multiðθzm, n Þ
8: End for
9: End for
During the document generation process in LDA model, first the topic term distribu-
tion ϕ can be generated (Line 1 in Algorithm 3.1), where ϕ is a parameter in the
document level, and it only needs to be sampled once, where the sampling conforms
to the Dirichlet distribution with a priori parameter β. For each document, we first
determine the document length on the basis of Poisson distribution (Line 3), namely
the number of words N. Then, it comes to the step of generating every word in the
document. More specifically, a topic can be generated by sampling according to the topic
distribution of a document (Line 8), and then a word can be generated by sampling
according to the topic term distribution obtained in the former step (Line 9). These steps
are repeated until all words in the document set are generated.
In this process, in general, we use the collapsed Gibbs sampling method to get
the values of the hidden variables (z values) and the probability distribution of the
parameters in the model (the distribution of θm and ϕ). The collapsed Gibbs
sampling process can generally be described as assigning every word in the docu-
ment set to the corresponding topic. We will introduce Gibbs sampling process in
detail below.
ðw Þ ðd Þ
n i +β n i +α
− i, j − i, j
ðÞ ðd Þ
n
− i, j
+ wβ n i + Kα
− i,
p zi, j jz − i , wi = ðw Þ ðd Þ
PT n − i,i j + β n − i,i j + α
j=1 ð Þ ðd Þ
n + wβ n i + Kα
− i, j − i,
where β can be understood as the frequency of words obtained from topic sampling
before seeing any word in the corpus. α can be understood as the frequency of topic
sampling before seeing any word in the document. zi, j represents that term wi is
ðw Þ
assigned to the topic j, and z − i represents the distribution of all zk ðk = iÞ. n − i,i j is the
ðÞ
number of words assigned to topic j, which is the same as wi ; n − i, j is the number of all
ðd Þ
words assigned to topic j; n − i,i j is the number of words assigned to topic j in document
ðd Þ
di ; and n − i,i is the number of all words that are assigned with topics in di . All the
numbers of words exclude the allocation of zi, j at this time.
(3) After a sufficient number of step (2) iterations, we can believe that Markov chain
approaches the target distribution, and then takes the current value of zi (i from
1 to N) as the sample recorded. To ensure the autocorrelation smaller, we need
to record other samples after a certain times of iteration. Abandon word mark
and set w represent words. For every single sample, estimate values of ϕ and θ
according to the following formula:
ðwÞ ðdÞ
nj +β nj + α
~z =i =
ϕ , ~θdz = i =
w ðÞ nðdÞ + Kα
nj + wβ
ðwÞ ðÞ
where nj represents the frequence of words w assigned to topic j; nj represents the
ðdÞ
number of all words assigned to topic j; nj indicates the number of words assigned
ðdÞ
to topic j in document d; and n represents all the number of words which have been
assigned with topics in document d.
Gibbs sampling algorithm starts from an initial value, and after being iterated
enough times, it can be considered as the probability distribution of the sample close
to that of the target.
Here, we use a simple example to illustrate the iterative process of Gibbs
sampling. The selected five documents are from the Facebook about MH370 flight
lost topic (for details, see Appendix), and their time is 20140309, 20140323,
20140325, 20140327, and 20140405. When the number of topics is set at 5, we get
a series of Gibbs sampling results as follows:
Iteration Iteration Iteration Iteration Iteration
times times times times times
3 http://mallet.cs.umass.edu/.
Document 1
Percentage
Document 2
Document 3
Document 4
Document 5
owing to certain features of social network data, relevant researches show that
the direct application of the topic model to social networks (such as short text
data) does not produce the expected results [17]. Therefore, we need to study
how to use the topic model in the social networking environments.
Social network data includes the data from blog, microblog (e.g, Twitter,
Weibo, etc.), instant message (IM, such as QQ, etc.), etc. Since Twitter was
launched in 2006, it has covered more and more users, and thus, it has
gradually become the most popular data source for social network researchers.
With data from Twitter, researchers have done a lot of work. In particular, they
try to use topic model on Tweets and try to find more valuable latent topic
information.
Aiming at the problem that due to the content length limit of Tweets the direct
application of traditional topic model is ineffective, some scholars study how to train
the standard topic model in the short text environments. They combine Tweets with
the authors’ information, and present three topic modeling models [13].
(1) MSG Model: MSG refers to Messages Generated by the Same User.
① Train LDA model in training corpus;
② In the training corpus, the information generated by the same user is aggre-
gated into a Training User Profile for the user;
③ In the testing corpus, the information generated by the same user is aggre-
gated into a Testing User Profile for the user;
④ Take Training User Profile, Testing User Profile and testing corpus as “new
documents”, and use the training model to infer the distribution of their topics.
(3) TERM Model: As the name itself suggests, it is the aggregation of all information
that comprises a certain term.
① For each term in the training set, all information containing the term is
aggregated into a Training Term Profile;
② Train LDA model on the Training Term Profile;
The principle of TERM is based on the fact that Twitter users often use customized
topic labels [the words surrounded with # called Tweets label (Hashtag)] which
represents a specific topics or events. With the established Term Profiles, using
TERM model can directly obtain topics related to these topic labels.
MSG model trains LDA model by a single Tweet. As the length of the content itself
is limited, there is not enough information for the model to learn topic pattern.
Specifically, compared with the long text, the words of short text have a lower
distinction for text. The TERM mode and USER mode, however, use the aggregation
strategy to train the model, and the results they obtained should be better.
For the difference between the content of Twitter and other traditional media,
some scholars have proposed an improved LDA model – Twitter-LDA [32]. Clearly,
because Tweets are brief, the standard LDA model does not work well on Twitter. To
overcome this problem, the above-mentioned MSG, USER, and TERM modes use an
aggregation strategy to merge the same user’s Tweets to a longer article, which, by
and large, is the application of the Author-Topic Model (ATM) [21]. However, this can
only discover the same author’s topic, which is useless for a single tweet. Hence, we
need to improve the standard LDA model so that it can form a model which is useful
for a single tweet.
In the Twitter-LDA model, we assume that a Tweet has K topics, and each
topic can be represented by a distribution of terms. ’t represents the term
distribution of the topic t; ’B represents the term distribution of background
words; θu represents the topic distribution of the user u; and π represents a
Bernoulli distribution (used to control the choice between background words
and topic words). When writing a Tweet, a user first selects a topic based on its
topic distribution, and then chooses a group of words according to the selected
topic or background model. Twitter-LDA model is shown in Figure 3.4, with the
generation process as follows:
(a) Sampling ’B ~DirðβÞ, π~DirðγÞ
(b) For each topic t:
a) Sampling ’t ~DirðβÞ
(c) For each user u:
b) Sampling θu ~DirðαÞ
c) For each Tweet &
i. Sampling zu, & ~Multiðθu Þ
ii. For each word n in &
To adopt labeled LDA to analyze the features of Twitter content [19], we can rely on
the basis of the features of Twitter content and regard each hashtag as a label in
Tweets, which to some degree uses the topic information that users tagged. However,
a drawback of this approach is that the model cannot be directly applied to all Tweets
without any hashtag.
In addition to the Twitter data, some scholars have studied the chat data from the
Internet Relay Chat (IRC) [29]. Just like Twitter data, IRC data also includes misspellings,
grammatical errors, abbreviations and other noise information. Along with such fea-
tures as dynamic change, concise expression, intersected topic, etc., IRC data is not
suitable for us to analyze with the existing text mining methods. To remedy these
problems, researchers use the latent features of IRC data — social relationships among
chatters, to filter out the irrelevant parts in discussion. This is similar to the use of
PageRank algorithm that distinguishes highly interactive web pages from irrelevant web
pages. However, it should be noticed that the score here is not for ranking but to improve
the topic model via such correlation score. This method can be summarized as using
some features of IRC data to construct user relationship graph, and then according to the
features of the graph (such as indegree, outdegree, etc.) provide a score for the user,
which is used to determine noise information and topic information. In addition, in
accordance with relevant scores, noise information is decreased and useful information
is enlarged so that the application effect of the topic model can be improved.
Topic model has receiveda significant amount of attention from scholars in
various fields as it not only has a solid statistical theoretical foundation but also is
simple to model and can be easily extended to accommodate different applications.
Many scholars try to apply topic model on social networks. In addition to some of
the models mentioned above, they also propose a variety of other topic models,
which provides an effective means for us to analyze social network data.
feature. Finally, the similarity between vectors can be measured with a cosine
function. This method is relatively simple and will not be introduced in detail here.
In the initial TDT project, the topic discovery was understood as the classification
of a large number of different news stories; therefore, the methods focus on the VSM-
based clustering method. Regarding different application scenarios, researchers have
proposed different solutions. Regarding Retrospective Event Detection (RED), scholars
have proposed a Group Average Clustering (GAC) algorithm. For online event detec-
tion, a Single Pass Incremental Clustering (SPIC) algorithm has been proposed. These
methods well adapt to the needs of different scenarios and meet the task requirements.
The adoption of the vector space model for topic discovery is based on the assump-
tion that the documents of similar topics have similar content. Thus, after transforming
the document as eigenvector by VSM, the key step is to measure the similarity between
features. However, the clustering algorithm used in traditional TDT mainly aims at the
topic discovery in the data stream of news report, and migrating this method directly to
the current social network environment has limitations. This is due to the fact that, in
Twitter, Weibo, and other similar sites the data generated by users are relatively short
with a frequent use of informal languages and there are specialized network terminol-
ogies, etc., which makes the VSM model that simply feature terms produce sparse data
and other issues. Consequently, we need to improve VSM method to make it applicable
to the data generated by social networks. When constructing eigenvector with the VSM
model, it is necessary to pay much more attention to feature selection, the calculation of
feature weight, and similarity measurement between features. In simple words, the
chosen features and calculation of feature weight should be sufficient to represent the
content of current text and reflect the difference between the texts. The similarity
measurement should accurately reflect the difference between features.
Regarding the problems of feature selection and the weight calculation of text
contents in social networks, researchers have done a lot of work. Here, we briefly
introduce these efforts.
Paige H Adams and Craig H Martell compare the effect of several feature selection
and weight calculation methods for the similarity measurement of Chat (i.e., instant
messaging) data in the reference [1]. Authors analyze chat data and obtain the
following characteristics:
(1) Topics tend to aggregate by time. New topics may come from the previous topics;
before dying or being replaced by new topics, the current topic will remain for a
while.
(2) Interleaving occurs between different topics. Messages containing different
topics may be mixed together.
(3) The participants of a specific topic may change. But, generally speaking, the core
participants of a topic dialogue tend to remain unchanged.
Based on the above data characteristics, the authors point out that it is necessary to
improve the traditional weight calculation method of using TFIDF feature:
Regarding feature (1), during the similarity measurement, the time distance pena-
lization coefficients are introduced to increase the similarity of data with similar time,
as well as reduce the similarity of the data with larger time difference.
As for feature (2), with regard to term-featured aspect, the authors use Hypemym
(i.e., topic words of conceptually broader denotations) to solve the problems that the
different representations of the same topic result in high semantic similarity but low
coincidence degree of terms.
Regarding feature (3), the author assigns each eigenvector with user’s nickname
information corresponding to the original document and gives the data published by
the same user more weight, which implies that the information published by the
same author has a higher probability under the same topic.
Hila Becker et al. analyzed rich context (i.e., background or contextual content)
information in social networks in the reference [3], including text messages and nontext
information, such as user labeled information (such as title, tags, etc.), automatically
generated information (such as creation time of content, etc.) and so on. With the help
of the above background information, the authors compared the various measurement
techniques for the similarity of documents from social media. In this paper, the authors
point out that the single feature of social media content remains with big noise and it
cannot be used to effectively complete classification and clustering task, but the joint
use of various features (including the document content and context of content, etc.)
can provide us valuable information related to the topic or event.
In fact, although context features between different social medias are not the
same, many social media still share some of the same features, such as author name
(i.e., the user of document creation), title (i.e., the name of the document), description
(i.e., a brief summary of the document content), label (i.e., a set of keywords describing
the content of the document), date and time (i.e., the time of content published), and
location information. These features can help measure the similarity between docu-
ments. For different features, researchers provide different approaches:
(1) For textual features, the eigenvector identifying with TFIDF as weights can be
used to calculate the similarity by a cosine function.
jt − t j
(2) For date and time features, the similarity is calculated as 1 − 1 y 2 , where t1 and t2
respectively represent the time of publishing of document 1 and document 2, and
yshows the number of minutes per year. In addition, if the time interval of two
documents is more than a year, then the similarity of the two documents is
regarded as zero.
(3) For geographical features, the similarity is calculated as 1-H(L1-L2), where L1
and L2 , respectively, represent the latitude and longitude of document 1 and
document 2, and H ðÞ function is to use a Haversine distance.
For topic discovery based on clustering algorithm, the key point is to determine
appropriate similarity measurement approaches to reflect the similarity between
documents. In regard to similarity measurement methods, Becker et al. proposed
two ways in the reference [3], including the aggregation-based similarity mea-
surement method and the classification-based similarity measurement method,
of which the former adopts a variety of clustering methods, and the final result
is determined by weighted voting based on different weights of clustering
methods in the clusterers or weighted combination in the calculation of the
similarity. The idea of classification-based similarity measurement takes similar-
ity score as the classification feature for predicting whether two documents
belong to the same topic or event.
After considering feature selection and similarity measurement, we also need to
explore the selection of algorithms including clustering. Most data in social networks
can be seen as a continuous incoming data stream, which can be categorized as large-
scale and real-time. Hence, in the choice of clustering algorithms, we must choose a
clustering algorithm which can be expanded and requires no prior knowledge of the
number of clusters [4]. Thus, researchers have proposed an incremental clustering
algorithm, which takes into account each piece of information in turn and decides the
appropriate category based on the similarity between the information and current
clusters.
Here, we briefly introduce a single-pass incremental clustering algorithm that
only scans the data once and dynamically adjusts the parameters, perfect for online
environment where there are continuous data and dynamic increase of the number of
categories of clustering. The algorithm steps are as follows:
1. Basic idea
The basic idea of the term relationship graph-based topic discovery algorithm is to
first construct a term co-occurrence graph based on the co-occurrence relationship
between terms, then via the community detection algorithms commonly adopted
in social network analysis to find the community formed by terms (i.e., a set of terms
related to a certain topic), and finally to determine the topic of each document
according to the similarity between the found term sets and the original document.
2. Method description
The term co-occurrence graph based topic discovery can be largely divided into the
following three steps [25, 26]:
Step 1, construct a term co-occurrence graph according to the relationship of term co-
occurrence;
Step 2, conduct a community detection algorithm in the term co-occurrence graph,
which leads to a community (i.e., term set) with the description for a specific topic;
Step 3, specify a set of topic terms for each document in the original document set.
Next, we will conduct a detailed description for each of the above steps.
For edge screening, we can also calculate the conditional probability of the
emergence between terms, and then remove the edge between corresponding
nodes when the corresponding bidirectional conditional probability is below the
threshold edge_min_prob. Conditional probability is calculated as follows:
DFi ∩ j DFei, j
p wi jwj = =
DFj DFj
where DFj represents the document frequency of term wi ; p wi jwj represents the
conditional probability that term wi appears at the same time the term wj appears;
discovery task, theoretically the community discovery algorithm that can effectively
detect a term set related to a topic can be applied to the topic discovery.
Here, we define the topic cluster as S and the document as d, and d ∩ S represents the
intersection of the document and terms in topic clusters. f ðwÞ can be a simple
Boolean or other functions, and its value reflects the similarity between the topic
cluster and the document.
Certainly, we can use a cosine function to measure the similarity between a topic
and a document, hence, the probability distribution of each topic in document d can
be calculated as:
cosðd, Sz Þ
pðzjdÞ = P
z′2Z cosðd, Sz′ Þ
In general, the different features of the topic have different weights on this topic, so
we can also on the basis of this feature improve the similarity measurement method
for a topic and a document. For example, we can use TF*IDF value to evaluate the
weight of different features (i.e., terms) under the current topic to obtain a more
accurate topic distribution in the document.
Overall, the term co-occurrence graph-based topic discovery adopts a more
matured term co-occurrence theory, which is intuitive and simple, providing new
ideas and methods for topic discovery.
In summary, topic discovery technology originates from TDT project of DARPA
[2], originally using VSM model-based clustering methods. Later, with the develop-
ment of topic models, scholars gradually began to conduct text analysis with topic
models. Meanwhile, other methods such as term co-occurrence analysis are also
good. Definitely, not all topic discovery methods are mentioned in this chapter, for
example, we can also use some of the new technologies from natural language
processing. For instance, the recent deep learning method also provides a new
technical means for topic discovery. In another example, focusing on extensive
user participation and rich interactive features in Weibo social media, a group
intelligence-based new topic/event detection method for Weibo streams is proposed
in the reference [9]. The main idea of this approach is to first decompose the tradi-
tional single mode topic/event detection model according to Weibo participants,
then create a language feature and sequential characteristic-based topic/event dis-
criminant model for each participant, namely the Weibo user personal model, and
finally determine the new topics/events by voting with group intelligence. As we
have mentioned earlier, for various types of data, we cannot simply transplant these
technologies and methods. Instead, we should analyze the features of these data and
modify the traditional methods properly, only in this way can the effect of the
adopted method and the accuracy of topic discovery be improved.
With the explosive growth of social networks, new social media represented by
Weibo will surely get more widespread concern. Analyzing the content of social
networks, better understanding their users’ interests and providing valuable infor-
mation for business decisions and public opinion monitoring can all find root in
the topic discovery technologies in social networks. The rapid development of
social networks provides plentiful research materials for topic discovery, which
in turn requires topic discovery technologies advance with times, and adapt to the
emerging new media and new data. In short, the prospect of technology applica-
tion of topic discovery is bound to be more and more broad with the pace of the
internet age.
studies have not effectively considered time characteristics of terms and analyzed the
distribution of topics across the timeline.
With the topic model proposed, how to effectively use the time characteristics of
terms in the topic model and study the characteristics of topic evolution becomes a
hot issue in study of social networking text. Unlike the earlier TDT study, each text in
the topic model is a mixed distribution of topics and each topic is a mixed distribution
of a set of terms. Because the topic model can capture the evolution of a topic and the
introduction of topics may have a good effect on text prediction, the topic model has
been widely used in the field of topic evolution.
This section focuses mainly on the most common and the most widely-used
topic evolution methods in social networks. Firstly, we will introduce the most
widely-used simple topic evolution method. Secondly, we will introduce the LDA
model and other topic models with high precision which can meet various bonding
time of a variety of application requirements. Finally, topic evaluation methods
with better application effect in some special application environments will be
introduced.
In the study of the topic evolution in social networks, the most commonly-used
method is the simple topic evolution method: using a topic discovery method in
each time slice, and then analyzing and comparing the similarity of keywords
obtained by the topic discovery algorithm of the adjacent time slices to analyze the
situation of topic evolution [27].
A typical article [25] about topic evolution points out that, due to the fact that the
social network continues to generate large amounts of data, it becomes impossible to
perform topic discovery algorithm on the entire dataset after each generation of new
data. Hence, they set a sliding time window with a fixed size in social networks, and
use their proposed topic discovery algorithm to calculate the similarities and differ-
ences between the topic of this window and the topic of the previous window for
analyzing the evolution of the topic.
There are many proposed topic evolution methods of the topic expression based
on multiple words and supervised/semi-supervised algorithms which can solve the
problems of topic evolution in traditional media. However, in topic evolution in
social networks, these traditional methods have the following problems:
(1) There is more noise in social network text than that in traditional text, such as
colloquial text, slang, advertising, etc.
(2) The text in social networks is shorter than that in traditional media, which make
the accuracy of text mining algorithms greatly reduced.
(3) The topic in social networks is word of mouth, which makes the topic evolve very
fast.
the forward and comment Weibo in the topic space. Therefore, under the situa-
tion without predicting public concerns in events, we can portray the evolution
process of public concerns with the development of events. Interested readers
can refer to the reference [10] for more details.
out how to apply references to topic evolution. In this model, the reference is
interpreted as the succession of the topic and the time properties of the document
is carefully considered, even the time sequence of documents in a time slice is to be
measured, ultimately reflected as a partial order of a reference graph.
In reference [30], the research topic evolution graph is proposed, which includes the
following elements: topic milestone paper, topic time strength, and topic keyword.
(1) Topic milestone papers are the most representative of a topic in understanding a
topic.
(2) The topic time strength indicates the number of relatively related topic citations
at different times, which mainly reveals the changes between current citations
and previous citations and the lifecycle of a topic.
(3) The topic keywords refer to the keywords that can accurately sum up the topic,
which enables users to obtain a general understanding about the topic even
before reading the relevant references and helps users to accurately locate papers
they are most interested and are supposed to read the most in general references.
The biggest difference between this paper and previous research lies in that it not
only considers the similarity between texts but also takes into account the depen-
dencies between cited papers. The authors verify that there is topic similarity
between the papers citing the same paper, which reflects the topic similarity between
papers more accurately than that of the texts.
Another feature of this paper lies in regarding each paper as a set of references,
which can be modeled with a topic generation model, with citations mainly repre-
sented by the latent topic variable. This is different from the traditional approaches
using the probabilistic topic model to discovery topic in the document. The generated
topic of this article is based on a polynomial distribution of research papers, while the
topic generated in traditional methods is the polynomial distribution of words.
Therefore, taking into account the time factors of papers and references, we can get
the exact topic evolution graph.
field is introduced to model the impact of historical status and dependency relation-
ship on the graph, so that the topic model can be used to generate the corresponding
keywords in the case of given interesting topics. As Gibbs random field and the topic
model interacte, topic evolution becomes an optimization problem of a joint prob-
ability distribution including historical factors, text factors, and structural factors.
The authors also verify that the classic topic model is a special case of this model.
For the study of online social networks, especially the phenomenon of short text
encountered during topic evolution in Twitter, the traditional text methods cannot
solve this problem due to the requirement of short time, a large amount of data to be
processed, and sparse data. Specifically, in the continuous generated data stream, we
need to find Tweets related to a preset topic. In this context, the incoming data stream
is huge, the specified data related topics are very limited and it is required to
complete the topic evolution analysis with both time and space constraints. On
account of the requirement of about only one millisecond to process a tweet in
practice, Lin et al. use a simple language model in the reference [16], particularly,
using the label appeared in parts of Tweets as an approximation to train the prob-
ability model, which can be applied to the continuous Tweets in online social net-
works, with even most of those unlabeled Tweets included. This paper adopts
smoothing techniques to integrate timeliness and sparsity and also takes into
account a variety of technologies for preserving historical information. Experiments
validate that in this method the most appropriate smoothing technique is Stupid
Backoff – the simplest smoothing technique. This paper shows that under the
extreme circumstances of a large amount of data and the requirement of fast comput-
ing speed in online social networks and with consideration of equilibrium speed,
availability, scalability and other demands for practical applications, the simplest
way is the best way.
As for the cyclical phenomenon occurred in the topic evolution of social media
(such as Flickr and Twitter), Yin et al. [31] in 2011 propose a latent periodic topic
analysis (LPTA)-based probability model. This model mainly considers addressing
the following several issues:
(1) The existing work on the periodicity is generally concentrated in the time series
database, and it is not suitable for text processing required in topic evolution
analysis.
(2) Periodic word analysis cannot satisfy the requirement of periodic topic analysis,
because it is unlikely to see the same word again, but other words within the
same topic.
(3) Due to the diversity of language, there are a lot of synonyms in text, which makes
topic analysis a challengeable problem.
Therefore, the proposed LPTA method can be considered as a variant of latent topic
model and the difference from traditional topic model lies in the time domain
oriented period property. Specifically, it is to cluster the words of the same topic
separated by about a time period. In other words, the problem is transformed into the
estimation of the time period and the determination of the topic. Technically speak-
ing, in this paper, the authors use the maximum likelihood probability approach to
estimate the relevant parameters. The goal of LPTA is not only to identify latent topic
space according to the data but also to discover whether there is a periodicity in topic
evolution.
In addition to the most simple and intuitive simple topic evolution model described
in section 3.3.1 and the most commonly used topic model introduced in section 3.3.2,
there are other types of models in the study of topic evolution, while the difference
between these models and the aforementioned two models lies in that these models
mainly focus on the link between history time slice and present time slice and
relevant background of topics.
In the social media stream, users have the following natural requirements for the
topic evolution:
(1) In social media, once there emerges a new topic, we need to detect it in time.
(2) The evolution of the interesting topics can be tracked in real time.
(3) The information provided should be ensured in a proper range to avoid informa-
tion overload.
Saha et al. [22] point out that new information appears all the time in social media,
leading to the fact that to send all topic information to users is impossible, which thus
requires a model proposed to infer topics of users’ interests according to the topic
information of historical time slices. In this article, the authors propose a system
based on the strict rigorous machine learning and optimization strategy for the online
analysis of social media stream. Specifically, it applies the effective topic-related
model and non-negative matrix factorization techniques. Different from the existing
work, this article maintains the time continuity of topic evolution at the process of
new topic discovery.
Unlike the above-mentioned papers on topic evolution, Lin et al. [15] also reveal
the topic’s latent diffusion paths in social networks except the simple analysis of
topic evolution. After comprehensively considering the text document, social
influence and topic evolution, they turn the problem into a joint inference problem.
Specifically, the authors propose a hybrid model that on the one hand can generate
topic related keywords based on topic evolution and diffusion, and on the other
hand can adjust the diffusion process with Gaussian Markov random field using the
social influence of users’ personal layer. This paper believes that in social networks
the information users acquire comes more from the social links rather than stran-
gers. Hence, the researchers believe the topic is actually spreading in the social
network structure. Based on the proposed topic diffusion and evolution model, for
one thing the latent spread path graph can be determined for further determining
the source of a topic; and for another, the properties of the time change of the topic
can be inferred to track the new development of a topic and understand the
regularity of its changes.
3.4 Summary
With the advancement of computing technologies and the popularization of internet,
online social network has developed dramatically with an explosive growth of
information. Therefore, traditional knowledge acquisition approaches have become
obsolete to keep up with the pace of knowledge production in the era of big data,
which demands intelligent processing of information in the network age and auto-
matic discovery and acquisition of knowledge. Among these, social network contains
a large amount of valuable information for its real reflection of people’s life.
Therefore, it is particularly important for text mining and analysis for social net-
works. Topic is the important information concerned by social network users. In
addition, detecting accurately the topic and tracking the evolution of topic have a
great reference value in monitoring and guiding public opinions, business decisions
and other aspects. This chapter introduces the theories and technologies related to
topic discovery and topic evolution with the hope that readers can have a general
understanding of the relevant fields.
Topic discovery methods mainly include topic model, the traditional vector
space model-based method, and the term co-occurrence graph-based method,
where topic model is the current mainstream approach, whereas the above methods
are all required to be suitably modified to suit the characteristics of text information
in social networks to improve the analysis results. Research on topic evolution in
social networks is relatively limited and the main method is still concentrated in the
method of time slice segmentation, with the more complex considering the relation-
ship between time slices, etc.
Overall, although the research on topic discovery and evolution has been carried
out for many years, the initial research field is relatively narrow and the method used
is not mature enough, which cannot simply be applied directly to the current online
social network environments. In recent years, the research on topic discovery and
evolution has made considerable progress and development, but we believe there are
still many challenging issues to be solved:
(1) The formal definition and presentation of the topic. Although many researchers
have made relatively complete definitions and presentations of the topic, the
topic itself involves some subjective factors and researches have not established
a more systematic concept to describe it, which leads to more or less differences
References
[1] Paige H. Adams, Craig H. Martell. Topic detection and extraction in chat. 2008 IEEE
International Conference on Semantic Computing, 2008.
[2] James Allan, Jaime Carbonell, George Doddington, Jonathan Yamron, Yiming Yang. Topic
detection and tracking pilot study final report. In Proceedings of the DARPA Broadcast News
Transcription and Understanding Workshop, 1998.
[3] Hila Becker, Mor Naaman, Luis Gravano. Learning similarity metrics for event identification in
social media. In Proceedings of the third ACM international conference on Web search and data
mining, 2010.
[4] Hila Becker, Mor Naaman, Luis Gravano. Beyond trending topics: Real-world event identification
on Twitter. ICWSM 11, 2011: 438–441.
[5] David M. Blei, Andrew Y. Ng, Michael I. Jordan, John Lafferty. Latent Dirichlet allocation. Journal
of Machine Learning Research 2003, 3:993–1022.
[6] Levent Bolelli, Seyda Ertekin, C. Lee Giles. Topic and trend detection in text collections using
latent Dirichlet allocation. In ECIR’09, 2009.
[7] Kai Cui, Bin Zhou, Yan Jia, Zheng Liang. LDA-based model for online topic evolution mining.
Computer Science, 2011, 37(11):156–159.
[8] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard
Harshman. Indexing by latent semantic analysis. JASIS, 1990, 41(6):391–407.
[9] Lei Deng, Zhaoyun Ding, Bingying Xu, Bin Zhou, Peng Zou. Using social intelligence for
new event detection in microblog stream. 2012 Second International Conference on Cloud
and Green Computing (CGC), 2012: 434–439.
[10] Lei Deng, Bingying Xu, Lumin Zhang, Yi Han, Bin Zhou, Peng . Tracking the evolution of public
concerns in social media. In Proceedings of the Fifth International Conference on Internet
Multimedia Computing and Service, 2013: 353–357.
[11] Qi He, Bi Chen, Jian Pei, Baojun Qiu, Prasenjit Mitra, C. Lee Giles. Detecting topic evolution in
scientific literature: How can citations help?. CIKM’09, 2009.
[12] Thomas Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual
international ACM SIGIR conference on Research and development in information retrieval,
1999.
[13] Liangjie Hong, Brian D. Davison. Empirical study of topic modeling in Twitter. In Proceedings of
the First Workshop on Social Media Analytics, 2010: 80–88.
[14] Cindy Xide Lin, Bo Zhao, Qiaozhu Mei, Jiawei Han. PET: A statistical model for popular events
tracking in social communities. KDD’10, 2010.
[15] Cindy Xide Lin, Qiaozhu Mei, Jiawei Han, Yunliang Jiang, Marina Danilevsky. The joint inference
of topic diffusion and evolution in social communities. ICDM’11, 2011.
[16] Jimmy Lin, Rion Snow, William Morgan. Smoothing techniques for adaptive online language
models: Topic tracking in Tweet streams. KDD’11, 2011.
[17] Yue Lu, Chengxiang Zhai. Opinion integration through semi-supervised topic modeling. In
Proceedings of the 17th international conference on World Wide Web, 2008.
[18] Omid Madani, Jiye Yu. Discovery of numerous specific topics via term co-occurrence analysis.
In Proceedings of the 19th ACM international conference on Information and knowledge
management, 2010.
[19] Daniel Ramage, Susan Dumais, Dan Liebling. Characterizing microblogs with topic models.
ICWSM, 2010.
[20] Daniel Ramage, David Hall, Ramesh Nallapati, Christopher D. Manning. Labeled LDA: A super-
vised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009
Conference on Empirical Methods in Natural Language Processing, 2009.
[21] Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers, Padhraic Smyth. The author-topic model for
authors and documents. In Proceedings of the 20th conference on Uncertainty in artificial
intelligence, 2004.
[22] Ankan Saha, Vikas Sindhwani. Learning evolving and emerging topics in social media: a
dynamic nmf approach with temporal regularization. In Proceedings of the fifth ACM
international conference on Web search and data mining, 2012.
[23] Gerard M Salton, Andrew Wong, Chungshu Yang. A vector space model for automatic indexing.
Communications of the ACM, 1975, 18(11): 613–620.
[24] Hassan Sayyadi, Matthew Hurst, Alexey Maykov. Event detection and tracking in social
streams. ICWSM, 2009.
[25] Hassan Sayyadi, Matthew Hurst and Alexey Maykov. Event detection and tracking in aocial
streams. ICWSM’09, 2009.
[26] Hassan Sayyadi, Louiqa Raschid. A graph analytical approach for topic Detection. ACM
Transactions on Internet Technology (TOIT), 2013, 13(2): 4.
[27] Jing Shi, Meng Fan, Wanlong Li. Topic analysis based on LDA model. Acta Automatica Sinica,
2009, 35(12): 1586–1592.
[28] Jintao Tang, Ting Wang, Qin Lu, Ji Wang, Wenjie Li. A wikipedia based semantic graph model
for topic tracking in blogosphere. In Proceedings of the Twenty-Second international joint
conference on Artificial Intelligence, 2011.
[29] Ville H. Tuulos, Henry Tirri. Combining topic models and social networks for chat data mining.
In Proceedings of the 2004 IEEE/WIC/ACM international Conference on Web intelligence,
2004.
[30] Xiaolong Wang, ChengxiangZhai and Dan Roth. Understanding evolution of research themes:
A Probabilistic generative model for citations. KDD’13, 2013.
[31] Zhijun Yin, Liangliang Cao, Jiawei Han, Chengxiang Zhai, Thomas Huang. LPTA: A probabilistic
model for latent periodic topic analysis. In 2011 IEEE 11th International Conference on Data
Mining (ICDM), 2011.
[32] Wayne Zhao, J Jing Jiang, Jianshu Weng, Jing He, Ee-Peng Lim, Hongfei Yan, Xiaoming Li,.
Comparing twitter and traditional media using topic models. In Proceedings of the 33rd
European conference on Advances in information retrieval, 2011: 338–349.
Appendix
1 Document 20140309 in application examples
We are working with anti-terrorism units, says Hisham
SEPANG: Malaysia has informed counter-terrorism agencies of various countries in light of several
imposters found to have boarded flight MH370 that went missing over the South China Sea early Saturday.
Defence Minister Datuk Seri Hishammuddin Hussein (pic) said Malaysia would be working with
intelligence agencies, including the Federal Bureau of Investigation (FBI), on the matter.
“If it is an international network, the Malaysian immigration alone will not be sufficient.”
“We have also informed the counter-terrorism units of all relevant countries.”
“At this point, we have not established if there was a security risk involved (and) we do not want to
jump the gun,” Hishammuddin said when asked if there could be any hijack or terror elements in the
disappearance of the MH370 flight.
On two impostors who boarded the flight using passports reported lost by an Italian and an
Austrian, Hishammuddin said the authorities would screen the entire manifest of the flight.
Both aircrafts arrived in Malaysia on March 21. After extensive briefings in Malaysia on Sunday,
both Indian aircrafts took off to be part of the Australian-led search on Sunday morning.
India has been participating in the search and rescue operation beginning March 11 in the
Andaman Sea and Bay of Bengal.
New statement from Malaysia Airlines: Tan Sri MdNorMdYusof, Chairman of Malaysia Airlines:
The painful reality is that the aircraft is now lost and that none of the passengers or crew on board
survived.
This is a sad and tragic day for all of us at Malaysia Airlines. While not entirely unexpected after an
intensive multi-national search across a 2.24 million square mile area, this news is clearly devastating
for the families of those on board. They have waited for over two weeks for even the smallest hope of
positive news about their loved ones.
This has been an unprecedented event requiring an unprecedented response. The investigation
still underway may yet prove to be even longer and more complex than it has been since March 8th.
But we will continue to support the families—as we have done throughout. And to support the
authorities as the search for definitive answers continues.
MAS Group CEO, Ahmad Jauhari Yahya, has said the comfort and support of families involved and
support of the multi-national search effort continues to be the focus of the airline. In the last 72 hours,
MAS has trained an additional 40 caregivers to ensure the families have access to round-the-clock
support.
A short while ago Australian Defense Minister David Johnston said “to this point, no debris or
anything recovered to identify the plane” He also said this is an extremely remote part of the world
and “it’s a massive logistical exercise.” “We are not searching for a needle in a haystack. We are still
trying to define where the haystack is,” he said.
He added that he would boycott all Malaysian products and avoid coming to Malaysia indefinitely.
Posts by Zhang, Chen and other Chinese celebrities have been widely shared online by Chinese
netizens.
Fish Leong
Some Chinese netizens have also urged a boycott of Malaysian artistes such as Fish Leong (right),
Gary Chaw, Lee Sinje and Ah Niu, who are popular for their music and movies.
Bahau-born Leong, who is an expectant mother, drew scorn from Microblogging users after
uploading a photograph of three candles as a mark of respect for MH370 victims.
Numerous Chinese netizens responded by cursing her and her unborn child. The photograph has
since been removed.
In an apparent attempt to stem the anger and distrust in China, Malaysian officials have also met
the Chinese ambassador to Malaysia Huang Huikang to ask for the Chinese government to engage
and help clarify the situation to the bereaved relatives and the public.
“Malaysia is working hard to try and make the briefings to the Chinese relatives in Beijing more
productive,” read a statement sent out by the Transport Ministry today.
https://doi.org/10.1515/9783110599435-004
Detailed studies on the influence spread models have been made in the current aca-
demic community, where the independent cascade model [2] and the linear threshold
model [3] are currently the two most widely researched influence maximization models.
The independent cascade model is based on the probabilities wherein each node,
after switched into the active state itself, will try to activate its subsequent node at a
certain probability, and the behaviors for multiple active nodes to try to activate the
same neighboring node are independent from each other; hence, the model is named
as the independent cascade model. The linear threshold model is based on thresh-
olds; the behavior of multiple active nodes to try to activate the same subsequent
node are dependent, and whether the influence is successful depends on whether the
sum of the weights of the influences on the same subsequent node by all the active
nodes surpasses the threshold of the subsequent node. The independent cascade
model and the linear threshold model elaborate the process of influence spread
based on probability and threshold, respectievely. Please refer to Chapter 10 for
details, and pleonasm will be omitted here.
In addition to these models, there are also some other influence spread models in
the research of influence maximization. A brief description is presented below.
(1) Triggering model [1]. The triggering model was proposed by Kempe et al. in 2003.
In the triggering model, each node v is corresponding to a triggering node set Tv
which defines the nodes capable of triggering the node v to switch from the
inactive state to the active state; namely, at the time point of t, a nonactivated
node v will be activated if and only if the precursor u of at least one node v is in the
active state at the time point of t − 1, and u 2 Tv . Take Figure 4.1 as an example,
suppose that Te = fa, cg, Tf = fb, eg, and the node c is in the active state when
t = T0 . Then, when t = T1 , c tries to influence nodes e and f . Since c 2 Te but c ∉ Tf , c
can successfully activate the node e, while f is still in a nonactivated state. When
t = T2 , the newly activated node e also tries to influence its neighboring node f . As
e 2 Tf at this moment, the node e succeeds in activating the node f .
(2) Weighted cascade model [4] is a unique independent cascade model. The differ-
ence between the weighted cascade model and the independent cascade model
lies in that, in the weighted cascade model, the probability for the node v to
successfully activate the subsequent node w is the reciprocal of the in-degree of
the node w, namely, pðv, wÞ = 1=dw , where dw is the in-degree of the node w. For
instance, in Figure 4.1, the probability for the node e to successfully activate the
node f is pðe, f Þ = 31.
(3) Voter model [5] is a probability model which is widely applied to statistical
physics and particle systems, and was first presented by Peter Clifford and
Aidan Sudbury. In the voter model, each mode randomly selects one node from
its precursor node set in each step, and takes the state of the selected node as its
own state. Again take Figure 4.1 as an example, suppose that only the node c of
all nodes is in the active state when t = T0 ,while all the other nodes are inactive.
Then, when t = T1 , the node e randomly selects one node from its precursor node
set fa, cg; if the node a is selected, then the node e is still inactive at the time
point of T1 ; otherwise, if the node c is selected, then e will switch from the
inactive state into the active state at the time point of T1 . It should be noted
that, the states of nodes in the voter model can either switch from the inactive
state into the active state, or switch from the active state into the inactive state, so
the voter model is more suitable for modeling those occasions allowing for the
nodes to change their viewpoints, e.g., the public in democratic elections may
change their votes due to the influence of the votees and other people.
1. Run time
The computation time of influence maximization algorithm is important for policy
formulation has important applications, hence, these applications are very sensitive
to the computation time of algorithms, and have very high demands on algorithm
2. Algorithm precision
Regarding influence maximization, algorithm precision refers to the number of nodes
ultimately influenced by the seed set selected by the influence maximization algo-
rithm after the process of influence spread. In practical applications of influence
maximization, many applications possibly require the largest ultimate influence
range. Applications of this type are represented by marketing and advertisement
publication or the like. In the above two applications, a larger ultimate influence
maximization range indicates better promotion benefits of the product and more
commercial profits. Therefore, exploring high-precision computation algorithm is
also a key issue in the influence maximization researches. Influence maximization
researches in recent years prove that it is a NP-Hard problem to find the most
influential K nodes [1]. Traditional sorting algorithm, PageRank, among other meth-
ods pay no attention to the characteristics of influence spread, hence, the algorithm
precision is too low to solve the influence maximization problem. As a result, high
precision is also a goal sought by influence maximization algorithms.
3. Scalability
Scalability is an important metric for practical applications of influence maximization.
Due to complicated algorithms and long run time, the current solution algorithm can be
applied only to the social network of with less than a million nodes. Faced with large-
scale social networks, influence maximization algorithms with fine scalability must be
designed for handling the severe challenge due to the mass data of social network.
proven to be a NP-Hard problem [1], its research can mainly be divided into two
directions:
(1) Greedy Algorithm. Researches on Greedy Algorithm are basically based on the
Hill-Climbing Greedy Algorithm [1], where one node capable of providing the
maximum influence value is selected at each step, and a locally optimal solution
is used for approximating the globally optimal solution. The advantage of Greedy
Algorithm is its comparatively high precision, which can reach an approximate
optimal of 1 − 1=e − ε. However, Greedy Algorithm has a serious efficiency pro-
blem, namely, the high algorithm complexity and the long run time. As a result, it
can hardly be applied to large-scale social networks. Numerous studies and
related optimization have been made specific to the efficiency of Greedy
Algorithm, and this problem is atill a hot topic in current research.
(2) Heuristic Algorithm. Different from Greedy Algorithm, Heuristic Algorithm
selects the most influential node according to a designed heuristic strategy,
and does not need to calculate the precise influence value of the node; hence,
the run time of Heuristic Algorithm is short and the efficiency is high. However,
its algorithm precision is too low to compare with Greedy Algorithm.
marginal profit brought to the set S by the addition of any element vi is not lower than
the marginal profit brought to the superset T S of S by the addition of the element
vi , the formal description is as follows:
or
Basic theory 1:
If the function f ðÞ is a submodular function as well as a monotonic function
(f ðS ∪ fvi gÞ ≥ f ðSÞ satisfies all sets S and all elements), when trying to locate the element
set S with a size of K so that f ðSÞ is maximum, Hill-Climbing Greedy Algorithm can be
used to obtain the approximate optimal solution of 1 − 1=e − ε [6, 7], where e is the base
of a natural logarithm and ε can be any positive real number.
Pedro Domingos and Matt Richardson [8] studied influence maximum as an algo-
rithm problem for the first time. The earliest method is to regard the market as a social
network, model the individual purchasing behavior and the overall earning promo-
tion after marketing as a model of Markov Random Field, and bring up Single Pass
Algorithm and Greedy Search Algorithm so as to gain an approximate solution.
Therefore, Kempe et al. [1] first refined this problem to be a discrete optimization
problem, that is to find K nodes capable of maximizing the ultimate influence range
according to a given spread model. Kempe et al. proved that this optimization problem
was a NP-Hard problem in both the independent cascade model and the linear thresh-
old model. Later, the author proved that the influence value function σðÞ satisfied the
submodular character and the monotonic character in both Influence Spread Models,
and thus put forward a Greedy Hill-Climbing Approximate Algorithm BasicGreedy,
which is capable of ensuring the approximate optimal of 1 − 1=e − ε.
The Greedy Hill-Climbing Algorithm presented by Kempe et al. is shown as the
Algorithm 4.1. The algorithm starts when S is a null set (Line 1), and later executes K
rounds (Line 2), a node v capable of providing the maximum marginal profit will be
selected in each round (Round 10), and then is added into the initial node set S (Line 11).
To calculate the marginal profit sv of each node in graph. G (Line 3), Kempe et al.
designed to calculate, through R rounds of stimulation (Lines 5~7), the number of nodes
which can be ultimately influenced with the set S ∪ fvg as the initial active node in each
round, and finally seek the average value (Line 8) and select the node having the largest
marginal profit to join the set S.
In 2007, Leskovec et al. [9] proposed CELF (Cost-Effective Lazy Forward), which is an
optimization of BasicGreedy. As the influence value function σðÞ satisfies the sub-
modular character, the influence value marginal profit brought by a random node v
has to be smaller following the growth of the initial active set S. Hence, CELF
Algorithm does not have to calculate in each round the influence value marginal
profits of all nodes like the BasicGreedy Algorithm does. If the influence value
marginal profit of the node u prior to this round is smaller than that of the node v
in the current round, then the influence value marginal profit of the node u in the
current round is bound to be smaller than that of the node v, therefore, it is
impossible for the node u to be the node having the largest marginal profit in the
current round, and there is no need to calculate its influence value marginal profit in
the current round. Exactly, using the submodular character of the influence max-
imization objective function, CELF Algorithm significantly reduces the number of
calculations of the influence value marginal profits of nodes in each round, and
narrows the selection range of nodes, thereby lowering the overall calculation com-
plexity. Experiment results show that the precision of CELF Algorithm is basically the
same as that of BasicGreedy Algorithm, whereas its calculation efficiency is far higher
that that of the BasicGreedy Algorithm, and accelerates which is as high as 700 times.
Even so, it still takes hours for CELF Algorithm to seek 50 most influential nodes in a
data set having 37,000 nodes, and its efficiency can hardly satisfy the demand of
short run time of current social networks.
not in conflict, so the MixGreedy Algorithm takes the advantages of these two
algorithms, that is to use NewGreedy Algorithm in the first round and use CELF
Algorithm in succeeding rounds so as to reduce the calculation amount, thereby
further reducing the overall algorithm complexity. The experiment results show that
the NewGreedy Algorithm and the MixGreedy Algorithm can significantly accelerate
the discovery of most influential users in social networks, and ensure precision which
is basically the same as that of BasicGreedy.
phase, CGA Algorithm selects the most influential user from the divided community
using the method of dynamic programming. Suppose that k − 1 most influential nodes
have been obtained in preceding k − 1 rounds, each community will serve as the object
of study in Round k, and MixGreedy Algorithm will be used in each community to
select the most influential node, which will be selected as the globally most influential
node in Round k. Through the experiments on mobile social network, the author
proved that the operating speed of CGA Algorithm was significantly enhanced com-
pared with MixGreedy Algorithm. However, the enhancement of speed is at the cost of
precision because CGA Algorithm approximates the global influence of a node using its
influence inside the community, which lowers the precision.
Goyal et al. [11] thoroughly analyzed CELF Algorithm, and presented the CELF++
Algorithm, which is an optimization method specific to CELF Algorithm. CELF++
Algorithm once again uses the submodular character of the influence value Function
σðÞ, and records in the current iteration for all nodes the most influential node ID:
prevbest after this node calculation. If the prevbest node of the node vi is selected as the
most influential node in the current round after the iteration in current round, then
the influence value of the node vi does not need to be calculated in the iteration of the
next round, thereby avoiding numerous recalculation of influence values existing in
CELF Algorithm. The author proved by experiments that CELF++ Algorithm could
reduce 35%~55% of the run time compared with CELF Algorithm.
On the basis of MixGreedy Alorithm, Liu et al. [12] analyzed the layer dependency
and parallelizability of the nodes in the social network, designs, by the transformation
of a directed acyclic graph and the bottom-up layer-by-layer scanning, an influence
maximization algorithm BUTA having a high parallelizability to efficiently concur-
rently calculate the influence values of all nodes in the social networks. Later, a
parallel computing system of CPU+GPU was taken as a representative of the existing
heterogeneous-parallel computing frame, BUTA Algorithm was mapped onto the
heterogeneous-parallel frame of CPU+GPU, and an IMGPU frame was proposed. In
order to better make BUTA Algorithm adaptive to GPU hardware frame and program-
ming models, the author provided the following three optimization methods: K-layer
merging, data recombination and memory access merging to reduce the number of
branches and memory access and to improve parallelism. Finally, a large amount of
experiments were intensively performed in the real social networks, and the experi-
ment results showed that the execution speed of BUTA Algorithm and IMGPU was
notably improved relative to the aforementioned MixGreedy Algorithm.
computation, resulting in a rather long run time. Numerous follow-up researches have
been performed to optimize this efficiency problem, and have achieved notable effect
of acceleration, but still fail to meet the demand of high algorithm efficiency. In
particular, facing the current large-scale social networks, it is still the core object of
current researches to design more efficient influence maximization algorithm.
selected as an initial active node, the degree of the node v needs to be quantified and
discounted due to the overlap existing between the two nodes. Details about the
discounting method are shown in Algorithm 4.3. Experiment results show that the
algorithm precision of DegreeDiscount Heuristic is much higher than that of Degree
Heuristic, but still cannot be compared with the afore-mentioned Greedy Algorithms.
Specific to the independent cascade model, Chen et al. [13] proposed a new
heuristic algorithm PMIA in 2010. First, the author proved that the calculation
of the Influecen Value of the given initial active set in the independent cascade
model was a #P-Hard problem, then the author provided a heuristic algorithm
PMIA which is based on local influence. PMIA has high efficiency and good
scalability as it approximates the global influence value of the node by using its
influence value in its peripheral local area, constructs the maximum influence
arborescence (MIA) model through the maximum influence path, and compro-
mises between the algorithm execution efficiency and the algorithm precision
through the regulation of the MIA size. The author proved that the influence
function is still compliant with the submodular characteristics in MIA Model, so
the Greedy Algorithms can reach an approximate optimal of 1 − 1=e − ε. For
higher execution efficiency, the author provided a heuristic PMIA, which is on
the basis of MIA Model. PMIA merely needs to calculate the influence value of
nodes in local area, and update the Influence Values of locally relevant nodes,
so the calculation efficiency is higher. However, although PMIA heuristic
improved the efficiency by local approximate optimal, but its precision is
inevitably lost, which result in an over-low algorithm precision.
Jung et al. [16] designed a new IRIE Heuristic in 2012 on the basis of indepen-
dent cascade models. Traditional heuristics and PMIA Heuristic obtains the influ-
ence value of nodes through rounds of simulation or by means of local influence
value calculation, and thereby selecting the node having the maximum influence
value. However, for large-scale social networks, it is rather time-consuming to
calculate the influence value of all nodes. Therefore, IRIE Algorithm is novel for
that IRIE does not need to calculate the Influence Value for each node; instead, it is
based on the method of belief propagation, ranks the influence values of the global
node merely through rather few rounds of iteration, and then selects the top-ranked
node as the most influential node. Moreover, IRIE is integrated with the method of
influence estimation, estimates the influence of the most influential node on other
nodes after each round of ranking, and then regulates the next round of influence
ranking according to the results. IRIE combines the method of influence ranking
and the method of influence estimation, and thus is on average two orders of
magnitude faster than independent cascade model Heuristic PMIA, and is as pre-
cise as PMIA.
Furthermore, literature [17] has indicated that it is not necessary to precisely
calculate the influence value of each node in social networks, and relative ranking
according to node influence value will be enough. In addition, the distribution of
social network nodes is subject to certain rules, so the nodes can be randomly
sampled according to Monte Carlo Theory, and a distribution of overall sample can
be approximated according to the distribution of the small sample, so as to approx-
imate and estimate the node influence value, thereby decreasing the amount of
calculation and improving algorithm execution efficiency. The author designed a
supervised sampling method ESMCE based on power law index. Through deep
analysis of node distribution features in social networks, ESMCE Algorithm deter-
mines the number of nodes in the initial sample according to power law index of the
given social network; to minimize the number of the sample nodes and the number
of sample rounds, ESMCE Algorithm proposed a method of forecasting the number
of follow-up sample nodes based on a Grey Forecasting Model, which gradually
refines the precision by iteration sampling till the error satisfies the predetermined
requirement.
short run time. However, their precision cannot be guaranteed, and cannot be
compared to the aforementioned Greedy Algorithms.
as seed nodes, in this case, the Black party will be able to influence 12 nodes (the
Black party realizes the maximum ultimate influence range); if the target of the Black
party is to minimize the interests of the competitor, then the Black party will choose
the nodes B and C, in this case, the Red party can influence only 6 nodes (the ultimate
influence range of the Red party is minimal); if the Black party intends to win in the
competition with the Red party, then the Black party will choose the nodes C and D,
in this case, the Red party can influence 9 nodes while the Black party can influence
10 nodes, thereby realizing the Black party’s target of winning the competition.
Apparently, corresponding seed node selection strategies are needed for different
targets of competitive spread.
strategy based on community, and testified the effectiveness of this method through
experiments.
As for the minimization of the influence range of the opponent, Budak et al. [22]
from University of California Santa Barbara initiated relevant researches in 2011.
Based on the expanded independent cascade model, Budak et al. proved that the
problem of competitive influence minimization is an NP-Hard problem, and com-
pared the performances of Greedy Algorithm with three Heuristic Algorithms.
Besides, He et al. [23] took into consideration multi-topic competitive factor in linear
threshold model, and proved that the objective function in the problem of competi-
tive influence minimization in linear threshold model conformed to the sub-modular
character, as a result of which an approximate optimal of 1 − 1=e − ε can be realized
when the Greedy Hill-Climbing Strategy is used; in order to enhance the calculation
efficiency of Greedy Hill-climbing method, He et al. put forward an efficient heuristic
algorithm CLDAG to make up the deficiency of long run time of Greedy Algorithms. In
2012, scholars including Tsai et al. [24] studied the problem of competitive influence
minimization on the basis of Game Theory, and designed a heuristic mixed strategy
for problem solution.
Based on the problem of influence maximization, Lappas et al. [28] studied the
problem of team formation, namely, if a task T (which needs to be finished by different
skill sets), an alternative talent set X (each person has his own skill reserves), and a
social network of talent set (the weight of edges between person and person represents
the interaction price between them; the smaller the price, the more effectively they can
cooperate) are given, how to organize teams in the set X and how to find the talent set
X′ to execute the task T so that the sum of interaction price in the set X′ is the smallest.
Based on the diagram diameter and the minimal spanning tree, the author defined two
different methods of determining the interaction price, and proved that the problems
of Team Formation in both methods were NP-Hard problems, and designed corre-
sponding solving algorithms specific to these two methods.
4.8 Summary
For the past few years, with the rapid development of internet and Web 2.0 technol-
ogy, and as a communication bridge in real human world, social network has become
an important media and platform for inter-communication, knowledge sharing, and
information spread. The problem of influence maximization aiming at discovering
the most influential node set in social networks is a key problem in the field of social
network analysis, and is widely applied to marketing, advertisement publication,
early warning of public sentiment, and many other important occasions; hence, it is
of high research and application values.
The problem of influence maximization in social networks and its main
analysis methods are summarized in this chapter. With respect to algorithms,
the current work focuses on the Greedy Algorithms and heuristic strategies.
Existing researches have gained some research results in the efficient handling
of influence maximization problem; however, because of the large scale of social
networks, complex connection between nodes, and dynamic variation of the
network structure, many new challenges have been resulted in efficient solu-
tions of influence maximization. Therefore, there are many problems in this field
that need further investigation.
(1) At present, there are numerous parallel computation frameworks which have
already been widely applied to such field as massive scientific calculation and so
on. MapReduce computation framework is good in programmability, has the
advantages of automatic parallelization, load balancing or the like, and can be
operated on large-scale clusters, so its computing power is very remarkable.
Therefore, the parallel solving algorithm of influence maximization based on
MapReduce framework is a feasible and meaningful solution. The key problem to
be solved in this research direction is how to rationally assign the task of
computing the influence value of each node in social networks to the computing
References
[1] David Kempe, Jon Kleinberg, Eva Tardos. Maximizing the spread of influence through a social
network. In Proceedings of the ninth ACM SIGKDD international conference on knowledge
discovery and data mining, 2003: 137–146.
[2] Jacob Goldenberg, Barak Libai, Eitan Muller. Talk of the network: A complex systems look at the
underlying process of word-of-mouth. Marketing Letters, 2001, 12(3):211–223.
[3] Mark Granovetter. Threshold models of collective behavior. American Journal of Sociology,
1978, 83(6):1420.
[4] Wei Chen, Yajun Wang, Siyu Yang. Efficient Influence Maximization in social networks. In
Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and
data mining, 2009: 199–208.
[5] Eyal Even-Dar, Asaf Shapira. A note on maximizing the spread of influence in social networks.
Internet and Network Economics, 2007, 281–286.
[6] Gerard Cornuejols, Marshall L. Fisher, George L. Nemhauser. Exceptional paper—location of
bank accounts to optimize float: An analytic study of exact and approximate algorithms.
Management Science. 1977, 23 (8):789–810.
[7] George L. Nemhauser, Lawrence A Wolsey, Marshall L. Fisher. An analysis of approximations for
maximizing submodular set functions—I. Mathematical Programming, 1978, 14 (1):265–294.
[8] Pedro Domingos, Matt Richardson. Mining the network value of customers. In Proceedings of
the seventh ACM SIGKDD international conference on Knowledge discovery and data mining,
2001: 57–66.
[9] Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen,
Natalie Glance. Cost-effective outbreak detection in networks. In Proceedings of the 13th ACM
SIGKDD international conference on Knowledge discovery and data mining, 2007: 420–429.
[10] Yu Wang, Gao Cong, Guojie Song, Kunqing Xie. Community-based Greedy Algorithm for mining
top-k influential nodes in mobile social networks. In Proceedings of the 16th ACM SIGKDD
international conference on knowledge discovery and data mining, 2010: 1039–1048.
[11] Amit Goyal, Wei Lu, Laks V.S. Lakshmanan. Celf++: optimizing the Greedy Algorithm for
Influence Maximization in social networks. In Proceedings of the 20th international conference
companion on World wide web, 2011: 47–48.
[12] Liu Xiaodong, Li Mo, Li Shanshan, Peng Shaoliang, Liao Xiangke, Lu Xiaopei. IMGPU: GPU
accelerated influence maximization in large-scale social networks. IEEE Transactions on
Parallel and Distributed Systems, 2014: 1.
[13] Wei Chen, Chi Wang, Yajun Wang. Scalable influence maximization for prevalent viral market-
ing in large-scale social networks. In Proceedings of the 16th ACM SIGKDD international
conference on Knowledge discovery and data mining, 2010: 1029–1038.
[14] Wei Chen, Yifei Yuan, Li Zhang. Scalable influence maximization in social networks under the
linear threshold model. In Proceedings of IEEE 10th International Conference on Data Mining,
2010: 88–97.
[15] Amit Goyal, Wei Lu, Laks V. S. Lakshmanan. Simpath: An efficient algorithm for influence
maximization under the linear threshold model. In Proceedings of IEEE 11th International
Conference on Data Mining, 2011: 211–220.
[16] Kyomin Jung, Wooram Heo, Wei Chen. IRIE: A scalable influence maximization algorithm for
independent cascade model and its extensions. arXiv preprint arXiv:1111.4795, 2011.
[17] Liu Xiaodong, Li Shanshan, Liao Xiangke, Peng Shaoliang, Wang Lei, Kong Zhiyin. Know by a
handful the whole sack: Efficient sampling for top-K influential user identification in large
graphs, World Wide Web Journal.
[18] Tim Carnes, Chandrashekhar Nagarajan, Stefan M. Wild, Anke van Zuylen. Maximizing influence
in a competitive social network: A follower’s perspective. In Proceedings of the ninth interna-
tional conference on Electronic commerce, 2007: 351–360.
[19] Shishir Bharathi, David Kempe, Mahyar Salek. Competitive influence maximization in social
networks. Internet and Network Economics, 2007, 306–311.
[20] Wan-Shiou Yang, Shi-Xin Weng. Application of the ant colony optimization algorithm to
competitive viral marketing. In Proceedings of the 7th Hellenic conference on Artificial
Intelligence: theories and applications, 2012: 1–8.
[21] Nam P. Nguyen, Guanhua Yan, My T. Thai, Stephan Eidenbenz. Containment of misinformation
spread in online social networks. Proceedings of the 4th ACM Web Science (WebSci’12), 2012.
[22] Ceren Budak, Divyakant Agrawal, Amr El Abbadi. Limiting the spread ofmisinformation in social
networks. In Proceedings of the 20th international conference on World wide web, 2011: 665-674.
[23] Xinran He, Guojie Song, Wei Chen, Qingye Jiang. Influence blocking maximization in social
networks under the competitive Linear Threshold Model technical report. arXiv preprint
arXiv:1110.4723, 2011.
[24] Jason Tsai, Thanh H. Nguyen, Milind Tambe. Security games for controlling contagion. In
Proceedings of the Twenty-Sixth National Conference in Artificial Intelligence, 2012.
[25] Amit Goyal, Francesco Bonchi, Laks V. S. Lakshmanan, Suresh Venkatasubramanian. On
minimizing budget and time in influence propagation over social networks. Social Network
Analysis and Mining, 2012: 1–14.
[26] Cheng Long, Raymond Chi-Wing Wong. Minimizing seed set for viral marketing. In Proceedings
of the IEEE 11th International Conference on Data Mining, 2011: 427–436.
[27] Theodoros Lappas, Evimaria Terzi, Dimitrios Gunopulos, Heikki Mannila. Finding effectors in
social networks. In Proceedings of the 16th ACM SIGKDD international conference on
Knowledge discovery and data mining, 2010: 1059–1068.
[28] Theodoros Lappas, Kun Liu, Evimaria Terzi. Finding a team of experts in social networks. In
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and
data mining, 2009: 467–476.
https://doi.org/10.1515/9783110599435-005