0% found this document useful (0 votes)
38 views10 pages

Named Entity Disambiguation by Leveraging Wikipedia

Named Entity Disambiguation using Wikipedia, by the Chinese Academy of Sciences

Uploaded by

jmpel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views10 pages

Named Entity Disambiguation by Leveraging Wikipedia

Named Entity Disambiguation using Wikipedia, by the Chinese Academy of Sciences

Uploaded by

jmpel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Named Entity Disambiguation by Leveraging Wikipedia

Semantic Knowledge
Xianpei Han Jun Zhao
Institute of Automation, Chinese Academy of Sciences
HaiDian District, Beijing, China
+86 10 8261 4468
{xphan, jzhao}@nlpr.ia.ac.cn

ABSTRACT Algorithms, Experimentation

Name ambiguity problem has raised an urgent demand for


efficient, high-quality named entity disambiguation methods. The
Keywords
Named Entity Disambiguation, Name ambiguity, Coreference
key problem of named entity disambiguation is to measure the
Resolution, Record Linkage, Semantic Knowledge
similarity between occurrences of names. The traditional methods
measure the similarity using the bag of words (BOW) model. The
BOW, however, ignores all the semantic relations such as social 1. INTRODUCTION
relatedness between named entities, associative relatedness Name ambiguity problem is common on the Web. For example,
between concepts, polysemy and synonymy between key terms. the name “Michael Jordan” represents more than ten persons in
So the BOW cannot reflect the actual similarity. Some research the Google search results. Some of them are shown below:
has investigated social networks as background knowledge for
disambiguation. Social networks, however, can only capture the Michael (Jeffrey) Jordan, Basketball Player
social relatedness between named entities, and often suffer the Michael (I.) Jordan, Professor of Berkeley
limited coverage problem. Michael Jordan, Footballer
Michael (B.) Jordan, American Actor
To overcome the previous methods’ deficiencies, this paper
proposes to use Wikipedia as the background knowledge for The name ambiguity has raised serious problems in many different
disambiguation, which surpasses other knowledge bases by the areas such as web person search, data integration, link analysis
coverage of concepts, rich semantic information and up-to-date and knowledge base population. For example, in response to a
content. By leveraging Wikipedia’s semantic knowledge like person query, search engine returns a long, flat list of results
social relatedness between named entities and associative containing web pages about several namesakes. The users are then
relatedness between concepts, we can measure the similarity forced either to refine their query by adding terms, or to browse
between occurrences of names more accurately. In particular, we through the search results to find the person they are looking for.
construct a large-scale semantic network from Wikipedia, in order Besides, an ever-increasing number of question answering,
that the semantic knowledge can be used efficiently and information extraction systems are coming to rely on data from
effectively. Based on the constructed semantic network, a novel multi-sources, name ambiguity will lead to wrong answers and
similarity measure is proposed to leverage Wikipedia semantic poor results. For example, in order to extract the birth date of the
knowledge for disambiguation. The proposed method has been Berkeley’s professor Michael Jordan, an information extraction
tested on the standard WePS data sets. Empirical results show that system may return the birth date of his popular namesake
the disambiguation performance of our method gets 10.7% basketball player Michael Jordan. Furthermore, ambiguous names
improvement over the traditional BOW based methods and 16.7% are not unique identifiers for specific entities and, as a result,
improvement over the traditional social network based methods. there are many confounders in the construction of knowledge base
or social network about named entities. So there is an urgent
Categories and Subject Descriptors demand for efficient, high-quality named entity disambiguation
H.3.3 [Information Systems]: Information storage and retrieval– methods, which can disambiguate occurrences of names by
Information Search and Retrieval. grouping them according to their represented named entities.
Named entity disambiguation, however, is by no means a trivial
General Terms task. In order to group occurrences of names, the disambiguation
system must decide whether the occurrences of a specific name
represent the same entity. The manner by which a human makes a
decision is often contingent on contextual clues as well as prior
Permission to make digital or hard copies of all or part of this work for background knowledge. For example, when a reader encounters
personal or classroom use is granted without fee provided that copies are the following four occurrences of the name “Michael Jordan”:
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy 1) Michael Jordan is a leading researcher in machine learning.
otherwise, or republish, to post on servers or to redistribute to lists, 2) Michael Jordan plays basketball in Chicago Bulls.
requires prior specific permission and/or a fee. 3) Michael Jordan wins NBA MVP.
CIKM’09, November 2-6, 2009, Hong Kong, China. 4) Learning in Graphical Models: Michael Jordan.
Copyright 2009 ACM 978-1-60558-512-3/09/11...$10.00.
the reader must decide whether these occurrences represent the entities, associative relatedness between concepts, acronyms and
same person. With the background knowledge that the machine spelling variations between key terms, we can obtain a more
learning in the context of the first Michael Jordan occurrence is accurate similarity measure between occurrences of names for
semantic related to the Graphical Models in the context of the disambiguation. In particular, we construct a large-scale semantic
fourth Michael Jordan occurrence via associative relation, it is network from Wikipedia, in order that the semantic knowledge
obvious that the first Michael Jordan represents the same person can be used efficiently and effectively: Wikipedia concepts in
as the fourth Michael Jordan. And with the background documents can be recognized, semantic relations between
knowledge that the entity Chicago Bulls is semantic related to the concepts can be identified and semantic relatedness between
entity NBA via social relation, it is clear that the second and the concepts can be measured. Based on the constructed semantic
third occurrence of Michael Jordan represent the same person. network, we first represent every occurrence of names as a
Wikipedia concept vector; then the similarity between concept
Conventionally, named entity disambiguation methods determine
vectors are computed using a novel similarity measure which can
whether two occurrences of a specific name represent the same
leverage various types of semantic relations; finally a hierarchical
entity by measuring the similarity between them. The traditional
agglomerative clustering algorithm is applied to grouping
methods measure the similarity using the bag of words (BOW)
occurrences of names based on the similarity. To evaluate the
model (Bagga and Baldwin[1]; Mann and Yarowsky[13];
performance of the proposed method, we have performed an
Fleischman[21]; Pedersen et al.[26]), where an occurrence of
empirical evaluation on the standard WePS data sets. The
name is represented as a term vector consisting of the terms that
experimental results show that, with the help of Wikipedia
appear in the context and their associated weights. By “terms” we
semantic knowledge, the disambiguation performance of our
mean words, phrases or extracted named entities, but in most
proposed method is greatly improved over the previous methods.
cases they are single words. In this model, similarity is measured
by the co-occurrence statistics of terms. Hence the disambiguation This paper is organized as follows. In the next section we state the
algorithm can only group the occurrences of names containing the named entity disambiguation problem and briefly review the
identical contextual terms, while all semantic relations (Hjørland, related work. Next in Section 3 we describe how to construct a
Birger[29]) like social relatedness between named entities, semantic network from Wikipedia. In Section 4 we describe our
associative relatedness between concepts, and acronyms, proposed method in detail. Experimental results are discussed in
synonyms, spelling variations between key terms are ignored. Sections 5. Section 6 concludes this paper and discusses the
Thus, the BOW based similarity cannot reflect the actual similarity future work.
between name occurrences. Background knowledge is needed to
capture the various semantic relations. 2. PROBLEM STATEMENT AND
Recent research has investigated social networks as background RELATED WORK
knowledge for disambiguation (Malin and Airoldi[3]; Minkov et Conventionally, a named entity disambiguation system is defined
al.[10]; Bekkerman and McCallum[23]). Social networks can as a six-tuple M  {N , E , D, O, K ,  } , where:
capture the social relatedness between named entities, so the
N={n1,n2,…,nl} is a set of ambiguous names which need to be
similarity can be bridged by the socially related named entities.
disambiguated, e.g., {“Michael Jordan”, …… };
For example, although they share no identical contextual terms,
E={e1,e2,…,ek} is a reference entity table containing the entities
the following two occurrences of Michael Jordan: “Michael
which the names in N may represent, e.g., {“Michael Jordan
Jordan plays basketball in Chicago Bulls” and “Michael Jordan
wins NBA MVP” will still be identified as the same person if a (Basketball player)”, “Michael Jordan (Professor)”, ……};
social network can provide the information that NBA is socially D={d1,d2,…,dn} is a set of documents containing the names in N;
related to Chicago Bulls. By leveraging social relatedness among O={o1,o2,…,om} is all name observations in D which need to be
entities, the social network based methods are more reliable than disambiguated. In this paper, we use the term observation to
the BOW based methods in some situations. However, the social denote the basic unit to be disambiguated: an occurrence of a
network based methods has a number of limitations: First, social particular name combined with its context. For example,
networks can only capture a special type of semantic relations - “NBA.com: Michael Jordan Bio” is an observation of “Michael
the social relatedness between named entities, while all other Jordan”. The name occurrence’s context can be various forms,
semantic relations such as associative relation, hierarchical such as the contextual words within a fixed window size or
relation and equivalence relation between concepts are still sometimes the entire document;
ignored (e.g., the associative relatedness between basketball and K is the background knowledge used in named entity
MVP in above example); Second, social networks usually have disambiguation. The background knowledge has been exploited
limited coverage: most recent research uses social networks built along a continuum, from the BOW model which includes no
from specific corpora, or some existing social networks of special background knowledge, to the social network based methods
domain, such as the IMDB for movie domain(Malin and which employ the social relatedness between named entities;
Airoldi[3]) and the DBLP for research domain(Joseph et al.[17]).  : O  K  E is the disambiguation function, the key component
To overcome the deficiencies of previous methods, in this paper of named entity disambiguation, which groups the observations
we propose to use Wikipedia as the background knowledge for according to their represented entities.
disambiguation, which surpasses other knowledge bases by the Obviously, the perfect reference entity table E is in most cases
coverage of concepts, rich semantic information and up-to-date unavailable, so disambiguation must be conducted on the
content (D. Milne, et al. [7]). By leveraging the semantic condition that the reference entity table is incomplete. Therefore,
knowledge in Wikipedia like social relatedness between named in most cases the disambiguation problem is regarded as a
clustering task, where  : O  K  E is a clustering algorithm, and up-to-date 1 . Each article in Wikipedia describes a single
which clusters all the observations of a particular name, with each concept; its title is a succinct, well-formed phrase that resembles a
resulting cluster corresponding to one specific entity. term in a conventional thesaurus (Milne, et al.[7]). Wikipedia
contains concepts in a wide range2, such as people, organizations,
A lot of research has focused on named entity disambiguation. occupations and publications. Wikipedia contains rich semantic
The traditional methods disambiguate names based on the bag of structures, such as disambiguation pages (polysemy), redirect
words (BOW) model. Bagga and Baldwin [1] represented a name pages (synonym), and hyperlinks between Wikipedia articles
as a vector of its contextual words, then the similarity between (associative relatedness and social relatedness, etc.). Moreover,
two names was determined by the co-occurring words, and finally Wikipedia has high coverage on both concepts and semantic
two names were predicted to be the same entity if their similarity relations. For example, in Food and Agriculture domain, the June
is above a threshold. Cucerzan [24] disambiguated names through 3, 2006 Version of English Wikipedia covers 72% useful
linking them to Wikipedia entities by comparing their term vector concepts, 95% synonymy relations, 69% hierarchical relations and
representations. Mann and Yarowsky [13] extended the name’s 56% associate relations(Milne, et al.[7]). And, with the growth of
vector representation by extracting biographic facts. Pedersen et al. Wikipedia, these coverage rates will be further improved.
[26] employed significant bigrams to represent the context of a
name. Fleischman [21] trained a Maximum Entropy model to give However, Wikipedia is an open data resource built for human use,
the probability that two names represent the same entity, then so it includes much noise and the semantic knowledge within it is
used a modified agglomerative clustering algorithm to cluster not suitable for direct use in named entity disambiguation. To
names using the probability as the similarity. Bunescu and Pasca make it clean and easy to use, we construct a semantic network
[22] disambiguated the names in Wikipedia by linking them to the from Wikipedia, in order that the semantic knowledge can be used
most similar Wikipedia entities using the similarity computed efficiently and effectively for disambiguation: Wikipedia concepts
using a disambiguation SVM kernel. within documents can be recognized, semantic relations between
concepts can be identified and semantic relatedness between
All the similarity measures used in the above BOW based methods concepts can be measured efficiently and accurately.
are only determined by the co-occurrences of terms, while all
semantic relations like social relatedness between named entities 3.1 Wikipedia Concepts
and associative relatedness between concepts are all ignored. So As shown above, each article in Wikipedia describes a single
background knowledge is needed to capture the various semantic concept and its title can be used to represent the concept it
relations. Recent research has investigated social networks as describes, e.g., the title “IBM” and “Professor”. However, some
background knowledge for disambiguation. Bekkerman and articles are meaningless – it is only used for Wikipedia
McCallum [23] disambiguated names based on the link structure management and administration, such as “1980s”,
of the Web pages between a set of socially related persons, their “Wikipedia:Statistics”, etc. Hence, we filter the noisy Wikipedia
model leveraged hyperlinks and the content similarity between concepts using some rules from Hu, et al.[16], which is described
web pages. Malin [2] and Malin and Airoldi [3] measured the below(titles satisfy one of the below will be filtered):
similarity based on the probability of walking from one  The article belongs to categories related to chronology, i.e.
ambiguous name to another in the social network constructed “Years”, “Decades” and “Centuries”.
from corpora. Minkov et al. [10] disambiguated names in email  The first letter is not a capital one.
documents by building a social network from email data, then  The title is a single stop word.
employed a random walk algorithm to compute the similarity.
Joseph et al. [17] used the relationships from DBLP to pinpoint 3.2 Surface Forms of Wikipedia Concepts
names in research domain to the persons in DBLP. Kalashnikov et In many tasks, we need to recognize Wikipedia concepts in
al. [8] enhanced similarity measure by collecting named entity co- documents (plain texts, web pages, etc.). Usually the recognition
occurrence information via web search. is affected by two factors: First, a Wikipedia concept may appear
in various surface forms. For example, the IBM can appear in
Social networks can enhance similarity measure by leveraging more than 40 forms, such as IBM, Big Blue and International
social relatedness between named entities. However, as mentioned Business Machine. Second, a surface form may represent several
in Section 1, social networks can only capture a special type of Wikipedia concepts. For example, as shown in Table 1, the
semantic relations - the social relatedness between named entities, surface form AI can represent more than 6 Wikipedia concepts,
and often suffer the limited coverage problem. To overcome these such as Artificial intelligence and Ai (singer).
deficiencies, we propose to use Wikipedia as the background
knowledge. In the following sections, we will show how to
leverage semantic knowledge in Wikipedia for disambiguation.

3. WIKIPEDIA AS A SEMANTIC
NETWORK Figure 1. Three anchor texts of IBM
Wikipedia is the largest encyclopedia in the world and surpasses Taking into account the above two factors, we collect a table of
other knowledge bases in its coverage of concepts, rich semantic the surface forms (full name, acronyms, alternative names, and
information and up-to-date content. Its English version contains
more than 2,800,000 articles and new articles are added quickly
1
http://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia
2
http://en.wikipedia.org/wiki/Portal:Contents/Categorical_index
spelling variations) of Wikipedia concepts for Wikipedia concept 3.4 Semantic Relatedness between Wikipedia
recognition. The surface forms of Wikipedia concepts can be
collected from anchor texts in Wikipedia: each link in Wikipedia Concepts
is associated with an anchor text, and the anchor text can be Semantic relations can provide the information about whether two
regarded as the surface form of its target concept. For example, concepts are related, but it doesn’t explicitly provide the value of
the three anchor texts of IBM in Figure 1 are respectively its full the semantic relation’s strength. In order to incorporate Wikipedia
name “International Business Machines”, acronyms “IBM” and semantic into similarity measure, we must measure the semantic
alternative name “Big Blue”. Using the anchor text collection in relation’s strength (semantic relatedness) between concepts. There
Wikipedia, we can collect all surface forms and, for each of the has been several research which focus on computing the semantic
surface forms, we summarize its target concepts together with the relatedness between Wikipedia concepts (Strube and Ponzetto
count information it’s used as the anchor text of a specific [25]; Gabrilovich and Markovich[12]; Milne and Witten[6]). In
Wikipedia concept. Part of the surface form table is shown in this paper, we adopt the method described in Milne and Witten [6]
Table 1. Using the collected surface form table, we are able to to compute the semantic relatedness between Wikipedia concepts.
recognize Wikipedia concepts in documents and the detailed Based on the idea that the higher semantic related Wikipedia
description of recognition method is shown in Section 4.1. concepts will share more semantic related concepts, this method
measures the semantic relatedness as:
Surface Form Target Concept Count
IBM IBM 3685 log(max( A ,B ))  log( A  B )
sr (a, b) 
IBM mainframe 2 log( W )  log(min( A , B ))
IBM DB2 2
… … where a and b are the two concepts of interest, A and B are the
International IBM 1 sets of all concepts that link to a and b respectively, and W is the
Business Machine entire Wikipedia. We show an example of semantic relatedness
AI Artificial intelligence 581
between four selected concepts in Table 2, where the semantic
Game artificial intelligence 48
Ai (singer) 10 relatedness can reveals the associate relatedness between Bayesian
Angel Investigations 9 network and Machine learning, and the social relatedness between
Strong AI 3 Chicago Bulls and NBA.
Characters in the Halo series 2
… … Bayesian network Chicago Bulls
Machine learning 0.74 0.00
Table 1. Part of the surface form table of Wikipedia concepts NBA 0.00 0.71
3.3 Semantic Relations between Wikipedia Table 2. The semantic relatedness table of four selected
Concepts concepts
Wikipedia contains rich relation structures, such as synonymy
(Redirect page), Polysemy (disambiguation page), social 4. NAMED ENTITY DISAMBIGUATION
relatedness and associative relatedness (internal page link). All BY LEVERAGING WIKIPEDIA SEMANTIC
these semantic relations express in the form of hyperlinks between
Wikipedia articles, and as Milne et al. [6] mentioned that, links KNOWLEDGW
between articles are only tenuously related. Therefore in the In this section, we describe our proposed method in detail and
constructed semantic network, two Wikipedia concepts are show how to leverage Wikipedia semantic knowledge for
considered to be semantic related if there are hyperlinks between disambiguation. There are three steps in total: (1) representing
them. In this way, the constructed semantic network can name observations as Wikipedia concept vectors; (2) computing
incorporate all the semantic relations expressed by the hyperlinks the similarity between name observations; (3) grouping name
between Wikipedia articles. For example, Figure 2 shows a part of observations using a hierarchical agglomerative clustering
the constructed semantic network, which contains all the semantic algorithm. The critical innovation of the proposed method is a
related concepts of the Berkeley’s professor Michael Jordan. novel similarity measure which can accurately measure the
similarity between name observations by incorporating the various
machine learning David Blei semantic relations in Wikipedia.
Lawrence Saul
Artificial intelligence 4.1 Representing Name Observations as
Andrew Ng
Wikipedia Concept Vectors
Bayesian network
Tommi Jaakkola
Intuitively, if two name observations represent the same entity, it
Statistics Michael I. Jordan Zoubin Ghahramani is highly possible that the Wikipedia concepts in their contexts are
Variational Bayesian
highly related. In contrast, if two name observations represent
David Rumelhart
methods different entities, the Wikipedia concepts in their contexts will not
University of be closely related. Thus, a name observation o can be represented
Expectation-maximiz California, Berkeley
ation algorithm by the Wikipedia concepts in its context, i.e., a Wikipedia concept
PDP Group
Professor vector o  {(c1 , w(c1 , o)),(c2 , w(c2 , o)),...,(cm , w(cm , o))} , where
each concept ci is assigned a weight w(ci ,o) indicating the
Figure 2. The semantic related concepts of Berkeley professor relatedness between ci and o.
Michael I. Jordan in the constructed semantic network

In order to represent a name observation as a Wikipedia concept 1
w(c, o)  o ( sr (c, ci ))
vector, we recognize the Wikipedia concepts in its context. In this ci o ,ci  c
paper, we use the collected table of surface forms and take the
same route as Milne and Witten [5] to recognize Wikipedia Based on the computed weights, we are able to prune concepts to
concepts. The recognition takes three steps: (1) identifying surface improve both efficiency and accuracy for disambiguation using a
forms; (2) mapping them to Wikipedia concepts; (3) concepts weight threshold which can be learned in a learning process.
weighting and pruning for a better similarity measure. The detail
description is as follows. 4.2 Measuring the Similarity between Name
Surface form identification. In order to recognize Wikipedia
Observations by Leveraging Wikipedia
concepts, we first identify all occurrences of surface forms. Given Semantic Knowledge
a name observation’s context as input, we gather all N-grams (up
Through the method described in Section 4.1, a name observation
to 8 words) and match them to the surface forms in the collected
is represented as a Wikipedia concept vector:
surface form table (described in Section 3.2). Not all matches are
considered, because even stop words such as “is” and “a” may o  {(c1 , w(c1 , o)),(c2 , w(c2 , o)),...,(cm , w(cm , o))}
represent a concept. We use Mihalcea and Csomai [19]’s
keyphraseness feature to select helpful surface forms. In detail, for where each concept ci is assigned with a weight w(ci,o). For
each surface form s, we first calculate its probability of example, given the following three observations MJ1, MJ2 and
representing a concept as fa(s)/(fa(s)+ft(s)), where fa(s) is the MJ3 of “Michael Jordan”, their concept vector representations are
number of Wikipedia articles in which the surface form represents shown in Figure 3.
a concept, and ft(s)is the number of articles in which the surface
MJ1: Michael Jordan is a leading researcher in machine
form appears in any form. Then surface forms with low
learning and artificial intelligence.
probabilities are discarded.
MJ2: Michael Jordan has published over 300 research articles
Mapping surface forms to concepts. As mentioned earlier, on topics in computer science, statistics and cognitive science.
surface forms may be ambiguous for they can represent more than MJ3: Michael Jordan wins NBA MVP.
one concept, such as the IBM in Table 1, the concept candidates it
may represent include IBM, IBM mainframe and IBM DB2, etc. Researcher Machine Art ificial
So a mapping step is needed to identify which concept a surface MJ1 (0.42) learning(0.54) int elligence(0.51)
form actually represents. In this paper, we adopt the mapping
method described in Medelyan et al. [19]: First, the method detect
Research St at ist ics Comput er science cognit ive
the “context concepts” T in name observations, i.e., the concepts MJ2
(0.52)
(0.47) (0.52) science(0.51)
which the unambiguous surface forms (which has only one target
concept, e.g., the International Business Machine in Table 1)
mapped to. Then, the method scores the final mapping between a MJ3 Nat ional Basket ball Nat ional Basket ball Associat ion Most
surface form s and a candidate concept c by combining the Associat ion(0.57) Valuable P layer Award(0.57)
average similarity of a candidate concept with the commonness of
this mapping: Figure 3. The concept representations of MJ1, MJ2 and MJ3

 sr (t , c) After obtaining the concept vector representations of name


observations, previous methods’ similarity measures can be
Score( s, c)  tT
 Commonnesss ,c , where
T applied to compute the similarity of two name observations.
However, previous methods’ similarity measures cannot take into
Count ( s, c)
Commonnesss ,c  consideration the semantic relations: the BOW based methods
Count ( s ) typically measure the similarity between name observations using
the cosine of their term vectors, so that matches of terms indicate
Finally the candidate concept with highest score will be taken as
relatedness and mismatches indicate otherwise; the social network
the target concept of a surface form. Using this method, the
based methods measure the similarity using only the social
mapping accuracy can be up to 93.3%. More details about this
relatedness between named entities.
method can be found in Medelyan et al.[19].
So in this paper, we propose a novel similarity measure which
Concepts weighting and pruning. After the first two steps, a
allows us to take into account the full semantic relations indicated
name observation is represented as a Wikipedia concept vector
by hyperlinks within Wikipedia, rather than just term overlap or
o={c1,c2, …, cm}. However, not all concepts in representation are
social relatedness between named entities. Given two name
equally helpful for named entity disambiguation: documents may
observations ol and ok, the proposed similarity measure is
contain noisy concepts (this is very common in web pages) and
computed as follows:
some concepts are only loosely related to the observed name. So
here we expect to preserve the concepts that are highly related to Step 1. Concept alignment between two concept vector
the observed name, and discard the outliers that are only loosely representations. In order to measure the similarity between two
related to the observed name. This paper select the helpful concept vector representations, firstly we must define the
concepts by assign each concept with a weight indicating its correspondence between the concepts from one vector to those
relatedness to the observed name. In detail, for each concept c in a from another. A simple alignment strategy is to assign a concept
name observation o, we assign it a weight by averaging the to the target concept which is exactly the same match, e.g., assign
semantic relatedness of c to all other concepts in o, i.e.: “Research” to “Research”, “Machine learning” to “Machine
learning”. This alignment strategy, however, cannot take semantic 1
relations between concepts into account. Therefore, we use the SIM (ok , ol )   ( SR(ok  ol )  SR(ol  ok ))
2
following strategy to align concepts: for each concept c in an
observation ol, we assign it a target concept Align(c, ok) in another Because the semantic relatedness sr(c,ci) is always in [0,1],
observation ok, which will maximize the semantic relatedness SR(olok) will also bounded within [0,1], thus the SIM(ok,ol)
between the concept pair, i.e., between two name observations will also bounded within [0,1]:
while 0 indicates the named entities represented by the two name
Align(c, ok )  argmax sr (c, ci ) observations are completely unrelated and 1 indicates the named
ci ok
entities represented by the two name observations are mostly
We use two examples shown in Figure 4 and Figure 5 to related.
demonstrate the proposed concept alignment strategy based on the Using the proposed similarity measure, the semantic similarity
semantic relatedness table shown in Table 3. SIM(MJ1, MJ2) is computed as (0.60 + 0.62)/2 = 0.61, SIM(MJ2,
Researcher Machine Artificial MJ3) is computed as 0.10 and SIM(MJ1, MJ3) is computed as 0.0.
Learning intelligence These similarities indicate that, although (MJ1, MJ2),(MJ1, MJ3)
Research 0.54 0.38 0.40 and (MJ2,MJ3) all have no concept overlap, the similarity values
Statistics 0.32 0.58 0.46 measured by leveraging Wikipedia semantic knowledge can still
Computer science 0.44 0.50 0.60 successfully reveal the fact that (MJ1, MJ2) is highly possible
Cognitive science 0.44 0.66 0.65 represents the same entity, while (MJ1, MJ3) and (MJ2, MJ3) are
unlikely represent the same entity.
Table 3. The semantic relatedness table of between the
concepts in MJ1 and MJ2 4.3 Grouping Name Observations Using
Hierarchical Agglomerative Clustering
Researcher Machine Artificial
MJ1 (0.42) learning(0.54) intelligence(0.51) Given the computed similarities, name observations are
disambiguated by grouping them according to their represented
entities. In this paper, we grouping name observations using the
MJ2 Research Statistics Computer science Cognitive hierarchical agglomerative clustering(HAC) algorithm, which is
(0.47) (0.52) (0.52) science(0.51)
widely used in prior disambiguation research and evaluation task
(WePS1 and WePS2). The HAC produce clusters in a bottom-up
Figure 4. The concept alignment from MJ1 to MJ2 way as follows: Initially, each name observation is an individual
cluster; then we iteratively merge the two clusters with the largest
MJ1
Researcher Machine Artificial similarity value to form a new cluster until this similarity value is
(0.42) learning(0.54) intelligence(0.51)
smaller than a preset merging threshold or all the observations
reside in one common cluster. The merging threshold can be
Research Statistics Computer science Cognitive
determined through cross-validation. We employ the average-link
MJ2
(0.47) (0.52) (0.52) science(0.51) method to compute the similarity between two clusters which has
been applied in prior disambiguation research (Bagga and
Figure 5. The concept alignment from MJ2 to MJ1 Baldwin[1]; Mann and Yarowsky[13]), where similarity between
different clusters, denoted CSIM(ui, uj), is calculated as follows:
Step 2. Compute the semantic relatedness from one concept
vector representation to another. We define the semantic CSIM (ui , u j )  ( ui u j ) 1 
sui ,tu j
SIM ( s, t )
relatedness from a source concept vector representation ok to
target representation ol as the weighted average of all the semantic
relatedness between the source concepts in ok and their aligned where s, t are name observations in cluster ui and cluster uj.
target concepts in ol :
5. EXPERIMENTS
 w(c, o )  w( Align(c, o ), o )  sr (c, Align(c, o ))
cok
k l l l To assess the performance of our method and compare it with
SR(ok  ol )  traditional methods, we conduct a series of experiments. In the
cok
 w(c, o )  w( Align(c, o ), o )
k l l experiments, we evaluate our proposed method on the
disambiguation of personal names, which is the most common
Using the alignments shown in Figure 4 and Figure 5, SR(MJ1 type of named entity disambiguation. The experiments are
MJ2) is computed as (0.42×0.47×0.54 + 0.54×0.51×0.66 + conducted on a standard disambiguation data set, the WePS data
0.51×0.51×0.65)/(0.42×0.47 + 0.54×0.51 + 0.51×0.51)=0.62, and set [14,15]. In the following, we first explain the general
SR(MJ2MJ1) is computed as (0.47×0.42×0.54 + experimental settings in Section 5.1, 5.2 and 5.3, then evaluate
0.52×0.54×0.58 + 0.52×0.51×0.60+0.51×0.54×0.66)/(0.47×0.42 and discuss the performance of our method.
+ 0.52×0.54 + 0.52×0.51+0.51×0.54)=0.60.
5.1 Wikipedia Data
Step 3: Compute similarity between two concept vector Wikipedia data can be obtained easily from
representations. We compute the similarity between ol and ok as http://download.wikipedia.org for free research use. It is available
the average of the semantic relatedness from ol to ok and that from in the form of database dumps that are released periodically. The
ok to ol: version we used in our experiments was released on Sep. 9, 2007.
We identified over 4,600,000 distinct concepts for the applying leave-one-out cross validation. The concept pruning
construction of semantic network. The concepts are highly inter- threshold for WS was set to 0.04 through a learning process which
linked: averagely each concept links to 10 other concepts. This will be introduced detailedly in the next section. The overall
indicates the rich semantic relations between Wikipedia concepts. performance is shown in Table 4.
5.2 Disambiguation Data Sets WePS1_training
Method
We adopted the standard data sets used in the First Web People Pur Inv_Pur F
Search Clustering Task (WePS1) ([14]) and the Second Web BOW 0.71 0.88 0.78
People Search Clustering Task (WePS2) ([15]). All the three data SocialNetwork 0.66 0.98 0.76
sets were used: WePS1_training data set, WePS1_test data set, WikipediaConcept 0.80 0.88 0.82
and WePS2_test data set. Each of the three data sets consists of a WS-SameWeight 0.84 0.89 0.85
set of ambiguous personal names (totally 109 personal names); WS 0.88 0.89 0.87
and for each name, its observations in the web pages of the top N WePS1_test
(100 for WePS1 and 150 for WePS2) Yahoo! search results are
Pur Inv_Pur F
needed to be disambiguated.
BOW 0.74 0.87 0.74
The experiment made the standard “one person per document” SocialNetwork 0.83 0.63 0.65
assumption which is widely used in the systems participated in
WikipediaConcept 0.73 0.72 0.71
WePS1 and WePS2, i.e., all the observations of the same name in
a document are assumed to representing the same entity. Based on WS-SameWeight 0.83 0.87 0.84
this assumption, the features within the entire web page can be WS 0.88 0.90 0.88
used for disambiguation. WePS2_test
Pur Inv_Pur F
5.3 Evaluation Criteria BOW 0.80 0.80 0.77
We adopted the measures used in WePS1 ([14]) to evaluate the
SocialNetwork 0.62 0.93 0.70
performance of name disambiguation. These measures are:
WikipediaConcept 0.71 0.84 0.75
Purity (Pur): measures the homogeneity of the observations of WS-SameWeight 0.84 0.82 0.83
names in the same cluster;
WS 0.85 0.89 0.86
Inverse purity (Inv_Pur): measures the completeness of a cluster;
Table 4. Performance results of baselines, WS-SameWeight
F-Measure (F): the harmonic mean of purity and inverse purity.
and WS
The detailed definitions of these measures can be found in Amigo,
et al. [11]. Because purity and inverse purity are often positively From the performance results in Table 4, we can see that within
correlated, they do not always get their peaks at the same point. In the three baselines:
this case, we used F-measure as the most important measure just 1) BOW and WikipediaConcept perform better than the
like WePS1 and WePS2. SocialNetwork: In comparison with SocialNetwork, BOW gets 6%
improvement and WikipediaConcept gets 5.7% improvement. We
5.4 Experimental Results believe this is because SocialNetwork only used the named
We compared our method with three baselines: (1) The first one is entities within context, which is usually insufficient for named
the traditional BOW based methods: hierarchical agglomerative entity disambiguation: compared with BOW, it ignores helpful
clustering (HAC) over term vector similarity, where a web page is contextual words; compared with WikipediaConcept, it ignores
represented as the features including single words and NEs, and helpful concepts of other types.
all the features are weighted using TFIDF- we denote this baseline
as BOW, which is also the state-of-art method in WePS1 and 2) There is no clear winner between BOW and
WePS2; (2) The second one is social network based methods, WikipediaConcept: the winner is different on different data sets.
which is the same as the method described in Malin and Airoldi This may indicate that Wikipedia concept representations contain
[3]: HAC over the similarity obtained through random walk over considerable information as the BOW’s representations do.
the social network built from the web pages of top N search By comparing the proposed method with the three baselines, we
results - we denote this baseline as SocialNetwork; (3) The third found that by leveraging Wikipedia semantic knowledge, our
one evaluates the efficiency of Wikipedia concept representation: method can greatly improve the disambiguation performance:
HAC over the cosine similarity between the Wikipedia concept compared with BOW, WS-SameWeight gets 7.7% improvement
representations of the name observations-we denoted it as and WS gets 10.7% improvement on average on the three data sets;
WikipediaConcept. compared with SocialNetwork, WS-SameWeight gets 13.7%
improvement and WS gets 16.7% improvement; Compared with
5.4.1 Overall Performance WikipediaConcept, WS-SameWeight gets 8% improvement and
We conducted several experiments on all the three WePS data sets: WS gets 11% improvement on average on the three data sets.
the baseline BOW, the baseline SocialNetwork, the baseline Comparing the performances of the two proposed methods, WS-
WikipediaConcept, the proposed method with all Wikipedia SameWeight and WS, we can find that the concept weighting and
concepts assigned the same weight 1.0(WS-SameWeight), and the pruning can improve the proposed method by 3% on average.
proposed method with concept weighting and pruning (WS). All
the optimal merging thresholds used in HAC were selected by
Representation Features Representation Features
Terms machine(5), learning(5), networks(2), statistics(2), Terms vol(9), pp(9), Research(6), Learning(6), Machine(5),
David(2), cognitive(2), Department(2), students(2), Bayesian(4), Science(4), Fellow(4), 2006(4),
postdocs(2), field(2) Electrical(3), Engineering(3), Berkeley(3),
Named Entities Andrew Ng, David Blei, Statistical(3)
David E. Rumelhart, Lawrence Saul, Named Entities A. Y. Ng, B. Taskar,
Tommi Jaakkola, Zoubin Ghahramani, D. M. Blei, Z. Ghahramani,
Berkeley, California, P. Xing, W. Teh, D. Wolpert,
Department of EECS, AAAI, AAAS, IEEE, IMS,
Department of Statistics, American Statistical Association,
PDP Group Arizona State University,
Wikipedia Statistics (0.273), Berkeley, Department of Electrical Engineering and
Concepts Machine learning(0.269), Computer Science,
Artificial intelligence(0.267), Department of Statistics,
University of California, Berkeley (0.225), University of California,
David Rumelhart (0.218), Wikipedia Computer science(0.257),
Inference(0.215), Concepts Statistics(0.253),
Professor (0.210), Neural network(0.244),
Bayesian network(0.210), Artificial intelligence(0.242),
Expectation-maximization algorithm(0.201), Cognitive science(0.238),
Doctor of Philosophy(0.194), Research(0.237),
Variational Bayesian methods(0.193), Bioinformatics(0.234),
Postgraduate education(0.169), Massachusetts Institute of Technology(0.225),
Zoubin Ghahramani(0.167), Machine learning(0.223),
Student(0.158), Inference(0.215),
Researcher(0.157), Robotics(0.211),
Postdoctoral researcher(0.144), Institute of Electrical and Electronics
Cognitive model(0.124), Engineers(0.200),
Perspective (cognitive)( 0.108), Bayesian inference(0.192),
Recurrent neural network(0.103), Molecular biology(0.190), Bioengineering(0.185),
Formal system(0.082) University of California, Berkeley(0.183),
Distributed computing(0.177),
Table 5. The representations of Professor Michael Jordan’s Prediction(0.169),
Wikipedia Page Doctor of Philosophy(0.168),
Computational biology(0.160)

5.4.2 Optimizing Parameters Table 6. The representations of Professor Michael Jordan’s


Our proposed method selects helpful concepts for disambiguation Home Page
by assigning them with weights and pruning them. A weight
threshold needs to be set for pruning the outliner concepts. 5.4.3 Detailed Analysis
Usually a larger threshold will filter out more outliner concepts To better understand the reasons why our proposed method works
but meanwhile it will also filter out more helpful concepts. Figure better than the BOW based methods and the social network based
6 plots the tradeoff. For WePS1_training and WePS1_testing data methods, we analyze the features of name observations generated
sets, a threshold 0.04 will result in the best performance. But the by different methods and show how they affect the similarity
pruning of concepts will lead to a decline in performance on measures.
WePS2_testing data set. Overall, the best threshold can only
enhance the performance in a limited extend (0.1% on average). For demonstration, Table 5 and 6 respectively show the top
We believe this is because the concept weighting is good enough weighted features of two web pages which talks about the
for disambiguation, so a pruning step cannot make significant Berkeley professor Michael Jordan: one is his Wikipedia page3
improvements. and the other is his Berkeley homepage4. The occurrence counts
of terms and the weights of Wikipedia concepts are shown within
the brackets after them.
Comparison of Representations. As shown in Table 5 and 6, the
feature representations generated by different methods are
different: the BOW based methods represent a name observation
as a vector of terms; the social network based methods represent a
name observation as a set of named entities; our method
represents a name observation as a Wikipedia concept vector.
Compared with the term vector representation and the named
entities representation, the Wikipedia concept vector
representation has the following advantages:

Figure 6. The F-Measure vs. Concept Weight Threshold on 3


http://en.wikipedia.org/wiki/Michael_I._Jordan
three data sets
4
http://www.eecs.berkeley.edu/Faculty/Homepages/jordan.html
1) Compared with the term vector representation, the term vector similarity gives them the same relatedness 0.
Wikipedia concept vector representation is more meaningful. All Currently the social relatedness between named entities are
the features in Wikipedia concept representation are concepts usually set by manually defined heuristic rules (Malin and
which themselves are semantic units, while some terms in term Airoldi[3], Minkov et al.[10]). While based on the large-scale and
vector representation cannot explain their actual meaning on semantic information rich data in Wikipedia, the semantic
behalf of themselves. For example, the proposed method extracts relatedness measure between concepts has shown their efficiency
a feature Machine learning from the phrase “Machine learning” in Milne and Witten [6].
while the term vector representation extract two separate terms
machine and learning. 5.4.4 Comparison with State-of-art Performance
2) Compared with the social network based methods, our
method can generate features in a larger scope. Except for the
named entities, our method can also extract the concepts of other
types contained in Wikipedia such as occupation, subject and
degree , which is also very useful for disambiguation. For
example, the concepts Statistics, Professor, Computer science and
Machine learning in Table 5 and 6.
3) All the features in Wikipedia concept vector
representation are corresponding to Wikipedia articles, rather than
their surface forms. So our method can handle the acronyms and
spelling variations by mapping them into the same concept, while
the other two representations usually lack this ability. For example,
Andrew Ng in Table 5 and A. Y. Ng in Table 6 are actually the
same person, AAAS and American Statistical Association in Table Figure 7. A comparision with WePS1 systems
6 are actually the same organization, but the social network based
methods cannot recognize them as the same one. On the other
hand, it is obvious that the semantic knowledge incorporation will
be more effectively and efficiently using Wikipedia concept
representation: for every feature there is an article in Wikipedia
which can provide the detailed knowledge about it.
Comparison of Similarity Measures. When measuring the
similarity between name observations, the three methods (BOW,
social network and the proposed method) use different measures:
The term vector similarity used in the BOW based methods is
determined by the term co-occurrence statistics; the social
network similarity is determined by the social relatedness between
contextual named entities; and our proposed similarity is
determined by the semantic relatedness between Wikipedia Figure 8. A comparision with WePS2 systems using B-Cubed
concepts. Compared with other two similarity measures, the F-measure
proposed similarity measure shows the following advantages: We also compared our method with the state-of-art Performance
1) Compared with the other two similarity measures, our in WePS1 (Artiles, et al.[14]) and WePS2 (Artiles, et al.[15]).
proposed similarity measure can incorporate more semantic Because WePS2 evaluated the participating systems using the B-
relations between features. The term vector similarity ignores all Cubed measures, we compared our method with the systems
the semantic relations between terms, such as associate participating in WePS2 by optimizing our method on the B-Cubed
relatedness between statistics and Bayesian, and social relatedness F-measure. The comparison results are shown in Figure 7 and 8.
between Berkeley and David. The social network based methods As shown in Figure 7, our method gets 10% improvement over
can only capture social relatedness between named entities, such the best system of WePS1. As shown in Figure 8, in comparison
as that between University of California and Department of EECS, with the systems participating in WePS2, our method can obtain
Andrew Ng and Z. Ghahramani in Table 5 and 6. But it cannot the same performance as the best solution. We believe our method
incorporate semantic relations of other types, such as associate is competitive: the best solution in WePS2 extracted additional
relatedness between machine learning and statistics, Bayesian features such as the title words of the root page of the given web
network and Cognitive science. Compared with the above two page and used some large additional resources such as the Web
similarity measures, our proposed similarity measure can 1T 5-gram corpus of Google, while these features and knowledge
incorporate all these semantic relations. are not used in our proposed method. And we believe our method
can be further improved by collecting additional disambiguation
2) The relatedness measure between features (terms,
evidence from the web.
named entities and concepts) used in the proposed similarity
measure is more reliable and accurate. The term vector similarity
measures the relatedness between terms as either 0 or 1, this 6. CONCLUSIONS AND FUTURE WORKS
usually conflicts reality. For example, the two terms statistics and In this paper we demonstrate how to leverage the semantic
statistical is obvious more related than statistics and pp, but the knowledge in Wikipedia, so the performance of named entity
disambiguation can be enhanced by obtaining a more accurate [11] Enrique Amigo, Julio Gonzalo, Javier Artiles and Felisa
similarity measure between name observations. Concretely, we Verdejo. A comparison of extrinsic clustering evaluation
construct a large-scale semantic network from Wikipedia, in order metrics based on formal constraints. Information Retrieval,
that the semantic knowledge can be used efficiently and 2008.
effectively. Based on the constructed semantic network, a novel [12] E. Gabrilovich, and S. Markovich. Computing Semantic
similarity measure is proposed to leverage Wikipedia semantic Relatedness using Wikipedia-based Explicit Semantic
knowledge for disambiguation. On the standard WePS data sets, Analysis. In Proc. of the IJCAI, 2007.
our method can achieve appealing results: it gets 10.7%
improvement over the traditional BOW based method and 16.7% [13] Gideon S. Mann and David Yarowsky. Unsupervised
improvement over the traditional social network based methods. Personal Name Disambiguation. In Proc. of CONIL, 2003.
For future work, because Wikipedia also provides other semantic [14] Javier Artiles, Julio Gonzalo and Satoshi Sekine. The
knowledge like category hierarchy and structural description of SemEval-2007 WePS Evaluation: Establishing a benchmark
entities (e.g. the infobox), so Wikipedia semantic knowledge can for the Web People Search Task. In SemEval, 2007.
also be used to tag and generate a concise structural summary of [15] Javier Artiles, Julio Gonzalo and Satoshi Sekine. WePS2
disambiguation results. Furthermore, Wikipedia semantic Evaluation Campaign: Overview of the Web People Search
knowledge is also very useful in many other different tasks, such Clustering Task. In WePS2, WWW 2009, 2009.
as knowledge base population, link analysis, document clustering
[16] Jian Hu, Lujun Fang, Yang Cao, et al. Enhancing Text
and classification.
Clustering by Leveraging Wikipedia Semantics. In Proc. Of
SIGIR, 2008.
7. ACKNOWLEDGMENTS
The work is supported by the National High Technology [17] J. Hassell, B. Aleman-Meza and IB Arpinar. Ontology-
Development 863 Program of China under Grants no. Driven Automatic Entity Disambiguation in Unstructured
2006AA01Z144, and the National Natural Science Foundation of Text. In Proc. of ISWC, 2006
China under Grants no. 60673042 and 60875041. [18] Kai-Hsiang Yang, Kun-Yan Chiou, Hahn-Ming Lee and Jan-
Ming Ho. Web Appearance Disambiguation of Personal
Names Based on Network Motif. In Proc. of WI, 2006.
8. REFERENCES
[1] Bagga and Baldwin. Entity-Based Cross-Document [19] O. Medelyan, Ian H. Witten and D. Milne. Topic Indexing
Coreferencing Using the Vector Space Model, In Proc. of with Wikipedia. In WIKIAI, AAAI 2008. 2008.
HLT/ACL, 1998. [20] R. Mihalcea and A. Csomai. Wikify!: linking documents to
[2] B. Malin. Unsupervised Name Disambiguation via Social encyclopedic knowledge. In Proc. of CIKM. 2007.
Network Similarity, In Proc. of SIAM, 2005. [21] Michael Ben Fleischman. Multi-Document Person Name
[3] B. Malin and E. Airoldi. A Network Analysis Model for Resolution, In Proc. of ACL, 2004.
Disambiguation of Names in Lists. In Proc. of CMOT, 2005. [22] Razvan Bunescu and Marius Pasca. Using Encyclopedic
[4] Cheng Niu, Wei Li and Srihari. Weakly Supervised Learning Knowledge for Named Entity Disambiguation. In Proc. of
for Cross-document Person Name Disambiguation Supported EACL, 2006.
by Information Extraction. In Proc. of ACL, 2004. [23] Ron Bekkerman and Andrew McCallum. Disambiguating
[5] D. Milne and Ian H. Witten. Learning to Link with Web Appearances of People in a Social Network. In Proc. of
Wikipedia. In Proc. of CIKM, 2008. WWW, 2005
[6] D. Milne and Ian H. Witten. An effective, low-cost measure [24] Silviu Cucerzan. Large-Scale Named Entity Disambiguation
of semantic relatedness obtained from Wikipedia links. In Based on Wikipedia Data. In Proc. of EMNLP, 2007.
Proc. of AAAI, 2008. [25] Strube, M. and Ponzetto, S. P. WikiRelate! Computing
[7] D. Milne, O. Medelyan and Ian H. Witten. Mining Domain- Semantic Relatedness Using Wikipedia. In Proc. of AAAI,
Specific Thesauri from Wikipedia: A case study. In Proc. of 2006.
IEEE/WIC/ACM WI, 2006. [26] Ted Pedersen, Amruta Purandare and Anagha Kulkarni.
[8] D. V. Kalashnikov, R. Nuray-Turan and S. Mehrotra. Name Discrimination by Clustering Similar Contexts. In
Towards Breaking the Quality Curse. A Web-Querying Proc. of CICLing, 2005.
Approach to Web People Search. In Proceedings of SIGIR. [27] Xiaojun Wan, Jianfeng Gao, Mu Li and Binggong Ding.
2008. Person Resolution in Person Search Results: WebHawk. In
[9] E. Gabrilovich and S. Markovitch. Feature Generation for Proc. of CIKM, 2005.
Text Categorization Using World Knowledge. In Proc. of [28] Ying Chen, James Martin. Towards Robust Unsupervised
IJCAI, 2005. Personal Name Disambiguation. In Proc. of EMNLP, 2007.
[10] Einat Minkov, William W. Cohen and Andrew Y. Ng. [29] Hjørland, Birger. Semantics and Knowledge Organization.
Contextual Search and Name Disambiguation in Email Using Annual Review of Information Science and Technology
Graphs. In Proc. of SIGIR, 2006. 41:367 -40, 2007.

You might also like