0% found this document useful (0 votes)
51 views10 pages

Combining Strategies For Extracting Relations From Text Collections

This document introduces two versions of the Snowball system for extracting relations from text collections: Snowball-VS, which considers contextual keywords without order, and Snowball-SMT, which takes word order into account. It explores combining these complementary systems and presents preliminary experimental results on using the combined approach to more accurately extract tuples.

Uploaded by

Mohamed Aly
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views10 pages

Combining Strategies For Extracting Relations From Text Collections

This document introduces two versions of the Snowball system for extracting relations from text collections: Snowball-VS, which considers contextual keywords without order, and Snowball-SMT, which takes word order into account. It explores combining these complementary systems and presents preliminary experimental results on using the combined approach to more accurately extract tuples.

Uploaded by

Mohamed Aly
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Combining Strategies for Extracting Relations from Text Collections

Eugene Agichtein Eleazar Eskin Luis Gravano

Department of Computer Science


Columbia University
{eugene,eeskin,gravano}@cs.columbia.edu

Abstract Seed Tuples Find Occurrences of Seed Tuples

Text documents often contain valuable structured data


that is hidden in regular English sentences. This data is Generate New Seed Tuples Tag Entities
best exploited if available as a relational table that we
could use for answering precise queries or for running
data mining tasks. Our Snowball system extracts these Augment Table Generate Extraction Patterns
relations from document collections starting with only
a handful of user-provided example tuples. Based on
Figure 1: The main components of Snowball.
these tuples, Snowball generates patterns that are used,
in turn, to find more tuples. In this paper we introduce a
new pattern and tuple generation scheme for Snowball,
tuple. In effect, a system based on the DIPRE method
with different strengths and weaknesses than those of
will perform reasonably well even if certain instances
our original system. We also show preliminary results
of a tuple are missed, as long as the system captures
on how we can combine the two versions of Snowball
one such instance. This approach is in contrast with
to extract tuples more accurately.
the goals of traditional information extraction research,
where a system attempts to extract as much information
1 Introduction as possible from each document [13]. DIPRE, on the
Text documents often hide valuable structured data. other hand, attempts to build the most comprehensive
For example, a collection of newspaper articles might table from all of the documents in the collection. In
contain information on the location of the headquarters [1] we built on this approach and introduced Snowball.
of a number of organizations. The web contains We developed a method for defining and representing
millions of pages whose text hides data that would be extraction patterns that is at the same time flexible,
best exploited in structured form. so that we capture most of the tuples that are hidden
Brin [4] proposed the idea of DIPRE, which uses in the text in our collection, and selective, so that we
bootstrapping for extracting structured relations (or do not generate invalid tuples. We also introduced a
tables) from the web. A key assumption is that the table strategy for estimating the reliability of the extracted
to be extracted appears redundantly in the document patterns and tuples. Finally, we presented a scalable
collection. As a result of this assumption, the patterns evaluation methodology and associated metrics, which
that DIPRE generates need not be overly general we used for large-scale experiments over collections of
to capture every instance of an organization-location over 300,000 real documents. Our experiments showed
that Snowball was able to extract more than 80% of the
organization-location pairs mentioned in the collection
with high precision.
The basic architecture of Snowball is shown in Fig-
ure 1. Initially, we provide Snowball with a handful
of instances of valid organization-location pairs such
Organization Location of Headquarters the web. This method requires training over a large set
MICROSOFT REDMOND of web pages, with relevant document segments manu-
EXXON IRVING ally labeled, as well as a large training set of page-to-
IBM ARMONK page relations.
BOEING SEATTLE Finally, a number of systems use unlabeled exam-
INTEL SANTA CLARA ples for training. This direction of research is closest
to our work. Specifically, the approach we are fol-
Table 1: User-provided example tuples. lowing falls into the broad category of bootstrapping
techniques that have been successfully applied in other
contexts. [17] demonstrated a bootstrapping technique
as the tuple <Microsoft, Redmond> (Table 1). Our
for disambiguating senses of ambiguous nouns. [6]
system searches for occurrences of the example tu-
and [15] use bootstrapping to classify named entities
ples organizations and locations in the documents,
in text. [18] describes an extension of DIPRE to min-
identifying text lines where an organization and its
ing the web for acronyms and their expansions. [3]
corresponding location occur together. From these
presents a methodology and theoretical framework for
tagged example contexts, the system learns patterns
combining unlabeled examples with labeled examples
that would indicate the desired relationship. For in-
to boost performance of a learning algorithm for clas-
stance, from examining the occurrences of the seed tu-
sifying web pages. While the underlying principle of
ples, we might learn that a context <LOCATION>-
using the systems output to generate the training in-
based <ORGANIZATION> is likely to indicate that
put for the next iteration is the same for all of these
LOCATION is the headquarters of the ORGANIZA-
approaches, the tasks are different enough to require
TION. Patterns built from examples like these are then
specialized methodologies.
used to scan through the corpus, discovering new tu-
ples. The new tuples are evaluated, the most reliable
ones are used as the new seed tuples, and the process Our Contributions In this paper we consider two al-
repeats. A key step in generating and later match- ternative methods for representing the textual contexts
ing patterns is finding where <ORGANIZATION> and around the tuples that we want to identify. In Sec-
<LOCATION> entities occur in the text. For this we tion 2.1 we briefly review the original Snowball sys-
tag the text documents using the MITRE corporations tem that we presented in [1], and which we refer to
Alembic Workbench [8]. as Snowball-VS in this paper. Snowball-VS considers
the textual context around the entities as an unordered
collection of keywords. In Section 2.2 we introduce
Related Work Brins DIPRE method and our Snow- Snowball-SMT, a new system that takes advantage of
ball system both address issues that have long been the the order of the words in the contexts. In Section 3
subject of information extraction research. However, we present our preliminary exploration of methods to
DIPRE and Snowball do not attempt to extract all the combine these complementary systems. Our approach
relevant information from each document, which has allows us to exploit different representations of data for
been the goal of traditional information extraction sys- the problem. In Section 4 we outline the experimental
tems [13, 10]. One of the major challenges in informa- setup and evaluation methodology for the experiments
tion extraction is the necessary amount of manual tag- in Section 5. Section 6 contains our preliminary con-
ging involved in training the system for each new task. clusions and a discussion of future work.
[14] generates extraction patterns automatically by us-
ing a training corpus of documents that were manually
marked as either relevant or irrelevant for the topic.
2 Snowball
This approach requires less manual labor than to tag the In this section, we explore different methods to learn
documents, but nevertheless the effort involved is sub- patterns and generate tuples for Snowball: Snowball-
stantial. [7] describes machine learning techniques for VS, our original implementation [1], uses a vector-
creating a knowledge base from the web, consisting of space model, whereas Snowball-SMT, a new system
classes of entities and relations, by exploiting the con- that we present in this paper, represents text as an
tent of the documents, as well as the link structure of ordered sequence of terms.
2.1 Snowball-VS (which are the same for all the 5-tuples in the cluster),
Snowball-VS is initially given a handful of example form a Snowball-VS pattern < ls , tag1 , ms , tag2 , rs >.
tuples. For every such organization-location tuple < As an initial filter, we eliminate all patterns supported
o, ` >, Snowball-VS finds segments of text in the by fewer than sup seed tuples.
document collection where o and ` occur close to each Using these patterns, Snowball-VS scans the collec-
other, and analyzes the text that connects o and ` to tion to discover new tuples. The system first identi-
generate extraction patterns that will later be used to fies sentences that include an organization and a loca-
discover new tuples. tion, as determined by the named-entity tagger. For
a given text segment with an associated organization
o and location `, Snowball-VS generates the 5-tuple
Generating Patterns and Tuples A crucial step t =< lc , tag1 , mc , tag2 , rc >. A candidate tuple
in the mining process is the generation of patterns < o, ` > is generated if there is a pattern tp such that
that will be used to find new tuples in the documents. M atch(t, tp ) sim , where sim is the clustering sim-
Ideally, we would like patterns both to be selective, so ilarity threshold. Each candidate tuple may be gener-
that they do not generate incorrect tuples, and to have ated multiple times from different text segments, using
high coverage, so that they identify many new tuples. either a single pattern to match the segments, or dif-
To improve the generality of the patterns, we rep- ferent patterns. For each candidate tuple, we store the
resent the left, middle, and right contexts associated set of patterns that generated it, each with an associ-
with a pattern analogously to the way the vector-space ated degree of match. Snowball-VS uses this informa-
model of information retrieval represents documents tion, together with information about the selectivity of
and queries [16]. Thus, the left, middle, and right con- the patterns, to decide what candidate tuples to actually
texts are three vectors associating weights with terms. add to the table that it is constructing.
These weights indicate the importance of each term in
the corresponding context. An example of a Snowball-
VS pattern is the 5-tuple <{<the, 0.2>}, LOCATION, Evaluating Patterns and Tuples We can weigh
{<-, 0.5>, <based, 0.5>}, ORGANIZATION, {}>. the Snowball-VS patterns based on their selectivity, and
This pattern will match strings like the Irving-based trust the tuples that they generate accordingly. Thus, a
Exxon Corporation.... To match text portions with our pattern that is not selective will have a low confidence
5-tuple representation of patterns, Snowball-VS also as- value. The tuples generated by such a pattern will
sociates a 5-tuple t with each document portion that be discarded, unless they are supported by selective
contains two named entities with the correct tag (i.e., patterns. Intuitively, the confidence of a tuple will
LOCATION and ORGANIZATION in our scenario), be high if it is generated by several highly selective
and matches it against the 5-tuple pattern p, where the patterns.
degree of match Match(t, p) is calculated as the normal- We estimate the selectivity of each pattern during
ized sum of inner products of the corresponding left, our scan of the corpus to discover new tuples. If
middle, and right context vectors. a sentence matches one of our patterns and contains
In order to generate a pattern, we group occurrences an organization that we have discovered in an earlier
of known tuples in documents that occur in similar con- iteration of the system, we check whether the new
texts. More precisely, Snowball-VS generates a 5-tuple location agrees with a previously extracted, known
for each string where a seed tuple occurs, and then clus- headquarters location for this organization. If so,
ters these 5-tuples using a simple single-pass bucket this new match is considered positive for the pattern.
clustering algorithm [11], using the Match function de- Otherwise, the match is negative. This allows us
scribed above to calculate the similarity between the to compute the confidence of the pattern. Note that
5-tuples, with minimum similarity threshold sim . The this confidence computation assumes that organization
pattern is represented as the representative 5-tuple of is a key for the relation that we are extracting (i.e.,
the cluster: the left vectors in the 5-tuples of clusters are two different tuples in a valid instance of the relation
represented by a centroid ls . Similarly, we collapse the cannot agree on the organization attribute). Estimating
middle and right vectors into ms and rs , respectively. the confidence of the patterns in discovering relations
These three centroids, together with the original tags without such a single-attribute key is part of our future
work. The confidence of a pattern P is defined as: headquarters of this organization, a negative example
is added. In each iteration, Snowball-SMT is trained
P.positive
Conf (P ) = on this set of examples, and builds a model that best
(P.positive + P.negative) describes the training set. Snowball-SMT then scans
where P .positive is the number of positive matches for the corpus again, generating a tuple each time that
P and P .negative is the number of negative matches. a sequence of terms in the context surrounding the
For illustration purposes, Table 2 lists three representa- entities is accepted by the model.
tive patterns that Snowball-VS extracted from the doc- We represent contexts as ordered sequences using
ument collection described in Section 4. sparse Markov transducers (SMTs), which estimate
Having scored the patterns, we are now able to a probability distribution conditioned on a sequence.
evaluate the new candidate tuples. For each tuple In our problem, we compute the probability that a
we store the set of patterns that produced it, together tuple is an organization-location pair conditioned on
with the degree of match between the context in which the sequence of terms that make up the context of the
the tuple occurred and the matching pattern. The tuple. The probability distribution is conditioned on
confidence of a candidate tuple T is: some of these words and not the others. We wish to
represent a part of the conditional sequence of words
|P |
Y as dont care, or -terms in the probability model.
Conf (T ) = 1 (1 (Conf (Pi ) M atch(Ci , Pi ))) For instance, the probability of a text fragment near
i=0
Boeings renovated Seattle headquarters contaning a
where P = {Pi } is the set of patterns that gener- tuple T = <Boeing, Seattle> would be calculated as
ated T and Ci is the context associated with an oc-
currence of T that matched Pi with degree of match Conf (T ) = P (T |near, 0 s, 1 , headquarters)
M atch(Ci , Pi ). From the set of discovered tuples, the
most reliable ones are selected as seed for the next iter- where the system ignores the term renovated as
ation of the system. A tuple T is added to the seed set irrelevant.
if Conf (T ) min . More formally, a sparse Markov transducer is a
conditional probability of the form:
2.2 Snowball-SMT
The Snowball-VS patterns model each context as a P (T |n1 t1 n2 t2 ...nk tk )
bag of words, ignoring word order. These patterns
then concentrate on the presence or absence of certain where T is the output label that Snowball-SMT returns
keywords. For instance, a context such as ... where upon recognizing a tuple. Each ti is the ith term in
Microsoft is located. Which Silicon Valley startup the context surrounding the entities, arranged into a
... will match a pattern <{}, ORGANIZATION, sequence by starting from the terms on the left of the
{<which, 0.5>, <is, 0.5>, <located, 0.5>, <in, leftmost entity, adding the terms between entities, and
0.5>}, LOCATION, {}>, producing an incorrect tuple followed by the terms to the right of the rightmost
<Microsoft, Silicon Valley>. In this section we entity (as in the example above). In the equation, ni
introduce Snowball-SMT, a variant of Snowball that represents ni consecutive -terms, and for a sequence
takes into account the order of the words in each of length n, n1 + ... + nk + k = n.
context, while keeping the patterns flexible enough to To estimate SMTs we use a type of prediction suf-
have high coverage. For this purpose, we model the fix tree called a sparse prediction tree, which is repre-
textual contexts as ordered sequences of tokens and try sentationally equivalent to sparse Markov transducers.
to estimate the probability of sentences containing an These trees probabilistically map the context of a tu-
instance of the organization-location relationship. ple to a probability that the tuple is an organization-
Thus, if a seed organization and its correct location location pair. A sparse prediction tree is a rooted
are mentioned in the same sentence, the text context tree where each node is either a leaf node or con-
surrounding the entities is converted into a sequence tains one branch labeled with n (n 0), which
of tokens, and a positive example is added to the forks into a branch for each word. The paths from the
training set. If the location does not match the known root node to the leaf nodes represent the sequences of
Conf middle right
1 <based, 0.53>, <in, 0.53> <, , 0.01>
0.69 <, 0.42> <s, 0.42> < headquarters, 0.42><in, 0.12>
0.61 <(, 0.93> <), 0.12>
Table 2: Actual patterns discovered by Snowball. (For all three of these patterns, the left vectors are empty,
tag1 = ORGANIZATION, and tag2 = LOCATION.)

given in [9].
0
based near
located 3 Combining Snowball-VS and
7
Snowball-SMT
0 0
near near
The two systems that we used in our experiments,
at in at in Snowball-VS and Snowball-SMT, focus on two differ-
1 2 3 4 5 6 ent aspects of the textual context: the presence or ab-
sence of keywords that tend to indicate the correct rela-
Figure 2: An example sparse Markov tree. tionships (Snowball-VS), and the order of words in the
contexts surrounding the entities (Snowball-SMT). We
explore how to combine the two systems with the goal
terms that make up the contexts surrounding the enti- of improving our overall extraction accuracy. Com-
ties. Each leaf node stores an estimate of the probabil- bining predictors to increase accuracy is an active area
ity that if the node was reached, the context that was of research. Some of the methods we considered in-
used to generate the path to the node contains a valid clude sleeping-experts, boosting by majority [12], and
organization-location tuple. Figure 2 shows a sparse co-training [3]. In this section, we explore prelimi-
Markov tree. For example, the node labeled 3 would nary ways in which we can combine Snowball-VS and
be reached by following the terms making up the con- Snowball-SMT. (We discuss this issue further in Sec-
text <ORGANIZATION> based in <LOCATION>. tion 6.)
A tree is used to obtain a probability for a tuple by Initially, both Snowball-VS and Snowball-SMT re-
following the context from the root node to a leaf node ceive the same set of seed tuples (Figure 3). Each
skipping a token in the context for each along the system runs for one iteration, producing a set of tu-
path. The leaf node contains the tuples probability ples SeedV S and SeedSM T , respectively. These two
of being an organization-location pair. The topology sets of tuples are combined into one set SeedCombined
of a tree encodes the positions of the -terms in the (we will describe how shortly). SeedCombined is then
probability distribution. Because we do not know the used as the set of seed tuples for both Snowball-VS and
positions of the -terms for each context a priori, we Snowball-SMT and both systems are run for another it-
do not know the best topology of the prediction tree to eration. This process repeats until we stop discovering
use. We approximate the best tree using a Bayesian new tuples. The final step of the extraction process re-
mixture (weighted sum) technique. Instead of using turns the set SeedCombined , containing the combination
a single tree, we use a weighted sum of all possible of the final set of tuples discovered by Snowball-VS and
trees as our predictor. We then use a Bayesian update Snowball-SMT.
rule (described in Section 3) to update the weight of We explored three options for combining the tuples
each tree based on its performance on a given element discovered by Snowball-VS and Snowball-SMT to cre-
in the data set. At the end of this process, we have ate the new set of seed tuples: the Union, the Intersec-
a weighted sum of trees in which the best performing tion, and the weighted Mixture of the tuples produced
trees in the set of all trees have the highest weights. The by the individual systems. The Intersection strategy
sparse prediction tree is rebuilt from scratch from the was motivated by [3]. To implement the Union and
set of positive and negative examples on each iteration Intersection strategies, the sets of tuples produced by
of Snowball-SMT. An in-depth description of SMTs is Snowball-VS and Snowball-SMT are filtered using each
Find Occurrences of Seed Tuples Find Occurrences of Seed Tuples

Combining
Tag Entities Generate New Seed Tuples Generate New Seed Tuples Tag Entities
Algorithm

Generate Extraction Patterns Learn Model

SnowballVS SnowballSMT

Figure 3: Combining Snowball-VS and Snowball-SMT into one system.

systems individual thresholds for generating seed tu- 4 Experimental Setting


ples, and the resulting sets are combined. In the Union The goal of Snowball is to extract as many valid
model, seed tuples proposed by either Snowball-VS or tuples as possible from the text collection. We do
Snowball-SMT are added to the combined set, unless not attempt to capture every instance of such tuples.
the locations that the two systems propose for the same Instead, we exploit the fact that these tuples will tend to
organization do not match (based on our unique-key appear multiple times in the types of collections that we
assumption, only one of these can be correct). In the consider. As long as we capture one instance of such a
Intersection model, only seed tuples proposed by both tuple, we will consider our system to be successful for
Snowball-VS and Snowball-SMT are added. that tuple.
To implement the weighted Mixture model, a tuple T
is added to the combined set if Conf (T ) min where
Methodology We adapt the recall and precision
Conf (T ) is calculated as the weighted sum of the
metrics from information retrieval to quantify how ac-
confidence values that each system assigns to tuple T .
curate and comprehensive our combined table of tuples
The weights are based on the accuracy of each system
is [16]. Our metric for evaluating the performance of
over the training data. To calculate these weights, we
an extraction system over a collection of documents D
use an implementation of a Bayesian update rule. We
is based on determining Ideal, the set of all the known
first calculate the absolute weight WV0 S of Snowball-VS
test tuples that appear in collection D. After identifying
as:
Ideal, we compare it against the tuples produced by the
X system, Extracted, using adapted precision and recall
WV0 S = log(Conf (T )) metrics [1]. To create the Ideal set automatically, we
correct tuples T
X start by considering a large, publicly available direc-
+ log(1 Conf (T )) tory of more than 13,000 organizations provided on the
incorrect tuples T Hoovers Online web site1 . To determine the target
0
set of tuples Ideal from the Hoovers-compiled table
The absolute weight WSM T of Snowball-SMT is cal- above, we keep only the tuples that have the organiza-
culated similarly. Then the relative weights WV S and tion mentioned together with their location in the col-
WSM T are: lection. We match possible variations of companies
names by using Whirl [5], a research tool developed
WV0 S WSM0
T
WV S = , WSM T = at AT&T Research Laboratories for integrating similar
WV0 S + WSM0
T WV0 S + WSM0
T textual information.
An alternative to using our Ideal metric to estimate
Finally, the combined confidence in tuple T is defined precision could be to sample the extracted table, and
as: check each value in the sample tuples by hand. (Simi-
larly, we could estimate the recall of the system by sam-
pling documents in the collection, and checking how
Conf (T ) = WV S ConfV S (T )+WSM T ConfSM T (T )
many of the tuples mentioned in those documents the
system discovers.) For completeness, we also report
In our experiments, we compare the performance of
precision estimates using sampling in Section 5. Please
Snowball-VS and Snowball-SMT as well as that of the
1
three combining strategies. http://www.hoovers.com
Organization-Location Pairs System Parameter Value Description
Occurrences Training Collection Test Collection sim 0.6 degree of match
0 5455 4642 t 0.9 seed confidence
1 3787 3411 Snowball-VS sup 2 pattern support
2 2774 2184 f inal 0.3 tuple confidence
5 1321 909 window 2 length of left,
right contexts
10 593 389
Snowball-SMT t 0.99 seed confidence
Table 3: Occurrence statistics of the test f inal 0.99 tuple confidence
tuples in the experiment collections.
Table 4: Parameter values used for evaluating
refer to [1] for more details on our evaluation method- Snowball-VS and Snowball-SMT on the test
ology. collection.

Document Collections Our experiments used large 5 Experimental Results


collections of real newspaper articles from the North In [1] we extensively examined the performance of
American News Text Corpus, available from LDC 2 . Snowball-VS, together with an implementation of DIPRE.
This corpus includes articles from the Los Angeles In this paper we compare the performance of Snowball-
Times, The Wall Street Journal, and The New York VS and Snowball-SMT, and explore the effect of com-
Times for 1994 to 1997. We split the corpus into two bining the two into a single system.
collections: training and test. The training collection In the training phase of our experiments, we empiri-
consists of 178,000 documents, all from 1996. The test cally determined the best individual operating parame-
collection is composed of 142,000 documents, from ters for Snowball-VS and Snowball-SMT by running the
1995 and 1997. systems on the training collection. We then evaluated
Both Snowball and DIPRE rely on tuples appearing the systems on the test collection using the parameters
multiple times in the document collection at hand. in Table 4.
To analyze how redundant the training and test
As we discussed, the only input to both Snowball
collections are, we report in Table 3 the number
systems during this evaluation on the test collection
of tuples in the Ideal set for each frequency level.
were the five seed tuples of Table 1. All the extraction
For example, 5455 organizations in the Ideal set are
patterns were learned from scratch by running each
mentioned in the training collection, and 3787 of them
Snowball system on a previously unseen test collection
are mentioned in the same line of text with their
using the operational parameters of Table 4.
location at least once. So, if we wanted to evaluate how
Figure 4 shows the performance of the individual
our system performs on extracting tuples that occur at
systems as they attempt to extract test tuples that are
least once in the training collection, the Ideal set that
mentioned more times in the test collection. For
we will create for this evaluation will contain 3787
example, Snowball-VS correctly extracts 85% of the
tuples. The first row of Table 3, corresponding to
tuples that occur at least three times in the collection,
zero occurrences, deserves further explanation. If we
with precision of 89%. Not surprisingly, Snowball-
wanted to evaluate the performance of our system on
VS performs increasingly well as the number of times
all the organizations that were mentioned in the corpus,
that the test tuples are required to be mentioned in
even if the appropriate location never occurred near
the collection is increased. Also, notice that while
its organization name anywhere in the collection, we
DIPRE has better precision than Snowball-VS at the
would include all these organizations in our Ideal set.
0-occurrence level (72% vs. 69%), Snowball-VS
So, if the system attempts to guess the value of the
has at all occurrence levels significantly higher recall
location for such an organization, any value that the
than DIPRE. However, Snowball-SMT has the highest
system extracts will automatically be considered wrong
precision when we consider all tuples (75%), and
in our evaluation.
its precision steadily increases for more frequently
2
http://www.ldc.upenn.edu occurring tuples.
(a) (b)
Figure 4: Recall (a) and precision (b) of DIPRE, Snowball-VS, and Snowball-SMT (test collection).

Snowball-VS Snowball-SMT Combined ple of 100. The incorrect tuples are due mainly to er-
seed f inal seed f inal seed f inal roneously tagging phrases as organizations (41 out of
1 0.9 0.3 0.99 0.99 - - the 48 incorrect tuples for Snowball-VS). If we were
2 0.9 0.3 0.99 0.99 - - querying the table extracted by Snowball-VS by orga-
3 - - - - 0.97 0.6 nization, we would expect to find the correct headquar-
ters for the organization approximately 88% of the time
incorrect relationship 100%).
52 correct tuples
Table 5: Parameter values used for generating ( 52+6 incorrect locations+1
new seed tuples by combining Snowball-VS Observe that the Intersection strategy appears to pro-
and Snowball-SMT using the Intersection (1), duce the cleanest table overall (81 tuples out of 100 are
Union (2) and Mixture (3) strategies. correct).
Thus, if we want a high-recall system, we should
run Snowball-VS. Alternatively, if we want to create a
We explored the three combining strategies on the table of high-quality tuples, we should run Snowball-
training collection and used the parameters in Table 5 SMT. Finally, we could combine the two systems using
to run the combined system on the test collection. The the Intersection strategy to create a table with high
individual systems were run using the parameters in precision that also approaches Snowball-VSs recall
Table 4. The seed parameter of Table 5 is the minimum values for high-frequency tuples.
confidence value for a tuple to be chosen as seed by
each individual system, and the f inal parameter is the
threshold used to filter the final table. 6 Conclusions and Future Work
As we can see in Figure 5, the simple combining This paper presents significant extensions of Snowball,
strategies we explored do not help us discover new tu- a system for extracting relations from large collections
ples, but can be used to improve the precision of the of plain-text documents that requires minimal training
extracted table. This claim is further supported by ran- for each new scenario. We compared two alternatives
domly sampling the tables produced by Snowball-VS, for representing text for our extraction task, and pre-
Snowball-SMT, and those produced by using the In- sented preliminary results on combining the systems.
tersection, Union, and Mixture combining strategies. We only evaluated our techniques on plain text
The samples were manually checked for accuracy of documents, and it would require future work to adopt
the discovered tuples, with results shown in Table 6. our methodology to HTML data. While HTML tags
We classify the errors into three types (Location, Orga- can be naturally incorporated into Snowballs pattern
nization, and Relationship), where the former two are representation, it is problematic to extract named-entity
due to the errors of the named-entity tagger, and the tags from arbitrary HTML documents. State-of-the-art
latter is completely the extraction systems fault. As taggers rely on clues from the text surrounding each
we can see, Snowball-SMT produces few incorrect tu- entity, which may be absent in HTML documents that
ples (24 out a sample of 100) while Snowball-VS is less often rely on visual formatting to convey information.
selective, producing 48 incorrect tuples out of a sam- In the context of processing HTML data, we plan to
(a) (b)
Figure 5: Recall (a) and precision (b) of the combined system for the Intersection, Union, and Mixture
strategies, against Snowball-VS (test collection).

Type of Error
Correct Incorrect Location Organization Relationship
Snowball-VS 52 48 6 41 1
Snowball-SMT 76 24 3 19 2
Union 49 51 6 42 3
Mixture 73 27 4 19 4
Intersection 81 19 4 14 1

Table 6: Manually computed precision estimate, derived from a random sample of 100 tuples from each
extracted table.

explore the question of combining complementary in- noun phrase as opposed to a named entity as in this pa-
formation as part of the Snowball system. In this paper, per. In the future, we will also generalize Snowball to
we only had two systems to combine, and sophisticated relations of more than two attributes. Finally, another
methods for combining predictors (e.g., [12, 2]) were open problem is how to extend our tuple and pattern
not likely to make a significant impact. In addition to evaluation strategy of Section 2.1 so that it does not
the two systems that operate on the text immediately rely on an attribute being a key for the relation.
surrounding the entities, we could have a third system
that considers the links between documents, each con- Acknowledgements This material is based upon
taining one of the attributes in the relation. Having one work supported by the National Science Foundation
or multiple systems operating over this additional infor- under Grant No. IIS-9733880. We also thank Kazi
mation will allow us to compare and exploit the benefits Zaman and Nicolas Bruno for their helpful comments.
of more sophisticated methods for combining predic-
tors. More importantly, this might result in even higher References
quality extraction strategies.
[1] E. Agichtein and L. Gravano. Snow-
We have assumed throughout that the attributes of ball: Extracting relations from large plain-
the relation we extract (i.e., organization and location) text collections. Proceedings of the 5th
correspond to named entities that our tagger can iden- ACM International Conference on Dig-
tify accurately. As we mentioned, named-entity taggers ital Libraries, June 2000. http://-
like Alembic can be extended to recognize entities that www.cs.columbia.edu/eugene/-
are distinct in a context-independent way (e.g., num- papers/dl00.pdf.
bers, dates, proper names). For some other attributes,
we will need to extend Snowball so that its pattern gen- [2] A. Blum. Empirical support for winnow and
eration and matching could be anchored around, say, a weighted-majority algorithms: Results on a cal-
endar scheduling domain. Machine Learning, [12] Y. Freund. Boosting a weak learning algorithm by
1997. majority. Information and Computation, 1995.

[3] A. Blum and T. Mitchell. Combining labeled and [13] R. Grishman. Information extraction: Techniques
unlabeled data with co-training. In Proceedings of and challenges. In Information Extraction (In-
the 1998 Conference on Computational Learning ternational Summer School SCIE-97). Springer-
Theory, 1998. Verlag, 1997.

[4] S. Brin. Extracting patterns and relations from [14] E. Riloff. Automatically generating extraction
the World-Wide Web. In Proceedings of the patterns from untagged text. In Proceedings of
1998 International Workshop on the Web and the Thirteenth National Conference on Artificial
Databases (WebDB98), Mar. 1998. Intelligence, pages 10441049, 1996.

[5] W. Cohen. Integration of heterogeneous [15] E. Riloff and R. Jones. Learning dictionaries for
databases without common domains using information extraction by multi-level bootstrap-
queries based on textual similarity. In Proceed- ping. In Proceedings of the Sixteenth National
ings of the 1998 ACM International Conference Conference on Artificial Intelligence, 1999.
on Management of Data (SIGMOD98), 1998.
[16] G. Salton. Automatic Text Processing: The trans-
[6] M. Collins and Y. Singer. Unsupervised models formation, analysis, and retrieval of information
for named entity classification. In Proceedings by computer. Addison-Wesley, 1989.
of the Joint SIGDAT Conference on Empirical
Methods in Natural Language Processing and [17] D. Yarowsky. Unsupervised word sense disam-
Very Large Corpora, 1999. biguation rivaling supervised methods. In Pro-
ceedings of the 33rd Annual Meeting of the As-
[7] M. Craven, D. DiPasquo, D. Freitag, A. Mc- sociation for Computational Linguistics, pages
Callum, T. Mitchell, K. Nigam, and S. Slattery. 189196. Cambridge, MA, 1995.
Learning to construct knowledge bases from the
[18] J. Yi and N. Sundaresan. Mining the web for
World Wide Web. Artificial Intelligence, 1999.
acronyms using the duality of patterns and rela-
[8] D. Day, J. Aberdeen, L. Hirschman, R. Kozierok, tions. In Proceedings of the 1999 Workshop on
P. Robinson, and M. Vilain. Mixed-initiative Web Information and Data Management, 1999.
development of language processing systems.
In Proceedings of the Fifth ACL Conference
on Applied Natural Language Processing, Apr.
1997.

[9] E. Eskin, W. N. Grundy, and Y. Singer. Protein


family classification using sparse markov trans-
ducers. To appear in Proceedings of the Eighth
International Conference on Intelligent Systems
for Molecular Biology, August 2000.

[10] D. Fisher, S. Soderland, J. McCarthy, F. Feng, and


W. Lehnert. Description of the UMass systems
as used for MUC-6. In Proceedings of the 6th
Message Understanding Conference. Columbia,
MD, 1995.

[11] W. B. Frakes and R. Baeza-Yates, editors. In-


formation Retrieval: Data Structures and Algo-
rithms. Prentice-Hall, 1992.

You might also like