Combining Strategies For Extracting Relations From Text Collections
Combining Strategies For Extracting Relations From Text Collections
given in [9].
0
based near
located 3 Combining Snowball-VS and
7
Snowball-SMT
0 0
near near
The two systems that we used in our experiments,
at in at in Snowball-VS and Snowball-SMT, focus on two differ-
1 2 3 4 5 6 ent aspects of the textual context: the presence or ab-
sence of keywords that tend to indicate the correct rela-
Figure 2: An example sparse Markov tree. tionships (Snowball-VS), and the order of words in the
contexts surrounding the entities (Snowball-SMT). We
explore how to combine the two systems with the goal
terms that make up the contexts surrounding the enti- of improving our overall extraction accuracy. Com-
ties. Each leaf node stores an estimate of the probabil- bining predictors to increase accuracy is an active area
ity that if the node was reached, the context that was of research. Some of the methods we considered in-
used to generate the path to the node contains a valid clude sleeping-experts, boosting by majority [12], and
organization-location tuple. Figure 2 shows a sparse co-training [3]. In this section, we explore prelimi-
Markov tree. For example, the node labeled 3 would nary ways in which we can combine Snowball-VS and
be reached by following the terms making up the con- Snowball-SMT. (We discuss this issue further in Sec-
text <ORGANIZATION> based in <LOCATION>. tion 6.)
A tree is used to obtain a probability for a tuple by Initially, both Snowball-VS and Snowball-SMT re-
following the context from the root node to a leaf node ceive the same set of seed tuples (Figure 3). Each
skipping a token in the context for each along the system runs for one iteration, producing a set of tu-
path. The leaf node contains the tuples probability ples SeedV S and SeedSM T , respectively. These two
of being an organization-location pair. The topology sets of tuples are combined into one set SeedCombined
of a tree encodes the positions of the -terms in the (we will describe how shortly). SeedCombined is then
probability distribution. Because we do not know the used as the set of seed tuples for both Snowball-VS and
positions of the -terms for each context a priori, we Snowball-SMT and both systems are run for another it-
do not know the best topology of the prediction tree to eration. This process repeats until we stop discovering
use. We approximate the best tree using a Bayesian new tuples. The final step of the extraction process re-
mixture (weighted sum) technique. Instead of using turns the set SeedCombined , containing the combination
a single tree, we use a weighted sum of all possible of the final set of tuples discovered by Snowball-VS and
trees as our predictor. We then use a Bayesian update Snowball-SMT.
rule (described in Section 3) to update the weight of We explored three options for combining the tuples
each tree based on its performance on a given element discovered by Snowball-VS and Snowball-SMT to cre-
in the data set. At the end of this process, we have ate the new set of seed tuples: the Union, the Intersec-
a weighted sum of trees in which the best performing tion, and the weighted Mixture of the tuples produced
trees in the set of all trees have the highest weights. The by the individual systems. The Intersection strategy
sparse prediction tree is rebuilt from scratch from the was motivated by [3]. To implement the Union and
set of positive and negative examples on each iteration Intersection strategies, the sets of tuples produced by
of Snowball-SMT. An in-depth description of SMTs is Snowball-VS and Snowball-SMT are filtered using each
Find Occurrences of Seed Tuples Find Occurrences of Seed Tuples
Combining
Tag Entities Generate New Seed Tuples Generate New Seed Tuples Tag Entities
Algorithm
SnowballVS SnowballSMT
Snowball-VS Snowball-SMT Combined ple of 100. The incorrect tuples are due mainly to er-
seed f inal seed f inal seed f inal roneously tagging phrases as organizations (41 out of
1 0.9 0.3 0.99 0.99 - - the 48 incorrect tuples for Snowball-VS). If we were
2 0.9 0.3 0.99 0.99 - - querying the table extracted by Snowball-VS by orga-
3 - - - - 0.97 0.6 nization, we would expect to find the correct headquar-
ters for the organization approximately 88% of the time
incorrect relationship 100%).
52 correct tuples
Table 5: Parameter values used for generating ( 52+6 incorrect locations+1
new seed tuples by combining Snowball-VS Observe that the Intersection strategy appears to pro-
and Snowball-SMT using the Intersection (1), duce the cleanest table overall (81 tuples out of 100 are
Union (2) and Mixture (3) strategies. correct).
Thus, if we want a high-recall system, we should
run Snowball-VS. Alternatively, if we want to create a
We explored the three combining strategies on the table of high-quality tuples, we should run Snowball-
training collection and used the parameters in Table 5 SMT. Finally, we could combine the two systems using
to run the combined system on the test collection. The the Intersection strategy to create a table with high
individual systems were run using the parameters in precision that also approaches Snowball-VSs recall
Table 4. The seed parameter of Table 5 is the minimum values for high-frequency tuples.
confidence value for a tuple to be chosen as seed by
each individual system, and the f inal parameter is the
threshold used to filter the final table. 6 Conclusions and Future Work
As we can see in Figure 5, the simple combining This paper presents significant extensions of Snowball,
strategies we explored do not help us discover new tu- a system for extracting relations from large collections
ples, but can be used to improve the precision of the of plain-text documents that requires minimal training
extracted table. This claim is further supported by ran- for each new scenario. We compared two alternatives
domly sampling the tables produced by Snowball-VS, for representing text for our extraction task, and pre-
Snowball-SMT, and those produced by using the In- sented preliminary results on combining the systems.
tersection, Union, and Mixture combining strategies. We only evaluated our techniques on plain text
The samples were manually checked for accuracy of documents, and it would require future work to adopt
the discovered tuples, with results shown in Table 6. our methodology to HTML data. While HTML tags
We classify the errors into three types (Location, Orga- can be naturally incorporated into Snowballs pattern
nization, and Relationship), where the former two are representation, it is problematic to extract named-entity
due to the errors of the named-entity tagger, and the tags from arbitrary HTML documents. State-of-the-art
latter is completely the extraction systems fault. As taggers rely on clues from the text surrounding each
we can see, Snowball-SMT produces few incorrect tu- entity, which may be absent in HTML documents that
ples (24 out a sample of 100) while Snowball-VS is less often rely on visual formatting to convey information.
selective, producing 48 incorrect tuples out of a sam- In the context of processing HTML data, we plan to
(a) (b)
Figure 5: Recall (a) and precision (b) of the combined system for the Intersection, Union, and Mixture
strategies, against Snowball-VS (test collection).
Type of Error
Correct Incorrect Location Organization Relationship
Snowball-VS 52 48 6 41 1
Snowball-SMT 76 24 3 19 2
Union 49 51 6 42 3
Mixture 73 27 4 19 4
Intersection 81 19 4 14 1
Table 6: Manually computed precision estimate, derived from a random sample of 100 tuples from each
extracted table.
explore the question of combining complementary in- noun phrase as opposed to a named entity as in this pa-
formation as part of the Snowball system. In this paper, per. In the future, we will also generalize Snowball to
we only had two systems to combine, and sophisticated relations of more than two attributes. Finally, another
methods for combining predictors (e.g., [12, 2]) were open problem is how to extend our tuple and pattern
not likely to make a significant impact. In addition to evaluation strategy of Section 2.1 so that it does not
the two systems that operate on the text immediately rely on an attribute being a key for the relation.
surrounding the entities, we could have a third system
that considers the links between documents, each con- Acknowledgements This material is based upon
taining one of the attributes in the relation. Having one work supported by the National Science Foundation
or multiple systems operating over this additional infor- under Grant No. IIS-9733880. We also thank Kazi
mation will allow us to compare and exploit the benefits Zaman and Nicolas Bruno for their helpful comments.
of more sophisticated methods for combining predic-
tors. More importantly, this might result in even higher References
quality extraction strategies.
[1] E. Agichtein and L. Gravano. Snow-
We have assumed throughout that the attributes of ball: Extracting relations from large plain-
the relation we extract (i.e., organization and location) text collections. Proceedings of the 5th
correspond to named entities that our tagger can iden- ACM International Conference on Dig-
tify accurately. As we mentioned, named-entity taggers ital Libraries, June 2000. http://-
like Alembic can be extended to recognize entities that www.cs.columbia.edu/eugene/-
are distinct in a context-independent way (e.g., num- papers/dl00.pdf.
bers, dates, proper names). For some other attributes,
we will need to extend Snowball so that its pattern gen- [2] A. Blum. Empirical support for winnow and
eration and matching could be anchored around, say, a weighted-majority algorithms: Results on a cal-
endar scheduling domain. Machine Learning, [12] Y. Freund. Boosting a weak learning algorithm by
1997. majority. Information and Computation, 1995.
[3] A. Blum and T. Mitchell. Combining labeled and [13] R. Grishman. Information extraction: Techniques
unlabeled data with co-training. In Proceedings of and challenges. In Information Extraction (In-
the 1998 Conference on Computational Learning ternational Summer School SCIE-97). Springer-
Theory, 1998. Verlag, 1997.
[4] S. Brin. Extracting patterns and relations from [14] E. Riloff. Automatically generating extraction
the World-Wide Web. In Proceedings of the patterns from untagged text. In Proceedings of
1998 International Workshop on the Web and the Thirteenth National Conference on Artificial
Databases (WebDB98), Mar. 1998. Intelligence, pages 10441049, 1996.
[5] W. Cohen. Integration of heterogeneous [15] E. Riloff and R. Jones. Learning dictionaries for
databases without common domains using information extraction by multi-level bootstrap-
queries based on textual similarity. In Proceed- ping. In Proceedings of the Sixteenth National
ings of the 1998 ACM International Conference Conference on Artificial Intelligence, 1999.
on Management of Data (SIGMOD98), 1998.
[16] G. Salton. Automatic Text Processing: The trans-
[6] M. Collins and Y. Singer. Unsupervised models formation, analysis, and retrieval of information
for named entity classification. In Proceedings by computer. Addison-Wesley, 1989.
of the Joint SIGDAT Conference on Empirical
Methods in Natural Language Processing and [17] D. Yarowsky. Unsupervised word sense disam-
Very Large Corpora, 1999. biguation rivaling supervised methods. In Pro-
ceedings of the 33rd Annual Meeting of the As-
[7] M. Craven, D. DiPasquo, D. Freitag, A. Mc- sociation for Computational Linguistics, pages
Callum, T. Mitchell, K. Nigam, and S. Slattery. 189196. Cambridge, MA, 1995.
Learning to construct knowledge bases from the
[18] J. Yi and N. Sundaresan. Mining the web for
World Wide Web. Artificial Intelligence, 1999.
acronyms using the duality of patterns and rela-
[8] D. Day, J. Aberdeen, L. Hirschman, R. Kozierok, tions. In Proceedings of the 1999 Workshop on
P. Robinson, and M. Vilain. Mixed-initiative Web Information and Data Management, 1999.
development of language processing systems.
In Proceedings of the Fifth ACL Conference
on Applied Natural Language Processing, Apr.
1997.