0% found this document useful (0 votes)
46 views

Combining Content and Collaboration in Text Filtering

The document discusses combining content-based filtering and collaborative filtering by using latent semantic indexing to create a collaborative view of user profiles represented as term vectors from relevant documents. Initial experiments on a small test collection showed this approach performed favorably compared to other content-based approaches, but it did not perform as well on a larger collection with less opportunity for collaboration.

Uploaded by

Fatih Küçük
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

Combining Content and Collaboration in Text Filtering

The document discusses combining content-based filtering and collaborative filtering by using latent semantic indexing to create a collaborative view of user profiles represented as term vectors from relevant documents. Initial experiments on a small test collection showed this approach performed favorably compared to other content-based approaches, but it did not perform as well on a larger collection with less opportunity for collaboration.

Uploaded by

Fatih Küçük
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Combining Content and Collaboration in Text Filtering

Ian M. Soboro
Department of Computer Science and Electrical Engineering

University of Maryland, Baltimore County

[email protected]

Abstract Recently, some exploration has been made into


We describe a technique for combining col- content-based collaborative ltering. In such a sys-
laborative input and document content for tem, document content is used in making collabora-
text ltering. This technique uses latent tive decisions. An example of a content-based col-
semantic indexing to create a collaborative laborative ltering environment is Fab [Balabanovic
view of a collection of user pro les. The and Shoham, 1997]. In Fab, relevance feedback is
pro les themselves are term vectors con- used to simultaneously mold a personal lter as well
structed from documents deemed relevant as a communal \topic" lter. Documents are discov-
to the user's information need. In initial ex- ered and initially ranked by the topic lter accord-
periments with a standard text collection, ing to conformance to their topic, and then sent to
this approach performs quite favorably com- users' personal lters. A user then provides relevance
pared to other content-based approaches. In feedback for that document, which is used to modify
a larger collection with less possibility for both the personal lter (what the user wants), and
collaboration, the technique does not per- the originating topic lter (what matches the topic).
form as well. In this paper, we describe a new technique combin-
ing content-based and collaborative ltering. This
approach compares user pro les and documents in
1 Introduction a uni ed model, which derives relationships between
Filtering is a process of comparing an incoming doc- users' interests. The technique uses latent semantic
ument stream to a pro le of a user's interests and indexing to rearrange a collection of user pro les, so
recommending the documents according to that pro- that their commonalities are exploited in the ltering
le [Belkin and Croft, 1992]. A simple approach for task itself. Candidate documents are routed based
ltering textual content might be to look at each doc- on their similarities to the pro les in the LSI space,
ument's similarity to an average of known relevant rather than against the original pro les. We explore
documents. the e ectiveness of the technique in a batch ltering
Collaborative ltering takes into account the sim- scenario using a small, standard information retrieval
ilarities and di erences among the pro les of several test collection. Finally, we present some results with
users in determining how to recommend a document. the TREC collection, which do not illustrate any ad-
Typically, collaborative ltering is done by correlat- vantage for this technique, chie y due to a lack of
ing users' ratings of documents. In this approach, collaborative potential in the collection.
a document is recommended to a user because it is
highly rated by some other user with whom he or 1.1 Latent Semantic Indexing
she tends to agree. This can also work for negative Latent semantic indexing, or LSI, is an enhancement
ratings; an article may not be recommended because to the familiar vector-space model of information re-
some other \colleague" didn't like it. Examples of trieval [Deerwester et al., 1990]. Typically, authors
such collaborative systems are GroupLens [Konstan will use many words to describe the same idea, and
et al., 1997] and Ringo [Shardanand and Maes, 1995]. those words will appear in only a few contexts. LSI
In these \pure" collaborative environments, a docu- attempts to highlight these patterns of how words
ment's content is never examined, only its ratings. are used within a document collection. By grouping
Such systems have the advantage that they can rec- together the word co-occurrence patterns that char-
ommend any kind of content for which one can obtain acterize groups of documents, the \latent semantics"
ratings; however, if a document is unrated, it is diÆ- of the collection terms is described. Themes in the
cult to recommend it. document collection arise from subsets of documents
with similar word co-occurrences. cover the set of terms we expect to occur in future
Speci cally, each document is represented by a vec- documents. As Hull observed, the LSI can describe
tor of terms, whose values are weights related to their the occurrences of terms across only the documents
importance or frequency of occurrence. The collec- which are used in the SVD computation. Thus, one
tion of documents, called the term-document matrix, should apply the SVD to a document collection that
is decomposed using the singular value decomposition represents the kind of term distributions relevant to
M = T D 0 the task.
The columns of T and D are orthonormal, and are 1.3 Combining Content and
called the left and right singular vectors.  is a diag- Collaborative Information
onal matrix containing the singular values , ordered
by size. If M is t  d and of rank r, T is a t  r matrix, As we mentioned above, collaborative ltering recom-
D is d  r, and  is r  r. mends new documents based on correlations of users'
The SVD projects the documents in the collec- ratings of past documents. In general, we can envi-
tion into an r-dimensional space, in contrast to their sion a matrix of documents by users, where each user
t-dimensional representation in the term-document (column) has a set of ratings for some of the docu-
matrix. This LSI space is described by the columns ments (rows); this comprises her pro le. The goal of
of T , and it is useful to think of T  1 as a projection a collaborative ltering engine is to ll in the blanks
for document vectors into the LSI space. In particu- in this matrix with predicted ratings. This is done
lar, multiplying document vector i from the original by computing a correlation coeÆcient between each
term-document matrix by T  1 yields the ith col- user, and predicting a rating of a document to be
umn of D, the document's representation in the LSI the weighted sum of other users' ratings of that doc-
space. We can project any document vector into the ument by their coeÆcients [Shardanand and Maes,
LSI space in this way, and compare documents by 1995]. There are other methods for performing this
taking the dot product of their LSI representations. prediction, but this is by far the most common.
An important feature of the SVD is that the sin- In content ltering, the user pro le is often con-
gular values give an indication of the relative im- structed from the content of past relevant documents
portance of the dimensions, and one can choose how using Rocchio expansion. This may simply yield a
many dimensions to retain by eliminating low-valued centroid of the relevant documents, but more sophis-
dimensions. If all dimensions are kept, then docu- ticated techniques are often applied to further re-
ment similarities are the same as they were using the ne the pro le [Schapire et al., 1998]. These con-
original term-document matrix. If one keeps k di- tent pro les are closely related to the ratings matrix
mensions, then the matrix product described above. We can construct a matrix from
the ratings matrix such that an entry contains a 1
M = T  D0
k k k k if the user's rating for that document exceeds some
is the closest rank-k approximation to the original minimum threshold. Each column is then divided by
term-document matrix M [Berry et al., 1995]. Docu- the number of documents exceeding the threshold for
ment comparisons in this truncated LSI space may be that user. Multiplying this pro le-construction ma-
more e ective because low-impact term relationships trix by the term-document matrix for the document
are ignored. Choosing the best value of k is an open collection produces a matrix of content pro les.
problem, and in this work we tried several values to Both of these matrices, the ratings matrix and the
nd the best performance. content pro le matrix, model the universe of user in-
terests. One considers document objects explicitly
1.2 LSI for Content Filtering and separately, while the other pools the document
Latent semantic indexing was applied to ltering as contents to give \ratings" of actual content terms.
well as ad-hoc retrieval in TREC-3 using a technique Neither representation is necessarily collaborative in
similar to ours. The LSI space was rst computed any way. With the ratings matrix, collaboration oc-
from a collection of documents, and then pro les were curs when the correlations are used to predict new
constructed as centroids of document representations ratings. For collaboration within the content pro le
from the LSI space [Dumais, 1995]. matrix, we use a latent semantic index of the pro les.
Hull, in looking at LSI for use in ltering and rout- Our approach di ers from the LSI content ltering
ing applications, computed the LSI from a set of approaches in that the LSI space is computed from
documents known to be relevant to the lters [Hull, the collection of pro les, rather than a collection of
1994]. In later experiments [Schutze et al., 1995], documents. This means that commonalities between
a \local LSI" was built from documents similar to a pro les guide the construction of the LSI space. Oth-
given query. The key insight here is that the LSI pro- erwise, only overlap among documents is able to af-
jection can be trained from any arbitrary set of docu- fect the SVD. Projecting new documents into this
ment vectors, as long as the vectors we use adequately space allows us to better view the documents' rela-
tion to the group of pro les. Furthermore, collabora- Two groups of random test and training data were
tion occurs as the LSI describes the relationships of constructed in this way. Each group was then used
important concepts across pro les. generate a training matrix (terms by lters) and a
test matrix (terms by documents).
2 Initial Experiment: Cran eld 2.4 Experimental Tasks
We examined our collaborative LSI ltering tech- Three di erent tasks were performed with each test
nique in comparison to two content-based ap- and training set. In the rst task, each document
proaches. The rst compared incoming documents in the test collection was ranked against each lter
to the pro le centroids using the ordinary term- centroid, by taking the dot product between the doc-
document matrix. The second was similar to Dumais' ument and lter vectors. This we call the \content"
approach as described above: an LSI was initially result, wherein
computed from the full document set, and pro les a group; this is no processing is made on the lters as
were centroids of some relevant documents within the Garcia-Molina, 1995]. to the SIFT system [Yan and
similar
LSI space. Incoming documents were cast into the In the second task, an SVD was applied to the
LSI space and compared to the pro les. Cran eld term-document matrix. Both the lter cen-
2.1 Test Collection troids and the test documents were cast into this LSI
space as described above. The documents were com-
Our experiments used the Cran eld corpus, a small pared to the lters in the LSI space by taking the
test collection of 1400 documents and 225 scored dot product between them. This we call the \con-
queries. These queries have on average eight rele- tent LSI" result, and is similar to the technique used
vant documents, and each document is relevant to in [Dumais, 1995].
one or two queries. In order to have a reasonably- In the third task, an SVD was applied to the train-
sized set of documents for each query, we used a sub- ing matrix, and then each document in the test set
set of 26 queries which each have 15 or more relevant was cast into the LSI space and ranked against each
documents. In this subset, queries had 19 relevant lter. We call this the \collaborative LSI" result, as
documents on average, with a maximum of 40. The here we expected to see the e ects of collaboration,
documents in the Cran eld collection are technical as the SVD re-orients the document space along di-
scienti c abstracts, and are all quite short, between mensions that highlight common features among the
100 and 4200 bytes in length. user lters.
2.2 Document Representation As explained above, increasing performance with
the SVD requires choosing a dimensionality k smaller
The documents were indexed using the SMART sys- than the full rank of the matrix, which will be less
tem 1 , which performs stemming of terms and elim- than or equal to the smaller of the number of terms
ination of stop words. The terms were weighted us- and the number of documents. For our experiments,
ing log-t df weighting. All documents were length- we ran task two using 25, 50, 100, 200, and 500 di-
normalized for the experiments. Our version of mensions out of a possible 1398; and task three with
SMART has been modi ed by the addition of proce- 8, 15, and 18 dimensions out of a possible 26.
dures for gathering documents into pro les and com-
puting SVDs of document collections. 3 Results
2.3 Filter and Test Set Construction Using our modi ed SMART system, the experimental
runs yielded a set of ranked scores for each document
For each query, a training subset and a test subset against each lter. These rankings were used to cal-
were derived from the set of relevant documents for culate average
that query. This was done by randomly selecting 70% In summary,precisionthe
over xed levels of recall.
collaborative LSI technique per-
of the relevant documents for use in training, and formed better than either the content LSI or baseline
saving the remaining 30% for testing. content approaches. The choice of k was crucial; the
The vectors of the documents in each query's train-
ing set were averaged to produce a centroid, which best k-value for collaborative LSI allowed it to achieve
represented the user's pro le or lter. The test docu- higher
tent
precision than any attempted k-value for con-
LSI. For most values of k, content LSI did not
ments for all queries were pooled to produce a single perform appreciably better than using no LSI at all.
set of documents against which to test ltering per- Moreover, the best value
formance. In other words, the test documents for the was much lower than that ofneeded k for collaborative LSI
by content LSI to
experiment are each relevant to at least one query. No get good results, meaning that similarities are faster
e ort was made to control or induce overlap between to compute for collaborative LSI.
queries in the training sets. Figure 1 shows the results of the best runs in each
1
ftp://ftp.cs.cornell.edu/pub/smart task, chosen by highest overall average prevision. The
Task k-value Average Precision 4.1 Pro le Construction
Set 1 Set 2
Content (log-t df) { 0.2894 0.2705 This data set is much larger than the Cran eld collec-
Content LSI 25 0.2656 0.1980 tion; the test collection comprises some 140,000 docu-
50 0.3136 0.2686 ments. This also meant that the amount of available
100 0.3251 0.3053 training data was much less compared to Cran eld.
200 0.3314 0.3144 To attempt to overcome this, we employed a more
500 0.3302 0.3149 sophisticated pro le construction technique which in-
Collaborative LSI 8 0.3136 0.2583 cluded better term weighting and pro le normaliza-
15 0.4151 0.3745 tion, and unsupervised learning for some training ex-
18 0.3600 0.3615 amples.
To build our pro les, we used a technique similar
to that used by the AT&T group in TREC-6 [Sing-
Table 1: Overall average precision for each task. This hal, 1997] and TREC-7 [Singhal et al., 1998]. First,
average is over three recall points (0.20, 0.50, 0.80). a training collection was constructed from the FBIS,
Los Angeles Times, and Financial Times documents
from 1992. We gathered collection statistics here for
overall average precision for all runs in both data sets all future IDF weights. The training documents were
is shown in Table 1. The gures show precision-recall weighted with log-t df, and normalized using the piv-
graphs for the rst data set. oted unique-term document normalization [Singhal et
Figure 2 shows precision-recall graphs for the con- al., 1996].
tent LSI task, compared to the baseline content per- We then built the routing queries using query zon-
formance. There isn't a large gain to be made from ing. An initial query was made from the short topic
using LSI here, and in some cases the content LSI description, and the top 1000 documents are retrieved
approach even performs worse than no LSI at all. In from the training collection. The results from this re-
contrast, gure 3 shows precision and recall for the trieval were used to build a Rocchio feedback query
collaborative LSI task, which show a marked gain in with:
performance.
 The initial short-description query (weighted
= 3)
4 Experiments with TREC  All documents known to be relevant to the query
in the training collection (weighted = 2)
The next step was to apply the technique in a larger
collection. The TREC-8 routing task was an excellent  Retrieved documents 501-1000, assumed to be
opportunity for this. Routing at TREC is part of the nonrelevant (weighted = 2)
Filtering track, which for the past two years has used
a three-year span of documents from the Financial 4.2 Results
Times for evaluating ltering techniques. We submitted two runs, lists of the 1000 most similar
In the routing task at TREC-8, participants were documents for each topic, to be judged; one, umrqz
required to route documents from the 1993-4 Finan- routed the test documents using the exact pro les
cial Times collection among fty past TREC topics. described above. The second, umrlsi, rst computed
These topics have some relevance judgments avail- the SVD of the collection of pro les, as we did for
able in other parts of the TREC collection, including Cran eld, and routed the test documents in the re-
in the Financial Times from 1991-2, and systems were sulting LSI space.
allowed to train their pro les using this information To choose a dimensionality for the LSI run, we eval-
(but none from the test collection). uated the routing performance on the training collec-
It's not clear that any collaboration exists a among tion. Disappointingly, we did not nd that any choice
the topics in TREC, since the topics are not neces- of k gave any bene t to LSI. Arbitrarily, we chose to
sarily designed to overlap, either in information in- use 45 out of the possible 50 dimensions.
terest or in actual relevant document sets. However, Overall, both runs performed quite well, with um-
several topics this year were closely related, from a rqz above the median of all submitted runs for 27
reading of the topic descriptions. One group might queries, and umrlsi for 23. For ve queries, we pro-
be \clothing sweatshops" (361) and \human smug- duced the best performance, and for four of those,
gling" (362). Another is \hydrogen energy" (375), the LSI gave the maximum score.
\hydrogen fuel automobiles" (382), and \hybrid fuel For the majority of queries, however, there was
cars" (385). There are also groups of topics dealing only a very small di erence in performance if any
with medical disorders and thei treatment, pharma- between the two runs. We take this to indicate that
ceuticals, and special education. good overall performance is mostly due to the routing
Collaborative LSI, Content LSI, and Content Only
0.7
Content log-tfidf
Content LSI 200
Content LSI 500
0.6 Collaborative LSI 15
Collaborative LSI 18

0.5
Average Precision

0.4

0.3

0.2

0.1

0
0 0.2 0.4 0.6 0.8 1
Recall

Figure 1: Precision and recall for the top two values of k for content LSI and collaborative LSI, and the
content-only task.

Content LSI vs Content Only


0.7
Content log-tfidf
LSI 25
LSI 50
0.6 LSI 100
LSI 200
LSI 500

0.5
Average Precision

0.4

0.3

0.2

0.1

0
0 0.2 0.4 0.6 0.8 1
Recall

Figure 2: Precision and recall among content LSI runs (k = 25, 50, 100, 200, and 500) in the rst data
group.
Content LSI vs Content Only
0.7
Content log-tfidf
LSI 8
LSI 15
0.6 LSI 18

0.5
Average Precision

0.4

0.3

0.2

0.1

0
0 0.2 0.4 0.6 0.8 1
Recall

Figure 3: Precision and recall among collaborative LSI runs (k = 5, 15, and 18) in the rst data group.

query construction, which uses a combination of ap- lot of false friends.


proaches shown to work well in previous TRECs. Fig- To illustrate this, we looked at the distribution of
ure 4 shows the di erence in average precision from relevant documents among topics, to see if topics in-
the mean score for each topic, illustrating the simi- deed share relevant documents. Figure 5 shows, for
larity of the results. each run and for the relevance judgments, how many
This was somewhat expected, because since the (predicted) relevant topics were given for a docu-
topics are mostly di erent, with little opportunity for ment. The \qrels" bars show the actual relevance
overlap, the LSI should have been unable to help most judgments; one can see that the lions share of docu-
queries. However, for the example candidate topic ments are relevant to only one topic; less than sixty
\clusters" described above, the di erence in average documents are known to be relevant to more than
precision from using LSI was negligible. one topic. If a pure collaborative algorithm were used
For 18 queries where the di erence in average pre- to predict relevance for these topics, and these rele-
cision between the non-LSI and LSI routing was more vance judgments were sampled for training data, it
than 0.009, in 11 cases the di erence was quite small would fail miserably because the matrix would be
relative to the whole span of scores. In the other too sparse. The probability of any useful quantity
seven, the di erence was more marked, and in all but of overlap occurring is very small.
one (381) against LSI. For one query (360), LSI gave The two charts di er in the method for predicting
the minimum performance and the non-rotated query which documents in the umrqz and umrlsi runs are
gave the maximum. actually relevant. A routing run contains the highest-
Furthermore, in the twenty topics where average scored 1000 documents for each topic, but clearly the
precision in the umrlsi run was high (> 0:5), precision system does not expect that all 1000 documents are
without LSI was either the same or slightly higher. relevant. Thus, we use only predict as relevant some
In eight topics, the LSI average precision was less of the documents in each run. The rst picks the top
than 60% of that achieved without LSI. These topics 15 ranked documents; 15 is the median number of
have a fair range of relevant document set sizes and relevant documents per topic in the actual relevance
in only one of these topics was performance across all judgments. The second picks the top 50.
systems poor. One topic in this group was 375, \hy- We can see that our runs tend to spread documents
drogen energy", and three were drug-related (drug le- across more topics than are actually relevant. Within
galization, food/drug laws, mental illness drugs). It the top 15, the qz run distribution is similar to the
may be that the drug-related topics contained a lot qrels, and the lsi run gives slightly more overlap. At
of shared terms, but this caused LSI to bring out a 50 documents per topic the di erence is much greater;
Average Precision − Mean for topic Difference from Topic Mean Average Precision

qz
lsi
0.2

−0.2
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
Topic

Figure 4: Di erence in average precision from mean average precision for each topic. Note that there is very
little di erence in performance from using LSI.

Histogram of relevant topics per document Histogram of relevant topics per document

qrels qrels
1E3 1E3
qz top 15 qz top 50
Number of documents (log)

Number of documents (log)

lsi top 15 lsi top 50

1E2 1E2

1E1 1E1

1E0 1E0

1 2 3 4 5 1 2 3 4 5 6 7
Number of relevant topics Number of relevant topics

Figure 5: Histograms showing how many topics are relevant to each document, as dictated by the TREC-8
Filtering relevance judgments, and as predicted by the submitted runs. The horizontal axis is the number
of relevant topics; the vertical axis is a log scale of the number of documents which are relevant to only that
many topics. The chart on the left uses the top 15 submitted documents in each run; the right uses the top
50.
however, for documents that are shared among only References
two or three topics, the runs are close to each other [Balabanovic and Shoham, 1997] Marko Bala-
in overlap. banovic and Yoav Shoham. Fab: Content-based,
collaborative recommendation. Communications
5 Discussion of the ACM, 40(3):66{72, March 1997.

The results presented here show that LSI can be a [Belkin and Croft, 1992] Nicholas J. Belkin and
useful tool for text ltering. It is essential that the W. Bruce Croft. Information ltering and informa-
set of ltering pro les have room for collaboration. tion retrieval: Two sides of the same coin? Com-
The best performance is gained when the right doc- munications of the ACM, 35(12):29{38, December
uments are used to generate the best \rose-colored 1992.
SVD" through which to compare documents to l- [Berry et al., 1995] Michael W. Berry, Susan T. Du-
ters. Additionally, some work is needed to discover mais, and Gavin W. O'Brien. Using linear algebra
a good place to truncate the SVD. We currently do for intelligent information retrieval. SIAM Review,
this by inspecting some measure of performance at 37(4):573{595, December 1995.
several di erent k values, but this is obviously rather [
messy. Other schemes typically involve looking at the Deerwester et al., 1990] Scott Deerwester, Susan T.
decay of the singular values  . Dumais, George W. Furnas, Thomas K. Landauer,
Given that Cran eld is a small and highly special- and Richard Harshman. Indexing by latent seman-
i

ized collection, we are careful not to generalize too tic analysis. Journal of the American Society for
strongly on the results in that collection. The TREC Information Science, 41(6):391{407, 1990.
experiment also shows that there are characteristics [Dumais, 1995] Susan T. Dumais. Using LSI for
of the collection which are needed to make collabo- information ltering: TREC-3 experiments. In
rative content ltering work. Other, intermediately- Donna K. Harman, editor, Proceedings of the Third
sized collections we are examining are the Topic De- Text REtrieval Conference (TREC-3), pages 219{
tection and Tracking corpus2, and the Reuters-21578 230, Gaithersburg, MD, November 1995. Also ti-
text categorization collection 3 . tled "Latent Semantic Indexing (LSI): TREC-3 Re-
As far as we know, there are not yet any standard port".
test collections for collaborative ltering whose ob- [Hull, 1994] David Hull. Improving text retrieval for
jects contain a signi cant amount of text. The most the routing problem using latent semantic index-
widely used collaborative ltering collection is Digi- ing. In Proceedings of the Seventeenth Annual In-
tal's EachMovie database 4 , but its textual content ternational ACM SIGIR Conference (SIGIR '94),
is limited to names of actors, directors, and the like. pages 282{291, Dublin, Ireland, July 1994.
This makes it diÆcult to compare the collaborative [Hull, 1998] David A. Hull. The TREC-6 ltering
aspects of our approach with a more straightforward track: Description and analysis. In Proceedings
collaborative ltering algorithm. Test collections will of the Sixth Text REtrieval Conference (TREC-6),
be very important as collaborative ltering research 1998.
continues to move forward.
Finally, an implicit assumption is often made that [Konstan et al., 1997] Joseph A. Konstan,
interests, user ratings, document similarity, and top- Bradley N. Miller, David Maltz, Jonathan L.
ical relevance are somehow related. This is evident Herlocker, Lee R. Gordon, and John Riedl. Grou-
not only through the use of \did you like this page" pLens: Applying collaborative ltering to Usenet
ratings data but also in content-oriented information news. Communications of the ACM, 40(3):77{87,
retrieval approaches to ltering. With the use of March 1997.
standard IR collections, the issue is even more pro- [Schapire et al., 1998] Robert E. Schapire, Yoram
nounced; topical relevance is a much more di erently Singer, and Amit Singhal. Boosting and rocchio
de ned and strictly determined concept than that of applied to text ltering. In Proceedings of the
user preference or even the ful llment of an informa- 21st Annual International ACM SIGIR Confer-
tion need. These simplifying assumptions are foun- ence on Research and Development in Information
dational in retrieval and ltering experimentation, Retrieval (SIGIR '98), pages 215{223, Melbourne,
but there will certainly be problems when they are Australia, August 1998. ACM Press.
applied in real-world ltering systems. It certainly [Schutze et al., 1995] Hinrich Schutze, David A.
seems, from the early analysis here, that in TREC Hull, and Jan O. Pedersen. A comparison of clas-
strict topical relevance may hide the advantages of si ers and document representations for the rout-
collaborative ltering. ing problem. In Proceedings of the Eighteenth An-
2 nual International ACM SIGIR Conference (SI-
http://www.ldc.upenn.edu/TDT/
3

http://www.research.att.com/ lewis/reuters21578.html GIR '95), pages 229{237, Seattle, WA, USA, July
4
http://www.research.digital.com/SRC/eachmovie/ 1995.
[Shardanand and Maes, 1995] Upendra Shardanand
and Pattie Maes. Social information ltering: Al-
gorithms for automating "word of mouth". In Pro-
ceedings of CHI'95 { Human Factors in Computing
Systems , pages 210{217, Denver, CO, USA, May
1995.
[Singhal et al., 1996] Amit Singhal, Chris Buckley,
and Mandar Mitra. Pivoted document length nor-
malization. In W. Bruce Croft and C. J. van Rijs-
bergen, editors, Proceedings of the Nineteenth An-
nual Internation ACM SIGIR Conference on Re-
search and Development in Information Retrieval ,
pages 21{29. Association for Computing Machin-
ery, August 1996.
[Singhal et al., 1998] Amit Singhal, John Choi, Don-
ald Hindle, David D. Lewis, and Fernando Pereira.
AT&T at TREC-7. In E. M. Voorhees and D. K.
Harman, editors, The Seventh Text REtrieval Con-
ference, NIST Special Publication 500-242, pages
239{252, Gaithersburg, MD, November 1998. Na-
tional Institute of Standards and Technology.
[Singhal, 1997] Amit Singhal. AT&T at TREC-6.
In E. M. Voorhees and D. K. Harman, editors,
The Sixth Text REtrieval Conference, NIST Spe-
cial Publication 500-240, pages 215{226, Gaithers-
burg, MD, November 1997. National Institute of
Standards and Technology.
[Yan and Garcia-Molina, 1995] Tak W. Yan and
Hector Garcia-Molina. SIFT { a tool for wide-area
information dissemination. In Proceedings of the
1995 USENIX Technical Conference, pages 177{
186, 1995.

You might also like