Combining Content and Collaboration in Text Filtering
Combining Content and Collaboration in Text Filtering
Ian M. Soboro
Department of Computer Science and Electrical Engineering
0.5
Average Precision
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
Recall
Figure 1: Precision and recall for the top two values of k for content LSI and collaborative LSI, and the
content-only task.
0.5
Average Precision
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
Recall
Figure 2: Precision and recall among content LSI runs (k = 25, 50, 100, 200, and 500) in the rst data
group.
Content LSI vs Content Only
0.7
Content log-tfidf
LSI 8
LSI 15
0.6 LSI 18
0.5
Average Precision
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
Recall
Figure 3: Precision and recall among collaborative LSI runs (k = 5, 15, and 18) in the rst data group.
qz
lsi
0.2
−0.2
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
Topic
Figure 4: Di erence in average precision from mean average precision for each topic. Note that there is very
little di erence in performance from using LSI.
Histogram of relevant topics per document Histogram of relevant topics per document
qrels qrels
1E3 1E3
qz top 15 qz top 50
Number of documents (log)
1E2 1E2
1E1 1E1
1E0 1E0
1 2 3 4 5 1 2 3 4 5 6 7
Number of relevant topics Number of relevant topics
Figure 5: Histograms showing how many topics are relevant to each document, as dictated by the TREC-8
Filtering relevance judgments, and as predicted by the submitted runs. The horizontal axis is the number
of relevant topics; the vertical axis is a log scale of the number of documents which are relevant to only that
many topics. The chart on the left uses the top 15 submitted documents in each run; the right uses the top
50.
however, for documents that are shared among only References
two or three topics, the runs are close to each other [Balabanovic and Shoham, 1997] Marko Bala-
in overlap. banovic and Yoav Shoham. Fab: Content-based,
collaborative recommendation. Communications
5 Discussion of the ACM, 40(3):66{72, March 1997.
The results presented here show that LSI can be a [Belkin and Croft, 1992] Nicholas J. Belkin and
useful tool for text ltering. It is essential that the W. Bruce Croft. Information ltering and informa-
set of ltering pro les have room for collaboration. tion retrieval: Two sides of the same coin? Com-
The best performance is gained when the right doc- munications of the ACM, 35(12):29{38, December
uments are used to generate the best \rose-colored 1992.
SVD" through which to compare documents to l- [Berry et al., 1995] Michael W. Berry, Susan T. Du-
ters. Additionally, some work is needed to discover mais, and Gavin W. O'Brien. Using linear algebra
a good place to truncate the SVD. We currently do for intelligent information retrieval. SIAM Review,
this by inspecting some measure of performance at 37(4):573{595, December 1995.
several di erent k values, but this is obviously rather [
messy. Other schemes typically involve looking at the Deerwester et al., 1990] Scott Deerwester, Susan T.
decay of the singular values . Dumais, George W. Furnas, Thomas K. Landauer,
Given that Cran eld is a small and highly special- and Richard Harshman. Indexing by latent seman-
i
ized collection, we are careful not to generalize too tic analysis. Journal of the American Society for
strongly on the results in that collection. The TREC Information Science, 41(6):391{407, 1990.
experiment also shows that there are characteristics [Dumais, 1995] Susan T. Dumais. Using LSI for
of the collection which are needed to make collabo- information ltering: TREC-3 experiments. In
rative content ltering work. Other, intermediately- Donna K. Harman, editor, Proceedings of the Third
sized collections we are examining are the Topic De- Text REtrieval Conference (TREC-3), pages 219{
tection and Tracking corpus2, and the Reuters-21578 230, Gaithersburg, MD, November 1995. Also ti-
text categorization collection 3 . tled "Latent Semantic Indexing (LSI): TREC-3 Re-
As far as we know, there are not yet any standard port".
test collections for collaborative ltering whose ob- [Hull, 1994] David Hull. Improving text retrieval for
jects contain a signi cant amount of text. The most the routing problem using latent semantic index-
widely used collaborative ltering collection is Digi- ing. In Proceedings of the Seventeenth Annual In-
tal's EachMovie database 4 , but its textual content ternational ACM SIGIR Conference (SIGIR '94),
is limited to names of actors, directors, and the like. pages 282{291, Dublin, Ireland, July 1994.
This makes it diÆcult to compare the collaborative [Hull, 1998] David A. Hull. The TREC-6 ltering
aspects of our approach with a more straightforward track: Description and analysis. In Proceedings
collaborative ltering algorithm. Test collections will of the Sixth Text REtrieval Conference (TREC-6),
be very important as collaborative ltering research 1998.
continues to move forward.
Finally, an implicit assumption is often made that [Konstan et al., 1997] Joseph A. Konstan,
interests, user ratings, document similarity, and top- Bradley N. Miller, David Maltz, Jonathan L.
ical relevance are somehow related. This is evident Herlocker, Lee R. Gordon, and John Riedl. Grou-
not only through the use of \did you like this page" pLens: Applying collaborative ltering to Usenet
ratings data but also in content-oriented information news. Communications of the ACM, 40(3):77{87,
retrieval approaches to ltering. With the use of March 1997.
standard IR collections, the issue is even more pro- [Schapire et al., 1998] Robert E. Schapire, Yoram
nounced; topical relevance is a much more di erently Singer, and Amit Singhal. Boosting and rocchio
de ned and strictly determined concept than that of applied to text ltering. In Proceedings of the
user preference or even the ful llment of an informa- 21st Annual International ACM SIGIR Confer-
tion need. These simplifying assumptions are foun- ence on Research and Development in Information
dational in retrieval and ltering experimentation, Retrieval (SIGIR '98), pages 215{223, Melbourne,
but there will certainly be problems when they are Australia, August 1998. ACM Press.
applied in real-world ltering systems. It certainly [Schutze et al., 1995] Hinrich Schutze, David A.
seems, from the early analysis here, that in TREC Hull, and Jan O. Pedersen. A comparison of clas-
strict topical relevance may hide the advantages of si ers and document representations for the rout-
collaborative ltering. ing problem. In Proceedings of the Eighteenth An-
2 nual International ACM SIGIR Conference (SI-
http://www.ldc.upenn.edu/TDT/
3
http://www.research.att.com/ lewis/reuters21578.html GIR '95), pages 229{237, Seattle, WA, USA, July
4
http://www.research.digital.com/SRC/eachmovie/ 1995.
[Shardanand and Maes, 1995] Upendra Shardanand
and Pattie Maes. Social information ltering: Al-
gorithms for automating "word of mouth". In Pro-
ceedings of CHI'95 { Human Factors in Computing
Systems , pages 210{217, Denver, CO, USA, May
1995.
[Singhal et al., 1996] Amit Singhal, Chris Buckley,
and Mandar Mitra. Pivoted document length nor-
malization. In W. Bruce Croft and C. J. van Rijs-
bergen, editors, Proceedings of the Nineteenth An-
nual Internation ACM SIGIR Conference on Re-
search and Development in Information Retrieval ,
pages 21{29. Association for Computing Machin-
ery, August 1996.
[Singhal et al., 1998] Amit Singhal, John Choi, Don-
ald Hindle, David D. Lewis, and Fernando Pereira.
AT&T at TREC-7. In E. M. Voorhees and D. K.
Harman, editors, The Seventh Text REtrieval Con-
ference, NIST Special Publication 500-242, pages
239{252, Gaithersburg, MD, November 1998. Na-
tional Institute of Standards and Technology.
[Singhal, 1997] Amit Singhal. AT&T at TREC-6.
In E. M. Voorhees and D. K. Harman, editors,
The Sixth Text REtrieval Conference, NIST Spe-
cial Publication 500-240, pages 215{226, Gaithers-
burg, MD, November 1997. National Institute of
Standards and Technology.
[Yan and Garcia-Molina, 1995] Tak W. Yan and
Hector Garcia-Molina. SIFT { a tool for wide-area
information dissemination. In Proceedings of the
1995 USENIX Technical Conference, pages 177{
186, 1995.