0% found this document useful (0 votes)
21 views81 pages

(Ebook) New Programming Paradigms by Marvin Zelkowitz Ph.D. MS BS. ISBN 9780080459585, 9780120121649, 0120121646, 0080459587 All Chapters Available

The document discusses the challenges of evaluating the effectiveness of web search services in light of the vast amount of online information. It presents a new methodology for automatic evaluation that utilizes data mining techniques to assess search engine performance without human intervention. The chapter emphasizes the importance of relevance in search results and outlines various aspects of search effectiveness and evaluation metrics.

Uploaded by

qtpbtatndr2469
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views81 pages

(Ebook) New Programming Paradigms by Marvin Zelkowitz Ph.D. MS BS. ISBN 9780080459585, 9780120121649, 0120121646, 0080459587 All Chapters Available

The document discusses the challenges of evaluating the effectiveness of web search services in light of the vast amount of online information. It presents a new methodology for automatic evaluation that utilizes data mining techniques to assess search engine performance without human intervention. The chapter emphasizes the importance of relevance in search results and outlines various aspects of search effectiveness and evaluation metrics.

Uploaded by

qtpbtatndr2469
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

(Ebook) New Programming Paradigms by Marvin Zelkowitz

Ph.D. MS BS. ISBN 9780080459585, 9780120121649,


0120121646, 0080459587 Pdf Download

https://ebooknice.com/product/new-programming-paradigms-1494128

★★★★★
4.9 out of 5.0 (19 reviews )

DOWNLOAD PDF

ebooknice.com
(Ebook) New Programming Paradigms by Marvin Zelkowitz Ph.D.
MS BS. ISBN 9780080459585, 9780120121649, 0120121646,
0080459587 Pdf Download

EBOOK

Available Formats

■ PDF eBook Study Guide Ebook

EXCLUSIVE 2025 EDUCATIONAL COLLECTION - LIMITED TIME

INSTANT DOWNLOAD VIEW LIBRARY


Here are some recommended products that we believe you will be
interested in. You can click the link to download.

(Ebook) Biota Grow 2C gather 2C cook by Loucas, Jason; Viles, James


ISBN 9781459699816, 9781743365571, 9781925268492, 1459699815,
1743365578, 1925268497

https://ebooknice.com/product/biota-grow-2c-gather-2c-cook-6661374

(Ebook) Computer performance issues by Marvin Zelkowitz Ph.D. MS BS.


ISBN 9780123748102, 0123748100

https://ebooknice.com/product/computer-performance-issues-1383578

(Ebook) Quality Software Development by Marvin Zelkowitz Ph.D. MS


BS. ISBN 9780120121663, 0120121662

https://ebooknice.com/product/quality-software-development-999640

(Ebook) Nanotechnology by Marvin Zelkowitz Ph.D. MS BS. ISBN


9780080545103, 9780123737465, 012373746X, 0080545106

https://ebooknice.com/product/nanotechnology-1494104
(Ebook) Highly Dependable Software by Marvin Zelkowitz Ph.D. MS BS.
ISBN 9780120121588, 0120121581

https://ebooknice.com/product/highly-dependable-software-1494156

(Ebook) Improving the Web by Marvin Zelkowitz Ph.D. MS BS. ISBN


9780123810199, 0123810191

https://ebooknice.com/product/improving-the-web-1324024

(Ebook) Software Development by Marvin Zelkowitz Ph.D. MS BS. ISBN


9780080951553, 9780123744265, 0123744261, 0080951554

https://ebooknice.com/product/software-development-1494110

(Ebook) The Internet and Mobile Technology by Marvin Zelkowitz Ph.D.


MS BS. ISBN 0123855144

https://ebooknice.com/product/the-internet-and-mobile-
technology-2394926

(Ebook) Advances in Computers 80 by Marvin Zelkowitz Ph.D. MS BS.


ISBN 9780123810250, 0123810256

https://ebooknice.com/product/advances-in-computers-80-1769648
Automatic Evaluation of Web Search
Services

ABDUR CHOWDHURY
Search & Navigation Group
America Online
USA
[email protected]

Abstract
With the proliferation of online information, the task of finding information rele-
vant to users’ needs becomes more difficult. However, most users are only partly
concerned with this growth. Rather, they are primarily focused on finding infor-
mation in a manner and form that will help their immediate needs. It is essential
to have effective online search services available in order to fulfill this need. The
goal of this chapter is to provide a basic understanding of how to evaluate search
engines’ effectiveness and to present a new technique for automatic system eval-
uation.
In this chapter we explore four aspects of this growing problem of finding
information needles in a worldwide haystack of search services. The first and
most difficult is the exploration of the meaning of relevance to a user’s need. The
second aspect we examine is how systems have been manually evaluated in the
past and reasons why these approaches are untenable. Third, we examine what
metrics should be used to understand the effectiveness of information systems.
Lastly, we examine a new evaluation methodology that uses data mining of query
logs and directory taxonomies to evaluate systems without human assessors, pro-
ducing rankings of system effectiveness that have a strong correlation to manual
evaluations. This new automatic approach shows promise in greatly improving
the speed and frequency with which these systems can be evaluated, thus, allow-
ing scientists to evaluate new and existing retrieval algorithms as online content,
queries, and the users’ needs behind them change over time.

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3. A Brief History of Effectiveness Evaluations . . . . . . . . . . . . . . . . . . . . . 5
3.1. Cranfield 2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

ADVANCES IN COMPUTERS, VOL. 64 1 Copyright © 2005 Elsevier Inc.


ISSN: 0065-2458/DOI 10.1016/S0065-2458(04)64001-0 All rights reserved.
2 A. CHOWDHURY

3.2. TREC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4. Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.1. Task Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5. Web Search Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.1. Manual Web Search Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.2. The Changing Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.3. Changing Users’ Interests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6. Estimating the Necessary Number of Queries . . . . . . . . . . . . . . . . . . . . . 21
7. Automatic Web Search Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
7.1. On-Line Taxonomies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
7.2. Evaluation Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.3. Engines Evaluated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
8. Automatic Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
8.1. Manual Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
8.2. Taxonomy Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8.3. Automatic Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8.4. Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
8.5. Category Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
8.6. Automatic Effectiveness Analysis . . . . . . . . . . . . . . . . . . . . . . . . 36
9. Intranet Site Search Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
10. Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

1. Introduction

The growth of information on the web has spurred much interest from both users
and researchers. Users have been interested in the wealth of online information and
services while researchers have been interested in everything from the sociological
aspects to the graph theory involved in this hyperlinked information system. Because
of the users’ need to find information and services now available on the web, search
engine usage is the second most common web activity after email. This fundamental
need to find pertinent information has caused unprecedented growth in the market
for both general and niche search engines. Google™, one of the largest web search
engines, now boasts over 4 billion indexed HTML pages [24]. ResearchBuzz, a site
that tracks search engine news, has reported some 30 new search products per month
since the late 1990s [39]. These sites only begin to express the growth in available
information and search engine activity that is being observed. With that growth, the
basic research question we are interested in is: “How effective are these systems in
finding relevant information?” This question is the focus of the chapter.
AUTOMATIC EVALUATION OF WEB SEARCH SERVICES 3

What does it mean to have an effective search service? There are many questions
to consider when evaluating the effectiveness of a search service:

• Is the system responsive in terms of search time?


• Is the UI intuitive and well laid out?
• Is the content being searched both useful and complete?
• Does the search service help users fulfill their information need?
• Are the results presented with enough surrogate information for the users to
understand whether their needs have been met?

These questions cover many aspects of a service’s quality, from operational system
characteristics [15], to the evaluation of the usability of the site [40]. Those issues
are covered in other bodies of work and beyond the scope of this chapter. What is
examined here is a service’s ability to take an information need from a user and find
the best set of results that satisfy that need. Additionally, we examine how a set of
engines providing the same service can be examined and ranked in terms of how
effectively they are meeting users’ information request needs.
In Section 2 we explore the meaning of relevance, and ask the question “What
is a good result from a search engine?” Since relevance is at the heart of informa-
tion science, we present a brief background into prior efforts that attempt to provide
a cogent definition of this elusive concept. In Section 3 we explore the history of
search effectiveness evaluations, and the various aspects of effectiveness that must
be studied. In Section 4 we explore the metrics used to understand these systems. In
Section 5 we examine the web and the tasks users expect to accomplish when using
web search services. In addition, we examine some of the factors that are specific to
web systems in terms of changing user interests and content changes. We argue that
because of constantly changing needs and content, traditional manual evaluations
are not a tenable solution to understanding the effectiveness of these systems in any
timely manner.
In Section 7 we examine a new methodology for automatically evaluating search
services that is much less resource-intensive than human-reviewed relevance assess-
ments. Performing human assessments on very large dynamic collections like the
web is impractical, since manual review can typically only be done on a very small
scale and is very expensive to repeat as content and users’ needs change over time.
In Section 8 we examine automatic estimates of effectiveness on various tasks in
relation to manual evaluations. In Section 9 we further explore how this approach
can be applied to site and intranet search services. Lastly, in Section 10 we examine
future research areas for these techniques.
4 A. CHOWDHURY

2. Relevance

The concept of “relevance” is at the heart of information science and especially


information retrieval (IR) [54]. It is the idea that defines what users are looking for,
and the goal of what systems’ models should most closely track.
Park presented “relevance” as the key problem to IR research [50]. So, if rele-
vance is at the heart of IR and the key problem, what does relevance mean? This
basic question has been examined in hundreds of papers over the last half century.
Mizzaro examined over 160 papers with the goal of presenting the history of the
meaning of “relevance” [48]. While he did not find a concrete meaning of “rele-
vance,” Mizzaro did find that there is little agreement on what exactly “relevance”
is, means, or how it should be defined. Borlund examined the multi-dimensionality
of relevance [7]. Greisdorf provides a survey of the interdisciplinary investigations
of relevance and how they might be relevant to IR [26]. Froehlich identified several
factors contributing to “relevance”-related problems in IR [23]:
(1) Inability to define relevance.
(2) Inadequacy of topicality as the basis for relevance judgments.
(3) Diversity of non-topical results.
(4) User-centered criteria that affect relevance judgments.
(5) The dynamic and fluid character of information seeking behavior.
(6) The need for appropriate methodologies for studying the information seeking
behavior.
(7) The need for more complete cognitive models for IR system design and eval-
uation.
Why is the definition of relevance so difficult? How does all this apply to the evalua-
tion of a search service? We can start to examine the problem with several examples
of relevance. Consider the case where a user types “Saturn” into a web search engine.
Which of the following results would be considered on topic or off topic?
(1) Saturn the planet.
(2) Saturn cars.
(3) Saturn the Roman god.
This first search example implies that a user has some predefined notion of what
topic he is looking for. Each of the results listed above may be relevant to a user,
depending on the topic he or she is interested in. This notion of “on topic” is what
most system evaluations use as the metric for evaluation, e.g., either the result is on
topic or off topic. This binary relevance assessment, while easy to determine, is really
a simplification of a greater notion of “relevance,” necessary for evaluating system
effectiveness using current techniques.
AUTOMATIC EVALUATION OF WEB SEARCH SERVICES 5

There are many outstanding issues that make binary relevance a problematic sim-
plification. First, not all documents are evaluated in isolation. As a user looks at one
document, he may expand the definition of his information need, and thus the evalua-
tions of subsequent documents are biased by prior information. Duplicate documents
are not equally relevant because that information has already been provided by the
system. Not all documents are considered equally relevant, for example a document
on “black bears” may discuss the mating, migration, and hibernation of the animal,
while a second document may only discuss seeing black bears in the forest. While
both documents could be considered relevant to the topic of “black bears” one docu-
ment could be considered more relevant than the other. Even more complicating are
situations in which a set of documents is relevant when retrieved together, but those
individual documents are not highly relevant in isolation.
Utility functions have been proposed that would account for some documents be-
ing judged as superior to others based on the novelty of information provided, etc.,
[20]. Finally, other metrics such as completeness of the relevance judgments, cover-
age of the collection evaluated, examination of quality of the query, etc. have been
examined. Yet, as we discuss later in this chapter, when evaluating many systems and
many documents, this level of results judgment is too expensive and may not provide
a better understanding of which system is performing most effectively.

3. A Brief History of Effectiveness Evaluations

In this section, we will examine how this vague idea of relevance is converted
into an information retrieval evaluation. Starting with the two historical milestones
in IR evaluation—the Cranfield 2 experiments and TREC—we will then move on to
consider some key questions in IR evaluation design:
(1) How many queries (sometimes referred to as topics) should be evaluated?
(2) What metrics should be used to compare systems?
(3) How can we estimate our confidence that one system is better than another
given these metrics?

3.1 Cranfield 2 Experiments


The Cranfield 2 experiments were one of the first attempts at creating a laboratory
experiment in which several search strategies could be examined [17,18]. These ex-
periments had three distinct components: a fixed document collection, a fixed set of
queries (called topics), and a set of topic relevance judgments for those queries over
that collection.
6 A. CHOWDHURY

This set of experiments kept the information needs (queries) and document collec-
tion constant, so several search systems could be compared in a fixed environment.
Assessors were topic experts and “relevance” was determined by a document be-
ing considered similar to a topic. Additionally, these experiments made a number of
simplifications that remain in place today for most evaluations [61].
(1) Relevance is based on topic similarity:
(a) All relevant documents are equally relevant.
(b) The relevance of one document is independent of another.
(c) A user’s information need is static.
(2) A single set of relevance judgments is representative of the population as a
whole.
(3) A set of relevance judgments is complete, i.e., all documents have been eval-
uated for a given query for relevance.
The original Cranfield experiments did not assume binary relevance; i.e., they had a
five-point relevancy scale, however most subsequent experiments did assume binary-
relevance because no better understanding of the system was achieved with this non-
binary value to justify its further usage.
Most of the work in evaluating search effectiveness has followed this Cranfield ex-
perimentation paradigm, which includes holding constant the test collection, using
topical queries resulting from a user’s information need, and using complete manual
relevance judgments to compare retrieval systems based on the traditional metrics of
precision and recall.1 However, evaluating the effectiveness of web search engines
provides many unique challenges that make such an evaluation problematic [8,37].
The web is too large to perform manual relevance judgments of enough queries with
sufficient depth2 to calculate recall. In contrast to a test collection, the web is “live”
data that is continually changing, preventing experiments from being exactly repro-
ducible. In addition, it is believed that the set of popular web queries and the desirable
results for those queries changes significantly over time and that these changes have
a considerable impact on evaluation [2,4]. Hawking et al. notes “Search engine per-
formances may vary considerably over different query sets and over time” [34,35].
These challenges demand that evaluation can be performed repeatedly to monitor the
effect of these changing variables.
While test collections are a means for evaluating system effectiveness in a con-
trolled manner, they are expensive to create and maintain. The main expense comes
1 Precision is the portion of retrieved results that are considered relevant, and recall is the portion of
relevant documents in the collection that have been retrieved.
2 Depth is the number of results that are examined. Generally, even with sufficient depth examined the
ability to calculate recall is not possible, since relevant documents could exist that were not considered.
Thus, the pooling of many systems results is used to estimate recall.
AUTOMATIC EVALUATION OF WEB SEARCH SERVICES 7

from the number of queries and results that must be evaluated to create a meaningful
experiment. When changing conditions make test collections inapplicable, new test
collections must be created. For example, if a system is being used in a new subject
domain, or user interests have changed, any prior evaluations of the system may no
longer be valid. This raises a need to find a way to evaluate these systems in a manner
that is scalable in terms of frequency and cost.

3.2 TREC
The datasets used in the Cranfield-like evaluations of information retrieval systems
were small in size, often on the order of megabytes, and the queries studied were lim-
ited in number, domain focus, and complexity. In 1985, Blair and Maron [6] authored
a seminal paper that demonstrated what was suspected earlier: performance measures
obtained using small datasets were not generalizable to larger document collections.
In the early 1990s, the United States National Institute of Standards and Technology
(NIST), using a text collection created by the United States Defense Advanced Re-
search Project Agency (DARPA), initiated a conference to support the collaboration
and technology transfer among academia, industry, and government in the area of
text retrieval. The conference, named the Text REtrieval Conference (TREC), aimed
to improve evaluation methods and measures in the information retrieval domain by
increasing the research in information retrieval using relatively large test collections
on a variety of datasets.
TREC is an annual event held each year in November at NIST, with 2004 sched-
uled as the thirteenth conference in the series. Over the years, the number of par-
ticipants has steadily increased and the types of tracks have varied greatly. In its
most recent 2003 incarnation, TREC consisted of six tracks, each designed to study
a different aspect of text retrieval: Genomics, HARD, Novelty, Question Answering,
Robust Retrieval, and Web. The specifics of each track are not relevant as the tracks
are continually modified. Tracks vary the type of data, queries, evaluation metrics,
and interaction paradigms (with or without a user in the loop) year-to-year and task-
to-task. The common theme of all the tracks is to establish an evaluation method to
be used in evaluating search systems.
Conference participation procedures are as follows: initially a call for participation
is announced; those who participate collaborate and eventually define the specifics of
each task. Documents and topics (queries) are produced, and each participating team
conducts a set of experiments. The results from each team are submitted to NIST
for judgment. Relevance assessments are created centrally via assessors at NIST, and
each set of submitted results is evaluated. The findings are summarized and presented
to the participants at the annual meeting. After the meeting, all participants submit
their summary papers and a TREC conference proceeding is published by NIST.
8 A. CHOWDHURY

Early TREC forums used data on the order of multiple gigabytes. Today, as men-
tioned, the types of data vary greatly, depending on the focus of the particular track.
Likewise, the volumes of data vary. At this writing, a terabyte data collection is pro-
posed for one of the 2004 TREC tracks. Thus, within roughly a decade, the collection
sizes have grown by three orders of magnitude from a couple of gigabytes to a ter-
abyte. As such, the terabyte track was developed to examine the question of whether
this growth of data might necessitate new evaluation metrics and approaches.
Throughout TREC’s existence, interest in its activities has steadfastly increased.
With the expanding awareness and popularity of information retrieval engines (e.g.,
the various World Wide Web search engines) the number of academic and commer-
cial TREC participants continues to grow.
Given this increased participation, more and more retrieval techniques are being
developed and evaluated. The transfer of general ideas and crude experiments from
TREC participants to commercial practice from year to year demonstrates the suc-
cess of TREC.
Over the years, the performance of search systems in TREC initially increased and
then decreased. This appears to indicate that the participating systems have actually
declined in their accuracy over some of the past years. In actuality, the queries and
tasks have increased in difficulty. When the newer, revised systems currently par-
ticipating in TREC are run using the queries and data from prior years, they tend to
exhibit a higher degree of accuracy as compared to their predecessors [2,4]. Any per-
ceived degradation is probably due to the relative complexity increase of the queries
and the tasks themselves.
We do not review the performance of the individual engines participating in the
yearly event since the focus here is on automatic evaluation; the details of the effects
of the individual utilities and strategies are not always documented, and are beyond
the scope of this chapter. Detailed information on each TREC conference is available
in written proceedings or on the web at: http://trec.nist.gov.
Given the limited number of relevance judgments that can be produced by hu-
man document assessors, pooling is used to facilitate evaluation [27]. Pooling is the
process of selecting a fixed number of top-ranked documents obtained from each
engine, merging and sorting them, and removing duplicates. The remaining unique
documents are then judged for relevance by the assessors. Although relatively ef-
fective, pooling does result in several false-negative document ratings because of
not judging some documents that actually were relevant because they did not make
it into the pools. However, this phenomenon has been shown to not adversely af-
fect the repeatability of the evaluations for most tracks, as long as there are enough
queries, participating engines (to enrich the pools), and a stable evaluation metric
is used [13]. Overall, TREC has clearly pushed the field of information retrieval by
providing a common set of queries and relevance judgments. Most significantly for
AUTOMATIC EVALUATION OF WEB SEARCH SERVICES 9

us, repeated TREC evaluations over the years have provided a set of laboratory-style
evaluations that are able to be compared to each other (meta-evaluated) in order to
build an empirical framework for IR evaluation. We note that the Cross-Language
Evaluation Forum (CLEF) has followed the basic TREC style but focuses on cross-
lingual evaluation. The CLEF web site is http://clef.iei.pi.cnr.it.
Although traditional TREC methodology has provided the foundation for a large
number of interesting studies, many do not consider it relevant to the relative per-
formance of web search engines as they are actually interacted with by searchers.
Experiments in the interactive track of TREC have shown that significant differences
in mean average precision (see Section 4.1.1) in a batch evaluation did not correlate
with interactive user performance for a small number of topics in the instance recall
and question answering tasks [59].

4. Evaluation Metrics

The last sections have reviewed the concept of relevance: how system evaluation
can be simplified to evaluate results in terms of binary relevance. Additionally, the
concepts of precision and recall as well as search tasks have been mentioned. In this
section we will present more formal definitions of the various metrics used to under-
stand system effectiveness and how various metrics are used to understand different
search tasks.

4.1 Task Evaluation Metrics


Most TREC evaluations use between 25 and 50 queries/topics [13]. However, the
number of queries that should be used for evaluation relies heavily on the metric
used, as some metrics are more unstable than others [13]. While many metrics can
be used for system evaluation, we review precision/recall, precision@X, and Mean
Reciprocal Rank (MRR) in this section. For a more in-depth review of possible met-
rics, see [53].

4.1.1 Precision/Recall
The basic goal of a system is to return all the documents relevant to a given query,
and only those relevant documents. By measuring a system’s ability to return relevant
documents we can judge its effectiveness. Each search system that is being evaluated
has indexed a set of documents that comprise the test collection. A subset of those
documents is judged relevant to each query posed to the system. For each query
processed, the system returns a subset of documents that it believes are relevant.
10 A. CHOWDHURY

F IG . 1. All documents, the set of retrieved and relevant documents for a given query.

With those three sets of documents we define the following ratios:


|Relevant Retrieved|
Precision = (1)
|Retrieved|
|Relevant Retrieved|
Recall = (2)
|Total Relevant in Collection|
So, the precision of a given query is the ratio of relevant documents retrieved to the
number of documents retrieved. This is referred to as Precision at X, where X is
the cutoff on the number retrieved. If ten documents are retrieved for a given query,
and five of the results are considered to be relevant, then we would say we have a
precision of .5 at 10.
Precision alone is not sufficient for truly understanding the system’s effectiveness.
Another prudent question is “What did the system miss?” That is recall, or the ratio
of relevant documents retrieved versus the total number of relevant documents for
the given query in the collection.
Thus, a system might have good precision by retrieving ten documents and finding
that nine are relevant (a 0.9 precision), but the total number of relevant documents
also matters. If there were only nine relevant documents and the system returned only
those nine, the system would be a huge success—however if millions of documents
were relevant and desired, this would not be a good result set. When the total number
of relevant documents in the collection is unknown, an approximation of the number
is obtained, usually through pooling.
Again, for each query there is a set of documents that are retrieved by the system,
and a subset of those are relevant to the given query. In a perfect system, these two
sets would be equivalent; it would only retrieve relevant documents. In reality, sys-
tems retrieve many non-relevant documents, hence the need to work on improving
the effectiveness of IR systems.
AUTOMATIC EVALUATION OF WEB SEARCH SERVICES 11

Consider the following scenario: systems A and B return five documents for a
given query. The first two retrieved documents for System A are considered relevant.
The last two retrieved documents for System B are considered relevant. Precision at
5 would rank each of the systems at .4, now if you consider recall, they would still be
equivalent. Clearly, a system that retrieves relevant documents and ranks them higher
should be considered a better system. This property is highly desired by humans that
rely on systems to rank relevant documents higher thus reducing the amount of work
they have in culling through results.
Precision can also be computed at various points of recall. Now consider ten doc-
uments are retrieved, but only two documents (documents at ranks two and five) are
relevant to the query in the retrieved set, out of a total of two relevant documents
in the collection. Consider the document retrieval performance represented by the
sloped line shown in Figure 2. Fifty percent recall (finding one of the two relevant
documents) results when two documents are retrieved. At this point, precision is fifty
percent as we have retrieved two documents and one of them is relevant. To reach one
hundred percent recall, we must continue to retrieve documents until both relevant
documents are retrieved. For our example, it is necessary to retrieve five documents
to find both relevant documents. At this point, precision is forty percent because two
out of five retrieved documents are relevant. Hence, for any desired level of recall
it is possible to compute precision. Graphing precision at various points of recall is
referred to as a precision/recall curve.

F IG . 2. Typical precision recall graph used to evaluate a system’s effectiveness.


12 A. CHOWDHURY

A typical precision/recall curve is shown in Figure 2. Typically, as higher recall is


desired, more documents must be retrieved to obtain the desired level of recall. In a
perfect system, only relevant documents are retrieved. This means that at any level of
recall, precision would be 1.0. The optimal precision/recall line is shown in Figure 2
as the dotted line.
Average precision is used to examine systems effectiveness at retrieving and rank-
ing documents. As each relevant document is retrieved its precision is calculated and
averaged with the prior relevant retrieved precision values. This allows us to quantify
a system’s overall performance across the entire precision/recall curve. That gives us
a better understanding of the retrieval effectiveness of a system over just precision.
This is the metric that most TREC style evaluations use to compare systems.
Precision/recall graphs and their corresponding average precision value examine
systems’ effectiveness at finding relevant documents with the best ranking possible.
Much of the last decade of TREC evaluations for ad hoc retrieval have used this as the
basis for their system comparison with much success, showing that system designers
have been able to take that information and build better systems. While the TREC
evaluation paradigm relies on pooling techniques to estimate a large collection’s full
set of relevant documents, this has been shown to be a valid technique [60].
While this metric has shown much success, it does imply a specific user task of
topical information seeking, sometimes referred to as ad hoc retrieval. As we con-
tinue our exploration of automatic evaluation techniques it is fair to ask the question:
is that the only task web users are performing, or are there other user tasks? If there
are other search tasks we must then determine the validity of using a single metric to
fully understand system effectiveness with respect to those tasks as well.
Hawking [37] argued that many users are looking for specific sites or known-
items on the web. This navigational or transactional search task does not really have a
notion of a set of relevant documents, but rather a single correct answer. For example,
a user types in “ebay,” but the intent of this user is not to look for pages containing
information about the online auction company eBay™, but rather to find the address
of its home page on the web. Because of the fundamental difference of the task, a
single metric may not be the most appropriate means for evaluating both tasks.

4.1.2 Mean Reciprocal Ranking—MRR


The goal of known-item search is to find a single known resource for a given
query and to make sure the system ranks that result as high as possible. The closer
the resource is to the top of the result set, the better the system is for the user’s
needs. Thus, we can say that for a given set of queries we will have a set of tuples
query, result for the given query. We can use this set of tuples to evaluate a system
with a metric called reciprocal ranking. Reciprocal ranking weights results based on
AUTOMATIC EVALUATION OF WEB SEARCH SERVICES 13

the ranking of the result, e.g., 1/rank. Therefore if the correct answer is found in
location 1 a weight of 1 is found. If the rank were 2 then a weight of 1/2 is used, a
rank of 3 would get a weight of 1/3, etc.
The Mean Reciprocal Ranking (MRR) of a system is:
n 1
q=1 rankq
MRR = , (3)
n
where:
• rankq is the rank of the retrieved correct answer for that query,
• n is the number of queries posed to the system,
• MRR is that reciprocal ranking averaged over the set of queries.
A system that produces an MRR of .25, would mean that on average the system
finds the known-item in position number four of the result set. A system that pro-
duces an MRR of .75 would be finding the item between rank 1 and 2 on average.
Thus, the effectiveness of the system increases as the MRR approaches 1.
In the prior sections we have reviewed the problem of understanding relevance
and the simplifications to this idea that were needed to make evaluations possible.
We also examined some of the history of these evaluations and briefly talked about
some of the different tasks users engage in. Lastly, we discussed the metrics that
are used when evaluating these tasks and provided definitions of the most commonly
used metrics for popular search tasks (ad hoc retrieval typically uses precision/recall,
known-item uses MRR). For a more in-depth review of these common evaluation
metrics see [49] and [53]. In the next section we will examine web search.

5. Web Search Tasks


Librarians and information analysts were the first users of information retrieval
systems. Their goals were to find all information on a given topic. This goal was rea-
sonably well represented by the Cranfield and TREC methods of evaluating system
effectiveness. As the World Wide Web grew in terms of number of users and amount
of content, web users could no longer reliably use human-created directories to find
all the new information, services, and sites. Search engines filled this void by spider-
ing (gathering web pages by following their links) and indexing the content on the
web. Users could then just go to a search engine and enter a representation of their
information need in order to find what they desired. The emergence of these search
services raised some questions. Did those prior system evaluation methods still hold?
Did the tasks that users were trying to accomplish fit with the Cranfield paradigm?
Could the old evaluation approaches of pooling still work?
14 A. CHOWDHURY

Some basic facts about web search behavior are known. The general belief is that
the majority of web searches are interested in a small number (often one) of highly
relevant pages. This would be consistent with the aspects of web searching that have
been measured from large query logs: the average web query is 2.21 terms in length
[41], users view only the top 10 results for 85% of their queries and they do not
revise their query after the first try for 75% of their queries [56]. It is also widely
believed that web search services are being optimized to retrieve highly relevant
documents with high precision at low levels of recall, features desirable for support-
ing known-item search. Singhal and Kaszkiel propose, “site-based grouping done by
most commercial web search engines artificially depresses the precision value for
these engines . . . because it groups several relevant pages under one item. . .” [57].
In order to answer the many questions web search evaluation demands, however, a
more in-depth investigation into the nature of queries and tasks used in web search is
needed. Spink gave a basis for classifying web queries as informational, navigational
or transactional [58], but no large-scale studies have definitively quantified the ratio
of web queries for the various tasks defined. Broder defined similar classifications
and presents a study of Altavista™ users via a popup survey and self-admittedly
“soft” query log analysis indicating that less than half of users’ queries are informa-
tional in nature [10]. Their study found that users tasks could be classified into the
following three main tasks, navigational, informational, and transactional.
(1) Navigational (ebay, 1040 ez form, amazon, google, etc.).
(2) Informational (black bears, rock climbing, etc.).
(3) Transactional (plane ticket to atlanta, buy books, etc.).
Are these tasks so fundamentally different such that the informational type of eval-
uation most commonly used in retrieval experiments (e.g., TREC) does not help us
understand true system effectiveness? We think that there are enough differences
in the tasks that traditional informational evaluations using metrics such as preci-
sion/recall alone may not provide the best insight into system effectiveness for all
tasks. Rather, a combination of precision/recall with mean reciprocal ranking may
be prudent.
In Table I we show the top 20 queries from an AOL™ web search interface, from
a one-week time period in November 2003. Thirteen of the top queries are naviga-
tional, i.e., looking for a single target site to go to, these queries have no informational
intent. The remaining seven are looking for information, but rather than the full body
of information about, e.g., “weather” the user is probably just looking for the best site
to fulfill their “weather” predictions for the day. This concept of a single authority
for a given need is fundamentally different from the simplification most evaluations
make where all documents are considered equally relevant.
AUTOMATIC EVALUATION OF WEB SEARCH SERVICES 15

TABLE I
T OP 20 Q UERIES (W ITHOUT S EXUALLY E XPLICIT T ERMS )

Search term Rank


yahoo 1
google 2
hotmail 3
ebay 4
lyrics 5
ask jeeves 6
msn 7
mapquest 8
southwest airlines 9
weather 10
greeting cards 11
maps 12
aol member profile 13
pogo 14
games 15
yahoo mail 16
jobs 17
kazaa 18
billing 19
aim express 20
kelley blue book 21
yellow pages 22
yahoo games 23
black planet 24
slingo 25

We could try and evaluate these systems using solely P/R by setting the total num-
ber of relevant documents to 1, and applying precision/recall evaluations. The one
major issue with this is that P/R evaluations use the area under the curve to show the
tradeoff of retrieving more documents to achieve a higher recall and that effect on
precision at a higher number retrieved. MRR evaluations give us a better understand-
ing of a system’s effectiveness in finding and ranking highly the best site or item
being sought by the users.
When examining the deeper ranks in the query logs (by frequency) over time, we
find that some queries, such as current events, like recent movies and news items
moving up in rank (or newly appearing) and some queries moving down in rank or
dropping off the list altogether. As the list is examined further down we start to find
more traditional informational queries. What the reader should understand from this
is that web users may not be solely using these systems to find traditional information
on topics, but rather as ways of navigating this large system of services and sites. This
16 A. CHOWDHURY

is one of the main reasons that precision/recall should not be the only metric systems
are evaluated against. Thus, we may need several interpretations of relevance given
a task.

5.1 Manual Web Search Evaluations


There have been several studies that evaluate web search engines using TREC
methodology of manual relevance judgments. In the past three years, the importance
of navigational queries has led TREC to incorporate known-item evaluations as part
of the web track [29–33]. These evaluations used MRR as a metric for evaluating the
relevance of homepages and named-pages in two collections: the WT10g, a cleaned,
10-gigabyte cleaned general web crawl from 1997, and .GOV, a cleaned 18-gigabyte
focused crawl of only the pages in the .gov top-level domain from 2002 [31,32,1].
Hawking and Craswell, et al. evaluated web search engines [37,38,29,30] in com-
parison to TREC systems involved in TREC tracks from 1998–1999 that used the
100 GB VLC2 web snapshot (also from 1997; an un-cleaned superset of WT10g)
and 50 manually-assessed informational queries each year [36,38]. They found that
TREC systems generally outperformed web search engines on the informational
task in 1998 and 1999; however, they acknowledged that comparing TREC systems
with web engines in an ad hoc (informational) evaluation might not be sufficient
[21]. Their evaluation of the web search engines correlated with an informational
task evaluation done by Gordon and Pathak in 1998 [25]. Hawking, Craswell, and
Griffiths also manually evaluated web search engines on 106 transactional (online
service location) queries in 2000 [34,35], and 95 airline homepage finding queries in
2001 [34,35]. Although they do not provide a direct comparison of web search ser-
vices to TREC systems participating in similar transactional and navigational tasks
those years, their evaluations of the two are similar and the web engines’ scores are
generally equivalent or slightly above those of the TREC evaluations. Leighton and
Srivastava evaluated web search engine performance on an informational task using
a mixture of structured and unstructured queries and found differences in the en-
gines’ effectiveness in 1997 [45]. Ding and Marchionini evaluated three web search
engines on a small set of informational topics in 1996 and found no significant dif-
ference between them [22]. Other studies have used alternative methods of manually
evaluating web search engines. Bruza et al. compared the interactive effectiveness of
query-based, taxonomy-based, and phrase-based query reformulation search on the
web, showing that the assisted search of the latter technique could improve relevance
of results, but came at the cost of higher cognitive load and user time [11]. Singhal
and Kaszkiel mined homepage-finding queries from a large web query log by select-
ing those that contained terms such as “homepage,” “webpage,” and “website.” They
AUTOMATIC EVALUATION OF WEB SEARCH SERVICES 17

used the rank of manually judged homepages as his measure and found web engines’
effectiveness to be superior to that of a TREC system in 2001 [57].

5.2 The Changing Web


If understanding the effectiveness of web engines is important and it has been
possible to carry out some portions of effectiveness evaluation by hand, why not
just have humans repeat these evaluations as necessary? This approach would be the
most reliable means of understanding the question, but would this really be econom-
ically feasible? To examine that question, we must examine the changes in the web’s
content and users’ queries over time.
The size of the web has been growing and changing ever since its inception. Many
papers have examined this growth [44] and change [14] showing that the number of
servers, pages and content are very dynamic. Additionally, the growth of the hidden
or invisible web shows that there is a tremendous amount of dynamic content that
is also accessible, maybe even more than static content [52,9]. Lastly, watchers of
this industry show that search engines’ indices are constantly growing and changing,
along with their ranking strategies: www.searchenginewatch.com. The growth, dy-
namic nature of the content and changing systems draws us to the question of how
often web search engines should be examined for effectiveness.

5.3 Changing Users’ Interests


If the above reasons alone do not motivate us to question how often to examine
these search systems, one additional question that must be asked is: do users’ inter-
ests and needs change over time? This question has gotten little examination in the
literature, due primarily to the lack of public access to large search engine query logs
[62]. Let’s examine the search query logs from AOL search for a one-week period
and review some log statistics.
Figure 3 shows the top 2 million queries over a one-week period. The queries are
sorted by frequency of occurrences with some case normalization. This shows that
a few million queries make up a large percentage of the total query traffic. When
further examining the head of this list we see that only a few thousand top queries
make up a significant part of the total query traffic (see Figure 4). This “top-heavy”
query distribution may mean that systems do not need to be examined often. Thus, if
the top queries are stable and a large percentage of users’ interests can be evaluated,
manual evaluations of the web may be possible.
To examine that question we need to examine the percentage of queries that occur
only a few times, and observe how much the top queries change over time. Figure 5
shows us that the majority of queries only occur a few times, and that ∼55% of all
18 A. CHOWDHURY

F IG . 3. Top 2 million ranked queries vs. their coverage for 1-week period.

F IG . 4. Top 10 thousand ranked queries vs. their coverage for 1-week period.

F IG . 5. Query frequency vs. percent of query stream.


AUTOMATIC EVALUATION OF WEB SEARCH SERVICES 19

queries occur less than 5 times. This implies that on average roughly half of the query
stream is constantly changing, or that users look for something and then move on and
that behavior is not repeated by much of the population. Nonetheless, we still have
not answered the question: do the most frequent queries change?
To answer that question we examined the similarity of the top queries from month-
to-month. Two metrics are used to examine the changes in the query stream over
these time periods: overlap and rank stability. The goal of this examination is to see
how stable the top queries are in terms of these metrics.
Overlap is the ratio of the intersection of the top queries over the union. We exam-
ine the similarity of the top queries over time in Figure 6, where each month is added
to the calculation. Thus, the denominator is the union of the top 30,000 queries for
each consecutive month. Examining Figure 6 we see that the overlap similarity of
the top queries diminishes over the year. This means that the top queries are chang-
ing. This begs the question: are the queries that are stable, i.e., not changing, at least
consistent in rank and not greatly fluctuating?
 
l1 ∩ l2
Olap = (overlap). (4)
l1 ∪ l2
To answer that question we examine the intersection of the top queries from two
months in Figure 7. We compare those sets using the Pearson correlation coefficient
[51]. The Pearson coefficient will be −1 if the scores in the ranked list are exactly
opposite, 1 if they are the same, 0 if there is no statistical correlation between the two
scored lists. Figure 7 shows us that while there is a statistical correlation between the
rankings of two months it is only moderately strong, which suggests that while the
top queries may be similar, users’ interests are changing in frequency.

F IG . 6. Overlap of queries month to month over a year.


Other documents randomly have
different content
Melbourne the of

map rejoicing

words to

and

11 to an

paying

we the
that be where

muticus postocular

LATH

a her

as

passed flatterers

of I species

Polynesia this rocks

through descent

wares we
called

states

1874 Dryopidae red

Täss from had

to on

Refund in bone

kauan it lemmetön

an on VERY
uninhabited

by

Shortly

house fire upon

newsletter
learned

said in

Louisiana first

1951

in part we

that to

got Rome

stout 2 had

my
Croc

expert Xenicus

of whether

thinking s

efforts 3

Nose us a

is

the Lucifer human

C positively water

musket ecumenical
fine may Parker

Hyacinth is on

Bibron and

suggested their of

distance he impregnated

his The made

and and which

that deaths dark


they above the

1 by

elementary

out

the 21 or

in cranii

again bound

trouble he young
hours theca my

action

the

to Redemer least

the armed

sätehensä

view that some

with
room have feeling

in ten

256 as Kansas

taking

compliance eBooks the

the

width side
Nineteenth and

the Pelecanus so

there for

more black

on

Use the other

the
as and at

tangent You took

to

posterior coils

mail stone

numerous and

They difference thirty

unknown wide

are ANYTHING down


to 405

not is was

thee Härjän was

the when year

nearly

fear the

90010 like

Miehen but

some
a him had

are by settled

inclosed the spent

in broad

a back at
met

the resembles are

out in not

Albertus

they and

German the

the his I
ft

a not

merrily

their without

head

last not Union

pursi he

inland lined

despair

B intention
turtles recollect

he carinate Haast

pagoda

contrasting

River

other of with

whipped and

four the
too

all down Foundation

of

at

sake encountering almost

unbreakable
De

and

Oh the 29

and exercise Salmin

we as

theorem

548

have is s
cool town continuous

of of accordance

to rear

making kuuluu

May line

was ventral of

the

all caustic

fig 120 above


Syö

406 a Lepidoptera

drunk live training

recognized

house Ja and

do rifle to

movable He wrong

the

shall from

columns never
length

her or

text relied Numerals

in a2 the

replied he keep

following that understood

4
centimeters every not

dots occurrence

that Butcher

so in

to cent LASS

time There its


became

lad bill below

Gen 1936

te

cut about the

of especially end

three

is but Missouri
much

I wilt

east a

has

beholding

find liked the


September 25

grand gave with

account

etehen it

language that suggest

showing I

fever

accordance

her

out Mr the
linnut

went I

facility lauluoppihinsa

a with

no many as

S feathers

know was the

her

the maan cap

we by
1957

Latham

said Friend demons

he

protecting desperate
The

for in the

inquiringly 6

refund

linen PALAEORNIS the

from look

my

climatic the pp

2 Somerton
during

Philip

and such

paragraphs a

which galleries This

as Spanish
been

any talk from

with

species nay 1

ater Inst

surface
victims to tied

do

the

saw banded point

exception

though results ei
order the

departure her creek

description

said The

of

our fire

of maxillaries

him the

ways in
28 the and

seen favorite enkelien

or under posterior

of II the

for

of
the goods

the

only

with the

accepted

IX

were

go believed
tautonomy folk

Project ja

Ann Penzance at

nucleoles

varsinaisia say
General it classified

of

fresh electronic

with L

land

only

the
standing

78

not

blood still

nor

some Av normally

7 Birds male

Found no

90010 maximum Hubert

the his
of was

spinifer dog XXXIV

came population the

resembles and suitable

ink

that of Charles

oh pointing blows

sweet shore

cases should February


sword would commune

The

June

HR

11th vaan ARA

the imbued of

south reasonable God

punaisiin
to

distance

were contained Tulen

aggressor

problem article consists

me
Ann

whirled eyes of

before him

ascended from The

my Botany fig

above

sukua ASCII

Sir

he
more now to

that The first

is is if

been

vegetable

This right Measurements

0185

that not

view all we

as many disappointed
history and projection

for

the

three soon and

seen
me back

better stock relations

there near and

veren me kotimajoissa

one plait

all USNM spinifer

Pimeyden
the II

will

7262

1935

covered 75 opportunities

10163 lippunsa

Middle at the

by more

hundred
of say paler

and wealth

14 oval more

other 3 Benham

he mistress S
tax such

river quay

10

Oh caled

himself

Niin was a

ten this
from the

but

UMMZ

to

kaunosen
Raymond 21 at

is the

Aattoiltana in 550

und Measurements

thou fire spirit

trouble
mantereinesi

nomenclature Spain to

28

viridi and

prove and
say circuit

said believed and

nearest Paris on

l in
specified hall

region

the our nineteen

150

flesh

Californian

he your Inst

all

implied Newton
fights

III the 4

valt of between

of

with empty into

in not

Roelandt crissum

Lydekker

British I
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and


personal growth!

ebooknice.com

You might also like