(Ebook) New Programming Paradigms by Marvin Zelkowitz Ph.D. MS BS. ISBN 9780080459585, 9780120121649, 0120121646, 0080459587 All Chapters Available
(Ebook) New Programming Paradigms by Marvin Zelkowitz Ph.D. MS BS. ISBN 9780080459585, 9780120121649, 0120121646, 0080459587 All Chapters Available
https://ebooknice.com/product/new-programming-paradigms-1494128
                               ★★★★★
                      4.9 out of 5.0 (19 reviews )
DOWNLOAD PDF
                           ebooknice.com
(Ebook) New Programming Paradigms by Marvin Zelkowitz Ph.D.
   MS BS. ISBN 9780080459585, 9780120121649, 0120121646,
                 0080459587 Pdf Download
EBOOK
Available Formats
https://ebooknice.com/product/biota-grow-2c-gather-2c-cook-6661374
https://ebooknice.com/product/computer-performance-issues-1383578
https://ebooknice.com/product/quality-software-development-999640
 https://ebooknice.com/product/nanotechnology-1494104
(Ebook) Highly Dependable Software by Marvin Zelkowitz Ph.D. MS BS.
ISBN 9780120121588, 0120121581
https://ebooknice.com/product/highly-dependable-software-1494156
https://ebooknice.com/product/improving-the-web-1324024
https://ebooknice.com/product/software-development-1494110
https://ebooknice.com/product/the-internet-and-mobile-
technology-2394926
https://ebooknice.com/product/advances-in-computers-80-1769648
Automatic Evaluation of Web Search
Services
       ABDUR CHOWDHURY
       Search & Navigation Group
       America Online
       USA
       [email protected]
       Abstract
       With the proliferation of online information, the task of finding information rele-
       vant to users’ needs becomes more difficult. However, most users are only partly
       concerned with this growth. Rather, they are primarily focused on finding infor-
       mation in a manner and form that will help their immediate needs. It is essential
       to have effective online search services available in order to fulfill this need. The
       goal of this chapter is to provide a basic understanding of how to evaluate search
       engines’ effectiveness and to present a new technique for automatic system eval-
       uation.
          In this chapter we explore four aspects of this growing problem of finding
       information needles in a worldwide haystack of search services. The first and
       most difficult is the exploration of the meaning of relevance to a user’s need. The
       second aspect we examine is how systems have been manually evaluated in the
       past and reasons why these approaches are untenable. Third, we examine what
       metrics should be used to understand the effectiveness of information systems.
       Lastly, we examine a new evaluation methodology that uses data mining of query
       logs and directory taxonomies to evaluate systems without human assessors, pro-
       ducing rankings of system effectiveness that have a strong correlation to manual
       evaluations. This new automatic approach shows promise in greatly improving
       the speed and frequency with which these systems can be evaluated, thus, allow-
       ing scientists to evaluate new and existing retrieval algorithms as online content,
       queries, and the users’ needs behind them change over time.
 1. Introduction . . . . . . . . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   2
 2. Relevance . . . . . . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   4
 3. A Brief History of Effectiveness Evaluations       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   5
    3.1. Cranfield 2 Experiments . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   5
    3.2. TREC . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    7
 4. Evaluation Metrics . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    9
    4.1. Task Evaluation Metrics . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    9
 5. Web Search Tasks . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   13
    5.1. Manual Web Search Evaluations . . . .           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   16
    5.2. The Changing Web . . . . . . . . . . .          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   17
    5.3. Changing Users’ Interests . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   17
 6. Estimating the Necessary Number of Queries           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   21
 7. Automatic Web Search Evaluation . . . . . .          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   23
    7.1. On-Line Taxonomies . . . . . . . . . .          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   25
    7.2. Evaluation Methodologies . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   26
    7.3. Engines Evaluated . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   27
 8. Automatic Evaluation Results . . . . . . . .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   28
    8.1. Manual Evaluation . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   29
    8.2. Taxonomy Bias . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   31
    8.3. Automatic Evaluation . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   31
    8.4. Stability . . . . . . . . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   32
    8.5. Category Matching . . . . . . . . . . .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   33
    8.6. Automatic Effectiveness Analysis . . .          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   36
 9. Intranet Site Search Evaluation . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   37
10. Conclusions and Future Work . . . . . . . .          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   39
    Acknowledgements . . . . . . . . . . . . . .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   40
    References . . . . . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   40
1. Introduction
   The growth of information on the web has spurred much interest from both users
and researchers. Users have been interested in the wealth of online information and
services while researchers have been interested in everything from the sociological
aspects to the graph theory involved in this hyperlinked information system. Because
of the users’ need to find information and services now available on the web, search
engine usage is the second most common web activity after email. This fundamental
need to find pertinent information has caused unprecedented growth in the market
for both general and niche search engines. Google™, one of the largest web search
engines, now boasts over 4 billion indexed HTML pages [24]. ResearchBuzz, a site
that tracks search engine news, has reported some 30 new search products per month
since the late 1990s [39]. These sites only begin to express the growth in available
information and search engine activity that is being observed. With that growth, the
basic research question we are interested in is: “How effective are these systems in
finding relevant information?” This question is the focus of the chapter.
                AUTOMATIC EVALUATION OF WEB SEARCH SERVICES                          3
   What does it mean to have an effective search service? There are many questions
to consider when evaluating the effectiveness of a search service:
   These questions cover many aspects of a service’s quality, from operational system
characteristics [15], to the evaluation of the usability of the site [40]. Those issues
are covered in other bodies of work and beyond the scope of this chapter. What is
examined here is a service’s ability to take an information need from a user and find
the best set of results that satisfy that need. Additionally, we examine how a set of
engines providing the same service can be examined and ranked in terms of how
effectively they are meeting users’ information request needs.
   In Section 2 we explore the meaning of relevance, and ask the question “What
is a good result from a search engine?” Since relevance is at the heart of informa-
tion science, we present a brief background into prior efforts that attempt to provide
a cogent definition of this elusive concept. In Section 3 we explore the history of
search effectiveness evaluations, and the various aspects of effectiveness that must
be studied. In Section 4 we explore the metrics used to understand these systems. In
Section 5 we examine the web and the tasks users expect to accomplish when using
web search services. In addition, we examine some of the factors that are specific to
web systems in terms of changing user interests and content changes. We argue that
because of constantly changing needs and content, traditional manual evaluations
are not a tenable solution to understanding the effectiveness of these systems in any
timely manner.
   In Section 7 we examine a new methodology for automatically evaluating search
services that is much less resource-intensive than human-reviewed relevance assess-
ments. Performing human assessments on very large dynamic collections like the
web is impractical, since manual review can typically only be done on a very small
scale and is very expensive to repeat as content and users’ needs change over time.
   In Section 8 we examine automatic estimates of effectiveness on various tasks in
relation to manual evaluations. In Section 9 we further explore how this approach
can be applied to site and intranet search services. Lastly, in Section 10 we examine
future research areas for these techniques.
4                                  A. CHOWDHURY
2. Relevance
   There are many outstanding issues that make binary relevance a problematic sim-
plification. First, not all documents are evaluated in isolation. As a user looks at one
document, he may expand the definition of his information need, and thus the evalua-
tions of subsequent documents are biased by prior information. Duplicate documents
are not equally relevant because that information has already been provided by the
system. Not all documents are considered equally relevant, for example a document
on “black bears” may discuss the mating, migration, and hibernation of the animal,
while a second document may only discuss seeing black bears in the forest. While
both documents could be considered relevant to the topic of “black bears” one docu-
ment could be considered more relevant than the other. Even more complicating are
situations in which a set of documents is relevant when retrieved together, but those
individual documents are not highly relevant in isolation.
   Utility functions have been proposed that would account for some documents be-
ing judged as superior to others based on the novelty of information provided, etc.,
[20]. Finally, other metrics such as completeness of the relevance judgments, cover-
age of the collection evaluated, examination of quality of the query, etc. have been
examined. Yet, as we discuss later in this chapter, when evaluating many systems and
many documents, this level of results judgment is too expensive and may not provide
a better understanding of which system is performing most effectively.
   In this section, we will examine how this vague idea of relevance is converted
into an information retrieval evaluation. Starting with the two historical milestones
in IR evaluation—the Cranfield 2 experiments and TREC—we will then move on to
consider some key questions in IR evaluation design:
  (1) How many queries (sometimes referred to as topics) should be evaluated?
  (2) What metrics should be used to compare systems?
  (3) How can we estimate our confidence that one system is better than another
      given these metrics?
   This set of experiments kept the information needs (queries) and document collec-
tion constant, so several search systems could be compared in a fixed environment.
Assessors were topic experts and “relevance” was determined by a document be-
ing considered similar to a topic. Additionally, these experiments made a number of
simplifications that remain in place today for most evaluations [61].
    (1) Relevance is based on topic similarity:
        (a) All relevant documents are equally relevant.
        (b) The relevance of one document is independent of another.
        (c) A user’s information need is static.
    (2) A single set of relevance judgments is representative of the population as a
        whole.
    (3) A set of relevance judgments is complete, i.e., all documents have been eval-
        uated for a given query for relevance.
The original Cranfield experiments did not assume binary relevance; i.e., they had a
five-point relevancy scale, however most subsequent experiments did assume binary-
relevance because no better understanding of the system was achieved with this non-
binary value to justify its further usage.
   Most of the work in evaluating search effectiveness has followed this Cranfield ex-
perimentation paradigm, which includes holding constant the test collection, using
topical queries resulting from a user’s information need, and using complete manual
relevance judgments to compare retrieval systems based on the traditional metrics of
precision and recall.1 However, evaluating the effectiveness of web search engines
provides many unique challenges that make such an evaluation problematic [8,37].
The web is too large to perform manual relevance judgments of enough queries with
sufficient depth2 to calculate recall. In contrast to a test collection, the web is “live”
data that is continually changing, preventing experiments from being exactly repro-
ducible. In addition, it is believed that the set of popular web queries and the desirable
results for those queries changes significantly over time and that these changes have
a considerable impact on evaluation [2,4]. Hawking et al. notes “Search engine per-
formances may vary considerably over different query sets and over time” [34,35].
These challenges demand that evaluation can be performed repeatedly to monitor the
effect of these changing variables.
   While test collections are a means for evaluating system effectiveness in a con-
trolled manner, they are expensive to create and maintain. The main expense comes
   1 Precision is the portion of retrieved results that are considered relevant, and recall is the portion of
relevant documents in the collection that have been retrieved.
   2 Depth is the number of results that are examined. Generally, even with sufficient depth examined the
ability to calculate recall is not possible, since relevant documents could exist that were not considered.
Thus, the pooling of many systems results is used to estimate recall.
                AUTOMATIC EVALUATION OF WEB SEARCH SERVICES                            7
from the number of queries and results that must be evaluated to create a meaningful
experiment. When changing conditions make test collections inapplicable, new test
collections must be created. For example, if a system is being used in a new subject
domain, or user interests have changed, any prior evaluations of the system may no
longer be valid. This raises a need to find a way to evaluate these systems in a manner
that is scalable in terms of frequency and cost.
                                    3.2 TREC
   The datasets used in the Cranfield-like evaluations of information retrieval systems
were small in size, often on the order of megabytes, and the queries studied were lim-
ited in number, domain focus, and complexity. In 1985, Blair and Maron [6] authored
a seminal paper that demonstrated what was suspected earlier: performance measures
obtained using small datasets were not generalizable to larger document collections.
In the early 1990s, the United States National Institute of Standards and Technology
(NIST), using a text collection created by the United States Defense Advanced Re-
search Project Agency (DARPA), initiated a conference to support the collaboration
and technology transfer among academia, industry, and government in the area of
text retrieval. The conference, named the Text REtrieval Conference (TREC), aimed
to improve evaluation methods and measures in the information retrieval domain by
increasing the research in information retrieval using relatively large test collections
on a variety of datasets.
   TREC is an annual event held each year in November at NIST, with 2004 sched-
uled as the thirteenth conference in the series. Over the years, the number of par-
ticipants has steadily increased and the types of tracks have varied greatly. In its
most recent 2003 incarnation, TREC consisted of six tracks, each designed to study
a different aspect of text retrieval: Genomics, HARD, Novelty, Question Answering,
Robust Retrieval, and Web. The specifics of each track are not relevant as the tracks
are continually modified. Tracks vary the type of data, queries, evaluation metrics,
and interaction paradigms (with or without a user in the loop) year-to-year and task-
to-task. The common theme of all the tracks is to establish an evaluation method to
be used in evaluating search systems.
   Conference participation procedures are as follows: initially a call for participation
is announced; those who participate collaborate and eventually define the specifics of
each task. Documents and topics (queries) are produced, and each participating team
conducts a set of experiments. The results from each team are submitted to NIST
for judgment. Relevance assessments are created centrally via assessors at NIST, and
each set of submitted results is evaluated. The findings are summarized and presented
to the participants at the annual meeting. After the meeting, all participants submit
their summary papers and a TREC conference proceeding is published by NIST.
8                                  A. CHOWDHURY
   Early TREC forums used data on the order of multiple gigabytes. Today, as men-
tioned, the types of data vary greatly, depending on the focus of the particular track.
Likewise, the volumes of data vary. At this writing, a terabyte data collection is pro-
posed for one of the 2004 TREC tracks. Thus, within roughly a decade, the collection
sizes have grown by three orders of magnitude from a couple of gigabytes to a ter-
abyte. As such, the terabyte track was developed to examine the question of whether
this growth of data might necessitate new evaluation metrics and approaches.
   Throughout TREC’s existence, interest in its activities has steadfastly increased.
With the expanding awareness and popularity of information retrieval engines (e.g.,
the various World Wide Web search engines) the number of academic and commer-
cial TREC participants continues to grow.
   Given this increased participation, more and more retrieval techniques are being
developed and evaluated. The transfer of general ideas and crude experiments from
TREC participants to commercial practice from year to year demonstrates the suc-
cess of TREC.
   Over the years, the performance of search systems in TREC initially increased and
then decreased. This appears to indicate that the participating systems have actually
declined in their accuracy over some of the past years. In actuality, the queries and
tasks have increased in difficulty. When the newer, revised systems currently par-
ticipating in TREC are run using the queries and data from prior years, they tend to
exhibit a higher degree of accuracy as compared to their predecessors [2,4]. Any per-
ceived degradation is probably due to the relative complexity increase of the queries
and the tasks themselves.
   We do not review the performance of the individual engines participating in the
yearly event since the focus here is on automatic evaluation; the details of the effects
of the individual utilities and strategies are not always documented, and are beyond
the scope of this chapter. Detailed information on each TREC conference is available
in written proceedings or on the web at: http://trec.nist.gov.
   Given the limited number of relevance judgments that can be produced by hu-
man document assessors, pooling is used to facilitate evaluation [27]. Pooling is the
process of selecting a fixed number of top-ranked documents obtained from each
engine, merging and sorting them, and removing duplicates. The remaining unique
documents are then judged for relevance by the assessors. Although relatively ef-
fective, pooling does result in several false-negative document ratings because of
not judging some documents that actually were relevant because they did not make
it into the pools. However, this phenomenon has been shown to not adversely af-
fect the repeatability of the evaluations for most tracks, as long as there are enough
queries, participating engines (to enrich the pools), and a stable evaluation metric
is used [13]. Overall, TREC has clearly pushed the field of information retrieval by
providing a common set of queries and relevance judgments. Most significantly for
                AUTOMATIC EVALUATION OF WEB SEARCH SERVICES                         9
us, repeated TREC evaluations over the years have provided a set of laboratory-style
evaluations that are able to be compared to each other (meta-evaluated) in order to
build an empirical framework for IR evaluation. We note that the Cross-Language
Evaluation Forum (CLEF) has followed the basic TREC style but focuses on cross-
lingual evaluation. The CLEF web site is http://clef.iei.pi.cnr.it.
   Although traditional TREC methodology has provided the foundation for a large
number of interesting studies, many do not consider it relevant to the relative per-
formance of web search engines as they are actually interacted with by searchers.
Experiments in the interactive track of TREC have shown that significant differences
in mean average precision (see Section 4.1.1) in a batch evaluation did not correlate
with interactive user performance for a small number of topics in the instance recall
and question answering tasks [59].
4. Evaluation Metrics
   The last sections have reviewed the concept of relevance: how system evaluation
can be simplified to evaluate results in terms of binary relevance. Additionally, the
concepts of precision and recall as well as search tasks have been mentioned. In this
section we will present more formal definitions of the various metrics used to under-
stand system effectiveness and how various metrics are used to understand different
search tasks.
4.1.1 Precision/Recall
  The basic goal of a system is to return all the documents relevant to a given query,
and only those relevant documents. By measuring a system’s ability to return relevant
documents we can judge its effectiveness. Each search system that is being evaluated
has indexed a set of documents that comprise the test collection. A subset of those
documents is judged relevant to each query posed to the system. For each query
processed, the system returns a subset of documents that it believes are relevant.
10                                       A. CHOWDHURY
F IG . 1. All documents, the set of retrieved and relevant documents for a given query.
    Consider the following scenario: systems A and B return five documents for a
given query. The first two retrieved documents for System A are considered relevant.
The last two retrieved documents for System B are considered relevant. Precision at
5 would rank each of the systems at .4, now if you consider recall, they would still be
equivalent. Clearly, a system that retrieves relevant documents and ranks them higher
should be considered a better system. This property is highly desired by humans that
rely on systems to rank relevant documents higher thus reducing the amount of work
they have in culling through results.
    Precision can also be computed at various points of recall. Now consider ten doc-
uments are retrieved, but only two documents (documents at ranks two and five) are
relevant to the query in the retrieved set, out of a total of two relevant documents
in the collection. Consider the document retrieval performance represented by the
sloped line shown in Figure 2. Fifty percent recall (finding one of the two relevant
documents) results when two documents are retrieved. At this point, precision is fifty
percent as we have retrieved two documents and one of them is relevant. To reach one
hundred percent recall, we must continue to retrieve documents until both relevant
documents are retrieved. For our example, it is necessary to retrieve five documents
to find both relevant documents. At this point, precision is forty percent because two
out of five retrieved documents are relevant. Hence, for any desired level of recall
it is possible to compute precision. Graphing precision at various points of recall is
referred to as a precision/recall curve.
the ranking of the result, e.g., 1/rank. Therefore if the correct answer is found in
location 1 a weight of 1 is found. If the rank were 2 then a weight of 1/2 is used, a
rank of 3 would get a weight of 1/3, etc.
  The Mean Reciprocal Ranking (MRR) of a system is:
              n        1
                  q=1 rankq
      MRR =                   ,                                                     (3)
                    n
where:
  • rankq is the rank of the retrieved correct answer for that query,
  • n is the number of queries posed to the system,
  • MRR is that reciprocal ranking averaged over the set of queries.
   A system that produces an MRR of .25, would mean that on average the system
finds the known-item in position number four of the result set. A system that pro-
duces an MRR of .75 would be finding the item between rank 1 and 2 on average.
Thus, the effectiveness of the system increases as the MRR approaches 1.
   In the prior sections we have reviewed the problem of understanding relevance
and the simplifications to this idea that were needed to make evaluations possible.
We also examined some of the history of these evaluations and briefly talked about
some of the different tasks users engage in. Lastly, we discussed the metrics that
are used when evaluating these tasks and provided definitions of the most commonly
used metrics for popular search tasks (ad hoc retrieval typically uses precision/recall,
known-item uses MRR). For a more in-depth review of these common evaluation
metrics see [49] and [53]. In the next section we will examine web search.
   Some basic facts about web search behavior are known. The general belief is that
the majority of web searches are interested in a small number (often one) of highly
relevant pages. This would be consistent with the aspects of web searching that have
been measured from large query logs: the average web query is 2.21 terms in length
[41], users view only the top 10 results for 85% of their queries and they do not
revise their query after the first try for 75% of their queries [56]. It is also widely
believed that web search services are being optimized to retrieve highly relevant
documents with high precision at low levels of recall, features desirable for support-
ing known-item search. Singhal and Kaszkiel propose, “site-based grouping done by
most commercial web search engines artificially depresses the precision value for
these engines . . . because it groups several relevant pages under one item. . .” [57].
   In order to answer the many questions web search evaluation demands, however, a
more in-depth investigation into the nature of queries and tasks used in web search is
needed. Spink gave a basis for classifying web queries as informational, navigational
or transactional [58], but no large-scale studies have definitively quantified the ratio
of web queries for the various tasks defined. Broder defined similar classifications
and presents a study of Altavista™ users via a popup survey and self-admittedly
“soft” query log analysis indicating that less than half of users’ queries are informa-
tional in nature [10]. Their study found that users tasks could be classified into the
following three main tasks, navigational, informational, and transactional.
  (1) Navigational (ebay, 1040 ez form, amazon, google, etc.).
  (2) Informational (black bears, rock climbing, etc.).
  (3) Transactional (plane ticket to atlanta, buy books, etc.).
Are these tasks so fundamentally different such that the informational type of eval-
uation most commonly used in retrieval experiments (e.g., TREC) does not help us
understand true system effectiveness? We think that there are enough differences
in the tasks that traditional informational evaluations using metrics such as preci-
sion/recall alone may not provide the best insight into system effectiveness for all
tasks. Rather, a combination of precision/recall with mean reciprocal ranking may
be prudent.
   In Table I we show the top 20 queries from an AOL™ web search interface, from
a one-week time period in November 2003. Thirteen of the top queries are naviga-
tional, i.e., looking for a single target site to go to, these queries have no informational
intent. The remaining seven are looking for information, but rather than the full body
of information about, e.g., “weather” the user is probably just looking for the best site
to fulfill their “weather” predictions for the day. This concept of a single authority
for a given need is fundamentally different from the simplification most evaluations
make where all documents are considered equally relevant.
                AUTOMATIC EVALUATION OF WEB SEARCH SERVICES                           15
                                          TABLE I
                  T OP 20 Q UERIES (W ITHOUT S EXUALLY E XPLICIT T ERMS )
   We could try and evaluate these systems using solely P/R by setting the total num-
ber of relevant documents to 1, and applying precision/recall evaluations. The one
major issue with this is that P/R evaluations use the area under the curve to show the
tradeoff of retrieving more documents to achieve a higher recall and that effect on
precision at a higher number retrieved. MRR evaluations give us a better understand-
ing of a system’s effectiveness in finding and ranking highly the best site or item
being sought by the users.
   When examining the deeper ranks in the query logs (by frequency) over time, we
find that some queries, such as current events, like recent movies and news items
moving up in rank (or newly appearing) and some queries moving down in rank or
dropping off the list altogether. As the list is examined further down we start to find
more traditional informational queries. What the reader should understand from this
is that web users may not be solely using these systems to find traditional information
on topics, but rather as ways of navigating this large system of services and sites. This
16                                A. CHOWDHURY
is one of the main reasons that precision/recall should not be the only metric systems
are evaluated against. Thus, we may need several interpretations of relevance given
a task.
used the rank of manually judged homepages as his measure and found web engines’
effectiveness to be superior to that of a TREC system in 2001 [57].
F IG . 3. Top 2 million ranked queries vs. their coverage for 1-week period.
F IG . 4. Top 10 thousand ranked queries vs. their coverage for 1-week period.
queries occur less than 5 times. This implies that on average roughly half of the query
stream is constantly changing, or that users look for something and then move on and
that behavior is not repeated by much of the population. Nonetheless, we still have
not answered the question: do the most frequent queries change?
   To answer that question we examined the similarity of the top queries from month-
to-month. Two metrics are used to examine the changes in the query stream over
these time periods: overlap and rank stability. The goal of this examination is to see
how stable the top queries are in terms of these metrics.
   Overlap is the ratio of the intersection of the top queries over the union. We exam-
ine the similarity of the top queries over time in Figure 6, where each month is added
to the calculation. Thus, the denominator is the union of the top 30,000 queries for
each consecutive month. Examining Figure 6 we see that the overlap similarity of
the top queries diminishes over the year. This means that the top queries are chang-
ing. This begs the question: are the queries that are stable, i.e., not changing, at least
consistent in rank and not greatly fluctuating?
                         
                  l1 ∩ l2
      Olap =                  (overlap).                                               (4)
                  l1 ∪ l2
To answer that question we examine the intersection of the top queries from two
months in Figure 7. We compare those sets using the Pearson correlation coefficient
[51]. The Pearson coefficient will be −1 if the scores in the ranked list are exactly
opposite, 1 if they are the same, 0 if there is no statistical correlation between the two
scored lists. Figure 7 shows us that while there is a statistical correlation between the
rankings of two months it is only moderately strong, which suggests that while the
top queries may be similar, users’ interests are changing in frequency.
map rejoicing
words to
and
11 to an
paying
we the
that be where
muticus postocular
LATH
a her
as
passed flatterers
of I species
through descent
wares we
called
states
to on
Refund in bone
kauan it lemmetön
an on VERY
uninhabited
by
Shortly
newsletter
learned
said in
Louisiana first
1951
in part we
that to
got Rome
stout 2 had
my
Croc
expert Xenicus
of whether
thinking s
efforts 3
Nose us a
is
C positively water
musket ecumenical
fine may Parker
Hyacinth is on
Bibron and
suggested their of
distance he impregnated
1 by
elementary
out
the 21 or
in cranii
again bound
trouble he young
hours theca my
action
the
to Redemer least
the armed
sätehensä
with
room have feeling
in ten
256 as Kansas
taking
the
width side
Nineteenth and
the Pelecanus so
there for
more black
on
the
as and at
to
posterior coils
mail stone
numerous and
unknown wide
not is was
nearly
fear the
90010 like
Miehen but
some
a him had
are by settled
in broad
a back at
met
out in not
Albertus
they and
German the
the his I
ft
a not
merrily
their without
head
pursi he
inland lined
despair
B intention
turtles recollect
he carinate Haast
pagoda
contrasting
River
other of with
whipped and
four the
too
of
at
unbreakable
De
and
Oh the 29
we as
theorem
548
have is s
cool town continuous
of of accordance
to rear
making kuuluu
May line
was ventral of
the
all caustic
406 a Lepidoptera
recognized
house Ja and
do rifle to
movable He wrong
the
shall from
columns never
length
her or
in a2 the
replied he keep
4
centimeters every not
dots occurrence
that Butcher
so in
to cent LASS
Gen 1936
te
of especially end
three
is but Missouri
much
I wilt
east a
has
beholding
account
etehen it
showing I
fever
accordance
her
out Mr the
linnut
went I
facility lauluoppihinsa
a with
no many as
S feathers
her
we by
1957
Latham
he
protecting desperate
The
for in the
inquiringly 6
refund
from look
my
climatic the pp
2 Somerton
during
Philip
and such
paragraphs a
as Spanish
been
with
species nay 1
ater Inst
surface
victims to tied
do
the
exception
though results ei
order the
description
said The
of
our fire
of maxillaries
him the
ways in
28 the and
or under posterior
of II the
for
of
the goods
the
only
with the
accepted
IX
were
go believed
tautonomy folk
Project ja
Ann Penzance at
nucleoles
varsinaisia say
General it classified
of
fresh electronic
with L
land
only
the
standing
78
not
blood still
nor
some Av normally
7 Birds male
Found no
the his
of was
ink
that of Charles
oh pointing blows
sweet shore
The
June
HR
the imbued of
punaisiin
to
distance
aggressor
me
Ann
whirled eyes of
before him
my Botany fig
above
sukua ASCII
Sir
he
more now to
is is if
been
vegetable
0185
that not
view all we
as many disappointed
history and projection
for
the
seen
me back
veren me kotimajoissa
one plait
Pimeyden
the II
will
7262
1935
covered 75 opportunities
10163 lippunsa
Middle at the
by more
hundred
of say paler
and wealth
14 oval more
other 3 Benham
he mistress S
tax such
river quay
10
Oh caled
himself
Niin was a
ten this
from the
but
UMMZ
to
kaunosen
Raymond 21 at
is the
Aattoiltana in 550
und Measurements
trouble
mantereinesi
nomenclature Spain to
28
viridi and
prove and
say circuit
nearest Paris on
l in
specified hall
region
150
flesh
Californian
he your Inst
all
implied Newton
fights
III the 4
valt of between
of
in not
Roelandt crissum
Lydekker
British I
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebooknice.com