0% found this document useful (0 votes)
421 views10 pages

Introduction to Information Retrieval Systems

The document provides an introduction to information retrieval systems. It defines information retrieval as obtaining relevant information resources from a collection to satisfy an information need. The objectives of information retrieval systems are to minimize the time and effort for users to find needed information. Key concepts discussed include the difference between data and information retrieval, how information retrieval aims to interpret documents and rank them by relevance to the user's need, and how the user's task and the system's organization of documents affect effective retrieval of relevant information.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
421 views10 pages

Introduction to Information Retrieval Systems

The document provides an introduction to information retrieval systems. It defines information retrieval as obtaining relevant information resources from a collection to satisfy an information need. The objectives of information retrieval systems are to minimize the time and effort for users to find needed information. Key concepts discussed include the difference between data and information retrieval, how information retrieval aims to interpret documents and rank them by relevance to the user's need, and how the user's task and the system's organization of documents affect effective retrieval of relevant information.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Module – I Introduction

Topics :
 Definition and objectives on information retrieval system
 Motivation
 Basic Concepts
 Past and Future
 The Retrieval Process
 Information System: Components, parts and types on information system
.......................................................................................................................................................

Definition :

- What do you mean by information retrieval?


From Wikipedia, the free encyclopedia. Information retrieval (IR) is the activity of obtaining
information system resources that are relevant to an information need from a collection of those
resources. Searches can be based on full-text or other content-based indexing.

Information retrieval is the science of searching for information in a document, searching for
documents themselves, and also searching for the metadata that describes data, and for databases of
texts, images or sounds.

-- Web search engines are the most visible IR applications.


Information retrieval (IR) deals with the representation, storage, organization of, and access to
information items. The representation and organization of the information items should provide the
user with easy access to the information in which he is interested.

Objective :
The general objective of an Information Retrieval System is to minimize the overhead of a
user locating needed information. Overhead can be expressed as the time a user spends in
all of the steps leading to reading an item containing the needed information (e.g., query
generation, query execution, scanning results of query to select items to read, reading non-
relevant items). The success of an information system is very subjective, based upon what
information is needed and the willingness of a user to accept overhead.
In information retrieval the term “relevant” item is used to represent an item
containing the needed information. In reality the definition of relevance is not a binary
classification but a continuous function. From a user’s perspective “relevant” and
“needed” are synonymous. From a system perspective, information could be relevant to a
search statement (i.e., matching the criteria of the search statement) even though it is not
needed/relevant to user (e.g., the user already knew the information).

Question : Characterization of the user information need is not a simple problem.


Answer : Consider, for instance, the following hypothetical user information need in the context of
the World Wide Web (or just the Web):
Find all the pages (documents) containing information on college tennis teams which:
(1) are maintained by an university in the USA and (2) participate in the NCAA tennis
tournament. To be relevant, the page must include information on the national ranking
of the team in the last three years and the email or phone number of the team coach.

Clearly, this full description of the user information need cannot be used directly to request
information using the current interfaces of Web search engines. Instead, the user must first translate
this information need into a query which can be processed by the search engine (or IR system).
In its most common form, this translation yields a set of keywords (or index terms) which
summarizes the description of the user information need. Given the user query, the key goal of an
IR system is to retrieve information which might be useful or relevant to the user. The emphasis is
on the retrieval of information as opposed to the retrieval of data.

Motivation :
1) Information versus Data Retrieval
Data retrieval, in the context of an IR system, consists mainly of determining which documents of a
collection contain the keywords in the user query which, most frequently, is not enough to satisfy
the user information need. In fact, the user of an IR system is concerned more with retrieving
information about a subject than with retrieving data which satisfies a given query. A data retrieval
language aims at retrieving all objects which satisfy clearly defined conditions such as those in a
regular expression or in a relational algebra expression. Thus, for a data retrieval system, a single
erroneous object among a thousand retrieved objects means total failure. For an information
retrieval system, however, the retrieved objects might be inaccurate and small errors are likely to go
unnoticed. The main reason for this difference is that information retrieval usually deals with
natural language text which is not always well structured and could be semantically ambiguous. On
the other hand, a data retrieval system (such as a relational database) deals with data that has a well
defined structure and semantics.
Data retrieval, while providing a solution to the user of a database system, does not solve the
problem of retrieving information about a subject or topic. To be effective in its attempt to satisfy
the user information need, the IR system must somehow `interpret' the contents of the information
items (documents) in a collection and rank them according to a degree of relevance to the user
query. This `interpretation' of a document content involves extracting syntactic and semantic
information from the document text and using this information to match the user information need.
The difficulty is not only knowing how to extract this information but also knowing how to use it to
decide relevance. Thus, the notion of relevance is at the center of information retrieval. In fact, the
primary goal of an IR system is to retrieve all the documents which are relevant to a user query
while retrieving as few non-relevant documents as possible.

Data Retrieval (DR) Information Retrieval (IR)

Matching Exact match Partial match, best match

Inference Deduction Induction

Model Deterministic Probabilistic


Classification Monothetic Polythetic

Query language Artificial Natural

Query specification Complete Incomplete

Items wanted Matching Relevant

Error response Sensitive Insensitive

2) Information Retrieval at the Center of the Stage

In the past 20 years, the area of information retrieval has grown well beyond its primary goals of
indexing text and searching for useful documents in a collection. Nowadays, research in IR includes
modeling, document classification and categorization, systems architecture, user interfaces, data
visualization, filtering, languages, etc. Despite its maturity, until recently, IR was seen as a narrow
area of interest mainly to librarians and information experts. Such a tendentious vision prevailed for
many years, despite the rapid dissemination, among users of modern personal computers, of IR
tools for multimedia and hypertext applications. In the beginning of the 1990s, a single fact changed
once and for all these perceptions -- the introduction of the World Wide Web.
The Web is becoming a universal repository of human knowledge and culture which has allowed
unprecedent sharing of ideas and information in a scale never seen before.
Despite so much success, the Web has introduced new problems of its own. Finding useful
information on the Web is frequently a tedious and difficult task. For instance, to satisfy his
information need, the user might navigate the space of Web links (i.e., the hyperspace) searching for
information of interest. However, since the hyperspace is vast and almost unknown, such a
navigation task is usually inefficient. For naive users, the problem becomes harder, which might
entirely frustrate all their efforts. The main obstacle is the absence of a well defined underlying data
model for the Web, which implies that information definition and structure is frequently of low
quality. These difficulties have attracted renewed interest in IR and its techniques as promising
solutions. As a result, almost overnight, IR has gained a place with other technologies at the center
of the stage.

Basic Concepts :

The effective retrieval of relevant information is directly affected both by the user task and by the
logical view of the documents adopted by the retrieval system.

1) The User Task


The user of a retrieval system has to translate his information need into a query in the language
provided by the system. With an information retrieval system, this normally implies specifying a set
of words which convey the semantics of the information need. With a data retrieval system, a query
expression (such as, for instance, a regular expression) is used to convey the constraints that must
be satisfied by objects in the answer set. In both cases, we say that the user searches for useful
information executing a retrieval task.
Consider now a user who has an interest which is either poorly defined or which is inherently broad.
For instance, the user might be interested in documents about car racing in general. In this situation,
the user might use an interactive interface to simply look around in the collection for documents
related to car racing. For instance, he might find interesting documents about Formula 1 racing,
about car manufacturers, or about the `24 Hours of Le Mans.' Furthermore, while reading about the
`24 Hours of Le Mans', he might turn his attention to a document which provides directions to Le
Mans and, from there, to documents which cover tourism in France. In this situation, we say that the
user is browsing the documents in the collection, not searching. It is still a process of retrieving
information, but one whose main objectives are not clearly defined in the beginning and whose
purpose might change during the interaction with the system.

Fig.1 : Interaction of the user with the retrieval system through distinct tasks.

Fig.1 shows two distinct types of task : information or data retrieval and browsing. Classic
information retrieval systems normally allow information or data retrieval. Hypertext systems are
usually tuned for providing quick browsing. Modern digital library and Web interfaces might
attempt to combine these tasks to provide improved retrieval capabilities. However, combination of
retrieval and browsing is not yet a well established approach and is not the dominant paradigm
(might become so in the future).
Both retrieval and browsing are, in the language of the World Wide Web, `pulling' actions. That is,
the user requests the information in an interactive manner. An alternative is to do retrieval in an
automatic and permanent fashion using software agents which push the information towards the
user. For instance, information useful to a user could be extracted periodically from a news service.
In this case, we say that the IR system is executing a particular retrieval task which consists of
filtering relevant information for later inspection by the user.

2) Logical View of the Documents


Due to historical reasons, documents in a collection are frequently represented through a set of
index terms or keywords. Such keywords might be extracted directly from the text of the document
or might be specified by a human subject (as frequently done in the information sciences arena). No
matter whether these representative keywords are derived automatically or generated by a specialist,
they provide a logical view of the document.
Modern computers are making it possible to represent a document by its full set of words. In this
case, we say that the retrieval system adopts a full text logical view (or representation) of the
documents. With very large collections, however, even modern computers might have to reduce the
set of representative keywords. This can be accomplished through the elimination of stopwords
(such as articles and connectives), the use of stemming (which reduces distinct words to their
common grammatical root), and the identification of noun groups (which eliminates adjectives,
adverbs, and verbs). Further, compression might be employed. These operations are called text
operations (or transformations) and are covered in detail in later module. Text operations reduce the
complexity of the document representation and allow moving the logical view from that of a full
text to that of a set of index terms.

Fig.2 : Logical view of a document: from full text to a set of index terms.

The full text is clearly the most complete logical view of a document but its usage usually implies
higher computational costs. A small set of categories (generated by a human specialist) provides the
most concise logical view of a document but its usage might lead to retrieval of poor quality.
Several intermediate logical views (of a document) might be adopted by an information retrieval
system as illustrated in Fig.2. Besides adopting any of the intermediate representations, the retrieval
system might also recognize the internal structure normally present in a document (e.g., chapters,
sections, subsections, etc.). This information on the structure of the document might be quite useful
and is required by structured text retrieval models.
As illustrated in Fig.2, we view the issue of logically representing a document as a continuum in
which the logical view of a document might shift (smoothly) from a full text representation to a
higher level representation specified by a human subject.

The Retrieval Process :


To understand the retrieval process, we use a simple and generic software architecture as shown in
Fig.3. First of all, before the retrieval process can even be initiated, it is necessary to define the text
database. This is usually done by the manager of the database, which specifies the following: (a) the
documents to be used, (b) the operations to be performed on the text, and (c) the text model (i.e., the
text structure and what elements can be retrieved). The text operations transform the original
documents and generate a logical view of them.
Once the logical view of the documents is defined, the database manager (using the DB Manager
Module) builds an index of the text. An index is a critical data structure because it allows fast
searching over large volumes of data. Different index structures might be used, but the most popular
one is the inverted file as indicated in Fig.3. The resources (time and storage space) spent on
defining the text database and building the index are amortized by querying the retrieval system
many times.

Fig.3: The process of retrieving information

Given that the document database is indexed, the retrieval process can be initiated. The user first
specifies a user need which is then parsed and transformed by the same text operations applied to
the text. Then, query operations might be applied before the actual query, which provides a system
representation for the user need, is generated. The query is then processed to obtain the retrieved
documents. Fast query processing is made possible by the index structure previously built.
Before been sent to the user, the retrieved documents are ranked according to a likelihood of
relevance. The user then examines the set of ranked documents in the search for useful information.
At this point, he might pinpoint a subset of the documents seen as definitely of interest and initiate a
user feedback cycle. In such a cycle, the system uses the documents selected by the user to change
the query formulation. Hopefully, this modified query is a better representation of the real user
need.
Consider now the user interfaces available with current information retrieval systems (including
Web search engines and Web browsers). We first notice that the user almost never declares his
information need. Instead, he is required to provide a direct representation for the query that the
system will execute. Since most users have no knowledge of text and query operations, the query
they provide is frequently inadequate. Therefore, it is not surprising to observe that poorly
formulated queries lead to poor retrieval (as happens so often on the Web).
A traditional information retrieval system must perform two main tasks, building a retrieval
database from its set of documents and accessing this database to retrieve relevant documents for
the user (Fig. 4).
The first task involves extracting from each document the set of its representative terms, which are
associated with the document in the database through a set of auxiliary structures, such as inverted
indexes and postings files (Salton and McGill 1983). The second task starts with a user query,
which is used by the system to access relevant information from the retrieval database. This
information is then returned to the user, preferably ranked by order of relevance.
Fig. 4. Information retrieval system model

Information Retrieval is typically a two – steps process:


(i) First potentially relevant documents are identified
(ii) And then found documents are ranked

The identification process is often conducted as set intersection – from the set of all documents the
potentially relevant documents are those that contain all or some of the search items.
Ranking involves combining a set of heuristics derived from the corpus, the result set, and
individual documents. Typical heuristics include tf (term frequency), idf (inverse document
frequency) proximity measures etc. The similarity of each document to the query is computed and
the documents are sorted according to the ranking function based on these heuristics.

Components of IRS :
The two main components of information retrieval system : the indexing system and the query
system.
The first of these is in charge of analyzing the documents downloaded from the Web and with the
creating of indexes that then allow search queries to be made; while the second is the search
engine’s visible interface, that is, the part with which users interact.

The various components of Information Retrieval System are as follows:


1. Indexing :
For IR systems, in order to efficiently judge whether the documents from a corpus match a given
query, a pre-process called indexing is usually applied. It is the way documents are managed in the
collection. To make searching more efficient, a retrieval system stores documents in an abstract
representation. A set of keywords is stored, along with links to the document in which each word
appears. This structure for storing indexing information is called an inverted file. Although there are
other options, the most popular data structure employed by IR systems is the inverted file (IF). An
IF is a traversed representation of the original document collection, organised in posting lists. Each
entry in the inverted file contains information about a single term in the document collection. Since
this structure requires a large amount of space to be stored, posting lists usually are compressed.
The indexing process includes several steps, which are described as follows:
1.1 Tokenization: The first stage of the indexing process is typically known as tokenization. In this
phase, documents text is parsed and index words called Tokens are generated. In addition, at this
stage, all characters contained in the tokens are often lower-cased and all punctuations are removed.
Every language has a different internal binary encoding for the characters in the language. We
assume all the documents (English as well as Hindi) are encoded in Unicode based on UTF-8, using
multiple 8-bit bytes.

1.2 Stop Words Removal: Luhn pointed out that the frequency of a term within a document can be a
good discriminator of its significance in the document. In addition, there are many extremely
frequent terms (e.g. “the”) that appear in almost all documents of a corpus. These terms are called
stop words, which bring little value for the purpose of representing the content of documents and
are normally filtered out from the list of potential indexing terms during the indexing process [58].
Removing the stop words allows also the reduction of the size of the generated document index.
However, removing stopwords from one document at a time is time consuming. A cost-effective
approach consists in removing all terms which appear commonly in the document collection, and
which will not improve retrieval of relevant documents. We have categorized stopwords in two
categories – Relational (ननीचच , ऊपर, आगच , अअंदर etc.: Hindi; above, below, inside, outside etc.: English)
& Non-relational ( is, are, an, a etc.: English).
These stopwords have different impact on the information retrieval process. Relational stopwords
indicate semantic relevance that is necessary for efficient information retrieval. Removing relational
stopwords from the document would result in loss of such relevant semantic information resulting
in decrease of relevance efficiency of the system. While removing non-relational stopwords would
reduce the document length resulting into faster search. We remove only non-relational stopwords
to perform relation inclusive searching. After applying stop words removal to
our example sentence, the text is reduced to the following:

1.3 Stemming: Often, a user specifies a term in a query when only a variant of this term
is contained in a relevant document. Hence, it would be beneficial for retrieval if documents
containing variants of the query term were
also considered. Plurals, gerund forms, and past tense suffixes are examples of syntactical
variations which prevent a perfect match between a query term and a respective document
term . This problem can be alleviated by applying stemming, which replaces a term with its stem, so
that different grammatical forms of terms are represented in a common base form. A stem is the
portion of a word which is obtained after chopping off its affixes.
Stemming refers to the process of reducing terms to their stems or root variant. Thus,

2. Index Data Structure:


To enable efficient access to document representatives, a suitable data structure is necessary. The
most widely used data structure is the inverted index, which is a word-oriented mechanism. In
general, the inverted index structure contains two components: vocabulary and posting list. The
vocabulary is a set of all different terms extracted from the corpus by the above steps. The
occurrences store each vocabulary term’s statistics in each document, such as term frequency and
term position.

Format of Posting List:


<d>,<n>:[[<pos 1 >#{<relation 1 >}>],[<pos 2 >#{<relation 2 >}],. .,[<pos n >#{<relation n >}]]
where,
<d>: document name
<n>: term frequency in document <d>
<pos j > : j th position of term in document <d>
<relation j > : relation represented by the term at j th position. It is present only in case of relational
stopwords
Relation information is also stored in the posting list of relational stopwords along with its
position.

3. Query Parser:

It performs tokenization, stemming and stop words


removal operations on query so that it would be easy to perform matching on indexed
documents for these query terms.

4. Matching:

With a given query, an ideal IR system should only return


relevant documents and ranks these documents in decreasing order of relevance. In this
phase, all the documents containing query terms are retrieved from the inverted index
structure. The relevance of a document to a given query can be estimated by various IR
models, such as the Boolean Model (BM), Vector Space Model (VSM), and Probabilistic
Model (PM).

5. Ranking: Finally, all the retrieved documents are ranked according to their
relevance score using the generated learnt ranking function.

6. User Interface:

Interface manages interaction with the user by taking


query as input and displaying documents according to their relevance score as output.

Three types of information systems:

1) Information-Retrieval Systems (IR)

 Search large bodies of information which are not


specifically formatted as formal data bases.

 Web search engine

 Keyword search of a text base

 Typically read-only

2) Database Management Systems (DBMS)

 Relatively small schema


 Large body of homogeneous data

 Minor or no deductive capability

 Extensive formal update capability

 Shared use for both read and write

3) Knowledge-Base Systems (KBS)

 Relatively small body of heterogeneous information

 Significant deductive capability

 Typical use: support of an intelligent application.

Applications of IR:
 Indexing
 Ranked retrieval
 Web search
 Query processing
 Online display advertising
 Automatic delivery of news/alerts
 Publish/subscribe systems

References :
1. [Link]
2. [Link]
3. [Link]
4. [Link]
telligent_Information_Search
5. [Link]
6. [Link] › bitstream › 10_chapter02

You might also like