0% found this document useful (0 votes)
24 views32 pages

IR Chapter 1

ir

Uploaded by

abdelaj087
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views32 pages

IR Chapter 1

ir

Uploaded by

abdelaj087
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Chapter One

Information Retrieval
Faris A.

1
Information Retrieval Systems?
Document (Web page)
retrieval in response to a
query
 Quite effective (at some
things)
 Commercially successful
(some of them)
But what goes on behind
the scenes?
 How do they work?
Web search systems
 What happens beyond the • Lycos, Excite, Yahoo, Google,
Web? Live, Northern Light, Teoma,
HotBot, Baidu, …
2
Examples of IR systems
Conventional (library catalog): Search by keyword, title, author,
etc.
Text-based (Lexis-Nexis, Google, FAST): Search by keywords.
Limited search using queries in natural language.
Multimedia (IBMs QBIC, WebSeek, SaFe): Search by visual
appearance (shapes, colors,… ).
Question answering systems (AskJeeves, Answerbus):
Search in (restricted) natural language
 Other:
 Cross language information retrieval,
 Music retrieval

3
WebSEEk Search Engine

4
What is Information Retrieval ?
 A good formal definition of information retrieval is
given in Baeze-Yates & Riberio-Neto (1990)
“Information retrieval deals with representation
representation,,
storage,, organization of, and access to information
storage
items.
 The organization and access of information items
should provide the user with easy access to the
information in which he is interested”
 The definition incorporates all important features of
a good information retrieval system
 Representation
 Storage
 Organization
 Access
 The focus is mainly on the user information need 5
Information Retrieval
 Information retrieval (IR) is the process of finding
material (usually documents) of an unstructured
nature (usually text) that satisfies an information
need from within large collections (usually stored on
computers).
 Information is organized into (a large number of)
documents
 Large collections of documents from various sources:
news articles, research papers, books, digital libraries,
Web pages, etc.
 Example
Example:: Web Search Engines like Google claim to index
Trillions of pages

6
General Goal of Information Retrieval
To help users find useful information based on
their information needs (with a minimum effort)
despite
Increasing complexity of Information
Changing needs of user

Provide immediate random access to the


document collection.
Retrieval systems, such as Google, Yahoo, are
developed with this aim.

7
Information Retrieval vs. Data Retrieval
 Emphasis of IR is on the retrieval of information, rather than on the
retrieval of data
Data retrieval
Consists mainly of determining which documents contain a set of
keywords in the user query (which is not enough to satisfy the user
information need)
Aims at retrieving all objects that satisfy well defined semantics
a single erroneous object among a thousand retrieved objects
implies failure
Mainly designed for structured databases
Information retrieval
Is concerned with retrieving information about a subject or topic
than retrieving data which satisfies a given query
semantics is frequently loose: the retrieved objects might be
inaccurate
small errors are tolerated

8
Information Retrieval vs. Data Retrieval
 Example of data retrieval system is a relational database
Data Retrieval Info Retrieval
Data organization Structured Unstructured
Fields Clear Semantics No fields (other
(ID, Name, age,…) than text and images etc)
Matching Exact (results are Partial match, best match
always “correct”)
Items wanted Matching Relevant
Accuracy 100% < 50%
Error response Sensitive Insensitive

9
Why is IR so hard?
 Traditionnel Information retrieval (IR) Systems
attempt to find relevant documents to respond to a
user’s request.
 Information retrieval problem:
problem: locating relevant
documents based on user input, such as keywords
or example documents
 The real problem boils down to matching the language of
the query to the language of the document.
 Simply matching on words is a very brittle (no elasticity)
approach. One word can have different semantic
meanings. Consider: Take
 “take a place at the table”
 “take money to the bank”
 “take a picture”
10
More Problems with IR
 You can’t even tell what part of speech a word has:
 “I saw her duck”
 A query that searches for “pictures of a duck” will find documents
that contains:
 “I saw her duck away from the ball falling from the sky”

 Proper Nouns often use regular nouns


 Consider a document with “a man named Abraham owned a
Lincoln”
 A word matching query for “Abraham Lincoln” may well find the
above document.

11
Basic Concepts in Information Retrieval:
(i) User Task and (ii) Logical View of documents

The User Task:


two user task – retrieval and browsing

Retrieval

DB
Browsing

USER
12
The User Task
Retrieval
• It is the process of retrieving information whereby the main
objective is clearly defined from the onset of searching
process.
• The user of a retrieval system has to translate his
information need into a query in the language provided by
the system.
• In this context (i.e. by specifying a set of words), the user
searches for useful information executing a retrieval task
• English Language Statement :
I want a book by J. K Rowling titled The Chamber of Secrets

13
Browsing
• It is the process of retrieving information, whereby the
main objective is not clearly defined from the beginning
and whose purpose might change during the interaction
with the system.
• E.g. User might search for documents about ‘car racing’ .
Meanwhile he might find interesting documents about
‘car manufacturers’. While reading about car
manufacturers in Addis, he might turn his attention to a
document providing ‘direction to Addis’, and from this to
documents which cover ‘Tourism in Ethiopia’.
• In this context, user is said to be browsing in the
collection and not searching, since a user may has an
interest glancing around

14
Logical View of Documents
Documents in a collection are frequently represented by a
set of index terms or keywords
Such keywords are mostly extracted directly from the text of
the document
These representative keywords provide a logical view of the
document

Docs Tokenization stop words stemming Indexing

Full text Index terms

Document representation viewed as a continuum, in which


logical view of documents might shift from full text to index
terms
15
Logical view of documents
 If full text :
 Each word in the text is a keyword
 Most complex form
 Expensive
 If full text is too large, the set of representative keywords
can be reduced through transformation process called
text operation
 Itreduce the complexity of the document
representation and allow moving the logical view
from that of a full text to a set of index terms

16
Structure of an IR System
 An Information Retrieval System serves as a bridge between
the world of authors and the world of readers/users,
 That is, writers present a set of ideas in a document using a set
of concepts. Then Users seek the IR system for relevant
documents that satisfy their information need.

User Documents
Black box

The black box is the information retrieval system.

17
Structure of an IR System
 To be effective in its attempt to satisfy
information need of users, the IR system must
‘interpret
interpret’’ the contents of documents in a
collection and rank them according to their
degree of relevance to the user query.
 Thus the notion of relevance is at the center of IR
 The primary goal of an IR system is
 To retrieve all the documents which are relevant to a
user query while retrieving as few non-relevant
documents as possible

18
Structure of an IR System
Typical IR Task
 Given: Document
corpus
 A corpus of textual
natural-language
documents.
Quer IR
 A user query in the y System
form of a textual Strin
string. g
1. Doc1
 Find: 2. Doc2
 A ranked set of Ranked 3. Doc3
Documents .
documents that are .
relevant to the
query.
19
Web Search System

Web Spider
Document
corpus

Query IR
String System

1. Page1
2. Page2
3. Page3 Ranked
. Documents
.

20
Overview of the Retrieval process

21
The Retrieval Process
 It is necessary to define the text database before
any of the retrieval processes are initiated
 This is usually done by the manager of the
database and includes specifying the following
 The documents to be used
 The operations to be performed on the text
 The text model to be used (the text structure and what
elements can be retrieved)

 The text operations transform the original


documents and the information needs and generate
a logical view of them
22
Retrieval Process ….
Once the logical view of the documents is
defined, the database module builds an index
of the text
An index is a critical data structure
It allows fast searching over large volumes
of data
Different index structures might be used , but
the most popular one is the inverted file
Given that the document database is indexed,
the retrieval process can be initiated
23
The Retrieval Process …
The user first specifies a user need which is then
parsed and transformed by the same text operation
applied to the text
Next the query operations is applied before the actual
query, which provides a system representation for the
user need, is generated
The query is then processed to obtain the retrieved
documents
Before the retrieved documents are sent to the user,
the retrieved documents are ranked according to
the likelihood of relevance
24
The Retrieval Process …
The user then examines the set of ranked documents in
the search for useful information. Two choices for the
user:
(i) Reformulate query, run on entire collection or
(ii) Reformulate query, run on result set
At this point, s/he might pinpoint a subset of the
documents seen as definitely of interest and initiate a user
feedback cycle
In such a cycle, the system uses the documents
selected by the user to change the query formulation.
Hopefully, this modified query is a better representation
of the real user need
25
Detail view of the Retrieval Process
User Text
Interface
User Text
need

Text Operations
logical view Logical view
DB manager
User Query Language & Module
Indexing
feedback Operations

Query Inverted file

Searching Index

Retrieved docs Text


Database
Ranking
Ranked docs
26
Issues that arise in IR
 Text representation
 what makes a “good” representation?
 how is a representation generated from text?
 what are retrievable objects and how are they organized?
 information needs representation
 what is an appropriate query language? Ex. Weighting and
ranking, relevance-orientation, or semantic relativism etc
 how can interactive query formulation and refinement be
supported?
 Comparing representations (to identify relevant
documents)
 What weighting scheme and similarity measure to be used?
 what is a “good” model of retrieval?
 Evaluating effectiveness of retrieval
 what are good metrics/measurements?
 what constitutes a good experimental test bed?
27
Focus in IR System Design
Our focus during IR system design is:
In improving performance effectiveness of the
system
Effectiveness of the system is measured in terms of
precision, recall, …
Stemming, stop words, weighting schemes, matching
algorithms
In improving performance efficiency
The concern here is storage space usage, access time,
searching time, data transfer time …
Concern regarding space – time tradeoffs !!
Use Compression techniques, data/file structures, etc.

28
Subsystems of an IR system
The two subsystems of an IR system:
Searching: is an online process of finding relevant
documents in the index list as per users query
Indexing: is an offline process of organizing
documents using keywords extracted from the
collection
Indexing and searching: are unavoidably
connected
you cannot search what was not first indexed
indexing of documents or objects is done in order to be
searchable
to index one needs an indexing language
 there are many indexing languages
 even taking every word in a document is an indexing language

29
Indexing Subsystem
documents
Documents Assign document
identifier
text Tokenize docume
nt IDs
token
Stop list
s
non-
Stemming &
stoplist
Normalize
tokensstemmed Term weighting
terms
terms
with Index
weights
30
Searching Subsystem
quer parse
y query query tokens
ranked non-
documen Stop list
stoplist
t set tokens
ranking
Stemming &
relevant Normalize stemmed
document terms
Similarity Query
set Term weighting
Measure terms
Index terms
Index

31
Thank you

32

You might also like