Chapter 1 : Overview of
Information Retrieval
Adama Science and Technology University
School of Electrical Engineering and Computing
Department of CSE
Dr. Mesfin Abebe Haile (2024)
Information Retrieval
Information retrieval (IR) is the process of finding relevant
documents that satisfies information need of users from large
collections of unstructured text.
General Goal of Information Retrieval:
To help users to find useful information based on their
information needs (with a minimum effort) despite:
Increasing complexity of Information,
Changing needs of user,
Provide immediate random access to the document collection.
03/28/24 2
Document Corpus
Large collections of documents from various sources: news
articles, research papers, books, digital libraries, Web pages,
etc.
Sample Statistics of Text Collections:
Dialog: (http://www.dialog.com/)
Claims to have more than 20 terabytes of data in > 600 Databases, >
1 Billion unique records.
LEXIS/NEXIS: (http://www.lexisnexis.com/)
Claims 7 terabytes, 1.7 billion documents, 1.5 million subscribers,
11,400 databases; > 200,000 searches per day; 9 mainframes, 300
Unix servers, 200 NT servers.
03/28/24 3
Document Corpus
Large collections of documents from various sources: news
articles, research papers, books, digital libraries, Web pages,
etc.
TREC (Text REtrieval Conference) collections:
It is an annual information retrieval conference & competition.
Total of about 10 GBs of dataset of text for IR evaluation.
Web Search Engines:
Google claim to index over 3 billion pages.
03/28/24 4
Information Retrieval Systems ?
Document (Web page) retrieval in
response to a query.
Quite effective (at some things)
Commercially successful (some of them)
But what goes on behind the scenes?
How do they work?
What happens beyond the Web?
Web search systems:
Lycos, Excite, Yahoo, Google, Live,
Northern Light, Teoma, HotBot, Baidu,
…
03/28/24 5
Web Search Engines
There are more than 2,000 general web search engines.
The big four are Google, Yahoo!, Live Search, Ask.
Scientific research & selected journals search engine: Scirus,
About.
Meta search engine: Search.com, Searchhippo, Searchthe.net,
Windseek, Web-search, Webcrawler, Mamma, Ixquick,
AllPlus, Fazzle, Jux2
Multimedia search engine: Blinkx
Visual search engine: Ujiko, Web Brain, RedZee, Kartoo, Mooter
Audio/sound search engine: Feedster, Findsounds
video search engine: YouTube, Trooker
– Medical search engine: Search Medica, Healia,
Omnimedicalsearch,
03/28/24 6
Web Search Engines
There are more than 2,000 general web search engines.
The big four are Google, Yahoo!, Live Search, Ask.
Index/Directory: Sunsteam, Supercrawler, Thunderstone,
Thenet1, Webworldindex, Smartlinks, Whatusee, Re-quest,
DMOZ, Searchtheweb
Others: Lycos, Excite, Altavista, AOL Search, Intute, Accoona,
Jayde, Hotbot, InfoMine, Slider, Selectsurf, Questfinder, Kazazz,
Answers, Factbites, Alltheweb
There are also Virtual Libraries: Pinakes, WWW Virtual
Library, Digital-librarian, Librarians Internet Index.
03/28/24 7
Structure of an IR System
An Information Retrieval System serves as a bridge between
the world of authors and the world of readers/users.
Writers present a set of ideas in a document using a set of
concepts.
Then Users seek the IR system for relevant documents that
satisfy their information need.
User Documents
Black box
What is in the Black Box?
The black box is the processing part of the information retrieval
03/28/24
system. 8
Information Retrieval vs. Data
Retrieval
Example of data retrieval system is a relational database.
Data Retrieval Info Retrieval
Data Structured (Clear Semantics: Name, Unstructured (No fields
organization age…) (other than text)
Query Artificial (defined, SQL) Free text (“natural
Language language”), Boolean
Query Complete Incomplete
specification
Items wanted Exact Matching Partial & Best
matching, Relevant
Accuracy 100 % (results are always “correct”) < 50 %
Error response Sensitive Insensitive
03/28/24 9
Typical IR Task
Given:
A corpus of document collections (text, image, video, audio)
published by various authors.
A user information need in the form of a query.
An IR system searches for:
A ranked set of documents that are relevant to satisfy
information need of a user.
03/28/24 10
Typical IR System Architecture
Document
corpus
Query IR
String System
1. Doc1
2. Doc2
Ranked 3. Doc3
Documents .
.
03/28/24 11
Web Search System
Document
Web Spider corpus
Query IR
String System
1. Page1
2. Page2 Ranked
3. Page3
Documents
.
.
03/28/24 12
Overview of the Retrieval Process
03/28/24 13
Issues that arise in IR
Text representation:
What makes a “good” representation? The use of free-text or content-bearing
index-terms?
How is a representation generated from text?
What are retrievable objects and how are they organized?
Information needs representation:
What is an appropriate query language?
How can interactive query formulation and refinement be supported?
Comparing representations:
What is a “good” model of retrieval?
How is uncertainty represented?
Evaluating effectiveness of retrieval:
What are good metrics?
What constitutes a good experimental test bed?
03/28/24 14
Detail View of the Retrieval Process
User
User need Interface
Text
Text Operations
Logical view Logical view
Formulate
Indexing
User feedback Query
Query Inverted file Text
Database
Searching Index file
Retrieved docs
Ranked docs
Ranking
03/28/24 15
Focus in IR System Design
In improving performance effectiveness of the system.
Effectiveness of the system is evaluated in terms of precision,
recall, …
Stemming, stop words, weighting schemes, matching
algorithms.
In improving performance efficiency. The concern here is:
Storage space usage, access time, …
Compression, data/file structures, space – time tradeoffs
03/28/24 16
Subsystems of IR system
The two subsystems of an IR system:
Indexing:
Is an offline process of organizing documents using keywords
extracted from the collection.
Indexing is used to speed up access to desired information from
document collection as per users query.
Searching:
Is an online process that scans document corpus to find relevant
documents that matches users query.
03/28/24 17
Statistical Properties of Text
How is the frequency of different words distributed?
A few words are very common.
2 most frequent words (e.g. “the”, “of”) can account for
about 10% of word occurrences.
Most words are very rare.
Half the words in a corpus appear only once, called “read
only once”.
How fast does vocabulary size grow with the size of a corpus?
Such factors affect the performance of IR system & can be used
to select suitable term weights & other aspects of the system.
03/28/24 18
Text Operations
Not all words in a document are equally significant to
represent the contents/meanings of a document.
Some word carry more meaning than others.
Noun words are the most representative of a document content.
Therefore, need to preprocess the text of a document in a
collection to be used as index terms.
Text operations is the process of text transformations in to
logical representations.
It (text operation) generated a set of index terms.
03/28/24 19
Text Operations
Main operations for selecting index terms:
Tokenization: identify a set of words used to describe the content
of text document.
Stop words removal: filter out frequently appearing words.
Stemming words: remove prefixes, infixes & suffixes.
Design term categorization structures (like thesaurus), which
captures relationship for allowing the expansion of the original
query with related terms.
03/28/24 20
Indexing Subsystem
Documents documents
Assign document identifier
text
Tokenize
tokens document IDs
Stop list
non-stop list tokens
Stemming & Normalize
stemmed terms
Term weighting
Weighted terms
Index
03/28/24 21
Example: Indexing
Documents to Friends, Romans, countrymen.
be indexed. countrymen
Token Tokenizer Friends Romans
stream.
Modified Stemmer and friend roman countryman
tokens. Normalizer
friend 2 4
Index File Indexer roman 1 2
(Inverted file). 13 16
countryman
03/28/24 22
Index File
An index file consists of records, called index entries.
Index files are much smaller than the original file.
For 1 GB of TREC text collection the vocabulary has a size of
only 5 MB (Ref: Baeza-Yates and Ribeiro-Neto, 2005)
This size may be further reduced by Linguistic pre-processing
(like stemming & other normalization methods).
The usual unit for text indexing is a word.
Index terms - are used to look up records in a file.
Index file usually has index terms in a sorted order.
The sort order of the terms in the index file provides an order on
a physical file.
03/28/24 23
Building Index file
An index file of a document is a file consisting of a list of index
terms and a link to one or more documents that has the index term.
A good index file maps each keyword Ki to a set of documents Di
that contain the keyword.
An index file is list of search terms that are organized for
associative look-up, i.e., to answer user’s query:
In which documents does a specified search term appear?
Where within each document does each term appear?
For organizing index file for a collection of documents, there are
various options available:
Decide what data structure and/or file structure to use. Is it
03/28/24sequential file, inverted file, suffix array, signature file, etc. ? 24
Searching Subsystem
query
Parse query
query tokens
Ranked
document set Stop list
non-stop list tokens
Ranking
Stemming & Normalize
Relevant
document set stemmed terms
Similarity Query
terms Term weighting
Measure
Index terms
Index file
03/28/24 25
IR Models - Basic Concepts
One central problem regarding IR systems is the issue of
predicting which documents are relevant and which are not.
Such a decision is usually dependent on a ranking algorithm
which attempts to establish a simple ordering of the documents
retrieved.
Documents appearning at the top of this ordering are considered
to be more likely to be relevant.
Thus ranking algorithms are at the core of IR systems.
The IR models determine the predictions of what is relevant and
what is not, based on the notion of relevance implemented by the
system.
03/28/24 26
IR Models - Basic Concepts
After preprocessing, N distinct terms remain which are
Unique terms that form the VOCABULARY.
Let ki be an index term i & dj be a document j.
Each term, i, in a document or query j, is given a real-valued
weight, wij.
wij is a weight associated with (ki,dj). If wij = 0, it indicates that
term does not belong to document dj.
The weight wij quantifies the importance of the index term for
describing the document contents.
vec(dj) = (w1j, w2j, …, wtj) is a weighted vector associated with
the document dj.
03/28/24 27
Mapping Documents & Queries
Represent both documents & queries as N-dimensional vectors in
a term-document matrix, which shows occurrence of terms in the
document collection/query.
E.g.
d (t , t ,..., t ); q (t , t ,..., t )
j 1, j 2, j N, j k 1, k 2,k N ,k
An entry in the matrix corresponds to the “weight” of a term in
the document; zero means the term doesn’t exist in the document.
T1 T2 …. TN Document collection is mapped to
D1 w11 w12 … w1N term-by-document matrix.
D2 w21 w22 … w2N View as vector in multidimensional
: : : : space.
: : : : Nearby vectors are related.
03/28/24 28
DM wM1 wM2 … wMN
IR Models: Matching function
IR models measure the similarity between documents and
queries.
Matching function is a mechanism used to match query with a set
of documents.
For example, the vector space model considers documents and
queries as vectors in term-space and measures the similarity of the
document to the query.
Techniques for matching include dot-product, cosine similarity,
dynamic programming…
03/28/24 29
IR Models
A number of major models have been developed to retrieve
information:
The Boolean model,
The vector space model,
The probabilistic model, and
Other models.
Boolean model: is often referred to as the "exact match" model;
Others are the "best match" models.
03/28/24 30
The Boolean Model: Example
Generate the relevant documents retrieved by the Boolean model
for the query:
q = k1 (k2 k3)
k2
d7
k1 d2 d6
d4 d5
d3
d1
d8
k3
03/28/24 31
IR System Evaluation?
It provides the ability to measure the difference between IR
systems.
How well do our search engines work?
Is system A better than B?
Under what conditions?
Evaluation drives what to research:
Identify techniques that work and do not work,
There are many retrieval models/ algorithms/ systems.
Which one is the best?
What is the best method for:
Similarity measures (dot-product, cosine, …)
Index term selection (stop-word removal, stemming…)
Term weighting (TF, TF-IDF,…)
03/28/24 32
Types of Evaluation Strategies
System-centered studies:
Given documents, queries, and relevance judgments.
Try several variations of the system.
Measure which system returns the “best” hit list.
User-centered studies:
Given several users, and at least two retrieval systems.
Have each user try the same task on both systems.
Measure which system satisfy the “best” for users
information need.
03/28/24 33
Evaluation Criteria
What are some main measures for evaluating an IR system’s
performance?
Measure effectiveness of the system:
How is a system capable of retrieving relevant documents from
the collection?
Is a system better than another one?
User satisfaction: How “good” are the documents that are
returned as a response to user query?
“Relevance” of results to meet information need of users.
03/28/24 34
Retrieval scenario
The scenario where 13 results retrieved by different search engine
for a given query. Which search engine you prefer? Why?
A.
B.
= Relevant
document
C.
= Irrelevant
document D.
E.
F.
03/28/24 35
Measuring Retrieval Effectiveness
Metrics often used to evaluate effectiveness of the system.
Relevant Irrelevant
Retrieved
A B
Not retrieved C D
Recall:
Is percentage of relevant documents retrieved from the database
in response to users query. (A / A + C)
Precision:
Is percentage of retrieved documents that are relevant to the
query. (A / A + B)
03/28/24 36
Query Language
How users query?
The basic IR approach is Keyword-based search.
Queries are combinations of words.
The document collection is searched for documents that
contain these words.
Word queries are intuitive, easy to express and provide fast
ranking.
There are different query language:
Single query,
Multiple query,
Boolean query, .... etc
03/28/24 37
Problems with Keywords
May not retrieve relevant documents that include Synonymous
terms (words with similar meaning).
“restaurant” vs. “café”
“Ethiopia” vs. “Abyssinia”
“Car” vs. “automobile”
“Buy” vs. “purchase”
“Movie” vs. “film”
May retrieve irrelevant documents that include Polysemy terms
(terms with multiple meaning).
“Apple” (company vs. fruit)
“Bit” (unit of data vs. act of eating)
“Bat” (baseball vs. mammal)
“Bank” (financial institution vs. river bank)
03/28/24 38
Relevance Feedback
After initial retrieval results are presented, allow the user to
provide feedback on the relevance of one or more of the
retrieved documents.
Use this feedback information to reformulate the query.
Produce new results based on reformulated query.
Allows more interactive, multi-pass process.
Relevance feedback can be automated in such a way that it
allows:
Users relevance feedback,
Pseudo relevance feedback.
03/28/24 39
Users Relevance Feedback
Architecture
Document
Query
String corpus
Revised IR Rankings ReRanked
System
Query Documents
1. Doc2
Query Ranked 2. Doc1
Documents 3. Doc4
Reformulation
.
1. Doc1 .
1. Doc1
2. Doc2
2. Doc2
3. Doc3
Feedback 3. Doc3
.
.
.
.
03/28/24 40
Challenges for IR researchers and
practitioners
Technical challenge: what tools should IR systems provide to
allow effective and efficient manipulation of information within
such diverse media of text, image, video and audio?
Interaction challenge: what features should IR systems provide
in order to support a wide variety of users in their search for
relevant information.
Evaluation challenge: how can we measure effectiveness of
retrieval? which tools and features are effective and usable,
given the increasing diversity of end-users and information
seeking situations?
03/28/24 41
Assignments - One
Pick three of the following concept (which is not taken by
other students). Review literatures (books, articles & Internet)
(concerning the meaning, function, pros and cons &
application of the concept).
1. Information Retrieval 11. Normalization
2. Search engine 12. Thesaurus
3. Data retrieval 13. Searching
4. Cross language IR 14. IR models
5. Multilingual IR 15. Term weighting
6. Document image retrieval 16. Similarity measurement
7. Indexing 17. Retrieval effectiveness
8. Tokenization 18.Query language
9. Stemming 19. Relevance feedback
10.
03/28/24 Stop words 20. Query Expansion 42
Question & Answer
03/28/24 43
Thank You !!!
03/28/24 44