0% found this document useful (0 votes)

32 views47 pages

1-Overview of Information Retrieval - New

Information Retrieval (IR) is the process of locating relevant documents from large collections of unstructured text based on user queries. The document discusses the structure and functioning of IR systems, including components like query processors, indexers, and evaluation modules, as well as the differences between information retrieval and data retrieval. It also highlights the importance of effective indexing, searching, and ranking algorithms in improving the efficiency and accuracy of IR systems.

Uploaded by

abekum21

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views47 pages

1-Overview of Information Retrieval - New

Uploaded by

abekum21

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 47

Chapter 1 : Overview

of Information
Retrieval
Adama Science and Technology University
School of Electrical Engineering and
Computing
Department of CSE
Kibrom T (2023)
Information Retrieval

 Information retrieval (IR) is the process of finding relevant documents that

satisfies information need of users from large collections of unstructured
text.
– Relevant documents: These are documents that are related to the user's
information needs or search query.
– Information need: It is the requirement or the query of the user that
prompts the search for information.
– Unstructured text: It refers to data that does not have a predefined
format, such as text documents, emails, web pages, and social media
posts. Social media posts (e.g., Twitter, Facebook, Instagram).
Structured text: Database records, Spread sheet data
– Large collections: These are a vast amount of data that can be indexed,
searched, and retrieved by the information retrieval system.
•04/01/25
. 2
Information Retrieval

 General Goal of Information Retrieval:

 To help users to find useful information based on their
information needs (with a minimum effort) despite:
Increasing complexity of Information,
Changing needs of user,
Provide immediate random access to the document collection.
 To provide users with quick and accurate access to the
information they need. The overall aim is to improve the
efficiency of the search process and help users find the
information they need in a timely and efficient manner .

04/01/25 3
What is a Document Corpus

 Large collections of documents from various sources: news

articles, research papers, books, digital libraries, Web pages,
etc.
• A document corpus refers to a large and organized collection of text documents that
are used for information retrieval or natural language processing tasks. It can
include a variety of types of documents, such as web pages, emails, news articles,
scientific papers, books, and other textual data sources.
Sample Statistics of Text Collections:
 Dialog: (http://www.dialog.com/)
 Claims to have more than 20 terabytes of data in > 600 Databases, >
1 Billion unique records.
 LEXIS/NEXIS: (http://www.lexisnexis.com/)
 Claims 7 terabytes, 1.7 billion documents, 1.5 million subscribers,
11,400 databases; > 200,000 searches per day; 9 mainframes, 300
Unix servers, 200 NT servers.
04/01/25 4
What is a Document Corpus

 Large collections of documents from various sources: news

articles, research papers, books, digital libraries, Web pages,
etc.
 TREC (Text REtrieval Conference) collections:
 It is an annual information retrieval conference & competition.
 Total of about 10 GBs of dataset of text for IR evaluation.
 Web Search Engines:
 Google claim to index over 3 billion pages.

04/01/25 5
What is a Document Corpus

Company/Platform Type of Collection Claimed Size

Public datasets, including text
Amazon Web Services (AWS) Over 70,000 datasets
collections

Microsoft Bing Web pages and entities Billions of web pages, billions of entities

Library of Congress Digital Collections Books, manuscripts, photographs Over 15 million items
HathiTrust Digital Library Digitized books and journals Over 17 million items
Academic journals and primary source
ProQuest over 5 billion digital pages
material

Elsevier ScienceDirect Scholarly articles Over 16 million articles from over 3,800 journal

Facebook data about people, places, and things Over 100 billion data objects
Twitter Tweets Over 500 million tweets per day
LexisNexis Legal and business information Over 2.7 billion searchable documents

Over 40,000 news sources, 100 million social

Bloomberg Terminal News sources and social media posts
media posts

04/01/25 6
Information Retrieval
Systems ?
 Document (Web page) retrieval in
response to a query.
 Quite effective (at some things)
 Commercially successful (some of them)
 But what goes on behind the scenes?
 How do they work?
 What happens beyond the Web?
 Web search systems:
 Lycos, Excite, Yahoo, Google, Live,
Northern Light, Teoma, HotBot, Baidu,
…
04/01/25 7
Web Search Engines

 There are 500 to more than 2,000 general web search engines.
 The big four are Google, Yahoo!, Live Search, Ask.
 Scientific research & selected journals search engine: Scirus,
About.
 Meta search engine: Search.com, Searchhippo, Searchthe.net,
Windseek, Web-search, Webcrawler, Mamma, Ixquick,
AllPlus, Fazzle, Jux2
 Multimedia search engine: Blinkx
Visual search engine: Ujiko, Web Brain, RedZee, Kartoo, Mooter

Audio/sound search engine: Feedster, Findsounds

video search engine: YouTube, Trooker
– Medical search engine: Search Medica, Healia,
Omnimedicalsearch,
04/01/25 8
Web Search Engines

 Index/Directory: Sunsteam, Supercrawler, Thunderstone,

Thenet1, Webworldindex, Smartlinks, Whatusee, Re-quest,
DMOZ, Searchtheweb

 Others: Lycos, Excite, Altavista, AOL Search, Intute, Accoona,

Jayde, Hotbot, InfoMine, Slider, Selectsurf, Questfinder, Kazazz,
Answers, Factbites, Alltheweb

There are also Virtual Libraries: Pinakes, WWW Virtual

Library, Digital-librarian, Librarians Internet Index.

04/01/25 9
Structure of an IR System

 An Information Retrieval System serves as a bridge between

the world of authors and the world of readers/users.
 Writers present a set of ideas in a document using a set of
concepts.
 Then Users seek the IR system for relevant documents that
satisfy their information need.

User Documents
Black box

 What is in the Black Box?

 The black box is the processing part of the information retrieval
04/01/25
system. 10
Structure of an IR System

 Here are the key processing stages in the black box or in the IR System:

 Query processor: The query processor receives the user's search query
and processes it to generate a list of relevant documents. The query
processor may involve techniques such as stemming, and ranking.
 Indexer: The indexer builds an index of the documents in the system,
typically using techniques such as stop word removal, and term
weighting.
 Document repository: The document repository stores the documents
that are indexed by the system. This can be a local or remote file
system, database, or other storage mechanism.
 Evaluation module: The evaluation module measures the effectiveness
of the IR system using various metrics, such as precision, recall, and
F1-score.

04/01/25 11
Information Retrieval vs.
Data Retrieval
 Example of data retrieval system is a relational database.
Data Retrieval Info Retrieval
Data Structured (Clear Semantics: Name, Unstructured (No fields
organization age…) (other than text)
Query Artificial (defined, SQL) Free text (“natural
Language language”), Boolean
Query Complete Incomplete
specification
Items wanted Exact Matching Partial & Best
matching, Relevant
Accuracy 100 % (results are always “correct”) < 50 %
Error response Sensitive Insensitive
04/01/25 12
Typical IR Task

 Given:
 A corpus of document collections (text, image, video, audio)
published by various authors.
 A user information need in the form of a query.

 An IR system searches for:

 A ranked set of documents that are relevant to satisfy
information need of a user.

04/01/25 13
Typical IR System
Architecture
IR System Architecture refers to Document
the overall design and corpus
organization of an IR system.

Query IR
String System

1. Doc1
2. Doc2
Ranked 3. Doc3
Documents .
.

04/01/25 14
Web Search System

Document
Web Spider corpus

Query IR
String System

1. Page1
2. Page2 Ranked
3. Page3
Documents
.
.
04/01/25 15
Overview of the Retrieval
Process

04/01/25 16
Detail View of the Retrieval
Process
User
Interface
User Text
need
Text Operations
Logical view Logical view
Formulate
Indexing
User Query
feedback Query Inverted file Text
Database
Searching Index
file
Retrieved docs
Ranked docs
Ranking
04/01/25 17
Issues that arise in IR

 Text representation:
 How is a representation generated from text?
 What makes a “good” representation? The use of free-text or content-bearing index-
terms? Free-text index terms, also known as natural language or uncontrolled terms, are words or phrases that are extracted from the text
of documents themselves. Content-bearing index terms, on the other hand, are controlled vocabulary terms that are assigned to documents by
human indexers or algorithms.
 What are retrievable objects and how are they organized?
 Information needs representation:
 What is an appropriate query language?
 How can interactive query formulation and refinement be supported?
 Comparing representations:
 What is a “good” model of retrieval?
 How is uncertainty represented?
 Evaluating effectiveness of retrieval:
 What are good metrics?
 What constitutes a good experimental test bed?
04/01/25 18
Focus in IR System Design

 In improving performance effectiveness of the system.

 Effectiveness of the system is evaluated in terms of precision,
recall, …
 Stemming, stop words, weighting schemes, matching
algorithms.

 In improving performance efficiency. The concern here is:

 Storage space usage, access time, …
 Compression, data/file structures, space – time tradeoffs

04/01/25 19
Subsystems of IR system

 The two subsystems of an IR system:

 Indexing: is the process of creating an index of the documents'
content, based on keywords or other relevant information.
 Is an offline process of organizing documents using keywords
extracted from the collection.
 Indexing is used to speed up access to desired information from
document collection as per users query.
 Searching: is the process of querying the IR system for
relevant documents based on user input.
 Is an online process that scans document corpus to find relevant
documents that matches users query.

04/01/25 20
Statistical Properties of
Text
 How is the frequency of different words distributed?
 A few words are very common.
 2 most frequent words (e.g. “the”, “of”) can account for
about 10% of word occurrences.
 Most words are very rare.
 Half the words in a corpus appear only once, called “read
only once”.

 How fast does vocabulary size grow with the size of a corpus?
 Such factors affect the performance of IR system & can be used
to select suitable term weights & other aspects of the system.

04/01/25 21
Text Operations

 Not all words in a document are equally significant to

represent the contents/meanings of a document.
 Some word carry more meaning than others.
 Noun words are the most representative of a document content.
 Therefore, need to preprocess the text of a document in a
collection to be used as index terms.

 Text operations is the process of text transformations in to

logical representations.
 It (text operation) generated a set of index terms.

04/01/25 22
Text Operations

 Main operations for selecting index terms:

 Tokenization: identify a set of words used to describe the content
of text document.
 Stop words removal: filter out frequently appearing words.

 Stemming words: remove prefixes, infixes & suffixes.

 Design term categorization structures (like thesaurus), which
captures relationship for allowing the expansion of the original
query with related terms.

04/01/25 23
Indexing Subsystem

Documents documents
Assign document identifier
text
Tokenize
tokens document IDs
Stop list
non-stop list tokens
Stemming & Normalize
stemmed terms
Term weighting
Weighted terms
Index
04/01/25 24
Example: Indexing

Documents to Friends, Romans, countrymen.

be indexed. countrymen

Token Tokenizer Friends Romans

stream.

Modified Stemmer and friend roman countryman

tokens. Normalizer

friend 2 4
Index File Indexer roman 1 2
(Inverted file). 13 16
countryman
04/01/25 25
Index File

 An index file consists of records, called index entries.

 Index files are much smaller than the original file.
 For 1 GB of TREC text collection the vocabulary has a size of
only 5 MB (Ref: Baeza-Yates and Ribeiro-Neto, 2005)
 This size may be further reduced by Linguistic pre-processing
(like stemming & other normalization methods).
 The usual unit for text indexing is a word.
 Index terms - are used to look up records in a file.
 Index file usually has index terms in a sorted order.
 The sort order of the terms in the index file provides an order on
a physical file.
04/01/25 26
Building Index file

 An index file of a document is a file consisting of a list of index

terms and a link to one or more documents that has the index term.
 A good index file maps each keyword Ki to a set of documents Di
that contain the keyword.

 An index file is list of search terms that are organized for

associative look-up, i.e., to answer user’s query:
 In which documents does a specified search term appear?
 Where within each document does each term appear?
 For organizing index file for a collection of documents, there are
various options available:
 Decide what data structure and/or file structure to use. Is it
04/01/25sequential file, inverted file, suffix array, signature file, etc. ? 27
Searching Subsystem

query
Parse query
query tokens
Ranked
document set Stop list
non-stop list tokens
Ranking
Stemming & Normalize
Relevant
document set stemmed terms
Similarity Query
terms Term weighting
Measure
Index terms
Index file
04/01/25 28
IR Models - Basic Concepts

 One central problem regarding IR systems is the issue of

predicting which documents are relevant and which are not.
 Such a decision is usually dependent on a ranking algorithm
which attempts to establish a simple ordering of the documents
retrieved.
 Documents appearning at the top of this ordering are considered
to be more likely to be relevant.

 Thus ranking algorithms are at the core of IR systems.

 The IR models determine the predictions of what is relevant and
what is not, based on the notion of relevance implemented by the
system.
04/01/25 29
IR Models - Basic Concepts

 After preprocessing, N distinct terms remain which are

Unique terms that form the VOCABULARY.
 Let ki be an index term i & dj be a document j.
 Each term, i, in a document or query j, is given a real-valued
weight, wij.
 wij is a weight associated with (ki,dj). If wij = 0, it indicates that
term does not belong to document dj.
 The weight wij quantifies the importance of the index term for
describing the document contents.
 vec(dj) = (w1j, w2j, …, wtj) is a weighted vector associated with
the document dj.
04/01/25 30
Mapping Documents &
Queries
 Represent both documents & queries as N-dimensional vectors in
a term-document matrix, which shows occurrence of terms in the
document collection/query.
 E.g.  
d (t , t ,..., t ); q (t , t ,..., t )
j 1, j 2, j N, j k 1, k 2,k N ,k

 An entry in the matrix corresponds to the “weight” of a term in

the document; zero means the term doesn’t exist in the document.
T1 T 2 …. TN  Document collection is mapped to
D1 w11 w12 … w1N term-by-document matrix.
D2 w21 w22 … w2N  View as vector in multidimensional
: : : : space.
: : : :  Nearby vectors are related.
04/01/25 31
DM wM1 wM2 … wMN
IR Models: Matching
function
 IR models measure the similarity between documents and
queries.
 Matching function is a mechanism used to match query with a set
of documents.

 For example, the vector space model considers documents and

queries as vectors in term-space and measures the similarity of the
document to the query.

 Techniques for matching include dot-product, cosine similarity,

dynamic programming…

04/01/25 32
IR Models

 A number of major models have been developed to retrieve

information:
 The Boolean model,
 The vector space model,
 The probabilistic model, and
 Other models like :Neural network-based models, Bag-of-words
model, Latent semantic analysis

 Boolean model: is often referred to as the "exact match" model;

 Others are the "best match" models.

04/01/25 33
The Boolean Model:
Example
 Generate the relevant documents retrieved by the Boolean model
for the query:
 q = k1  (k2  k3)
k2
d7
k1 d2 d6
d4 d5
d3
d1
d8
k3
04/01/25 34
IR System Evaluation?

 It provides the ability to measure the difference between IR

systems.
 How well do our search engines work?
 Is system A better than B?
 Under what conditions?
 Evaluation drives what to research:
 Identify techniques that work and do not work,
 There are many retrieval models/ algorithms/ systems.
 Which one is the best?
 What is the best method for:
 Similarity measures (dot-product, cosine, …)
 Index term selection (stop-word removal, stemming…)
 Term weighting (TF, TF-IDF,…)
04/01/25 35
Types of Evaluation
Strategies
 System-centered studies:
 Given documents, queries, and relevance judgments.
 Try several variations of the system.
 Measure which system returns the “best” hit list.

 User-centered studies:
 Given several users, and at least two retrieval systems.
 Have each user try the same task on both systems.
 Measure which system satisfy the “best” for users
information need.

04/01/25 36
Evaluation Criteria

 What are some main measures for evaluating an IR system’s

performance?

 Measure effectiveness of the system:

 How is a system capable of retrieving relevant documents from
the collection?
 Is a system better than another one?
 User satisfaction: How “good” are the documents that are
returned as a response to user query?
 “Relevance” of results to meet information need of users.

04/01/25 37
Retrieval scenario

 The scenario where 13 results retrieved by different search engine

for a given query. Which search engine you prefer? Why?
A.

B.
= Relevant
document
C.
= Irrelevant
document D.

F.
04/01/25 38
Measuring Retrieval
Effectiveness
 Metrics often used to evaluate effectiveness of the system.
Relevant Irrelevant

Retrieved
A B
Not retrieved C D
 Recall:
 Is percentage of relevant documents retrieved from the database
in response to users query. (A / A + C)
 Precision:
 Is percentage of retrieved documents that are relevant to the
query. (A / A + B)
04/01/25 39
Query Language

 How users query?

 The basic IR approach is Keyword-based search.
 Queries are combinations of words.
 The document collection is searched for documents that
contain these words.
 Word queries are intuitive, easy to express and provide fast
ranking.
 There are different query language:
 Single query,
 Multiple query,
 Boolean query, .... etc
04/01/25 40
Problems with Keywords

 May not retrieve relevant documents that include Synonymous

terms (words with similar meaning).
 “restaurant” vs. “café”
 “Ethiopia” vs. “Abyssinia”
 “Car” vs. “automobile”
 “Buy” vs. “purchase”
 “Movie” vs. “film”
 May retrieve irrelevant documents that include Polysemy terms
(terms with multiple meaning).
 “Apple” (company vs. fruit)
 “Bit” (unit of data vs. act of eating)
 “Bat” (baseball vs. mammal)

04/01/25“Bank” (financial institution vs. river bank) 41
Relevance Feedback

 After initial retrieval results are presented, allow the user to

provide feedback on the relevance of one or more of the
retrieved documents.

 Use this feedback information to reformulate the query.

 Produce new results based on reformulated query.
 Allows more interactive, multi-pass process.
 Relevance feedback can be automated in such a way that it
allows:
 Users relevance feedback,
 Pseudo relevance feedback.
04/01/25 42
Users Relevance Feedback
Architecture

Document
Query
String corpus

Revised IR Rankings ReRanked

System
Query Documents

1. Doc2
Query Ranked 2. Doc1
Documents 3. Doc4
Reformulation
.
1. Doc1 .
1. Doc1 
2. Doc2
2. Doc2 
3. Doc3
Feedback 3. Doc3 
.
.
.
.
04/01/25 43
Challenges for IR researchers
and practitioners
 Technical challenge: what tools should IR systems provide to
allow effective and efficient manipulation of information within
such diverse media of text, image, video and audio?

 Interaction challenge: what features should IR systems provide

in order to support a wide variety of users in their search for
relevant information.

 Evaluation challenge: how can we measure effectiveness of

retrieval? which tools and features are effective and usable,
given the increasing diversity of end-users and information
seeking situations?
04/01/25 44
Assignments - One

 Pick one of the following concept (which is not taken by other

students). Review literatures (books, articles & Internet)
(concerning the meaning, function, pros and cons &
application of the concept).
1. Information Retrieval 11. Normalization
2. Search engine 12. Thesaurus
3. Data retrieval 13. Searching
4. Cross language IR 14. IR models
5. Multilingual IR 15. Term weighting
6. Document image retrieval 16. Similarity measurement
7. Indexing 17. Retrieval effectiveness
8. Tokenization 18.Query language
9. Stemming 19. Relevance feedback
10.
04/01/25 Stop words 20. Query Expansion 45
Question & Answer

04/01/25 46
Thank You !!!

04/01/25 47

Information Retrieval Overview
No ratings yet
Information Retrieval Overview
44 pages
1-Overview of Information Retrieval
No ratings yet
1-Overview of Information Retrieval
44 pages
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
100% (1)
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
35 pages
Ch2 - IR and LT
No ratings yet
Ch2 - IR and LT
45 pages
Chapter One IR
No ratings yet
Chapter One IR
18 pages
1 introIR
No ratings yet
1 introIR
22 pages
Chapter 1 Introduction To IR
No ratings yet
Chapter 1 Introduction To IR
18 pages
1 IR Intro
No ratings yet
1 IR Intro
30 pages
Introduction To IR 2021
No ratings yet
Introduction To IR 2021
40 pages
ch1 - Information Retrieval Systems
No ratings yet
ch1 - Information Retrieval Systems
52 pages
ISR Chap..1
No ratings yet
ISR Chap..1
27 pages
Ch1 IR
No ratings yet
Ch1 IR
39 pages
1 Introduction MIR
No ratings yet
1 Introduction MIR
35 pages
Introduction to Information Retrieval Course
No ratings yet
Introduction to Information Retrieval Course
39 pages
Information Retrieval: Dr. Bassel ALKHATIB
No ratings yet
Information Retrieval: Dr. Bassel ALKHATIB
55 pages
Lecture17 IR
No ratings yet
Lecture17 IR
28 pages
RetrivalChapter One
No ratings yet
RetrivalChapter One
30 pages
01 Introduction To ISR
No ratings yet
01 Introduction To ISR
34 pages
Information Retrieval Systems
No ratings yet
Information Retrieval Systems
46 pages
1 IR Introduction
No ratings yet
1 IR Introduction
23 pages
Information Storage and Retrieval: Chapter One - Introduction
No ratings yet
Information Storage and Retrieval: Chapter One - Introduction
50 pages
IR Chapter 1
No ratings yet
IR Chapter 1
29 pages
Chapter 1 Introduction To IR
No ratings yet
Chapter 1 Introduction To IR
18 pages
1 IR Chapter-One
No ratings yet
1 IR Chapter-One
47 pages
1 IR Introductionn
No ratings yet
1 IR Introductionn
30 pages
Modern Information Retrieval Course
No ratings yet
Modern Information Retrieval Course
23 pages
1 introIR
No ratings yet
1 introIR
15 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
Irt 2 Marks With Answer
No ratings yet
Irt 2 Marks With Answer
15 pages
Lecturenote - 580003121chapter 1
No ratings yet
Lecturenote - 580003121chapter 1
10 pages
Chapter One ISR
No ratings yet
Chapter One ISR
25 pages
UNIT I - Introduction and Motivation
No ratings yet
UNIT I - Introduction and Motivation
57 pages
MSC IR 2021
100% (1)
MSC IR 2021
188 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
Chapter 1 Ir
No ratings yet
Chapter 1 Ir
37 pages
Irs Ia 1
No ratings yet
Irs Ia 1
12 pages
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
No ratings yet
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
77 pages
Cs8080 - Irt - Notes All
No ratings yet
Cs8080 - Irt - Notes All
281 pages
Lecture1 Chap1
No ratings yet
Lecture1 Chap1
22 pages
1 IRIntro
No ratings yet
1 IRIntro
95 pages
Intro to Info Retrieval Course
No ratings yet
Intro to Info Retrieval Course
31 pages
Documentation Ir
No ratings yet
Documentation Ir
58 pages
Section a-UNIT 1
No ratings yet
Section a-UNIT 1
25 pages
Introduction To IIR
No ratings yet
Introduction To IIR
53 pages
Cs6007 Information Retrieval Question Bank
100% (3)
Cs6007 Information Retrieval Question Bank
45 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
Unit - I - IR
No ratings yet
Unit - I - IR
39 pages
IR Chapter 1
No ratings yet
IR Chapter 1
32 pages
IR Textbook
No ratings yet
IR Textbook
167 pages
IRS B Tech CSE Part 1
No ratings yet
IRS B Tech CSE Part 1
161 pages
Introduction to Information Retrieval
No ratings yet
Introduction to Information Retrieval
88 pages
DDB Ch27
No ratings yet
DDB Ch27
60 pages
Introduction To IR Chapter 01
No ratings yet
Introduction To IR Chapter 01
29 pages
1stunit GN
No ratings yet
1stunit GN
36 pages
CAT King Study Material 3
No ratings yet
CAT King Study Material 3
25 pages
Cs8080irtunitinotes 220515215754 E06d144b
No ratings yet
Cs8080irtunitinotes 220515215754 E06d144b
43 pages
Chap 1
No ratings yet
Chap 1
23 pages
Recess Term Project Presentation
No ratings yet
Recess Term Project Presentation
9 pages
Python Bank Management System Guide
No ratings yet
Python Bank Management System Guide
7 pages
Power BI Learning Journey
No ratings yet
Power BI Learning Journey
6 pages
Enterprise Architect Role at OIB
No ratings yet
Enterprise Architect Role at OIB
3 pages
QA Testing Guide for Engineers
No ratings yet
QA Testing Guide for Engineers
7 pages
IBM I2 Analyst's Notebook Family
No ratings yet
IBM I2 Analyst's Notebook Family
6 pages
Database Management Course Guide
No ratings yet
Database Management Course Guide
24 pages
Commvault Basic Interview Questions & Answers
100% (1)
Commvault Basic Interview Questions & Answers
8 pages
Cloud Healthcare API Insights
No ratings yet
Cloud Healthcare API Insights
11 pages
Catalog It A Guide To Cataloging School Library Materials Allison G Kaplan Download
No ratings yet
Catalog It A Guide To Cataloging School Library Materials Allison G Kaplan Download
81 pages
Quintessential Guide To Data Lineage 2023 1689366223
No ratings yet
Quintessential Guide To Data Lineage 2023 1689366223
10 pages
Digital Marketing
90% (10)
Digital Marketing
449 pages
Btech Oe 6 Sem Basics of Data Base Management System Koe 067 2023
No ratings yet
Btech Oe 6 Sem Basics of Data Base Management System Koe 067 2023
2 pages
I Year / Ii Semester (Information Technology) It8201-Information Technology Essentials
No ratings yet
I Year / Ii Semester (Information Technology) It8201-Information Technology Essentials
46 pages
Mod 5-Chap 13 & 14 Notes
No ratings yet
Mod 5-Chap 13 & 14 Notes
21 pages
Yandex vs Google: Russian Search Comparison
No ratings yet
Yandex vs Google: Russian Search Comparison
70 pages
Jahnavi IITPATNA
100% (1)
Jahnavi IITPATNA
1 page
Data Clerk Recruitment in Kenya
No ratings yet
Data Clerk Recruitment in Kenya
2 pages
8-Text and Multimedia Languages
No ratings yet
8-Text and Multimedia Languages
22 pages
Chapter 1 Bigdata Introduction Questions Answers
No ratings yet
Chapter 1 Bigdata Introduction Questions Answers
6 pages
Report of The Summer Internship Project
No ratings yet
Report of The Summer Internship Project
25 pages
Worksheet: Introduction To Databases and DBMS: Part 1: Important Database Terms
100% (1)
Worksheet: Introduction To Databases and DBMS: Part 1: Important Database Terms
11 pages
DA 100 Exam Practice Questions
100% (1)
DA 100 Exam Practice Questions
21 pages
Kmeans Practice
No ratings yet
Kmeans Practice
3 pages
Chethan-Advanced Database-Quiz
No ratings yet
Chethan-Advanced Database-Quiz
20 pages
Computer Forensics
No ratings yet
Computer Forensics
7 pages
Information Systems and Knowledge Management
No ratings yet
Information Systems and Knowledge Management
37 pages
Cyber Defense Incident Responder.625f2443964a08.80950885
No ratings yet
Cyber Defense Incident Responder.625f2443964a08.80950885
5 pages
DLMBMMIIT01 Session5
No ratings yet
DLMBMMIIT01 Session5
25 pages
SQL Questions Answers
No ratings yet
SQL Questions Answers
6 pages

1-Overview of Information Retrieval - New

Uploaded by

1-Overview of Information Retrieval - New

Uploaded by

Chapter 1 : Overview

 Information retrieval (IR) is the process of finding relevant documents that

 General Goal of Information Retrieval:

 Large collections of documents from various sources: news

 Large collections of documents from various sources: news

Company/Platform Type of Collection Claimed Size

Over 40,000 news sources, 100 million social

Audio/sound search engine: Feedster, Findsounds

 Index/Directory: Sunsteam, Supercrawler, Thunderstone,

 Others: Lycos, Excite, Altavista, AOL Search, Intute, Accoona,

There are also Virtual Libraries: Pinakes, WWW Virtual

 An Information Retrieval System serves as a bridge between

 What is in the Black Box?

 An IR system searches for:

 In improving performance effectiveness of the system.

 In improving performance efficiency. The concern here is:

 The two subsystems of an IR system:

 Not all words in a document are equally significant to

 Text operations is the process of text transformations in to

 Main operations for selecting index terms:

 Stemming words: remove prefixes, infixes & suffixes.

Documents to Friends, Romans, countrymen.

Token Tokenizer Friends Romans

Modified Stemmer and friend roman countryman

 An index file consists of records, called index entries.

 An index file of a document is a file consisting of a list of index

 An index file is list of search terms that are organized for

 One central problem regarding IR systems is the issue of

 Thus ranking algorithms are at the core of IR systems.

 After preprocessing, N distinct terms remain which are

 An entry in the matrix corresponds to the “weight” of a term in

 For example, the vector space model considers documents and

 Techniques for matching include dot-product, cosine similarity,

 A number of major models have been developed to retrieve

 Boolean model: is often referred to as the "exact match" model;

 It provides the ability to measure the difference between IR

 What are some main measures for evaluating an IR system’s

 Measure effectiveness of the system:

 The scenario where 13 results retrieved by different search engine

 How users query?

 May not retrieve relevant documents that include Synonymous

 After initial retrieval results are presented, allow the user to

 Use this feedback information to reformulate the query.

Revised IR Rankings ReRanked

 Interaction challenge: what features should IR systems provide

 Evaluation challenge: how can we measure effectiveness of

 Pick one of the following concept (which is not taken by other

You might also like