0% found this document useful (0 votes)

109 views23 pages

On Information Retrival

The document introduces information retrieval and describes how it involves finding unstructured text documents that satisfy an information need from large collections, such as web search. It explains that information retrieval relies on term-document incidence matrices and inverted indices as its underlying data structures. An inverted index stores for each term a list of the documents that contain it, represented by document IDs, to enable efficient querying and retrieval of relevant documents.

Uploaded by

Samiran Panda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

109 views23 pages

On Information Retrival

Uploaded by

Samiran Panda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 23

Introduction to

Information Retrieval
Introducing Information Retrieval
and Web Search
Information Retrieval
Information Retrieval (IR) is finding material
(usually documents) of an unstructured nature
(usually text) that satisfies an information need
from within large collections (usually stored on
computers).

These days we frequently think first of web search,

but there are many other cases:
E-mail search
Searching your laptop
Corporate knowledge bases
Legal information retrieval

2
Unstructured (text) vs. structured (database)
data in the mid-nineties

3
Unstructured (text) vs. structured (database)
data today

4
Sec. 1.1

Basic assumptions of Information Retrieval

Collection: A set of documents

Assume it is a static collection for the moment

Goal: Retrieve documents with information

that is relevant to the users information need
and helps the user complete a task

5
The classic search model
User task Get rid of mice in a
politically correct way

Misconception?
Info need
Info about removing mice
without killing them

Misformulation?

Query Searc
how trap mice alive
h

Search
engine

Query Results
Collection
refinement
Sec. 1.1

How good are the retrieved docs?

Precision : Fraction of retrieved docs that are
relevant to the users information need
Recall : Fraction of relevant docs in collection
that are retrieved

More precise definitions and measurements to

follow later

7
Introduction to
Information Retrieval
Term-document incidence matrices
Sec. 1.1

Unstructured data in 1620

Which plays of Shakespeare contain the words
Brutus AND Caesar but NOT Calpurnia?
One could grep all of Shakespeares plays for
Brutus and Caesar, then strip out lines containing
Calpurnia?
Why is that not the answer?
Slow (for large corpora)
NOT Calpurnia is non-trivial
Other operations (e.g., find the word Romans near
countrymen) not feasible
Ranked retrieval (best documents to return)
Later lectures
9
Sec. 1.1

Term-document incidence matrices

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0

Brutus AND Caesar BUT NOT 1 if play contains

Calpurnia word, 0 otherwise
Sec. 1.1

Incidence vectors
So we have a 0/1 vector for each term.
To answer query: take the vectors for Brutus,
Caesar and Calpurnia (complemented)
bitwise AND.
110100 AND
110111 AND Antony
Antony and Cleopatra
1
Julius Caesar
1
The Tempest
0
Hamlet
0
Othello
0
Macbeth
1

101111 =
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0

100100
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0

11
Sec. 1.1

Answers to query

Antony and Cleopatra, Act III, Scene ii

Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.

Hamlet, Act III, Scene ii

Lord Polonius: I did enact Julius Caesar I was killed i the
Capitol; Brutus killed me.

12
Sec. 1.1

Bigger collections
Consider N = 1 million documents, each with
about 1000 words.
Avg 6 bytes/word including
spaces/punctuation
6GB of data in the documents.
Say there are M = 500K distinct terms among
these.

13
Sec. 1.1

Cant build the matrix

500K x 1M matrix has half-a-trillion 0s and 1s.

But it has no more than one billion 1s. Why?

matrix is extremely sparse.

Whats a better representation?

We only record the 1 positions.

14
Introduction to
Information Retrieval
The Inverted Index
The key data structure underlying
modern IR
Sec. 1.2

Inverted index
For each term t, we must store a list of all
documents that contain t.
Identify each doc by a docID, a document serial
number
Can
Brutus we used fixed-size
1 2 arrays
4 11 for
31 this?
45 173 174
Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101

What happens if the word Caesar

is added to document 14?
16
Sec. 1.2

Inverted index
We need variable-size postings lists
On disk, a continuous run of postings is normal
and best
Posting
In memory, can use linked lists or variable length
arrays
Some tradeoffs in 1size/ease
2 4 11 31 45 173 174
Brutus
of insertion
Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101

Dictionary Postings
Sorted by docID (more later on why).
17
Sec. 1.2

Inverted index construction

Documents to Friends, Romans, countrymen.
be indexed

Tokenizer

Token stream Friends Romans Countrymen

Linguistic modules

Modified tokens friend roman countryman

Indexer friend 2 4
roman 1 2
Inverted index
countryman 13 16
Initial stages of text processing
Tokenization
Cut character sequence into word tokens
Deal with Johns, a state-of-the-art solution
Normalization
Map text and query term to same form
You want U.S.A. and USA to match
Stemming
We may wish different forms of a root to match
authorize, authorization
Stop words
We may omit very common words (or not)
the, a, to, of
Sec. 1.2

Indexer steps: Token sequence

Sequence of (Modified token, Document ID) pairs.

Doc 1 Doc 2

I did enact Julius So let it be with

Caesar I was killed Caesar. The noble
i the Capitol; Brutus hath told you
Brutus killed me. Caesar was ambitious
Sec. 1.2

Indexer steps: Sort

Sort by terms
And then docID

Core indexing step

Sec. 1.2

Indexer steps: Dictionary & Postings

Multiple term entries
in a single document
are merged.
Split into Dictionary
and Postings
Doc. frequency
information is added.

Why frequency?
Will discuss later.
Sec. 1.2

Where do we pay in storage?

Lists of
docIDs

Terms
and
counts
IR system
implementation
How do we
index efficiently?
How much
storage do we
need?
Pointers 23

Unit 1 - Modern Information Retrieval - WWW - Rgpvnotes.in
No ratings yet
Unit 1 - Modern Information Retrieval - WWW - Rgpvnotes.in
8 pages
Introduction to Information Retrieval Course
No ratings yet
Introduction to Information Retrieval Course
39 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
Introduction To Information Retrieval-Ch2 Solutions
No ratings yet
Introduction To Information Retrieval-Ch2 Solutions
2 pages
Introduction to IR Models
No ratings yet
Introduction to IR Models
46 pages
Information Storage and Retrieval: Chapter One - Introduction
No ratings yet
Information Storage and Retrieval: Chapter One - Introduction
50 pages
Information Retrieval Course Overview
100% (2)
Information Retrieval Course Overview
12 pages
SEO Guide: Indexing Basics & Techniques
No ratings yet
SEO Guide: Indexing Basics & Techniques
34 pages
Ir MCQ-1
No ratings yet
Ir MCQ-1
22 pages
3 Retrieval Models
No ratings yet
3 Retrieval Models
87 pages
Irs Unit1
No ratings yet
Irs Unit1
15 pages
IR Chap3
No ratings yet
IR Chap3
45 pages
Irs Important Questions
0% (1)
Irs Important Questions
3 pages
AI Course Overview & Problem Solving
No ratings yet
AI Course Overview & Problem Solving
28 pages
Indexing in DBMS - Ordered Indices - Primary Index - Dense Index - Sparse Index - Secondary Index - Multilevel Indices - Clustering Index in Database
No ratings yet
Indexing in DBMS - Ordered Indices - Primary Index - Dense Index - Sparse Index - Secondary Index - Multilevel Indices - Clustering Index in Database
7 pages
Indexing and Cataloging Essentials
No ratings yet
Indexing and Cataloging Essentials
16 pages
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
100% (1)
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
35 pages
Chapter 3 - Simple Sorting and Searching
100% (1)
Chapter 3 - Simple Sorting and Searching
18 pages
2 - Text Operation
No ratings yet
2 - Text Operation
45 pages
Chapter Four Indexing Structure
100% (2)
Chapter Four Indexing Structure
60 pages
Distributed System: Naming System in DS
No ratings yet
Distributed System: Naming System in DS
51 pages
Spatial Domain Image Processing
0% (1)
Spatial Domain Image Processing
64 pages
Database Query Optimization Guide
100% (1)
Database Query Optimization Guide
43 pages
IR Models for Students
No ratings yet
IR Models for Students
62 pages
University of Gondar: Document Image Retrieval
No ratings yet
University of Gondar: Document Image Retrieval
9 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
Systems Planning and Selection
100% (1)
Systems Planning and Selection
11 pages
Database Management Review Questions Chapter 2 Ryan Scott
100% (2)
Database Management Review Questions Chapter 2 Ryan Scott
4 pages
Sheet 1
No ratings yet
Sheet 1
2 pages
UNIT 2 IRS Up
No ratings yet
UNIT 2 IRS Up
42 pages
PST Unit 4
No ratings yet
PST Unit 4
10 pages
IR Systems Evaluation Guide
100% (1)
IR Systems Evaluation Guide
2 pages
Basic SQL Quiz - 2 Test
No ratings yet
Basic SQL Quiz - 2 Test
5 pages
NLP MCQ Quiz for B.Tech Students
No ratings yet
NLP MCQ Quiz for B.Tech Students
7 pages
Binary Search PDF
100% (1)
Binary Search PDF
5 pages
MSC IR 2021
100% (1)
MSC IR 2021
188 pages
The Use of Medical Information Retrieval Systems
No ratings yet
The Use of Medical Information Retrieval Systems
22 pages
Advanced Question Answering Systems
No ratings yet
Advanced Question Answering Systems
11 pages
Multimedia Data Mining
No ratings yet
Multimedia Data Mining
4 pages
Information Retrieval 1 Introduction To IR
No ratings yet
Information Retrieval 1 Introduction To IR
12 pages
Data Mining for Analysts
No ratings yet
Data Mining for Analysts
86 pages
UNIT-6 Important Questions & Answers
No ratings yet
UNIT-6 Important Questions & Answers
20 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
77 pages
Cs8080informationretrievaltechniquesunit Ipptpdfversion 220423092105
No ratings yet
Cs8080informationretrievaltechniquesunit Ipptpdfversion 220423092105
240 pages
Distributed File Systems: Unit - V Essay Questions
No ratings yet
Distributed File Systems: Unit - V Essay Questions
10 pages
HCI-Lecture-14 - 15
No ratings yet
HCI-Lecture-14 - 15
94 pages
Unit 4 (Isr)
100% (1)
Unit 4 (Isr)
11 pages
MCQ-4 String
No ratings yet
MCQ-4 String
7 pages
Unit1 Introduction
No ratings yet
Unit1 Introduction
31 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
19 pages
Jimma University JIT School of Computing Advanced Database System Lab
100% (1)
Jimma University JIT School of Computing Advanced Database System Lab
70 pages
Data Warehousing: Online Analytical Processing (OLAP)
No ratings yet
Data Warehousing: Online Analytical Processing (OLAP)
44 pages
Web Crawling
No ratings yet
Web Crawling
10 pages
Chapter 2 Design Principles
100% (1)
Chapter 2 Design Principles
20 pages
Ch-8.2 Multimedia Communication System
No ratings yet
Ch-8.2 Multimedia Communication System
12 pages
Algorithm Design for Students
No ratings yet
Algorithm Design for Students
35 pages
M.Tech IR Course Overview
No ratings yet
M.Tech IR Course Overview
72 pages
Distributed System Answer Key
78% (9)
Distributed System Answer Key
129 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
30 pages
Lecture1 Intro
No ratings yet
Lecture1 Intro
57 pages
FB2505 Fabia
No ratings yet
FB2505 Fabia
3 pages
Connecting Lans, Backbone Networks, and Virtual Lans
No ratings yet
Connecting Lans, Backbone Networks, and Virtual Lans
40 pages
Acc 82u33zphs 100030
No ratings yet
Acc 82u33zphs 100030
2 pages
Desarrollo de Addons
No ratings yet
Desarrollo de Addons
11 pages
SHARC Processor: ADSP-21060 ADSP-21060L ADSP-21062 ADSP-21062L ADSP-21060C ADSP-21060LC
No ratings yet
SHARC Processor: ADSP-21060 ADSP-21060L ADSP-21062 ADSP-21062L ADSP-21060C ADSP-21060LC
64 pages
Ushtrimi 1
No ratings yet
Ushtrimi 1
3 pages
Set Associative Mapping
No ratings yet
Set Associative Mapping
15 pages
Reglamento Ley de Contrataciones Del Estado
No ratings yet
Reglamento Ley de Contrataciones Del Estado
3 pages
USR TCP232 410s User Manual
No ratings yet
USR TCP232 410s User Manual
17 pages
Profiling Apache HIVE Query From Run Time Logs: Givanna Putri Haryono Ying Zhou
No ratings yet
Profiling Apache HIVE Query From Run Time Logs: Givanna Putri Haryono Ying Zhou
8 pages
Configuring Advanced File Solutions: This Lab Contains The Following Exercises and Activities
No ratings yet
Configuring Advanced File Solutions: This Lab Contains The Following Exercises and Activities
16 pages
Half - Yearly Examination#2021-22 - Computer
No ratings yet
Half - Yearly Examination#2021-22 - Computer
3 pages
R-Trees - Presentation Slides
100% (1)
R-Trees - Presentation Slides
44 pages
Data Reduction Techniques Guide
No ratings yet
Data Reduction Techniques Guide
39 pages
Computer Basics For AP Grama Sachivalayam Exams
0% (1)
Computer Basics For AP Grama Sachivalayam Exams
7 pages
Configure SFTP Shell Script File Transfer
No ratings yet
Configure SFTP Shell Script File Transfer
14 pages
11g R1 Rac - Solaris - Installation Checklist
No ratings yet
11g R1 Rac - Solaris - Installation Checklist
9 pages
AN3312
No ratings yet
AN3312
20 pages
Help HTLsoft4
No ratings yet
Help HTLsoft4
15 pages
Spring 14
No ratings yet
Spring 14
7 pages
Database Systems Overview
No ratings yet
Database Systems Overview
12 pages
sg248041 - RACF - RRSF (Remote Sharing Facility) - TCPIP
No ratings yet
sg248041 - RACF - RRSF (Remote Sharing Facility) - TCPIP
164 pages
June 2017 A-Level Paper 1 (With MS)
No ratings yet
June 2017 A-Level Paper 1 (With MS)
62 pages
SQL Interview Questions
No ratings yet
SQL Interview Questions
44 pages
T24 Directory Structure and Important Parameter Files PDF
No ratings yet
T24 Directory Structure and Important Parameter Files PDF
29 pages
Lab-MongoDB CRUD
No ratings yet
Lab-MongoDB CRUD
3 pages
Based On May 2011 Occupational Standards: Ethiopian TVET-System
No ratings yet
Based On May 2011 Occupational Standards: Ethiopian TVET-System
92 pages
Info Theory
No ratings yet
Info Theory
3 pages
Make HQ Dorks Get HQ Databases
83% (29)
Make HQ Dorks Get HQ Databases
27 pages
Computer Architecture and Organization MCQS
No ratings yet
Computer Architecture and Organization MCQS
10 pages

On Information Retrival

Uploaded by

On Information Retrival

Uploaded by

Introduction to

These days we frequently think first of web search,

Basic assumptions of Information Retrieval

Collection: A set of documents

Goal: Retrieve documents with information

How good are the retrieved docs?

More precise definitions and measurements to

Unstructured data in 1620

Term-document incidence matrices

Brutus AND Caesar BUT NOT 1 if play contains

Antony and Cleopatra, Act III, Scene ii

Hamlet, Act III, Scene ii

Cant build the matrix

But it has no more than one billion 1s. Why?

matrix is extremely sparse.

Whats a better representation?

What happens if the word Caesar

Inverted index construction

Token stream Friends Romans Countrymen

Modified tokens friend roman countryman

Indexer steps: Token sequence

I did enact Julius So let it be with

Indexer steps: Sort

Core indexing step

Indexer steps: Dictionary & Postings

Where do we pay in storage?

You might also like