0% found this document useful (0 votes)

52 views

Module 5 - Indexing and Searching

This document discusses indexing and searching techniques for information retrieval. It covers inverted files, suffix arrays, and signature files as the main indexing approaches. Inverted files are described in detail, including their structure of a vocabulary and occurrences lists, approaches for addressing like term offsets and block offsets, techniques for constructing inverted files by building partial indexes in memory and merging them, and their advantages and disadvantages compared to other approaches.

Uploaded by

Pravin Shinde

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views

Module 5 - Indexing and Searching

Uploaded by

Pravin Shinde

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 15

Module 5 – Indexing and Searching

Prof. Pravin V.Shinde

Indexing and Searching

• Indexing techniques:
– Inverted files
– Suffix arrays
– Signature files
• Technique used to search each type of
index
• Other searching techniques

2
Overview

• Just like in traditional RDBMSs searching for data

may be costly
• In a RDB one can take (a lot of) advantage from
the well defined structure of (and constraints
on) the data
• Linear scan of the data is not feasible for non-
trivial datasets (real life)
• Indices are not optional in IR (not meaning that
they are in RDBMS)

3
Continue..

• Traditional indices, e.g., B-trees, are not well

suited for IR

• Main approaches:
– Inverted files (or lists)
– Suffix arrays
– Signature files

4
Inverted Files
• There are two main elements:
– vocabulary – set of unique terms
– Occurrences – where those terms appear
• The occurrences can be recorded as
terms or byte offsets
• Using term offset is good to retrieve
concepts such as proximity, whereas
byte offsets allow direct access

Vocabulary Occurrences (byte

offset)
… … 5
Inverted Files

• The number of indexed terms is often several

orders of magnitude smaller when compared to
the documents size (Mbs vs Gbs)
• The space consumed by the occurrence list is
not trivial. Each time the term appears it must
be added to a list in the inverted file
• That may lead to a quite considerable index
overhead

6
Inverted Files - layout
Vocabulary

Occurrences Lists

Posting File
Indexed Number of
Terms occurrences This could be a tree like structure !
7
Example
• Text:
1 6 12 16 18 25 29 36 40 45 54 58 66 70

That house has a garden. The garden has many flowers. The flowers are
beautiful

• Inverted file
Vocabulary Occurrences
beautiful 70
flowers 45, 58
garden 18, 29
house 6

8
Inverted Files
• Coarser addressing may be used
Terms Occurrences (block
offset)
… …
• All occurrences within a block (perhaps a whole
document) are identified by the same block offset
• Much smaller overhead
• Some searches will be less efficient, e.g., proximity
searches. Linear scan may be needed, though hardly
feasible (specially on-line)

9
Space Requirements
• The space required for the vocabulary is rather small.
According to Heaps’ law the vocabulary grows as O(n),
where  is a constant between 0.4 and 0.6 in practice
• On the other hand, the occurrences demand much
more space. Since each word appearing in the text is
referenced once in that structure, the extra space is
O(n)
• To reduce space requirements, a technique called block
addressing is used

10
Block Addressing
• The text is divided in blocks
• The occurrences point to the blocks
where the word appears
• Advantages:
– the number of pointers is smaller than positions
– all the occurrences of a word inside a single block
are collapsed to one reference
• Disadvantages:
– online search over the qualifying blocks if exact
positions are required

11
Example
• Text:
That house has a garden. The garden has many flowers. The flowers are
beautiful
Block 1 Block 2 Block 3 Block 4

• Inverted file:
Vocabulary Occurrences
beautiful 4
flowers 3
garden 2
house 1

12
Inverted Files - construction

• Building the index in main memory is not

feasible (wouldn’t fit, and swapping would be
unbearable)
• Building it entirely in disk is not a good idea
either (would take a long time)
• One idea is to build several partial indices in
main memory, one at a time, saving them to
disk and then merging all of them to obtain a
single index

13
Inverted Files - construction

• The procedure works as follows:

– Build and save partial indices l1, I2, …, In
– Merge Ij and Ij+1 into a single partial index Ij,j+1
• Merging indices mean that their sorted vocabularies are
merged, and if a term appears in both indices then the
respective lists should be merged (keeping the document
order)
– Then indices Ij,j+1 and Ij+2,j+3 are merged into
partial index Ij,j+3, and so on and so forth until a
single index is obtained
– Several partial indices can be merged together at once

14
Thank You

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
87% (46)
12 Week Program: Summer Body Starts Now
70 pages
Read People Like A Book by Patrick King-Edited
57% (82)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Cheat Code To The Universe
94% (79)
Cheat Code To The Universe
34 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
The Secret Language of Attraction
86% (108)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (542)
How To Develop and Write A Grant Proposal
17 pages
Penis Enlargement Secret
60% (124)
Penis Enlargement Secret
12 pages
Workbook For The Body Keeps The Score
89% (53)
Workbook For The Body Keeps The Score
111 pages
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
83% (1016)
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
13 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (30)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
77% (13)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
Phone Codes
79% (28)
Phone Codes
5 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
How 2 Setup Trust
97% (307)
How 2 Setup Trust
3 pages
100 Questions To Ask Your Partner
78% (36)
100 Questions To Ask Your Partner
2 pages
The 36 Questions That Lead To Love - The New York Times
91% (35)
The 36 Questions That Lead To Love - The New York Times
3 pages
Satanic Calendar
25% (56)
Satanic Calendar
4 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (8)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
1001 Songs
70% (73)
1001 Songs
1,798 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
IRS Module5-I
No ratings yet
IRS Module5-I
15 pages
Completed UNIT-III 20.9.17
No ratings yet
Completed UNIT-III 20.9.17
61 pages
Chapter 3 Indexing
No ratings yet
Chapter 3 Indexing
48 pages
Chapter 3,4, 5 and 6
No ratings yet
Chapter 3,4, 5 and 6
145 pages
UNIT-2
No ratings yet
UNIT-2
10 pages
Chapter 4 IR
No ratings yet
Chapter 4 IR
56 pages
Unit 3 Indexing
100% (1)
Unit 3 Indexing
10 pages
Chapter-4 - Data Structure-File Structure
No ratings yet
Chapter-4 - Data Structure-File Structure
34 pages
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
No ratings yet
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
34 pages
3-Index Construction
No ratings yet
3-Index Construction
43 pages
4 - Indexing
No ratings yet
4 - Indexing
42 pages
Assignment No: 3: Aim: Objective: Theory:-Inverted Index
No ratings yet
Assignment No: 3: Aim: Objective: Theory:-Inverted Index
2 pages
IR Chap3
No ratings yet
IR Chap3
45 pages
4_Indexing
No ratings yet
4_Indexing
59 pages
CHAP 4 Inverted Index
No ratings yet
CHAP 4 Inverted Index
21 pages
L05
No ratings yet
L05
33 pages
Inverted File
No ratings yet
Inverted File
20 pages
Course Name: Advanced Information Retrieval
No ratings yet
Course Name: Advanced Information Retrieval
6 pages
3 Index Construction
No ratings yet
3 Index Construction
43 pages
3 Index Construction
No ratings yet
3 Index Construction
43 pages
Ir Chapter Three
No ratings yet
Ir Chapter Three
41 pages
chap5-index-construction
No ratings yet
chap5-index-construction
38 pages
Indexing Files: Last Time
No ratings yet
Indexing Files: Last Time
5 pages
Indexing Structure: Chapter Four
No ratings yet
Indexing Structure: Chapter Four
26 pages
3
No ratings yet
3
8 pages
ch3_ Indexing _2019
No ratings yet
ch3_ Indexing _2019
38 pages
IR Unit III - Notes
No ratings yet
IR Unit III - Notes
18 pages
CS2202_IndexingHashing
No ratings yet
CS2202_IndexingHashing
83 pages
4.index Construction - New
No ratings yet
4.index Construction - New
46 pages
Index Structures
No ratings yet
Index Structures
34 pages
Memoryhierarchy Indexing
No ratings yet
Memoryhierarchy Indexing
9 pages
Chapter 12: Indexing and Hashing
No ratings yet
Chapter 12: Indexing and Hashing
31 pages
SUMSEM2022-23 CSE3024 ETH VL2022230700533 2023-05-22 Reference-Material-I
No ratings yet
SUMSEM2022-23 CSE3024 ETH VL2022230700533 2023-05-22 Reference-Material-I
7 pages
3 Indexing (2)
No ratings yet
3 Indexing (2)
28 pages
Chapter_3_File_Organization_Indexed_methods
No ratings yet
Chapter_3_File_Organization_Indexed_methods
31 pages
glimpse(0)
No ratings yet
glimpse(0)
11 pages
Week 9
No ratings yet
Week 9
46 pages
FOP Efficiency Indexing 13
No ratings yet
FOP Efficiency Indexing 13
22 pages
L2.2-File Organization Techniques
No ratings yet
L2.2-File Organization Techniques
42 pages
Mod4 Chap10 - 11 Indexing
No ratings yet
Mod4 Chap10 - 11 Indexing
77 pages
IT3020 L06 Indexing
No ratings yet
IT3020 L06 Indexing
41 pages
Indexing and Searching: Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
No ratings yet
Indexing and Searching: Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
32 pages
IR_MOD4_NOTES
No ratings yet
IR_MOD4_NOTES
19 pages
IN3020/4020 - Database Systems Spring 2020, Week 3.1 Indexing
No ratings yet
IN3020/4020 - Database Systems Spring 2020, Week 3.1 Indexing
44 pages
Slides Chap09
No ratings yet
Slides Chap09
153 pages
Lecture3 File Orgn
No ratings yet
Lecture3 File Orgn
13 pages
Fs Mini Project Report
No ratings yet
Fs Mini Project Report
25 pages
DBMS Indexing Methods
No ratings yet
DBMS Indexing Methods
33 pages
Searching Handout
No ratings yet
Searching Handout
58 pages
Learning Guide Unit 2
No ratings yet
Learning Guide Unit 2
15 pages
File Organizations and Indexing: R&G Chapter 8
No ratings yet
File Organizations and Indexing: R&G Chapter 8
40 pages
File Organizations and Indexing: R&G Chapter 8
No ratings yet
File Organizations and Indexing: R&G Chapter 8
40 pages
Indexing 2021
No ratings yet
Indexing 2021
44 pages
Advanced Indexing Issues
No ratings yet
Advanced Indexing Issues
52 pages
Search Tree: Fundamentals and Applications
From Everand
Search Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Rust Essentials: Master the Language of Safe Systems Programming
From Everand
Rust Essentials: Master the Language of Safe Systems Programming
Tyler Hayes
No ratings yet
Introduction to Microsoft SQL Server
From Everand
Introduction to Microsoft SQL Server
Eric Frick
No ratings yet
Jump Start MySQL: Master the Database That Powers the Web
From Everand
Jump Start MySQL: Master the Database That Powers the Web
Timothy Boronczyk
No ratings yet
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
Decoding Oracle Database: A Comprehensive Guide to Mastery
From Everand
Decoding Oracle Database: A Comprehensive Guide to Mastery
Kameron Hussain
No ratings yet
Naming, Scope, and Binding Are Important Concepts in High-Level Languages
No ratings yet
Naming, Scope, and Binding Are Important Concepts in High-Level Languages
29 pages
1.4generic Subroutines and Modules
No ratings yet
1.4generic Subroutines and Modules
15 pages
1.3subroutine and Control Abstraction
No ratings yet
1.3subroutine and Control Abstraction
7 pages
PCPD Functional Programming PDF
No ratings yet
PCPD Functional Programming PDF
4 pages
Paradigm Can Also Be Termed As Method To Solve Some Problem or Do Some Task. Programming Paradigm
No ratings yet
Paradigm Can Also Be Termed As Method To Solve Some Problem or Do Some Task. Programming Paradigm
7 pages
Declarative Programming Paradigm: Functional Programming
No ratings yet
Declarative Programming Paradigm: Functional Programming
6 pages
The Io Monad: Reading: " ," Sections 1-2 " ," Chapter 7: I/O
No ratings yet
The Io Monad: Reading: " ," Sections 1-2 " ," Chapter 7: I/O
43 pages
IRS Module 5 & 6 Web Search
No ratings yet
IRS Module 5 & 6 Web Search
37 pages
The Io Monad: Comp150PLD
No ratings yet
The Io Monad: Comp150PLD
43 pages
IRS Text Compression
No ratings yet
IRS Text Compression
1 page
User Interfaces and Visualization: Prof - Pravin V.Shinde
No ratings yet
User Interfaces and Visualization: Prof - Pravin V.Shinde
24 pages
Multimedia IRS Module 5
No ratings yet
Multimedia IRS Module 5
20 pages
Multimedia IRS Module 5
No ratings yet
Multimedia IRS Module 5
20 pages
Indexing and Searching: Prof - Pravin Shinde
No ratings yet
Indexing and Searching: Prof - Pravin Shinde
25 pages
Multimedia IRS
No ratings yet
Multimedia IRS
51 pages
Multimedia IRS Module 5
No ratings yet
Multimedia IRS Module 5
20 pages
Ergonomics: Ergonomics Is The Science of Designing User Interaction With Equipment and
No ratings yet
Ergonomics: Ergonomics Is The Science of Designing User Interaction With Equipment and
7 pages
Orloo Zagvar New 1
No ratings yet
Orloo Zagvar New 1
243 pages
2005 ISB CM850 Wiring
No ratings yet
2005 ISB CM850 Wiring
7 pages
Question Paper Code:: (10×2 20 Marks)
No ratings yet
Question Paper Code:: (10×2 20 Marks)
2 pages
Safeguards Scsem DB Db2
No ratings yet
Safeguards Scsem DB Db2
376 pages
Trustworthy AI
No ratings yet
Trustworthy AI
8 pages
01 Summry
No ratings yet
01 Summry
5 pages
RFoF Link Design Guideline
No ratings yet
RFoF Link Design Guideline
12 pages
Basic Pneumatic Circuitry: For Control and Automation
No ratings yet
Basic Pneumatic Circuitry: For Control and Automation
134 pages
Wallet 1
No ratings yet
Wallet 1
12 pages
User Manual For MACS 5.2.5B
No ratings yet
User Manual For MACS 5.2.5B
1,129 pages
Pure Luxury, in Every Way.: Price List#Edition
No ratings yet
Pure Luxury, in Every Way.: Price List#Edition
94 pages
IP Subnetting - The Basic Concepts
No ratings yet
IP Subnetting - The Basic Concepts
18 pages
Construction Manager Civil Engineer in Rochester NY Resume Timothy Ryan
No ratings yet
Construction Manager Civil Engineer in Rochester NY Resume Timothy Ryan
3 pages

Module 5 - Indexing and Searching

Uploaded by

Module 5 - Indexing and Searching

Uploaded by

Module 5 – Indexing and Searching

Prof. Pravin V.Shinde

• Just like in traditional RDBMSs searching for data

• Traditional indices, e.g., B-trees, are not well

Vocabulary Occurrences (byte

• The number of indexed terms is often several

• Building the index in main memory is not

• The procedure works as follows:

You might also like