0% found this document useful (0 votes)
52 views

Module 5 - Indexing and Searching

This document discusses indexing and searching techniques for information retrieval. It covers inverted files, suffix arrays, and signature files as the main indexing approaches. Inverted files are described in detail, including their structure of a vocabulary and occurrences lists, approaches for addressing like term offsets and block offsets, techniques for constructing inverted files by building partial indexes in memory and merging them, and their advantages and disadvantages compared to other approaches.

Uploaded by

Pravin Shinde
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views

Module 5 - Indexing and Searching

This document discusses indexing and searching techniques for information retrieval. It covers inverted files, suffix arrays, and signature files as the main indexing approaches. Inverted files are described in detail, including their structure of a vocabulary and occurrences lists, approaches for addressing like term offsets and block offsets, techniques for constructing inverted files by building partial indexes in memory and merging them, and their advantages and disadvantages compared to other approaches.

Uploaded by

Pravin Shinde
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 15

Module 5 – Indexing and Searching

Prof. Pravin V.Shinde


Indexing and Searching

• Indexing techniques:
– Inverted files
– Suffix arrays
– Signature files
• Technique used to search each type of
index
• Other searching techniques

2
Overview

• Just like in traditional RDBMSs searching for data


may be costly
• In a RDB one can take (a lot of) advantage from
the well defined structure of (and constraints
on) the data
• Linear scan of the data is not feasible for non-
trivial datasets (real life)
• Indices are not optional in IR (not meaning that
they are in RDBMS)

3
Continue..

• Traditional indices, e.g., B-trees, are not well


suited for IR

• Main approaches:
– Inverted files (or lists)
– Suffix arrays
– Signature files

4
Inverted Files
• There are two main elements:
– vocabulary – set of unique terms
– Occurrences – where those terms appear
• The occurrences can be recorded as
terms or byte offsets
• Using term offset is good to retrieve
concepts such as proximity, whereas
byte offsets allow direct access

Vocabulary Occurrences (byte


offset)
… … 5
Inverted Files

• The number of indexed terms is often several


orders of magnitude smaller when compared to
the documents size (Mbs vs Gbs)
• The space consumed by the occurrence list is
not trivial. Each time the term appears it must
be added to a list in the inverted file
• That may lead to a quite considerable index
overhead

6
Inverted Files - layout
Vocabulary

Occurrences Lists

Posting File
Indexed Number of
Terms occurrences This could be a tree like structure !
7
Example
• Text:
1 6 12 16 18 25 29 36 40 45 54 58 66 70

That house has a garden. The garden has many flowers. The flowers are
beautiful

• Inverted file
Vocabulary Occurrences
beautiful 70
flowers 45, 58
garden 18, 29
house 6

8
Inverted Files
• Coarser addressing may be used
Terms Occurrences (block
offset)
… …
• All occurrences within a block (perhaps a whole
document) are identified by the same block offset
• Much smaller overhead
• Some searches will be less efficient, e.g., proximity
searches. Linear scan may be needed, though hardly
feasible (specially on-line)

9
Space Requirements
• The space required for the vocabulary is rather small.
According to Heaps’ law the vocabulary grows as O(n),
where  is a constant between 0.4 and 0.6 in practice
• On the other hand, the occurrences demand much
more space. Since each word appearing in the text is
referenced once in that structure, the extra space is
O(n)
• To reduce space requirements, a technique called block
addressing is used

10
Block Addressing
• The text is divided in blocks
• The occurrences point to the blocks
where the word appears
• Advantages:
– the number of pointers is smaller than positions
– all the occurrences of a word inside a single block
are collapsed to one reference
• Disadvantages:
– online search over the qualifying blocks if exact
positions are required

11
Example
• Text:
That house has a garden. The garden has many flowers. The flowers are
beautiful
Block 1 Block 2 Block 3 Block 4

• Inverted file:
Vocabulary Occurrences
beautiful 4
flowers 3
garden 2
house 1

12
Inverted Files - construction

• Building the index in main memory is not


feasible (wouldn’t fit, and swapping would be
unbearable)
• Building it entirely in disk is not a good idea
either (would take a long time)
• One idea is to build several partial indices in
main memory, one at a time, saving them to
disk and then merging all of them to obtain a
single index

13
Inverted Files - construction

• The procedure works as follows:


– Build and save partial indices l1, I2, …, In
– Merge Ij and Ij+1 into a single partial index Ij,j+1
• Merging indices mean that their sorted vocabularies are
merged, and if a term appears in both indices then the
respective lists should be merged (keeping the document
order)
– Then indices Ij,j+1 and Ij+2,j+3 are merged into
partial index Ij,j+3, and so on and so forth until a
single index is obtained
– Several partial indices can be merged together at once

14
Thank You

You might also like