Module 5 - Indexing and Searching
Module 5 - Indexing and Searching
• Indexing techniques:
– Inverted files
– Suffix arrays
– Signature files
• Technique used to search each type of
index
• Other searching techniques
2
Overview
3
Continue..
• Main approaches:
– Inverted files (or lists)
– Suffix arrays
– Signature files
4
Inverted Files
• There are two main elements:
– vocabulary – set of unique terms
– Occurrences – where those terms appear
• The occurrences can be recorded as
terms or byte offsets
• Using term offset is good to retrieve
concepts such as proximity, whereas
byte offsets allow direct access
6
Inverted Files - layout
Vocabulary
Occurrences Lists
Posting File
Indexed Number of
Terms occurrences This could be a tree like structure !
7
Example
• Text:
1 6 12 16 18 25 29 36 40 45 54 58 66 70
That house has a garden. The garden has many flowers. The flowers are
beautiful
• Inverted file
Vocabulary Occurrences
beautiful 70
flowers 45, 58
garden 18, 29
house 6
8
Inverted Files
• Coarser addressing may be used
Terms Occurrences (block
offset)
… …
• All occurrences within a block (perhaps a whole
document) are identified by the same block offset
• Much smaller overhead
• Some searches will be less efficient, e.g., proximity
searches. Linear scan may be needed, though hardly
feasible (specially on-line)
9
Space Requirements
• The space required for the vocabulary is rather small.
According to Heaps’ law the vocabulary grows as O(n),
where is a constant between 0.4 and 0.6 in practice
• On the other hand, the occurrences demand much
more space. Since each word appearing in the text is
referenced once in that structure, the extra space is
O(n)
• To reduce space requirements, a technique called block
addressing is used
10
Block Addressing
• The text is divided in blocks
• The occurrences point to the blocks
where the word appears
• Advantages:
– the number of pointers is smaller than positions
– all the occurrences of a word inside a single block
are collapsed to one reference
• Disadvantages:
– online search over the qualifying blocks if exact
positions are required
11
Example
• Text:
That house has a garden. The garden has many flowers. The flowers are
beautiful
Block 1 Block 2 Block 3 Block 4
• Inverted file:
Vocabulary Occurrences
beautiful 4
flowers 3
garden 2
house 1
12
Inverted Files - construction
13
Inverted Files - construction
14
Thank You