Indexing and Searching
Modern Information Retrieval
by R. Baeza-Yates and B. Ribeiro-Neto
Chapter 8
1
Outline
Inverted Files
Other Indices for Text
Sequential Searching
Pattern Matching
Compression
2
Inverted Files
And inverted file (or inverted index) is a word-
oriented mechanism for indexing a text collection
in order to speed up the searching task.
Structure:vocabulary and occurrences
Block addressing
The text is divided in blocks, and the
occurrences point to the blocks
Full inverted indices:exact occurrences
3
4
5
Inverted Files
The search algorithm on an inverted index
Vocabulary search
Retrieval of occurrences
Manipulation of occurrences
Construction (split the index into two files)
Posting file:the lists of occurrences are stored
contiguously
The vocabulary is stored in lexicographical
order and points to its list.
6
7
Inverted Files
For Large texts
Partial index
Merging two indices consists of merging
the sorted vocabularies.
8
9
Other Indices for Text
Suffix Trees
Suffix Arrays
Signature Files
10
Suffix Trees and Suffix Arrays
Each position in the text is considered as a
text suffix
Index points are selected form the text,
which point to the beginning of the text
positions which will be retrievable
11
12
Suffix arrays
The main drawbacks of Suffix Array are its
costly construction process.
Allow binary searches done by comparing
the contents of each pointer.
Supra-indices (for large suffix array)
13
14
15
Construction of Suffix Arrays for
Large Texts
16
Signature Files
Word-oriented index structures base on hashing
Maps words to bit masks of B bits
Divides the text in blocks of b words each
The mask is obtained by bitwise ORing the
signatures of all the words in the text block.
Hash the query to a bit mask W
If W & Bi = W, the text block may contain the
word
17
18
Sequential Searching
Brute Force
Knuth-Morris-Pratt
Boyer-Moore Family
Shift-Or
Suffix Automaton
Backward DAWG matching (BDM)
BNDM
19
Knuth-Morris-Pratt
20
Boyer-Moore Family
21
Shift-Or
22
Suffix Automaton
23
24
Pattern Matching
Searching allowing errors
Dynamic Programming
Automaton
Regular Expressions and Extended patterns
Pattern Matching Using Indices
Inverted files
Suffix Trees and Suffix Arrays
25
Dynamic Programming
26
Automaton
27
Regular Expressions
28
Pattern Matching Using Indices
Inverted Files
The types of queries such as suffix or
substring queries, searching allowing
errors and regular expressions, are solved
by a sequential search
The restriction is to find approximate
matches or regular expressions that span
many word.
29
Pattern Matching Using Indices
Suffix Trees
Suffix trees are able to perform complex
searches
Word, prefix, suffix, substring, and Range
queries
Regular expressions
Unrestricted approximate string matching
Useful in specific areas
Find the longest substring
Find the most common substring of a fixed 30
size
Pattern Matching Using Indices
Suffix Arrays
Some patterns can be searched directly in
the suffix array without simulation the
suffix tree
Word, prefix, suffix, subword search and
range search
31
Compression
Compressed text--Huffman coding
Taking words as symbols
Use an alphabet of bytes instead of bits
Compressed indices
Inverted Files
Suffix Trees and Suffix Arrays
Signature Files
32