0% found this document useful (0 votes)
17 views21 pages

Unit-2 Complete Notes

The document outlines the course objectives and content for a lecture on Information Retrieval Systems, focusing on key concepts, data structures, and algorithms essential for building efficient IR systems. It covers topics such as inverted files, indexing techniques, string searching algorithms, and the construction of thesauri. Additionally, it discusses the structure and types of inverted files, as well as methods for handling large data sets in indexing.

Uploaded by

kishore.rce
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views21 pages

Unit-2 Complete Notes

The document outlines the course objectives and content for a lecture on Information Retrieval Systems, focusing on key concepts, data structures, and algorithms essential for building efficient IR systems. It covers topics such as inverted files, indexing techniques, string searching algorithms, and the construction of thesauri. Additionally, it discusses the structure and types of inverted files, as well as methods for handling large data sets in indexing.

Uploaded by

kishore.rce
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

DEPARTMENT OF CSE (ARTIFICIAL INTELLIGENCE & MACHINE LEARNING)

R23 Regulation V Semester A. Y.: 2025 -26

Lecture Notes
Information Retrieval System(23ML5T01)

Prepared By
Mr. P V Kishore Kumar M.Tech., (Ph.D.)
Associate Professor

Department of CSE (AI & ML)


Ramachandra College of Engineering(A), Eluru.

pg. 1 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


Course Objectives:
1 To introduce the fundamental concepts and domain of Information Retrieval (IR) systems
and how they differ from other information systems.
2 To explain the data structures and algorithms fundamental to building and operating
efficient IR systems.
3 To analyze the working of indexing techniques such as inverted files and signature files
and their enhancements.
4 To explore advanced indexing structures like PAT Trees and the role of lexical analysis,
stop lists, and stemming in IR systems.
5 To study various string searching algorithms and their applicability to IR systems for
efficient retrieval.

Unit – I Introduction to Information storage and retrieval systems Domain Analysis of IR


systems, IR and other types of Information Systems, IR System Evaluation Introduction to
Data structures and algorithms related to Information Retrieval: Basic Concepts, Data
structures, Algorithms.
Unit – II Inverted Files and Signature Files Introduction, Structures used in Inverted Files,
building an Inverted file using a sorted array, Modifications to the Basic Techniques.
Signature Files: Concepts of Signature files, Compression, Vertical Partitioning, Horizontal
Partitioning.
Unit – III New Indices for Text, Lexical Analysis and Stoplists PAT Trees and PAT Arrays:
Introduction, PAT Tree structure, Algorithms on the PAT Trees, Building PAT Trees as
PATRICA Trees, PAT representation as Arrays. Stoplists.
Unit – IV Stemming Algorithms and Thesaurus Construction Types of Stemming
algorithms, Experimental Evaluations of Stemming, stemming to Compress Inverted Files.
Thesaurus Construction: Features of Thesauri, Thesaurus Construction, Thesaurus
construction from Texts, Merging existing Thesauri.
Unit – V String Searching Algorithms Introduction, Preliminaries, The Naive Algorithm, The
Knutt-Morris-Pratt Algorithm, The Boyer Moore Algorithm, The Shift-Or Algorithm, The
Karp-Rabin Algorithm.

Text Books:
1 Modern Information Retrieval, Ricardo Baeza-Yates, Neto, PEA,2007.
2 Information Storage and Retrieval Systems: Theory and Implementation, Kowalski, Gerald,
Mark Academic Press, 2000.

pg. 2 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


Unit -2 Inverted and Signature Files
INTRODUCTION
An inverted file is the most commonly used indexing structure in Information Retrieval (IR).
It maps each word (term) in the collection to the list of documents in which it appears. This
structure helps in fast searching and retrieval of documents.
Example: If we have documents containing 'data', 'information', and 'retrieval', the inverted
file will store each word along with the list of document IDs where it appears.
Three of the most commonly used file structures for information retrieval can be classified as
lexicographical indices (indices that are sorted), clustered file structures, and indices based
on hashing.
The concept of the inverted file type of index is as follows. Assume a set of documents. Each
document is assigned a list of keywords or attributes, with optional relevance weights
associated with each keyword (attribute). An inverted file is then the sorted list (or index) of
keywords (attributes), with each keyword having links to the documents containing that
keyword (see Fig 1) . This is the kind of index found in most commercial library systems.
The use of an inverted file improves
search efficiency by several orders
of magnitude, a necessity for very
large text files. The penalty paid for
this efficiency is the need to store a
data structure that ranges from 10
percent to 100 percent or more of
the size of the text itself, and a need
to update that index as the data set
changes.

Fig 1: An inverted file implemented using a sorted array


Usually there are some restrictions imposed on these indices and consequently on later
searches. Examples of these restrictions are:
a controlled vocabulary which is the collection of keywords that will be indexed. Words in
the text that are not in the vocabulary will not be indexed, and hence are not searchable.
a list of stopwords (articles, prepositions, etc.) that for reasons of volume or precision and
recall will not be included in the index, and hence are not searchable.
a set of rules that decide the beginning of a word or a piece of text that is indexable. These
rules deal with the treatment of spaces, punctuation marks, or some standard prefixes, and
may have significant impact on what terms are indexed.

pg. 3 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


a list of character sequences to be indexed (or not indexed). In large text databases, not all
character sequences are indexed; for example, character sequences consisting of all numeric
are often not indexed.
A search in an inverted file is the composition of two searching algorithms; a search for a
keyword (attribute), which returns an index, and then a possible search on that index for a
particular attribute value. The result of a search on an inverted file is a set of records (or
pointers to records).
STRUCTURE OF INVERTED FILES
The inverted file consists of two major components:
1. Dictionary (or Vocabulary) – List of all unique terms in the collection.
2. Posting Lists – For each term, the posting list contains document IDs (and possibly
frequencies, positions).
STRUCTURES USED IN INVERTED FILES
Fig1: Show a table with terms on the left and their posting lists on the right.
Example:
Term → [Doc1, Doc2, Doc7]
There are several structures that can be used in implementing inverted files: sorted arrays, B-
trees, tries, and various hashing structures, or combinations of these structures. The first
three of these structures are sorted (lexicographically) indices, and can efficiently support
range queries, such as all documents having keywords that start with "comput."
The Sorted Array
An inverted file implemented as a sorted array structure stores the list of keywords in a sorted
array, including the number of documents associated with each keyword and a link to the
documents containing that keyword. This array is commonly searched using a standard
binary search, although large secondary-storage-based systems will often adapt the array (and
its search) to the characteristics of their secondary storage.
The main disadvantage of this approach is that updating the index (for example appending a
new keyword) is expensive. On the other hand, sorted arrays are easy to implement and are
reasonably fast.
B-trees
A special case of the B-tree, the prefix B-tree, uses prefixes of words as primary keys in a B-
tree index (Bayer and Unterauer 1977) and is particularly suitable for storage of textual
indices. Each internal node has a variable number of keys. Each key is the shortest word (in
length) that distinguishes the keys stored in the next level. The key does not need to be a
prefix of an actual term in the index. The last level or leaf level stores the keywords
themselves, along with their associated data (see Fig 2). Because the internal node keys and
their lengths depend on the set of keywords, the order (size) of each node of the prefix B-tree
is variable. Updates are done similarly to those for a B-tree to maintain a balanced tree. The

pg. 4 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


prefix B-tree method breaks down if there are many words with the same (long) prefix. In
this case, common prefixes should be further divided to avoid wasting space.

Fig 2: A prefix B-tree


Compared with sorted arrays, B-trees use more space. However, updates are much easier and
the search time is generally faster, especially if secondary storage is used for the inverted file
(instead of memory). The implementation of inverted files using B-trees is more complex
than using sorted arrays.
Tries
Inverted files can also be implemented using a trie structure. This structure uses the digital
decomposition of the set of keywords to represent those keywords. A special trie structure,
the Patricia (PAT) tree, is especially useful in information retrieval.
BUILDING AN INVERTED FILE USING A SORTED ARRAY
Steps to construct an inverted file:
1. Scan each document in the collection.
2. Extract individual terms (after removing stopwords and stemming).
3. Insert the term into the dictionary if not already present.
4. Add the document ID to the posting list of that term.
Example: For three documents, the process builds a dictionary of terms and their associated
posting lists dynamically as each document is read.’
The production of sorted array inverted files can be divided into two or three sequential steps
as shown in Fig.3. First, the input text must be parsed into a list of words along with their
location in the text. This is usually the most time consuming and storage consuming
operation in indexing. Second, this list must then be inverted, from a list of terms in location
order to a list of terms ordered for use in searching (sorted into alphabetical order, with a list
of all locations attached to each term). An optional third step is the postprocessing of these
inverted files, such as for adding term weights, or for reorganizing or compressing the files.

pg. 5 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


Fig 3. Overall schematic of sorted array inverted file creation
Creating the initial word list requires several different operations. First, the individual words
must be recognized from the text. Each word is then checked against a stoplist of common
words, and if it can be considered a noncommon word, may be passed through a stemming
algorithm. The resultant stem is then recorded in the word-within-location list.
The word list resulting from the parsing operation (typically stored as a disk file) is then
inverted. This is usually done by sorting on the word (or stem), with duplicates retained (see
Fig.4). Even with the use of high-speed sorting utilities, however, this sort can be time
consuming for large data sets (on the order of n log n). One way to handle this problem is to
break the data sets into smaller pieces, process each piece, and then correctly merge the
results. Methods that do not use sorting. After sorting, the duplicates are merged to produce
within-document frequency statistics. (A system not using within-document frequencies can
just sort with duplicates removed.) Note that although only record numbers are shown as
locations in Fig.4, typically inverted files store field locations and possibly even word
location. These additional locations are needed for field and proximity searching in Boolean
operations and cause higher inverted file storage overhead than if only record location was
needed. Inverted files for ranking retrieval systems usually store only record locations and
term weights or frequencies.

Fig.4: Inversion of word list


Although an inverted file could be used directly by the search routine, it is usually processed
into an improved final format. This format is based on the search methods and the (optional)
weighting methods used. A common search technique is to use a binary search routine on the
file to locate the query words. This implies that the file to be searched should be as short as
possible, and for this reason the single file shown containing the terms, locations, and
(possibly) frequencies is usually split into two pieces. The first piece is the dictionary
containing the term, statistics about that term such as number of postings, and a pointer to the
location of the postings file for that term. The second piece is the postings file itself, which

pg. 6 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


contains the record numbers (plus other necessary location information) and the (optional)
weights for all occurrences of the term. In this manner, the dictionary used in the binary
search has only one "line" per unique term. Fig 5 illustrates the conceptual form of the
necessary files; the actual form depends on the details of the search routine and on the
hardware being used. Work using large data sets (Harman and Candela 1990) showed that for
a file of 2,653 records, there were 5,123 unique terms with an average of 14 postings/term
and a maximum of over 2,000 postings for a term. A larger data set of 38,304 records had
dictionaries on the order of 250,000 lines (250,000 unique terms, including some numbers)
and an average of 88 postings per record. From these numbers it is clear that efficient storage
structures for both the binary search and the reading of the postings are critical.

Fig 5: Dictionary and postings file from the last example


Types of Inverted Files
There are two major types of inverted files:
1. Term-to-Document Inverted File – Maps terms to documents.
2. Document-to-Term Inverted File – Maps documents to terms.
MODIFICATIONS TO THE BASIC TECHNIQUE
Example: Term-to-Document file is most common in search engines. For instance, searching
'machine learning' will directly retrieve all documents containing these terms.
Two different techniques are presented as improvements on the basic inverted file creation.
The first technique is for working with very large data sets using secondary storage. The
second technique uses multiple memory loads for inverting files.
Producing an Inverted File for Large Data Sets without Sorting

pg. 7 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


Indexing large data sets using the basic inverted
file method presents several problems. Most
computers cannot sort the very large disk files
needed to hold the initial word list within a
reasonable time frame, and do not have the amount
of storage necessary to hold a sorted and unsorted
version of that word list, plus the intermediate files
involved in the internal sort. Whereas the data set
could be broken into smaller pieces for processing,
and the resulting files properly merged, the
following technique may be considerably faster.
For small data sets, this technique carries a
significant overhead and therefore should not be
used.
The new indexing method (Harman and Candela 1990) is a two-step process that does not
need the middle sorting step. The first step produces the initial inverted file, and the second
step adds the term weights to that file and reorganizes the file for maximum efficiency (see
Fig.6).

Fig .6: Flowchart of new indexing method


The creation of the initial inverted file avoids the use of an explicit sort by using a right-
threaded binary tree (Knuth 1973). The data contained in each binary tree node is the current
number of term postings and the storage location of the postings list for that term. As each
term is identified by the text parsing program, it is looked up in the binary tree, and either is
added to the tree, along with related data, or causes tree data to be updated. The postings are
stored as multiple linked lists, one variable length linked list for each term, with the lists
stored in one large file. Each element in the linked postings file consists of a record number
(the location of a given term), the term frequency in that record, and a pointer to the next
element in the linked list for that given term. By storing the postings in a single file, no
storage is wasted, and the files are easily accessed by following the links. As the location of
both the head and tail of each linked list is stored in the binary tree, the entire list does not
need to be read for each addition, but only once for use in creating the final postings file (step
two).
Note that both the binary tree and the linked postings list are capable of further growth. This
is important in indexing large data sets where data is usually processed from multiple
separate files over a short period of time. The use of the binary tree and linked postings list
could be considered as an updatable inverted file. Although these structures are not as
efficient to search, this method could be used for creating and storing supplemental indices
for use between updates to the primary index.
The binary tree and linked postings lists are saved for use by the term weighting routine (step
two). This routine walks the binary tree and the linked postings list to create an alphabetical

pg. 8 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


term list (dictionary) and a sequentially stored postings file. To do this, each term is
consecutively read from the binary tree (this automatically puts the list in alphabetical order),
along with its related data. A new sequentially stored postings file is allocated, with two
elements per posting. The linked postings list is then traversed, with the frequencies being
used to calculate the term weights (if desired). The last step writes the record numbers and
corresponding term weights to the newly created sequential postings file. These sequentially
stored postings files could not be created in step one because the number of postings is
unknown at that point in processing, and input order is text order, not inverted file order.

Table .1: Indexing Statistics


Table 3.1 gives some statistics showing the differences between an older indexing scheme
and the new indexing schemes. The old indexing scheme refers to the indexing method in
which records are parsed into a list of words within record locations, the list is inverted by
sorting, and finally the term weights are added.
SIGNATURE FILES
Introduction to Signature Files
Signature files are one of the indexing methods used in Information Retrieval (IR). They
provide an alternative to inverted files and use bit strings (signatures) to represent
documents. Searching is done by matching query signatures with document signatures.
Example: A document signature may be a bit string like 101001, where each bit represents
the presence/absence of certain terms.
 Text retrieval methods have attracted much interest recently. There are numerous
applications involving storage and retrieval of textual data.
 Electronic office filing.
 Computerized libraries.
 Automated law and patent offices.
 Electronic storage and retrieval of articles from newspapers and magazines.
 Consumers' databases, which contain descriptions of products in natural language.
 Electronic encyclopedias.
 Indexing of software components to enhance reusabililty.

pg. 9 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


 Searching databases with descriptions of DNA molecules.
 Searching image databases, where the images are manually annotated.
The main operational characteristics of all the above applications are the following two:
 Text databases are traditionally large.
 Text databases have archival nature: there are insertions in them, but almost never
deletions and updates.
Test retrieval methods form the following large classes Full text scanning, inversion, and
signature files, on which we shall focus next.
Signature files are based on the idea of the inexact filter: They provide a quick test, which
discards many of the nonqualifying items. The qualifying items definitely pass the test; some
additional items ("false hits" or "false drops") may also pass it accidentally. The signature file
approach works as follows: The documents are stored sequentially in the "text file." Their
"signatures" (hash-coded bit patterns) are stored in the "signature file." When a query arrives,
the signature file is scanned and many nonqualifying documents are discarded. The rest are
either checked (so that the "false drops" are discarded) or they are returned to the user as they
are.
A brief, qualitative comparison of the signature-based methods versus their competitors is as
follows: The signature-based methods are much faster than full text scanning (1 or 2 orders of
magnitude faster, depending on the individual method). Compared to inversion, they require a
modest space overhead (typically 10-15% , as opposed to 50-300% that inversion requires
moreover, they can handle insertions more easily than inversion, because they need "append-
only" operations--no reorganization or rewriting of any portion of the signatures. Methods
requiring "append-only" insertions have the following advantages:
(a) increased concurrency during insertions (the readers may continue consulting the old
portion of index structure, while an insertion takes place).
(b) these methods work well on Write-Once-Read-Many (WORM) optical disks, which
constitute an excellent archival medium.
On the other hand, signature files may be slow for large databases, precisely because their
response time is linear on the number of items N in the database. Thus, signature files have
been used in the following environments:
1. PC-based, medium size db
2. WORMs
3. parallel machines
4. distributed text db
STRUCTURE OF SIGNATURE FILES
The structure of a signature file generally consists of:
1. Document Signatures – Bit patterns representing documents.

pg. 10 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


2. Query Signatures – Bit patterns generated from queries.
3. Matching Process – Compare query signatures with document signatures.
BASIC CONCEPTS
Show a query signature being ANDed with document signatures to filter candidates.
Signature files typically use superimposed coding to create the signature of a document. A
brief description of the method follows, each document is divided into "logical blocks," that
is, pieces of text that contain a constant number D of distinct, noncommon words. (To
improve the space overhead, a stoplist of common words is maintainted.) Each such word
yields a "word signature," which is a bit pattern of size F , with m bits set to "1", while the
rest are "0" (see Fig 1). F and m are design
parameters. The word signatures are OR'ed together
to form the block signature. Block signatures are
concatenated, to form the document signature. The m
bit positions to be set to "1" by each word are decided
by hash functions. Searching for a word is handled by
creating the signature of the word and by examining
each block signature for "1" 's in those bit positions
that the signature of the search word has a "1".
Fig 1: Illustration of the superimposed coding method. It is assumed that each logical block consists of D=2 words
only. The signature size F is 12 bits, m=4 bits per word.

In order to allow searching for parts of words, the following method has been suggested Each
word is divided into successive, overlapping triplets (e.g., "fr", "fre", "ree", "ee" for the word
"free"). Each such triplet is hashed to a bit position by applying a hashing function on a
numerical encoding of the triplet, for example, considering the triplet as a base-26 number. In
the case of a word that has l triplets, with l > m , the word is allowed to set l (nondistinct)
bits. If l < m , the additional bits are set using a random number generator, initialized with a
numerical encoding of the word.
An important concept in signature files is the false drop probability Fd. Intuitively, it gives the
probability that the signature test will fail, creating a "false alarm" (or "false hit" or "false
drop"). Notice that the signature test never gives a false dismissal.
DEFINITION: False drop probability, Fd , is the probability that a block signature seems to
qualify, given that the block does not actually qualify . Expressed mathematically

Fd = Prob{signature qualifies/block does not}

The signature file is an F N binary matrix. Previous analysis showed that, for a given value
of F, the value of m that minimizes the false drop probability is such that each row of the
matrix contains "1" 's with probability 50 percent. Under such an optimal design, we have
Fd = 2-m
F1n2 = mD

pg. 11 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


This is the reason that documents have to be divided into logical blocks: Without logical
blocks, a long document would have a signature full of "l" 's, and it would always create a
false drop. To avoid unnecessary complications, for the rest of the discussion we assume that
all the documents span exactly one logical block.

Table 1: Symbols and definitions


The most straightforward way to store the signature matrix is to store the rows sequentially.
For the rest of this work, the above method will be called SSF, for Sequential Signature File.
Fig 2 illustrates the file structure used: In addition to the text file and the signature file, we
need the so-called "pointer file," with pointers to the beginnings of the logical blocks (or,
alternatively, to the beginning of the documents).

Fig 2 : File structure for SSF


Although SSF has been used as is, it may be slow for large databases. Many methods have
been suggested, trying to improve the response time of SSF, trading off space or insertion
simplicity for speed. The main ideas behind all these methods are the following:
1. Compression: If the signature matrix is deliberately sparse, it can be compressed.
2. Vertical partitioning: Storing the signature matrix columnwise improves the response
time on the expense of insertion time.
3. Horizontal partitioning: Grouping similar signatures together and/or providing an index
on the signature matrix may result in better-than-linear search.

pg. 12 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


The methods we shall examine form the classes shown in Fig 3. For each of these classes we
shall describe the main idea, the main representatives, and the available performance results,
discussing mainly
the storage overhead,
the response time on single word queries,
the performance on insertion, as well as whether the insertion maintains the "append-only"
property.

Sequential storage of the signature matrix


without compression
sequential signature files (SSF)
with compression
bit-block compression (BC)
variable bit-block compression (VBC)
Vertical partitioning
without compression
bit-sliced signature files (BSSF, B'SSF))
frame sliced (FSSF)
generalized frame-sliced (GFSSF)
with compression
compressed bit slices (CBS)
doubly compressed bit slices (DCBS)
no-false-drop method (NFD)
Horizontal partitioning
data independent partitioning
Gustafson's method
partitioned signature files
data dependent partitioning
2-level signature files
S-trees
Fig 3: Classification of the signature-based methods

COMPRESSION
Steps in constructing signature files:
1. Assign a unique bit position or hash for each term.
2. For each document, generate a bit string where bits are set according to terms present.
3. Store these bit strings in the signature file.
4. Queries are also converted into bit strings and matched.
Example: If terms are {data, retrieval, system}, a document containing {data, system} will
have signature 101.

pg. 13 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


In this section we examine a family of methods.These methods create sparse document
signatures on purpose, and then compress them before
storing them sequentially. Analysis in that paper showed
that, whenever compression is applied, the best value
for m is 1. Also, it was shown that the resulting methods
achieve better false drop probability than SSF for the
same space overhead.
The idea in these methods is that we use a (large) bit
vector of B bits and we hash each word into one (or
perhaps more, say n) bit position(s), which are set to "1" (see Fig 4). The resulting bit vector
will be sparse and therefore it can be compressed.

Fig 4: Illustration of the compression-based methods. With B = 20 and n = 1 bit per word, the resulting
bit vector is sparse and can be compressed

The spacewise best compression method is based on


run-length encoding, using the approach of "infinite
Huffman codes". However, searching becomes slow.
To determine whether a bit is "1" in the sparse
vector, the encoded lengths of all the preceding
intervals (runs) have to be decoded and summed (see
Fig 5).
Fig 5: Compression using run-length encoding. The notation [x] stands for the encoded value of number x

Bit-block Compression (BC)

This method accelerates the search by sacrificing some space, compared to the run-length encoding technique.
The compression method is based on bit-blocks, and was called BC (for bit-Block Compression). To speed up
the searching, the sparse vector is divided into groups of consecutive bits (bit-blocks); each bit-block is encoded
individually.For each bit-block we create a signature, which is of variable length and consists of at most three
parts (see Fig 6):

Part I. It is one bit long and it indicates whether there are any "l"s in the
bit-block (1) or the bit-block is empty (0). In the latter case, the bit-block
signature stops here.

Part II. It indicates the number s of "1"s in the bit-block. It consists of s -


1 "1"s and a terminating zero. This is not the optimal way to record the
number of "1"s. However this representation is simple and it seems to
give results close to the optimal.

Part III. It contains the offsets of the "1"s from the beginning of the bit-
block (1 gb bits for each "1", where b is the bit-block size).

Fig 6: Illustration of the BC method with bit-block size b = 4.


Fig 6 illustrates how the BC method compresses the sparse vector of Fig 4. Fig 7 illustrates the way to store
parts of a document signature: the first parts of all the bit-block signatures are stored consecutively, then the
second parts, and so on.

pg. 14 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


0 1 0 1 1 | 10 0 0 | 00 11 10 00

Fig 7: BC method--Storing the signature by concatenating the parts. Vertical lines indicate the part
boundaries.

Variable Bit-block Compression (VBC)


The BC method was slightly modified to become insensitive to changes in the number of
words D per block. This is desirable because the need to split messages in logical blocks is
eliminated, thus simplifying the resolution of complex queries: There is no need to
"remember" whether some of the terms of the query have appeared in one of the previous
logical blocks of the message under inspection.
The idea is to use a different value for the bit-block size bopt for each message, according to
the number W of bits set to "1" in the sparse vector. The size of the sparse vector B is the
same for all messages. Choose the optimal size b of the bit-blocks for a document
with W (distinct) words; arithmetic examples in the same paper indicate the advantages of the
modified method.
This method was called VBC (for Variable bit-Block Compression). Fig 8 illustrates an
example layout of the signatures in the VBC method. The upper row corresponds to a small
message with small W, while the
lower row to a message with
large W. Thus, the upper row has
a larger value of bopt, fewer bit-
blocks, shorter Part I (the size of
Part I is the number of bit-blocks),
shorter Part II (its size is W) and
fewer but larger offsets in Part III
(the size of each offset is
log bopt bits).
Fig 8: An example layout of the message signatures in the VBC method.
VERTICAL PARTITIONING
Methods of Generating Signatures
Common methods include:
1. Bit Slicing – Assigning fixed bits to terms.
2. Hashing – Applying hash functions to terms to set bits.
3. Superimposed Coding – Combining multiple hash results to reduce collisions.
Example: Hashing ensures that even large vocabularies can be mapped into smaller fixed-
size signatures.

pg. 15 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


The idea behind the vertical partitioning is to avoid bringing useless portions of the document
signature in main memory; this can be achieved by storing
the signature file in a bit-sliced form or in a "frame-sliced"
form.
Bit-Sliced Signature Files (BSSF)
The bit-sliced design is illustrated in Fig10.Transposed
bit matrix
To allow insertions, we propose using F different files, one per each bit position, which will
be referred to as "bit-files." The method will be called BSSF, for "Bit-Sliced Signature Files.
B'SSF, a Faster Version of BSSF
The traditional design of BSSF suggests choosing the optimal value for m to be such that the
document signatures contain "l"s by 50 percent. A typical value of m is of the order of 10.
This implies 10 random disk accesses on a single word query. it is suggested to use a smaller
than optimal value of m; thus, the number of random disk accesses decreases. The drawback
is that, in order to maintain the same false drop probability, the document signatures have to
be longer.
Frame-Sliced Signature File
The idea behind this method is to force each word to hash into bit positions that are close to
each other in the document signature. Then, these bit files are stored together and can be
retrieved with few random disk accesses. The main motivation for this organization is that
random disk accesses are more expensive than sequential ones, since they involve movement
of the disk arm.
Generalized Frame-Sliced Signature File (GFSSF)
In FSSF, each word selects only one frame and sets m bit positions in that frame. A more
general approach is to select n distinct frames and set m bits (not necessarily distinct) in each
frame to generate the word signature. The document signature is the OR-ing of all the word
signatures of all the words in that document. This method is called Generalized Frame-Sliced
Signature File.
Notice that BSSF, B'SSF, FSSF, and SSF are actually special cases of GFSSF:
When k = F, n = m, it reduces to the BSSF or B'SSF method.

When n = 1, it reduces to the FSSF method.

When k = 1, n = 1, it becomes the SSF method (the document signature is broken down to
one frame only).
VERTICAL PARTITIONING AND COMPRESSION

pg. 16 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


The idea in all the methods in this class is to create a very sparse signature matrix, to store it
in a bit-sliced form, and compress each bit slice by storing the position of the "1"s in the
slice. The methods in this class are closely related to inversion with a hash table.
Compressed Bit Slices (CBS)
Although the bit-sliced method is much faster than SSF on retrieval, there may be room for
two improvements:
1. On searching, each search word requires the retrieval of m bit files, exactly because each
word signature has m bits set to "1". The search time could be improved if m was forced to be
"1".
2. The insertion of a logical block requires too many disk accesses.
Doubly Compressed Bit Slices (DCBS)
The motivation behind this method is to try to compress the sparse directory of CBS. The file
structure we propose consists of a hash table, an intermediate file, a postings file and the text
file as in Fig 11. The method is similar to CBS. It uses a
hashing function h1(), which returns values in the range
(O,(S-1)) and determines the slot in the directory. The
difference is that DCBS makes an effort to distinguish
among synonyms, by using a second hashing
function h2(), which returns bit strings that are h bits
long. These hash codes are stored in the "intermediate
file," which consists of buckets of Bi bytes (design
parameter). Each such bucket contains records of the
form (hashcode, ptr). The pointer ptr is the head of a
linked list of postings buckets. Fig 11: Illustration of DCBS

where the word "base" appears in the document that starts at the 1145-th byte of the text file.
The example also assumes that h = 3 bits, hl("base") = 30, and h2("base") = (011)2.
Searching for the word "base" is handled as follows:
Step 1 h1("base") = 30: The 30-th pointer of the directory will be followed. The
corresponding chain of intermediate buckets will be examined.
Step 2 h2("base") = (011)2: The records in the above intermediate buckets will be examined.
If a matching hash code is found (at most one will exist!), the corresponding pointer is
followed, to retrieve the chain of postings buckets.
Step 3 The pointers of the above postings buckets will be followed, to retrieve the qualifying
(actually or falsely) documents.
Insertion is omitted for brevity.
No False Drops Method (NFD)

pg. 17 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


This method avoids false drops completely, without storing
the actual words in the index. The idea is to modify the
intermediate file of the DCBS, and store a pointer to the word
in the text file. Specifically, each record of the intermediate
file will have the format (hashcode, ptr, ptr-to-word),
where ptr-to-word is a pointer to the word in the text file.(See
Fig12.) for an illustration.

Fig 12: Illustration of NFD

This way each word can be completely distinguished from its synonyms, using only h bits for
the hash code and p (=4 bytes, usually) for the ptr-to-word. The advantages of storing ptr-to-
word instead of storing the actual word are two: (1) space is saved (a word from the
dictionary is 8 characters long and (2) the records of the intermediate file have fixed
length. Thus, there is no need for a word delimiter and there is no danger for a word to cross
bucket boundaries.
Searching is done in a similar way with DCBS. The only difference is that, whenever a
matching hash code is found in Step 2, the corresponding ptr-to-word is followed, to avoid
synonyms completely.
HORIZONTAL PARTITIONING
The motivation behind all these methods is to avoid the sequential scanning of the signature
file (or its bit-slices), in order to achieve better than O(N) search time. Thus, they group the
signatures into sets, partitioning the signature matrix horizontally. The grouping criterion can
be decided beforehand, in the form of a hashing function h(S), where S is a document
signature (data independent case). Alternatively, the groups can be determined on the fly,
using a hierarchical structure (e.g. a B-tree--data dependent case).
Data Independent Case
Gustafson's method
The earliest approach was proposed by Gustafson (1971). Suppose that we have records with,
say six attributes each. For example, records can be documents and attributes can be
keywords describing the document. Consider a hashing function h that hashes a keyword w to
a number h(w) in the range 0-15. The signature of a keyword is a string of 16 bits, all of
which are zero except for the bit at position h(w). The record signature is created by
superimposing the corresponding keyword signatures. If k < 6 bits are set in a record
signature, additional 6 - k bits are set by some random method. Thus, there are comb (16,6) =
8,008 possible distinct record signatures (where C(m,n) denotes the combinations
of m choose n items). Using a hash table with 8,008 slots, we can map each record signature
to one such slot as follows: Let p1 < p2 < . . . < p6 the positions where the "1"s occur in the
record signature. Then the function

C(p1, 1) + C(p2, 2) + . . . + C(p6, 6)

pg. 18 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


maps each distinct record signature to a number in the range 0-8,007. The interesting point of
the method is that the extent of the search decreases quickly (almost exponentially) with the
number of terms in the (conjunctive) query. Single word queries touch C(15,5) = 3,003 slots
of the hash table, two-word queries touch C(14, 4) = 1,001 slots, and so on.
Although elegant, Gustafson's method suffers from some practical problems:
1. Its performance deteriorates as the file grows.
2. If the number of keywords per document is large, then either we must have a huge hash
table or usual queries (involving 3-4 keywords) will touch a large portion of the database.
3. Queries other than conjunctive ones are handled with difficulty.

Partitioned signature files


Lee and Leng (1989) proposed a family of methods that can be applied for longer documents.
They suggested using a portion of a document signature as a signature key to partition the
signature file. For example, we can choose the first 20 bits of a signature as its key and all
signatures with the same key will be grouped into a so-called "module." When a query
signature arrives, we will first examine its signature key and look for the corresponding
modules, then scan all the signatures within those modules that have been selected.
Data Dependent Case
Two-level signature files
Sacks-Davis and his colleagues (1983, 1987) suggested using two levels of signatures. Their
documents are bibliographic records of variable length. The first level of signatures consists
of document signatures that are stored sequentially, as in the SSF method. The second level
consists of "block signatures"; each such signature corresponds to one block (group) of
bibliographic records, and is created by superimposing the signatures of all the words in this
block, ignoring the record boundaries. The second level is stored in a bit-sliced form. Each
level has its own hashing functions that map words to bit positions.
Searching is performed by scanning the block signatures first, and then concentrating on
these portions of the first-level signature file that seem promising.
Analysis on a database with N 106 records (with 128 bytes per record on the average)
reported response times as low as 0.3 seconds for single word queries, when 1 record
matched the query. The BSSF method required 1-5 seconds for the same situation.
A subtle problem arises when multiterm conjunctive queries are asked, (e.g.,
"data and retrieval"): A block may result in an unsuccessful block match, because it may
contain the desired terms, but not within the same record. The authors propose a variety of
clever ways to minimize these block matches.
S-tree

pg. 19 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


Deppisch (1986) proposed a B-tree like structure to facilitate fast access to the records (which
are signatures) in a signature file. The leaf of an S-tree consists of k "similar" (i.e., with small
Hamming distance) document signatures along with the document identifiers. The OR-ing or
these k document signatures forms the "key" of an entry in an upper level node, which serves
as a directory for the leaves. Recursively we construct directories on lower level directories
until we reach the root. The S-tree is kept balanced in a similar manner as a B-trees: when a
leaf node overflows it is split in two groups of "similar" signatures; the father node is
changed appropriately to reflect the new situation. Splits may propagate upward until
reaching the root.

Advantages and Disadvantages of Signature Files


Advantages:
Compact storage compared to inverted files.
Efficient for small and medium databases.

Disadvantages:
May produce false positives.
Not as efficient for very large collections compared to inverted files.

Long Answer Questions


Explain the structure of inverted files.
Describe the steps in building an inverted file using a sorted array
Compare inverted file indexing with signature file indexing.
Explain the advantages of each approach.
Discuss vertical and horizontal partitioning in signature files.
Explain how compression is applied in signature files.
Implement a simple inverted index for a set of documents.
Demonstrate how modifications to basic techniques improve performance.
Evaluate the efficiency of inverted file indexing for large datasets.
Propose methods to optimize retrieval using indexing.
Short Answer Questions
1. Define an inverted file in Information Retrieval.
2. What are the two main components of an inverted file?
3. Differentiate between term-to-document and document-to-term inverted files.
4. State one advantage and one disadvantage of using a sorted array for inverted files.
5. What is the role of a dictionary in inverted file structure?
6. Mention one limitation of using B-trees in inverted file indexing.
7. What is a trie structure in the context of inverted files?
8. Write two steps involved in building an inverted file using a sorted array.
9. Define posting list with an example.
10. What is the need for modifications in the basic inverted file technique?
11. Define a signature file in Information Retrieval.
12. What is a false drop in signature files?

pg. 20 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,


13. State one advantage and one disadvantage of signature files.
14. What are the two main types of partitioning in signature files?
15. Explain the role of superimposed coding in signature files.
16. What is the difference between vertical and horizontal partitioning?
17. Define bit-sliced signature file (BSSF).
18. What is the purpose of compression in signature files?
19. List any two applications of signature files.
20. Compare inverted files and signature files in one point each.

pg. 21 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,

You might also like