0% found this document useful (0 votes)
75 views6 pages

Course Name: Advanced Information Retrieval

The document discusses inverted indexes. It begins by defining an inverted index as a database index that maps content like words or numbers to their locations in documents. It notes inverted indexes allow for fast full-text searches but at the cost of increased processing when adding documents. There are two types: record-level indexes containing word to document mappings, and word-level containing word positions within documents. Advantages are fast searches and common use in search engines, while disadvantages are large storage and maintenance costs for updates. Python code is provided as an implementation example using CountVectorizer.

Uploaded by

jewar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views6 pages

Course Name: Advanced Information Retrieval

The document discusses inverted indexes. It begins by defining an inverted index as a database index that maps content like words or numbers to their locations in documents. It notes inverted indexes allow for fast full-text searches but at the cost of increased processing when adding documents. There are two types: record-level indexes containing word to document mappings, and word-level containing word positions within documents. Advantages are fast searches and common use in search engines, while disadvantages are large storage and maintenance costs for updates. Python code is provided as an implementation example using CountVectorizer.

Uploaded by

jewar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

JIMMA UNIVERSITY

JIMMA INSTITUTE OF TECHNOLOGY


FACULTY OF COMPUTING AND
INFORMATICS
MSC. IN INFORMATION SCIENCE
(ELECTRONIC AND DIGITAL RECOURSE
MANAGEMENT)

Course Name: Advanced Information Retrieval

Assignment 2: Inverted Index

Prepared by: Ruth Wondu

Submitted to: Dr Getachew

January, 2021
1. Definition of Inverted Index
an inverted index (also referred to as a postings file or inverted file) is a database index storing a
mapping from content, such as words or numbers, to its locations in a table, or in a document or
a set of documents (named in contrast to a forward index, which maps from documents to
content). The purpose of an inverted index is to allow fast full-text searches, at a cost of
increased processing when a document is added to the database. The inverted file may be the
database file itself, rather than its index. It is the most popular data structure used in document
retrieval systems used on a large scale for example in search engines. Additionally, several
significant general-purpose mainframe-based database management systems have used inverted
list architectures, including ADABAS, DATACOM/DB, and Model 204.

Alternate names for ‘inverted index’ are ‘postings file’ and ‘inverted file’. In computer science,
this is an index data structure that stores a mapping from content, like words or numbers. Its
place of storage is its locations within a document or set of documents. This is in stark contrast to
a ‘forward index’, whose purpose is to map from documents to content. To put simply, it’s a
hash map-like data structure that guides you from a word to either a document or a web page
.The objective of an inverted index is to permit quick full-text searches. What’s more, they can
do so at a cost of increasing processing whenever a document goes on the database. The inverted
file may very well be the database file itself instead of its index. It is, inarguably, the most
popular data structure that document retrieval systems use. It is especially useful on a large scale;
in search engines, for instance.

There are two types of inverted indexes: A record-level inverted index contains a list of
references to documents for each word. A word-level inverted index additionally contains the
positions of each word within a document. The latter form offers more functionality, but needs
more processing power and space to be created. Suppose we want to search the texts “hello
everyone,” “this article is based on inverted index,” “which is hash map like data structure”. If
we index by (text, word within the text), the index with location in text is:

1. A record-level inverted index – Alternatively an ‘inverted file index’ or just ‘inverted file’.
This variant contains a list of references to documents for each word.

2. A word-level inverted index – Alternatively ‘full inverted index’ or ‘inverted list’. This
variant contains the positions of each word that exist within a document. This particular form

1|Page
provides more functionality (like phrase searches). However, it requires additional processing
power and space in order to achieve creation.

An inverted index is an index data structure storing a mapping from content, such as words or
numbers, to its locations in a document or a set of documents. In simple words, it is a hash map
like data structure that directs you from a word to a document or a web page.

2. Advantages and Disadvantages


Advantages

 Inverted index is to allow fast full text searches, at a cost of increased processing when a
document is added to the database.
 It is easy to develop.
 It is the most popular data structure used in document retrieval systems, used on a large
scale for example in search engines.

Disadvantages

 Large storage overhead and high maintenance costs on update, delete and insert.
 Large storage expense and tremendous maintenance costs on each update, delete and
insert.

2|Page
3. Implementation using any programming language (python/java ...). I recommend you
python programming language

import nltk

import sklearn

import sklearn.feature_extraction

vectorizer = sklearn.feature_extraction.text.CountVectorizer(min_df=1)

documents =['Computer program used to retrieve digital information.',

'Software is necessary for users to access digital information.',

'Digital ICT is communication through computer-based systems.',]

In = vectorizer.fit_transform(documents).toarray()

print('In={0}'.format(In))

3|Page
print('{0}'.format(vectorizer.vocabulary_))

4|Page
Reference
1. inverse-document-frequency.https://moz.com/blog/inverse-document-frequency-and-
the-importance-of-uniqueness Assessed on 19-01-2021
2.inverse-documenthttps://nlp.stanford.edu/IR-book/html/htmledition/inverse-
document-frequency-1.html Assessed on 2021

5|Page

You might also like