JIMMA UNIVERSITY
JIMMA INSTITUTE OF TECHNOLOGY
FACULTY OF COMPUTING AND
INFORMATICS
MSC. IN INFORMATION SCIENCE
(ELECTRONIC AND DIGITAL RECOURSE
MANAGEMENT)
Course Name: Advanced Information Retrieval
Assignment 2: Inverted Index
Prepared by: Ruth Wondu
Submitted to: Dr Getachew
January, 2021
1. Definition of Inverted Index
an inverted index (also referred to as a postings file or inverted file) is a database index storing a
mapping from content, such as words or numbers, to its locations in a table, or in a document or
a set of documents (named in contrast to a forward index, which maps from documents to
content). The purpose of an inverted index is to allow fast full-text searches, at a cost of
increased processing when a document is added to the database. The inverted file may be the
database file itself, rather than its index. It is the most popular data structure used in document
retrieval systems used on a large scale for example in search engines. Additionally, several
significant general-purpose mainframe-based database management systems have used inverted
list architectures, including ADABAS, DATACOM/DB, and Model 204.
Alternate names for ‘inverted index’ are ‘postings file’ and ‘inverted file’. In computer science,
this is an index data structure that stores a mapping from content, like words or numbers. Its
place of storage is its locations within a document or set of documents. This is in stark contrast to
a ‘forward index’, whose purpose is to map from documents to content. To put simply, it’s a
hash map-like data structure that guides you from a word to either a document or a web page
.The objective of an inverted index is to permit quick full-text searches. What’s more, they can
do so at a cost of increasing processing whenever a document goes on the database. The inverted
file may very well be the database file itself instead of its index. It is, inarguably, the most
popular data structure that document retrieval systems use. It is especially useful on a large scale;
in search engines, for instance.
There are two types of inverted indexes: A record-level inverted index contains a list of
references to documents for each word. A word-level inverted index additionally contains the
positions of each word within a document. The latter form offers more functionality, but needs
more processing power and space to be created. Suppose we want to search the texts “hello
everyone,” “this article is based on inverted index,” “which is hash map like data structure”. If
we index by (text, word within the text), the index with location in text is:
1. A record-level inverted index – Alternatively an ‘inverted file index’ or just ‘inverted file’.
This variant contains a list of references to documents for each word.
2. A word-level inverted index – Alternatively ‘full inverted index’ or ‘inverted list’. This
variant contains the positions of each word that exist within a document. This particular form
1|Page
provides more functionality (like phrase searches). However, it requires additional processing
power and space in order to achieve creation.
An inverted index is an index data structure storing a mapping from content, such as words or
numbers, to its locations in a document or a set of documents. In simple words, it is a hash map
like data structure that directs you from a word to a document or a web page.
2. Advantages and Disadvantages
Advantages
Inverted index is to allow fast full text searches, at a cost of increased processing when a
document is added to the database.
It is easy to develop.
It is the most popular data structure used in document retrieval systems, used on a large
scale for example in search engines.
Disadvantages
Large storage overhead and high maintenance costs on update, delete and insert.
Large storage expense and tremendous maintenance costs on each update, delete and
insert.
2|Page
3. Implementation using any programming language (python/java ...). I recommend you
python programming language
import nltk
import sklearn
import sklearn.feature_extraction
vectorizer = sklearn.feature_extraction.text.CountVectorizer(min_df=1)
documents =['Computer program used to retrieve digital information.',
'Software is necessary for users to access digital information.',
'Digital ICT is communication through computer-based systems.',]
In = vectorizer.fit_transform(documents).toarray()
print('In={0}'.format(In))
3|Page
print('{0}'.format(vectorizer.vocabulary_))
4|Page
Reference
1. inverse-document-frequency.https://moz.com/blog/inverse-document-frequency-and-
the-importance-of-uniqueness Assessed on 19-01-2021
2.inverse-documenthttps://nlp.stanford.edu/IR-book/html/htmledition/inverse-
document-frequency-1.html Assessed on 2021
5|Page