WOLKITE UNIVERITY
COLLEGE OF COMPUTING AND INFORMATICS
DEPARTMENT OF INFORMATION SYSTEM
Course : Introduction to Information Storage & Retrieval
INSY 2063: IR
BSc(IS) Third Year, First Semester, 2018
ISAYAS W.
INFORMATION SYSYTEM
Information Retrieval
Chapter 1:
Information Storage and Retrieval
Questions
• Find “BRUTUS AND CAESAR AND NOT
CALPURNIA” in the big book of shakespare.
• I want to get some idea about the concepts of information
retrieval
?
Example
Introduction
• The practice of archiving (A depository containing
historical records and documents) written information can be
traced back to around 3000 BC, when the Sumerians designated
special areas to store clay tablets with cuneiform inscriptions
(Amit Singhal,2001)
The need to store and retrieve written information became
increasingly important over centuries, especially with inventions
like paper and the printing press.
• After computers were invented, people realized that they could be
used for storing and mechanically retrieving large amounts of
information
Cont……
Approaching the end of the twentieth century, societies all
over the world are changing.
In countries of many different kinds, information now plays
an increasingly important part in economic, social, cultural
and political life.
This phenomenon is taking place regardless of a country’s
size, state of development or political philosophy.
Cont……..
Changes that are happening in Singapore, with a population
of 2.5 million, are similar to those taking place in Japan with
its population of 125 million.
Developing countries like Thailand are striving to build
information-intensive(concentrated) social and economic
systems just as hard as countries like the United Kingdom or
France.
The storage of information from first to now
• Clay tablets
Paper and other soft materials
cloud
Computers
What is Information, storage and retrieval?
• Information:
Data that have been processed and has meaning of itself and
the meaning is useful but does not have to be.
Information is a critical business resource and like any other
critical resource must be properly managed
Provides answer to who,what,where,when questions
Storage
The action of or method of storing something.
The place where data is held in an electromagnetic or optical
for access by a computer processor.
Retrieval : The process of getting some thing backfrom
somewhere easily.
The action of obtaining or consulting material stored in a
computer system.
Example: find „BRUTUS AND CAESAR AND NOT
CALPURNIA‟ in the big book of shakespare.
Information Storage
• The computers can store different types of information in
different ways, depending on what the information is, how
much storage it requires and how quickly it needs to be
accessed.
• Information storage is the part of the accounting system
that keeps data accessible to the information processors
(cpu)
• Accounting system is the system used to manage the
income, expenses, and other financial activities of a
business.
Cont..
After the input devices enter data into an accounting system,
the information processors take the raw data and convert it
into a usable form.
• This information is then stored, often in the form of a
database, on the information storage component of the
accounting system.
Example
….cont
instead
IR and IR systems
What is IR(Information Retrieval)
?
Information Retrieval(IR)
• The term Information Retrieval was first coined by Calvin
Moore (1950)
Definition: Is an Important sub-discipline of Information Science
that is concerned with developing theories and methods of access
to information
– Focus is on helping user find information that matches their
information need (User Centered View)
• Is a branch of applied Computer Science that focus on
representation, storage, organization of, and access to information
items (System Centered View).
…cont
• A good formal definition of information retrieval is given in
Baeze-Yates and Riberio-Neto (1990p1)
“Information retrieval deals with representation, storage,
organization of, and access to information items. The
organization and access of information items should provide
the user with easy access to the information in which he is
interested”
• Is about finding relevant information in large collection of
data
….cont
• Conceptually, IR is used to cover all related problems in
finding needed information
• Historically, information retrieval is about document retrieval,
emphasizing documents as a basic units
– Until recently, in the above sense, IR was considered as a narrow area of
interest for Librarians and Information experts
– Today, IR includes Modelling, document classification, user
interfaces and visualization, multimedia retrieval, digital library,
filtering, natural languages etc.
• Technically, information retrieval refers to (text) string
manipulation, indexing, matching, querying, etc.
The Task of Information Retrieval
• A large depository document is stored on a computer=Corpus
• There is a topic about which we desire to get some
information=Information need
• Some of those documents may contain the information that
satisfies my need=relevance
• How do we retrieve those documents?
• we communicate our information need to the computers by
expressing it in the form of a query
How to prepare a Query
• How the query is expressed will depend on whether the data is
structured or unstructured
• Structured data information in a tables, has a clear,
overt(obvious) semantic structure, organized, Relations
• Example :
Employee Manager salary
A B 80000
Example
• Structured data allows for the expressive queries like:
Give me the social security numbers of all the employees
who have stayed with the company for more than five years
Id Name Manger Year stayed
1 Abebe Beka 2
2 Bona Dedefo 2
3 Chala Boru 6
Cont……….
• Unstructured Data: does not have a clear, overt semantic
structure(e.g. free text on web page, video, audio)
• Allows less expressive queries of the form:
• Give me all documents that have the keywords
• ‘These romans are crazy’
Structured data Database System
Unstructured Information
data Retrieval
Generally;
• Information Retrieval (IR) is finding material (usually
documents) of an unstructured nature (usually text) that
satisfies an information need from within large collections
(usually stored on computers). Information retrieval
technology has been central to the success of the Web.
• Question 1: what is the difference between the structured
and unstructured data? What about semi-structured data?
Query
• (computing) a set of instructions passed to a database to
retrieve particular data (Dictionary Definition)
• Queries are formal statements of information needs that are put
to an IR system by the user to search for a document.
• The users’ query is matched to the documents stored in a
database through the documents’ index.
Query
• When formulating a query, the user can employ search
facilities such as search limits (by date of publication,
language, publication type, and so on) and Boolean operators
(AND/OR/NEAR/NOT) to make the query more specified
(i.e. refine or relax the query).
• The user can also often control the output in terms of, for
example, number of retrieved documents to display and of
highlighting search terms.
Goal of IR
• The general goal of IR is to
Help users find useful information based on their information
needs (with a minimum effort, ) despite the increasing
complexity of Information and the changing needs of user
Provide immediate random access to the data
Remark
Retrieval systems such as google are developed with this aim
What IR assumes?
• Information is stored (or available)
• A user has an information need
• An automated system exists from which information can
be retrieved
• The system works!!
Cont…
Challenges in IR
• Representation of information items and information needs
(first problem)
– Document representation is one area of IR
– Query representation is another area of IR
• Matching (second problem)
– How to match need Vs. information items
• Modification of representation as a result of judgment (query
expansion or reformulation)
Question
Data Information Retrieval Systems
• Are systems which are build to retrieve documents highly
likely relevant to the user
• Are systems built to reduce user’s workload in searching
through the store of documents to find relevant one’s
• Are systems that give information about the presence or
absence of documents in accordance with the query
– Automated abstracts or summaries of documents were developed to
further simplify access to search results
• Are computer based systems (we are talking about
automation )
….cont
• Are systems that attempt to find relevant documents to
respond to user’s request
• Are systems that interposed (interrepted) between a
potential user of information and the information collection
itself.
– For a given information problem, the purpose of the
system is to capture wanted items and to filter out
unwanted items
…….cont
Programmable IR Tools
Apache Lucene
Apache Solr
Lemur
Terrier
Rapid Miner
Generally;
• Is a set of rules and procedures, as operated by humans and/or
machines, for doing some or all of the following operations
– Indexing (or constructing representation of documents)
– Search formulation (or constructing representation of information
needs)
– Searching (or matching representation of documents against
representation of needs)
– Feedback (or repeating any or all of the above processes with
modifications introduced in response to an assessment of results of
some process)
Information retrieval system
• Consists of:
1. Sets of Information items (documents)
• Objects that have the information we need
2. A set of requests (Information needs)
3. Some mechanisms for determining the requirements of
the request (matching functions)
Information Retrieval Systems
Examples of IR systems
• Typical examples of IR systems are search engines that
can be found on the web or in library
– They concentrate on finding documents, performing
full text retrieval
– After a user types in several keywords, the system
returns the documents that are most interesting
according to the system
What an IR system should do?
Store/archive information
Provide access to that information
Answer queries with relevant information
Understand the user’s queries
Understand the user’s need
Acts as an assistant
Major functions of an IR systems
– Analyze contents of information items
– Represent the contents of the analyzed sources in a way suitable
for matching with users’ queries
– Analyze users information need and represent them in a form
that will be suitable for matching with the database
– Match the search statement with the stored database
– Retrieve or generate information that are relevant in a ranking
which reflects relevance
– Make necessary adjustments in the system based on feedback
from users
Types of IR Systems/applications
• IR can be structured for ease of discussion as:
– Text IR
• Discusses the classic problem of searching a collection of documents
for useful information
• Focuses is on document s that are predominantly text (rather than
pictures)
• These are called textual images and are amenable(agreeable) to
automatic extraction of key words
– Multimedia IR
• Discusses how to index document images and other binary data by
extracting features from their content and how to search them
efficiently
– Human computer interaction (HCI) for IR
• Discusses current trends in IR towards improved user interface and
better data visualization tools
– Application of IR
• Covers modern applications of IR (such as the Web, bibliographic
systems, and digital libraries)
Components of an IR systems
• An IR system comprises the following major subsystems
– Document selection subsystem
• Documents are there in the database. How are we going to select those
documents that are relevant (matched with user requests)
– Vocabulary subsystem
• In indexing we need to use controlled vocabulary i.e., a list of selected
subject terms to represent a document. It is based on a vocabulary that
the indexing is updated
– Text Operations subsystem
– Tokenization, Stopword removal, Stemming
Data versus Information Retrieval
• 1. Information items
– DBMS
• highly structured data (are of known nature), often
homogeneous records, often semantically unambiguous (well
defined semantics)
– IR systems
• Unstructured or unformatted data (as opposed to relational
database). When you go to a specific document it is not
structured as in DB
• Free text
– text data- papers, technical reports, news article ( completely
untagged or plain text)
– Web-pages – HTML and XML files (semi structured)
• None textual data – images, graphics etc.
• Heterogeneous, Semantically ambiguous (semantics is
frequently loose; we want approximate match)
……cont
• 2. Answers
– DBMS:
• Records, tupples, No ranking
• Well defined results
• Perfect precision and recall, each item is relevant
– IR systems
• Documents, ranked list of documents. The issue ranking is
very important (page through the top k documents)
• Imperfect precision and recall, each item has specific
relevance
….cont
• 3. Matching
– DBMS:
• Analoguous to db quering: Which records contain a set of
keywords?
• Exact match; We talk of items that match exactly; Every record
either matches or fails to match a query; No notion of relevnce
• A single erroneous (conatining error) object implies failure!
– IR systems
• Information about a subject or topic
• Partial or best match; We talk of possibly relevant items not
exact matched items
• Notion of relevance is most important- needs a model
• Small errors are tolerated (and in fact inevitable)
• Interpret contents of information items
• Generate a ranking which reflects relevance
….cont
• 4. Items wanted
– DBMS
• Matching
– IR systems
• Relevant
• 5. Model
• DBMS:
• Deterministic (answer can be predetermined
– IR systems:
• Probabilistic, not deterministic; answer is not
predetermined
…cont
• 6. Querying
– DBMS:
• (DB query) assumes that the data is in standardized
format
– IR system
• Query assumes that we work on plain, unformatted data
7. Query language
• DBMS
• Artificial language
– IR system
• Natural language
….cont
• 8. Query specification
• DBMS
• Complete (requires precise retrieval criteria)
• A single erroneous object implies failure
– IR system
• Incomplete
• Small errors are tolerated
…cont
DB grew out of files and traditional business system
IR grew out of library science and need to
categorize/group/access books/articles
Information retrieval is much more difficult than data
retrieval
Both support queries over large data set, using indexing
Relationship
Systems complement each other
Summary of Comparison (data retrieval Vs information retrieval)
Discussion questions
1. What is IR and IR systems
2. What is search engine?
3. List and explain components of IR block diagram.
4. write the difference between IR retrieval and DR.
5. Apache solr is open source software. What is open source
software?
IR and the Retrieval Process
• The purpose of an information retrieval strategy is to retrieve all
the relevant documents whilst(although) at the same time
retrieving as few non relevant once as possible
• The process involves a certain amount of element of feed back
and is best illustrated using the diagram in the next slide
• Can be seen or interpreted in terms of component sub-processes
whose study fields yields many of the topics that will be
covered in the course
Retrieval Process
Text
User
Interface
user need 4, 10 Text
Text Operations
6, 7
logical view logical view
Query DB Manager
Operations Indexing
Module
user feedback
5 8
inverted file
query
Searching
Index
8
retrieved docs
Text
Database
Ranking
ranked docs
2
A simple and generic software architecture to describe the retrieval process
Indexing part
Searching part
….cont
• There are three main ingredients to the IR process
– Texts or documents
– Queries
– The process of evaluation
• For texts, the main problem is to obtain a representation of
the text in a form which is amenable to automatic indexing
• This is achieved (i.e., the representation) by creating an
abbreviated form of the text, known as a text surrogate
• A typical surrogate would consist of a set of index terms or
keywords or descriptors
Document surrogates
Example
….cont
For queries
• For queries, the query has arisen as a result of an information
need on the part of the user
• The query is then a representation of the information need and
must be expressed in a language understood by the system
• Due to the inherent difficulty of accurately representing the
information need, the query in IR system is always regarded
as approximate and imperfect
…..cont
For the evaluation
• The evaluation process involves a comparison of the text
actually retrieved with those the user expected to retrieve
• This often leads to some modification, typically of the query
through possibly of the information need or even of the
surrogates
• The extent to which modification is required is closely linked
with the process of measuring the effectiveness of the retrieval
operation (recall and precision)
….cont
It is necessary to define the text database before any of the
retrieval processes are initiated
This is usually done by the manager of the database and
includes specifying the following
The documents to be used
The operations to be performed on the text
The text model to be used (the text structure and what
elements can be retrieved)
The text operations transform the original documents and the
information needs and generate a logical view of them
….cont.
• Once the logical view of the documents is defined, the
database module builds an index of the text
– An index is a critical data structure
– It allows fast searching over large volumes of data
• Different index structures might be used , but the most popular
one is the inverted file (more on this later) as indicated in the
slide
• Given the document database is indexed, the retrieval process
can be initiated
……..cont.
• The user first specifies a user need which is then parsed and
transformed by the same text operation applied to the text
• Then the query operations might be applied before the actual
query, which provides the a system representation for the user
need, is generated
• Matching- The query is then processed to obtain the retrieved
documents
• Before the retrieved documents are sent to the user, the
retrieved documents are ranked according to the likelihood of
relevance
……..cont.
• The user then examines the set of ranked documents in the
search for useful information
• Two choices for the user
– Reformulate query, run on entire collection
– Reformulate query, run on result set
• At this point, he might pinpoint a subset of the documents seen
as definitely of interest and initiate a user feedback cycle
• In such a cycle, the system uses the documents selected by the
user to change the query formulation
• Hopefully, this modified query is a better representation of the
real user need
Basic Structure of an IR System
Components of an IR System
Why
1. Regulatory compliance
• A well-organized information storage and retrieval system
that follows compliance (agreement) regulations and tax
record-keeping guidelines significantly
increases a business owner’s confidence the business is fully
complying.
2. Efficiency and Productivity
• A good information storage and retrieval system, including an
effective indexing system, not only decreases the chances
information will be misfiled but also speeds up the storing and
retrieval of information.
• The resulting time saving benefit increases office efficiency
and productivity while decreasing stress and anxiety
3. Improving working environment
• It can be disheartening to anyone walking through an office area
to see vital business documents and other information stacked
on top of file cabinets or in boxes next to office workstations.
• Not only does this create a stressful and poor working
environment, but if customers see this, can cause customers to
form a negative perception of the business.
• Contrast this with an office area in which file cabinets, passages
and workstations are clear and neatly organized to see how
important it is for even a small business to have a well-organized
information storage and retrieval system.
The Standard Retrieval Interaction Model
Question
Information Retrieval
Chapter 2:
Automatic Term Selection and
Term Weighting
Definition of Term Selection
• The act or fact of carefully choosing some term as being the
best or the most suitable
• The process of choosing the most important term from the
given documents for the purpose of indexing and text
operation
Why term selection?
• Some words are not good for representing documents
• Use of all words have computational cost, increase searching
time and storage requirements
• Using the set of all words in a collection to index
documents generates too much noise for the retrieval
task
Objective or aim of term selection
• Represent textual documents by a set of keywords called index
terms or simply terms
• Increase efficiency by extracting from the resulting document a
selected set of terms to be used for indexing the document
• If full text representation is adopted then all words are used for
indexing (not as such efficient as it will have an overhead, time
and space)
Index term
• Is also called keyword
• Is a word (a single word) or phrase (multiword) in a document
whose semantics gives an indication of the document’s theme
(main idea)
– A term that captures subject of the topic of a document.
– Help in remembering the documents main theme
Index Terms
• Assumption
– The index terms selected are assumed to reflect the content of
the text (are descriptions of content)
• Index terms can be extracted from the title, abstract and text of
the document
Indexing
Is a critical process
– User’s ability to find documents on a particular subject is
limited by the indexing process used to create index terms for
the subject
Indexing is The act of classifying and providing an index
in order to make items easier to retrieve
Example: (in a book, set of books), an alphabetical list of names, subjects,
etc, with reference to the pages on which they are mentioned.
Indexing
Indexing
• Some definitions
– Is the art of organizing information
– Is an association of descriptors (keywords, concepts) to
documents in view of future retrieval
– Is a process of constructing document surrogates by
assigning identifiers to text items
– Is the process of analyzing the information content in the
language of the indexing system
Document surrogates
Example
Indexing
• Purpose/objective
– To give access point to a collection that are expected to be
most useful to the users of information
– To allow easy identification of documents (e.g., find
documents by topic)
– To relate documents to each other
– To allow prediction of document relevance to a particular
information need
Indexing
• Indexing may also assign weights to terms
– Non-weighted indexing
– Weighted indexing
Indexing
• Non-weighted indexing
– No attempt to determine the value of the different terms
assigned to a document
– Not possible to distinguish between major topics and causal
references
– All retrieved documents are equal in value
– Typical of commercial systems through the 1980s
Indexing
• Weighted indexing
– Attempt made to place a value on each term of the
description of the document
– This value is related to the frequency of occurrence of the
term in the document (higher is better), but also to the
number of collection documents that uses this term (lower is
better)
Indexing exhaustively(completely)
• Should we index only the most important concepts, or also more
minor concepts?
Indexing specificity
• Should we use general index terms or more specific terms?
• Should we use the term “computer” or “personal computer”?
Indexing
• Ways to do indexing
– Manual
– Automatic (focus of the course)
Manual Indexing
• Indexers decide which keywords to assign to documents based
on controlled vocabulary
– Human indexers assign index terms to documents
• The indexers try to summarize the contents or aboutness of the
whole document in a few keywords
• That is, indexers analyze and represent the content of a document
through keywords
• Is based on intellectual judgment and semantic interpretation of
(concepts, themes) of indexers
Manual Indexing
• Indexers prior knowledge of the following is important to come
up with good keywords or index terms
– Terms that will be used by the user
– Indexing vocabulary
– Collection characteristics
Advantages of Manual Indexing
• Ability to perform abstraction (conclude what the subject is) and
determine additional related terms
• Ability to judge the value of concepts (because it is done by
human being)
Disadvantages of Manual Indexing
• Slow and expensive (significant cost)
– Cost of professional indexers is very expensive
• Is based on intellectual judgment and semantic interpretation
(concepts, themes)
– High probability of inconsistency or low consistency among
indexers (maintaining consistency is difficult),
• Labor intensive
• In automatic indexing all these problems will some how be solved
Automatic Indexing
• Is the assignment of content identifiers, with the help of modern
computing technology
– A computer system is used to record the descriptors generated
by the human
• The system extracts “typical”/ “significant” terms
• The human may contribute by setting the parameters or
thresholds, or by choosing components or algorithms
Why automatic indexing?
• Reasons for the necessity of automatic indexing
– Information overload
• Enormous amount of information is being generated from
day to day activities
– Explosion of machine-readable text
• Massive information available in electronic format and on
Internet.
– Cost effectiveness
• Human indexing is expensive and labor intensive.
Current Procedures for Automatic Indexing
• Generating document representatives through automatic indexing
involves
Lexical analysis the process of converting an input stream of
characters into a stream of words or tokens
– Use of stoplist
– Use of conflation procedures (stemming, optional)
– Selection of index terms
– Weighting the resulting terms (optional)
Procedures for Building an Index Automatically
Documents
Tokenizing
text break into words
Noise reduction
words stoplist Feature
normalization
non-stoplist stemming*
words
*Indicates
optional stemmed term weighting*
operation words
terms with Index
weights database
Procedures for Building an Index Automatically
• Thus, automatic indexing consists of two processes
– Assigning terms or concepts capable of representing document
content
– Assigning a weight or value to each term reflecting its
presumed importance for the purpose of content identification
• Important words are assigned higher weights
• Less important words are assigned lower weights
Advantages of Automatic Indexing
• Reduced processing time (Fast)
• Reduced cost (inexpensive)
• Easy to maintain
• Improved consistency(reliability)
– No inconsistency or high consistency
– Algorithms select index terms much more consistently than
humans.
• Better retrieval (achieved)
Disadvantages of Automatic Indexing
• Mechanical execution of algorithms, with no intelligent interpretation (of
aboutness / relevance)
Automatic Text Analysis
• Not all words in a text are good index terms
• Some are good, some are bad and some are indifferent
• How do we know whether a term is good or bad or indifferent for
indexing?
• Luhn’s idea will give us answer to this question
Automatic Text Analysis
• It was Luhn (1957) who first suggested that certain words could
be automatically extracted from texts to represent their content
– He is one of the earliest researcher into IR
• He discovered that the distribution patterns of words could give
significant information about the property of being content
bearing
• Much of text analysis has been built on the original idea of Luhn
Automatic Text Analysis
• Luhn’s proposal
“The frequency of word occurrences in an article furnishes a useful
measure of word significance…”
However, a high frequency term will be acceptable for indexing
purposes only if its occurrence frequency is not equally high in
all documents of the collection
Still today, the search engines that operate on the Internet index the
documents based on this principle
Automatic Text Analysis
• Luhn’s observation
– He noted that high frequency words tend to be common, non
content bearing words
– He also recognized that one or two occurrences of a word in
a relatively long text could not be taken significant in defining
the subject matter
• Came up with a model for selecting terms based on their
frequency of occurrences
Automatic Text Analysis
• Luhn’s model
– Words which occur very infrequently in a collection are of
little importance for indexing since they are unlikely to be
specified in queries
• Such rare terms are likely to be specific to the documents
and they may not occur in users queries
– Words which occur very frequently in a collection are of
little importance for indexing since they do not
discriminate sufficiently between documents
Cont..
It is less likely to use these terms to discriminate
the documents from others so not important for
indexing
– The most important words for indexing are those which
occur with intermediate frequencies
• Thus, according to Luhn, medium frequency terms are
better candidates for indexing
Automatic Text Analysis
• Let f be the frequency
of occurrence of
various word types in
a given position of
text
• Let r be their rank
order, the order of
their frequency of
occurrence
• Then a plot relating f
and r yields a curve
similar to the
hyperbolic curve
shown to the right
• The curve is, in fact,
demonstrates Zipf’s
law
Automatic Text Analysis
• Therefore, Luhn suggested using the words in the middle of the
frequency range
• These findings are the bases of a number of classical weighting
schemes
Problems with Luhn’s Selection Mechanism
• Finding a way for elimination of high and low frequency words
– Certain arbitrariness is involved in determining the cut-offs
– That is, there is no means which gives their values
– They have to be determined by trial and error
• The risk of loss of retrieval performance
– The removal of high frequency words may reduce recall
– The removal of low frequency words may bring losses in
precision
Zipf’s Law in IR
• The law states that there is an inverse relation between the
frequency of a word f and its rank r; highest frequency term
has rank 1, second highest frequency term has rank 2 etc.)
• If the terms in a collection are ranked (r) by their frequency (f),
they roughly fit the relation r_t * f_t = C, which is known as
Zipf’s law f = C*1/r
– In other words, the law states that the product of the
frequency of use of words and their rank order is
approximately constant
rank * frequency ≈ constant
Question