Information
Information Storage
Storage and
and Retrieval
Retrieval
Course
Course Overview
Overview
• What the course is about
– How people search and find information
– How computers store and retrieve information
» Bring back the information stored
– How computer systems are designed to help people find information
they need
• What the course emphasize on understanding of
• Theories
• Tools (lexical analyzers, stemmers, etc.)
• Algorithms (ranking, matching, clustering, etc.) and
• Evaluation of information retrieval systems
Course O
Course Overview
verview
• What this course is NOT
– An algorithm design course
• We might use several related algorithms, not study them in details
– A system development course
• Except some assignments may require you to write or compile some C,
C++, java, etc procedures
• We look at an IR system as a whole, not as an individual components
• A course offered to the following programs
– Computer Science
– Information Science
– Information Systems
– Library Science
– The technicality of the course differs from program to program (see section 1.6 of
Baeza-Yates)
Knowledge
Knowledge Useful
Useful for the C
for the Course
ourse
• Mathematics (set theory, probability, vector algebra)
• Data / File structure
• Linguistics (read papers on Linguistics in Information Science)
• System A & D
• Programming in higher level languages such as C, C++, Java,
VB, etc.
Chapter
Chapter 1:
1: Agenda
Agenda
• Introduction (Motivation, Definitions of IR & IR Systems,
challenges of IR)
• Data Retrieval Vs Information Retrieval
• Basic concepts in IR
– User task
– Logical view of documents
• The Retrieval Process
– The structure of IR System
• List of important concepts and terms
Motivational
Motivational Factors
Factors
1. Major observation on computers capability
• Computers are able to scan whole documents and decide on whether
they were relevant or not
• Question: What will happen if computers are not able to do
this?
• IR systems, since their inception, are in place to reduce a user’s workload in
searching through the store of documents to find relevant ones
• The first IR systems, obviously, were very basic and were not very effective
• The gradual improvement of the performance of computers made the
focus in times shift more towards implementing algorithms and
designing computer programs focused on automating systems of storing
and retrieving information
Motivational FFactors
Motivational actors
2. Information explosion / overload
–Why information Explosion?
–How do you relate information overload with IR?
Answering these questions will help us to appreciate IR
systems
–In IR we talk about information explosion/overload
• The rapid growth in the amount of information published
• Finding a needle in haystack
–The growth in information and the retrieval mechanisms do
not match
–Our techniques to retrieve lags behind the growth of
information
• The speed at which you retrieve matters a lot these days
Motivational
Motivational Factors
Factors
• 2 (cont’d)
– The overload made storage and retrieval of information very
tough
– Because of the overload our search space becomes very large
– In the search space we have information items (generic
name) which could be in the form of books, journals, etc.
– Searching for information in such large space is tough
• You have a number of goal states (relevant docs)
– Because of the overload, the alternative paths we have to the
goal state are too many
– It requires defining the path from start state to the goal state
– U must have a good system to evaluate these paths be it in
terms of time, space and so on
Motivational
Motivational Factors
Factors
3. Information Need
– The two most important entities in IR are
• Information Items
• Information Needs (In IR we talk also about information
needs)
– Some definitions of information need for our purpose
• Is what user want from the IR system
• Is a question that users ask
• Is the desire to know
• Is a desire to fill a gap of knowledge
• Information problem that cause the user to act
Motivational
Motivational Factors
Factors
• 3 (cont’d)
– Examples: Users interested to find something
• Articles published on certain subjects (e.g., HLT in Ethiopia)
• Books written by a certain author
• Banks offering online banking service
• Get information About the history of Kennedy's (article about
the Kennedy's, text retrieval)
• Information on what brain tumor look like on a CT-scan (a
picture of brain tumor, image retrieval)
Motivational
Motivational Factors
Factors
• 3 (cont’d)
– Example of complex information need (in the context of
WWW)
“Find all pages (documents) containing information on
Computer Courses which (1) are offered by Universities in
Ethiopia, (2) have prerequisites and (3) accredited by the
Ministry of Education. To be relevant the document must
include information on admission requirements, email and
phone number for contact purposes.”
• Could this be a query for current retrieval systems?
– Is usually expressed by key terms or Boolean combinations of
key terms
Motivational
Motivational Factors
Factors
• 3 (cont’d)
– Very complicated to know about it
– Often, we do not know exactly our need (not defined,
specific, or well crystallized)
– Our first impression about our need is not exactly what we
need
– Because of the lack of knowledge, lack of expression, we
may not also provide a refined expression of our information
need
– Assuming that we know our need, we have a difficulty to
express it.
– We start from some where and go under a certain process to
define clearly our need
– In this case, we are searching rather than getting one specific
answer
– We look for a lead rather than the actual answer
Motivational
Motivational Factors
Factors
• 3 (cont’d)
– We look to a pointer to the fact, the answer, not the fact or the answer
(you point to the resource )
– Often , it is not important to look for the answer rather get pointer for the
source
– When do we know our information need?
– We thought we know our information need, but we don’t
– As you go through the process we refine our need
– Although we have a starting point, it doesn’t mean that we know our
need.
– We know our need when we are provided with hits/responses to our
information need and asked to judge/evaluate it
– We know our need through a process, when we judge, when we evaluate
etc
Motivational
Motivational Factors
Factors
• 3 (cont’d)
– This is the philosophical background of IR
– It is something more philosophical (has philosophical
dimension) and psychological (has psychological
dimensions)
– Is one of the main component IR (the system must have
some mechanism to accept your information need)
• Hence the subject if IR
– Information need is a sophisticated area, come up with a
computer system that addresses this need, which is more
psychological and philosophical
Motivational
Motivational Factors
Factors
4. Importing/retrieving Knowledge
– Import or retrieve knowledge from where ever it is
– The retrieval system should be built in such a way that this
is possible
Information
Information Retrieval
Retrieval
• The term Information Retrieval was first coined by Calvin
Moore (1950)
Definition
• Is an Important sub-discipline of Information Science that is
concerned with developing theories and methods of access to
information
– Focus is on helping user find information that matches their
information need (User Centered View)
• Is a branch of applied Computer Science that focus on
representation, storage, organization of, and access to
information items (System Centered View).
Information
Information Retrieval
Retrieval (Definition)
(Definition)
• A good formal definition of information retrieval is given in Baeze-
Yates & Riberio-Neto (1990p1)
“Information retrieval deals with representation, storage, organization
of, and access to information items. The organization and access of
information items should provide the user with easy access to the
information in which he is interested”
• The definition incorporates all important features of a good information
retrieval system
– Representation
– Storage
– Organization
– Access
– Evaluation
• As a field, IR focuses on advanced application of computers
• Is about finding relevant information in large collection of data
Information
Information Retrieval
Retrieval
• Conceptually, IR is used to cover all related problems in finding
needed information
• Historically, information retrieval is about document retrieval,
emphasizing documents as a basic units
– Until recently, in the above sense, IR was considered as a narrow area of
interest for Librarians and Information experts
– Today, IR includes Modelling, document classification, user interfaces
and visualization, multimedia retrieval, digital library, filtering, natural
languages etc.
• Technically, information retrieval refers to (text) string
manipulation, indexing, matching, querying, etc.
Goal
Goal of
of Information
Information Retrieval
Retrieval
• The general goal of IR is to
– Help users find useful information based on their information
needs (with a minimum effort, minimum effort wanted)
despite the increasing complexity of Information and the
changing needs of user
– Provide immediate random access to the data
Remark
– Retrieval systems such as google are developed with this aim
What
What IR
IR assumes
assumes
• Information is stored (or available)
• A user has an information need
• An automated system exists from which information can be
retrieved
• The system works!!
Challenges
Challenges in
in IR
IR
• Representation of information items and information needs
(first problem)
– Document representation is one area of IR
– query representation is another area of IR
• Matching (second problem)
– How to match need Vs. information items
• Modification of representation as a result of judgment (query
expansion or reformulation)
Information
Information Retrieval
Retrieval task
task
• Task statement
– Build a system that retrieves documents that users are highly
likely to find relevant to their request (i.e., Information need)
– Why highly likely relevant?
Information
Information Retrieval
Retrieval Systems
Systems
• Are systems build to retrieve documents highly likely relevant
to the user
• Are systems built to reduce user’s workload in searching
through the store of documents to find relevant one’s
• Are systems that give information about the presence or
absence of documents in accordance with the query
– Automated abstracts or summaries of documents were developed to
further simplify access to search results
• Are computer based systems (we are talking about automation )
Information
Information Retrieval
Retrieval Systems
Systems
• Are systems that attempt to find relevant documents to respond to
user’s request
• Are devices interposed between a potential user of information
and the information collection itself.
– For a given information problem, the purpose of the system is
to capture wanted items and to filter out unwanted items
Information
Information Retrieval
Retrieval Systems
Systems
• Is a set of rules and procedures, as operated by humans and/or
machines, for doing some or all of the following operations
– Indexing (or constructing representation of documents)
– Search formulation (or constructing representation of
information needs)
– Searching (or matching representation of documents against
representation of needs)
– Feedback (or repeating any or all of the above processes with
modifications introduced in response to an assessment of
results of some process)
– Indexing Language Construction ( or the generation of rules of
representation)
Examples
Examples of
of Information
Information Retrieval
Retrieval Systems
Systems
• Typical examples of IR systems are search engines that can
be found on the web or in library
– They concentrate on finding documents, performing full
text retrieval
– After a user types in several keywords, the system returns
the documents that are most interesting according to the
system
• More examples will be given under IR models
What
What an
an IR
IR system
system should
should do
do
• Store/archive information
• Provide access to that information
• Answer queries with relevant information
• Stay current
• WISH list
– Understand the user’s queries
– Understand the user’s need
– Acts as an assistant
Information
Information Retrieval
Retrieval Systems
Systems
• Consists of
– Sets of Information items (documents)
• Objects that have the information we need
– A set of requests (Information needs)
– Some mechanisms for determining the requirements of
the request (matching functions)
Information
Information Retrieval
Retrieval Systems
Systems
A typical IR System
Collection of Documents
(Information items
User Query/Request
Internal
Query Results Processor Representation
of Documents
Retrieved objects
This is not a detailed schematic illustration of IRs
Information
Information Retrieval
Retrieval Systems
Systems
• Major functions
– Analyze contents of information items
– Represent the contents of the analyzed sources in a way
suitable for matching with users’ queries
– Analyze users information need and represent them in a form
that will be suitable for matching with the database
– Match the search statement with the stored database
– Retrieve or generate information that are relevant in a
ranking which reflects relevance
– Make necessary adjustments in the system based on feedback
from users
Components
Components of
of an
an IR
IR System
System
• An IR system comprises the following major subsystems
– Document selection subsystem
• Documents are there in the database. How are we going to select those
documents that are relevant (matched with user requests)
– Vocabulary subsystem
• In indexing we need to use controlled vocabulary i.e., a list of selected
subject terms to represent a document. It is based on a vocabulary that
the indexing is updated
– Text Operations subsystem
• Forms index words (tokens)
– Tokenization
– Stopword removal
– Stemming
Components of an IR System
– Indexing subsystem Term Doc #
I 1
• Is a means of organizing the documents selected did
enact
1
1
julius 1
• Constructs an inverted index of word to document caesar
I
1
1
pointers. was
killed
1
1
i' 1
– Mapping from keywords to document ids the
capitol
1
1
brutus 1
killed 1
me 1
so 2
Doc 1 Doc 2 let
it
2
2
be 2
with 2
caesar 2
the 2
I did enact Julius So let it be with noble
brutus
2
2
Caesar I was killed Caesar. The noble hath
told
2
2
i' the Capitol; Brutus hath told you you 2
caesar 2
Brutus killed me. Caesar was ambitious was 2
ambitious 2
Components of an IR System
– The (Document-query ) matching subsystem
• This step matches the users’ queries with the available documents that
are relevant.
– Searching subsystem
• Retrieves documents that contain a given query token from the inverted
index.
– Ranking subsystem
• Scores all retrieved documents according to a relevance metric
– The User (system) interface subsystem
• A software that enables you to give command to the system (document
& query input module) and the system responds (output module).
• Manages interaction with the user:
– Query input and document output.
– Relevance feedback.
– Visualization of results
Components
Components of
of an
an IR
IR System
System
– Query Operations (query reformulation) subsystem
• Transform the query to improve retrieval
– Query expansion using a thesaurus.
– Query transformation using relevance feedback.
Components
Components of
of an
an IR
IR System
System
• The main components of an IR system can also be discussed in
terms of
– Input, process, and output components or
– Human component (users, organizations, information
professionals) and system components (data, devise &
media, algorithms & procedures)
Primary goal of Information Retrieval Systems
• Retrieve all the documents which are relevant to user query,
while retrieving as few non-relevant documents as possible
– Capture wanted items and filter out unwanted items
More
More on
on IR
IR
• IR usually deals with NL text which is not always well
structured and could semantically be ambiguous
– Simple matching of words is brittle approach as one word
could have zillion different meanings
• IR deals with very large sets of documents
– High amount of robustness, efficiency
• IR often deals with domain independent and multilingual sets
of documents
More
More on
on IR
IR
• The activities of IR can be mainly divided in to 3 main
processes
– Content analysis
– Exploiting information structure
– Evaluation
More
More on
on IR
IR
• Content analysis (main activity of IR)
– Is also called subject analysis
– The identification of subject matter in document texts
– Deciding the “aboutness” of a document
– Concerned with describing contents of documents
– Deals with representation of the thought-content of the
document
– It involves the analysis and assignment of terms or
identifiers that are capable of representing document content,
which can be used as access points to that document
– Indexing, cataloguing and abstracting are some of the
processes used to represent the thought content of the
document
More
More on
on IR
IR
• Information structure (main activity of IR)
– Concerned with exploiting relationship between documents to
improve the efficiency and effectiveness of retrieval
strategies.
• Evaluation (main activity of IR)
– Deals with measurements of the effectiveness of
retrieval
– Performance and evaluation are aspects of IR
because we are talking about improvements
• Of the three processes, content analysis is an important step
IR
IR Types
Types
• IR can be structured for ease of discussion as
– Text IR
• Discusses the classic problem of searching a collection of documents for
useful information
• Focuses is on document images that are predominantly text (rather than
pictures)
• These are called textual images and are amenable to automatic extraction
of key words
– Multimedia IR
• Discusses how to index document images and other binary data by
extracting features from their content and how to search them efficiently
– Human computer interaction (HCI) for IR
• Discusses current trends in IR towards improved user interface and better
data visualization tools
– Application of IR
• Covers modern applications of IR (such as the Web, bibliographic systems,
and digital libraries)
History of IR Reading assignment
Database
Database Retrieval
Retrieval Vs.
Vs. Information
Information Retrieval
Retrieval
• Information items
– DBMS
• highly structured data (are of known nature), often
homogeneous records, often semantically unambiguous (well
defined semantics)
– IR systems
• Unstructured or unformatted data (as opposed to relational
database). When you go to a specific document it is not
structured as in DB
• Free text
– text data- papers, technical reports, news article ( completely
untagged or plain text)
– Web-pages – HTML and XML files (semi structured)
• None textual data – images, graphics etc.
• Heterogeneous, Semantically ambiguous (semantics is
frequently loose; we want approximate match)
Database
Database Retrieval
Retrieval Vs.
Vs. Information
Information Retrieval
Retrieval
• Answers
– DBMS:
• Records, tuples, No ranking
• Well defined results
• Perfect precision and recall, each item is relevant
– IR systems
• Vs. Documents, ranked list of documents. The issue
ranking is very important (page through the top k
documents)
• Vs. fuzzy results
• Vs. Imperfect precision and recall, each item has specific
relevance
Database
Database Retrieval
Retrieval Vs.
Vs. Information
Information Retrieval
Retrieval
• Matching
– DBMS:
• Analoguous to db quering: Which docs contain a set of
keywords?
• Exact match; We talk of items that match exactly; Every record
either matches or fails to match a query; No notion of relevnce
• A single erroneous object implies failure!
– IR systems
• Information about a subject or topic
• Partial or best best match; We talk of possibly relevant items not
exact matched items
• Notion of relevance is most important- needs a model
• Small errors are tolerated (and in fact inevitable)
• Interpret contents of information items
• Generate a ranking which reflects relevance
Database
Database Retrieval
Retrieval Vs.
Vs. Information
Information Retrieval
Retrieval
• Items wanted
– DBMS
• Matching
– IR systems
• Relevant
• Model
• DBMS:
• Deterministic (answer can be predetermined
– IR systems:
• Probabilistic, not deterministic; answer is not
predetermined
Database
Database Retrieval
Retrieval Vs.
Vs. Information
Information Retrieval
Retrieval
• Querying
– DBMS:
• (DB query) assumes that the data is in standardized
format
– IR system
• Query assumes that we work on plain, unformatted data
• Query language
• DBMS
• Artificial language
– IR system
• Natural language
Database
Database Retrieval
Retrieval Vs.
Vs. Information
Information Retrieval
Retrieval
• Query specification
• DBMS
• Complete (requires precise retrieval criteria)
• A single erroneous object implies failure
– IR system
• Incomplete
• Small errors are tolerated
Database
Database Retrieval
Retrieval Vs.
Vs. Information
Information Retrieval
Retrieval
• DB grew out of files and traditional business system
• IR grew out of library science and need to
categorize/group/access books/articles
• Information retrieval is much more difficult than data retrieval
• Both support queries over large data set, using indexing
• Relationship
– Systems complement each other
Summary
Summaryof
ofComparison
Comparison(data
(dataretrieval
retrievalVs
Vsinformation
informationretrieval)
retrieval)
Data retrieval
Information retrieval
• Classification Monothetic Polythetic
• Inference Deduction Induction
• Error response Sensitive Insensitive
• Content Data Information
• Data object Table Document
• Matching Exact match Partial match, best match
• Items wanted Matching Relevant
• Query language Artificial (e.g., SQL ) Natural
• Query specification Complete Incomplete
• Model Deterministic Probabilistic
• Information item Highly structured Unstructured or
semi-structured
Basic
Basic Concepts
Concepts
• The effective retrival of relevant information is directly affected by two
things
the User Task
the logical view of the documents adopted by the
retrival system
• User task
–Request for information (how?)
•(The user does this) by translating his information need into
keywords (or query language provided by the system)
•By formulating a query which expresses his information need
– How you search, how you coin terms and the terms you use have
something to do with the effectiveness of an information retrieval
system
Basic
Basic Concepts
Concepts –– the
the User
User task
task
• System
– Responds with answer set by matching two abstractions
• Abstraction of information needs
• Abstraction of information items
– That is, the system task is to process the queries and
retrieve documents that best approximate the user query
• In all these interactions between the user and the system, the
user task might be one of retrieval or browsing
Retrieval
Database
Browsing
Basic
Basic Concepts
Concepts –– The
The User
User Task
Task
• Retrieval
– Is a more organized or focused way of looking for information
– Information need (retrieval goal) is focused, crystalized and well defned
– More purposeful (you are not glancing arround)
– Often user is sophesticated
• Browsing
– Information need (retrival goal) is vague, impresise, not focused, not well
defined
– You start somewhere and go on searching one after the other
– Glancing around
– Often user is naive
• Both retrieval and browsing are initiated by the user
• What the user does contributes to the retrieval quality (positively or
negatively)
Basic
Basic Concepts
Concepts -- Logical
Logical View
View of
of Documents
Documents
• The logical view of a document is nothing but the representation
of the document
• Why represent documents?
– Documents are full of texts and not every word in the texts are
meaningful for search/retrieval
– For this reason documents must be processed and represented
to a concise and identifiable format or structures
• When are documents best represented?
Basic
Basic Concepts
Concepts -- Logical
Logical View
View of
of Documents
Documents
• Documents in a collection are frequently represented through a
set of index terms or keywords
– An index term is a keyword (or group of related words) which has some
meaning of its own (which usually has the semantics of a noun)
– In its more general form, an index term is simply any word which appears
in the text of a document collection
– It is simply a (document) word whose semantic helps in
remembering the document’s main theme
– Index terms are used to index and summarize the document
content
– Index terms are mainly nouns because nouns have meaning by
themselves and thus, their semantics easier to identify and to grasp
– Adjectives, adverbs, and connectives are less useful as index terms
Basic
Basic Concepts
Concepts -- Logical
Logical View
View of
of Documents
Documents
• Keywords might be extracted directly from the text of the
document automatically or might be specified by a human
expert (this is frequently done in the information science arena)
• No matter whether these representative keywords are derived
automatically or generated by a specialist, they provide a
logical view of a document
Basic
Basic Concepts
Concepts -- Logical
Logical View
View of
of Documents
Documents
•The logical view of documents •Standard steps to index
•Full text
•Set of index terms
documents (to obtain
•Full text + structure
index terms)
Accents Noun Manual
Docs spacing stopwords groups stemming indexing
structure
structure Full text Index terms
Document representation viewed as a continuum: logical view of docs might shift
from full text to a set of index terms
Document
Document Processing
Processing Steps
Steps
From “Modern IR” textbook
Document
Document Processing
Processing Steps
Steps –– Lexical
Lexical Analysis
Analysis
• Is used to break the documents into tokens
– Is the identification of words in a text
– Divides text into distinct terms
– Extract individual words (tokenize)
– Convert stream of characters (text) into tokens
• Deals with term separators, accents, spacing etc.
– The treatment of special characters such as hyphens, digits, punctuation
marks and spaces need to be carefully considered
– Usually disregard punctuation marks, numbers, spaces etc.
• Involve decisions
– On how to treat cases and hyphens
– To use or not to use formatting directions (e.g., html tags)
– On what determines the boundaries for terms (assuming for instance one
would want to identify only words within a document for indexing)
Document
Document Processing
Processing Steps
Steps –– Lexical
Lexical Analysis
Analysis
• Can be done for various portions of the document
– Not on the title but on the abstract and the body
– Only on the title (programming assignment)
– Only on the document (programming assignment)
– Etc
• Can be done for both the document and the query
• It may be accomplished by
– Unix tools (large, complex and hard to change)
– Finite state machines
• Move from state to state to state to convert characters
• Fast, small, relatively easy to design
Document
Document Processing
Processing Steps
Steps –– Stopwords
Stopwords
• Words that either
– Appear so frequently that they do not distinguish documents or
– Have more syntactic than semantic meaning (e.g., the)
• Words that are likely to occur in almost all the document
collection and therefore can hardly provide a distinction between
documents concerning relevance
• The best way to avoid retrieving too many documents that do not
particularly match the user’s is to filter out these words
• One way to do this is to use relative frequencies of the words
within the documents as a threshold-value for determining w.r. a
word has significant meaning to a document subject or not
• Natural candidates for a list of stopwords include
– Articles
– Prepositions
– conjunctions
Document
Document Processing
Processing Steps
Steps –– Noun
Noun Groups
Groups
• Deals with the identification of nouns
• Discards or throw outs terms that are not nouns
– Eliminates adjectives, adverbs, and verbs
• Fix spelling errors
• Use a thesaurus to combine similar terms
Document
Document Processing
Processing Steps
Steps –– Stemming
Stemming
• Grammar permits modification of terms that change their type
rather than their meaning (e.g., plurals, gernds- i.e., attaching
some prefixes & suffixes)
• Reduce distinct words to their grammatical root or stem by
applying a stemming algorithm
Document
Document Processing
Processing Steps
Steps
• Terms remaining after document processing must be stored to
facilitate retrieval
• Extract words (or tokens) along with references to the records
they come from
– Typically they are stored in an inverted file /index
– Build an inverted file of words or tokens
Document
Document Processing
Processing Steps
Steps
• Remark
– Identifying tokens, applying stoplist, noun identification,
applying stemming, and creating searchable data structure are
all parts of the indexing process
– Stoplist and stemming algorithms are applied to reduce the
number of tokens to be processed
– These operations are called text operations (or
transformations) and will be dealt in detail in the next chapter
Basic
BasicConcepts
Concepts--Logical
LogicalView
Viewof
ofDocuments
Documents
• Modern computers make possible to represent a document by its
full set of words
• In this case, we say that the retrieval system adopts a full text
logical view (or representation) of the documents
• With very large collections, however, modern computers might
have to reduce the set of representative keywords
• These are accomplished through the standard steps outlined
earlier or on the next slide
Basic
BasicConcepts
Concepts--Logical
LogicalView
Viewof
ofDocuments
Documents
• Standard steps
– Recognizing document structures (titles, sections, paragraphs, etc.)
– Break into tokens
– The elimination of stop words
– The use of stemming
– The identification of noun groups (which eliminates adjectives, adverbs,
and verbs)
– Other further operation can also be performed
– Store in inverted index (to be discussed in later chapters)
• Such text operations reduce the complexity of the document
representation and allow moving the logical view from that of a
full text to that of a set of index terms (high level
representation specified by a human subject)
Basic
BasicConcepts
Concepts--Logical
LogicalView
Viewof
ofDocuments
Documents
• The full text is the most complete logical view of a document
– its usage usually implies higher computational costs
• A small set of categories (generated by a human specialist)
provides the most concise logical view of a document
– Its usage might lead to retrieval of poor quality
• Several intermediate logical views (of a document) might be
adopted by an information retrieval system as indicated in the
digram
Basic
Basic Concepts
Concepts -- Logical
Logical View
View of
of Documents
Documents
• Besides adopting any of the intermediate representations, the
retrieval system might also recognize the internal structure
normally present in a document (e.g. chapters, sections,
subsections, etc.)
– This information on the structure of the document might be quite useful
and is required by the structured text retrieval models
• The index terms obtained are a description of a document
content and of its structure
The
The Information
Information Retrieval
Retrieval Process
Process
• The purpose of an information retrieval strategy is to retrieve
all the relevant documents whilst at the same time retrieving as
few nonevent once as possible
• The process involves a certain amount of element of feed back
and is best illustrated using the diagram in the next slide
• Can be seen or interpreted in terms of component sub-
processes whose study fields yields many of the topics that will
be covered in the course
The
The Retrieval
Retrieval Process
Process Used to detail our
•
•
Web search engine view of the retrieval
Web browser
Text process
User
Interface
user need 4, 10 Text
Text Operations
6, 7
logical view logical view
Query DB Manager
Operations Indexing
Module
user feedback
5 8
inverted file
query
Searching
Index
8
retrieved docs The
The
document Text
document Database
data
database
base
Ranking
ranked docs
indexed
indexed
2
A simple and generic software architecture to describe the retrieval process
The
The Information
Information Retrieval
Retrieval Process
Process
• There are three main ingredients to the IR process
– Texts or documents
– Queries
– The process of evaluation
For texts
• For texts, the main problem is to obtain a representation of the
text in a form which is amenable to automatic indexing
• This is achieved (i.e., the representation) by creating an
abbreviated form of the text, known as a text surrogate
• A typical surrogate would consist of a set of index terms or
keywords or descriptors
The
The Information
Information Retrieval
Retrieval Process
Process
For queries
• For queries, the query has arisen as a result of an information
need on the part of the user
• The query is then a representation of the information need and
must be expressed in a language understood by the system
• Due to the inherent difficulty of accurately representing the
information need, the query in IR system is always regarded as
approximate and imperfect
The
The Information
Information Retrieval
Retrieval Process
Process
For the evaluation
• The evaluation process involves a comparison of the text
actually retrieved with those the user expected to retrieve
• This often leads to some modification, typically of the query
through possibly of the information need or even of the
surrogates
• The extent to which modification is required is closely linked
with the process of measuring the effectiveness of the retrieval
operation (recall and precision)
The
The Information
Information Retrieval
Retrieval Process
Process
• It is necessary to define the text database before any of the
retrieval processes are initiated
• This is usually done by the manager of the database and includes
specifying the following
– The documents to be used
– The operations to be performed on the text
– The text model to be used (the text structure and what
elements can be retrieved)
• The text operations transform the original documents and the
information needs and generate a logical view of them
The
TheInformation
InformationRetrieval
RetrievalProcess
Process
• Once the logical view of the documents is defined, the database
module builds an index of the text
– An index is a critical data structure
– It allows fast searching over large volumes of data
• Different index structures might be used , but the most popular
one is the inverted file (more on this later) as indicated in the
slide
• Given the document database is indexed, the retrieval process
can be initiated
The
The Information
Information Retrieval
Retrieval Process
Process
• The user first specifies a user need which is then parsed and
transformed by the same text operation applied to the text
• Then the query operations might be applied before the actual
query, which provides the a system representation for the user
need, is generated
• Matching- The query is then processed to obtain the retrieved
documents
• Before the retrieved documents are sent to the user, the retrieved
documents are ranked according to the likelihood of relevance
The
TheInformation
InformationRetrieval
RetrievalProcess
Process
• The user then examines the set of ranked documents in the
search for useful information
• Two choices for the user
– Reformulate query, run on entire collection
– Reformulate query, run on result set
• At this point, he might pinpoint a subset of the documents seen
as definitely of interest and initiate a user feedback cycle
• In such a cycle, the system uses the documents selected by the
user to change the query formulation
• Hopefully, this modified query is a better representation of the
real user need
Structure
Structure of
of an
an IR
IR System
System
• An Information Retrieval System serves as a bridge between the world
of authors and the world of readers/users,
• That is, writers present a set of ideas in a document using a set of
concepts,
Black box
User Documents
• The black box is the processing part of the information
retrieval system,
• It includes mainly indexing and searching
•
Structure
Structure of
of an
an IR
IR System
System
Search Storage
Line Interest profiles Documents Line
& Queries & data
Information Storage and Retrieval System
Rules of the game =
Rules for subject indexing +
Formulating query in Thesaurus (which consists of Indexing
terms of (Descriptive and
descriptors Lead-In Subject)
Vocabulary
and
Indexing
Language
Storage of
Storage of
profiles
Documents
Store1: Profiles/ Comparison/ Store2: Document
Search requests Matching representations
Adapted from Soergel, p. 19
Potentially
Relevant
Documents
Structure
Structure of
of an
an IR
IR System
System
Search Storage
Line Interest profiles Documents Line
& Queries & data
Information Storage and Retrieval System
Rules of the game =
Rules for subject indexing +
Formulating query in Thesaurus (which consists of Indexing
terms of (Descriptive and
descriptors Lead-In Subject)
Vocabulary
and
Indexing
Language
Storage of
Storage of
profiles
Documents
Store1: Profiles/ Comparison/ Store2: Document
Search requests Matching representations
Adapted from Soergel, p. 19
Potentially
Relevant
Documents
Structure
Structure of
of an
an IR
IR System
System
Search Storage
Line Interest profiles Documents Line
& Queries & data
Information Storage and Retrieval System
Rules of the game =
Rules for subject indexing +
Formulating query in Thesaurus (which consists of Indexing
terms of (Descriptive and
descriptors Lead-In Subject)
Vocabulary
and
Indexing
Language
Storage of
Storage of
profiles
Documents
Store1: Profiles/ Comparison/ Store2: Document
Search requests Matching representations
Adapted from Soergel, p. 19
Potentially
Relevant
Documents
Structure
Structure of
of an
an IR
IR System
System
Search Storage
Line Interest profiles Documents Line
& Queries & data
Information Storage and Retrieval System
Rules of the game =
TT Rules for subject indexing + TT
rr Formulating query in Thesaurus (which consists of Indexing rr
terms of (Descriptive and aa
aa descriptors Lead-In Subject)
nn Vocabulary nn
and ss
ss Indexing
ll Language ll
aa Storage of
Storage of
aa
profiles titi
titi Documents
oo oo
nn nn
Store1: Profiles/ Comparison/ Store2: Document
Search requests Matching representations
Ranking
Ranking
Adapted from Soergel, p. 19
Potentially
Relevant
Documents
Structure
Structure of
of an
an IR
IR System
System
• Translation from user need to query
– Usually, manually ( by user himself)
– Tools available to assist the process
• Translation from item to representation (surrogate)
– Often, automatically (by the system)
– Representation can be at different level:
• Full text, abstract only, index terms only, etc.
• Duality of the two translations
– User query can be regarded as the representation of the ideal (sought-
after) item
– Often, similar techniques are used to generate both
Background
Background Concepts
Concepts for
for IR
IR
• Data • Term • Precision
• Information • Index term • Controlled
• Retrieval • Keyword Vocabulary (pre &
• Browsing • Descriptor post-coordination)
• Information Retrieval • Indexing • Feedback
• Information Storage and • Index
Retrieval • User
• • Indexing Language • Effectiveness
Information retrieval
system • Surrogate • Efficiency
• Database • Inverted File
• Collections
• Database system • Information Retrieval
• Information Need System • Evaluation
• Document • Logical View of • Relevance
• Text Document
• Query • Stemming
• Searching • Stopword
• Index Exhaustively • Stop List
• Index Specificity
• Recall
End
End of
of Chapter
Chapter 11
Important
Important Concepts
Concepts and
and Terms
Terms
• Document
– Is (in theory, at least) taken as more-or-less synonymous with the text in
linguistics - that describes any pieces of linguistic (in the widest sense)
material that can reasonably be considered as unit
• Information items (documents) - Usually text, but possibly also image, audio,
video, etc.
• Textual items/documents- may be of different scope (books, scientific
articles, paragraphs, newspapers, reports, email message etc.)
• Graphical and multimedia items/documents- images, line drawings, PPT
presentations, web pages, moving pictures/video
• Spoken documents- sound recordings (voice messages, Radio news,
telephone conversation)
• Focus: we will consider textual documents
Important
Important Concepts
Concepts and
and Terms
Terms
• Searching
– The way the file is examined and the items in it are taken as
related to a search query
• term
– A term is a semantic unit, a word, phrase, or potentially root
of a word
• Inverted file
– A stored list of index terms with each index term having
links to the documents containing that term.
– The inverted indexes can be extended to include
• Term location information
• Word numbers with in sentences
• Term weights
Important
Important Concepts
Concepts and
and Terms
Terms
• Query
– Is a request for documents pertaining to some topic
– Has usually been taken to mean the statement by the requester
describing his/her information need
• Database
– Is a collection of documents
Summary
Summary
• We have seen that an IR system deals with the sources of
information on the one hand and the users requirements on the
other hand.
• Generally, there are two major tasks (or functions) in an IR
system
- To analyze the contents of the sources of information
as well as well as well as the users’ queries and then
- To match both to retrieve those items which are
relevant.
• These functions can further be elaborated as follows:-
Summary
Summary
1. Identify the sources of information relevant to the areas of
interest of the target users’ community;
2. Analyze the contents of the sources (documents);
3. Represent the contents of the analyzed sources in a way
that will be suitable for matching with the users’ queries;
4. Analyze users’ queries and to represent them in a form
that will be suitable for matching with the database;
5. Match the search statement with the stored database;
6. Retrieve information that are relevant and
7. Make necessary adjustments in the system based on the
feedback from the user.
End Of Slide