0% found this document useful (0 votes)
284 views161 pages

IRS B Tech CSE Part 1

Uploaded by

Rajput Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
284 views161 pages

IRS B Tech CSE Part 1

Uploaded by

Rajput Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 161

B Tech (CSE)

Sub:-Information Retrieval Systems

Part-1

By
Dr Dilip Kumar Sharma
Program Coordinator(MTech-CSE) & Professor
Dept of Computer Engineering & Applications
GLA University , India
Contact : [email protected]
Cabin@ R-305 Academic Block-1(II Floor)
Now Syllabus
Information Retrieval Systems (CSE440)
Now Syllabus
Information Retrieval Systems (CSE440)
Important Terms used in IR
INTERNET
 The Internet is a global system of interconnected computer
networks that use the standard Internet Protocol Suite
(TCP/IP) to serve billions of users worldwide.
 It is a network of networks that consists of millions of private,
public, academic, corporate and government networks, of local
to global scope, that are linked by a broad array of electronic,
wireless and optical networking technologies.
 The Internet carries a vast range of information resources and
services, such as the inter-linked hypertext documents of the
Web and the infrastructure to support email.
 The origin of Internet is traced back to a project sponsored by
the U.S. called ARPANET an initiative of Defense Advanced
Research Projects Agency (DARPA) in 1969.
INTERNET…
 This was developed as system for researchers and
defence contractors to share information over the
network
 Fig. below depicts the general architecture of Internet.
WORLD WIDE WEB

 The World-Wide Web was developed to be a pool of


human knowledge and human culture, which would
allow collaborators of remote sites to share their ideas
and all aspects of a common project.
 It is a system of interlinked hypertext documents
accessed via the Internet.
 With a web browser, one can view web pages that
may contain text, images, animation, videos and other
multimedia and navigate between them via hyperlinks.
Browsing

 Interactive task in which the user is more


interested in exploring the document collection
than in retrieving documents which satisfy a
specific information need.
Clustering

 The grouping of documents which satisfy a set


of common properties.
 The aim is to assemble together documents
which are related among themselves.
Data retrieval

 The retrieval of items (tuples, objects, Web


pages, documents) whose contents satisfy the
conditions specified in a (Relational algebra
like) user query.
Digital library

 The combination of a collection of digital


objects (repository); descriptions of those
objects (metadata); a set of users (patrons or
target audience or users);
 and systems that offer a variety of services
such as indexing, cataloguing, search,
browsing, retrieval, delivery and preservation.
Directory

 A usually hierarchical categorization of


concepts in a domain of knowledge.
Compression of text

 The study of techniques for representing text


in fewer bytes or bits.
 Entropy: measure of information defined on
the statistics on the characters of a text.
Coding

 Coding: the substitution of text symbols


by numeric codes with the aim of
encrypting or compressing text.
Document

 A unit of retrieval. It might be a paragraph, a


section, a chapter, a Web page, an article, or a
whole book.
 A document is a sequence of terms,
expressing ideas about some topic in a natural
language.
Hypertext model

 A model of information retrieval based on


representing document relationships as edges
of a generic graph in which the documents are
the nodes.
Index term: (or keyword)

 A pre-selected term which can be used to refer


to the content of a document.
 In the Web, however, some search engines
use all the words in a document as index
terms.
 Index: a data structure built on the text to
speed up searching.
Federated search
 It is an information retrieval technology that allows the
simultaneous search of multiple searchable resources.
 A user makes a single query request which is
distributed to the search engines participating in the
group.
 The federated search then aggregates the results that
are received from the search engines for presentation
to the user.
Interoperability

 The working together of a number of computer


systems, typically for a common purpose,
 such as when a number of digital libraries
“support federated searching”, often enabled
by standards and agreed-upon conventions
including data formats and protocols.
Inverted file/ inverted index
 A text index composed of a vocabulary and a list of occurrences.

 It is an index data structure storing a mapping from content, such


as words or numbers, to its locations in a database file, or in a
document or a set of documents.
 The purpose of an inverted index is to allow fast full text
searches, at a cost of increased processing when a document is
added to the database.
Multimedia data

 Data combining several different media,


such as text, images, sound and video.
Query
 A query is a request for documents pertaining to some
topic.
 The expression of the user information need in the
input language provided by the information system.
 The most common type of input language allows
simply the specification of keywords and of a few
Boolean connectives.
Tag

 A string which is used to mark the


beginning or ending of structural
elements in the text.
Information retrieval (IR)

 Part of computer science which studies the


retrieval of information (not data) from a
collection of written documents.
 The retrieved documents aim at satisfying a
user information need usually expressed in
natural language.
Search & Information Retrieval

 Search on the Web is a daily activity for many


people throughout the world
 Search and communication are most popular
uses of the computer
 Applications involving search are everywhere
 The field of computer science that is most
involved with R&D for search is information
retrieval (IR)
Information Retrieval

 “Information retrieval is a field concerned with the structure,


analysis, organization, storage, searching, and retrieval of
information.” (Salton, 1968)
 General definition that can be applied to many types of
information and search applications
 Primary focus of IR since the 50s has been on text and
documents
 Information retrieval (IR) deals with the representation,
storage, organization of, and access to information items.
The representation and organization of the information
items should provide the user with easy access to the
information in which he is interested.(Ricardo Baeza-Yate,
Berthier Ribeiro-Neto)
Information Retrieval
IR Motivation
 Given the user query, the key goal of an IR system is
to retrieve information which might be useful or
relevant to the user.
 The emphasis is on the retrieval of information as
opposed to the retrieval of data.
 An Information Retrieval (IR) System attempts to find
relevant documents to respond to a user’s request.
What is Different about IR from the rest of
Computer Science

 Most algorithms in computer science have a “right”


answer:
Consider the two problems:
 – Sort the following ten integers
 – Find the higest integer
Now consider:
 – Find the document most relevant to “hippos in the
zoo”
What is a Document?

 Examples:
 web pages, email, books, news stories, scholarly
papers, text messages, Word™, Powerpoint™,
PDF, forum postings, patents, IM sessions, etc.
 Common properties
 Significant text content
 Some structure (e.g., title, author, date for papers;
subject, sender, destination for email)
Documents vs. Database Records

 Database records (or tuples in relational


databases) are typically made up of well-defined
fields (or attributes)
 e.g., bank records with account numbers, balances,
names, addresses, social security numbers, dates of
birth, etc.
 Easy to compare fields with well-defined
semantics to queries in order to find matches
 Text is more difficult
Information versus Data Retrieval

 Data retrieval, in the context of an IR system, consists


mainly of determining which documents of a collection
contain the keywords in the user query which, most
frequently, is not enough to satisfy the user
information need.
 In fact, the user of an IR system is concerned more
with retrieving information about a subject than with
retrieving data which satisfies a given query.
Information versus Data Retrieval ...

 A data retrieval language aims at retrieving all objects


which satisfy clearly defined conditions such as those
in a relational algebra expression.
 Thus, for a data retrieval system, a single erroneous
object among a thousand retrieved objects means
total failure.
 For an information retrieval system, however, the
retrieved objects might be inaccurate and small errors
are likely to go unnoticed.
Information versus Data Retrieval ...

 The main reason for this difference is that information


retrieval usually deals with natural language text which
is not always well structured and could be semantically
ambiguous.
 On the other hand, a data retrieval system (such as a
relational database) deals with data that has a well
defined structure and semantics.
Information versus Data Retrieval ...

 Data retrieval, while providing a solution to the user of


a database system, does not solve the problem of
retrieving information about a subject or topic.
 To be effective in its attempt to satisfy the user
information need, the IR System must somehow
'interpret' the contents of the information items
(documents) in a collection and rank them according
to a degree of relevance to the user query.
Information versus Data Retrieval ...

 This 'interpretation' of a document content involves


extracting syntactic and semantic information from the
document text and using this information to match the
user information need.
 The difficulty is not only knowing how to extract this
information but also knowing how to use it to decide
relevance.
 Thus, the notion of relevance is at the center of
information retrieval. In fact, the primary goal of an IR
system is to retrieve all the documents which are
relevant to a user query while retrieving as few non-
relevant documents as possible.
Documents vs. Records

 Example bank database query


 Find records with balance > $50,000 in branches
located in Amherst, MA.
 Matches easily found by comparison with field
values of records
 Example search engine query
 bank scandals in western group
 This text must be compared to the text of entire
news stories
The IR Problem

 Users of modern IR systems, such as search engine


users, have information needs of varying complexity
 An example of complex information need is as follows:
Find all documents that address the role of the Federal
Government in financing the operation of the National
Railroad Transportation Corporation
The IR Problem…

 This full description of the user information need is not


 necessarily a good query to be submitted to the IR
 System Instead, the user might want to first translate
this information need into a query
 This translation process yields a set of keywords, or
index terms, which summarize the user information
need Given the user query, the key goal of the IR
system is to retrieve information that is useful or
relevant to the user
The IR Problem…

 That is, the IR system must rank the information items


according to a degree of relevance to the user query
The IR Problem
 The key goal of an IR system is to retrieve all the
items that are relevant to a user query, while retrieving
as few nonrelevant items as possible The notion of
relevance is of central importance in IR
The IR Problem…

 That is, the IR system must rank the information items


according to a degree of relevance to the user query
The IR Problem
 The key goal of an IR system is to retrieve all the
items that are relevant to a user query, while retrieving
as few nonrelevant items as possible
 The notion of relevance is of central importance in IR
Architecture of the IR System
Retrieval and Ranking Processes
 The processes of indexing, retrieval, and ranking
WEB CRAWLERS
 The crawler retrieves web pages commonly for use by a
search engine. It traverses the web by downloading the
documents and following embedded links from page to
page [24].
 Though, crawlers are mainly used by web search
engines to gather data for indexing,
 Formally, crawlers may be defined as “Software
programs that traverse the World Wide Web information
space by following the hypertext links extracted from
hypertext documents”.

[email protected]
THE CRAWLERS…

 Web crawlers are also known as spiders, or


wanderers, or robots, etc.
 Since WWW is decentralized, dynamic and diverse;
finding specific documents is a very difficult exercise
through navigation.
 This problem is also called the Resource Discovery
Problem.
 A crawler solves the Resource Discovery Problem in
the context of WWW by retrieving information from
remote sites using standard web protocols.

[email protected]
THE CRAWLERS…

Since, a crawler identifies a document from its URL, it picks up a seed URL
and downloads corresponding Robot.txt file, which contains downloading
permissions and the information about the files that should be excluded by
the crawler.
On the basis of the host protocol, it downloads the document and stores the
related pages in SE database.

[email protected]
THE CRAWLERS…
 It then repeats the whole process as per the algorithm shown in
Fig.

[email protected]
ROBOT.TXT: A STANDARD FOR ROBOT
EXCLUSION
 The crawlers or robots traverse many pages in the WWW by
recursively retrieving linked pages.
 In 1993 and 1994, there were occasions when robots visited WWW
servers where they were not welcomed for various reasons such as
given below:
 Certain robots flooded servers with rapid fire requests
 Some robots retrieved the same file repeatedly
 Robots traversed parts of WWW servers, which were not suitable such
as duplicate information, temporary information, access to CGI scripts
etc.
 The above mentioned points necessitated the need for established
mechanisms for WWW servers to indicate to robots as to which parts of
their servers should not be accessed. Therefore, a concept to have a
file named as "/robots.txt" came into existence. It specifies an access
policy for robots.

[email protected]
ROBOT.TXT: A STANDARD FOR ROBOT
EXCLUSION…
 The record starts with one or more user-agent lines, followed by one
or more Disallow lines as detailed below.
 User-Agent
 The name of the field is the name of the robot for which the record is
describing access policy.
 If more than one user agent field is present, the record describes an
identical access policy for more than one robot. At least one field
needs to be present per record.
 The robot should be liberal in interpreting that field. A case sensitive
sub string match of the name without version information is
recommended.
 If the value is "*" the record describes the default access policy for
any robot that has not matched any of the records. It is not allowed
to have multiple such records in the "/robots.txt" file.

[email protected]
ROBOT.TXT: A STANDARD FOR ROBOT
EXCLUSION…
 Disallow
 The value of this field specifies a partial URL that is not to be visited.
This can be a full path, or a partial path; Any URL that starts with
this value will not be retrieved.
 For Example: Disallow: /help disallows both /help.html and
/help/index.html
 Whereas Disallow /help/ would disallow /help/index. html but allow
help.html.
 Any empty value indicates that all URLs can be retrieved.
 At least one disallow field needs to be present.

[email protected]
ROBOT.TXT: A STANDARD FOR ROBOT
EXCLUSION…
Example: The following server:
# Go away
User-agent: *
Disallow :/
Indicates that no robots should visit this site further.
Example: The following server:
# robots.txt for http://www.example.com
User-agent: *
Disallow: /cyberworld/map/
Disallow: /temp/
Disallow: /foo.html
Indicates that no robot should visit any URL starting with
"/cyberworld/map" or “/temp/" or "/foo.html:

[email protected]
ROBOT.TXT: A STANDARD FOR ROBOT
EXCLUSION…
Example: The following server:
# Go away
User-agent: *
Disallow :/
Indicates that no robots should visit this site further.
Example: The following server:
# robots.txt for http://www.example.com
User-agent: *
Disallow: /cyberworld/map/
Disallow: /temp/
Disallow: /foo.html
Indicates that no robot should visit any URL starting with
"/cyberworld/map" or “/temp/" or "/foo.html:

[email protected]
ROBOT.TXT: A STANDARD FOR ROBOT
EXCLUSION…
Example: The following server:
# robots.txt for http://www.example.com
User-agent: *
Disallow: /GLAU/secret/
# Crawler knows where to go
User-agent: Crawler
Disallow:
indicates that no robots should visit any URL starting from
“/GLAU/secret/" except the robot called “Crawler".
Thus, "/robots.txt" specifies which parts of the server URL space should
be avoided by the robots.

[email protected]
Logical view of a document:
 Logical view of a document: from full text to a set of
index terms

[email protected]
Document Preprocessing
Document preprocessing is a procedure which can be divided
mainly into five text operations (or transformations):
(1) Lexical analysis of the text with the objective of treating digits,
hyphens, punctuation marks, and the case of letters.
(2) Elimination of stopwords with the objective of filtering out
words with very low discrimination values for retrieval
purposes.
(3) Stemming of the remaining words with the objective of
removing affixes (i.e., prefixes and suffixes) and allowing the
retrieval of documents containing syntactic variations of query
terms (e.g., connect, connecting, connected, etc).

[email protected]
Document Preprocessing…
(4) Selection of index terms to determine which words/stems (or groups
of words) will be used as an indexing elements.
 Usually, the decision on whether a particular word will be used as an
index term is related to the syntactic nature of the word.
 In fact, noun words frequently carry more semantics than adjectives,
adverbs, and verbs.
(5) Construction of term categorization structures such as a thesaurus,
or extraction of structure directly represented in the text, for allowing
the expansion of the original query with related terms (a usually
useful procedure)

[email protected]
Logical view of a document throughout the
various phases of text preprocessing.

[email protected]
Lexical Analysis of the Text

[email protected]
Elimination of Stopwords

[email protected]
Stemming

[email protected]
Keyword Selection

[email protected]
The Web (A Brief History)
 As We May Think influenced people like Douglas
Engelbart, who invented the computer mouse and
introduced the concept of hyperlinked texts
 Ted Nelson, working in his Project Xanadu, pushed
the concept further and coined the term hypertext
 A hypertext allows the reader to jump from one
electronic document to another, which was one
important property regarding the problem that Tim
Berners-Lee faced in 1989

Ted Nelson Tim Berners-Lee


The Web (A Brief History)
 At the time, Berners-Lee worked in Geneva at the
CERN—European Organization for Nuclear Research
 There, researchers who wanted to share documentation
with others had to reformat their documents to make
them compatible with an internal publishing system
 Berners-Lee reasoned that it would be nice if the solution
of sharing documents were decentralized
 He saw that a networked hypertext would be a good
solution and started working on its implementation
The Web (A Brief History)
 In 1990, Berners-Lee
 Wrote the HTTP protocol
 Defined the HTML language
 Wrote the first browser, which he called World Wide Web
 Wrote the first Web server
 In 1991, he made his browser and server software
available in the Internet
 The Web was born!
INTRODUCTION TO WEB

 World Wide Web (WWW) is a huge repository of


hyperlinked documents containing useful information.
 In the recent years, the exponential growth of the
information technology has led to large amount of
information available through the WWW.
 The searching of WWW for the useful and relevant
information has become more challenging as the size
of the Web continues to grow.
INTRODUCTION TO WEB

 WWW can be broadly divided in two types i.e. surface web


and deep web according to their depth of data.
 The contents of the deep web are dynamically generated
by web server and return to the user during online query.
 Web crawler is a program that is specialized in
downloading web contents. Conventional web crawler can
easily index, search and analyze the surface web having
interlinked html pages but they have limitations in fetching
the data from deep web.
 To access deep web, a user must request for information
from a particular database through a search interface
How the Web Changed Search

 Web search is today the most prominent application of


IR and its techniques—the ranking and indexing
components of any search engine are fundamentally
IR pieces of technology
 The first major impact of the Web on search is
related to the characteristics of the document
collection itself
 The Web is composed of pages distributed over millions of
sites and connected through hyperlinks
 This requires collecting all documents and storing copies of
them in a central repository, prior to indexing
 This new phase in the IR process, introduced by the Web, is
called crawling
How the Web Changed Search…

 The second major impact of the Web on search is


related to:
 The size of the collection
 The volume of user queries submitted on a daily basis
 As a consequence, performance and scalability have become
critical characteristics of the IR system
 The third major impact: in a very large collection,
predicting relevance is much harder than before
 Fortunately, the Web also includes new sources of evidence
 Ex: hyperlinks and user clicks in documents in the answer set

Scalability is the ability of a system, network, or process, to handle growing


amount of work in a capable manner or its ability to be enlarged to
accommodate that growth
How the Web Changed Search…

 The fourth major impact derives from the fact that


the Web is also a medium to do business
 Search problem has been extended beyond the seeking of text
information to also encompass other user needs
 Ex: the price of a book, the phone number of a hotel, the link for
downloading a software
Practical Issues in the Web

 Security
 Commercial transitions over the Internet are not yet a
completely safe procedure
 Privacy
 Frequently, people are willing to exchange information
as long as it does not become public
 Copyright and patent rights
 It is far from clear how the wide spread of data on the
Web affects copyright and patent laws in the various
countries
CSE440
Information Retrieval Systems

Dr Dilip Kumar Sharma


Boolean retrieval
Sec. 1.1

Unstructured data in 1680

 Which plays of Shakespeare contain the words


Brutus AND Caesar but NOT Calpurnia?
 One could grep all of Shakespeare’s plays for
Brutus and Caesar, then strip out lines
containing Calpurnia?

73
Sec. 1.1

Term-document incidence

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0

1 if play contains word, 0


Brutus AND Caesar BUT NOT otherwise
Calpurnia
Sec. 1.1

Incidence vectors

 So we have a 0/1 vector for each term.


 To answer query: take the vectors for Brutus,
Caesar and Calpurnia (complemented) 
bitwise AND.
 110100 AND 110111 AND 101111 = 100100.

75
Sec. 1.1
Basic assumptions of Information
Retrieval
 Collection: Fixed set of documents
 Goal: Retrieve documents with information
that is relevant to the user’s information need
and helps the user complete a task

77
Sec. 1.1

Bigger collections
 Consider N = 1 million documents, each with about
1000 words.
 Avg 6 bytes/word including spaces/punctuation
 6GB of data in the documents.

 Say there are M = 500K distinct terms among these.

78
Sec. 1.1

Can’t build the matrix


 500K x 1M matrix has half-a-trillion 0’s and 1’s.
 But it has no more than one billion 1’s.
 matrix is extremely sparse.

 What’s a better representation?


 We only record the 1 positions.

79
Sec. 1.2

Inverted index

 For each term t, we must store a list of all


documents that contain t.
 Identify each by a docID, a document serial
number

Brutus 1 2 4 11 31 45 173 174

Caesar 1 2 4 5 6 16 57 132

Calpurnia 2 31 54 101

80
Sec. 1.2

Inverted index

Posting

Brutus 1 2 4 11 31 45 173 174

Caesar 1 2 4 5 6 16 57 132

Calpurnia 2 31 54 101

Dictionary Postings
Sorted by docID. 81
Sec. 1.2

Inverted index construction

Documents to Friends, Romans, countrymen.


be indexed

Tokenizer

Token stream Friends Romans Countrymen

Linguistic modules

friend roman countryman


Modified tokens
Indexer friend 2 4

roman 1 2
Inverted index
countryman 13 16
Sec. 1.2

Indexer steps: Token sequence

 Sequence of (Modified token, Document ID) pairs.

Doc 1 Doc 2

I did enact Julius So let it be with


Caesar I was killed Caesar. The noble
i' the Capitol; Brutus hath told you
Brutus killed me. Caesar was ambitious
Sec. 1.2

Indexer steps: Sort


 Sort by terms
 And then docID

Core indexing step


Sec. 1.2

Indexer steps: Dictionary &


Postings
 Multiple term entries in
a single document are
merged.
 Split into Dictionary
and Postings
 Doc. frequency
information is added.

Why frequency?
Will discuss later.
Sec. 1.2

Where do we pay in storage?

Lists of
docIDs

Terms
and
counts

86
Pointers
Problem 1:

Draw the inverted index that would be built for the


following document collection.
(See Figure 1.3 for an example.)
Doc 1 new home sales top forecasts
Doc 2 home sales rise in july
Doc 3 increase in home sales in july
Doc 4 july new home sales rise
Problem 1: Sol

Draw the inverted index that would be built for the


following document collection.
(See Figure 1.3 for an example.)
Doc 1 new home sales top forecasts
Doc 2 home sales rise in july
Doc 3 increase in home sales in july
Doc 4 july new home sales rise

Inverted Index: forecast->1 home->1->2->3->4 in-


>2->3
increase->3 july->2->3 new->1->4 rise->2->4 sale-
>1->2->3->4 top->1
Problem 2:
Consider these documents:
Doc 1 breakthrough drug for schizophrenia
Doc 2 new schizophrenia drug
Doc 3 new approach for treatment of schizophrenia
Doc 4 new hopes for schizophrenia patients
a. Draw the term-document incidence matrix for this document
collection.
b. Drawthe inverted index representation for this collection, as in Figure
1.3
Problem 2: Sol
Consider these documents:
Doc 1 breakthrough drug for schizophrenia
Doc 2 new schizophrenia drug
Doc 3 new approach for treatment of schizophrenia
Doc 4 new hopes for schizophrenia patients
a. Draw the term-document incidence matrix for this document
collection.
b. Drawthe inverted index representation for this collection, as in Figure
1.3

Term-Document matrix: d1 d2 d3 d4 Approach 0 0 1 0 breakthrough 1 0 0


0
drug 1 1 0 0 for 1 0 1 1 hopes 0 0 0 1 new 0 1 1 1 of 0 0 1 0 patients 0 0 0 1
schizophrenia 1 1 1 1 treatment 0 0 1 0
Inverted Index: Approach -> 3 breakthrough ->1 drug ->1->2 for ->1->3-
>4 hopes ->4 new -.>2->3->4 of ->3 patients ->4 schizophrenia ->1->2->3-
>4 treatment >3
Sec. 1.3

Query processing: AND

 Consider processing the query:


Brutus AND Caesar
 Locate Brutus in the Dictionary;
 Retrieve its postings.
 Locate Caesar in the Dictionary;
 Retrieve its postings.
 “Merge” the two postings:
2 4 8 16 32 64 128 Brutus

1 2 3 5 8 13 21 34 Caesar

91
Sec. 1.3

The merge

 Walk through the two postings simultaneously,


in time linear in the total number of postings
entries

2 4 8 16 32 64 128 Brutus
2 8
1 2 3 5 8 13 21 34 Caesar

If list lengths are x and y, merge takes O(x+y) operations.


Crucial: postings sorted by docID.

92
BOOLEAN MODEL
 The Boolean model of information retrieval is one of
the earliest and simplest retrieval methods that use the
method of exact matching to match documents
according to the user’s query wherein words are
logically combined by using Boolean operators like
AND, OR, and NOT.
 For example, the Boolean AND of two logical
statements x and y means that both x AND y must be
satisfied, while the Boolean OR of these two
statements means that at least one of these
statements must be satisfied.

[email protected]
BOOLEAN MODEL…
 In this model, large number of logical statements can
be combined using the three Boolean operators.
 This model operates by considering which keywords of
the user query are present in a document.
 Thus, if keywords are found in a document, then
document is called relevant.
 Infact, there is no concept of a partial match between
documents and queries. This strategy can lead to poor
performance [14].

[email protected]
The Boolean Model…

[email protected]
Drawbacks of the Boolean Model

 Retrieval based on binary decision criteria with no


notion of partial matching
 No ranking of the documents is provided (absence of a
grading scale)
 Information need has to be translated into a Boolean
expression, which most users find awkward
 The model frequently returns either too few or too
many documents in response to a user query

[email protected]
Sec. 1.3

Query optimization

 What is the best order for query processing?


 Consider a query that is an AND of n terms.
 For each of the n terms, get its postings, then
AND them together.
Brutus 2 4 8 16 32 64 128

Caesar 1 2 3 5 8 16 21 34

Calpurnia 13 16

Query: Brutus AND Calpurnia AND Caesar 97


Sec. 1.3

Query optimization example

 Process in order of increasing freq:


 start with smallest set, then keep cutting
further.
This is why we kept
document freq. in dictionary

Brutus 2 4 8 16 32 64 128

Caesar 1 2 3 5 8 16 21 34

Calpurnia 13 16

Execute the query as (Calpurnia AND Brutus) AND Caesar.


98
Information Retrieval Systems (CSE440)
Part-1
The term vocabulary and postings lists
By
Dr Dilip Sharma,
GLA University India
Document delineation and character
sequence decoding

 Obtaining the character sequence in a document


 Digital documents that are the input to an indexing process are
typically bytes in a file or on a web server.
 The first step of processing is to convert this byte sequence into a
linear sequence of characters.
 For the case of plain English text in ASCII encoding, this is trivial.
But often things get much more complex.
 The term vocabulary and postings lists.
 The sequence of characters may be encoded by one of various
single byte or multibyte encoding schemes, such as Unicode
UTF-8, or various national or vendor-specific standards.
 We need to determine the correct encoding.
Determining the vocabulary of terms
Tokenization
 Given a character sequence and a defined document unit,
tokenization is the task of chopping it up into pieces,
called tokens, perhaps at the same time throwing away
certain characters, such as punctuation.
Here is an example of tokenization:

 These tokens are often loosely referred to as terms or


words, but it is sometimes important to make a type/token
distinction.
Stemming and Lemmatization
 However, the two words differ in their flavor.
 Stemming usually refers to a crude heuristic process that
chops off the ends of words in the hope of achieving this
goal correctly most of the time, and often includes the
removal of derivational affixes.
 Lemmatization usually refers to doing things properly with
the use of a vocabulary and morphological analysis of
words, normally aiming to remove inflectional endings only
and to return the base or dictionary form of a word, which
is known as the lemma.
Stemming and Lemmatization…
 If confronted with the token saw, stemming might
return just s, whereas lemmatization would attempt to
return either see or saw depending on whether the use
of the token was as a verb or a noun.
Information Retrieval System [CSE440]
Part 1:Dictionaries and tolerant retrieval
WILDCARD QUERY
 A query such as *a*e*i*o*u*, which seeks
documents containing any term that includes all
the five vowels in sequence.
 The * symbol indicates any (possibly empty)
string of characters.
 Users pose such queries to a search engine
when they are uncertain about how to spell a
query term, or seek documents containing
variants of a query term;
 for instance, the query automat* would seek
documents containing any of the terms
automatic, automation and automated.
BINARY TREE

 The best-known search tree is the binary tree,


in which each internal node has two children.
 The search for a term begins at the root of the
tree.
 Each internal node (including the root)
represents a binary test, based on whose
outcome the search proceeds to one of the
two sub-trees below that node.
Sec. 3.1

TREE: BINARY TREE


Root
a-m n-z

a-hu hy-m n-sh si-z

123
B-TREE

 A search tree commonly used for a dictionary


is the B-tree – a search tree in which every
internal node has a number of children in the
interval [a, b], where a and b are appropriate
positive integers;
 Figure 3.2 shows an example with a = 2 and b
= 4.
 Each branch under an internal node again
represents a test for a range of char-
Sec. 3.1

TREE: B-TREE

a-hu n-z
hy-m

 Definition: Every internal node has a number of


children in the interval [a,b] where a, b are
appropriate natural numbers, e.g., [2,4].
125
Sec. 3.2

WILD-CARD QUERIES
 Wildcard queries are used in any of the following
situations:
(1) the user is uncertain of the spelling of a query term (e.g.,
Sydney vs. Sidney, which leads to the wildcard query S*dney);
(2) the user is aware of multiple variants of spelling a term and
(consciously) seeks documents containing any of the variants
(e.g., color vs. colour);
(3) the user seeks documents containing variants of a term that
would be caught by stemming, but is unsure whether the search
engine performs stemming (e.g., judicial vs. judiciary, leading to
the wildcard query judicia*);
(4) the user is uncertain of the correct rendition of a foreign word or
phrase (e.g., the query Universit* Stuttgart).

126
Sec. 3.2

WILD-CARD QUERIES: *

 mon*: find all docs containing any word


beginning with “mon”.
 Easy with binary tree (or B-tree) lexicon:
retrieve all words in range: mon ≤ w < moo
 *mon: find words ending in “mon”: harder
Maintain an additional B-tree for terms backwards.
Can retrieve all words in range: nom ≤ w < non.

Exercise: from this, how can we enumerate all terms


meeting the wild-card query pro*cent ?

127
Sec. 3.2

QUERY PROCESSING

 At this point, we have an enumeration of all


terms in the dictionary that match the wild-card
query.
 We still have to look up the postings for each
enumerated term.
 E.g., consider the query:
se*ate AND fil*er
This may result in the execution of many
Boolean AND queries.

128
Sec. 3.2
B-trees handle *’s at the end of a
query term
 How can we handle *’s in the middle of query
term?
 co*tion
 We could look up co* AND *tion in a B-tree and
intersect the two term sets
 Expensive
 The solution: transform wild-card queries so that
the *’s occur at the end
 This gives rise to the Permuterm Index.

129
Sec. 3.2.1

Permuterm index
 Our first special index for general wildcard queries is
the permuterm index, a form of inverted index.
 First, we introduce a special symbol $ into our character
set, to mark the end of a term. Thus, the term hello is
shown here as the augmented term hello$.
 Next, we construct a permuterm index, in which the
various rotations of each term (augmented with $) all
link to the original vocabulary term.
 Figure 3.3 gives an example of such a permuterm index
entry for the term hello.
 We refer to the set of rotated terms in the permuterm
index as the permuterm vocabulary.
130
Sec. 3.2.1

Permuterm index

131
Sec. 3.2.1

Permuterm index

 For term hello, index under:


hello$, ello$h, llo$he, lo$hel, o$hell
where $ is a special symbol.
 Queries:
 X lookup on X$ X* lookup on $X*
 *X lookup on X$* *X* lookup on X*
 X*Y lookup on Y$X* X*Y*Z ???
Exercise! Query = hel*o
X=hel, Y=o
Lookup o$hel*
132
Sec. 3.2.1

Permuterm query processing

 Rotate query wild-card to the right


 Now use B-tree lookup as before.
 Permuterm problem: ≈ quadruples lexicon
size Empirical observation for English.

133
Sec. 3.2.2

Bigram (k-gram) indexes


 Whereas the permuterm index is simple, it can lead to a
considerable blowup from the number of rotations per term; for a
dictionary of English terms, this can represent an almost ten-fold
space increase.
 We now present a second technique, known as the k-gram index,
for processing wildcard queries.
 A k-gram is a sequence of k characters. Thus cas, ast and stl are
all 3-grams occurring in the term castle.
 We use a special character $ to denote the beginning or end of a
term, so the full set of 3-grams generated for castle is: $ca, cas,
ast, stl, tle, le$. In a k-gram index, the dictionary contains all k-
grams that occur in any term in the vocabulary.
 Each postings list points from a k-gram to all vocabulary terms
containing that k-gram. For instance, the 3-grametr would point to
vocabulary terms such as metric and retrieval. An example is given
in Figure 3.4.
134
Sec. 3.2.2

Bigram (k-gram) indexes


 Enumerate all k-grams (sequence of k chars)
occurring in any term
 e.g., from text “April is the cruelest month”
we get the 2-grams (bigrams)
$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru,
ue,el,le,es,st,t$, $m,mo,on,nt,h$

$ is a special word boundary symbol


 Maintaina second inverted index from
bigrams to dictionary terms that match each
bigram.

136
Sec. 3.2.2

Bigram index example

 The k-gram index finds terms based on a


query consisting of k-grams (here k=2).
$m mace madden

mo among amortize

on along among

137
Sec. 3.2.2

Processing wild-cards

 Query mon* can now be run as


 $m AND mo AND on
 Gets terms that match AND version of our
wildcard query.
 But we’d enumerate moon.
 Must post-filter these terms against query.
 Surviving enumerated terms are then looked up
in the term-document inverted index.
 Fast, space efficient (compared to permuterm).

138
Sec. 3.2.2

Processing wild-card queries


 As before, we must execute a Boolean query for
each enumerated, filtered term.
 Wild-cards can result in expensive query
execution (very large disjunctions…)
 pyth* AND prog*
 If you encourage “laziness” people will respond!
Type your search terms, use ‘*’ if you need to.
E.g., Alex* will match Alexander.

Search

 Which web search engines allow wildcard


queries?
139
Sec. 3.3

Spell correction
 Two principal uses
 Correcting document(s) being indexed
 Correcting user queries to retrieve “right” answers
 Two main flavors:
 Isolated word
 Check each word on its own for misspelling
 Will not catch typos resulting in correctly spelled words
 e.g., from  form
 Context-sensitive
 Look at surrounding words,
 e.g., I flew form Heathrow to Narita.

140
Sec. 3.3

Document correction
 Especially needed for OCR’ed documents
 Correction algorithms are tuned for this: rn/m
 Can use domain-specific knowledge
 E.g., OCR can confuse O and D more often than it would
confuse O and I (adjacent on the QWERTY keyboard, so
more likely interchanged in typing).
 But also: web pages and even printed material
have typos
 Goal: the dictionary contains fewer misspellings
 But often we don’t change the documents and
instead fix the query-document mapping

141
Sec. 3.3

Query mis-spellings

 Our principal focus here


 E.g., the query Alanis Morisett
 We can either
 Retrieve documents indexed by the correct
spelling, OR
 Return several suggested alternative queries
with the correct spelling
 Did you mean … ?

142
Sec. 3.3.2

Isolated word correction

 Fundamental premise – there is a lexicon


from which the correct spellings come
 Two basic choices for this
 A standard lexicon such as
 Webster’s English Dictionary
 An “industry-specific” lexicon – hand-maintained

 The lexicon of the indexed corpus


 E.g., all words on the web
 All names, acronyms etc.

 (Including the mis-spellings)

143
Sec. 3.3.2

Isolated word correction

 Given a lexicon and a character sequence


Q, return the words in the lexicon closest
to Q
 What’s “closest”?
 We’ll study several alternatives
 Edit distance (Levenshtein distance)
 Weighted edit distance

 n-gram overlap

144
Sec. 3.3.3

Edit distance
 Given two strings S1 and S2, the minimum number
of operations to convert one to the other
 Operations are typically character-level
 Insert, Delete, Replace, (Transposition)
 E.g., the edit distance from dof to dog is 1
 From cat to act is 2 (Just 1 with transpose.)
 from cat to dog is 3.
 Generally found by dynamic programming.
 See http://www.merriampark.com/ld.htm for a nice
example plus an applet.

145
Sec. 3.3.3

Weighted edit distance

 As above, but the weight of an operation


depends on the character(s) involved
 Meant to capture OCR or keyboard errors
Example: m more likely to be mis-typed as n than as
q
 Therefore, replacing m by n is a smaller edit distance
than by q
 This may be formulated as a probability model
 Requires weight matrix as input
 Modify dynamic programming to handle weights

146
Sec. 3.3.4

Using edit distances


 Given query, first enumerate all character
sequences within a preset (weighted) edit
distance (e.g., 2)
 Intersect this set with list of “correct” words
 Show terms you found to user as suggestions
 Alternatively,
 We can look up all possible corrections in our
inverted index and return all docs … slow
 We can run with a single most likely correction
 The alternatives disempower the user, but save a
round of interaction with the user

147
Sec. 3.3.4

Edit distance to all dictionary terms?

 Given a (mis-spelled) query – do we compute its


edit distance to every dictionary term?
 Expensive and slow
 Alternative?
 How do we cut the set of candidate dictionary
terms?
 One possibility is to use n-gram overlap for this
 This can also be used by itself for spelling
correction.

148
Sec. 3.3.4

n-gram overlap

 Enumerate all the n-grams in the query string as


well as in the lexicon
 Use the n-gram index (recall wild-card search)
to retrieve all lexicon terms matching any of the
query n-grams
 Threshold by number of matching n-grams
 Variants – weight by keyboard layout, etc.

149
Sec. 3.3.4

Example with trigrams

 Suppose the text is november


 Trigrams are nov, ove, vem, emb, mbe, ber.
 The query is december
 Trigrams are dec, ece, cem, emb, mbe, ber.
 So 3 trigrams overlap (of 6 in each term)
 How can we turn this into a normalized
measure of overlap?

150
Sec. 3.3.4

One option – Jaccard coefficient


 A commonly-used measure of overlap
 Let X and Y be two sets; then the J.C. is

X Y / X Y
 Equals 1 when X and Y have the same
elements and zero when they are disjoint
 X and Y don’t have to be of the same size
 Always assigns a number between 0 and 1
 Now threshold to decide if you have a match
 E.g., if J.C. > 0.8, declare a match

151
Sec. 3.3.4

Matching trigrams

 Consider the query lord – we wish to


identify words matching 2 of its 3 bigrams
(lo, or, rd)
lo alone lore sloth

or border lore morbid

rd ardent border card

Standard postings “merge” will enumerate …

Adapt this to using Jaccard (or another) measure.


152
Sec. 3.3.5

Context-sensitive spell correction

 Text: I flew from Heathrow to Narita.


 Consider the phrase query “flew form
Heathrow”
 We’d like to respond
Did you mean “flew from Heathrow”?
because no docs matched the query phrase.

153
Sec. 3.3.5

Context-sensitive correction
 Need surrounding context to catch this.
 First idea: retrieve dictionary terms close (in
weighted edit distance) to each query term
 Now try all possible resulting phrases with one
word “fixed” at a time
 flew from heathrow
 fled form heathrow
 flea form heathrow
 Hit-based spelling correction: Suggest the
alternative that has lots of hits.

154
Sec. 3.3.5

Exercise

Suppose that for “flew form Heathrow” we


have 7 alternatives for flew, 19 for form and 3
for heathrow.
How many “corrected” phrases will we enumerate
in this scheme?

155
Sec. 3.3.5

General issues in spell correction

 We enumerate multiple alternatives for “Did you


mean?”
 Need to figure out which to present to the user
 The alternative hitting most docs
 Query log analysis
 More generally, rank alternatives
probabilistically
argmaxcorr P(corr | query)
 From Bayes rule, this is equivalent to
argmaxcorr P(query | corr) * P(corr)
Noisy channel Language model
156
Sec. 3.4

Soundex

 Class of heuristics to expand a query into


phonetic equivalents
 Language specific – mainly for names
 E.g., chebyshev  tchebycheff

 Invented for the U.S. census … in 1918

157
Sec. 3.4

Soundex – typical algorithm

 Turn every token to be indexed into a 4-


character reduced form
 Do the same with query terms
 Build and search an index on the reduced
forms
 (when the query calls for a soundex match)

 http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm#Top

158
Sec. 3.4

Soundex – typical algorithm


1. Retain the first letter of the word.
2. Change all occurrences of the following
letters to '0' (zero):
'A', E', 'I', 'O', 'U', 'H', 'W', 'Y'.
3. Change letters to digits as follows:
 B, F, P, V  1
 C, G, J, K, Q, S, X, Z  2
 D,T  3
 L4
 M, N  5
 R6

159
Sec. 3.4

Soundex continued

4. Remove all pairs of consecutive digits.


5. Remove all zeros from the resulting
string.
6. Pad the resulting string with trailing zeros
and return the first four positions, which
will be of the form <uppercase letter>
<digit> <digit> <digit>.

Will hermann generate the same code?


E.g., Herman becomes H655.
160
Sec. 3.4

Soundex

 Soundex is the classic algorithm, provided by


most databases (Oracle, Microsoft, …)
 How useful is soundex?
 Not very – for information retrieval
 Okay for “high recall” tasks (e.g., Interpol),
though biased to names of certain nationalities
 Zobel and Dart (1996) show that other
algorithms for phonetic matching perform much
better in the context of IR

161

You might also like