0% found this document useful (0 votes)

82 views

CS583 Info Retrieval

This document provides an overview of information retrieval and web search. It discusses how information retrieval aims to find needed information by matching user queries to relevant documents. Common information retrieval models include the Boolean model, vector space model, and statistical language models. The vector space model represents documents and queries as vectors of term weights. Relevance is determined by calculating similarities between document and query vectors. Relevance feedback and query expansion can improve retrieval effectiveness. Evaluation measures include precision, recall, and precision-recall curves.

Uploaded by

Mathesh Paramasivam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

82 views

CS583 Info Retrieval

Uploaded by

Mathesh Paramasivam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 33

Chapter 5: Information

Retrieval and Web Search

An introduction
Introduction
 Text mining refers to data mining using text
documents as data.
 Most text mining tasks use Information
Retrieval (IR) methods to pre-process text
documents.
 These methods are quite different from
traditional data pre-processing methods
used for relational tables.
 Web search also has its root in IR.

CS583, Bing Liu, UIC 2

Information Retrieval (IR)
 Conceptually, IR is the study of finding needed
information. I.e., IR helps users find information
that matches their information needs.
 Expressed as queries
 Historically, IR is about document retrieval,
emphasizing document as the basic unit.
 Finding documents relevant to user queries
 Technically, IR studies the acquisition,
organization, storage, retrieval, and distribution of
information.

CS583, Bing Liu, UIC 3

IR architecture

CS583, Bing Liu, UIC 4

IR queries

 Keyword queries
 Boolean queries (using AND, OR, NOT)
 Phrase queries
 Proximity queries
 Full document queries
 Natural language questions

CS583, Bing Liu, UIC 5

Information retrieval models
 An IR model governs how a document and a
query are represented and how the relevance
of a document to a user query is defined.
 Main models:
 Boolean model
 Vector space model
 Statistical language model
 etc

CS583, Bing Liu, UIC 6

Boolean model
 Each document or query is treated as a “bag”
of words or terms. Word sequence is not
considered.
 Given a collection of documents D, let V = {t1,
t2, ..., t|V|} be the set of distinctive words/terms
in the collection. V is called the vocabulary.
 A weight wij > 0 is associated with each term ti
of a document dj ∈ D. For a term that does not
appear in document dj, wij = 0.
dj = (w1j, w2j, ..., w|V|j),
CS583, Bing Liu, UIC 7
Boolean model (contd)
 Query terms are combined logically using the
Boolean operators AND, OR, and NOT.
 E.g., ((data AND mining) AND (NOT text))
 Retrieval
 Given a Boolean query, the system retrieves every
document that makes the query logically true.
 Called exact match.
 The retrieval results are usually quite poor
because term frequency is not considered.

CS583, Bing Liu, UIC 8

Vector space model
 Documents are also treated as a “bag” of words or
terms.
 Each document is represented as a vector.
 However, the term weights are no longer 0 or 1. Each
term weight is computed based on some variations of
TF or TF-IDF scheme.

 Term Frequency (TF) Scheme: The weight of a term ti

in document dj is the number of times that ti appears in
dj, denoted by fij. Normalization may also be applied.

CS583, Bing Liu, UIC 9

TF-IDF term weighting scheme
 The most well known
weighting scheme
 TF: still term frequency
 IDF: inverse document

frequency.
N: total number of docs
dfi: the number of docs that ti
appears.
 The final TF-IDF term
weight is:

CS583, Bing Liu, UIC 10

Retrieval in vector space model
 Query q is represented in the same way or slightly
differently.
 Relevance of di to q: Compare the similarity of
query q and document di.
 Cosine similarity (the cosine of the angle between
the two vectors)

 Cosine is also commonly used in text clustering

CS583, Bing Liu, UIC 11
An Example
 A document space is defined by three terms:
 hardware, software, users
 the vocabulary
 A set of documents are defined as:
 A1=(1, 0, 0), A2=(0, 1, 0), A3=(0, 0, 1)
 A4=(1, 1, 0), A5=(1, 0, 1), A6=(0, 1, 1)
 A7=(1, 1, 1) A8=(1, 0, 1). A9=(0, 1, 1)
 If the Query is “hardware and software”
 what documents should be retrieved?

CS583, Bing Liu, UIC 12

An Example (cont.)
 In Boolean query matching:
 document A4, A7 will be retrieved (“AND”)
 retrieved: A1, A2, A4, A5, A6, A7, A8, A9 (“OR”)
 In similarity matching (cosine):
 q=(1, 1, 0)
 S(q, A1)=0.71, S(q, A2)=0.71, S(q, A3)=0
 S(q, A4)=1, S(q, A5)=0.5, S(q, A6)=0.5
 S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5
 Document retrieved set (with ranking)=
 {A4, A7, A1, A2, A5, A6, A8, A9}

CS583, Bing Liu, UIC 13

Okapi relevance method
 Another way to assess the degree of relevance is to
directly compute a relevance score for each
document to the query.
 The Okapi method and its variations are popular
techniques in this setting.

CS583, Bing Liu, UIC 14

Relevance feedback
 Relevance feedback is one of the techniques for
improving retrieval effectiveness. The steps:
 the user first identifies some relevant (Dr) and irrelevant
documents (Dir) in the initial list of retrieved documents
 the system expands the query q by extracting some
additional terms from the sample relevant and irrelevant
documents to produce qe
 Perform a second round of retrieval.
 Rocchio method (α, β and γ are parameters)

CS583, Bing Liu, UIC 15

Rocchio text classifier
 In fact, a variation of the Rocchio method above,
called the Rocchio classification method, can be
used to improve retrieval effectiveness too
 so are other machine learning methods. Why?
 Rocchio classifier is constructed by producing a
prototype vector ci for each class i (relevant or
irrelevant in this case):

 In classification, cosine is used.

CS583, Bing Liu, UIC 16

Text pre-processing
 Word (term) extraction: easy
 Stopwords removal
 Stemming
 Frequency counts and computing TF-IDF
term weights.

CS583, Bing Liu, UIC 17

Stopwords removal
 Many of the most frequently used words in English are useless
in IR and text mining – these words are called stop words.
 the, of, and, to, ….

 Typically about 400 to 500 such words

 For an application, an additional domain specific stopwords list

may be constructed
 Why do we need to remove stopwords?
 Reduce indexing (or data) file size
 stopwords accounts 20-30% of total word counts.

 Improve efficiency and effectiveness

 stopwords are not useful for searching or text mining

 they may also confuse the retrieval system.

CS583, Bing Liu, UIC 18

Stemming
 Techniques used to find out the root/stem of a
word. E.g.,
 user engineering
 users engineered
 used engineer
 using
 stem: use engineer
Usefulness:
 improving effectiveness of IR and text mining
 matching similar words
 Mainly improve recall
 reducing indexing size
 combing words with same roots may reduce indexing
size as much as 40-50%.

CS583, Bing Liu, UIC 19

Basic stemming methods
Using a set of rules. E.g.,
 remove ending
 if a word ends with a consonant other than s,
followed by an s, then delete s.
 if a word ends in es, drop the s.
 if a word ends in ing, delete the ing unless the remaining word
consists only of one letter or of th.
 If a word ends with ed, preceded by a consonant, delete the ed
unless this leaves only a single letter.
 …...
 transform words
 if a word ends with “ies” but not “eies” or “aies” then “ies --> y.”

CS583, Bing Liu, UIC 20

Frequency counts + TF-IDF
 Counts the number of times a word occurred
in a document.
 Using occurrence frequencies to indicate relative
importance of a word in a document.
 if a word appears often in a document, the document
likely “deals with” subjects related to the word.
 Counts the number of documents in the
collection that contains each word
 TF-IDF can be computed.

CS583, Bing Liu, UIC 21

Evaluation: Precision and Recall
 Given a query:
 Are all retrieved documents relevant?
 Have all the relevant documents been retrieved?
 Measures for system performance:
 The first question is about the precision of the
search
 The second is about the completeness (recall) of
the search.

CS583, Bing Liu, UIC 22

Precision-recall curve

CS583, Bing Liu, UIC 23

Compare different retrieval
algorithms

CS583, Bing Liu, UIC 24

Compare with multiple queries
 Compute the average precision at each recall
level.

 Draw precision recall curves

 Do not forget the F-score evaluation measure.

CS583, Bing Liu, UIC 25

Rank precision

 Compute the precision values at some

selected rank positions.
 Mainly used in Web search evaluation.
 For a Web search engine, we can compute
precisions for the top 5, 10, 15, 20, 25 and 30
returned pages
 as the user seldom looks at more than 30 pages.
 Recall is not very meaningful in Web search.
 Why?
CS583, Bing Liu, UIC 26
Web Search as a huge IR system
 A Web crawler (robot) crawls the Web to
collect all the pages.
 Servers establish a huge inverted indexing
database and other indexing databases
 At query (search) time, search engines
conduct different types of vector query
matching.

CS583, Bing Liu, UIC 27

Inverted index
 The inverted index of a document collection
is basically a data structure that
 attaches each distinctive term with a list of all
documents that contains the term.
 Thus, in retrieval, it takes constant time to
 find the documents that contains a query term.
 multiple query terms are also easy handle as we
will see soon.

CS583, Bing Liu, UIC 28

An example

CS583, Bing Liu, UIC 29

Index construction
 Easy! See the example,

CS583, Bing Liu, UIC 30

Search using inverted index
Given a query q, search has the following steps:
 Step 1 (vocabulary search): find each
term/word in q in the inverted index.
 Step 2 (results merging): Merge results to
find documents that contain all or some of the
words/terms in q.
 Step 3 (Rank score computation): To rank
the resulting documents/pages, using,
 content-based ranking
 link-based ranking

CS583, Bing Liu, UIC 31

Different search engines
 The real differences among different search
engines are
 their index weighting schemes
 Including location of terms, e.g., title, body,
emphasized words, etc.
 their query processing methods (e.g., query
classification, expansion, etc)
 their ranking algorithms
 Few of these are published by any of the search
engine companies. They aretightly guarded
secrets.

CS583, Bing Liu, UIC 32

Summary
 We only give a VERY brief introduction to IR. There
are a large number of other topics, e.g.,
 Statistical language model
 Latent semantic indexing (LSI and SVD).
 (read an IR book or take an IR course)
 Many other interesting topics are not covered, e.g.,
 Web search
 Index compression
 Ranking: combining contents and hyperlinks
 Web page pre-processing
 Combining multiple rankings and meta search
 Web spamming
 Want to know more? Read the textbook

CS583, Bing Liu, UIC 33

Cognitive Neuroscience 3rd Edition Marie T. Banich - Get the ebook in PDF format for a complete experience
100% (1)
Cognitive Neuroscience 3rd Edition Marie T. Banich - Get the ebook in PDF format for a complete experience
49 pages
Information Technology System Applicable To Nursing Practice
No ratings yet
Information Technology System Applicable To Nursing Practice
56 pages
Module1PartBInformationRetrievalWebdocuments
No ratings yet
Module1PartBInformationRetrievalWebdocuments
49 pages
CS583 Info Retrieval
No ratings yet
CS583 Info Retrieval
34 pages
Information Retrieval
No ratings yet
Information Retrieval
72 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
No ratings yet
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
16 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
Unit II
No ratings yet
Unit II
73 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
Query Languages: Chapter Seven
No ratings yet
Query Languages: Chapter Seven
36 pages
Text Mining
No ratings yet
Text Mining
23 pages
F-IR
No ratings yet
F-IR
30 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
No ratings yet
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
42 pages
Query Languages and Query Operation: Chapter Seven
No ratings yet
Query Languages and Query Operation: Chapter Seven
20 pages
7 B - Query Languages
No ratings yet
7 B - Query Languages
33 pages
Chapter 2: Modeling: Advanced Topics in Information Retrieval
No ratings yet
Chapter 2: Modeling: Advanced Topics in Information Retrieval
28 pages
Week 2 - Information Retrieval Basics
No ratings yet
Week 2 - Information Retrieval Basics
74 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
1 Overview
No ratings yet
1 Overview
44 pages
6-Query Languages
No ratings yet
6-Query Languages
19 pages
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
46 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
NLP SEE
No ratings yet
NLP SEE
9 pages
4 IRModels
No ratings yet
4 IRModels
32 pages
Apznzazcghor Yfaefzxic8mtoyxh4styndoxb7gk17qpn3jvxdvqw0hldfkvr9zqdwdlqlvv Bxxsh9ypo05o9bu2vf7xntq6 Pzji8yata6ieq9uptrduksav3o g6fx5brv Epaefr Ehdghr7renjhhptsx6dxy3fundzb1nwwcrmbvg5lggbaw6m2gzk5rudbp31dnn8w
No ratings yet
Apznzazcghor Yfaefzxic8mtoyxh4styndoxb7gk17qpn3jvxdvqw0hldfkvr9zqdwdlqlvv Bxxsh9ypo05o9bu2vf7xntq6 Pzji8yata6ieq9uptrduksav3o g6fx5brv Epaefr Ehdghr7renjhhptsx6dxy3fundzb1nwwcrmbvg5lggbaw6m2gzk5rudbp31dnn8w
61 pages
Lec1 Introduction
No ratings yet
Lec1 Introduction
40 pages
Materi Pertemuan Ke-1-Dno 2018-1
No ratings yet
Materi Pertemuan Ke-1-Dno 2018-1
42 pages
Web Information Retrieval
No ratings yet
Web Information Retrieval
10 pages
DDB Ch27
No ratings yet
DDB Ch27
60 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
2
No ratings yet
2
17 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
67 pages
1-Introduction-MIR
No ratings yet
1-Introduction-MIR
35 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
11 Multimedia Media IR
No ratings yet
11 Multimedia Media IR
19 pages
Nowadays IR Is Much More Than Building Search Engines !: Paolo Ferragina
No ratings yet
Nowadays IR Is Much More Than Building Search Engines !: Paolo Ferragina
47 pages
Irt Q&A
No ratings yet
Irt Q&A
14 pages
IR Merged Merged
No ratings yet
IR Merged Merged
132 pages
IR Journal
No ratings yet
IR Journal
36 pages
NLP SEE
No ratings yet
NLP SEE
27 pages
IR Systems Usually Adopt Index Terms To Process Queries Index Term
No ratings yet
IR Systems Usually Adopt Index Terms To Process Queries Index Term
24 pages
4_IRModels
No ratings yet
4_IRModels
46 pages
Chapter Five IR Models
No ratings yet
Chapter Five IR Models
28 pages
5 IRModels IR
No ratings yet
5 IRModels IR
25 pages
Unit3 QueryLanguages Berlin
No ratings yet
Unit3 QueryLanguages Berlin
29 pages
Made By:-Bhawana Agarwal Cs Iiiyr
No ratings yet
Made By:-Bhawana Agarwal Cs Iiiyr
29 pages
7 Query Languages Operations
No ratings yet
7 Query Languages Operations
12 pages
IR Models: Chapter Five
100% (1)
IR Models: Chapter Five
26 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
Classwork For Information Retrieval
No ratings yet
Classwork For Information Retrieval
118 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
Web Search Engines: Rooted in Information Retrieval (IR) Systems
No ratings yet
Web Search Engines: Rooted in Information Retrieval (IR) Systems
48 pages
Unit 2 Irt
No ratings yet
Unit 2 Irt
33 pages
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
No ratings yet
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
77 pages
Elementary IR: Scalable Boolean Text Search: (Compare With R & G 27.1-3)
No ratings yet
Elementary IR: Scalable Boolean Text Search: (Compare With R & G 27.1-3)
22 pages
The Future of Search
From Everand
The Future of Search
Andres J. Clary
No ratings yet
Unit 1
No ratings yet
Unit 1
107 pages
Ch27a Ir1-Intro
No ratings yet
Ch27a Ir1-Intro
18 pages
Unit 1-Java Fundamentals
No ratings yet
Unit 1-Java Fundamentals
12 pages
Unit 2 Object Oriented Programming-Inheritance
No ratings yet
Unit 2 Object Oriented Programming-Inheritance
72 pages
Unit 3 Event Driven Programming
No ratings yet
Unit 3 Event Driven Programming
56 pages
Unit 5 Concurrent Programming
No ratings yet
Unit 5 Concurrent Programming
31 pages
Sustainability in Alcohol Production: An Overview of Green Initiatives and their Impact on the Industry
No ratings yet
Sustainability in Alcohol Production: An Overview of Green Initiatives and their Impact on the Industry
5 pages
20.04.23 IX A Computer
No ratings yet
20.04.23 IX A Computer
4 pages
Esquema Circuito Dell D505
No ratings yet
Esquema Circuito Dell D505
46 pages
Audit of A Clinical Trial Site
100% (1)
Audit of A Clinical Trial Site
30 pages
Letter of Application
No ratings yet
Letter of Application
2 pages
EnhancingSchool agedChildrensSocial
No ratings yet
EnhancingSchool agedChildrensSocial
27 pages
1 S6 Final EBE Q4 Week 1 DE LIMA JEAN S.
No ratings yet
1 S6 Final EBE Q4 Week 1 DE LIMA JEAN S.
6 pages
Fee Payment Through Digital Banking
No ratings yet
Fee Payment Through Digital Banking
5 pages
A Comparison of CAPM and APT
No ratings yet
A Comparison of CAPM and APT
3 pages
Download (Ebook) Creole Societies in the Portuguese Colonial Empire by Philip J. Havik; Malyn Newitt ISBN 9781443884631, 1443884634 ebook All Chapters PDF
100% (7)
Download (Ebook) Creole Societies in the Portuguese Colonial Empire by Philip J. Havik; Malyn Newitt ISBN 9781443884631, 1443884634 ebook All Chapters PDF
63 pages
saytex_8010_data_april_2017
No ratings yet
saytex_8010_data_april_2017
2 pages
Aircom Enterprise 6
No ratings yet
Aircom Enterprise 6
3 pages
20 Memorization Techniques For College Students
No ratings yet
20 Memorization Techniques For College Students
5 pages
Victory Battery Metals Exploration Presentation Summer 2023
No ratings yet
Victory Battery Metals Exploration Presentation Summer 2023
39 pages
Bpo Kpo
No ratings yet
Bpo Kpo
19 pages
9th Class 2nd Test
No ratings yet
9th Class 2nd Test
1 page
Health Assessment Lesson 1 - 5
No ratings yet
Health Assessment Lesson 1 - 5
18 pages
Byzantine Architecture
No ratings yet
Byzantine Architecture
32 pages
Chemistry For Engineers: Engr. Rosamia D. Tubo
No ratings yet
Chemistry For Engineers: Engr. Rosamia D. Tubo
15 pages
Class 9 Gulf Sahodaya Previous Question Papers - Google Drive
No ratings yet
Class 9 Gulf Sahodaya Previous Question Papers - Google Drive
1 page
Aleluya
No ratings yet
Aleluya
2 pages
GFG-2011 Buyers Gu PDF
100% (1)
GFG-2011 Buyers Gu PDF
16 pages
The Satanic Verses
50% (6)
The Satanic Verses
9 pages
TOEFL - Junior - Online Test 1
No ratings yet
TOEFL - Junior - Online Test 1
34 pages
OmniSX MX2 Training 7 UT Configuration
100% (2)
OmniSX MX2 Training 7 UT Configuration
29 pages
Chetan Resume
No ratings yet
Chetan Resume
1 page
Drama English
No ratings yet
Drama English
9 pages
Let your Voice Be Heard
No ratings yet
Let your Voice Be Heard
3 pages

CS583 Info Retrieval

Uploaded by

CS583 Info Retrieval

Uploaded by

Chapter 5: Information

Retrieval and Web Search

CS583, Bing Liu, UIC 2

CS583, Bing Liu, UIC 3

CS583, Bing Liu, UIC 4

CS583, Bing Liu, UIC 5

CS583, Bing Liu, UIC 6

CS583, Bing Liu, UIC 8

 Term Frequency (TF) Scheme: The weight of a term ti

CS583, Bing Liu, UIC 9

CS583, Bing Liu, UIC 10

 Cosine is also commonly used in text clustering

CS583, Bing Liu, UIC 12

CS583, Bing Liu, UIC 13

CS583, Bing Liu, UIC 14

CS583, Bing Liu, UIC 15

 In classification, cosine is used.

CS583, Bing Liu, UIC 16

CS583, Bing Liu, UIC 17

 Typically about 400 to 500 such words

 For an application, an additional domain specific stopwords list

 Improve efficiency and effectiveness

 they may also confuse the retrieval system.

CS583, Bing Liu, UIC 18

CS583, Bing Liu, UIC 19

CS583, Bing Liu, UIC 20

CS583, Bing Liu, UIC 21

CS583, Bing Liu, UIC 22

CS583, Bing Liu, UIC 23

CS583, Bing Liu, UIC 24

 Draw precision recall curves

CS583, Bing Liu, UIC 25

 Compute the precision values at some

CS583, Bing Liu, UIC 27

CS583, Bing Liu, UIC 28

CS583, Bing Liu, UIC 29

CS583, Bing Liu, UIC 30

CS583, Bing Liu, UIC 31

CS583, Bing Liu, UIC 32

CS583, Bing Liu, UIC 33

You might also like