0% found this document useful (0 votes)

59 views77 pages

Introduction to Information Retrieval

This document provides an introduction to information retrieval (IR). It begins with an overview of the InfoSense research group and outlines some key IR textbooks. The document then discusses IR at both a broad and narrow scope. In the broad sense, IR is a discipline aimed at finding information people want using various techniques like classification, clustering, and question answering. In the narrow sense, IR specifically focuses on searching large document collections in response to queries. Finally, the document outlines the core components of IR systems including preprocessing, indexing, retrieval, and evaluation.

Uploaded by

lekha.cce

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views77 pages

Introduction to Information Retrieval

Uploaded by

lekha.cce

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 77

Information Retrieval:

An Introduction
Dr. Grace Hui Yang
InfoSense
Department of Computer Science
Georgetown University, USA
[email protected]
Jan 2019 @ Cape Town 1
A Quick Introduction
• What do we do at InfoSense
• Dynamic Search
• IR and AI
• Privacy and IR
• Today’s lecture is on IR fundamentals
• Textbooks and some of their slides are referenced and used here
• Modern Information Retrieval: The Concepts and Technology behind Search. by Ricardo Baeza-Yates,
Berthier Ribeiro-Neto. Second condition. 2011.
• Introduction to Information Retrieval. C.D. Manning, P. Raghavan, H. Schütze. Cambridge UP, 2008.
• Foundations of Statistical Natural Language Processing. Christopher D. Manning and Hinrich Schütze.
• Search Engines: Information Retrieval in Practice. W. Bruce Croft, Donald Metzler, and Trevor Strohman.
2009.
• Personal views are also presented here
• Especially in the Introduction and Summary sections

2
Outline
• What is Information Retrieval
• Task, Scope, Relations to other disciplines
• Process
• Preprocessing, Indexing, Retrieval, Evaluation, Feedback
• Retrieval Approaches
• Boolean
• Vector Space Model
• BM25
• Language Modeling
• Summary
• What works
• State-of-the-art retrieval effectiveness
• Relation to the learning-based approaches

3
What is Information Retrieval (IR)?
• Task: To find a few among many
• It is probably motivated by the situation of information overload and
acts as a remedy to it
• When defining IR, we need to be aware that there is a broad sense
and a narrow sense

4
Broad Sense of IR
• It is a discipline that finds information that people want
• The motivation behind would include
• Humans’ desire to understand the world and to gain knowledge
• Acquire sufficient and accurate information/answer to accomplish a task
• Because finding information can be done in so many different ways, IR would involve:
• Classification (Wednesday lecture by Fraizio Sabastiani and Alejandro Mereo))
• Clustering
• Recommendation
• Social network
• Interpreting natural languages (Wednesday lecture by Fraizio Sabastiani and Alejandro Mereo))
• Question answering
• Knowledge bases
• Human-computer interaction (Friday lecture by Rishabh Mehrotra)
• Psychology, Cognitive Science, (Thursday lecture by Joshua Kroll), …
• Any topic that listed on IR conferences such as SIGIR/ICTIR/CHIIR/CIKM/WWW/WSDM…

5
Narrow Sense of IR
• It is ‘search’
• Mostly searching for documents
• It is a computer science discipline that designs and implements
algorithms and tools to help people find information that they want
• from one or multiple large collections of materials (text or multimedia,
structured or unstructured, with or without hyperlinks, with or without
metadata, in a foreign language or not – Monday Lecture Multilingual IR by
Doug Oard),
• where people can be a single user or a group
• who initiate the search process by an information need,
• and, the resulting information should be relevant to the information need
(based on the judgement by the person who starts the search)

6
Narrowest Sense of IR
• It helps people find relevant documents
• from one large collection of material (which is the Web or a TREC collection),
• where there is a single user,
• who initiates the search process by a query driven by an information need,
• and, the resulting documents should be ranked (from the most relevant to the
least) and returned in a list

7
Players in Information Retrieval

Corpus

Information
User Metric
Need

Results

8
A Brief Historical Line of Information Retrieval
8

0
1940s 1950s 1960s 1970s 1980s 1990s 2000 2005 2010 2015 2020
Memex Vector Space Model Probabilistic Theory Okapi BM25 TREC LM
Learning to Rank Deep Learning QA Filtering Query User

9
Relationships to Sister Disciplines
Solid line: transformations or special cases
Dashed line: overlap with

AI
Recommendation

Non-exhaustive search
Human issued queries;
DB Supervised

file
tab

pro
ul a
Un ted d ata ML

ser
str
u
da
ta; i ng
in da
ta

tu
ctu t ra n g
Bo of i ni

bu
red ol e e ra
s ot
da ;u

ry
ta; an n ; n

ue
qu e el s
NL eri d riv d

q
es ta - o
q dm

No
ue
rie Da t e
s -c raf
er t
Ex p
Understanding of data; Semantics Large scale; use of algorithms Library
NLP IR
Controlled vocabulary; browsing
Science
Loss of semantics; only counting terms
Inter
nts activ
e cted

User-centered study
cu m a e; co
o t r mple
a d of d e rs ex Singl x info
ste w e ite rmat
e rs in re a ns ratio
n ion n
nsw o eed
s a e p bef s
urn e st
Ret d i at
te rme
In Information
QA Seeking; IS
HCI 10
Outline
• What is Information Retrieval
• Task, Scope, Relations to other disciplines
• Process
• Preprocessing, Indexing, Retrieval, Evaluation, Feedback
• Retrieval Approaches
• Boolean
• Vector Space Model
• BM25
• Language Modeling
• Summary
• What works
• State-of-the-art retrieval effectiveness
• Relations to the learning-based approaches

11
Process of Information Retrieval
Information Document
Need Representation Corpus

Query Representation Indexing

Retrieval
Models Index

Retrieval Results

Evaluation/
Feedback
12
Terminology
• Query: text to represent an information need
• Document: a returned item in the index
• Term/token: a word, a phrase, an index unit
• Vocabulary: set of the unique tokens
• Corpus/Text collection
• Index/database: index built for a corpus
• Relevance feedback: judgment from human
• Evaluation Metrics: how good is a search system?
• Precision, Recall, F1

13
Document Retrieval Process
Information Document
Need Representation Corpus

Indexing
Query Representation

Querying Retrieval
Models Index

Retrieval Results

Evaluation/
Feedback
14
From Information Need to Query
Get rid of mice in a politically
TASK
correct way

Info Need Info about removing mice

without killing them

Verbal form
How do I trap mice alive?

Query mouse trap

15
Textbook slides for “Introduction to Information Retrieval” by Hinrich Schütze and Christina Lioma. Chap 1
Document Retrieval Process
Information Document
Need Representation Corpus

Indexing
Query Representation

Retrieval
Models Index
Indexing
Retrieval Results

Evaluation/
Feedback 16
Sec. 1.2

Inverted index construction

Documents to Friends, Romans, countrymen.
be indexed
Tokenizer
Tokens Friends Romans Countrymen

Linguistic modules
friend roman countryman
Normalized tokens
Indexer friend 2 4
roman 1 2
Inverted index
countryman 13 17 16
Textbook slides for “Introduction to Information Retrieval” by Hinrich Schütze and Christina Lioma. Ch 1
Sec. 1.2

An Index
• Sequence of (Normalized token, Document ID) pairs.

Doc 1 Doc 2

I did enact Julius So let it be with

Caesar I was killed Caesar. The noble
i' the Capitol; Brutus hath told you
Brutus killed me. Caesar was ambitious

18
Textbook slides for “Introduction to Information Retrieval” by Hinrich Schütze and Christina Lioma. Chap 1
Document Retrieval Process
Information Document
Need Representation Corpus

Query Representation Indexing

Retrieval
Models Index

Retrieval Results

Evaluation Evaluation/
Feedback 19
Evaluation
• Implicit (clicks, time spent) vs. Explicit (yes/no, grades)
• Done by the same user or by a third party (TREC-style)
• Judgments can be binary (Yes/No) or graded
• Assuming ranked or not
• Dimensions under consideration
• Relevance (Precision, nDCG)
• Novelty/diversity
• Usefulness
• Effort/cost
• Completeness/coverage (Recall)
• Combinations of some of the above (F1), and many more
• Relevance is the main consideration. It means
• If a document (a result) can satisfy the information need
• If a document contains the answer to my query
• The evaluation lecture (Tuesday by Nicola Ferror and Maria Maistro) will share much more
20
interesting details
Document Retrieval Process
Information Document
Need Representation Corpus

Query Representation Indexing

Retrieval
Algorithms Index

Retrieval
Retrieval Results

Evaluation/
Feedback
21
Outline
• What is Information Retrieval
• Task, Scope, Relations to other disciplines
• Process
• Preprocessing, Indexing, Retrieval, Evaluation, Feedback
• Retrieval Approaches
• Boolean
• Vector Space Model
• BM25
• Language Modeling
• Summary
• What works
• State-of-the-art retrieval effectiveness
• Relations to the learning-based approaches

22
How to find relevant documents for a query?

• By keyword matching
• boolean model
• By similarity
• vector space model
• By imaging how to write out a query
• how likely a query is written with this document in mind
• generate with some randomness
• query generation language model
• By trusting how other web pages think about the web page
• pagerank, hits
• By trusting how other people find relevant documents for the same/similar query
• Learning to rank

23
Sec. 1.3

Boolean Retrieval
• Views each document as a set of words
• Boolean Queries use AND, OR and NOT to join query terms
• Simple SQL-like queries
• Sometimes with weights attached to each component
• It is like exact match: document matches condition or not
• Perhaps the simplest model to build an IR system
• Many current search systems are still using Boolean
• Professional searchers who want to under control of the search process
• e.g. doctors and lawyers write very long and complex queries with Boolean
operators
24
Summary: Boolean Retrieval
• Advantages:
• Users are under control of the search results
• The system is nearly transparent to the user
• Disadvantages:
• Only give inclusion or exclusion of docs, not rankings
• Users would need to spend more effort in manually examining the returned
sets; sometimes it is very labor intensive
• No fuzziness allowed so the user must be very precise and good at writing
their queries
• However, in many cases users start a search because they don’t know the answer
(document)

25
Ranked Retrieval
• Often we want to rank results
• from the most relevant to the least relevant
• Users are lazy
• maybe only look at the first 10 results
• A good ranking is important
• Given a query q, and a set of documents D, the task is to rank those
documents based on a ranking score or relevance score:
• Score (q,di) in the range of [0,1]
• from the most relevant to the least relevant
• A lot of IR research is about to determine score (q,di)
26
Vector Space Model

27
Sec. 6.3

Vector Space Model

• Treat the query as a tiny document
• Represent the query and every document each as a word vector
in a word space
• Rank documents according to their proximity to the query in the
word space

28
Sec. 6.3

Represent Documents in a Space of Word Vectors

Suppose the corpus only has two
words: ’Jealous’ and ‘Gossip’

They form a space of “Jealous” and

“Gossip”

d1: gossip gossip jealous

gossip gossip gossip gossip
gossip gossip gossip gossip
d2: gossip gossip jealous
gossip gossip gossip gossip
gossip gossip gossip jealous
jealous jealous jealous jealous
jealous jealous gossip jealous q: gossip gossip jealous
gossip gossip gossip gossip
d3: jealous gossip jealous gossip jealous jealous
jealous jealous jealous jealous jealous jealous
jealous jealous jealous jealous Adapted from textbook slides for “Introduction to Information Retrieval” by Hinrich Schütze and Christina Lioma.29Chap 6
Euclidean Distance
• If if p = (p1, p2,..., pn) and q = (q1, q2,..., qn) are two points in the
Euclidean space, their Euclidean distance is

30
Sec. 6.3
In a space of ‘Jealous’ and ‘Gossip’
Here, if you look at the content (or we say
the word distributions) of each
document, d2 is actually the most similar
document to q

However, d2 produces a bigger Eclidean

distance score to q
d1: gossip gossip jealous
gossip gossip gossip gossip
gossip gossip gossip gossip
d2: gossip gossip jealous
gossip gossip gossip gossip
gossip gossip gossip jealous
jealous jealous jealous jealous q: gossip gossip jealous
jealous jealous gossip jealous gossip gossip gossip gossip
d3: jealous gossip jealous gossip jealous jealous
jealous jealous jealous jealous jealous jealous
31
jealous jealous jealous jealous
Adapted from textbook slides for “Introduction to Information Retrieval” by Hinrich Schütze and Christina Lioma. Chap 6
Sec. 6.3

Use angle instead of distance

• Short query and long documents will
always have big Euclidean distance
• Key idea: Rank documents according
to their angles with query
• The angle between similar vectors is
small, between dissimilar vectors is
large
• This is equivalent to perform a
document length normalization

32
Adapted from textbook slides for “Introduction to Information Retrieval” by Hinrich Schütze and Christina Lioma. Chap 6
Sec. 6.3

Cosine Similarity

qi is the tf-idf weight of term i in the query

di is the tf-idf weight of term i in the document

cos(q,d) is the cosine similarity of q and d … or,

equivalently, the cosine of the angle between q and d. 33
Exercise: Cosine Similarity
Consider two documents D1, D2 and a query Q, which
document is more similar to the query?

D1 = (0.5, 0.8, 0.3), D2 = (0.9, 0.4, 0.2),

Q = (1.5, 1.0, 0)

Example from textbook “Search Engines: Information Retrieval in Practice” Chap 7 34

Answers:

35
Answers:
Consider two documents D1, D2 and a query Q
D1 = (0.5, 0.8, 0.3), D2 = (0.9, 0.4, 0.2), Q = (1.5, 1.0, 0)

Example from textbook “Search Engines: Information Retrieval in Practice” Chap 7 36

What are those numbers in a vector?

• They are term weights

• They are used to indicate the importance of a term in a document

37
Term Frequency
• How many times a term appears in a document

38
• Some terms are common,
• less common than the stop words
• but still quite common
• e.g. “Information Retrieval” is uniquely important in NBA.com

• e.g. “Information Retrieval” appears at too many pages in SIGIR web site, so it is not a
very important term in those pages.

• How to discount their term weights?

39
Sec. 6.2.1

Inverse Document Frequency (idf)

• dft is the document frequency of t
• the number of documents that contain t
• it inversely measures how informative a term is
• The IDF of a term t is defined as

• Log is used here to “dampen” the effect of idf.

• N is the total number of documents
• Note it is a property of the term and it is query independent

40
Sec. 6.2.2

tf-idf weighting
• Product of a term’s tf weight and idf weight regarding a document

• Best known term weighting scheme in IR

• Increases with the number of occurrences within a document
• Increases with the rarity of the term in the collection
• Note: term frequency takes two inputs (the term and the document) while IDF
only takes one (the term)

41
Sec. 6.4

tf-idf weighting has many variants

42
Textbook slides for “Introduction to Information Retrieval” by Hinrich Schütze and Christina Lioma. Chap 6
Sec. 6.4

Standard tf-idf weighting scheme: Lnc.ltc

• A very standard weighting scheme is: lnc.ltc
• Document:
• L: logarithmic tf (l as first character)
• N: no idf
• C: cosine normalization
• Query:
• L: logarithmic tf (l in leftmost column)
• t: idf (t in second column)
• C: cosine normalization …
• Note: here the weightings differ in queries and in documents

43
Summary: Vector Space Model
• Advantages
• Simple computational framework for ranking documents given a query
• Any similarity measure or term weighting scheme could be used
• Disadvantages
• Assumption of term independence
• Ad hoc

44
BM25

45
The (Magical) Okapi BM25 Model
• BM25 is one of the most successful retrieval models
• It is a special case of the Okapi models
• Its full name is Okapi BM25
• It considers the length of documents and uses it to normalize the
term frequency
• It is virtually a probabilistic ranking algorithm though it looks very ad-
hoc
• It is intended to behave similarly to a two-Poisson model
• We will talk about Okapi in general

46
What is Behind Okapi?
• [Robertson and Walker 94 ]
• A two-Poisson document-likelihood Language model
• Models within-document term frequencies by means of a mixture of two Poisson
distributions
• Hypothesize that occurrences of a term in a document have a random or
stochastic element
• It reflects a real but hidden distinction between those documents which are “about” the concept
represented by the term and those which are not.
• Documents which are “about” this concept are described as “elite” for the term.
• Relevance to a query is related to eliteness rather than directly to term
frequency, which is assumed to depend only on eliteness.

47
Two-Poisson Model
• Term weight for a term t:

Figure adapted from “Search Engines: Information Retrieval in Practice” Chap 7

where lambda and mu are the Poisson means for tf
In the elite and non-elite sets for t

p’ = P(document elite for t| R)

q’ = P(document elite for t| NR)

48
Characteristics of Two-Poisson Model
• It is zero for tf=0;
• It increases monotonically with tf;
• but to an asymptotic maximum;
• The maximum approximates to the Robertson/Sparck-Jones weight
that would be given to a direct indicator of eliteness.

p = P(term present| R)
q = P(term present| NR)
49
Constructing a Function
• Constructing a function
• Such that tf/(constant + tf) increases from 0 to an asymptotic maximum
• A rough estimation of 2-poisson

Approximated term weight Robertson/Sparck-Jones weight;

Becomes the idf component of Okapi

constant
tf component of Okapi
50
Okapi Model
• The complete version of Okapi BMxx models

idf (Robertson-Sparck Jones weight) tf user related weight

Original Okapi: k1 = 2, b=0.75, k3 = 0

BM25: k1 = 1.2, b=0.75, k3 = a number from 0 to 1000
51
Exercise: Okapi BM25
• Query with two terms, “president lincoln”, (qtf = 1)
• No relevance information (r and R are zero)
• N = 500,000 documents
• “president” occurs in 40,000 documents (df1 = 40, 000)
• “lincoln” occurs in 300 documents (df2 = 300)
• “president” occurs 15 times in the doc (tf1 = 15)
• “lincoln” occurs 25 times in the doc (tf2 = 25)
• document length is 90% of the average length (dl/avdl = .9)
• k1 = 1.2, b = 0.75, and k3 = 100
• K = 1.2 · (0.25 + 0.75 · 0.9) = 1.11

Example from textbook “Search Engines: Information Retrieval in Practice” Chap 7 52

Answer:

53
Answer: Okapi BM25

Example from textbook “Search Engines: Information Retrieval in Practice” Chap 7 54

Effect of term frequencies in BM25

Textbook slides from “Search Engines: Information Retrieval in Practice” Chap 7 55

Language Modeling

56
Using language models in IR
§ Each document is treated as (the basis for) a language model
§ Given a query q, rank documents based on P(d|q)

§ P(q) is the same for all documents, so ignore

§ P(d) is the prior – often treated as the same for all d
§ But we can give a prior to high-quality documents, e.g., those with high PageRank.
§ P(q|d) is the probability of q given d
§ Ranking according to P(q|d) and P(d|q) is equivalent

57
Textbook slides for “Introduction to Information Retrieval” by Hinrich Schütze and Christina Lioma.
Query-likelihood LM
Document Query
Language Model Likelihood
d1 θ d1 p (q | q d1 ) q
p(q | q d 2 )
d2 θd2 p(q | q d N )

dN θdN
• Scoring documents with query likelihood
• Known as the language modeling (LM) approach to IR

58
Adapted from Mei, Fang and Zhai‘s “A study of poison query generation model in IR”
A different language model for each document

String = frog said that toad likes frog STOP

P(string|Md1 ) = 0.01 · 0.03 · 0.04 · 0.01 · 0.02 · 0.01 · 0.02 = 0.0000000000048 = 4.8 · 10-12
P(string|Md2 ) = 0.01 · 0.03 · 0.05 · 0.02 · 0.02 · 0.01 · 0.02 = 0.0000000000120 = 12 · 10-12 P(string|Md1 ) < P(string|Md2 )
Thus, document d2 is more relevant to the string frog said that toad likes frog STOP than d1 is.

59
Textbook slides for “Introduction to Information Retrieval” by Hinrich Schütze and Christina Lioma.
Binomial Distribution
• Discrete
• Series of trials with only two outcomes, each trial being independent
from all the others
• Number r of successes out of n trials given that the probability of
success in any trial isq :
ænö r
b(r; n, q ) = çç ÷÷q (1 - q ) n - r
èrø

60
Multinomial Distribution
• The multinomial distribution is a generalization of the binomial distribution.

• The binomial distribution counts successes of an event (for example, heads in coin
tosses).
• The parameters:
– N (number of trials)
– q (the probability of success of the event)

• The multinomial counts the number of a set of events (for example, how many times
each side of a die comes up in a set of rolls).
– The parameters:
– N (number of trials)
– q1..q k (the probability of success for each category)

61
Each is estimated by Maximum Likelihood Estimation (MLE)

Multinomial Distribution
• W1,W2,..Wk are variables Number of possible orderings of N balls

N!
P(W1 = n1 ,..., W1 = nk | N , q 1 ,.., q k ) = q1n1q 2 n2 ..q k nk
n1 !n2 !..nk !

k k order invariant selections

ån
i =1
i =N åq i =1
i =1

A binomial distribution is the multinomial distribution with k=2 and q1 ,q 2 = 1 - q 2

Assume events (terms being generated ) are independent
62
Multi-Bernoulli vs. Multinomial
Multi-Bernoulli:
Flip a coin for each word
Doc: d text mining … model Query q:
text “text mining”
H H T
mining
model
clustering p ( q | d ) = Õ p ( w = 1 | d )Õ p ( w = 0 | d )
wÎq wÏq
text
Multinomial:
model
Roll a dice to choose a word
text
Query q:
… text text
“text mining”

model
mining
mining

|V |
p(q | d ) = Õ p( w j | d )
c ( w j ,q )

j =1
63
Adapted from Mei, Fang and Zhai‘s “A study of poison query generation model in IR”
Issue
§ Issue: a single t with P(t|Md) = 0 will make zero
§ Smooth the estimates to avoid zeros

64
Dirichlet Distribution & Conjugate Prior
• If the prior and the posterior are the same distribution, the prior is
called a conjugate prior for the likelihood

• The Dirichlet distribution is the conjugate prior for the multinomial,

just as beta is conjugate prior for the binomial.

Gamma function
65
Dirichlet Smoothing
• Let s say the prior for q1 ,..,q k is
Dir (a1 ,.., a k )
• From observations to the data, we have the following counts n1 ,.., nk
• The posterior distribution for q1 ,..,q k , given the data, is

Dir (a1 + n1 ,.., a k + nk )

• So the prior works like pseudo-counts

• it can be used for smoothing

66
JM Smoothing:

§ Also known as the Mixture Model

§ Mixes the probability from the document with the general collection
frequency of the word.
§ Correctly setting λ is very important for good performance.
§ High value of λ: conjunctive-like search – tends to retrieve documents
containing all query words.
§ Low value of λ: more disjunctive, suitable for long queries

67
Textbook slides for “Introduction to Information Retrieval” by Hinrich Schütze and Christina Lioma.
Poisson Query-likelihood LM
Poisson: Receiver: Query
Each term is written Rates of Duration: |q|
arrival : li
text text [ ] 1
mining 3/7
model mining [ ] 2
2/7
mining model [ / ] 0
text 1/7
clustering clustering [ / ] 0
1/7
text … [ ] 1
…
Query q :
“mining text mining systems”

3 - 2 / 7|q| 2 -1/ 7|q| 1 -1/ 7|q| 1

e - 3 / 7| q | 1
( | q |) e ( | q |) 2
e ( | q |) 0
e ( | q |) 0 e - li |q| (l | q |)1
7 7
p(q | d ) = 7
1!
7
2! 0! 0! 1!
i

Slides adapted from Mei, Fang and Zhai‘s “A study of poison query generation model in IR” 68
Comparison

multi-Bernoulli multinomial Poisson

|V |

Õ p(w = 1 | d )Õ p(w = 0 | d ) Õ
c ( w j ,q ) |V |

p(q | d ) wÎq wÏq j =1

p( w j | d ) Õ p (c ( w , q ) | d )
j =1
j

Event space Appearance Vocabulary frequency

/absence
Model frequency? No Yes Yes

Model length? No Implicitly yes Yes

(document/query)
w/o Sum-to-one constraint? Yes No Yes

Per-Term Smoothing Easy Hard Easy

Closed form solution for No No Yes

mixture of models?
Slides adapted from Mei, Fang and Zhai‘s “A study of poison query generation model in IR” 69
Summary: Language Modeling
• LM vs. VSM:
• LM: based on probability theory
• VSM: based on similarity, a geometric/ linear algebra notion
• Modeling term frequency in LM is better than just modeling term presence/absence
• Multinomial model performs better than multi-Bernoulli
• Mixture of Multinomials for the background smoothing model has been shown to be
effective for IR
• LDA-based retrieval [Wei & Croft SIGIR 2006]
• PLSI [Hofmann SIGIR 99]
§ Probabilities are inherently length-normalized
§ When doing parameter estimation
§ Mixing document and collection frequencies has an effect similar to idf
§ Terms rare in the general collection, but common in some documents will have a
greater influence on the ranking.

70
Outline
• What is Information Retrieval
• Task, Scope, Relations to other disciplines
• Process
• Preprocessing, Indexing, Retrieval, Evaluation, Feedback
• Retrieval Approaches
• Boolean
• Vector Space Model
• BM25
• Language Modeling
• Summary
• What works?
• State-of-the-art retrieval effectiveness – what should you expect?
• Relations to the learning-based approaches

71
What works?
• Term Frequency (tf)
• Inverse Document Frequency (idf)
• Document length normalization
• Okapi BM25
• Seems ad-hoc but works so well (popularly used as a baseline)
• Created by human experts, not by data
• Other more justified methods could achieve similar effectiveness as
BM25
• They help better deep understanding of IR, related disciplines

72
What might not work?
• You might have heard of other topics/techniques, such as
• Pseudo-relevance feedback
• Query expansion
• N-gram instead of unit gram
• Semantically-heavy annotations
• Sophisticated understanding of documents
• Personalization (Read a lot into the user)
• .. But they usually don’t work reliably (not as much as what we expect
and sometimes worsen the performance)
• Maybe more research needs to be done
• Or, maybe they are not the right directions

73
At the heart is the metric
• How our users feel good about the search results
• Sometimes it could be subjective
• The approaches that we discusses today do not directly optimize the
metrics (P, R, nDCG, MAP etc)
• These approaches are considered more conventional, without making
use of large amount of data that can be learned models from
• Instead, they are created by researchers based on their own
understanding of IR and they hand-crafted or imagined most of the
models
• And these models work very well
• Salute to the brilliant minds

74
Learning-based Approaches
• More recently, learning-to-rank has become the dominating approach
• Due to vast amount of logged data from Web search engines
• The retrieval algorithm paradigm
• Has become data-driven
• Requires large amount of data from massive users
• IR is formulated as a supervised learning problem
• directly uses the metrics as the optimization objectives
• No longer guess what a good model should be, but leave to the data to decide
• The Deep learning lecture (Thursday by Bhaskar Mitra, Nick Craswell,
and Emine Yilmaz) will introduce them in depth
75
References
• IR Textbooks used for this talk:
• Introduction to Information Retrieval. C.D. Manning, P. Raghavan, H. Schütze. Cambridge UP, 2008.
• Foundations of Statistical Natural Language Processing. Christopher D. Manning and Hinrich Schütze.
• Search Engines: Information Retrieval in Practice. W. Bruce Croft, Donald Metzler, and Trevor Strohman. 2009.
• Modern Information Retrieval: The Concepts and Technology behind Search. by Ricardo Baeza-Yates, Berthier Ribeiro-Neto. Second
condition. 2011.
• Main IR research papers used for this talk:
• Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval. Robertson, S. E., & Walker, S.
SIGIR 1994.
• Document Language Models, Query Models, and Risk Minimization for Information Retrieval. Lafferty, John and Zhai, Chengxiang.
SIGIR 2001.
• A study of Poisson query generation model for information retrieval. Qiaozhu Mei, Hui Fang, Chengxiang Zhai. SIGIR 2007.
• Course Materials/presentation slides used in this talk:
• Barbara Rosario’s “Mathematical Foundations” lecture notes for textbook “Statistical Natural Language Processing”
• Textbook slides for “Search Engines: Information Retrieval in Practice” by its authors
• Oznur Tastan s recitation for 10601 Machine Learning
• Textbook slides for “Introduction to Information Retrieval” by Hinrich Schütze and Christina Lioma
• CS276: Information Retrieval and Web Search by Pandu Nayak and Prabhakar Raghavan
• 11-441: Information Retrieval by Jamie Callan
• A study of Poisson query generation model for information retrieval. Qiaozhu Mei, Hui Fang, Chengxiang Zhai

76
Thank You Dr. Grace Hui Yang
InfoSense
Georgetown University, USA

Contact: [email protected]

Intro to Info Retrieval Course
No ratings yet
Intro to Info Retrieval Course
31 pages
Lecture17 IR
No ratings yet
Lecture17 IR
28 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
UNIT I - Introduction and Motivation
No ratings yet
UNIT I - Introduction and Motivation
57 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
Introduction to Information Retrieval Course
No ratings yet
Introduction to Information Retrieval Course
39 pages
Materi Pertemuan Ke-1-Dno 2018-1
No ratings yet
Materi Pertemuan Ke-1-Dno 2018-1
42 pages
Introduction
No ratings yet
Introduction
32 pages
DDB Ch27
No ratings yet
DDB Ch27
60 pages
Information Storage and Retrieval Course
100% (1)
Information Storage and Retrieval Course
35 pages
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
No ratings yet
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
16 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
42 pages
Lecture1 Chap1
No ratings yet
Lecture1 Chap1
22 pages
IR Chapter 1
No ratings yet
IR Chapter 1
29 pages
Information Retrieval Detailed Lecture Nov 2023
No ratings yet
Information Retrieval Detailed Lecture Nov 2023
39 pages
Comprehensive Guide to Information Retrieval
No ratings yet
Comprehensive Guide to Information Retrieval
74 pages
01 Introduction To ISR
No ratings yet
01 Introduction To ISR
34 pages
Understanding Information Retrieval Systems
No ratings yet
Understanding Information Retrieval Systems
30 pages
Information Retrieval Systems
No ratings yet
Information Retrieval Systems
46 pages
Introduction To IR 2021
No ratings yet
Introduction To IR 2021
40 pages
Introduction to Information Retrieval
No ratings yet
Introduction to Information Retrieval
48 pages
Chap 1
No ratings yet
Chap 1
23 pages
Intro to Information Retrieval
No ratings yet
Intro to Information Retrieval
23 pages
ch1 - Information Retrieval Systems
No ratings yet
ch1 - Information Retrieval Systems
52 pages
Introduction To IIR
No ratings yet
Introduction To IIR
53 pages
Chapter 1 Ir
No ratings yet
Chapter 1 Ir
37 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
Information Retrieval: Dr. Bassel ALKHATIB
No ratings yet
Information Retrieval: Dr. Bassel ALKHATIB
55 pages
Introduction Advanced DB
No ratings yet
Introduction Advanced DB
80 pages
Chapter 1 Introduction To IR
No ratings yet
Chapter 1 Introduction To IR
18 pages
Overview of Information Retrieval Systems
No ratings yet
Overview of Information Retrieval Systems
25 pages
Understanding Information Retrieval Systems
No ratings yet
Understanding Information Retrieval Systems
18 pages
Chapter One IR
No ratings yet
Chapter One IR
18 pages
Ch1 IR
No ratings yet
Ch1 IR
39 pages
Introduction To Information Retrieval - DR Alshli
No ratings yet
Introduction To Information Retrieval - DR Alshli
46 pages
Introduction to Information Retrieval
No ratings yet
Introduction to Information Retrieval
29 pages
1 IR Introductionn
No ratings yet
1 IR Introductionn
30 pages
Unit - I - IR
No ratings yet
Unit - I - IR
39 pages
IR Chapter 1
No ratings yet
IR Chapter 1
32 pages
1 IR Chapter-One
No ratings yet
1 IR Chapter-One
47 pages
Intro to Information Retrieval Systems
No ratings yet
Intro to Information Retrieval Systems
10 pages
1 Introduction MIR
No ratings yet
1 Introduction MIR
35 pages
Modern Information Retrieval: Computer Engineering Department Fall 2005
No ratings yet
Modern Information Retrieval: Computer Engineering Department Fall 2005
19 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
IR-Module 1 and 2
No ratings yet
IR-Module 1 and 2
48 pages
Intro to Information Retrieval
No ratings yet
Intro to Information Retrieval
47 pages
Ch2 - IR and LT
No ratings yet
Ch2 - IR and LT
45 pages
Introduction To Information Retrieval - by William Scott - Medium
No ratings yet
Introduction To Information Retrieval - by William Scott - Medium
4 pages
Introduction to Information Retrieval
No ratings yet
Introduction to Information Retrieval
73 pages
IR Lec1
No ratings yet
IR Lec1
26 pages
IR Textbook
No ratings yet
IR Textbook
167 pages
Documentation Ir
No ratings yet
Documentation Ir
58 pages
Aspect Information Retrieval (IR) Web Search
No ratings yet
Aspect Information Retrieval (IR) Web Search
19 pages
Week 10
No ratings yet
Week 10
3 pages
Week 11
No ratings yet
Week 11
3 pages
Understanding Plagiarism and Detection Tools
No ratings yet
Understanding Plagiarism and Detection Tools
3 pages
Skill Enhancement
No ratings yet
Skill Enhancement
4 pages
Week 2
No ratings yet
Week 2
3 pages
Week 6
No ratings yet
Week 6
4 pages
Research Methodology
No ratings yet
Research Methodology
6 pages
Unit 2 - WD
No ratings yet
Unit 2 - WD
39 pages
BigData Mining and Analytics
No ratings yet
BigData Mining and Analytics
2 pages
Machine Learning Techniques
No ratings yet
Machine Learning Techniques
3 pages
Big Data Framework
No ratings yet
Big Data Framework
3 pages
XML
No ratings yet
XML
36 pages
Foundation of Datascience
No ratings yet
Foundation of Datascience
2 pages
Unit III - IV
No ratings yet
Unit III - IV
122 pages
cs8080 Irt Local Author
No ratings yet
cs8080 Irt Local Author
168 pages
Web Tech for CS Students
No ratings yet
Web Tech for CS Students
96 pages
IV CSE Handbook
No ratings yet
IV CSE Handbook
29 pages
Add Solar Camera to NVR via ISUP
No ratings yet
Add Solar Camera to NVR via ISUP
10 pages
Introduction To Manual vs. Autocad
No ratings yet
Introduction To Manual vs. Autocad
6 pages
Python Coding Interview Questions - 3
No ratings yet
Python Coding Interview Questions - 3
3 pages
Types of Hacking and Cyber Crimes
No ratings yet
Types of Hacking and Cyber Crimes
24 pages
Ai Logbook Face Detection
No ratings yet
Ai Logbook Face Detection
46 pages
615 Series IEC 60870-5-103 Point List Manual - A
No ratings yet
615 Series IEC 60870-5-103 Point List Manual - A
96 pages
A55F2-M3 V2.0 High PDF
No ratings yet
A55F2-M3 V2.0 High PDF
70 pages
Grading Rubrics
No ratings yet
Grading Rubrics
1 page
CHFI Course Overview and Objectives
No ratings yet
CHFI Course Overview and Objectives
24 pages
CSS: The Definitive Guide, 5th Edition (Early Release) Eric Meyer Get PDF
No ratings yet
CSS: The Definitive Guide, 5th Edition (Early Release) Eric Meyer Get PDF
169 pages
COVID-19 Data Analysis Project
0% (1)
COVID-19 Data Analysis Project
20 pages
Big M Method
No ratings yet
Big M Method
12 pages
Donald Knuth, 1979, Mathematical Typografy, Bull. Amer. Math. Soc. (N.S.) Volume 1, Number 2 (1979), 337-372 PDF
No ratings yet
Donald Knuth, 1979, Mathematical Typografy, Bull. Amer. Math. Soc. (N.S.) Volume 1, Number 2 (1979), 337-372 PDF
36 pages
PGP Encryption & Decryption
No ratings yet
PGP Encryption & Decryption
8 pages
Summarization Techniques for QA
No ratings yet
Summarization Techniques for QA
1,006 pages
Research Paper On Stock Market Forecast Using ARIMA & LSTM Hybrid Model With AI, Database & Blockchain
No ratings yet
Research Paper On Stock Market Forecast Using ARIMA & LSTM Hybrid Model With AI, Database & Blockchain
4 pages
Software Project Mgmt Exam Guide
No ratings yet
Software Project Mgmt Exam Guide
2 pages
Computer PPT 2
No ratings yet
Computer PPT 2
21 pages
Cisco Nexus 5548UP Switch Configuration Guide PDF
No ratings yet
Cisco Nexus 5548UP Switch Configuration Guide PDF
14 pages
PMKS+ Web How-To Videos - Planar Mechanism Kinema
No ratings yet
PMKS+ Web How-To Videos - Planar Mechanism Kinema
7 pages
Erp - Comparative Statement
100% (1)
Erp - Comparative Statement
5 pages
DGCA AME User Manual Overview
No ratings yet
DGCA AME User Manual Overview
53 pages
M.RT2261 (Without Audio) PDF
No ratings yet
M.RT2261 (Without Audio) PDF
9 pages
Baloncesto Aprender y Progresar PDF
100% (1)
Baloncesto Aprender y Progresar PDF
244 pages
Westernport Water Uses Watergems To Achieve Automated Distribution System
No ratings yet
Westernport Water Uses Watergems To Achieve Automated Distribution System
2 pages
Preparing For Your Rubrik Interview
No ratings yet
Preparing For Your Rubrik Interview
3 pages
OpenProject SQL Injection Vulnerability
No ratings yet
OpenProject SQL Injection Vulnerability
3 pages
Ethiopian TVET-System: Learning Guide # 7
100% (1)
Ethiopian TVET-System: Learning Guide # 7
16 pages
MIS Unit III: Info System Strategy
No ratings yet
MIS Unit III: Info System Strategy
5 pages
Free Microsoft Office Download Guide
No ratings yet
Free Microsoft Office Download Guide
4 pages

Introduction to Information Retrieval

Uploaded by

Introduction to Information Retrieval

Uploaded by

Information Retrieval:

Query Representation Indexing

Info Need Info about removing mice

Query mouse trap

Inverted index construction

I did enact Julius So let it be with

Query Representation Indexing

Query Representation Indexing

Vector Space Model

Represent Documents in a Space of Word Vectors

They form a space of “Jealous” and

d1: gossip gossip jealous

However, d2 produces a bigger Eclidean

Use angle instead of distance

qi is the tf-idf weight of term i in the query

cos(q,d) is the cosine similarity of q and d … or,

D1 = (0.5, 0.8, 0.3), D2 = (0.9, 0.4, 0.2),

Example from textbook “Search Engines: Information Retrieval in Practice” Chap 7 34

Example from textbook “Search Engines: Information Retrieval in Practice” Chap 7 36

• They are term weights

• How to discount their term weights?

Inverse Document Frequency (idf)

• Log is used here to “dampen” the effect of idf.

• Best known term weighting scheme in IR

tf-idf weighting has many variants

Standard tf-idf weighting scheme: Lnc.ltc

Figure adapted from “Search Engines: Information Retrieval in Practice” Chap 7

p’ = P(document elite for t| R)

Approximated term weight Robertson/Sparck-Jones weight;

idf (Robertson-Sparck Jones weight) tf user related weight

Original Okapi: k1 = 2, b=0.75, k3 = 0

Example from textbook “Search Engines: Information Retrieval in Practice” Chap 7 52

Example from textbook “Search Engines: Information Retrieval in Practice” Chap 7 54

Textbook slides from “Search Engines: Information Retrieval in Practice” Chap 7 55

§ P(q) is the same for all documents, so ignore

String = frog said that toad likes frog STOP

k k order invariant selections

A binomial distribution is the multinomial distribution with k=2 and q1 ,q 2 = 1 - q 2

• The Dirichlet distribution is the conjugate prior for the multinomial,

Dir (a1 + n1 ,.., a k + nk )

• So the prior works like pseudo-counts

§ Also known as the Mixture Model

3 - 2 / 7|q| 2 -1/ 7|q| 1 -1/ 7|q| 1

multi-Bernoulli multinomial Poisson

p(q | d ) wÎq wÏq j =1

Event space Appearance Vocabulary frequency

Model length? No Implicitly yes Yes

Per-Term Smoothing Easy Hard Easy

Closed form solution for No No Yes

You might also like