ADVANCED TOPICS
IN INFORMATION RETRIEVAL
AND WEB SEARCH
Lecture 1:
Introduction
S. M. Vahidipour
[email protected]
Outline
□ Introduction to the Course
□ Overview of the Semester
2
Text Books
Search Engines:
Information Retrieval in Practice
W. Bruce Croft, Donald Metzler, Trevor Strohman
Pearson Education, 2010
3
Text Books
Modern Information Retrieval:
The Concepts and Technology behind Search
(2nd Edition)
Ricardo Baeza-Yates, Berthier Ribeiro-Neto
ACM Press Books, 2010
4
Text Books
Introduction to Information Retrieval
C. Manning, P. Raghavan, and H. Schütze
Cambridge University Press, 2008
5
Search and Information Retrieval
Search on the Web is a daily activity for many people
throughout the world
□ Google: 40,000 searches per second (3.5 billion per
day; 1.2 trillion per year)
□ Yahoo: 3,200 searches per second (280 million per day;
8.4 billion per month)
□ Bing: 927 searches per second ( 80 million per day;
2.4 billion per month)
106: Million, 109: billion, 1012: Trillion, 1015: Quadrillion, 1018: Quintillion, …
6
Search and Information Retrieval
□ Search and communication are most popular uses of the computer.
□ Applications involving search are everywhere.
□ The field of computer science that is most involved with R&D for search
is information retrieval (IR).
7
Information Retrieval
“Information retrieval is a field concerned with the structure, analysis,
organization, storage, searching, and retrieval of information.”
(Salton, 1968)
□ General definition that can be applied to many types of information
and search applications
□ Still appropriate after 40 years.
□ Primary focus of IR since the 50s has been on text and documents
8
Data/Information
□ Storage
□ Search
9
Data/Information
□ Structured
□ Unstructured
10
Structured vs. Unstructured Data
11
What is a Document?
Examples:
Web pages, email, books, news stories, scholarly papers, text
messages, Word™, Powerpoint™, PDF, forum postings, patents, IM
(Instant Messages) sessions, etc.
Common properties
Significant text content
Some structure (≈ attributes in DB)
□ Papers: title, author, date
□ Email: subject, sender, destination, date
12
Comparing Text
Comparing the query text to the document text and determining what is
a good match is the core issue of information retrieval.
Exact matching of words is not enough
Many different ways to write the same thing in a “natural language” like
English
Does a news story containing the text “karl benz built the first automobile in 1886” match
the query “car inverter”?
Defining the meaning of a word, a sentence, a paragraph, or a story is
more difficult than defining the meaning of a database field.
13
Dimensions of IR
IR is more than just text, and more than just web search
although these are central
People doing IR work with different media, different types of search
applications, and different tasks
Three dimensions of IR
□ Content
□ Applications
□ Tasks
20
The Content Dimension
Textual data, but…
New applications increasingly involve new media
□ Video, photos, music, speech
□ Scanned documents (for legal purposes)
Like text, content is difficult to describe and compare
□ Text may be used to represent them (e.g., tags)
IR approaches to search and evaluation are appropriate
15
The Application D imension
Web search Desktop search
□ Personal enterprise search
□ Most common
□ See above plus recent web pages
Vertical search
P2P search
□ Restricted domain/topic
□ No centralized control
□ Books, movies, suppliers □ File sharing, shared locality
Enterprise search Literature search
□ Corporate intranet
□ Databases, emails, web pages, Forum search
documentation, code, wikis, tags,
directories, presentations, spreadsheets …
16
The Task Dimension
User queries / ad-hoc search
□ Range of query enormous, not pre-specified
Filtering
□ Given a profile (interests), notify about interesting news stories
□ Identify relevant user profiles for a new document
Classification / categorization
□ Automatically assign text to one or more classes of a given set
□ Identify relevant labels for documents
Question answering
□ Similar to search
□ Automatically answer a question posed in natural language
□ Provide concrete answer, not list of documents.
17
Main Issues in IR
Relevance
□ A relevant document contains the information a user was looking for when
he/she submitted the query
Evaluation
□ How well does the ranking meet the expectation of the user
Users and information needs
□ Users of a search engine are the ultimate judges of quality
18
IR and Search Engines
A search engine is the practical application of information retrieval
techniques to large scale text collections
Big issues include main IR issues but also some others…
Information Retrieval Search Engines
● Relevance: Effective ranking ● Performance: Efficient search and indexing
● Evaluation: Testing and measuring ● Incorporating new data: Coverage and freshness
● Information needs: User interaction ● Scalability: Growing with data and users
● Adaptability: Tuning for applications
● Specific problems: e.g., Spam
Additional
19
Outline
□ Introduction to the Course
□ Overview of the Semester
20
Search Engine
Basic architecture
Main issues
Indexing
Text acquisition
Text
transformation
Index creation
Querying
User interaction
Ranking
Evaluation
21
Overview of Traditional Retrieval Models
Boolean retrieval
Vector space model
Probabilistic models
22
Overview of Evaluation Metrics
Effectiveness metrics
Efficiency metrics
Training, testing, and statistics
23
Advanced Retrieval Models
Language model-based retrieval
Learning to rank
30
Word Mismatch Problem
Language model-based approaches
□ Translation model
□ Topic model
□ Word cluster model
□ Wordnet
□ Dependency model
Query expansion approaches
25
Advanced/Specific IR Tasks
Query log and query suggestion
Personalized search
Information extraction
Cross-language IR
Question answering
Recommendation systems
Enterprise search
Digital library
Structured text retrieval
Multimedia retrieval
26
Query Log and Query Suggestion
27
Personalized Search
28
Information Extraction
29
Cross- language Retrieval
30
Question Answering
31
Recommendation Systems
32
Enterprise Search
33
Digital Library
40
Structured Text Retrieval
35
Multimedia Retrieval
36
Questions?
37