0% found this document useful (0 votes)

65 views26 pages

Fundamentals of Information Retrieval

The document outlines a course on Information Retrieval (IR) led by Dr. Ebtsam AbdelHakam at Minia University, covering the fundamentals of search engines, including crawling, indexing, and evaluating documents. It discusses the evolution of IR, the challenges faced in retrieving relevant information from vast collections, and the importance of user-centered search evaluation. Additionally, it highlights the significance of adaptability and performance in search engine design and the impact of spam on search effectiveness.

Uploaded by

محمود عاطف

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views26 pages

Fundamentals of Information Retrieval

Uploaded by

محمود عاطف

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Information Storage

and Retrieval
CS418
Dr. Ebtsam AbdelHakam

Computer Science Dept.

Minia University
What is Information Retrieval?
• You have a collection of documents

‣ Books, web pages, journal articles, photographs, video clips,

tweets, a weather database, …

• You have an information need (query)

‣ “How many species of sparrow are native to New England?”

‣ “Find a new musician I’d enjoy listening to.”

‣ “Is it cold outside?”

• You want the documents that best satisfy that need

Web Search
Site-specific Search
Product Search
But also grouping related documents
And mining the web for knowledge
And answering everyday questions

[Link]
Course Goals
• To help you understand the fundamentals of search
engines.

‣ How to crawl, index, and search documents

‣ How to evaluate and compare different search engines

‣ How to modify search engines for specific applications

• To provide broad coverage of the major issues in

information retrieval

• To take a closer look at particular applications of

Information Retrieval in industry
Course Materials
• Suggested books:

‣ Search Engines: Information Retrieval in Practice,

by Croft, Metzler, and Strohman

‣ Introduction to Information Retrieval, by Manning,

Raghavan, and Schütze

‣ Available for free online!

Course Topics
• Architecture of a search engine

• Data acquisition / web Crawling

• Text representation

• Information extraction

• Indexing

• Query processing

• Ranking

• Evaluation

• Classification and clustering

• Social search

• More…
A brief history of IR

• Search of digital libraries was one of the earliest tasks

computers were used for.

• By the 1950s, rudimentary search systems could find

documents that contained particular terms.

• Documents were ranked based on how often the specific

search terms appeared in them — term frequency
weighting
A brief history of IR
• In the 60s, new techniques were developed that treated a document as a
term vector.

‣ Using a “bag of words” model: assuming that the number of

occurrences of each term matters but term order does not

‣ A query can also be represented as a term vector, and the vectors can
be compared to measure similarity between the document and query

• Work also started on clustering documents with similar content

• The concept of relevance feedback was introduced: the best few

documents are assumed to be matches, and documents which are similar
to them are assumed to also be relevant to the original query.

• Some of the first commercial systems appeared in the 60s, sold to

companies who wanted to search their private records
A brief history of IR
• Before the Internet, search was mainly about finding documents in your own collection

• The emphasis was largely on recall — making sure you find every relevant document

• Documents were mainly text files, and did not contain references to other documents

• Just after the Internet, this was all changed

‣ Collection sizes jumped to billions of documents

‣ Documents are structured in networks, providing extra relevance information, and

often have other useful metadata (e.g. how many FaceBook likes?)

‣ You can’t possibly know what’s in every document

‣ A “document” can be pages long or just 120 characters, or could be an image or

video clip, a file download, an abstract fact, or something else entirely

‣ You usually care more about precision — making sure your first few results are
relevant — because people only look at the first few results (except for when they
don’t…)
Challenges of IR
• Text documents are generally free-form

‣ The metadata is there, but you have to find it

‣ Most web pages contain lots of extra content —

ads, navigation bars, comments — that might or
might not be of interest

‣ Spam filtering is hard

• Searching multimedia content has its own challenges

‣ What are the features? How do you extract them?

Challenges of IR
• Running a query is hard

‣ You have less than one second to search the full text of
billions of documents to find the best ten matches

‣ …and the user only gave you two or three words

‣ …and one was misspelled, and one was “the”

‣ …and maybe throw a good relevant ad in, so you can

pay the bills

• Working at web scale means massive distributed systems,

sub-linear algorithms, and careful use of heuristics
Challenges of IR
• Comparing the query text to the document text and
determining what is a good match is the core issue of
information retrieval

‣ Exact matching of words is not enough

‣ Many different ways to write the same thing in a “natural

language” like English

‣ e.g., does a news story containing the text “bank

director in Amherst steals funds” match the query “bank
scandals in western mass?”

‣ Some stories will be better matches than others

Relevance

• What is relevance?

• Simple (and simplistic) definition: A relevant

document contains the information that a person
was looking for when they submitted a query to the
search engine

• Many factors influence a person’s decision about

what is relevant: e.g., task, context, novelty, style
Relevance
• Retrieval models define a particular view of
relevance based on some idea of what users want

• Ranking algorithms used in search engines are

based on retrieval models

• Most models are based on statistical properties of

text rather than deep linguistic analysis

• i.e., counting simple text features such as words

instead of parsing and analyzing the sentences
Users and Information Needs
• Search evaluation is user-centered

• Keyword queries are often poor descriptions of

actual information needs

• Interaction and context are important for

understanding user intent

• Query refinement techniques such as query

expansion, query suggestion, relevance feedback
improve ranking
Research and Industry
• A search engine is the practical application of information
retrieval techniques to large scale text collections

• Web search engines are the best-known examples, but

there are many others

• Open source search engines are important for research

and development

• e.g., Lucene, Lemur/Indri, Galago

• Researchers are focused on many, but not all, of the

tasks that industry search engines care about
Research and Industry
Research Tasks Search Engines

• Performance
• Relevance
‣ Efficient search and indexing

‣ Effective ranking • Incorporating new data

‣ Coverage and freshness

• Evaluation • Scalability

‣ Growing with data and users

‣ Testing and
measuring • Adaptability

‣ Tuning for applications

• Information needs
• Specific problems

‣ User interaction ‣ e.g. Spam

Search Engine Issues
• Performance

• Measuring and improving the efficiency of search

• e.g., reducing response time, increasing query

throughput, increasing indexing speed

• Indexes are data structures designed to improve

search efficiency

• Designing and implementing them are major

issues for search engines
Search Engine Issues
• Dynamic data

• The “collection” for most real applications is constantly

changing in terms of updates, additions, deletions

• e.g., web pages

• Acquiring or “crawling” the documents is a major task

• Typical measures are coverage (how much has been

indexed) and freshness (how recently was it indexed)

• Updating the indexes while processing queries is also a

design issue
Search Engine Issues
• Scalability

• Making everything work with millions of users

every day, and many terabytes of documents

•Distributed processing is essential

• Adaptability

• Changing and tuning search engine components

such as ranking algorithms, indexing strategies,
interfaces for different applications
Search Engine Issues
• Spam

• For web search, spam in all its forms is one of the major
issues

• Affects the efficiency of search engines and, more

seriously, the effectiveness of the results

• Proliferation of spam varieties

• e.g. spamdexing or term spam, link spam, “optimization”

• New subfield called adversarial IR, since spammers are

“adversaries” with different goals

Understanding Information Retrieval Systems
No ratings yet
Understanding Information Retrieval Systems
22 pages
Understanding Information Retrieval Basics
No ratings yet
Understanding Information Retrieval Basics
52 pages
Overview of Information Retrieval Systems
No ratings yet
Overview of Information Retrieval Systems
48 pages
Information Retrieval Course Overview
No ratings yet
Information Retrieval Course Overview
39 pages
Overview of Information Retrieval Systems
No ratings yet
Overview of Information Retrieval Systems
50 pages
Understanding Information Retrieval Basics
No ratings yet
Understanding Information Retrieval Basics
37 pages
Information Retrieval Course Overview
No ratings yet
Information Retrieval Course Overview
31 pages
Overview of Information Retrieval Concepts
No ratings yet
Overview of Information Retrieval Concepts
53 pages
Web Information Retrieval Challenges
No ratings yet
Web Information Retrieval Challenges
47 pages
Information Retrieval and Language Tech Insights
No ratings yet
Information Retrieval and Language Tech Insights
45 pages
Overview of Information Retrieval Concepts
No ratings yet
Overview of Information Retrieval Concepts
59 pages
Overview of Information Retrieval Systems
No ratings yet
Overview of Information Retrieval Systems
57 pages
1 - DSAI4202 IR Introduction
No ratings yet
1 - DSAI4202 IR Introduction
60 pages
AI Integration in Information Retrieval
No ratings yet
AI Integration in Information Retrieval
35 pages
Information Retrieval Techniques
No ratings yet
Information Retrieval Techniques
59 pages
Information Retrieval Techniques Overview
No ratings yet
Information Retrieval Techniques Overview
21 pages
Information Retrieval Course Overview
No ratings yet
Information Retrieval Course Overview
22 pages
Overview of Information Retrieval Techniques
100% (6)
Overview of Information Retrieval Techniques
87 pages
Overview of Information Retrieval Concepts
No ratings yet
Overview of Information Retrieval Concepts
37 pages
Information Storage And: Retrieval Techniques
No ratings yet
Information Storage And: Retrieval Techniques
56 pages
Introduction
No ratings yet
Introduction
32 pages
Information Storage and Retrieval Course
No ratings yet
Information Storage and Retrieval Course
39 pages
Lec5 Ir Introduction
No ratings yet
Lec5 Ir Introduction
37 pages
Introduction to Information Retrieval
No ratings yet
Introduction to Information Retrieval
12 pages
Understanding Information Retrieval Systems
No ratings yet
Understanding Information Retrieval Systems
33 pages
Unit 1 Merged
No ratings yet
Unit 1 Merged
90 pages
Google Search and Information Retrieval Guide
No ratings yet
Google Search and Information Retrieval Guide
22 pages
Information Retrieval System Overview
No ratings yet
Information Retrieval System Overview
28 pages
Information Retrieval and Web Search Concepts
No ratings yet
Information Retrieval and Web Search Concepts
44 pages
Basics of Information Retrieval Concepts
No ratings yet
Basics of Information Retrieval Concepts
35 pages
Information Retrieval Overview by Dr. Bassel
No ratings yet
Information Retrieval Overview by Dr. Bassel
55 pages
Information Retrieval Fundamentals Guide
No ratings yet
Information Retrieval Fundamentals Guide
23 pages
IR UNIT 1 Notes
No ratings yet
IR UNIT 1 Notes
24 pages
Information Storage and Retrieval Overview
No ratings yet
Information Storage and Retrieval Overview
30 pages
Introduction to Information Retrieval
No ratings yet
Introduction to Information Retrieval
73 pages
Information Retrieval Course Overview
No ratings yet
Information Retrieval Course Overview
49 pages
Introduction to Information Retrieval
No ratings yet
Introduction to Information Retrieval
34 pages
Introduction to Information Retrieval
No ratings yet
Introduction to Information Retrieval
40 pages
Introduction to Information Retrieval
No ratings yet
Introduction to Information Retrieval
42 pages
Understanding Information Retrieval Systems
No ratings yet
Understanding Information Retrieval Systems
47 pages
Modern Information Retrieval Overview
No ratings yet
Modern Information Retrieval Overview
19 pages
Search Engine
No ratings yet
Search Engine
49 pages
Information Retrieval System Overview
No ratings yet
Information Retrieval System Overview
51 pages
Information Retrieval Course Overview
No ratings yet
Information Retrieval Course Overview
146 pages
Unit 1 Introduction
No ratings yet
Unit 1 Introduction
52 pages
Understanding Information Retrieval Systems
No ratings yet
Understanding Information Retrieval Systems
30 pages
Understanding Information Retrieval Systems
No ratings yet
Understanding Information Retrieval Systems
20 pages
Introduction to Information Retrieval
No ratings yet
Introduction to Information Retrieval
18 pages
IRS B Tech CSE Part 1
No ratings yet
IRS B Tech CSE Part 1
161 pages
Who Doesn't Give You Direct Answers But Tells You Where To Find The Right Book Like This IR
No ratings yet
Who Doesn't Give You Direct Answers But Tells You Where To Find The Right Book Like This IR
9 pages
Overview of Search Engine Functionality
No ratings yet
Overview of Search Engine Functionality
23 pages
Information Retrieval Fundamentals by Yang
No ratings yet
Information Retrieval Fundamentals by Yang
77 pages
Overview of Information Retrieval Systems
No ratings yet
Overview of Information Retrieval Systems
32 pages
Introduction to Information Retrieval
No ratings yet
Introduction to Information Retrieval
47 pages
Information Retrieval: Past to Present
No ratings yet
Information Retrieval: Past to Present
5 pages
Information Retrieval and Storage Overview
No ratings yet
Information Retrieval and Storage Overview
15 pages
Understanding Information Retrieval Systems
No ratings yet
Understanding Information Retrieval Systems
118 pages
BSM/USM Panel-Mounted Fault Annunciator
No ratings yet
BSM/USM Panel-Mounted Fault Annunciator
24 pages
SR-130 Portable X-Ray System Overview
No ratings yet
SR-130 Portable X-Ray System Overview
2 pages
Midweight Graphic Designer in Food Tech
No ratings yet
Midweight Graphic Designer in Food Tech
1 page
Empowerment Technology Exam Questions
No ratings yet
Empowerment Technology Exam Questions
4 pages
Android Animations by Tutorials (Alex Sullivan, Filip Babić and Prateek Prasad)
100% (1)
Android Animations by Tutorials (Alex Sullivan, Filip Babić and Prateek Prasad)
204 pages
Telegram Listing
No ratings yet
Telegram Listing
244 pages
CrutchfieldPDF 2642300011
No ratings yet
CrutchfieldPDF 2642300011
60 pages
ARRIVE Essential 10 Compliance Checklist
No ratings yet
ARRIVE Essential 10 Compliance Checklist
2 pages
Dome Construction Techniques Overview
No ratings yet
Dome Construction Techniques Overview
16 pages
ETL Testing Expertise of Sai Sravani
No ratings yet
ETL Testing Expertise of Sai Sravani
3 pages
AI Open Source Apps for Language Innovation
No ratings yet
AI Open Source Apps for Language Innovation
33 pages
Seminar Report on Artificial Intelligence
No ratings yet
Seminar Report on Artificial Intelligence
29 pages
16/24-Port Gigabit Ethernet Switches
No ratings yet
16/24-Port Gigabit Ethernet Switches
7 pages
LACCD BIM Standards Overview 4.1
No ratings yet
LACCD BIM Standards Overview 4.1
38 pages
Emsisoft Security Logs for 210327
No ratings yet
Emsisoft Security Logs for 210327
6 pages
Fire Resistance Testing for Timber Doors
No ratings yet
Fire Resistance Testing for Timber Doors
9 pages
Rostami Ali 201704 PHD 2
No ratings yet
Rostami Ali 201704 PHD 2
168 pages
R2-Series Installation Manual Overview
No ratings yet
R2-Series Installation Manual Overview
89 pages
Ict s5 End Term III
No ratings yet
Ict s5 End Term III
4 pages
Electrical Components Wiring Guide
No ratings yet
Electrical Components Wiring Guide
3 pages
QuietOn 4 User Guide
No ratings yet
QuietOn 4 User Guide
10 pages
Atlanta Gift Market Exhibitor List
No ratings yet
Atlanta Gift Market Exhibitor List
152 pages
TYBCom Sem 6 Computer MCQ Practice
No ratings yet
TYBCom Sem 6 Computer MCQ Practice
12 pages
License Agreement
No ratings yet
License Agreement
4 pages
Gas Turbine Maintenance Engineer Resume
No ratings yet
Gas Turbine Maintenance Engineer Resume
3 pages
Create Animations and Graphics in Flash
No ratings yet
Create Animations and Graphics in Flash
11 pages
Pursuing MS in Mobile Computing
No ratings yet
Pursuing MS in Mobile Computing
2 pages
SAP SD Exam Question Bank
No ratings yet
SAP SD Exam Question Bank
9 pages
Mufakose 1 E-Learning Platform Development
No ratings yet
Mufakose 1 E-Learning Platform Development
30 pages
Chemical Storage Guidelines and Safety
No ratings yet
Chemical Storage Guidelines and Safety
14 pages