0% found this document useful (0 votes)

83 views21 pages

Fast Search Engine Development Guide

This document summarizes how search engines work by indexing web documents, processing user queries, and returning relevant results. It discusses key aspects like indexing data with inverted indexes, approximating relevance using statistical algorithms like Okapi BM25, compressing indexes and documents for fast retrieval, and architectures that divide indexes and documents across multiple servers. The document concludes by outlining current research at RMIT's Search Engine Group on fast search techniques and applications to other domains.

Uploaded by

Shikhir Kapoor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

83 views21 pages

Fast Search Engine Development Guide

Uploaded by

Shikhir Kapoor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Building Fast Search Engines

Hugh E. Williams (hugh@[Link])

School of Computer Science and Information
Technology, RMIT
Overview

• User’s Information Needs

• Why users use search engines

• How users query with search engines

• Answers
• What is a good answer?

• How search engines provide a search service

• Indexing data

• Index design

• Architecture of a commercial search engine

• Research
• Fast searching and emerging technologies
Queries

• Search engines are one tool used to answer information

needs
• Users express their information needs as queries
• Usually informally expressed as two or three words (we

call this a ranked query)

• A recent study showed the mean query length was 2.4

words per query with a median of 2

• Around 48.4% of users submit just one query in a

session, 20.8% submit two, and about 31% submit three

or more
• Less than 5% of queries use Boolean operators (AND,

OR, and NOT), and around 5% contain quoted phrases

Queries...

• About 1.28 million different words were used in queries in

the Excite log studied (which contained 1.03 million
queries)
• Around 75 words account for 9% of all words used in
queries. The top-ten non-trivial words occurring in 531,000
queries are “sex” (10,757), “free” (9,710), “nude” (7,047),
“pictures” (5,939), “university” (4,383), “pics” (3,815), “chat”
(3,515), “adult” (3,385), “women” (3,211), and “new” (3,109)
• 16.9% of the queries were about entertainment, 16.8%
about sex, pornography, or preferences, and 13.3%
concerned commerce, travel, employment, and the
economy
Answers

• What is a good answer to a query?

• One that is relevant to the user’s information need!

• Search engines typically return ten answers-per-page,

where each answer is a short summary of a web

document
• Likely relevance to an information need is approximated

by statistical similarity between web documents and the

query
• Users favour search engines that have high precision,

that is, those that return relevant answers in the first

page of results
An Example Query
Top-ten Answers
Approximating
Relevance
• Statistical similarity is used to estimate the relevance of a
query to an answer
• Consider the query “Richardson Richmond Football”
• A good answer contains all three words, and the more

frequently the better; we call this term frequency (TF)

• Some query terms are more important—have better

discriminating power—than others. For example, an

answer containing only “Richardson” is likely to be better
than an answer containing only “Football”; we call this
inverse document frequency (IDF)
• A popular, state-of-the-art statistical ranking function that
incorporates these ideas is Okapi
Okapi BM25 Function

• The Okapi ranking function is as follows:

( k 1 + 1)tf (k 3 + 1) qtf
∑
T ∈Q
w
K + tf
×
k 3 + qtf
• Q is a query that contains the words T
• k1, b, and k3 are constant parameters (k1=1.2 and b=0.75 work well, k3 is 7 or 1000)
• K is: k 1((1 − b) + [Link] / avdl )
• tf is the term frequency of the term with a document
• qtf is the term frequency in the query
• w is: ( N − n + 0.5)
log
(n + 0.5)
• N is the number of documents, n is the number containing the term
• dl and avdl are the document length and average document length
• Overall: ranking uses the number of times a word occurs in
a document, the number of documents containing the term,
and the document length
More on Ranking...

• Other techniques are used to improve the accuracy of

search engines:
• Google Inc. use their patented PageRank(tm)

technology. Google ranks a page higher if it links to

pages that are an authorative source, and a link from an
authorative source to a page ranks that page higher
• Relevance feedback is a technique that adds words to a

query based on a user selecting a more like this option

• Query expansion adds words to a query using thesaural

or other techniques
• Searching within categories or groups to narrow a

search
How Search Engines
Work
• Search engines work as follows:
• They retrieve (spider or crawl) documents from the Web

• Documents are stored as a collection in a centralised

repository
• The collection is indexed to allow fast ranking to find

answers
• A web interface is provided for entering queries and

presenting answers
• Document summarisation is used to present short

answers to the user for judging relevance

• Documents are updated and re-indexed regularly
Indexing Data

• All search engines use inverted indexes to support fast

searching
• An inverted index consists of two components:
• A searchable in-memory vocabulary of all words in the

collection; stored with each word is the IDF and a pointer

to the inverted list for that word
• An on-disk inverted list for each word in the collection.

This list contains:

• the documents that contain the word
• the term frequency of the word in each document
• the offset or offsets of the word in each document (this is
optional, and is used for proximity and phrase queries)
Indexing Data
Resolving Queries

• Queries are resolved using the inverted index

• Consider the example query “Cat Mat Hat”. This is
evaluated as follows:
• Select a word from the query (say, “Cat”)
• Retrieve the inverted list from disk for the word
• Process the list. For each document the word occurs in, add weight
to an accumulator for that document based on the TF, IDF, and
document length
• Repeat for each word in the query
• Find the best-ranked documents with the highest weights
• Lookup the document in the mapping table
• Retrieve and summarise the documents, and present to the user
Fast Search Engines

• There are many well-known principles for building a fast

search engine
• Perhaps the most important is compression:
• Inverted lists are stored in a compressed format. This

allows more information per second to be retrieved from

disk, and it lowers disk head seek times
• As long as decompression is fast, there is a beneficial

trade-off in time
• Documents are stored in a compressed format for the

same reason
• Different compression schemes are used for lists (which

are integers) and documents (which are multimedia, but

mostly text)
Fast Search Engines...

• Average query times and index sizes for 25,000 queries on

10 gigabytes of indexed Web data
Index Size (% of collection) Query Speed (Seconds)

35
1
30
25 0.8
% of
20 Average 0.6
collection Query
size 15
Time (sec) 0.4
10
5 0.2
0 0
Compressed Uncompressed Compressed Uncompressed
Fast Search Engines...

• Other principles of fast searching:

• Sort disk accesses to minimise disk head movement

when retrieving lists or documents

• Use hash tables in memory to store the vocabulary;

avoid slow hash functions that use modulo

• Pre-calculate and store constants in ranking formulae

• Carefully choose integer compression schemes

• Organise inverted lists so that the information frequently

needed is at the start of the list

• Use heap structures when partial sorting is required

• Develop a query plan for each query

Search Engine
Architecture
Search Engine
Architecture...
• The inverted lists are divided amongst a number of servers,
where each is known as a shard
• If an inverted list is required for a particular range of words,
then that shard server is contacted
• Each shard server can be replicated as many times as
required; each server in a shard is identical
• Documents are also divided amongst a number of servers
• Again, if a document is required within a particular range,
then the appropriate document server is contacted
• Each document server can also be replicated as many
times as required
What we’re working on...

• The Search Engine Group here at RMIT specialises in

research into fast search engines and applications of
search technology to other domains
• We are currently investigating:
• Fast phrase querying using new index structures
• Answer summarisation
• Index design
• Fast vocabulary searching and accumulation
• Index construction
• DNA and protein search engines
• Image and video management and retrieval
• General-purpose compression of collections

• Our new research testbed search engine will be released

under the GPL later this year
Pointers (& advertising!)
• The Search Engine Group, [Link]
• My home page, [Link]
• Witten, Moffat, and Bell, “Managing Gigabytes”, 2nd edition, Morgan Kaufmann, 1999
• Spink, Wolfram, Jansen and Saracevic, “Searching the web: The public and their queries”,
Journal of the American Society for Information Science, 52(3), 226--234, 2001. Queries
are available from: [Link]
• Williams and Zobel, “Compressing Integers for Fast File Access”, The Computer Journal,
42(3), 193-201, 1999.
• Moffat, Zobel, and Sharman, “Text compression for dynamic document databases”, IEEE
Transactions on Knowledge and Data Engineering, 9(2):302-313, March-April 1997.
• Zobel and Moffat, “Adding compression to a full text retrieval system”, Software-Practice
and Experience, 25(8):891-903, 1995.
• Zobel, Heinz, and Williams, “In-memory Hash Tables for Accumulating Text Vocabularies”,
Information Processing Letters. To appear.

Search Engine Architecture and Processes
No ratings yet
Search Engine Architecture and Processes
45 pages
Search Engine Functionality Overview
No ratings yet
Search Engine Functionality Overview
40 pages
Search Engine Functionality Overview
No ratings yet
Search Engine Functionality Overview
29 pages
Web Search Engines Overview Guide
No ratings yet
Web Search Engines Overview Guide
35 pages
Understanding How Search Engines Work
No ratings yet
Understanding How Search Engines Work
11 pages
Search Engine Basics and Evolution
No ratings yet
Search Engine Basics and Evolution
2 pages
Meta Search Engines
No ratings yet
Meta Search Engines
48 pages
Human-Powered Data Collection in Search Engines
No ratings yet
Human-Powered Data Collection in Search Engines
36 pages
Understanding Search Engine Functionality
No ratings yet
Understanding Search Engine Functionality
42 pages
Understanding Information Retrieval Techniques
No ratings yet
Understanding Information Retrieval Techniques
44 pages
Web Search Engines and Indexing Explained
No ratings yet
Web Search Engines and Indexing Explained
14 pages
Information Storage And: Retrieval Techniques
No ratings yet
Information Storage And: Retrieval Techniques
56 pages
COMP S834: Unit 4
No ratings yet
COMP S834: Unit 4
44 pages
Web Technology Lab Manual: Semester 5
No ratings yet
Web Technology Lab Manual: Semester 5
39 pages
Building a Search Engine with Lucene
No ratings yet
Building a Search Engine with Lucene
5 pages
Cmpsci 446 Search Engines
No ratings yet
Cmpsci 446 Search Engines
32 pages
Understanding Search Engines and Their Types
No ratings yet
Understanding Search Engines and Their Types
22 pages
Overview of Web Search Engines
No ratings yet
Overview of Web Search Engines
26 pages
Understanding Inverted Indexes in Search Engines
No ratings yet
Understanding Inverted Indexes in Search Engines
38 pages
Search Engine Architecture Explained
No ratings yet
Search Engine Architecture Explained
29 pages
SearchLand: Measuring Search Quality
No ratings yet
SearchLand: Measuring Search Quality
29 pages
Overview of Search Engine Development
100% (2)
Overview of Search Engine Development
42 pages
Biomedical Information Retrieval Overview
No ratings yet
Biomedical Information Retrieval Overview
20 pages
Understanding Web Search and Ranking
No ratings yet
Understanding Web Search and Ranking
10 pages
Anatomy of Web Search Engines
No ratings yet
Anatomy of Web Search Engines
49 pages
Fundamentals of Information Retrieval
No ratings yet
Fundamentals of Information Retrieval
26 pages
Understanding Search Engines and SEO
No ratings yet
Understanding Search Engines and SEO
31 pages
Web Search Engine Functionality Guide
No ratings yet
Web Search Engine Functionality Guide
34 pages
Understanding Search Engine Technologies
No ratings yet
Understanding Search Engine Technologies
17 pages
Search Engines: Concepts & Challenges
No ratings yet
Search Engines: Concepts & Challenges
24 pages
Enhancing Search Engines with Data Mining
No ratings yet
Enhancing Search Engines with Data Mining
4 pages
Search Engine Architecture Overview
No ratings yet
Search Engine Architecture Overview
23 pages
The Anatomy of A Large-Scale Hypertextual Web Search Engine '
No ratings yet
The Anatomy of A Large-Scale Hypertextual Web Search Engine '
11 pages
Overview of Web Browsers and Search Engines
No ratings yet
Overview of Web Browsers and Search Engines
10 pages
Understanding Web Search Engines
No ratings yet
Understanding Web Search Engines
30 pages
Information Retrieval & Web Search Course
No ratings yet
Information Retrieval & Web Search Course
3 pages
MS CS Manipal University Ashish Kumar Jha Data Structures and Algorithms Used in Search Engine
No ratings yet
MS CS Manipal University Ashish Kumar Jha Data Structures and Algorithms Used in Search Engine
13 pages
IRSunit 4
No ratings yet
IRSunit 4
29 pages
Data Warehouse Query Tools Overview
No ratings yet
Data Warehouse Query Tools Overview
5 pages
Web Crawling and Page Ranking Techniques
No ratings yet
Web Crawling and Page Ranking Techniques
63 pages
PageRank Algorithm in Web Search Engines
No ratings yet
PageRank Algorithm in Web Search Engines
26 pages
Anatomy of Web Search Engine Architecture
No ratings yet
Anatomy of Web Search Engine Architecture
21 pages
Unit 5 - Data Science & Big Data - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Data Science & Big Data - WWW - Rgpvnotes.in
17 pages
Understanding Search Engine Technology
No ratings yet
Understanding Search Engine Technology
11 pages
Understanding Information Retrieval Systems
No ratings yet
Understanding Information Retrieval Systems
33 pages
Web Information Retrieval Challenges
No ratings yet
Web Information Retrieval Challenges
47 pages
Term Paper OF Int-301: Web Programming: Topic: Search Engine
No ratings yet
Term Paper OF Int-301: Web Programming: Topic: Search Engine
18 pages
Understanding Search Engine Functionality
No ratings yet
Understanding Search Engine Functionality
17 pages
Web Indexing and Search Engine Basics
No ratings yet
Web Indexing and Search Engine Basics
47 pages
AI Integration in Information Retrieval
No ratings yet
AI Integration in Information Retrieval
35 pages
Search Engine Architecture Overview
No ratings yet
Search Engine Architecture Overview
15 pages
User-Oriented Evaluation of Search Engines
50% (2)
User-Oriented Evaluation of Search Engines
18 pages
Google: A Scalable Web Search Engine
No ratings yet
Google: A Scalable Web Search Engine
30 pages
Unit - 3 Ir Questionbank
No ratings yet
Unit - 3 Ir Questionbank
27 pages
Overview of Information Retrieval Concepts
No ratings yet
Overview of Information Retrieval Concepts
59 pages
Search Engine Mechanics Explained
No ratings yet
Search Engine Mechanics Explained
13 pages
Search Engine Dynamics and Statistics
No ratings yet
Search Engine Dynamics and Statistics
27 pages
LambdaMART and Ranking Algorithms
No ratings yet
LambdaMART and Ranking Algorithms
51 pages
Online Book Delivery System Report
No ratings yet
Online Book Delivery System Report
56 pages
A Model Driven Approach To Building Domain Specific Search Engines
No ratings yet
A Model Driven Approach To Building Domain Specific Search Engines
8 pages
Understanding Scribd Document Uploads
No ratings yet
Understanding Scribd Document Uploads
36 pages
Indexing for Question Answering
No ratings yet
Indexing for Question Answering
51 pages
CS8080 Information Retrieval Question Bank
No ratings yet
CS8080 Information Retrieval Question Bank
6 pages
Comparison of Existing Open-Source Tools For Web Crawling and Indexing of Free Music
No ratings yet
Comparison of Existing Open-Source Tools For Web Crawling and Indexing of Free Music
6 pages
Indexing Algorithms in IR Systems
100% (2)
Indexing Algorithms in IR Systems
60 pages
Going Beyond Google Again
No ratings yet
Going Beyond Google Again
193 pages
Whoosh-Based Tweet Indexing System
No ratings yet
Whoosh-Based Tweet Indexing System
7 pages
LuaDoc: Documentation Generator Tool For The Lua Language
No ratings yet
LuaDoc: Documentation Generator Tool For The Lua Language
3 pages
Information Retrieval Systems Overview
No ratings yet
Information Retrieval Systems Overview
37 pages
AI Document QA System with LLaMA
No ratings yet
AI Document QA System with LLaMA
4 pages
Literature Review Guidance for Supervisors
100% (1)
Literature Review Guidance for Supervisors
14 pages
Understanding Web and Network Protocols
No ratings yet
Understanding Web and Network Protocols
37 pages
DECE's GEODI Classification Suite
No ratings yet
DECE's GEODI Classification Suite
2 pages
Information Retrieval and Web Search Concepts
No ratings yet
Information Retrieval and Web Search Concepts
23 pages
Azure AI Search Solution Overview
No ratings yet
Azure AI Search Solution Overview
34 pages
Cataloging and Indexing in IRS
No ratings yet
Cataloging and Indexing in IRS
27 pages
Introduction to Front-End Development
No ratings yet
Introduction to Front-End Development
25 pages
Overview of Internet and Its Services
No ratings yet
Overview of Internet and Its Services
24 pages
AI for Cybersecurity Course Overview
No ratings yet
AI for Cybersecurity Course Overview
17 pages
Documentum xPlore Overview and Components
No ratings yet
Documentum xPlore Overview and Components
15 pages
Phosphate Processing Technology Overview
No ratings yet
Phosphate Processing Technology Overview
50 pages
Top 20 Digital Marketing Interview Q&A
No ratings yet
Top 20 Digital Marketing Interview Q&A
32 pages
CMOD Architecture and Use Cases
No ratings yet
CMOD Architecture and Use Cases
26 pages
Researching the Web: Key Strategies
No ratings yet
Researching the Web: Key Strategies
15 pages
Xobni User Manual
100% (1)
Xobni User Manual
31 pages
Amharic Probabilistic IR System Thesis
No ratings yet
Amharic Probabilistic IR System Thesis
131 pages
M.Tech AI Syllabus - NIT Agartala
No ratings yet
M.Tech AI Syllabus - NIT Agartala
31 pages

Fast Search Engine Development Guide

Uploaded by

Fast Search Engine Development Guide

Uploaded by

Building Fast Search Engines

Hugh E. Williams (hugh@[Link])

• User’s Information Needs

• How users query with search engines

• How search engines provide a search service

• Architecture of a commercial search engine

• Search engines are one tool used to answer information

call this a ranked query)

words per query with a median of 2

session, 20.8% submit two, and about 31% submit three

OR, and NOT), and around 5% contain quoted phrases

• About 1.28 million different words were used in queries in

• What is a good answer to a query?

• Search engines typically return ten answers-per-page,

where each answer is a short summary of a web

by statistical similarity between web documents and the

that is, those that return relevant answers in the first

frequently the better; we call this term frequency (TF)

discriminating power—than others. For example, an

• The Okapi ranking function is as follows:

• Other techniques are used to improve the accuracy of

technology. Google ranks a page higher if it links to

query based on a user selecting a more like this option

• Documents are stored as a collection in a centralised

answers to the user for judging relevance

• All search engines use inverted indexes to support fast

collection; stored with each word is the IDF and a pointer

This list contains:

• Queries are resolved using the inverted index

• There are many well-known principles for building a fast

allows more information per second to be retrieved from

are integers) and documents (which are multimedia, but

• Average query times and index sizes for 25,000 queries on

• Other principles of fast searching:

when retrieving lists or documents

avoid slow hash functions that use modulo

• Carefully choose integer compression schemes

• Organise inverted lists so that the information frequently

needed is at the start of the list

• Develop a query plan for each query

• The Search Engine Group here at RMIT specialises in

• Our new research testbed search engine will be released

You might also like