Slides Chap11 PDF
Slides Chap11 PDF
Chapter 11
Web Retrieval
with Yoelle Maarek
A Challenging Problem
The Web
Search Engine Architectures
Search Engine Ranking
Managing Web Data
Search Engine User Interaction
Browsing
Beyond Browsing
Related Problems
Web Retrieval, Modern Information Retrieval, Addison Wesley, 2010 – p. 1
Introduction
The Web
very large, public, unstructured but ubiquitous repository
need for efficient tools to manage, retrieve, and filter information
search engines have become a central tool in the Web
2 4 6
log(File Size)
-1
-2
log(P[X>x])
-3
-4 All Files
Image Files
-5 Audio Files
Video Files
Text Files
-6
0 1 2 3 4 5 6 7 8
log(File Size in Bytes)
Interface
Users Indexer
Crawler
Web
0.6
0.5
static QTF/DF
0.4 LRU
LFU
0.3 Dyn-QTF/DF
QTF
0.2
0.1 0.2 0.3 0.4 0.5 0.6 0.7
Cache size
Broker
User Broker
Gatherer
q Xn
P R(pi )
P R(a) = + (1 − q)
T i=1
L(pi )
where
T : total number of pages on the Web graph
q: parameter set by the system (typical value is 0.15)
Almost 80% of all links are local, that is, they point to
pages of the same site
if we assign closer identifiers to URLs referring to the same site,
the adjacency lists that will contain very close ids
WebGraph
compresses typical Web graphs at about 3 bits per link
provides access to a link in few hundreds of nanoseconds