Design and Implementation of a High-
Performance Distributed Web Crawler


    Vladislav Shkapenyuk* Torsten Suel
                CIS Department
              Polytechnic University
               Brooklyn, NY 11201



* Currently at AT&T Research Labs, Florham Park
Overview
:
 1. Introduction
 2. PolyBot System Architecture
 3. Data Structures and Performance
 4. Experimental Results
 5. Discussion and Open Problems
1. Introduction
Web Crawler: (also called spider or robot)


     tool for data acquisition in search engines
     large engines need high-performance
     crawlers
     need to parallelize crawling task
     PolyBot: a parallel/distributed web crawler
     cluster vs. wide-area distributed
Basic structure of a search engine:


                               indexin
                               g

           Crawle
           r                             Index


                       disk
                       s

                       Search.com            look
 Query: “computer”                           up
Crawler
                             Crawle
                             r


                                          disk
   fetches pages from the web             s

   starts at set of “seed pages”
   parses fetched pages for
   hyperlinks
   then follows those links (e.g., BFS)
   variations:
- recrawling
- focused crawling
- random walks
Breadth-First Crawl:
  Basic idea:
- start at a set of known URLs
- explore in “concentric circles” around these URLs



  start
  pages
  distance-one
  pages
  distance-two
  pages



  used by broad web search engines
  balances load between servers
Crawling Strategy and Download Rate:
   crawling strategy: “What page to download next?”
   download rate: “How many pages per second?”
   different scenarios require different strategies
   lots of recent work on crawling strategy
   little published work on optimizing download rate
(main exception: Mercator from DEC/Compaq/HP?)
   somewhat separate issues
   building a slow crawler is (fairly) easy ...
Basic System
Architecture




    Application determines crawling
System Requirements:
   flexibility (different crawling strategies)
   scalabilty (sustainable high performance at low cost)
   robustness
(odd server content/behavior, crashes)
   crawling etiquette and speed control
(robot exclusion, 30 second intervals, domain level
throttling, speed control for other users)
   manageable and reconfigurable
(interface for statistics and control, system setup)
Details: (lots of ‘em)
       robot exclusion
   - robots.txt file and meta tags
   - robot exclusion adds overhead
       handling filetypes
   (exclude some extensions, and use mime types)
      URL extensions and CGI scripts
   (to strip or not to strip? Ignore?)
      frames, imagemaps
      black holes (robot traps)
   (limit maximum depth of a site)
      different names for same site?
   (could check IP address, but no perfect solution)
Crawling courtesy
    minimize load on crawled server
    no more than one outstanding request per
    site
    better: wait 30 seconds between accesses to
    site
 (this number is not fixed)
    problems:
 - one server may have many sites
 - one site may be on many servers
 - 3 years to crawl a 3-million page site!
    give contact info for large crawls
Contributions:
    distributed architecture based on collection of services
- separation of concerns
- efficient interfaces
    I/O efficient techniques for URL handling
- lazy URL -seen structure
- manager data structures
    scheduling policies
- manager scheduling and shuffling
    resulting system limited by network and parsing
performane
    detailed description and how-to (limited experiments)
2. PolyBot System
Architecture
Structure:
  separation of crawling strategy and basic
  system
  collection of scalable distributed services
(DNS, downloading, scheduling, strategy)
  for clusters and wide-area distributed
  optimized per-node performance
  no random disk accesses (no per-page access)
Basic Architecture (ctd):
     application issues
requests to manager
     manager does DNS
and robot exclusion
     manager schedules
URL on downloader
     downloader gets
     file
and puts it on disk
     application is
     notified
of new files
     application parses
     new
files for hyperlinks
     application sends
     data
to storage component
(indexing done later)
System components:
   downloader: optimized HTTP client written in Python
(everything else in C++)
   DNS resolver: uses asynchronous DNS library
   manager uses Berkeley DB and STL for external and
internal data structures
   manager does robot exclusion by generating requests
to downloaders and parsing files
   application does parsing and handling of URLs
(has this page already been downloaded?)
Scaling the system:
   small system on previous pages:
3-5 workstations and 250-400 pages/sec peak
   can scale up by adding downloaders and DNS
   resolvers
   at 400-600 pages/sec, application becomes bottleneck
   at 8 downloaders manager becomes bottleneck
need to replicate application and manager
   hash-based technique (Internet Archive crawler)
partitions URLs and hosts among application parts
   data transfer batched and via file system (NFS)
Scaling
up:
    20
    machines
    1500
    pages/s?
    depends on
crawl strategy
    hash to
    nodes
based on site
(b/c robot ex)
3. Data Structures and Techniques
Crawling
Application pcre
  parsing using       library
   NFS eventually bottleneck
   URL-seen problem:
- need to check if file has been parsed or downloaded before
- after 20 million pages, we have “seen” over 100 million URLs
- each URL is 50 to 75 bytes on average
   Options: compress URLs in main memory, or use disk
- prefix+huffman coding (DEC, JY01) or Bloom Filter (Archive)
- disk access with caching (Mercator)
- we use lazy/bulk operations on disk
Implementation of URL-seen check:
- while less than a few million URLs seen, keep in main memory
- then write URLs to file in alphabetic, prefix-compressed order
- collect new URLs in memory and periodically reform bulk
check by merging new URLs into the file on disk
   When is a newly a parsed URL downloaded?
   Reordering request stream
- want to space ot requests from same subdomain
- needed due to load on small domains and due to security tools
- sort URLs with hostname reversed (e.g., com.amazon.www),
and then “unshuffle” the stream provable load balance
Crawling Manager
  large stream of incoming URL request files
  goal: schedule URLs roughly in the order that they
come, while observing time-out rule (30 seconds)
and maintaining high speed
  must do DNS and robot excl. “right before”download
  keep requests on disk as long as possible!
- otherwise, structures grow too large after few million pages
(performance killer)
Manager Data Structures:




  when to insert new URLs into internal structures?
URL Loading
Policy new request file from disk whenever less
  read
than x hosts in ready queue
  choose x > speed * timeout (e.g., 100 pages/s * 30s)
  # of current host data structures is
x + speed * timeout + n_down + n_transit
which is usually < 2x
  nice behavior for BDB caching policy
  performs reordering only when necessary!
4. Experimental Results
   crawl of 120 million pages over 19 days
161 million HTTP request
16 million robots.txt requests
138 million successful non-robots requests
17 million HTTP errors (401, 403, 404 etc)
121 million pages retrieved
   slow during day, fast at night
   peak about 300 pages/s over T3
   many downtimes due to attacks, crashes, revisions
   “slow tail” of requests at the end (4 days)
   lots of things happen
Experimental Results ctd.




    bytes in bytes out frames out

    Poly T3 connection over 24 hours of 5/28/01
    (courtesy of AppliedTheory)
Experimental Results ctd.
   sustaining performance:
- will find out when data structures hit disk
- I/O-efficiency vital
   speed control tricky
- vary number of connections based on feedback
- also upper bound on connections
- complicated interactions in system
- not clear what we should want
  other configuration: 140 pages/sec sustained
on 2 Ultra10 with 60GB EIDE and 1GB/768MB
  similar for Linux on Intel
More Detailed Evaluation (to be done)
   Problems
- cannot get commercial crawlers
- need simulation systen to find system bottlenecks
- often not much of a tradeoff (get it right!)
   Example: manager data structures
- with our loading policy, manager can feed several
downloaders
- naïve policy: disk access per page
   parallel communication overhead
- low for limited number of nodes (URL exchange)
- wide-area distributed: where do yu want the data?
- more relevant for highly distributed systems
5. Discussion and Open Problems
Related work
    Mercator (Heydon/Najork from DEC/Compaq)
- used in altaVista
- centralized system (2-CPU Alpha with RAID disks)
- URL-seen test by fast disk access and caching
- one thread per HTTP connection
- completely in Java, with pluggable components
    Atrax: very recent distributed extension to Mercator
- combines several Mercators
- URL hashing, and off-line URL check (as we do)
Related work (ctd.)
    early Internet Archive crawler (circa 96)
- uses hashing to partition URLs between crawlers
- bloom filter for “URL seen” structure
    early Google crawler (1998)
    P2P crawlers (grub.org and others)
    Cho/Garcia-Molina (WWW 2002)
- study of overhead/quality tradeoff in parallel crawlers
- difference: we scale services separately, and focus on
single-node performance
- in our experience, parallel overhead low
Open Problems:
    Measuring and tuning peak performance
- need simulation environment
- eventually reduces to parsing and network
- to be improved: space, fault-tolerance (Xactions?)
    Highly Distributed crawling
- highly distributed (e.g., grub.org) ? (maybe)
- bybrid? (different services)
- few high-performance sites? (several Universities)
    Recrawling and focused crawling strategies
- what strategies?
- how to express?
- how to implement?

More Related Content

PDF
ClickHouse Deep Dive, by Aleksei Milovidov
PDF
SQL for NoSQL and how Apache Calcite can help
PPTX
Apache Flink Deep Dive
PDF
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
PDF
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
PPTX
Row/Column- Level Security in SQL for Apache Spark
PDF
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
PDF
My first 90 days with ClickHouse.pdf
ClickHouse Deep Dive, by Aleksei Milovidov
SQL for NoSQL and how Apache Calcite can help
Apache Flink Deep Dive
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Row/Column- Level Security in SQL for Apache Spark
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
My first 90 days with ClickHouse.pdf

What's hot (20)

PDF
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
PPTX
Introduction to Apache Kafka
PDF
Streaming SQL with Apache Calcite
PDF
ClickHouse Keeper
PDF
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
PDF
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
PPTX
Introduction to ELK
PDF
Tips on how to improve the performance of your custom modules for high volume...
PDF
ClickHouse materialized views - a secret weapon for high performance analytic...
PDF
Natural Language Toolkit (NLTK), Basics
PDF
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
PDF
Better than you think: Handling JSON data in ClickHouse
PDF
KSQL - Stream Processing simplified!
PDF
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
PDF
Conversational AI with Rasa - PyData Workshop
PDF
Apache Flink internals
PDF
Your first ClickHouse data warehouse
PPTX
Redis vs Aerospike
PDF
Concurrency With Go
PDF
KrakenD API Gateway
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Introduction to Apache Kafka
Streaming SQL with Apache Calcite
ClickHouse Keeper
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
Introduction to ELK
Tips on how to improve the performance of your custom modules for high volume...
ClickHouse materialized views - a secret weapon for high performance analytic...
Natural Language Toolkit (NLTK), Basics
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
Better than you think: Handling JSON data in ClickHouse
KSQL - Stream Processing simplified!
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Conversational AI with Rasa - PyData Workshop
Apache Flink internals
Your first ClickHouse data warehouse
Redis vs Aerospike
Concurrency With Go
KrakenD API Gateway
Ad

Viewers also liked (6)

PPTX
Web crawler
PDF
Etsy Activity Feeds Architecture
PDF
Current challenges in web crawling
PDF
Scaling GIS Data in Non-relational Data Stores
PDF
分布式Key Value Store漫谈
PDF
Introduction to Redis
Web crawler
Etsy Activity Feeds Architecture
Current challenges in web crawling
Scaling GIS Data in Non-relational Data Stores
分布式Key Value Store漫谈
Introduction to Redis
Ad

Similar to Design and Implementation of a High- Performance Distributed Web Crawler (20)

PPTX
Scalability andefficiencypres
PPTX
How Web Browsers Work
PPT
Web crawler
PPTX
webcrawler.pptx
PDF
Large Scale Crawling with Apache Nutch and Friends
ODP
Large Scale Crawling with Apache Nutch and Friends
PPT
Jagmohancrawl
PPT
Apache hadoop and hive
PDF
An introduction to Storm Crawler
PDF
Web crawler
PDF
Web Crawling Using Location Aware Technique
PDF
Frontera: open source, large scale web crawling framework
PPT
Spring 2007 SharePoint Connections Oleson Advanced Administration and Plannin...
PPT
WebCrawler
PDF
Faster and resourceful multi core web crawling
PDF
E017624043
PDF
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
ODP
Large scale crawling with Apache Nutch
PPT
HDFS_architecture.ppt
Scalability andefficiencypres
How Web Browsers Work
Web crawler
webcrawler.pptx
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
Jagmohancrawl
Apache hadoop and hive
An introduction to Storm Crawler
Web crawler
Web Crawling Using Location Aware Technique
Frontera: open source, large scale web crawling framework
Spring 2007 SharePoint Connections Oleson Advanced Administration and Plannin...
WebCrawler
Faster and resourceful multi core web crawling
E017624043
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Large scale crawling with Apache Nutch
HDFS_architecture.ppt

More from George Ang (20)

PDF
Wrapper induction construct wrappers automatically to extract information f...
PDF
Opinion mining and summarization
PPT
Huffman coding
PPT
Do not crawl in the dust 
different ur ls similar text
PPT
大规模数据处理的那些事儿
PPT
腾讯大讲堂02 休闲游戏发展的文化趋势
PPT
腾讯大讲堂03 qq邮箱成长历程
PPT
腾讯大讲堂04 im qq
PPT
腾讯大讲堂05 面向对象应对之道
PPT
腾讯大讲堂06 qq邮箱性能优化
PPT
腾讯大讲堂07 qq空间
PPT
腾讯大讲堂08 可扩展web架构探讨
PPT
腾讯大讲堂09 如何建设高性能网站
PPT
腾讯大讲堂01 移动qq产品发展历程
PPT
腾讯大讲堂10 customer engagement
PPT
腾讯大讲堂11 拍拍ce工作经验分享
PPT
腾讯大讲堂14 qq直播(qq live) 介绍
PPT
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
PPTX
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
PPT
腾讯大讲堂16 产品经理工作心得分享
Wrapper induction construct wrappers automatically to extract information f...
Opinion mining and summarization
Huffman coding
Do not crawl in the dust 
different ur ls similar text
大规模数据处理的那些事儿
腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂04 im qq
腾讯大讲堂05 面向对象应对之道
腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂07 qq空间
腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂10 customer engagement
腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂16 产品经理工作心得分享

Recently uploaded (20)

PDF
Advancements in abstractive text summarization: a deep learning approach
PPTX
Presentation - Principles of Instructional Design.pptx
PPTX
CRM(Customer Relationship Managmnet) Presentation
PDF
Secure Java Applications against Quantum Threats
PDF
ment.tech-How to Develop an AI Agent Healthcare App like Sully AI (1).pdf
PDF
State of AI in Business 2025 - MIT NANDA
PDF
Ebook - The Future of AI A Comprehensive Guide.pdf
PDF
Addressing the challenges of harmonizing law and artificial intelligence tech...
PDF
Uncertainty-aware contextual multi-armed bandits for recommendations in e-com...
PDF
“Introduction to Designing with AI Agents,” a Presentation from Amazon Web Se...
PDF
Rooftops detection with YOLOv8 from aerial imagery and a brief review on roof...
PPTX
Rise of the Digital Control Grid Zeee Media and Hope and Tivon FTWProject.com
PDF
GDG Cloud Southlake #45: Patrick Debois: The Impact of GenAI on Development a...
PDF
Domain-specific knowledge and context in large language models: challenges, c...
PDF
TrustArc Webinar - Data Minimization in Practice_ Reducing Risk, Enhancing Co...
PDF
Human Computer Interaction Miterm Lesson
PDF
Revolutionizing recommendations a survey: a comprehensive exploration of mode...
PDF
TicketRoot: Event Tech Solutions Deck 2025
PDF
Applying Agentic AI in Enterprise Automation
PDF
Gestión Unificada de los Riegos Externos
Advancements in abstractive text summarization: a deep learning approach
Presentation - Principles of Instructional Design.pptx
CRM(Customer Relationship Managmnet) Presentation
Secure Java Applications against Quantum Threats
ment.tech-How to Develop an AI Agent Healthcare App like Sully AI (1).pdf
State of AI in Business 2025 - MIT NANDA
Ebook - The Future of AI A Comprehensive Guide.pdf
Addressing the challenges of harmonizing law and artificial intelligence tech...
Uncertainty-aware contextual multi-armed bandits for recommendations in e-com...
“Introduction to Designing with AI Agents,” a Presentation from Amazon Web Se...
Rooftops detection with YOLOv8 from aerial imagery and a brief review on roof...
Rise of the Digital Control Grid Zeee Media and Hope and Tivon FTWProject.com
GDG Cloud Southlake #45: Patrick Debois: The Impact of GenAI on Development a...
Domain-specific knowledge and context in large language models: challenges, c...
TrustArc Webinar - Data Minimization in Practice_ Reducing Risk, Enhancing Co...
Human Computer Interaction Miterm Lesson
Revolutionizing recommendations a survey: a comprehensive exploration of mode...
TicketRoot: Event Tech Solutions Deck 2025
Applying Agentic AI in Enterprise Automation
Gestión Unificada de los Riegos Externos

Design and Implementation of a High- Performance Distributed Web Crawler

  • 1. Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk* Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 * Currently at AT&T Research Labs, Florham Park
  • 2. Overview : 1. Introduction 2. PolyBot System Architecture 3. Data Structures and Performance 4. Experimental Results 5. Discussion and Open Problems
  • 3. 1. Introduction Web Crawler: (also called spider or robot) tool for data acquisition in search engines large engines need high-performance crawlers need to parallelize crawling task PolyBot: a parallel/distributed web crawler cluster vs. wide-area distributed
  • 4. Basic structure of a search engine: indexin g Crawle r Index disk s Search.com look Query: “computer” up
  • 5. Crawler Crawle r disk fetches pages from the web s starts at set of “seed pages” parses fetched pages for hyperlinks then follows those links (e.g., BFS) variations: - recrawling - focused crawling - random walks
  • 6. Breadth-First Crawl: Basic idea: - start at a set of known URLs - explore in “concentric circles” around these URLs start pages distance-one pages distance-two pages used by broad web search engines balances load between servers
  • 7. Crawling Strategy and Download Rate: crawling strategy: “What page to download next?” download rate: “How many pages per second?” different scenarios require different strategies lots of recent work on crawling strategy little published work on optimizing download rate (main exception: Mercator from DEC/Compaq/HP?) somewhat separate issues building a slow crawler is (fairly) easy ...
  • 8. Basic System Architecture Application determines crawling
  • 9. System Requirements: flexibility (different crawling strategies) scalabilty (sustainable high performance at low cost) robustness (odd server content/behavior, crashes) crawling etiquette and speed control (robot exclusion, 30 second intervals, domain level throttling, speed control for other users) manageable and reconfigurable (interface for statistics and control, system setup)
  • 10. Details: (lots of ‘em) robot exclusion - robots.txt file and meta tags - robot exclusion adds overhead handling filetypes (exclude some extensions, and use mime types) URL extensions and CGI scripts (to strip or not to strip? Ignore?) frames, imagemaps black holes (robot traps) (limit maximum depth of a site) different names for same site? (could check IP address, but no perfect solution)
  • 11. Crawling courtesy minimize load on crawled server no more than one outstanding request per site better: wait 30 seconds between accesses to site (this number is not fixed) problems: - one server may have many sites - one site may be on many servers - 3 years to crawl a 3-million page site! give contact info for large crawls
  • 12. Contributions: distributed architecture based on collection of services - separation of concerns - efficient interfaces I/O efficient techniques for URL handling - lazy URL -seen structure - manager data structures scheduling policies - manager scheduling and shuffling resulting system limited by network and parsing performane detailed description and how-to (limited experiments)
  • 13. 2. PolyBot System Architecture Structure: separation of crawling strategy and basic system collection of scalable distributed services (DNS, downloading, scheduling, strategy) for clusters and wide-area distributed optimized per-node performance no random disk accesses (no per-page access)
  • 14. Basic Architecture (ctd): application issues requests to manager manager does DNS and robot exclusion manager schedules URL on downloader downloader gets file and puts it on disk application is notified of new files application parses new files for hyperlinks application sends data to storage component (indexing done later)
  • 15. System components: downloader: optimized HTTP client written in Python (everything else in C++) DNS resolver: uses asynchronous DNS library manager uses Berkeley DB and STL for external and internal data structures manager does robot exclusion by generating requests to downloaders and parsing files application does parsing and handling of URLs (has this page already been downloaded?)
  • 16. Scaling the system: small system on previous pages: 3-5 workstations and 250-400 pages/sec peak can scale up by adding downloaders and DNS resolvers at 400-600 pages/sec, application becomes bottleneck at 8 downloaders manager becomes bottleneck need to replicate application and manager hash-based technique (Internet Archive crawler) partitions URLs and hosts among application parts data transfer batched and via file system (NFS)
  • 17. Scaling up: 20 machines 1500 pages/s? depends on crawl strategy hash to nodes based on site (b/c robot ex)
  • 18. 3. Data Structures and Techniques Crawling Application pcre parsing using library NFS eventually bottleneck URL-seen problem: - need to check if file has been parsed or downloaded before - after 20 million pages, we have “seen” over 100 million URLs - each URL is 50 to 75 bytes on average Options: compress URLs in main memory, or use disk - prefix+huffman coding (DEC, JY01) or Bloom Filter (Archive) - disk access with caching (Mercator) - we use lazy/bulk operations on disk
  • 19. Implementation of URL-seen check: - while less than a few million URLs seen, keep in main memory - then write URLs to file in alphabetic, prefix-compressed order - collect new URLs in memory and periodically reform bulk check by merging new URLs into the file on disk When is a newly a parsed URL downloaded? Reordering request stream - want to space ot requests from same subdomain - needed due to load on small domains and due to security tools - sort URLs with hostname reversed (e.g., com.amazon.www), and then “unshuffle” the stream provable load balance
  • 20. Crawling Manager large stream of incoming URL request files goal: schedule URLs roughly in the order that they come, while observing time-out rule (30 seconds) and maintaining high speed must do DNS and robot excl. “right before”download keep requests on disk as long as possible! - otherwise, structures grow too large after few million pages (performance killer)
  • 21. Manager Data Structures: when to insert new URLs into internal structures?
  • 22. URL Loading Policy new request file from disk whenever less read than x hosts in ready queue choose x > speed * timeout (e.g., 100 pages/s * 30s) # of current host data structures is x + speed * timeout + n_down + n_transit which is usually < 2x nice behavior for BDB caching policy performs reordering only when necessary!
  • 23. 4. Experimental Results crawl of 120 million pages over 19 days 161 million HTTP request 16 million robots.txt requests 138 million successful non-robots requests 17 million HTTP errors (401, 403, 404 etc) 121 million pages retrieved slow during day, fast at night peak about 300 pages/s over T3 many downtimes due to attacks, crashes, revisions “slow tail” of requests at the end (4 days) lots of things happen
  • 24. Experimental Results ctd. bytes in bytes out frames out Poly T3 connection over 24 hours of 5/28/01 (courtesy of AppliedTheory)
  • 25. Experimental Results ctd. sustaining performance: - will find out when data structures hit disk - I/O-efficiency vital speed control tricky - vary number of connections based on feedback - also upper bound on connections - complicated interactions in system - not clear what we should want other configuration: 140 pages/sec sustained on 2 Ultra10 with 60GB EIDE and 1GB/768MB similar for Linux on Intel
  • 26. More Detailed Evaluation (to be done) Problems - cannot get commercial crawlers - need simulation systen to find system bottlenecks - often not much of a tradeoff (get it right!) Example: manager data structures - with our loading policy, manager can feed several downloaders - naïve policy: disk access per page parallel communication overhead - low for limited number of nodes (URL exchange) - wide-area distributed: where do yu want the data? - more relevant for highly distributed systems
  • 27. 5. Discussion and Open Problems Related work Mercator (Heydon/Najork from DEC/Compaq) - used in altaVista - centralized system (2-CPU Alpha with RAID disks) - URL-seen test by fast disk access and caching - one thread per HTTP connection - completely in Java, with pluggable components Atrax: very recent distributed extension to Mercator - combines several Mercators - URL hashing, and off-line URL check (as we do)
  • 28. Related work (ctd.) early Internet Archive crawler (circa 96) - uses hashing to partition URLs between crawlers - bloom filter for “URL seen” structure early Google crawler (1998) P2P crawlers (grub.org and others) Cho/Garcia-Molina (WWW 2002) - study of overhead/quality tradeoff in parallel crawlers - difference: we scale services separately, and focus on single-node performance - in our experience, parallel overhead low
  • 29. Open Problems: Measuring and tuning peak performance - need simulation environment - eventually reduces to parsing and network - to be improved: space, fault-tolerance (Xactions?) Highly Distributed crawling - highly distributed (e.g., grub.org) ? (maybe) - bybrid? (different services) - few high-performance sites? (several Universities) Recrawling and focused crawling strategies - what strategies? - how to express? - how to implement?