How does a Web Crawler work
User specifies the starting URL through the GUI of the crawler All the links in this URL are retrieved and are added to the “crawl frontier”, which is a list of the URLs to visit The links in the crawl frontier are then checked and the links present in them are retrieved. This process keeps on happening recursively Steps Involved in the Crawling Process
Starting URL is specified here  WebSPHINX Web Crawler’s GUI
Starting URL or  Root of the tree The Crawler “checks” if the URL exists, parses through it and retrieves all the links then repeats this process on the links, hence obtained. It checks this by sending HTTP requests to the server
Root of the tree The process is done recursively Son of the root Son of the previous son of the root The crawler has to constantly check the links for duplication. This is done to avoid redundancies, which otherwise will take a toll on the efficiency of the crawling process
Theoretically this process should continue till all the links have been retrieved. But practically the crawler goes only to 5 levels of depth from  the home page of the  URL it  visits . After this, it is concluded that there is no  need of going further. But the crawling can  still continue from  other URLs  Here, the process stops after five depths
The red crosses signify that crawling cannot be continued from that particular URL.  This can arise in the following cases- 1) The server containing the URL is taking too long to respond 2) Server is not  allowing access  to the crawler  3) URL is left out to  avoid duplication 4) The crawler has  been specifically  been designed to  ignore such pages  (“ Politeness ” of a  crawler)
Why only go till 5 depths? Normally, 5 depths or levels of search are enough to gather majority of information present in the home page of the website A precaution to avoid “Spider Traps” Spider Traps- Web pages containing an infinite loop within them. Eg-  http://webcrawl.com/web/crawl/web/crawl ... Crawler is trapped in the page or can even crash.  Can be intentional or unintentional. Intentionally done to trap crawlers as they eat up the page’s bandwidth Created unintentionally as in the case of dynamically created calendar, where the dates point to the next date and a year to its next year A crawler's ability to avoid spider traps is known as “ Robustness ” of the crawler .

More Related Content

PPT
Webcrawler
PPT
Coding for a wget based Web Crawler
PPTX
SemaGrow demonstrator: “Web Crawler + AgroTagger”
PPT
Web crawler
PDF
Web Crawling & Crawler
PPT
“Web crawler”
PDF
What is a web crawler and how does it work
PPTX
Web Crawlers
Webcrawler
Coding for a wget based Web Crawler
SemaGrow demonstrator: “Web Crawler + AgroTagger”
Web crawler
Web Crawling & Crawler
“Web crawler”
What is a web crawler and how does it work
Web Crawlers

What's hot (19)

PPT
Web Crawler
PPT
WebCrawler
DOC
Web crawler synopsis
PPT
Working with WebSPHINX Web Crawler
PPTX
Web crawler and applications
PDF
Colloquim Report - Rotto Link Web Crawler
DOCX
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
PDF
Smart crawler a two stage crawler
PDF
Design and Implementation of a High- Performance Distributed Web Crawler
PDF
Colloquim Report on Crawler - 1 Dec 2014
PDF
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
PPTX
PPTX
Smart Crawler -A Two Stage Crawler For Efficiently Harvesting Deep Web
PPTX
Web crawler
DOCX
Smart crawler a two stage crawler
PPTX
Smart crawlet A two stage crawler for efficiently harvesting deep web interf...
PDF
Web Crawlers - Exploring the WWW
DOC
Web crawler synopsis
PPTX
Web crawler with seo analysis
Web Crawler
WebCrawler
Web crawler synopsis
Working with WebSPHINX Web Crawler
Web crawler and applications
Colloquim Report - Rotto Link Web Crawler
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
Smart crawler a two stage crawler
Design and Implementation of a High- Performance Distributed Web Crawler
Colloquim Report on Crawler - 1 Dec 2014
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
Smart Crawler -A Two Stage Crawler For Efficiently Harvesting Deep Web
Web crawler
Smart crawler a two stage crawler
Smart crawlet A two stage crawler for efficiently harvesting deep web interf...
Web Crawlers - Exploring the WWW
Web crawler synopsis
Web crawler with seo analysis

Similar to Working of a Web Crawler (20)

PPTX
Web crawler
PPT
Webcrawler
PPTX
Web Crawling and Indexing in Information Retrieval.pptx
PPTX
webcrawler.pptx
PDF
Web crawling
PDF
Brief Introduction on Working of Web Crawler
PDF
[LvDuit//Lab] Crawling the web
PPT
Webcrawler
PPTX
4 Web Crawler.pptx
PDF
Web crawler
PPTX
Crawl optimization - ( How to optimize to increase crawl budget)
PDF
Web Crawler For Mining Web Data
PPTX
Scalability andefficiencypres
PDF
IRJET - Review on Search Engine Optimization
PDF
A Novel Interface to a Web Crawler using VB.NET Technology
PDF
The Research on Related Technologies of Web Crawler
PPTX
Challenges in web crawling
PDF
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
PPT
Smart Web Crawling in Search Engine Optimization
PDF
Search engine and web crawler
Web crawler
Webcrawler
Web Crawling and Indexing in Information Retrieval.pptx
webcrawler.pptx
Web crawling
Brief Introduction on Working of Web Crawler
[LvDuit//Lab] Crawling the web
Webcrawler
4 Web Crawler.pptx
Web crawler
Crawl optimization - ( How to optimize to increase crawl budget)
Web Crawler For Mining Web Data
Scalability andefficiencypres
IRJET - Review on Search Engine Optimization
A Novel Interface to a Web Crawler using VB.NET Technology
The Research on Related Technologies of Web Crawler
Challenges in web crawling
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Smart Web Crawling in Search Engine Optimization
Search engine and web crawler

Recently uploaded (20)

PDF
Be ready for tomorrow’s needs with a longer-lasting, higher-performing PC
PPTX
Blending method and technology for hydrogen.pptx
PPT
Overviiew on Intellectual property right
PDF
Addressing the challenges of harmonizing law and artificial intelligence tech...
PDF
Intravenous drug administration application for pediatric patients via augmen...
PPTX
maintenance powerrpoint for adaprive and preventive
PDF
Decision Optimization - From Theory to Practice
PDF
Advancements in abstractive text summarization: a deep learning approach
PDF
GDG Cloud Southlake #45: Patrick Debois: The Impact of GenAI on Development a...
PDF
1_Keynote_Breaking Barriers_한계를 넘어서_Charith Mendis.pdf
PPTX
Strategic Picks — Prioritising the Right Agentic Use Cases [2/6]
PDF
Revolutionizing recommendations a survey: a comprehensive exploration of mode...
PDF
TrustArc Webinar - Data Minimization in Practice_ Reducing Risk, Enhancing Co...
PDF
“Introduction to Designing with AI Agents,” a Presentation from Amazon Web Se...
PDF
Slides World Game (s) Great Redesign Eco Economic Epochs.pdf
PPTX
From Curiosity to ROI — Cost-Benefit Analysis of Agentic Automation [3/6]
PDF
Optimizing bioinformatics applications: a novel approach with human protein d...
PPTX
CRM(Customer Relationship Managmnet) Presentation
PDF
ELLIE29.pdfWETWETAWTAWETAETAETERTRTERTER
PDF
Child-friendly e-learning for artificial intelligence education in Indonesia:...
Be ready for tomorrow’s needs with a longer-lasting, higher-performing PC
Blending method and technology for hydrogen.pptx
Overviiew on Intellectual property right
Addressing the challenges of harmonizing law and artificial intelligence tech...
Intravenous drug administration application for pediatric patients via augmen...
maintenance powerrpoint for adaprive and preventive
Decision Optimization - From Theory to Practice
Advancements in abstractive text summarization: a deep learning approach
GDG Cloud Southlake #45: Patrick Debois: The Impact of GenAI on Development a...
1_Keynote_Breaking Barriers_한계를 넘어서_Charith Mendis.pdf
Strategic Picks — Prioritising the Right Agentic Use Cases [2/6]
Revolutionizing recommendations a survey: a comprehensive exploration of mode...
TrustArc Webinar - Data Minimization in Practice_ Reducing Risk, Enhancing Co...
“Introduction to Designing with AI Agents,” a Presentation from Amazon Web Se...
Slides World Game (s) Great Redesign Eco Economic Epochs.pdf
From Curiosity to ROI — Cost-Benefit Analysis of Agentic Automation [3/6]
Optimizing bioinformatics applications: a novel approach with human protein d...
CRM(Customer Relationship Managmnet) Presentation
ELLIE29.pdfWETWETAWTAWETAETAETERTRTERTER
Child-friendly e-learning for artificial intelligence education in Indonesia:...

Working of a Web Crawler

  • 1. How does a Web Crawler work
  • 2. User specifies the starting URL through the GUI of the crawler All the links in this URL are retrieved and are added to the “crawl frontier”, which is a list of the URLs to visit The links in the crawl frontier are then checked and the links present in them are retrieved. This process keeps on happening recursively Steps Involved in the Crawling Process
  • 3. Starting URL is specified here WebSPHINX Web Crawler’s GUI
  • 4. Starting URL or Root of the tree The Crawler “checks” if the URL exists, parses through it and retrieves all the links then repeats this process on the links, hence obtained. It checks this by sending HTTP requests to the server
  • 5. Root of the tree The process is done recursively Son of the root Son of the previous son of the root The crawler has to constantly check the links for duplication. This is done to avoid redundancies, which otherwise will take a toll on the efficiency of the crawling process
  • 6. Theoretically this process should continue till all the links have been retrieved. But practically the crawler goes only to 5 levels of depth from the home page of the URL it visits . After this, it is concluded that there is no need of going further. But the crawling can still continue from other URLs Here, the process stops after five depths
  • 7. The red crosses signify that crawling cannot be continued from that particular URL. This can arise in the following cases- 1) The server containing the URL is taking too long to respond 2) Server is not allowing access to the crawler 3) URL is left out to avoid duplication 4) The crawler has been specifically been designed to ignore such pages (“ Politeness ” of a crawler)
  • 8. Why only go till 5 depths? Normally, 5 depths or levels of search are enough to gather majority of information present in the home page of the website A precaution to avoid “Spider Traps” Spider Traps- Web pages containing an infinite loop within them. Eg- http://webcrawl.com/web/crawl/web/crawl ... Crawler is trapped in the page or can even crash. Can be intentional or unintentional. Intentionally done to trap crawlers as they eat up the page’s bandwidth Created unintentionally as in the case of dynamically created calendar, where the dates point to the next date and a year to its next year A crawler's ability to avoid spider traps is known as “ Robustness ” of the crawler .