Scrapy
Patrick O’Brien | @obdit
DataPhilly | 20131118 | Monetate
Steps of data science
●
●
●
●
●

Obtain
Scrub
Explore
Model
iNterpret
Steps of data science
●
●
●
●
●

Obtain
Scrub
Explore
Model
iNterpret

Scrapy
About Scrapy
●
●
●
●

Framework for collecting data
Open source - 100% python
Simple and well documented
OSX, Linux, Windows, BSD
Some of the features
● Built-in selectors
● Generating feed output
○ Format: json, csv, xml
○ Storage: local, FTP, S3, stdout

●
●
●
●

Encoding and autodetection
Stats collection
Control via a web service
Handle cookies, auth, robots.txt, user-agent
Scrapy Architecture
Data flow
1.
2.
3.
4.
5.
6.
7.

Engine opens, locates Spider, schedule first url as a Request
Scheduler sends url to the Engine, which sends it to Downloader
Downloader sends completed page as a Response through the
middleware to the engine
Engine sends Response to the Spider through middleware
Spiders sends Items and new Requests to the Engine
Engine sends Items to the Item Pipeline and Requests to the Scheduler
GOTO 2
Parts of Scrapy
●
●
●
●
●
●

Items
Spider
Link Extractors
Selectors
Request
Responses
Items
● Main container of structured information
● dict-like objects
from scrapy.item import Item, Field
class Product(Item):
name = Field()
price = Field()
stock = Field()
last_updated = Field(serializer=str)
Items
>>> product = Product(name='Desktop PC', price=1000)
>>> print product
Product(name='Desktop PC', price=1000)

>>> product['name']
Desktop PC
>>> product.get('name')
Desktop PC

>>> product.keys()
['price', 'name']
>>> product.items()
[('price', 1000), ('name', 'Desktop PC')]
Spiders
● Define how to move around a site
○ which links to follow
○ how to extract data

● Cycle
○ Initial request and callback
○ Store parsed content
○ Subsequent requests and callbacks
Generic Spiders
●
●
●
●
●

BaseSpider
CrawlSpider
XMLFeedSpider
CSVFeedSpider
SitemapSpider
BaseSpider
● Every other spider inherits from BaseSpider
● Two jobs
○ Request `start_urls`
○ Callback `parse` on resulting response
BaseSpider
...
class MySpider(BaseSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = [

● Send Requests example.
com/[1:3].html

'http://www.example.com/1.html',
'http://www.example.com/2.html',
'http://www.example.com/3.html',
]

● Yield title Item

def parse(self, response):
sel = Selector(response)
for h3 in sel.xpath('//h3').extract():
yield MyItem(title=h3)

for url in sel.xpath('//a/@href').extract():
yield Request(url, callback=self.parse)

● Yield new Request
CrawlSpider
● Provide a set of rules on what links to follow
○ `link_extractor`
○ `call_back`
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(SgmlLinkExtractor(allow=('category.php', ), deny=('subsection.php', ))),
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(SgmlLinkExtractor(allow=('item.php', )), callback='parse_item'),
)
Link Extractors
● Extract links from Response objects
● Parameters include:
○
○
○
○
○
○

allow | deny
allow_domains | deny_domains
deny_extensions
restrict_xpaths
tags
attrs
Selectors
● Mechanisms for extracting data from HTML
● Built over the lxml library
● Two methods
○ XPath: sel.xpath('//a[contains(@href,

"image")]/@href' ).

extract()

○

CSS: sel.css('a[href*=image]::attr(href)' ).extract()

● Response object is called into Selector
○

sel = Selector(response)
Request
● Generated in Spider, sent to Downloader
● Represent an HTTP request
● FormRequest subclass performs HTTP
POST
○ useful to simulate user login
Response
● Comes from Downloader and sent to Spider
● Represents HTTP response
● Subclasses
○ TextResponse
○ HTMLResponse
○ XmlResponse
Advanced Scrapy
● Scrapyd
○ application to deploy and run Scrapy spiders
○ deploy projects and control with JSON API

● Signals
○ notify when events occur
○ hook into Signals API for advance tuning

● Extensions
○ Custom functionality loaded at Scrapy startup
More information
●
●
●
●

http://doc.scrapy.org/
https://twitter.com/ScrapyProject
https://github.com/scrapy
http://scrapyd.readthedocs.org
Demo

More Related Content

PDF
Scrapy workshop
PDF
Python, web scraping and content management: Scrapy and Django
PDF
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
PDF
Web Scraping in Python with Scrapy
PDF
Pydata-Python tools for webscraping
PPTX
Scrapy.for.dummies
PPTX
Scrapy-101
PDF
Downloading the internet with Python + Scrapy
Scrapy workshop
Python, web scraping and content management: Scrapy and Django
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web Scraping in Python with Scrapy
Pydata-Python tools for webscraping
Scrapy.for.dummies
Scrapy-101
Downloading the internet with Python + Scrapy

What's hot (20)

PDF
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
PDF
Web Scrapping with Python
PDF
Web Scraping with Python
PDF
N hidden gems you didn't know hippo delivery tier and hippo (forge) could give
PDF
Webscraping with asyncio
PDF
Fun with Python
PDF
Approach to find critical vulnerabilities
PDF
Application Logging With The ELK Stack
PDF
LogStash - Yes, logging can be awesome
ODP
Using Logstash, elasticsearch & kibana
PDF
Do something in 5 minutes with gas 1-use spreadsheet as database
PDF
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NY
PPTX
CouchDB Day NYC 2017: Full Text Search
PDF
Quicli - From zero to a full CLI application in a few lines of Rust
PDF
Do something in 5 with gas 8-copy between databases
PPTX
2015 555 kharchenko_ppt
KEY
Python Development (MongoSF)
PDF
Cross Domain Web
Mashups with JQuery and Google App Engine
PPTX
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
PDF
Do something in 5 with gas 2-graduate to a database
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
Web Scrapping with Python
Web Scraping with Python
N hidden gems you didn't know hippo delivery tier and hippo (forge) could give
Webscraping with asyncio
Fun with Python
Approach to find critical vulnerabilities
Application Logging With The ELK Stack
LogStash - Yes, logging can be awesome
Using Logstash, elasticsearch & kibana
Do something in 5 minutes with gas 1-use spreadsheet as database
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NY
CouchDB Day NYC 2017: Full Text Search
Quicli - From zero to a full CLI application in a few lines of Rust
Do something in 5 with gas 8-copy between databases
2015 555 kharchenko_ppt
Python Development (MongoSF)
Cross Domain Web
Mashups with JQuery and Google App Engine
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
Do something in 5 with gas 2-graduate to a database
Ad

Viewers also liked (6)

PDF
Getting started with Hadoop, Hive, and Elastic MapReduce
PDF
Hadoop Now, Next & Beyond
PDF
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
PDF
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
PDF
Hive sq lfor-hadoop
PDF
SQL in Hadoop
Getting started with Hadoop, Hive, and Elastic MapReduce
Hadoop Now, Next & Beyond
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Hive sq lfor-hadoop
SQL in Hadoop
Ad

Similar to Scrapy talk at DataPhilly (20)

PPTX
Web scraping using scrapy - zekeLabs
PPTX
Revealing ALLSTOCKER
PDF
Introduction to Django
PDF
GDG İstanbul Şubat Etkinliği - Sunum
PDF
Big data analysis in python @ PyCon.tw 2013
PDF
SF Grails - Ratpack - Compact Groovy Webapps - James Williams
PDF
Apache Calcite Tutorial - BOSS 21
PDF
EuroPython 2013 - Python3 TurboGears Training
PDF
Time series denver an introduction to prometheus
PPTX
Maximizer 2018 API training
PDF
Mini Curso de Django
PPTX
Sphinx + robot framework = documentation as result of functional testing
PDF
Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann
PDF
Catalyst MVC
DOCX
Akash rajguru project report sem v
PDF
202107 - Orion introduction - COSCUP
PDF
Google app engine - Soft Uni 19.06.2014
PPTX
Single page apps_with_cf_and_angular[1]
PPTX
Data-Analytics using python (Module 4).pptx
PPTX
Web Scrapping Using Python
Web scraping using scrapy - zekeLabs
Revealing ALLSTOCKER
Introduction to Django
GDG İstanbul Şubat Etkinliği - Sunum
Big data analysis in python @ PyCon.tw 2013
SF Grails - Ratpack - Compact Groovy Webapps - James Williams
Apache Calcite Tutorial - BOSS 21
EuroPython 2013 - Python3 TurboGears Training
Time series denver an introduction to prometheus
Maximizer 2018 API training
Mini Curso de Django
Sphinx + robot framework = documentation as result of functional testing
Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann
Catalyst MVC
Akash rajguru project report sem v
202107 - Orion introduction - COSCUP
Google app engine - Soft Uni 19.06.2014
Single page apps_with_cf_and_angular[1]
Data-Analytics using python (Module 4).pptx
Web Scrapping Using Python

Recently uploaded (20)

PDF
Planning-an-Audit-A-How-To-Guide-Checklist-WP.pdf
PDF
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
PDF
Decision Optimization - From Theory to Practice
PDF
A hybrid framework for wild animal classification using fine-tuned DenseNet12...
PDF
Data Virtualization in Action: Scaling APIs and Apps with FME
PDF
ment.tech-Siri Delay Opens AI Startup Opportunity in 2025.pdf
PDF
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
PDF
Lung cancer patients survival prediction using outlier detection and optimize...
PDF
substrate PowerPoint Presentation basic one
PDF
LMS bot: enhanced learning management systems for improved student learning e...
PDF
Human Computer Interaction Miterm Lesson
PDF
giants, standing on the shoulders of - by Daniel Stenberg
PDF
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
PDF
zbrain.ai-Scope Key Metrics Configuration and Best Practices.pdf
PPTX
SGT Report The Beast Plan and Cyberphysical Systems of Control
PDF
NewMind AI Weekly Chronicles – August ’25 Week IV
PDF
SaaS reusability assessment using machine learning techniques
PDF
Advancing precision in air quality forecasting through machine learning integ...
PDF
The AI Revolution in Customer Service - 2025
PPTX
Internet of Everything -Basic concepts details
Planning-an-Audit-A-How-To-Guide-Checklist-WP.pdf
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
Decision Optimization - From Theory to Practice
A hybrid framework for wild animal classification using fine-tuned DenseNet12...
Data Virtualization in Action: Scaling APIs and Apps with FME
ment.tech-Siri Delay Opens AI Startup Opportunity in 2025.pdf
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
Lung cancer patients survival prediction using outlier detection and optimize...
substrate PowerPoint Presentation basic one
LMS bot: enhanced learning management systems for improved student learning e...
Human Computer Interaction Miterm Lesson
giants, standing on the shoulders of - by Daniel Stenberg
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
zbrain.ai-Scope Key Metrics Configuration and Best Practices.pdf
SGT Report The Beast Plan and Cyberphysical Systems of Control
NewMind AI Weekly Chronicles – August ’25 Week IV
SaaS reusability assessment using machine learning techniques
Advancing precision in air quality forecasting through machine learning integ...
The AI Revolution in Customer Service - 2025
Internet of Everything -Basic concepts details

Scrapy talk at DataPhilly

  • 1. Scrapy Patrick O’Brien | @obdit DataPhilly | 20131118 | Monetate
  • 2. Steps of data science ● ● ● ● ● Obtain Scrub Explore Model iNterpret
  • 3. Steps of data science ● ● ● ● ● Obtain Scrub Explore Model iNterpret Scrapy
  • 4. About Scrapy ● ● ● ● Framework for collecting data Open source - 100% python Simple and well documented OSX, Linux, Windows, BSD
  • 5. Some of the features ● Built-in selectors ● Generating feed output ○ Format: json, csv, xml ○ Storage: local, FTP, S3, stdout ● ● ● ● Encoding and autodetection Stats collection Control via a web service Handle cookies, auth, robots.txt, user-agent
  • 7. Data flow 1. 2. 3. 4. 5. 6. 7. Engine opens, locates Spider, schedule first url as a Request Scheduler sends url to the Engine, which sends it to Downloader Downloader sends completed page as a Response through the middleware to the engine Engine sends Response to the Spider through middleware Spiders sends Items and new Requests to the Engine Engine sends Items to the Item Pipeline and Requests to the Scheduler GOTO 2
  • 8. Parts of Scrapy ● ● ● ● ● ● Items Spider Link Extractors Selectors Request Responses
  • 9. Items ● Main container of structured information ● dict-like objects from scrapy.item import Item, Field class Product(Item): name = Field() price = Field() stock = Field() last_updated = Field(serializer=str)
  • 10. Items >>> product = Product(name='Desktop PC', price=1000) >>> print product Product(name='Desktop PC', price=1000) >>> product['name'] Desktop PC >>> product.get('name') Desktop PC >>> product.keys() ['price', 'name'] >>> product.items() [('price', 1000), ('name', 'Desktop PC')]
  • 11. Spiders ● Define how to move around a site ○ which links to follow ○ how to extract data ● Cycle ○ Initial request and callback ○ Store parsed content ○ Subsequent requests and callbacks
  • 13. BaseSpider ● Every other spider inherits from BaseSpider ● Two jobs ○ Request `start_urls` ○ Callback `parse` on resulting response
  • 14. BaseSpider ... class MySpider(BaseSpider): name = 'example.com' allowed_domains = ['example.com'] start_urls = [ ● Send Requests example. com/[1:3].html 'http://www.example.com/1.html', 'http://www.example.com/2.html', 'http://www.example.com/3.html', ] ● Yield title Item def parse(self, response): sel = Selector(response) for h3 in sel.xpath('//h3').extract(): yield MyItem(title=h3) for url in sel.xpath('//a/@href').extract(): yield Request(url, callback=self.parse) ● Yield new Request
  • 15. CrawlSpider ● Provide a set of rules on what links to follow ○ `link_extractor` ○ `call_back` rules = ( # Extract links matching 'category.php' (but not matching 'subsection.php') # and follow links from them (since no callback means follow=True by default). Rule(SgmlLinkExtractor(allow=('category.php', ), deny=('subsection.php', ))), # Extract links matching 'item.php' and parse them with the spider's method parse_item Rule(SgmlLinkExtractor(allow=('item.php', )), callback='parse_item'), )
  • 16. Link Extractors ● Extract links from Response objects ● Parameters include: ○ ○ ○ ○ ○ ○ allow | deny allow_domains | deny_domains deny_extensions restrict_xpaths tags attrs
  • 17. Selectors ● Mechanisms for extracting data from HTML ● Built over the lxml library ● Two methods ○ XPath: sel.xpath('//a[contains(@href, "image")]/@href' ). extract() ○ CSS: sel.css('a[href*=image]::attr(href)' ).extract() ● Response object is called into Selector ○ sel = Selector(response)
  • 18. Request ● Generated in Spider, sent to Downloader ● Represent an HTTP request ● FormRequest subclass performs HTTP POST ○ useful to simulate user login
  • 19. Response ● Comes from Downloader and sent to Spider ● Represents HTTP response ● Subclasses ○ TextResponse ○ HTMLResponse ○ XmlResponse
  • 20. Advanced Scrapy ● Scrapyd ○ application to deploy and run Scrapy spiders ○ deploy projects and control with JSON API ● Signals ○ notify when events occur ○ hook into Signals API for advance tuning ● Extensions ○ Custom functionality loaded at Scrapy startup
  • 22. Demo