# Scrapy Cluster
[](https://circleci.com/gh/istresearch/scrapy-cluster) [](http://scrapy-cluster.readthedocs.io/en/dev/) [](https://gitter.im/istresearch/scrapy-cluster?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge) [](https://coveralls.io/github/istresearch/scrapy-cluster?branch=dev) [](https://github.com/istresearch/scrapy-cluster/blob/dev/LICENSE) [](https://hub.docker.com/r/istresearch/scrapy-cluster/)
This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
The goal is to distribute seed URLs among many waiting spider instances, whose requests are coordinated via Redis. Any other crawls those trigger, as a result of frontier expansion or depth traversal, will also be distributed among all workers in the cluster.
The input to the system is a set of Kafka topics and the output is a set of Kafka topics. Raw HTML and assets are crawled interactively, spidered, and output to the log. For easy local development, you can also disable the Kafka portions and work with the spider entirely via Redis, although this is not recommended due to the serialization of the crawl requests.
## Dependencies
Please see the ``requirements.txt`` within each sub project for Pip package dependencies.
Other important components required to run the cluster
- Python 2.7 or 3.6: https://www.python.org/downloads/
- Redis: http://redis.io
- Zookeeper: https://zookeeper.apache.org
- Kafka: http://kafka.apache.org
## Core Concepts
This project tries to bring together a bunch of new concepts to Scrapy and large scale distributed crawling in general. Some bullet points include:
- The spiders are dynamic and on demand, meaning that they allow the arbitrary collection of any web page that is submitted to the scraping cluster
- Scale Scrapy instances across a single machine or multiple machines
- Coordinate and prioritize their scraping effort for desired sites
- Persist data across scraping jobs
- Execute multiple scraping jobs concurrently
- Allows for in depth access into the information about your scraping job, what is upcoming, and how the sites are ranked
- Allows you to arbitrarily add/remove/scale your scrapers from the pool without loss of data or downtime
- Utilizes Apache Kafka as a data bus for any application to interact with the scraping cluster (submit jobs, get info, stop jobs, view results)
- Allows for coordinated throttling of crawls from independent spiders on separate machines, but behind the same IP Address
- Enables completely different spiders to yield crawl requests to each other, giving flexibility to how the crawl job is tackled
## Scrapy Cluster test environment
To set up a pre-canned Scrapy Cluster test environment, make sure you have [Docker](https://www.docker.com/).
### Steps to launch the test environment:
1. Build your containers (or omit --build to pull from docker hub)
```
docker-compose up -d --build
```
2. Tail kafka to view your future results
```
docker-compose exec kafka_monitor python kafkadump.py dump -t demo.crawled_firehose -ll INFO
```
3. From another terminal, feed a request to kafka
```
curl localhost:5343/feed -H "content-type:application/json" -d '{"url": "http://dmoztools.net", "appid":"testapp", "crawlid":"abc123"}'
```
4. Validate you've got data!
```
# wait a couple seconds, your terminal from step 2 should dump json data
{u'body': '...content...', u'crawlid': u'abc123', u'links': [], u'encoding': u'utf-8', u'url': u'http://dmoztools.net', u'status_code': 200, u'status_msg': u'OK', u'response_url': u'http://dmoztools.net', u'request_headers': {u'Accept-Language': [u'en'], u'Accept-Encoding': [u'gzip,deflate'], u'Accept': [u'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], u'User-Agent': [u'Scrapy/1.5.0 (+https://scrapy.org)']}, u'response_headers': {u'X-Amz-Cf-Pop': [u'IAD79-C3'], u'Via': [u'1.1 82c27f654a5635aeb67d519456516244.cloudfront.net (CloudFront)'], u'X-Cache': [u'RefreshHit from cloudfront'], u'Vary': [u'Accept-Encoding'], u'Server': [u'AmazonS3'], u'Last-Modified': [u'Mon, 20 Mar 2017 16:43:41 GMT'], u'Etag': [u'"cf6b76618b6f31cdec61181251aa39b7"'], u'X-Amz-Cf-Id': [u'y7MqDCLdBRu0UANgt4KOc6m3pKaCqsZP3U3ZgIuxMAJxoml2HTPs_Q=='], u'Date': [u'Tue, 22 Dec 2020 21:37:05 GMT'], u'Content-Type': [u'text/html']}, u'timestamp': u'2020-12-22T21:37:04.736926', u'attrs': None, u'appid': u'testapp'}
```
## Documentation
Please check out the official [Scrapy Cluster documentation](https://scrapy-cluster.readthedocs.io/en/dev/) for more information on how everything works!
## Branches
The `master` branch of this repository contains the latest stable release code for `Scrapy Cluster 1.2`.
The `dev` branch contains bleeding edge code and is currently working towards [Scrapy Cluster 1.3](https://github.com/istresearch/scrapy-cluster/milestone/3). Please note that not everything may be documented, finished, tested, or finalized but we are happy to help guide those who are interested.
没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
Scrapy集群 这个Scrapy项目使用Redis和Kafka创建一个分布式的按需抓取集群。目标是将种子 URL 分发给许多等待的蜘蛛实例,这些实例的请求通过 Redis 进行协调。由于边界扩展或深度遍历而触发的任何其他抓取也将分发给集群中的所有工作器。系统的输入是一组 Kafka 主题,输出也是一组 Kafka 主题。原始 HTML 和资产以交互方式进行抓取、爬取并输出到日志。为了便于本地开发,您还可以禁用 Kafka 部分并完全通过 Redis 与爬取器协作,但由于抓取请求的序列化,不建议这样做。依赖项请查看requirements.txt每个子项目内的 Pip 包依赖关系。运行集群所需的其他重要组件Python 2.7 或 3.6https://www.python.org/downloads/Redis http: //redis.ioZookeeperhttps://zookeeper.apache.org卡夫卡http://kafka.apache.org核心概念该项目试图将一系列新概念整合到 Scrapy 和大规模分布
资源推荐
资源详情
资源评论




















收起资源包目录





































































































共 213 条
- 1
- 2
- 3
资源评论


徐浪老师
- 粉丝: 9478
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助


最新资源
- 3DSMax插件安装完全向导.doc
- 软件技术职业生涯规划.doc
- 单片机实用系统设计方案教学进度表、教案.doc
- 大数据视角下的人工智能技术应用探讨.docx
- 2017-2018学年高中数学-第二章-算法初步-2.2-算法框图的基本结构及设计-2.2.3-循环结构-北师大版必修3.ppt
- 医院财务管理信息化研究.docx
- 云计算在现代远程教育中的应用研究.docx
- 区块链视角的企业业财融合与财务共享研究.docx
- RSA数据加密算法分析与改进.docx
- 单片机汇编语言经典一百例.doc
- 以培养学生自觉意识为基础的计算机程序设计课程教学改革研究.docx
- JAVA计算器课程设计.docx
- 声源定位在智能语音识别中的应用-洞察研究.pptx
- 自学考试C--程序设计C--笔记.doc
- 基于云计算环境下数据存储安全的关键技术初探.docx
- C语言-第12章.ppt
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈



安全验证
文档复制为VIP权益,开通VIP直接复制
