Python中一个强大的Spider：Web爬网程序

共191个文件

py：108个

md：24个

js：13个

版权申诉

python

114 浏览量 2024-03-01 21:47:16 上传评论收藏 2.54MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

Python中一个强大的Spider：Web爬网程序（191个子文件）

.babelrc 28B

CSDN关注我不迷路.bmp 2.79MB

logging.conf 763B

.coveragerc 350B

debug.min.css 5KB

index.min.css 2KB

tasks.min.css 1KB

task.min.css 783B

result.min.css 390B

Dockerfile 1KB

.gitignore 339B

.gitignore 5B

index.html 9KB

debug.html 6KB

task.html 4KB

result.html 4KB

tasks.html 2KB

MANIFEST.in 175B

tox.ini 300B

debug.min.js 24KB

debug.js 19KB

splitter.js 10KB

phantomjs_fetcher.js 7KB

css_selector_helper.js 7KB

index.js 6KB

puppeteer_fetcher.js 6KB

index.min.js 4KB

css_selector_helper.min.js 4KB

webpack.config.js 723B

result.min.js 257B

tasks.min.js 256B

task.min.js 255B

package.json 627B

config_example.json 423B

debug.less 7KB

index.less 2KB

task.less 1023B

result.less 641B

tasks.less 556B

variable.less 545B

LICENSE 11KB

splash_fetcher.lua 6KB

Command-Line.md 9KB

self.crawl.md 8KB

HTML-and-CSS-Selector.md 8KB

AJAX-and-more-HTTP.md 7KB

Architecture.md 5KB

Deployment-demo.pyspider.org.md 4KB

Deployment.md 4KB

Quickstart.md 4KB

Render-with-PhantomJS.md 3KB

Frequently-Asked-Questions.md 3KB

index.md 3KB

README.md 3KB

Working-with-Results.md 3KB

About-Projects.md 2KB

Running-pyspider-with-Docker.md 2KB

About-Tasks.md 2KB

Response.md 2KB

self.send_message.md 1KB

Script-Environment.md 1KB

@every.md 729B

ISSUE_TEMPLATE.md 659B

@catch_status_code_error.md 542B

index.md 396B

index.md 198B

demo.png 834KB

inspect_element.png 252KB

request-headers.png 237KB

search-for-request.png 232KB

css_selector_helper.png 124KB

tutorial_imdb_front.png 94KB

developer-tools-network.png 90KB

index_page.png 77KB

twitch.png 47KB

creating_a_project.png 42KB

run_one_step.png 29KB

developer-tools-network-filter.png 19KB

pyspider-arch.png 17KB

scheduler.py 46KB

run.py 32KB

tornado_fetcher.py 31KB

test_scheduler.py 28KB

test_database.py 26KB

test_fetcher.py 24KB

test_fetcher_processor.py 23KB

test_processor.py 20KB

test_webui.py 20KB

base_handler.py 15KB

test_run.py 14KB

counter.py 12KB

pprint.py 12KB

utils.py 12KB

project_module.py 9KB

task_queue.py 9KB

rabbitmq.py 9KB

bench.py 8KB

test_message_queue.py 8KB

processor.py 8KB

response.py 7KB

共 191 条

pyspider [![Build Status]][Travis CI] [![Coverage Status]][Coverage] ======== A Powerful Spider(Web Crawler) System in Python. - Write script in Python - Powerful WebUI with script editor, task monitor, project manager and result viewer - [MySQL](https://www.mysql.com/), [MongoDB](https://www.mongodb.org/), [Redis](http://redis.io/), [SQLite](https://www.sqlite.org/), [Elasticsearch](https://www.elastic.co/products/elasticsearch); [PostgreSQL](http://www.postgresql.org/) with [SQLAlchemy](http://www.sqlalchemy.org/) as database backend - [RabbitMQ](http://www.rabbitmq.com/), [Redis](http://redis.io/) and [Kombu](http://kombu.readthedocs.org/) as message queue - Task priority, retry, periodical, recrawl by age, etc... - Distributed architecture, Crawl Javascript pages, Python 2.{6,7}, 3.{3,4,5,6} support, etc... Tutorial: [http://docs.pyspider.org/en/latest/tutorial/](http://docs.pyspider.org/en/latest/tutorial/) Documentation: [http://docs.pyspider.org/](http://docs.pyspider.org/) Release notes: [https://github.com/binux/pyspider/releases](https://github.com/binux/pyspider/releases) Sample Code ----------- ```python from pyspider.libs.base_handler import * class Handler(BaseHandler): crawl_config = { } @every(minutes=24 * 60) def on_start(self): self.crawl('http://scrapy.org/', callback=self.index_page) @config(age=10 * 24 * 60 * 60) def index_page(self, response): for each in response.doc('a[href^="http"]').items(): self.crawl(each.attr.href, callback=self.detail_page) def detail_page(self, response): return { "url": response.url, "title": response.doc('title').text(), } ``` Installation ------------ * `pip install pyspider` * run command `pyspider`, visit [http://localhost:5000/](http://localhost:5000/) **WARNING:** WebUI is open to the public by default, it can be used to execute any command which may harm your system. Please use it in an internal network or [enable `need-auth` for webui](http://docs.pyspider.org/en/latest/Command-Line/#-config). Quickstart: [http://docs.pyspider.org/en/latest/Quickstart/](http://docs.pyspider.org/en/latest/Quickstart/) Contribute ---------- * Use It * Open [Issue], send PR * [User Group] * [中文问答](http://segmentfault.com/t/pyspider) TODO ---- ### v0.4.0 - [ ] a visual scraping interface like [portia](https://github.com/scrapinghub/portia) License ------- Licensed under the Apache License, Version 2.0 [Build Status]: https://img.shields.io/travis/binux/pyspider/master.svg?style=flat [Travis CI]: https://travis-ci.org/binux/pyspider [Coverage Status]: https://img.shields.io/coveralls/binux/pyspider.svg?branch=master&style=flat [Coverage]: https://coveralls.io/r/binux/pyspider [Try]: https://img.shields.io/badge/try-pyspider-blue.svg?style=flat [Issue]: https://github.com/binux/pyspider/issues [User Group]: https://groups.google.com/group/pyspider-users

评论收藏

内容反馈

版权申诉