scrapy redis 配置文件setting参数详解

最新推荐文章于 2025-05-11 08:15:00 发布

原创最新推荐文章于 2025-05-11 08:15:00 发布 · 1k 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#redis #分布式

scrapy 专栏收录该内容

2 篇文章

订阅专栏

本文详细介绍如何在Scrapy项目中集成Redis，实现任务队列、去重过滤、持久化存储等功能，适用于分布式爬虫场景。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

scrapy项目 setting.py

#Resis 设置

#使能Redis调度器

SCHEDULER = 'scrapy_redis.scheduler.Scheduler'

#所有spider通过redis使用同一个去重过滤器

DUPEFILTER_CLASS  = 'scrapy_redis.dupefilter.RFPDupeFilter'

#不清除Redis队列、这样可以暂停/恢复 爬取

#SCHEDULER_PERSIST = True

#SCHEDULER_QUEUE_CLASS ='scrapy_redis.queue.PriorityQueue' #默认队列，优先级队列
#备用队列。
#SCHEDULER_QUEUE_CLASS ='scrapy_redis.queue.FifoQueue'  #先进先出队列
#SCHEDULER_QUEUE_CLASS ='scrapy_redis.queue.LifoQueue'  #后进先出队列

#最大空闲时间防止分布式爬虫因为等待而关闭

#SCHEDULER_IDLE_BEFORE_CLOSE = 10


#将抓取的item存储在Redis中以进行后续处理。

ITEM_PIPELINES  = {
     'scrapy_redis.pipelines.RedisPipeline':300,
}

# The item pipeline serializes and stores the items in this redis key.
#item pipeline 将items 序列化 并用如下key名储存在redis中

#REDIS_ITEMS_KEY = '%(spider)s:items'

#默认的item序列化方法是ScrapyJSONEncoder，你也可以使用自定义的序列化方式

#REDIS_ITEMS_SERIALIZER = 'json.dumps'


#设置redis地址 端口 密码

REDIS_HOST = 'localhost'
REDIS_HOST = 6379

#也可以通过下面这种方法设置redis地址 端口和密码，一旦设置了这个，则会覆盖上面所设置的REDIS_HOST和REDIS_HOST

 REDIS_URL = 'redis://root:redis_pass@xxx.xx.xx.xx:6379'  
 #root用户名，redis_pass:你设置的redis验证密码，xxxx:你的主机ip

#你设置的redis其他参数 Custom redis client parameters (i.e.: socket timeout, etc.)
REDIS_PARAMS  = {}


#自定义的redis客户端类
#REDIS_PARAMS['redis_cls'] = 'myproject.RedisClient'

# If True, it uses redis ``zrevrange`` and ``zremrangebyrank`` operation. You have to use the ``zadd``
# command to add URLS and Scores to redis queue. This could be useful if you
# want to use priority and avoid duplicates in your start urls list.

#REDIS_START_URLS_AS_SET = False

# 默认的RedisSpider 或 RedisCrawlSpider  start urls key

#REDIS_START_URLS_KEY = '%(name)s:start_urls'

#redis的默认encoding是utf-8，如果你想用其他编码可以进行如下设置：

#REDIS_ENCODING = 'latin1'

类scrapy_redis.spiders.RedisSpider使spider可以从redis数据库中读取URL。Redis队列中的URL将被爬取，如果第一个请求产生更多请求，则spider将处理这些请求，然后再从Redis中获取另一个URL。

创建spider

from scrapy_redis.spiders import RedisSpider

class MySpider(RedisSpider):
    name = 'myspider'

    def parse(self, response):
        # do stuff
        pass

在redis-cli设置start_url

redis-cli lpush myspider:start_urls http://google.com