实现elasticsearch 和scrapy-redis分布式

最新推荐文章于 2024-12-21 22:50:03 发布

干啥都要好好干！

最新推荐文章于 2024-12-21 22:50:03 发布

阅读量404

点赞数

CC 4.0 BY-SA版权

本文链接：https://blog.csdn.net/chasejava/article/details/80024698

本文介绍如何配置Elasticsearch与Kibana的集成环境，包括所需软件版本匹配、开发工具的选择及基本的索引数据模型设计。通过示例展示了如何使用Python的Scrapy框架抓取网页数据并存储至Elasticsearch中。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

kibana-5.1.2-windows-x86

elasticsearch-rtf

elasticsearch-head

elasticsearch-rtf的版本最好要和kibana接近具体操作可以从GitHub上查找

使用到npm的话再去下载node.js

在项目中建立一个models文件夹类似django

from datetime import datetime
from elasticsearch_dsl import DocType, Date, Nested, Boolean, \
    analyzer, InnerDoc, Completion, Keyword, Text,Integer
from elasticsearch_dsl.connections import connections
connections.create_connection(hosts=["localhost"])

class jobboleItemsType(DocType):
    title = Text(analyzer="ik_max_word")
    date_time = Date()
    style = Text(analyzer='ik_max_word')
    content = Text(analyzer='ik_max_word')
    cherish = Integer()
    image_url = Keyword()
    img_path = Keyword()

    class Meta:
        index = 'job_bole'
        doc_type = 'article'


if __name__ == '__main__':
    jobboleItemsType.init()

如上将item中对应的设置一下类似数据库建立表

在对应的item类中

def save_to_es(self):
    article = jobboleItemsType()
    article.title = self['title']
    article.content = self['content']

    article.date_time = self['date_time']
    article.cherish = self['cherish']
    article.image_url = self['image_url']
    # article.img_path = item['img_path']
    article.meta.id = self['id']
    article.save()
    return

在对应的pipeline中调用这个方法，就可以实现将数据存进去了

写完后记得要在settings中注册

分布式-------------------------------------

下载安装好scrapy-redis

C:\Users\chase\Desktop\scrapy-redis-master\src\scrapy_redis 将这个文件夹放入到项目中之后按照GitHub上给的布置

之后运行下面这两个修改成自己的名称

run the spider:
```
scrapy runspider myspider.py
```

push urls to redis:

redis-cli lpush myspider:start_urls http://google.com