从0开始学习scrapy框架——（五）item的介绍和使用

弹吉他的羊驼

于 2021-04-08 10:58:01 发布

阅读量311

点赞数

CC 4.0 BY-SA版权

分类专栏：爬虫 scrapy 文章标签： python 爬虫

本文链接：https://blog.csdn.net/qq_43278562/article/details/115508724

爬虫同时被 2 个专栏收录

26 篇文章

订阅专栏

scrapy

12 篇文章

订阅专栏

该博客介绍了如何使用Scrapy爬虫框架来抓取腾讯招聘网站上的职位信息。在items.py中定义了TencentItem类，包括title、position和publish_data字段。在hr.py中，创建了HrSpider类，解析网页并提取职位的标题、职位和发布时间。同时，实现了通过next_url获取下一页并继续爬取的功能。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

继续上面爬取腾讯招聘的例子我们再来说一下item的用法：
在items.py中做以下修改：

import scrapy


class TencentItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    postion = scrapy.Field()
    pubish_data = scrapy.Field()

此时我们 hr.py 也要做相应的改变：（导入类然后再实例化）

import scrapy
from tencent.items import TencentItem

class HrSpider(scrapy.Spider):
    name = 'hr'
    # allowed_domains = ['tencent.com']
    start_urls = ['http://hr.tencent.com/position.php']

    def parse(self, response):
        tr_list = response.xpath('//table[@class="tablelist"]/tr')[1:-1]
        print(tr_list)
        for tr in tr_list:
            item = TencentItem()
            item['title'] = tr.xpath('./td[1]/a/text()').extract_first()
            item['position'] = tr.xpath('./td[2]/text()').extract_first()
            item['publish_data'] = tr.xpath('./td[5]/text()').extract_first()
            print("item内容：",item)
            yield item

        # 找到下一页的url地址：
        next_url = response.xpath('//a[@id=''next]@href').extract_first()
        if next_url != "javascript:;":
            next_url = "http://hr.tencent.com/" + next_url
            yield scrapy.Request(
                next_url,
                callback=self.parse(),
                meta = {"item":item}
            )

在这里插入图片描述