Python进阶之Scrapy抓取阳光政务平台
1. 目标
-
- 爬取http://wz.sun0769.com/political/index/politicsNewest
-
- 网站标题及详情内容
2. 页面分析
- 1.列表页网址http://wz.sun0769.com/political/index/politicsNewest?id=1&page=1
通过修改page={}来改变页面,列表内容网页源码中有。
本程序通过找到翻页键实现找到下一页链接发起请求实现翻页。 - 2.详情页网址可以通过检查获取href进行拼接
//ul[@class=‘title-state-ul’]//li/span[3]/a/@href - 3.详情页内容可以通过pre标签获得
3. 示例代码
1. 创建项目
scrapy startproject sunSpider
cd sunSpider
scrapy genspider sun sun0769.com
2. start.py
from scrapy import cmdline
# cmdline.execute(['scrapy','crawl','sun'])
cmdline.execute('scrapy crawl sun'.split())
3. items.py
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
# 字段名一定要在items里边定义,防止正式程序写错
class SunspiderItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
href = scrapy.Field()
content = scrapy.Field()