(一) python虚拟环境安装
1.安装virtualenv
pip install virtualenv
2.安装 virtualenvwrapper
pip install virtualenvwrapper-win
ps: linux下运行pip install virtualenvwrapper
3.设置WORK_HOME环境变量
1.在Anaconda中新建了一个文件夹ENV
2.WORK_HOME变量值为ENV文件夹的地址,设置此环境变量的作用为:使用mkvirtualenv创建环境时,新创建的环境自动放入此环境变量设置的文件中
4.创建环境
(1)使用mkvirtualenv创建环境spider_1
C:\Users\Adminstrate>mkvirtualenv spider_1
(2)创建成功后使用workon和lsvirtualenv查看所有环境:
C:\Users\Adminstrate>workon
C:\Users\Adminstrate>lsvirtualenv
(3)使用workon 启动环境spider_1
C:\Users\Adminstrate>workon spider_1
(4)退出虚拟环境
(spider_1) C:\Users\Adminstrate>deactivate
(4)删除虚拟环境
C:\Users\Adminstrate>rmvirtualenv spideri_1
5 安装包
workon spider_1 启动虚拟环境 spider_1后
安装Scrapy、mysqlclient
(spider_1) C:\Users\Adminstrate>pip install Scrapy
(spider_1) C:\Users\Adminstrate>pip install mysqlclient
(二)创建第一个爬虫项目
(1)创建项目
(spider_1) C:\Users\Adminatrate>scrapy startproject nlp_spider
创建完成后显示:
进入项目文件:
(spider_1) C:\Users\Adminatrate>cd nlp_spider
(spider_1) C:\Users\12823\nlp_spider>scrapy genspider tieba http://tieba.baidu.com/f?ie=utf-8&kw=%E8%81%8A%E5%9F%8E%E5%A4%A7%E5%AD%A6&red_tag=b1186180911
其中的网址是我们要爬取的网页,tieba是这个网址的标签
如果出现:‘kw’ 不是内部或外部命令,也不是可运行的程序,可以加上禁止重定向命令:
(spider_1) C:\Users\Adminstrate\nlp_spider>scrapy genspider tieba "http://tieba.baidu.com/f?ie=utf-8&kw=%E8%81%8A%E5%9F%8E%E5%A4%A7%E5%AD%A6&red_tag=b1186180911,meta={'dont_redirect':True}"
创建完成后nlp_spider目录结构:
继续进入nlp_spider,看到生成了如下文件:
进入spiders:
打开tieba文件:
修改allowned_domains为 [‘tieba.baidu.com’]:
修改start_urls去掉 http://
(三)调试爬虫程序
1.下载pycharm:https://www.jetbrains.com/pycharm/
2.file/settings修改Project Interpreter
点击show All
选择新创建环境spider_1下的python.exe:
如果没有的话点击右边的+号添加,在Existing environment/Interpreter中选择spider_1下python.exe的路径:
3. 打开nlp_spider项目,在nlp_spider中新建main.py文件
4.scrapy shell 操作
为了方便操作网页元素,可以先在命令行中用scrapy shell url进行测试
使用scrapy shell时应该在spider_1环境下
打开 设置/工具/开发人员工具:
使用最左边工具选中作者姓名,将自动定位到该元素
(spider_1) C:\Users\Adminatrate>scrapy shell https://tieba.baidu.com/p/4482506671
>>response.css('.p_author_name.j_user_card::text').extract()
5. tieba.py文件内容
# -*- coding: utf-8 -*-
from urllib import parse
import re
import scrapy
from nlp_spider.items import TiebaItem
class TiebaSpider(scrapy.Spider):
name = 'tieba'
allowed_domains = ['tieba.baidu.com']
start_urls = ['https://tieba.baidu.com/f?ie=utf-8&kw=%E9%98%B2%E9%AA%97&fr=search']
def parse(self, response):
# 页面中帖子的url地址
url_list = response.css('.j_th_tit::attr(href)').extract()
for url in url_list:
print(url)
# 发送一个请求,拼接响应url和每一个url,请求完的结果交给parse_detail处理
yield scrapy.Request(url=parse.urljoin(response.url,url),callback=self.parse_detail)
next_url = response.css('.next.pagination-item::attr(href)').extract()
if next_url:
yield scrapy.Request(url=parse.urljoin(response.url,next_url[0]),callback=self.parse)
def parse_detail(self,response):
# 帖子的主题
title = response.css(".core_title_txt.pull-left.text-overflow::text").extract()
if title:
# 作者们,变量为一个list
authors = response.css(".p_author_name.j_user_card::text").extract()
# 帖子回复的内容,需要进一步处理
contents_list = response.css(".d_post_content.j_d_post_content").extract()
# 进一步处理帖子的内容,包含图片地址,以及前端的换行标签
contents_list = self.get_content(contents_list)
# 处理帖子的发送时间和帖子位于楼数
bbs_sendtime_list, bbs_floor_list = self.get_send_time_and_floor(response)
for i in range(len(authors)):
tieba_item = TiebaItem()
tieba_item['title'] = title[0]
tieba_item['author'] = authors[i]
tieba_item['content'] = contents_list[i]
tieba_item['reply_time'] = bbs_sendtime_list[i]
tieba_item['floor'] = bbs_floor_list[i]
yield tieba_item
def get_content(self,contents):
contents_list = []
for content in contents:
reg = ";\">(.*)</div>"
result = re.findall(reg,content)[0]
contents_list.append(result)
return contents_list
def get_send_time_and_floor(self,response):
bbs_send_time_and_floor_list = response.css('.post-tail-wrap span[class=tail-info]::text').extract()
for lz in bbs_send_time_and_floor_list:
if lz == "来自":
bbs_send_time_and_floor_list.remove(lz)
i = 0 # 记录 bbs_send_time_and_floor_list角标位置,0 是楼数,1是发帖时间,0,1成对出现
bbs_sendtime_list = []
bbs_floor_list = []
for bbs_send_time_and_floor in bbs_send_time_and_floor_list:
if i % 2 == 0: # 代表楼数
bbs_floor_list.append(bbs_send_time_and_floor)
elif i % 2 == 1: # 代表时间
bbs_sendtime_list.append(bbs_send_time_and_floor)
i += 1
return bbs_sendtime_list,bbs_floor_list
(四)数据持久化代码开发
1.settings.py文件,添加四行代码
作用:配置MySQL
ITEM_PIPELINES = {
'nlp_spider.pipelines.NlpSpiderPipeline': 300,
'nlp_spider.pipelines.MysqlTwistedPipeline': 1,
}
MYSQL_HOST = "localhost"
MYSQL_DBNAME = "spider"
MYSQL_USER = "root"
MYSQL_PASSWORD = "123456"
2.items.py文件
作用:数据存储模型,将数据装到一个对象当中
import scrapy
class TiebaItem(scrapy.Item):
# 帖子的标题
title = scrapy.Field()
# 帖子的作者
author = scrapy.Field()
# 帖子的内容
content = scrapy.Field()
# 帖子回复的时间
reply_time = scrapy.Field()
# 帖子所在的楼数
floor = scrapy.Field()
# 执行一个SQL语句,把数据插入到数据库中
def get_insert_sql(self):
insert_sql = """
insert into baidutieba(title,author,content,reply_time,floor)
values(%s,%s,%s,%s,%s)
"""
params = (self['title'],self['author'],self['content'],self['reply_time'],self['floor'])
return insert_sql,params
3.pipelines.py文件
from twisted.enterprise import adbapi
import MySQLdb
import MySQLdb.cursors
class MysqlTwistedPipeline(object):
def __init__(self,dbpool):
self.dbpool = dbpool
@classmethod
#固定从settings文件中把所有配置读取出来
def from_settings(cls,settings):
dbparms = dict(
host=settings["MYSQL_HOST"],
db=settings["MYSQL_DBNAME"],
user=settings["MYSQL_USER"],
password=settings["MYSQL_PASSWORD"],
charset="utf8",
cursorclass=MySQLdb.cursors.DictCursor,
use_unicode=True
)
dbpool = adbapi.ConnectionPool("MySQLdb",**dbparms)
return cls(dbpool)
def process_item(self,item,spider):
query = self.dbpool.runInteraction(self.doinsert,item)
def do_insert(self,cursor,item):
insert_sql,params = item.get_insert_sql()
cursor.execute(insert_qsl,params)
(五)数据入库
1.MySQL下载:https://dev.mysql.com/downloads/mysql/
2.新建链接,名字与settings.py文件对应
3.新建表spider
4.新建表baidu_tieba,与items.py文件sql语句中的表对应
5.爬取完成: