实验要求:爬取豆瓣图书排行榜书单信息,存储到数据库中,并爬取图书评论进行数据分析,提取关键字做成词云展示。
实验成果:
词云效果图:
废话不多说,直接开始实战!
爬取数据
先来看简单的例子:
import requests
from bs4 import BeautifulSoup
url = "https://book.douban.com/top250"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'
}
res = requests.get(url,headers=headers)
soup = BeautifulSoup(res.text,'html.parser')
#提取书名和链接
items = soup.select('.pl2 a')
for item in items:
name = item['title']
link = item['href']
print(name,link)
得到的结果为:
红楼梦 https://book.douban.com/subject/1007305/
活着 https://book.douban.com/subject/4913064/
1984 https://book.douban.com/subject/4820710/
百年孤独 https://book.douban.com/subject/6082808/
三体全集 https://book.douban.com/subject/6518605/
飘 https://book.douban.com/subject/1068920/
...
改进:
1.实现模拟登录
我们在访问网站时,若在一段时间内访问次数过多,网站会检测到异常随后封掉我们的ip,所以在请求时,应在header中写入 user-agent 和 cookie 来模拟登录信息,且要不断变更user-agent来模拟登录。同时,设置休眠时间,避免高频率的爬取
user_Agent = [
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36(KHTML, like Gecko) Chrome / 61.0.3163.100Safari / 537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, likeGecko) Chrome / 39.0.2171.95Safari / 537.36OPR / 26.0.1656.60',
'Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0Opera9.50',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)']
page = requests.get(url=url, headers={"User-Agent": random.choice(user_Agent)})
time.sleep(random.random()*3)
使用正则表达式对爬取的内容进行提取,并将结果保存到 txt 和数据库中
import re
findLink = re.compile(r'<a href="(\S*)"')
findName = re.compile(r'title="(.*?)"')
findQuote = re.compile(r'<span class="inq">(\S*)</span>')
findStar = re.compile(r'<span class="rating_nums">(\S*)</span>')
findAuthor = re.compile(r'<p class="pl">(.*?)</p>')
link = re.findall(findLink,str(item))
name = re.findall(findName,str(item))
quote = re.findall(findQuote,str(item))
if(quote==[]):
quote = ['null']
star = re.findall(findStar, str(item))
author = re.findall(findAuthor, str(item))
#打印结果
print(name[0] + ' ' + link[0] + ' ' + author[0] + ' ' + quote[0] + ' ' + star[0])
fp.write(name[0] + ' ' + link[0] + ' ' + author[0] + ' ' + quote[0] + ' ' + star[0] + '\n')
cursor.execute(sql, [name[0], link[0], author[0], quote[0], star[0]])
#存储到数据库记得commit()!
db.commit()
完整代码:
import random
import time
import requests
from bs4 import BeautifulSoup
import re
import pymysql
db = pymysql.connect(host='127.0.0.1', user='root', password='******', database='Pachong')
cursor = db.cursor()
sql = '''insert into info(name,link,author,quote,star) values(%s,%s,%s,%s,%s);'''
user_Agent = [
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36(KHTML, like Gecko) Chrome / 61.0.3163.100Safari / 537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, likeGecko) Chrome / 39.0.2171.95Safari / 537.36OPR / 26.0.1656.60',
'Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0Opera9.50',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)']
fp = open("PachongTest.txt","w",encoding='utf-8')
i = 0
for i in range(0,226,25):
print(i)
url = "https://book.douban.com/top250?start="+str(i)
page = requests.get(url=url, headers={"User-Agent": random.choice(user_Agent)})
time.sleep(random.random() * 3)
soup = BeautifulSoup(page.text,'html.parser')
# print(soup)
# 提取书名和链接
items = soup.select('.item')
findLink = re.compile(r'<a href="(\S*)"')
findName = re.compile(r'title="(.*?)"')
findQuote = re.compile(r'<span class="inq">(\S*)</span>')
findStar = re.compile(r'<span class="rating_nums">(\S*)</span>')
findAuthor = re.compile(r'<p class="pl">(.*?)</p>')
for item in items:
link = re.findall(findLink,str(item))
name = re.findall(findName,str(item))
quote = re.findall(findQuote,str(item))
if(quote==[]):
quote = ['null']
star = re.findall(findStar, str(item))
author = re.findall(findAuthor, str(item))
# print(quote)
print(name[0] + ' ' + link[0] + ' ' + author[0] + ' ' + quote[0] + ' ' + star[0])
fp.write(name[0] + ' ' + link[0] + ' ' + author[0] + ' ' + quote[0] + ' ' + star[0] + '\n')
cursor.execute(sql, [name[0], link[0], author[0], quote[0], star[0]])
db.commit()
cursor.close()
db.close()
爬取评论几乎同理。
数据库结果:
数据清洗
爬取到数据后,由于数据杂乱,需要对数据进行关键字提取与停用无用词
停用词表:
import jieba
import wordcloud
stopwords_filepath = r"scu_stopwords.txt" #引用停用词
# 创建停用词list
def stopwordslist(stopwords_filepath):
stopwords = [line.strip() for line in open(stopwords_filepath, 'r',
encoding='utf-8-sig', errors='ignore').readlines()]
return stopwords
# 对句子进行分词
def seg_sentence(sentence):
sentence_seged = jieba.cut(sentence.strip())
stopwords = stopwordslist(stopwords_filepath) # 这里加载停用词的路径
outstr = ''
for word in sentence_seged:
if word not in stopwords:
if word != '\t':
outstr += word
outstr += " "
return outstr
inputs = open(r'commentsAll.txt', 'r', encoding='utf-8', errors='ignore')
outputs = open(r'result2.txt', 'w',encoding='utf-8')
for line in inputs:
line_seg = seg_sentence(line)
outputs.write(line_seg + '\n')
outputs.close()
inputs.close()
分词前:
分词后:
使用jieba进行关键词提取(按频率),并使用 wordcloud生成词云图
import jieba.analyse
import wordcloud
path = 'result2.txt'
file_in = open(path)
s = file_in.read()
y = ''
for x, w in jieba.analyse.extract_tags(s, topK=100, withWeight=True): #基于 TF-IDF 算法,抽取10个关键词
print('%s %s' % (x, w))
y+=x+' '
print(y)
wc = wordcloud.WordCloud(
background_color='white', # 图片背景颜色
font_path='/System/Library/Fonts/PingFang.ttc', # 词云字体
scale=15
)
# 给词云输入文字
wc.generate(y)
# 词云图保存图片地址
wc.to_file('ciyun.png')
flask后端搭建
搭建简易后台
from flask import Flask,render_template
from flask_cors import CORS, cross_origin
import pymysql
db = pymysql.connect(host='127.0.0.1', user='root', password='******', database='Pachong')
cursor = db.cursor(cursor=pymysql.cursors.DictCursor)
sql = '''select * from info;'''
sql2 = '''select * from comments;'''
info = cursor.fetchmany(cursor.execute(sql))
comments = cursor.fetchmany(cursor.execute(sql2))
app = Flask(__name__)
CORS(app, resources=r'/*')
@app.route('/getAll', methods=['GET', 'POST'])
def getInfo():
return info
@app.route('/getComments', methods=['GET', 'POST'])
def getComments():
return comments
if __name__ == '__main__':
app.run()
接口展示:
Vue前端搭建
路由Vue-Router
info代码:
<template>
<div>
<el-table :data="tableData.slice((currentPage-1)*pageSize,currentPage*pageSize)" stripe style="width: 100%">
<el-table-column prop="name" label="书名" />
<el-table-column prop="link" label="链接">
<template slot-scope="scope">
<a :href="scope.row.link" >{{scope.row.link}}</a>
</template>
</el-table-column>
<el-table-column prop="author" label="作者" />
<el-table-column prop="quote" label="引言" />
<el-table-column prop="star" label="评分" />
</el-table>
<el-pagination background
align = 'right'
:page-size="pageSize"
layout="prev, pager, next"
:total="tableData.length"
:current-page="currentPage"
@size-change="handleSizeChange"
@current-change="handleCurrentChange"
/>
</div>
</template>
<script>
// import axios from 'axios'
export default {
name: 'info',
props: {
msg: String
},
data() {
return {
tableData: [
{
name: 'ss'
}
],
currentPage:1,
pageSize:10,
}
},
mounted() {
console.log("加载成功")
this.getInfo()
},
methods: {
getInfo() {
this.axios.get("http://127.0.0.1:5000/getAll").then(res => {
console.log(res.data)
this.tableData = res.data
})
},
handleSizeChange(val) {
console.log(`每页 ${val} 条`);
this.currentPage = 1;
this.pageSize = val;
},
handleCurrentChange(val) {
console.log(`当前页: ${val}`);
this.currentPage = val;
}
},
}
</script>
<!-- Add "scoped" attribute to limit CSS to this component only -->
<style scoped>
h3 {
margin: 40px 0 0;
}
ul {
list-style-type: none;
padding: 0;
}
li {
display: inline-block;
margin: 0 10px;
}
a {
color: #2b8ae9;
}
.el-pagination{
margin-top: 20px;
}
</style>