python爬取豆瓣图书Top250

原创已于 2023-04-27 22:28:22 修改 · 4k 阅读

73 ·

CC 4.0 BY-SA版权

文章标签：

#python #爬虫 #vue #flask #开发语言

于 2023-04-26 23:56:53 首次发布

实验要求：爬取豆瓣图书排行榜书单信息，存储到数据库中，并爬取图书评论进行数据分析，提取关键字做成词云展示。

实验成果：

词云效果图：

ciyunYuan

ciyun2

废话不多说，直接开始实战！

爬取数据

先来看简单的例子：

import requests
from bs4 import BeautifulSoup
url = "https://book.douban.com/top250" 
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'
}
res = requests.get(url,headers=headers)
soup = BeautifulSoup(res.text,'html.parser')
#提取书名和链接
items = soup.select('.pl2 a') 
for item in items:
      name = item['title'] 
      link = item['href'] 
      print(name,link)

得到的结果为：

红楼梦 https://book.douban.com/subject/1007305/ 
活着 https://book.douban.com/subject/4913064/ 
1984 https://book.douban.com/subject/4820710/ 
百年孤独 https://book.douban.com/subject/6082808/ 
三体全集 https://book.douban.com/subject/6518605/ 
飘 https://book.douban.com/subject/1068920/
...

改进：

1.实现模拟登录

我们在访问网站时，若在一段时间内访问次数过多，网站会检测到异常随后封掉我们的ip，所以在请求时，应在header中写入 user-agent 和 cookie 来模拟登录信息，且要不断变更user-agent来模拟登录。同时，设置休眠时间，避免高频率的爬取

user_Agent = [
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36(KHTML, like Gecko) Chrome / 61.0.3163.100Safari / 537.36',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, likeGecko) Chrome / 39.0.2171.95Safari / 537.36OPR / 26.0.1656.60',
    'Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0Opera9.50',
    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50',
    'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
    'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
    'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)',
    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
    'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)']

page = requests.get(url=url, headers={"User-Agent": random.choice(user_Agent)})

time.sleep(random.random()*3)

使用正则表达式对爬取的内容进行提取，并将结果保存到 txt 和数据库中

import re

findLink = re.compile(r'<a href="(\S*)"')
findName = re.compile(r'title="(.*?)"')
findQuote = re.compile(r'<span class="inq">(\S*)</span>')
findStar = re.compile(r'<span class="rating_nums">(\S*)</span>')
findAuthor = re.compile(r'<p class="pl">(.*?)</p>')
    
link = re.findall(findLink,str(item))
name = re.findall(findName,str(item))
quote = re.findall(findQuote,str(item))
if(quote==[]):
     quote = ['null']
star = re.findall(findStar, str(item))
author = re.findall(findAuthor, str(item))

#打印结果
print(name[0] + ' ' + link[0] + ' ' + author[0] + ' ' + quote[0] + ' ' + star[0])
fp.write(name[0] + ' ' + link[0] + ' ' + author[0] + ' ' + quote[0] + ' ' + star[0] + '\n')
cursor.execute(sql, [name[0], link[0], author[0], quote[0], star[0]])

#存储到数据库记得commit()!
db.commit()

完整代码：

import random
import time
import requests
from bs4 import BeautifulSoup
import re
import pymysql

db = pymysql.connect(host='127.0.0.1', user='root', password='******', database='Pachong')
cursor = db.cursor()
sql = '''insert into info(name,link,author,quote,star) values(%s,%s,%s,%s,%s);'''

user_Agent = [
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36(KHTML, like Gecko) Chrome / 61.0.3163.100Safari / 537.36',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, likeGecko) Chrome / 39.0.2171.95Safari / 537.36OPR / 26.0.1656.60',
    'Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0Opera9.50',
    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50',
    'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
    'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
    'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)',
    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
    'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)']

fp = open("PachongTest.txt","w",encoding='utf-8')
i = 0
for i in range(0,226,25):
    print(i)
    url = "https://book.douban.com/top250?start="+str(i)
    page = requests.get(url=url, headers={"User-Agent": random.choice(user_Agent)})
    time.sleep(random.random() * 3)
    soup = BeautifulSoup(page.text,'html.parser')
    # print(soup)
# 提取书名和链接
    items = soup.select('.item')
    findLink = re.compile(r'<a href="(\S*)"')
    findName = re.compile(r'title="(.*?)"')
    findQuote = re.compile(r'<span class="inq">(\S*)</span>')
    findStar = re.compile(r'<span class="rating_nums">(\S*)</span>')
    findAuthor = re.compile(r'<p class="pl">(.*?)</p>')
    for item in items:
        link = re.findall(findLink,str(item))
        name = re.findall(findName,str(item))
        quote = re.findall(findQuote,str(item))
        if(quote==[]):
            quote = ['null']
        star = re.findall(findStar, str(item))
        author = re.findall(findAuthor, str(item))
        # print(quote)
        print(name[0] + ' ' + link[0] + ' ' + author[0] + ' ' + quote[0] + ' ' + star[0])
        fp.write(name[0] + ' ' + link[0] + ' ' + author[0] + ' ' + quote[0] + ' ' + star[0] + '\n')
        cursor.execute(sql, [name[0], link[0], author[0], quote[0], star[0]])
db.commit()
cursor.close()
db.close()

爬取评论几乎同理。

数据库结果：

数据清洗

爬取到数据后，由于数据杂乱，需要对数据进行关键字提取与停用无用词

停用词表：

import jieba
import wordcloud

stopwords_filepath = r"scu_stopwords.txt" #引用停用词

# 创建停用词list
def stopwordslist(stopwords_filepath):
    stopwords = [line.strip() for line in open(stopwords_filepath, 'r',
                                               encoding='utf-8-sig', errors='ignore').readlines()]
    return stopwords

# 对句子进行分词
def seg_sentence(sentence):
    sentence_seged = jieba.cut(sentence.strip())
    stopwords = stopwordslist(stopwords_filepath)  # 这里加载停用词的路径
    outstr = ''
    for word in sentence_seged:
        if word not in stopwords:
            if word != '\t':
                outstr += word
                outstr += " "
    return outstr

inputs = open(r'commentsAll.txt', 'r', encoding='utf-8', errors='ignore')
outputs = open(r'result2.txt', 'w',encoding='utf-8')
for line in inputs:
    line_seg = seg_sentence(line)
    outputs.write(line_seg + '\n')
outputs.close()
inputs.close()

分词前：

分词后：

使用jieba进行关键词提取（按频率），并使用 wordcloud生成词云图

import jieba.analyse
import wordcloud
path = 'result2.txt'
file_in = open(path)
s = file_in.read()
y = ''

for x, w in jieba.analyse.extract_tags(s, topK=100, withWeight=True):  #基于 TF-IDF 算法，抽取10个关键词
    print('%s %s' % (x, w))
    y+=x+' '
print(y)
wc = wordcloud.WordCloud(
        background_color='white',   # 图片背景颜色
        font_path='/System/Library/Fonts/PingFang.ttc',    # 词云字体
        scale=15
)
# 给词云输入文字
wc.generate(y)
# 词云图保存图片地址
wc.to_file('ciyun.png')

ciyun

flask后端搭建

搭建简易后台

from flask import Flask,render_template
from flask_cors import CORS, cross_origin
import pymysql


db = pymysql.connect(host='127.0.0.1', user='root', password='******', database='Pachong')
cursor = db.cursor(cursor=pymysql.cursors.DictCursor)
sql = '''select * from info;'''
sql2 = '''select * from comments;'''
info = cursor.fetchmany(cursor.execute(sql))
comments = cursor.fetchmany(cursor.execute(sql2))

app = Flask(__name__)
CORS(app, resources=r'/*')

@app.route('/getAll', methods=['GET', 'POST'])
def getInfo():
    return info

@app.route('/getComments', methods=['GET', 'POST'])
def getComments():
    return comments
  
if __name__ == '__main__':
    app.run()

接口展示：

Vue前端搭建

路由Vue-Router

info代码：

<template>
    <div>
        <el-table :data="tableData.slice((currentPage-1)*pageSize,currentPage*pageSize)" stripe style="width: 100%">
            <el-table-column prop="name" label="书名" />
            <el-table-column prop="link" label="链接">
                <template slot-scope="scope">
                    <a :href="scope.row.link" >{{scope.row.link}}</a>
                </template>
            </el-table-column>
            <el-table-column prop="author" label="作者" />
            <el-table-column prop="quote" label="引言" />
            <el-table-column prop="star" label="评分" />
        </el-table>
        <el-pagination background
        align = 'right'
        :page-size="pageSize" 
        layout="prev, pager, next" 
        :total="tableData.length"
        :current-page="currentPage"
        @size-change="handleSizeChange"
        @current-change="handleCurrentChange" 
        />
    </div>
</template>

<script>
// import axios from 'axios'
export default {
    name: 'info',
    props: {
        msg: String
    },
    data() {
        return {
            tableData: [
                {
                    name: 'ss'
                }
            ],
            currentPage:1,
            pageSize:10,
        }
    },
    mounted() {
        console.log("加载成功")
        this.getInfo()
    },
    methods: {
        getInfo() {
            this.axios.get("http://127.0.0.1:5000/getAll").then(res => {
                console.log(res.data)
                this.tableData = res.data
            })
        },
        handleSizeChange(val) {
   console.log(`每页 ${val} 条`);
   this.currentPage = 1;
   this.pageSize = val;
  },
        handleCurrentChange(val) {
   console.log(`当前页: ${val}`);
   this.currentPage = val;
  }
    },
}
</script>

<!-- Add "scoped" attribute to limit CSS to this component only -->
<style scoped>
h3 {
    margin: 40px 0 0;
}

ul {
    list-style-type: none;
    padding: 0;
}

li {
    display: inline-block;
    margin: 0 10px;
}

a {
    color: #2b8ae9;
}
.el-pagination{
    margin-top: 20px;
}
</style>