爬取小说钓趣实例(二)_爬虫小说白夜行的网址-CSDN博客

本文链接：https://blog.csdn.net/Coll_Jack/article/details/106854375

爬取小说钓趣实例(二)

回顾一下上一次的介绍，上次说要爬取东野圭吾的《白夜行》，并且分析了网页的结构。

有了上一篇文章的分析，接下来该思考怎么爬取想要的信息了。首先，爬取网页的数量算少的，所以考虑使用requests库来爬取网页，这个库使用简单，很受爬友欢迎；然后，有了网页就要有提取网页信息的工具，我选择的是bs4库，这个库也是相当好用，分析html页面十分方便。

看一看如何安装这两个库，安装requests库：pip install requests，安装bs4库：pip install BeautifulSoup4。

库安装好了，接下来该写代码来爬取信息了。首先得有个思路，思路如下：

采用笨拙的手法下载网页，也就是说一个一个下载网页；
有一个可以下载网页的函数，给这个函数传递url，这个函数返回页面信息；
有一个可以解析网页信息的函数，给这个函数一个网页内容，该函数要返回我想要的内容；
有一个函数可以将解析之后的网页内容写入一个给定的文件之中；
爬取开始和结束怎么如何定义。

看一看源码是如何实现的：

""" 一个线程爬取白夜行

步骤
1.获得html源码
2.解析html源码
3.将解析之后的文本写入文件之中

the statistics of this file:
lines(count)    understand_level(h/m/l)    classes(count)    functions(count)    fields(count)
000000000083    ----------------------m    00000000000000    0000000000000003    ~~~~~~~~~~~~1
"""

import time

import requests
import bs4

__author__ = '与C同行'
baiyexing_origin_url = 'https://www.yooread.net/3/3309/'


def get_html_text(url):
    """
    得到页面html源码
    """
    try:
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ''


def parse_text(temp_list, text):
    """ 解析html源码,返回文章内容

    :param temp_list: 解析之后的文本存储的列表
    :param text: html文本
    """
    soup = bs4.BeautifulSoup(text, 'html.parser')
    head_tag = soup.find('h1')
    temp_list.append(head_tag.text)
    body_tag_list = soup.find_all('p')
    temp_list.extend([p.text for p in body_tag_list[:-2]])


def write_in_file(store_list, filename):
    """ 将解析之后的文本写入文件之中

    :param store_list: 解析之后存放文本的列表
    :param filename: 文件名
    """
    with open(filename, 'a+', encoding='utf8') as f:
        for line in store_list:
            f.write(line+'\n')


if __name__ == '__main__':
    print(f'当前时间:{time.ctime()}')
    print()

    one_addr_text_list = []
    store_file_name = '白夜行.txt'
    start_time = time.time()
    for i in range(1, 17):
        print(f'下载第{i}章', '*'*5)
        j = 1
        partial_suffix = str(147590+i)
        while True:
            full_suffix = partial_suffix + '.html'
            current_url = baiyexing_origin_url + full_suffix
            one_addr_text = get_html_text(current_url)
            if one_addr_text == '':
                print(f'本章共{j-1}页')
                break
            parse_text(one_addr_text_list, one_addr_text)
            write_in_file(one_addr_text_list, store_file_name)
            one_addr_text_list.clear()
            j = j+1
            partial_suffix = str(147590+i) + '_{}'.format(j)
    end_time = time.time()
    print(f'下载耗时:{end_time-start_time}s')