Python异步协程爬取百度小说全本教程——附源代码

最新推荐文章于 2025-09-04 16:53:36 发布

原创最新推荐文章于 2025-09-04 16:53:36 发布 · 673 阅读

4 ·

CC 4.0 BY-SA版权

文章标签：

#java #前端 #服务器

首先需要找到要爬取的小说所对应的章节的数据：

找到所对应的url:

点进去某一章节获取到第二个url：

对这两个url稍微做一下调整：

第一个url是章节的内容，第二个url是小说的具体内容：

https://dushu.baidu.com/api/pc/getCatalog?data={"book_id":"4345125319"}
https://dushu.baidu.com/api/pc/getChapterContent?data={"book_id":"4345125319","cid":"4345125319|1571332019","need_bookinfo":1}

发送 HTTP 请求
使用 requests.get(url) 同步请求获取目录数据，通过 resp.json() 解析为字典格式。
遍历章节数据
从 dic['data']['novel']['items'] 提取每个章节的标题 (title) 和章节 ID (cid)。
创建异步任务
为每个章节调用 aiodownload 函数生成异步任务，并存入 tasks 列表，最后通过 asyncio.gather 并发执行。

async def getCatalog(url):
    resp = requests.get(url)
    dic=resp.json()
    tasks=[]
    for item in dic['data']['novel']['items']:  #item时对应每一个章节的名称和id
        title = item['title']
        #每一个cid对应一个异步任务
        cid=item['cid']
        #准备异步任务
        tasks.append(aiodownload(cid,b_id,title))
    await asyncio.gather(*tasks)

b_id="4345125319" 定义了书籍ID，用于构造请求URL。
url 通过字符串拼接构造了一个API请求地址，格式为百度读书的目录获取接口。
asyncio.run(getCatalog(url)) 启动异步函数 getCatalog 来获取并处理目录数据。

if __name__ == '__main__':
    b_id="4345125319"
    url='https://dushu.baidu.com/api/pc/getCatalog?data={"book_id":"'+b_id+'"}'
    asyncio.run(getCatalog(url))

cid: 章节ID
b_id: 书籍ID
title: 保存的文件名
构造JSON格式的请求参数
对参数进行URL编码
拼接完整的API请求URL

data = {
    "book_id": f"{b_id}",
    "cid": f"{b_id}|{cid}",
    "need_bookinfo": 1
}
encoded_data = urllib.parse.quote(json.dumps(data))
url = f'https://dushu.baidu.com/api/pc/getChapterContent?data={encoded_data}'

创建aiohttp客户端会话
发送GET请求并获取响应
检查HTTP状态码为200表示成功
将响应内容解析为JSON格式

async with aiohttp.ClientSession() as session:
    async with session.get(url) as resp:
        if resp.status == 200:
            dic = await resp.json()

捕获并打印所有可能的异常，包括网络请求失败、JSON解析错误、文件操作异常等。

except Exception as e:
    print(f"发生错误：{str(e)}")

完整代码：

#https://dushu.baidu.com/api/pc/getCatalog?data={"book_id":"4345125319"}  得到的是所有的章节内容
#https://dushu.baidu.com/api/pc/getChapterContent?data={"book_id":"4345125319","cid":"4345125319|1571332019","need_bookinfo":1}  小说的具体内容

import requests
import asyncio
import aiohttp
import json
import aiofiles
from urllib.parse import quote
import urllib.parse
async def aiodownload(cid, b_id, title):
    data = {
        "book_id": f"{b_id}",
        "cid": f"{b_id}|{cid}",
        "need_bookinfo": 1
    }
    encoded_data = urllib.parse.quote(json.dumps(data))
    url = f'https://dushu.baidu.com/api/pc/getChapterContent?data={encoded_data}'

    try:
        async with aiohttp.ClientSession() as session:
            async with session.get(url) as resp:
                if resp.status == 200:
                    dic = await resp.json()
                    async with aiofiles.open(f'{title}.txt', mode='w', encoding='utf-8') as f:
                        await f.write(dic['data']['novel']['content'])
                else:
                    print(f"请求失败，状态码：{resp.status}")
    except Exception as e:
        print(f"发生错误：{str(e)}")



async def getCatalog(url):
    resp = requests.get(url)
    dic=resp.json()
    tasks=[]
    for item in dic['data']['novel']['items']:  #item时对应每一个章节的名称和id
        title = item['title']
        #每一个cid对应一个异步任务
        cid=item['cid']
        #准备异步任务
        tasks.append(aiodownload(cid,b_id,title))
    await asyncio.gather(*tasks)

if __name__ == '__main__':
    b_id="4345125319"
    url='https://dushu.baidu.com/api/pc/getCatalog?data={"book_id":"'+b_id+'"}'
    asyncio.run(getCatalog(url))