首先需要找到要爬取的小说所对应的章节的数据:
找到所对应的url:
点进去某一章节获取到第二个url:
对这两个url稍微做一下调整:
第一个url是章节的内容,第二个url是小说的具体内容:
https://dushu.baidu.com/api/pc/getCatalog?data={"book_id":"4345125319"}
https://dushu.baidu.com/api/pc/getChapterContent?data={"book_id":"4345125319","cid":"4345125319|1571332019","need_bookinfo":1}
-
发送 HTTP 请求
使用requests.get(url)
同步请求获取目录数据,通过resp.json()
解析为字典格式。 -
遍历章节数据
从dic['data']['novel']['items']
提取每个章节的标题 (title
) 和章节 ID (cid
)。 -
创建异步任务
为每个章节调用aiodownload
函数生成异步任务,并存入tasks
列表,最后通过asyncio.gather
并发执行。
async def getCatalog(url):
resp = requests.get(url)
dic=resp.json()
tasks=[]
for item in dic['data']['novel']['items']: #item时对应每一个章节的名称和id
title = item['title']
#每一个cid对应一个异步任务
cid=item['cid']
#准备异步任务
tasks.append(aiodownload(cid,b_id,title))
await asyncio.gather(*tasks)
b_id="4345125319"
定义了书籍ID,用于构造请求URL。url
通过字符串拼接构造了一个API请求地址,格式为百度读书的目录获取接口。asyncio.run(getCatalog(url))
启动异步函数getCatalog
来获取并处理目录数据。
if __name__ == '__main__':
b_id="4345125319"
url='https://dushu.baidu.com/api/pc/getCatalog?data={"book_id":"'+b_id+'"}'
asyncio.run(getCatalog(url))
cid
: 章节IDb_id
: 书籍IDtitle
: 保存的文件名- 构造JSON格式的请求参数
- 对参数进行URL编码
- 拼接完整的API请求URL
data = {
"book_id": f"{b_id}",
"cid": f"{b_id}|{cid}",
"need_bookinfo": 1
}
encoded_data = urllib.parse.quote(json.dumps(data))
url = f'https://dushu.baidu.com/api/pc/getChapterContent?data={encoded_data}'
- 创建aiohttp客户端会话
- 发送GET请求并获取响应
- 检查HTTP状态码为200表示成功
- 将响应内容解析为JSON格式
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
if resp.status == 200:
dic = await resp.json()
捕获并打印所有可能的异常,包括网络请求失败、JSON解析错误、文件操作异常等。
except Exception as e:
print(f"发生错误:{str(e)}")
完整代码:
#https://dushu.baidu.com/api/pc/getCatalog?data={"book_id":"4345125319"} 得到的是所有的章节内容
#https://dushu.baidu.com/api/pc/getChapterContent?data={"book_id":"4345125319","cid":"4345125319|1571332019","need_bookinfo":1} 小说的具体内容
import requests
import asyncio
import aiohttp
import json
import aiofiles
from urllib.parse import quote
import urllib.parse
async def aiodownload(cid, b_id, title):
data = {
"book_id": f"{b_id}",
"cid": f"{b_id}|{cid}",
"need_bookinfo": 1
}
encoded_data = urllib.parse.quote(json.dumps(data))
url = f'https://dushu.baidu.com/api/pc/getChapterContent?data={encoded_data}'
try:
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
if resp.status == 200:
dic = await resp.json()
async with aiofiles.open(f'{title}.txt', mode='w', encoding='utf-8') as f:
await f.write(dic['data']['novel']['content'])
else:
print(f"请求失败,状态码:{resp.status}")
except Exception as e:
print(f"发生错误:{str(e)}")
async def getCatalog(url):
resp = requests.get(url)
dic=resp.json()
tasks=[]
for item in dic['data']['novel']['items']: #item时对应每一个章节的名称和id
title = item['title']
#每一个cid对应一个异步任务
cid=item['cid']
#准备异步任务
tasks.append(aiodownload(cid,b_id,title))
await asyncio.gather(*tasks)
if __name__ == '__main__':
b_id="4345125319"
url='https://dushu.baidu.com/api/pc/getCatalog?data={"book_id":"'+b_id+'"}'
asyncio.run(getCatalog(url))