1、 任务描述:在当当网首页(https://www.dangdang.com/),点击“全部商品分类”下的“图书、童书”栏目,再选择“排行榜”中的“图书畅销榜”,选择“近7日”畅销榜,抓取1—5页的数据。
时间: 2025-01-17 17:23:50 浏览: 137
### 使用 Python 抓取当当网图书畅销榜近7日榜单前5页数据
为了实现这一目标,可以采用 `requests` 和 `BeautifulSoup` 库来获取并解析网页内容。以下是具体方法:
#### 准备工作
安装必要的库:
```bash
pip install requests beautifulsoup4 lxml
```
#### 编写爬虫脚本
创建一个名为 `dangdang_spider.py` 的文件,并输入如下代码:
```python
import requests
from bs4 import BeautifulSoup
import re
def fetch_page(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get(url=url, headers=headers)
if response.status_code == 200:
return response.text
else:
raise Exception(f"Failed to load page {url}, status code: {response.status_code}")
def parse_html(html_content):
soup = BeautifulSoup(html_content, "lxml")
book_list_pattern = re.compile(
r'list_num.*?(\d+).*?<img src="(.*?)".*?title="(.*?)".*?tuijian">(.*?)</span>.*?title="(.*?)".*?<span>(\d{4}-\d{2}-\d{2}).*?(\d+)次.*?price_n">¥(.*?)</span>.*?price_r">¥(.*?)</span>',
re.S)
books_info = []
matches = re.findall(book_list_pattern, str(soup))
for match in matches:
info_dict = {
"排名": match[0],
"封面图片URL": match[1],
"书名": match[2].strip(),
"推荐指数": match[3].strip(),
"作者": match[4].strip(),
"出版日期": match[5],
"评论次数": match[6],
"现价": match[7],
"原价": match[8]
}
books_info.append(info_dict)
return books_info
if __name__ == "__main__":
base_url = "http://bang.dangdang.com/books/fivestars/01.00.00.00.00.00-recent7-0-0-1-"
all_books_data = []
for i in range(1, 6): # 获取第一页至第五页的内容
url = f"{base_url}{i}"
html_text = fetch_page(url)
single_page_books = parse_html(html_text)
all_books_data.extend(single_page_books)
with open("books_top_5_pages.txt", mode="w+", encoding='utf-8') as file:
for item in all_books_data:
line = "\t".join([str(value) for value in item.values()]) + '\n'
file.write(line)
print("Data has been successfully saved.")
```
此段程序通过循环访问指定页面地址,利用正则表达式匹配所需字段,最后保存到本地文本文件中。
阅读全文
相关推荐



















