使用爬虫爬取豆瓣top250并存入csv中的代码
时间: 2025-07-06 08:53:34 浏览: 2
### Python 爬虫抓取豆瓣 Top250 电影信息并保存至 CSV 文件
为了实现这一功能,可以采用 `requests` 和 `lxml` 库解析网页内容,并利用 `pandas` 或者内置的 `csv` 模块将获取的信息写入 CSV 文件。下面是一个完整的代码示例:
```python
import requests
from lxml import etree
import pandas as pd
def fetch_page(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get(url, headers=headers)
if response.status_code == 200:
return response.text
else:
raise Exception(f'Failed to load page {url}')
def parse_html(html_content):
tree = etree.HTML(html_content)
movies = []
items = tree.xpath('//ol[@class="grid_view"]/li')
for item in items:
title_chinese = item.xpath('.//span[@class="title"][1]/text()')[0]
try:
title_english = item.xpath('.//span[@class="other"]/text()')[0].replace('/', '').strip()
except IndexError:
title_english = ''
link = item.xpath('.//a/@href')[0]
info_text = ''.join(item.xpath('.//div[@class="bd"]//p[1]/text()')).split('\n')[0].strip().split()
director_and_casts = " ".join(info_text).split('...')[:-1][0]
year = item.xpath('.//div[@class="bd"]//p[1]/text()[last()]')[0].strip().strip('()')
country_type_info = "".join(item.xpath('.//div[@class="bd"]//p[1]/text()[last()-1]'))
nationality = country_type_info.split('/')[1].strip()
movie_types = '/'.join(country_type_info.split('/')[-1].strip().split())
rating_score = item.xpath('.//span[@class="rating_num"]/text()')[0]
num_of_ratings = item.xpath('.//div[@class="star"]/span[last()]/text()')[0].strip('人评价')
single_movie_data = [
title_chinese,
title_english,
link,
director_and_casts,
year,
nationality,
movie_types,
rating_score,
num_of_ratings
]
movies.append(single_movie_data)
return movies
def save_to_csv(data_list, filename='douban_top250.csv'):
df = pd.DataFrame(
data=data_list,
columns=['中文名称', '外文名称', '详情链接', '导演与演员', '年份', '国家地区', '类型', '评分', '评论人数']
)
df.to_csv(filename, index=False, encoding='utf-8-sig')
if __name__ == '__main__':
base_url = 'https://movie.douban.com/top250?start={}&filter='
all_movies = []
for i in range(10): # 总共十页
url = base_url.format(i * 25)
html = fetch_page(url)
parsed_items = parse_html(html)
all_movies.extend(parsed_items)
print(f'Successfully fetched and parsed page {i + 1}')
save_to_csv(all_movies)
print('All pages have been successfully processed.')
```
此脚本会遍历每一页的 URL,提取所需字段,并最终把所有收集到的数据存入名为 `douban_top250.csv` 的文件中[^2]。
阅读全文
相关推荐

















